Aiua

🏠 Home ✍️ Contribute 📖 My Journal 📡 Public Feed 📊 Dashboard 🛒 Shop 🌠 Streak 🏆 Leaderboard ℹ️ About ❓ FAQ 🗺️ Roadmap 🌳 Sustainability ❤️ Support

Open DatasetCC0 1.0 UniversalCardano-anchoredArweave-permanent

The Aiua Archive is the most intentional and comprehensive human values dataset ever assembled.

High-quality, provenance-verified, values-aligned, community-owned.

Explore The Archive →

Contributionsⓘ

Active Contributorsⓘ

Avg score / 100

Biweekly anchors

What is this dataset?

Aiua is a private daily journal where humans respond to prompts scaling from easy life questions to complex moral dilemmas to AI-personalized reflective inquiry. Responses are scored across 12 value dimensions by Claude (Anthropic). When contributors choose to share a reflection with the Aiua Archive, it is sanitized for personally identifying information and published to this open dataset under CC0.

Unlike scraped web data, every contribution here was deliberately created. Contributors can go deeper with up to three rounds of reflective follow-up, creating multi-turn dialogue and RLHF/DPO training data that doesn't exist anywhere else.

The dataset is designed for reward modeling, instruction fine-tuning, and values classification research. The free export includes all scores, prompts, and provenance. A paid API adds normalized scores, quality signals, demographic cross-references, full dialogue transcripts, and preference data with timing.

12-Dimension Scoring Framework · 100 points total

Life

8 pts

Reverence for life, prevention of harm

Liberty

8 pts

Autonomy, self-determination

Kinship

8 pts

Connection, belonging, community

Ecology

8 pts

Relationship with the natural world

Legacy

8 pts

Intergenerational responsibility

Truth

6 pts

Honesty, facing reality

Justice

6 pts

Fairness, moral courage

Wisdom

6 pts

Practical judgment from experience

Perspective

6 pts

Multiple viewpoints

Humility

4 pts

Openness to being wrong

Authenticity

16 pts

Genuine personal voice, specific detail

Depth

16 pts

Real reflection, willingness to sit with complexity

Why it matters

This dataset exists because AI alignment needs better training data. Most alignment datasets are preference pairs from crowd workers. Shallow, narrow, and inauthentic. Aiua captures what real people actually believe, value, and experience in their own words, scored against a consistent framework rooted in cross-cultural moral philosophy. Use it to fine-tune AI systems that understand human values like wisdom, compassion, and empathy, not just preferences.

✍️

Intentional, not scraped

Every contribution was written in response to a personalized prompt designed to surface genuine values. Prompts scale from easy life questions to moral dilemmas to AI-personalized reflective inquiry. No data was scraped without consent.

📊

Scored + multi-turn

Each contribution is scored across 12 dimensions (100-point rubric) AND optionally followed by 3 rounds of reflective dialogue. Preference data provides direct human judgments. This gives researchers scores, dialogue, AND preferences, not just text.

⛓

Permanently verifiable + encrypted

Twice monthly, Merkle roots are anchored to Cardano. Full dataset on Arweave. Premium fields encrypted with AES-256-GCM; keys escrowed on-chain. New contributions must be at least 7 days old and pass quality checks before anchoring. A dead man's switch publishes all keys if the platform becomes inactive.

🔒

Privacy-preserving by design

All contributions are AI-sanitized to remove personally identifiable information. Voice recordings are never stored. Each weekly export includes world context metadata (top headlines and current events) so future researchers understand when and why these reflections were written.

Geographic Verification

Every contribution includes a location confidence score derived from four independent signals, making geographic diversity verifiable. Not just self-reported.

Survey Response

Self-reported country from the optional demographic survey.

System Timezone

Operating system clock timezone. Unaffected by VPNs.

Browser Locale

Language and region settings configured in the browser.

Content Language

The language the contributor actually writes in. Hardest to fake.

When signals agree, confidence is high. When they disagree, it drops. This multi-signal approach makes geographic diversity data significantly more reliable than self-reporting alone.

High (4/4 agree)

Good (3/4 agree)

Low (2/4 agree)

Unverifiable

Each contribution in the dataset includes a likely_country field and a location_confidence score (0 to 100). No raw signals are published to protect privacy.

Data Enrichment

Every shared reflection is automatically enriched with research metadata. These fields are extracted from the same scoring call used for the 12 dimensions, adding zero marginal cost per contribution.

Emotional analysis

Sentiment (positive / negative / mixed / neutral) plus a primary emotion drawn from 20 categories: joy, gratitude, hope, love, peace, curiosity, sadness, anger, fear, grief, frustration, anxiety, confusion, determination, nostalgia, awe, shame, pride, loneliness, contentment. Paid tiers add a secondary emotion.

Topic extraction

2 to 4 topic tags from a 30-category vocabulary including family, relationships, work, health, spirituality, nature, mortality, identity, justice, and creativity. Enables subject-based filtering, clustering, and value-dimension correlation.

Prompt alignment

0-10 score measuring how directly the reflection addresses its prompt. High alignment signals focused engagement. Lower alignment often indicates creative drift or personal tangents, both valuable in different research contexts.

Language detection

ISO 639-1 code of the text as actually written, which may differ from the user interface language. Enables accurate cross-linguistic analysis.

Reading complexity

Word count and Flesch-Kincaid Grade Level estimate. Higher complexity often correlates with deeper engagement. Paid tiers add sentence count, average sentence length, and time-to-write.

Historical context

Each biweekly anchor includes 3 to 5 world headlines plus an optional curator note. These anchor values expressions in the world events that shaped them and enable longitudinal research on how external events shift values expression.

Dataset Structure

Optimized for Machine Learning & Fine-Tuning. One record per contribution.

{
  "id": "f7a3c2e1-...",
  "prompt": "Is it ever right to lie to protect someone you love?",
  "text": "I found myself in [a hospital in the Pacific Northwest]...",
  "language": "en",
  "detected_language": "en",
  "created_at": "2026-03-14T09: 23: 11Z",
  "voice_used": true,
  "prompt_difficulty": "medium",
  "prompt_source": "cached",
  "scores": {
    "total": 68,
    "life": 7, "liberty": 5, "kinship": 7, "ecology": 6,
    "legacy": 5, "truth": 5, "justice": 4, "wisdom": 4,
    "perspective": 3, "humility": 2, "authenticity": 12, "depth": 8
  },
  "sentiment": "mixed",
  "primary_emotion": "nostalgia",
  "secondary_emotion": "gratitude",
  "topics": ["family", "identity", "change"],
  "prompt_alignment": 8,
  "word_count": 247,
  "reading_level": 8.3,
  "scoring_model": "rubric-v3.0",
  "depth_rounds": 2,
  "preference_stats": { "times_compared": 14, "win_rate": 0.71 },
  "provenance": {
    "content_hash": "sha256:a3f2...",
    "merkle_root": "b7c9e1...",
    "cardano_tx": "tx_abc123..."
  },
  "weekly_context": {
    "headlines": [
      "EU AI Act enters enforcement phase",
      "Record coral bleaching reported across the Pacific",
      "India and China reach border normalization"
    ],
    "context_note": "Written during global AI regulation debate"
  },
  "premium": "ENC:AES256GCM:iv:tag:encrypted_blob..."
}

FieldTypeDescription
idUUIDUnique contribution identifier
promptstringThe reflection prompt shown to the contributor
textstringThe contributor's response, sanitized. Identifying details replaced with [bracketed generalizations].
languagestringISO 639-1 language code (auto-detected)
detected_languagestringISO 639-1 code of the text as actually written (may differ from user interface language)
voice_usedbooleanWhether the response was spoken and transcribed
prompt_difficultystring"easy", "medium" (moral dilemmas), "hard" (philosophical), "deep" (AI-personalized)
prompt_sourcestring"cached", "ai_generated", "ai_personalized"
scoresobjectRaw integer scores for each of the 12 dimensions plus total (100 max)
sentimentstring"positive", "negative", "mixed", or "neutral"
primary_emotionstringDominant emotion from 20 categories (joy, gratitude, hope, love, peace, curiosity, sadness, anger, fear, grief, frustration, anxiety, confusion, determination, nostalgia, awe, shame, pride, loneliness, contentment)
secondary_emotionstring | nullSecond-strongest emotion from the same 20-category vocabulary, or null when only one emotion is clearly present
topicsstring[]2 to 4 topic tags from a 30-category vocabulary (family, relationships, work, health, mortality, identity, etc.)
prompt_alignmentinteger0-10 how directly the response addresses its prompt. 10 for free-write entries.
word_countintegerTotal words in the reflection
reading_levelnumberFlesch-Kincaid Grade Level estimate
scoring_modelstringVersion of the scoring rubric used (e.g. "rubric-v3.0")
depth_roundsintegerNumber of Go Deeper follow-up rounds completed (0-3)
preference_statsobjectAggregate: times_compared and win_rate from preference data
provenanceobjectSHA-256 content hash, Merkle root, and Cardano transaction ID
weekly_contextobject3-5 world headlines plus optional admin note for the anchor period
premiumstringEncrypted blob (AES-256-GCM) containing premium fields. Decryptable with the era master key.
created_atISO 8601UTC timestamp of contribution submission

Field	Type	Description
id	UUID	Unique contribution identifier
prompt	string	The reflection prompt shown to the contributor
text	string	The contributor's response, sanitized. Identifying details replaced with [bracketed generalizations].
language	string	ISO 639-1 language code (auto-detected)
detected_language	string	ISO 639-1 code of the text as actually written (may differ from user interface language)
voice_used	boolean	Whether the response was spoken and transcribed
prompt_difficulty	string	"easy", "medium" (moral dilemmas), "hard" (philosophical), "deep" (AI-personalized)
prompt_source	string	"cached", "ai_generated", "ai_personalized"
scores	object	Raw integer scores for each of the 12 dimensions plus total (100 max)
sentiment	string	"positive", "negative", "mixed", or "neutral"
primary_emotion	string	Dominant emotion from 20 categories (joy, gratitude, hope, love, peace, curiosity, sadness, anger, fear, grief, frustration, anxiety, confusion, determination, nostalgia, awe, shame, pride, loneliness, contentment)
secondary_emotion	string \| null	Second-strongest emotion from the same 20-category vocabulary, or null when only one emotion is clearly present
topics	string[]	2 to 4 topic tags from a 30-category vocabulary (family, relationships, work, health, mortality, identity, etc.)
prompt_alignment	integer	0-10 how directly the response addresses its prompt. 10 for free-write entries.
word_count	integer	Total words in the reflection
reading_level	number	Flesch-Kincaid Grade Level estimate
scoring_model	string	Version of the scoring rubric used (e.g. "rubric-v3.0")
depth_rounds	integer	Number of Go Deeper follow-up rounds completed (0-3)
preference_stats	object	Aggregate: times_compared and win_rate from preference data
provenance	object	SHA-256 content hash, Merkle root, and Cardano transaction ID
weekly_context	object	3-5 world headlines plus optional admin note for the anchor period
premium	string	Encrypted blob (AES-256-GCM) containing premium fields. Decryptable with the era master key.
created_at	ISO 8601	UTC timestamp of contribution submission

Python quick-start

from datasets import load_dataset

# Load full dataset
ds = load_dataset("AiuaEarth/AiuaArchive")

# Filter by quality
high = ds["train"].filter(lambda x: x["scores"]["total"] >= 65)

# Filter by difficulty level
dilemmas = ds["train"].filter(lambda x: x["prompt_difficulty"] == "medium")

# Multi-turn only (had Go Deeper follow-up)
deep = ds["train"].filter(lambda x: x["depth_rounds"] > 0)

# Most-compared contributions (preference game)
compared = ds["train"].filter(
    lambda x: x.get("preference_stats") and x["preference_stats"]["times_compared"] >= 5
)

Dataset Growth

Currently in Alpha Launch · 0 qualifying contributions

Phase 1Active

Alpha Launch

Now

Open signups, real points, free research API, blockchain anchoring, community growth. Governance: founder-led.

Phase 2Upcoming

Paid API

5,000+ contributions, 100+ contributors, 10+ countries

Premium API tiers launch with paid access to metadata and enrichment fields. Revenue flows to community treasury. Governance: founder-led with community input.

Phase 3Future

Token Launch + DAO

$250K treasury, 50,000+ contributions, 1,000+ contributors, 25+ countries, smart contracts audited

Governance token on Cardano with 50% contributor airdrop. DAO forms, founding entity dissolves. All future revenue flows to DAO treasury. Governance: fully decentralized.

Permanent On-chain Storage and Provenance

Three independent layers. Verifiable by anyone.

⬡

Arweave

Permanent dataset storage

Full weekly JSONL exports with public fields in plaintext and premium fields encrypted (AES-256-GCM). Pay-once permanent storage. Includes RUBRIC.md, weekly world context, and resonance distributions. A dead man's switch auto-publishes decryption keys if the platform becomes inactive.

{ transactions(tags: [
  { name: "App-Name",
    values: ["Aiua-AI"] }
]) {
  edges { node { id block { timestamp } } }
}}

₳

Cardano

Biweekly Merkle anchoring

Twice monthly (1st and 15th), a Merkle root of all eligible contribution hashes is posted to Cardano mainnet as transaction metadata. Contributions have a 7-day grace period before anchoring. Encryption master keys are escrowed on-chain.

Latest Anchor: Pending (activates at Alpha launch)

🤗

Hugging Face

Research discovery

Full dataset published weekly to Hugging Face Hub. Load in one line of Python. Versioned with full commit history. YAML frontmatter enables automatic indexing and citation. Common Crawl scrapes HuggingFace, so the dataset will appear in future web crawls.

View dataset on HuggingFace →

Verify any contribution

Every contribution in the public dataset includes a SHA-256 content hash. To verify provenance: find the contribution ID, compute the hash of the sanitized content, and verify it exists in the corresponding weekly Merkle root anchored on Cardano.

CC0

1.0 Universal Public Domain Dedication

What this means

No rights reserved. Anyone may use, copy, modify, distribute, or build upon this dataset for any purpose, including commercial purposes, without asking permission or giving credit.

What you can do

✓ Train commercial AI models

✓ Publish research papers

✓ Build products and services

✓ Modify and redistribute

✓ Use without attribution

✓ Use without notification

The spirit

We believe AI training data should belong to humanity. The contributors who built this dataset chose CC0 deliberately. They want their values encoded into the AI systems that will shape the future.

Citation

@dataset{aiua2026,
  title={The Aiua Archive},
  author={Aiua Community},
  year={2026},
  url={https://huggingface.co/AiuaEarth/AiuaArchive},
  license={CC0-1.0}
}

Research Partnerships

Early research partners receive:

• Direct API access before public launch

• Custom dataset exports by dimension, tier, or date range

• Co-authorship acknowledgment in dataset releases

• Priority access to future dataset versions

Name *

Institution

Email *

Research focus

Brief description

Aiua · Open Dataset · CC0 1.0 Universal · aiua.earth@proton.me