Spain Is Not a Translation Problem: Building Frontier AI That Understands the Country It Serves

Most AI localization is still skin-deep.

We translate the interface. We swap a few examples. We fine-tune on Spanish text. We declare victory.

But Spain is not a language setting.

It is a territorial, linguistic, household, economic, and civic reality. It is co-official-language contexts and urban-rural differences. It is housing pressure in one place and demographic aging in another. It is trust built locally, media consumed differently, and services experienced through households, municipalities, and institutions — not just through abstract “users.” If frontier AI is going to be genuinely useful here, it has to understand more than Spanish. It has to understand Spain.

That is the gap I wanted to address with apol/spain-reference-personas-frontier.

1M Adult personas

537K Households

6.4M LLM-facing views

1,800 Benchmark tasks

Translation localizes language. Representation localizes intelligence.

The Missing Layer in AI for Spain

There is a difference between building AI in Spanish and building AI for Spain.

The first is linguistic. The second is social.

A system can speak fluent Spanish and still misunderstand the country it serves. It can produce polished answers while missing the practical realities that shape whether a person trusts it, follows it, or benefits from it. It can fail because it assumes a generic digital-native user. It can fail because it ignores household burden, local purchase preference, bilingual contexts, or the different rhythms of life between Madrid, Galicia, Andalucía, Cataluña, Euskadi, Canarias, or Castilla y León. It can fail because it was localized at the surface, not aligned at the social layer.

This is the problem hidden inside many AI products today. They are technically impressive and socially under-modeled.

That is especially dangerous when we move beyond chat demos into systems that touch real life: public-service assistants, healthcare navigation, education support, benefits discovery, banking interfaces, retail guidance, energy onboarding, civic information, or anything that asks a model to interact with the lived texture of a country. In those settings, bad representation is not a cosmetic flaw. It becomes product failure, exclusion, or mistrust.

What I Built

I built spain-reference-personas-frontier as a synthetic reference population and benchmark substrate for Spanish-language LLM work grounded in the territorial, household, cultural, linguistic, and civic structure of Spain.

The release is not just a file full of personas. It is a package designed to be used. It includes a stable persona_core, linked household_core, multiple persona_views, an actor_state_init layer for mutable context, and benchmark_tasks to evaluate system behavior rather than relying on anecdotal prompts. In plain English: it helps teams move from decorative personas to something much closer to an engineering substrate for simulation, prompting, testing, and design.

To make the project legible beyond the dataset card, I also built two public demo spaces on Hugging Face:

Interactive Demo Spaces

Spain Persona Campaign Studio

A one-screen targeting demo for pre-research and hypothesis generation. Choose a campaign mission and get a first-fit audience, an opening message direction, a channel mix, and representative persona cards. It is a pre-research accelerator, not observed market microdata — designed for communication and service design teams exploring directions before fieldwork.

Spain Persona Research Observatory

A one-screen research lens. Choose one audience story and get a short readout, a signal snapshot, and representative persona cards. Use it to inspect structured audience patterns across the synthetic population before committing to deeper analysis.

Video Overview

Why This Work Is Frontier

We often talk about “frontier AI” as if it only meant larger models, bigger benchmarks, or better reasoning scores.

That is part of the story, but not the whole story.

A different frontier is emerging now: the frontier of alignment to real societies. Not alignment only in the safety sense, or in the preference-tuning sense, but alignment to the actual people, institutions, languages, constraints, and cultural realities a system is meant to serve.

That is where I believe this project sits.

It is frontier work because it tries to solve a problem that increasingly matters for serious LLM systems: how do you connect very capable general models to a specific country in a way that is structured enough to build with, light enough to prompt with, and rigorous enough to evaluate? A static persona deck cannot do that. A synthetic reference population with household structure, multi-view representations, mutable actor state, and benchmark tasks can begin to.

0.022 pp Region share MAE

2.95 pp Age share MAE

100% View budget compliance

0.418% High disclosure-risk rows

Age Group Calibration — Deviation from INE Target (percentage points)

18–24

+2.48

+2.48 pp

25–34

+4.00

+4.00 pp

35–44

−1.16

−1.16 pp

45–54

−4.16

−4.16 pp

55–64

−3.54

−3.54 pp

65+

+2.38

+2.38 pp

Source: EVALUATION_REPORT.md — v0.1 release. Age calibration is the main remaining weakness; region share is near-exact (MAE 0.022 pp).

Why Multi-View Design Matters

Real-world LLM systems do not always need the long-form extended profile (600-token budget ceiling, observed mean ~365 tokens). Sometimes they need a compact card. Sometimes they need a policy view. Sometimes a consumer or dialogue view is enough. Token-efficient representations are not a cosmetic detail; they are part of what makes a dataset usable in production-facing workflows. The project is not only about fidelity. It is about operationalizing fidelity.

And the benchmark layer is crucial. I did not want something that only looked compelling in a workshop or a notebook. I wanted something that could help answer a harder question: does a system behave better when grounded in a structured representation of Spain? The release ships 1,800 benchmark tasks across 9 families and 4 split regimes. It packages the evaluation layer — but live cross-model benchmark lift is not yet part of the v0.1 release. The infrastructure is there; the empirical results will follow. Because if you cannot evaluate the difference, you are still mostly storytelling.

Respecting People Means More Than Privacy

When we talk about “responsible AI,” we rightly focus on privacy, consent, security, and governance.

But there is another kind of respect that matters just as much in product design: representational respect.

It is disrespectful to flatten a country into a handful of imported archetypes. It is disrespectful to assume that a persona designed for an English-speaking U.S. product can simply be translated and made to stand in for Spain. It is disrespectful to treat linguistic plurality, local identity, household structure, rurality, affordability, or trust in institutions as peripheral details.

If a system is supposed to serve Spaniards, it should be designed with some humility toward Spanish reality.

That does not mean essentializing people. Quite the opposite. It means refusing caricature. It means building structured representations that acknowledge variation: between regions, between household forms, between media habits, between different levels of digital access or institutional trust, between people under housing pressure and people living with more security, between monolingual and multilingual contexts, between those who feel included by technology and those who do not.

In that sense, respecting people is not only about protecting their data. It is also about not erasing the shape of their lives.

Respecting Data Means Being Honest About Limits

This part matters to me just as much.

The release is explicit about what it is and what it is not. It is a synthetic reference population, not observed microdata. It is designed for simulation and evaluation, not for replacing field surveys. It exposes limitations openly: age share MAE is 2.95 percentage points, with deviations of +4.00 pp in 25–34 and −4.16 pp in 45–54 — age calibration remains the main weakness in v0.1. High-disclosure-risk rows (0.418% of the population) are flagged in metadata so downstream users can exclude them where needed.

I consider that explicitness a feature, not a footnote.

Too much AI work still oscillates between two bad habits: either pretending the data is “real enough” to stand in for reality, or treating synthetic data as a kind of magic that dissolves all ethical and methodological problems. Neither is serious. Good synthetic data does not pretend to be reality. It declares its assumptions, documents its gaps, exposes its provenance, and makes safe-use boundaries visible.

Data Provenance

The release is anchored to official Spanish population statistics (INE), complemented by institutional and survey conditioning for language, media, and civic structure, and augmented with modeled latent variables for values and motivations. Four layers: official statistics → institutional/survey inputs → modeled latent variables → rendered narrative. That chain is documented in the companion DATASHEET.md so every downstream user can trace what is observed, what is conditioned, and what is generated.

If you are going to model a society, you owe that society honesty.

Why This Is Useful for Building with LLMs

For builders, the value here is practical.

One of the most common problems in LLM product work is that teams do not have a good middle layer between “generic user” and “real deployment.” Prompts become ad hoc. Evaluation becomes anecdotal. Teams test on a few hand-written personas, get a few impressive outputs, and mistake that for robustness.

This project is meant to give builders a better layer to work with.

Filter & Weight with persona_core

Select cohorts by region, age, education, language, digital fluency — whatever your product requires. Weight distributions to match your actual user base or stress-test against underrepresented segments.

Join household_core When Context Demands It

When housing, caregiving, economic constraints, or household composition matter — as they do for public services, benefits, healthcare, and housing products — link the household layer for richer grounding.

Attach the Smallest Useful View

Token-efficient prompting. Pick the persona_view that fits your context window and task: compact card, policy view, consumer view, or long-form extended profile. Not every interaction needs the full-length view.

Add actor_state_init for Mutable Context

When recent media exposure, mood, seasonal events, or crisis sensitivity matter, layer in the actor state. Personas are not static. Neither should your simulations be.

Evaluate with benchmark_tasks

Instead of relying on taste or intuition, score system behavior with explicit tasks and split regimes. Does your system behave better when grounded in structured representation? Now you can measure it.

That is useful for product teams. It is useful for service designers. It is useful for researchers. It is useful for model builders trying to see whether compact views generalize differently from richer profiles. It is useful for anyone who suspects that the missing ingredient in AI quality is often not more raw capability, but better grounding.

From Better Localization to Better Social Outcomes

The most visible use case for a project like this is marketing. That is why one of the demos is framed as a campaign studio.

But the deeper value is much broader.

A country-specific synthetic reference layer can help teams design better onboarding flows, better service explanations, better recommendation systems, better conversational agents, better citizen-facing tools, and better decision support. It can help reduce a familiar kind of AI failure: the system that works beautifully for the imagined mainstream user and poorly for everyone who lives outside that abstraction.

For Spain, that matters across both public and private sectors.

It matters for a public-service assistant that should not assume high digital fluency. It matters for a housing support flow that should understand tenure and burden, not just income labels. It matters for healthcare and care-navigation systems that need to speak clearly across generations. It matters for education and employment tools that should recognize household constraints rather than pretending everyone has the same time, bandwidth, or confidence. It matters for retail, banking, insurance, mobility, and energy products trying to serve real Spanish households rather than imported personas.

And it matters for democracy, too.

Because when AI systems mediate access to information, services, opportunities, or institutions, representation becomes part of inclusion. A society that is badly represented in the systems built for it will be badly served by them.

Built in a Frontier Workflow

I also wanted the way I built this project to reflect the reality of frontier practice in 2026.

This was not a one-model exercise. It was a multi-model studio.

Model	Role	Why This Model
Codex with GPT‐5.4	Scale generation & narrative	Schema-compliant data generation at scale. Statistical distributions anchored to INE data across 50 provinces plus Ceuta and Melilla. Turned demographic scaffolding into three-dimensional personas — “Female, 45, Nurse, Valencia” becomes a person with a view on the hospital’s new digital triage system.
Claude Code with Opus 4.6	Quality assurance & critique	Long-horizon coding and calibration. Million-token context window holds hundreds of personas simultaneously for internal coherence, demographic plausibility, and cultural accuracy checks.
MiniMax M2.7	Cost-efficient iteration	Tier-1 benchmark performance at $0.30/MTok input. High-volume iteration passes: generating variants, testing edge cases, filling coverage gaps across underrepresented demographics at a fraction of the cost.

That matters because the frontier today is increasingly editorial as much as technical. The unit of work is no longer “ask one model, get one answer.” It is a directed workflow in which each model contributes a distinct strength — scale generation, deep quality assurance, cost-efficient iteration — and the human builder acts less like a passive prompter and more like a conductor: setting standards, comparing outputs, rejecting shallow answers, integrating good ones, and keeping the work grounded in a real problem.

That is how I now think about serious AI building. Not as a conversation with a single oracle, but as a disciplined collaboration with frontier systems — all in service of something more important than the models themselves.

Human First, AI Frontier — for Spain

In that sense, this project is a very practical expression of a principle I keep returning to: Human First, AI Frontier.

The point is not to choose between ambition and responsibility. The point is to refuse the false trade-off.

Spain should not have to settle for second-hand AI: generic systems built elsewhere, translated late, lightly adapted, and then presented as localization. Nor should the path to frontier capability require stripping out the social texture that makes people legible to the services meant to support them.

We can aim higher than that.

We can build AI that is technically ambitious and socially grounded. We can build with frontier models and still respect people. We can use synthetic data while being honest about its limits. We can treat localization not as cosmetic translation, but as serious representational work. And we can design systems that are not only more effective, but more just, more trustworthy, and more useful.

The Ambition

The next wave of useful AI in Spain will not be won only by whoever has the best base model. It will be won by whoever can align frontier capability with Spanish reality. Better substrates — the ones that connect frontier AI to real societies with fidelity, humility, and structure — may matter almost as much as the models themselves.

That is the ambition behind spain-reference-personas-frontier.

Not a perfect representation of Spain. Not a replacement for fieldwork. Not a final answer.

But a better substrate for building.

And in this decade, I increasingly believe that better substrates may matter almost as much as the models themselves.

Open & Available

The dataset, demo spaces, and documentation are all open under CC-BY-4.0. Explore the dataset on Hugging Face, try the Campaign Studio, or dive into the Research Observatory.

References

Dataset & Companion Documents

Guerrero, M. (2026). spain-reference-personas-frontier: Dataset Card. Hugging Face.
Guerrero, M. (2026). DATASHEET.md — Datasheet for spain-reference-personas-frontier. Companion document.
Guerrero, M. (2026). EVALUATION_REPORT.md — Evaluation Report. Companion document.
Guerrero, M. (2026). PRIVACY_AND_DISCLOSURE.md — Privacy and Disclosure Analysis. Companion document.

Official Statistics

INE (2025). “Censo Anual de Población. First results 2025.” Instituto Nacional de Estadística, Spain.

Related Work

NVIDIA (2025). “Nemotron-Personas-USA: Synthesized Data for Sovereign AI.” Hugging Face Blog.
NVIDIA (2026). “Nemotron-Personas Collection.” Hugging Face.
Ge, Y. et al. (2024). “Scaling Synthetic Data Creation with 1,000,000,000 Personas.” arXiv:2406.20094. Tencent AI Lab.
Cambridge University Press (2024). “Synthetic users: insights from designers’ interactions with persona-based chatbots.” AI EDAM.
Gebru, T. et al. (2021). “Datasheets for Datasets.” Communications of the ACM, 64(12), 86–92.
Argyle, L.P. et al. (2023). “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis, 31(3), 337–351.
Shanahan, M. et al. (2023). “Role-Play with Large Language Models.” Nature, 623, 493–498.

Models Used

OpenAI (2026). Codex Models Documentation.
MiniMax (2026). “MiniMax M2.7: Model Self-Improvement.” MiniMax.io.