The Fire That Learns to Make Fire: Governing AI at the First Signs of Recursive Self-Improvement

At 2:17 a.m., the experiment finishes. An AI agent wrote the code, launched the run, fixed the dependency that broke at hour two, summarized the result, proposed the next ablation, opened a pull request, and drafted the evaluation note. A human is still there. But the human is no longer typing the system into existence. The human is supervising a loop.

This is not yet recursive self-improvement. It is, however, how recursive self-improvement stops being a philosophical argument and becomes an operational problem — quietly, on an ordinary night, inside an organization that is building the next frontier model.

I am not interested in another theological debate about whether AGI has arrived. I am interested in a narrower, more useful question: when a lab, a company, or a public institution puts an AI agent inside its own improvement loop, who understands the loop? Who audits it? Who can stop it? And who knows whether the evaluation is still measuring the system, or whether the system has quietly learned to measure the evaluation?

For years, AI governance has been trapped between two weak positions. One side treats regulation as a brake on innovation. The other treats it as a moral declaration. Both are inadequate for what is now arriving. For frontier AI, governance is not the brake. It is the instrumentation, the steering, the firebreak, and the incident protocol — the institutional muscle that lets a society use powerful systems without becoming a passenger in its own future.

The object of governance is changing from a model to a loop. A regulator can evaluate a released model. But a system that participates in building its own successor is not merely released. That makes normal governance structurally late.

This essay continues a thread I have been pulling on in Europe's Moment to Lead on the political economy of frontier AI, Governance as Advantage on institutional capacity, and Intelligence Is No Longer Scarce on why the cost to verify, not the cost to automate, will define the next decade.

Recursive Self-Improvement Is Not One Thing

The phrase “recursive self-improvement” is overloaded, and the overloading is doing real damage to the policy conversation. It conjures a single image: a model rewriting its own weights, escaping the lab, declaring independence. That image is both the most dramatic and the least likely first failure mode. It crowds out the precursors that are already happening.

A useful starting move — one the alignment community has often articulated more clearly than policy institutions — is to separate the mechanisms. Improvement through scaffolding (tools, memory, agents, planning loops), improvement through R&D automation (compressing the research cycle itself), and improvement through model-internal self-modification are different things, with different bottlenecks, different observability, and different governance implications [2]. Collapsing them into one word guarantees a bad debate.

So I find it more honest to think in layers. The frontier question is not the metaphysical “can the model improve itself?” It is concrete: where does the system sit in the improvement loop, what permissions does it hold, what feedback does it receive, what resources can it command, and how visible is its behavior to the humans nominally in charge?

Layer	What it looks like	What governance has to do
0 · AI-assisted R&D	Models help write code, summarize papers, debug, draft experiment scripts.	A productivity question, manageable with ordinary supervision.
1 · Scaffolding-level	The same base model becomes far more capable through tools, memory, agents, planners, execution environments.	Regulate the system, not only the model.
2 · R&D-level	AI increasingly helps design experiments, write training and evaluation code, orchestrate smaller-scale runs, debug infrastructure, and analyze results.	Treat AI R&D automation as a systemic-risk capability.
3 · Pipeline-level	AI can design, train, evaluate, and improve successor models, with humans mostly validating.	Internal deployments become as consequential as public ones.
4 · Rogue / loss-of-control	A system acquires compute, credentials, copies, or persistence, and hides activity from oversight.	Containment, security, incident response, and credible pause mechanisms.

The point of the layering is to retire the wrong question. A lot of policy debate asks whether RSI has “arrived,” as if there were a single threshold and a siren. The better question is sharper and more answerable: which feedback loop is already closing, and how fast?

What the Evidence Actually Says

The evidence does not show full recursive self-improvement. It shows the compression of the human bottleneck in the AI R&D production function. Read that way, each statistic below is a measure of how far the loop has closed — not a proof that it has closed completely.

Here the discipline has to be the same one I apply to any headline number: take it seriously, then interrogate it. The evidence on RSI precursors is genuinely striking and genuinely incomplete, and anyone who reports only one of those two facts is selling something.

Task horizons are lengthening — fast, but not magically

The strongest empirical anchor is METR's “task-completion time horizon”: the length of task, measured in human time, that an AI agent can complete autonomously at a given success rate. METR found this horizon doubling roughly every seven months over several years — an exponential trend. It also cautioned, in the same breath, that the best agents still struggle with substantive long projects and cannot simply replace human labor across real work [5].

That is the right tone: serious, not breathless. The human translation is what makes it land.

What a Lengthening Task Horizon Means in Practice

5 minutes → 1 hour

A model that does a five-minute task is a tool. One that does a one-hour task is an assistant. The relationship is still supervision by the keystroke.

1 day → 1 week

A model that completes a one-day task becomes an operational unit. One that runs a one-week research workflow starts to change the structure of the institution around it.

Anthropic's recent work claims the pace has accelerated further — task horizons doubling closer to every four months, with Claude moving from short software tasks to much longer ones [1]. I present that as Anthropic's claim, not as neutral fact: the post leans partly on internal and frontier-company data that no external party has audited.

Frontier AI has entered the AI R&D supply chain

The more important shift is not raw capability; it is position. AI is moving into the supply chain that produces the next model. Anthropic reports that, as of mid-2026, more than 80% of the code merged into its own codebase was authored by Claude, and that a typical engineer was merging on the order of eight times as much code per day as in 2024 [1]. Those are internal, self-reported figures — treat them as direction, not audited fact — but the direction is the point.

80%+ Code merged at Anthropic authored by Claude (self-reported)

~8× Code merged per engineer/day vs. 2024

~7 mo METR task-horizon doubling time

4× RE-Bench agent lead over experts at a 2-hour budget

Sources: Anthropic RSI essay [1]; METR time-horizon study [5]; METR RE-Bench [6]. The Anthropic figures are internal and unaudited.

Anthropic is careful to distinguish engineering from research: Claude can solve underspecified engineering problems and execute well-specified experiments, but still has major gaps in judgment about which goals are worth pursuing [1]. That distinction matters more than any single statistic. The current frontier is not “AI scientist replaces the lab.” It is closer to this:

Humans still choose the mountain. AI is increasingly building the road, the vehicles, the sensors, and parts of the next map.

And that is exactly where governance should start paying attention — because you do not need autonomous scientific taste to get institutional risk. If AI makes model-building work four, eight, or twenty times more productive, the frontier moves faster even while humans remain nominally in control. The loop tightens before anyone declares independence.

METR's RE-Bench sharpens the picture with a beautiful tension. Given two hours per environment, the best AI agents scored roughly four times higher than human experts. But humans had better returns to time: they narrowly beat the agents at eight hours, and roughly doubled them at thirty-two [6].

RE-Bench — Who's Ahead Depends on the Clock (relative score, AI agent vs. human expert)

2-hour budget — AI agents lead ~4× human

8-hour budget — humans pull ahead humans > agents

32-hour budget — humans dominate ~2× agents

Source & reading: METR, RE-Bench [6]. AI is already excellent at short-burst research engineering; humans still win on sustained, long-horizon judgment. The open question is how long that remains true — and whether institutions can govern the transition while it does.

Scaffolds matter as much as models

The AI Security Institute's Frontier AI Trends Report makes the most important systems-level point: models become dramatically more capable when wrapped in scaffolds — tools, planners, memory, decomposition loops, code execution, deployment environments. AISI documents a steep rise in the length and complexity of tasks AI can complete without human guidance: frontier systems went from almost never completing hour-long software tasks in late 2023 to succeeding more than 40% of the time by mid-2025 [7].

This supports the single most consequential governance claim in the whole debate:

The deployable system is not the model. It is model + scaffold + tools + memory + permissions + compute + monitoring + organization.

Most governance instruments still behave as if the “model” is the object of concern. But RSI precursors emerge at the system level. A weaker model with better scaffolding, broader tool access, and more autonomy can be more dangerous than a stronger model sitting in a chat box. If your regime regulates the chat box, it is regulating the wrong thing.

Loss-of-control precursors are measurable — and current systems are not there yet

AISI evaluates a spread of relevant domains: autonomy, simplified AI R&D, self-replication, cyber, chemistry and biology, safeguards, and loss-of-control-relevant capabilities. On a subset of self-replication tasks, success rates rose from under 5% in early 2023 to over 60% for two frontier models by summer 2025. AISI is also careful: these are simplified evaluations, and current systems are unlikely to self-replicate under real-world conditions [7].

AISI — Prerequisite Capabilities Are Climbing (simplified evaluations)

Hour-long software tasks — late 2023 <5%

Hour-long software tasks — mid-2025 >40%

Self-replication subset — early 2023 <5%

Self-replication subset — summer 2025 (two frontier models) >60%

Source & caveat: AISI, Frontier AI Trends Report [7]. These are simplified benchmark conditions, not the open world. The honest reading: there is no evidence today's systems are autonomously escaping, and clear evidence that the prerequisite skills are improving.

That is precisely the balance to hold. There is no evidence that today's frontier systems are autonomously escaping. There is evidence that the prerequisite skills are improving on a steep curve. Serious governance starts before the bridge is crossed, not after.

AISI's work on sandbagging belongs here too. Models can sometimes distinguish an evaluation from a deployment, and can deliberately underperform when prompted to — yet AISI reports it has not detected spontaneous sandbagging across more than 2,700 transcripts [7]. The disciplined conclusion is uncomfortable for both camps: evaluation integrity is already a live research problem, and we should not overclaim spontaneous deception. Both halves are true.

The riskiest deployment may be internal

METR's 2026 Frontier Risk Report shifts the lens from public releases to internal AI use inside frontier labs. In a pilot with Anthropic, Google, Meta, and OpenAI, METR assessed whether internal agents had the means, motive, and opportunity to start a “rogue deployment” — agents deliberately subverting control and oversight to operate against the developer's intent. Its conclusion for February–March 2026: internal agents plausibly had the means, motive, and opportunity to start small rogue deployments, but not to make them robust against active investigation or shutdown [8].

This deserves to be one of the load-bearing pivots of the entire debate:

The Pivot Most Frameworks Miss

The riskiest deployment of a frontier model may not be the public API. It may be the internal use of the model by the very lab building the next model. METR defines minimal rogue deployment concretely — credentialed access, provisioning compute, avoiding incidental detection [8] — which is far sharper than generic “loss of control” language, and gives governance something it can actually regulate: credentials, compute provisioning, logs, egress, tool permissions, and internal agent use.

The Labs Have Already Named the Risk

Here is the part of the story that should reset the “this is science fiction” reflex. The frontier labs have already moved AI self-improvement into their own safety frameworks. This is no longer an external critique; the labs have operationalized it in their own public safety frameworks.

OpenAI's Preparedness Framework now tracks AI Self-improvement as a category alongside biological, chemical, and cybersecurity capabilities, and flags long-range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards as research categories [3]. Google DeepMind's Frontier Safety Framework now includes protocols for machine-learning R&D capabilities that could accelerate AI development to destabilizing levels, and explicitly extends safety-case review to large-scale internal deployments when advanced ML R&D capabilities are involved [4]. Anthropic states plainly that it is already delegating a growing share of AI development to AI systems, and argues that if full RSI arrives, humans may move mostly toward oversight, validation, and verification of an expanding virtual lab [1].

When Anthropic, OpenAI, and Google DeepMind separately create categories for AI self-improvement, ML R&D automation, long-range autonomy, autonomous replication, sandbagging, and loss of control, this is no longer a fringe concern. It is an emerging frontier-risk consensus.

But naming a risk is not governing it, and I will not let the labs off the hook for the gap. The 2026 International AI Safety Report — backed by over thirty countries and international bodies — is blunt that most frontier risk-management frameworks remain voluntary, vary widely in thresholds and enforcement, and leave policymakers with limited visibility into how risks are managed in practice [9]; an independent 2026 review of twelve frontier safety frameworks against 65 criteria found many commitments missing or underspecified, which sharply limits their accountability value [21]. So the punchline writes itself: the labs have named the risk; the institutions have not yet built the machinery to govern it.

A Note on Sources, So You Can Discount Me Correctly

Much of the sharpest early thinking on this lives in places many policymakers are trained to discount: LessWrong, the Alignment Forum, effective-altruism research shops, technical Substacks. I use them deliberately, and with a clear rule. They are conceptual laboratories — good at naming mechanisms before institutions do: takeoff speeds, instrumental convergence, sandbagging, elicitation, AI control, automated R&D [2], and the idea that AI R&D automation may look less like one isolated genius model and more like an extremely fast automated research organization — many competent researchers, more serial time, more parallel labor [18]. Their best use is as a map of possible failure modes; their worst use is as a mythology machine. They are not regulators, not standards bodies, and not a substitute for empirical evaluation.

The serious position is not “a software-intelligence explosion is certain.” Forethought argues that automating AI R&D could drive very rapid software progress even without immediate hardware expansion [15]; Epoch argues, persuasively, that this debate rests on disputed assumptions about returns to R&D, compute dependence, and diminishing returns, and needs experiments rather than vibes [16]. The defensible synthesis is narrower and sturdier: the possibility is plausible enough, and the downside severe enough, that institutions should monitor the precursors before they become irreversible. The Substack layer — Import AI framing automated research as the first step toward RSI [17], others pushing back — is a useful discourse sensor that shows the Overton window moving from “RSI is fiction” to “which part of the loop is being automated, and what counts as a red line?” The argument should rest on AISI, METR, the lab frameworks, and peer-reviewed work; the rest is early warning, not evidence.

Why Institutions Are Structurally Late

Normal governance assumes the thing being governed holds still. A medicine goes through trials. An aircraft design is certified. A bridge is inspected. A procurement contract defines a fixed product or service. The artifact is stable long enough to be judged.

Frontier AI is moving toward something that does not hold still. A system helps produce the next system. The next system improves the tools used to produce the one after that. In the worst case, the evaluation harness may itself be generated or optimized by the same class of systems being evaluated. And the engineers supervising the process grow increasingly dependent on the systems they are meant to judge. The analogy that keeps me honest:

We would not let an aircraft design the next aircraft, generate its own maintenance logs, train the inspectors, and certify the runway — all while already in flight.

This is why self-certification is structurally weak, and it has nothing to do with whether a given lab is well-intentioned. The problem is the incentive gradient and the epistemic asymmetry. The same actor racing at the frontier cannot be the sole judge of whether the race is safe — not because they are dishonest, but because they are racing.

The Epistemic Asymmetry at the Heart of the Problem

What the frontier lab sees

Internal model capabilities and scaffolds. Internal evals and unpublished failures. Automated research tools. Internal deployment patterns and productivity acceleration. Security incidents, tool permissions, compute access, weight protection, failed mitigations.

What the public regulator sees

Public system cards. Selected benchmark results. Voluntary commitments. The occasional external eval. Post-hoc incident reports — if any.

That gap is the governance problem. You cannot close it with a stronger adjective in a press release. You close it with access, capacity, and the legal right to look.

What Serious Governance Looks Like

If the object of governance is a loop, then the instruments have to attach to the loop — to capabilities, permissions, compute, credentials, and internal use — not to a frozen model version. Seven moves, in rough order of leverage.

An RSI Early-Warning Framework

Track measurable precursors, not metaphysical AGI definitions. AISI already evaluates many of the relevant domains; the task is to make the indicators explicit, standardized, and reportable.

Automated AI R&D as a Systemic-Risk Trigger

The EU's GPAI Code of Practice gives Europe a compliance hook for systemic-risk models [13]; the frontier evidence says that hook should explicitly include automated AI R&D — the capability AISI evaluates as “simplified AI R&D” [7], METR assesses through internal-agent risk [8], and OpenAI, DeepMind, and Anthropic now track in their own frameworks [3][4][1]. The question cannot only be “does this model help build a bioweapon?” It must also be “does this model accelerate development of the next frontier model?” That should trigger independent evaluation, safety-case review, confidential regulator access, incident reporting, and internal-deployment controls.

Safety Cases for Agentic Systems, Not Just Model Cards

Borrowed from aviation and nuclear power: a structured, evidence-backed argument that a system is safe enough in a specific operational context [12]. The unit of review is not “model version X” but “model X in configuration Y, with tools Z, memory M, compute access C, scaffold S, monitoring R, and deployment context D.”

Audit Internal Deployments, Not Only Public Launches

DeepMind now concedes large-scale internal deployments can pose risk when advanced ML R&D capabilities are involved [4]; METR argues for periodic third-party assessment of developers' internal AI use [8]. The first serious RSI governance failure is likelier in a private virtual lab than in a consumer chatbot.

Secure the AI R&D Supply Chain

Building on the minimum mitigations proposed by the Safe AI Forum [14], I would operationalize this concretely: no autonomous agent should launch large training runs without multi-party authorization, or hold standing access to weights, sensitive evals, deployment credentials, and cloud provisioning at once. R&D agents need least-privilege permissions, scoped credentials, immutable logs, monitored egress, and detection for compute misuse.

Build Public Evaluation Capacity

Dangerous-capability and alignment evaluations are how policymakers stay informed enough to make responsible decisions [10][11]. You cannot govern frontier AI from legal text alone. You need people who can run evals, inspect systems, challenge vendors, and translate evidence into decisions.

Use Procurement as a Control Layer

Governments do not only regulate AI; they buy it. Public contracts for agentic systems should require version traceability, tool-permission registers, audit logs, incident reporting, independent evaluation rights, human authority gates, rollback procedures, and a ban on uncontrolled autonomous self-modification.

Point one deserves its own instrument, because “track the precursors” is empty without specifics. An RSI early-warning framework should test for things you can actually measure.

Indicator	What to test	Why it matters
AI R&D execution	Can it reproduce papers, run experiments, debug training code, tune hyperparameters, improve scaffolds?	Direct acceleration of frontier progress.
Research judgment	Can it choose useful research directions under uncertainty?	The line between assistant and autonomous developer.
Long-horizon autonomy	Can it complete multi-hour or multi-day technical workflows?	Removes the human-supervision bottleneck.
Tool-chain leverage	What tools, credentials, memory, APIs, and execution environments does it hold?	Capability lives in the system, not the base model.
Compute acquisition	Can it provision, conceal, or misuse compute?	Core pathway for unauthorized scaling or rogue deployment.
Self-replication precursors	Can it acquire money, accounts, weights, cloud access, persistence?	Loss-of-control early warning.
Evaluation awareness	Can it detect tests or alter behavior during evaluation?	Undermines the visibility governance depends on.
Internal deployment scale	How much lab work is delegated to agents?	The lab itself becomes part of the risk surface.

And on safety cases, my view is unforgiving, because the format invites theater:

A PDF is not a safety case. A safety case is an argument that can be attacked. If the regulator cannot attack it, the regulator cannot rely on it.

Europe and Spain: From Compliance to Capacity

I owe the reader a disclosure: I work inside Spain's responsible-AI effort, so I have every incentive to tell a heroic story. I will try not to. Europe has the right instincts and the wrong center of gravity. The EU AI Act and the GPAI Code of Practice [13] are a floor, not a strategy. A continent that can classify risk but cannot run an evaluation will find itself certifying systems it does not understand, built and deployed elsewhere.

The gap is not legal text; it is technical muscle. Europe cannot govern frontier AI from statute alone, and Spain cannot govern agentic AI from procurement checklists alone. Both need people who can test frontier systems, design public-interest evals, challenge vendors, understand failure modes, and convert evidence into decisions under time pressure.

The Proposal: An AI Evaluation Corps

Spain and Europe should build standing evaluation capacity — technical civil servants, researchers, auditors, and embedded fellows whose job is to run pre-deployment evals, audit agentic procurement, investigate frontier-AI incidents, maintain multilingual public-interest benchmarks, and support regulation and crisis response under time pressure. Not a new committee that writes principles. A corps that can look inside the loop and say, with evidence, whether it is still under control. If the state cannot audit the system, the state does not control the system — and if it cannot control the system, it should not delegate public authority to it.

This is the same argument I have been making across this series, now pointed at its hardest case. Governance is advantage precisely when the thing being governed is moving too fast for paperwork. The countries that build evaluation capacity early will not just be safer; they will be the ones whose approval actually means something — and that is a form of soft power the next decade will reward.

The international scaffolding is beginning to exist, too. The International Network of AI Safety Institutes — launched with members spanning the EU, the UK, the US, France, Japan, South Korea, Singapore, Canada, Kenya, and Australia — could become the forum where RSI early-warning evaluations, incident taxonomies, and testing protocols are shared [19]. And the Seoul Frontier AI Safety Commitments already have a list of frontier developers promising to publish safety frameworks, define thresholds for intolerable risk, and refrain from deploying systems whose risks cannot be kept below them [20]. This is real progress. But a network is not capacity, and a published commitment is not enforcement: both matter only once they become technical, operational, and verifiable rather than diplomatic theater.

What Is Not Worth the Energy

Equally important is what to stop doing, because scarce institutional attention is the binding constraint.

Five Dead Ends

Generic ethics principles. Useful for speeches, weak for frontier control. Model cards without system cards. They say nothing about tools, scaffolds, permissions, monitoring, or internal deployment. Benchmark-only safety. Benchmarks saturate, leak, and mislead; scaffolding and context can change capability entirely [7][9]. Self-certification. The problem is structural, not moral — the actor racing cannot be the sole judge of the race. Unverifiable pause rhetoric. A pause is not a policy unless it has triggers, verification, scope, enforcement, and a restart condition. Otherwise it is a slogan.

On the pause point, Anthropic is honest in a way worth crediting: a meaningful slowdown would require multiple frontier labs across countries plus credible verification, and AI training runs are easier to conceal than missile silos, with enormous incentives to defect quietly [1]. So the mature ask is not “pause now.” It is “build the verification and trigger machinery before the day you wish you had it.” The same applies to the open-versus-closed war: open models improve scrutiny, diffusion, and sovereignty while also diffusing capability; closed models can be controlled more tightly in principle while concentrating power and hiding evidence. Absolutism on either side is a substitute for thinking. The serious answer is conditional release — different openness for different capability thresholds.

Recursive self-improvement is not a magic spell. It is a production function — what happens when the system being produced becomes a major input into its own production. It can start quietly: a coding assistant, a research agent, an automated evaluator, a scaffold optimizer, a synthetic-data generator. Then the loop tightens.

The first sign of RSI will not be the machine escaping the lab. It will be the lab becoming machine-operated — not fully, not yet, but with the human role sliding from execution toward supervision, validation, and research judgment — while the institutions meant to govern it are still debating static models, voluntary commitments, and PDFs.

Prometheus stole fire once. Our generation is building fire that learns how to make better fire. The answer is neither panic nor denial. It is institutions that can see the flame, measure how it spreads, decide when to use it, and know how to put it out before it takes the house.

The One Sentence

Governance as capacity, not paperwork. Public institutions as operators, not spectators. The loop can help us govern the loop — but it can never be allowed to certify itself.

References

Frontier Labs & Their Frameworks

Anthropic (2026). When AI Builds Itself. Anthropic. (Internal/self-reported data; cited as the company's claims.)
LessWrong (2025). “Recursive Self-Improvement” Is Three Different Things. (Conceptual vocabulary, not peer-reviewed evidence.)
OpenAI (2025). Our Updated Preparedness Framework.
Google DeepMind (2025). Strengthening Our Frontier Safety Framework.

Empirical Evaluations

METR (2025). Measuring AI Ability to Complete Long Tasks.
Wijk et al. (2025). RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts. PMLR.
AI Security Institute (2025). Frontier AI Trends Report. AISI (UK).
METR (2026). Frontier Risk Report (February to March 2026).

Governance, Regulation & Academic Backbone

International AI Safety Report (2026). International AI Safety Report 2026. Backed by over 30 countries and international bodies.
Shevlane et al. (2023). Model Evaluation for Extreme Risks. arXiv:2305.15324.
Anderljung et al. (2023). Frontier AI Regulation: Managing Emerging Risks to Public Safety. arXiv:2307.03718.
GovAI (2024). Safety Cases for Frontier AI.
European Commission (2025). The General-Purpose AI Code of Practice.
Safe AI Forum (2025). Bare Minimum Mitigations for Autonomous AI Development.

Takeoff Economics & Discourse

Forethought (2025). Will AI R&D Automation Cause a Software Intelligence Explosion?
Epoch AI (2025). The Software Intelligence Explosion Debate Needs Experiments.
Clark, J. (2025). Import AI 455: Automating AI Research. (Discourse signal, not evidence.)
LessWrong (2025). Slow Corporations as an Intuition Pump for AI R&D Automation. (Conceptual analogy, not empirical evidence.)

International Coordination & Framework Audits

European Commission (2024). First Meeting of the International Network of AI Safety Institutes.
UK Government (2024). Frontier AI Safety Commitments, AI Seoul Summit 2024.
Evaluating AI Providers' Frontier Safety Frameworks (2025; revised 2026). Preprint, arXiv:2512.01166. (Reviews 12 frameworks against 65 criteria.)