The Missing Layer in Autonomous Science
Autonomous laboratories are compressing the timeline of scientific discovery. Robotic platforms execute experiments. Generative models design novel materials, molecules, and protein sequences. Screening systems evaluate thousands of candidates in hours rather than months. Each layer of this stack is improving rapidly. The layer that has not kept pace is the judgment connecting them.
Between the screen and the decision to act on it sits a judgment call. Where should we look for novel candidates? Is this evidence trustworthy? Should we spend budget escalating to a more expensive verifier? Does this new result change what we believe about the competing hypotheses? Today, one scientist runs one campaign at a time. That is the serial constraint.
Two questions drive materials discovery. What is worth exploring? And once we have a candidate, what are we actually capable of building?
AI has compressed the first question. The space of things worth exploring is no longer the bottleneck. The second question is harder. Every candidate must pass through physical reality, and the judgment connecting exploration to reliable commitment is where that question lives.
Models encode broad scientific knowledge, but they lack a training signal for this kind of sequential evaluation. The signal that determines whether a judgment call was correct lives in the experimental process itself, not in any web-scale corpus.
We set out to build that training signal. Verified campaign environments with physics-grounded oracles provide deterministic reward for every search, trust, escalate, and revise decision. The goal is not to replace the scientist but to rethink the workflow.
A critical step in any automated discovery pipeline is the screening campaign. Given competing hypotheses, a limited experimental budget, and access to a generative model that can produce novel candidates on demand, the system must decide where to search, which candidates to test, whether to trust each result, and when the accumulated evidence justifies committing to a particular direction. These decisions are sequential and coupled. A search that finds a disambiguating candidate changes what is available for selection. Choosing candidate A at round 3 determines what evidence is available at round 4. Trusting a misleading result early corrupts every subsequent belief update. Escalating to a more expensive verifier buys accuracy but burns budget that could fund another search or another candidate.
This is the decision layer where experimental throughput meets scientific reliability.
A campaign that converges on the right answer in six rounds, with calibrated confidence, feeds a reliable signal into synthesis, characterization, and scale-up. A campaign that trusts the wrong evidence and commits too early wastes not just the screening budget but every downstream experiment built on that commitment.
When deployed on these campaigns, language models parse protocols correctly and generate plausible reasoning about candidate selection. But when tasked with multi-round judgment under budget pressure and noisy evidence, a consistent failure pattern emerges. They trust cheap evidence indiscriminately, rarely escalate to more expensive verification, and almost never revise a commitment they made early in the campaign.
On the left, hypothesis accuracy improves from 55.2% to 79.3% over 100 gradient steps on 29 held-out environments, with the largest gain between steps 50 and 75. On the right, the behavioral structure is clearer. In single-pass rollouts on held-out campaigns, the base model stays near 50% regardless of how much evidence it sees. The RL model is already correct in 93.1% of episodes at round 0 and reaches 100% by round 3.
Same evidence. Different judgment.
The Judgment Bottleneck
The training signal for judgment differs from the signal for knowledge or format compliance. Knowledge transfers through pretraining. Protocol adherence transfers through instruction tuning. But the signal for whether a judgment call was correct requires ground truth, depends on what happened in subsequent rounds, and compounds across the full campaign trajectory.
Verified campaign environments provide this signal. Each environment pairs a candidate pool with a physics-grounded oracle,1 a staged verifier ladder2 with increasing cost and fidelity, and a budget constraint that forces genuine prioritization. The oracle evaluates every trust, escalate, and revise decision against deterministic ground truth, scoring the full campaign trajectory rather than individual rounds. The model learns not what an expert did, but what would have been correct given what the physics actually showed.
How It Works
The environment-generation pipeline has six components. We build it here for materials discovery, but any domain with a candidate source and a verification oracle can produce campaign environments with deterministic reward.
We call an environment open-world when its candidate pool contains genuinely novel structures that no model has seen during training, so the agent cannot rely on memorized associations between compositions and outcomes.
A candidate source generates novel structures at runtime. We use Crystalite, a 67M-parameter Diffusion Transformer3 trained on Alex-MP-20, to generate candidates on demand when the agent calls SEARCH. A lean seed pool of three pre-computed "literature" candidates per environment provides a starting point, but the agent must learn when novel candidates from Crystalite are worth the budget. A verification oracle provides deterministic ground truth. We use the Materials Project 2020 convex hull,4 where an energy-above-hull value of zero means the structure is thermodynamically stable.
A staged verifier ladder transforms the oracle signal into cheap, medium, and final evidence stages with increasing cost and fidelity. The cheap stage applies a noisy surrogate. The medium stage uses a more accurate transform. The final stage returns the true energy-above-hull value. This staging creates naturally occurring evidence patterns that make campaigns genuinely challenging. Trap candidates score well on the cheap verifier but collapse at the final stage. Conflicting signals between cheap and medium stages force the agent to reconcile contradictory evidence.
A budget constraint caps total rounds, forcing genuine prioritization between searching for new candidates and committing to existing ones. A difficulty schedule uses a UCB bandit5 over environment parameters to discover configurations where the agent's decisions actually matter, concentrating training on campaigns where good and bad judgment lead to different outcomes. A hybrid reward combines seven oracle-grounded components with a gated rubric bonus from an external judge, scoring each full campaign trajectory rather than individual rounds.
We train through supervised fine-tuning6 on expert demonstrations followed by multi-turn reinforcement learning across 247 open-world training environments. The oracle evaluates every decision against physics. To read more on the full environment, training, and reward design, refer to our full technical report.
What We Found
On 29 held-out open-world environments with novel Crystalite-generated crystal structures, the RL-trained model achieves 79.3% hypothesis accuracy (23/29), surpassing GPT-5.4 at 72.4% (21/29) with 3B active parameters7 trained on 247 environments for 100 gradient steps. The SFT baseline sits at 55.2% (16/29). Qwen3.5-397B (44.8%) and Opus 4.6 (34.5%) fall below the SFT checkpoint despite substantially larger parameter counts. Frontier models without task-specific training degrade on the longest campaigns, consistent with scientific judgment being a distinct capability that does not scale with model size alone.
In workflow language, hypothesis accuracy means the model identified the right material group after all evidence rounds. Budget efficiency means it did not waste scarce experimental budget on uninformative tests. Contamination means it was misled by a trap candidate. Belief quality means it maintained calibrated confidence under partial and conflicting information. These map directly to the decisions a lab lead reviews after every screening campaign.
The RL model's gains show in decision structure. The base model never uses "suspect" as a validation category. The RL model expanded to a three-way triage (trust/suspect/reject) and reduced total revision rounds from 177 to 6 across the held-out set. Wrong-winner episodes collapsed from 14 out of 29 to 1. These are structural changes in how the model allocates budget and updates belief across a multi-round campaign.
On MADE,8 an independently developed materials-discovery benchmark with a different task format and system decomposition, the RL model shows positive transfer on a full 10-system same-family sweep. Formula recall improves from 0.375 to 0.416, discovery efficiency from 0.543 to 0.576, normalized AUDC from 0.372 to 0.396, and novel stable unique discoveries from 14.7 to 16.1. Structure recall is flat (0.069 vs 0.068). The gain is strongest in search quality and composition-level discovery rather than structure-level retrieval. This supports the narrower claim: judgment learned in verified campaign environments carries to an independently developed benchmark, but the transfer is bounded.
Closing the Loop
AI has compressed what is worth exploring. The harder question, what are we actually capable of building, requires judgment that no web-scale corpus teaches. Our results suggest this judgment is trainable through verified environments with physics-grounded oracles. The training extends beyond evidence evaluation to the full planning loop. Where to search, what to test, how to judge, and when to revise.
A good scientist holds beliefs loosely, tests them against reality, and revises when the evidence demands it. In our experiments, the RL-trained model rejects more, escalates more, and revises faster than the base model or the SFT checkpoint. Whether that pattern holds across domains and scales is an open question. But the direction is encouraging.
We tested this in one domain. Thermodynamic stability of crystal structures, using Crystalite-generated candidates and the Materials Project convex hull as the verification oracle. The environment-generation pipeline itself requires only two domain-specific inputs, a candidate source and a verification oracle. The staged verifier ladder, budget mechanics, difficulty scheduling, and reward decomposition are domain-general by construction.
Better judgment produces more informative experiments. More informative experiments produce higher-quality training signal for the next generation of judgment models. That spiral is how autonomous discovery could scale. Framing the full process as one closed-loop RL run remains an open problem, and the system is non-stationary. Verifiers improve, candidate generators retrain, and the science itself shifts.
We think this points to a broader shift in how scientific work gets done. The workflow of inspecting data, building models, generating candidates, and drawing conclusions has historically required a team of domain experts coordinating across each step. That coordination is increasingly compressible. Not because the expertise stops mattering, but because the steps where an agent can operate reliably are expanding. The scientist's role evolves from executing each step to scoping the problem, setting the objective, and recognizing when the system has failed in ways it cannot detect itself. More of the experimental loop moves into compute. More of the human contribution moves into defining what questions are worth answering and what evidence would change our minds.
Citation
@article{barnes2026training,
author = {Barnes, Jarrod},
title = {Training Scientific Judgment with Verified
Environments for Autonomous Science},
journal = {Dynamical Systems},
year = {2026},
url = {https://dynamicalsystems.ai/blog/training-scientific-judgment}
}If you are running autonomous experiments, building scientific workflows for agents, or researching how AI fits into the discovery loop, we want to hear about it.