Research

The Missing Layer in Autonomous Science

AuthorsJarrod Barnes

PublishedApril 2, 2026

Autonomous laboratories are compressing the timeline of scientific discovery. Robotic platforms execute experiments. Generative models design novel materials, molecules, and protein sequences. Screening systems evaluate thousands of candidates in hours rather than months. Each layer of this stack is improving rapidly. The layer that has not kept pace is the judgment connecting them.

Between the screen and the decision to act on it sits a judgment call. Where should we look for novel candidates? Is this evidence trustworthy? Should we spend budget escalating to a more expensive verifier? Does this new result change what we believe about the competing hypotheses? Today, one scientist runs one campaign at a time. That is the serial constraint.

Two questions drive materials discovery. What is worth exploring? And once we have a candidate, what are we actually capable of building?

AI has compressed the first question. The space of things worth exploring is no longer the bottleneck. The second question is harder. Every candidate must pass through physical reality, and the judgment connecting exploration to reliable commitment is where that question lives.

Models encode broad scientific knowledge, but they lack a training signal for this kind of sequential evaluation. The signal that determines whether a judgment call was correct lives in the experimental process itself, not in any web-scale corpus.

We set out to build that training signal. Verified campaign environments with physics-grounded oracles provide deterministic reward for every search, trust, escalate, and revise decision. The goal is not to replace the scientist but to rethink the workflow.

A critical step in any automated discovery pipeline is the screening campaign. Given competing hypotheses, a limited experimental budget, and access to a generative model that can produce novel candidates on demand, the system must decide where to search, which candidates to test, whether to trust each result, and when the accumulated evidence justifies committing to a particular direction. These decisions are sequential and coupled. A search that finds a disambiguating candidate changes what is available for selection. Choosing candidate A at round 3 determines what evidence is available at round 4. Trusting a misleading result early corrupts every subsequent belief update. Escalating to a more expensive verifier buys accuracy but burns budget that could fund another search or another candidate.

This is the decision layer where experimental throughput meets scientific reliability.

A campaign that converges on the right answer in six rounds, with calibrated confidence, feeds a reliable signal into synthesis, characterization, and scale-up. A campaign that trusts the wrong evidence and commits too early wastes not just the screening budget but every downstream experiment built on that commitment.

When deployed on these campaigns, language models parse protocols correctly and generate plausible reasoning about candidate selection. But when tasked with multi-round judgment under budget pressure and noisy evidence, a consistent failure pattern emerges. They trust cheap evidence indiscriminately, rarely escalate to more expensive verification, and almost never revise a commitment they made early in the campaign.

(a) Training Dynamics

Accuracy

Reward

(b) Per-Round Accuracy

RL (Dynamical-30B-A3B)

Base (Qwen3-30B)

Figure 1. The trainability and behavioral structure of scientific judgment. (a) Trajectory-level RL improves hypothesis accuracy from 55.2% to 79.3% over 100 gradient steps on 29 held-out open-world environments, with consolidation in the final 25 steps. (b) In 29 single-pass mechanistic rollouts, the base model hovers near 50% regardless of investigation depth, while the RL model is correct in 93.1% of episodes at round 0 and reaches 100% by round 3. Later rounds have smaller support counts as shorter-budget campaigns end earlier.

On the left, hypothesis accuracy improves from 55.2% to 79.3% over 100 gradient steps on 29 held-out environments, with the largest gain between steps 50 and 75. On the right, the behavioral structure is clearer. In single-pass rollouts on held-out campaigns, the base model stays near 50% regardless of how much evidence it sees. The RL model is already correct in 93.1% of episodes at round 0 and reaches 100% by round 3.

Same evidence. Different judgment.

The Judgment Bottleneck

The training signal for judgment differs from the signal for knowledge or format compliance. Knowledge transfers through pretraining. Protocol adherence transfers through instruction tuning. But the signal for whether a judgment call was correct requires ground truth, depends on what happened in subsequent rounds, and compounds across the full campaign trajectory.

Verified campaign environments provide this signal. Each environment pairs a candidate pool with a physics-grounded oracle,¹1A verification function that returns deterministic ground truth for a candidate. In this work, the Materials Project convex hull determines whether a structure is thermodynamically stable. The oracle scores decisions against physics, not against human expert judgment.¹ a staged verifier ladder²2A sequence of verification methods ordered by cost and accuracy. Cheap surrogates screen thousands of candidates. Medium-fidelity simulations narrow the pool. Final-stage evaluation returns ground truth. Each step adds confidence and expense.² with increasing cost and fidelity, and a budget constraint that forces genuine prioritization. The oracle evaluates every trust, escalate, and revise decision against deterministic ground truth, scoring the full campaign trajectory rather than individual rounds. The model learns not what an expert did, but what would have been correct given what the physics actually showed.

How It Works

The environment-generation pipeline has six components. We build it here for materials discovery, but any domain with a candidate source and a verification oracle can produce campaign environments with deterministic reward.

Figure 2. End-to-end pipeline from environment generation through RL training. The center panel requires two domain-specific inputs (candidate source + verification oracle); everything else is domain-general. Hover any component for details.

We call an environment open-world when its candidate pool contains genuinely novel structures that no model has seen during training, so the agent cannot rely on memorized associations between compositions and outcomes.

A candidate source generates novel structures at runtime. We use Crystalite, a 67M-parameter Diffusion Transformer³3A neural network that generates crystal structures by iteratively denoising from random noise. Combines the diffusion process (gradual refinement from noise) with the Transformer architecture (attention-based processing of atom sequences).³ trained on Alex-MP-20, to generate candidates on demand when the agent calls SEARCH. A lean seed pool of three pre-computed "literature" candidates per environment provides a starting point, but the agent must learn when novel candidates from Crystalite are worth the budget. A verification oracle provides deterministic ground truth. We use the Materials Project 2020 convex hull,⁴4The set of thermodynamically stable compounds in a chemical system. A structure on the hull has no energetically favorable decomposition path. Energy above the hull measures how far a candidate sits from stability.⁴ where an energy-above-hull value of zero means the structure is thermodynamically stable.

A staged verifier ladder transforms the oracle signal into cheap, medium, and final evidence stages with increasing cost and fidelity. The cheap stage applies a noisy surrogate. The medium stage uses a more accurate transform. The final stage returns the true energy-above-hull value. This staging creates naturally occurring evidence patterns that make campaigns genuinely challenging. Trap candidates score well on the cheap verifier but collapse at the final stage. Conflicting signals between cheap and medium stages force the agent to reconcile contradictory evidence.

A budget constraint caps total rounds, forcing genuine prioritization between searching for new candidates and committing to existing ones. A difficulty schedule uses a UCB bandit⁵5Upper Confidence Bound, a strategy for balancing exploration and exploitation. Here it selects environment configurations where the agent's decisions have the most room to improve, concentrating training on difficult campaigns.⁵ over environment parameters to discover configurations where the agent's decisions actually matter, concentrating training on campaigns where good and bad judgment lead to different outcomes. A hybrid reward combines seven oracle-grounded components with a gated rubric bonus from an external judge, scoring each full campaign trajectory rather than individual rounds.

We train through supervised fine-tuning⁶6Training a model to imitate expert demonstrations before reinforcement learning. The SFT checkpoint provides a behavioral starting point; RL then optimizes for outcomes the demonstrations did not explicitly teach.⁶ on expert demonstrations followed by multi-turn reinforcement learning across 247 open-world training environments. The oracle evaluates every decision against physics. To read more on the full environment, training, and reward design, refer to our full technical report.

What We Found

On 29 held-out open-world environments with novel Crystalite-generated crystal structures, the RL-trained model achieves 79.3% hypothesis accuracy (23/29), surpassing GPT-5.4 at 72.4% (21/29) with 3B active parameters⁷7In a mixture-of-experts model, only a subset of parameters activate per input. 3B active parameters means the model uses 3 billion parameters per forward pass, though the total parameter count is higher.⁷ trained on 247 environments for 100 gradient steps. The SFT baseline sits at 55.2% (16/29). Qwen3.5-397B (44.8%) and Opus 4.6 (34.5%) fall below the SFT checkpoint despite substantially larger parameter counts. Frontier models without task-specific training degrade on the longest campaigns, consistent with scientific judgment being a distinct capability that does not scale with model size alone.

In workflow language, hypothesis accuracy means the model identified the right material group after all evidence rounds. Budget efficiency means it did not waste scarce experimental budget on uninformative tests. Contamination means it was misled by a trap candidate. Belief quality means it maintained calibrated confidence under partial and conflicting information. These map directly to the decisions a lab lead reviews after every screening campaign.

The RL model's gains show in decision structure. The base model never uses "suspect" as a validation category. The RL model expanded to a three-way triage (trust/suspect/reject) and reduced total revision rounds from 177 to 6 across the held-out set. Wrong-winner episodes collapsed from 14 out of 29 to 1. These are structural changes in how the model allocates budget and updates belief across a multi-round campaign.

On MADE,⁸8MAterials Discovery Environments (Malik et al., 2026). An independent benchmark for closed-loop materials discovery that evaluates modular workflows built from interchangeable planners, generators, and filters under a constrained oracle budget.⁸ an independently developed materials-discovery benchmark with a different task format and system decomposition, the RL model shows positive transfer on a full 10-system same-family sweep. Formula recall improves from 0.375 to 0.416, discovery efficiency from 0.543 to 0.576, normalized AUDC from 0.372 to 0.396, and novel stable unique discoveries from 14.7 to 16.1. Structure recall is flat (0.069 vs 0.068). The gain is strongest in search quality and composition-level discovery rather than structure-level retrieval. This supports the narrower claim: judgment learned in verified campaign environments carries to an independently developed benchmark, but the transfer is bounded.

Closing the Loop

AI has compressed what is worth exploring. The harder question, what are we actually capable of building, requires judgment that no web-scale corpus teaches. Our results suggest this judgment is trainable through verified environments with physics-grounded oracles. The training extends beyond evidence evaluation to the full planning loop. Where to search, what to test, how to judge, and when to revise.

A good scientist holds beliefs loosely, tests them against reality, and revises when the evidence demands it. In our experiments, the RL-trained model rejects more, escalates more, and revises faster than the base model or the SFT checkpoint. Whether that pattern holds across domains and scales is an open question. But the direction is encouraging.

We tested this in one domain. Thermodynamic stability of crystal structures, using Crystalite-generated candidates and the Materials Project convex hull as the verification oracle. The environment-generation pipeline itself requires only two domain-specific inputs, a candidate source and a verification oracle. The staged verifier ladder, budget mechanics, difficulty scheduling, and reward decomposition are domain-general by construction.

Better judgment produces more informative experiments. More informative experiments produce higher-quality training signal for the next generation of judgment models. That spiral is how autonomous discovery could scale. Framing the full process as one closed-loop RL run remains an open problem, and the system is non-stationary. Verifiers improve, candidate generators retrain, and the science itself shifts.

We think this points to a broader shift in how scientific work gets done. The workflow of inspecting data, building models, generating candidates, and drawing conclusions has historically required a team of domain experts coordinating across each step. That coordination is increasingly compressible. Not because the expertise stops mattering, but because the steps where an agent can operate reliably are expanding. The scientist's role evolves from executing each step to scoping the problem, setting the objective, and recognizing when the system has failed in ways it cannot detect itself. More of the experimental loop moves into compute. More of the human contribution moves into defining what questions are worth answering and what evidence would change our minds.

Citation

@article{barnes2026training,
  author  = {Barnes, Jarrod},
  title   = {Training Scientific Judgment with Verified
             Environments for Autonomous Science},
  journal = {Dynamical Systems},
  year    = {2026},
  url     = {https://dynamicalsystems.ai/blog/training-scientific-judgment}
}

If you are running autonomous experiments, building scientific workflows for agents, or researching how AI fits into the discovery loop, we want to hear about it.