Research

Can a Self-Driving-Lab Agent Tell When the Evidence Is Enough?

Evaluating self-driving-lab agents over historical experimental data

PublishedJune 19, 2026

Software abundance is driving an exponential hardware buildout whose energy, cooling, magnets, coatings, and defense and aerospace materials cannot scale on software velocity, and with candidate generation already commoditized by foundation models, generative materials models, MLIPs, and DFT automation, the bottleneck has moved downstream to qualification-grade evidence, the proof that a material survives physical reality, can be reproduced, can be trusted, and can enter a high-consequence system. Raising the throughput of that evidence starts with measurement, with knowing where the judgment that drives those decisions holds and where it breaks before a lab is trusted to run on its own. For mission-critical alloys, coatings, and rare-earth-constrained parts, the cost of weak judgment is paid in repeated tests, scrapped specimens, and expert review time spent re-deciding calls that were already made once.

Self-driving laboratories close part of this gap by executing experiments from code, faster and around the clock. Execution speed is real, but it is not the constraint that decides whether their output can be trusted. What decides trust is judgment, the ability to tell when the evidence in hand is sufficient to act on, defective enough to reject, incomplete enough to investigate further, or uncertain enough to stop. Before an agent is allowed to drive a qualification program, it should be qualified itself. Before autonomy, agent qualification.

We build the layer that turns qualification evidence into a machine-readable, replayable record a model can learn from, so the path from discovery to qualified capability compresses instead of repeating itself. This report is the first public proof surface for that idea. We compile historical materials workflows from AM Bench into source-located replay environments and measure how six local and frontier models behave at the evidence boundary. None of them clears it. They over-submit packets over visible defects, miss source defects, request redundant evidence, and stop at the wrong time, and these are evidence and value judgments rather than formatting or retrieval failures. We think the missing object is a trainable estimate of the value and sufficiency of evidence over the state of an investigation, and the rest of this report is the evidence for that claim.

Experiments Become Code. Experience Has Not.

Self-driving labs are getting very good at turning intent into executed experiments, and the reasons are good ones. An experiment written as code can be versioned, checked before it runs, and replayed across sites, which is why experiment-as-code frameworks, declarative laboratory stacks, and national programs like the National Science Foundation's network of AI-programmable cloud laboratories and the Department of Energy's Genesis Mission¹1NSF describes its Programmable Cloud Laboratories Test Bed as distributed autonomous laboratory facilities whose nodes can be remotely accessed to run custom user-programmed workflows, including self-driving autonomous experiment workflows. DOE describes the Genesis Mission as an integrated platform connecting supercomputers, experimental facilities, AI systems, and unique datasets.¹ show that public investment is explicitly aimed at programmable, AI-enabled experimental execution. The remaining engineering is hard, but the shape of the problem is understood and the direction is settled.

Turning execution history back into reusable decision signal has no equivalent stack. When a campaign ends, the history that produced its result, the record of which evidence was trusted, which test was skipped, which source was caught as defective and which was accepted, usually survives only as a PDF, a folder of instrument files, a spreadsheet, a lab notebook, or a one-off report. Those decisions are exactly the signal an agent would need to make the same calls the next time it faces them, and almost none of that signal survives in a form a model can learn from.

The cost of leaving that signal unbuilt is not paid in execution speed. A laboratory that runs an order of magnitude faster than a human also reaches wrong conclusions an order of magnitude faster, and the rate that actually decides whether such a laboratory can be trusted is its qualification throughput, the rate at which evidence-based go and no-go decisions can be made and then stood behind. Faster execution raises both the value of good judgment and the cost of bad judgment, which is why the judgment layer, rather than the actuation layer, is where trust in an autonomous laboratory is won or lost.

The larger goal is to build the experience layer for physical science, turning the accumulated history of experiments into source-located evidence, deterministic verifiers, and value signals, so that physical work compounds instead of rediscovering the same failures. The benchmark in this report is the first narrow instance of it, and it starts where it has to start, with a historical workflow that already happened.

AM Bench as Historical Scientific Experience

We started with three DOI-backed NIST AM Bench records,²2Primary sources: the NIST AM Bench program page, the AM Bench 2025 challenge descriptions, the direct AM Bench data links, and the NIST Public Data Repository DOI records for AMB2025-02, AMB2025-03, and AMB2025-06/07. AM Bench is a NIST-led benchmark series for controlled additive-manufacturing measurements and blind challenge problems; these three records are the source workflows used here, not generic examples.² not generic materials examples. They cover IN718 tensile testing, Ti-6Al-4V fatigue, and IN718 laser-track process calibration. Each one is a real sequence of process records, characterization results, calibration artifacts, and source documents, with measured outcomes attached on the verifier side and never exposed to the policy.

NIST source record	What it contains	Where it appears in replay
AMB2025-02, PBF-L IN718 tensile	Challenge statement, specimen geometry, CAD/FEA calibration bundle, calibration manifest, process lineage, submission template, answer workbook, and raw tensile traces.	`specimen_geometry`, `cad_fea_summary`, `calibration_manifest`, process records; answer workbook and raw tensile traces are verifier-only or audit-only.
AMB2025-03, PBF-L Ti-6Al-4V rotating-bending fatigue	Build and powder details, heat treatment, specimen preparation, chemistry, surface/XRD measurements, tensile/fatigue calibration, microstructure, XCT defect, fractography, and answer data.	Fatigue process records plus `fatigue_build_powder_surface`, `fatigue_heat_treatment_surface`, `fatigue_tensile_calibration_surface`, `fatigue_800hip_calibration_surface`, `fatigue_microstructure_surface`, and `fatigue_xct_defect_surface`; answer data is verifier-only.
AMB2025-06/07, IN718 laser-track pads	Plate material, powder feedstock, spread-layer images, scan strategy, melt-pool cross sections, submission templates, and answer keys.	Laser-pad process records plus `plate_material_surface`, `powder_feedstock_surface`, `scan_strategy_surface`, `spread_layer_surface`, and `melt_pool_cross_section_surface`; answer keys are verifier-only.

We compiled these into a source-located evidence graph. Every piece of evidence an agent can request points back to a specific realized record in the historical workflow. The contract is realized-only.³3The environment never invents an outcome. If a measurement was not actually produced in the historical workflow, no agent and no policy can conjure it. Requests for evidence that does not exist in the archive return unavailable.³ If a measurement was not taken, it does not exist in the environment, and asking for it returns unavailable rather than a fabricated value.

That constraint is the point. It keeps the benchmark honest about what the historical record can and cannot support, and it keeps the agent inside the same visible-evidence boundary the historical decision faced. Realized-only is an anti-synthetic-outcome contract, not a claim of representativeness. A realized additive-manufacturing record can still be nonrepresentative, locally heterogeneous, or misregistered, and the replay does not test for that.

This is a deliberate change from how we built environments before. In earlier work we trained scientific judgment in verified campaign environments whose candidates were generated by a crystal diffusion model and scored against a physics oracle. Here there are no generated candidates and no oracle we built. The inputs are a real, published experimental record, and the only authority is that record and a deterministic verifier. Nothing in the compiler is specific to these three workflows, so any archive that carries process records, characterization, calibration, and outcomes can be compiled the same way.

Before the machine that produces them, here is what the six models did on the package.

104

replay tasks, six models, one frozen package

65 / 104

strongest valid final decision (Gemini 3.1 Pro)

models that solved the benchmark

all 6

models over-submit a packet over a visible defect

Six local and frontier models on the same 104 source-located replay tasks. The strongest made a valid final decision on 65, no model solved the benchmark, and over-submission, advancing a packet over a visible defect, shows up in every one.

The Scientific Experience Compiler

The compiler builds the experience layer. It takes the residue of physical execution and turns it into the objects an agent can be evaluated on and eventually trained on. It ingests process records, instrument traces, calibration artifacts, source documents, and outcomes, resolves them into source-located evidence objects, builds realized-only replay tasks, gates them through verifier and leakage checks, runs policy models through the environment, and returns behavior traces, value and sufficiency signals, and failure cases for later training.

The design rule is one line. The compiler is deterministic where authority matters and model-assisted where interpretation creates value. That rule is what separates this work from naive ingestion on one side and unverifiable synthetic data generation on the other.

The Scientific Experience Compilerarchive in, trainable experience out

Raw archive

Process logsInstrument tracesCalibration sheetsSEM / fractographySpreadsheets, PDFsOutcomes

deterministicSource locationprovenance, hashes, bounds

model-assistedAnnotationreframe pinned decision, pick refs

gateReviewAuthority stays with the archive and verifiercalibrated judge + deterministic filters

deterministicReplay compilerrealized-only tasks

deterministicVerifier registrynumeric + decision scoring

Trainable experience

Replay tasksEvidence graphTrace galleryBehavior + value signalsFailure cases for later training

The compiler is deterministic where authority matters and model-assisted where interpretation creates value.

Annotation is where historical data becomes scientific experience. The raw archive tells us what happened, but it does not always tell us what decision was being made, which evidence was binding, what defect mattered, or what behavior a future agent should be tested on. We pin each decision to a realized record first, then use a model to reframe it as a task and to select, from the supplied options, the evidence an agent is allowed to see. The model writes the framing and the decision context. It cannot create outcomes, gold labels, verifier answers, the correct terminal action, or any policy-visible shortcut.

Every candidate the model proposes passes two gates before it becomes a task. Deterministic filters reject any candidate that selects a verifier-only field, an evidence ref that does not exist, a forbidden answer artifact, or a verifier that does not match the episode. A live judge, itself calibrated against fixed accept and reject cases before it is trusted, then scores the survivors against a six-criterion rubric for grounding and framing quality. A candidate is admitted only when the deterministic filters pass and the judge accepts, and the judge never sets the reference answer or the terminal truth.

Authority stays with the deterministic verifiers and the archive, and that authority covers performance, not only procedure. The same registry scores numeric property predictions, an agent's estimate of yield strength, tensile strength, modulus, and strain, against held-out measured outcomes, alongside the evidence-boundary decisions this report focuses on, though we report no property-prediction results here. No language model holds ground-truth authority over any task, the benchmark version is frozen for this report rather than released, and every model is run once across the package, the local models at temperature zero, so the panels that follow describe a single pass rather than a seed distribution.

What counts as success runs in a fixed order. The point of the compiler is to help a team reach a better and more defensible decision about a material or process, and to reach it on less evidence and in less time than the current workflow. Performance is the primary axis. A faster decision that is wrong is a regression, so a reduction in cost or time only counts once the decision itself holds. The evidence-boundary behavior the rest of this report measures is the prerequisite for that decision, the discipline an agent needs before its speed is worth trusting, not the performance win by itself.

Retrospective Value-of-Evidence Replay

The compiled package is 104 retrospective value-of-evidence tasks across the three workflows, with 498 archived evidence items behind them. Each task replays a single decision moment from the historical workflow and asks the agent to act using only what is visible.

Every task has the same anatomy.

Policy-visible state

Menu metadata only. No labels, no answer keys.

Realized action menu

Seven actions. Request, flag, localize, submit, stop.

Archived evidence reveal

Only evidence the workflow actually produced.

Deterministic verifier

Scores the terminal action against the realized record.

Terminal decision

Submit, flag a defect, localize, or escalate.

Every task strips labels and answer keys and scores only the terminal action against the realized record, so it measures evidence-boundary judgment rather than retrieval.

The agent sees a policy-visible state that carries menu metadata and nothing else. We strip gold labels, variant labels, answer keys, verifier-only values, split tags, and any field describing whether a piece of evidence is useful, before the state ever reaches the model. It chooses from a fixed menu of seven actions, three requests (request_characterization, request_calibration_artifact, request_process_record) and four terminal moves (flag_source_defect, localize_missing_evidence, submit_risk_packet, abstain_or_escalate). When it requests evidence, the archived result is revealed. When it stops, a deterministic verifier scores the terminal action against the realized record.

The 104 tasks span five decision moments, each one abstracting a call a materials reviewer makes.

Decision moment	Tasks	What it tests
Archived evidence selection	23	Which realized evidence to inspect next
Calibration artifact sufficiency	23	Whether a provenance package supports the decision
Source trustworthiness	20	Whether a source or process record is trustworthy
Packet escalation	29	Whether to submit, flag, localize, or escalate a risk packet
Missing evidence localization	9	Which specific binding record is absent

The scientific decision boundary is the point where the available evidence is sufficient to act, defective enough to reject, incomplete enough to inspect further, or uncertain enough to stop. Every task in the package sits on that boundary by construction.

The metric that matters is terminal validity. A terminal-valid decision is a final action accepted by the verifier given the evidence the agent actually saw. It is not answer-key correctness in the usual sense. It checks whether the agent stopped in the right way, by submitting only when the evidence supports it, flagging a source defect when one is visible, localizing the missing binding record instead of requesting everything, or escalating when the evidence does not resolve. An agent can read the right document and still pick the wrong terminal action, and terminal validity is what catches that. Terminal validity is a verifier proxy for evidence-boundary discipline. It is not a measure of physical qualification, and a terminal-valid run says nothing about whether the underlying process, structure, and property chain is sound. A flagged source defect here is a provenance or checkability defect, a broken source reference or an unverifiable record, not a metallurgical or structure-property judgment, and a submitted risk packet is a replay object rather than a qualification packet in the materials-release sense.

Baselines and the Model Panel

We first check that the environment is hard for the right reasons. Static baselines confirm it. The deterministic reference, which follows the verifier-grounded policy, is valid on every task. Mechanical strategies are not.

Baseline policy	Terminal-valid rate	Valid decisions
Deterministic reference	1.000	104 / 104
Random request	0.308	32 / 104
Request all, then submit	0.173	18 / 104
Always submit	0.173	18 / 104
Always abstain or escalate	0.144	15 / 104

Blanket submission and blanket refusal both score near the floor, and requesting everything before submitting does no better than always submitting. The environment rewards getting the evidence boundary right, which is exactly what a fixed posture cannot do.

Then we ran six models over the same 104-task package, one trajectory per task, with parse, invalid-action, off-menu, leakage, and forbidden-term gates passing on every run, recorded in each run's live-validator and task-quality-audit summaries.

Model	Valid / 104	Useful precision	Evidence regret	Over-submit	Duplicate	Horizon
Gemini 3.1 Pro (frontier)	65 / 104	0.701	1.625	13	0	0
Qwen 3.6 35B (local)	48 / 104	0.735	1.615	29	14	3
Claude Sonnet 4.6 (frontier)	47 / 104	0.565	1.625	10	0	0
GPT-5.5 (frontier)	44 / 104	0.588	1.798	13	0	0
Gemma 4 26B (local)	42 / 104	0.662	1.721	25	43	10
Claude Opus 4.8 (frontier)	40 / 104	0.576	1.635	18	0	0

The table is a taxonomy of model behavior, and the ranking is the least interesting thing in it.⁴4Useful precision is the fraction of an agent's evidence requests the reference considered useful, so higher is better. Evidence regret penalizes stopping with the wrong evidence set, so lower is better. Over-submit, duplicate, and horizon are counts of advancing a packet over a defect, repeating a request, and running out of turns.⁴ The strongest model, Gemini 3.1 Pro, made a valid final decision on 65 of 104 tasks. No model solved the benchmark, and the failures are legible and different. Qwen has the highest useful-request precision of the six, and it pays for that precision with 29 over-submits, 14 duplicate loops, and 3 horizon exhaustions. The frontier models are procedurally cleaner, with no duplicates and no horizon failures, and they still miss the decision boundary more often than they hit it. Opus 4.8 follows the action protocol cleanly and still lands lowest on terminal validity.

Reading the panel by decision moment shows where the difficulty lives. The three groups are the local pair (Qwen, Gemma), the first frontier pair (Sonnet 4.6, GPT-5.5), and the Gemini and Opus pair, each averaged over its two models.

Decision moment	Tasks	Local	Frontier A	Frontier B
Missing evidence localization	9	1.000	1.000	0.944
Source trustworthiness	20	0.825	0.625	0.625
Archived evidence selection	23	0.239	0.391	0.478
Calibration artifact sufficiency	23	0.326	0.304	0.435
Packet escalation	29	0.224	0.276	0.362

Missing-evidence localization is the control slice, solved by every group. Packet escalation is the hard customer-relevant slice, and no group clears 0.362 on it. Packet escalation is the replay analogue of a high-consequence packet stop, the moment where an agent should refuse to advance a risk packet with an unresolved defect, and it is exactly where current agents are weakest.

What the Traces Show

Aggregate metrics hide the behavior that produces them, and the raw trajectories make it legible. These six are drawn from the frontier run, each annotated with the model, the workflow, and the verifier's reference action.

archived evidence selectionTerminal valid

GPT-5.5IN718 tensile#665946c6

request_characterizationspecimen_geometryflag_source_defectspecimen_geometry

reference: flag_source_defect

Inspects the binding item, then stops on the visible source defect. The clean case.

packet escalationTerminal valid

Sonnet 4.6IN718 tensile#618fc61b

request_characterizationcalibration_manifestrequest_characterizationcad_fea_summaryabstain_or_escalate

reference: abstain_or_escalate

Gathers, finds the packet unsupported, and escalates instead of advancing it. The hard slice done right.

calibration artifact sufficiencyTerminal invalid

GPT-5.5IN718 tensile#e742f0c6

request_calibration_artifactcalibration_manifestrequest_calibration_artifactspecimen_geometryrequest_calibration_artifactcad_fea_summarysubmit_risk_packet

reference: flag_source_defect

Requests three relevant artifacts, then submits over a defect it should have flagged. Over-submit, the dangerous failure.

archived evidence selectionTerminal invalid

GPT-5.5IN718 tensile#682eebae

request_characterizationcalibration_manifestlocalize_missing_evidenceelastic_fit_policy

reference: submit_risk_packet

The evidence supported a submit. The model localized a missing item that was not the binding gap. Right tools, wrong stop.

calibration artifact sufficiencyTerminal invalid

GPT-5.5IN718 tensile#cd2d2271

request_calibration_artifactcalibration_manifestflag_source_defectcalibration_manifest

reference: submit after specimen_geometry, cad_fea_summary, calibration_manifest

Stops after one artifact when three were needed. Insufficient evidence, premature flag.

archived evidence selectionTerminal invalid

GPT-5.5IN718 tensile#365517a8

request_characterizationcad_fea_summaryflag_source_defectcad_fea_summary

reference: useful item was specimen_geometry

Spends a request on the wrong item, then flags it. Unnecessary request, operational waste.

Six frontier trajectories against the verifier's reference action, two valid and four invalid, where the invalid cases read the relevant evidence and still stop the wrong way.

The two valid traces show the environment is not rewarding caution for its own sake. One agent inspects the binding item and stops on a visible source defect. Another gathers two artifacts, finds the packet unsupported, and escalates instead of advancing it.

The four invalid traces are the more instructive ones, and the most diagnostic is the third. An agent requests three relevant calibration artifacts, reads them, and then submits a packet over a defect that the same evidence should have led it to flag. The evidence was in front of it and the retrieval worked, yet the terminal action was wrong, and that combination, inspecting the relevant evidence and still choosing the wrong way to stop, is the central finding of the exercise. The failure does not live in JSON formatting, action syntax, or retrieval. It lives in the judgment of what the evidence means for the decision that follows.

Six cards are anecdotes, so we coded the full panel the same way. Every one of the 624 trajectories, six models against all 104 tasks, carries a single dominant-behavior label drawn from the same codebook the cards use, which turns the individual stories into a distribution. These are single trajectories per task and model at low temperature, so the distribution describes this fixed panel rather than a variance-characterized estimate, and the same directional pattern is visible across all six models in it.

Dominant behavior, all 624 trajectories6 models × 104 tasks

Under-requested evidence

147 23.6%

Reference-equivalent, zero regret

141 22.6%

Over-submit, advanced over a defect

108 17.3%

Unnecessary request

101 16.2%

Wrong terminal action

98 15.7%

Ran out of turns

13 2.1%

Over-abstain

8 1.3%

Duplicate-request loop

8 1.3%

Terminal validity is the broader axis. 286 of 624 trajectories (45.8%) reached an accepted terminal action; reference-equivalent is the strict subset with zero evidence regret.

By the verifier's own standard, 286 of the 624 trajectories, 45.8 percent, reached an accepted terminal action, and the per-model counts in the table above sum to exactly that. The stricter reference-equivalent behavior, which takes the reference evidence path with nothing wasted, was reached by 141, or 22.6 percent. The codebook labels each trajectory by its dominant behavior rather than by validity alone, so some labels, under-requested evidence and unnecessary request among them, still contain trajectories that ended on a valid terminal. The two that never do are the ones that matter most. Over-submission of an unsupported packet (108) and a wrong terminal action despite adequate evidence (98) are terminal-invalid in every case, and over-submission is the failure a qualification reviewer cares about most because it advances a packet over a visible defect. Duplicate loops and exhausted horizons, the failures a runtime can catch, account for only 21. The mass of the distribution sits on judgment rather than plumbing.

The direction of the error is consistent, and it is the direction a qualification reviewer would worry about.

Terminal action, chosen vs referenceagentsreference

Flag source defect

242 38.8%

372 59.6%

Submit risk packet

189 30.3%

108 17.3%

Localize missing evidence

137 22.0%

54 8.7%

Abstain or escalate

43 6.9%

90 14.4%

13 trajectories ran out of turns and chose no terminal action.

Set against the reference policy, the agents submit risk packets nearly twice as often (30.3 against 17.3 percent of trajectories), flag a visible source defect markedly less often (38.8 against 59.6 percent), and escalate under uncertainty less than half as often (6.9 against 14.4 percent). The reference behavior on these workflows is usually to stop the line by flagging a defect or escalating, and the agents instead push the packet forward. Over-submission is the most dangerous of these patterns in a qualification setting, because it advances a packet over a defect that is visible in the record, and it appears in every model in the panel.

The Missing Object Is Value

This is where the missing object shows up. The models in this panel can acquire evidence, and Qwen requests useful items with higher precision than anything else in the set, but none of them reliably estimates whether the evidence in hand is enough to act on, defective enough to reject, or short of the one binding item that would change the decision. They can read the relevant evidence and still misjudge what it is worth.

So we asked whether prompt-level structure could supply that judgment. On a held-back validation split, with no edits derived from any held-out data, we tried three interventions and gated each before any held-out run on a bar we set in advance, a terminal-validity gain of at least +0.10 with no regression in over-submit or request precision.

Intervention	What moved	What did not move
Skill-library adapter probe	duplicate loops, some terminal discipline	over-submit, request precision, source trust
Binding-evidence ledger	over-submit, duplicates, horizon, precision	terminal validity, up 2 of 24, below the +0.10 gate
Value/sufficiency scaffold	duplicates 6 to 0, horizon 1 to 0	terminal validity flat; precision, regret, over-submit worse

None cleared the gate, so none advanced to a held-out test. The skill-library adapter probe moved procedural behavior and left the go or no-go decision where it was. The binding ledger cleaned up loops and unsafe submits and moved terminal validity by 2 of 24 validation tasks, short of the +0.10 threshold we set in advance. The value/sufficiency scaffold removed duplicate and horizon failures and left terminal validity unchanged while making precision, regret, and over-submit worse.

These were validation no-go probes on small splits, on the order of two dozen tasks, not ablations that prove prompting can never work. They show that these specific interventions failed to clear a bar we set in advance, which is the reason none of them advanced to a held-out test.

The pattern is consistent. Prompted structure improves control. It does not reliably learn the decision value of evidence. We do not claim weight updates have been proven necessary, the held-out experiment that would establish that has not been run. We do claim the next research object. A trainable estimate of value and sufficiency over evidence state, learned from experience rather than prompted, is what these results point at. A representation that predicts which visible item is decision-changing, when a defect is binding, and when stopping is warranted.

The Research Agenda

The replay benchmark is the first node of a longer loop. Historical workflows become traces. Traces become reusable experience. Experience is compressed into verifiers, replay tasks, and value estimators. Value estimation feeds skill. Skill enables open-ended exploration. Better exploration produces better experiments, which regenerate the trace.

01Tracestate, action, evidence, outcome, provenance

02Experiencereusable trajectories, not dead reports

03Compressionexpert judgment and constraints into usable form

04Valuewhich evidence changes the decision

05Skillimproved acquisition, refusal, timing

06Explorationopen-ended discovery under trust constraints

↻ Better judgment produces more informative experiments. More informative experiments regenerate the trace. The current replay work instantiates the first three nodes and opens the fourth.

The current work instantiates the first three nodes. It demonstrates experience reuse from historical materials workflows, it tests domain-knowledge compression through source-located tasks and deterministic verifiers, and it introduces early value estimation over evidence acquisition. It does not yet reach the later nodes, and it is not meant to.

What This Shows, and What It Does Not

This is Level 1/2 evidence. It supports a precise claim and nothing larger.

What it shows. Historical materials workflows can be compiled into replayable, source-located, evidence-boundary environments. Those environments are hard for the right reasons, with static baselines near the floor and a deterministic reference at the ceiling. Six local and frontier models all fail the scientific decision boundary, in interpretable and customer-relevant ways, with packet escalation the hardest slice.

What it does not show. It is not a trained or improved policy, not live lab control, not qualification authority, not a held-out Level-4 result, and not a public model ranking. The environment carries no synthetic outcomes and no answer-key authority over the agent.

Why it is still useful now. The value of the benchmark is not a higher score. It is that a customer's historical workflow becomes an auditable decision surface where over-submit, premature stopping, source-defect misses, and request-all behavior are all visible in raw traces, before any model goes near a live instrument. The same move turns a public benchmark archive into reusable infrastructure for qualifying scientific agents, not only a static dataset or a record of past challenge submissions.

If an agent cannot reuse historical experience to decide what evidence matters, it is not ready to accelerate the path from discovery to qualified capability, let alone to explore on its own. The retrospective replay is how we find out, one workflow at a time, before live control. That discipline is what eventually makes open-ended discovery possible, letting a team search for a material and the component built from it in a single loop rather than qualifying one long after the other.

Citation

@article{barnes2026evidence,
  author  = {Barnes, Jarrod},
  title   = {Can a Self-Driving-Lab Agent Tell When the Evidence Is Enough?},
  journal = {Dynamical Systems},
  year    = {2026},
  url     = {https://dynamicalsystems.ai/blog/benchmarking-self-driving-agents}
}

If you are building agents for physical science, compiling experimental history into training signal, or working on value estimation over evidence, we want to hear about it.