The Future Is Thousands of Labs
A thesis on experiments that compound.

The abundance of software is creating more demand for hardware.
More intelligence now means more power, more compute, more autonomy, more aerospace capacity, more resilient supply chains, and more systems operating at physical extremes. To build that physical future, we need new materials.
Materials discovery begins with two questions.
What is worth exploring? And once a candidate exists, what are we actually capable of building?
For most of history, the first question was scarce. Good hypotheses were hard to find. The search space was too large, the literature too fragmented, the instruments too expensive, and the number of trained scientists too small. Discovery moved at the speed of human imagination disciplined by experiment.
Models can now propose structures, mechanisms, synthesis routes, and experimental plans faster than any physical lab can validate them. The frontier is moving from a shortage of plausible ideas to a shortage of trusted contact with reality.
What can we actually build? What forms, what fails, what phase appears instead, what measurement survives doubt, what process repeats, and what result can be trusted enough for someone else to act on?
The laboratory is where that question becomes real.
Before a lab is a room of instruments, a furnace schedule, a sample queue, or a set of methods, it is a disciplined way of letting the physical world answer. It turns matter into evidence, evidence into judgment, and judgment into the next experiment. It is where discoveries are realized.
The physical world is becoming the training environment for AI, but today's labs weren't built to emit training signal.
The design problem is not only generating candidates. It is turning candidate designs into materials that survive contact with synthesis, characterization, and use. A material is real when it can be made, measured, trusted, reproduced, and built into something that survives its intended environment.
The best way to reason about the future is to study how the past changed. In a field moving this quickly, the most useful signal is the pattern by which labs absorb new ways of seeing, measuring, computing, and acting.
The history of materials laboratories is the history of expanding what can be captured from that encounter.
In the early experimental tradition, knowledge lived in skilled hands, notebooks, demonstrations, recipes, and the credibility of witnesses. Robert Boyle helped make reproducibility a public scientific concern; he was asked to replicate experiments at early Royal Society meetings, and his experimental practice made the details of apparatus, materials, and procedure part of scientific trust. The point was not only to report a conclusion. It was to make an encounter with reality inspectable by others.
In materials science, instruments widened that encounter. X-ray diffraction made crystal structure visible in a new way; the Braggs received the 1915 Nobel Prize for analyzing crystal structure by means of X-rays. Electron microscopy pushed inspection beyond the optical limit; Ernst Ruska developed the first electron microscope in 1933 and later received the Nobel Prize for his work in electron optics. The lab could capture not only what a material did, but what it was.
Computation changed what could be searched before anything was made. Simulation, electronic-structure methods, thermodynamic modeling, data repositories, and materials informatics made parts of the materials search space navigable. The Materials Genome Initiative made the infrastructure thesis explicit by aiming to reduce the cost and development time of materials discovery, optimization, and deployment through data, models, interoperability, and quality.
Automation changed the tempo. High-throughput workflows, robotic platforms, and self-driving laboratories began to close loops between planning, synthesis, characterization, and analysis.
Skilled hands became written protocols, instruments made invisible structure measurable, computation made search navigable, and automation made selected procedures repeatable. The agentic lab extends that sequence by making experimental experience trainable.
The Forward Path and the Return Path
Autonomous labs expose a forward path into the physical world. Programmable control alone does not make a lab agentic. If an outcome only updates a local optimizer, the lab has automated search. If it changes the data, the model, the simulator, the verifier, and the next human decision, the lab has begun to accumulate experience.
The clearest form of the forward path treats the experiment as code. Experimental intent becomes a declarative configuration that compiles down to device-level APIs, with program analysis, safety checks, resource assignment, job orchestration, and state-aware execution across heterogeneous instruments. Many workflows remain vendor-specific, GUI-driven, not portable, not statically checkable, not composable across labs, and brittle around live state, calibration drift, hidden device assumptions, safety semantics, and shared instrument contention.
That work separates intent from execution. The next question is how execution becomes reusable experience, how the physical result returns as a structured object that can train agents, update verifiers, calibrate simulators, generate rewards, and become replayable experience.
The forward path is intent into execution. The return path is execution into learning.
What the Visible Record Misses
For most of modern science, the laboratory was designed to produce papers, reports, local decisions, and trusted artifacts for human readers. That was the right output for a world where human experts were the primary carriers of scientific context. The lab could leave much unsaid because the people around the workflow supplied the missing context. They knew why the first synthesis failed, why the second sample was more trustworthy, why one XRD match was suspicious, why a microscopy image was preparation artifact, why the property measurement should not yet move the program forward.
An agent cannot inherit the tacit context of a lab unless that context is represented. It cannot learn from a failed synthesis if the failure is only a note in a notebook. It cannot calibrate a simulator from a characterization result if the sample provenance is missing. It cannot improve a planner from a staff correction if the correction is not tied to the original decision, artifact, and outcome. It cannot become scientific by reading only the visible surface of science.
The visible record is biased toward success. Papers compress the argument after uncertainty has been resolved. Databases preserve selected facts after context has been stripped away. Benchmarks freeze a task after a community has decided how to score it. These artifacts matter. They are civilization's compressed scientific record. But they are not the whole process by which reality disciplines belief.
The next systems need the attempts, dead ends, instrument faults, ambiguous measurements, process deviations, negative results, human corrections, and reasons a competent scientist decided not to trust something. Many scientific fields lack large, high-quality datasets like the Protein Data Bank, while much of the most important context remains dark data, logbooks, hardware issues, failed procedures, undocumented code choices, and dead ends.
Models can now generate candidates faster than physical systems can validate them. GNoME reported 2.2 million new crystal predictions, including 380,000 predicted stable materials, and Berkeley Lab's autonomous A-Lab synthesized more than 41 new materials using Materials Project data and GNoME insights. The result is extraordinary, but its lesson is that the world becomes the bottleneck.
A guided-diffusion workflow for superconductors showed the same lesson. It sampled 200,000 structures and produced 773 DFT-screened candidates above 5 K. But when 18 candidates were synthesized, XRD showed that the predicted structures generally did not form; the products were mostly disordered solid solutions or mixtures.
When candidate generation becomes abundant, verification becomes scarce. When agents can propose more plausible actions than a lab can execute, the bottleneck becomes deciding which physical actions are worth taking and returning what happened as signal.
Information Throughput
Scaling experimental throughput cannot mean simply running more experiments. If a lab doubles the number of runs but loses the process context, ambiguity, failures, and expert corrections, it has doubled activity, not learning. Real throughput is information throughput, how much uncertainty is reduced per instrument hour, per sample, per dollar, and per expert intervention.
That requires a different unit of scientific experience. The unit of learning is hypothesis, action, process, observation, interpretation, failure attribution, next action, and outcome. Static prediction, extraction, generation, and local optimization are useful modules, but discovery-oriented systems require open-ended experimental environments where real-world outcomes can propagate backward through the pipeline.
A trace preserves the semantics of an encounter with reality. It holds the intent beside the execution, the sample beside the process, the raw artifact beside the method, the interpretation beside the correction, the measurement beside the uncertainty it changed, and the outcome beside the decision it should affect.
From that trace, the learning objects follow, including evals, reward signals, hard negatives, simulator-calibration cases, verifier-training examples, and replayable episodes. The trace preserves enough structure for future agents to be judged against what reality actually revealed.
In a digital model, loss can move backward through a network. In scientific discovery, the analogous signal has to move backward through each link in the chain, from the candidate proposed, to the priors that made it plausible, the simulation that screened it, the protocol that tried to make it, the instrument that measured it, the expert who challenged it, and the decision that followed. Formation should strengthen some choices. Failure should weaken others. Ambiguity should reveal where the verifier is underpowered.
A successful experiment tells us that one path exists. A failed experiment tells us where the path ends, where the map is wrong, where the process is fragile, where the measurement is misleading, where the model is overconfident, or where the material is not yet under control.
A model trained only on polished successes learns islands without shorelines. It sees the materials that survived publication, not the neighboring attempts that reality rejected. It learns what humans decided to preserve, not what they learned to avoid. It inherits the selection bias of the visible record and misses the contour of synthesizability, manufacturability, reproducibility, and trust.
The highest-value signal in this loop is often human judgment. Expert correction teaches the system what kind of evidence matters. A staff scientist can see that a phase assignment is too convenient, that a substrate peak is being mistaken for the material, that a sample charged under the beam, that a property signal belongs to an impurity, that a result is not yet strong enough to move toward qualification.
The future lab should make that judgment durable. Agent interpretation can be cheap and continuous; expert adjudication should be precise and high-leverage. The raw artifact remains canonical. The interpretation is versioned. The correction becomes training signal. The disagreement becomes an eval. The non-formation becomes a hard negative. The repeated artifact becomes a better verifier.
The true goal is to gain information that reduces uncertainty. That is measurable. An intelligent system learns compressing operations that preserve what improves prediction or decision.
In representation learning, a useful representation is structure that supports interpretation, reliability, control, and future action.
The laboratory needs an analogous shift. It should form representations of experimental experience around what changes the next decision, including which process parameter mattered, which measurement resolved an ambiguity, which simulation was trusted too far, which synthesis route failed for a physical reason, when characterization was sufficient to stop, and when the right action was a refusal to believe the current one.
Qualification
The downstream bottleneck is qualification.
The world needs materials systems that can be made, measured, reproduced, trusted, scaled, and deployed, including magnets, coatings, batteries, semiconductors, thermal materials, structural alloys, catalysts, devices, and process windows that survive contact with production.
Rare earths make the stakes concrete. The U.S. Geological Survey's 2026 mineral commodity summary reports that magnets are the leading global use of rare earths, that U.S. net import reliance for rare-earth compounds and metals was 67 percent in 2025, and that China supplied 71 percent of U.S. imports of rare-earth compounds and metals from 2021 through 2024. But the issue is not only supply. The hard part is separation, refining, alloying, microstructure, magnet manufacturing, lot reproducibility, and qualification. A better magnet imagined by a model is still only a conjecture until the phase forms, the process repeats, the measured property is attributed to the intended material, and the evidence is strong enough for someone else to trust.
Qualification is the slow conversion of possibility into warranted action.
Some autonomous science factories will exist, and they should. But they are not the whole future.
The deeper transformation is that the labs we already have become learnable.
University shared facilities. National labs. Industrial R&D labs. Characterization centers. Pilot lines. Foundries. Qualification labs. Process-development labs. The places where real samples already move through real instruments, under real constraints, with real experts making real decisions.
These labs contain the world's experimental nervous system.
Three Research Questions
That transformation has three research questions.
Experience reuse asks how traces from real scientific workflows can be compiled into reusable experience, replay episodes, and evaluation environments rather than disappearing as one-off lab records.
Backward and delayed outcome credit assignment asks how downstream physical outcomes, including failures and nonformations, assign credit or blame back to earlier choices in the pipeline, including corpus, objective, candidate, protocol, simulator, process parameters, instrument state, and interpretation.
Verifier-grounded RL infrastructure asks how experiment-as-code workflows become stable training environments with typed state, actions, rewards, uncertainty, hard negatives, and train/eval splits so labs can improve policies against their own scientific objectives.
Thousands of Labs
The future is still physical. It is furnaces, films, powders, wafers, magnets, coupons, devices, microscopy sessions, diffraction patterns, thermal cycles, corrosion tests, failed batches, and expert doubt. The change is that these encounters no longer disappear after the report. They become shared training signal that humans and agents can both use.
Human expertise moves upward in this future. A workflow that once required multiple expert scientists to inspect data, develop models, make tradeoffs, generate candidates, and summarize conclusions can increasingly be compressed into a single agentic process. The scientist's work shifts toward defining the problem, choosing the objective, recognizing failure modes, deciding what should be built and validated in the real world, and setting the standard of proof. The human still decides what would count as knowing.
The next era of AI science will be trained on verified physical experience.
That experience is already being produced every day, when a candidate is proposed, a process is planned, a simulation is run, a synthesis is attempted, a sample is characterized, an expert changes their mind, a failure is attributed, and a future decision becomes clearer.
Most of that experience is still lost as training signal.
The work now is to build the return path from physical work to trainable experience, from repeated workflows to replayable environments, from expert correction to durable judgment, from failure to better models of the world.
Our mission is to make science programmable.
The future is thousands of existing labs becoming programmable, verifiable, and compounding, thousands of labs learning from reality, and from themselves.