Thesis

The Future Is Thousands of Labs

A thesis on experiments that compound.

The abundance of software is creating more demand for hardware.

More intelligence now means more power, more compute, more autonomy, more aerospace capacity, more resilient supply chains, and more systems operating at physical extremes. To build that physical future, we need new materials.

What is worth exploring? And once a candidate exists, what are we actually capable of building?

For most of history, the first question was the hard one. Good hypotheses were hard to find, and discovery moved at the speed of human imagination disciplined by experiment.

Models can now propose structures, mechanisms, synthesis routes, and experimental plans faster than any physical lab can validate them. The frontier is moving from a shortage of plausible ideas to a shortage of trusted contact with reality.

Before a lab is a room of instruments, a furnace schedule, a sample queue, or a set of methods, it is a disciplined way of letting the physical world answer. It turns matter into evidence, evidence into judgment, and judgment into the next experiment.

The physical world is becoming the training environment for AI, but today's labs weren't built to emit training signal.

From Execution to Experience

Experimental knowledge once lived only in skilled hands, notebooks, and the credibility of witnesses. Instruments widened what the lab could see; the 1915 Nobel Prize went to reading crystal structure from X-rays, and electron microscopy pushed past the optical limit. Computation made parts of the search space navigable before anything was made, and the Materials Genome Initiative made the infrastructure thesis explicit. Automation changed the tempo, closing loops between planning, synthesis, characterization, and analysis. The agentic lab extends the sequence by making experimental experience trainable.

A lab becomes agentic when the physical result changes the data, the model, the simulator, the verifier, and the next human decision. Experiment as code separates intent from execution. The physical result still has to return as a structured object that can train agents, update verifiers, calibrate simulators, and become replayable experience.

The lab could leave much unsaid because the people around the workflow supplied the missing context. A candidate can clear every computational screen and never form in the furnace. A synthesis can fail twice and succeed the third time because someone changed a precursor and remembered why. A measurement can be trusted because of how the sample was prepared, not only because of what the instrument returned. An agent cannot inherit any of this unless it is represented; a failed synthesis that survives only as a line in a notebook teaches nothing, and a characterization result with no provenance cannot calibrate a simulator. The runs that would teach a model the most, the failures, the deviations, the expert corrections, are exactly the ones today's records are least equipped to keep.

The fields where AI compounds fastest are the ones where checking an answer is cheap. Code compiles and runs against its tests. A proof either checks or it does not. A materials claim is checked by a furnace, a diffractometer, and months of testing, and generation has already outrun that check; GNoME reported 2.2 million new crystal predictions, including 380,000 predicted stable materials.

When candidate generation becomes abundant, verification becomes scarce. When agents can propose more plausible actions than a lab can execute, the bottleneck becomes deciding which physical actions are worth taking and returning what happened as signal.

What We Build

At Dynamical, we build long-horizon evaluations that represent the full trajectory of discovering and developing a material, from candidate to qualified capability. We bring them as close to reality as we can, grounding them in real experimental records, instrument artifacts, and expert corrections. Our first benchmark replays real additive-manufacturing campaigns and asks whether an agent can tell when the evidence is enough. These evaluations show where agentic systems fail on real research decisions, and each failure marks where a stronger verifier, an environment, or expert judgment is needed. We supply that signal to the teams training frontier models and to the labs generating the experiments. Each pass through the loop makes verification faster, cheaper, and closer to reality. Over time the loop converges to the inference-time infrastructure that lets agents operate in real research environments and reach materials that do not yet exist.

In its first run, the benchmark replayed 1,872 trajectories across six frontier models, and the dominant failure was over-caution; agents kept spending on evidence the record already contained, stopping short more than six times as often as they overclaimed.

Information Throughput

Real throughput is information throughput, how much uncertainty is reduced per instrument hour, per sample, per dollar, and per expert intervention. The experiments worth running are the ones that change the next decision and produce data the world does not yet have.

A trace preserves the semantics of an encounter with reality, the intent beside the execution, the interpretation beside the correction, the outcome beside the decision it should affect. Traces are what open-ended experimental environments and their rewards are built from.

In a digital model, loss can move backward through a network. In scientific discovery, the analogous signal has to move backward through each link in the chain, from the candidate proposed, to the priors that made it plausible, the simulation that screened it, the protocol that tried to make it, the instrument that measured it, the expert who challenged it, and the decision that followed. Formation should strengthen some choices. Failure should weaken others. Ambiguity should reveal where the verifier is underpowered.

The highest-value signal is often human judgment. A staff scientist can see that a phase assignment is too convenient, that a substrate peak is being mistaken for the material, that a sample charged under the beam, that a result is not yet strong enough to move toward qualification. Representations of experimental evidence have to be built around what changes the next decision.

Beyond Optimization

Assume the exponential holds. Generation, simulation, and lab throughput all compound; a characterization run that once took an afternoon now takes seconds. Optimization inside known chemistry becomes cheap. Frontier models already match purpose-built optimizers at rounding out known spaces, and the value moves to the edge of what is known, where the data does not yet exist.

Working at that edge takes two capabilities. The first is reading the model's own internal state to find where its knowledge is thin and which physical result would teach it the most; our work on test-time verification shows that signal already lives inside the generator and can be surfaced as a control primitive rather than an after-the-fact filter. The second is training the generator and the simulator against each other, so that each real result pulls both toward the regions experiments actually occupy. Simulators are now accurate on ordinary structures and quietly unreliable at the defects, kinetics, and extremes where discovery happens.

One thing does not compound. Informative contact with reality stays rationed by synthesis, instrument hours, and the months a qualification test takes, and much of what a high-throughput lab measures repeats what it already knew. The ratio of hypotheses to physical answers keeps widening, so the scarce act becomes choosing the single experiment that most reduces the whole system's uncertainty and returning its outcome to every model at once. The experiment is the backward pass of the world model. Discovery stops being a search a scientist fully specifies and becomes a loop that learns where to look.

Thousands of Labs

The future is still physical. It is furnaces, films, powders, wafers, coupons, thermal cycles, corrosion tests, failed batches, and expert doubt.

Software demand is becoming materials demand, and materials demand is becoming qualification demand. Qualified materials and components are now a rate limit on strategic capacity. The labs we already have become learnable. University shared facilities, national labs, industrial R&D labs, characterization centers, pilot lines, foundries, qualification labs. These labs contain the world's experimental nervous system.

As discovery compounds, the scientist's work moves upward, toward defining the problem, choosing the objective, recognizing failure modes, and setting the standard of proof. The human still decides what would count as knowing. When materials research moves at the speed of product development, the material and the machine are designed together, hardware and software moving through the same loop against the same physical evidence. In the limit, an engineer specifies the environment a part must survive, and agents, simulators, verifiers, and labs return a material qualified to survive it.

Our mission is to make science programmable. Thousands of existing labs become programmable, verifiable, and compounding.

We are shaping a new way of approaching agentic science, and we are looking for people who want to build it with us.