Scaling Test-Time Verification for Novel Materials

In language models, the generation-verification gap is a selection problem. The model generates correct answers but cannot identify which ones are correct. You close it by ensembling weak verifiers or scaling compute at test time. In materials discovery, the gap is harder. The final verifier is nature. You cannot ensemble a robotic physics lab. Neural surrogates return in seconds, physics simulations in minutes, but actual synthesis takes hours of irreducible physical time. Generate ten times more candidates and the verification burden grows ten times. Scaling generation without scaling verification produces diminishing returns.
We set out to close that gap by pushing verification into the generation process itself. Crystal diffusion models already encode property information in their hidden states that they never use during sampling. A 256-parameter linear probe extracts this signal, and its gradient steers the generation trajectory toward target structures at test time. An unconditional model trained on 97.9% metals, with no property conditioning at all, reaches 24-43% of structures in a target band-gap window (the energy range that determines whether a material is a metal, semiconductor, or insulator), comparable to state-of-the-art conditional generation. Swap the probe, change the target. No retraining, no mode collapse, over 50x faster per sample. Every candidate that exits the model has already been steered toward physical viability. The downstream pipeline starts from a higher-quality pool.
The candidate pool is no longer the constraint. Diffusion models like Crystalite (Hadži Veljković et al.) and MatterGen (Zeni et al.) propose novel candidates by jointly denoising atom identities, fractional coordinates (the position of each atom within the repeating unit cell), and lattice geometry from noise. Neural potentials screen for thermodynamic stability in seconds. Of over half a million candidates proposed by GNoME, fewer than one in five exhibited predicted synthesizability in subsequent analysis. The synthesis gap is where materials discovery stalls.
Multi-fidelity verification pipelines now span neural surrogates that return in seconds, physics simulations that return in minutes, and robotic platforms that run real experiments in hours. Each layer adds confidence and cost. The question is where in this pipeline verification should start.
In our previous post, we described the judgment layer, training models to decide which evidence to trust, when to escalate to a more expensive verifier, and when to commit. That layer operates on candidates. If the candidates themselves are not pre-filtered for physical viability, the judgment system spends its budget evaluating structures that will never survive synthesis. Judgment decides what to pursue. Verification ensures what you generate can actually be built.
Self-Correcting Search
Self-correcting search (Hazra et al., Goodfire Research) exploits a gap between what diffusion models encode internally and what they express in their outputs. A probe trained on MatterGen's GemNet hidden states can predict what the band gap of the final structure will be at each intermediate denoising step. Self-correcting search uses this prediction as a feedback loop. At each step, the probe evaluates the proposal, and Metropolis-Adjusted Langevin Ascent accepts or rejects it based on whether it moves toward the target property range.
Standard conditional generation with MatterGen achieves 15% of samples in the target band-gap range, dropping to 6.5% when filtering for stable, unique, and novel candidates. Self-correcting search pushes forward the targeting-viability frontier at every conditioning strength tested, generating ~30% more viable candidates in the target range. The approach is general. Any property predictable from model internals becomes a viable steering signal.
We reproduced these results on a single GB10 GPU to understand the method's behavior across conditioning strengths and random seeds.

At gamma=1.5, self-correction produces 25-28% stable-and-in-window candidates, with 43% of relaxed structures in the target band-gap range. Structural viability improves from 54% to 70%. Self-correction improves both targeting and structural validity. Seed robustness varies at lower conditioning strengths, and the robustness variant (best-of-3 with a hard floor) stabilizes performance at the cost of peak throughput.
Model internals carry verification signal that meaningfully improves generation. This was the starting point for our work. The natural question was whether the same principle extends beyond conditional models. MatterGen receives the band-gap target as input at every conditioning strength. Can test-time verification steer a model that has no conditioning signal at all?
Probe-Gradient Guidance
Crystalite (Hadži Veljković et al.) is a ~67M-parameter Diffusion Transformer that generates crystal structures unconditionally. No property target enters the model. It denoises atom tokens, fractional coordinates, and lattice descriptors jointly from noise, using a subatomic tokenizer that encodes elements as continuous vectors derived from periodic table position and valence structure. Its training data is Alex-MP-20 (675,204 structures total, 540,162 in the training split), 97.9% of which are metals by our analysis. If the goal is generating wide-bandgap semiconductors, the base distribution has no coverage of the target region.
We trained a 256-parameter linear probe on Crystalite's atom-mean hidden states to predict band gap. The probe achieves 0.957 AUROC. The model represents whether a structure is metallic or insulating at every denoising step, despite never being trained to condition on that property.
We first applied the same Metropolis accept/reject strategy that works on MatterGen. On an unconditional model, it failed across all 36 configurations tested. Every one produced 0% in-window and 97-100% metals. Metropolis can only select among proposals the model already generates. If the base distribution is 97.9% metals, there are no insulator proposals to select. Passive verification requires base distribution coverage. Active steering creates it.
Instead of accepting or rejecting proposals, we backpropagate through the probe at each denoising step to produce a gradient on the generation trajectory. The gradient pushes the denoising path toward structures with higher predicted band gap.

At guidance weight zero, the model generates what it always generates. 96.5% metals, 0.0% in the target window. At w=1, metals drop to 0.8% and 3.5% of structures hit the 4-6 eV range. At w=10, metals reach 0.0% and 24.2% of structures hit the target, with a mean band gap of 4.19 eV. An unconditional model trained on 97.9% metals reaches comparable targeting accuracy to MatterGen's conditional generation with self-correcting search (25-28%), steered entirely at test time.
Head to Head
The comparison between MatterGen with self-correcting search and Crystalite with probe-gradient guidance is not a model-versus-model benchmark. The two systems differ in architecture (equivariant GNN vs. Diffusion Transformer), conditioning (conditional vs. unconditional), and verification mechanism (accept/reject vs. gradient steering). What the juxtaposition reveals is how far the test-time verification principle generalizes.
| MatterGen + Self-Correction | Crystalite + Gradient | |
|---|---|---|
| Targeting (in-window) | 25-28% at gamma=1.5 | 24.2% (base) / 42.6% (balanced) |
| Latency | 9-16 min / 32 samples | ~5 sec / 10 samples |
| Structural validity | ~70% stable | 100% lattice, 99.6% geometry (balanced) |
| Conditional model | Required | Not required |
| Per-property retraining | Required | Swap probe (256 params) |
| Compositional uniqueness | Not reported at scale | 99.7% (base) / 78% (balanced) |
| Seed robustness | Varies (9-16% at gamma=1.0) | Stable across 3 seeds |
| Formation energy probe | N/A | AUROC 0.990 |
MatterGen with self-correction achieves 25-28% stable-and-in-window at its strongest conditioning. Crystalite with probe-gradient guidance achieves 24.2% in-window on the base model, rising to 42.6% on a balanced-training checkpoint. Band-gap hit rates on relaxed structures converge at 43% for MatterGen (gamma=1.5) and 42.6% for the balanced Crystalite model (w=3).
The difference that enables everything else is latency. Crystalite's base architecture is over 100x faster than MatterGen at generation (22 sec vs 2,639 sec per 1,000 structures). With guidance overhead, guided Crystalite remains over 50x faster per sample than MatterGen with self-correction. Sweeping across guidance weights, composing constraints, and regenerating takes seconds. Changing the target property means swapping a 256-parameter probe, not retraining the generator.
MatterGen's strength is structural maturity. Its equivariant GNN architecture produces physically valid crystal structures by construction. Crystalite's base model (trained on the full Alex-MP-20 distribution) achieves high compositional diversity (99.7% unique) but lower structural validity (5-16%). A balanced-training checkpoint resolves this, reaching 100% lattice validity and 99.6% geometry validity, at the cost of reduced diversity (78% unique). We return to this tradeoff below.
The same verification principle, applied through different mechanisms on different architectures, produces comparable targeting in both cases. Model internals carry the signal. The question is how to extract and act on it, not whether it exists.
The Pareto Frontier
Does guidance sacrifice diversity for targeting? This is the standard tradeoff in conditional generation. Stronger conditioning collapses the output distribution. We tested this with a Pareto sweep of 18,432 structures across 6 guidance weights, 3 random seeds, and 1,024 samples per batch.

Every guidance weight Pareto-dominates the baseline. At w=10, in-window rate reaches 31.8% (up from 24.2% in the smaller initial sweep, consistent at larger N) while compositional uniqueness holds at 99.7%. At w=15, targeting climbs to 33.7% with uniqueness at 99.6%. The number of distinct chemical systems explored stays flat at roughly 1,015 across all weights. Novelty remains above 99% everywhere.
Element entropy reveals what the gradient is doing at the compositional level. At w=1, entropy dips from 4.67 to 4.57 as the composition space shifts from the metallic regime to the insulator regime. By w=10, it recovers to 4.65. The insulator compositional space is equally diverse as the metallic space. The probe gradient opens a different region, not a narrower one.
Crystalite's subatomic tokenizer creates this behavior. Instead of one-hot element identities, it encodes each element as a continuous vector built from periodic table position and valence structure, then compresses via principal component analysis to 16 dimensions. This creates a smooth manifold where many different element combinations achieve similar probe scores. The gradient says "make this an insulator" without specifying which insulator. The model retains full compositional freedom within the target property range.
From Diversity to Production
The base model generates compositionally diverse candidates but not structurally valid ones for the target property range. Alex-MP-20 is 97.9% metals. The model never learned what insulator crystal geometry looks like. A balanced training subset with 35% insulators fixes this. The balanced checkpoint at w=3 produces 42.6% in-window candidates with 100% lattice validity. A formation energy probe achieves 0.990 AUROC.
Not everything is reachable from model internals. An E_hull probe achieves 0.000 AUROC. E_hull (energy above the convex hull) measures how far a structure sits from the most stable known compounds in its chemical system. It depends on what other compounds exist in that system, information the Transformer has no access to. Structure-level properties are accessible. Database-relative metrics are not.
But even within what the probes can reach, a single property target is rarely enough. Novel discovery tends to look more like search under constraint. "Generate a refractory insulator (one that retains structural integrity at extreme temperatures) with no cobalt or nickel." Real queries have multiple constraints, some continuous (band gap), some discrete (which elements to include or exclude). Composing multiple probe gradients for all of these failed. A weak refractory probe created adversarial shortcuts, collapsing output to Au/Cs. The approach that works is matching each constraint type to the right mechanism. Probe-gradient steering handles continuous properties. Token masking handles discrete composition constraints. The hybrid produces 100% refractory, 0% cobalt/nickel, 100% insulator, 30% in the target band-gap window. Swap the mask, swap the probe, regenerate in seconds.
From Generation to Synthesis
Inference-time verification changes the scaling laws of the generation-verification gap. When the verification signal is folded into the generation trajectory, every candidate that exits the model has already been steered toward target properties. Generating more means more verified candidates, not more candidates to verify.
The mechanism is composable. Stacking probes and token masks adds verification dimensions without retraining. The cost of adding a constraint is training a 256-parameter probe, not retraining a model.
What the probes cannot reach is the property that matters most. Synthesizability depends on kinetics, processing conditions, and precursor chemistry. No model internal encodes it, because no training dataset contains it. The signal lives in experiment logs. What was attempted, what formed, what failed. If that signal were available as supervision, the same gradient mechanism would steer generation toward structures pre-filtered for physical realizability.
Judgment decides which evidence to trust and when to commit. Verification steers generation toward structures worth committing to. Better candidates produce more informative experiments. More informative experiments produce the synthesis outcome data that the next generation of verification probes needs. The verification signal that matters most is the one that comes back from the lab.
Citation
@article{barnes2026verification,
author = {Barnes, Jarrod},
title = {Scaling Test-Time Verification for Novel Materials},
journal = {Dynamical Systems},
year = {2026},
url = {https://dynamicalsystems.ai/blog/scaling-test-time-verification}
}If you are running autonomous experiments, building generation-to-synthesis pipelines, or working on synthesizability prediction, we want to hear about it.