Physical evals

Evaluations in the actual physical world.

May 23, 2026

A physical evaluation tests an AI system in the actual physical world — not a simulator, not a sandbox, not a virtual environment dressed up as one. The point is to measure how well AI can do real things in real places.

An orchard owner has birds eating the fruit. She sets up a few cameras and a drone, brings them online safely so any agent can be invited to take a slot on the system, and poses the question: who can keep the birds off the fruit best? That setup is a physical eval. It has cameras, a drone, an orchard, birds — none of it simulated and outcomes are measured against what matters to the orchard.

leaderboard · week 12 live
#operatorfruit savedcost
1Owl-3B94%$0.18/h
2Hummingbird v289%$0.21/h
3FlockSentinel84%$0.31/h
4human baseline71%
5RoboScarecrow63%$0.09/h
A physical eval, sketched: an orchard with perimeter cameras and a deterrent drone, and a live leaderboard of operators competing to keep the birds off the fruit. Operators here are illustrative.

In this document, a few threads are developed:

  1. What physical evals are. A definition, what they’re not (simulators, sim-to-real benchmarks, curated demos), virtual environments as virtual gyms and physical evals as a final exam.
  2. An anatomy. An initial draft of components a physical eval needs, good practices.
  3. Safety for physical evals. Letting anyone on the internet drive real hardware is its own adversarial-security challenge.
  4. An open movement for physical evals. Creating simple protocols, great safety standard and economical setups could lead to a cambrian explosion of physical evals, where anyone can bring their physical challenge online.
  5. Physical evals as a market. One can set up an eval to delegate the selection of the right AI model/algorithm to competing participants.

What physical evals are

A physical eval is an evaluation of an AI system carried out in the actual physical world. The system being measured operates a real environment - fruit trees, a wet bench, a warehouse cell, a field plot - through sensors and actuators connected to the internet, so any agent can take a slot, attempt the task, and submit a score.

In principle, most problems in the physical world could be turned in a challenge for surfacing the state of the art of AI in solving that problem. In a way physical evals can act as a forcing function to saturate evaluations in the real world. Saturate as in: take the measurable outcome to its ceiling. The eval defines the ceiling; the participants find out how close they can get.

What they’re not

The final exam

A physical eval isn’t where one trains their models, but it’s where they get tested.

Due to the high cost of interacting with the real world, it is likely that all the learning — model fitting, policy iteration, RL rollouts, fine-tuning, ablations, sweeps — will happen somewhere cheaper: a simulator, a virtual environment, a closed in-house testbed. Some of the gyms people are using today: OpenAI Gym / Gymnasium, MuJoCo, PyBullet, Isaac Gym / Isaac Sim, DeepMind Lab, Habitat, AI2-THOR, CARLA, AirSim, Genesis. Participants are free to use whichever virtual world or gym they like — there is a whole landscape of simulators specifically built for this.

Differently, a physical eval is like a final exam. Build and train wherever you want; gather data however you want; iterate as much as you want — and then submit to the physical eval to see how your work holds up against the real world.

Why the distinction matters: the gap between a virtual world and the physical one, sim-to-real gap, contains everything the simulator didn’t model. Wind that doesn’t blow the way it does in the sim. Lighting the renderer didn’t predict. Mechanical wear, sensor noise, calibration drift, the way birds actually respond to a drone rather than the way an idealised model of a bird does. The physical eval catches it because the physical eval is the physical world.

Examples

The cards below are sketches of what a small handful of physical evals could look like across very different domains. They give a feel for the surface area before we get into the abstraction that they all share.

Anatomy of a physical eval

One way to think about what a physical eval needs is to ask what any participant would need to interact with it and what any observer would need to trust the result. The diagram below sketches one possible decomposition — not the only one, but a useful starting point:

FRUIT SAVED94%↑ 31ENVIRONMENTthe orchard2SENSORSperimeter cameras3ACTION SPACEdrone, deterrents4METRIC+ secondaries5GUARDRAILSgeofence, no-fly buffers6OPEN ACCESSremote operator input7GOVERNANCEwho sets the rules
The seven components of a physical eval, called out on the orchard sketch. Numbers correspond to the items in the list below.

I don’t think this list is final. Different domains will surface components I haven’t named — calibration drift, biological containment, human-in-the-loop sign-off, regulatory constraints — and the right abstraction is going to settle as people actually build the things.

Keep evals honest

Any eval with a numeric target eventually invites unintended ways to hit it. Physical evals are no exception, with the wrinkle that the unintended ways can also cause real-world harm.

Saturation. A physical eval defines a ceiling — the best achievable value of its primary metric under its rules. Saturating the eval means pushing right up against that ceiling. Whether the ceiling is known in advance or discovered through the leaderboard is a design choice; both shapes exist and both are interesting. When someone saturates, the eval has done its job: it has produced a numeric, real-world answer to a real-world question.

Goodhart’s shadow. An agent will discover that the camera-counts metric ticks up faster if it drives the deterrent so hard that birds and the human operators avoid the orchard. A sprayer drone will saturate pest pressure reduction by spraying when nobody is watching the secondary metrics. The classic Goodhart failure — with physical consequences.

The fix isn’t a perfect metric; there isn’t one. It’s a stack of practices that keep an eval from drifting into measurement theatre:

None of these are novel; they’re standard practice in any community that runs a long-lived leaderboard. The point is that physical evals inherit all of these problems and have to solve them from day one.

Problems physical evals haven’t solved yet

The following are some of the harder design problems that don’t have clean answers yet — and that any serious physical eval effort will have to confront.

Non-stationarity. The physical world changes whether you want it to or not. An orchard in week one is not the same orchard in week twelve — season, weather, and pest population all shift. A wet-lab bench drifts as reagent batches age. Field plots evolve. Comparing scores across time is therefore hard, sometimes impossible: a result from March is a result from a different environment than a result from October. Rotating environments help with gaming but make longitudinal comparison worse. There is no escape from the tension; the best an eval can do is be explicit about it.

Sequential contamination. Each participant leaves a trace for the next. In a wetlab this is acute — reagents consumed, cultures disturbed, hardware worn — but the problem is general: stock depleted in a warehouse cell, soil compacted on a field plot, bird behaviour shifted by a heavy deterrence week. Sequential slots work for environments with a natural or cheap reset; they don’t work for environments where state accumulates. Eval designers need to answer that question before the first slot runs, not after.

Latency as a confound. A participant operating remotely over the internet sees the environment through a sensor stream and acts through a command channel, both of which have variable latency. Two agents with identical policies but different network conditions will produce different results. This is especially visible in fast-moving environments — a drone avoiding a collision, a robot arm catching a falling object. Whether to treat latency as part of the task or as noise to be filtered is a design choice, but ignoring it produces a leaderboard that partially ranks network infrastructure rather than agent capability.

Observer effect. The sensors required to score an eval change what is being measured. A camera rig that watches a field plot for pest activity may deter the pests on its own. A flow sensor on a reagent line changes the thermal environment of the bench. In some domains the effect is negligible; in others it will corrupt the primary metric. Eval designers need a clear-eyed account of what their instrumentation does to the thing it is measuring.

Cost asymmetry. A team that can afford to run fifty slots iterates faster than a team that can afford five. If the leaderboard is open to anyone but only well-resourced teams can learn quickly, it is open in name only. Possible mitigations — subsidised slots, a cap on attempts per period, separate tracks for constrained and unconstrained budgets — each introduce their own distortions. No obviously correct answer exists.

Safety for physical evals

A physical eval that anyone on the internet can operate is, by construction, a public attack surface on a real-world physical system. The participant at any given slot might be a well-behaved research team, an AI agent following a poorly-aligned policy, or a person who wants to break things on purpose. The eval has to keep working — usefully, openly, safely — across all three.

This is a different adversarial model than most AI-security work contends with. It’s not “is this output dangerous to publish?”, not “can this model be jailbroken?”, not even “is this code safe to run in a sandbox?” It’s: somebody we don’t fully trust is about to make our drone, our sprayer, our autoclave do something for the next twenty minutes — what’s the worst they can do, and how do we bound it?

The threat surface

A useful first pass is to categorise harms by who pays the cost:

  1. Harm to the eval itself. The drone crashes, the cell line dies, the robot arm jams. Cheap if the guardrails work — the operator resets, the leaderboard absorbs the failure.
  2. Harm to the surrounding environment. Chemicals spill, the orchard catches fire, a neighbouring field gets sprayed. Real cost, often externalised to people who never agreed to host the eval.
  3. Harm to humans. A bystander gets hit by the drone, an operator gets burned, a patient sample gets switched. The lines between these categories blur in practice — chemical drift is “environment” until a bystander walks through it. The category that matters most, and the hardest to bound.
  4. Information harms. Footage of bystanders or proprietary processes leaves the eval site; the eval is used as a covert surveillance platform; sensor streams are exfiltrated.
  5. Generation of dangerous artifacts. The wet-lab cell is steered toward synthesising something harmful; the sprayer drone is weaponised; the autoclave is used to destroy evidence.

Categories 1–3 are about what can happen during a slot. Categories 4–5 are about what can leave the eval afterwards. They want different defences, and a serious eval needs both.

Defences worth building

None of the following is a finished answer. They’re the moves I’d want to see physical evals try, evaluate, and write up:

Most of these are borrowed from adjacent fields — public cloud security, scientific-facility time-sharing (telescope nights, beamline schedules), bug-bounty programs, robotics-safety standards. None of them have been worked out in detail for a public, openly-instrumented physical system that AI agents are also supposed to operate. That’s a research agenda in itself, and a big part of why this blog exists.

An open movement for building physical evals

A handful of high-profile, sponsor-led competitions is what got us here. To go further, physical evals need to stop looking like championships and start looking like an open-source ecosystem: cheap to stand up, easy to fork, open to anyone with a problem worth measuring, and continuously running rather than reserved for the four weeks around a finals event.

The rest of this section is the arc: where we’ve been (Prior art), what an open ecosystem looks like in practice (Open at every layer), what it would have to cost (What’s the Raspberry Pi of a physical eval?), and what selects the good evals from the bad (A Darwinian ecosystem).

Prior art

Physical-world AI competitions aren’t new. The DARPA Grand Challenge put autonomous vehicles in the Mojave; the DARPA Robotics Challenge put humanoids through disaster-response courses; the Amazon Picking Challenge ran in warehouse mock-ups for several years; RoboCup has been running its soccer leagues since 1997 — arguably the longest-lived physical eval in continuous operation, and the one with the most literature on what makes it work and what it ends up measuring. RoboCup has been doing its soccer leagues since the late 1990s; the Indy Autonomous Challenge and Roborace have put driverless cars on real circuits. On the digital side, ARC-AGI is the closest analog in spirit — a fixed-format eval the whole field competes on year after year.

What these have in common: each was (or is) a sponsor-led, time-limited event with closed protocols and bespoke infrastructure. They produced brilliant moments and a small library of papers; they were expensive to build and harder to reproduce.

Open at every layer

Concretely, an open posture at every layer of the stack:

What’s the Raspberry Pi of a physical eval?

The hard constraint on all of this is cost. A DARPA-class eval needs millions of dollars and a multi-year program; even a modest research-grade one runs into six figures once you count sensors, networking, fail-safe hardware, and the human labour to keep it operating. That ceiling is what makes physical evals rare today — and rare evals can’t be the basis of an ecosystem.

So one of the most important questions this community can keep returning to is the one in the heading. Stand-in for “the cheapest plausible build”. The Raspberry Pi did this for hobbyist computing; what’s the equivalent for a physical AI eval? What’s the bill of materials that brings a credible, instrumented, openable physical eval down to the cost of a serious hobby project? Probably some mix of commodity sensors, a single-board computer for the control loop, an open scheduling service for time-share, off-the-shelf safety hardware, and a reference orchestration stack that everyone forks. If the answer ends up being “a few hundred dollars and a weekend,” the ecosystem can actually form. If it stays at “a few hundred thousand and a six-month build,” it stays a fantasy.

Tracking and lowering that cost is the unsexy part of this agenda, and probably the most important.

Public verifiability

The safety section above focuses on protecting the physical environment from adversarial participants. There is a symmetric problem that gets less attention: protecting participants — and the public — from adversarial eval runners.

In an open world where anyone can wire a field, a lab bench, or a warehouse cell to the internet and declare it a physical eval, the operator controls the sensors, the scoring pipeline, and the ground truth. A dishonest operator can inflate results for a preferred team, suppress evidence of harm, or fabricate the physical record entirely. If physical evals are going to carry weight — as procurement signals, safety certifications, or policy inputs — the data they produce has to be trustworthy independent of whether the runner is trustworthy.

This is a largely unsolved problem, and probably a valuable research direction in its own right. Some threads worth pulling:

No single mechanism is sufficient. Combinations are likely to be necessary, and the right combination will vary by domain. What a wet-lab needs to prove that a synthesis actually ran differs from what an orchard needs to prove that a drone actually flew a slot. Building this layer — call it physical eval attestation — is at least as important as building the evals themselves.

A Darwinian ecosystem

Not every physical eval will be a good one. Some will be hard, some trivial. Some will be well-structured; some will be a mess. Some will scale to many participants; some will only ever host one team at a time. That’s fine — even desirable. The interesting questions are second-order: what makes a physical eval reproducible? Honest? Worth competing on? The shape of “what makes a good physical eval” is going to emerge from people building them, breaking them, and learning what each one actually measured.

Sketched, a public registry for such an ecosystem might look like this:

physevals.io · open registry · 126 evals
Physical eval registry
43 accepting slots
O
Orchard pest defence
agriculture · outdoor · Greenfields UK
● open
fruit saved / week
18 teams
W
Wet-lab synthesis cell
chemistry · indoor · SynthLab Oxford
● open
g/h at target purity
11 teams
V
Indoor vertical farm
agriculture · controlled environment
◑ 2 slots left
g / kWh / cycle
6 teams
P
Pick-and-pack cell
logistics · warehouse robotics
● open
correct orders / hour
23 teams
S
Outdoor sprayer drone
agri-robotics · field · safety-vetted access
○ coming soon
pest Δ / ml sprayed
+
Submit an eval
open spec · CC-BY · any domain
updated 26 May · specs CC-BY physevals.io is imagined

The entries are made up. The structure isn’t: an open registry is exactly what this ecosystem needs, and what no single lab can build alone.

This blog will track that ecosystem as it forms.

More than evals

Every execution of a physical eval produces something beyond a score: a timestamped record of sensor readings, actions taken, and outcomes observed, all under conditions that were defined in advance and held constant across participants. That record has value on its own.

The most immediate use is data collection. A team that runs an agent on the orchard eval for a week doesn’t just get a leaderboard position — they accumulate labelled trajectories in a real environment that would be expensive to stage deliberately. Even failed attempts are informative: a drone that misses a bird on Tuesday has documentation of exactly what the environment looked like and what the agent did.

For some categories of problem the step further is worth considering: repurposing the eval environment as a training environment. The orchard is already instrumented. The slot system already handles scheduling. If the cost of running episodes is low enough — bird-deterrence is essentially free to attempt, wet-lab synthesis is not — the same infrastructure can run RL rollouts between evaluation windows. The environment that scores a model on Monday can help train the next version by Friday.

This doesn’t collapse the distinction between training and testing. Eval integrity still requires held-out conditions, independent scoring, and participants who didn’t design the environment. But the hardware doesn’t have to be idle between eval slots, and the data generated during evaluation doesn’t have to be discarded. For operators willing to share trajectories under open licences, a physical eval site becomes something closer to a living dataset — one that grows richer every time a new agent takes a slot.

Physical evals as a market

Here’s another way to see what an eval does once it exists. The question “which AI should I use on my problem?” stops being a question the problem-owner has to answer themselves. The eval becomes the market: any participant — human, agent, team, company, hobbyist — can take a slot, attempt to saturate the metric, and submit. The leaderboard answers the AI-selection question by revealing whose approach actually delivers on the physical world.

That’s a real shift: physical evals are a way for problem-owners to delegate AI knowledge to a market. This is the same shift that happened with bug bounties. A company didn’t have to predict who the best vulnerability researchers were; they had to publish the surface and the rules, and the market sorted itself out. The grower doesn’t pick a model. The hospital doesn’t pick a model. The factory doesn’t pick a model. They pick a problem worth instrumenting and let the world’s AI builders compete to be the answer. As more domains follow suit, an aggregate picture emerges of where AI is actually good — earned in the real world, not claimed on a benchmark.

That market only works if it is safe to open, honest enough to resist Goodharting, and cheap enough for many people to run. The preceding sections are the scaffolding for that future.

Get in touch

If any of this resonates, please write. Three good reasons:

DM @iamnotnicola on X.

Let’s turn more of the physical world into something AI can be measured against — and use that to point AI at problems that actually matter.

Acknowledgements

This work was brainstormed as part of ARIA’s Scaling Trust programme, in collaboration with Alex Obadia.