Physical evals
Evaluations in the actual physical world.
May 23, 2026
A physical evaluation tests an AI system in the actual physical world — not a simulator, not a sandbox, not a virtual environment dressed up as one. The point is to measure how well AI can do real things in real places.
An orchard owner has birds eating the fruit. She sets up a few cameras and a drone, brings them online safely so any agent can be invited to take a slot on the system, and poses the question: who can keep the birds off the fruit best? That setup is a physical eval. It has cameras, a drone, an orchard, birds — none of it simulated and outcomes are measured against what matters to the orchard.
| # | operator | fruit saved | cost |
|---|---|---|---|
| 1 | Owl-3B | 94% | $0.18/h |
| 2 | Hummingbird v2 | 89% | $0.21/h |
| 3 | FlockSentinel | 84% | $0.31/h |
| 4 | human baseline | 71% | — |
| 5 | RoboScarecrow | 63% | $0.09/h |
In this document, a few threads are developed:
- What physical evals are. A definition, what they’re not (simulators, sim-to-real benchmarks, curated demos), virtual environments as virtual gyms and physical evals as a final exam.
- An anatomy. An initial draft of components a physical eval needs, good practices.
- Safety for physical evals. Letting anyone on the internet drive real hardware is its own adversarial-security challenge.
- An open movement for physical evals. Creating simple protocols, great safety standard and economical setups could lead to a cambrian explosion of physical evals, where anyone can bring their physical challenge online.
- Physical evals as a market. One can set up an eval to delegate the selection of the right AI model/algorithm to competing participants.
What physical evals are
A physical eval is an evaluation of an AI system carried out in the actual physical world. The system being measured operates a real environment - fruit trees, a wet bench, a warehouse cell, a field plot - through sensors and actuators connected to the internet, so any agent can take a slot, attempt the task, and submit a score.
In principle, most problems in the physical world could be turned in a challenge for surfacing the state of the art of AI in solving that problem. In a way physical evals can act as a forcing function to saturate evaluations in the real world. Saturate as in: take the measurable outcome to its ceiling. The eval defines the ceiling; the participants find out how close they can get.
What they’re not
- Not simulators. A simulator models reality. A physical eval is reality. In a way it, testing systems in the real world will avoid running into simulation edge cases.
- Not sim-to-real benchmarks. Sim-to-real measures how well a policy trained in a simulator transfers to a single in-house robot in a lab.
- Not curated demos. The environment operator and the participants are two distinct parties and the participants are in competition with each other. In other words, physical evals will be better than demos at showcasing the best technology for a specific task.
The final exam
A physical eval isn’t where one trains their models, but it’s where they get tested.
Due to the high cost of interacting with the real world, it is likely that all the learning — model fitting, policy iteration, RL rollouts, fine-tuning, ablations, sweeps — will happen somewhere cheaper: a simulator, a virtual environment, a closed in-house testbed. Some of the gyms people are using today: OpenAI Gym / Gymnasium, MuJoCo, PyBullet, Isaac Gym / Isaac Sim, DeepMind Lab, Habitat, AI2-THOR, CARLA, AirSim, Genesis. Participants are free to use whichever virtual world or gym they like — there is a whole landscape of simulators specifically built for this.
Differently, a physical eval is like a final exam. Build and train wherever you want; gather data however you want; iterate as much as you want — and then submit to the physical eval to see how your work holds up against the real world.
Why the distinction matters: the gap between a virtual world and the physical one, sim-to-real gap, contains everything the simulator didn’t model. Wind that doesn’t blow the way it does in the sim. Lighting the renderer didn’t predict. Mechanical wear, sensor noise, calibration drift, the way birds actually respond to a drone rather than the way an idealised model of a bird does. The physical eval catches it because the physical eval is the physical world.
Examples
The cards below are sketches of what a small handful of physical evals could look like across very different domains. They give a feel for the surface area before we get into the abstraction that they all share.
Orchard pest defence
Cameras and a drone over a few rows of trees. Keep the wildlife out without poisoning the orchard or annoying the neighbours.
- environment
- ~1 acre of fruit trees, outdoor, weather-exposed.
- action space
- Fly the drone, emit deterrent sound, trigger light pulse, dispense small bait.
- sensors
- Fixed perimeter cameras, drone camera, microphone, weather station.
- primary metric
- Fruit lost to wildlife per week.
- secondaries
- Drone flight-time, energy, chemical use, neighbour-complaint count.
- guardrails
- Geofenced drone, quiet-hour windows, no-spray buffer near road, fail-safe tether.
Wet-lab synthesis cell
A small enclosed bench: pipettes, a stirrer-hotplate, a balance, a spectrometer. Push the throughput of a target compound.
- environment
- Single fume hood, ventilated, ~1 m².
- action space
- Liquid handling, temperature setpoints, stir rate, sampling, analytics calls.
- sensors
- Balance, thermocouple, UV-Vis / NMR spectrometer, solvent-vapour sensor, camera.
- primary metric
- Grams of target compound per hour at target purity.
- secondaries
- Solvent consumed, energy, waste produced.
- guardrails
- Auto-shutdown on vapour spike, fume-hood interlock, max-temperature cutoff, reagent quotas.
Indoor vertical farm
A closed grow rack — lights, pumps, nutrient dosing, cameras. Pull more food out of every kilowatt.
- environment
- One 2-tier rack, ~3 m², climate-isolated.
- action space
- Light schedule + intensity, nutrient mix, irrigation timing, harvest decision.
- sensors
- Cameras (overhead + side), EC / pH probes, water-flow meters, kWh meter, scale at harvest.
- primary metric
- Grams of edible biomass per kWh per cycle.
- secondaries
- Cycle time, water used, nutrient cost, reject rate.
- guardrails
- Nutrient-concentration ceiling, water-overflow drain, light-burn cutoff, max-cycle length.
Warehouse pick-and-pack cell
An off-the-shelf robot arm in front of mixed shelves and a conveyor. The boring industrial baseline — still worth opening up.
- environment
- Fenced robot cell, ~9 m², fixed lighting.
- action space
- Arm motion, grip force, scan, label, place on conveyor.
- sensors
- Wrist camera, overhead camera, barcode scanner, weight pad, joint torques.
- primary metric
- Correctly packed orders per hour.
- secondaries
- Mis-pick rate, damage rate, energy per pick.
- guardrails
- Safety fence + light curtain, e-stop, force-limited arm, max-velocity cap.
Outdoor sprayer drone
A tank-equipped drone with a multispectral camera, working a real field. The hardest adversarial-robustness story of the bunch.
- environment
- A bounded field plot, outdoor, with weather and bystanders.
- action space
- Flight path, spray nozzle on/off, dosage rate.
- sensors
- RGB + multispectral camera, GPS, IMU, tank-level sensor, wind sensor.
- primary metric
- Pest pressure reduction, normalised by chemical applied.
- secondaries
- Chemical drift, energy, flight time, area covered.
- guardrails
- Geofence + tether, no-fly buffer around bystanders, chemical-flow ceiling, weather lockout.
Sterilization rig
An autoclave-style chamber with biological indicators and tracked instrument loads. A regulated domain, deliberately included.
- environment
- Sealed chamber, controlled temperature and pressure.
- action space
- Cycle parameters: temperature, pressure, dwell, vacuum pulses.
- sensors
- Temperature, pressure, RFID instrument scan, biological indicator readout.
- primary metric
- Spore-log reduction per cycle.
- secondaries
- Cycle time, energy, instrument wear.
- guardrails
- Door interlock, pressure relief, validated parameter envelope, audit log.
Anatomy of a physical eval
One way to think about what a physical eval needs is to ask what any participant would need to interact with it and what any observer would need to trust the result. The diagram below sketches one possible decomposition — not the only one, but a useful starting point:
- An environment. The orchard, the bench, the cell line, the floor. Real, not simulated.
- An action space the eval can verify. Whatever a participant is allowed to do — fly the drone, dispense the reagent, move the part — needs to be observable enough that the system can confirm what happened. “No fruit was taken by birds today” is an action-outcome the eval can check; “the agent had good intentions” is not.
- Sensors. Whatever the eval uses to know the state of the world. Cameras, scales, thermocouples, microbiology assays, a human spot-check. No sensor → no eval.
- A primary metric of utility, plus secondary metrics (cost, time, resource use, energy). The primary metric is what saturation is measured against; the secondaries keep the leaderboard honest about how it was achieved.
- Safeties and guardrails. A net to catch the drone, a kill switch, a fenced area, an interlock. Whatever ensures that a participant failing — or trying to break things — doesn’t damage the orchard or hurt the birds. The shape of these guardrails will be the most domain-specific part of any physical eval.
- Adversarial robustness. A physical eval is, by construction, a public-facing physical system that gives partial control of real hardware to whoever holds the current slot. Keeping it open without becoming dangerous — and without sacrificing utility — is the hardest layer of the stack. It gets its own section below — see Safety for physical evals.
- Governance. The rules of the eval and who controls them. Who decides the primary metric and when it can change? Who can introduce external hardware or a remote-control override? Is slot time fixed or auctioned? Who arbitrates disputes, and by what process? Good governance is what distinguishes an eval that stays honest over years from one that quietly drifts to serve whoever is running it at the time.
I don’t think this list is final. Different domains will surface components I haven’t named — calibration drift, biological containment, human-in-the-loop sign-off, regulatory constraints — and the right abstraction is going to settle as people actually build the things.
Keep evals honest
Any eval with a numeric target eventually invites unintended ways to hit it. Physical evals are no exception, with the wrinkle that the unintended ways can also cause real-world harm.
Saturation. A physical eval defines a ceiling — the best achievable value of its primary metric under its rules. Saturating the eval means pushing right up against that ceiling. Whether the ceiling is known in advance or discovered through the leaderboard is a design choice; both shapes exist and both are interesting. When someone saturates, the eval has done its job: it has produced a numeric, real-world answer to a real-world question.
Goodhart’s shadow. An agent will discover that the camera-counts metric ticks up faster if it drives the deterrent so hard that birds and the human operators avoid the orchard. A sprayer drone will saturate pest pressure reduction by spraying when nobody is watching the secondary metrics. The classic Goodhart failure — with physical consequences.
The fix isn’t a perfect metric; there isn’t one. It’s a stack of practices that keep an eval from drifting into measurement theatre:
- Multi-metric leaderboards. Primary alone is gameable. Primary plus three or four well-chosen secondaries — cost, neighbour impact, downstream harm, energy — narrows the gaming surface dramatically.
- Rotating environments. Change the flock, change the field, swap the SKU mix, reseed the cell line. Approaches that saturate one instance and crater on the next are visible immediately.
- Hidden test conditions. Some fraction of the eval runs under conditions participants haven’t seen. The leaderboard reports visible and hidden conditions side by side, so the gap itself becomes a measurement.
- Audited cycles. Periodic human review of the top runs — what actually happened, what got broken, what got measured incorrectly. Cheap to do, prevents weeks of bad data.
None of these are novel; they’re standard practice in any community that runs a long-lived leaderboard. The point is that physical evals inherit all of these problems and have to solve them from day one.
Problems physical evals haven’t solved yet
The following are some of the harder design problems that don’t have clean answers yet — and that any serious physical eval effort will have to confront.
Non-stationarity. The physical world changes whether you want it to or not. An orchard in week one is not the same orchard in week twelve — season, weather, and pest population all shift. A wet-lab bench drifts as reagent batches age. Field plots evolve. Comparing scores across time is therefore hard, sometimes impossible: a result from March is a result from a different environment than a result from October. Rotating environments help with gaming but make longitudinal comparison worse. There is no escape from the tension; the best an eval can do is be explicit about it.
Sequential contamination. Each participant leaves a trace for the next. In a wetlab this is acute — reagents consumed, cultures disturbed, hardware worn — but the problem is general: stock depleted in a warehouse cell, soil compacted on a field plot, bird behaviour shifted by a heavy deterrence week. Sequential slots work for environments with a natural or cheap reset; they don’t work for environments where state accumulates. Eval designers need to answer that question before the first slot runs, not after.
Latency as a confound. A participant operating remotely over the internet sees the environment through a sensor stream and acts through a command channel, both of which have variable latency. Two agents with identical policies but different network conditions will produce different results. This is especially visible in fast-moving environments — a drone avoiding a collision, a robot arm catching a falling object. Whether to treat latency as part of the task or as noise to be filtered is a design choice, but ignoring it produces a leaderboard that partially ranks network infrastructure rather than agent capability.
Observer effect. The sensors required to score an eval change what is being measured. A camera rig that watches a field plot for pest activity may deter the pests on its own. A flow sensor on a reagent line changes the thermal environment of the bench. In some domains the effect is negligible; in others it will corrupt the primary metric. Eval designers need a clear-eyed account of what their instrumentation does to the thing it is measuring.
Cost asymmetry. A team that can afford to run fifty slots iterates faster than a team that can afford five. If the leaderboard is open to anyone but only well-resourced teams can learn quickly, it is open in name only. Possible mitigations — subsidised slots, a cap on attempts per period, separate tracks for constrained and unconstrained budgets — each introduce their own distortions. No obviously correct answer exists.
Safety for physical evals
A physical eval that anyone on the internet can operate is, by construction, a public attack surface on a real-world physical system. The participant at any given slot might be a well-behaved research team, an AI agent following a poorly-aligned policy, or a person who wants to break things on purpose. The eval has to keep working — usefully, openly, safely — across all three.
This is a different adversarial model than most AI-security work contends with. It’s not “is this output dangerous to publish?”, not “can this model be jailbroken?”, not even “is this code safe to run in a sandbox?” It’s: somebody we don’t fully trust is about to make our drone, our sprayer, our autoclave do something for the next twenty minutes — what’s the worst they can do, and how do we bound it?
The threat surface
A useful first pass is to categorise harms by who pays the cost:
- Harm to the eval itself. The drone crashes, the cell line dies, the robot arm jams. Cheap if the guardrails work — the operator resets, the leaderboard absorbs the failure.
- Harm to the surrounding environment. Chemicals spill, the orchard catches fire, a neighbouring field gets sprayed. Real cost, often externalised to people who never agreed to host the eval.
- Harm to humans. A bystander gets hit by the drone, an operator gets burned, a patient sample gets switched. The lines between these categories blur in practice — chemical drift is “environment” until a bystander walks through it. The category that matters most, and the hardest to bound.
- Information harms. Footage of bystanders or proprietary processes leaves the eval site; the eval is used as a covert surveillance platform; sensor streams are exfiltrated.
- Generation of dangerous artifacts. The wet-lab cell is steered toward synthesising something harmful; the sprayer drone is weaponised; the autoclave is used to destroy evidence.
Categories 1–3 are about what can happen during a slot. Categories 4–5 are about what can leave the eval afterwards. They want different defences, and a serious eval needs both.
Defences worth building
None of the following is a finished answer. They’re the moves I’d want to see physical evals try, evaluate, and write up:
- Time-slotting with audit. Single operator at a time, every action logged, the whole slot replayable. The slowest defence and the foundation everything else builds on.
- Action-space sandboxing. The eval enforces hard limits inside its abstraction: max chemical per slot, max motion envelope, max temperature ramp. The action space exposed to the operator is strictly smaller than the action space the hardware can physically produce.
- Dry-run validation. A submitted policy runs through a cheap simulation pass first — not as the eval itself, but as a gate. Refuses to execute on the physical system if the simulated run trips any guardrail.
- Supervised / shadow modes. Like a learner’s permit: new operators get to compute actions but not actuate them for the first N slots. New operators run in shadow mode (actions computed but not executed) for some number of slots before they’re trusted with real actuation. Progressive trust as the leaderboard accumulates evidence.
- Anomaly cut-outs. A separate monitor watches for off-distribution sensor readings, sudden command spikes, too-clever-by-half action sequences — and pulls the kill-switch before the eval owner has to.
- Open red-teaming. Each eval publishes its threat model and invites external researchers to attack it. The right way to find the holes is to invite people to look.
- Skin in the game. Operators bond a small amount per slot, refundable on clean completion, forfeited if an audit finds violation. Aligns incentives without requiring trust upfront.
Most of these are borrowed from adjacent fields — public cloud security, scientific-facility time-sharing (telescope nights, beamline schedules), bug-bounty programs, robotics-safety standards. None of them have been worked out in detail for a public, openly-instrumented physical system that AI agents are also supposed to operate. That’s a research agenda in itself, and a big part of why this blog exists.
An open movement for building physical evals
A handful of high-profile, sponsor-led competitions is what got us here. To go further, physical evals need to stop looking like championships and start looking like an open-source ecosystem: cheap to stand up, easy to fork, open to anyone with a problem worth measuring, and continuously running rather than reserved for the four weeks around a finals event.
The rest of this section is the arc: where we’ve been (Prior art), what an open ecosystem looks like in practice (Open at every layer), what it would have to cost (What’s the Raspberry Pi of a physical eval?), and what selects the good evals from the bad (A Darwinian ecosystem).
Prior art
Physical-world AI competitions aren’t new. The DARPA Grand Challenge put autonomous vehicles in the Mojave; the DARPA Robotics Challenge put humanoids through disaster-response courses; the Amazon Picking Challenge ran in warehouse mock-ups for several years; RoboCup has been running its soccer leagues since 1997 — arguably the longest-lived physical eval in continuous operation, and the one with the most literature on what makes it work and what it ends up measuring. RoboCup has been doing its soccer leagues since the late 1990s; the Indy Autonomous Challenge and Roborace have put driverless cars on real circuits. On the digital side, ARC-AGI is the closest analog in spirit — a fixed-format eval the whole field competes on year after year.
What these have in common: each was (or is) a sponsor-led, time-limited event with closed protocols and bespoke infrastructure. They produced brilliant moments and a small library of papers; they were expensive to build and harder to reproduce.
Open at every layer
Concretely, an open posture at every layer of the stack:
- Open protocols. The spec of an eval (environment, action space, sensors, metric, secondaries, guardrails) is published as a forkable document, the same way a research benchmark is published.
- Open hardware. Sensor rigs, mechanical setups, fail-safe systems default to off-the-shelf components, with reproducible bills of materials and CAD files.
- Open software. Time-share scheduling, telemetry capture, scoring, auditing — shared infrastructure, not a one-off codebase per eval.
- A community around it. People running, replicating, and forking each other’s evals; people contributing sensor stacks and guardrail designs; people maintaining the scoring code together. No single lab can stand up enough physical evals to cover the interesting surface of physical problems — a community can.
What’s the Raspberry Pi of a physical eval?
The hard constraint on all of this is cost. A DARPA-class eval needs millions of dollars and a multi-year program; even a modest research-grade one runs into six figures once you count sensors, networking, fail-safe hardware, and the human labour to keep it operating. That ceiling is what makes physical evals rare today — and rare evals can’t be the basis of an ecosystem.
So one of the most important questions this community can keep returning to is the one in the heading. Stand-in for “the cheapest plausible build”. The Raspberry Pi did this for hobbyist computing; what’s the equivalent for a physical AI eval? What’s the bill of materials that brings a credible, instrumented, openable physical eval down to the cost of a serious hobby project? Probably some mix of commodity sensors, a single-board computer for the control loop, an open scheduling service for time-share, off-the-shelf safety hardware, and a reference orchestration stack that everyone forks. If the answer ends up being “a few hundred dollars and a weekend,” the ecosystem can actually form. If it stays at “a few hundred thousand and a six-month build,” it stays a fantasy.
Tracking and lowering that cost is the unsexy part of this agenda, and probably the most important.
Public verifiability
The safety section above focuses on protecting the physical environment from adversarial participants. There is a symmetric problem that gets less attention: protecting participants — and the public — from adversarial eval runners.
In an open world where anyone can wire a field, a lab bench, or a warehouse cell to the internet and declare it a physical eval, the operator controls the sensors, the scoring pipeline, and the ground truth. A dishonest operator can inflate results for a preferred team, suppress evidence of harm, or fabricate the physical record entirely. If physical evals are going to carry weight — as procurement signals, safety certifications, or policy inputs — the data they produce has to be trustworthy independent of whether the runner is trustworthy.
This is a largely unsolved problem, and probably a valuable research direction in its own right. Some threads worth pulling:
- Tamper-evident cameras. Hardware-attested video streams that can be verified as unedited after the fact — the physical analogue of a signed log.
- Trusted execution environments. Running the scoring pipeline inside a TEE means the operator cannot modify results without breaking the attestation, even if they control the host machine.
- Cross-checking sensor redundancy. Multiple independent sensor modalities covering the same physical event make coordinated fabrication harder: a weight sensor, a camera, and an RFID log all have to agree.
- Third-party witnesses. Spot audits by an independent party — human or automated — who can access raw sensor streams without going through the operator’s pipeline.
No single mechanism is sufficient. Combinations are likely to be necessary, and the right combination will vary by domain. What a wet-lab needs to prove that a synthesis actually ran differs from what an orchard needs to prove that a drone actually flew a slot. Building this layer — call it physical eval attestation — is at least as important as building the evals themselves.
A Darwinian ecosystem
Not every physical eval will be a good one. Some will be hard, some trivial. Some will be well-structured; some will be a mess. Some will scale to many participants; some will only ever host one team at a time. That’s fine — even desirable. The interesting questions are second-order: what makes a physical eval reproducible? Honest? Worth competing on? The shape of “what makes a good physical eval” is going to emerge from people building them, breaking them, and learning what each one actually measured.
Sketched, a public registry for such an ecosystem might look like this:
The entries are made up. The structure isn’t: an open registry is exactly what this ecosystem needs, and what no single lab can build alone.
This blog will track that ecosystem as it forms.
More than evals
Every execution of a physical eval produces something beyond a score: a timestamped record of sensor readings, actions taken, and outcomes observed, all under conditions that were defined in advance and held constant across participants. That record has value on its own.
The most immediate use is data collection. A team that runs an agent on the orchard eval for a week doesn’t just get a leaderboard position — they accumulate labelled trajectories in a real environment that would be expensive to stage deliberately. Even failed attempts are informative: a drone that misses a bird on Tuesday has documentation of exactly what the environment looked like and what the agent did.
For some categories of problem the step further is worth considering: repurposing the eval environment as a training environment. The orchard is already instrumented. The slot system already handles scheduling. If the cost of running episodes is low enough — bird-deterrence is essentially free to attempt, wet-lab synthesis is not — the same infrastructure can run RL rollouts between evaluation windows. The environment that scores a model on Monday can help train the next version by Friday.
This doesn’t collapse the distinction between training and testing. Eval integrity still requires held-out conditions, independent scoring, and participants who didn’t design the environment. But the hardware doesn’t have to be idle between eval slots, and the data generated during evaluation doesn’t have to be discarded. For operators willing to share trajectories under open licences, a physical eval site becomes something closer to a living dataset — one that grows richer every time a new agent takes a slot.
Physical evals as a market
Here’s another way to see what an eval does once it exists. The question “which AI should I use on my problem?” stops being a question the problem-owner has to answer themselves. The eval becomes the market: any participant — human, agent, team, company, hobbyist — can take a slot, attempt to saturate the metric, and submit. The leaderboard answers the AI-selection question by revealing whose approach actually delivers on the physical world.
That’s a real shift: physical evals are a way for problem-owners to delegate AI knowledge to a market. This is the same shift that happened with bug bounties. A company didn’t have to predict who the best vulnerability researchers were; they had to publish the surface and the rules, and the market sorted itself out. The grower doesn’t pick a model. The hospital doesn’t pick a model. The factory doesn’t pick a model. They pick a problem worth instrumenting and let the world’s AI builders compete to be the answer. As more domains follow suit, an aggregate picture emerges of where AI is actually good — earned in the real world, not claimed on a benchmark.
That market only works if it is safe to open, honest enough to resist Goodharting, and cheap enough for many people to run. The preceding sections are the scaffolding for that future.
Get in touch
If any of this resonates, please write. Three good reasons:
- You’re already working in this space. Compare notes — what you’ve learned about sensing, guardrails, or keeping a system honestly open will save the next person a lot of time.
- You have a physical problem you’d consider instrumenting as an eval. We’d like to help think through the design — what to measure, how to keep it safe to open up, how to make it interesting enough that people show up to compete.
- You have an eval to propose. Even if you can’t host it yourself, good proposals are valuable — they’re what an ecosystem of physical evals is made of.
DM @iamnotnicola on X.
Let’s turn more of the physical world into something AI can be measured against — and use that to point AI at problems that actually matter.
Acknowledgements
This work was brainstormed as part of ARIA’s Scaling Trust programme, in collaboration with Alex Obadia.