A Demo Proves Nothing
Frontier AI agents fail more than half the time on real customer-service tasks, and the variance is worse than the average. Evaluation is the only honest answer to 'does it work.'
In June 2024, researchers at Sierra published a benchmark called τ-bench. They built it to test whether AI agents could handle simulated retail and airline customer service. The agents had to call domain-specific APIs and follow real policy constraints. The kind of thing companies actually deploy.
GPT-4o class models succeeded on less than 50% of tasks.
That's the headline. The number underneath is worse. The pass^8 metric, which measures whether the agent succeeds on the same task across 8 independent trials, dropped below 25% in retail. Win one out of four times. A coin-flip agent doesn't survive contact with production.
This is what evaluation tells you that a demo never can.
The Static Benchmark Era
The first generation of LLM evaluation looked like a multiple choice test. In September 2020, Dan Hendrycks and collaborators published MMLU, a 57-subject exam covering elementary math, US history, computer science, law, and professional medicine. About 16,000 questions. Most models at launch scored near the 25% random-guess baseline. GPT-3 hit 43.9%. By 2024 several frontier models cleared 85%.
OpenAI dropped HumanEval the following July. 164 hand-written Python problems with hidden unit tests. Codex, the model behind early Copilot, solved 28.8% on a single try. Given 100 tries per problem, it hit 70.2%.
A few months later, Karl Cobbe's team at OpenAI published GSM8K, 8,500 grade-school math word problems for measuring step-by-step reasoning. Then in 2022, Jason Wei's chain-of-thought paper from Google Research showed an 18% to 57% jump on GSM8K just from prompting differently. Same model. The harness around it changed.
The model didn't get smarter. The wrapping did.
The Contamination Problem
Static benchmarks live on the public internet. Models train on the public internet. You can see the problem.
By 2024, HumanEval problems had leaked into training corpora. Models kept scoring well, but nobody could tell whether they'd learned to code or memorized the test. Naman Jain's LiveCodeBench paper made this concrete. The authors collected fresh competition problems from LeetCode, AtCoder, and Codeforces with explicit release dates, then tested models only on problems released after their training cutoff. DeepSeek's models showed a sharp drop on LeetCode problems released after September 2023. Earlier scores were partially contaminated.
This isn't an anecdote. It's structural. Any benchmark old enough to matter is old enough to leak.
What Evaluation Looks Like Now
The frameworks have caught up. The UK AI Security Institute ships Inspect AI with over 200 prebuilt evaluations covering reasoning, knowledge, coding, behavior, multimodal, and agentic tasks. EleutherAI's lm-evaluation-harness powers Hugging Face's Open LLM Leaderboard and gets used internally by NVIDIA, Cohere, BigScience, BigCode, and Mosaic ML. OpenAI publishes its Evals framework as YAML configs with an open registry that staff review when training new models. Anthropic publishes agent evaluations through its research posts, often built on Inspect AI as the reference implementation.
The hard cases are the agent benchmarks. τ-bench was one. SWE-bench is the other. Princeton's Carlos Jimenez and collaborators pulled 2,294 real GitHub issues with matching pull requests from 12 popular Python repos and asked models to produce patches that pass the existing tests. Claude 2 resolved 1.96% of issues at launch in late 2023. By late 2025, frontier models combined with strong agent harnesses had pushed past 60% on the SWE-bench Verified subset.
A 30x improvement in two years. Almost none of it came from the model alone.
The Long-Horizon Question
METR, the Model Evaluation and Threat Research nonprofit, has been pushing this further. Their HCAST benchmark contains 189 software tasks across machine learning, cybersecurity, software engineering, and general reasoning. They paid 140 human contractors to attempt the same tasks. 563 baseline runs total. AI scores get compared to human time-to-completion directly.
In March 2025, METR published a "time horizon" metric, which estimates how long a task a model can complete with 50% reliability. The headline finding was that frontier-model time horizons doubled roughly every 7 months from 2020 to 2024. If that trend holds, models will be completing tasks that take humans hours by the late 2020s.
If.
The Trust Gap
The Stanford AI Index Report for 2026 aggregates this across the field. Performance on OSWorld, a real-computer-task benchmark, rose from roughly 12% in March 2025 to 66.3% in March 2026. Six points off human baseline in a year. The same report noted that the gap between benchmark progress and production deployment stayed large. Most agent demos that score well on benchmarks fail to reach production reliability.
That gap is the point of this whole series.
Benchmarks are necessary. They're not sufficient. Old benchmarks contaminate. New benchmarks oversimplify. Trace-based evaluation on real tasks is more honest but harder to reproduce. And the variance number matters as much as the headline. A 50% pass rate at pass^1 with a 5% pass rate at pass^8 is not "half-working." It's essentially broken under production conditions.
Builders who don't run their own evaluations are flying blind. Buyers who accept a demo as evidence are buying noise.
The bridge between "the model can do this on a benchmark" and "the system reliably does this in production" is the harness. The only way to inspect that bridge honestly is to measure it.
Part of The Harness Layer series.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Sources
- Yao et al. "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains" (2024, arXiv:2406.12045) (opens in new tab)
- Hendrycks et al. "Measuring Massive Multitask Language Understanding" (2020, arXiv:2009.03300) (opens in new tab)
- Chen et al. "Evaluating Large Language Models Trained on Code" (2021, arXiv:2107.03374) (opens in new tab)
- Cobbe et al. "Training Verifiers to Solve Math Word Problems" (2021, arXiv:2110.14168) (opens in new tab)
- Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022, arXiv:2201.11903) (opens in new tab)
- Jain et al. "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code" (2024, arXiv:2403.07974) (opens in new tab)
- UK AI Security Institute: Inspect AI (opens in new tab)
- EleutherAI: Language Model Evaluation Harness (opens in new tab)
- OpenAI: Evals framework (opens in new tab)
- Jimenez et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023, arXiv:2310.06770) (opens in new tab)
- SWE-bench Verified leaderboard (opens in new tab)
- METR: HCAST Benchmark (opens in new tab)
- METR: "Measuring AI Ability to Complete Long Tasks" (19 March 2025) (opens in new tab)
- Stanford HAI: The 2026 AI Index Report (opens in new tab)



