A Demo Proves Nothing

In June 2024, researchers at Sierra published a benchmark called τ-bench. They built it to test whether AI agents could handle simulated retail and airline customer service. The agents had to call domain-specific APIs and follow real policy constraints. The kind of thing companies actually deploy.

GPT-4o class models succeeded on less than 50% of tasks.

That's the headline. The number underneath is worse. The pass^8 metric, which measures whether the agent succeeds on the same task across 8 independent trials, dropped below 25% in retail. Win one out of four times. A coin-flip agent doesn't survive contact with production.

This is what evaluation tells you that a demo never can.

The Static Benchmark Era

The first generation of LLM evaluation looked like a multiple choice test. In September 2020, Dan Hendrycks and collaborators published MMLU, a 57-subject exam covering elementary math, US history, computer science, law, and professional medicine. About 16,000 questions. Most models at launch scored near the 25% random-guess baseline. GPT-3 hit 43.9%. By 2024 several frontier models cleared 85%.

OpenAI dropped HumanEval the following July. 164 hand-written Python problems with hidden unit tests. Codex, the model behind early Copilot, solved 28.8% on a single try. Given 100 tries per problem, it hit 70.2%.

A few months later, Karl Cobbe's team at OpenAI published GSM8K, 8,500 grade-school math word problems for measuring step-by-step reasoning. Then in 2022, Jason Wei's chain-of-thought paper from Google Research showed an 18% to 57% jump on GSM8K just from prompting differently. Same model. The harness around it changed.

The model didn't get smarter. The wrapping did.

The Contamination Problem

Static benchmarks live on the public internet. Models train on the public internet. You can see the problem.

By 2024, HumanEval problems had leaked into training corpora. Models kept scoring well, but nobody could tell whether they'd learned to code or memorized the test. Naman Jain's LiveCodeBench paper made this concrete. The authors collected fresh competition problems from LeetCode, AtCoder, and Codeforces with explicit release dates, then tested models only on problems released after their training cutoff. DeepSeek's models showed a sharp drop on LeetCode problems released after September 2023. Earlier scores were partially contaminated.

This isn't an anecdote. It's structural. Any benchmark old enough to matter is old enough to leak.

What Evaluation Looks Like Now

The frameworks have caught up. The UK AI Security Institute ships Inspect AI with over 200 prebuilt evaluations covering reasoning, knowledge, coding, behavior, multimodal, and agentic tasks. EleutherAI's lm-evaluation-harness powers Hugging Face's Open LLM Leaderboard and gets used internally by NVIDIA, Cohere, BigScience, BigCode, and Mosaic ML. OpenAI publishes its Evals framework as YAML configs with an open registry that staff review when training new models. Anthropic publishes agent evaluations through its research posts, often built on Inspect AI as the reference implementation.

The hard cases are the agent benchmarks. τ-bench was one. SWE-bench is the other. Princeton's Carlos Jimenez and collaborators pulled 2,294 real GitHub issues with matching pull requests from 12 popular Python repos and asked models to produce patches that pass the existing tests. Claude 2 resolved 1.96% of issues at launch in late 2023. By late 2025, frontier models combined with strong agent harnesses had pushed past 60% on the SWE-bench Verified subset.

A 30x improvement in two years. Almost none of it came from the model alone.

The Long-Horizon Question

METR, the Model Evaluation and Threat Research nonprofit, has been pushing this further. Their HCAST benchmark contains 189 software tasks across machine learning, cybersecurity, software engineering, and general reasoning. They paid 140 human contractors to attempt the same tasks. 563 baseline runs total. AI scores get compared to human time-to-completion directly.

In March 2025, METR published a "time horizon" metric, which estimates how long a task a model can complete with 50% reliability. The headline finding was that frontier-model time horizons doubled roughly every 7 months from 2020 to 2024. If that trend holds, models will be completing tasks that take humans hours by the late 2020s.

If.

The Trust Gap

The Stanford AI Index Report for 2026 aggregates this across the field. Performance on OSWorld, a real-computer-task benchmark, rose from roughly 12% in March 2025 to 66.3% in March 2026. Six points off human baseline in a year. The same report noted that the gap between benchmark progress and production deployment stayed large. Most agent demos that score well on benchmarks fail to reach production reliability.

That gap is the point of this whole series.

Benchmarks are necessary. They're not sufficient. Old benchmarks contaminate. New benchmarks oversimplify. Trace-based evaluation on real tasks is more honest but harder to reproduce. And the variance number matters as much as the headline. A 50% pass rate at pass^1 with a 5% pass rate at pass^8 is not "half-working." It's essentially broken under production conditions.

Builders who don't run their own evaluations are flying blind. Buyers who accept a demo as evidence are buying noise.

The bridge between "the model can do this on a benchmark" and "the system reliably does this in production" is the harness. The only way to inspect that bridge honestly is to measure it.

Part of The Harness Layer series.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

Sources

GPT-4o class models succeeded on less than 50% of tasks.

This is what evaluation tells you that a demo never can.

The Static Benchmark Era

The model didn't get smarter. The wrapping did.

The Contamination Problem

Static benchmarks live on the public internet. Models train on the public internet. You can see the problem.

This isn't an anecdote. It's structural. Any benchmark old enough to matter is old enough to leak.

What Evaluation Looks Like Now

A 30x improvement in two years. Almost none of it came from the model alone.

The Long-Horizon Question

If.

The Trust Gap

That gap is the point of this whole series.

Builders who don't run their own evaluations are flying blind. Buyers who accept a demo as evidence are buying noise.

The bridge between "the model can do this on a benchmark" and "the system reliably does this in production" is the harness. The only way to inspect that bridge honestly is to measure it.

Part of The Harness Layer series.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

A Demo Proves Nothing

The Static Benchmark Era

The Contamination Problem

What Evaluation Looks Like Now

The Long-Horizon Question

The Trust Gap

Sources

Keep reading

Why 'Harness' Won the Naming Fight

The Tool Loop and Who Drives It

Most 'AI' in Production Is a Workflow

A Demo Proves Nothing

The Static Benchmark Era

The Contamination Problem

What Evaluation Looks Like Now

The Long-Horizon Question

The Trust Gap

Sources

Keep reading

Why 'Harness' Won the Naming Fight

The Tool Loop and Who Drives It

Most 'AI' in Production Is a Workflow