Why 'Harness' Won the Naming Fight

In 2021, the EleutherAI project shipped a piece of software called the Language Model Evaluation Harness. It was a test rig. You plugged a model in. It ran that model through a battery of benchmarks. The word "harness" had a specific meaning then: the code wrapped around the model that turned it into something you could actually run.

That word has been mutating ever since.

By mid-2026, "harness" is the most common term for the entire surrounding software stack that turns a frozen model into a usable system. The meaning expanded from a test rig to the whole workshop. It's not the only word in play. People still say "agent runtime," "agent framework," "scaffolding," "agent loop." But "harness" is winning, and it's worth understanding why.

A model is a function. Tokens in, token probabilities out. That's it. A harness is everything else.

A working definition for mid-2026: a harness is the software that turns a model into a usable system by adding state, control flow, tool registration, sandboxing, approval gates, retries, and observability.

I use two harnesses every day. Claude Code from Anthropic and Codex CLI from OpenAI. The model calls inside both of them are not special. They're the same API calls any developer can make. The value lives in the wrapping. A filesystem-aware tool registry. Sandboxed execution so a runaway call can't wipe my repo. Conversation memory. Hook points where I can inject custom behavior. Tracing of every tool call. Gates that require me to approve destructive actions before they run.

Pull the model out. Drop it into a different harness. The behavior changes dramatically. The weights are identical. The product is different.

The Seven Capabilities

By mid-2026, a competent harness is expected to provide roughly seven things.

State management. It keeps track of conversation history, scratchpads, intermediate plans, and tool-call traces across turns. The model has a context window. The harness has memory across that window.

Tool registration. The model needs a callable interface. The harness defines which functions the model can invoke and the JSON Schema that constrains its arguments.

Sandboxing. The harness runs tools in isolated environments. So when the model decides to delete files, it only deletes files inside an ephemeral container, not your home directory.

Approval gates. High-impact actions surface to a human before they execute. File writes. Payments. Outbound emails. The harness decides what needs a human in the loop.

Retries and error handling. Each tool call gets wrapped in policy. Does this retry on failure? Fall back to a different tool? Escalate? The harness decides.

Tracing and observability. Every prompt, completion, tool call, and timing gets recorded. So when the agent does something weird at 3am, you can replay it.

Memory. Short-term scratchpads inside the context window. Long-term stores (often a vector database or a key-value store) that survive across sessions. Both live in the harness.

None of these capabilities live in the model. All of them live in the harness.

The SDKs Are Documenting the Stack

The big labs are now publishing this decomposition explicitly.

Anthropic's Claude Agent SDK overview (opens in new tab) lists tool use, agent skills, permissions, hooks, sub-agents, and hosting as first-class features of the SDK rather than the model. (The SDK was renamed from "Claude Code SDK" to "Claude Agent SDK" in 2025. The names are still moving.)

OpenAI's Agents SDK shipped a v2 in April 2025 (opens in new tab) with configurable memory and sandbox-aware orchestration. Its feature list reads almost like Anthropic's: tools, handoffs between agents, guardrails, sessions, tracing.

LangGraph, the durable-execution graph framework from LangChain, emphasizes execution that survives process restarts, human-in-the-loop interruption, and both short-term and long-term memory.

The vocabularies differ. The decomposition is converging.

The Workshop Around the Worker

The metaphor that fits is the workshop around the worker. The model is the worker. It produces token sequences when prompted. The harness is the workshop. It owns the tools, the workspace, the safety rails, the inventory of past projects, and the rules about what counts as finished.

Move the worker into a different workshop and you get a different product. Same hands. Different shop.

This is also the most rapidly evolving area in the AI stack. Anthropic renamed its SDK once already. OpenAI is on version 2. LangChain promoted LangGraph from a side project to its flagship runtime in under two years. Any specific name in this article will probably shift again. But the structural decomposition (state, tools, sandbox, approvals, retries, tracing, memory) is more durable than any specific vendor's labels for the parts.

So when someone says "the AI built this" or "the model wrote this code," check what they mean. They usually mean the harness did most of the work. Calling the model at the right moments. Feeding it the right context. Catching its outputs. Sending them through the right guardrails. The model contributed token prediction. The harness contributed everything else.

That's not a smaller claim about what AI is doing. It's just a more accurate one.

Part of The Harness Layer series.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

Sources