Most 'AI' in Production Is a Workflow
Anthropic's own engineers tell you not to build an agent when a workflow will do. The deployed wins prove them right.
In December 2024, two engineers at Anthropic, Erik Schluntz and Barry Zhang, published a piece called "Building Effective Agents." Most of it is an argument against building agents.
The core line: "Agents trade latency and cost for better task performance, and you should consider when this tradeoff makes sense." Translation: don't reach for the loop when a function call will do.
The post catalogs five workflow patterns. Prompt chaining runs LLM calls in a fixed sequence, each feeding the next, with optional gates in between. Routing classifies an input and dispatches it to a specialized prompt. Parallelization fans out subtasks and aggregates the results. Orchestrator-workers uses one LLM to break a problem down and dispatch sub-prompts. Evaluator-optimizer loops a generator against a critic until the critic accepts.
Notice what these have in common. The control flow lives in code. The LLM is a callable function that returns text. The harness decides what to do with the text.
An agent, by Anthropic's definition, is different. It "operates independently for extended periods" and uses tool calls in a loop the LLM itself controls. The loop runs until the model emits a stop signal, hits a step limit, or a human intervenes. Schluntz and Zhang are explicit that agents earn their cost only on "open-ended problems where it's difficult or impossible to predict the required number of steps, and where you can't hardcode a fixed path." Almost everything else is a workflow.
The Klarna Story
The biggest "AI replaced humans" headline of 2024 was Klarna. The Klarna AI Assistant, built on OpenAI's API, went live globally in early 2024 and handled 2.3 million conversations in its first month. Klarna's own press release framed that as the equivalent of 700 full-time agents. Average resolution time dropped from 11 minutes to under 2 minutes. Repeat inquiries fell 25%.
That looks like an agent story. It's not.
The deployed system is a workflow. A router classifies the message. A small set of tool calls handles specific actions. The escalation triggers are coded. The script boundaries are coded. The model produces language. The harness produces decisions.
By mid-2025, Klarna walked the all-AI strategy back and reintroduced human agents for higher-friction cases. The hybrid is still mostly a workflow. The win came from getting the routing and the tool registry right, not from letting the model run open-ended.
Copilot Is a Workflow Too
GitHub Copilot launched as a technical preview on 29 June 2021 and reached general availability in June 2022. For most of its history it was a single-shot prompt. The IDE captures the cursor context. The editor sends a structured request. The model returns completions. The editor inserts them.
No loop. No tool selection by the model. No goal-directed planning. The product won on context construction, not autonomy.
Copilot's later "edit" and "agent" modes added small loops. But the bulk of the product's success was built on a workflow so tight that most users never thought of it as one. The model was a frozen function. The IDE was the harness.
Time Horizons and the Tradeoff
The case for restraint gets sharper as task length grows. METR, a nonprofit that benchmarks frontier models, has been publishing a "time horizon" measurement: how long a task is for a human, given that the model completes it with 50% reliability. Their March 2025 paper showed that time horizons have grown roughly exponentially since 2020. The best 2025 models could handle 30 to 60 minute tasks on average.
By the January 2026 update, the horizon on some software tasks had pushed past several hours.
Two things are true at once. The trajectory is real. The variance is brutal. A workflow that runs in 5 seconds with 99% reliability beats an agent that runs in 5 minutes with 70% reliability for any high-volume use case. The agent earns its cost when the task is genuinely open-ended and a one-off failure is cheap. The workflow earns its cost everywhere else.
The workflow is not a downgrade. It's a different product. The harness around the model is doing real work. Classifying, routing, gating, retrying, escalating. That work has a name, and the name is engineering. The LLM is just one of the function calls.
The model is not the product. The workflow is the product. Most of the time, the loop is overkill.
Part of The Harness Layer series.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Sources
- Schluntz, Erik and Zhang, Barry. "Building Effective Agents" (Anthropic, 19 December 2024) (opens in new tab)
- Klarna. "Klarna AI Assistant Handles Two-Thirds of Customer Service Chats in Its First Month" (2024) (opens in new tab)
- OpenAI. "Klarna's AI Assistant Does the Work of 700 Full-Time Agents" customer story (opens in new tab)
- Promptlayer. "Klarna Customer Service: From AI-First to Human-Hybrid Balance" (2025) (opens in new tab)
- GitHub. "Introducing GitHub Copilot: Your AI Pair Programmer" (29 June 2021) (opens in new tab)
- METR. "Measuring AI Ability to Complete Long Tasks" (19 March 2025) (opens in new tab)
- METR. "Time Horizon 1.1" (29 January 2026) (opens in new tab)



