The Tool Loop and Who Drives It
Tool use is what lets a frozen language model take an action in the real world. The interesting question is not how it works, but who controls the loop.
On 13 June 2023, OpenAI shipped function calling in the chat completions API. Before that day, the only thing a language model could do was write text. After it, the model could ask the application to do something on its behalf.
That sounds like a small thing. It is not.
The mechanic is simple. The developer describes a callable function in JSON Schema. The model, instead of just generating prose, returns a structured JSON object that names the function and fills in its arguments. The application reads that JSON, actually runs the function, and feeds the output back into the next turn of the conversation. The model can then pick the next call. (OpenAI, "Function Calling and Other API Updates", 13 June 2023 (opens in new tab))
Anthropic shipped its own version for Claude in 2024 and has since added server-side tools (web search, code execution, web fetch) and a programmatic tool-calling mode where the model orchestrates tool calls inside a code-execution sandbox to keep the context window from filling up. (Anthropic, "Tool Use with Claude" docs (opens in new tab))
The academic story is older. In February 2023, Schick et al. at Meta AI published Toolformer. They showed a base language model could be fine-tuned in a self-supervised way to insert API calls into its own text. A calculator. A translation engine. A calendar. The model learns when to reach for the tool. (Toolformer, arXiv:2302.04761 (opens in new tab))
Three months later, the Gorilla paper from Berkeley took it further. Patil et al. trained a LLaMA-7B model on a corpus of more than 1,600 ML API descriptions. With a document retriever attached, it outperformed GPT-4 at writing correct API invocations. (Gorilla, arXiv:2305.15334 (opens in new tab))
The point both papers made is the same one. Tool use is not a wired-in feature. It is a learned capability. The model only picks a good tool if it has seen good examples of picking tools.
The Loop
Inside any production harness, the tool loop runs the same shape. Read the tool's output. Reason about it. Decide what to do next. Execute. Read again.
This shape has a name from 1960s military doctrine. The OODA loop. Observe, orient, decide, act.
Colonel John Boyd of the United States Air Force coined it after studying why American F-86 pilots in the Korean War kept winning dogfights against the technically superior Soviet MiG-15. The kill ratio was roughly ten to one. Boyd's argument: whoever could complete observe-orient-decide-act cycles faster than the opponent got inside the opponent's decision loop and won. (Wikipedia, OODA loop (opens in new tab))
A tool loop is an OODA loop. Observe is reading the result back into the context window. Orient is the model's internal reasoning over the new state. Decide is the model picking the next call. Act is the harness executing it.
The framing is older than the framework. The framework just borrows it.
Who's Driving
The interesting design question is not how the loop works. It's who decides what happens at each step.
There are two extremes.
At one end, the model decides every step. It returns a tool-call JSON. The harness executes it. The harness feeds the result back. The model picks the next tool. Repeat. This is the ReAct pattern, named after a 2022 paper by Yao et al. (ReAct, arXiv:2210.03629 (opens in new tab))
At the other end, the orchestrator decides every step. A developer writes a fixed pipeline. Call the model. Run a search. Call the model again with the search results. Write to a database. The model never picks a tool. It just generates text at scheduled points.
Most real systems live in the middle. Anthropic's December 2024 post on building effective agents describes a pattern called "orchestrator-workers." One LLM call decides which sub-task is needed and dispatches it to a specialized worker prompt. Each worker may or may not call its own tools. Some of the control flow is fixed in code. Some of it is decided by the model on the fly. (Anthropic, "Building Effective Agents", 19 December 2024 (opens in new tab))
The reason this distinction matters is reliability.
The further control moves toward the model, the higher the variance of the system. Letting the model decide every step buys flexibility. You pay in unpredictability. The same input does not always produce the same path through your code.
The further control moves toward the orchestrator, the lower the variance. You buy reliability. You pay in brittleness. The first time a user's request does not match your fixed pipeline, the system has nothing to do.
The art of building a usable harness is choosing where to draw the line. Tool by tool. Step by step. Sometimes the model picks. Sometimes the code picks. The teams that ship working products are the ones who think carefully about which is which.
The model is still a frozen function. Tokens in, token probabilities out. The whole question of who decides anything is a question about the code that wraps it.
Part of The Harness Layer series.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Sources
- OpenAI: Function Calling and Other API Updates (13 June 2023) (opens in new tab)
- Anthropic: Tool Use with Claude (opens in new tab)
- Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023, arXiv:2302.04761) (opens in new tab)
- Patil et al. "Gorilla: Large Language Model Connected with Massive APIs" (2023, arXiv:2305.15334) (opens in new tab)
- Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models" (2022, arXiv:2210.03629) (opens in new tab)
- Anthropic: Building Effective Agents (19 December 2024) (opens in new tab)
- Wikipedia: OODA Loop (opens in new tab)



