When Tool Use Slows Your Agent Down

Jordan EllisMay 22, 20266 min read

Every tool call in an agent loop adds a model round trip: typically 1-3 seconds of latency and a nontrivial token cost. Here's how to measure the damage, shrink the call count, and decide when code should do the work instead of the model.

Each tool call in an agent loop costs you roughly 1-3 seconds of round-trip latency plus the token overhead of the result. Four sequential tool calls and you're already 8-12 seconds into a run before the agent has done anything visible. That number compounds fast.

How much does each tool call actually cost in latency?

A tool call isn't free in any direction. The model has to finish generating the tool invocation, the tool executes, the result comes back, and then the model reads that result and generates the next token. That last part is the part people forget: the model reads the whole accumulated context again on every continuation.

Realistic breakdown for a Claude Sonnet 4 call with a modest context:

Phase	Typical time
Model generates tool call	0.3-0.8s
Tool execution (fast, local)	0.05-0.2s
Result appended to context	~0s
Model reads context + generates next output	0.8-2s
Total per tool call	~1.2-3s

Tool execution itself is usually not the bottleneck. A database query or a file read takes milliseconds. The round trips through the model are the latency source.

For MCP-based tools there's an additional layer: the MCP server has to receive the request, run the handler, and respond. A local stdio MCP server adds very little. A remote MCP server over SSE adds its own network latency on top of the model round trip.

Does giving an agent more tools help?

Generally, no. It tends to hurt.

When the model sees a large tool list, two things happen. First, the tool definitions go into the context on every turn. Thirty tool definitions might add 2,000-6,000 tokens to every single call in the loop. That's cost and latency you pay regardless of which tools get used.

Second, the model has to do more selection work. With five tools, picking the right one is easy. With thirty, the model sometimes picks the wrong one, calls a broad tool where a narrow one would have been enough, or chains tools unnecessarily to satisfy an ambiguous intent.

The right number of tools for an agent is the number that covers the task surface, nothing more. Giving an agent 30 tools to handle 4 real use cases is a prompt engineering mistake that shows up as latency and cost.

For production agents I try to keep the active tool list under 10 for any single agent scope. If a task genuinely needs more surface area, I split into specialized sub-agents, each with a focused tool set.

When should work happen in code instead of a tool call?

If the logic is deterministic, put it in code.

A tool call that filters a list, formats a date, calculates a total, or applies a simple transformation is unnecessary. The model doesn't need to be involved in decisions that have exactly one right answer given the input. Write that as a function, call it before the agent sees the data, and send the model a cleaner result.

The pattern I've landed on: tools are for things that require external state (reading a file, querying a database, calling an API) or genuine ambiguity (the model needs to decide something). Pure computation and data shaping belong in the scaffolding layer.

This also applies to validation. If you're sending the model a tool result and then asking it whether the result looks reasonable before proceeding, that's a check you can often write in code for a fraction of the cost.

Can you run tool calls in parallel?

Yes, and this is one of the highest-leverage improvements available. Claude's tool use supports multiple tool calls in a single model response. If the calls are independent, you can execute them concurrently and only do one model round trip instead of N.

# Sequential: 3 round trips, ~6-9s total
result_a = await run_tool("get_user", {"id": user_id})
result_b = await run_tool("get_orders", {"user_id": user_id})
result_c = await run_tool("get_preferences", {"user_id": user_id})

# Parallel: 1 round trip for the calls, concurrent execution, ~2-3s total
# (model returns all three tool_use blocks in one response)
results = await asyncio.gather(
    run_tool("get_user", {"id": user_id}),
    run_tool("get_orders", {"user_id": user_id}),
    run_tool("get_preferences", {"user_id": user_id}),
)

For this to work the model needs to actually return multiple tool calls in one turn. Whether it does depends on the prompt. If you've written system instructions that encourage step-by-step sequential thinking, the model will often call tools one at a time by default. A simple addition like "when multiple pieces of information are needed and independent, request them in the same response" in the system prompt is usually enough to change the behavior.

Not every sequence is parallelizable. If tool B depends on the result of tool A, you can't batch them. Map your call graph before assuming parallelism is available.

When does a single broad tool beat five granular ones?

When the five granular tools would always get called together.

A common pattern: an agent has get_user, get_user_orders, get_user_address, and get_user_preferences as separate tools. Almost every time it needs user context, it calls all four. That's four round trips (or one batched round trip with four concurrent executions) when a get_user_profile tool that returns the full context object in one call would do it in one.

The tradeoff is specificity vs. call count. Granular tools give the model options. Broad tools reduce the total number of turns and the total prompt size. For a workflow where the access pattern is predictable, consolidate.

For MCP server design specifically: thin, single-purpose tools make sense for an exploratory agent. For a production workflow agent with a known task shape, broader tools with richer return types will consistently outperform the thin-tool approach on latency.

How do you measure whether tool use is the actual bottleneck?

Log timestamps around every tool call. The model's time-to-first-token is one number; tool execution time is another; time from result return to the next model output is a third. Most agent frameworks don't surface these separately, which means people optimize the wrong thing.

Once you have per-call timings, the patterns become obvious: one slow external API call dominating a 15-call run, or a 200ms tool getting called eight times in a loop because the model keeps checking the same state instead of caching it.

I log at the scaffolding layer rather than inside the tools, so I get the full round-trip time including model latency. A tool that looks fast in isolation often looks expensive in context.

The one change that consistently helps: audit your agent run logs for repeated calls to the same tool with the same arguments. That pattern almost always means the agent is using tool calls as a memory substitute. Fix it by passing the result forward explicitly in the context rather than letting the model re-fetch.

FAQ

Does MCP add latency compared to native tool definitions?

A local stdio MCP server adds minimal latency, usually under 50ms per call. The overhead is in process startup if you reinitialize on every run. The call-response cycle is cheap once the server is running. A remote MCP server over SSE will add its own network round trip, which can be significant if the server is not colocated with the agent.

How many tool calls is too many for a single agent run?

There's no hard ceiling, but I treat any run that exceeds 15-20 tool calls as a design signal. Either the task scope is too broad for one agent, the tools are too granular, or the agent is doing redundant work it could cache. A run that takes 30 tool calls to do something a human would do in three steps usually has an architecture problem.

Does the model get slower as the context grows during a run?

Yes. Inference time scales with context length. A long agent run that accumulates tool results across many turns will see each subsequent generation take longer than the first, because the model is attending over a larger input. Trimming tool results to only the fields the agent actually needs, rather than passing back full API responses, is one of the cheaper ways to keep context size manageable across a long run.

Tired of re-keying the same data between tools? Pylonworks builds custom automation and internal tools for businesses without a developer, on a fixed quote you approve up front. Tell us what's eating your time

Back to all posts

All posts