PylonworksTell us what's eating your time
All posts

What Autonomous AI Agents Can Actually Do in Production

Jordan Ellis6 min read

I break down which autonomous agent tasks hold up in production and which still need a human gate, with current 2026 token costs, rate-limit numbers, and a retry pattern that survives 429s and 529s. The dividing line is whether a test catches the mistake first.

A single runaway agent cost me $340 in Opus tokens before a loop guard caught it. Here is what autonomous agents actually do well in production: bounded, verifiable tasks. Ticket triage, data transforms, code behind a review gate, scheduled jobs that fail loud. They break on open-ended goals with no checkpoint.

What counts as an autonomous agent in production?

An autonomous AI agent is a loop where the model decides which tool to call next, calls it, reads the result, and repeats until it hits a stop condition. That is the whole idea. The model is the planner and the tools are the hands.

In production the word "autonomous" has a ceiling. It means the agent runs without a human approving each step, not that it runs without guardrails. Every agent I run has a step cap, a token budget, a timeout, and a place where it writes its failures. Remove any of those and "autonomous" turns into "unsupervised," which is a different and more expensive thing.

What can autonomous AI agents reliably do today?

The tasks that hold up share one trait: a machine or a person can check the output before it matters.

Task pattern Holds up in production? Why
Ticket and email triage (label, route, draft) Yes A human reads the draft before it sends
Data extraction and transform Yes Schema validation catches malformed rows
Code generation behind review Yes Tests and a PR gate catch the mistakes
Scheduled research and summaries Yes Output is read, not executed
End-to-end "run my business" goals No No checkpoint, errors compound silently
Irreversible actions with no confirm No One bad tool call is permanent

The pattern is dull and it works. I run a daily agent that reads a few dozen inbound emails, classifies each one, and drafts replies. It uses Haiku 4.5 for the classification pass and Sonnet 4.6 only for the drafts. It never sends. A person clicks send. That agent has run for months without an incident because the worst case is a bad draft nobody approves.

What still breaks?

Three failure modes account for almost everything I have seen.

The first is the compounding loop. The agent makes a small wrong assumption on step two, builds on it through step nine, and by then the output is confidently wrong. No single step looks broken in the logs. This is why open-ended goals fail. There is no point where reality corrects the agent.

The second is the silent tool error. A tool returns an empty result or a stale value, the model treats it as truth, and keeps going. Your retry logic does not help here because nothing threw an exception.

The third is cost blowup. An agent stuck in a retry loop or re-reading the same large context on every step burns tokens fast. My $340 incident was an agent that re-sent a 60K-token document on all 14 steps of its loop because I forgot to cache it. That is roughly 840K input tokens at Opus rates for one run that should have cost under a dollar.

The most expensive bug in agent work never throws an exception. It is the agent that keeps going while it is quietly wrong, and you pay for every step of it.

How much does running an autonomous agent cost?

Here are the current Anthropic rates, billed per million tokens, with output at 5x input across the line.

Model Input $/M Output $/M Best for
Opus 4.8 $5 $25 hard reasoning, final review step
Sonnet 4.6 $3 $15 most agent planning and tool loops
Haiku 4.5 $1 $5 classification, routing, high volume

Two levers move the bill more than model choice. Prompt caching cuts cached input by 90 percent, which matters because an agent re-sends its system prompt and context on every step of the loop. Batch processing is 50 percent cheaper for anything that does not need a real-time answer. Run your high-volume classification through Haiku in batch and the cost rounds to nothing.

The trap is defaulting every step to Opus. Most of an agent loop is routing and tool selection that Sonnet handles fine. Reserve Opus for the one step that needs it, usually the final synthesis or a code review pass. (For the record, drafting this post cost about 90 cents of Sonnet plus two web searches. I keep the meter running on everything, including the posts about keeping the meter running.)

How do you handle rate limits mid-task?

You will hit them. Anthropic tiers run 50 RPM at Tier 1, 1,000 at Tier 2, 2,000 at Tier 3, and 4,000 at Tier 4, with separate caps on input and output tokens per minute. A multi-step agent fanning out parallel tool calls hits the per-minute token cap before the request cap.

Two status codes matter. A 429 means you exceeded a rate limit and should honor the Retry-After header. A 529 means the API is overloaded and you should back off and try again. Treat them differently from a 400, which means your request is wrong and retrying will never help.

import time, anthropic

client = anthropic.Anthropic()
for attempt in range(5):
    try:
        msg = client.messages.create(model="claude-sonnet-4-6", ...)
        break
    except anthropic.RateLimitError as e:        # HTTP 429
        wait = float(e.response.headers.get("retry-after", 2 ** attempt))
        time.sleep(min(wait, 60))
    except anthropic.InternalServerError:        # HTTP 529 overloaded
        time.sleep(min(2 ** attempt, 60))
else:
    raise RuntimeError("exhausted 5 retries")

Cap your attempts. Five is plenty. An agent that retries forever is just another cost blowup. Log every retry so you can see when you are systematically over your limit and need a higher tier instead of more backoff.

When does an agent beat a plain script?

Not as often as the demos suggest. If the steps are fixed and the inputs are predictable, write the script. It is cheaper, faster, and it does not hallucinate. An agent earns its cost when the input is messy and the path is not known in advance: parsing inconsistent documents, deciding which of twelve tools fits an ambiguous request, handling the long tail a rule-based system would need a hundred branches for.

A script costs microseconds and zero dollars per run. An agent step costs a network round trip, 800ms to several seconds of latency, and real tokens. Use the agent where that trade buys you something a regex cannot.

What should I change first?

Pick your one running agent and add three things this week: a hard step cap, a token budget that aborts the run, and a log line on every failure that a human will actually read. Then move every step that is just classifying or routing down to Haiku 4.5, and turn on prompt caching for any context you re-send across steps. That combination fixed both the reliability and the cost of every agent I run.


Are autonomous AI agents safe to run without a human?

For reversible, checkable tasks, yes. For anything irreversible (sending money, deleting data, posting publicly), keep a confirmation step. The rule is simple: if a wrong action cannot be undone, a human approves it.

What is the cheapest way to run an agent at volume?

Route classification and routing steps to Haiku 4.5, run them in batch for a 50 percent discount, and cache any context you resend. Reserve Sonnet 4.6 for the planning loop and Opus 4.8 for the single step that needs deep reasoning.

How many retries should an autonomous agent do?

Cap at five with exponential backoff, honoring Retry-After on a 429 and backing off on a 529. Never retry a 400. An uncapped retry loop is a cost incident waiting to happen.


Tired of re-keying the same data between tools? Pylonworks builds custom automation and internal tools for businesses without a developer, on a fixed quote you approve up front. Tell us what's eating your time

Back to all posts