How I Decide Whether to Automate a Task With AI

Jordan EllisJune 5, 20266 min read

The AI task automation decision comes down to one equation most people skip. I break down the real per-run token cost, the verification trap that kills ROI, and a five-factor rubric for choosing what to automate first.

I killed an automation last month that cost $0.42 a run to replace a task that took me 90 seconds by hand. The AI task automation decision only works when (volume × time saved × your rate) beats (build cost + per-run cost + the cost of catching its mistakes). Most tasks fail that test. Automate the boring, high-volume, error-tolerant work first.

When is a task worth automating with AI?

A task is worth automating when the monthly labor it saves clears the monthly cost to run it, verify it, and keep it alive. That last clause is where people lose money. The model fee is the small number. The expensive parts are the human time spent checking output and the engineering time spent repairing prompts when an upstream format shifts.

Here is the equation I run before building anything:

net = labor_saved − run_cost − verification_cost − amortized_build_cost

If net is negative at realistic volume, don't build it. A task that runs 8 times a month and saves 2 minutes each saves you 16 minutes. No agent is worth maintaining for 16 minutes a month. A task that runs 4,000 times a month and saves 2 minutes each saves 133 hours. That one pays for a lot of debugging. The volume term dominates everything else, so estimate it honestly before you fall in love with the idea.

How much does a single AI task run actually cost?

More than the sticker price, because agents re-send their growing context on every tool-call turn. A 5-turn agent run doesn't bill one prompt. It bills the prompt five times, each turn larger than the last as tool results pile into the context.

Real example. An agent that makes 5 tool calls can accumulate roughly 115k input tokens and 6k output tokens across the full run. Here is what that costs at current Anthropic rates:

Model	Price (input / output per M)	~115k in / 6k out run	With prompt caching
Claude Opus 4.8	$5 / $25	~$0.73	~$0.20
Claude Sonnet 4.6	$3 / $15	~$0.43	~$0.12
Claude Haiku 4.5	$1 / $5	~$0.15	~$0.05

Prompt caching matters more than model choice for anything repeated. It cuts cached input by 90%, so the same Sonnet run drops from about $0.43 to $0.12. Turn it on before you tune anything else. Then budget for retries: I see 2-4% of agent runs hit a 429 rate limit or a tool timeout and need a re-run, which doubles the token cost on those specific runs.

For the record, this post cost about $0.31 in tokens to generate and maybe 35 minutes of human time to brief and fact-check. By my own rubric, that clears the bar without much argument.

What tasks should you automate with AI first?

The ones that are high volume, structurally stable, and cheap to get wrong. In rough priority order:

Classification and routing. Tagging support tickets, sorting inbound leads, labeling logs. High volume, a wrong answer is a one-click fix.
Extraction from messy text. Pulling fields out of emails, invoices, or PDFs that arrive in a predictable shape.
Drafting that a human always reviews anyway. First-pass replies, summaries, release notes. The human edit is already in the loop, so a mediocre run costs minutes, not dollars.
Format transforms. Restructuring data, normalizing entries, converting between schemas.

Notice the pattern. Every one of these runs constantly and survives a bad output without anyone getting hurt.

When does AI automation cost more than it saves?

When checking the output takes as long as doing the task yourself. This is the verification trap, and it is the most common way an automation quietly loses money. You feel productive because the agent ran. You are not productive, because you re-read every line it produced.

If verification takes 80% of the time the task took manually, you didn't automate the task. You added a token bill to it.

The other money-losers: low volume (under ~10 runs a month rarely justifies the build), high input variance (every input unique means the prompt can't generalize and you're back to babysitting), and specs that change weekly (you'll spend more on prompt repair than the task ever cost by hand). When a task has all three, leave it manual and move on.

What about tasks where a mistake is expensive?

Automate the work, gate the action. There is a difference between letting an agent draft a refund email and letting it issue the refund. Draft freely. Issue behind a human approval, or behind a deterministic check the model can't talk its way past.

For anything irreversible (sending money, deleting records, posting publicly), I keep a human in the loop until I have logged at least a few hundred runs and measured the actual error rate. If the agent is right 99% of the time on a task that runs 4,000 times a month, that is still 40 wrong actions a month. Decide whether 40 mistakes is an annoyance or a lawsuit before you remove the gate.

A scoring rubric I actually use

Score the task 1 to 5 on each factor, then add it up.

Factor	Score 1 (lean toward skip)	Score 5 (lean toward automate)
Monthly volume	under 10 runs	over 500 runs
Time saved per run	under 30 seconds	over 10 minutes
Input structure	every input is different	predictable and structured
Error tolerance	mistakes are costly or irreversible	mistakes are cheap to catch and fix
Spec stability	changes weekly	stable for months

Under 15 out of 25, skip it. 15 to 19, run a manual pilot before committing engineering time. 20 and up, build it. The same logic as a small function:

def monthly_net(runs, minutes_saved, hourly_rate,
                cost_per_run, build_hours, verify_minutes):
    labor_saved = runs * (minutes_saved / 60) * hourly_rate
    run_cost = runs * cost_per_run
    verify_cost = runs * (verify_minutes / 60) * hourly_rate
    build_cost = build_hours * hourly_rate / 6   # amortize over 6 months
    return labor_saved - run_cost - verify_cost - build_cost

# the task I killed: low volume, near-zero time saved
print(monthly_net(8, 1.5, 90, 0.42, 6, 1))   # negative, don't build

The task I killed scored a 9. The numbers said no before I wrote a line of prompt. I built it anyway because it was fun, which is the one input the equation can't catch.

How do I estimate volume before the task exists?

Look at how often you do it now. Count a week by hand, multiply by 4.3. If you can't be bothered to count it for one week, the volume is too low to automate. That reluctance is data.

Should I use the cheapest model to save money?

Use the cheapest model that passes your accuracy bar, not the cheapest model period. A Haiku run at $0.05 that you have to re-do on Sonnet costs more than running Sonnet once. Test the cheap model on 50 real inputs first, measure the error rate, then decide.

Does prompt caching change the automation decision?

For repeated tasks, yes. Caching cuts the per-run cost by 60-90% on the static part of your prompt, which can flip a borderline task from negative to positive net value. If the cached version still loses money, the task was never about token cost.

Before you build your next agent, fill in the five-factor table for the task in front of you. If it scores under 15, close the editor and do it by hand. That is the cheapest automation decision you will make all week.

Tired of re-keying the same data between tools? Pylonworks builds custom automation and internal tools for businesses without a developer, on a fixed quote you approve up front. Tell us what's eating your time

Back to all posts

All posts