The AI Agent Infrastructure Stack I Run 24/7
The four-piece stack I use to keep Claude agents running around the clock: systemd timers, the Agent SDK, Postgres for state, and OAuth rotation. Real model prices, the cost controls that keep the bill flat, and the four failure modes that break agents at 3am.
A runaway agent loop once cost me $47 in Opus calls overnight before a cap caught it. That bill shaped the stack. Running Claude agents 24/7 comes down to four boring pieces: systemd timers to schedule bounded jobs, the Agent SDK to spawn each run, Postgres for state between runs, and OAuth rotation so a dead token never stalls the fleet.
What is an AI agent infrastructure stack?
An AI agent infrastructure stack is the set of systems that schedule an agent, run its inference, hold its state between runs, and recover it when something fails, all without a human watching. Mine has five layers. None of them are exotic.
| Layer | What I use | Job |
|---|---|---|
| Scheduler | systemd timers (cron as fallback) | fire a bounded job on an interval |
| Runtime | Claude Agent SDK query() and the claude CLI |
spawn one agent, run it, exit |
| State | Neon Postgres | remember what ran, dedupe, store results |
| Auth | OAuth token plus rotation script | keep inference authorized around the clock |
| Recovery | concurrency caps, token caps, OOM guard | stop one bad run from taking the box down |
The whole thing runs on a single Hetzner VPS. No Kubernetes, no message broker, no autoscaler.
Why I run bounded jobs instead of one long-lived agent
Every unit of work is a job that does one thing and exits. The outreach agent wakes up, processes its queue, writes results to Postgres, and dies. Thirty minutes later a fresh process does it again.
Long-lived agents sound efficient. They are not. Context accumulates, memory leaks, and a single stuck tool call wedges the whole loop. A process that exits after every task has no state to corrupt and nothing to leak. When it crashes, the next timer tick is its recovery.
The most reliable agent I run is the one that assumes it will die after every task. Now I make agents safe to kill and let the scheduler restart them. Statelessness turned out to be cheaper than resilience.
How do I schedule agents without a queue broker?
systemd timers. One timer, one service, one bounded task.
# agent-outreach.timer
[Timer]
OnCalendar=*:0/30
Persistent=true
[Install]
WantedBy=timers.target
OnCalendar=*:0/30 fires every 30 minutes. Persistent=true means a run missed during a deploy or reboot fires on the next boot instead of vanishing. The paired service runs the agent and exits. systemd gives me journal logs, a restart policy, and a failure counter without writing any of it.
Cron does the same thing with less observability. I keep a few cron jobs for simple shell tasks and reserve systemd for anything I need logs and status on.
What actually runs the inference?
Every agent spawns through the Claude Agent SDK, authorized with a subscription OAuth token. No metered API key in the loop. One detail saves seconds per run. The SDK boots every configured MCP server on each spawn, and for a quick inline agent that needs no tools, that startup is dead weight.
const result = query({
prompt: task,
options: {
env: sdkEnv(), // subscription OAuth auth
strictMcpConfig: true, // skip booting every MCP server
mcpServers: {}, // this job needs no tools
settingSources: [],
},
});
That change took one voice agent from a 6-second timeout down to about 2.5 seconds. Across a few thousand runs a month, it adds up.
Model choice is the other lever. I tier deliberately.
| Model | Input ($/M) | Output ($/M) | What I run on it |
|---|---|---|---|
| Haiku 4.5 | $1.00 | $5.00 | classification, routing, status summaries |
| Sonnet 4.6 | $3.00 | $15.00 | writing, code, most agent work |
| Opus 4.8 | $5.00 | $25.00 | hard reasoning only, rarely |
Haiku handles routing and triage. Sonnet does the real writing and code. Opus comes out only when a task genuinely needs the reasoning, because at $25 per million output tokens it costs five times what Sonnet does.
Where does state live between runs?
Postgres, on Neon. Each agent is stateless and the database is its memory.
Before an agent acts, it checks a source_state key so it never touches the same item twice. After it acts, it writes the result and a timestamp. If the process dies mid-run, the next run sees what already finished and picks up the rest. This is the single most important reliability decision in the stack. Idempotency turns a crash into a no-op instead of a double-send.
How much does it cost to run an agent fleet 24/7?
The subscription is the floor. Metered, a typical Sonnet run for me is about 30k input and 8k output tokens, which lands near $0.21 per run. Prompt caching knocks 90% off cached input, so a job that reuses a large system prompt across runs pays the cheaper read price as long as it fires inside the 5-minute cache window.
This post cost me about 30 cents to research and draft, mostly Sonnet output tokens. I am the failure mode I write about, so I keep the meter on everything.
Three things keep the bill flat: a hard cap of 1 to 2 concurrent agent processes, a per-job token ceiling, and Haiku for anything that does not need to be smart. The cap matters most. Concurrency is where an idle fleet quietly becomes an expensive one.
What breaks at 3am, and how do I recover?
Four failure modes account for almost every overnight incident.
| Failure | Symptom | Fix |
|---|---|---|
| OAuth token expires | every agent throws "invalid API key" | rotation script swaps a backup token, restart the service |
| OOM killer fires | a run is killed mid-task, SIGTERM in the logs | concurrency cap of 1 to 2, kill stray forks by PID |
| Deploy restarts a shared service | live sessions drop for about 10 seconds | gate restarts on real code changes, buffer and reconnect |
| A background write merges late | one task's file writes are invisible to other runs | wait for completion before re-running the work |
None of these need a pager. They need a script and a cap. I would automate the token rotation first, because a dead token takes down the entire fleet at once while everything else fails one job at a time.
Where to start
Pick one agent. Give it a systemd timer, a Postgres row for its state, and a token cap. Make it exit after every run and check an idempotency key before it acts. That single bounded job is the entire pattern. Everything else in my stack is that same shape repeated, plus a rotation script for the token that will expire at the worst possible moment.
Do I need Kubernetes to run AI agents 24/7?
No. A single VPS with systemd timers and Postgres runs dozens of bounded agents fine. Kubernetes earns its keep when you have real horizontal scale and a team to operate it. For an agent fleet that spends most of its time idle between scheduled runs, it is cost and complexity you will pay for and never use.
How many agents can I run concurrently on one box?
I cap inference at 1 to 2 concurrent processes per box. Past that, OAuth rotations race each other, memory spikes, and the OOM killer starts taking down healthy runs alongside the bad one. Schedule more agents than that, just stagger their timers so they do not all fire at once.
What is the cheapest way to keep agent costs predictable?
Tier your models and cap tokens per job. Route classification and triage to Haiku at $1 per million input, keep Sonnet for work that needs to be good, and reserve Opus for the rare task that truly needs it. Add prompt caching on any system prompt you reuse within five minutes and the cached input drops 90%.
Tired of re-keying the same data between tools? Pylonworks builds custom automation and internal tools for businesses without a developer, on a fixed quote you approve up front. Tell us what's eating your time