Building an AI Agent That Actually Works in Production

15 April 2026 aiagentsproductionpythonclaude

The demo worked perfectly. Of course it did: I’d run it five times before showing anyone. Two weeks later, at some unreasonable hour, the same agent got stuck calling the same broken tool over and over, burned through its budget, and produced nothing at all. That gap, between the version that wows a room and the version that quietly falls over when no one is watching, is the entire story of putting AI agents into production. I’ve now run mine, Tree AI, every single day for over a year. Almost everything I actually know about agents I learned in that gap, not in the demo.

TL;DR

A demo proves an agent can succeed once. Production is about what it does when it fails.
Four things break agents in the real world: context bloat, tool failures, infinite loops, and slow persona drift. The fixes are deliberately boring.
The agent that survives production isn’t the most capable one. It’s the one that fails gracefully and recovers cleanly.

What it actually does

Tree AI runs around the clock on the same small server as the rest of my infrastructure. I talk to it: voice in through speech-to-text, a response, voice back out. It handles a slice of my home, keeps an eye on my calendar and email, runs the market-analysis pipeline that publishes to my Telegram channels, and switches personas depending on what I’m asking. It’s the thing I use to run my day, which is why its failures are never theoretical: they happen to me, on a Tuesday, when I needed the thing to work.

The four ways an agent dies

The failures are surprisingly consistent once you’ve lived with one for a while.

The first is context. Long conversations pile up until you hit the limit or the responses quietly degrade into mush. My answer is structured compaction: every so often the conversation gets summarised into a compact object, and the agent starts fresh with that summary as its memory. Not elegant, but it holds.

The second is tool failure. Tools go down, APIs time out, rate limits hit. An agent that crashes the first time a tool fails is useless, because tools fail constantly. The fix is the opposite of clever, and I’ll come back to it.

The third is the loop of doom. The agent gets confused, tries the same action, fails, and tries it again, forever if you let it. So I don’t let it. Every run has a hard ceiling on tool calls; hit the ceiling and it stops and reports what it tried. That single limit has saved me from runaway loops more times than I can count.

The fourth is the quiet one: persona drift. Over a long conversation the agent slowly wanders from how it’s supposed to behave. The language shifts, then the tone, then, the dangerous part, the decisions. The fix is to re-inject the system prompt periodically rather than trusting that the opening instructions still hold a thousand turns later.

An agent that crashes on the first failed tool call is just a demo with extra steps.

The protocol beats the smarts

Here’s the counterintuitive lesson. When a tool fails, do not let the model improvise its way out. Give it a fixed protocol: if tool X fails, try Y, then Z, then say plainly that it couldn’t. Hard-coded fallback chains beat “intelligent” recovery every time, because the model’s improvisation is exactly where the unpredictable, expensive, hard-to-debug behaviour comes from. In an agent, the boring, deterministic path is a feature.

MCP changed the shape of the work

The Model Context Protocol was the best thing to happen to agent development recently. Before it, every capability meant a bespoke integration glued to one client. With MCP you write a server once, and any compatible agent can discover and use it. My home automation, calendar, and file access all expose MCP servers; the agent connects and finds out what’s available on its own. Adding a new ability went from a project to a chore: write the server, restart the agent, done.

What I’d tell someone starting now

Use a real model: the gap between a frontier model and a small open one is enormous for messy real-world tasks, and saving on the model usually costs you more everywhere else. Build observability before features, because you cannot debug what you cannot see. Start with exactly one use case and make it genuinely solid before adding a second. And accept, up front, that it will break in ways you didn’t imagine.

The agent that works in production is never the one with the longest list of capabilities. It’s the one that stays inside its scope, fails without drama, and recovers on its own. What’s the part of your day you’d actually trust to something that fails that gracefully, and what would it have to prove first?

Stack: Python · Claude API · MCP · Whisper · Telegram Bot API

Need something like this for your own business? See how I can help →

What it actually does

The four ways an agent dies

The protocol beats the smarts

MCP changed the shape of the work

What I’d tell someone starting now

Get the AI Audit Kickoff Checklist