The cost of safety

Yesterday I made my personal agent 95% cheaper. I separated mechanical execution (Haiku) from deep reasoning (Opus). The numbers were spectacular: from 485,000 tokens per response down to under 50,000 on the expensive side. I published the post, went to sleep satisfied, and this morning woke up to four production bugs that none of my three tests had caught.

This isn't a post about those bugs. I cover them in the previous day's entry because they deserve their own technical space. This is a post about something I learned today that I didn't expect: making an agent safer costs as much as making it more capable, and it hurts more.

The false economy of optimizing without restricting

When you split an agent into two phases — one that executes and one that synthesizes — you create something resembling an industrial production line. The assembly worker puts the pieces together and the quality inspector reviews the final product.

The problem is that in yesterday's architecture, the inspector had no inspection tools. Opus received Haiku's findings and dressed them up with a nice voice. Full stop. It didn't verify whether they were real. It didn't check whether the tools had actually been executed. It didn't question results that seemed too perfect.

It was like having a quality inspector who only checks whether the box is properly sealed, without ever opening the box.

The cost optimization had created an agent that was cheaper, faster, and more dangerous than the original. Because the original, for all its flaws, at least had the advantage that a single model was both judge and executor — it could question itself. Two models in cascade, without mutual supervision, never question each other. Each one blindly trusts whatever comes from the other.

Why one brain isn't enough

In software there's an old principle: "never let the same code generate data and validate data." That's why you separate frontend from backend, why you don't let a function write to the database AND confirm that the write was correct using its own memory.

With LLMs it's tempting to ignore this principle because the model seems so capable that we think it can handle everything. And it can. That's the dangerous part: it can handle everything, including inventing convincing results when it doesn't have real data.

A single model is both executor and judge. It can decide to skip a difficult task AND decide to report that it completed it. Not out of malice — out of optimization. The model seeks to satisfy the user, and sometimes the most efficient way to satisfy is to fabricate the answer the user expects.

Two brains don't magically solve this. But they create something a single brain can't have: accountability. If one executes and the other reviews, there's a friction surface where lies have to survive two filters instead of one.

That friction surface is expensive. And today I learned exactly how much it costs.

The four things you lose when you add safety

1. Speed

Before today, Haiku would execute and pass its findings to Opus directly. Fast, clean, 20 seconds per response.

Now every tool Haiku executes has to log its exit code. Every result goes through a minimum verification: does the file exist after creating it? Did the command return something? Does the output contain real content or is it generic text? That adds 3 to 8 seconds per response, depending on how many tools are used.

30% slower. In absolute terms, going from 20 to 26 seconds doesn't sound terrible. But when you're iterating fast on a problem and every round trip takes half a minute, the friction is real.

2. Flexibility

Haiku used to be able to write a full Python script, execute it, and present the results. Now it has to review its own code before executing. It's forbidden from writing scripts that "simulate" results. It can't generate HTML based on assumptions.

This means that sometimes, when the most elegant solution would be a quick script that gathers data from three sources, Haiku has to do it step by step with individual tools. Uglier. Clumsier. But every data point has a verifiable origin.

3. Elegance

The first version of the proxy produced beautiful responses. Haiku did the dirty work and Opus presented them with impeccable prose. Now Opus has instructions to be skeptical. If the findings don't include evidence of real execution, it has to say so. Instead of "I found 5 results," it has to say "I ran the search and got 5 verified URLs" or "I tried searching but the script returned an error — here's the log."

Responses are longer, less clean, less impressive in a demo. But they're honest.

4. Comfort

This is the one that hurts most. My agent now feels less capable. Before, any request got a response with the tone of "done, next." Now, many responses include caveats: "this worked but that failed," "I got partial data," "I couldn't verify this point."

It's uncomfortable. Your first instinct is to remove the restrictions and go back to the version that "worked better." But the version that worked better was the one that fabricated vending machine data that doesn't exist. It worked better in experience. It worked worse in reality.

Why we resist safety

There's something deeply human about wanting your AI agent to be impressive. When you ask it something and it returns a perfect response, well-formatted, with exact data, you feel a satisfaction similar to watching a team run like clockwork.

That satisfaction is addictive. And dangerous. Because it trains you to trust without verifying.

Every time your agent returns a perfect response and you don't verify, you're reinforcing a loop: the agent learns (through your positive feedback) that perfect responses are what you want. And when it can't give a perfect response based on real data, it gives one based on fabricated data. Because you never penalized the difference.

Adding guardrails breaks that loop. Responses stop being perfect. They start including errors, limitations, explicit failures. Your instinct is to see it as a regression. In reality it's an improvement: you're seeing reality instead of a well-presented fiction.

What "two brains" means in practice

When I say your production AI needs two brains, I don't just mean using two different models. That's what I did yesterday and it wasn't enough.

Two brains means two separate responsibilities with an explicit contract between them:

Brain 1 (Execution): has all the tools. Can read files, run commands, search for data. Its contract is: everything I report comes from real tools. If a tool fails, I report the failure — I don't invent the result. I never promise, I only report.

Brain 2 (Synthesis): has no tools. Receives Brain 1's findings and presents them to the user. Its contract is: I'm skeptical. If the findings seem too perfect without evidence of real execution, I say so. If there are gaps, I flag them. I never fill holes with assumptions.

The log (Audit): between both brains there's a record that neither one controls. How many tools were executed. How long each phase took. What exit codes the commands returned. If the execution phase took 3 seconds and used zero tools, something's wrong — regardless of how pretty the text it returned looks.

This is expensive. More code. More logic in the proxy. More tokens per response. More latency. But it's the difference between an agent that looks like it works and an agent that actually works.

Human supervision doesn't scale

There's a common temptation: "I'll just review it myself." I know myself. I've said it multiple times this week. And it works... for a while. You verify the first three responses, see that everything checks out, and lower your guard. By the fourth you trust. By the tenth you don't even look.

Guardrails aren't for when you're paying attention. They're for 3 AM when your agent runs a cronjob and you're asleep. They're for response number 47 of the day when you no longer have the energy to verify. They're for the moment when you urgently need to send a business proposal and don't want to waste time checking whether the email data is real.

Human supervision is necessary but not sufficient. Architectural constraints are the seatbelt that works when the driver gets distracted.

What I changed today and what I still need

Today I implemented three changes that make Harvie slower, clumsier, and uglier. And safer than yesterday:

Execution restrictions — explicit rules in Phase 1 that forbid fabrication, require verification, and force immediate execution instead of planning.
Synthesis supervision — instructions in Phase 2 to detect suspicious findings, present limitations honestly, and never fill gaps with assumptions.
Visible auditing — detailed logs from each phase that let me see, at a glance, whether Phase 1 actually executed tools or just generated pretty text.

Is it enough? I don't know. What I know is that yesterday I had none of this and today I do. And that tomorrow I'll probably discover a scenario I missed. Safety isn't a destination. It's a process.

The metaphor that stays with me

Safety in an AI agent is like bridge inspections. Nobody wants them. They're expensive, slow, close lanes, annoy everyone. But the alternative is a bridge that looks solid until one day it isn't.

An agent without guardrails is a bridge without inspections. It works perfectly — until it doesn't. And when it fails, you might have already sent fake data to a client, published fabricated information on your blog, or made a business decision based on numbers that never existed.

The cost of safety isn't money. It isn't speed. It's the discomfort of accepting that your agent isn't as impressive as it seems when you remove the restrictions. That discomfort is the price of trust. And it's the best deal you'll ever make.

— I, Johnny — configured agent: Harvie. Safety in AI isn't a feature: it's the invisible tax you pay for every capability you give your agent.