When the proxy isn't enough — how I ended up using Claude Code as a backend

Yesterday I was celebrating. The OAuth proxy was working, Harvie was talking through Sonnet, everything was beautiful. Today? Everything broke again.

What went wrong

Yesterday's proxy forwards API requests to api.anthropic.com using the OAuth token from Claude Code. Simple, elegant. But today it started returning this:

You're out of extra usage. Add more at claude.ai/settings/usage and keep going.

It wasn't a rate limit. It wasn't a timeout. A billing error — on requests with more than ~720 input tokens. Small requests worked fine. Harvie's full system prompt (35KB, ~9000 tokens) — instant rejection.

Down the rabbit hole

We spent hours trying everything:

Replicating every header that Claude Code sends (x-stainless-*, anthropic-beta: claude-code-20250219, billing headers)
Testing different cch hash values in the billing header
Testing with the VPS token and the Mac token
Curling the API directly, bypassing the proxy

Nothing worked. Even the Mac token — the one powering a fully functional Claude Code session — failed when used from curl with the exact same headers.

The real problem

We finally checked the account's organization settings:

Org: joca.dev
  capabilities: ['api']
  billing_type: prepaid
  rate_limit_tier: auto_prepaid_tier_0  ← no credits

Org: [email protected]
  capabilities: ['claude_max', 'chat']  ← NO 'api'!
  billing_type: stripe_subscription
  rate_limit_tier: default_claude_max_5x

Two organizations. The personal one has Claude Max 5x but no API capability. The other has API capability but no credits. API calls get routed to the wrong organization.

Claude Code works because it uses claude_max billing, not api billing. They're different channels. The OAuth token gives you access to both, but direct API calls go through the api channel — which has no credits.

The fix: claude -p as a backend

If Claude Code can use the subscription, why not use Claude Code as the backend?

Hermes → POST /v1/messages → proxy (Node.js) → writes the system prompt as CLAUDE.md → runs: claude -p --output-format stream-json --model opus → translates the response to Anthropic's API format → returns to Hermes

The key: CLAUDE.md. When Claude Code starts, it loads CLAUDE.md from the working directory as part of its system prompt. This content integrates with Claude Code's billing system correctly — no token caps, no "out of extra usage." We tested with 35KB of system prompt and over 25,000 tokens. Works perfectly.

--append-system-prompt has the same 720-token cap as the API. But CLAUDE.md? No limit. Same subscription, different billing path.

The architecture

Telegram → Hermes (agent, tools, memory, SOUL.md) → HTTP proxy (:18791) → writes CLAUDE.md with Hermes' system prompt → launches: claude -p --output-format stream-json → parses structured events (tool_use, text, result) → sends tool progress to Telegram (edits the same message) → returns final response in Anthropic's API format → Hermes sends response to Telegram

The proxy translates between Anthropic's API format (what Hermes speaks) and the Claude Code CLI (what actually talks to the models). Hermes has no idea there's a CLI behind the curtain.

Tool progress in Telegram

Something we lost with this approach: Hermes used to show live tool progress in Telegram (what it's running, file reads, etc.). With claude -p, the tools run inside Claude Code — Hermes never sees them.

The fix: the proxy parses the events from --output-format stream-json, detects tool_use blocks, and sends them directly to Telegram via the bot API. Same visual format as Hermes: a message that gets edited with each new tool, deduped with (x2) counters.

Trade-offs

What works:

Full Claude Max subscription, no API billing
Opus with 1M context, 35KB system prompts, tools, everything
Tool progress visible in Telegram
Request queue (one at a time, the rest wait)

What's slower:

Each message spawns a new claude -p process (~8-10s startup overhead)
No real streaming — the response arrives all at once when Claude finishes
Heavier on the VPS (Node.js process + Claude Code per request)

What's lost:

The elegance of a simple HTTP proxy
Sub-second API response times
Real SSE streaming to the client

What I learned

claude_max ≠ api. They're separate billing capabilities. Your Max subscription doesn't include direct API access.
OAuth tokens work for both, but the API routes through the org that has api capability — which might not be the one with your subscription.
CLAUDE.md is magic. It integrates with Claude Code's billing in a way that --system-prompt and --append-system-prompt don't.
--output-format stream-json gives you structured events (tool calls, text deltas, results) — much better than parsing raw text.

Next steps

The claude -p approach works but it's a compromise. The ideal solution would be to add api capability to the Max org — then the original fast proxy works instantly. Until Anthropic bundles API access with Max subscriptions (or lets you enable it), this is the best we've got.

Next up: investigating persistent Claude Code sessions (--input-format stream-json) to eliminate the 8-10s startup per message. Keep a process alive, pipe messages to it. That would make it nearly as fast as the direct API.

What the agent actually did today

All of the above — the proxy debugging, the architecture pivot — was infrastructure. But the point of infrastructure is what runs on top. Here's what Harvie did today while I was at work:

Built an Obsidian vault for OhanaSmart using the LLM Wiki pattern. I shared a Karpathy video about replacing RAG with markdown wikis managed by an LLM. Harvie analyzed the video, extracted the architecture, and applied it to our vending project. The result: a complete vault with 31 documented leads (each with its own page — contact details, objections, next actions), competitive analysis, regulation pages, and pitch strategy. All cross-linked. The idea is that each interaction with a lead improves the wiki, so by lead #15 you've already seen every possible objection.

Analyzed a YouTube video and extracted business insights. Not a summary — strategic conclusions mapped to my three projects. "This pattern could be a premium service for Pimesit.es." "This is how we should structure OhanaSmart's knowledge." That's the filter: if a video doesn't connect to revenue or LOIs, it doesn't make the cut.

Fixed its own bugs. Voice transcription was failing because a config pointed to whisper-1 (OpenAI's cloud model) instead of the local Whisper model name (base). Harvie found the bug, patched it, and moved on. I didn't even know it was broken until it was already fixed.

None of this required me to open a terminal. I sent voice messages from Telegram while the kids were playing. The agent did the rest.

That's the real value of wrapping Claude Code as a backend: not the architecture diagram, but the fact that an AI agent is building knowledge bases, analyzing content, generating media, and fixing itself — all through a Telegram chat, powered by a Claude Max subscription and a hacky Node.js proxy on a 7 EUR/month VPS.

— I, Johnny — configured agent: Harvie. The best architecture is the one that works; the second best, the one you can fix at 3 AM.