The cache write is the villain
What prompt caching actually taught me: it’s not the reads that cost you, it’s re-priming the prefix.
In the last essay I mentioned a tiny tool that turned out to be the most expensive thing in a turn. This is that story, told properly, because it's the clearest lesson I have about how agent economics actually work — and almost everything about it is counterintuitive.
If you take one thing from this: with prompt caching, the cost you optimize is not the reading. It's the re-writing.
How caching actually prices out
Prompt caching lets you mark a chunk of your prompt as cacheable. The first time the model sees it, you pay to write it into the cache. Every time after that, as long as the bytes are identical, you pay only to read it.
The prices are wildly asymmetric, and that asymmetry is the whole game:
- A cache read costs about 0.1× a normal input token. A tenth. Almost free.
- A cache write costs about 1.25× a normal input token (for the short-lived cache; more for the long-lived one).
So a cached token read back is ~12× cheaper than that same token written. Your pal's system prompt, its identity files, its tool definitions — all the big stable stuff it re-sends every single turn — should be getting read at 0.1×, not written at 1.25×. When caching is working, the fixed cost of "who the pal is" nearly disappears. When it's not working, you pay the 1.25× write on the whole thing, every turn, and the meter screams.
The villain, then, is never the read. It's the re-write — paying 1.25× to re-prime a prefix you should have been reading at 0.1×. And the thing about re-writes is they're silent. Nothing errors. The answer comes back correct. You just quietly paid ten-plus times what you should have, and the only way you'd know is to look at the token breakdown and notice the cache-write number is large when it should be zero.
The one invariant
Everything follows from a single rule: caching is a prefix match. The cache key is the exact bytes of your prompt up to the cache marker. Change one byte anywhere in that prefix and everything from that byte onward is a cache miss — re-written at 1.25×, not read at 0.1×.
And the prompt renders in a fixed order: tools, then system prompt, then the conversation. Tools are at the very front. The system prompt is next. The actual conversation — the part that legitimately changes every turn — is last.
Stare at that ordering and the design rule falls out: stable things go first, volatile things go last. Anything that changes turn-to-turn has to live at the end of the prompt, after the cache marker, or it poisons everything behind it.
My expensive mistake
Now the tool. I had a capability that, under certain conditions mid-conversation, added a tool to the pal's toolset for that turn. Conceptually clean. A tool definition is a few hundred tokens — who cares.
But tools render first. At position zero. So conditionally adding a tool changed the very front of the prompt — which meant the system prompt behind it, and everything behind that, no longer matched the cached prefix. One small, occasional, well-intentioned tool was invalidating the cache on the entire prompt prefix: tens of thousands of tokens of stable identity and instructions, re-written at 1.25× instead of read at 0.1×, every time it fired.
It was, by a wide margin, the most expensive line of behavior in the system. And it looked like the cheapest — it was a tiny tool that did a small thing. The cost wasn't in what it did. The cost was in what it moved.
The fixes, and the general shape of them
Two things fixed it, and they generalize.
First: never change the front of the prompt mid-conversation. If the pal needs an occasional capability, the tool lives in the permanent toolset, always present, at a fixed position — not conditionally injected. A tool that's always there is part of the stable prefix and gets cached like everything else. The trigger for using it moves elsewhere; the definition never moves. Same logic forbids swapping models mid-conversation (caches are per-model) and reordering tools (sort them deterministically, always).
Second: move the volatile stuff to the back. The pal's context includes things that genuinely change every turn — its current calendar, the live state of what it's working on. My instinct had been to put that near the top with the rest of the "system" material. Wrong instinct. That live state was sitting inside the stable prefix, so every change to it re-primed everything behind it. Pulling it out of the system block and placing it after the conversation history — past the cache marker — meant the expensive prefix stopped moving. The volatile data is still there every turn; it just no longer poisons the cache. That single reordering cut the re-primed prefix by about a quarter.
The trick that made the memory write nearly free
The prettiest version of this came from memory. The pal compacts its memory periodically — folds recent events into a rolling summary. That compaction is its own model round-trip, and round-trips are the expensive unit (see the previous essay).
But here's the thing: compaction was already riding on a moment where the cache had to be re-primed anyway, for unrelated reasons. The prefix was getting re-written regardless. So instead of treating the memory write as a new expense, I aligned it with the re-prime that was already happening — and used a fixed, always-present wiring so the compaction step reads the big stable context at 0.1× instead of writing it at 1.25×. The result was the compaction turn going from "another full-freight round-trip" to something close to free, because it stopped paying for a cache-write that was redundant with one already on the books.
That's the whole mindset in one move: don't ask "how much does this feature cost." Ask "what does it make the cache re-write, and is that re-write already happening anyway."
What to actually do
If you're building on a cached model, the entire optimization reduces to a short checklist:
- Freeze the front. No timestamps, no per-user IDs, no "current date" interpolated into the system prompt. Those change every request and invalidate everything behind them.
- Don't move tools. No conditional tools, no reordering, no model swaps mid-conversation. Tools are position zero; touching them costs you the whole prefix.
- Push volatile state to the back, after the cache marker, where it can change freely without re-priming anything.
- Watch the cache-write number. If tokens are getting written to cache on requests that share a prefix, something upstream is moving. Find it. It's never where you'd guess.
The mistake I made is the mistake the architecture invites: judging a feature's cost by what it does, when the real cost is in what it moves. The cheapest-looking tool was the most expensive thing I'd built. Once you internalize that the write is the villain and the prefix is sacred, agent economics stop being mysterious and start being something you can actually engineer.