Chuckieaifinance

Context Caching Is How AI Labs Are Building the New Lock-In

Anthropic and Google are offering up to 90% cost reductions on repeated tokens. That discount is not a gift — it is a switching-cost mechanism dressed as a cost optimization feature.

March 26, 20268 min read

The Discount That Costs You Later

When Anthropic and Google started offering steep discounts on repeated context tokens, the engineering community celebrated it as a straightforward win: lower bills, faster responses, better economics for long-context workloads. Both companies have built context caching as a first-class, production-grade API primitive — not a beta feature, not an enterprise add-on, but a core part of how their platforms expect developers to build.

That framing is correct as far as it goes. The discounts are real. The cost savings for the right workload are significant. But treating context caching as a pure cost optimization feature misses what it actually does to your infrastructure over time. Context caching is a switching-cost mechanism. The discount is the entry price. The architectural coupling is the exit cost.

This piece is about the mechanism — and why understanding it should change how you evaluate the feature before you build around it.

What Context Caching Actually Is

At the API level, context caching allows developers to store a block of input tokens — a large system prompt, a reference document, a code base — on the provider's infrastructure and then reuse those precomputed tokens across multiple requests without paying full input processing cost each time.

Google Cloud launched Vertex AI context caching in 2024 specifically to address the cost and latency burden of re-processing repeated tokens. As Google's own product documentation describes it, the feature targets scenarios where "substantial amounts of contextual information — be it a lengthy document, a detailed set of system instructions, a code base — need to be repeatedly sent to the model." The mechanism: customers save and reuse precomputed input tokens, reducing both cost and latency for any workload where large context is stable across requests.

Anthropic built the equivalent into the Claude API under the label "Prompt caching," positioned as a core feature under "Context management" in the developer documentation. Anthropic's official API pricing page lists prompt caching as a first-class priced feature, though per-token cache rates were not parseable from that page during research due to dynamic rendering; the 70–90% cost reduction figure cited here comes from a detailed third-party pricing analysis published in February 2026. That analysis also documents base API rates for Claude Sonnet at $3 per million input tokens and $15 per million output tokens at full price. A workload that reuses a 100,000-token system prompt across thousands of daily requests is looking at a cost reduction that compounds into real money very quickly at enterprise scale.

Both implementations are live, priced, and being actively adopted. Neither company is hiding the ball on the economics. The discount is real. The question is what it costs you architecturally.

The Economic Incentive Chain

Here is the mechanism, stated plainly.

Step one: the discount creates optimization pressure. Any engineering team running an agentic pipeline, a RAG system, or a document-processing workflow with a large stable system prompt is going to feel pressure to cache it. At 70–90% cost reduction on repeated input tokens, the financial case for caching is overwhelming. To illustrate the scale: based on published Claude Sonnet rates of $3 per million input tokens, a rough calculation shows that a 50,000-token system prompt running against 10,000 requests per day produces an uncached input cost of approximately $1.50 per day — around $547 annually — for that context block alone. With caching applied at the 70–90% discount range, that figure drops to roughly $55–$164 per year. At 100,000 requests per day, the savings cross into five figures annually. These are illustrative figures derived from published rates, not independently benchmarked production data — but the order of magnitude is what matters. Engineers optimize for cost. This is rational behavior.

Step two: optimization creates architectural coupling. To realize those savings, you do not simply flip a setting. You restructure your application around the provider's caching API. For Anthropic, this means implementing the cache_control parameter in your message structure and organizing your prompts so that cacheable content appears in the right position in the context window. For Google, this means creating CachedContent objects via the Gemini API and managing cache TTLs and storage costs as a separate operational concern. These are not identical interfaces. The prompt architecture that maximizes cache hit rate on Anthropic is not the same as the architecture that maximizes it on Google. You are not just writing a system prompt — you are writing a system prompt optimized for a specific provider's cache implementation.

Step three: coupling creates migration cost. This is the part that gets elided in the cost-optimization framing. When you decide to evaluate a different model provider — or when your current provider raises prices, degrades quality, or loses a regulatory battle in a key market — the migration is not "change the API key and update the model name." It is "redesign the prompt architecture to match the new provider's cache implementation, rebuild the cache warming logic, absorb the full input token cost during the warmup period while the new cache fills, and retune any downstream behavior that depended on the old cached context structure." None of that is impossible. All of it is expensive and slow.

The Counterargument: It's Just Text

The strongest objection to this argument is also the most intuitive one: cached prompts are plain text files. There is no proprietary binary format, no compiled artifact, no database schema migration. You can copy your system prompt from one provider to another with a paste operation. Where is the lock-in?

This objection conflates three different kinds of portability: text portability, API portability, and financial portability.

Text portability is real. Your prompt text is yours. No argument there.

API portability is not. The code that manages cache creation, cache invalidation, cache TTL monitoring, and cache-hit-rate optimization is written against a provider-specific interface. That code does not port without rewriting. A team that has built sophisticated cache management tooling around Anthropic's cache_control parameter will need to rewrite that layer from scratch to work with Google's CachedContent lifecycle model — and vice versa. The surface area of that rewrite scales with how seriously you took the optimization.

Financial portability is the most underappreciated problem. Even if your prompt text is identical and your new provider's API is functionally equivalent, there is a warmup cost. The first day on a new provider, your cache is cold. You pay full input token rates for every request until the cache fills. For a large enterprise deployment processing hundreds of thousands of requests daily, that warmup period represents a real transition cost that appears nowhere in the feature's marketing materials. It also means the financial case for migration is always slightly worse than it looks on a spreadsheet, because the spreadsheet assumes steady-state cache hit rates that don't exist on day one of a new deployment.

What This Means for Infrastructure Decisions Made Today

None of this means context caching is a bad feature to adopt. For most production workloads with stable large system prompts, the cost savings are large enough that not caching is the wrong decision economically. The feature is valuable. Adopt it.

But adopt it with an accurate model of what it does to your optionality, not with the implicit assumption that you can walk away cleanly if the vendor relationship changes.

Specifically: if you are building a multi-year enterprise platform on top of LLM APIs and you are structuring your prompt architecture primarily around cache efficiency, you are making a bet that your current provider's pricing, quality, and availability will remain acceptable for the duration of that platform's life. That may be a good bet. It is not the same bet as "we can switch providers in a sprint if we need to."

The practical implication for CTOs evaluating this right now: treat context caching adoption the same way you would treat any other infrastructure dependency that has meaningful switching costs — with explicit documentation of what the migration path looks like before you build around the feature, not after. Know what your provider's cache API interface looks like. Know what the equivalent interface looks like at your fallback provider. Estimate the warmup cost for your actual request volume. If those numbers are acceptable, proceed. If they are not, build an abstraction layer over the cache interface now, while the codebase is still small enough to do it cleanly.

The Strategy Behind the Feature

Context caching is not a conspiracy. Anthropic and Google are not concealing the lock-in dynamic — they are simply not advertising it, which is exactly what any rational business would do when introducing a feature that both generates genuine customer value and creates structural switching costs. The discount is real. The savings are real. The architectural coupling is also real.

What is notable is that both of the two most capable frontier model providers have independently converged on the same feature architecture — precomputed token storage as a priced infrastructure primitive, available in production, documented as a first-class API concept. That convergence is not coincidence. It is the shape of a market where the primary competition is now happening at the infrastructure layer, not the capability layer. When two providers both decide that the right way to compete on cost is a feature that also ties developers to their prompt architecture, the correct interpretation is that this feature is serving two purposes simultaneously.

The first purpose is the one they advertise: lower costs, faster responses, better developer economics.

The second purpose is the one this piece is about. Now you know to look for it.

← Previous post

Why Microsoft 365 Copilot Is A Bundle-Distribution Story

Why Governed AI Workflows Have a Clearer Near-Term Path