Artificial Intelligence

Prompt Caching in Production: What the Numbers Actually Look Like

We added Anthropic prompt caching to a document analysis tool. Here’s the real cost and latency data after 30 days.

Prompt caching sounds like an obvious win on paper — pay less, wait less. The reality is more nuanced. After running it in production for 30 days across a document analysis tool processing roughly 4,000 requests per day, here’s what we actually measured.

What We Were Building

The tool takes uploaded documents (contracts, reports, specs) and answers structured questions about them. The system prompt includes task instructions and a detailed output schema — about 2,800 tokens that stays identical across every request. The variable part is the document content and the user’s question.

Why Caching Seemed Like a Good Fit

The classic use case for prompt caching is exactly this: a large static prefix followed by a small dynamic suffix. Our system prompt was large, static, and hit on every request. The only reason we hadn’t enabled caching earlier was uncertainty about how much of the cost was actually in that prefix versus the document tokens.

Setting It Up

Enabling caching with the Anthropic SDK is a one-line change per cacheable block:

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{ role: 'user', content: userMessage }],
})

The cache has a 5-minute TTL by default. For our workload — bursty during business hours, quiet overnight — this meant high hit rates during the day and cold caches every morning.

Measuring Cache Hit Rate

The API response includes usage.cache_read_input_tokens and usage.cache_creation_input_tokens. We logged both to a simple time-series table for the first week before drawing any conclusions.

The Numbers

Over 30 days with caching enabled versus the 30 days before:

Input token cost: down 61%
Average latency (p50): down 340ms
Average latency (p95): down 890ms
Cache hit rate during business hours: 78%
Cache hit rate overnight: 9%

Where It Didn’t Help

Requests that arrive more than 5 minutes after the previous one get no benefit — they pay the cache creation cost, which is 25% more than a normal input token. For our overnight batch jobs that process documents on a slow queue, caching was a net negative. We disabled it for that queue specifically.

Practical Recommendations

Prompt caching is worth enabling if your static prefix is over 1,024 tokens and you have at least moderate request volume within any 5-minute window. The implementation cost is minimal. The monitoring cost — making sure you’re not accidentally paying cache creation premiums on infrequent paths — is where you should spend your time.

The 5-minute TTL is the number to build your intuition around. If your traffic pattern is spiky or async, profile your inter-request gaps before assuming caching will save money.