Claude Code cache & context cost, explained

Why context is sent at all

The model is stateless between requests — it has no memory of your session. To act on the conversation so far, the client sends the full context with every turn: your instructions, the files it read, every tool result, and all prior model replies. Without this the model would start each turn blind.

Prompt caching makes re-sending that context affordable. The first time a chunk of context appears it's written to the cache; on later turns it's marked as a cache hit and read back at a steep discount instead of being re-billed at full input price.

The four buckets and their prices

Each turn's usage is split into components. Anthropic's documented cache multipliers (always verify for your tier — they can change):

Component	Price (relative to input)	What it covers
Cache read	0.1× input	Context already in the cache, re-read this turn. Paid every turn.
Cache write (5 min)	1.25× input	New context written to the short-lived cache. Paid once.
Cache write (1 hour)	2× input	New context written to the longer-lived cache. Paid once.
Fresh input	1× input	Uncached prompt tokens.
Output	output rate	Tokens the model generates.

So writing context is slightly more expensive than plain input (you pay a premium to store it), and reading it back is much cheaper (0.1×). The trade is worth it because you read far more often than you write — but you still pay that 0.1× on the whole context, every turn.

Why 0.1× still dominates

The cache-read discount is large, which makes it easy to assume re-sent context is cheap. The thing that flips it is repetition. Cache-read cost is, roughly:

cache-read cost ≈ context size × number of turns × input price × 0.1

A modest 10k-token context over 10 turns is negligible. But agentic coding sessions routinely grow to hundreds of thousands of tokens and run for hundreds of turns. Multiply a large context by many turns and the 0.1× rate adds up — frequently past every other line combined. That's why a session can be dominated by cache-read even when the model wrote relatively little code.

Two corollaries:

Context that arrives early is the most expensive, because it gets re-read on every turn that follows. A big file read on turn 3 is re-billed on turns 4, 5, 6, and so on.
Shrinking context mid-session pays off going forward, not retroactively — which is exactly why /compact and fresh sessions work.

Cache writes and cache expiry

Cache entries don't live forever. The 5-minute cache is refreshed as long as you keep working; if a session sits idle past the window, the entry expires and the context has to be re-written (1.25× or 2× input) before it can be read cheaply again. In practice this means long idle gaps can quietly add cache-write charges. Working in focused bursts keeps more context on the cheap read path.

See the split for your sessions

tokenscope reads your local Claude Code logs and shows exactly this breakdown — output vs. cache-read vs. cache-write vs. fresh input — plus how context grew turn by turn. The token counts come straight from your logs; cost is those counts times the prices above (overridable in .tokenscope.json).

npx @wartzar-bee/tokenscope Paste a report in your browser →

Read-only, local, no upload, no telemetry.

Why context is sent at all

The four buckets and their prices

Why 0.1× still dominates

Cache writes and cache expiry

See the split for your sessions

Related guides