First, measure — don't guess
Before changing your workflow, find out where the money is actually going. On most long sessions the dominant line is cache read (context re-sent every turn), but the only way to know your split is to look at your own logs.
npx @wartzar-bee/tokenscope
Paste a report in your browser →
Local, read-only, no upload. It prints the output / cache-read / cache-write split, the context-growth curve, and which tools filled your context.
The checklist
1. Compact or restart long sessions
Cache-read cost scales with context size × number of turns. Once a session has
accumulated a large context, every remaining turn re-pays for it. Running /compact
collapses the history into a summary so subsequent turns carry far less; starting a
fresh session for a new task resets it entirely. This is usually the single
highest-leverage habit. See when to use /compact.
2. Don't keep huge files and tool output resident
Every file the model reads and every verbose command it runs gets written into context, then re-sent on subsequent turns. A few habits help:
- Point the model at the relevant files or functions rather than asking it to read everything.
- Avoid piping enormous logs, full test output, or giant generated files into the conversation — summarize or grep first.
- If a big file was needed once but isn't anymore, a compact or fresh session sheds it.
3. Trim the tools that bloat context
Some tools and MCP servers inject large schemas, directory listings, or verbose results on every relevant turn. tokenscope's tool breakdown shows which tools were called most; if a tool you rarely use is adding a lot of context, disabling it removes that weight from every turn.
4. Match the model to the task
Input price drives every cache-read and cache-write charge, so the model you choose multiplies your whole context cost. Documented default rates (verify for your tier — they can change):
| Model family | Input / 1M | Output / 1M |
|---|---|---|
| claude-opus-4 | $15 | $75 |
| claude-sonnet-4 | $3 | $15 |
| claude-haiku-4 | $1 | $5 |
Reserve the most expensive model for the reasoning-heavy work; a cheaper model for routine edits and exploration cuts the cost of the exact same context proportionally. (These are defaults; always confirm current pricing for your plan.)
5. Let caching work for you
Prompt caching already discounts re-sent context to 0.1× the input price, but the cache has a short lifetime. Long idle gaps can let it expire, forcing a re-write (1.25× for the 5-minute cache, 2× for the 1-hour cache). Working in focused bursts rather than leaving a session idle for long stretches keeps more of your context on the cheap cache-read path.
What not to over-optimize
The model's output — the code it writes — is usually a small slice of the bill, so asking for terser answers rarely moves the needle. Don't sacrifice useful explanations to save output tokens; spend your effort on context size instead. Measure first so you optimize the line that's actually big.