Insights·June 17, 2026·12 min read

Where the tokens go — the real cost of a Claude Code session

What I found reading the transcripts of my own sessions

Claude Code LLMs Context Engineering Prompt Caching Tokens

A few days ago I decided to understand something that had been bothering me: where the cost goes when I work for hours on end with a coding agent. Instead of guessing, I went to the place where the answer had been all along — the records of my own sessions, saved as files on my computer. This article is what I found there, and what you can do with it.

Why I went looking

I use Claude Code as my main development tool, and many sessions run for hours. Every so often a usage summary shows up telling me things like "long sessions cost more even when cached" or "you used a lot of subagents." Those are useful nudges, but generic — they describe the symptom without giving me the mechanism. And without the mechanism, any adjustment I made would be superstition: dropping something that might not even be the problem.

What I was missing was an understanding of how cost forms, turn by turn, so I could decide based on data instead of hunches. And here's the detail most people never use: the agent saves each session to a local text file in .jsonl format — a file where every line is a JSON object, one per message. Inside each model response there's a usage field that records, precisely, how many tokens that response consumed, broken down by type. A token is the unit the model reads and generates — roughly a chunk of a word. That field is an exact accounting of what each turn cost, sitting right there, waiting to be read.

So I wrote a small script to scan those sessions and add up the numbers. What follows came out of hundreds of real sessions, with no estimation involved: the tool recording itself.

How the API actually bills

Before the numbers, the mechanism, because it explains everything that follows.

The API behind the agent is stateless — it keeps no memory between one response and the next. That has a consequence that sounds obvious once you say it out loud, but it changes how you think: on every turn, the client resends the entire conversation to the model. All the system instructions, all the history, all the tool results, all of it again, plus your new message. The model reads the full package, generates the reply, and forgets. On the next turn, it sends everything once more.

If it charged full price for that, it would be unviable. What makes it possible is prompt caching — a heavy discount for stretches that were already sent before and haven't changed. It works because the start of the conversation is stable: the system instructions on turn 50 are identical to the ones on turn 1. That stable stretch is billed at a fraction of the price.

In practice, every turn has four types of token, with very different weights. To compare without depending on a dollar price, you can measure everything in "input-equivalents," using the standard ratios between them:

new input — tokens the model has never seen (your new message). Weight 1.
cache_write — writing a new stretch into the cache for the first time. Weight 1.25.
cache_read — rereading a stretch that's already cached (the stable prefix of the conversation). Weight 0.1.
output — the tokens the model writes. Weight 5.

The intuition almost everyone has is that the expensive part is the output, because it's the "smart" part. The numbers say otherwise.

What the numbers said

The bill isn't what the model writes. It's the entire conversation being reread every time it answers.

When I summed the token types across my larger sessions, cache_read — the rereading of the already-cached history — came in between 54% and 73% of each session's total cost. The output, the part I'd assumed was expensive, sat near 10%. New input, the stuff I actually type, didn't even reach 1%.

It makes sense once you join it with the mechanism: in one of the sessions, cache_read totaled close to 194 million tokens for an actual content of about 4 million. That is, the conversation was reread around 48 times over the course of the session. Every time the model responded, it reread — cheaply, but reread — everything already there. When "everything already there" is large, 0.1 times a mountain is still a steep bill, multiplied by dozens of turns.

The second finding is what I started calling the backpack. Anything that enters the context and never leaves is billed again on every subsequent turn. I measured the average cache_read per turn at the start and the end of long sessions: it went from about 140 thousand tokens per turn to over 500 thousand. Almost four times as much. It's not that each individual turn got smarter — it's that the backpack kept filling, and turn 200 carries on its back everything accumulated in the 199 before it. A useless item that enters the context early gets paid for hundreds of times before the session ends.

And the findings pointed to two different ways a session gets expensive, which call for opposite remedies. One is the visual-validation session, where I use a tool of mine that captures the screen state as text to check an interface — each capture is large, and they pile up in the context, inflating cache_read. The other is the heavy-editing session spread across many files: there the culprit is cache_write, because editing files inside an already-huge context forces large chunks of the cache to be rewritten on every change. The first problem is solved by discarding the old captures; the second, by keeping the context small from the start. Different diagnoses, same bill.

Subagents: cheap where I thought they were expensive

A subagent is a separate instance of the agent that I dispatch to handle a sub-task — find where something lives in the code, investigate a stretch, propose a plan — and that returns only the conclusion to me. The usage notice tends to flag "subagent-heavy" sessions as expensive, and I'd been carrying the suspicion that spreading subagents around was wasteful.

The numbers corrected that, but not the way I expected. In a session with dozens of subagents dispatched, each one returned, on average, only about 3 KB of text to the main context. Which means: the subagent doesn't pollute my context window — it returns almost nothing and absorbs the dirty work outside. Its cost isn't in my context; it's in its own separate stack, which grows with the same backpack logic, but isolated.

That shifts the decision. A subagent isn't expensive or cheap in the abstract — it depends on how much carrying it saves you. If I sent a subagent to read 40 files and it returned 3 KB of summary, I saved 40 reads that would weigh on my context for dozens of turns. But if I dispatch a subagent for a two-command task, I pay for an entire mini-session to save almost nothing. The rule I took from this is simple: delegating is worth it when the work would dump more into my context than the subagent returns, and when I'd reread that weight over many turns. Outside of that, I do it inline.

A thermometer for context

If the problem is that useless things keep getting paid for turn after turn, the question becomes: what, exactly, is worth keeping? Comparing the sessions, I noticed you can order the types of information on a kind of thermometer, with four bands, from the one I always want to keep to the one that can go first.

One data point anchors it all: my messages — what I actually type to the agent — are less than 2.5% of a session's message volume, but they carry almost all the intent that matters. The signal is tiny; the noise (the tool results) is the mass. The bands come out of that contrast between token weight and durability, that is, how long a piece of information keeps supporting a decision:

Critical — weighs almost nothing in tokens and lasts the whole session. The instructions and constraints I gave, the decisions made and the reasoning behind them, and above all the evolution of the target: how the work started aiming at one thing and, because of a finding, shifted to another. Cheap to keep and expensive to lose.
Important — it weighs, but it has to stay in some form. The code being written now, the errors that came up and how they were fixed (so they aren't repeated), the files that matter. Here the ideal is to keep it distilled — a pointer to the relevant stretch, not the whole copy repeated.
Situational — only worth keeping while it's active. The plan in progress, the pending tasks, the last test result. When the task closes, this can go.
Ephemeral — weighs a lot and lasts almost nothing. The old screen captures, the file reads already superseded by edits, the verbose output of a test command, the search dumps. This is exactly the band that represents most of the bill, and the first that should be discarded.

The reading of the thermometer is that a good context cut keeps the top and the bottom — saves what's cheap and durable, throws out what's heavy and dead — and distills the middle. The risk lives on the border between the critical and the situational: it's where the thread of the story lives, the reasoning behind the pivots, and it's exactly what's easy to lose without noticing.

How /compact works on the inside

In the agent, the tool that attacks the backpack is /compact — a command that summarizes the conversation so far and replaces the long history with the summary, freeing up space. I'd been using it without knowing what it preserved and what it dropped, so I went to understand its behavior in practice, by observing the summaries it produced in my own sessions.

Three things became clear. The first is that there's more than one compaction mode: one that summarizes the whole conversation to restart the session from the summary, and another, incremental, that keeps the beginning intact and summarizes only the most recent part. The second is that the summary follows a fixed template of sections — primary request and intent, technical concepts, files and code sections, errors and fixes, problems solved, all user messages, pending tasks, current work, next step.

The third is the most interesting, and it talks directly to the thermometer. The template preserves all user messages, with explicit attention to "changing intent," and it instructs to keep security constraints in their literal form. In other words: what I typed myself is the protected channel — what I said survives the cut. And the template is evidently concerned with not losing the thread of the task: it asks for verbatim quotes of the most recent work precisely to keep the interpretation of what was being done from drifting in paraphrase.

That's good, and it has two blind spots worth gold to know about.

The first is the finding-driven pivot. When I change the target — "actually, let's go another way" — that lives in a message of mine and survives the cut. But when it's the investigation that changes the target — the agent discovers an assumption was wrong and the plan has to change — the reasoning for that turn lives in a tool result, and a tool result is the first thing the summary throws away. The template captures the intent I verbalize, not the one that emerges from the inquiry itself.

The second is recency bias. Several sections of the template ask for special attention to the most recent messages. That makes sense for continuing where you left off, but it means a foundational decision made way back at the start — a pivot on turn 50 that shaped everything that followed — gets underweighted if it wasn't restated recently. The summary pulls the focus to the end of the conversation.

Putting it together: /compact is, at its core, a good summarizer of state, anchored in what I type and careful with the interpretation of the task. What it doesn't guarantee is the trajectory — the chain of cause between what was discovered and the change of course that finding triggered, especially when the finding is old.

What you can do with this

The method of reading the logs is worth it on its own — anyone who uses the tool can run the same analysis on their own session files and see where their cost lives, instead of trusting mine. But the study also points to concrete actions, and they come straight out of the findings.

The cheapest is a habit: open each session inside the specific project's directory, not in a root folder that mixes everything. A session without boundaries is the one that grows the most, mixes files from different projects into the same window, and becomes that eight-hour session that carries everything. The project boundary is, on its own, a natural context limiter.

The second follows from what /compact doesn't guarantee. Since the protected channel is my messages and the recent verbatim quotes, a finding-driven pivot only survives the cut if I bring it into that channel — stating the turn explicitly the moment it happens, instead of leaving it buried in a tool result. And /compact accepts a focus instruction alongside the command, so you can ask it, every time, to preserve the trajectory of the pivots and discard captures and logs, instead of accepting the raw default template.

The third is about when to cut. It's not by blind size, it's by composition: the best moment to compact is when the backpack got heavy with ephemeral stuff — right after a visual-validation cycle wraps up, when the screen captures have already done their job and turned into dead weight. The gain is direct to calculate: it's the amount of token freed, times the reread discount, times all the turns still to come. Cutting early the junk that would otherwise be reread many times is what saves the most.

What stayed with me

What struck me most in this study wasn't any specific number, it was realizing that the tool's behavior is legible. I didn't need to guess where the cost went, or follow generic advice about "long sessions" — the exact accounting was in a file on my disk, and the compaction template was observable in the very result it produces. I just had to go read it.

There's an extra layer to this, the kind of thing I want to record here. The relationship with an AI tool improves a lot when you stop treating it as a black box and start treating it as a system — with inputs, costs, and behaviors you can measure and adjust. The next step, for me, is turning these findings into small automations: a reminder that fires when the backpack gets heavy with junk, a compaction pattern that protects the trajectory. But that's the next article, with the code alongside.

Architecture·April 16, 2026

14 min read

Building Monney Works — from mock to deploy in ~4 hours

The architecture behind my personal blog

First article on Monney Works. How the platform was built in under 4 hours with Domain-Driven Design, TDD and Next.js 16, why it exists, and what's going to live here from now on.