Benchmarks measure a model you are not running

agents May 17, 2026 8 min read

When a new model drops, the benchmark scores arrive first. SWE-bench Verified percentage. HumanEval pass@1. APPS accuracy. These numbers travel fast because they are precise and comparable — they feel like a specification sheet for a component you are about to buy.

What they actually measure is a very specific operating condition that most people are not running in. Understanding what that condition is changes how you interpret the score, and it changes which deployment model actually delivers the capability the score advertises.

What the context window looks like during a benchmark

I measured the token count of every problem in the seven major coding benchmarks used to evaluate LLM agents. The tokenizer is cl100k_base (GPT-4 / tiktoken). The numbers are for the raw problem input — the text actually sent to the model.

Benchmark	Problems	Median	P99	Max
MBPP	257	16 tokens	48	49
HumanEval	164	117 tokens	310	391
BigCodeBench	1,140	129 tokens	363	1,216
SWE-bench Verified	500	294 tokens	2,514	6,939
SWE-bench	2,294	282 tokens	2,937	22,483
LiveCodeBench	400	421 tokens	1,105	1,521
APPS	5,000	456 tokens	1,103	1,815

Every single problem across all seven benchmarks fits under 8,000 tokens. The largest problem in the entire dataset — an outlier SWE-bench issue at 22,483 tokens — is still under 12% of Claude’s 200,000-token context window. At the median, HumanEval problems use 0.06% of that window. MBPP problems use 0.008%.

The model being benchmarked is operating with its context window essentially empty.

Flat illustration in the same bold graphic style as the rest of the site: warm cream background, bold black outlines, deep red and warm orange palette. A tall vertical rectangle labeled CONTEXT WINDOW 200K TOKENS. The rectangle is almost entirely empty. At the very bottom, a tiny deep crimson red sliver labeled 282 TOKENS — a hairline compared to the total height. The disproportion is the point.

Two conditions, not one

It is tempting to read “empty context window” as a capacity fact — the model has space available. The more important thing is what that space is not filled with.

A benchmark evaluation has two properties that travel together but are worth separating:

Few tokens of input. The problem statement is small. At 117 tokens, a HumanEval prompt is shorter than most Slack messages. The model has room to reason freely.

No prior turns. The context contains nothing except the problem. No corrections from three exchanges ago. No dead end the model went down and you had to redirect. No half-finished implementation that the model is tempted to continue in the wrong direction. No long system prompt from a previous session that is now only partially relevant. The model starts with a blank slate.

The second condition is the one that matters more, and it is the one people talk about least.

What fills a pet session’s context

A pet agent session accumulates context continuously. After an hour of work, the context window contains:

The original task description (probably underspecified)
Your first few corrections and clarifications
The model’s first attempt, which missed something
Your redirect
The model’s second attempt, partially correct
More corrections
Tool call outputs — file reads, test runs, error messages
The model’s current understanding, shaped by all of the above

This is not neutral. Research on context window utilization shows that models degrade when relevant content is buried in earlier positions — the “lost in the middle” phenomenon. Information at the start and end of a context is attended to more reliably than information in the middle. A long pet session buries its most important content (the actual task requirements) under layers of accumulated back-and-forth.

Beyond attention degradation, there is a more direct effect: the model’s prior wrong attempts are in the context. It has seen itself go down a particular path. It has a prior. That prior shapes the next attempt, not always in the right direction.

The benchmark model has none of this. It sees the problem cold. Whatever capability it has for that problem is expressed fully, without interference.

The cattle worker is the benchmark condition

A stateless cattle worker dispatched by an orchestrator starts each task with a context containing:

The project instructions (CLAUDE.md) — stable, curated, written once
The task body from the bead — a precise specification of one unit of work
Whatever reference material the task body explicitly includes

That is structurally close to a benchmark evaluation. The input is small and deliberate. There is no accumulated conversation. No prior wrong attempts. No corrections that anchored the model toward a direction it should abandon.

The pet session is not that model. The pet session is a model operating under conditions that are systematically worse than the conditions under which it was benchmarked — and those conditions degrade further the longer the session runs.

This means benchmark scores are a better predictor of cattle performance than pet performance. When you read that a model scores X% on SWE-bench Verified, the deployment model that will actually realize that capability is a stateless worker with a clean context and a well-scoped task — not an ongoing chat session where the model has been talking to you for two hours.

The score you are buying is not what you are running

The practical consequence is that most people buy capability — a higher benchmark score, a more expensive model tier — and then run it in a mode that systematically degrades that capability below what the benchmark measured.

A pet session with a more capable model is better than a pet session with a less capable model, but both are operating below their benchmark-measured ceiling. The gap between “what the benchmark measured” and “what the pet session delivers” grows as the session accumulates context. By hour three, with 50,000 tokens of back-and-forth, you are running a noticeably different model than the one that got the score.

There is no equivalent degradation in the cattle model. A stateless worker dispatched against a well-specified task is as close to benchmark conditions as production use gets. The score is what you bought. The score is roughly what you run.

This also changes the economics of model selection. The correct question is not “which model scores highest on SWE-bench?” but “which model scores highest on SWE-bench, and am I running it in conditions that will actually realize that score?” If you are running pet sessions, you are paying for capability you are not fully using. If you are running cattle workers with clean contexts and scoped tasks, you are.

What an honest benchmark would measure

The benchmarks that exist were designed for a world of clean evaluations, not for the question of how model capability degrades across a production conversation. None of them measure:

Performance at turn 30 of a live session versus turn 1
Performance with 80K tokens of prior context versus 500
How much capability is recovered by compressing and distilling the context versus starting fresh

These would be more diagnostic for real production use. The existing benchmarks tell you the ceiling. What is missing is the curve that describes how quickly you fall away from that ceiling as the context accumulates, and how different deployment models (cattle versus pet) track that curve differently.

Until those measurements exist, the empirical data points in one direction: the deployment model closest to benchmark conditions is stateless dispatch into clean context. That is the cattle model.

The question I now ask

When someone cites a benchmark to justify a model choice, I ask:

Under what context conditions was that score measured, and are those the conditions we are actually running?

If the answer is “a 300-token prompt against a blank context window” and the deployment is “an ongoing chat session that has been running for six hours,” the score is describing a model that is not the model they are running.

Knowing that does not mean you stop using pet sessions — there are tasks where they are the right tool, and where the accumulated context is a feature rather than a bug. It means you understand that benchmark scores are ceiling measurements, and that the delta between the ceiling and what you actually get is a function of the deployment model you chose.

Cattle closes that delta. That is the case, made empirically.

— Jed

Data: token counts measured across 9,755 benchmark problems (HumanEval, MBPP, SWE-bench, SWE-bench Verified, LiveCodeBench, BigCodeBench, APPS) using the cl100k_base tokenizer. Raw data (per-problem token counts + full statistics): benchmark-tokens.json — schema, license, and citation on the data page. Background: pet agents vs. cattle agents — the deployment model this post argues for. The context reconstruction approach that makes clean-context cattle work: the plan is the prompt.