Benchmarks measure a model you are not running

agentsorchestrationbenchmarkscontext

When a new model drops, the benchmark scores arrive first. SWE-bench Verified percentage. HumanEval pass@1. APPS accuracy. These numbers travel fast because they are precise and comparable — they feel like a specification sheet for a component you are about to buy.

What they actually measure is a very specific operating condition that most people are not running in. Understanding what that condition is changes how you interpret the score, and it changes which deployment model actually delivers the capability the score advertises.

What the context window looks like during a benchmark

I measured the token count of every problem in the seven major coding benchmarks used to evaluate LLM agents. The tokenizer is cl100k_base (GPT-4 / tiktoken). The numbers are for the raw problem input — the text actually sent to the model.

BenchmarkProblemsMedianP99Max
MBPP25716 tokens4849
HumanEval164117 tokens310391
BigCodeBench1,140129 tokens3631,216
SWE-bench Verified500294 tokens2,5146,939
SWE-bench2,294282 tokens2,93722,483
LiveCodeBench400421 tokens1,1051,521
APPS5,000456 tokens1,1031,815

Every single problem across all seven benchmarks fits under 8,000 tokens. The largest problem in the entire dataset — an outlier SWE-bench issue at 22,483 tokens — is still under 12% of Claude’s 200,000-token context window. At the median, HumanEval problems use 0.06% of that window. MBPP problems use 0.008%.

The model being benchmarked is operating with its context window essentially empty.

Flat illustration in the same bold graphic style as the rest of the site: warm cream background, bold black outlines, deep red and warm orange palette. A tall vertical rectangle labeled CONTEXT WINDOW 200K TOKENS. The rectangle is almost entirely empty. At the very bottom, a tiny deep crimson red sliver labeled 282 TOKENS — a hairline compared to the total height. The disproportion is the point.

Two conditions, not one

It is tempting to read “empty context window” as a capacity fact — the model has space available. The more important thing is what that space is not filled with.

A benchmark evaluation has two properties that travel together but are worth separating:

Few tokens of input. The problem statement is small. At 117 tokens, a HumanEval prompt is shorter than most Slack messages. The model has room to reason freely.

No prior turns. The context contains nothing except the problem. No corrections from three exchanges ago. No dead end the model went down and you had to redirect. No half-finished implementation that the model is tempted to continue in the wrong direction. No long system prompt from a previous session that is now only partially relevant. The model starts with a blank slate.

The second condition is the one that matters more, and it is the one people talk about least.

Flat illustration in the same bold graphic style as the rest of the site: warm cream background, bold black outlines, deep red and warm orange palette. Two panels divided by a bold vertical black line. Left panel labeled BENCHMARK: a nearly empty context window rectangle with only a tiny crimson sliver at the bottom labeled PROBLEM, and a worker figure with a checkmark indicating success. Right panel labeled PET SESSION: the same rectangle now packed with alternating cream and crimson horizontal bands, with ORIGINAL TASK barely visible at the bottom buried under accumulated context. The same worker figure stands beside it looking overwhelmed, with a swirl above their head.

What fills a pet session’s context

A pet agent session accumulates context continuously. After an hour of work, the context window contains:

This is not neutral. Research on context window utilization shows that models degrade when relevant content is buried in earlier positions — the “lost in the middle” phenomenon. Information at the start and end of a context is attended to more reliably than information in the middle. A long pet session buries its most important content (the actual task requirements) under layers of accumulated back-and-forth.

Flat illustration in the same bold graphic style as the rest of the site: warm cream background, bold black outlines, deep red and warm orange palette. A single tall context window rectangle packed top-to-bottom with horizontal bands labeled ORIGINAL TASK, WRONG ATTEMPT, DEAD END, CORRECTION, REDIRECT, TOOL OUTPUT, WRONG ATTEMPT, DEAD END repeating downward. Label at the top reads TURN 47. A worker figure (short black hair, crimson red polo, warm orange skin, closed eyes) stands to the right with arms crossed, staring at the packed context.

Beyond attention degradation, there is a more direct effect: the model’s prior wrong attempts are in the context. It has seen itself go down a particular path. It has a prior. That prior shapes the next attempt, not always in the right direction.

The benchmark model has none of this. It sees the problem cold. Whatever capability it has for that problem is expressed fully, without interference.

The cattle worker is the benchmark condition

A stateless cattle worker dispatched by an orchestrator starts each task with a context containing:

That is structurally close to a benchmark evaluation. The input is small and deliberate. There is no accumulated conversation. No prior wrong attempts. No corrections that anchored the model toward a direction it should abandon.

Flat illustration in the same bold graphic style as the rest of the site: warm cream background, bold black outlines, deep red and warm orange palette. Two columns. Left labeled BENCHMARK SCORE: a large bold 72% in crimson red, with a label below reading CLEAN CONTEXT 300 TOKENS. Right labeled CATTLE WORKER: a worker figure (short black hair, crimson red polo, orange skin, closed serene eyes) with a small TASK BEAD document icon and a checkmark. A bold double-headed arrow between the two columns. Label below reads SAME OPERATING CONDITIONS.

The pet session is not that model. The pet session is a model operating under conditions that are systematically worse than the conditions under which it was benchmarked — and those conditions degrade further the longer the session runs.

This means benchmark scores are a better predictor of cattle performance than pet performance. When you read that a model scores X% on SWE-bench Verified, the deployment model that will actually realize that capability is a stateless worker with a clean context and a well-scoped task — not an ongoing chat session where the model has been talking to you for two hours.

The score you are buying is not what you are running

The practical consequence is that most people buy capability — a higher benchmark score, a more expensive model tier — and then run it in a mode that systematically degrades that capability below what the benchmark measured.

A pet session with a more capable model is better than a pet session with a less capable model, but both are operating below their benchmark-measured ceiling. The gap between “what the benchmark measured” and “what the pet session delivers” grows as the session accumulates context. By hour three, with 50,000 tokens of back-and-forth, you are running a noticeably different model than the one that got the score.

There is no equivalent degradation in the cattle model. A stateless worker dispatched against a well-specified task is as close to benchmark conditions as production use gets. The score is what you bought. The score is roughly what you run.

This also changes the economics of model selection. The correct question is not “which model scores highest on SWE-bench?” but “which model scores highest on SWE-bench, and am I running it in conditions that will actually realize that score?” If you are running pet sessions, you are paying for capability you are not fully using. If you are running cattle workers with clean contexts and scoped tasks, you are.

What an honest benchmark would measure

The benchmarks that exist were designed for a world of clean evaluations, not for the question of how model capability degrades across a production conversation. None of them measure:

These would be more diagnostic for real production use. The existing benchmarks tell you the ceiling. What is missing is the curve that describes how quickly you fall away from that ceiling as the context accumulates, and how different deployment models (cattle versus pet) track that curve differently.

Until those measurements exist, the empirical data points in one direction: the deployment model closest to benchmark conditions is stateless dispatch into clean context. That is the cattle model.

The question I now ask

When someone cites a benchmark to justify a model choice, I ask:

Under what context conditions was that score measured, and are those the conditions we are actually running?

If the answer is “a 300-token prompt against a blank context window” and the deployment is “an ongoing chat session that has been running for six hours,” the score is describing a model that is not the model they are running.

Knowing that does not mean you stop using pet sessions — there are tasks where they are the right tool, and where the accumulated context is a feature rather than a bug. It means you understand that benchmark scores are ceiling measurements, and that the delta between the ceiling and what you actually get is a function of the deployment model you chose.

Cattle closes that delta. That is the case, made empirically.

— Jed


Data: token counts measured across 9,755 benchmark problems (HumanEval, MBPP, SWE-bench, SWE-bench Verified, LiveCodeBench, BigCodeBench, APPS) using the cl100k_base tokenizer. Raw data (per-problem token counts + full statistics): benchmark-tokens.json. Background: pet agents vs. cattle agents — the deployment model this post argues for. The context reconstruction approach that makes clean-context cattle work: the plan is the prompt.