Coding-benchmark task token counts

Measured token counts for the task descriptions of seven popular LLM coding benchmarks — the data behind Benchmarks measure a model you are not running. The median benchmark task is a few hundred tokens; real full-context work is orders of magnitude larger, which is the note's whole argument.

Download: benchmark-tokens.json

Schema

A JSON object keyed by benchmark — HumanEval, MBPP (sanitized), SWE-bench, SWE-bench Verified, LiveCodeBench, BigCodeBench, and APPS. Each entry carries summary statistics over task-description token counts (count, min/max, mean, median, p25–p99 percentiles, and the share of tasks under common context sizes) plus the raw per_problem_tokens array so you can recompute anything yourself.

License & citation

CC BY 4.0 — reuse freely with attribution:

Jed Arden, "Coding-benchmark task token counts", jedarden.com/data, 2026.