Coding-benchmark task token counts
Measured token counts for the task descriptions of seven popular LLM coding benchmarks — the data behind Benchmarks measure a model you are not running. The median benchmark task is a few hundred tokens; real full-context work is orders of magnitude larger, which is the note's whole argument.
Download: benchmark-tokens.json
Schema
A JSON object keyed by benchmark — HumanEval, MBPP (sanitized), SWE-bench, SWE-bench Verified,
LiveCodeBench, BigCodeBench, and APPS. Each entry carries summary statistics over task-description
token counts (count, min/max, mean, median,
p25–p99 percentiles, and the share of tasks under common context sizes) plus the raw
per_problem_tokens array so you can recompute anything yourself.
License & citation
CC BY 4.0 — reuse freely with attribution:
Jed Arden, "Coding-benchmark task token counts", jedarden.com/data, 2026.