-
Anchor on a fact the model can't see
An agent's confidence is uncorrelated with whether it's right. The cheapest correction you can make is to hold out one ground-truth fact it never had in context — usually a number — and check its output against that. Works the same whether you're watching one session or auditing a fleet's reports.
-
Benchmarks measure a model you are not running
SWE-bench problems have a median of 282 tokens. HumanEval: 117. MBPP: 16. Every major coding benchmark evaluates a model operating with essentially an empty context window — which is almost never the condition you run in. Unless you are running cattle.