#context

2 notes tagged “context”. All notes →

Anchor on a fact the model can't see

agents

An agent's confidence is uncorrelated with whether it's right. The cheapest correction you can make is to hold out one ground-truth fact it never had in context — usually a number — and check its output against that. Works the same whether you're watching one session or auditing a fleet's reports.

May 20, 2026

agents collaboration context
Benchmarks measure a model you are not running

agents

SWE-bench problems have a median of 282 tokens. HumanEval: 117. MBPP: 16. Every major coding benchmark evaluates a model operating with essentially an empty context window — which is almost never the condition you run in. Unless you are running cattle.

May 17, 2026

agents orchestration benchmarks context