A worker reported it had sized a node pool for the new cluster. The plan looked reasonable, the YAML looked reasonable, the explanation was fluent. The only problem was a number I happened to know and the agent didn’t: the configured Rackspace Spot node is 2 CPU and 3.75 GB of RAM. The agent had quietly reasoned as though it had more. “That doesn’t make sense,” I typed, “the node is 2CPU/3.75GB” — and the whole plan unwound, because it had been built on a resource budget that didn’t exist.
I do some version of this constantly. “5 beads per 24h? The close rate is far higher than that.” “That can’t be right.” It is the single most reliable correction I make, and it works because of one fact about these models that took me a while to internalize.
Confidence is not correlation
A language model’s fluency is uncorrelated with whether it’s right. It will describe a node pool it sized wrong with exactly the same calm competence it uses to describe one it sized right. There is no tremor in the prose when the underlying number is fiction. This is not the model lying — it is the model doing precisely what it does, which is produce the most plausible continuation given what’s in its context. If the true number was never in its context, the most plausible continuation is built on a guess, and the guess is delivered with the same production values as a fact.
So you cannot audit an agent by reading for hesitation. There is none. You have to audit it against something outside the text.
The model is fluent about a number it can’t see. Your job is to be the one who can see it.
Why the number is usually missing
Benchmarks measure a model running on an almost-empty context window, and most real work is the opposite — but “full context” is not the same as “correct context.” The agent can have ten thousand tokens of code in its window and still be missing the one operational fact that decides the task: the actual size of the node, the real close rate, the column that’s secretly nullable, the rate limit on the third-party API, the fact that this cluster is read-only.
These facts share a property: they live in the world, not in the repo. They’re in a dashboard, in your head, in a billing console, in something you saw last Tuesday. The agent has no path to them unless you put them in front of it. And because it can’t tell that they’re missing — absence of a fact doesn’t feel like anything from the inside of a context window — it doesn’t ask. It interpolates. Confidently.

This reframes what you, the human, are actually for in the loop. You are not there to write the code; the agent is faster than you at that. You are there because you are holding facts the agent structurally cannot hold, and the highest-leverage thing you do all session is notice when the output collides with one of them.
Numbers are the best anchors
Of all the ground-truth facts you might hold, numbers are the ones to reach for first, because they fail loudly. A node has 3.75 GB, not “enough.” A close rate is 40 a day, not “low.” A node pool either fits in the budget or it doesn’t, and you can check by subtracting. Prose claims are slippery — “this should improve latency” is hard to falsify in a glance — but a quantitative claim either matches the number you’re holding or it doesn’t, and the mismatch is instant.
So when I read an agent’s report, the thing I’m hunting for is the load-bearing number. How big did it think the node was? How many of these did it assume there are? What rate did it design against? Often the agent doesn’t even state the number — it’s implicit in a decision. Half the work is making the implicit number explicit so I can check it: “what RAM budget did you assume here?” The moment it commits to a figure, I can compare against the one in my head, and the whole edifice stands or falls on that single comparison.
This scales straight into the fleet
Interactively, this is a reflex: read the report, find the number, check it, say “that doesn’t make sense” when it collides. Across a fleet it becomes a design requirement, and it’s the same instinct wearing a different hat.
If the operator’s job is to anchor agents against facts they can’t see, then a fleet needs those facts written into the work, not held in one human’s head — because there’s no human in a cattle worker’s loop to say “that can’t be right.” The node size goes in the plan. The expected close rate becomes an acceptance criterion the worker validates against. The reports the fleet emits get spot-checked against the dashboards, because a confident report is exactly as trustworthy at scale as it was in your single session — which is to say, not at all, until you’ve checked it against something it couldn’t see.
The pet version and the cattle version are the same practice. Interactively you supply the missing fact in real time. At scale you supply it in advance, in writing, and you keep auditing the output against the world because the model’s confidence never told you anything in the first place.
The question I now ask
When an agent hands me a result that sounds right, I ask:
What number is this conclusion resting on, and have I seen that number with my own eyes?
If the conclusion rests on a figure the agent could only have known by being told — and I never told it — then I’m not reading a finding, I’m reading a plausible guess in the costume of one. The fix is not to argue with the prose. It’s to put the real number on the table and watch what survives contact with it.
— Jed
Background: Benchmarks measure a model you are not running — on context and what the model is actually operating with. And Deterministic state machines for non-deterministic agents — where the “check it against something external” reflex becomes a validation gate.