Jed Arden

The agentic coding ladder is a list of things you give up

Sat, 23 May 2026 00:00:00 GMT

Steve Yegge has a framework I keep coming back to: eight levels of AI-assisted development, from "no AI" up to "build your own orchestrator." It reads the climb by trust — how much of the machine's output you'll accept without looking. Permissions on, permissions off, stop reading diffs, abandon the IDE, run a fleet. It's a genuinely good way to find where you stand, and I'm not trying to replace it.

What follows is a second instrument pointed at the same mountain. Climbing those levels myself, I kept noticing a quieter move underneath the growing trust: a renunciation. To climb a rung I had to give up something I'd treated as my job — not just trust the machine more, but do less of the thing I was proud of.

So this measures the same ascent off a different dial. Not "how much do you trust it." What have you given up. Both readings are true; I've just found the second one tells me more about the person doing the climbing.

The first thing you give up is the search bar.

How the ladder is shaped

There are two halves and two thresholds.

In the first half, the thing that grows is the size of the unit you hand off. A snippet, then a function, then a file, then a whole application. You're still writing — you're just writing less of the code and more of the intent. The first threshold is where that bottoms out: you stop writing code at all.

In the second half, the axis flips. You've run out of bigger pieces of code to delegate, so what grows instead is your distance from the work. More sessions, then too many sessions, then sessions you don't sit in. The second threshold is the important one, and it's the one I've spent years on: the agent stops being able to ask you anything. A live CLI is a conversation — it hits ambiguity, asks, you answer. Headless is a dispatch — it hits ambiguity and must resolve it alone or fail, because you are gone. That crossing is the same line I've called pets versus cattle. It's where everything I build now lives.

The first half — you give up the code

Each rung below shows the avatar with the thing it just gave up faded and struck through beside it — the picture is the renunciation. Read down until the crossed-out object is the last thing you actually stopped reaching for.

1 · The Student — "I ask it to explain the error, but I'd never paste its code in unread." Gave up: Google / Stack Overflow.

2 · The Borrower — "I paste in its snippets, but I assemble the program." Gave up: writing boilerplate by hand.

3 · The Reviewer — "It writes functions in my codebase, but I read every diff before it lands." Gave up: being the author.

4 · The Delegator — "I let it write a whole file and I test the output instead of reading every line." Gave up: reading every line.

5 · The Director — "I write the spec; I judge the app by whether it runs, not by how it's built." Gave up: knowing how my own code works inside.

Rung 3 is where most working developers actually are, and it's a genuinely good place to be — but notice what you surrendered to get there. You stopped being the person who wrote the function. You became the person who approves it. That's not a small thing; it's the first time the code in your repo wasn't authored by you. Rung 4 surrenders the diff itself: you stop reading lines and start testing outputs. By rung 5 you've handed off whole applications, and the only honest way to do that is to put your thinking into the plan up front — because once it's a whole app, you can't steer line by line, and when the result is wrong, the cheapest fix is to throw it away and re-run, not to repair it.

Threshold 1 — you stop writing code. Above this line there is no smaller unit to hand off. You are no longer producing code at any granularity. So the ladder has to start measuring something else.

The second half — you give up your presence

6 · The Juggler — "I've got two or three sessions going and I bounce between them; I let them recover instead of fixing by hand." Gave up: undivided focus.

Rung 6 is where the surrender stops being technical and starts being uncomfortable. You give up undivided focus — you accept that you'll never again give one task your whole head — and you start letting agents course-correct instead of correcting them yourself. That "self-healing" people talk about isn't a feature that emerges; it's the thing you're forced to build the moment you stop being available to fix every wrong turn.

7 · The Bottleneck — "I'm losing track of which session is doing what. I'm the constraint now — not the model." Gave up: the belief that attention scales.

Rung 7 is not really a capability. It's a wall. It's the day you realize you are the slowest component in your own system — that adding another session doesn't add throughput, it adds the cost of context-switching into your own skull. Everyone I know who runs agents seriously remembers hitting this. It feels like drowning. And it is the entire reason rung 8 exists.

Threshold 2 — the conversation becomes a dispatch. This is the crossing. You stop opening sessions and start launching jobs. The agent can no longer reach you, which means it can no longer ask you the question that used to save it. Every discipline I care about lives on this side of the line because you removed yourself as the answer: anchoring the work on a fact the model can't see, never letting the agent grade its own homework, and making the unit economics work when nobody's watching the meter. None of those are optional up here. They're the prosthetics that replace the human you just amputated from the loop.

8 · The Dispatcher — "I kick off runs I don't sit in and read the log after — they can't ask me anything." Gave up: being reachable.

9 · The Operator — "The volume means I couldn't inspect any single run even if I tried; I live on dashboards, gates, and cost." Gave up: the option to inspect any individual run.

By rung 9 the surrender is inspectability itself. At rung 8 you still could open a run's transcript and follow it if you wanted to. At rung 9 there are too many; you couldn't audit an individual run if you tried, so you're forced up onto aggregates — fleet metrics, verification gates, spend. You stop looking at any cow and only ever look at the herd.

How to find yourself on this

Don't look for your rung. Look for your band.

Read from rung 1 upward. Each rung names something you've given up. Keep climbing for as long as the surrender is genuinely, permanently true of you — not "I could," but "I have." The first rung whose renunciation you haven't actually made — where you'd honestly say "no, I still do that" — is your ceiling. You reside on the rung just below it.

That's the strict part, and it's strict on purpose. Owning the infrastructure for a rung doesn't put you on it. I built a fleet orchestrator; I can reach rung 9 any afternoon. But I haven't given up dropping into individual runs — I still do it, constantly — so I don't live at 9. I reside lower and reach higher, and the truth about me is the band between them, not a single number. That's the honest shape of everyone working near the top. Nobody stands on one rung. The renunciation test just keeps you from flattering yourself about which one.

Higher isn't better — it's heavier

A ladder invites you to read it as a ranking: get to the top. This one isn't. Each rung up buys capability, but it also adds overhead — specification, planning, scaffolding, verification — and that overhead only pays for itself above a certain size of job. So every rung has a break-even: the smallest piece of work worth bringing to it. And the break-even rises as you climb.

I run headless fleets. I would never point one at a one-line typo. Specifying that fix well enough for a worker that can't ask me anything, then waiting on orchestration and gates to confirm it, costs far more than opening the file and typing the character at rung 3. It's chartering a freight train to mail a letter.

So the rung you should use for a task is the lowest one whose break-even the task clears — not the highest one you've reached. This is what the band was already hinting at: I drop to rung 3 every day, not because I've lost rung 9 but because most of what's in front of me doesn't clear rung 9's break-even. Operating below your ceiling isn't regression; it's fit.

Which means the skill the ladder actually rewards at the top isn't living up high. It's choosing the rung per task — and staying honest that a lot of work is small, and small work belongs on a low rung. The most advanced thing you can do with all nine rungs available to you is pick the cheapest one that still clears the job.

Some people skip rungs

Here's the part that follows from measuring the climb in renunciations rather than skill: a rung can only slow you down if you have the habit it asks you to give up.

I learned to code the old way. I have two decades of reflexes — read the diff, own the function, know how it works inside — and every one of them is a thing the ladder demands I surrender. For me the climb is mostly unlearning. The lower rungs were hard not because the technique was hard but because the identity was.

Someone who learned to build with agents from day one doesn't have that. They never authored a function alone, never formed the reflex to read every line, never built the search-bar muscle, never tied their self-worth to knowing how the internals work. They don't climb rungs 1 through 5 so much as start above them — there's simply nothing there to undo. This is the uncomfortable inversion at the center of the whole framework: seniority is friction. The more experience you have, the more attached you are to exactly the things each rung requires you to drop. The ladder punishes the veteran and waves the newcomer through.

But skipping a rung is not the same as having earned it. The giving-up was never only a loss — each surrender was also where you built the judgment to survive the next one. The veteran who gives up authorship still knows what good code looks like, because reading every diff for twenty years is how they learned it. The newcomer who never authored may direct an app they have no way to evaluate. So skipping rungs is real and it's an advantage, but it's a specific one: nothing to unlearn, not automatic competence. Your floor is set by attachment, not ability — and the people with the lowest floor sometimes have the thinnest judgment holding up their ceiling.

10+

There is a rung above 9. I can't tell you what it is.

That's not a dodge — it's the most rigorous thing on this page. Look at the pattern. Every rung gives up something the rung below it considered non-negotiable. Ask the Reviewer at rung 3 to imagine giving up reading the diff and they'll tell you that's the whole job. Ask the Bottleneck at rung 7 to imagine being unreachable and they can't even parse it. From inside any rung, the next surrender is invisible, because you are still standing on the thing you're about to give up. A ladder that could name its own summit would be lying about how the first nine rungs felt before anyone had climbed them.

So 10 is defined by its shape, not its content: it is whatever still feels load-bearing at rung 9 — the thing I'd resist giving up hardest, the pillar I can't currently see as a pillar. That gives a clean test, and the test is what makes 10 honest. A candidate only qualifies as level 10 if, standing at 9, it still feels indispensable. Which is exactly why "fleets that assign their own work" or "fleets that tune themselves" aren't it — I can already picture those. They're just more of 9. The real one is something I'd argue with you about. Maybe it's human intent at the root of the work. Maybe it's human accountability for the output. Maybe it's the human as the one who still decides what "good" means. I won't commit to any of them, because if I could name it correctly, it wouldn't be 10.

I haven't reached it. I don't know anyone who has. On the map it isn't an arrow — an arrow would mean I know which way is up. It's fog above the tree line. You can see the mountain keeps going. You can't see the next ledge, or even where it sits.

If you can name the thing rung 9 still can't function without — the necessity none of us can see because we're all still standing on it — then you might be the first one looking at level 10.

The whole ladder, in one line: it opens with giving up the search bar, and it ends — if it ends — with giving up something I'm not yet wise enough to miss.

— Jed

The ladder, in one image

The whole thing on a single card — click to open it full size, or save it to share:

Ending is better than mending

Fri, 22 May 2026 00:00:00 GMT

I once spent the better part of an afternoon reading four hundred lines of code an agent had written wrong. I traced its logic, patched the broken parts, tried to bend it toward the thing I had actually asked for. When I finally gave up — git reset --hard, sharpened three sentences in the plan, re-dispatched — the correct version existed eight minutes later. The hour I spent mending was the most expensive hour of my day. The eight minutes of ending were nearly free.

I had the cost structure exactly backwards, and I had it backwards because I learned to write software in a world that no longer exists.

A slogan from a dystopia

In Brave New World, Aldous Huxley gives his citizens a sleep-taught jingle to keep the economy churning: "Ending is better than mending." Its companion line is "The more stitches, the less riches." The point of the conditioning is to stop people repairing what they own, so they keep consuming. Huxley meant it as a warning — a portrait of a society that had engineered away the instinct to fix things, because thrift was bad for business.

For software, in the agent era, the slogan has quietly become true. Not as propaganda. As arithmetic.

The cost structure inverted

For the entire history of software until very recently, code was the expensive thing. Every line was hand-written by a person whose time cost money, so a working artifact represented accumulated human labor. Throwing it away threw away that labor. Mending — debugging, patching, salvaging — was almost always cheaper than rewriting, because rewriting meant paying the labor cost twice. The whole craft was built on that premise. "Don't rewrite, refactor" is the premise wearing a cardigan.

That premise broke. Writing code is now cheap: an agent and a pile of tokens produce four hundred lines in minutes for a few cents. What did not get cheaper — what got relatively more expensive as everything around it fell — is human time and taste. Your judgment. Your attention. The scarce, slow, irreplaceable act of looking at something and knowing it's wrong, and knowing what right would look like.

When the two inputs trade places like that, the optimal move trades places with them:

When the artifact is wrong, fixing it spends the input that got expensive and regenerating it spends the input that got cheap. Default to regenerating.

Mending is a tax on attention

The hidden cost of mending isn't the typing. It's the comprehension. To fix four hundred lines of wrong code, you first have to load the agent's reasoning into your own head — including the parts that are wrong, because you can't tell which parts are wrong until you understand all of it. You pay full price to build a mental model of something you're going to partially discard. That modelling is the single most expensive thing you do all day, and mending demands it up front, in full, before you've fixed a single line.

Ending skips the tax. You don't have to understand wrong code to delete it. git reset --hard costs no comprehension at all. The only thing you have to understand is what you wanted — which you needed to understand anyway, and which is far cheaper to hold in your head than someone else's flawed attempt to deliver it.

This is why the afternoon felt so wrong in retrospect. I spent the expensive resource — hours of my own comprehension — to rescue the cheap one. I should have spent thirty seconds of taste deciding the artifact was unsalvageable, and let the cheap resource run again.

Why this is safe now, and wasn't before

You cannot tell people to throw away working code without sounding reckless, so let me be precise about the condition that makes it sane: ending is only cheap when the value has moved out of the artifact and into something durable.

For an agent fleet, that durable thing is the plan. I've argued before that the plan is the prompt — that the token-dense, carefully-written specification is the real artifact and the code is a rendering of it. If that's true, then nuking the codebase isn't destroying value; it's discarding one rendering and asking for another. The plan is the negative; the code is just a print. You can burn the print.

This is the same move as pets versus cattle, pushed down one level. Cattle made the agent disposable: don't nurse the worker, replace it. "Ending is better than mending" makes the output disposable too: don't nurse the artifact, regenerate it. Both rest on the same foundation — value lives in the reproducible source (the template, the plan), never in the running instance.

The other half of the condition is mechanical: reversibility. git reset --hard and even rm -rf && regenerate are only safe moves on top of disciplined version control — atomic commits, clean revert points, a plan and a task graph that survive the deletion. Ending is cheap because the floor is solid. Knock the floor out — uncommitted work, no plan, no clean checkpoint — and ending stops being thrift and becomes the reckless thing it sounds like.

When mending still wins

The slogan is a default, not a law, and the place it breaks is worth naming exactly.

Ending discards everything the artifact knows that the plan does not. Most of the time that's nothing — the code was a faithful rendering of a spec you still have. But sometimes the artifact has accumulated real knowledge that never made it back into the source: a subtle concurrency fix, three edge cases you discovered only by running it, a workaround for a genuine bug in a dependency. If that knowledge lives only in the code, regenerating throws it away, and the fresh artifact will rediscover the same pain the hard way.

So the test is not "is this wrong." The test is:

Where does the knowledge live — in the plan, or only in the code?

If it lives in the plan, end: git reset beats the debugger every time. If it lives only in the code, mend once — and then do the thing that actually matters, which is to move that knowledge back into the plan. Write the edge case into the spec. Add the concurrency note to the acceptance criteria. You mend the artifact exactly long enough to harvest what it learned, and then you make sure you never have to mend it again, because next time the plan will know.

There's a second failure mode that masquerades as ending: regenerating against the same vague plan that produced the mess. That isn't ending, it's thrashing — you'll get a differently-wrong artifact and spend your comprehension budget all over again. Ending only pays off if you change the input. A regeneration that doesn't sharpen the plan is just a bare "no" wearing a git reset costume.

Owning the dystopia

It would be too easy to quote Huxley approvingly and miss that he was describing a nightmare. Mindless disposability is corrosive — that was the whole point of the book. So what keeps this from being his consumer dystopia rebuilt in a code editor?

It comes down to where the taste goes. Ending is better than mending only if the attention you save by not salvaging code gets reinvested upstream — in the plan, the acceptance criteria, the specification that the next regeneration will render. Spend it there and ending is leverage: you move your scarce judgment to the highest point in the system, where one improvement propagates into every future artifact. Pocket it instead — regenerate mindlessly, never sharpen the source, let the agent churn — and you've built exactly the thing Huxley was warning about, just with tokens instead of textiles.

The discipline, in one line: end the artifact, mend the plan. Be ruthless with the disposable thing and precious with the durable one. That's the opposite of the dystopia. It's the only version of "ending is better than mending" worth practicing.

The question I now ask

Before I open the debugger on something an agent got wrong, I ask:

Is this cheaper to regenerate from the plan than to understand and fix — and if I regenerate, what am I changing about the input so I don't just get the same thing back?

If the knowledge lives in the spec, my hand goes to git reset, not to the stack trace. If it lives only in the code, I mend once, harvest the lesson into the plan, and end it forever after. The expensive resource is my own attention. I try very hard now to spend it at the source, where it compounds — not in the artifact, where it evaporates the next time the agent runs.

— Jed

Background: The plan is the prompt — why the durable artifact is the spec, not the code that renders it. And Pet agents vs. cattle agents — the same disposability logic, one level up.

Don't let the agent grade its own homework

Thu, 21 May 2026 00:00:00 GMT

"Done — the pod is deployed and running." It wasn't. The image had built, the manifest had applied, and the pod was sitting in CrashLoopBackOff because of an env var the agent never set. The agent wasn't lying. It had done the steps it set out to do, observed that each step returned success, and concluded — reasonably, from inside its own process — that the job was finished. What it had not done was look at the thing it built from the outside.

The correction I now reach for reflexively used to be an afterthought: "have a separate instance monitor the pod to confirm it's running." Not "are you sure?" — asking the same agent to re-grade its own work just gets you the same grade with more words. A different observer. The shift from the first habit to the second is the whole post.

Self-report is the actor grading its own work

When an agent tells you it succeeded, you are reading a claim produced by the same process that did the work, using the same assumptions that may have caused the failure. If the agent misunderstood the goal, its definition of "done" inherited the misunderstanding. If it set the wrong env var, its mental model of "running" is the one without that var. Asking it to verify its own output runs the check through the exact mistake you're trying to catch. The blind spot grades itself and reports 20/20.

This isn't an agent flaw so much as a structural one — it's true of people too, which is why code review exists and why surgeons count the sponges twice with a second person. The fix is the same fix it's always been:

Verification has to come from somewhere the work didn't.

"Somewhere else" can be cheaper than it sounds. It does not require a second model or fancy infrastructure. It requires that the check not share the doing's assumptions.

Three places "somewhere else" comes from

The live artifact. The strongest observer is the thing itself, queried fresh. Don't ask the agent whether the pod is running — read the pod status. Don't ask whether the deploy landed — "confirm the new version actually landed on the server." Don't trust that the UI works — drive it: I'll have an agent verify a phone app over the live ADB connection rather than accept "the screen should now show X." The artifact doesn't share the agent's assumptions because the artifact isn't reasoning at all. It just is whatever it is.

A separate agent. Spawn a second agent whose only job is to confirm, with no stake in the first one's story and, ideally, no knowledge of how the work was done. "Have a separate instance watch the logs and confirm functionality." A fresh agent reads the running system cold and reports what it actually sees, instead of what the builder expected to have produced. The independence is the point — give the verifier the original goal and the artifact, not the builder's narrative of success.

A different tool. Build with one tool, check with another. The compiler said it built; does the test suite agree? The agent says the migration ran; does a raw SELECT against the table agree? Each tool has its own failure modes, and a claim that survives two unrelated ones is far more likely to be true than one blessed twice by the same path.

The common thread: the verifier must be able to disagree with the actor. If your check can only ever return "yep, looks good," it isn't a check. It's a mirror.

Bake the verification into the request

The habit that made this stick for me is small: I no longer ask for the work and the verification as two messages. I ask for them as one. "Build the image, confirm it builds, then update the deployment, then confirm the new pod is actually running." The verification step is part of the task, not a follow-up I might forget after the agent's confident "done" has already half-convinced me.

This matters because the failure mode isn't usually that you can't verify — it's that the agent's success report lowers your guard at exactly the wrong moment. Fluent confidence is sedating. By the time you've read "deployed and running," some part of you has already moved on. Writing the check into the original ask means the verification happens while you're still paying attention, before the report has a chance to talk you out of looking.

The same practice, wearing fleet armor

Interactively, this is something I do by hand: I spawn the watcher, I read the pod status, I tell the agent to confirm before it moves on. It's a reflex, and it's reversible — if I skip it, I find out a few minutes later and re-run.

In a fleet there is no "a few minutes later me" watching. A worker that grades its own homework and closes its own task injects a silent failure straight into the system, and the next worker builds on top of it. So the same instinct hardens into structure: in the state machine, a successful exit code is not a success — it's a claim that has to clear a validation gate before the task is allowed to close, and if the gate fails the task is released back to the queue rather than marked done. The acceptance criteria written into each task are exactly the "somewhere else" the check comes from: a definition of done the worker did not get to author.

It is the identical practice. Interactively, I am the gate, and I stay cheap by staying in the loop. In the fleet, the gate is code, because no one is in the loop. What does not change between the two is the rule that produced both: the thing that did the work does not get to be the thing that certifies it.

The question I now ask

Before I accept an agent's "done," I ask:

Who checked this — and was it the same process, with the same assumptions, that did the work?

If the answer is "the agent says so," I haven't verified anything; I've read a press release. The fix is never to interrogate the agent harder. It's to go find an observer that didn't share the work's blind spot — the artifact, another agent, another tool — and let it have the last word.

— Jed

Background: Deterministic state machines for non-deterministic agents — where this reflex becomes a validation gate that a worker's "done" has to clear. And Pet agents vs. cattle agents — why no-one-is-in-the-loop changes everything about who gets to certify the work.

Anchor on a fact the model can't see

Wed, 20 May 2026 00:00:00 GMT

A worker reported it had sized a node pool for the new cluster. The plan looked reasonable, the YAML looked reasonable, the explanation was fluent. The only problem was a number I happened to know and the agent didn't: the configured Rackspace Spot node is 2 CPU and 3.75 GB of RAM. The agent had quietly reasoned as though it had more. "That doesn't make sense," I typed, "the node is 2CPU/3.75GB" — and the whole plan unwound, because it had been built on a resource budget that didn't exist.

I do some version of this constantly. "5 beads per 24h? The close rate is far higher than that." "That can't be right." It is the single most reliable correction I make, and it works because of one fact about these models that took me a while to internalize.

Confidence is not correlation

A language model's fluency is uncorrelated with whether it's right. It will describe a node pool it sized wrong with exactly the same calm competence it uses to describe one it sized right. There is no tremor in the prose when the underlying number is fiction. This is not the model lying — it is the model doing precisely what it does, which is produce the most plausible continuation given what's in its context. If the true number was never in its context, the most plausible continuation is built on a guess, and the guess is delivered with the same production values as a fact.

So you cannot audit an agent by reading for hesitation. There is none. You have to audit it against something outside the text.

The model is fluent about a number it can't see. Your job is to be the one who can see it.

Why the number is usually missing

Benchmarks measure a model running on an almost-empty context window, and most real work is the opposite — but "full context" is not the same as "correct context." The agent can have ten thousand tokens of code in its window and still be missing the one operational fact that decides the task: the actual size of the node, the real close rate, the column that's secretly nullable, the rate limit on the third-party API, the fact that this cluster is read-only.

These facts share a property: they live in the world, not in the repo. They're in a dashboard, in your head, in a billing console, in something you saw last Tuesday. The agent has no path to them unless you put them in front of it. And because it can't tell that they're missing — absence of a fact doesn't feel like anything from the inside of a context window — it doesn't ask. It interpolates. Confidently.

This reframes what you, the human, are actually for in the loop. You are not there to write the code; the agent is faster than you at that. You are there because you are holding facts the agent structurally cannot hold, and the highest-leverage thing you do all session is notice when the output collides with one of them.

Numbers are the best anchors

Of all the ground-truth facts you might hold, numbers are the ones to reach for first, because they fail loudly. A node has 3.75 GB, not "enough." A close rate is 40 a day, not "low." A node pool either fits in the budget or it doesn't, and you can check by subtracting. Prose claims are slippery — "this should improve latency" is hard to falsify in a glance — but a quantitative claim either matches the number you're holding or it doesn't, and the mismatch is instant.

So when I read an agent's report, the thing I'm hunting for is the load-bearing number. How big did it think the node was? How many of these did it assume there are? What rate did it design against? Often the agent doesn't even state the number — it's implicit in a decision. Half the work is making the implicit number explicit so I can check it: "what RAM budget did you assume here?" The moment it commits to a figure, I can compare against the one in my head, and the whole edifice stands or falls on that single comparison.

This scales straight into the fleet

Interactively, this is a reflex: read the report, find the number, check it, say "that doesn't make sense" when it collides. Across a fleet it becomes a design requirement, and it's the same instinct wearing a different hat.

If the operator's job is to anchor agents against facts they can't see, then a fleet needs those facts written into the work, not held in one human's head — because there's no human in a cattle worker's loop to say "that can't be right." The node size goes in the plan. The expected close rate becomes an acceptance criterion the worker validates against. The reports the fleet emits get spot-checked against the dashboards, because a confident report is exactly as trustworthy at scale as it was in your single session — which is to say, not at all, until you've checked it against something it couldn't see.

The pet version and the cattle version are the same practice. Interactively you supply the missing fact in real time. At scale you supply it in advance, in writing, and you keep auditing the output against the world because the model's confidence never told you anything in the first place.

The question I now ask

When an agent hands me a result that sounds right, I ask:

What number is this conclusion resting on, and have I seen that number with my own eyes?

If the conclusion rests on a figure the agent could only have known by being told — and I never told it — then I'm not reading a finding, I'm reading a plausible guess in the costume of one. The fix is not to argue with the prose. It's to put the real number on the table and watch what survives contact with it.

— Jed

Background: Benchmarks measure a model you are not running — on context and what the model is actually operating with. And Deterministic state machines for non-deterministic agents — where the "check it against something external" reflex becomes a validation gate.

'No' is not an instruction

Tue, 19 May 2026 00:00:00 GMT

The agent picks OpenAI's API to transcribe some audio. You don't want that — you have Whisper running on your own cluster, and you'd rather not ship audio to a third party. So you type "no, don't use OpenAI." The agent apologizes, agrees, and reaches for... AssemblyAI. Still wrong. You type "no" again. It tries Deepgram. You are now three turns deep and the agent is walking a random path through the space of speech-to-text vendors, because every turn you have told it where not to go and never once told it where to go.

This is the most common way I waste turns with a coding agent, and it took me an embarrassing number of sessions to see the shape of it.

"Not that" is an enormous space

A rejection carries one bit of information: the last thing was wrong. That is genuinely useful — but it is also almost all you've given the agent. Everything that is "not the rejected thing" is still on the table, and for most decisions that set is huge. There are a dozen transcription vendors. There are twenty ways to do auth. There are infinite refactors that are "not the one you just did."

When you say only "no," you are asking the agent to guess again from a barely-narrowed space. It will guess confidently, because that is what these models do, and it will frequently guess in a direction you like even less than the first one. You have spent a turn and bought yourself a fresh problem.

The fix is one clause long:

A rejection that names the alternative re-steers in a single turn. A rejection without one buys another guess.

Here is what that looks like in my own transcripts, again and again:

"Don't use OpenAI. Use the Whisper instance running on the cluster."
"No need for an API lookup — checking the profile picture is a free signal, use that."
"Don't use middleware for auth. Have the page itself ask for the password."

None of those is longer than the bare "no" would have been by more than a sentence. Each one collapses the search space to a single point. The agent does not have to guess what I meant, because I told it.

Why the asymmetry is brutal at scale

With a single interactive session — a pet — a content-free rejection is merely annoying. You catch the second wrong guess, you sigh, you finally say the thing you should have said the first time. The cost is a couple of turns and a little of your patience.

Run the same habit across a fleet and the cost stops being linear. A vague correction is information you didn't write down, and the input is the plan. When a cattle worker misreads the task, it doesn't pause to ask — it runs to completion in some "not what you wanted" direction, fails validation, and the task goes back on the queue for the next worker to misread differently. A rejection you'd have typed interactively never even reaches them. The only thing that reaches a stateless worker is what you committed to the plan, the bead, the standing rule. So the discipline you build interactively — name the alternative, don't just veto — is the same discipline that makes a written task unambiguous enough to hand to a worker you'll never talk to.

This is why I think of it as one practice with two surfaces. Interactively, "don't X, do Y" saves you a turn. In a plan, "don't X, do Y" saves you a failed run multiplied by however many workers hit the ambiguity before you noticed.

The tell: when you reach for "no," you know the answer

Here is the part that makes this actionable. Almost every time I catch myself typing a bare "no," I already know what I want instead. The right answer is right there — I just didn't type it, because rejecting felt faster than specifying. It isn't. The half-second I saved by typing "no, don't do that" I pay back with interest on the next turn when the agent guesses again.

So the rule I hold myself to now is mechanical: if I'm about to reject something, I am not allowed to send the message until it contains the alternative. If I genuinely don't know the alternative yet, that's a different and more honest message — "stop, I need to think about the right approach here" — and it should not masquerade as a correction. A correction implies I know the target. If I know the target, I should say it.

There's a softer version of the same failure that's worth naming: the rejection that's technically a redirect but points nowhere useful. "No, do it properly." "That's not what I meant." "Be more careful." These feel like instructions and contain none. "Properly" is not a destination. If the agent could infer "properly," it would have gone there the first time.

What it costs

Almost nothing, which is exactly why it's hard. The whole tax is a moment of thinking before you hit enter instead of after the agent guesses wrong. You have to convert the vague dissatisfaction you feel — "ugh, not that" — into the specific thing you'd prefer, in the same breath. The feeling arrives before the specification; the work is refusing to send the feeling on its own.

The one real cost: sometimes you'll discover, in the act of trying to name the alternative, that you don't actually know what you want. That is not the practice failing. That is the practice doing its most valuable job — surfacing, for the price of one unsent message, that the problem was underspecified in your own head before it was ever underspecified to the agent.

The question I now ask

Before I send a correction — interactive or written into a plan — I ask:

If the agent does the literal opposite of what it just did, will that be right?

If yes, a plain "no" is fine; the space really is binary. That case is rare. Far more often the answer is "no, the opposite is also wrong," which means the space is wide, which means a bare rejection is about to cost me another guess — and the alternative I'm failing to type is already sitting in my head, waiting to be sent.

— Jed

Background: The plan is the prompt — why the information you don't write down is the information that costs you, multiplied by every worker that hits the gap. And Pet agents vs. cattle agents — the model that makes the asymmetry visible.

Benchmarks measure a model you are not running

Sun, 17 May 2026 00:00:00 GMT

When a new model drops, the benchmark scores arrive first. SWE-bench Verified percentage. HumanEval pass@1. APPS accuracy. These numbers travel fast because they are precise and comparable — they feel like a specification sheet for a component you are about to buy.

What they actually measure is a very specific operating condition that most people are not running in. Understanding what that condition is changes how you interpret the score, and it changes which deployment model actually delivers the capability the score advertises.

What the context window looks like during a benchmark

I measured the token count of every problem in the seven major coding benchmarks used to evaluate LLM agents. The tokenizer is cl100k_base (GPT-4 / tiktoken). The numbers are for the raw problem input — the text actually sent to the model.

Benchmark	Problems	Median	P99	Max
MBPP	257	16 tokens	48	49
HumanEval	164	117 tokens	310	391
BigCodeBench	1,140	129 tokens	363	1,216
SWE-bench Verified	500	294 tokens	2,514	6,939
SWE-bench	2,294	282 tokens	2,937	22,483
LiveCodeBench	400	421 tokens	1,105	1,521
APPS	5,000	456 tokens	1,103	1,815

Every single problem across all seven benchmarks fits under 8,000 tokens. The largest problem in the entire dataset — an outlier SWE-bench issue at 22,483 tokens — is still under 12% of Claude's 200,000-token context window. At the median, HumanEval problems use 0.06% of that window. MBPP problems use 0.008%.

The model being benchmarked is operating with its context window essentially empty.

Two conditions, not one

It is tempting to read "empty context window" as a capacity fact — the model has space available. The more important thing is what that space is not filled with.

A benchmark evaluation has two properties that travel together but are worth separating:

Few tokens of input. The problem statement is small. At 117 tokens, a HumanEval prompt is shorter than most Slack messages. The model has room to reason freely.

No prior turns. The context contains nothing except the problem. No corrections from three exchanges ago. No dead end the model went down and you had to redirect. No half-finished implementation that the model is tempted to continue in the wrong direction. No long system prompt from a previous session that is now only partially relevant. The model starts with a blank slate.

The second condition is the one that matters more, and it is the one people talk about least.

What fills a pet session's context

A pet agent session accumulates context continuously. After an hour of work, the context window contains:

The original task description (probably underspecified)
Your first few corrections and clarifications
The model's first attempt, which missed something
Your redirect
The model's second attempt, partially correct
More corrections
Tool call outputs — file reads, test runs, error messages
The model's current understanding, shaped by all of the above

This is not neutral. Research on context window utilization shows that models degrade when relevant content is buried in earlier positions — the "lost in the middle" phenomenon. Information at the start and end of a context is attended to more reliably than information in the middle. A long pet session buries its most important content (the actual task requirements) under layers of accumulated back-and-forth.

Beyond attention degradation, there is a more direct effect: the model's prior wrong attempts are in the context. It has seen itself go down a particular path. It has a prior. That prior shapes the next attempt, not always in the right direction.

The benchmark model has none of this. It sees the problem cold. Whatever capability it has for that problem is expressed fully, without interference.

The cattle worker is the benchmark condition

A stateless cattle worker dispatched by an orchestrator starts each task with a context containing:

The project instructions (CLAUDE.md) — stable, curated, written once
The task body from the bead — a precise specification of one unit of work
Whatever reference material the task body explicitly includes

That is structurally close to a benchmark evaluation. The input is small and deliberate. There is no accumulated conversation. No prior wrong attempts. No corrections that anchored the model toward a direction it should abandon.

The pet session is not that model. The pet session is a model operating under conditions that are systematically worse than the conditions under which it was benchmarked — and those conditions degrade further the longer the session runs.

This means benchmark scores are a better predictor of cattle performance than pet performance. When you read that a model scores X% on SWE-bench Verified, the deployment model that will actually realize that capability is a stateless worker with a clean context and a well-scoped task — not an ongoing chat session where the model has been talking to you for two hours.

The score you are buying is not what you are running

The practical consequence is that most people buy capability — a higher benchmark score, a more expensive model tier — and then run it in a mode that systematically degrades that capability below what the benchmark measured.

A pet session with a more capable model is better than a pet session with a less capable model, but both are operating below their benchmark-measured ceiling. The gap between "what the benchmark measured" and "what the pet session delivers" grows as the session accumulates context. By hour three, with 50,000 tokens of back-and-forth, you are running a noticeably different model than the one that got the score.

There is no equivalent degradation in the cattle model. A stateless worker dispatched against a well-specified task is as close to benchmark conditions as production use gets. The score is what you bought. The score is roughly what you run.

This also changes the economics of model selection. The correct question is not "which model scores highest on SWE-bench?" but "which model scores highest on SWE-bench, and am I running it in conditions that will actually realize that score?" If you are running pet sessions, you are paying for capability you are not fully using. If you are running cattle workers with clean contexts and scoped tasks, you are.

What an honest benchmark would measure

The benchmarks that exist were designed for a world of clean evaluations, not for the question of how model capability degrades across a production conversation. None of them measure:

Performance at turn 30 of a live session versus turn 1
Performance with 80K tokens of prior context versus 500
How much capability is recovered by compressing and distilling the context versus starting fresh

These would be more diagnostic for real production use. The existing benchmarks tell you the ceiling. What is missing is the curve that describes how quickly you fall away from that ceiling as the context accumulates, and how different deployment models (cattle versus pet) track that curve differently.

Until those measurements exist, the empirical data points in one direction: the deployment model closest to benchmark conditions is stateless dispatch into clean context. That is the cattle model.

The question I now ask

When someone cites a benchmark to justify a model choice, I ask:

Under what context conditions was that score measured, and are those the conditions we are actually running?

If the answer is "a 300-token prompt against a blank context window" and the deployment is "an ongoing chat session that has been running for six hours," the score is describing a model that is not the model they are running.

Knowing that does not mean you stop using pet sessions — there are tasks where they are the right tool, and where the accumulated context is a feature rather than a bug. It means you understand that benchmark scores are ceiling measurements, and that the delta between the ceiling and what you actually get is a function of the deployment model you chose.

Cattle closes that delta. That is the case, made empirically.

— Jed

Data: token counts measured across 9,755 benchmark problems (HumanEval, MBPP, SWE-bench, SWE-bench Verified, LiveCodeBench, BigCodeBench, APPS) using the cl100k_base tokenizer. Raw data (per-problem token counts + full statistics): benchmark-tokens.json — schema, license, and citation on the data page. Background: pet agents vs. cattle agents — the deployment model this post argues for. The context reconstruction approach that makes clean-context cattle work: the plan is the prompt.

The plan is the prompt

Sat, 16 May 2026 00:00:00 GMT

When a pet agent misunderstands the task, you correct it. Two sentences, maybe three. The session absorbs the correction and continues. The cost is negligible — a few seconds of your time and a handful of tokens.

When a cattle worker misunderstands the task, it runs to completion. It produces output that does not match what you wanted, the orchestrator classifies the outcome as failure, and the task goes back on the queue. Another worker picks it up, starts cold, misunderstands in a slightly different direction, and fails again. If you have twenty workers running the same workspace, the same misunderstanding plays out twenty different ways at twenty different costs before you notice the pattern and fix the input.

The input is the plan.

What you are actually paying for in a pet session

Pet sessions accumulate context through dialogue. You describe what you want, the agent responds, you correct, it adjusts. The eventual output reflects not just your initial description but everything you added along the way: the clarifications, the course corrections, the "no, I meant the other thing." By the time the pet agent produces something useful, it has received substantially more information than what was in your first message.

This is not a flaw — it is how conversations work. But it conceals the actual input cost. The information the agent needs to do the job was never written down in one place. It accumulated, across turns, in a format that is impossible to hand directly to a stateless worker.

When you move to cattle, that cost comes due. The worker starts cold. It gets exactly what you wrote, nothing more. The turns of dialogue that would have filled in the blanks do not exist — so the blanks have to be filled before dispatch. The plan is how you fill them.

The compression problem

A running system has enormous context attached to it: commit history, bead history, comments in the code, decisions made and reversed over weeks of work. A worker technically has access to all of this. The question is whether it can find the relevant signal in time to use it.

Commit history is retrospective and fragmented. You learn what changed but rarely why the tradeoff landed where it did. Bead bodies describe individual tasks, not the coherent shape of the whole. Code captures implementation, not intent. None of these is dense enough.

The plan document is different. It is written to be read by a worker who knows nothing — which means it has to contain everything necessary in as few tokens as possible. A good plan is not comprehensive in the way a specification is comprehensive. It is compressed in a specific way: it records the decisions that were made without recording the deliberations that led to them. It says this tradeoff resolved in this direction rather than here are both sides of the argument.

The compression is the point. A worker does not need to re-litigate the decisions. It needs to know what was decided and work within that. A plan that explains why every decision was made is longer than a plan needs to be and often less useful, because the why is background and the worker needs foreground.

The hierarchy that makes cold starts work

NEEDLE's task structure has four layers:

A genesis bead sits at the root of any significant project. It exists to tie phases together and track overall progress. Its body references the plan document — that reference is the load-bearing connection. The genesis bead does not contain the plan; it points to it.

Phase beads derive from the plan's phasing section. Each phase bead describes a coherent unit of work — a vertical slice, a capability, a milestone — with its own acceptance criteria and its own list of child tasks. Phase beads block the genesis bead; when all phases close, the genesis closes.

Task beads are the atomic units a worker actually executes. A task bead is scoped to a single piece of work a worker can complete in one run: one file, one function, one test suite, one configuration change.

A worker assigned a task bead reads: the task bead body, the parent phase bead, and the plan document at the genesis bead's reference path. In that order, smallest to largest scope, most specific to least specific. By the time it has read all three, it knows exactly what it is doing, how it fits into the phase, and what the overall project is trying to accomplish.

This only works if the plan document is coherent enough to anchor the hierarchy. A plan that is vague at the project level produces phase beads that are vague at the phase level, which produce task beads that leave workers guessing. The failure propagates down; the fix has to start at the top.

What has to be in the plan

There are several categories of content that make a plan useful to a cold-start worker. Missing any of them degrades the plan proportionally.

Scope lock. What the system does, stated precisely enough that a worker can determine whether a given change is in scope or out of scope without asking. This is harder to write than it sounds. The failure mode is a scope statement that is technically accurate but vague enough to be consistent with many different implementations — which means every worker is free to pick a different implementation.

Acceptance criteria. The conditions under which the project is done. Not aspirations ("the system should be fast") but testable criteria ("p99 response time under 200ms with N concurrent users"). Acceptance criteria are what let the orchestrator evaluate whether a worker's output counts as success. If the criteria are absent or vague, the orchestrator cannot classify the outcome reliably, and the outcome table loses a row.

Phase boundaries. Where one phase ends and the next begins, stated as conditions rather than calendar dates. A phase boundary is a checkpoint: the system is in this state before this phase, and in this state after. If the boundary is defined as a date, it is almost certainly wrong the moment implementation starts. If it is defined as a condition, it stays true regardless of how long the phase takes.

Known unknowns. The things you do not know yet, stated explicitly. A plan that does not acknowledge its own uncertainty is pretending to more confidence than it has — and workers will act on that pretended confidence. A plan that says "we do not yet know how to handle the edge case of X; this will be resolved in phase 3" gives workers license to defer the question cleanly rather than improvise an answer that may conflict with what phase 3 eventually decides.

Constraint inventory. The fixed points that eliminate solution space: existing interfaces you cannot change, performance budgets you cannot exceed, security requirements you cannot trade away. Constraints are more useful than requirements because they narrow the space of valid implementations without prescribing a specific one. A worker that knows the constraints can make autonomous design decisions within them; a worker that does not know the constraints makes design decisions that may violate them invisibly.

Rollback plan. What happens if phase N fails. Not "we will figure it out" but the actual fallback: which state the system can be safely returned to, which changes are reversible and which are not, what the recovery path is. Workers do not need this to execute normally — they need it to handle the abnormal cases that the orchestrator surfaces.

The plan is how you avoid pivoting mid-flight

The most expensive thing that can happen to a cattle fleet is not a worker crashing. It is a worker succeeding at the wrong thing — because the plan did not make "the right thing" unambiguous.

A mid-flight pivot in a pet session costs a turn of conversation. A mid-flight pivot in a cattle fleet costs every task derived from the misunderstanding: the work already done, the retries queued, the downstream tasks that were built on the wrong foundation. The later in the implementation the pivot happens, the more work has to be undone. This is why the plan-review gate exists.

There is also an asymmetry in how expensive the pivot is depending on what changed. If the plan needed adjustment, the code artifacts can be wholesale deleted and workers restarted from the revised plan. Code is cheap. A well-scoped implementation takes hours for a fleet of workers, not days. The correct response to discovering your plan was wrong is not to patch the existing code — it is to fix the plan, clear the code, and run again. The plan is the expensive artifact; the code is the output.

This is the rule everything above resolves to: the plan is the source of truth. When the plan and the artifacts disagree, the plan is right by definition, and the artifacts are what gets amended. The direction is not negotiable. Editing the plan to match whatever the code drifted into feels like keeping the document current, but it is laundering a mistake into the source of truth — the next cold-start worker reads the retrofitted plan and treats the drift as intent. You conform the artifacts to the plan, never the plan to the artifacts. The only thing that legitimately changes a plan is a changed decision, made deliberately — and that change leads the code rather than trailing it.

This changes what you invest in. You spend the effort on the plan. You spend comparatively little worrying about whether any individual implementation is precious, because it is not — it is reproducible from the plan in a matter of hours.

The plan-review gate

Before any worker touches the code, the plan goes through /plan-review.

The skill checks 80+ structural patterns across scope, acceptance criteria, architecture, preflight safety, phasing, testing, security, performance, operations, API design, and risk. It was developed from analysis of high-quality planning documents by Jeffrey Emanuel, whose methodology for writing plans that survive contact with implementation influenced how I think about this. The patterns were extracted from what those plans had in common — and, more usefully, from what the plans that failed mid-implementation were missing.

The most common failure patterns cluster around the same four gaps: no acceptance criteria (workers cannot self-evaluate output), no phase gates (workers do not know when a phase is complete), no rollback plan (failures have no recovery path), and no constraint inventory (workers make design decisions in an unconstrained space and produce incompatible implementations). Plan-review checks for all four explicitly, along with everything else.

The output of a review is a scorecard with PRESENT / PARTIAL / MISSING ratings for each item, followed by an offer to draft the missing sections. The offer is worth taking. A plan that passes review at 90% is close enough to deploy; a plan at 60% has gaps that will compound across a fleet.

The skill is available at jedarden/jeds-curated-skills.

The cost multiplier

The math for why plan quality matters more in cattle than in pets is straightforward.

In a pet session, a plan gap costs one correction: a few seconds of your time, a few tokens, the session absorbs the fix. The cost is O(1).

In a cattle fleet with N workers, a plan gap costs N failed executions before you notice the pattern. Each failed execution burns its full budget — time, tokens, whatever the worker spent before the orchestrator classified the outcome as failure. The cost is O(N × execution budget). At twenty workers with a per-task budget of 100K tokens, one plan gap that takes two failed iterations to surface costs four million tokens in wasted execution before you see it in the outcome distribution and trace it back to the plan.

This is not hypothetical. It is the most expensive class of bug in a cattle system — more expensive than a bad prompt, more expensive than a misconfigured model, more expensive than a network issue. Network issues affect individual calls; plan gaps affect every worker on every task derived from the plan.

The fix is to treat the plan as a first-class artifact with a quality gate, not as a rough sketch you clarify in conversation.

What a plan is not

A plan is not a specification. A specification describes every detail of the implementation. A plan describes the decisions that constrain the implementation without prescribing every detail of it. Workers fill in the details; the plan tells them which details are fixed and which are theirs to choose.

A plan is not a design document. A design document explains how the system will be built. A plan records what the system will do and what success looks like. The how is the worker's job; the what and the done are the plan's job.

A plan is not a changelog. It records the current state of decisions, not the history of how those decisions evolved. A plan that accumulates commentary about why things changed over time is a plan that is getting harder to read with each revision. Keep the plan current and put the history in commit messages.

What I'd change

Two things.

Plans should be versioned with the code — and the plan leads. The plan lives next to the code it describes, and the two travel together. But "together" has a direction. When the code diverges from the plan — which it always does, in small ways — that divergence is a defect in the code, not a fact to be transcribed into the plan. You amend the artifacts to match the plan in the same commit that surfaces the drift. The plan itself changes only when the decision changes, and then it changes first: revise the plan, then regenerate the code from it. The habit I want to kill is the reflexive one — editing the plan to describe whatever the implementation drifted into, because that keeps the document looking current while quietly demoting it from source of truth to changelog. The convention is not "plan changes follow code changes." It is "code changes follow plan changes."

Staleness should be explicit. A plan written at the start of a project is not the same as that plan after six weeks of implementation. When a later phase revises an earlier decision, every section that rested on the old decision is now wrong — and there is a window between making the new decision and propagating it through the document. Today there is no marker for that window: nothing in the plan says "§Architecture still describes the phase-1 decision; phase 3 superseded it, rewrite pending." There should be. A worker that reads an un-updated section and acts on it produces work that has to be undone. The marker is a stopgap, not a resting state — staleness is a defect in the plan to be closed, not a permanent admission that the code has outrun the document. The goal is always a plan with no stale sections, because the plan is what every worker trusts.

The question I now ask

Before I commit a plan and start creating beads from it:

Could a worker who has never spoken to me, reading only this document and the task bead it has been assigned, produce something I would accept on the first try?

Both parts matter. The first — never spoken to me — rules out plans that rely on context you have accumulated in conversation. The second — on the first try — rules out plans where success requires multiple iterations to clarify. If the answer is no, the plan is not done. I work on the plan, not the code.

The workers are ready. The question is whether the inputs are.

— Jed

Plan methodology: derived from Jeffrey Emanuel's (@dicklesworthstone) approach to high-quality planning documents. Plan-review skill: jedarden/jeds-curated-skills. The orchestration layer this sits on: NEEDLE. The task structure (genesis beads, phase beads): beads_rust.

The unit economics of running cattle

Tue, 05 May 2026 00:00:00 GMT

The first time I ran headless agents in earnest, I burned through a quarter of my Anthropic monthly limit in three days. I told that story already as a parable about treating agents as pets. It is also a story about money — and the money story is the one almost nobody plans for before it happens to them.

The pet model has a hidden cost-control mechanism that nobody designed deliberately and nobody notices until they remove it. The cattle model removes it. If you do not replace it with something explicit before the workers go headless, you find out about the gap at the bottom of an invoice.

This is the post about the replacement.

The pet model's hidden subsidy

When you run a pet agent, you are the spend control. Not metaphorically — literally. The reason your bill stays sane is that your attention is the bottleneck. You can only watch one or two sessions at a time. Each session can only spend money while you are in it. When you walk away from the keyboard, the spending stops because the conversation stops.

This is invisible until you scale. As long as the human is in the loop, three things happen automatically:

You notice runaways. The agent that has been chewing on the same prompt for ten minutes producing nothing is something you see and stop. You do not need a metric for it. You see the spinner.
You pace yourself. Token-heavy operations — long context windows, large diffs, tool calls with big outputs — get throttled because you are the one initiating them. You feel the lag and back off.
You self-throttle on quota. When the dashboard creeps up, you slow down. The dashboard is your conscience. The agent has no conscience.

None of those mechanisms exist in cattle. The agent does not see its own cost. The agent does not see the dashboard. The agent does not get tired. Twenty headless workers running unattended, each invoking expensive tool calls in tight loops, will produce a bill that bears no relationship to the value of the work — unless something between them and the provider says no.

The pet model's economics work because the human is the rate-limiter. The cattle model's economics work because something else is the rate-limiter. That something else is the actual subject of this post.

Tokens are the wrong unit

The first instinct, when the bills get scary, is to start counting tokens. Token-counting feels rigorous. The provider exposes per-call usage. You can build dashboards. You can say "this prompt averages 12,000 input tokens, this one averages 40,000" and feel like you are doing the work.

It is the wrong unit.

Tokens are an input cost, not an output measurement. A worker that burns 200K tokens to close one bead and a worker that burns 800K tokens to close five beads — the second worker is four times cheaper per outcome, even though it spent four times more on tokens. If you optimize the token line you will end up trimming context windows on the second worker until it stops being able to close beads at all, and you will congratulate yourself on the savings.

The unit that matters is cost per closed outcome. A bead closed. A test passing. A PR merged and reviewed. The denominator is the work the system actually delivered, not the inputs it consumed getting there. Token cost without an outcome attached is just spend.

This is harder to measure. It requires the orchestrator to know when a worker produced a real outcome — which is the same exhaustive-handler discipline from the previous post showing up in a different form. If your state machine cannot tell success from failure, it cannot tell expensive-but-productive from expensive-and-wasted, and you will optimize the wrong column.

The orchestrator that classifies outcomes is also the orchestrator that can compute outcome cost. They are the same machine, doing the same work, for two reasons.

Cost governance is its own component

Once you accept that the human cannot be the spend control, the question becomes: where does the spend control live?

It does not live in the worker. The worker is fungible by design — and a worker that polices its own budget is not fungible, because some workers will refuse work that other workers would accept. It does not live in the provider, because providers are happy to sell you whatever you will buy. It does not live in the orchestrator, because the orchestrator's job is to dispatch work, not to mediate the financial contract with each provider.

It lives in a dedicated component sitting between the workers and the providers. A proxy that every model call passes through, with three jobs:

Cap spend. A hard ceiling on outflow per window — daily, weekly, monthly. When the cap is hit, calls return a structured error and the orchestrator handles it as just another outcome (route to the "rate-limited" handler, sleep the worker, retry later). The cap is not a soft warning. It is enforced at the request layer.

Throttle in-flight concurrency. A semaphore on simultaneous calls. Twenty workers does not mean twenty concurrent provider requests; the proxy holds a smaller number and queues the rest. This is what protects you when a tight retry loop in the orchestrator turns into an accidental DDoS of yourself.

Gate against quota. Subscription plans (Anthropic Max, Z.AI Max) have weekly or monthly windows. The proxy tracks burn against those windows independently of what the provider reports, and starts shedding load before the provider does. You hit your own ceiling before the provider hits theirs, because hitting the provider's ceiling is the bad failure mode.

claude-governor is what this looks like in my setup. It is unromantic plumbing — a Rust process that holds the Anthropic API key, exposes a local HTTP endpoint that workers call instead of api.anthropic.com, and enforces all three policies above. There is nothing clever about it. The cleverness is that the workers do not know it exists; they just see model calls succeed or fail. The policy is hidden behind the same interface the model itself uses, which is the only place a policy of this kind can live without leaking into every worker.

Run a portfolio, not a provider

There are two pricing models in this market and they do completely different things to your unit economics.

Subscriptions (Anthropic Max, Z.AI Max) cap your downside. You pay a fixed amount per month, you get a quota window, and if you exceed it you get cut off — not charged more. The cost per outcome is bounded above. The downside is bounded; the upside is bounded too. Subscriptions are how you fund sustained throughput, the steady drumbeat of cattle work that runs every hour of every day.

Metered API caps nothing. You pay for what you use. There is no ceiling. A bug in your retry loop can turn into a five-figure invoice overnight if there is no governor in front of it. The upside is real — metered access is how you handle bursts, spikes, one-off batches that exceed the subscription window without burning the next week's quota. The downside is unbounded. You only run metered traffic with the governor enforcing a hard cap, no exceptions.

A healthy fleet runs both. The subscription is the floor — committed throughput at a known cost. The metered API is the surge capacity. The orchestrator does not care which is which; the proxy makes the routing decision based on which subscription has remaining quota, which model the task asked for, and how much headroom the daily metered cap has.

The mistake I have watched people make repeatedly is going all-in on metered API "for flexibility." You get the flexibility. You also get a cost structure that scales linearly with how aggressive your retry policy is, which is exactly the wrong elasticity to give an autonomous fleet.

Quota observability is asymmetric

The proxy needs to know how much you have spent against each window. This sounds trivial. It is not, because the providers expose this information unevenly.

Anthropic exposes weekly limit headers on every API response. The governor reads them, knows where it is in the window, and shapes traffic accordingly. You can build proper closed-loop control because the loop has a sensor.

Z.AI exposes nothing programmatic. There is no quota endpoint. There are no rate-limit headers. The only place you can see your usage is the web dashboard, which is fine for a human checking in once a day and useless for a process that needs to make a decision every second. The governor is flying blind on Z.AI quota — until the wall, when the API starts returning rate-limit errors and you reverse-engineer what just happened.

The fix is to model your own quota independently. Measure outflow at the proxy, count it against your own ledger, and treat that as the source of truth — not what the provider reports, because the provider may not report at all. "You are out of quota" should be a fact your governor knows before the provider tells you, because the only signal the provider gives some of these subscriptions is the failure itself.

This is more work than it should be. It is the cost of operating against providers whose business model has not yet caught up with the operational needs of customers running unattended fleets. It will get better. Until then, the governor's quota model is a thing you maintain by hand.

The cheap-restart reflex

The pet operator has one cost-control reflex that the cattle operator should preserve and amplify: when a worker is stuck, kill it.

In the pet model this is intuitive. You see the agent thrashing, you stop it, you start over. The cost of the restart is small. The cost of letting it grind for another five minutes is large.

In the cattle model the same reflex is correct, but the operator cannot apply it manually because the operator is not watching. The reflex has to be built into the orchestrator. Three bounds, enforced at the worker level:

Time-bounded executions. A worker that has been running on the same task for longer than the budget gets killed. The handler is the timeout handler from the previous post — release the task, mark deferred, loop. No appeal. No "but it might be making progress." Long-tail tasks are almost always stuck tasks.

Token-bounded executions. A worker that has emitted more than N tokens of output on a single task gets killed. Most real outcomes fit comfortably in a budget. The ones that exceed it are usually agents in some kind of loop — emitting the same diff over and over, retrying the same failing tool call, repeating themselves with minor variations.

Iteration-bounded executions. A worker that has invoked the model more than N times on a single task gets killed. This catches the case the previous two miss: a worker that produces small, fast, expensive calls in tight succession. None of them individually trips the time or token bounds. The count does.

These three bounds together convert "the worker decides when to stop" into "the orchestrator decides when to stop." Which is the only correct allocation of that decision in a cattle system, because the worker has no skin in the game and the orchestrator has all of it.

A worker killed early might have been about to succeed. That is fine. The task goes back on the queue, gets picked up by another worker, possibly with a different prompt or a different model, and tries again. The cost of a wasted execution is a single bounded run. The cost of an unwasted runaway is unbounded.

Cheap restart, every time.

What I'd change

Three things, in order of how often I think about them.

Outcome cost should be a first-class metric, not a derived one. Today I compute cost per closed bead by joining provider invoices to the orchestrator's outcome log after the fact. It works but it is asynchronous — I find out about expensive failures after they have already happened. The proxy already knows, in real time, what each call cost. The orchestrator already knows what task each call belonged to. The metric should be emitted live: this bead just closed, here is what it cost, here is whether that is in line with the historical distribution. If it is an outlier, alert immediately.

Per-task budgets should be configurable, not fleet-wide constants. Right now the time / token / iteration bounds are global. Some tasks legitimately want bigger budgets — a complex refactor across many files genuinely needs more space than a one-line fix. The bead should declare its budget. The orchestrator should enforce it. Today everything gets the same budget and I either set it too low for hard tasks or too high for easy ones.

The governor should expose its policy decisions as a stream. Right now when the governor decides to throttle, the worker just sees a delayed call. There is no record of why it was delayed, against which window, or what the governor's view of remaining headroom was. When you go to debug "why did the fleet's throughput drop in this hour," the governor's reasoning is opaque. It should not be. Every policy decision the governor makes — throttled, capped, routed-to-metered, gated-by-quota — should be a structured event that the observability stack can query.

The question I now ask

Before I run any new workload on the cattle pipeline, I ask:

What is each dollar of spend supposed to produce, and how would I know if it didn't?

Both halves matter. The first half forces you to attach a denominator to the spend — you are not buying tokens, you are buying outcomes. The second half forces you to instrument the answer — if you cannot tell whether the spend produced the outcome, you cannot govern it. You are just hoping.

The pet model lets you skip both halves because your attention substitutes for both. You see the outcome (or its absence) directly. You see the spend (or its absence) directly. The cattle model takes both signals away from you and forces you to rebuild them in the orchestrator and the governor — explicitly, before the workers go live.

If you have not built that, you do not have a cattle system. You have a pet system with the supervision removed, which is a different thing. It looks the same right up until the bill arrives.

— Jed

If you want to see the governor side: claude-governor is the Rust proxy described here. The fleet side — workers, state machine, outcome handling — is NEEDLE. Together they bound the system from both ends: NEEDLE decides what work to do; claude-governor decides what that work is allowed to cost.

Deterministic state machines for non-deterministic agents

Mon, 04 May 2026 00:00:00 GMT

A worker crashes mid-task. A model-provider rate-limit kicks in for nine minutes. Two workers race for the same task and one of them loses. The agent finishes successfully but produces output that doesn't compile. Same workspace, same hour, four different outcomes — and zero of them are wrong. They are exactly the outcomes a long-running headless agent fleet should expect.

The question is not how to prevent any of those. The question is what happens after each one.

If your answer is some flavor of "I'll go look," you are still running pets. The cattle model needs an answer that does not involve a person — and the answer has to be defined before the outcome, not improvised after it.

This is the post about that answer.

The shape of the problem

Existing agent orchestration tools cluster into two shapes. Neither one quite fits.

Conversational frameworks — LangGraph, AutoGen, CrewAI. These assume a chat loop with a human-in-the-loop or another LLM, and they are good at that. They are bad at sustained autonomous work because the entire abstraction is the conversation. When the conversation ends, the abstraction ends. Recovering from a crash means starting over and hoping the new conversation lands in roughly the same place. Failure modes are vague — "the agent went off the rails" — because the framework never modeled what "the rails" were.

Workflow engines — Temporal, Argo Workflows, Inngest. These are excellent at deterministic step orchestration. They assume each step is code that you wrote, that produces a known shape of output, and that fails in known ways. Plug a non-deterministic agent into a Temporal workflow and the type system gives up: the agent's output is a string, the failure mode is "any exception with any message," and the workflow's retry logic cannot tell a transient rate-limit from a logic bug from "the agent gave up."

The missing middle is a deterministic state machine that drives non-deterministic agents. The orchestrator is rigid; the worker is fuzzy. The orchestrator's job is to enumerate every shape the fuzziness can produce and route each shape to a known handler. The agent's job is to produce one of those shapes.

NEEDLE is the shape this idea takes when you write it down. The rest of this post is what falls out.

The thesis

If an outcome can happen, it has a handler. If it doesn't have a handler, it cannot happen.

Every state transition in NEEDLE has an explicit handler. There are no implicit fallbacks. There is no match _ => continue. There is no "swallow the error, log a warning, hope nobody notices."

This sounds like a small constraint. It is the largest constraint in the system, and almost everything else falls out of it.

What this looks like in practice

A NEEDLE worker is a loop that executes six steps:

SELECT — query the bead queue for the next claimable task in deterministic priority order.
CLAIM — atomically claim the task via a SQLite transaction. Exactly one worker wins.
BUILD — construct the prompt from the task definition. Same task → same prompt, every time.
DISPATCH — load the agent adapter (Claude Code, OpenCode, Codex, Aider, anything CLI) and invoke it.
EXECUTE — wait for the agent to exit. The only inputs the orchestrator gets back are the exit code and what was written to disk.
OUTCOME — classify what happened. Run the handler. Loop.

The first five steps are mechanical. Almost any orchestration framework can do them. The whole game is in step six.

Here is the OUTCOME table for one NEEDLE iteration:

Outcome	Detection	Handler
Success	exit code `0`, output validates	close the task, log effort, loop
Failure	exit code `1`	log failure reason, release the task, increment retry count, loop
Timeout	exit code `124`	release the task, mark deferred, loop
Crash	exit code `>128` (SIGKILL, SIGSEGV, etc.)	release the task, create an alert task, loop
Race lost	claim transaction returned no row	exclude this candidate, retry SELECT
Queue empty	no claimable tasks	enter strand escalation: search other workspaces, do cleanup, alert if all strands exhausted

Six rows. Every row has a handler. Every row was added because the absence of a handler caused a real bug.

A few of those rows are worth dwelling on.

Race lost is its own outcome, not a failure. Two workers see the same top-priority task, both try to claim it, exactly one wins. The loser is not broken. It just needs to skip that task and try the next one. If you don't model "race lost" as a first-class outcome, you end up with workers that retry endlessly on tasks they will never claim — or worse, with workers that crash with cryptic SQLite errors and get auto-replaced by an outer supervisor that has no idea what just happened.

Queue empty is its own outcome, not idle time. When a worker has nothing to do, that is a signal, not a non-event. It means: this workspace is exhausted, look elsewhere. NEEDLE has a strand escalation sequence for this — search other workspaces, do cleanup, propose alternatives for blocked tasks, etc. — but none of that runs unless "queue empty" is a recognized outcome that triggers it. A worker that just spins on an empty queue is wasting cycles and obscuring the signal that the queue is empty.

Crash is distinct from failure. A failure is the agent saying "I tried and produced nothing useful." A crash is the agent dying without saying anything. They look superficially similar. They require different handlers: a failure increments a retry count and tries again; a crash creates an alert and may indicate something fundamentally wrong (the agent binary isn't installed, the workspace is corrupted, the model provider is rejecting all requests). Conflating them is the difference between a system that self-heals and a system that thrashes.

What it costs

The deterministic state machine is not free. The cost shows up in three places.

Up-front enumeration. You have to sit down and think through every shape your worker can produce. This is harder than it sounds. The natural state of any human-built system is "I'll add the handler when the bug actually happens" — which works fine for human-supervised work and is actively dangerous for unattended fleets. The first month of NEEDLE was mostly me producing new outcome rows because the system kept finding outcomes I hadn't thought to enumerate.

Discipline against match _. Rust makes this discipline visible: a non-exhaustive match is a compiler error. Languages without exhaustiveness checks make it tempting to write a wildcard handler that does something reasonable. Wildcards are how state machines silently grow undefined behavior. The rule has to be: when a new outcome shows up, you stop, you name it, you give it a row in the table, and you write its handler. You do not add it to the wildcard.

Schema-first, not caller-first. When you discover a new outcome, you do not patch the call site. You go back to the type that represents the outcome and add a variant. Then the compiler tells you everywhere else that needs to handle the new variant. This is more friction than the alternative — but the alternative is an outcome enum that drifts from reality, with handlers that quietly stop being called.

What it's worth

In exchange you get three things that are not available under any other model.

Workers fail predictably. Predictable failure is the foundation of recovery. A worker that returns garbage in a known failure mode — exit code 1, log line in a known format — is more useful than a worker that silently returns subtly-wrong output. The orchestrator can route the known failure; it has no leverage on the silent corruption. Designing for predictable failure means refusing to wallpaper over the failures you cannot classify, and instead either learning to classify them or rejecting the work.

The state machine is the contract; agents are the implementation. This is the property that lets NEEDLE be agent-agnostic. The state machine doesn't know whether the worker is Claude Code, OpenCode, Codex, or Aider. It knows that the worker is something that takes a prompt and produces an exit code. Add a new agent and you add a YAML adapter file — no code changes. Drop a worse agent and replace it with a better one — same. The agents are parts; the state machine is the machine.

You can run twenty of these and reason about what the herd is doing. Twenty pet agents are unreasonable. Twenty deterministic state machines, each running the same six-step loop, are reasonable. You stop debugging individual workers and start debugging the outcome distribution — which row of the table is firing more often than it should? When the failure column trends up, you know to look at the prompts. When the timeout column trends up, you know to look at the model provider. The state machine made each worker's behavior legible enough that the fleet's behavior is also legible.

What I'd change

Three things, with the benefit of running this in production for a while.

Outcome classification should be richer than exit codes. Exit codes are a 0–255 alphabet. They are too coarse to express the difference between "agent gave up gracefully" and "agent gave up because the rate-limiter hit it." Right now NEEDLE squints at stderr to disambiguate. If I rebuilt today, I would have agent adapters return a structured outcome envelope (JSON to a known sentinel path, or a stdout marker) instead of relying on exit codes alone. Exit codes would be the fallback when the envelope is missing.

Strand escalation should be a separate state machine. "Queue empty" routes to a sequence of fallback behaviors — search other workspaces, do cleanup, propose alternatives, alert if exhausted. Today that sequence is a function inside the OUTCOME handler for queue-empty. It really wants to be its own state machine with its own outcome table. Whenever a section of code starts growing its own enum of "what happened," that is the system asking for a state machine.

Determinism in the orchestrator does not buy determinism in outcomes. Two NEEDLE workers running the same task will produce different outputs because the agent is non-deterministic. That is by design. But it means replaying the orchestration against a recorded outcome stream is not the same as replaying the work. If I rebuilt today, I would separate "orchestration replay" (deterministic) from "work replay" (impossible without the agent), and design the recording format to make the first kind of replay easy.

The question I now ask

Before I add an outcome handler — before I add any new state to a NEEDLE-like system — I ask:

What outcome am I making explicit, and what was my system doing about it before?

If the answer to the second part is "nothing, it was hidden in a wildcard," that is the bug I am fixing. If the answer is "it was conflated with a different outcome," that is also a bug. If the answer is "I am inventing this outcome to handle a hypothetical," I do not add it. The state machine grows by making implicit outcomes explicit, not by adding speculative variants.

This is dual to the cattle question from the last post: can a stateless headless worker complete this with only the inputs I write down? Together they bound the design space:

Cattle says: the agent must be replaceable.
State machine says: the orchestrator must be exhaustive.

Either alone is a tarpit. Cattle without a state machine is a fleet of identical workers all failing in mysterious ways. A state machine without cattle is a beautiful enum that one operator hand-runs forever.

The combination is the system that runs unattended.

— Jed

Background: Pet agents vs. cattle agents — the mental-model shift this post sits on top of. Code: NEEDLE — the deterministic state machine described here, in Rust.

Pet agents vs. cattle agents

Sun, 03 May 2026 00:00:00 GMT

When I started running multiple LLM agents in parallel, I burned through a quarter of my Anthropic monthly limit in three days because I was nursing each one. Restarting the long ones when they went off the rails. Hand-curating context windows. Crafting bespoke system prompts. Watching specific sessions like a worried parent.

The waste was not the tokens. The waste was the model in my head.

I was treating my agents the way ops teams treated servers in 2008: as pets. Named, hand-tuned, irreplaceable. Each one a small individual project of mine. The day I stopped doing that — the day I started treating agents as cattle — was the day they actually started doing useful work at scale.

This is the framing decision that sits underneath every other technical choice I make about LLMs in production. It pre-dates the architecture decisions. It pre-dates the framework picks. It is, more than any individual tool, what separates "interesting demo" from "system that runs unattended."

The metaphor's origin

Bill Baker, an engineer at Microsoft, popularized "pets vs. cattle" sometime around 2012 to describe the shift from artisanal server management to fleet-scale operations. The pet was the box you named after a Norse god, ssh'd into to fix manually, panicked about when it went down. The cow was the AMI you spun up in an autoscaling group, terminated without ceremony when it misbehaved, and replaced from the same template five seconds later.

The shift the industry made — from pets to cattle — was not primarily about scale. It was about amortizing the cost of operating things. You cannot afford to know every server's name when you have ten thousand of them. The cattle model says: build the inputs, build the lifecycle, observe the outputs, and stop investing in any individual instance.

LLM agents in 2025 are sitting roughly where servers sat in 2010. Most people are running pets. The cattle model exists, but it requires more infrastructure than most teams have built yet, and the framing has not made it into the public conversation.

What pet agents look like

You probably already know whether you are running pets, but the symptoms:

You have a long-lived chat session you have spent hours curating context for, and you actively mourn the day you have to start a new one.
You hand-tune a system prompt for a specific task, then never reuse it on a different task because it would not transfer.
When the agent goes off the rails mid-task, you intervene. You re-prompt. You correct. You steer.
You measure success by the quality of any one output, not the throughput of the system.
You are the orchestrator. The agent is the worker. Both jobs cost your attention.
The cost of failure is high enough that you avoid letting it run unsupervised.

Pet agents are not bad. They are the right tool for high-judgment, one-off, exploratory work. The bug fix you cannot describe well enough for autonomous work. The design conversation that benefits from a back-and-forth. Anything where the value of the output justifies the cost of your attention.

The trap is using pet agents for work that is bulk, repetitive, or asynchronous. That is where the model breaks.

What cattle agents look like

The cattle model has a few non-negotiable properties:

Headless. The agent does not chat. It receives a prompt, does work, exits. The exit code and the diff it produced are the entire interface. There is no human in the loop during the run.

Stateless. Each invocation reconstructs its own context from durable inputs. Same task, same context, same prompt — every time. If the worker dies mid-run, another worker starts the same task fresh from the same state. No "resuming" a session.

Replaceable. Workers are anonymous. Identified by NATO-alphabet identifiers (alpha, bravo, charlie) — interchangeable enough that the names are deliberately content-free. When a worker fails, you do not investigate that worker. You investigate the task it was working on, then dispatch another worker.

Observed at the herd level. You do not look at individual sessions. You look at fleet metrics: tasks completed per hour, cost per task, failure rate by category, queue depth. The unit of analysis is the herd, not the cow.

Governed at the fleet level. Spend caps, rate limits, weekly quotas — all enforced at the orchestrator, not the agent. No agent can spend more than the herd is allowed to spend, regardless of what the agent decides to do.

Concretely, this is what NEEDLE does. A worker is a tmux session running a deterministic state-machine loop: select the next task from a shared queue, claim it atomically, build the prompt from the task definition, dispatch to a headless CLI (Claude Code, OpenCode, Codex, Aider — agent-agnostic), wait for an exit code, classify the outcome, handle it, loop. The agent does the fuzzy work; the orchestrator handles every other dimension.

It is unromantic on purpose. The agent is a black box that produces work; the orchestration is the part you can reason about.

What you give up

Treating agents as cattle has real costs. Anyone who tells you otherwise has not actually run them this way:

You give up the warm context. A pet session, after an hour of back-and-forth, has accumulated a lot of nuance: corrections you made, dead ends you ruled out, preferences you taught it. Cattle agents start cold, every time. Whatever you have not encoded into the task definition is gone.

This forces you to write task definitions seriously. The work that used to live in your conversation history now has to live in the bead, the prompt template, the reference docs. It is more discipline up front and less recovery in the moment.

You give up the moment-to-moment steering. When a pet agent starts heading the wrong way, you stop it. Cattle agents finish the wrong way. They produce a bad output, the orchestrator classifies it as failure, the task gets retried — possibly with the same wrong approach, possibly with a different worker that happens to be configured differently.

This forces you to be explicit about what wrong looks like. Acceptance criteria become real artifacts, because the orchestrator needs to evaluate them automatically. "I will know it when I see it" is not a thing cattle can implement.

You give up the artisan's pride. Each pet session has the satisfying texture of "we built this together, you and the agent." Cattle is fungible by design. You stop having a relationship with any specific worker. You only have a relationship with the throughput of the system.

This is the part most people resist hardest. It is genuinely a loss.

What you cannot have without cattle

In exchange, you get three things that are simply not available in the pet model:

Parallelism. I have run twenty NEEDLE workers concurrently on the same workspace. They coordinate through atomic claims on a shared bead queue (SQLite transactions guarantee exactly one worker wins each claim). I cannot manage twenty pet agents. Nobody can. The pet model has a hard ceiling around three or four sessions before the human becomes the bottleneck.

Cost governance. When I had pet agents, I was the spend control. I noticed when one was running long and stopped it. I noticed when one was being expensive and intervened. With twenty headless workers running unattended, I cannot be the spend control — the orchestrator has to be. claude-governor caps spend, throttles workers, and gates against weekly Anthropic limits. None of that is meaningful in a pet model because the pet has only one operator and that operator is paying attention.

Failure as a normal mode. A pet session "failing" is a small disaster. You console yourself, restart, try to recover the context. A cattle worker failing is expected. The orchestrator has an explicit handler for every outcome a worker can produce — success, failure, timeout, crash, race-lost, queue-empty. None of those is exceptional. Each has a defined recovery path. Workers fail constantly; the system does not.

This last point is the one I underestimated longest. In a pet model, failure is the thing you try to avoid. In a cattle model, failure is just one more outcome the orchestrator routes. You stop optimizing for "agents that don't fail" and start optimizing for "an orchestrator that handles failures cleanly." That second target is much more tractable.

The honest tradeoff

This is not a story where one model wins. They are tools for different shapes of work.

Pets are right when:

The task requires high-judgment back-and-forth.
The unit of value is one specific output, not throughput.
You cannot fully specify success in advance.
The cost of getting it wrong is high enough to justify your attention.

Cattle are right when:

The work is bulk, queueable, asynchronous.
Success is specifiable enough that the orchestrator can classify outcomes.
Throughput matters more than any individual artifact.
The economics only work if a human is not in the loop.

Most teams I see are running pet workflows on tasks that should be cattle. The fix is not just buying more agents — it is rewriting the task so that an unattended agent can succeed at it. That rewrite is the actual work. Once the task is specified well enough for cattle, the orchestration layer is a handful of weekends.

The question I now ask

Before I build anything new with LLMs, I ask one question:

Could a stateless, headless worker that does not know my name complete this task with only the inputs I write down?

If the answer is yes, it goes into the cattle pipeline.

If the answer is no, I do one of three things:

Specify the task harder until the answer becomes yes.
Decide it is genuinely a pet task and budget my attention accordingly.
Decide LLMs are the wrong tool for this and write the code myself.

The point is not that one answer is correct. The point is that I now make the decision, deliberately, instead of defaulting to pets because pets are what the chat interface trained me to do.

The chat interface is wonderful for what it is. It is the tutorial. It is not the production system.

— Jed

If you want to see what the cattle model looks like as code: NEEDLE is the orchestrator (Rust, deterministic state machine, K8s-native fleet); claude-governor is the fleet-level spend and quota gate; ccdash is the herd-health TUI. All three exist because none of this works without all three.

A new place for opinions and biases

Sat, 02 May 2026 00:00:00 GMT

I build agent infrastructure for a living. Most of what I have learned about running headless LLMs in production lives in commit messages, internal docs, and conversations that never make it past the room they happen in. This is the surface where some of it leaks out.

What I plan to write about

Headless LLM systems — the work between "agents are interesting" and "agents run unattended in production." Orchestration, cost governance, observability, fleet operations.
Strong opinions, weakly held — what I currently believe about how agent infrastructure should be built, and why. I expect to be wrong about some of these and look forward to the corrections.
Post-mortems on my own decisions — patterns that paid off, patterns that didn't, and the difference between the two.

What I won't write about

Hot takes on the latest model release.
Frameworks I haven't actually used in anger.
Speculation dressed up as expertise.

If something here is useful, it earned its place by being grounded in a shipped system. The bar is "code I run, problems I hit, fixes that actually held."

— Jed