A Team of One and a Hundred Agents

There are mornings now where I open my work and review more shipped software than I could have written by hand in a week. It was produced overnight, by agents, while I slept. Branches pushed, tests written, a migration drafted, three bugs fixed with the failing case reproduced first. My job that morning is not to write. It is to read, judge, reject a real share of it (more on a refactor, less on a batch of tests), and decide what ships.

This is the part of agentic development that the productivity charts miss. The interesting change is not that I type faster. It is that the unit of engineering output stopped being me and became me plus the harness I built to direct the work. A single developer with the right scaffolding now operates at a scope that used to need a team. Not by working harder, but by changing what they personally have to touch.

I want to be clear about where that leverage comes from, and equally clear about where it runs out. It does run out, and where it runs out is what I most want to get across.

Agents are only as good as what you hand them

The first thing I learned running agents at any real volume is that they inherit the quality of their tools. An agent pointed at a raw provider API, asked to manage its own streaming and tool-call bookkeeping, spends its budget on plumbing and makes plumbing mistakes. The same agent handed a clean primitive, one call that runs the whole model-to-tool-to-model loop and hands back a transcript, spends its budget on the actual problem.

So the harness sits on top of the compounding primitives I have written about before. That is not a coincidence; it is the precondition. A fleet of agents is a multiplier, and a multiplier on a weak foundation just produces more weak output faster. The primitives are what make the agents worth running at all. You do not get the hundred-agent morning by hiring a hundred agents. You get it by building the floor they stand on, one layer at a time, until directing them is less about doing the work than about decomposing it well and knowing what to accept. That last part is its own skill, and the one the rest of this post is about.

The harness is orchestration plus a gate

Two pieces make the difference between a fleet that produces leverage and one that produces a mess to clean up.

The first is orchestration: how work is decomposed and dispatched. I run agents in parallel across separable tasks, but serialized where they would otherwise tangle the same files. Each one gets an objective and acceptance criteria specific enough that “done” is checkable, not vibes. The skill here is the same skill as decomposing work for human collaborators, except the collaborators are fast, literal, tireless, and entirely without the social grace that would make a person ask “are you sure?” before doing the obviously-wrong thing you asked for.

The second piece is the one people underestimate: a gate. Nothing an agent produces ships because the agent says it is done. It ships because it has survived review, and the review I trust is adversarial, not congratulatory. I run changes past reviewers whose entire job is to try to break the claim that the work is correct: to find the edge case, the dropped invariant, the test that asserts nothing, the demo that shows what you should see instead of what you would see. A verdict has to be grounded in the actual code, citing the actual lines, which narrows the room for a confident-sounding fabrication without ever fully closing it. And a fix has to be verified by reverting it and watching the failure come back. Output that cannot survive that is not output. It is a draft.

Two catches make it concrete. In a code editor I was building from scratch, an agent shipped diagnostic squiggles (the little underlines under errors) that technically worked. Hover over one and the tooltip appeared, exactly as designed. The catch was that the hover target was the squiggle itself, the literal one-pixel wavy line. You had to land the cursor on it, to the pixel, or nothing happened. A test harness driving the mouse to exact coordinates hit it every time and reported success. A person, moving a real mouse at a real underline, missed it almost every time and gave up. It was usable by a machine and useless to a human, which is the most insidious kind of broken, because every automated check was green. The second catch was smaller and more embarrassing. Throwaway example text in a demo, sample code that exists purely to be read, was confidently fabricated and wrong, plausible at a glance and nonsense on a second look. Nobody had told the agent the examples needed to be correct, so they weren’t. Both shipped green. Both were caught by someone looking, not something running.

That gate is the QA layer for the fleet, and it is the reason the hundred-agent morning produces software instead of plausible-looking wreckage.

What still doesn’t scale

Here is the ceiling, and I have hit it enough times to respect it.

Everything mechanical scales. Writing code scales. Running tests scales. Reproducing a bug, drafting a migration, sweeping a rename across forty files: all of it scales, to as many agents as I can afford to run.

Judgment does not scale, and neither does taste. The decision of whether a thing is actually what we said it was (whether the abstraction is right, whether the migration preserved meaning and not just shape, whether the demo is telling the truth) does not scale, because it is the work that cannot be delegated to the same kind of process that produced the thing being judged. The reviewer fleet helps; it surfaces candidates and catches the mechanical failures. But the final call about what is good is mine, and it stays mine, and on a heavy morning it is the bottleneck. The agents finish faster than I can decide.

That bottleneck is not a flaw in the setup. It is the setup working correctly. The whole value of a team of one with a hundred agents is that the one is still accountable for the judgment. The moment I start rubber-stamping to keep up with the throughput, approving because the diff is long and the tests are green and reading it carefully is slow, the leverage inverts. I am no longer operating a fleet. I am laundering its mistakes into the codebase with my name on them.

Leveraged judgment, not replaced judgment

The outsized returns are real, and I am not going to perform modesty about them. As a team of one I have built and now operate several production systems whose scope would, not long ago, have needed a staffed team. That is not a rounding error; it is the most significant change in how I work in my career.

I should be precise about one thing while I’m here, though, because it is where this claim gets oversold. A hundred agents is throughput, not adoption. The agents don’t choose to depend on what I build; they execute it. The leverage is mine to wield. It is not yet a platform other people have taken up, and that is a later, different, harder thing to earn. I don’t want the word “team” to smuggle in a claim I haven’t made.

The returns, then, are leveraged judgment, not replaced judgment. The agents multiply the part of the job that was always mechanical and let me spend my scarce attention on the part that never was. The skill that matters is no longer how much I can produce, but how much I can correctly refuse: how reliably I can look at a morning’s worth of confident, well-formatted, test-passing work and say which of it is wrong.

A team of one and a hundred agents is not a team of a hundred and one. It is one person, accountable, sitting at the top of a very tall lever. The lever is extraordinary, but it still needs someone who knows which way to push.