Taste Doesn't Come in the Diff
Agents optimize what they can verify: does it render, do the values map, do the tests pass. They cannot optimize what they can't measure, whether the result is legible, calm, honest, good. That gap is where the human reviewer's job moved.
An agent once built me a data visualization that was perfect and unusable in the same breath. A heat-style overlay, values mapped to color, every number in the right place, every pixel rendered, every test green. And it looked like an emergency. The whole thing glowed, saturated and intense, throbbing with a heat that said something is on fire here when the actual data was calm. It was, functionally, exactly what I asked for. As a thing a person looks at, it was wrong.
“Looks wrong” is not a compile error. No test failed, because nothing testable had failed, and the visual-regression check that might someday flag a drift from this screenshot would, today, have happily blessed it as the baseline. The agent had optimized every property it could measure: correctness of the mapping, completeness of the render, passage of the suite. It sailed straight past the one property that mattered most to whether the feature was any good, because that property has no assertion. It took several passes to dial the thing back down. Less glow, lower opacity, a palette that whispered instead of shouted. None of those passes were bug fixes. They were taste, applied by hand, because taste does not come in the diff.
Correctness and taste are orthogonal
I did not fully appreciate this until I was reviewing agent output at volume. Correct and good are different axes, and an agent is built to climb only one of them.
An agent is a magnificent optimizer of the measurable. Does the function return the right value? Does the component mount? Does the layout hold at every breakpoint? It will grind those down to green with a thoroughness no tired human matches at 5 p.m. But “is this beautiful,” “is this legible,” “is this calm where it should be calm and loud only where it earns it,” “does this feel like the thing we meant” all resist measurement in the way a unit test does not, and so they are, to the agent, nearly invisible. The agent does not have bad taste. Taste is barely in the loss function. There is almost no reliable gradient pointing toward “this is lovely,” so it doesn’t descend toward it with anything like the confidence it brings to “this returns 200.”
I should be honest that “not measurable” is too strong, because the industry has spent real effort making slices of taste measurable, and an agent will satisfy every one of them. Visual-regression snapshots turn “looks different from the approved screenshot” into a failing check. Contrast linters compute the exact WCAG ratio. Design tokens encode a spacing scale and a palette as constraints. Performance budgets put a number on layout shift. All of that is real work, and an agent clears all of it. But every one of those tools measures taste that has already been decided. It enforces a baseline a human approved, a ratio a human chose, tokens a human authored last quarter. They catch regressions from good taste. None of them can tell you the baseline itself is shouting, that the chosen palette is wrong for this data, that every token was satisfied and the assembly is still incoherent. The encodable part of taste is a moving frontier; the residual, is this the right thing rendered well, stays human. The over-saturated visualization fell straight into that residual. It ships wearing the face of success, passing every gate built to catch its broken cousin, failing only the person who looks at it and says no, that’s shouting, bring it down. That sentence is not automatable, and pretending otherwise is how you end up with software that is technically flawless and quietly unpleasant to use.
The design pass is its own gate
So I run a design pass as a gate distinct from code review, because the two catch different classes of failure and collapsing them loses one.
Code review asks: is this correct, is this safe, will it hold. The design pass asks a different and stubbornly qualitative set of questions. Is the hierarchy right, does your eye land where it should first? Is the contrast actually legible, not just nominally present? Is the motion serving comprehension or just showing off? Is this restrained? A change can pass code review cleanly, correct and performant and well-typed, and fail the design pass completely, because it is correct and ugly, or correct and confusing, or correct and louder than it has any business being.
Accessibility is the sharpest illustration, because part of it is a number and part of it isn’t. Contrast ratio is literal; a linter computes it and an agent can satisfy it to the second decimal. But a label can clear the required ratio by a hair and still be miserable to actually read. Two states can both pass and still be indistinguishable to a colorblind eye the ratio never modeled. A distinction the spec certifies as present can be one the retina, in context, simply does not find. The measurable floor is necessary, and an agent will meet it. Whether the result is perceivable, usable by a real person in a real glance, is the part above the floor, and that part is still a human leaning in and saying: I can pass the check and I still can’t read it. Only one of us is doing QA.
Design drifts one reasonable commit at a time
The other half of this is temporal, and it is the part the original framing of “humans in the loop” usually misses. The danger is not only the single over-saturated visualization. It is drift, the slow divergence between the design you intended and the thing that actually ships, accumulating one individually-reasonable change at a time.
Every commit can be defensible on its own and the sum can still wander away from the intent. A spacing tweak here, a color nudged for contrast there, a component that grew a third variant because a feature needed it. None of them wrong, all of them together slowly eroding a coherence that no single diff was responsible for holding. The intent lives in a person, not in the repository. The repo holds the current state; it does not hold the meant state, the felt target you are steering toward. An agent optimizing each change locally has no access to that target and cannot tell you when the system has drifted off it, because from inside any single commit nothing is wrong.
I watched this happen to an interface I owned. Across a stretch of feature work, a calm UI had quietly gone busy: a contrast nudge for one fix, a third variant of a component for one feature, a spacing tweak to fit one new control. No commit was wrong. I had reviewed and approved each one as it landed. The drift only became visible when I stopped reading diffs and looked at the whole surface at once, and it took a deliberate reconciliation pass across nearly every screen to pull the system back to the coherence it had started with. What I reverted was not a line. It was a direction. Someone has to hold the whole picture in their head and notice the gap between where it is and where it was supposed to go. That someone is, still, me.
The reviewer’s job got more human, not less
It would be easy to read all of this as a complaint, the agent can’t do taste, woe is the craft. I read it the opposite way. The mechanical layer of review, does it compile, does it lint, does the obvious bug lurk on line forty, is increasingly handled, and good riddance; that was never the interesting part of the job. What’s left was always the point and was always the hardest: is this good, is this true, is this what we meant.
That isn’t a consolation prize. The job didn’t shrink; it concentrated. The reviewer’s job used to be substantially typo-hunting and bug-spotting, with judgment squeezed into whatever attention survived the mechanical pass. Now the mechanical pass is largely free, and the whole of my attention goes to the irreducibly human questions: adversarial truth-checking on one axis (is this actually doing what it claims, or just looking like it), and taste adjudication on the other (is this the right thing, rendered well). Those are harder than hunting typos. They are also the reason the work is still interesting, and the reason it still needs a person at all.
The diff will tell you, with perfect fidelity, what changed. It will tell you which lines moved and which tests passed and which files are new. It will never once tell you whether the result is any good. That judgment did not get automated. It got handed back to you in its purest form, stripped of the busywork that used to crowd it out. Taste doesn’t come in the diff. It comes from the person reading it, and that, increasingly, is the whole job.