The Wrong Benchmark: Why "Human-Level" Misses What Actually Matters in AI Refactoring

A new paper called CodeTaste asks whether LLMs can produce refactorings that match what skilled humans would do. The question makes sense for academic evaluation. The researchers run blind comparisons, collect human refactoring samples, and measure how closely model outputs align with expert judgment.

What they’re measuring is real. The methodology is sound for what it is. But building Octokraft and watching developers interact with AI-assisted refactoring on production codebases has convinced me that the entire framing asks the wrong question.

What Benchmarks Actually Capture

The CodeTaste approach isolates refactorings into discrete, evaluable units. You take a codebase, identify a transformation, and compare what the LLM produces against what a human expert produced. The metrics are about correctness, style matching, and whether the output achieves the stated goal.

This tells you something useful if you’re comparing models or tracking improvement over time. It establishes a baseline for what LLMs can do in controlled conditions. The problem isn’t with the research. The problem is that practitioners reading these results will assume the benchmark tells them something about whether the tool will work in their codebase. It doesn’t.

The benchmark surface is too clean by design. Academic datasets use well-isolated transformations where the goal is unambiguous and the context is self-contained. A method extraction here. A rename there. Maybe a parameter consolidation. The refactoring is the task.

Production refactoring rarely works that way.

What Production Actually Looks Like

Real codebases carry context that never makes it into the file being refactored. There are naming conventions that evolved over years and exist only in tribal knowledge. There are patterns that look wrong in isolation but exist for a reason documented in a Slack thread from 2023. There are partial migrations where the same abstraction is represented three different ways because the team is halfway through a rewrite.

A benchmark can’t capture this because the benchmark doesn’t know what it doesn’t know. The LLM doesn’t either. But here’s the thing: when a human refactors, they also don’t have complete information. The difference is that humans know what they don’t know. They ask questions. They hesitate. They make smaller changes because they’re uncertain about the ripple effects.

An LLM making a “human-level” refactoring in a blind evaluation doesn’t hesitate. It can’t. The evaluation setup doesn’t give it that option. It produces the transformation, and the transformation gets scored.

This is where the benchmark divergence becomes practical. An LLM that handles clean cases well can still fail badly on production code, not because the transformation is wrong but because the transformation doesn’t account for context that isn’t in the file. No human-level benchmark tells you how your codebase scores on that dimension.

Reviewability as the Hidden Metric

Here’s what I’ve learned from watching developers accept or reject AI refactoring suggestions: the question they’re asking isn’t “is this what a human would produce?” The question is “can I verify this in under thirty seconds?”

Those are different problems.

A refactoring can be technically correct and still be unreviewable. I’ve seen LLM suggestions that fix the target issue, then fix adjacent issues for consistency, then rename variables to match a style pattern the LLM inferred. Each individual change is defensible. The total diff is a page and a half. The developer looks at it, can’t hold all the changes in their head, and rejects the suggestion.

The benchmark would score that refactoring highly. The developer scored it as unusable.

Reviewability has specific properties. Small diffs are more reviewable than large ones. Changes confined to the target area are more reviewable than changes that spread across files. Transformations that preserve structure are more reviewable than transformations that reorganize. When an LLM refactoring suggestion fails in production, the failure mode is usually scope creep, not incorrectness.

The tool fixed the thing you pointed at. Then it fixed related things. Then it cleaned up naming inconsistencies it noticed along the way. This is what a good human would do if asked to refactor comprehensively. It’s exactly what makes AI suggestions hard to review.

Benchmarks measure quality of the transformation. They don’t penalize unasked-for surface area. They should.

Where AI Refactoring Actually Belongs

Not all refactorings are equal from a trust perspective. Some categories are genuinely well-suited to AI assistance. Others create more review overhead than they save.

Extract-and-rename operations are the sweet spot. The goal is unambiguous, the scope is bounded, and the verification is mechanical. Did the extraction preserve semantics? Did the rename propagate consistently? A developer can answer those questions quickly. The AI suggestion provides value because it does the tedious work and the human can confirm correctness without deep investigation.

Structural refactorings are different. Splitting a class, collapsing an inheritance hierarchy, changing a data shape. These require understanding dependencies that cross file boundaries, reasoning about runtime behavior, and anticipating edge cases that aren’t visible in the code as written. An LLM can produce a plausible transformation. The developer reviewing it has to reconstruct the reasoning to verify it. That reconstruction often takes longer than doing the refactoring manually would have.

The useful distinction isn’t simple versus complex. It’s bounded versus unbounded verification. When I can tell at a glance whether the transformation is correct, AI assistance helps. When I have to trace through implications to assess correctness, AI assistance creates work.

Building a tool means knowing that distinction. Evaluating a model doesn’t require it.

The Workflow That Actually Works

The developers who get consistent value from AI refactoring tools use them differently than the benchmark setup assumes. They don’t ask the AI to produce a refactoring and then review the output. They use the AI to find candidates, make the judgment call themselves, and then let the AI execute the mechanical transformation.

The split matters. Finding the refactoring opportunity requires understanding the codebase, the conventions, and the risk profile. That’s where human judgment is irreplaceable. Executing the refactoring once the decision is made is mechanical. That’s where AI excels.

This workflow doesn’t show up in benchmarks because benchmarks don’t model the discovery phase. They start with a labeled refactoring opportunity and measure the transformation. But in production, the hard part isn’t the transformation. The hard part is knowing which transformations are worth doing and which ones will break something subtle.

The benchmark question “can LLMs match human refactorings” assumes the refactoring is already identified. In practice, the AI is most useful when it surfaces candidates I wouldn’t have noticed, not when it executes transformations I already knew I wanted.

What I’d Actually Want From Research

The CodeTaste paper is doing legitimate work. My disagreement isn’t with their methodology for what they’re trying to measure. My disagreement is about what question is worth asking.

Here’s what would help practitioners more than another human-level comparison:

Evaluate on codebases with real conventions and intentional inconsistencies. Synthetic datasets strip out the context that makes production refactoring hard. I’d rather see performance on messy codebases with documented style guides than performance on clean datasets.

Score reviewability alongside correctness. A refactoring that’s correct but requires ten minutes to verify is less useful than a refactoring that’s slightly suboptimal but can be confirmed in thirty seconds. The benchmark should capture that tradeoff.

Measure workflow integration, not isolated output quality. The relevant question isn’t whether the LLM produces a good refactoring. It’s whether adding the LLM to the workflow makes the developer faster. Those aren’t the same thing.

Study the failure modes that matter. Scope creep. Context blindness. Over-application of patterns. The research should tell me where the tool will fail in ways I can predict, not just how often it succeeds on average.

The Actual Threshold

“Human-level” sounds like the right bar. It isn’t. The threshold developers care about is lower in some ways and higher in others.

Lower because we don’t need the AI to produce the exact refactoring a human would produce. We need it to produce a refactoring we can verify quickly. A transformation that’s slightly different from what I’d do but clearly correct is fine. A transformation that matches my approach exactly but requires extensive review is not.

Higher because verification burden is part of the quality calculation. A tool that produces technically correct suggestions I can’t quickly confirm has negative value. It’s added work dressed up as assistance.

The benchmark papers aren’t wrong. They’re answering a question that matters for model development. But if you’re evaluating whether to integrate AI refactoring into your workflow, the benchmark results won’t tell you what you need to know.

You’ll learn more from running the tool on your actual codebase for a week than from any paper. Watch what the suggestions look like. Notice how long verification takes. Track whether the tool finds candidates you wouldn’t have spotted or just executes transformations you already planned.

The human-level question is academically interesting. The verification-time question is practically relevant. Building Octokraft has convinced me that the second question is the one tool builders should be obsessed with.