VIBEPASS: AI Models Can Write Code But Cannot Find Their Own Bugs
Ciprian · · 12 min read A team at Salesforce AI Research, in collaboration with NTU and A*STAR, published VIBEPASS — the first benchmark that systematically decomposes whether LLMs can find and fix their own coding bugs. The paper evaluates 12 frontier models across 173 problem instances and produces findings that quantify what many practitioners suspected: models that write plausible code cannot reliably identify where that code fails.
This is not another “LLMs are bad at coding” paper. The contribution is more specific and more useful. VIBEPASS isolates the exact stage in the debugging pipeline where models break down, and the answer is not where most people would guess.
The Benchmark: What They Built and Why It Matters
VIBEPASS pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases. The benchmark was constructed through a three-stage pipeline:
Problem collection. The researchers sourced problems from LiveCodeBench, whose continuous release schedule mitigates training data contamination. After filtering for problems that the majority of GPT-4o, Claude Sonnet 4, and Gemini-2.5-Pro could not solve, they retained 170 problems, 89% rated medium or hard.
Solution collection. They gathered 2,184 candidate solutions across 168 problems from both human submissions and diverse LLMs. Solutions passing all official tests were designated “silver” (reference correct). Solutions that produced wrong answers — not runtime errors, which are trivially detectable — were designated “buggy.” Each buggy solution passes between 10% and 90% of official test cases, ensuring the bugs require genuine semantic reasoning to detect.
Final dataset. 173 instances spanning 76 unique problems. Each instance includes: a problem specification, official test cases, an automatically generated input-validity checker (validated against official tests, 98.8% accuracy), a verified silver solution, and a model-generated buggy solution.
The key design choice is that these are not obviously broken programs. They pass partial test suites. They look correct. The bugs live in edge cases that require understanding program behavior at a level deeper than pattern matching.
Two Tasks, Four Metrics, Three Repair Conditions
VIBEPASS evaluates two coupled tasks:
Task 1: Fault-Triggering Test Generation (FT-Test). Given a problem description and a buggy solution, generate a test case (input + expected output) that exposes the bug. A test is “fault-triggering” only if: the input is valid, the buggy solution produces the wrong output, and the silver solution produces the correct output.
This task is evaluated under two settings. In Bug-Aware, the model is told the code has a bug and must generate a test exposing it. In Bug-Discovery, the model must first determine whether the code is buggy at all, then generate a test only if it judges the code faulty.
Test quality is measured along four progressively stricter criteria:
- V_I (Validity): Is the generated input syntactically valid per the problem constraints?
- V_IO (Executability): Is the input valid AND does the expected output match the silver solution’s output?
- D_I (Discriminative Input): Is the input valid AND does the buggy solution produce a different output than the silver solution?
- D_IO (Discriminative Test): The full conjunction — valid input, correct expected output, and the buggy solution fails the test.
Each level strictly refines the previous: D_IO is a subset of both V_IO and D_I, which are subsets of V_I.
Task 2: Fault-Targeted Program Repair (FPR). Given the problem and the buggy solution, produce a corrected version that passes all official tests. This is evaluated under three conditions: NoTest (model knows the solution is buggy but gets no test cases), ExtTest (model receives an externally generated fault-triggering test from Task 1), and IntTest (model generates its own test first, then uses it to guide repair).
The 12 Models
The evaluation covers a broad cross-section of frontier models: GPT-5 Nano, GPT-5 Mini, GPT-5.2, GPT-5.2 Codex, Gemini-3 Flash, Gemini-3 Pro, Gemini-3.1 Flash-Lite, Gemini-3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.6, and two open-source models — GPT-OSS-120B and Nemotron-3-Nano-30B-A3B.
Finding 1: The Validity-Discrimination Gap
The headline result from Task 1: models produce syntactically valid test inputs at near-ceiling rates (average V_I = 86.4%) but collapse when those inputs need to actually expose a bug (average D_IO = 61.3%). That is a 25.1 percentage point gap between “can generate a test” and “can generate a test that matters.”
The paper decomposes this gap into two distinct failure modes:
- Fault hypothesis gap (V_I - D_I): average ~23 pp. The model generates a valid input but picks one that does not distinguish buggy from correct behavior. This is a failure of reasoning about what could go wrong.
- Output validation gap (V_I - V_IO): average ~8 pp. The model generates a valid input but predicts the wrong expected output. This is a failure of execution simulation.
The fault hypothesis gap is 2.7x larger than the output validation gap. The bottleneck is not predicting outputs — it is forming hypotheses about where the code might fail.
The dispersion across models is enormous. On discriminative test generation (D_IO), Opus 4.6 hits 79.8% while Gemini-3.1 Flash-Lite manages only 26.0% — a 54 percentage point spread. The fault hypothesis gap varies 12x across models, from 4.6% for Opus 4.6 to 57.8% for Gemini-3.1 Flash-Lite.
Individual model results in Bug-Aware FT-Test (V_I / D_IO):
- Opus 4.6: 84.4% / 79.8%
- Sonnet 4.6: 80.4% / 70.5%
- GPT-5.2 Codex: 90.2% / 72.3%
- Gemini-3 Pro: 94.8% / 69.4%
- Gemini-3.1 Pro: 89.0% / 68.8%
- GPT-5.2: 88.4% / 68.2%
- Gemini-3 Flash: 95.4% / 63.0%
- GPT-5 Mini: 89.0% / 60.7%
- GPT-5 Nano: 92.5% / 52.0%
- Nemotron-3-30B: 66.5% / 39.3%
- Gemini-3.1 Flash-Lite: 89.6% / 26.0%
Notice that Gemini-3.1 Flash-Lite achieves 89.6% input validity but only 26.0% discriminative test generation. It can write tests that compile and run. It cannot write tests that find bugs.
Finding 2: Bug Discovery Compounds the Problem
When models must first judge whether the code is buggy before generating tests (Bug-Discovery setting), performance drops further. Average judgment accuracy is 71.4%, but the end-to-end metric (correct judgment AND discriminative test) falls to 49.9% — down from 61.3% D_IO in the Bug-Aware setting. Roughly 60% of Bug-Discovery failures trace to incorrect judgment rather than poor test generation.
There is an asymmetry in how judgment quality interacts with model strength. Weaker models like GPT-5 Nano (judgment accuracy: 54.9%) benefit substantially from being told a bug exists — their conditional D_IO jumps from 34.6% when they misjudge to 66.3% when correctly grounded. Stronger models like Opus 4.6 (judgment accuracy: 82.7%) show minimal benefit; their internal confidence already drives effective test generation, and forcing them to test in low-confidence cases can actually depress performance.
The paper connects this to the selective prediction framework: allowing models to abstain on uncertain instances improves aggregate reliability. In the Bug-Aware setting, when the model’s own Bug-Discovery judgment was correct, average D_IO was 71.2%. When the judgment was wrong, it was 33.6% — a 37.6 pp gap even when the model was externally told a bug exists. The underlying diagnostic capability is the bottleneck, not the availability of grounding information.
Finding 3: Self-Tests That Miss the Bug Make Repair Worse
The repair results (Task 2) contain the paper’s most practically significant finding. Across all models:
- NoTest (unguided repair): average P@1 = 58.6, average SR = 47.6
- IntTest (self-generated test, then repair): average P@1 = 51.8, average SR = 41.6
- ExtTest (external test, then repair): average P@1 = 55.9, average SR = 45.9
Self-generated tests degrade repair performance by 6.8 P@1 points on average compared to unguided repair. The model is worse off having tried to test than if it had just attempted a blind fix.
This is counterintuitive and important. The naive assumption behind “write tests, then fix” workflows is that testing always helps. VIBEPASS shows it does not — when the test fails to expose the actual fault, it actively misleads the repair attempt.
The best overall repair profile belongs to Gemini-3 Flash (NoTest SR: 70.5, IntTest SR: 56.0, ExtTest SR: 71.0). The worst is Nemotron-3-Nano-30B (NoTest SR: 16.0, IntTest SR: 9.5, ExtTest SR: 9.0).
Finding 4: When Self-Tests Work, They Work Better Than External Tests
The paper runs a controlled comparison on the subset of instances where both internal and external tests produced valid corner cases (32-115 per model, median 89). On this intersection:
- NoTest: 63.9% success
- ExtTest: 57.8% success (-6.1 pp)
- IntTest: 64.2% success (+0.3 pp)
IntTest outperforms ExtTest by 6.4 pp (Wilcoxon p<0.05, Cohen’s d=0.41). At the model level, 8 of 12 models benefited more from self-generated tests.
This is the paradox. Self-generated tests are, on the whole, lower quality than externally provided ones (D_IO: 54.7% vs 56.3%). But when they do succeed in exposing the fault, the contextual alignment between the model’s own reasoning and the test provides better repair guidance. The model understands its own test better than someone else’s test.
The practical implication: test provenance matters. A correct self-generated test is more useful for repair than a correct external test, because the act of generating the test created reasoning context that transfers to the repair step.
Finding 5: The Pipeline Has Two Cliffs
The correlation analysis across six cumulative pipeline stages reveals where performance falls off. FT-Input and FT-IO show near-perfect correlation (r=0.988) — once a model finds a fault-triggering input, predicting the correct output is trivial. These fault-triggering metrics strongly predict both judgment (r >= 0.86) and repair performance (r=0.794).
Valid Input alone weakly predicts everything downstream (r <= 0.125). Syntactic test fluency does not transfer to fault reasoning.
The two sharpest performance cliffs:
- Valid IO to FT-Input: -14.7 pp. The transition from “can generate valid test inputs with correct outputs” to “can generate inputs that expose bugs.” This is the fault localization cliff.
- FT-IO to Repair: -21.2 pp. The transition from “found the bug” to “fixed the bug.” This is the test-to-repair cliff.
By model family: OpenAI models achieve the highest final success rate (54.3%) with the most graceful degradation. Google models struggle specifically with fault-triggering tasks. Open-source models underperform significantly in repair (12.1% final success rate).
What the Paper Does Not Address
VIBEPASS has several limitations worth noting, some acknowledged by the authors and some not:
Competitive programming only. The benchmark uses algorithmic problems from LiveCodeBench. Real-world bugs in production code involve state management, concurrency, API misuse, configuration errors, and integration failures — none of which are represented. The fault-targeted reasoning gap may be larger or smaller in those domains.
Single-turn evaluation. Models get one attempt at each task. Production coding agents typically iterate: run tests, observe failure, adjust, retry. The paper does not measure whether multi-turn interaction closes the gap.
No agentic tool use. Models are evaluated as pure text-in, text-out reasoners. They cannot run the code, inspect stack traces, or use debuggers. Real coding agents with execution feedback may perform differently.
Homogeneous bug source. The buggy solutions are all LLM-generated wrong-answer failures from competitive programming. Human-written bugs, bugs from code evolution (regressions), and bugs arising from specification ambiguity are structurally different.
No cost analysis. The paper does not report inference costs. The practical question for teams is not just “does model X find more bugs” but “what is the cost per bug found” — especially relevant when comparing Opus 4.6 (79.8% D_IO) against GPT-5 Nano (52.0% D_IO) at very different price points.
What This Means in Practice
Three takeaways for anyone using AI coding tools in production:
The “write tests then fix” loop is not reliably self-correcting. The common assumption behind vibe coding workflows — that models can iteratively test and repair their way to correctness — breaks down at the fault hypothesis stage. When the model generates tests from the same assumptions that produced the bug, the tests fail to expose the bug, and the repair attempt is worse than a blind fix. This is not a random failure. It is systematic: the model’s reasoning about what could go wrong is correlated with its reasoning about what is correct, so the same blind spots persist.
Different models, same bottleneck. All 12 models — across four model families, spanning open-source to frontier — show the same structural weakness: high syntactic test validity, collapsed discriminative generation. The gap narrows for stronger models (Opus 4.6’s fault hypothesis gap is only 4.6 pp) but does not disappear. This is not a training data problem that the next model generation will automatically solve. It points to a fundamental limitation in how current architectures reason about program behavior.
Independent verification is structurally necessary, not just good practice. If the same model writes and tests the code, the test quality is bounded by the same reasoning capacity that produced the bugs. VIBEPASS quantifies this: the average agreement rate between Bug-Discovery and Bug-Aware discriminative test outcomes is 82.9%, meaning models tend to find the same bugs (or miss the same bugs) regardless of whether they are told a bug exists. Breaking this correlation requires an independent verification step — a different model, a static analyzer, a fuzzer, or a human reviewer. The paper provides empirical grounding for what was previously an engineering intuition.
Source: VIBEPASS: Can Vibe Coders Really Pass the Vibe Check? — Bansal, Fangkai, Zhou, Xu, Joty, Yavuz. Salesforce AI Research / NTU / A*STAR. March 2026. Dataset and code: github.com/SalesforceAIResearch/vibepass