Vibe Code Bench: Best AI Model Hits 58% on Real Web App Development

Ciprian Ciprian · · 13 min read

Vibe Code Bench is the first benchmark that measures whether AI models can build complete, deployable web applications from nothing but a natural-language specification. Not “can it write a function.” Not “can it fix a bug in an existing codebase.” Can it take a description of an app and produce something a user can actually click through?

Sixteen frontier models were tested. The best one passes 61.8% of user workflows. The worst passes 1.2%. The gap between them is explained less by raw coding ability and more by whether the model bothers to test its own work.

What the 100 App Specs Look Like

The benchmark contains 100 web application specifications split into a public validation set (50 tasks) and a held-out test set (50 tasks). These are not toy programs. They span three categories designed to represent what people actually want to build when they vibe-code:

Individual apps (24 total): Personal tools a user would build for themselves. A habit tracker. A personal finance dashboard. These typically have no authentication. They are the simplest category.

Solo Founder apps (45 total): Products someone starting a business might build. A parking reservation app. A freelance invoicing tool. These usually require email-based authentication and sometimes more complex auth flows.

Enterprise tools (31 total): Internal business software. A procurement request tracker with role hierarchies. An employee onboarding system with approval workflows. These require multiple user roles and permission levels.

The specs were designed to read like what a non-technical person would actually type into a coding tool. They are deliberately minimal — containing all the information needed to pass the tests but no more than necessary. Candidate ideas were drawn from consumer apps, YC startup concepts, consulting case studies, and internal enterprise needs, then refined through a five-step process: idea generation, synthetic specification expansion, workflow construction, expert validation by professional PMs and engineers, and a final manual review pass by two senior project managers.

Twenty-eight of the 100 applications require integration with external services: 9 need email only (via MailHog), 6 need Stripe only, and 13 need both. This matters because async webhooks, delivery verification, and cross-service state consistency are where real production apps break. Excluding them would have overstated AI readiness.

964 Workflows and 10,131 Substeps

Each application is paired with 6 to 23 automated browser workflows that collectively define what “working” means. Across both splits: 964 workflows total, 10,131 individual substeps. The validation set contains 491 workflows (4,995 substeps); the test set contains 473 workflows (5,136 substeps).

A workflow is a complete end-to-end user journey. For a social media platform, that might be: create an account, create a post, comment on the post, sign out. Each workflow contains a sequence of substeps — atomic actions like “create an account with email X” or “navigate to page Y and click button Z.”

A workflow passes when 90% or more of its substeps succeed. This threshold tolerates minor non-critical errors (a slightly wrong label, a cosmetic issue) while still requiring near-complete correctness. An application’s accuracy score is the percentage of its workflows that pass.

This design choice is important. Evaluating at the workflow level rather than pooling substeps across workflows prevents a broken core feature from being masked by unrelated successful substeps elsewhere.

How the Autonomous Browser Evaluator Works

The evaluation pipeline uses Browser Use, an autonomous web agent, to perform point-and-click testing of generated applications. The canonical evaluator model is Claude Sonnet 4.5.

The process runs in four stages:

  1. Deployment verification: The generated application bundle is started via Docker Compose. If the app does not become reachable, all its workflows are marked as failed.
  2. Workflow execution: A fresh headless browser session (1920x1200) is launched for each workflow. The evaluator agent performs user actions and checks expected outcomes for each substep, producing structured pass/fail judgments.
  3. Workflow scoring: A workflow passes if at least 90% of substeps succeed.
  4. Application aggregation: The app’s accuracy is the percentage of passing workflows.

Each workflow evaluation is capped at 100 agent steps. Fresh browser sessions and unique account/data values per workflow prevent state leakage between tests.

The critical design decision: evaluation is implementation-agnostic. The browser agent does not check DOM selectors or run unit tests. It interacts with the app the way a real user would. Models are free to choose any UI structure and styling as long as the user-visible behavior satisfies the specification.

The Generation Environment

Models operate inside a modified version of OpenHands, an agentic coding scaffold. Each model gets an isolated container with terminal access, browser capabilities, and service integrations (Supabase for PostgreSQL/auth/storage, MailHog for email, Stripe in test mode). The system prompt mandates a specific tech stack — React + Vite frontend, Tailwind CSS, Supabase backend, Docker Compose for deployment — to ensure consistent evaluation at scale.

Every model gets the same wall-clock budget: 5 hours per application. This time-based normalization mirrors real development constraints and allows evaluation of diverse harnesses. Despite 22 available tools, models spend about 95% of their tool calls on just five: bash commands, file editing, browser navigation, SQL execution, and migration application.

The Full Rankings: 16 Models, Ranked

Here are all 16 models on the test split, ranked by workflow accuracy. Cost is the median API cost per application. Latency is median wall-clock generation time.

RankModelAccuracyCost/AppLatencyOpen Weight
1GPT-5.3-Codex61.8%$11.9175.8 minNo
2Claude Opus 4.657.6%$8.6921.3 minNo
3GPT-5.253.5%$17.7582.9 minNo
4Claude Opus 4.6 Thinking53.5%$8.2823.1 minNo
5Claude Sonnet 4.651.5%$5.9126.2 minNo
6GPT-5.2-Codex37.9%$4.1532.2 minNo
7Gemini 3.1 Pro32.0%$3.8320.2 minNo
8GLM-523.4%$40.27224.3 minYes
9Gemini 3 Flash20.2%$0.9413.4 minNo
10Kimi-K2.5 Thinking17.5%$0.8842.8 minYes
11Qwen3.5 Plus Thinking15.7%$3.8050.3 minYes
12MiniMax M2.514.9%$2.2051.1 minYes
13GPT-5 Mini14.2%$0.2511.6 minNo
14Claude Haiku 4.5 Thinking11.4%$1.3112.9 minNo
15DeepSeek V3.2 Thinking5.1%$2.4756.1 minYes
16Grok 4.1 Fast Reasoning1.2%$0.218.8 minNo

Several patterns emerge from this table.

The top tier is dominated by OpenAI and Anthropic. GPT-5.3-Codex leads at 61.8%, Claude Opus 4.6 is close behind at 57.6%, and then GPT-5.2 and Claude Opus 4.6 Thinking are tied at 53.5%. There is a sharp drop after the top five models: 6th place (GPT-5.2-Codex at 37.9%) is 13.6 percentage points behind 5th place (Claude Sonnet 4.6 at 51.5%).

The benchmark is far more discriminative than existing benchmarks. The gap between MiniMax M2.5 and Claude Opus 4.6 is 2.8% on SWE-Bench but 42.7% on Vibe Code Bench. SWE-Bench measures whether a model can fix a bug in an existing codebase. VCB measures whether it can build an entire application. These are very different skills.

Open-weight models cluster near the bottom. GLM-5, Kimi-K2.5 Thinking, Qwen3.5 Plus Thinking, MiniMax M2.5, and DeepSeek V3.2 Thinking all land between 5% and 24%. The best open-weight model (GLM-5 at 23.4%) costs $40.27 per app and takes nearly 4 hours — the worst cost-efficiency ratio in the entire benchmark.

Cost-Efficiency: Claude Opus 4.6 Stands Out

Claude Opus 4.6 achieves 57.6% accuracy at $8.69 per app in 21.3 minutes. GPT-5.3-Codex achieves 61.8% but at $11.91 per app and 75.8 minutes — over 3.5x the latency for 4 additional percentage points. GPT-5.2 costs $17.75 per app and takes 82.9 minutes for lower accuracy than Opus.

At the budget end, Gemini 3 Flash ($0.94/app, 13.4 minutes) and GPT-5 Mini ($0.25/app, 11.6 minutes) are cheap and fast but achieve only 20.2% and 14.2% respectively. The accuracy-cost frontier shows diminishing returns: additional spending improves performance, but the gains taper off above the $8-12 range.

The Self-Testing Correlation (r=0.72)

The single strongest predictor of a model’s benchmark score is how much it uses the browser during development. Across all 16 models, browser tool calls per application correlate with accuracy at Pearson r=0.72. This correlation persists after controlling for generation latency (partial r=0.72), meaning it is not simply a proxy for “spending more time.”

The top models sustain longer development sessions and cycle between exploration, coding, execution, and browser-based checking over much longer horizons. Lower-performing models terminate earlier and exhibit narrower action-phase coverage. GPT-5.3-Codex spends 13.2% of its tool calls on browser actions. Claude Sonnet 4.6 spends 26.1%. Grok 4.1 Fast Reasoning spends 1.1% — and scores 1.2%.

Meanwhile, editing volume has almost no predictive value (r=0.09). Writing more code does not help. Testing more code does. This validates what experienced developers already know: the write-test-fix loop matters more than the initial generation. The models that check whether their app actually works in a browser produce dramatically better results.

What Breaks: The Failure Taxonomy

Among the top five models, 12.8% of applications scored zero points. Among the bottom five, 73.2% scored zero. The worst models also had 14.0% of apps fail to start entirely (Docker deployment failure), which did not occur for top models. Only 8.8% of apps from the top five achieved a perfect 100%.

The performance distribution is bimodal: apps tend to either mostly work or completely fail. GPT-5.3-Codex’s most common score buckets are 0-12.5% and 87.5-100%. Model improvement is driven primarily by reducing total failures rather than incremental improvements across all apps.

For apps that start but have failing workflows, the failure taxonomy breaks down as:

  • Missing Feature (46.7%): The app simply did not implement something from the spec. This is the dominant failure mode.
  • Authorization Issue (20.4%): Sign-up or login was broken, or the user could not obtain the correct role.
  • Validation or Policy Block (14.8%): The user is prevented from performing actions that should be allowed. Most commonly, misconfigured Supabase Row-Level Security policies.
  • Other (10.2%): Miscellaneous failures.
  • UI Rendering or Navigation (6.0%): Missing UI elements or broken page navigation.
  • Data Consistency and Backend Logic (1.9%): Incorrect data writes or backend operation failures.

Different models have different failure signatures. Grok 4.1 Fast Reasoning and GPT-5.2-Codex have elevated authorization issues (32.3% and 30.3% of their behavioral failures). Gemini 3.1 Pro has the highest missing-feature share (58.5%).

Difficulty and integrations sharply affect scores. On easy tasks, GPT-5.3-Codex hits 81.9%; on hard tasks, it drops to 13.1%. On apps with no external integrations, GPT-5.3-Codex scores 71.3%; on apps requiring both email and Stripe, it drops to 29.6%. Enterprise tools (19.0% average across all models) are roughly half as likely to work as Individual apps (43.0%).

The Evaluator Agreement Problem

Here is the finding that complicates every number in this paper: evaluator agreement varies from 31.8% to 93.6% at the step level, depending on which pair of evaluators you compare.

The paper ran an external alignment study with four model evaluators (Gemini 3.1 Pro, GPT-5.2, Claude Sonnet 4.6, Claude Sonnet 4.5) and three human reviewers on 18 applications (1,401 unique substeps). The results are striking.

Claude Sonnet 4.5 (the canonical evaluator) agrees with human reviewers 86.4% of the time on average. Claude Sonnet 4.6 agrees with humans 86.3%. Gemini 3.1 Pro agrees with humans 84.7%. These are reasonable levels of alignment.

GPT-5.2 as an evaluator is the outlier: it agrees with humans only 36.1% of the time and with other model evaluators between 33.1% and 38.8%. Human-human agreement tops out at 93.6% (Reviewers A and B) and bottoms at 88.6% (Reviewers A and C).

This means the benchmark scores are most reliable when comparing models that are far apart (like GPT-5.3-Codex vs. Qwen3.5 Plus Thinking) and should be interpreted cautiously for models within a few percentage points of each other. Variance decomposition confirms this: most observed variance comes from generation rather than rescoring (92.5% for Gemini 3.1 Pro, 92.1% for GPT-5.2, 73.5% for Claude Sonnet 4.6). The benchmark reliably separates tiers but should not be treated as precise to the decimal.

What This Means for Teams Shipping AI-Generated Web Apps

The headline: the best AI model in the world, given a text spec and 5 hours, produces a web application where 38% of standard user workflows are broken. This is not edge cases. These are the workflows the spec explicitly defines.

The actionable findings:

Self-testing is not optional. Any AI coding workflow that does not include the model testing its own output in a real browser is leaving 20+ percentage points on the table. The correlation is strong, persistent, and not explained by time spent. If your AI coding pipeline generates code and then evaluates it separately, you are using the wrong architecture.

External integrations are where things fall apart. Apps with no integrations pass at 2.5x the rate of apps requiring both email and Stripe. If your AI-generated app needs to talk to payment processors, email services, or third-party APIs, expect to spend significant time on manual debugging and integration wiring.

Auth is the second-biggest failure point. One in five workflow failures is an authentication or authorization issue. Row-Level Security misconfiguration is the most common specific failure. If your spec involves roles, permissions, or multi-tenant access, AI-generated implementations require careful review.

Nearly half of failures are missing features. The model did not forget how to code — it forgot what the spec asked for. This suggests that spec comprehension and feature completeness tracking are at least as important as raw coding ability for improving these numbers.

Budget models are not ready for full-app generation. GPT-5 Mini and Grok 4.1 Fast Reasoning cost almost nothing but produce apps that barely function. For full application development, you are realistically choosing between the $5-18 per app tier (Claude Sonnet through GPT-5.3-Codex) and accepting that roughly half of workflows will still need fixing.

The 62% number will improve. But the current state is clear: AI can scaffold a web application, but it cannot reliably ship one.


Source: Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development