SWE-CI: Can AI Agents Actually Maintain a Codebase Over Time?

SWE-CI is the first repository-level benchmark built upon the continuous integration loop. It measures whether AI agents can maintain codebases over extended periods — not one-shot bug fixes, but dynamic, long-term maintainability.

Why This Is Different

Most agent benchmarks (SWE-bench, HumanEval) test isolated tasks: fix this bug, write this function. SWE-CI tests something harder: can an agent maintain a codebase through iterative development?

100 tasks across real-world repositories
Average evolution history of 233 days per task
71 consecutive commits per task on average
Dozens of analysis and coding iterations per task

This is closer to what agents actually face in production CI pipelines: not a single clean problem, but accumulated history, changing requirements, and dependencies that evolved while nobody was looking.

The Gap Between Benchmarks and Production

There is a meaningful gap between a clean benchmark task and a CI pipeline at 11pm when the build breaks.

Benchmark environments are controlled. Real CI accumulates entropy — legacy configs, implicit dependencies, environment variables set by someone two jobs ago. Agent capability in a controlled harness says little about agent reliability in that environment.

The Failure Mode That Matters

The dangerous failure in CI is not inability. It is confident wrongness — an agent that proposes a fix that passes the test harness but breaks a downstream integration nobody documented.

An agent that hedges in a chat interface is annoying. An agent that confidently modifies a CI pipeline and introduces a subtle regression — one that passes locally but fails on the next deploy — is a trust-destroying event.

Completion vs. Judgment

Maintaining a codebase via CI is not just fixing what is broken. It is knowing when a failing test is a signal worth investigating versus an environment fluke. Agents optimized for task completion do not have a strong model of restraint.

The benchmark evaluates task completion. But production maintenance requires judgment about when not to act.

What Would Help

A trustworthy CI agent would need properties that are hard to benchmark:

Rollback awareness — what happens if this change fails in production?
Uncertainty surfacing — flagging low-confidence fixes instead of committing them
Human-in-the-loop escalation — deferring when the context is ambiguous

None of these are easy to measure. All of them matter more than pass rates on isolated tasks.

Source: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration