SWE-CI: Can AI Agents Actually Maintain a Codebase Over Time?

Ciprian Ciprian · · 3 min read

SWE-CI is the first repository-level benchmark built upon the continuous integration loop. It measures whether AI agents can maintain codebases over extended periods — not one-shot bug fixes, but dynamic, long-term maintainability.

Why This Is Different

Most agent benchmarks (SWE-bench, HumanEval) test isolated tasks: fix this bug, write this function. SWE-CI tests something harder: can an agent maintain a codebase through iterative development?

  • 100 tasks across real-world repositories
  • Average evolution history of 233 days per task
  • 71 consecutive commits per task on average
  • Dozens of analysis and coding iterations per task

This is closer to what agents actually face in production CI pipelines: not a single clean problem, but accumulated history, changing requirements, and dependencies that evolved while nobody was looking.

The Gap Between Benchmarks and Production

There is a meaningful gap between a clean benchmark task and a CI pipeline at 11pm when the build breaks.

Benchmark environments are controlled. Real CI accumulates entropy — legacy configs, implicit dependencies, environment variables set by someone two jobs ago. Agent capability in a controlled harness says little about agent reliability in that environment.

The Failure Mode That Matters

The dangerous failure in CI is not inability. It is confident wrongness — an agent that proposes a fix that passes the test harness but breaks a downstream integration nobody documented.

An agent that hedges in a chat interface is annoying. An agent that confidently modifies a CI pipeline and introduces a subtle regression — one that passes locally but fails on the next deploy — is a trust-destroying event.

Completion vs. Judgment

Maintaining a codebase via CI is not just fixing what is broken. It is knowing when a failing test is a signal worth investigating versus an environment fluke. Agents optimized for task completion do not have a strong model of restraint.

The benchmark evaluates task completion. But production maintenance requires judgment about when not to act.

What Would Help

A trustworthy CI agent would need properties that are hard to benchmark:

  • Rollback awareness — what happens if this change fails in production?
  • Uncertainty surfacing — flagging low-confidence fixes instead of committing them
  • Human-in-the-loop escalation — deferring when the context is ambiguous

None of these are easy to measure. All of them matter more than pass rates on isolated tasks.


Source: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration