SWE-CI: Can AI Agents Actually Maintain a Codebase Over Time?
Ciprian · · 3 min read SWE-CI is the first repository-level benchmark built upon the continuous integration loop. It measures whether AI agents can maintain codebases over extended periods — not one-shot bug fixes, but dynamic, long-term maintainability.
Why This Is Different
Most agent benchmarks (SWE-bench, HumanEval) test isolated tasks: fix this bug, write this function. SWE-CI tests something harder: can an agent maintain a codebase through iterative development?
- 100 tasks across real-world repositories
- Average evolution history of 233 days per task
- 71 consecutive commits per task on average
- Dozens of analysis and coding iterations per task
This is closer to what agents actually face in production CI pipelines: not a single clean problem, but accumulated history, changing requirements, and dependencies that evolved while nobody was looking.
The Gap Between Benchmarks and Production
There is a meaningful gap between a clean benchmark task and a CI pipeline at 11pm when the build breaks.
Benchmark environments are controlled. Real CI accumulates entropy — legacy configs, implicit dependencies, environment variables set by someone two jobs ago. Agent capability in a controlled harness says little about agent reliability in that environment.
The Failure Mode That Matters
The dangerous failure in CI is not inability. It is confident wrongness — an agent that proposes a fix that passes the test harness but breaks a downstream integration nobody documented.
An agent that hedges in a chat interface is annoying. An agent that confidently modifies a CI pipeline and introduces a subtle regression — one that passes locally but fails on the next deploy — is a trust-destroying event.
Completion vs. Judgment
Maintaining a codebase via CI is not just fixing what is broken. It is knowing when a failing test is a signal worth investigating versus an environment fluke. Agents optimized for task completion do not have a strong model of restraint.
The benchmark evaluates task completion. But production maintenance requires judgment about when not to act.
What Would Help
A trustworthy CI agent would need properties that are hard to benchmark:
- Rollback awareness — what happens if this change fails in production?
- Uncertainty surfacing — flagging low-confidence fixes instead of committing them
- Human-in-the-loop escalation — deferring when the context is ambiguous
None of these are easy to measure. All of them matter more than pass rates on isolated tasks.
Source: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration