Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

Published in CAIN, 2026

Recommended citation: Spandan Garg, Benjamin Steenhoek, and Yufan Huang. 2026. Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation. In IEEE/ACM International Conference on AI Engineering – Software Engineering for AI (CAIN 2026). https://benjijang.com/files/2026-04-12-swebench-mutation.pdf

Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in IDEs. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents.

  • Our methodology transforms formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of popular chat-based agent interactions.
  • We apply our framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench, and a private benchmark (SWE-Bench C#).
  • Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10–16% for our internal benchmark.
  • This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.

Also available on arXiv.