There's an all-new, challenging SWE-bench Multimodal, containing software issues described with images.
Learn more here.
mini-SWE-agent scores up to 74% on SWE-bench Verified in 100 lines of Python code.
Click here to learn more.
Introducing CodeClash, our new evaluation where LMs compete head to head to write the best codebase!
Click here to learn more.
SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or view all agents [Post].
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Multimodal features issues with visual elements [Post].
Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified, 300 Lite & Multilingual, 517 Multimodal).
We thank the following institutions for their generous support: Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.