Introducing SM-100

Just how well do software agents navigate code bases and find real bugs?

Learn more at our AI Engineers World Fair presentation.

Created by the team at Bismuth. Check us out!

The SM-100 benchmark evaluates software engineering agents based on their ability to identify and remediate bugs within real-world codebases.

It measures agent performance through four metrics:

Needle in Haystack: The count of actual bugs from the SM-100 dataset found by agents without providing additional context or hints.
True Positive Rate: The percent of valid reports out of all bugs listed by the agent.
Remediated Bugs: The number of needle in haystack bugs that were successfully fixed.
PR Results: The number of SM-100 dataset bugs discovered when agents review the PR or commit that introduced the bug.

All results are pass@1 as we believe these systems should be single, comprehensive tools, and users running them multiple times is not representative of real-world usage.

Interpreting Results

The best agents will have a high needle in haystack count while still retaining accuracy in their reports shown by a high true positive rate. The needle in haystack count demonstrates ability to navigate and reason across large code bases and identify a wide variety of bugs. A high true positive rate meanwhile indicates that the agent is able to effectively distinguish between meaningful bugs and irrelevant issues. Even if an agent finds many needle in haystack bugs, if it is surrounded by false positives in a flood of reports, the real world usefulness of the agent is diminished as alert fatigue sets in and reports become disregarded before real issues are found.

Agent	Run Date	Needle in Haystack	True Positive Rate	Remediated Bugs

PR Review Results

Out of the 100 problems in the dataset, 80 were identified to have a single specific "introduction" commit. For these, agents were asked to review the PR or commit diff for issues.

Given the reduced area to examine and the inclusion of nearly all relevant files without the agents having to traverse the codebase to discover them, we see significantly higher bug discovery rates across the board.

Agent	Run Date	PR Review	True Positive Rate

💡 Click on any row to expand and see detailed results