Xint’s False Positive Rate: Methodology and Purpose
High Profile Stories About LLMs Finding 0days Miss the 800lb Gorilla In the Room: False Positives
Last year as the security researchers at Theori developed what would become Xint, they were able to win several high profile competitions like Zeroday.cloud, DEF CON CTF, and AIxCC using LLMs, often with little or no human intervention. But it’s been the launch of Mythos and now Daybreak this year that has really forced the potential of LLMs to secure and attack codebases to the forefront of CISO priorities.
These models have demonstrated impressive capabilities in finding bugs that would have been hard even for a human to find. But these news stories raise several critical questions:
What was the total number of vulnerabilities flagged as severe by the model?
How much work went into validating how many of those flagged vulns were true versus false?
And what share of these findings ended up being false positives?
This is not an academic exercise. One of the largest challenges for product security teams is sifting through false positives In traditional SAST tools false positives can account for over 80% of total findings and often cause burnout in the small teams tasked with discovery, triage, and remediation.
It seems obvious but the goal of AppSec in the real world isn’t to flag as many possible bugs as possible - it’s to find the vulnerabilities that attackers would target that could cost the business.
Xint Has a <25% False Positive Rate
Xint can also flag hundreds or even thousands of potential bugs, but given our team’s experience in practical security our interface has been designed specifically to 1) automatically rank bugs by severity, 2) provide trigger conditions to expedite validation, and 3) include potential impact of an exploit so the teams can focus limited bandwidth only on severe threats.
When we say we have less than a 25% false positive rate, we arrive at this answer through the following methodology:
Out of the hundreds of findings per scan, we select those flagged as moderate- to high-severity, similar to how product security teams would want to focus only on important findings.
Our researchers then try to conduct a POC for each item, often just inputting the trigger conditions included in every Xint finding to create that POC. Usually this takes 15-20 minutes of human researcher time per item tested. This compares to the days or even weeks per-bug validation used to take.
We then divide the number of items that had a successful POC over the total number of items tested to arrive at our true positive rate.
For example, as part of DARPA’s post AIxCC bounty program, we scanned 600k lines of OpenSSL. In less than 6 hours, Xint Code had flagged 411 possible vulns, of which 17 were flagged as critical. Our researcher was able to validate (that is, generate a successful POC) in 14 of those 17, for an FP rate of 18%.
Similarly, after running a scan of Ghostscript as part of our paper recreating Mythos results, out of 21 of the highest severity bugs flagged by Xint, 4 were FPs (19% FP rate). The Postgres bug that was part of our winning submission at ZDC came as a result of scanning nearly 1 million lines of code in under 12 hours, which resulted in 27 possible vulnerabilities flagged as severe, of which only 6 were FPs (22% FP rate).
Of course, it’s worth pointing out that our methodology is based on manually triaging only the highest severity bugs so it is possible that FPs are not uniformly distributed across severities. We maintain however that focusing on the most severe vulnerabilities is consistent with real world circumstances where both defenders and attackers are just looking for the most severe exploits.
Why 0% FP Is Not the Goal
Obviously having an FP rate that is too high is harmful because it leads to the “boy who cried wolf” syndrome - security teams just stop paying attention to all the alerts.
But at the other extreme, if you only catch bugs that can be reproduced, you're going to miss security issues that ought to be fixed. You don’t want your FP rate to be 0% because that means you’re likely missing true positives/significant bugs. Both near-misses and "safe in configuration X but not configuration Y" are things that one ought to be informed of, even if it's a false positive in a strict sense.
Exploits or POCs can be useful for prioritization in lower-volume scenarios, but when bug volume is high (dozens/hundreds), using exploitability as a prioritizer effectively becomes a harsh filter. Defenders primarily need convincing evidence of impact + remediation help — not necessarily a full exploit. As a result, at enterprise scale it is better to have solid confidence scores and triage based on estimated severity/impact.
The Right Approach to False Positives in Era of AI Vuln Discovery
In this new era of high-volume AI-driven vuln discovery, teams should focus more on reliable static analysis, impact estimation, and low-FP scaffolding instead of leaning heavily on "can we exploit it?" as the decider. Exploit skills remain valuable (Xint values and publishes exploits — sharing technical deep-dives with heap grooming, RCE techniques, etc. But that's for demonstration/impact, not as the gatekeeper for whether a vuln is "real" or worth fixing), but over-relying on them for sorting would slow things down and miss bugs.