Xint
|
Blog
    AI for Security Vulnerability Research

    The Frontier Isn’t the Model: Why ‘Good Enough’ Reasoning + Scaffolding Is More Important

    In this exclusive report, Xint researchers compare Mythos's publicly disclosed results versus what broadly available models can accomplish using advanced scaffolding
    Apr 16, 2026
    The Frontier Isn’t the Model: Why ‘Good Enough’ Reasoning + Scaffolding Is More Important
    Contents
    Focusing on frontier capabilities misses the pointThe factors to decide which model is best for each task Summary

    You can read our full Mythos report here

    Focusing on frontier capabilities misses the point

    While Anthropic is generating attention for Mythos’s ability to find 0days, the truth is that AI had already crossed the threshold over traditional SAST tools and even humans at bug discovery for about a year now. In August of 2025, Xint’s autonomous AI auditor found several high severity 0day exploits with no human intervention after only a single run through millions of lines of code. That same technology then won at Wiz’s ZeroDay Cloud Challenge, finding high severity vulnerabilities in critical open source projects like MariaDB, PostgreSQL, and Redis. 

    LLMs have been great at finding security relevant bugs. From the Sockpuppet blog: 

    You can’t design a better problem for an LLM agent than exploitation research. Before you feed it a single token of context, a frontier LLM already encodes supernatural amounts of correlation across vast bodies of source code…Vulnerabilities are found by pattern-matching bug classes and constraint-solving for reachability and exploitability. Precisely the implicit search problems that LLMs are most gifted at solving. Exploit outcomes are straightforwardly testable success/failure trials. An agent never gets bored and will search forever if you tell it to.

    But vulnerability discovery is only one part of a larger process that includes validation, severity assessment, and remediation. Each of those steps has their own dozens or even hundreds of tasks and subtasks. 

    We've been building LLM-powered security tooling since 2024, and it has never been the case that a single model (or even a single provider) is simultaneously best at all cybersecurity tasks.

    The factors to decide which model is best for each task 

    When we built the scaffolding for our thousands of parallel agents in the Xint platform, we’ve actually built it so we can use the best model for that task. This is why we have extensive evaluations for each piece of Xint Code's pipeline. When a new model is available, we make data-driven decisions about how it fits in (if it does at all).

    These are the factors our team of researchers uses to evaluate if it should replace our current champion for a specific task: 

    Coverage of vulnerabilities it can discover 

    Xint can find the traditional class of insecure bug patterns (memory safety, weak crypto usage, insecure configurations, etc), but as hackers ourselves, we know that oftentimes the worst exploits aren’t from when the rules are broken but rather abused - or what we call “business logic vulnerabilities.” 

    The ratio of false positives or trivial findings 

    If a model generates a high volume of false positives or findings that don’t have impact, it basically takes a needle in a haystack (the true vulnerability inside a lot of code) and moves it to a needle in a pile of staples (a true vulnerability inside a lot of things that look like vulnerabilities), requiring higher cognitive bandwidth from project maintainers to assess since validation is more time intensive than discovery. 

    Any new model release we check to see how much it is able to find real bugs without overwhelming with false positives. 

    Fit into larger workflows 

    How much code harnessing is required before testing? How easily can a person figure out if this vulnerability can actually be exploited in real world circumstances? What is the impact if a bug is exploited - is it just a PoC or can it lead to permission escalation and RCE? How clear is the bug's root-cause analysis? Are the reproduction steps easy to follow?

    This is why it’s so important to have researchers with practical offensive security experience who understand how the outputs will actually be used by security teams. 

    Cost 

    From a results perspective, running a test 10 times and taking the best result with a cheap model might perform better and be more cost effective than running once with the most expensive model.

    Additionally, the larger the codebase to be tested, the less predictable pricing becomes, especially with the higher reasoning models. A linear increase in the number of lines tested can lead to an exponential increase in tokens burned with no clarity beforehand.

    Xint delivers predictable pricing to our customers, so they know that testing 2 million lines of code will be 2x more than 1 million lines (or technically less with volume discounts on Xint).  

    Summary

    Sometimes, models show new capabilities which make us rethink and redesign parts of our pipeline. Often, this means the model passed some threshold which makes an old research idea suddenly become feasible. This generally means more scaffolding, not less!

    With Xint Code, we have been able to harness every generation of LLMs over the past 2 years, and with each new LLM release our system gets better. The key - Xint Code is the best scaffolding for AI vulnerability research, so we get great results using the models available today. Customers can get off the treadmill of moving workflows every time a new model comes out since they can trust that they will get the same or better results (that is, find all the same or even more bugs as what the latest frontier model can deliver) already on Xint. 

    Share article

    Xint

    RSS·Powered by Inblog