AI Wrapper vs. AI Native

Everyone is claiming AI in AppSec, but there are meaningful differences in how AI is used, leading to fundamentally differences in exposure

Hector Leano

Jun 10, 2026

Contents

AI Wrappers Frontier Labs AI Native How to Evaluate the Right AI AppSec Approach for Your Organization

AI isn’t so much replacing human attackers as it is making them inexhaustible. We are observing this first hand with incident logs showing attackers using AI to probe every corner and test every variation in a way no single human or even team of humans could have done before.

Defenders know they have to do the same, but when they attend the conferences or check the websites, all vendors are claiming to be “AI”.

But the AI label hides meaningful differences in approaches and results. While customers might think they are protected against AI-augmented attackers, really they are just getting better protection from yesterday’s attack methods.

AI Wrappers

Legacy SAST vendors are using AI to validate findings before presenting the output. While this has reduced false positives, at its core it is still the same rigid, rules-based traditional SAST engines looking for insecure code patterns and well-documented bugs. As a result, these reports are presenting weaknesses versus true vulnerabilities because new attack vectors and whole bug classes (especially business logic vulnerabilities) are missed by these scans.

a sheep wearing a wolf costume — Image courtesy of Factory43

Frontier Labs

Vulnerability discovery has emerged as one of the strongest use cases for LLMs due to their ability to find connections across vast volumes of data. As a result, they are able to find bugs of the quality that human pentester can, but at a massive scale (testing hundreds of scenarios or scanning every line of code in a codebase with 1 million+ LoCs).

Through public disclosures as well as Xint customers who were granted access to Mythos (and other SOTA models) we have validated that each new model release has greater vulnerability discovery capabilities.

At the same time, we have found two major shortcomings for product security teams in real world conditions that prevent out-of-the-box SOTA models from meeting AppSec requirements:

Price: Linear increases in application endpoints or lines of code can lead to unpredictable and exponential increases in token costs. Our estimates are that what would normally be a $50-100k human pentest would end up costing $500k-$1 million in token costs, before even including human triaging costs.
Harnessing difficulty: Frontier labs are providing toolings and prompts to help their customers use their SOTA models better - meaning models themselves don’t magically work out of the box. These models find bugs but require entire systems to be useful for prodsec. What this means practically is higher false positives and more resources spent in validation and severity assessment using naive scaffolding.

We provide more details here about why SOTA models do not deliver on what product security teams need from an AppSec platform, as well as a detailed breakdown comparing Mythos findings versus what Xint found using generally available models.

AI Native

With AI native AppSec, LLMs are the engine actually finding the bugs (unlike SAST where LLMs are only validating the results found by their rules-based engine). At this point we have found that the larger models coming from the frontier labs are able to find more connections to find bugs versus proprietary, specialized models built inhouse. This is the same approach taken by other AI native AppSec tools in this space.

So how do you differentiate from all the AI native platforms, like Xint, if they are all using the same models from the same frontier labs for bug discovery?

It comes down to the quality of findings:

Does it find all bug categories relevant to your code base?
Does its analysis identify complex logic bugs spanning your code base?
Are the findings reliable or do they require substantial work to validate?
Can you easily tell which findings need to be prioritized versus which ones can wait?

The scaffolding and orchestration is the secret sauce that comes from being the best hackers in the world. It is the structured system that:

Decides where to look
Validates that findings are real and exploitable
Eliminates false positives
Delivers actionable remediation
Provides predictability

How to Evaluate the Right AI AppSec Approach for Your Organization

When comparing the different approaches, customers need to ask these questions to figure out which approach best fits their need:

What is the coverage of bugs you can find, including business logic vulnerabilities?
What is the ratio of signal to noise so that too many false positives and trivial findings don’t actually make your entire security posture worse by taking the team’s bandwidth?
How does this solution integrate into the full triage? How clear is the bug's root-cause analysis? Are the reproduction steps easy to follow?
Can I predict the cost?

Contents

AI Wrappers Frontier Labs AI Native How to Evaluate the Right AI AppSec Approach for Your Organization

AI Wrappers

Image courtesy of Factory43

Frontier Labs

At the same time, we have found two major shortcomings for product security teams in real world conditions that prevent out-of-the-box SOTA models from meeting AppSec requirements:

Price: Linear increases in application endpoints or lines of code can lead to unpredictable and exponential increases in token costs. Our estimates are that what would normally be a $50-100k human pentest would end up costing $500k-$1 million in token costs, before even including human triaging costs.

Harnessing difficulty: Frontier labs are providing toolings and prompts to help their customers use their SOTA models better - meaning models themselves don’t magically work out of the box. These models find bugs but require entire systems to be useful for prodsec. What this means practically is higher false positives and more resources spent in validation and severity assessment using naive scaffolding.

AI Native

So how do you differentiate from all the AI native platforms, like Xint, if they are all using the same models from the same frontier labs for bug discovery?

It comes down to the quality of findings:

Does it find all bug categories relevant to your code base?

Does its analysis identify complex logic bugs spanning your code base?

Are the findings reliable or do they require substantial work to validate?

Can you easily tell which findings need to be prioritized versus which ones can wait?

The scaffolding and orchestration is the secret sauce that comes from being the best hackers in the world. It is the structured system that:

Decides where to look

Validates that findings are real and exploitable

Eliminates false positives

Delivers actionable remediation

Provides predictability

How to Evaluate the Right AI AppSec Approach for Your Organization

When comparing the different approaches, customers need to ask these questions to figure out which approach best fits their need:

What is the coverage of bugs you can find, including business logic vulnerabilities?

What is the ratio of signal to noise so that too many false positives and trivial findings don’t actually make your entire security posture worse by taking the team’s bandwidth?

How does this solution integrate into the full triage? How clear is the bug's root-cause analysis? Are the reproduction steps easy to follow?

Can I predict the cost?