System, Not Model: Why Off-the-Shelf LLMs Don’t Replace a Pen Test

What do buyers actually purchase when they pay for a vulnerability discovery platform, and why is the model the cheapest input in the bill?

Hector Leano

Apr 27, 2026

System, Not Model: Why Off-the-Shelf LLMs Don’t Replace a Pen Test

Every Xint conversation in the last month has included some version of the same question. Why would I pay you when Claude Enterprise now bundles security scanning? Why would I pay you when I could point an open-source agent at Opus myself? Both are reasonable questions. Both are based on an assumption that does not survive contact with the work.

The assumption is that the model is the product. Point a smart enough model at your code and it finds the vulnerabilities. If the model is capable and the price is low or free, the exercise is done.

That is the Mythos headline compressed to its simplest form, and it is the version that reaches buyers through press coverage and competitive decks. It is also wrong in a way that costs real money if a security program is built on it.

This post is about what the buyer is actually purchasing when they pay for a vulnerability discovery platform, and why the model is the cheapest input in the bill.

What the model actually does

Foundation model capability at vulnerability reasoning has been real since Sonnet 3.5 shipped in mid-2024. Mythos made the headline; the underlying capability has been improving quietly for eighteen months. A current-generation model, pointed at a well-scoped function, can identify pattern-level vulnerability classes, reason about control flow, and often suggest plausible exploit paths.

That is useful work. It is also a small fraction of what a pentest actually produces.

What you get when you hand a file to a capable LLM and ask it to find vulnerabilities: pattern recognition, plausible weakness classes, speculative exploit reasoning, usually with some false positives and sometimes with a real finding buried in the output.

What you do not get: a decision about which file to look at next, a validated answer to whether the finding is reachable in your actual system, awareness of whether the issue is already covered by an existing mitigation, a confirmed exploit that actually triggers against your code, or a reproduction a developer can run.

Every one of those is a different problem. Every one requires different infrastructure. The common failure mode in this category is collapsing a pipeline into a prompt and calling it a product.

The work the model doesn’t do

Break the pen test pipeline into its actual stages. Five of them. Each one requires work the model does not do on its own.

File selection. Production codebases run from five million to a hundred million lines. A model can reason about a hundred thousand lines of context at a time, and every call costs real money. Deciding which tenth of a percent of the codebase is worth examining is not a model problem. It is a targeting problem. Attack surface identification, entry point analysis, dependency graph reasoning, awareness of what has already been flagged. None of that happens when you point an agent at a repo.

Context assembly. Even on the right file, a model needs to see how that file fits into the rest of the system. Where does this function get called from? What are the trust boundaries between callers? What input validation is already happening upstream? A single-file prompt produces single-file reasoning, which is exactly what SAST already does poorly.

Validation and severity assessment. A current-generation model, pointed at production code, will return hundreds or thousands of candidate findings per run. Some will be real, but even among the true positives many will have little to no impact if exploited. Most will be weaknesses, false positives, duplicates, or issues that are technically valid but already mitigated. Separating signal from noise is where pen testers spend half their time. Automating that separation is where most AI security products fall apart, because the model that generated the noise is not the right tool to filter it.

Exploit construction and confirmation. A finding that says “this function looks vulnerable” is a weakness. A finding that says “here is the input that triggers it, here is the response that confirms it” is a vulnerability. The model can draft the exploit. Someone has to run it against a real target and verify that it actually works. That someone is either a pen tester or a validation pipeline built around the model. It is not the model.

Reporting. A CISO needs structured output. Severity, reproduction steps, affected versions, suggested patch, SOC 2 evidence trail. A raw LLM response is prose. Converting prose into a report an auditor accepts is its own product.

Five stages. One of them is the model. The other four are the work.

The Mythos receipt

Mythos is the cleanest existence proof that system matters more than model. Three points.

Anthropic’s Frontier Red Team built the scaffolding around Mythos. Twenty-one researchers selected the target codebases, designed the parallel scanning strategy, contracted professional human triagers, and managed responsible disclosure across thousands of findings. The model ran inside that structure. Without the structure, the output is unreadable.

Xint ran against the same codebases using off-the-shelf Opus and GPT, not Mythos, and reproduced every flagship finding plus twelve additional zero-days that Anthropic had not disclosed. Same model family. Different system. Comparable output.

The lesson is not that Xint’s models are better. Our models are the same ones anyone can rent. The lesson is that when the system around the model does the work, the model improvement is additive, not the point. When Mythos-class capabilities reach general availability, buyers running on Xint get the upgrade for free. Buyers running their own agent have to rebuild the pipeline around the new model and discover, after the rebuild, what the rebuild missed.

We wrote more about the architecture in our Mythos post. The short version is that the system is the receipt, not the model.

What the buyer is actually purchasing

Six things. Priced against the pen test budget, not the dev tooling budget.

Targeting intelligence. Xint picks the tenth of a percent of your codebase worth examining. You do not staff a team to do it, and you do not pay for model calls on the 99.9% that doesn’t matter.

Validated findings. Every output has been run through a triage pipeline. What you see is what a pen tester would have flagged after a week of manual review, without the week.

Reproduction steps. Every finding includes the inputs that trigger it and the response that confirms it. Your developer can validate it without a security engineer in the room.

Impact report. Along with reproduction steps, every finding also provides a human-readable report of what an exploit could enable for an attacker (e.g., RCE, crash, DoS, etc) so developers can focus on fixing the vulns with real impact.

Suggested fixes. Every finding includes a proposed patch. Your developer can close the loop without a security engineer in the room.

Predictable pricing. Xint provides the most cost effective system at unlimited scale because we solve the token burn problem. Otherwise pointing a model at source code without proper scaffolding leads to exponential token burn with linear increases in code lines.

Continuous model improvement. When the next frontier model ships, the system absorbs it. You do not rebuild your pipeline. You do not retrain your team. You do not renegotiate your contract.

That last one is the sleeper argument and it matters more than it sounds. Every buyer who bets on a specific model’s API today is building a workflow they will have to rebuild when the model improves. Xint customers get model improvements as continuous upgrades to the same product. That is the difference between buying a platform and renting a capability.

The question the buyer should actually be asking

Reframe. The question “why would I pay Xint when the model is cheaper” treats the model as the product. The question a CISO should be asking is different, and sharper.

What would it cost my team to build the scaffolding around a public model that would match what a platform vendor has already built?

Answer: a security research team, a targeting layer, a triage pipeline, a validation loop, a reporting system, ongoing model evaluation, and the operational discipline to keep it running on every release. The model is the cheapest input in the list.

So the next time a vendor demos an AI security tool, ask them what happens when the model gets better next quarter. The answers will separate the products from the prompts.

The model is the input. The system is the product.

Mythos collapsed a complicated industry question into a simple-sounding headline. AI can find vulnerabilities. That is true. It is also incomplete in a way that matters.

Finding vulnerabilities at production scale, on code no one has flagged, with validated exploits and working patches, is not a model capability. It is a system capability that uses a model as one input.

The model will keep getting better. Xint Code was built so that you do not have to.

Vulnerability Research AI for Security