What to Ask Every AI PenTest Vendor Before You Buy

These are the 8 questions that will tell you whether a vendor is selling a pen test alternative, a faster SAST tool, or a demo that doesn’t survive production

May 06, 2026

What to Ask Every AI PenTest Vendor Before You Buy

The AI security category has fragmented faster than any part of AppSec in the last decade. Two new entrants landed last month alone (Artemis with a $70M Series A, Armadin with $189.9M from Kevin Mandia). XBOW, Aikido, Escape, Corgea, Novee, and Aisle are all selling “AI-powered” security testing with different meanings of those words. Claude Code Security is now bundled into Claude Enterprise. Every vendor in the category says their tool finds vulnerabilities. Most of them are finding different things and calling the result by the same name.

A security team running an evaluation today cannot rely on category labels, analyst reports (too slow to keep up), or vendor comparison tables (written by the vendors). What they need is a short list of questions sharp enough to separate the products from the prompts.

This is that list. Eight questions. The answers will tell you whether the vendor is selling a pen test alternative, a faster SAST tool, or a demo that doesn’t survive production.

One disclosure up front. This post was written by Xint, which is one of the vendors these questions apply to. We tried to write questions we can answer well ourselves. Read it with that in mind. The questions are still the right ones.

1. Can the tool find a real vulnerability on code it has no prior signal about?

Most AI security demos use CVE-labeled commits, benchmark suites, or codebases the model has already seen in training. That is finding surfacing, not discovery. The tool is recognizing a pattern it was shown, not finding something new. Those are two different products.

Real discovery means the tool can be pointed at a codebase where no one has flagged anything and return vulnerabilities that survive expert validation. That is the test that matters, because production code is the environment where the tool will actually be used, and production code does not come with a CVE reference attached.

Good answers look like: the vendor can point to a specific codebase, disclose the findings, and show evidence of independent validation. Ideally the findings have been reported, triaged, and patched by the upstream maintainer. Mythos qualifies. Xint’s Mythos reproduction qualifies. Any vendor that can name specific zero-day disclosures in real projects qualifies.

Bad answers look like: demos on OWASP Benchmark, the Juliet Test Suite, DVWA, or any other intentionally vulnerable training application. Findings that match known CVEs without the vendor being able to articulate what they added beyond pattern detection. “Internal benchmarks” with no public evidence.

This is question one because it separates the category at the root. Every other question assumes the vendor is actually finding vulnerabilities. If the answer to this one is weak, the rest of the evaluation does not matter.

2. Does every finding include reproduction steps?

A finding without reproduction steps is a weakness, not a vulnerability. The reviewer has to validate exploitability manually, which is the exact labor the tool was supposed to eliminate. A development team that receives weakness reports burns their security team’s time triaging findings that may or may not be real.

Reproduction steps are the transition from “this looks unsafe” to “this is exploitable.” They include the inputs that trigger the issue, the response that confirms it, and where in the code path the vulnerability lives. With reproduction, a developer can validate the finding in minutes. Without it, the finding is an invitation to argue.

Good answers look like: every finding includes input, expected behavior, and observed behavior. Often a proof-of-concept script, a curl command, or a test case that triggers the issue. The output should let a developer confirm the finding without a security engineer in the room.

Bad answers look like: “we give you the file and line number and explain the weakness pattern.” That is a SAST output. It is also what most AI security tools deliver today, which is why most of them end up as expensive linters.

3. What is the false positive rate on production code, not benchmarks?

Benchmarks are trivial to game. Most of the answers are already in the training data, and a tool that reports 95% accuracy on OWASP Benchmark says very little about how it will perform on your code. Production is where the tool meets the customer.

A vendor that will not quote a production false positive rate is telling you something. Either they haven’t measured it, or the number is worse than the marketing implies. Either way, that number will be your problem to discover during the PoC.

Good answers look like: a specific percentage, tied to a defined methodology, measured against real customer codebases or publicly auditable open-source scans. The vendor should be able to tell you how they count false positives, what counts as a true positive, and what their target is. Xint’s internal target is zero false positive tolerance as a design bar, not an aspiration.

Bad answers look like: “our findings are high-fidelity” or “we have a low false positive rate” with no number attached. Equally bad: a number without methodology. “Three percent” is meaningful if you know how it was measured and useless if you don’t.

4. What does the pipeline around the model do, and what does the model do?

Every AI security vendor runs some pipeline around a foundation model. The question is how much of the hard work happens in the scaffolding versus being assumed to happen inside a prompt. A vendor that can’t describe the pipeline’s stages hasn’t built one.

The model is one input in a vulnerability discovery system. The other inputs are targeting (which code to examine), context assembly (how that code connects to the rest of the system), triage (filtering the model’s output), validation (confirming exploitability), and reporting (turning prose into auditor-grade output). A product that collapses those into a single prompt is a demo, not a pipeline.

Good answers look like: a clear decomposition of the pipeline’s stages, with specific answers about what each one does and how it’s tested. The vendor should be able to describe their targeting layer, their context assembly strategy, their triage logic, and their validation process. They should have evaluations for each stage.

Bad answers looks like: “we use the best available LLMs” or “our proprietary AI.” Those phrases are a tell. They mean the vendor is either hiding the architecture or doesn’t have one to describe.

5. Is the vendor claiming the tool is “fully autonomous,” and what does that claim actually mean?

“Fully autonomous” is the hottest phrase in the AI security category and almost none of the vendors using it can defend what it means operationally. The claim usually breaks down in one of three ways. The scan needs a human to scope what gets analyzed. The findings need a human to validate before they’re actionable. The remediation needs a human to decide which fixes ship. Strip any of those out and the tool isn’t autonomous; it’s a stage of a pipeline that still requires a security team.

The honest version of the category is that AI has eliminated enormous amounts of human effort inside the pipeline, particularly in analysis and triage. It has not eliminated the human at the boundaries. Vendors who claim otherwise are selling the demo.

Good answers look like: a clear description of where human judgment enters the workflow and where it doesn’t. Vendors that say “autonomous analysis and triage, human review on findings before they reach dev teams” are describing a real product. Specific customer touchpoints (scoping, finding review, cadence) are a credibility signal, not a limitation.

Bad answers look like: “fully autonomous, point it at your repo and walk away.” Ask what happens when the scan returns five hundred findings. Ask who decides which ones ship to developers. Ask what happens when the model is wrong. If the vendor can’t answer, the claim is marketing.

A related tell: vendors who describe their tool as “replacing your security team” rather than “extending what your security team can cover.” Nobody serious believes the first.

6. What is the research pedigree behind the product?

Vulnerability research is a craft. Models trained on public data can do a lot of it, but the hard calls require judgment that comes from having found actual vulnerabilities in production code. What counts as exploitable. How to distinguish a real bug from a benign pattern. How to build a validation pipeline that doesn’t throw away real findings. A team without that experience builds a tool that performs well on benchmarks and fails on production code.

Good answers look like: named researchers with a public track record. DEF CON CTF results, disclosed CVEs in meaningful projects, bug bounty history, academic publications, government contract work. “Our team found 20 novel vulnerabilities in PostgreSQL” is a verifiable claim. So is “we were top three at DARPA’s AI Cyber Challenge.” Specific, checkable claims.

Bad answers look like: “our team has decades of combined security experience.” Combined experience is the tell that the individual bios don’t stand up. Look for named people, named projects, named disclosures. If the vendor is hesitant to be specific, there is usually a reason.

7. How does the product absorb new foundation models?

A tool built around a specific model’s API is a tool that has to be rebuilt when the model improves. Mythos showed what is coming. Models will keep getting better, cheaper, and more capable at vulnerability reasoning. A product that doesn’t absorb those improvements automatically is a product you will be renegotiating in eighteen months.

Good answers look like: a model-agnostic pipeline architecture. Evaluation infrastructure that tests new models against existing benchmarks before production swaps. Documented cases where the vendor has moved between models without disrupting customer workflows. The ability to use different models for different pipeline stages, because the right model for targeting is not necessarily the right model for triage.

Bad answers look like: “we use Claude” or “we use GPT” as a full answer. Single-model lock-in is a risk that should show up in the MSA. If the vendor’s entire product strategy depends on one lab’s roadmap, the buyer inherits that risk.

The test here is future-proofing. Every customer buying AI security tooling today is buying into a category that will look different in eighteen months. The question is whether the vendor’s architecture absorbs that change or has to be rebuilt around it.

8. What does the output look like in the auditor’s hands?

Most buyers evaluate AI security tools on the finding experience, which is the part that looks best in demos. The harder test is whether the output survives a SOC 2 audit, a customer security review, or a regulatory inquiry. Findings that live in a pretty dashboard and can’t be exported with evidence are findings that can’t defend a security program.

The auditor’s question is not whether the tool found a vulnerability. It is whether the vulnerability was tracked, triaged, validated, remediated, and verified, with timestamps and accountable owners at each stage. That is an evidence trail, not a finding list.

Good answers look like: structured exports with severity ratings, reproduction steps, patch suggestions, remediation timestamps, and evidence of validation. Integration with existing GRC tooling (Jira, ServiceNow, Vanta). Audit-trail-ready on day one, not as a future roadmap item.

Bad answers look like: a finding dashboard with no export. Exports that strip the context. “Our customers use our UI for audit prep” as the full answer. The auditor is going to see the evidence in a PDF or a spreadsheet. If that’s an afterthought, the product hasn’t been through an audit yet.

The questions hold. The vendors change.

Eight questions. Each one is doing a specific job.

Questions 1, 2, and 3 test whether the tool is finding vulnerabilities or weaknesses. Question 4 tests whether there is an actual product behind the model. Question 5 tests whether the vendor is telling you the truth. Question 6 tests the team. Questions 7 and 8 test whether the product survives contact with the real world, now and in eighteen months.

The AI security category is going to keep fragmenting. New entrants every month. New capabilities every quarter. A buyer’s guide written today will need revision before the end of the year. The questions above hold.

Xint’s answers to all eight are at /platform. Ask every other vendor in your evaluation the same questions. If the gap in quality of answers is obvious, that’s the answer to your procurement decision.

AI for Security Product