FAQ: Are the Incremental Improvements in New Models Worth the Higher Cost?

What are the incremental benefits to each new (more expensive) model and when does it make sense to update?

Jun 16, 2026

FAQ: Are the Incremental Improvements in New Models Worth the Higher Cost?

Contents

Is it worth the cost to upgrade or are you better off with an older model?Use the most expensive model for the tasks where incremental improvements matter; cheaper models for the other pipeline tasks No one model is “good enough.”

Is it worth the cost to upgrade or are you better off with an older model?

It used to take 12-18 months for a frontier lab to release a model update. Now those are coming at least once per quarter or more per lab.

Source: https://newsletter.semianalysis.com/p/the-coding-assistant-breakdown-more

Yes, new models have improvements but not necessarily enough to justify the higher costs for vulnerability scanning. Every time a new model is released, we immediately use public disclosures, our own proprietary benchmarks, and collaboration with customers with early access to see how they compare running the same AppSec workloads against previous models. We are then able to provide customers with assessments about the costs and tradeoffs with each new SOTA model.

What we have consistently found over the past 6 months is small but consistent improvements in bug finding, with the most significant improvements coming in writing exploits (and thus testing/validation).

That said, those improvements come with less predictable costs. Looking at the list price per million input and output tokens is not a good predictor of total cost to scan a codebase or application because users can’t predict how new reasoning capabilities will consume tokens, especially the more lines of code being analyzed.

As a result organizations aren’t able to take advantage of the new capabilities of the latest models even with basic harnessing or scaffolding. They are often only able to conduct narrow exploratory searches.

They are left to wonder if they would get better coverage running the same code 10x on a cheaper model for the price it would take to run just once on the latest, most expensive model.

Use the most expensive model for the tasks where incremental improvements matter; cheaper models for the other pipeline tasks

To start with, newer models are generally better at writing exploits from scratch. However, with our scaffolding, Xint is able to generate output reports with extremely detailed preconditions and reproduction steps, such that you can usually write the exploit in one shot and not have to worry about token burn. On average, our researchers are able to generate a PoC in 15 minutes per vulnerability simply by copying the output from the report and inputting it into their preferred coding assistant.

That’s not to say we won’t therefore use the latest models. Rather we look across all the tasks in the pipeline (see below) and figure out where it makes sense to accept the cost. For example, with Mythos we have been excited by its ability to find some longtail bug types that previous models needed additional prompting to find. Depending on the specific task, we can separate “busywork” (wandering around the call stack and collecting info) where we can compromise on performance, from judgement (is this a real vuln? does anyone care about it?), where incremental improvements are worth the investment.

The Xint pipeline is composed of thousands of autonomous agents assigned to specific tasks

Still, by adopting SOTA models only for specific tasks and not for the entire pipeline, this cost effectiveness allows Xint Code users to look at all of the codebase for the vuln types we care about whereas it’s cost prohibitive to do that with certain frontier models, even with basic harnessing.

No one model is “good enough.”

If Mythos is the first time you have had the opportunity to run an agentic system to find vulnerabilities, the results seem magical. But for long-term use in production, product security teams need to adopt a FinOps approach to analyzing the tradeoffs in performance, coverage, and cost per task.

At this time, if you are relying solely on frontier models, you cannot predict what your total costs will be. With Xint on the other hand we can provide that to you as soon as you scope out which applications or codebases you want to pentest and how often.

Reach out to get your bespoke TCO price.