Go Big or Go Home: Scaling Automatic Triaging to the Linux Kernel

Key findings

14 of 20 open Syzbot crashes reproduced with ~27M tokens each.
~92 min per bug on average; compiling and QEMU were half of it.
Under 550M tokens spent to reproduce bugs Syzbot's own AI couldn't.
Cost ≠ severity The cheapest repros were often the most dangerous bugs.

A few weeks ago, Linus Torvalds vented about the Linux kernel's security mailing list being "almost entirely unmanageable."

Reports went from 2–3 a week in 2024 to 5–10 a day and most of them are the same thing: raw LLM output, no reproducer, no analysis, no patch. Someone ran the codebase through a model, copy-pasted the response, and hit send.

The result is that the people responsible for securing one of the most widely deployed pieces of software on the planet now have to spend their days triaging ghost reports instead of fixing real bugs. Every false positive that lands in the queue is time maintainers don’t get back. As the volume keeps climbing, legitimate reports get harder to find.

A vulnerability report without evidence is simply speculation. The core issue isn’t that LLMs are finding these “bugs”, though that is a separate issue; it’s that very few are checking whether they’re real before hitting send in their race to get a CVE in the kernel.

XKCD comic about machine learning — xkcd #1838

Most tools in this space focus on finding more bugs, which only adds to the pile. We wanted to build something that could help reduce the maintainer workload instead. The one place that seemed to be missing attention was the triaging process.

Reproducibility as Ground Truth

We’ve spent years working on traditional vulnerability detection techniques like fuzzing and symbolic execution. The one thing those approaches teach you is that observable behavior is everything. A fuzzer doesn’t emit a sanitizer report because the code looks suspicious, it does so because something violated security boundaries.

We built vTriage around the same idea: take a security report, generate a reproducible artifact that demonstrates the issue, and sign off only if that artifact produces an observable outcome which decides if the bug was triggered or not.

Accurate reproduction requirements

An artifact counts as successful only if it produces a backtrace matching the bug report in the correct environment, at the correct version and through the correct interfaces.

After validating it against outputs from tools like Zeropath and Semgrep on targets such as cURL and LibPNG, we wanted to test it on something even harder. The Linux Kernel felt like the right answer.

Starting with the Mailing List

We fed vTriage emails from the Linux Kernel Mailing List; a public archive of emails, and had it classify reports based on whether a real security boundary was being crossed. It handled this well, correctly filtering out false positives based on a security impact assessment, in some cases for less than 1M tokens per report.

Result

vTriage filtered out false-positive LKML reports for under 1M tokens per report in some cases.

That was a good start. But we wanted to tackle the truly hard problems: bugs that were already confirmed real but had no reproducer.

◆

Going after Syzbot’s backlog

Syzbot is Google’s automated fuzzing infrastructure, continuously running the Syzkaller fuzzer against the Linux Kernel, catching crashes, and reporting them to developers. Once a bug is found, Syzkaller tries to generate a Syzprog (a sequence of system calls) or a C program that reproduces the buggy behavior. The developers can then use it to verify patches. But for bugs involving non-deterministic behavior such as races or concurrent accesses, Syzkaller often fails to produce one.

Syzbot has its own AI system for generating these missing reproducers (since Jan 2026) that aims to generate a Syzprog (repro task) or a C program (repro-c task) — but it has been lackluster. That’s a large backlog of real crashes sitting unverified, so we built a pipeline to pull open Syzbot issues without a reproducer and generate one.

The open backlog, two approaches

Syzbot AI · repro task

0 / 173

working Syzprogs generated from the repro task.

Syzbot AI · repro-c task

2 / 12

working C reproducers generated from the repro-c task.

vTriage

14 / 20

open crashes reproduced from a randomly selected sample of 20 (KASAN, KCSAN, KMSAN and GPF) crashes reproduced at an average of ~27M tokens each.

vTriage on Syzbot

We randomly selected 20 unique crashes from Syzbot that were marked as open and did not contain a Syzprog or C reproducer, and used vTriage to automatically generate a C reproducer for each. Syzbot AI’s numbers above reflect its full backlog; vTriage ran on a random draw from that same pool, not a curated subset. Here is how the run came out:

14/20

Crashes reproduced

~27Mtokens

Average per bug

<550Mtokens

Total used

Sanitizer classes (KASAN/KCSAN/KMSAN/GPF)

That includes several bugs that Syzbot’s own AI had already attempted and failed on. The number of tokens used to generate a reproducer varied significantly with the type of bug — and it climbs with how little control user-space has over the racing code:

Deterministic bugs 5–8M tokens

Syzkaller usually reproduces these, but occasionally misses one. A NULL dereference in the FUSE subsystem had no reproducer despite needing only a single syscall to trigger. vTriage reproduced it using 6.1M tokens.

User-space race conditions 10–25M tokens

Two or more user-space threads racing on a shared resource are harder thanks to their non-deterministic nature. vTriage controls both threads directly to force the interleaving.

Kernel-side race conditions 25–50M tokens

It gets harder when one racing thread lives inside the kernel — timer callbacks, interrupt handlers, or softirqs on their own schedule that user-space can't directly influence.

Hardware-specific bugs 60–65M tokens

The most expensive — and most interesting. These need hardware our test environment doesn't have, so vTriage wrote a QEMU stub to emulate the device and trigger the bug.

Token usage climbs with how hard the bug is to control

Deterministic bugs6.1M avg · 5–8M tokens

User-space race conditions~18M avg · 10–25M tokens

Kernel-side race conditions~40M avg · 25–50M tokens

Hardware-specific bugs~62M avg · 60–65M tokens

Where the time goes

Token usage is one half of the story, the other half being time. The average run took about 92 minutes, roughly an hour and a half. Exactly half of that went to two operations: compiling the kernel, and running QEMU.

~92min

Average wall time per bug

50%

Time spent on kernel builds and QEMU

206min

Longest single run

The split between the two is almost perfectly even: 25% goes to kernel compiles, 25% to booting and running QEMU. The remaining half is the agent’s reasoning, file I/O, and various shell work.

Among all 20, only 2 bugs took more than 3 hours:

The first outlier ran for 187 minutes, with 75% of that on build + QEMU. A error in the compiler caching mechanism ate up a significant chunk of time until it was terminated by a timeout. This step alone took up 60 minutes of the total time.
The second outlier ran for 206 minutes, with 70% on build + QEMU. This bug involved a race between two kernel timer callbacks, so there was nothing that a user-space program can do to make it happen faster. The strategy was to crank up KCSAN’s sensitivity and keep rebooting QEMU until the timers happened to overlap. The repeated iterations meant that QEMU alone ate 109 minutes.

The failure cases

6 of the 20 runs didn’t end up with a reproducible artifact. In all these cases, the bug involved non-deterministic behavior which led to the bug being surfaced at a different location than the one described in the report.

KASAN — real crash, different site

KCSAN — unrelated race sampled first

KMSAN — sanitizer not enabled (config)

Take a use-after-free as an example. If multiple threads access the same poisoned resource, KASAN reports the first thread that accesses this poisoned resource. In such cases, the crashes are legitimate, come from the same lifetime bug, and any one is enough to demonstrate the underlying issue. But only one of them matches the title and stack in the original report. Three of the six failures fall into exactly this shape: a valid PoC that hits a real crash in the right subsystem, at a different site than the reporter saw.

KCSAN failures work the same way with worse odds. KCSAN samples a small number of memory accesses per CPU cycle, so the race that gets caught is whichever pair of accesses happens to land on a watchpoint, not the rarest race in the build. Two of the six failures are unrelated data races that fired before the target race got sampled.

Five of the six runs do crash the kernel and produce a working reproducer. They were classified as failures because the backtrace didn’t match the expected one.

The sixth is a different kind of miss. vTriage forgot to enable KMSAN in the kernel build, so no sanitizer report ever fired and no backtrace was produced at all. We caught this in post-run review; the pipeline now validates that all required sanitizer flags are present before a run starts. It was the only configuration error across all 20 runs.

◆

Key takeaways

We built vTriage to combat the unending stream of AI generated bug reports. Instead of giving you more problems to deal with, vTriage consumes bug reports, recognizes the valid ones with real security impact, and generates reproducible artifacts that demonstrate the described issue.

Time should be read the same way as cost: the ~92 minute average is a per-bug wall-clock metric, not a queueing model. In practice these runs are executed in parallel, so reproducing 20 bugs does not mean waiting 20 × 92 minutes end-to-end. With enough workers, total elapsed time is dominated by the slowest bucket of jobs rather than the sum of all of them.

Cost to reproduce is not a proxy for severity. Some of the cheapest reproductions were attached to the highest-impact bugs.

The counterintuitive finding: cost doesn’t equal exploitability. Some of the cheapest bugs to reproduce were among the most dangerous. The expensive bugs just happened to need hardware we didn’t have.

550M tokens

Total for the entire Syzbot run — 14 verified, reproducible kernel crashes.

The artifacts produced by vTriage can be found in our Github repository. For still unpatched bugs we will upload the hashes of the proof-of-concept reproduction files instead and update them once they are fixed upstream.

What’s next

vTriage can process hundreds of issues in parallel, and the verification pipeline runs unattended. The goal is to handle the triage and reproducer work so developers can spend their time on fixes.

We’re continuing to expand coverage and are happy to talk to anyone working in this space who thinks they might benefit from automated vulnerability triage at this scale.

- The Artiphishell Team