AI Agents Are Now Finding Critical Bugs That Fuzzing Can’t
The development of AI agents capable of discovering vulnerabilities in complex systems remains a formidable challenge, still demanding considerable manual effort. Yet such agents possess a key advantage—unlike traditional approaches such as fuzzing or formal verification, their work can be quite literally read through logs. This transparency allows researchers to better grasp the strengths and limitations of modern LLM models. In one experiment, the authors collected over a hundred gigabytes of logs and highlighted several particularly illustrative cases.
The first target of testing was SQLite, the lightweight and immensely popular C-based database engine used in browsers, mobile operating systems, automobiles, aircraft, and even within the CRS engine itself. During the AIxCC practical competition round, the agents unearthed not only deliberately planted vulnerabilities but also genuine flaws. Among these were two severe bugs fixed by the developers on August 5. One proved to be a classic buffer overflow in the zipfile extension, enabled by default. This error allowed memory out-of-bounds access when handling archives—an issue nearly impossible to uncover through random fuzzing. Another flaw in the same code caused excessive data to be read when opening a corrupted ZIP file.
Attention then turned to FreeRDP, an open-source implementation of the Remote Desktop Protocol. Beyond deliberately inserted issues, such as an obfuscated backdoor, the agents managed to identify a real vulnerability: a signed integer overflow when processing client monitor information. Remarkably, even hours of fuzzing with libFuzzer failed to trigger this bug, yet carefully generated AI inputs succeeded in reproducing it.
Similar experiments were conducted on other widely used projects, including Nginx, Apache Tika, and Apache Tomcat. The logs reveal how the AI system attempted fixes, grappled with ambiguities in patches, and ultimately succeeded—sometimes after investing tens of minutes and several dollars in computational resources. In certain cases, the agents devised unusual exploitation paths—for instance, when bypassing protections in ZIP handling proved ineffective, they pivoted to working with TAR archives.
The authors emphasize that such experiments are valuable not only for uncovering bugs but also for refining the agents themselves—their tools, workflows, and role distribution. While not every discovered flaw was critical, the practice demonstrated that LLM systems can indeed detect and reproduce vulnerabilities that elude classical techniques. And although the process is still far from fully automated, it already offers researchers a profoundly new perspective on the security of familiar software.