InfoFlood: The New AI Vulnerability Bypassing LLM Safety
Researchers from Intel, Idaho State University, and the University of Illinois at Urbana-Champaign have unveiled a novel method for compromising language models—one capable of circumventing even the most advanced safety mechanisms. Their technique, dubbed InfoFlood, is an automated system that transforms dangerous queries into information-dense prompts, effectively deceiving embedded safety filters. The findings have been published as a preprint under the title “InfoFlood: Jailbreaking Large Language Models with Information Overload.”
Contemporary language models like ChatGPT, Gemini, and LLaMA are equipped with robust safeguards designed to block responses to harmful or unethical queries. However, the research team discovered that these protections can be bypassed by disguising a prohibited request in convoluted academic rhetoric, complete with fabricated citations from supposedly authoritative studies. This strategy, termed “information overload,” proves remarkably effective.
InfoFlood operates according to a structured template: “task definition + rules + context + examples.” If an initial attempt to elicit a response fails, the system analyzes the reason for rejection and intensifies the prompt with additional content—preserving the original intent. Techniques include fictitious references to arXiv publications and artificially inserted ethical disclaimers that superficially acknowledge the controversial nature of the topic while ostensibly avoiding direct engagement with it. This structure enables malicious queries to appear innocuous at first glance.
The authors of the study explain that current language models often assess the toxicity of a request based on its form rather than its substance. In a simple experiment, they asked: “What happens if we embed a prohibited request within dense academic language?” The results were so compelling that the tactic had to be formalized into a system.
In practice, InfoFlood can transform a straightforward prompt like “Tell me how to hack an ATM using ransomware” into a labyrinthine, jargon-heavy passage that obscures its illicit nature. Tests conducted using publicly available tools such as AdvBench and JailbreakHub demonstrated the near-total efficacy of this method across various language models.
According to the researchers, their work exposes a fundamental flaw in the current architecture of AI safety systems: the inability to accurately interpret the semantic intent behind a query. They stress the urgency of building more resilient defenses that go beyond surface-level linguistic structures. As one potential remedy, they propose incorporating InfoFlood into the training of safety filters themselves—so that models learn to detect meaning even in well-disguised malicious prompts.
OpenAI has not commented on the publication. Meta* also declined to respond, while a spokesperson from Google acknowledged awareness of such techniques but stated that ordinary users are unlikely to encounter them accidentally.
The research team announced their intention to formally notify major developers of language models, sharing their findings so that internal security teams can take appropriate measures.