ONEFLIP: A Single Bit Can Hijack an AI, Redefining Hardware-Level Threats
Researchers have unveiled ONEFLIP, a groundbreaking attack technique that introduces a novel method of covertly modifying neural networks, marking a major advance in hardware-level threats against AI. Unlike traditional backdoors that rely on tampering with training datasets or manipulating the training process itself, ONEFLIP operates solely during model inference. The attack requires flipping just a single bit in the network’s weights, making it both computationally minimal and extraordinarily difficult to detect.
The key innovation lies in its targeting of full-precision floating-point models, rather than quantized versions with reduced accuracy. Such systems are typically deployed in high-performance environments where precise classification is paramount and were previously thought more resilient to bit-level interference. ONEFLIP demonstrates, however, that altering just one exponent bit in a weight is enough to implant an imperceptible trigger, forcing the model into predictable misclassification only when presented with a specific input pattern.
During preparation, researchers analyze the final classifier layer offline, identifying a weight with a suitable exponent structure. By altering a non–most significant bit, the weight’s value increases, overwhelming other signals and enabling the insertion of a hidden trojan. Remarkably, overall accuracy remains nearly intact, with observed reductions as low as 0.005%.
Once the weight is selected, a trigger pattern is crafted through gradient descent. Designed to be nearly invisible, it amplifies the activation of the targeted neuron, ensuring the backdoor fires reliably. When combined with a Rowhammer-style exploit, a single bit in memory is flipped, and the model begins misclassifying inputs containing the trigger into an attacker-defined class.
The results were striking. On benchmark datasets such as CIFAR-10, CIFAR-100, GTSRB, and ImageNet, across architectures including ResNet-18, VGG-16, PreAct-ResNet-18, and ViT-B-16, the attack achieved a 99.6% success rate, with normal accuracy dropping by an average of only 0.06%. This dramatically outperforms prior techniques such as TBT, TBA, and DeepVenom, which required tens or even thousands of bit flips.
ONEFLIP’s efficiency stems from its precision in weight selection and adaptability to diverse network types. Researchers also confirmed that classification layers contain a sufficient number of candidate weights, underscoring the universality of the threat.
A particularly troubling aspect is its resilience to modern defenses. Tools like Neural Cleanse, designed to detect training-stage backdoors, are powerless against inference-time interference. Further training of the model also fails to mitigate the attack: even after neighboring bits are altered, ONEFLIP retains up to 99.9% effectiveness. Input filtering provides little relief, as the triggers are deliberately subtle and capable of evading detection.
The authors stress that this vulnerability highlights the urgent need for stronger hardware-level safeguards: improved DRAM error-correction mechanisms, routine integrity checks for deployed models, and holistic protections bridging hardware and software layers.
The release of reproduction code aims to draw attention to these hardware risks—demonstrating how even a seemingly insignificant modification can escalate into a critical threat for AI systems.