Agent Under Fire: OpenAI Hardens ChatGPT Atlas Against “Invisible” Resignation Attacks
OpenAI has released a security update for ChatGPT Atlas, a browser equipped with a built-in “agent mode” that can browse the web and act within it almost like a human—clicking, typing, and carrying out steps within a user session. The update follows the discovery of a new class of attacks targeting such agents during internal automated penetration testing. In response, the company has reinforced its defensive mechanisms and deployed a new version of the browser agent model, deliberately hardened through exposure to real attack scenarios.
At the heart of the issue is the fact that a browser-based agent inevitably interacts with the same content as its user: emails, documents, invitations, social media posts, and virtually any page on the web. The more capable such an assistant becomes, the more attractive it is to adversaries. If an attacker manages to steer it off course, the consequences can mirror what a human might inadvertently do in a browser—for example, sending the wrong email or disclosing sensitive information.
One of the most troubling techniques in this context is prompt injection. This involves embedding malicious instructions directly into text that the agent reads as part of its normal operation, with the aim of coercing it into following the attacker’s intent rather than the user’s request. Crucially, this is not a traditional browser exploit or a system vulnerability. The attack targets the agent’s behavior itself, manipulating it with plausibly worded commands masquerading as legitimate content.
As an illustration, OpenAI describes a scenario that sounds almost farcical yet vividly underscores the risk. An automated “attacker” plants an email containing concealed instructions into the inbox. The user then asks the agent to perform a routine task—such as drafting an out-of-office reply. The agent opens the most recent unread message, interprets the embedded commands as authoritative guidance, and instead sends a resignation email to the user’s manager—entirely against the user’s wishes. Following the latest update, OpenAI says, the agent is now able to detect such manipulation attempts and warn the user before taking action.
To uncover these tactics proactively rather than after the fact, OpenAI has built an internal “AI adversary” based on a language model and trained it, via reinforcement learning, to probe the agent for weaknesses. In simple terms, the system repeatedly experiments with different attack strategies, observes their outcomes in simulation, and learns to refine its methods—much like a relentless tester who grows more cunning with each attempt. Successful attack chains are then converted into concrete defensive targets: the model is further trained on these new threats, while additional safeguards and monitoring layers are strengthened around it.
At the same time, OpenAI openly acknowledges that no solution can offer absolute, permanent protection. This is an ongoing arms race, akin to the long evolution of online fraud and social engineering. Alongside its internal efforts, the company therefore advises users to reduce risk on their own side: whenever possible, operate in logged-out contexts, carefully review confirmation prompts, and phrase instructions to the agent with precision—avoiding overly broad mandates such as “handle my email however you see fit.”
Support Our Threat Intelligence
If you find our technology report and cybersecurity news helpful, consider supporting our work.