Hidden Bias Exposed: Simple Conversational Prompts Can Fool ChatGPT & Gemini
Researchers at the University of Pennsylvania have discovered that bypassing built-in safeguards in AI-powered chatbots such as ChatGPT and Gemini requires no technical expertise at all. Even simple, conversational prompts can elicit biased or discriminatory responses comparable to those produced through sophisticated, expert-crafted methods.
The team found that manifestations of hidden bias in AI can be provoked not only through so-called jailbreaks—the generation of random character sequences to evade content filters—but also through the ordinary, natural language used by everyday users. According to the researchers, it is precisely this “human” mode of interaction that reveals how bias emerges in real-world conditions rather than in controlled laboratory tests.
To validate their findings, the scientists conducted an experiment in which participants were asked to devise prompts capable of eliciting biased or discriminatory responses from generative AI models. Fifty-two people took part, submitting 75 examples of interactions across eight different models. Each submission was accompanied by an explanation of the specific type of bias observed—ranging from age-related stereotypes to historical and cultural distortions.
The researchers then interviewed several participants to better understand how they formulated their prompts and how they personally defined concepts such as fairness and representation. The collected prompts were subsequently tested across multiple language models to determine whether the biases persisted upon repetition. Of the 75 examples, 53 produced reproducible results, allowing the team to identify eight primary categories of bias: gender, racial, ethnic and religious, age-based, disability-related, linguistic, historical (favoring Western perspectives), cultural, and political.
Participants employed seven main strategies to provoke biased outputs. These included asking the model to “assume a role,” constructing hypothetical scenarios, invoking obscure or niche topics to which AI models often respond formulaically, and testing reactions to misinformation or controversial questions. Some even framed their prompts as “research inquiries” to encourage the model to reply more freely.
The organizer of the study noted that these intuitive approaches exposed unexpected forms of bias. The winning example, for instance, revealed that the models exhibited a preference for faces conforming to “classical beauty standards”: individuals with clear skin were deemed more trustworthy, while those with high cheekbones were rated as more employable.
Experts emphasized that eliminating such distortions remains an ongoing race between developers and the challenges that continually arise. As potential remedies, they proposed implementing pre-response bias filters, conducting broader testing, educating users about AI limitations, and incorporating source citations to enable verification of generated information.
Support Our Threat Intelligence
If you find our technology report and cybersecurity news helpful, consider supporting our work.