Australian researchers have tested whether large language models can infer passwords from personal information — and found that, for now, they are almost entirely ineffectual. In a new study, the team at the Future Data Minds Research Lab demonstrated that popular open-source LLMs perform dramatically worse than classical password-cracking tools and remain far better suited to generating text and code than conducting real-world account breaches.
At the heart of the experiment lies an idea that has circulated for years: if AI can analyze text and “understand” context, then perhaps it could generate plausible passwords based on a person’s data — combining a name, date of birth, favorite sport, or hobby into a believable list of options. Were this reliable, it could become a dangerous asset in the hands of attackers.
To test the hypothesis, the researchers first built synthetic profiles of fictional users. Each profile contained structured attributes: name, birthdate, interests, hobbies, and more. Three models — TinyLLaMA, Falcon-RW-1B, and Flan-T5 — were then asked to generate lists of passwords that such a user might reasonably choose for their accounts.
The team evaluated accuracy using industry-standard metrics: Hit@1, Hit@5, and Hit@10 — measures of how often the correct password appears first, within the top five, or within the top ten model outputs. They tested performance both on plaintext passwords and their SHA-256 hashes. The results were unequivocal: across all scenarios, accuracy never exceeded 1.5% at Hit@10. In other words, even among the model’s ten “best guesses,” the correct password almost never appeared. By contrast, modern GPU-based cracking methods can breach many passwords in seconds using traditional techniques.
To establish a baseline, the researchers also ran classical cracking tools — rule-based and combinatorial methods widely used in specialized utilities. These conventional approaches delivered dramatically higher success rates, outperforming LLMs across every major indicator. The conclusion is straightforward: established, purpose-built algorithms remain vastly superior at password guessing than today’s fashionable general-purpose models.
The authors also sought to understand the reasons behind this gap. Their analysis suggests that modern LLMs struggle to generalize learned password patterns to new, concrete contexts and are poor at explicitly “recalling” specific examples from their training data. Effective password inference would require specialized fine-tuning on real password leaks and targeted optimization — capabilities these general models currently lack.
In the end, the researchers draw a critical cybersecurity conclusion: in their present form, LLMs are not effective password-guessing tools and pose little meaningful threat to account security in this narrow domain. At the same time, the study opens avenues for further research — from safer approaches to password modeling to systems that improve account protection by better understanding attack strategies and preventing unauthorized access to sensitive data.
The authors emphasize that their experiment examined only three models and does not claim to represent the entire landscape of LLMs. Yet even now it reveals a significant limitation of these systems in adversarial scenarios. Future studies may enlarge the pool of tested models and develop new defensive methods grounded in understanding what AI still performs poorly — especially when it comes to your passwords.