Major@dannylivshits7h ago

Researchers show AI models can jailbreak LLMs

The paper finds that four attacker models, given only the instruction to "jailbreak this AI," successfully bypassed safeguards on nine production language models in 97.14% of 25,200 interactions. The attacks adapted over 10 turns using plain-language tactics like flattery and hypotheticals, suggesting current defenses can fail even without detailed human-crafted prompts.

Source