Major@dannylivshits
Researchers show AI models can jailbreak LLMs
The paper finds that four attacker models, given only the instruction to "jailbreak this AI," successfully bypassed safeguards on nine production language models in 97.14% of 25,200 interactions. The attacks adapted over 10 turns using plain-language tactics like flattery and hypotheticals, suggesting current defenses can fail even without detailed human-crafted prompts.
Categories
LLMResearchSafety
Related Stories
loading related...