Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models

Abstract

The rapid advancement in machine learning and generative AI has catalyzed the development of large language models (LLMs). Enabled by their exceptional capabilities to understand and generate human-alike text, these models are being increasingly integrated into our society. However, it also raises widespread concerns about potential misuse, especially with the emerging jailbreak attacks where the adversary can judiciously craft prompts to circumvent security restrictions and elicit harmful content originally designed to be prohibited.

Unlike other known vulnerabilities of machine learning models, the frontline of such attacks is largely seen in online forums and among hobbyists, while a comprehensive study is still lacking. To bridge this gap, we initiate by systemizing and measuring existing jailbreak prompts, using human annotation and our proposed metrics built upon LLM benchmark studies. Through this process, we identified universal jailbreak prompts and examined the factors that contributed to their success. As a further step to explore the feasibility of users exploiting LLMs, we conducted a human study involving 92 participants covering diverse populations. We found that even inexperienced users can develop successful jailbreak attacks, and importantly, we identified jailbreak approaches that were undiscovered in the existing literature. With insights from the systemization and human studies, we also explored the potential for automatic jailbreaking using LLMs.

Team

Features

Demo