RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Jun 13, 2024·

Xuan Chen

Yuzhou Nie

Lu Yan

Yunshu Mao

Wenbo Guo

Xiangyu Zhang

· 0 min read

PDF

Abstract

Modern large language model (LLM) developers typically conduct a safety alignment to prevent an LLM from generating unethical or harmful content. This alignment process involves fine-tuning the model with human-labeled datasets, which include samples that refuse to answer unethical or harmful questions. However, recent studies have discovered that the safety alignment of LLMs can be bypassed by jailbreaking prompts. These prompts are designed to create specific conversation scenarios with a harmful question embedded. Querying an LLM with such prompts can mislead the model into responding to the harmful question. Most existing jailbreaking attacks either require model internals or extensive human interventions to generate jailbreaking prompts. More advanced techniques leverage genetic methods to enable automated and black-box attacks. However, the stochastic and random nature of genetic methods largely limits the effectiveness and efficiency of state-of-the-art (SOTA) jailbreaking attacks. In this paper, we propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL). We formulate the generation of jailbreaking prompts as a search problem and design a novel RL approach to solve it. Our method includes a series of customized designs to enhance the RL agent’s learning efficiency in the jailbreaking context. Notably, we devise an LLM-facilitated action space that enables diverse action variations while constraining the overall search space. Moreover, we propose a novel reward function that provides meaningful dense rewards for the agent toward achieving successful jailbreaking. Once trained, our agent can automatically generate diverse jailbreaking prompts against different LLMs. With rigorous analysis, we find that RL, as a deterministic search strategy, is more effective and has less randomness than stochastic search methods, such as genetic algorithms. Through extensive evaluations, we demonstrate that RL-JACK is overall much more effective than existing jailbreaking attacks against six SOTA LLMs, including large open-source models (e.g., Llama2-70b) and commercial models (GPT-3.5). We also show the RL-JACK’s resiliency against three SOTA defenses and its transferability across different models, including a very large model Llama2-70b. We further demonstrate the necessity of RL-JACK’s RL agent and the effectiveness of our action and reward designs through a detailed ablation study. Finally, we validate the insensitivity of RL-JACK to the variations in key hyper-parameters.

Type

Preprint

Last updated on Sep 28, 2025

Authors

Xuan Chen

PhD Candidate @ Purdue CS

← When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search Oct 1, 2024

Dynamics Model Based Adversarial Training For Competitive Reinforcement Learning Oct 1, 2023 →