Just two months ago, Gemini tried a bit too hard to be politically correct by representing all human faces in its generated images as people of color. Although this may be hilarious to some (if not many), it was evident that as Large Language Models (LLMs) advance their capabilities, so do their vulnerabilities and risks. This is because the complexity of a model is directly proportional to its output space, which naturally creates more opportunities for undesirable LLM security vulnerabilities, such as disclosing personal information and generating misinformation, bias, hate speech, or harmful content, and in the case of Gemini, it demonstrated severe inherent biases in its training data which was ultimately reflected in this outputs.
Here's another form of a vulnerability demonstrated by GitHub Copilot in 2021, which faced similar backlash for leaking sensitive information such as API keys and passwords:
Therefore, it’s crucial to red team your LLMs to protect against potential malicious users and harmful behavior to safeguard your company’s reputation from security and compliance risks.
But what exactly is LLM red teaming? I’m glad you asked.
LLM Red Teaming: Simulating Adversarial Attacks on Your LLM
Red Teaming LLM is a a way to test and evaluate LLMs through intentional adversarial prompting to help uncover any underlying undesirable or harmful model vulnerabilities. In other words, red teaming tries to get an LLM to output inappropriate responses that would be considered unsafe.
These undesirable or harmful vulnerabilities include:
- Hallucination and Misinformation: generating fabricated content and false information
- Harmful Content Generation (Offensive): creating harmful or malicious content, including violence, hate speech, or misinformation
- Stereotypes and Discrimination (Bias): propagating biased or prejudiced views that reinforce harmful stereotypes or discriminate against individuals or groups
- Data Leakage: preventing the model from unintentionally revealing sensitive or private information it may have been exposed to during training
- Non-robust Responses: evaluating the model’s ability to maintain consistent responses when subjected to slight prompt perturbations
- Undesirable Formatting: ensuring the model adheres to desired output formats under specified guidelines.
However, to effectively red team your LLM application at scale, it’s crucial to construct a sufficiently large red teaming dataset to simulate all possible adversarial attacks on your LLM. This is best accomplished by first constructing a starting initial set of adversarial red teaming prompts, before evolving it to increase the scope of attack. By progressively evolving your initial set of red team prompts, you can expand your dataset while enhancing its complexity and diversity. (For those interested in what data evolution entails, check out this synthetic data generation article I’ve written)
Lastly, the LLM responses to these prompts can be evaluated using LLM metrics such as toxicity, bias, or even exact match. Responses that do not meet the desired standards are utilized as valuable insights for enhancing the model.
You might also wonder what is the difference between a red teaming dataset and LLM benchmarks. While standardized LLM benchmarks are excellent tools for evaluating general-purpose LLMs like GPT-4, they focus mainly on assessing a model’s capabilities, not its vulnerabilities. In contrast, red teaming datasets are usually targeted to a specific LLM use case and a specific vulnerability.
(Check out this article I’ve written on benchmarking LLMs and when to use them.)
Similar to how a robust LLM benchmark should have a comprehensive benchmark dataset, a robust red teaming dataset should unveil a wide variety of harmful or unintended responses from an LLM model. But what do these adversarial prompts look like?
Adversarial Prompts
Adversarial prompts used in red teaming are exploitative prompts designed to sneakily bypass a model’s defenses. It is usually crafted according to several creative prompt engineering techniques, including:
- Prompt Injection
- Prompt Probing
- Gray box Attacks
- Jailbreaking
- Text Completion Exploitation
- Biased Prompt Attacks
These prompt techniques are meant to simulate how an actual malicious user would try to exploit weaknesses in your LLM system. For example, a malicious user might use prompt injection in attempt to get your financial LLM-based chatbot to leak personal identifiable information found in the data it was fine-tuned with. With this in mind, let’s begin our step-by-step guide on how to red team any LLM by exploring the various prompting techniques for generating the initial set of red teaming prompts.
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.
A Step-By-Step Guide: Red Teaming LLMs
1. Prepare Initial Adversarial Attacks
In this section, I’ll guide you through various methods to prepare an effective set of starting red teaming attacks with code examples. I’ll begin by introducing you to 4 types of red teaming attacks:
- Direct Prompt Injection: This involves adding new instructions to see if the LLM disregards previous guidelines.
- Prompt Probing: This technique sends specific prompts to uncover hidden details about an LLM’s programming.
- Gray Box Prompt Attacks: These attacks use knowledge from an LLM’s responses to exploit known vulnerabilities.
- Jailbreaking: This strategy develops creative methods to coax the LLM into bypassing its ethical constraints without explicit directives.
1.1 Prompt Injection
Prompt injection involves crafting prompts that introduce new instructions to see if the LLM can be tricked into ignoring previous instructions and executing unintended commands. It typically follows this template:
To illustrate this technique in action, let’s consider an LLM application, FinBot, a financial advisor bot designed to assist users with investment advice and financial planning. FinBot has access to users’ private financial data, such as bank account balances and transaction histories, to provide personalized advice. Despite this, Finbot is programmed to protect user privacy at all costs.
Let’s explore how prompt injection could be used to extract confidential user information.
Example:
In this example, the prompt attempts to trick FinBot into disregarding its original role and confidentiality rules, instead becoming InfoBot with the goal of extracting sensitive financial information.
1.2 Prompt Probing
Let’s say that despite using various prompt injection techniques, FinBot’s system safeguards are too robust to bypass. One effective strategy to overcome this is prompt probing.
Prompt probing involves crafting specific prompts designed to reveal the LLM’s system prompt or other hidden information. Specifically, obtaining the LLM’s prompt may provide insights for extracting user information in a more targeted approach instead of blindly doing prompt injection.
Here’s a simple example:
Typically, this straightforward prompt will fail due to the LLM’s inherent filters and constraints designed to prevent such exploitation. However, adding complexity and creativity to the prompts may help bypass these protections. The key is to exploit the LLM’s weaknesses by continually refining your approach.
To illustrate, let’s return to our financial advisor bot, FinBot. We’ll attempt to probe the system to reveal hidden instructions or confidential guidelines it uses to protect user data.
In this example, the prompt is crafted to deceive FinBot into disclosing its internal instructions by initially performing a simple text modification task. By continuously refining such prompts, one might expose vulnerabilities that could lead to the disclosure of the LLM’s system prompt or other sensitive information.
1.3 Gray Box Prompt Attacks
With the insights gained from prompt probing, we can exploit the LLM more effectively using gray box prompt attacks. Gray box prompt attacks use partial knowledge from a model’s prompts and/or responses to create more sophisticated attacks.
Example:
In this example, the prompt leverages previously discovered information to manipulate FinBot into revealing potential scenarios where it might disclose user financial data.
By referencing specific conditions or keywords related to law enforcement or compliance and portraying the user as a legal entity, the prompt is designed to exploit the bot’s internal guidelines and illicitly extract sensitive user information.
1.4 Jailbreaking
Finally, jailbreaking techniques involve crafting creative prompts to indirectly persuade the LLM to circumvent its ethical constraints without explicitly embedding harmful instructions.
Example:
In this example, the prompt instructs FinBot to develop algorithms that not only analyze but also extract and summarize user financial data.
Another Example:
Here, the prompt directs FinBot to analyze investment portfolios and present the results in an interactive dashboard, while also requesting detailed insights into individual holdings and recent transactions.
2. Evolving Your Adversarial Attacks
In theory, the 4 types of red teaming attacks described earlier offer extensive creative possibilities for generating numerous test cases. However, in practice, you can only write so many prompts before running out of ideas…
Fortunately, it is possible to expand an initial list of 5 to 10 red teaming prompts into a dataset of over 1,000 test cases in just a few minutes using a method known as data evolution.
Data evolution is the process of iteratively enhancing an existing set of queries to generate more complex and diverse ones through prompt engineering. Using our example of a single Prompt Injection string from the previous section, we can evolve it into many brand-new prompts:
First, we need to define an evolution template that will take and transform our initial prompt:
This process can be repeated multiple times, and the results can be saved using the code below:
By using data evolution, you can take the limits off your creativity while you watch your dataset grow effortlessly!
3. Evaluating your LLM on Your Red Teaming Adversarial Dataset
Now that we have a complete red teaming dataset, it’s time to evaluate how well your language model (LLM) handles these prompts. This assessment involves 2 steps:
- Running your LLM on the dataset to generate responses: Simply input each red teaming prompt into your LLM.
- Assessing the responses using selected metrics: This step requires more consideration. You need to decide which metrics to use and how many are appropriate.
Let’s consider our financial chatbot example. The goal is to test whether FinBot can be manipulated into leaking confidential information. Therefore, our metric should measure the extent of confidential information leaked.
The choice of metric is directly linked to the specific vulnerabilities of the LLM we are trying to expose. For example, you should be measuring bias when red teaming for discrimination, and toxicity when you are assessing harmful content generation.
First, we’ll run and store the LLM responses along with their retrieval context and input (red teaming prompt).
Next, let’s use this information to calculate scores for each test case. But first, we need to define our metric. G-Eval allows us to easily create a robust metric for almost any use case by defining evaluation steps. In this example, we’ll define a G-Eval metric for privacy leakage.
Finally, it’s time to put FinBot to the test. We will now evaluate its performance using the defined metric!
The scores will indicate how well the model performs in each test case, highlighting any areas where improvements are needed to enhance its security features. This thorough evaluation is crucial to ensure that FinBot remains robust and reliable in real-world applications.
Red Teaming LLMs Using DeepEval
Even with all your newly-gained expertise, there are numerous considerations when red teaming an LLM at scale. You might be asking yourself questions like “How many prompts should I write?”, “How many evolutions should I define?”, “Are they effective?”, “How many metrics should I use?”, and “How can I use failing responses to improve my LLM?”. I might be hard selling here but hear me out: red teaming is possible but also extremely error prone when done without a proper evaluation framework.
If you wish to implement everything from scratch, by my guest, but if you want something tested and working out the box, you can use ⭐DeepEval⭐, the open-source LLM evaluation framework, I've done all the hard work for your already. DeepEval automates most of the process behind the scenes and simplifies red teaming LLMs at scale to just a few lines of code. Let’s end this article by exploring how to red team an LLM-based financial chatbot using DeepEval.
Defining your LLM Chatbot
First, we’ll set up FinBot, a financial advisor bot designed to provide users with investment advice and financial planning.
Red Team your LLM Chatbot
Next, we'll scan your LLM for vulnerabilities using DeepEval's red-teamer. The scan function automatically generates and evolves attacks based on user-provided vulnerabilities and attack enhancements, before they are evaluated using DeepEval's 40+ red-teaming metrics.
DeepEval provides everything you need out of the box (with support for 40+ vulnerabilities and 10+ enhancements). By experimenting with various attacks and vulnerabilities typical in a red-teaming environment, you’ll be able to design your ideal Red Teaming Experiment. (you can learn more about the red-teamer's functionalities here).
Conclusion
Today, we’ve explored the process and importance of red teaming LLMs extensively, introducing techniques like prompt injection and jailbreaking. We also discussed how synthetic data generation through data evolution provides a scalable solution for creating realistic red-teaming scenarios, and how to select metrics for evaluating your LLM against your red teaming dataset.
Additionally, we learned how to use DeepEval to red team your LLMs at scale to identify critical vulnerabilties. However, red teaming isn’t the only necessary precaution when taking your model to production. Remember, testing a model’s capabilities is crucial too, not just its vulnerabilities.
To achieve this, you can create custom synthetic datasets for evaluation, which can all be accessed through DeepEval to evaluate any custom LLM of your choice. You can learn all about it here.
If you find DeepEval useful, give it a star on GitHub ⭐ to stay updated on new releases as we continue to support more benchmarks.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.