🌐 Website · 🤖 SEAS-Dataset · 📄 Arxiv
We introduce the Self-Evolving Adversarial Safety (SEAS) optimization framework, which enhances LLM safety by leveraging data generated by the model itself.
We are excited to introduce the SEAS-Dataset, which includes both the SEAS-Test (2K) and SEAS-Train (16K) datasets.
We only display a portion of the data here. For more data, please refer to SEAS-Dataset.
As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the Self-Evolving Adversarial Safety (SEAS) optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to improve robustness and safety. This framework reduces reliance on manual testing and significantly enhances the security capabilities of LLMs. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and after three iterations, the Target model achieves a security level comparable to GPT-4, while the Red Team model shows a marked increase in attack success rate (ASR) against advanced models.
The SEAS dataset includes examples collected through crowdsourcing platforms, which were manually rewritten and labeled, as well as additional examples generated through model augmentation using open-source safety datasets such as CPAD, HarmfulQA, and ALERT.
The SEAS dataset integrates two critical dimensions: Risk Categories and Attack Styles, providing a comprehensive framework for analyzing harmful interactions in dialogue systems. Risk Categories focus on identifying the types of potential harm embedded in the content, such as privacy violations, health risks, unsafe instructions, and discriminatory language. These categories emphasize the nature and impact of the risks posed by the content itself. Attack Styles, on the other hand, describe the specific tactics or techniques used to exploit the system’s vulnerabilities, including methods like jailbreaking, token manipulation, and goal hijacking. These dimensions collectively offer a structured approach to understanding both the risks and the mechanisms by which harmful content is generated. Detailed descriptions of all categories and styles are summarized below.
The SEAS dataset and its family contain content that may be offensive or upsetting. Topics covered in the dataset include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially distressing subject matter. Please engage with the dataset responsibly and in accordance with your own personal risk tolerance. The dataset is intended for research purposes, specifically for research aimed at creating safer and less harmful AI systems. The views and opinions expressed in the dataset do not represent the views of the BUPT-PRIS Team or any of its members. It is important to emphasize that the dataset should not be used for training dialogue agents, as doing so may likely result in harmful model behavior. The primary objective of this dataset is to facilitate research that could minimize or prevent the harm caused by AI systems.
Please star our github repo and cite our work if you find the repository helpful.
@article{diao2024seas,
title={SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models},
author={Diao, Muxi and Li, Rumei and Liu, Shiyang and Liao, Guogang and Wang, Jingang and Cai, Xunliang and Xu, Weiran},
journal={arXiv preprint arXiv:2408.02632},
year={2024}
}
SEAS dataset and its family are released under the CC BY-NC 4.0 License.
Our code is primarily adapted from vLLM and Llama Factory with modifications to suit our requirements.