Jiawen Deng


2023

pdf bib
InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
Zhexin Zhang | Jiale Cheng | Hao Sun | Jiawen Deng | Minlie Huang
Findings of the Association for Computational Linguistics: EMNLP 2023

Safety detection has been an increasingly important topic in recent years and it has become even more necessary to develop reliable safety detection systems with the rapid development of large language models. However, currently available safety detection systems have limitations in terms of their versatility and interpretability. In this paper, we first introduce InstructSafety, a safety detection framework that unifies 7 common sub-tasks for safety detection. These tasks are unified into a similar form through different instructions. We then conduct a comprehensive survey of existing safety detection datasets and process 39 human-annotated datasets for instruction tuning. We also construct adversarial samples to enhance the model’s robustness. After fine-tuning Flan-T5 on the collected data, we have developed Safety-Flan-T5, a multidimensional and explainable safety detector. We conduct comprehensive experiments on a variety of datasets and tasks, and demonstrate the strong performance of Safety-Flan-T5 in comparison to supervised baselines and served APIs (Perspective API, ChatGPT and InstructGPT). We will release the processed data, fine-tuned Safety-Flan-T5 and related code for public use.

2022

pdf bib
COLD: A Benchmark for Chinese Offensive Language Detection
Jiawen Deng | Jingyan Zhou | Hao Sun | Chujie Zheng | Fei Mi | Helen Meng | Minlie Huang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Offensive language detection is increasingly crucial for maintaining a civilized social media platform and deploying pre-trained language models. However, this task in Chinese is still under exploration due to the scarcity of reliable datasets. To this end, we propose a benchmark –COLD for Chinese offensive language analysis, including a Chinese Offensive Language Dataset –COLDATASET and a baseline detector –COLDETECTOR which is trained on the dataset. We show that the COLD benchmark contributes to Chinese offensive language detection which is challenging for existing resources. We then deploy the COLDETECTOR and conduct detailed analyses on popular Chinese pre-trained language models. We first analyze the offensiveness of existing generative models and show that these models inevitably expose varying degrees of offensive issues. Furthermore, we investigate the factors that influence the offensive generations, and we find that anti-bias contents and keywords referring to certain groups or revealing negative attitudes trigger offensive outputs easier.

pdf bib
On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark
Hao Sun | Guangxuan Xu | Jiawen Deng | Jiale Cheng | Chujie Zheng | Hao Zhou | Nanyun Peng | Xiaoyan Zhu | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2022

Dialogue safety problems severely limit the real-world deployment of neural conversational models and have attracted great research interests recently. However, dialogue safety problems remain under-defined and the corresponding dataset is scarce. We propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors in human-bot dialogue settings, with focuses on context-sensitive unsafety, which is under-explored in prior works. To spur research in this direction, we compile DiaSafety, a dataset with rich context-sensitive unsafe examples. Experiments show that existing safety guarding tools fail severely on our dataset. As a remedy, we train a dialogue safety classifier to provide a strong baseline for context-sensitive dialogue unsafety detection. With our classifier, we perform safety evaluations on popular conversational models and show that existing dialogue systems still exhibit concerning context-sensitive safety problems.

pdf bib
Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark
Jingyan Zhou | Jiawen Deng | Fei Mi | Yitong Li | Yasheng Wang | Minlie Huang | Xin Jiang | Qun Liu | Helen Meng
Findings of the Association for Computational Linguistics: EMNLP 2022

Among all the safety concerns that hinder the deployment of open-domain dialog systems (e.g., offensive languages, biases, and toxic behaviors), social bias presents an insidious challenge. Addressing this challenge requires rigorous analyses and normative reasoning. In this paper, we focus our investigation on social bias measurement to facilitate the development of unbiased dialog systems. We first propose a novel Dial-Bias Framework for analyzing the social bias in conversations using a holistic method beyond bias lexicons or dichotomous annotations. Leveraging the proposed framework, we further introduce the CDial-Bias Dataset which is, to the best of our knowledge, the first annotated Chinese social bias dialog dataset. We also establish a fine-grained dialog bias measurement benchmark and conduct in-depth ablation studies to shed light on the utility of the detailed annotations in the proposed dataset. Finally, we evaluate representative Chinese generative models with our classifiers to unveil the presence of social bias in these systems.

pdf bib
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
Zhexin Zhang | Jiale Cheng | Hao Sun | Jiawen Deng | Fei Mi | Yasheng Wang | Lifeng Shang | Minlie Huang
Findings of the Association for Computational Linguistics: EMNLP 2022

Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., profanity, insult, drugs, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called reverse generation to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level and inductivity of the generated contexts. Via reverse generation, we augment the existing BAD dataset and construct a new dataset BAD+ which contains more than 120K diverse and highly inductive contexts in 12 categories. We test three popular pretrained dialogue models (Blender, DialoGPT and Plato2) and find that BAD+ can largely expose their safety problems. Furthermore, we show that BAD+ can greatly enhance the safety of generation, and we reveal the key factors of safety improvement. Our code and dataset is available at https://github.com/thu-coai/Reverse_Generation.