Zhengyu Zhao
2024
Composite Backdoor Attacks Against Large Language Models
Hai Huang
|
Zhengyu Zhao
|
Michael Backes
|
Yun Shen
|
Yang Zhang
Findings of the Association for Computational Linguistics: NAACL 2024
Large language models (LLMs) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. However, the untrustworthy third-party LLMs may covertly introduce vulnerabilities for downstream tasks. In this paper, we explore the vulnerability of LLMs through the lens of backdoor attacks. Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. CBA ensures that the backdoor is activated only when all trigger keys appear. Our experiments demonstrate that CBA is effective in both natural language processing (NLP) and multimodal tasks. For instance, with 3% poisoning samples against the LLaMA-7B model on the Emotion dataset, our attack achieves a 100% Attack Success Rate (ASR) with a False Triggered Rate (FTR) below 2.06% and negligible model accuracy degradation. Our work highlights the necessity of increased security research on the trustworthiness of foundation LLMs.
2021
What Did You Refer to? Evaluating Co-References in Dialogue
Wei-Nan Zhang
|
Yue Zhang
|
Hanlin Tang
|
Zhengyu Zhao
|
Caihai Zhu
|
Ting Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
Profile Consistency Identification for Open-domain Dialogue Agents
Haoyu Song
|
Yan Wang
|
Wei-Nan Zhang
|
Zhengyu Zhao
|
Ting Liu
|
Xiaojiang Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Maintaining a consistent attribute profile is crucial for dialogue agents to naturally converse with humans. Existing studies on improving attribute consistency mainly explored how to incorporate attribute information in the responses, but few efforts have been made to identify the consistency relations between response and attribute profile. To facilitate the study of profile consistency identification, we create a large-scale human-annotated dataset with over 110K single-turn conversations and their key-value attribute profiles. Explicit relation between response and profile is manually labeled. We also propose a key-value structure information enriched BERT model to identify the profile consistency, and it gained improvements over strong baselines. Further evaluations on downstream tasks demonstrate that the profile consistency identification model is conducive for improving dialogue consistency.
Search
Co-authors
- Weinan Zhang 2
- Ting Liu 2
- Hai Huang 1
- Michael Backes 1
- Yun Shen 1
- show all...