You Don’t Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers’ Private Personas

Haoran Li, Yangqiu Song, Lixin Fan


Abstract
Social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. Despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. On the other hand, the datasets used for training chatbots contain many private conversations between two individuals. In this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. We show that speakers’ personas can be inferred through a simple neural network with high accuracy. To this end, we propose effective defense objectives to protect persona leakage from hidden states. We conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve language models’ powerful generation ability.
Anthology ID:
2022.naacl-main.429
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5858–5870
Language:
URL:
https://aclanthology.org/2022.naacl-main.429
DOI:
10.18653/v1/2022.naacl-main.429
Bibkey:
Cite (ACL):
Haoran Li, Yangqiu Song, and Lixin Fan. 2022. You Don’t Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers’ Private Personas. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5858–5870, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
You Don’t Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers’ Private Personas (Li et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.429.pdf
Software:
 2022.naacl-main.429.software.zip
Video:
 https://aclanthology.org/2022.naacl-main.429.mp4
Code
 hkust-knowcomp/persona_leakage_and_defense_in_gpt-2