Zhouhang Xie


2024

pdf bib
Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning
Zhouhang Xie | Bodhisattwa Prasad Majumder | Mengjie Zhao | Yoshinori Maeda | Keiichi Yamada | Hiromi Wakaki | Julian McAuley
Findings of the Association for Computational Linguistics ACL 2024

We consider the task of building a dialogue system that can motivate users to adopt positive lifestyle changes, Motivational Interviewing (MI). Addressing such a task requires a system that could infer how to motivate the user effectively. We propose DIIR, a framework that is capable of learning and applying conversation strategies in the form of natural language inductive rules from expert demonstrations. Automatic and human evaluation on instruction-following large language models show natural language strategies descriptions discovered by DIIR can improve active listening skills, reduce unsolicited advice, and promote more collaborative and less authoritative conversations, outperforming in-context demonstrations that are over 50 times longer.

2021

pdf bib
What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations
Zhouhang Xie | Jonathan Brophy | Adam Noack | Wencong You | Kalyani Asthana | Carter Perkins | Sabrina Reis | Zayd Hammoudeh | Daniel Lowd | Sameer Singh
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Adversarial attacks curated against NLP models are increasingly becoming practical threats. Although various methods have been developed to detect adversarial attacks, securing learning-based NLP systems in practice would require more than identifying and evading perturbed instances. To address these issues, we propose a new set of adversary identification tasks, Attacker Attribute Classification via Textual Analysis (AACTA), that attempts to obtain more detailed information about the attackers from adversarial texts. Specifically, given a piece of adversarial text, we hope to accomplish tasks such as localizing perturbed tokens, identifying the attacker’s access level to the target model, determining the evasion mechanism imposed, and specifying the perturbation type employed by the attacking algorithm. Our contributions are as follows: we formalize the task of classifying attacker attributes, and create a benchmark on various target models from sentiment classification and abuse detection domains. We show that signals from BERT models and target models can be used to train classifiers that reveal the properties of the attacking algorithms. We demonstrate that adversarial attacks leave interpretable traces in both feature spaces of pre-trained language models and target models, making AACTA a promising direction towards more trustworthy NLP systems.