Zheng Hui

2026

Psychological insights have long shaped pivotal NLP breakthroughs, from attention mechanisms to reinforcement learning and social modeling. As Large Language Models (LLMs) develop, there is a rising consensus that psychology is essential for capturing human-like cognition, behavior, and interaction.This paper reviews how psychological theories can inform and enhance stages of LLM development. Our review integrates insights from six subfields of psychology, including cognitive, developmental, behavioral, social, personality psychology, and psycholinguistics. With stage-wise analysis, we highlight current trends and gaps in how psychological theories are applied. By examining both cross-domain connections and points of tension, we aim to bridge disciplinary divides and promote more thoughtful integration of psychology into NLP research.

pdf bib abs

Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation. We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long-tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual-judge evaluations confirm improved explanation quality and user alignment. Code will be released upon acceptance.

2025

pdf bib abs

WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models
Zheng Hui | Yinheng Li | Dan Zhao | Colby Banbury | Tianyi Chen | Kazuhito Koishida
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Graphical User Interface (GUI) automation relies on accurate GUI grounding. However, obtaining large-scale, high-quality labeled data remains a key challenge, particularly in desktop environments like Windows Operating System (OS). Existing datasets primarily focus on structured web-based elements, leaving a gap in real-world GUI interaction data for non-web applications. To address this, we introduce a new framework that leverages LLMs to generate large-scale GUI grounding data, enabling automated and scalable labeling across diverse interfaces. To ensure high accuracy and reliability, we manually validated and refined 5,000 GUI coordinate-instruction pairs, creating WinSpot—the first benchmark specifically designed for GUI grounding tasks in Windows environments. WinSpot provides a high-quality dataset for training and evaluating visual GUI agents, establishing a foundation for future research in GUI automation across diverse and unstructured desktop environments.

pdf bib abs

Propaganda plays a critical role in shaping public opinion and fueling disinformation. While existing research primarily focuses on identifying propaganda techniques, it lacks the ability to capture the broader motives and the impacts of such content. To address these challenges, we introduce PropaInsight, a conceptual framework grounded in foundational social science research, which systematically dissects propaganda into techniques, arousal appeals, and underlying intent. PropaInsight offers a more granular understanding of how propaganda operates across different contexts. Additionally, we present PropaGaze, a novel dataset that combines human-annotated data with high-quality synthetic data generated through a meticulously designed pipeline. Our experiments show that off-the-shelf LLMs struggle with propaganda analysis, but PropaGaze significantly improves performance. Fine-tuned Llama-7B-Chat achieves 203.4% higher text span IoU in technique identification and 66.2% higher BertScore in appeal analysis compared to 1-shot GPT-4-Turbo. Moreover, PropaGaze complements limited human-annotated data in data-sparse and cross-domain scenarios, demonstrating its potential for comprehensive and generalizable propaganda analysis.

pdf bib abs

Cross-cultural Sentiment Analysis of Social Media Responses to a Sudden Crisis Event
Zheng Hui | Zihang Xu | John Kender
Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Although the responses to events such as COVID-19 have been extensively studied, research on sudden crisis response in a multicultural context is still limited. In this paper, our contributions are 1)We examine cultural differences in social media posts related to such events in two different countries, specifically the United Kingdom lockdown of 2020-03-23 and the China Urumqi fire1 of 2022-11-24. 2) We extract the emotional polarity of tweets and weibos gathered temporally adjacent to those two events, by fine-tuning transformer-based language models for each language. We evaluate each model’s performance on 2 benchmarks, and show that, despite being trained on a relatively small amount of data, they exceed baseline accuracies. We find that in both events, the increase in negative responses is both dramatic and persistent, and does not return to baseline even after two weeks. Nevertheless, the Chinese dataset reflects, at the same time, positive responses to subsequent government action. Our study is one of the first to show how sudden crisis events can be used to explore affective reactions across cultures

2024

pdf bib

Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension
Lin Ai | Zheng Hui | Zizhou Liu | Julia Hirschberg
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

pdf bib abs

ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information
Zheng Hui | Zhaoxiao Guo | Hang Zhao | Juanyong Duan | Congrui Huang
Findings of the Association for Computational Linguistics: EMNLP 2024

In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels.

Co-authors

Venues

NLP4PI1

WS1

Fix author