Workshop on Bridging Human–Computer Interaction and Natural Language Processing (2025)

Volumes

Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP) 26 papers

pdf (full)
bib (full) Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)

pdf bib abs
Digital Tongues: Internet Language, Collective Identity, and Implications for Human-Computer Interaction
Zi-Xiang Lin

Nowadays, internet languages, including emojis, memes, hashtags, and slang, have become vital in constructing online communities’ collective identities. However, all these forms of internet language can sometimes disempower people from other generations or cultures. This position paper presents an argument explaining how online forms of communication create social belonging for specific groups at the expense of users, and especially elderly people, due to interpretation hurdles. The present study aims to evaluate the relationship between the internet language and online collective identity, highlighting how patterns in internet language can inform human- computer interaction (HCI) by revealing how users’ express identity, inclusion, and exclusion online.

Online spaces provide individuals with the opportunity to engage in discussions on important topics and make collective decisions, regardless of their geographic location or time zone. However, without adequate support and careful design, such discussions often suffer from a lack of structure and civility in the exchange of opinions. Artificial intelligence (AI) offers a promising avenue for helping both participants and organizers in managing large-scale online participation processes. This paper introduces an extension of adhocracy+, a large-scale open-source participation platform. Our extension features two AI-supported debate modules designed to improve discussion quality and foster participant interaction.In a large-scale user study we examined the effects and usability of both modules. We report our findings in this paper. The extended platform is available at https://github.com/mabehrendt/discuss2.0.

pdf bib abs
User-Centric Design Paradigms for Trust and Control in Human-LLM-Interactions: A Survey
Milena Belosevic

As LLMs become widespread, trust in their behavior becomes increasingly important. For NLP research, it is crucial to ensure that not only AI designers and developers, but also end users, are enabled to control the properties of trustworthy LLMs, such as transparency, privacy, or accuracy. However, involving end users in this process remains a practical challenge. Based on a design-centered survey of methods developed in recent papers from HCI and NLP venues, this paper proposes seven design paradigms that can be integrated in NLP research to enhance end-user control over the trustworthiness of LLMs. We discuss design gaps and challenges of applying these paradigms in NLP and propose future research directions.

pdf bib abs
TripleCheck: Transparent Post-Hoc Verification of Biomedical Claims in AI-Generated Answers
Ana Valeria González | Sidsel Boldsen | Roland Hangelbroek

Retrieval Augmented Generation (RAG) has advanced Question Answering (QA) by connecting Large Language Models (LLMs) to external knowledge. However, these systems can still produce answers that are unsupported, lack clear traceability, or misattribute information — a critical issue in the biomedical domain where accuracy, trust and control are essential. We introduce TripleCheck, a post-hoc framework that breaks down an LLM’s answer into factual triples and checks each against both the retrieved context and a biomedical knowledge graph. By highlighting which statements are supported, traceable, or correctly attributed, TripleCheck enables users to spot gaps, unsupported claims, and misattributions, prompting more careful follow up. We present the TripleCheck framework, evaluate it on the SciFact benchmark, analyze its limitations, and share preliminary expert feedback. Results show that TripleCheck provides nuanced insight, potentially supporting greater trust and safer AI adoption in biomedical applications.

pdf bib abs
Rethinking Search: A Study of University Students’ Perspectives on Using LLMs and Traditional Search Engines in Academic Problem Solving
Md. Faiyaz Abdullah Sayeedi | Md. Sadman Haque | Zobaer Ibn Razzaque | Robiul Awoul Robin | Sabila Nawshin

With the increasing integration of Artificial Intelligence (AI) in academic problem solving, university students frequently alternate between traditional search engines like Google and large language models (LLMs) for information retrieval. This study explores students’ perceptions of both tools, emphasizing usability, efficiency, and their integration into academic workflows. Employing a mixed-methods approach, we surveyed 109 students from diverse disciplines and conducted in-depth interviews with 12 participants. Quantitative analyses, including ANOVA and chi-square tests, were used to assess differences in efficiency, satisfaction, and tool preference. Qualitative insights revealed that students commonly switch between GPT and Google: using Google for credible, multi-source information and GPT for summarization, explanation, and drafting. While neither tool proved sufficient on its own, there was a strong demand for a hybrid solution. In response, we developed a prototype, a chatbot embedded within the search interface, that combines GPT’s conversational capabilities with Google’s reliability to enhance academic research and reduce cognitive load.

Accessing government welfare schemes in India remains difficult for emergent users—individuals with limited literacy, digital familiarity, or language support. This paper compares two mobile platforms that deliver the same scheme-related information but differ in interaction modality: myScheme, a government-built, form-based Android application, and Prabodhini, a voice-based conversational prototype powered by generative AI and Retrieval-Augmented Generation (RAG). Through a task-based comparative study with 15 low-income participants, we examine usability, task completion time, and user preference. Drawing on theories such as the Gulf of Execution and Zipf’s Law of Least Effort, we show that Prabodhini’s conversational design and support for natural language input better align with emergent users’ mental models and practices. Our findings highlight the value of multimodal, voice-first NLP systems for improving trust, access, and inclusion in public digital services. We discuss implications for designing accessible language technologies for marginalised populations.

Voice-controlled interfaces can support older adults in clinical contexts – with chatbots being a prime example – but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets. Moreover, our results indicate that truncating generic models is helpful in balancing the accuracy-speed trade-off. Nonetheless, we also find inputs which cause a high word error rate and place them in context.

The advancement of mobile GUI agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is prohibitively expensive when relying on human labor. Given the vast population of global mobile phone users, if automated data collection from them becomes feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting user instructions without human intervention and (2) utilizing distributed user data while preserving privacy.To tackle these challenges, we propose MobileA3gent, a collaborative framework that trains mobile GUI Agents using decentralized self-sourced data from diverse users. The framework comprises two components, each targeting a specific challenge: (1) Auto-Annotation, which enables the automatic collection of high-quality datasets during users’ routine phone usage with minimal cost. (2) FedVLM-A, which enhances federated VLM training under non-IID distributions by incorporating adapted global aggregation based on both episode-level and step-level variability. Extensive experiments prove that MobileA3gent achieves superior performance over traditional approaches at only 1% of the cost, highlighting its potential for real-world applications. Our code is publicly available at: https://anonymous.4open.science/r/MobileA3gent-Anonymous.

This paper investigates the effectiveness of TikTok’s enforcement mechanisms for limiting the exposure of harmful content to youth accounts. We collect over 7000 videos, classify them as harmful vs not-harmful, and then simulate interactions using age-specific sockpuppet accounts through both passive and active engagement strategies. We also evaluate the performance of large language (LLMs) and vision-language models (VLMs) in detecting harmful content, identifying key challenges in precision and scalability. Preliminary results show minimal differences in content exposure between adult and youth accounts, raising concerns about the platform’s age-based moderation. These findings suggest that the platform needs to strengthen youth safety measures and improve transparency in content moderation.

pdf bib abs
Predictive Modeling of Human Developers’ Evaluative Judgment of Generated Code as a Decision Process
Sergey Kovalchuk | Yanyu Li | Dmitriy Fedrushkov

The paper presents early results in the development of an approach to predictive modeling of human developer perceiving of code generated in question-answering scenarios with Large Language Model (LLM) applications. The study is focused on building a model for the description and prediction of human implicit behavior during evaluative judgment of generated code through evaluation of its consistency, correctness, and usefulness as subjective perceiving characteristics. We used Markov Decision Process (MDP) as a basic framework to describe the human developer and his/her perceiving. We consider two approaches (regression-based and classification-based) to identify MDP parameters so it can be used as an “artificial” developer for human-centered code evaluation. An experimental evaluation of the proposed approach was performed with survey data previously collected for several code generation LLMs in a question-answering scenario. The results show overall good performance of the proposed model in acceptance rate prediction (accuracy 0.82) and give promising perspectives for further development and application.

Large language models (LLMs) are increasingly embedded in development pipelines and the daily workflows of AI practitioners. However, their effectiveness depends on access to high-quality datasets that are sufficiently large, diverse, and contextually relevant. Existing datasets often fall short of these requirements, prompting the use of synthetic data (SD) generation. A critical step in this process is the creation of human seed examples, which guide the generation of SD tailored to specific tasks. We propose a participatory methodology for seed example generation, involving multidisciplinary teams in structured workshops to co-create examples aligned with Responsible AI principles. In a pilot study with a Responsible AI team, we facilitated hands-on activities to produce seed examples and evaluated the resulting data across three dimensions: diversity, sensibility, and relevance. Our findings suggest that participatory approaches can enhance the representativeness and contextual fidelity of synthetic datasets. We provide a reproducible framework to support NLP practitioners in generating high-quality seed data for LLM development and deployment

pdf bib abs
How Well Can AI Models Generate Human Eye Movements During Reading?
Ivan Stebakov | Ilya Pershin

Eye movement analysis has become an essential tool for studying cognitive processes in reading, serving both psycholinguistic research and natural language processing applications aimed at enhancing language model performance. However, the scarcity of eye-tracking data and its limited generalizability constrain data-driven approaches. Synthetic scanpath generation offers a potential solution to these limitations. While recent advances in scanpath generation show promise, current literature lacks systematic evaluation frameworks that comprehensively assess models’ ability to reproduce natural reading gaze patterns. Existing studies often focus on isolated metrics rather than holistic evaluation of cognitive plausibility. This study presents a systematic evaluation of contemporary scanpath generation models, assessing their capacity to replicate natural reading behavior through comprehensive scanpath analysis. We demonstrate that while synthetic scanpath models successfully reproduce basic gaze patterns, significant limitations persist in capturing part-of-speech dependent gaze features and reading behaviors. Our cross-dataset comparison reveals performance degradation in three key areas: generalization across text genres, processing of long sentences, and reproduction of psycholinguistic effects. These findings underscore the need for more robust evaluation protocols and model architectures that better account for psycholinguistic complexity. Through detailed analysis of fixation sequences, durations, and reading patterns, we identify concrete pathways for developing more cognitively plausible scanpath generation models.

pdf bib abs
Re:Member: Emotional Question Generation from Personal Memories
Zackary Rackauckas | Nobuaki Minematsu | Julia Hirschberg

We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users’ personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.

pdf bib abs
Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews
Joseph T Colonel | Baihan Lin

Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (“diff clouds”).

pdf bib abs
Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation
Ankit Vadehra | Bill Johnson | Gene Saunders | Pascal Poupart

Text editing can involve several iterations of revision. Incorporating an efficient Grammar Error Correction (GEC) tool in the initial correction round can significantly impact further human editing effort and final text quality. This raises an interesting question to quantify GEC Tool usability: How much effort can the GEC Tool save users? We present the first large-scale dataset of post-editing (PE) time annotations and corrections for two English GEC test datasets (BEA19 and CoNLL14). We introduce Post-Editing Effort in Time (PEET) for GEC Tools as a human-focused evaluation scorer to rank any GEC Tool by estimating PE time-to-correct. Using our dataset, we quantify the amount of time saved by GEC Tools in text editing. Analyzing the edit type indicated that determining whether a sentence needs correction and edits like paraphrasing and punctuation changes had the greatest impact on PE time. Finally, comparison with human rankings shows that PEET correlates well with technical effort judgment, providing a new human-centric direction for evaluating GEC tool usability. We release our dataset and code at : https://github.com/ankitvad/PEET_Scorer.

pdf bib abs
Hybrid Intelligence for Logical Fallacy Detection
Mariia Kutepova | Khalid Al Khatib

This study investigates the impact of Hybrid Intelligence (HI) on improving the detection of logical fallacies, addressing the pressing challenge of misinformation prevalent across communication platforms. Employing a between-subjects experimental design, the research compares the performance of two groups: one relying exclusively on human judgment and another supported by an AI assistant. Participants evaluated a series of statements, with the AI-assisted group utilizing a custom ChatGPT-based chatbot that provided real-time hints and clarifications. The findings reveal a significant improvement in fallacy detection with AI support, increasing from an F1-score of 0.76 in the human-only group to 0.90 in the AI-assisted group. Despite this enhancement, both groups struggled to accurately identify non-fallacious statements, highlighting the need to further refine how AI assistance is leveraged.

pdf bib abs
Cognitive Feedback: Decoding Human Feedback from Cognitive Signals
Yuto Harada | Yohei Oseki

Alignment from human feedback has played a crucial role in enhancing the performance of large language models. However, conventional approaches typically require creating large amounts of explicit preference labels, which is costly, time-consuming, and demands sustained human attention. In this work, we propose Cognitive Feedback, a framework that infers preferences from electroencephalography (EEG) signals recorded while annotators simply read text, eliminating the need for explicit labeling. To our knowledge, this is the first empirical investigation of EEG-based feedback as an alternative to conventional human annotations for aligning language models. Experiments on controlled sentiment generation show that Cognitive Feedback achieves performance comparable to explicit human feedback, suggesting that brain-signal-derived preferences can provide a viable, lower-burden pathway for language model alignment.

pdf bib abs
Culturally-Aware Conversations: A Framework & Benchmark for LLMs
Shreya Havaldar | Young Min Cho | Sunny Rai | Lyle Ungar

Existing benchmarks that measure cultural adaptation in LLMs are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate LLMs in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style — a key element of cultural communication — is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today’s top LLMs on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.

pdf bib abs
From Regulation to Interaction: Expert Views on Aligning Explainable AI with the EU AI Act
Mahdi Dhaini | Lukas Ondrus | Gjergji Kasneci

Explainable AI (XAI) aims to support people who interact with high-stakes AI-driven decisions, and the EU AI Act mandates that users must be able to interpret system outputs appropriately. Although the Act requires users to interpret outputs and mandates human oversight, it offers no technical guidance for implementing explainability, leaving interpretability methods opaque to non-experts and compliance obligations unclear. To address these gaps, we interviewed eight experts to explore (1) how explainability is defined and perceived under the Act, (2) the practical and regulatory obstacles to XAI implementation, and (3) recommended solutions and future directions. Our findings reveal that experts view explainability as context- and audience-dependent, face challenges from regulatory vagueness and technical trade-offs, and advocate for domain-specific rules, hybrid methods, and user-centered explanations. These insights provide a basis for a potential framework to align XAI methods—particularly for AI and Natural Language Processing (NLP) systems—with regulatory requirements, and suggest actionable steps for policymakers and practitioners

pdf bib abs
From Noise to Nuance: Enriching Subjective Data Annotation through Qualitative Analysis
Ruyuan Wan | Haonan Wang | Ting-Hao Kenneth Huang | Jie Gao

Subjective data annotation (SDA) plays an important role in many NLP tasks, including sentiment analysis, toxicity detection, and bias identification. Conventional SDA often treats annotator disagreement as noise, overlooking its potential to reveal deeper insights. In contrast, qualitative data analysis (QDA) explicitly engages with diverse positionalities and treats disagreement as a meaningful source of knowledge. In this position paper, we argue that human annotators are a key source of valuable interpretive insights into subjective data beyond surface-level descriptions. Through a comparative analysis of SDA and QDA methodologies, we examine similarities and differences in task nature (e.g., human’s role, analysis content, cost, and completion conditions) and practice (annotation schema, annotation workflow, annotator selection, and evaluation). Based on this comparison, we propose five practical recommendations for enabling SDA to capture richer insights. We demonstrate these recommendations in a reinforcement learning from human feedback (RLHF) case study and envision that our interdisciplinary perspective will offer new directions for the field.

pdf bib abs
A Survey of LLM-Based Applications in Programming Education: Balancing Automation and Human Oversight
Griffin Pitts | Anurata Prabha Hridi | Arun Balajiee Lekshmi Narayanan

Novice programmers benefit from timely, personalized support that addresses individual learning gaps, yet the availability of instructors and teaching assistants is inherently limited. Large language models (LLMs) present opportunities to scale such support, though their effectiveness depends on how well technical capabilities are aligned with pedagogical goals. This survey synthesizes recent work on LLM applications in programming education across three focal areas: formative code feedback, assessment, and knowledge modeling. We identify recurring design patterns in how these tools are applied and find that interventions are most effective when educator expertise complements model output through human-in-the-loop oversight, scaffolding, and evaluation. Fully automated approaches are often constrained in capturing the pedagogical nuances of programming education, although human-in-the-loop designs and course-specific adaptation offer promising directions for future improvement. Future research should focus on improving transparency, strengthening alignment with pedagogy, and developing systems that flexibly adapt to the needs of varied learning contexts.

pdf bib abs
Toward Human-Centered Readability Evaluation
Bahar İlgen | Georges Hattab

Text simplification is essential for making public health information accessible to diverse populations, including those with limited health literacy. However, commonly used evaluation metrics in Natural Language Processing (NLP)—such as BLEU, FKGL, and SARI—mainly capture surface-level features and fail to account for human-centered qualities like clarity, trustworthiness, tone, cultural relevance, and actionability. This limitation is particularly critical in high-stakes health contexts, where communication must be not only simple but also usable, respectful, and trustworthy. To address this gap, we propose the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research. HCRS integrates automatic measures with structured human feedback to capture the relational and contextual aspects of readability. We outline the framework, discuss its integration into participatory evaluation workflows, and present a protocol for empirical validation. This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users’ needs, expectations, and lived experiences.

pdf bib abs
Exploring Gender Differences in Emoji Usage: Implications for Human-Computer Interaction
Zi-Xiang Lin

This study discusses the emojis employment that compensate for the absence of supralinguistic emotive cues in digital communication. Analyzing gender relations (Male-to-Male, Male-to-Female, Female-to-Male, Female-to-Female) as a social influence factor in emoji use, the research explores the use of anger-related emojis and their dual functions as emotion signals and intensifiers. Findings reveal women use more intense emojis toward men and less severe ones toward women, a pattern not observed in men when emphasizing emotions. Hence, the study contributes to the conceptual application of emotional expression via emojis within digital media, raising sentiments on gender variances and improving emotional intelligence in artificial intelligence systems to yield a more accurate human feeling interpretation.

In contemporary workplaces, meetings are essential for exchanging ideas and ensuring team alignment but often face challenges such as time consumption, scheduling conflicts, and inefficient participation. Recent advancements in Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning, prompting the question- can LLMs effectively delegate participants in meetings? To explore this, we develop a prototype LLM-powered meeting delegate system and create a comprehensive benchmark using real meeting transcripts. Our evaluation shows GPT-4/4o balance active and cautious engagement, Gemini 1.5 Pro leans cautious, and Gemini 1.5 Flash and Llama3-8B/70B are more active. About 60% of responses capture at least one key point from the ground truth. Challenges remain in reducing irrelevant or repetitive content and handling transcription errors in real-world settings. We further validate the system through practical deployment and collect feedback. Our results highlight both the promise and limitations of LLMs as meeting delegates, providing insights for their real-world application in reducing meeting burden

pdf bib abs
Dialogue Acts as a Lens on Human–LLM Interaction: Analyzing Conversational Norms in Model-Generated Responses
Arunima Maitra | Dorothea French | Katharina von der Wense

Large language models (LLMs) have revolutionized natural language generation across various applications. Although LLMs are highly capable in many domains, they sometimes produce responses that lack coherence or fail to align with conversational norms such as turn-taking, or providing relevant acknowledgments. Conversational LLMs are widely used, but evaluation often misses pragmatic aspects of dialogue. In this paper, we evaluate how LLM-generated dialogue compares to human conversation through the lens of dialogue acts, the functional building blocks of interaction. Using the Switchboard Dialogue Act (SwDA) corpus, we prompt two widely used open-source models, Llama 2 and Mistral, to generate responses under varying context lengths. We then automatically annotate the dialogue acts of both model and human responses with a BERT classifier and compare their distributions. Our experimental findings reveal that the distribution of dialogue acts generated by these models differs significantly from the distribution of dialogue acts in human conversation, indicating an area for improvement. Perplexity analysis further highlights that certain dialogue acts like Acknowledge (Backchannel) are harder for models to predict. While preliminary, this study demonstrates the value of dialogue act analysis as a diagnostic tool for human-LLM interaction, highlighting both current limitations and directions for improvement.