Yun-Nung Chen

2025

pdf bib abs
Transferring Textual Preferences to Vision-Language Understanding through Model Merging
Chen-An Li | Tzu-Han Lin | Yun-Nung Chen | Hung-yi Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs’ scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

pdf bib abs
The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering
Yi-Jie Cheng | Oscar Chew | Yun-Nung Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: https://github.com/yijie-cheng/SLM-ToG/.

Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of creativity, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present:(1) a taxonomy of agent proactivity and persona design;(2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and(3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.

pdf bib abs
Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions
Yu-Ang Lee | Guan-Ting Yi | Mei-Yi Liu | Jui-Chao Lu | Guan-Bo Yang | Yun-Nung Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field.

pdf bib abs
None of the Above, Less of the Right Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering
Zhi Rui Tam | Cheng-Kuang Wu | Chieh-Yen Lin | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2025

Multiple-choice exam questions with “None of the above” (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50% performance drop across models regardless of scale–suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1% drop). Our results highlight important implications for benchmark design and raise questions about LLMs’ ability to handle uncertainty in real-world applications.

pdf bib abs
LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation
Yen-Shan Chen | Jing Jin | Peng-Ting Kuo | Chao-Wei Huang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2025

Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks—where keyword extraction and factual accuracy take precedence over stylistic elements—remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the pointwise reranking phase. The second phase involves conducting pairwise reading comprehension tests to simulate the generation phase. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs’ output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.

pdf bib abs
Exploring Personality-Aware Interactions in Salesperson Dialogue Agents
Sijia Cheng | Wen Yu Chang | Yun-Nung Chen
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

The integration of dialogue agents into the sales domain requires a deep understanding of how these systems interact with users possessing diverse personas. This study explores the influence of user personas, defined using the Myers-Briggs Type Indicator (MBTI), on the interaction quality and performance of sales-oriented dialogue agents. Through large-scale testing and analysis, we assess the pre-trained agent’s effectiveness, adaptability, and personalization capabilities across a wide range of MBTI-defined user types. Our findings reveal significant patterns in interaction dynamics, task completion rates, and dialogue naturalness, underscoring the future potential for dialogue agents to refine their strategies to better align with varying personality traits. This work not only provides actionable insights for building more adaptive and user-centric conversational systems in the sales domain but also contributes broadly to the field by releasing persona-defined user simulators. These simulators, unconstrained by domain, offer valuable tools for future research and demonstrate the potential for scaling personalized dialogue systems across diverse applications.

pdf bib abs
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
Shang-Chi Tsai | Yun-Nung Chen
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients’ medical conditions. However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation. If the model can provide appropriate comfort and empathy based on the patient’s negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process. To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient’s emotions while addressing their concerns. The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients’ questions. Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model’s ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers.

pdf bib abs
MSR²: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering
Kuo-Han Hung | Hung-Chieh Fang | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing

This paper introduces MSR², a benchmark for multi-source retrieval and reasoning in visual question answering. Unlike previous knowledge-based visual question answering datasets, MSR² focuses on questions involving multiple fine-grained entities, providing a unique opportunity to assess a model’s spatial reasoning ability and its capacity to retrieve and aggregate information from various sources for different entities. Through comprehensive evaluation using MSR², we gain valuable insights into the capabilities and limitations of state-of-the-art large vision-language models (LVLMs).Our findings reveal that even state-of-the-art LVLMs struggle with questions requiring multi-entities and knowledge-intensive reasoning, highlighting important new directions for future research.Additionally, we demonstrate that enhanced visual entity recognition and knowledge retrieval can significantly improve performance on MSR², pinpointing key areas for advancement.

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

2024

pdf bib
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yaser Al-Onaizan | Mohit Bansal | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

pdf bib abs
I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation
Cheng-Kuang Wu | Zhi Rui Tam | Chao-Chung Wu | Chieh-Yen Lin | Hung-yi Lee | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This study explores the proactive ability of LLMs to seek user support. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help under varying information availability. Our experiments show that without external feedback, many LLMs struggle to recognize their need for user support. The findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies. Source code: https://github.com/appier-research/i-need-help

pdf bib abs
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
Tzu-Han Lin | Chen-An Li | Hung-yi Lee | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the **Do**main knowled**ge** merged **R**eward **M**odel (**DogeRM**), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

pdf bib abs
PairDistill: Pairwise Relevance Distillation for Dense Retrieval
Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Effective information retrieval (IR) from vast datasets relies on advanced techniques to extract relevant information in response to queries. Recent advancements in dense retrieval have showcased remarkable efficacy compared to traditional sparse retrieval methods. To further enhance retrieval performance, knowledge distillation techniques, often leveraging robust cross-encoder rerankers, have been extensively explored. However, existing approaches primarily distill knowledge from pointwise rerankers, which assign absolute relevance scores to documents, thus facing challenges related to inconsistent comparisons. This paper introduces Pairwise Relevance Distillation (PairDistill) to leverage pairwise reranking, offering fine-grained distinctions between similarly relevant documents to enrich the training of dense retrieval models. Our experiments demonstrate that PairDistill outperforms existing methods, achieving new state-of-the-art results across multiple benchmarks. This highlights the potential of PairDistill in advancing dense retrieval techniques effectively. Our source code and trained models are released at https://github.com/MiuLab/PairDistill

pdf bib abs
Efficient Unseen Language Adaptation for Multilingual Pre-Trained Language Models
Po-Heng Chen | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multilingual pre-trained language models (mPLMs) have demonstrated notable effectiveness in zero-shot cross-lingual transfer tasks. Specifically, they can be fine-tuned solely on tasks in the source language and subsequently applied to tasks in the target language. However, for low-resource languages unseen during pre-training, relying solely on zero-shot language transfer often yields sub-optimal results. One common strategy is to continue training PLMs using masked language modeling objectives on the target language. Nonetheless, this approach can be inefficient due to the need to adjust all parameters for language adaptation. In this paper, we propose a more efficient solution: soft-prompt tuning for language adaptation. Our experiments demonstrate that with carefully designed prompts, soft-prompt tuning enables mPLMs to achieve effective zero-shot cross-lingual transfer to downstream tasks in previously unseen languages. Notably, we found that prompt tuning outperforms continuously trained baselines on two text classification benchmarks, encompassing 20 low-resource languages while utilizing a mere 0.28% of the tuned parameters. These results underscore the superior adaptability of mPLMs to previously unseen languages afforded by soft-prompt tuning compared to traditional fine-tuning methods.

pdf bib abs
Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance.
Zhi Rui Tam | Cheng-Kuang Wu | Yi-Lin Tsai | Chieh-Yen Lin | Hung-yi Lee | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs).This study investigates whether such constraints on generation space impact LLMs’ abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs’ performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

pdf bib
Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling
Chao-Wei Huang | Chen-An Li | Tsu-Yuan Hsu | Chen-Yu Hsu | Yun-Nung Chen
Findings of the Association for Computational Linguistics: EACL 2024

pdf bib abs
Injecting Salesperson’s Dialogue Strategies in Large Language Models with Chain-of-Thought Reasoning
Wen Chang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2024

Recent research in dialogue systems focuses on two main categories: task-oriented (TOD) and open-domain (chit-chat) dialogues. TOD systems help users complete specific tasks, while open-domain systems aim to create engaging conversations. However, user intents often emerge during interactions. A recent study introduced SalesBot, simulating dialogues that transition from chit-chat to task-oriented scenarios to train sales agents. Unfortunately, the initial data lacked smooth transitions and coherent long dialogues, resulting in unnatural interactions. This paper presents SalesBot 2.0, an improved dataset leveraging commonsense knowledge from large language models (LLMs) through strategic prompting. Additionally, we introduce SalesAgent, a novel model trained on salesperson interactions using chain-of-thought (CoT) reasoning. This model excels in transitioning topics, understanding user intents, and selecting appropriate strategies.Experiments with diverse user simulations validate our method’s effectiveness in controlling dialogue strategies in LLMs. SalesBot 2.0 enhances coherence and reduces aggression, improving model learning for sales-customer interactions.

pdf bib abs
Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models
Chang-Sheng Kao | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2024

For dialogue systems, the utilization of multimodal dialogue responses, as opposed to relying solely on text-only responses, offers the capability to describe different concepts through various modalities. This enhances the effectiveness of communication and elevates the overall conversational experience. However, current methods for dialogue-to-image retrieval are constrained by the capabilities of the pre-trained vision language models (VLMs). They struggle to accurately extract key information from conversations and are unable to handle long-turn conversations. In this paper, we leverage the reasoning capabilities of large language models (LLMs) to predict the potential features that may be present in the images to be shared, based on the dialogue context. This approach allows us to obtain succinct and precise descriptors, thereby improving the performance of text-image retrieval. Experimental results shows that our method outperforms previous approaches significantly in terms of Recall@k.

pdf bib
Findings of the Association for Computational Linguistics: EMNLP 2024
Yaser Al-Onaizan | Mohit Bansal | Yun-Nung Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

pdf bib abs
Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models
Cheng-Hsun Hsueh | Paul Kuo-Ming Huang | Tzu-Han Lin | Che Wei Liao | Hung-Chieh Fang | Chao-Wei Huang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Knowledge editing is a rising technique for efficiently updating factual knowledge in large language models (LLMs) with minimal alteration of parameters. However, recent studies have identified side effects, such as knowledge distortion and the deterioration of general abilities, that have emerged after editing. Despite these findings, evaluating the pitfalls of knowledge editing often relies on inconsistent metrics and benchmarks, lacking a uniform standard. In response, this survey presents a comprehensive study of these side effects, providing a unified perspective on the challenges of knowledge editing in LLMs by conducting experiments with consistent metrics and benchmarks. Additionally, we review related works and outline potential research directions to address these limitations. Our survey highlights the limitations of current knowledge editing methods, emphasizing the need for a deeper understanding of the inner knowledge structures of LLMs and improved knowledge editing methods. To foster future research, we have released the complementary materials publicly (https://github.com/MiuLab/EditLLM-Survey).

pdf bib abs
FactAlign: Long-form Factuality Alignment of Large Language Models
Chao-Wei Huang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models have demonstrated significant potential as the next-generation information access engines. However, their reliability is hindered by issues of hallucination and generating non-factual content. This is particularly problematic in long-form responses, where assessing and ensuring factual accuracy is complex. In this paper, we address this gap by proposing FactAlign, a novel alignment framework designed to enhance the factuality of LLMs’ long-form responses while maintaining their helpfulness. We introduce fKTO, a fine-grained, sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent advances in automatic factuality evaluation, FactAlign utilizes fine-grained factuality assessments to guide the alignment process. Our experiments on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM responses while also improving their helpfulness. Further analyses identify that FactAlign is capable of training LLMs to provide more information without losing factual precision, thus improving the factual F1 score. Our source code, datasets, and trained models are publicly available at https://github.com/MiuLab/FactAlign

The concept of *persona*, originally adopted in dialogue literature, has re-surged as a promising framework for tailoring large language models (LLMs) to specific context (*e.g.*, personalized search, LLM-as-a-judge). However, the growing research on leveraging persona in LLMs is relatively disorganized and lacks a systematic taxonomy. To close the gap, we present a comprehensive survey to categorize the current state of the field. We identify two lines of research, namely (1) *LLM Role-Playing*, where personas are assigned to LLMs, and (2) *LLM Personalization*, where LLMs take care of user personas. Additionally, we introduce existing methods for LLM personality evaluation. To the best of our knowledge, we present the first survey for role-playing and personalization in LLMs under the unified view of persona. We continuously maintain a paper collection to foster future endeavors.

2023

pdf bib abs
Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations
Wei-Lin Chen | Cheng-Kuang Wu | Yun-Nung Chen | Hsin-Hsi Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have exhibited striking in-context learning (ICL) ability to adapt to target tasks with a few input-output demonstrations. For better ICL, different methods are proposed to select representative demonstrations from existing training corpora. However, such settings are not aligned with real-world practices, as end-users usually query LMs without access to demonstration pools. In this work, we introduce Self-ICL—a simple framework which bootstraps LMs’ intrinsic capabilities to perform zero-shot ICL. Given a test input, Self-ICL first prompts the model to generate pseudo-inputs. Next, the model predicts pseudo-labels for the pseudo-inputs via zero-shot prompting. Finally, we perform ICL for the test input with the pseudo-input-label pairs as demonstrations. Evaluation on 23 BIG-Bench Hard tasks shows Self-ICL outperforms zero-shot baselines on both average accuracy and head-to-head comparison. Moreover, with zero-shot chain-of-thought, Self-ICL achieves results comparable to using real demonstrations. Additionally, we conduct a range of analyses to validate Self-ICL’s effectiveness and provide insights for its behaviors under different settings.

pdf bib abs
Zero-Shot Prompting for Implicit Intent Prediction and Recommendation with Commonsense Reasoning
Hui-Chi Kuo | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2023

The current generation of intelligent assistants require explicit user requests to perform tasks or services, often leading to lengthy and complex conversations. In contrast, human assistants can infer multiple implicit intents from utterances via their commonsense knowledge, thereby simplifying interactions. To bridge this gap, this paper proposes a framework for multi-domain dialogue systems. This framework automatically infers implicit intents from user utterances, and prompts a large pre-trained language model to suggest suitable task-oriented bots. By leveraging commonsense knowledge, our framework recommends associated bots in a zero-shot manner, enhancing interaction efficiency and effectiveness. This approach substantially reduces interaction complexity, seamlessly integrates various domains and tasks, and represents a significant step towards creating more human-like intelligent assistants that can reason about implicit intents, offering a superior user experience.

pdf bib abs
Visually-Enhanced Phrase Understanding
Tsu-Yuan Hsu | Chen-An Li | Chao-Wei Huang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2023

Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often outperform models that are purely text-based, such as BERT. In this study, our objective is to utilize both textual and visual encoders of multi-modal pre-trained models to enhance language understanding tasks. We achieve this by generating an image associated with a textual prompt, thus enriching the representation of a phrase for downstream tasks. Results from experiments conducted on four benchmark datasets demonstrate that our proposed method, which leverages visually-enhanced text representations, significantly improves performance in the entity clustering task.

pdf bib
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)
Yun-Nung Chen | Abhinav Rastogi
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)

pdf bib abs
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
Yen-Ting Lin | Yun-Nung Chen
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)

We propose LLM-Eval, a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs). Existing evaluation methods often rely on human annotations, ground-truth responses, or multiple LLM prompts, which can be expensive and time-consuming. To address these issues, we design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods. Our analysis also highlights the importance of choosing suitable LLMs and decoding strategies for accurate evaluation results. LLM-Eval offers a versatile and robust solution for evaluating open-domain conversation systems, streamlining the evaluation process and providing consistent performance across diverse scenarios.

pdf bib abs
Zero-Shot Dialogue Relation Extraction by Relating Explainable Triggers and Relation Names
Ze-Song Xu | Yun-Nung Chen
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)

Developing dialogue relation extraction (DRE) systems often requires a large amount of labeled data, which can be costly and time-consuming to annotate. In order to improve scalability and support diverse, unseen relation extraction, this paper proposes a method for leveraging the ability to capture triggers and relate them to previously unseen relation names. Specifically, we introduce a model that enables zero-shot dialogue relation extraction by utilizing trigger-capturing capabilities. Our experiments on a benchmark DialogRE dataset demonstrate that the proposed model achieves significant improvements for both seen and unseen relations. Notably, this is the first attempt at zero-shot dialogue relation extraction using trigger-capturing capabilities, and our results suggest that this approach is effective for inferring previously unseen relation types. Overall, our findings highlight the potential for this method to enhance the scalability and practicality of DRE systems.

pdf bib abs
CONVERSER: Few-shot Conversational Dense Retrieval with Synthetic Data Generation
Chao-Wei Huang | Chen-Yu Hsu | Tsu-Yuan Hsu | Chen-An Li | Yun-Nung Chen
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Conversational search provides a natural interface for information retrieval (IR). Recent approaches have demonstrated promising results in applying dense retrieval to conversational IR. However, training dense retrievers requires large amounts of in-domain paired data. This hinders the development of conversational dense retrievers, as abundant in-domain conversations are expensive to collect. In this paper, we propose Converser, a framework for training conversational dense retrievers with at most 6 examples of in-domain dialogues. Specifically, we utilize the in-context learning capability of large language models to generate conversational queries given a passage in the retrieval corpus. Experimental results on conversational retrieval benchmarks OR-QuAC and TREC CAsT 19 show that the proposed Converser achieves comparable performance to fully-supervised models, demonstrating the effectiveness of our proposed framework in few-shot conversational dense retrieval. All source code and generated datasets are available: https://github.com/MiuLab/CONVERSER

2022

pdf bib abs
SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues
Ssu Chiu | Maolin Li | Yen-Ting Lin | Yun-Nung Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dialogue systems are usually categorized into two types, open-domain and task-oriented. The first one focuses on chatting with users and making them engage in the conversations, where selecting a proper topic to fit the dialogue context is essential for a successful dialogue. The other one focuses on a specific task instead of casual talks, e.g., finding a movie on Friday night, playing a song. These two directions have been studied separately due to their different purposes. However, how to smoothly transition from social chatting to task-oriented dialogues is important for triggering the business opportunities, and there is no any public data focusing on such scenarios. Hence, this paper focuses on investigating the conversations starting from open-domain social chatting and then gradually transitioning to task-oriented purposes, and releases a large-scale dataset with detailed annotations for encouraging this research direction. To achieve this goal, this paper proposes a framework to automatically generate many dialogues without human involvement, in which any powerful open-domain dialogue generation model can be easily leveraged. The human evaluation shows that our generated dialogue data has a natural flow at a reasonable quality, showing that our released data has a great potential of guiding future research directions and commercial activities. Furthermore, the released models allow researchers to automatically generate unlimited dialogues in the target scenarios, which can greatly benefit semi-supervised and unsupervised approaches.

pdf bib abs
PLM-ICD: Automatic ICD Coding with Pretrained Language Models
Chao-Wei Huang | Shang-Chi Tsai | Yun-Nung Chen
Proceedings of the 4th Clinical Natural Language Processing Workshop

Automatically classifying electronic health records (EHRs) into diagnostic codes has been challenging to the NLP community. State-of-the-art methods treated this problem as a multi-label classification problem and proposed various architectures to model this problem. However, these systems did not leverage the superb performance of pretrained language models, which achieved superb performance on natural language understanding tasks. Prior work has shown that pretrained language models underperformed on this task with the regular fine-tuning scheme. Therefore, this paper aims at analyzing the causes of the underperformance and developing a framework for automatic ICD coding with pretrained language models. We spotted three main issues through the experiments: 1) large label space, 2) long input sequences, and 3) domain mismatch between pretraining and fine-tuning. We propose PLM-ICD, a framework that tackles the challenges with various strategies. The experimental results show that our proposed framework can overcome the challenges and achieves state-of-the-art performance in terms of multiple metrics on the benchmark MIMIC data. Our source code is available at https://github.com/MiuLab/PLM-ICD.

pdf bib abs
Open-Domain Conversational Question Answering with Historical Answers
Hung-Chieh Fang | Kuo-Han Hung | Chen-Wei Huang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Open-domain conversational question answering can be viewed as two tasks: passage retrieval and conversational question answering, where the former relies on selecting candidate passages from a large corpus and the latter requires better understanding of a question with contexts to predict the answers. This paper proposes ConvADR-QA that leverages historical answers to boost retrieval performance and further achieves better answering performance. Our experiments on the benchmark dataset, OR-QuAC, demonstrate that our model outperforms existing baselines in both extractive and generative reader settings, well justifying the effectiveness of historical answers for open-domain conversational question answering.

pdf bib abs
Controllable User Dialogue Act Augmentation for Dialogue State Tracking
Chun-Mao Lai | Ming-Hao Hsu | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Prior work has demonstrated that data augmentation is useful for improving dialogue state tracking. However, there are many types of user utterances, while the prior method only considered the simplest one for augmentation, raising the concern about poor generalization capability. In order to better cover diverse dialogue acts and control the generation quality, this paper proposes controllable user dialogue act augmentation (CUDA-DST) to augment user utterances with diverse behaviors. With the augmented data, different state trackers gain improvement and show better robustness, achieving the state-of-the-art performance on MultiWOZ 2.1.

pdf bib abs
TREND: Trigger-Enhanced Relation-Extraction Network for Dialogues
Po-Wei Lin | Shang-Yu Su | Yun-Nung Chen
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

The goal of dialogue relation extraction (DRE) is to identify the relation between two entities in a given dialogue. During conversations, speakers may expose their relations to certain entities by explicit or implicit clues, such evidences called “triggers”. However, trigger annotations may not be always available for the target data, so it is challenging to leverage such information for enhancing the performance. Therefore, this paper proposes to learn how to identify triggers from the data with trigger annotations and then transfers the trigger-finding capability to other datasets for better performance. The experiments show that the proposed approach is capable of improving relation extraction performance of unseen relations and also demonstrate the transferability of our proposed trigger-finding model across different domains and datasets.

2021

pdf bib abs
Relating Neural Text Degeneration to Exposure Bias
Ting-Rui Chiang | Yun-Nung Chen
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer from (Holtzman et al., 2020). Motivated by the unknown causation of the text degeneration, in this paper we attempt to relate these two mysteries. Specifically, we first qualitatively and quantitatively identify mistakes made before text degeneration occurs. Then we investigate the significance of the mistakes by inspecting the hidden states in GPT-2. Our results show that text degeneration is likely to be partly caused by exposure bias. We also study the self-reinforcing mechanism of text degeneration, explaining why the mistakes amplify. In sum, our study provides a more concrete foundation for further investigation on exposure bias and text degeneration problems.

pdf bib abs
Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity
Po-Nien Kung | Sheng-Siang Yin | Yi-Cheng Chen | Tse-Hsuan Yang | Yun-Nung Chen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multi-task auxiliary learning utilizes a set of relevant auxiliary tasks to improve the performance of a primary task. A common usage is to manually select multiple auxiliary tasks for multi-task learning on all data, which raises two issues: (1) selecting beneficial auxiliary tasks for a primary task is nontrivial; (2) when the auxiliary datasets are large, training on all data becomes time-expensive and impractical. Therefore, this paper focuses on addressing these problems and proposes a time-efficient sampling method to select the data that is most relevant to the primary task. The proposed method allows us to only train on the most beneficial sub-datasets from the auxiliary tasks, achieving efficient multi-task auxiliary learning. The experiments on three benchmark datasets (RTE, MRPC, STS-B) show that our method significantly outperforms random sampling and ST-DNN. Also, by applying our method, the model can surpass fully-trained MT-DNN on RTE, MRPC, STS-B, using only 50%, 66%, and 1% of data, respectively.

pdf bib abs
Modeling Diagnostic Label Correlation for Automatic ICD Coding
Shang-Chi Tsai | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Given the clinical notes written in electronic health records (EHRs), it is challenging to predict the diagnostic codes which is formulated as a multi-label classification task. The large set of labels, the hierarchical dependency, and the imbalanced data make this prediction task extremely hard. Most existing work built a binary prediction for each label independently, ignoring the dependencies between labels. To address this problem, we propose a two-stage framework to improve automatic ICD coding by capturing the label correlation. Specifically, we train a label set distribution estimator to rescore the probability of each label set candidate generated by a base predictor. This paper is the first attempt at learning the label set distribution as a reranking module for ICD coding. In the experiments, our proposed framework is able to improve upon best-performing predictors for medical code prediction on the benchmark MIMIC datasets.

pdf bib
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI
Alexandros Papangelis | Paweł Budzianowski | Bing Liu | Elnaz Nouri | Abhinav Rastogi | Yun-Nung Chen
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.

2020

pdf bib abs
Towards Unsupervised Language Understanding and Generation by Joint Dual Learning
Shang-Yu Su | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In modular dialogue systems, natural language understanding (NLU) and natural language generation (NLG) are two critical components, where NLU extracts the semantics from the given texts and NLG is to construct corresponding natural language sentences based on the input semantic representations. However, the dual property between understanding and generation has been rarely explored. The prior work is the first attempt that utilized the duality between NLU and NLG to improve the performance via a dual supervised learning framework. However, the prior work still learned both components in a supervised manner; instead, this paper introduces a general learning framework to effectively exploit such duality, providing flexibility of incorporating both supervised and unsupervised learning algorithms to train language understanding and generation models in a joint fashion. The benchmark experiments demonstrate that the proposed approach is capable of boosting the performance of both NLU and NLG. The source code is available at: https://github.com/MiuLab/DuaLUG.

pdf bib abs
Learning Spoken Language Representations with Neural Lattice Language Modeling
Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pre-trained language models have achieved huge improvement on many NLP tasks. However, these methods are usually designed for written text, so they do not consider the properties of spoken language. Therefore, this paper aims at generalizing the idea of language model pre-training to lattices generated by recognition systems. We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks. The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency. Experiments on intent detection and dialogue act recognition datasets demonstrate that our proposed method consistently outperforms strong baselines when evaluated on spoken inputs. The code is available at https://github.com/MiuLab/Lattice-ELMo.

pdf bib abs
Lifelong Language Knowledge Distillation
Yung-Sung Chuang | Shang-Yu Su | Yun-Nung Chen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model can better adapt to the new task while keeping the previously learned knowledge. Experiments show that the proposed L2KD consistently improves previous state-of-the-art models, and the degradation comparing to multi-task models in LLL tasks is well mitigated for both sequence generation and text classification tasks.

pdf bib abs
What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding
Yu-An Wang | Yun-Nung Chen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In recent years, pre-trained Transformers have dominated the majority of NLP benchmark tasks. Many variants of pre-trained Transformers have kept breaking out, and most focus on designing different pre-training objectives or variants of self-attention. Embedding the position information in the self-attention mechanism is also an indispensable factor in Transformers however is often discussed at will. Hence, we carry out an empirical study on position embedding of mainstream pre-trained Transformers mainly focusing on two questions: 1) Do position embeddings really learn the meaning of positions? 2) How do these different learned position embeddings affect Transformers for NLP tasks? This paper focuses on providing a new insight of pre-trained position embeddings by feature-level analysis and empirical experiments on most of iconic NLP tasks. It is believed that our experimental results can guide the future works to choose the suitable positional encoding function for specific tasks given the application property.

pdf bib abs
Zero-Shot Rationalization by Multi-Task Transfer Learning from Question Answering
Po-Nien Kung | Tse-Hsuan Yang | Yi-Cheng Chen | Sheng-Siang Yin | Yun-Nung Chen
Findings of the Association for Computational Linguistics: EMNLP 2020

Extracting rationales can help human understand which information the model utilizes and how it makes the prediction towards better interpretability. However, annotating rationales requires much effort and only few datasets contain such labeled rationales, making supervised learning for rationalization difficult. In this paper, we propose a novel approach that leverages the benefits of both multi-task learning and transfer learning for generating rationales through question answering in a zero-shot fashion. For two benchmark rationalization datasets, the proposed method achieves comparable or even better performance of rationalization without any supervised signal, demonstrating the great potential of zero-shot rationalization for better interpretability.

pdf bib abs
Dual Inference for Improving Language Understanding and Generation
Shang-Yu Su | Yung-Sung Chuang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: EMNLP 2020

Natural language understanding (NLU) and Natural language generation (NLG) tasks hold a strong dual relationship, where NLU aims at predicting semantic labels based on natural language utterances and NLG does the opposite. The prior work mainly focused on exploiting the duality in model training in order to obtain the models with better performance. However, regarding the fast-growing scale of models in the current NLP area, sometimes we may have difficulty retraining whole NLU and NLG models. To better address the issue, this paper proposes to leverage the duality in the inference stage without the need of retraining. The experiments on three benchmark datasets demonstrate the effectiveness of the proposed method in both NLU and NLG, providing the great potential of practical usage.

2019

pdf bib abs
Tree Transformer: Integrating Tree Structures into Self-Attention
Yaushian Wang | Hung-Yi Lee | Yun-Nung Chen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Pre-training Transformer from large-scale raw texts and fine-tuning on the desired task have achieved state-of-the-art results on diverse NLP tasks. However, it is unclear what the learned attention captures. The attention computed by attention heads seems not to match human intuitions about hierarchical structures. This paper proposes Tree Transformer, which adds an extra constraint to attention heads of the bidirectional Transformer encoder in order to encourage the attention heads to follow tree structures. The tree structures can be automatically induced from raw texts by our proposed “Constituent Attention” module, which is simply implemented by self-attention between two adjacent words. With the same training procedure identical to BERT, the experiments demonstrate the effectiveness of Tree Transformer in terms of inducing tree structures, better language modeling, and further learning more explainable attention scores.

pdf bib abs
DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs
Yi-Lin Tuan | Yun-Nung Chen | Hung-yi Lee
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Data-driven, knowledge-grounded neural conversation models are capable of generating more informative responses. However, these models have not yet demonstrated that they can zero-shot adapt to updated, unseen knowledge graphs. This paper proposes a new task about how to apply dynamic knowledge graphs in neural conversation model and presents a novel TV series conversation corpus (DyKgChat) for the task. Our new task and corpus aids in understanding the influence of dynamic knowledge graphs on responses generation. Also, we propose a preliminary model that selects an output from two networks at each time step: a sequence-to-sequence model (Seq2Seq) and a multi-hop reasoning model, in order to support dynamic knowledge graphs. To benchmark this new task and evaluate the capability of adaptation, we introduce several evaluation metrics and the experiments show that our proposed approach outperforms previous knowledge-grounded conversation models. The proposed corpus and model can motivate the future research directions.

pdf bib abs
QAInfomax: Learning Robust Question Answering System by Mutual Information Maximization
Yi-Ting Yeh | Yun-Nung Chen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Standard accuracy metrics indicate that modern reading comprehension systems have achieved strong performance in many question answering datasets. However, the extent these systems truly understand language remains unknown, and existing systems are not good at distinguishing distractor sentences which look related but do not answer the question. To address this problem, we propose QAInfomax as a regularizer in reading comprehension systems by maximizing mutual information among passages, a question, and its answer. QAInfomax helps regularize the model to not simply learn the superficial correlation for answering the questions. The experiments show that our proposed QAInfomax achieves the state-of-the-art performance on the benchmark Adversarial-SQuAD dataset.

pdf bib abs
What Does This Word Mean? Explaining Contextualized Embeddings with Natural Language Definition
Ting-Yun Chang | Yun-Nung Chen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Contextualized word embeddings have boosted many NLP tasks compared with traditional static word embeddings. However, the word with a specific sense may have different contextualized embeddings due to its various contexts. To further investigate what contextualized word embeddings capture, this paper analyzes whether they can indicate the corresponding sense definitions and proposes a general framework that is capable of explaining word meanings given contextualized word embeddings for better interpretation. The experiments show that both ELMo and BERT embeddings can be well interpreted via a readable textual form, and the findings may benefit the research community for a better understanding of what the embeddings capture.

pdf bib abs
FlowDelta: Modeling Flow Information Gain in Reasoning for Conversational Machine Comprehension
Yi-Ting Yeh | Yun-Nung Chen
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Conversational machine comprehension requires deep understanding of the dialogue flow, and the prior work proposed FlowQA to implicitly model the context representations in reasoning for better understanding. This paper proposes to explicitly model the information gain through the dialogue reasoning in order to allow the model to focus on more informative cues. The proposed model achieves the state-of-the-art performance in a conversational QA dataset QuAC and sequential instruction understanding dataset SCONE, which shows the effectiveness of the proposed mechanism and demonstrate its capability of generalization to different QA models and tasks.

pdf bib abs
Leveraging Hierarchical Category Knowledge for Data-Imbalanced Multi-Label Diagnostic Text Understanding
Shang-Chi Tsai | Ting-Yun Chang | Yun-Nung Chen
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Clinical notes are essential medical documents to record each patient’s symptoms. Each record is typically annotated with medical diagnostic codes, which means diagnosis and treatment. This paper focuses on predicting diagnostic codes given the descriptive present illness in electronic health records by leveraging domain knowledge. We investigate various losses in a convolutional model to utilize hierarchical category knowledge of diagnostic codes in order to allow the model to share semantics across different labels under the same category. The proposed model not only considers the external domain knowledge but also addresses the issue about data imbalance. The MIMIC3 benchmark experiments show that the proposed methods can effectively utilize category knowledge and provide informative cues to improve the performance in terms of the top-ranked diagnostic codes which is better than the prior state-of-the-art. The investigation and discussion express the potential of integrating the domain knowledge in the current machine learning based models and guiding future research directions.

pdf bib abs
Towards Understanding of Medical Randomized Controlled Trials by Conclusion Generation
Alexander Te-Wei Shieh | Yung-Sung Chuang | Shang-Yu Su | Yun-Nung Chen
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Randomized controlled trials (RCTs) represent the paramount evidence of clinical medicine. Using machines to interpret the massive amount of RCTs has the potential of aiding clinical decision-making. We propose a RCT conclusion generation task from the PubMed 200k RCT sentence classification dataset to examine the effectiveness of sequence-to-sequence models on understanding RCTs. We first build a pointer-generator baseline model for conclusion generation. Then we fine-tune the state-of-the-art GPT-2 language model, which is pre-trained with general domain data, for this new medical domain task. Both automatic and human evaluation show that our GPT-2 fine-tuned models achieve improved quality and correctness in the generated conclusions compared to the baseline pointer-generator model. Further inspection points out the limitations of this current approach and future directions to explore.

pdf bib abs
Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems
Ting-Rui Chiang | Yun-Nung Chen
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Solving math word problems is a challenging task that requires accurate natural language understanding to bridge natural language texts and math expressions. Motivated by the intuition about how human generates the equations given the problem texts, this paper presents a neural approach to automatically solve math word problems by operating symbols according to their semantic meanings in texts. This paper views the process of generating equation as a bridge between the semantic world and the symbolic world, where the proposed neural math solver is based on an encoder-decoder framework. In the proposed model, the encoder is designed to understand the semantics of problems, and the decoder focuses on tracking semantic meanings of the generated symbols and then deciding which symbol to generate next. The preliminary experiments are conducted in a dataset Math23K, and our model significantly outperforms both the state-of-the-art single model and the best non-retrieval-based model over about 10% accuracy, demonstrating the effectiveness of bridging the symbolic and semantic worlds from math word problems.

pdf bib abs
Dual Supervised Learning for Natural Language Understanding and Generation
Shang-Yu Su | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Natural language understanding (NLU) and natural language generation (NLG) are both critical research topics in the NLP and dialogue fields. Natural language understanding is to extract the core semantic meaning from the given utterances, while natural language generation is opposite, of which the goal is to construct corresponding sentences based on the given semantics. However, such dual relationship has not been investigated in literature. This paper proposes a novel learning framework for natural language understanding and generation on top of dual supervised learning, providing a way to exploit the duality. The preliminary experiments show that the proposed approach boosts the performance for both tasks, demonstrating the effectiveness of the dual relationship.

2018

pdf bib
Deep Learning for Dialogue Systems
Yun-Nung Chen | Asli Celikyilmaz | Dilek Hakkani-Tür
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib abs
CLUSE: Cross-Lingual Unsupervised Sense Embeddings
Ta-Chung Chi | Yun-Nung Chen
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper proposes a modularized sense induction and representation learning model that jointly learns bilingual sense embeddings that align well in the vector space, where the cross-lingual signal in the English-Chinese parallel corpus is exploited to capture the collocation and distributed characteristics in the language pair. The model is evaluated on the Stanford Contextual Word Similarity (SCWS) dataset to ensure the quality of monolingual sense embeddings. In addition, we introduce Bilingual Contextual Word Similarity (BCWS), a large and high-quality dataset for evaluating cross-lingual sense embeddings, which is the first attempt of measuring whether the learned embeddings are indeed aligned well in the vector space. The proposed approach shows the superior quality of sense embeddings evaluated in both monolingual and bilingual spaces.

pdf bib abs
Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning
Shang-Yu Su | Xiujun Li | Jianfeng Gao | Jingjing Liu | Yun-Nung Chen
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper presents a Discriminative Deep Dyna-Q (D3Q) approach to improving the effectiveness and robustness of Deep Dyna-Q (DDQ), a recently proposed framework that extends the Dyna-Q algorithm to integrate planning for task-completion dialogue policy learning. To obviate DDQ’s high dependency on the quality of simulated experiences, we incorporate an RNN-based discriminator in D3Q to differentiate simulated experience from real user experience in order to control the quality of training data. Experiments show that D3Q significantly outperforms DDQ by controlling the quality of simulated experience used for planning. The effectiveness and robustness of D3Q is further demonstrated in a domain extension setting, where the agent’s capability of adapting to a changing environment is tested.

pdf bib abs
How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding in Dialogues
Shang-Yu Su | Pei-Chieh Yuan | Yun-Nung Chen
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Spoken language understanding (SLU) is an essential component in conversational systems. Most SLU components treats each utterance independently, and then the following components aggregate the multi-turn information in the separate phases. In order to avoid error propagation and effectively utilize contexts, prior work leveraged history for contextual SLU. However, most previous models only paid attention to the related content in history utterances, ignoring their temporal information. In the dialogues, it is intuitive that the most recent utterances are more important than the least recent ones, in other words, time-aware attention should be in a decaying manner. Therefore, this paper designs and investigates various types of time-decay attention on the sentence-level and speaker-level, and further proposes a flexible universal time-decay attention mechanism. The experiments on the benchmark Dialogue State Tracking Challenge (DSTC4) dataset show that the proposed time-decay attention mechanisms significantly improve the state-of-the-art model for contextual understanding performance.

pdf bib abs
Natural Language Generation by Hierarchical Decoding with Linguistic Patterns
Shang-Yu Su | Kai-Ling Lo | Yi-Ting Yeh | Yun-Nung Chen
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Natural language generation (NLG) is a critical component in spoken dialogue systems. Classic NLG can be divided into two phases: (1) sentence planning: deciding on the overall sentence structure, (2) surface realization: determining specific word forms and flattening the sentence structure into a string. Many simple NLG models are based on recurrent neural networks (RNN) and sequence-to-sequence (seq2seq) model, which basically contains a encoder-decoder structure; these NLG models generate sentences from scratch by jointly optimizing sentence planning and surface realization using a simple cross entropy loss training criterion. However, the simple encoder-decoder architecture usually suffers from generating complex and long sentences, because the decoder has to learn all grammar and diction knowledge. This paper introduces a hierarchical decoding NLG model based on linguistic patterns in different levels, and shows that the proposed method outperforms the traditional one with a smaller model size. Furthermore, the design of the hierarchical decoding is flexible and easily-extendible in various NLG systems.

pdf bib abs
Slot-Gated Modeling for Joint Slot Filling and Intent Prediction
Chih-Wen Goo | Guang Gao | Yun-Kai Hsu | Chih-Li Huo | Tsung-Chieh Chen | Keng-Wei Hsu | Yun-Nung Chen
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Attention-based recurrent neural network models for joint intent detection and slot filling have achieved the state-of-the-art performance, while they have independent attention weights. Considering that slot and intent have the strong relationship, this paper proposes a slot gate that focuses on learning the relationship between intent and slot attention vectors in order to obtain better semantic frame results by the global optimization. The experiments show that our proposed model significantly improves sentence-level semantic frame accuracy with 4.2% and 1.9% relative improvement compared to the attentional model on benchmark ATIS and Snips datasets respectively

2017

pdf bib abs
MUSE: Modularizing Unsupervised Sense Embeddings
Guang-He Lee | Yun-Nung Chen
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts. Prior work focused on designing a single model to deliver both mechanisms, and thus suffered from either coarse-grained representation learning or inefficient sense selection. The proposed modular approach, MUSE, implements flexible modules to optimize distinct mechanisms, achieving the first purely sense-level representation learning system with linear-time sense selection. We leverage reinforcement learning to enable joint training on the proposed modules, and introduce various exploration techniques on sense selection for better robustness. The experiments on benchmark data show that the proposed approach achieves the state-of-the-art performance on synonym selection as well as on contextual word similarities in terms of MaxSimC.

pdf bib abs
End-to-End Task-Completion Neural Dialogue Systems
Xiujun Li | Yun-Nung Chen | Lihong Li | Jianfeng Gao | Asli Celikyilmaz
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

One of the major drawbacks of modularized task-completion dialogue systems is that each module is trained individually, which presents several challenges. For example, downstream modules are affected by earlier modules, and the performance of the entire system is not robust to the accumulated errors. This paper presents a novel end-to-end learning framework for task-completion dialogue systems to tackle such issues. Our neural dialogue system can directly interact with a structured database to assist users in accessing information and accomplishing certain tasks. The reinforcement learning based dialogue manager offers robust capabilities to handle noises caused by other components of the dialogue system. Our experiments in a movie-ticket booking domain show that our end-to-end system not only outperforms modularized dialogue system baselines for both objective and subjective evaluation, but also is robust to noises as demonstrated by several systematic experiments with different error granularity and rates specific to the language understanding module.

pdf bib abs
Speaker Role Contextual Modeling for Language Understanding and Dialogue Policy Learning
Ta-Chung Chi | Po-Chun Chen | Shang-Yu Su | Yun-Nung Chen
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Language understanding (LU) and dialogue policy learning are two essential components in conversational systems. Human-human dialogues are not well-controlled and often random and unpredictable due to their own goals and speaking habits. This paper proposes a role-based contextual model to consider different speaker roles independently based on the various speaking patterns in the multi-turn dialogues. The experiments on the benchmark dataset show that the proposed role-based model successfully learns role-specific behavioral patterns for contextual encoding and then significantly improves language understanding and dialogue policy learning tasks.

pdf bib abs
Open-Domain Neural Dialogue Systems
Yun-Nung Chen | Jianfeng Gao
Proceedings of the IJCNLP 2017, Tutorial Abstracts

In the past decade, spoken dialogue systems have been the most prominent component in today’s personal assistants. A lot of devices have incorporated dialogue system modules, which allow users to speak naturally in order to finish tasks more efficiently. The traditional conversational systems have rather complex and/or modular pipelines. The advance of deep learning technologies has recently risen the applications of neural models to dialogue modeling. Nevertheless, applying deep learning technologies for building robust and scalable dialogue systems is still a challenging task and an open research area as it requires deeper understanding of the classic pipelines as well as detailed knowledge on the benchmark of the models of the prior work and the recent state-of-the-art work. Therefore, this tutorial is designed to focus on an overview of the dialogue system development while describing most recent research for building task-oriented and chit-chat dialogue systems, and summarizing the challenges. We target the audience of students and practitioners who have some deep learning background, who want to get more familiar with conversational dialogue systems.

pdf bib abs
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access
Bhuwan Dhingra | Lihong Li | Xiujun Li | Jianfeng Gao | Yun-Nung Chen | Faisal Ahmed | Li Deng
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper proposes KB-InfoBot - a multi-turn dialogue agent which helps users search Knowledge Bases (KBs) without composing complicated queries. Such goal-oriented dialogue agents typically need to interact with an external database to access real-world knowledge. Previous systems achieved this by issuing a symbolic query to the KB to retrieve entries based on their attributes. However, such symbolic operations break the differentiability of the system and prevent end-to-end training of neural dialogue agents. In this paper, we address this limitation by replacing symbolic queries with an induced “soft” posterior distribution over the KB that indicates which entities the user is interested in. Integrating the soft retrieval process with a reinforcement learner leads to higher task success rate and reward in both simulations and against real users. We also present a fully neural end-to-end agent, trained entirely from user feedback, and discuss its application towards personalized dialogue agents.

pdf bib abs
Deep Learning for Dialogue Systems
Yun-Nung Chen | Asli Celikyilmaz | Dilek Hakkani-Tür
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

In the past decade, goal-oriented spoken dialogue systems have been the most prominent component in today's virtual personal assistants. The classic dialogue systems have rather complex and/or modular pipelines. The advance of deep learning technologies has recently risen the applications of neural models to dialogue modeling. However, how to successfully apply deep learning based approaches to a dialogue system is still challenging. Hence, this tutorial is designed to focus on an overview of the dialogue system development while describing most recent research for building dialogue systems and summarizing the challenges, in order to allow researchers to study the potential improvements of the state-of-the-art dialogue systems. The tutorial material is available at http://deepdialogue.miulab.tw.

2016

pdf bib abs
AIMU: Actionable Items for Meeting Understanding
Yun-Nung Chen | Dilek Hakkani-Tür
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

With emerging conversational data, automated content analysis is needed for better data interpretation, so that it is accurately understood and can be effectively integrated and utilized in various applications. ICSI meeting corpus is a publicly released data set of multi-party meetings in an organization that has been released over a decade ago, and has been fostering meeting understanding research since then. The original data collection includes transcription of participant turns as well as meta-data annotations, such as disfluencies and dialog act tags. This paper presents an extended set of annotations for the ICSI meeting corpus with a goal of deeply understanding meeting conversations, where participant turns are annotated by actionable items that could be performed by an automated meeting assistant. In addition to the user utterances that contain an actionable item, annotations also include the arguments associated with the actionable item. The set of actionable items are determined by aligning human-human interactions to human-machine interactions, where a data annotation schema designed for a virtual personal assistant (human-machine genre) is adapted to the meetings domain (human-human genre). The data set is formed by annotating participants’ utterances in meetings with potential intents/actions considering their contexts. The set of actions target what could be accomplished by an automated meeting assistant, such as taking a note of action items that a participant commits to, or finding emails or topic related documents that were mentioned during the meeting. A total of 10 defined intents/actions are considered as actionable items in meetings. Turns that include actionable intents were annotated for 22 public ICSI meetings, that include a total of 21K utterances, segmented by speaker turns. Participants’ spoken turns, possible actions along with associated arguments and their vector representations as computed by convolutional deep structured semantic models are included in the data set for future research. We present a detailed statistical analysis of the data set and analyze the performance of applying convolutional deep structured semantic models for an actionable item detection task. The data is available at http://research.microsoft.com/projects/meetingunderstanding/.

pdf bib abs
AppDialogue: Multi-App Dialogues for Intelligent Assistants
Ming Sun | Yun-Nung Chen | Zhenhao Hua | Yulian Tamres-Rudnicky | Arnab Dash | Alexander Rudnicky
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Users will interact with an individual app on smart devices (e.g., phone, TV, car) to fulfill a specific goal (e.g. find a photographer), but users may also pursue more complex tasks that will span multiple domains and apps (e.g. plan a wedding ceremony). Planning and executing such multi-app tasks are typically managed by users, considering the required global context awareness. To investigate how users arrange domains/apps to fulfill complex tasks in their daily life, we conducted a user study on 14 participants to collect such data from their Android smart phones. This document 1) summarizes the techniques used in the data collection and 2) provides a brief statistical description of the data. This data guilds the future direction for researchers in the fields of conversational agent and personal assistant, etc. This data is available at http://AppDialogue.com.

Yun-Nung Chen

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Co-authors

Venues