Kam-Fai Wong - ACL Anthology

Kam-Fai Wong

Also published as: Kam-fai Wong, K.F. Wong

2025

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao | Alessio Devoto | Giwon Hong | Xiaotang Du | Aryo Pradipta Gema | Hongru Wang | Xuanli He | Kam-Fai Wong | Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context—this phenomenon, known as context-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use inference-time intervention strategies to resolve it. In this work, we propose SpARE, a training-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. SpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that SpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods (+10%) as well as contrastive decoding methods (+15%).

UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models
Boyang Xue | Fei Mi | Qi Zhu | Hongru Wang | Rui Wang | Sheng Wang | Erxin Yu | Xuming Hu | Kam-Fai Wong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite demonstrating impressive capabilities, Large Language Models (LLMs) still often struggle to accurately express the factual knowledge they possess, especially in cases where the LLMs’ knowledge boundaries are ambiguous. To improve LLMs’ factual expressions, we propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries, and then explicitly incorporates these representations as input features into prompts for LLMs to Align with factual knowledge. First, we prepare the dataset on knowledge question-answering (QA) samples by calculating two uncertainty estimations, including confidence score and semantic entropy, to represent the knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we train a reward model that incorporates uncertainty estimations and then employ the Proximal Policy Optimization (PPO) algorithm for factuality alignment on LLMs. Experimental results indicate that, by integrating uncertainty representations in LLM alignment, the proposed UAlign can significantly enhance the LLMs’ capacities to confidently answer known questions and refuse unknown questions on both in-domain and out-of-domain tasks, showing reliability improvements and good generalizability over various prompt- and training-based baselines.

COPR: Continual Human Preference Learning via Optimal Policy Regularization
Han Zhang | Lin Gui | Yu Lei | Yuanzhao Zhai | Yehong Zhang | Zhuo Zhang | Yulan He | Hui Wang | Yue Yu | Kam-Fai Wong | Bin Liang | Ruifeng Xu
Findings of the Association for Computational Linguistics: ACL 2025

Reinforcement Learning from Human Feedback (RLHF) is effective for aligning Large Language Models (LLMs) with human preferences. However, RLHF’s complex process limits its ability to continually learn human feedback, making it impractical for real-world applications where the deployed model continuously receives feedback from users. The non-RL-based method, such as Direct Preference Optimization (DPO), is not primitively favorable for Continual Learning (CL). We observe that when combined with Experiment Relay (ER) for CL, DPO tends to significantly widen the gap in the probability of human-preferred and dispreferred responses. Consequently, this diminishes the diversity in model generation, potentially leading to model collapse. To overcome the above challenges, we propose the Continual Optimal Policy Regularization (COPR), a novel non-RL offline method to convert the historical optimal policies into optimization constraints when continually learning new preferences. We first derive a moderate reward function from the pairwise ranking loss and then use the moderate reward to calculate a new sampling distribution to construct novel learning objectives and constraints. We also provide formal proof of the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment.

Flexibly Utilize Memory for Long-Term Conversation via a Fragment-then-Compose Framework
Cai Ke | Yiming Du | Bin Liang | Yifan Xiang | Lin Gui | Zhongyang Li | Baojun Wang | Yue Yu | Hui Wang | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have made significant breakthroughs in extracting useful information from conversation history to enhance the response in long-term conversations. Summarizing useful information from historical conversations has achieved remarkable performance, which, however, may introduce irrelevant or redundant information, making it difficult to flexibly choose and integrate key information from different sessions during memory retrieval. To address this issue, we propose a Fragment-then-Compose framework, a novel memory utilization approach for long-term open-domain conversation, called *FraCom*. To be specific, inspired by the concept of proposition representation from Cognitive Psychology, we first represent the conversation history as a series of predicates plus arguments for propositional representation to preserve key information useful for memory ("**Fragment**”). Then, we compose propositional graphs for the conversation history based on the connection between shared arguments ("**Compose**”). During retrieval, we retrieve relevant propositions from the graph based on arguments from the current query. This essentially allows for flexible and effective utilization of related information in long-term memory for better response generation towards a query. Experimental results on four long-term open-domain conversation datasets demonstrate the effectiveness of our *FraCom* in memory utilization and its ability to enhance response generation for LLMs.

T²: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering
Zhengyi Zhao | Shubo Zhang | Zezhong Wang | Huimin Wang | Yutian Zhao | Bin Liang | Yefeng Zheng | Binyang Li | Kam-Fai Wong | Xian Wu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advances in large language models have demonstrated remarkable performance on Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models’ inherent reasoning capabilities. To address these limitations, we present T²: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T² leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T² works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T² not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2%.

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges
Hongru Wang | Wenyu Huang | Yufei Wang | Yuanhao Xi | Jianqiao Lu | Huan Zhang | Nan Hu | Zeming Liu | Jeff Z. Pan | Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025

Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose DialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) tool creation; 2) tool utilization: tool awareness, tool selection, tool execution; and 3) role-consistent response: response generation and role play. Furthermore, we build VirtualMobile – an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons .

Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Hongru Wang | Deng Cai | Wanjun Zhong | Shijue Huang | Jeff Z. Pan | Zeming Liu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025

Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than +2.5 points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute +7.89 average improvement with 64 sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline .

Mitigating Biases of Large Language Models in Stance Detection with Counterfactual Augmented Calibration
Ang Li | Jingqian Zhao | Bin Liang | Lin Gui | Hui Wang | Xi Zeng | Xingwei Liang | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Stance detection is critical for understanding the underlying position or attitude expressed toward a topic. Large language models (LLMs) have demonstrated significant advancements across various natural language processing tasks including stance detection, however, their performance in stance detection is limited by biases and spurious correlations inherent due to their data-driven nature. Our statistical experiment reveals that LLMs are prone to generate biased stances due to sentiment-stance spurious correlations and preference towards certain individuals and topics. Furthermore, the results demonstrate a strong negative correlation between stance bias and stance detection performance, underscoring the importance of mitigating bias to enhance the utility of LLMs in stance detection. Therefore, in this paper, we propose a Counterfactual Augmented Calibration Network (FACTUAL), which a novel calibration network is devised to calibrate potential bias in the stance prediction of LLMs. Further, to address the challenge of effectively learning bias representations and the difficulty in the generalizability of debiasing, we construct counterfactual augmented data. This approach enhances the calibration network, facilitating the debiasing and out-of-domain generalization. Experimental results on in-target and zero-shot stance detection tasks show that the proposed FACTUAL can effectively mitigate biases of LLMs, achieving state-of-the-art results.

MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models
Zhengyi Zhao | Shubo Zhang | Yuxi Zhang | Yanxi Zhao | Yifan Zhang | Zezhong Wang | Huimin Wang | Yutian Zhao | Bin Liang | Yefeng Zheng | Binyang Li | Kam-Fai Wong | Xian Wu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme’s image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding.

Learning First-Order Logic Rules for Argumentation Mining
Yang Sun | Guanrong Chen | Hamid Alinejad-Rokny | Jianzhu Bao | Yuqi Huang | Bin Liang | Kam-Fai Wong | Min Yang | Ruifeng Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Argumentation Mining (AM) aims to extract argumentative structures from texts by identifying argumentation components (ACs) and their argumentative relations (ARs). While previous works focus on representation learning to encode ACs and AC pairs, they fail to explicitly model the underlying reasoning patterns of AM, resulting in limited interpretability. This paper proposes a novel ̲First- ̲Order ̲Logic reasoning framework for ̲AM (FOL-AM), designed to explicitly capture logical reasoning paths within argumentative texts. By interpreting multiple AM subtasks as a unified relation query task modeled using FOL rules, FOL-AM facilitates multi-hop relational reasoning and enhances interpretability. The framework supports two flexible implementations: a fine-tuned approach to leverage task-specific learning, and a prompt-based method utilizing large language models to harness their generalization capabilities. Extensive experiments on two AM benchmarks demonstrate that FOL-AM outperforms strong baselines while significantly improving explainability.

Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step
Zezhong Wang | Xingshan Zeng | Weiwen Liu | Yufei Wang | Liangyou Li | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: NAACL 2025

Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in confidence during the model’s reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by the reasoning steps required. Furthermore, by analyzing patterns in confidence change, we examine the correctness of the model’s reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model’s reasoning.

Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs’ Reasoning
Zezhong Wang | Xingshan Zeng | Weiwen Liu | Yufei Wang | Liangyou Li | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.

ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Zezhong Wang | Xingshan Zeng | Weiwen Liu | Liangyou Li | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.

Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions
Hongru Wang | Boyang Xue | Baohang Zhou | Tianhua Zhang | Cunxiang Wang | Huimin Wang | Guanhua Chen | Kam-Fai Wong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., internal reasoning such as generate-then-read). In contrast, for questions that fall outside their known scope, these models rely on external knowledge retrieval to provide accurate responses (i.e., external acting such as retrieve-then-read). However, few previous works consider the compositional questions, which consist of several known and unknown sub-questions, necessitating the dynamic combination of previous two methods (i.e., internal reasoning and external acting) to achieve a better trade-off between effectiveness and efficiency. To this end, we introduce a Self Divide-and-Conquer (Self-DC) framework, accompanying with the first Compositional unknown Question-Answering dataset (CuQA). This framework enables LLMs to adaptively choose between using internal knowledge and retrieving external knowledge as needed, resulting in a better trade-off between effectiveness and efficiency. Experimental results on two datasets demonstrate that Self-DC can achieve comparable or even better performance with much fewer external calls compared with several strong baselines.

MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models
Boyang Xue | Hongru Wang | Rui Wang | Sheng Wang | Zezhong Wang | Yiming Du | Bin Liang | Wenxuan Zhang | Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025

The tendency of Large Language Models (LLMs) to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluated high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs’ reliability and accuracy in LS scenarios.

Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception
Luyang Lin | Lingzhi Wang | Jinsong Guo | Kam-Fai Wong
Proceedings of the 31st International Conference on Computational Linguistics

The pervasive spread of misinformation and disinformation in social media underscores the critical importance of detecting media bias. While robust Large Language Models (LLMs) have emerged as foundational tools for bias prediction, concerns about inherent biases within these models persist. In this work, we investigate the presence and nature of bias within LLMs and its consequential impact on media bias detection. Departing from conventional approaches that focus solely on bias detection in media content, we delve into biases within the LLM systems themselves. Through meticulous examination, we probe whether LLMs exhibit biases, particularly in political bias prediction and text continuation tasks. Additionally, we explore bias across diverse topics, aiming to uncover nuanced variations in bias expression within the LLM framework. Importantly, we propose debiasing strategies, including prompt engineering and model fine-tuning. Extensive analysis of bias tendencies across different LLMs sheds light on the broader landscape of bias propagation in language models. This study advances our understanding of LLM bias, offering critical insights into its implications for bias detection tasks and paving the way for more robust and equitable AI systems

ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
Yiming Du | Yifan Xiang | Bin Liang | Dahua Lin | Kam-Fai Wong | Fei Tan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose **ReSURE** (REgularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford’s online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively.

2024

SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation
Minda Hu | Licheng Zong | Hongru Wang | Jingyan Zhou | Jingjing Li | Yichen Gao | Kam-Fai Wong | Yu Li | Irwin King
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG). However, existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries, resulting in sub-optimal performance. To address these limitations, we propose a novel plug-and-play LLM-based retrieval method called Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm. By combining the reasoning capabilities of LLMs with the effectiveness of tree search, SeRTS boosts the zero-shot performance of retrieving high-quality and informative results for RAG. We further enhance retrieval performance by fine-tuning LLMs with Proximal Policy Optimization (PPO) objectives using the trajectories collected by SeRTS as feedback. Controlled experiments using the BioASQ-QA dataset with GPT-3.5-Turbo and LLama2-7b demonstrate that our method significantly improves the performance of the BM25 retriever and surpasses the strong baseline of self-reflection in both efficiency and scalability. Moreover, SeRTS generates higher-quality feedback for PPO training than self-reflection. Our proposed method effectively adapts LLMs to document retrieval tasks, enhancing their ability to retrieve highly relevant documents for RAG in the context of medical knowledge queries. This work presents a significant step forward in leveraging LLMs for accurate and comprehensive biomedical question answering.

IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators
Luyang Lin | Lingzhi Wang | Xiaoyan Zhao | Jing Li | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EACL 2024

This study focuses on media bias detection, crucial in today’s era of influential social media platforms shaping individual attitudes and opinions. In contrast to prior work that primarily relies on training specific models tailored to particular datasets, resulting in limited adaptability and subpar performance on out-of-domain data, we introduce a general bias detection framework, IndiVec, built upon large language models. IndiVec begins by constructing a fine-grained media bias database, leveraging the robust instruction-following capabilities of large language models and vector database techniques. When confronted with new input for bias detection, our framework automatically selects the most relevant indicator from the vector database and employs majority voting to determine the input’s bias label. IndiVec excels compared to previous methods due to its adaptability (demonstrating consistent performance across diverse datasets from various sources) and explainability (providing explicit top-k indicators to interpret bias predictions). Experimental results on four political bias datasets highlight IndiVec’s significant superiority over baselines. Furthermore, additional experiments and analysis provide profound insights into the framework’s effectiveness.

PACAR: Automated Fact-Checking with Planning and Customized Action Reasoning Using Large Language Models
Xiaoyan Zhao | Lingzhi Wang | Zhanghao Wang | Hong Cheng | Rui Zhang | Kam-Fai Wong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In an era characterized by the rapid proliferation of information, the pervasive issues of misinformation and disinformation have significantly impacted numerous individuals. Consequently, the evaluation of information’s truthfulness and accuracy has garnered substantial attention among researchers. In this work, we present a novel fact-checking framework called PACAR, fact-checking based on planning and customized action reasoning using LLMs. It comprises four modules: a claim decomposer with self-reflection, an LLM-centric planner module, an executor for carrying out planned actions, and a verifier module that assesses veracity and generates explanations based on the overall reasoning process. Unlike previous work that employs single-path decision-making and single-step verdict prediction, PACAR focuses on the use of LLMs in dynamic planning and execution of actions. Furthermore, in contrast to previous work that relied primarily on general reasoning, we introduce tailored actions such as numerical reasoning and entity disambiguation to effectively address potential challenges in fact-checking. Our PACAR framework, incorporating LLM-centric planning along with customized action reasoning, significantly outperforms baseline methods across three datasets from different domains and with varying complexity levels. Additional experiments, including multidimensional and sliced observations, demonstrate the effectiveness of PACAR and offer valuable insights for the advancement of automated fact-checking.

Visually Guided Generative Text-Layout Pre-training for Document Intelligence
Zhiming Mao | Haoli Bai | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

Role Prompting Guided Domain Adaptation with General Capability Preserve for Large Language Models
Rui Wang | Fei Mi | Yi Chen | Boyang Xue | Hongru Wang | Qi Zhu | Kam-Fai Wong | Ruifeng Xu
Findings of the Association for Computational Linguistics: NAACL 2024

The growing interest in Large Language Models (LLMs) for specialized applications has revealed a significant challenge: when tailored to specific domains, LLMs tend to experience catastrophic forgetting, compromising their general capabilities and leading to a suboptimal user experience. Additionally, crafting a versatile model for multiple domains simultaneously often results in a decline in overall performance due to confusion between domains. In response to these issues, we present the RolE Prompting Guided Multi-Domain Adaptation (REGA) strategy. This novel approach effectively manages multi-domain LLM adaptation through three key components: 1) Self-Distillation constructs and replays general-domain exemplars to alleviate catastrophic forgetting. 2) Role Prompting assigns a central prompt to the general domain and a unique role prompt to each specific domain to minimize inter-domain confusion during training. 3) Role Integration reuses and integrates a small portion of domain-specific data to the general-domain data, which are trained under the guidance of the central prompt. The central prompt is used for a streamlined inference process, removing the necessity to switch prompts for different domains.Empirical results demonstrate that REGA effectively alleviates catastrophic forgetting and inter-domain confusion. This leads to improved domain-specific performance compared to standard fine-tuned models, while still preserving robust general capabilities.

Multi-modal Stance Detection: New Datasets and Model
Bin Liang | Ang Li | Jingqian Zhao | Lin Gui | Min Yang | Yue Yu | Kam-Fai Wong | Ruifeng Xu
Findings of the Association for Computational Linguistics: ACL 2024

Stance detection is a challenging task that aims to identify public opinion from social media platforms with respect to specific targets. Previous work on stance detection largely focused on pure texts. In this paper, we study multi-modal stance detection for tweets consisting of texts and images, which are prevalent in today’s fast-growing social media platforms where people often post multi-modal messages. To this end, we create five new multi-modal stance detection datasets of different domains based on Twitter, in which each example consists of a text and an image. In addition, we propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT), where target information is leveraged to learn multi-modal stance features from textual and visual modalities. Experimental results on our five benchmark datasets show that the proposed TMPT achieves state-of-the-art performance in multi-modal stance detection.

Fine-tuning after Prompting: an Explainable Way for Classification
Zezhong Wang | Luyao Ye | Hongru Wang | Boyang Xue | Yiming Du | Bin Liang | Kam-Fai Wong
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

Prompting is an alternative approach for utilizing pre-trained language models (PLMs) in classification tasks. In contrast to fine-tuning, prompting is more understandable for humans because it utilizes natural language to interact with the PLM, but it often falls short in terms of accuracy. While current research primarily focuses on enhancing the performance of prompting methods to compete with fine-tuning, we believe that these two approaches are not mutually exclusive, each having its strengths and weaknesses. In our study, we depart from the competitive view of prompting versus fine-tuning and instead combine them, introducing a novel method called F&P. This approach enables us to harness the advantages of Fine-tuning for accuracy and the explainability of Prompting simultaneously. Specifically, we reformulate the sample into a prompt and subsequently fine-tune a linear classifier on top of the PLM. Following this, we extract verbalizers according to the weight of this classifier. During the inference phase, we reformulate the sample in the same way and query the PLM. The PLM generates a word, which is then subject to a dictionary lookup by the verbalizer to obtain the prediction. Experiments show that keeping only 30 keywords for each class can achieve comparable performance as fine-tuning. On the other hand, both the prompt and verbalizers are constructed in natural language, making them fully understandable to humans. Hence, the F&P method offers an effective and transparent way to employ a PLM for classification tasks.

Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
Kam-Fai Wong | Min Zhang | Ruifeng Xu | Jing Li | Zhongyu Wei | Lin Gui | Bin Liang | Runcong Zhao
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

LLMEdgeRefine: Enhancing Text Clustering with LLM-Based Boundary Point Refinement
Zijin Feng | Luyang Lin | Lingzhi Wang | Hong Cheng | Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Text clustering is a fundamental task in natural language processing with numerous applications. However, traditional clustering methods often struggle with domain-specific fine-tuning and the presence of outliers. To address these challenges, we introduce LLMEdgeRefine, an iterative clustering method enhanced by large language models (LLMs), focusing on edge points refinement. LLMEdgeRefine enhances current clustering methods by creating super-points to mitigate outliers and iteratively refining clusters using LLMs for improved semantic coherence. Our method demonstrates superior performance across multiple datasets, outperforming state-of-the-art techniques, and offering robustness, adaptability, and cost-efficiency for diverse text clustering applications.

UniRetriever: Multi-task Candidates Selection for Various Context-Adaptive Conversational Retrieval
Hongru Wang | Boyang Xue | Baohang Zhou | Rui Wang | Fei Mi | Weichao Wang | Yasheng Wang | Kam-Fai Wong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Conversational retrieval refers to an information retrieval system that operates in an iterative and interactive manner, requiring the retrieval of various external resources, such as persona, knowledge, and even response, to effectively engage with the user and successfully complete the dialogue. However, most previous work trained independent retrievers for each specific resource, resulting in sub-optimal performance and low efficiency. Thus, we propose a multi-task framework function as a universal retriever for three dominant retrieval tasks during the conversation: persona selection, knowledge selection, and response selection. To this end, we design a dual-encoder architecture consisting of a context-adaptive dialogue encoder and a candidate encoder, aiming to attention to the relevant context from the long dialogue and retrieve suitable candidates by simply a dot product. Furthermore, we introduce two loss constraints to capture the subtle relationship between dialogue context and different candidates by regarding historically selected candidates as hard negatives. Extensive experiments and analysis establish state-of-the-art retrieval quality both within and outside its training domain, revealing the promising potential and generalization capability of our model to serve as a universal retriever for different candidate selection tasks simultaneously.

DPDLLM: A Black-box Framework for Detecting Pre-training Data from Large Language Models
Baohang Zhou | Zezhong Wang | Lingzhi Wang | Hongru Wang | Ying Zhang | Kehui Song | Xuhui Sui | Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2024

The success of large language models (LLM) benefits from large-scale model parameters and large amounts of pre-training data. However, the textual data for training LLM can not be confirmed to be legal because they are crawled from different web sites. For example, there are copyrighted articles, personal reviews and information in the pre-training data for LLM which are illegal. To address the above issue and develop legal LLM, we propose to detect the pre-training data from LLM in a pure black-box way because the existing LLM services only return the generated text. The previous most related works are the membership inference attack (MIA) on machine learning models to detect the training data from them. But the existing methods are based on analyzing the output probabilities of models which are unrealistic to LLM services. To tackle the problem, we firstly construct the benchmark datasets by collecting textual data from different domains as the seen and unseen pre-training data for LLMs. Then, we investigate a black-box framework named DPDLLM, with the only access to the generated texts from LLM for detecting textual data whether was used to train it. In the proposed framework, we exploit GPT-2 as the reference model to fit the textual data and feed the generated text from LLM into it to acquire sequence probabilities as the significant feature for detection. The experimental results on the benchmark datasets demonstrate that DPDLLM is effective on different popular LLMs and outperforms the existing methods.

JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialogue Policy Learning
Wai-Chung Kwan | Huimin Wang | Hongru Wang | Zezhong Wang | Bin Liang | Xian Wu | Yefeng Zheng | Kam-Fai Wong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Dialogue policy learning (DPL) aims to determine an abstract representation (also known as action) to guide what the response should be. Typically, DPL is cast as a sequential decision problem across a series of predefined action candidates. However, such static and narrow actions can limit response diversity and impede the dialogue agent’s adaptability to new scenarios and edge cases. To overcome these challenges, we introduce a novel Joint Transformer Reinforcement Learning framework, coined as JoTR, where a text-to-text Transformer-based model is employed to directly generate dialogue actions. More concretely, JoTR formulates a token-grained policy, facilitating more dynamic and adaptable dialogue action generation without the need for predefined action candidates. This method not only enhances the diversity of responses but also significantly improves the system’s capability to manage unfamiliar scenarios. Furthermore, JoTR utilizes Reinforcement Learning with a reward-shaping mechanism to efficiently fine-tune the token-grained policy. This allows the model to evolve through interactions, thereby enhancing its performance over time. Our extensive evaluation demonstrates that JoTR surpasses previous state-of-the-art models, showing improvements of 9% and 13% in success rate, and 34% and 37% in the diversity of dialogue actions across two benchmark dialogue modeling tasks respectively. These results have been validated by both user simulators and human evaluators. Code and data are available at ://github.com/KwanWaiChung/JoTR.

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models
Wai-Chung Kwan | Xingshan Zeng | Yuxin Jiang | Yufei Wang | Liangyou Li | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) are increasingly used for complex multi-turn conversations across diverse real-world applications. However, existing benchmarks mainly focus on single-turn evaluations, overlooking the models’ capabilities in multi-turn interactions. To address this gap, we introduce , a comprehensive benchmark to evaluate the multi-turn conversational abilities of LLMs. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or creating new examples using GPT-4 with a human-in-the-loop process to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance. Our evaluation of 10 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. We observe significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models’ fundamental capabilities. Moreover, we identify the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance.

PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering
Yiming Du | Hongru Wang | Zhengyi Zhao | Bin Liang | Baojun Wang | Wanjun Zhong | Zezhong Wang | Kam-Fai Wong
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

In conversational AI, effectively employing long-term memory improves personalized and consistent response generation. Existing work only concentrated on a single type of long-term memory, such as preferences, dialogue history, or social relationships, overlooking their interaction in real-world contexts. To this end, inspired by the concept of semantic memory and episodic memory from cognitive psychology, we create a new and more comprehensive Chinese dataset, coined as PerLTQA, in which world knowledge, profiles, social relationships, events, and dialogues are considered to leverage the interaction between different types of long-term memory for question answering (QA) in conversation. Further, based on PerLTQA, we propose a novel framework for memory integration in QA, consisting of three subtasks: Memory Classification, Memory Retrieval, and Memory Fusion, which provides a comprehensive paradigm for memory modeling, enabling consistent and personalized memory utilization. This essentially allows the exploitation of more accurate memory information for better responses in QA. We evaluate this framework using five LLMs and three retrievers. Experimental results demonstrate the importance of personal long-term memory in the QA task

SELF-GUARD: Empower the LLM to Safeguard Itself
Zezhong Wang | Fangkai Yang | Lu Wang | Pu Zhao | Hongru Wang | Liang Chen | Qingwei Lin | Kam-Fai Wong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

With the increasing risk posed by jailbreak attacks, recent studies have investigated various methods to improve the safety of large language models (LLMs), mainly falling into two strategies: safety training and safeguards. Safety training involves fine-tuning the LLM with adversarial samples, which activate the LLM’s capabilities against jailbreak. However, it is not always effective in countering new attacks and often leads to potential performance degradation. Safeguards, on the other hand, are methods using additional models to filter harmful content from the LLM’s response. Nevertheless, they can only reduce a limited amount of harmful output and introduce extra computational costs. Given the distinct strengths and weaknesses of both, we combine them to balance out their flaws and propose a more effective method called Self-Guard.Specifically, we train the LLM to review its responses for any harmful content and append a [harmful] or [harmless] tag to the end of the response. In this way, Self-Guard possesses the advantages of safety training, leveraging the powerful capabilities of the LLMs themselves to detect harmfulness. Besides that, it gains flexibility like safeguards, making the safety check target the output side, which makes the system less vulnerable to attack updates. Experimental results indicate that our Self-Guard can effectively defend against jailbreak attacks and will not cause LLMs’ performance degradation.

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
Rui Wang | Hongru Wang | Fei Mi | Boyang Xue | Yi Chen | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful.Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses.Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as inductive instructions, which may stem from users’ false beliefs or malicious intents.In this paper, we aim to reveal the behaviors of LLMs towards inductive instructions and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of Inductive Instructions (INDust), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.Additionally, we identified that different inductive styles affect the models’ ability to identify the same underlying errors,and the complexity of the underlying assumptions also influences the model’s performance.Motivated by these results, we propose Dual-critique prompting to improve LLM robustness against inductive instructions.Our experiments demonstrate that Dual-critique prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles.

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
Jingtao Cao | Zhang Zheng | Hongru Wang | Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Progress in Text-to-Image (T2I) models has significantly advanced the generation of images from textual descriptions. Existing metrics, such as CLIP, effectively measure the semantic alignment between single prompts and their corresponding images. However, they fall short in evaluating a model’s ability to generalize across a broad spectrum of textual inputs. To address this gap, we propose the VLEU (Visual Language Evaluation Understudy) metric. VLEU leverages the power of Large Language Models (LLMs) to sample from the visual text domain, encompassing the entire range of potential inputs for the T2I task, to generate a wide variety of visual text. The images generated by T2I models from these prompts are then assessed for their alignment with the input text using the CLIP model. VLEU quantitatively measures a model’s generalizability by computing the Kullback-Leibler (KL) divergence between the visual text marginal distribution and the conditional distribution over the images generated by the model. This provides a comprehensive metric for comparing the overall generalizability of T2I models, beyond single-prompt evaluations, and offers valuable insights during the finetuning process. Our experimental results demonstrate VLEU’s effectiveness in evaluating the generalizability of various T2I models, positioning it as an essential metric for future research and development in image synthesis from text prompts. Our code and data will be publicly available at https://github.com/mio7690/VLEU.

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
Wai-Chung Kwan | Xingshan Zeng | Yufei Wang | Yusen Sun | Liangyou Li | Yuxin Jiang | Lifeng Shang | Qun Liu | Kam-Fai Wong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Managing long sequences has become an important and necessary feature for large language models (LLMs). However, assessing their ability to handle long contexts remains a challenge. This paper introduces M⁴LE, a Multi-ability, Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. It encompasses 36 NLP datasets, covering 11 types of tasks and 12 domains, providing a comprehensive test bed. To address the lack of tasks featuring naturally long sequences, we propose an automatic approach to convert short-sequence tasks into long-sequence scenarios. These scenarios evaluate LLMs’ long-context understanding across five key abilities: understanding of single or multiple relevant spans in long contexts based on explicit or semantic hints, and global context understanding. This automatic approach allows us to create instances evenly distributed from 1k to 8k input length. Our evaluation of 11 prominent LLMs reveals that 1) Current LLMs struggle to understand long context, particularly when tasks require multiple-span attention. 2) Semantic retrieval is more difficult for competent LLMs. 3) Models fine-tuned on longer text with position interpolation have comparable performance to those using Neural Tangent Kernel (NTK) aware scaling methods without fine-tuning. We make our benchmark publicly available to encourage future research in this challenging area.

AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
Hongru Wang | Rui Wang | Boyang Xue | Heming Xia | Jingtao Cao | Zeming Liu | Jeff Z. Pan | Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily either focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources, especially for complex user instructions. In this paper, we introduce MetaBench, the first benchmark to evaluate LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task. Specifically, we consider two significant challenges in multiple APIs: 1) graph structures: some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and 2) permission constraints: which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at https://github.com/ruleGreen/AppBench.

WatME: Towards Lossless Watermarking Through Lexical Redundancy
Liang Chen | Yatao Bian | Yang Deng | Deng Cai | Shuaiyi Li | Peilin Zhao | Kam-Fai Wong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text watermarking has emerged as a pivotal technique for identifying machine-generated text. However, existing methods often rely on arbitrary vocabulary partitioning during decoding to embed watermarks, which compromises the availability of suitable tokens and significantly degrades the quality of responses. This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens. Our finding highlights a significant disparity; knowledge recall and logical reasoning are more adversely affected than language generation. These results suggest a more profound effect of watermarking on LLMs than previously understood. To address these challenges, we introduce Watermarking with Mutual Exclusion (WatME), a novel approach leveraging linguistic prior knowledge of inherent lexical redundancy in LLM vocabularies to seamlessly integrate watermarks. Specifically, WatME dynamically optimizes token usage during the decoding process by applying a mutually exclusive rule to the identified lexical redundancies. This strategy effectively prevents the unavailability of appropriate tokens and preserves the expressive power of LLMs. We provide both theoretical analysis and empirical evidence showing that WatME effectively preserves the diverse capabilities of LLMs while ensuring watermark detectability.

2023

Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs
Hongru Wang | Rui Wang | Fei Mi | Yang Deng | Zezhong Wang | Bin Liang | Ruifeng Xu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Large Language Models (LLMs), such as ChatGPT, greatly empower dialogue systems with strong language understanding and generation capabilities. However, most of the previous works prompt the LLMs to directly generate a response based on the dialogue context, overlooking the underlying linguistic cues about the user status exhibited in the context. Such in-depth dialogue scenarios are challenging for existing LLMs to figure out the user’s hidden needs and respond satisfactorily through a single-step inference. To this end, we propose a novel linguistic cue-based chain-of-thoughts (Cue-CoT), which enhances the LLMs inference with an intermediate reasoning step to find cues exhibited in the dialogue, aiming to provide a more personalized and engaging response. To evaluate the approach, we build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English, targeting 3 major linguistic cues during the conversation: personality, emotion, and psychology. We conducted experiments on the proposed benchmark with 5 LLMs under both zero-shot and one-shot settings. Empirical results demonstrate our proposed Cue-CoT method outperforms standard prompting methods in terms of both helpfulness and acceptability on all datasets.

Towards Robust Personalized Dialogue Generation via Order-Insensitive Representation Regularization
Liang Chen | Hongru Wang | Yang Deng | Wai Chung Kwan | Zezhong Wang | Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2023

Generating persona consistent dialogue response is important for developing an intelligent conversational agent. Recent works typically fine-tune large-scale pre-trained models on this task by concatenating persona texts and dialogue history as a single input sequence to generate the target response. While simple and effective, our analysis shows that this popular practice is seriously affected by order sensitivity where different input orders of persona sentences significantly impact the quality and consistency of generated response, resulting in severe performance fluctuations (i.e., 29.4% on GPT2 and 83.2% on BART). To mitigate the order sensitivity problem, we propose a model-agnostic framework, ORder Insensitive Generation (ORIG), which enables dialogue models to learn robust representation under different persona orders and improve the consistency of response generation. Experiments on the Persona-Chat dataset justify the effectiveness and superiority of our method with two dominant pre-trained models (GPT2 and BART).

CoAD: Automatic Diagnosis through Symptom and Disease Collaborative Generation
Huimin Wang | Wai Chung Kwan | Kam-Fai Wong | Yefeng Zheng
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic diagnosis (AD), a critical application of AI in healthcare, employs machine learning techniques to assist doctors in gathering patient symptom information for precise disease diagnosis. The Transformer-based method utilizes an input symptom sequence, predicts itself through auto-regression, and employs the hidden state of the final symptom to determine the disease. Despite its simplicity and superior performance demonstrated, a decline in disease diagnosis accuracy is observed caused by 1) a mismatch between symptoms observed during training and generation, and 2) the effect of different symptom orders on disease prediction. To address the above obstacles, we introduce the CoAD, a novel disease and symptom collaborative generation framework, which incorporates several key innovations to improve AD: 1) aligning sentence-level disease labels with multiple possible symptom inquiry steps to bridge the gap between training and generation; 2) expanding symptom labels for each sub-sequence of symptoms to enhance annotation and eliminate the effect of symptom order; 3) developing a repeated symptom input schema to effectively and efficiently learn the expanded disease and symptom labels. We evaluate the CoAD framework using four datasets, including three public and one private, and demonstrate that it achieves an average 2.3% improvement over previous state-of-the-art results in automatic disease diagnosis. For reproducibility, we release the code and data at https://github.com/KwanWaiChung/coad.

Retrieval-free Knowledge Injection through Multi-Document Traversal for Dialogue Models
Rui Wang | Jianzhu Bao | Fei Mi | Yi Chen | Hongru Wang | Yasheng Wang | Yitong Li | Lifeng Shang | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dialogue models are often enriched with extensive external knowledge to provide informative responses through a retrieval-augmented pipeline. Nevertheless, retrieval-augmented approaches rely on finely annotated retrieval training data and knowledge-grounded response generation data, making it costly to transfer. To tackle this challenge, this paper proposed a retrieval-free approach, KiDG, by automatically turning knowledge documents into simulated multi-turn dialogues through a Multi-Document Traversal algorithm. The simulated knowledge-intensive dialogues constructed by KiDG in one domain can be easily used to train and enhance pre-trained dialogue models’ knowledge w.r.t. this domain without costly annotation. We conduct extensive experiments comparing retrieval-augmented models and a variety of retrieval-free models. We found that dialogue models enhanced with data simulated with KiDG largely outperform state-of-the-art retrieval-free methods, and it achieves comparable performance compared to retrieval-augmented methods while being better, and cheaper at domain transfer.

A Training-Free Debiasing Framework with Counterfactual Reasoning for Conversational Emotion Detection
Geng Tu | Ran Jing | Bin Liang | Min Yang | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Unintended dataset biases typically exist in existing Emotion Recognition in Conversations (ERC) datasets, including label bias, where models favor the majority class due to imbalanced training data, as well as the speaker and neutral word bias, where models make unfair predictions because of excessive correlations between specific neutral words or speakers and classes. However, previous studies in ERC generally focus on capturing context-sensitive and speaker-sensitive dependencies, ignoring the unintended dataset biases of data, which hampers the generalization and fairness in ERC. To address this issue, we propose a Training-Free Debiasing framework (TFD) that operates during prediction without additional training. To ensure compatibility with various ERC models, it does not balance data or modify the model structure. Instead, TFD extracts biases from the model by generating counterfactual utterances and contexts and mitigates them using simple yet empirically robust element-wise subtraction operations. Extensive experiments on three public datasets demonstrate that TFD effectively improves generalization ability and fairness across different ERC models.

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
Liang Chen | Yang Deng | Yatao Bian | Zeyu Qin | Bingzhe Wu | Tat-Seng Chua | Kam-Fai Wong
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives – Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.

Dialog Action-Aware Transformer for Dialog Policy Learning
Huimin Wang | Wai Chung Kwan | Kam-Fai Wong
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Recent works usually address Dialog policy learning DPL by training a reinforcement learning (RL) agent to determine the best dialog action. However, existing works on deep RL require a large volume of agent-user interactions to achieve acceptable performance. In this paper, we propose to make full use of the plain text knowledge from the pre-trained language model to accelerate the RL agent’s learning speed. Specifically, we design a dialog action-aware transformer encoder (DaTrans), which integrates a new fine-tuning procedure named masked last action task to encourage DaTrans to be dialog-aware and distill action-specific features. Then, DaTrans is further optimized in an RL setting with ongoing interactions and evolves through exploration in the dialog action space toward maximizing long-term accumulated rewards. The effectiveness and efficiency of the proposed model are demonstrated with both simulator evaluation and human evaluation.

Strategize Before Teaching: A Conversational Tutoring System with Pedagogy Self-Distillation
Lingzhi Wang | Mrinmaya Sachan | Xingshan Zeng | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EACL 2023

Conversational tutoring systems (CTSs) aim to help students master educational material with natural language interaction in the form of a dialog. CTSs have become a key pillar in educational data mining research. A key challenge in CTSs is to engage the student in the conversation while exposing them to a diverse set of teaching strategies, akin to a human teacher, thereby, helping them learn in the process. Different from previous work that generates responses given the strategies as input, we propose to jointly predict teaching strategies and generate tutor responses accordingly, which fits a more realistic application scenario. We benchmark several competitive models on three dialog tutoring datasets and propose a unified framework that combines teaching response generation and pedagogical strategy prediction, where a self-distillation mechanism is adopted to guide the teaching strategy learning and facilitate tutor response generation. Our experiments and analyses shed light on how teaching strategies affect dialog tutoring.

Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogues
Hongru Wang | Minda Hu | Yang Deng | Rui Wang | Fei Mi | Weichao Wang | Yasheng Wang | Wai-Chung Kwan | Irwin King | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Open-domain dialogue system usually requires different sources of knowledge to generate more informative and evidential responses. However, existing knowledge-grounded dialogue systems either focus on a single knowledge source or overlook the dependency between multiple sources of knowledge, which may result in generating inconsistent or even paradoxical responses. To incorporate multiple knowledge sources and dependencies between them, we propose SAFARI, a novel framework that leverages the exceptional capabilities of large language models (LLMs) in planning, understanding, and incorporating under both supervised and unsupervised settings. Specifically, SAFARI decouples the knowledge grounding into multiple sources and response generation, which allows easy extension to various knowledge sources including the possibility of not using any sources. To study the problem, we construct a personalized knowledge-grounded dialogue dataset Knowledge Behind Persona (KBP), which is the first to consider the dependency between persona and implicit knowledge. Experimental results on the KBP dataset demonstrate that the SAFARI framework can effectively produce persona-consistent and knowledge-enhanced responses.

UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based Recommendation
Zhiming Mao | Huimin Wang | Yiming Du | Kam-Fai Wong
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Prior study has shown that pretrained language models (PLM) can boost the performance of text-based recommendation. In contrast to previous works that either use PLM to encode user history as a whole input text, or impose an additional aggregation network to fuse multi-turn history representations, we propose a unified local- and global-attention Transformer encoder to better model two-level contexts of user history. Moreover, conditioned on user history encoded by Transformer encoders, our framework leverages Transformer decoders to estimate the language perplexity of candidate text items, which can serve as a straightforward yet significant contrastive signal for user-item text matching. Based on this, our framework, UniTRec, unifies the contrastive objectives of discriminative matching scores and candidate text perplexity to jointly enhance text-based recommendation. Extensive evaluation shows that UniTRec delivers SOTA performance on three text-based recommendation tasks.

MCML: A Novel Memory-based Contrastive Meta-Learning Method for Few Shot Slot Tagging
Hongru Wang | Zezhong Wang | Wai Chung Kwan | Kam-Fai Wong
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Set Learning for Generative Information Extraction
Jiangnan Li | Yice Zhang | Bin Liang | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recent efforts have endeavored to employ the sequence-to-sequence (Seq2Seq) model in Information Extraction (IE) due to its potential to tackle multiple IE tasks in a unified manner. Under this formalization, multiple structured objects are concatenated as the target sequence in a predefined order. However, structured objects, by their nature, constitute an unordered set. Consequently, this formalization introduces a potential order bias, which can impair model learning. Targeting this issue, this paper proposes a set learning approach that considers multiple permutations of structured objects to optimize set probability approximately. Notably, our approach does not require any modifications to model structures, making it easily integrated into existing generative IE frameworks. Experiments show that our method consistently improves existing frameworks on vast tasks and datasets.

An Empirical Study on Multiple Knowledge from ChatGPT for Emotion Recognition in Conversations
Geng Tu | Bin Liang | Bing Qin | Kam-Fai Wong | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2023

Multiple knowledge (e.g., co-reference, topics, emotional causes, etc) has been demonstrated effective for emotion detection. However, exploring this knowledge in Emotion Recognition in Conversations (ERC) is currently a blank slate due to the lack of annotated data and the high cost involved in obtaining such knowledge. Fortunately, the emergence of Large Language Models (LLMs) holds promise in filling this void. Therefore, we propose a Multiple Knowledge Fusion Model (MKFM) to effectively integrate such knowledge generated by LLMs for ERC and empirically study its impact on the model. Experimental results on three public datasets have demonstrated the effectiveness of multiple knowledge for ERC. Furthermore, we conduct a detailed analysis of the contribution and complementarity of this knowledge.

ReadPrompt: A Readable Prompting Method for Reliable Knowledge Probing
Zezhong Wang | Luyao Ye | Hongru Wang | Wai-Chung Kwan | David Ho | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Knowledge probing is a task to assess the knowledge encoded within pre-trained language models (PLMs) by having the PLM complete prompts such as “Italy is located in __,”. The model’s prediction precision serves as a lower bound for the amount of knowledge it contains. Subsequent works explore training a series of vectors as prompts to guide PLMs towards more accurate predictions. However, these methods compromise the readability of the prompts. We cannot directly understand these prompts from their literal meaning, making it difficult to verify whether they are correct. Consequently, the credibility of probing results derived from these prompts is diminished. To address the issue, we propose a novel method called ReadPrompt, which aims to identify meaningful sentences to serve as prompts. Experiments show that ReadPrompt achieves state-of-the-art performance on the current knowledge probing benchmark. Moreover, since the prompt is readable, we discovered a misalignment between constructed prompts and knowledge, which is also present in current prompting methods verified by an attack experiment. We claim that the probing outcomes of the current prompting methods are unreliable that overestimate the knowledge contained within PLMs.

Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment
Boyang Xue | Weichao Wang | Hongru Wang | Fei Mi | Rui Wang | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Pretrained language models (PLMs) based knowledge-grounded dialogue systems are prone to generate responses that are factually inconsistent with the provided knowledge source. In such inconsistent responses, the dialogue models fail to accurately express the external factual knowledge they rely upon. Inspired by previous work which identified that feedforward networks (FFNs) within Transformers are responsible for factual knowledge expressions, we investigate two methods to efficiently improve the factual expression capability of FFNs by knowledge enhancement and alignment respectively. We first propose K-Dial, which explicitly introduces extended FFNs in Transformers to enhance factual knowledge expressions given the specific patterns of knowledge-grounded dialogue inputs. Additionally, we apply the reinforcement learning for factual consistency (RLFC) method to implicitly adjust FFNs’ expressions in responses by aligning with gold knowledge for the factual consistency preference. To comprehensively assess the factual consistency and dialogue quality of responses, we employ extensive automatic measures and human evaluations including sophisticated fine-grained NLI-based metrics. Experimental results on WoW and CMU_DoG datasets demonstrate that our methods efficiently enhance the ability of the FFN module to convey factual knowledge, validating the efficacy of improving factual consistency for knowledge-grounded dialogue systems.

In-context Learning for Few-shot Multimodal Named Entity Recognition
Chenran Cai | Qianlong Wang | Bin Liang | Bing Qin | Min Yang | Kam-Fai Wong | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2023

Thanks in part to the availability of copious annotated resources for some entity categories, existing studies have achieved superior performance in multimodal named entity recognition (MNER). However, in the real-world scenario, it is infeasible to enumerate all entity categories in advance. Therefore, in this paper, we formulate a new few-shot multimodal named entity recognition (FewMNER) task, which aims to effectively locate and identify named entities for a text-image pair only using a small number of labeled examples. Further, we explore the merit of in-context learning (ICL) and propose a novel framework to deal with FewMNER, where three points are taken into account: i.e., converting visual modality, selecting useful examples, and designing an effective task demonstration. Specifically, we first employ an image caption model to convert images into textual descriptions, enabling large language models to absorb information from visual modality. Then, we use the ranking of the sum of similarity rankings from both text and image modalities to select k-nearest examples, which form a demonstration context. Finally, we utilize the MNER definition and the meaning of each entity category as effective instruction. Extensive experimental results demonstrate that our framework outperforms baselines under several few-shot settings.

KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment
Lingzhi Wang | Tong Chen | Wei Yuan | Xingshan Zeng | Kam-Fai Wong | Hongzhi Yin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent legislation of the “right to be forgotten” has led to the interest in machine unlearning, where the learned models are endowed with the function to forget information about specific training instances as if they have never existed in the training set. Previous work mainly focuses on computer vision scenarios and largely ignores the essentials of unlearning in NLP field, where text data contains more explicit and sensitive personal information than images. In this paper, we propose a general unlearning framework called KGA to induce forgetfulness. Different from previous work that tries to recover gradients or forces models to perform close to one specific distribution, KGA maintains distribution differences (i.e., knowledge gap). This relaxes the distribution assumption. Furthermore, we first apply the unlearning method to various NLP tasks (i.e., classification, translation, response generation) and propose several unlearning evaluation metrics with pertinence. Experiments on large-scale datasets show that KGA yields comprehensive improvements over baselines, where extensive analyses further validate the effectiveness of KGA and provide insight into unlearning for NLP tasks.

2022

When Cantonese NLP Meets Pre-training: Progress and Challenges
Rong Xiang | Hanzhuo Tan | Jing Li | Mingyu Wan | Kam-Fai Wong
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts

Cantonese is an influential Chinese variant with a large population of speakers worldwide. However, it is under-resourced in terms of the data scale and diversity, excluding Cantonese Natural Language Processing (NLP) from the stateof-the-art (SOTA) “pre-training and fine-tuning” paradigm. This tutorial will start with a substantially review of the linguistics and NLP progress for shaping language specificity, resources, and methodologies. It will be followed by an introduction to the trendy transformerbased pre-training methods, which have been largely advancing the SOTA performance of a wide range of downstream NLP tasks in numerous majority languages (e.g., English and Chinese). Based on the above, we will present the main challenges for Cantonese NLP in relation to Cantonese language idiosyncrasies of colloquialism and multilingualism, followed by the future directions to line NLP for Cantonese and other low-resource languages up to the cutting-edge pre-training practice.

Prior Omission of Dissimilar Source Domain(s) for Cost-Effective Few-Shot Learning
Zezhong Wang | Hongru Wang | Wai Chung Kwan | Kam-Fai Wong
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

RecInDial: A Unified Framework for Conversational Recommendation with Pretrained Language Models
Lingzhi Wang | Huang Hu | Lei Sha | Can Xu | Daxin Jiang | Kam-Fai Wong
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Conversational Recommender System (CRS), which aims to recommend high-quality items to users through interactive conversations, has gained great research interest recently. A CRS is usually composed of a recommendation module and a generation module. In the previous work, these two modules are loosely connected in the model training and are shallowly integrated during inference, where a simple switching or copy mechanism is adopted to incorporate recommended items into generated responses. Moreover, the current end-to-end neural models trained on small crowd-sourcing datasets (e.g., 10K dialogs in the ReDial dataset) tend to overfit and have poor chit-chat ability. In this work, we propose a novel unified framework that integrates recommendation into the dialog (RecInDial) generation by introducing a vocabulary pointer. To tackle the low-resource issue in CRS, we finetune the large-scale pretrained language models to generate fluent and diverse responses, and introduce a knowledge-aware bias learned from an entity-oriented knowledge graph to enhance the recommendation performance. Furthermore, we propose to evaluate the CRS models in an end-to-end manner, which can reflect the overall performance of the entire system rather than the performance of individual modules, compared to the separate evaluations of the two modules used in previous work. Experiments on the benchmark dataset ReDial show our RecInDial model significantly surpasses the state-of-the-art methods. More extensive analyses show the effectiveness of our model.

TopicRefine: Joint Topic Prediction and Dialogue Response Generation for Multi-turn End-to-End Dialogue System
Hongru Wang | Mingyu Cui | Zimo Zhou | Kam-Fai Wong
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

DIGAT: Modeling News Recommendation with Dual-Graph Interaction
Zhiming Mao | Jian Li | Hongru Wang | Xingshan Zeng | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2022

News recommendation (NR) is essential for online news services. Existing NR methods typically adopt a news-user representation learning framework, facing two potential limitations. First, in news encoder, single candidate news encoding suffers from an insufficient semantic information problem. Second, existing graph-based NR methods are promising but lack effective news-user feature interaction, rendering the graph-based recommendation suboptimal. To overcome these limitations, we propose dual-interactive graph attention networks (DIGAT) consisting of news- and user-graph channels. In the news-graph channel, we enrich the semantics of single candidate news by incorporating the semantically relevant news information with a semantic-augmented graph (SAG). In the user-graph channel, multi-level user interests are represented with a news-topic graph. Most notably, we design a dual-graph interaction process to perform effective feature interaction between the news and user graphs, which facilitates accurate news-user representation matching. Experiment results on the benchmark dataset MIND show that DIGAT outperforms existing news recommendation methods. Further ablation studies and analyses validate the effectiveness of (1) semantic-augmented news graph modeling and (2) dual-graph interaction.

“I Know Who You Are”: Character-Based Features for Conversational Humor Recognition in Chinese
Wenbo Shang | Jiangjiang Zhao | Zezhong Wang | Binyang Li | Fangchun Yang | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2022

Humor plays an important role in our daily life, as it is an essential and fascinating element in the communication between persons. Therefore, how to recognize punchlines from the dialogue, i.e. conversational humor recognition, has attracted much interest of computational linguistics communities. However, most existing work attempted to understand the conversational humor by analyzing the contextual information of the dialogue, but neglected the character of the interlocutor, such as age, gender, occupation, and so on. For instance, the same utterance could bring out humorous from a serious person, but may be a plain expression from a naive person. To this end, this paper proposes a Character Fusion Conversational Humor Recognition model (CFCHR) to explore character information to recognize conversational humor. CFCHR utilizes a multi-task learning framework that unifies two highly pertinent tasks, i.e., character extraction and punchline identification. Based on deep neural networks, we trained both tasks jointly by sharing weight to extract the common and task-invariant features while each task could still learn its task-specific features. Experiments were conducted on Chinese sitcoms corpus, which consisted of 12,677 utterances from 22 characters. The experimental results demonstrated that CFCHR could achieve 33.08% improvements in terms of F1-score over some strong baselines, and proved the effectiveness of the character information to identify the punchlines.

Learning When and What to Quote: A Quotation Recommender System with Mutual Promotion of Recommendation and Generation
Lingzhi Wang | Xingshan Zeng | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2022

This work extends the current quotation recommendation task to a more realistic quotation recommender system that learns to predict when to quote and what to quote jointly. The system consists of three modules (tasks), a prediction module to predict whether to quote given conversation contexts, a recommendation module to recommend suitable quotations and a generation module generating quotations or sentences in ordinary language to continue the conversation. We benchmark several competitive models for the two newly introduced tasks (i.e., when-to-quote and what-to-continue). For quotation recommendation, compared with previous work that is either generation-based or ranking-based recommendation, we propose a novel framework with mutual promotion of generation module and ranking-based recommendation module. Experiments show that our framework achieves significantly better performance than baselines on two datasets. Further experiments and analyses validate the effectiveness of the proposed mechanisms and get a better understanding of the quotation recommendation task.

2021

Re-entry Prediction for Online Conversations via Self-Supervised Learning
Lingzhi Wang | Xingshan Zeng | Huang Hu | Kam-Fai Wong | Daxin Jiang
Findings of the Association for Computational Linguistics: EMNLP 2021

In recent years, world business in online discussions and opinion sharing on social media is booming. Re-entry prediction task is thus proposed to help people keep track of the discussions which they wish to continue. Nevertheless, existing works only focus on exploiting chatting history and context information, and ignore the potential useful learning signals underlying conversation data, such as conversation thread patterns and repeated engagement of target users, which help better understand the behavior of target users in conversations. In this paper, we propose three interesting and well-founded auxiliary tasks, namely, Spread Pattern, Repeated Target user, and Turn Authorship, as the self-supervised signals for re-entry prediction. These auxiliary tasks are trained together with the main task in a multi-task manner. Experimental results on two datasets newly collected from Twitter and Reddit show that our method outperforms the previous state-of-the-arts with fewer parameters and faster convergence. Extensive experiments and analysis show the effectiveness of our proposed models and also point out some key ideas in designing self-supervised tasks.

Neural News Recommendation with Collaborative News Encoding and Structural User Encoding
Zhiming Mao | Xingshan Zeng | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2021

Automatic news recommendation has gained much attention from the academic community and industry. Recent studies reveal that the key to this task lies within the effective representation learning of both news and users. Existing works typically encode news title and content separately while neglecting their semantic interaction, which is inadequate for news text comprehension. Besides, previous models encode user browsing history without leveraging the structural correlation of user browsed news to reflect user interests explicitly. In this work, we propose a news recommendation framework consisting of collaborative news encoding (CNE) and structural user encoding (SUE) to enhance news and user representation learning. CNE equipped with bidirectional LSTMs encodes news title and content collaboratively with cross-selection and cross-attention modules to learn semantic-interactive news representations. SUE utilizes graph convolutional networks to extract cluster-structural features of user history, followed by intra-cluster and inter-cluster attention modules to learn hierarchical user interest representations. Experiment results on the MIND dataset validate the effectiveness of our model to improve the performance of news recommendation.

Fast and Scalable Dialogue State Tracking with Explicit Modular Decomposition
Dingmin Wang | Chenghua Lin | Qi Liu | Kam-Fai Wong
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a fast and scalable architecture called Explicit Modular Decomposition (EMD), in which we incorporate both classification-based and extraction-based methods and design four modules (for clas- sification and sequence labelling) to jointly extract dialogue states. Experimental results based on the MultiWoz 2.0 dataset validates the superiority of our proposed model in terms of both complexity and scalability when compared to the state-of-the-art methods, especially in the scenario of multi-domain dialogues entangled with many turns of utterances.

Quotation Recommendation and Interpretation Based on Transformation from Queries to Quotations
Lingzhi Wang | Xingshan Zeng | Kam-Fai Wong
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

To help individuals express themselves better, quotation recommendation is receiving growing attention. Nevertheless, most prior efforts focus on modeling quotations and queries separately and ignore the relationship between the quotations and the queries. In this work, we introduce a transformation matrix that directly maps the query representations to quotation representations. To better learn the mapping relationship, we employ a mapping loss that minimizes the distance of two semantic spaces (one for quotation and another for mapped-query). Furthermore, we explore using the words in history queries to interpret the figurative language of quotations, where quotation-aware attention is applied on top of history queries to highlight the indicator words. Experiments on two datasets in English and Chinese show that our model outperforms previous state-of-the-art models.

A Collaborative Multi-agent Reinforcement Learning Framework for Dialog Action Decomposition
Huimin Wang | Kam-Fai Wong
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Most reinforcement learning methods for dialog policy learning train a centralized agent that selects a predefined joint action concatenating domain name, intent type, and slot name. The centralized dialog agent suffers from a great many user-agent interaction requirements due to the large action space. Besides, designing the concatenated actions is laborious to engineers and maybe struggled with edge cases. To solve these problems, we model the dialog policy learning problem with a novel multi-agent framework, in which each part of the action is led by a different agent. The framework reduces labor costs for action templates and decreases the size of the action space for each agent. Furthermore, we relieve the non-stationary problem caused by the changing dynamics of the environment as evolving of agents’ policies by introducing a joint optimization process that makes agents can exchange their policy information. Concurrently, an independent experience replay buffer mechanism is integrated to reduce the dependence between gradients of samples to improve training efficiency. The effectiveness of the proposed framework is demonstrated in a multi-domain environment with both user simulator evaluation and human evaluation.

2020

Dynamic Online Conversation Recommendation
Xingshan Zeng | Jing Li | Lu Wang | Zhiming Mao | Kam-Fai Wong
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Trending topics in social media content evolve over time, and it is therefore crucial to understand social media users and their interpersonal communications in a dynamic manner. Here we study dynamic online conversation recommendation, to help users engage in conversations that satisfy their evolving interests. While most prior work assumes static user interests, our model is able to capture the temporal aspects of user interests, and further handle future conversations that are unseen during training time. Concretely, we propose a neural architecture to exploit changes of user interactions and interests over time, to predict which discussions they are likely to enter. We conduct experiments on large-scale collections of Reddit conversations, and results on three subreddits show that our model significantly outperforms state-of-the-art models that make a static assumption of user interests. We further evaluate on handling “cold start”, and observe consistently better performance by our model when considering various degrees of sparsity of user’s chatting history and conversation contexts. Lastly, analyses on our model outputs indicate user interest change, explaining the advantage and efficacy of our approach.

Learning Efficient Dialogue Policy from Demonstrations through Shaping
Huimin Wang | Baolin Peng | Kam-Fai Wong
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Training a task-oriented dialogue agent with reinforcement learning is prohibitively expensive since it requires a large volume of interactions with users. Human demonstrations can be used to accelerate learning progress. However, how to effectively leverage demonstrations to learn dialogue policy remains less explored. In this paper, we present Sˆ2Agent that efficiently learns dialogue policy from demonstrations through policy shaping and reward shaping. We use an imitation model to distill knowledge from demonstrations, based on which policy shaping estimates feedback on how the agent should act in policy space. Reward shaping is then incorporated to bonus state-actions similar to demonstrations explicitly in value space encouraging better exploration. The effectiveness of the proposed Sˆ2Agentt is demonstrated in three dialogue domains and a challenging domain adaptation task with both user simulator evaluation and human evaluation.

Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Kam-Fai Wong | Kevin Knight | Hua Wu
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations
Lingzhi Wang | Jing Li | Xingshan Zeng | Haisong Zhang | Kam-Fai Wong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Quotations are crucial for successful explanations and persuasions in interpersonal communications. However, finding what to quote in a conversation is challenging for both humans and machines. This work studies automatic quotation generation in an online conversation and explores how language consistency affects whether a quotation fits the given context. Here, we capture the contextual consistency of a quotation in terms of latent topics, interactions with the dialogue history, and coherence to the query turn’s existing contents. Further, an encoder-decoder neural framework is employed to continue the context with a quotation via language generation. Experiment results on two large-scale datasets in English and Chinese demonstrate that our quotation generation model outperforms the state-of-the-art models. Further analysis shows that topic, interaction, and query consistency are all helpful to learn how to quote in online conversations.

CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning
Hongru Wang | Xiangru Tang | Sunny Lai | Kwong Sak Leung | Jia Zhu | Gabriel Pui Cheong Fung | Kam-Fai Wong
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not to make sense and require the model to explain it. Based on BERT architecture with the multi-task setting, we propose an effective and interpretable “Explain, Reason and Predict” (ERP) system to solve the three sub-tasks about commonsense: (a) Validation, (b) Reasoning, and (c) Explanation. Inspired by cognitive studies of common sense, our system first generates a reason or understanding of the sentences and then choose which one statement makes sense, which is achieved by multi-task learning. During the post-evaluation, our system has reached 92.9% accuracy in subtask A (rank 11), 89.7% accuracy in subtask B (rank 9), and BLEU score of 12.9 in subtask C (rank 8).

2019

Sentence-Level Evidence Embedding for Claim Verification with Hierarchical Attention Networks
Jing Ma | Wei Gao | Shafiq Joty | Kam-Fai Wong
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Claim verification is generally a task of verifying the veracity of a given claim, which is critical to many downstream applications. It is cumbersome and inefficient for human fact-checkers to find consistent pieces of evidence, from which solid verdict could be inferred against the claim. In this paper, we propose a novel end-to-end hierarchical attention network focusing on learning to represent coherent evidence as well as their semantic relatedness with the claim. Our model consists of three main components: 1) A coherence-based attention layer embeds coherent evidence considering the claim and sentences from relevant articles; 2) An entailment-based attention layer attends on sentences that can semantically infer the claim on top of the first attention; and 3) An output layer predicts the verdict based on the embedded evidence. Experimental results on three public benchmark datasets show that our proposed model outperforms a set of state-of-the-art baselines.

Neural Conversation Recommendation with Online Interaction Modeling
Xingshan Zeng | Jing Li | Lu Wang | Kam-Fai Wong
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The prevalent use of social media leads to a vast amount of online conversations being produced on a daily basis. It presents a concrete challenge for individuals to better discover and engage in social media discussions. In this paper, we present a novel framework to automatically recommend conversations to users based on their prior conversation behaviors. Built on neural collaborative filtering, our model explores deep semantic features that measure how a user’s preferences match an ongoing conversation’s context. Furthermore, to identify salient characteristics from interleaving user interactions, our model incorporates graph-structured networks, where both replying relations and temporal features are encoded as conversation context. Experimental results on two large-scale datasets collected from Twitter and Reddit show that our model yields better performance than previous state-of-the-art models, which only utilize lexical features and ignore past user interactions in the conversations.

Coupling Global and Local Context for Unsupervised Aspect Extraction
Ming Liao | Jing Li | Haisong Zhang | Lingzhi Wang | Xixin Wu | Kam-Fai Wong
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Aspect words, indicating opinion targets, are essential in expressing and understanding human opinions. To identify aspects, most previous efforts focus on using sequence tagging models trained on human-annotated data. This work studies unsupervised aspect extraction and explores how words appear in global context (on sentence level) and local context (conveyed by neighboring words). We propose a novel neural model, capable of coupling global and local representation to discover aspect words. Experimental results on two benchmarks, laptop and restaurant reviews, show that our model significantly outperforms the state-of-the-art models from previous studies evaluated with varying metrics. Analysis on model output show our ability to learn meaningful and coherent aspect representations. We further investigate how words distribute in global and local context, and find that aspect and non-aspect words do exhibit different context, interpreting our superiority in unsupervised aspect extraction.

Joint Effects of Context and User History for Predicting Online Conversation Re-entries
Xingshan Zeng | Jing Li | Lu Wang | Kam-Fai Wong
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

As the online world continues its exponential growth, interpersonal communication has come to play an increasingly central role in opinion formation and change. In order to help users better engage with each other online, we study a challenging problem of re-entry prediction foreseeing whether a user will come back to a conversation they once participated in. We hypothesize that both the context of the ongoing conversations and the users’ previous chatting history will affect their continued interests in future engagement. Specifically, we propose a neural framework with three main layers, each modeling context, user history, and interactions between them, to explore how the conversation context and user chatting history jointly result in their re-entry behavior. We experiment with two large-scale datasets collected from Twitter and Reddit. Results show that our proposed framework with bi-attention achieves an F1 score of 61.1 on Twitter conversations, outperforming the state-of-the-art methods from previous work.

2018

Microblog Conversation Recommendation via Joint Modeling of Topics and Discourse
Xingshan Zeng | Jing Li | Lu Wang | Nicholas Beauchamp | Sarah Shugars | Kam-Fai Wong
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Millions of conversations are generated every day on social media platforms. With limited attention, it is challenging for users to select which discussions they would like to participate in. Here we propose a new method for microblog conversation recommendation. While much prior work has focused on post-level recommendation, we exploit both the conversational context, and user content and behavior preferences. We propose a statistical model that jointly captures: (1) topics for representing user interests and conversation content, and (2) discourse modes for describing user replying behavior and conversation dynamics. Experimental results on two Twitter datasets demonstrate that our system outperforms methods that only model content without considering discourse.

The UIR Uncertainty Corpus for Chinese: Annotating Chinese Microblog Corpus for Uncertainty Identification from Social Media
Binyang Li | Jun Xiang | Le Chen | Xu Han | Xiaoyan Yu | Ruifeng Xu | Tengjiao Wang | Kam-fai Wong
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A Joint Model of Conversational Discourse Latent Topics on Microblogs
Jing Li | Yan Song | Zhongyu Wei | Kam-Fai Wong
Computational Linguistics, Volume 44, Issue 4 - December 2018

Conventional topic models are ineffective for topic extraction from microblog messages, because the data sparseness exhibited in short messages lacking structure and contexts results in poor message-level word co-occurrence patterns. To address this issue, we organize microblog messages as conversation trees based on their reposting and replying relations, and propose an unsupervised model that jointly learns word distributions to represent: (1) different roles of conversational discourse, and (2) various latent topics in reflecting content information. By explicitly distinguishing the probabilities of messages with varying discourse roles in containing topical words, our model is able to discover clusters of discourse words that are indicative of topical content. In an automatic evaluation on large-scale microblog corpora, our joint model yields topics with better coherence scores than competitive topic models from previous studies. Qualitative analysis on model outputs indicates that our model induces meaningful representations for both discourse and topics. We further present an empirical study on microblog summarization based on the outputs of our joint model. The results show that the jointly modeled discourse and topic representations can effectively indicate summary-worthy content in microblog conversations.

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks
Jing Ma | Wei Gao | Kam-Fai Wong
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic rumor detection is technically very challenging. In this work, we try to learn discriminative features from tweets content by following their non-sequential propagation structure and generate more powerful representations for identifying different type of rumors. We propose two recursive neural models based on a bottom-up and a top-down tree-structured neural networks for rumor representation learning and classification, which naturally conform to the propagation layout of tweets. Results on two public Twitter datasets demonstrate that our recursive neural models 1) achieve much better performance than state-of-the-art approaches; 2) demonstrate superior capacity on detecting rumors at very early stage.

Task-oriented Dialogue System for Automatic Diagnosis
Zhongyu Wei | Qianlong Liu | Baolin Peng | Huaixiao Tou | Ting Chen | Xuanjing Huang | Kam-fai Wong | Xiangying Dai
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we make a move to build a dialogue system for automatic diagnosis. We first build a dataset collected from an online medical forum by extracting symptoms from both patients’ self-reports and conversational data between patients and doctors. Then we propose a task-oriented dialogue system framework to make diagnosis for patients automatically, which can converse with patients to collect additional symptoms beyond their self-reports. Experimental results on our dataset show that additional symptoms extracted from conversation can greatly improve the accuracy for disease identification and our dialogue system is able to collect these symptoms automatically and make a better diagnosis.

Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning
Baolin Peng | Xiujun Li | Jianfeng Gao | Jingjing Liu | Kam-Fai Wong
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training a task-completion dialogue agent via reinforcement learning (RL) is costly because it requires many interactions with real users. One common alternative is to use a user simulator. However, a user simulator usually lacks the language complexity of human interlocutors and the biases in its design may tend to degrade the agent. To address these issues, we present Deep Dyna-Q, which to our knowledge is the first deep RL framework that integrates planning for task-completion dialogue policy learning. We incorporate into the dialogue agent a model of the environment, referred to as the world model, to mimic real user response and generate simulated experience. During dialogue policy learning, the world model is constantly updated with real user experience to approach real user behavior, and in turn, the dialogue agent is optimized using both real experience and simulated experience. The effectiveness of our approach is demonstrated on a movie-ticket booking task in both simulated and human-in-the-loop settings.

2017

Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning
Jing Ma | Wei Gao | Kam-Fai Wong
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

How fake news goes viral via social media? How does its propagation pattern differ from real stories? In this paper, we attempt to address the problem of identifying rumors, i.e., fake information, out of microblog posts based on their propagation structure. We firstly model microblog posts diffusion with propagation trees, which provide valuable clues on how an original message is transmitted and developed over time. We then propose a kernel-based method called Propagation Tree Kernel, which captures high-order patterns differentiating different types of rumors by evaluating the similarities between their propagation tree structures. Experimental results on two real-world datasets demonstrate that the proposed kernel-based approach can detect rumors more quickly and accurately than state-of-the-art rumor detection models.

IJCNLP-2017 Task 2: Dimensional Sentiment Analysis for Chinese Phrases
Liang-Chih Yu | Lung-Hao Lee | Jin Wang | Kam-Fai Wong
Proceedings of the IJCNLP 2017, Shared Tasks

This paper presents the IJCNLP 2017 shared task on Dimensional Sentiment Analysis for Chinese Phrases (DSAP) which seeks to identify a real-value sentiment score of Chinese single words and multi-word phrases in the both valence and arousal dimensions. Valence represents the degree of pleasant and unpleasant (or positive and negative) feelings, and arousal represents the degree of excitement and calm. Of the 19 teams registered for this shared task for two-dimensional sentiment analysis, 13 submitted results. We expected that this evaluation campaign could produce more advanced dimensional sentiment analysis techniques, especially for Chinese affective computing. All data sets with gold standards and scoring script are made publicly available to researchers.

NLPTEA 2017 Shared Task – Chinese Spelling Check
Gabriel Fung | Maxime Debosschere | Dingmin Wang | Bo Li | Jia Zhu | Kam-Fai Wong
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

This paper provides an overview along with our findings of the Chinese Spelling Check shared task at NLPTEA 2017. The goal of this task is to develop a computer-assisted system to automatically diagnose typing errors in traditional Chinese sentences written by students. We defined six types of errors which belong to two categories. Given a sentence, the system should detect where the errors are, and for each detected error determine its type and provide correction suggestions. We designed, constructed, and released a benchmark dataset for this task.

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning
Baolin Peng | Xiujun Li | Lihong Li | Jianfeng Gao | Asli Celikyilmaz | Sungjin Lee | Kam-Fai Wong
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks. For example, the agent needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. This paper addresses this challenge by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales. The dialogue manager consists of: (1) a top-level dialogue policy that selects among subtasks or options, (2) a low-level dialogue policy that selects primitive actions to complete the subtask given by the top-level policy, and (3) a global state tracker that helps ensure all cross-subtask constraints be satisfied. Experiments on a travel planning task with simulated and real users show that our approach leads to significant improvements over three baselines, two based on handcrafted rules and the other based on flat deep reinforcement learning.

May I take your order? A Neural Model for Extracting Structured Information from Conversations
Baolin Peng | Michael Seltzer | Y.C. Ju | Geoffrey Zweig | Kam-Fai Wong
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper we tackle a unique and important problem of extracting a structured order from the conversation a customer has with an order taker at a restaurant. This is motivated by an actual system under development to assist in the order taking process. We develop a sequence-to-sequence model that is able to map from unstructured conversational input to the structured form that is conveyed to the kitchen and appears on the customer receipt. This problem is critically different from other tasks like machine translation where sequence-to-sequence models have been used: the input includes two sides of a conversation; the output is highly structured; and logical manipulations must be performed, for example when the customer changes his mind while ordering. We present a novel sequence-to-sequence model that incorporates a special attention-memory gating mechanism and conversational role markers. The proposed model improves performance over both a phrase-based machine translation approach and a standard sequence-to-sequence model.

2016

Proceedings of the 12th Workshop on Asian Language Resources (ALR12)
Koiti Hasida | Kam-Fai Wong | Nicoletta Calzorari | Key-Sun Choi
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

ACE: Automatic Colloquialism, Typographical and Orthographic Errors Detection for Chinese Language
Shichao Dong | Gabriel Pui Cheong Fung | Binyang Li | Baolin Peng | Ming Liao | Jia Zhu | Kam-fai Wong
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a system called ACE for Automatic Colloquialism and Errors detection for written Chinese. ACE is based on the combination of N-gram model and rule-base model. Although it focuses on detecting colloquial Cantonese (a dialect of Chinese) at the current stage, it can be extended to detect other dialects. We chose Cantonese becauase it has many interesting properties, such as unique grammar system and huge colloquial terms, that turn the detection task extremely challenging. We conducted experiments using real data and synthetic data. The results indicated that ACE is highly reliable and effective.

Topic Extraction from Microblog Posts Using Conversation Structures
Jing Li | Ming Liao | Wei Gao | Yulan He | Kam-Fai Wong
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Using Content-level Structures for Summarizing Microblog Repost Trees
Jing Li | Wei Gao | Zhongyu Wei | Baolin Peng | Kam-Fai Wong
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

UIR-PKU: Twitter-OpinMiner System for Sentiment Analysis in Twitter at SemEval 2015
Xu Han | Binyang Li | Jing Ma | Yuxiao Zhang | Gaoyan Ou | Tengjiao Wang | Kam-fai Wong
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

Exploiting Community Emotion for Microblog Event Detection
Gaoyan Ou | Wei Chen | Tengjiao Wang | Zhongyu Wei | Binyang Li | Dongqing Yang | Kam-Fai Wong
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The CUHK Discourse TreeBank for Chinese: Annotating Explicit Discourse Connectives for the Chinese TreeBank
Lanjun Zhou | Binyang Li | Zhongyu Wei | Kam-Fai Wong
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The lack of open discourse corpus for Chinese brings limitations for many natural language processing tasks. In this work, we present the first open discourse treebank for Chinese, namely, the Discourse Treebank for Chinese (DTBC). At the current stage, we annotated explicit intra-sentence discourse connectives, their corresponding arguments and senses for all 890 documents of the Chinese Treebank 5. We started by analysing the characteristics of discourse annotation for Chinese, adapted the annotation scheme of Penn Discourse Treebank 2 (PDTB2) to Chinese language while maintaining the compatibility as far as possible. We made adjustments to 3 essential aspects according to the previous study of Chinese linguistics. They are sense hierarchy, argument scope and semantics of arguments. Agreement study showed that our annotation scheme could achieve highly reliable results.

Web Information Mining and Decision Support Platform for the Modern Service Industry
Binyang Li | Lanjun Zhou | Zhongyu Wei | Kam-fai Wong | Ruifeng Xu | Yunqing Xia
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2013

Is Twitter A Better Corpus for Measuring Sentiment Similarity?
Shi Feng | Le Zhang | Binyang Li | Daling Wang | Ge Yu | Kam-Fai Wong
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

An Empirical Study on Uncertainty Identification in Social Media Context
Zhongyu Wei | Junwen Chen | Wei Gao | Binyang Li | Lanjun Zhou | Yulan He | Kam-Fai Wong
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

Quantising Opinions for Political Tweets Analysis
Yulan He | Hassan Saif | Zhongyu Wei | Kam-Fai Wong
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

There have been increasing interests in recent years in analyzing tweet messages relevant to political events so as to understand public opinions towards certain political issues. We analyzed tweet messages crawled during the eight weeks leading to the UK General Election in May 2010 and found that activities at Twitter is not necessarily a good predictor of popularity of political parties. We then proceed to propose a statistical model for sentiment detection with side information such as emoticons and hash tags implying tweet polarities being incorporated. Our results show that sentiment analysis based on a simple keyword matching against a sentiment lexicon or a supervised classifier trained with distant supervision does not correlate well with the actual election results. However, using our proposed statistical model for sentiment analysis, we were able to map the public opinion in Twitter with the actual offline sentiment in real world.

Cross-Lingual Identification of Ambiguous Discourse Connectives for Resource-Poor Language
Lanjun Zhou | Wei Gao | Binyang Li | Zhongyu Wei | Kam-Fai Wong
Proceedings of COLING 2012: Posters

Information-theoretic Multi-view Domain Adaptation
Pei Yang | Wei Gao | Qi Tan | Kam-Fai Wong
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

Unsupervised Discovery of Discourse Relations for Eliminating Intra-sentence Polarity Ambiguities
Lanjun Zhou | Binyang Li | Wei Gao | Zhongyu Wei | Kam-Fai Wong
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Query Weighting for Ranking Model Adaptation
Peng Cai | Wei Gao | Aoying Zhou | Kam-Fai Wong
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

A Unified Graph Model for Sentence-Based Opinion Retrieval
Binyang Li | Lanjun Zhou | Shi Feng | Kam-Fai Wong
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

Exploiting Bilingual Information to Improve Web Search
Wei Gao | John Blitzer | Ming Zhou | Kam-Fai Wong
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

Lyric-based Song Sentiment Classification with Sentiment Vector Space Model
Yunqing Xia | Linlin Wang | Kam-Fai Wong | Mingxing Xu
Proceedings of ACL-08: HLT, Short Papers

Opinion Annotation in On-line Chinese Product Reviews
Ruifeng Xu | Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the design and construction of a Chinese opinion corpus based on the online product reviews. Based on the observation on the characteristics of opinion expression in Chinese online product reviews, which is quite different from in the formal texts such as news, an annotation framework is proposed to guide the construction of the first Chinese opinion corpus based on online product reviews. The opinionated sentences are manually identified from the review text. Furthermore, for each comment in the opinionated sentence, its 13 describing elements are annotated including the expressions related to the interested product attributes and user opinions as well as the polarity and degree of the opinions. Currently, 12,724 comments are annotated in 10,935 sentences from review text. Through statistical analysis on the opinion corpus, some interesting characteristics of Chinese opinion expression are presented. This corpus is shown helpful to support systematic research on Chinese opinion analysis.

Extractive Summarization Using Supervised and Semi-Supervised Learning
Kam-Fai Wong | Mingli Wu | Wenjie Li
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

Annotating Chinese Collocations with Multi Information
Ruifeng Xu | Qin Lu | Kam-Fai Wong | Wenjie Li
Proceedings of the Linguistic Annotation Workshop

2006

A Phonetic-Based Approach to Chinese Chat Text Normalization
Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

Constructing A Chinese Chat Language Corpus with A Two-Stage Incremental Annotation Approach
Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Chat language refers to the special human language widely used in the community of digital network chat. As chat language holds anomalous characteristics in forming words, phrases, and non-alphabetical characters, conventional natural language processing tools are ineffective to handle chat language text. Previous research shows that knowledge based methods perform less effectively in proc-essing unseen chat terms. This motivates us to construct a chat language corpus so that corpus-based techniques of chat language text processing can be developed and evaluated. However, creating the corpus merely by hand is difficult. One, this work is manpower consuming. Second, annotation inconsistency is serious. To minimize manpower and annotation inconsistency, a two-stage incre-mental annotation approach is proposed in this paper in constructing a chat language corpus. Experiments conducted in this paper show that the performance of corpus annotation can be improved greatly with this approach.

Anomaly Detecting within Dynamic Chinese Chat Text
Yunqing Xia | Kam-Fai Wong
Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources

2005

A Preliminary Work on Classifying Time Granularities of Temporal Questions
Wei Li | Wenjie Li | Qin Lu | Kam-Fai Wong
Second International Joint Conference on Natural Language Processing: Full Papers

NIL Is Not Nothing: Recognition of Chinese Network Informal Language Expressions
Yunqing Xia | Kam-Fai Wong | Wei Gao
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

2004

Combining Linguistic Features with Weighted Bayesian Classifier for Temporal Reference Processing
Guihong Cao | Wenjie Li | Kam-Fai Wong | Chunfa Yuan
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Applying Machine Learning to Chinese Temporal Relation Resolution
Wenjie Li | Kam-Fai Wong | Guihong Cao | Chunfa Yuan
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2003

Improving Document Clustering by Utilizing Meta-Data
Kam-Fai Wong | Nam-Kiu Chan | Kam-Lai Wong
Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages

2002

An Indexing Method Based on Sentences
Li Li | Chunfa Yuan | K.F. Wong | Wenjie Li
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

2001

A Model For Processing Temporal References In Chinese
Wenjie Li | Kam-Fai Wong | Chunfa Yuan
Proceedings of the ACL 2001 Workshop on Temporal and Spatial Information Processing

2000

An Algorithm for Situation Classification of Chinese Verbs
Xiaodan Zhu | Chunfa Yuan | K.F. Wong | Wenjie Li
Second Chinese Language Processing Workshop

1995

Are Statistics-Based Approaches Good Enough For NLP? A Case Study Of Maximal-Length NP Extraction In Mandarin Chinese
Wenjie Li | Haihua Pan | Ming Zhou | Kam-Fai Wong | Vincent Lum
Proceedings of Rocling VIII Computational Linguistics Conference VIII

Co-authors

Lingzhi Wang 13

Wai Chung Kwan 10

Tengjiao Wang 3

Gabriel Pui Cheong Fung 2

Xu Han (韩旭) 2

Bing Qin (秦兵) 2

Haisong Zhang 2

Jingqian Zhao 2

Hamid Alinejad-Rokny 1

Nick Beauchamp 1

Nicoletta Calzorari 1

Asli Celikyilmaz 1

Guanrong Chen 1

Tat-Seng Chua 1

Maxime Debosschere 1

Alessio Devoto 1

Aryo Pradipta Gema 1

Xuan-Jing Huang (黄萱菁) 1

Kwong Sak Leung 1

Xingwei Liang 1

Pasquale Minervini 1

Mrinmaya Sachan 1

Michael Seltzer 1

Sarah Shugars 1

Zhanghao Wang 1

Cunxiang Wang 1

Qianlong Wang 1

Hua Wu (吴华) 1

Dongqing Yang 1

Fangchun Yang 1

Liang-Chih Yu 1

Ge Yu (于戈) 1

Yuanzhao Zhai 1

Huan Zhang (张欢) 1

Tianhua Zhang 1

Wenxuan Zhang 1

Jiangjiang Zhao 1

Geoffrey Zweig 1

Venues