Wenxuan Zhang - ACL Anthology

Wenxuan Zhang

2025

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models
Chaoqun Liu | Wenxuan Zhang | Yiran Zhao | Anh Tuan Luu | Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated multilingual capabilities, yet they are mostly English-centric due to the imbalanced training corpora. While prior works have leveraged this bias to enhance multilingual performance through translation, they have been largely limited to natural language processing (NLP) tasks. In this work, we extend the evaluation to real-world user queries and non-English-centric LLMs, offering a broader examination of multilingual performance. Our key contribution lies in demonstrating that while translation into English can boost the performance of English-centric LLMs on NLP tasks, it is not universally optimal. For culture-related tasks that need deep language understanding, prompting in the native language proves more effective as it better captures the nuances of culture and language. Our experiments expose varied behaviors across LLMs and tasks in the multilingual context, underscoring the need for a more comprehensive approach to multilingual evaluation. Therefore, we call for greater efforts in developing and evaluating LLMs that go beyond English-centric paradigms.

MPO: Multilingual Safety Alignment via Reward Gap Optimization
Weixiang Zhao | Yulin Hu | Yang Deng | Tongtong Wu | Wenxuan Zhang | Jiahe Guo | An Zhang | Yanyan Zhao | Bing Qin | Tat-Seng Chua | Ting Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (e.g., English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO’s efficacy in multilingual safety alignment without degrading general multilingual utility.

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang | Hou Pong Chan | Yiran Zhao | Mahani Aljunied | Jianyu Wang | Chaoqun Liu | Yue Deng | Zhiqiang Hu | Weiwen Xu | Yew Ken Chia | Xin Li | Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

Explainable Ethical Assessment on Human Behaviors by Generating Conflicting Social Norms
Yuxi Sun | Wei Gao | Hongzhan Lin | Jing Ma | Wenxuan Zhang
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Human behaviors are often guided or constrained by social norms, which are defined as shared, commonsense rules. For example, underlying an action report a witnessed crime are social norms that inform our conduct, such as It is expected to be brave to report crimes. Current AI systems that assess valence (i.e., support or oppose) of human actions by leveraging large-scale data training not grounded on explicit norms may be difficult to explain, and thus untrustworthy. Emulating human assessors by considering social norms can help AI models better understand and predict valence. While multiple norms come into play, conflicting norms can create tension and directly influence human behavior. For example, when deciding whether to report a witnessed crime, one may balance bravery against self-protection. In this paper, we introduce ClarityEthic, a novel ethical assessment approach, to enhance valence prediction and explanation by generating conflicting social norms behind human actions, which strengthens the moral reasoning capabilities of language models by using a contrastive learning strategy. Extensive experiments demonstrate that our method outperforms strong baseline approaches, and human evaluations confirm that the generated social norms provide plausible explanations for the assessment of human behaviors.

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
Chaoqun Liu | Wenxuan Zhang | Jiahao Ying | Mahani Aljunied | Anh Tuan Luu | Lidong Bing
Findings of the Association for Computational Linguistics: NAACL 2025

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.

AdaMergeX: Cross-Lingual Transfer with Large Language Models via Adaptive Adapter Merging
Yiran Zhao | Wenxuan Zhang | Huiming Wang | Kenji Kawaguchi | Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning
Chang Gao | Wenxuan Zhang | Guizhen Chen | Wai Lam
Findings of the Association for Computational Linguistics: ACL 2025

Instruction tuning is vital for enhancing the performance of large language models (LLMs), but existing text-to-text methods, referred to as TextTuning, struggle with issues such as generalization, robustness, and controllability due to their lack of explicit task structures. We introduce JsonTuning, a structure-to-structure approach that uses JSON structures to represent tasks. This method improves generalization by clarifying task elements and their relations, boosts robustness by minimizing ambiguity, and enhances controllability by allowing precise control over outputs. We conduct an extensive comparative analysis between JsonTuning and TextTuning using various language models and benchmarks. Our findings reveal that JsonTuning consistently surpasses TextTuning in terms of performance, robustness, and controllability across different scenarios. By overcoming the limitations of TextTuning, JsonTuning demonstrates significant potential for developing more effective and reliable LLMs capable of handling diverse scenarios.

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions
Ruochen Zhao | Wenxuan Zhang | Yew Ken Chia | Weiwen Xu | Deli Zhao | Lidong Bing
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As LLMs continuously evolve, there is an urgent need for a reliable evaluation method that delivers trustworthy results promptly. Currently, static benchmarks suffer from inflexibility and unreliability, leading users to prefer human voting platforms like Chatbot Arena. However, human evaluations require significant manual effort. Therefore, we propose Auto-Arena, an innovative framework that automates the entire evaluation process using LLM-powered agents. Firstly, an LLM examiner generates questions. Then, two LLM candidates engage in a multi-round peer battle based on the questions, aiming at revealing their true performance differences. Finally, a committee of LLM judges collaboratively discusses and decides the winner, reducing bias and enhancing fairness. During the peer battles, we observe intriguing scenarios where the LLM candidates display competitive behaviors and learn from the opponents. In our extensive experiments involving 15 recent LLMs, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks without any manual efforts. Auto-Arena offers a promising alternative to current human evaluation platforms for evaluating LLMs automatically.

Pruning General Large Language Models into Customized Expert Models
Yiran Zhao | Guizhen Chen | Kenji Kawaguchi | Lidong Bing | Wenxuan Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have transformed natural language processing, yet their substantial model sizes often demand significant computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need expert models tailored to specific downstream scenarios. However, current pruning methods primarily focus on maintaining models’ general capabilities, either requiring extensive post-training or performing poorly due to coarse-grained pruning. In this work, we design a ̲Custom ̲Pruning method (Cus-Prun) to prune a large general model into a smaller lightweight expert model, which is positioned along the “language”, “domain” and “task” dimensions. By identifying and pruning irrelevant neurons of each dimension, Cus-Prun creates expert models without any post-training. Our experiments demonstrate that Cus-Prun consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.

Disentangling Language and Culture for Evaluating Multilingual Large Language Models
Jiahao Ying | Wei Tang | Yiran Zhao | Yixin Cao | Yu Rong | Wenxuan Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs’ ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable “Cultural-Linguistic Synergy” phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language’s cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations.

Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels
Chaoqun Liu | Qin Chao | Wenxuan Zhang | Xiaobao Wu | Boyang Li | Anh Tuan Luu | Lidong Bing
Proceedings of the 31st International Conference on Computational Linguistics

Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. However, this paradigm is limited by the availability of gold labels, while in certain scenarios, LLMs may need to perform tasks that are too complex for humans to provide such labels. To tackle this challenge, this study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization. We iteratively prompt LLMs to annotate unlabeled data and retain high-quality labels by filtering. Surprisingly, we obverse that this iterative process gradually unlocks LLMs’ potential on downstream tasks. Our experiments on extensive classification and reasoning tasks confirm the effectiveness of our proposed framework. Our analysis indicates that this paradigm is effective for both in-context learning and fine-tuning, and for various model sizes.

MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models
Boyang Xue | Hongru Wang | Rui Wang | Sheng Wang | Zezhong Wang | Yiming Du | Bin Liang | Wenxuan Zhang | Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025

The tendency of Large Language Models (LLMs) to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluated high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs’ reliability and accuracy in LS scenarios.

Knowledge Boundary of Large Language Models: A Survey
Moxin Li | Yong Zhao | Wenxuan Zhang | Shuaiyi Li | Wenya Xie | See-Kiong Ng | Tat-Seng Chua | Yang Deng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although large language models (LLMs) store vast amount of knowledge in their parameters, they still have limitations in the memorization and utilization of certain knowledge, leading to undesired behaviors such as generating untruthful and inaccurate responses. This highlights the critical need to understand the knowledge boundary of LLMs, a concept that remains inadequately defined in existing research. In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types. Using this foundation, we systematically review the field through three key lenses: the motivation for studying LLM knowledge boundaries, methods for identifying these boundaries, and strategies for mitigating the challenges they present. Finally, we discuss open challenges and potential research directions in this area. We aim for this survey to offer the community a comprehensive overview, facilitate access to key issues, and inspire further advancements in LLM knowledge research.

Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
Yi Feng | Jiaqi Wang | Wenxuan Zhang | Zhuang Chen | Shen Yutong | Xiyao Xiao | Minlie Huang | Liping Jing | Jian Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, **INT** (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate responses through retrieval-augmentation. Second, **IMA** (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that **INT** consistently outperforms standard methods in therapeutic quality and depth. We further demonstrate the effectiveness of **INT** in synthesizing high-quality support conversations to facilitate social applications.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
Hongzhan Lin | Yang Deng | Yuxuan Gu | Wenxuan Zhang | Jing Ma | See-Kiong Ng | Tat-Seng Chua
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs’ fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs’ factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.

2024

Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon popular English-centric models through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.

DimA: A Parameter-efficient Fine-tuning Method with Knowledge Transfer Based on Transformer
Wenxuan Zhang | Min Huang | Zhuoyang Song | Qinghai Miao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Fine-tuning is a widely used technique for leveraging pre-trained language models (PLMs) in downstream tasks, but it can be computationally expensive and storage-intensive. To address this challenge, researchers have developed parameter-efficient methods that balance performance and resource cost. However, these methods often come with trade-offs like increased inference latency, token length usage, or limited adaptability for multitasking scenarios. This paper introduces a novel parameter-efficient method called DimA(Dimensionality Augmentation), which enhances the Transformer architecture by increasing the dimensionality. DimA achieves state-of-the-art results in GLUE and XSUM tasks while utilizing less than 1% of the original model’s parameters. Moreover, DimA introduces a novel approach to knowledge transfer that enables the simultaneous utilization of knowledge learned from multiple tasks to handle new tasks. This method significantly enhances the performance of the model on new tasks. Its versatility in model structure also enables its application to various Transformer-based models.

Order-Agnostic Data Augmentation for Few-Shot Named Entity Recognition
Huiming Wang | Liying Cheng | Wenxuan Zhang | De Wen Soh | Lidong Bing
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Data augmentation (DA) methods have been proven to be effective for pre-trained language models (PLMs) in low-resource settings, including few-shot named entity recognition (NER). However, existing NER DA techniques either perform rule-based manipulations on words that break the semantic coherence of the sentence, or exploit generative models for entity or context substitution, which requires a substantial amount of labeled data and contradicts the objective of operating in low-resource settings. In this work, we propose order-agnostic data augmentation (OaDA), an alternative solution that exploits the often overlooked order-agnostic property in the training data construction phase of sequence-to-sequence NER methods for data augmentation. To effectively utilize the augmented data without suffering from the one-to-many issue, where multiple augmented target sequences exist for one single sentence, we further propose the use of ordering instructions and an innovative OaDA-XE loss. Specifically, by treating each permutation of entity types as an ordering instruction, we rearrange the entity set accordingly, ensuring a distinct input-output pair, while OaDA-XE assigns loss based on the best match between the target sequence and model predictions. We conduct comprehensive experiments and analyses across three major NER benchmarks and significantly enhance the few-shot capabilities of PLMs with OaDA.

Sentiment Analysis in the Era of Large Language Models: A Reality Check
Wenxuan Zhang | Yue Deng | Bing Liu | Sinno Pan | Lidong Bing
Findings of the Association for Computational Linguistics: NAACL 2024

Sentiment analysis (SA) has been a long-standing research area in natural language processing. With the recent advent of large language models (LLMs), there is great potential for their employment on SA problems. However, the extent to which current LLMs can be leveraged for different sentiment analysis tasks remains unclear. This paper aims to provide a comprehensive investigation into the capabilities of LLMs in performing various sentiment analysis tasks, from conventional sentiment classification to aspect-based sentiment analysis and multifaceted analysis of subjective texts. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets. Our study reveals that while LLMs demonstrate satisfactory performance in simpler tasks, they lag behind in more complex tasks requiring a deeper understanding of specific sentiment phenomena or structured sentiment information. However, LLMs significantly outperform SLMs in few-shot learning settings, suggesting their potential when annotation resources are limited. We also highlight the limitations of current evaluation practices in assessing LLMs’ SA abilities and propose a novel benchmark, SentiEval, for a more comprehensive and realistic evaluation. Data and code are available at https://github.com/DAMO-NLP-SG/LLM-Sentiment.

On the Multi-turn Instruction Following for Conversational Web Agents
Yang Deng | Xuan Zhang | Wenxuan Zhang | Yifei Yuan | See-Kiong Ng | Tat-Seng Chua
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Web agents powered by Large Language Models (LLMs) have demonstrated remarkable abilities in planning and executing multi-step interactions within complex web-based environments, fulfilling a wide range of web navigation tasks. Despite these advancements, the potential for LLM-powered agents to effectively engage with sequential user instructions in real-world scenarios has not been fully explored. In this work, we introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment, supported by a specially developed dataset named Multi-Turn Mind2Web (MT-Mind2Web). To tackle the limited context length of LLMs and the context-dependency issue of the conversational tasks, we further propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques. Extensive experiments are conducted to benchmark the MT-Mind2Web dataset, and validate the effectiveness of the proposed method.

2023

Knowledge-enhanced Mixed-initiative Dialogue System for Emotional Support Conversations
Yang Deng | Wenxuan Zhang | Yifei Yuan | Wai Lam
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Unlike empathetic dialogues, the system in emotional support conversations (ESC) is expected to not only convey empathy for comforting the help-seeker, but also proactively assist in exploring and addressing their problems during the conversation. In this work, we study the problem of mixed-initiative ESC where the user and system can both take the initiative in leading the conversation. Specifically, we conduct a novel analysis on mixed-initiative ESC systems with a tailor-designed schema that divides utterances into different types with speaker roles and initiative types. Four emotional support metrics are proposed to evaluate the mixed-initiative interactions. The analysis reveals the necessity and challenges of building mixed-initiative ESC systems. In the light of this, we propose a knowledge-enhanced mixed-initiative framework (KEMI) for ESC, which retrieves actual case knowledge from a large-scale mental health knowledge graph for generating mixed-initiative responses. Experimental results on two ESC datasets show the superiority of KEMI in both content-preserving evaluation and mixed initiative related analyses.

SOUL: Towards Sentiment and Opinion Understanding of Language
Yue Deng | Wenxuan Zhang | Sinno Pan | Lidong Bing
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims to evaluate sentiment understanding through two subtasks: Review Comprehension (RC) and Justification Generation (JG). RC seeks to validate statements that focus on subjective information based on a review text, while JG requires models to provide explanations for their sentiment predictions. To enable comprehensive evaluation, we annotate a new dataset comprising 15,028 statements from 3,638 reviews. Experimental results indicate that SOUL is a challenging task for both small and large language models, with a performance gap of up to 27% when compared to human performance. Furthermore, evaluations conducted with both human experts and GPT-4 highlight the limitations of the small language model in generating reasoning-based justifications. These findings underscore the challenging nature of the SOUL task for existing models, emphasizing the need for further advancements in sentiment analysis to address its complexities. The new dataset and code are available at https://github.com/DAMO-NLP-SG/SOUL.

Zero-Shot Text Classification via Self-Supervised Tuning
Chaoqun Liu | Wenxuan Zhang | Guizhen Chen | Xiaobao Wu | Anh Tuan Luu | Chip Hong Chang | Lidong Bing
Findings of the Association for Computational Linguistics: ACL 2023

Existing solutions to zero-shot text classification either conduct prompting with pre-trained language models, which is sensitive to the choices of templates, or rely on large-scale annotated data of relevant tasks for meta-tuning. In this work, we propose a new paradigm based on self-supervised learning to solve zero-shot text classification tasks by tuning the language models with unlabeled data, called self-supervised tuning. By exploring the inherent structure of free texts, we propose a new learning objective called first sentence prediction to bridge the gap between unlabeled data and text classification tasks. After tuning the model to learn to predict the first sentence in a paragraph based on the rest, the model is able to conduct zero-shot inference on unseen tasks such as topic classification and sentiment analysis. Experimental results show that our model outperforms the state-of-the-art baselines on 7 out of 10 tasks. Moreover, the analysis reveals that our model is less sensitive to the prompt design. Our code and pre-trained models are publicly available at https://github.com/DAMO-NLP-SG/SSTuning.

Easy-to-Hard Learning for Information Extraction
Chang Gao | Wenxuan Zhang | Wai Lam | Lidong Bing
Findings of the Association for Computational Linguistics: ACL 2023

Information extraction (IE) systems aim to automatically extract structured information, such as named entities, relations between entities, and events, from unstructured texts. While most existing work addresses a particular IE task, universally modeling various IE tasks with one model has achieved great success recently. Despite their success, they employ a one-stage learning strategy, i.e., directly learning to extract the target structure given the input text, which contradicts the human learning process. In this paper, we propose a unified easy-to-hard learning framework consisting of three stages, i.e., the easy stage, the hard stage, and the main stage, for IE by mimicking the human learning process. By breaking down the learning process into multiple stages, our framework facilitates the model to acquire general IE task knowledge and improve its generalization ability. Extensive experiments across four IE tasks demonstrate the effectiveness of our framework. We achieve new state-of-the-art results on 13 out of 17 datasets.

Product Question Answering in E-Commerce: A Survey
Yang Deng | Wenxuan Zhang | Qian Yu | Wai Lam
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Product question answering (PQA), aiming to automatically provide instant responses to customer’s questions in E-Commerce platforms, has drawn increasing attention in recent years. Compared with typical QA problems, PQA exhibits unique challenges such as the subjectivity and reliability of user-generated contents in E-commerce platforms. Therefore, various problem settings and novel methods have been proposed to capture these special characteristics. In this paper, we aim to systematically review existing research efforts on PQA. Specifically, we categorize PQA studies into four problem settings in terms of the form of provided answers. We analyze the pros and cons, as well as present existing datasets and evaluation protocols for each setting. We further summarize the most significant challenges that characterize PQA from general QA applications and discuss their corresponding solutions. Finally, we conclude this paper by providing the prospect on several future directions.

Bidirectional Generative Framework for Cross-domain Aspect-based Sentiment Analysis
Yue Deng | Wenxuan Zhang | Sinno Jialin Pan | Lidong Bing
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-domain aspect-based sentiment analysis (ABSA) aims to perform various fine-grained sentiment analysis tasks on a target domain by transferring knowledge from a source domain. Since labeled data only exists in the source domain, a model is expected to bridge the domain gap for tackling cross-domain ABSA. Though domain adaptation methods have proven to be effective, most of them are based on a discriminative model, which needs to be specifically designed for different ABSA tasks. To offer a more general solution, we propose a unified bidirectional generative framework to tackle various cross-domain ABSA tasks. Specifically, our framework trains a generative model in both text-to-label and label-to-text directions. The former transforms each task into a unified format to learn domain-agnostic features, and the latter generates natural sentences from noisy labels for data augmentation, with which a more accurate model can be trained. To investigate the effectiveness and generality of our framework, we conduct extensive experiments on four cross-domain ABSA tasks and present new state-of-the-art results on all tasks. Our data and code are publicly available at https://github.com/DAMO-NLP-SG/BGCA.

AQE: Argument Quadruplet Extraction via a Quad-Tagging Augmented Generative Approach
Jia Guo | Liying Cheng | Wenxuan Zhang | Stanley Kok | Xin Li | Lidong Bing
Findings of the Association for Computational Linguistics: ACL 2023

Argument mining involves multiple sub-tasks that automatically identify argumentative elements, such as claim detection, evidence extraction, stance classification, etc. However, each subtask alone is insufficient for a thorough understanding of the argumentative structure and reasoning process. To learn a complete view of an argument essay and capture the interdependence among argumentative components, we need to know what opinions people hold (i.e., claims), why those opinions are valid (i.e., supporting evidence), which source the evidence comes from (i.e., evidence type), and how those claims react to the debating topic (i.e., stance). In this work, we for the first time propose a challenging argument quadruplet extraction task (AQE), which can provide an all-in-one extraction of four argumentative components, i.e., claims, evidence, evidence types, and stances. To support this task, we construct a large-scale and challenging dataset. However, there is no existing method that can solve the argument quadruplet extraction. To fill this gap, we propose a novel quad-tagging augmented generative approach, which leverages a quadruplet tagging module to augment the training of the generative framework. The experimental results on our dataset demonstrate the empirical superiority of our proposed approach over several strong baselines.

2022

PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance
Yang Deng | Wenqiang Lei | Wenxuan Zhang | Wai Lam | Tat-Seng Chua
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

To facilitate conversational question answering (CQA) over hybrid contexts in finance, we present a new dataset, named PACIFIC. Compared with existing CQA datasets, PACIFIC exhibits three key features: (i) proactivity, (ii) numerical reasoning, and (iii) hybrid context of tables and text. A new task is defined accordingly to study Proactive Conversational Question Answering (PCQA), which combines clarification question generation and CQA. In addition, we propose a novel method, namely UniPCQA, to adapt a hybrid format of input and output content in PCQA into the Seq2Seq problem, including the reformulation of the numerical reasoning process as code generation. UniPCQA performs multi-task learning over all sub-tasks in PCQA and incorporates a simple ensemble strategy to alleviate the error propagation issue in the multi-task learning by cross-validating top-k sampled Seq2Seq outputs. We benchmark the PACIFIC dataset with extensive baselines and provide comprehensive evaluations on each sub-task of PCQA.

Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems
Zhongli Li | Wenxuan Zhang | Chao Yan | Qingyu Zhou | Chao Li | Hongzhi Liu | Yunbo Cao
Findings of the Association for Computational Linguistics: ACL 2022

Math Word Problem (MWP) solving needs to discover the quantitative relationships over natural language narratives. Recent work shows that existing models memorize procedures from context and rely on shallow heuristics to solve MWPs. In this paper, we look at this issue and argue that the cause is a lack of overall understanding of MWP patterns. We first investigate how a neural network understands patterns only from semantics, and observe that, if the prototype equations are the same, most problems get closer representations and those representations apart from them or close to other prototypes tend to produce wrong solutions. Inspired by it, we propose a contrastive learning approach, where the neural network perceives the divergence of patterns. We collect contrastive examples by converting the prototype equation into a tree and seeking similar tree structures. The solving model is trained with an auxiliary objective on the collected examples, resulting in the representations of problems with similar prototypes being pulled closer. We conduct experiments on the Chinese dataset Math23k and the English dataset MathQA. Our method greatly improves the performance in monolingual and multilingual settings.

Towards Generalizable and Robust Text-to-SQL Parsing
Chang Gao | Bowen Li | Wenxuan Zhang | Wai Lam | Binhua Li | Fei Huang | Luo Si | Yongbin Li
Findings of the Association for Computational Linguistics: EMNLP 2022

Text-to-SQL parsing tackles the problem of mapping natural language questions to executable SQL queries. In practice, text-to-SQL parsers often encounter various challenging scenarios, requiring them to be generalizable and robust. While most existing work addresses a particular generalization or robustness challenge, we aim to study it in a more comprehensive manner. In specific, we believe that text-to-SQL parsers should be (1) generalizable at three levels of generalization, namely i.i.d., zero-shot, and compositional, and (2) robust against input perturbations. To enhance these capabilities of the parser, we propose a novel TKK framework consisting of Task decomposition, Knowledge acquisition, and Knowledge composition to learn text-to-SQL parsing in stages. By dividing the learning process into multiple stages, our framework improves the parser’s ability to acquire general SQL knowledge instead of capturing spurious patterns, making it more generalizable and robust. Experimental results under various generalization and robustness settings show that our framework is effective in all scenarios and achieves state-of-the-art performance on the Spider, SParC, and CoSQL datasets.

UniGDD: A Unified Generative Framework for Goal-Oriented Document-Grounded Dialogue
Chang Gao | Wenxuan Zhang | Wai Lam
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The goal-oriented document-grounded dialogue aims at responding to the user query based on the dialogue context and supporting document. Existing studies tackle this problem by decomposing it into two sub-tasks: knowledge identification and response generation. However, such pipeline methods would unavoidably suffer from the error propagation issue. This paper proposes to unify these two sub-tasks via sequentially generating the grounding knowledge and the response. We further develop a prompt-connected multi-task learning strategy to model the characteristics and connections of different tasks and introduce linear temperature scheduling to reduce the negative effect of irrelevant document information. Experimental results demonstrate the effectiveness of our framework.

2021

Towards Generative Aspect-Based Sentiment Analysis
Wenxuan Zhang | Xin Li | Yang Deng | Lidong Bing | Wai Lam
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Aspect-based sentiment analysis (ABSA) has received increasing attention recently. Most existing work tackles ABSA in a discriminative manner, designing various task-specific classification networks for the prediction. Despite their effectiveness, these methods ignore the rich label semantics in ABSA problems and require extensive task-specific designs. In this paper, we propose to tackle various ABSA tasks in a unified generative framework. Two types of paradigms, namely annotation-style and extraction-style modeling, are designed to enable the training process by formulating each ABSA task as a text generation problem. We conduct experiments on four ABSA tasks across multiple benchmark datasets where our proposed generative approach achieves new state-of-the-art results in almost all cases. This also validates the strong generality of the proposed framework which can be easily adapted to arbitrary ABSA task without additional task-specific model design.

Cross-lingual Aspect-based Sentiment Analysis with Aspect Term Code-Switching
Wenxuan Zhang | Ruidan He | Haiyun Peng | Lidong Bing | Wai Lam
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Many efforts have been made in solving the Aspect-based sentiment analysis (ABSA) task. While most existing studies focus on English texts, handling ABSA in resource-poor languages remains a challenging problem. In this paper, we consider the unsupervised cross-lingual transfer for the ABSA task, where only labeled data in the source language is available and we aim at transferring its knowledge to the target language having no labeled data. To this end, we propose an alignment-free label projection method to obtain high-quality pseudo-labeled data of the target language with the help of the translation system, which could preserve more accurate task-specific knowledge in the target language. For better utilizing the source and translated data, as well as enhancing the cross-lingual alignment, we design an aspect code-switching mechanism to augment the training data with code-switched bilingual sentences. To further investigate the importance of language-specific knowledge in solving the ABSA problem, we distill the above model on the unlabeled target language data which improves the performance to the same level of the supervised method.

Learning to Rank Question Answer Pairs with Bilateral Contrastive Data Augmentation
Yang Deng | Wenxuan Zhang | Wai Lam
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

In this work, we propose a novel and easy-to-apply data augmentation strategy, namely Bilateral Generation (BiG), with a contrastive training objective for improving the performance of ranking question answer pairs with existing labeled data. In specific, we synthesize pseudo-positive QA pairs in contrast to the original negative QA pairs with two pre-trained generation models, one for question generation, the other for answer generation, which are fine-tuned on the limited positive QA pairs from the original dataset. With the augmented dataset, we design a contrastive training objective for learning to rank question answer pairs. Experimental results on three benchmark datasets show that our method significantly improves the performance of ranking models by making full use of existing labeled data and can be easily applied to different ranking models.

Aspect Sentiment Quad Prediction as Paraphrase Generation
Wenxuan Zhang | Yang Deng | Xin Li | Yifei Yuan | Lidong Bing | Wai Lam
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Aspect-based sentiment analysis (ABSA) has been extensively studied in recent years, which typically involves four fundamental sentiment elements, including the aspect category, aspect term, opinion term, and sentiment polarity. Existing studies usually consider the detection of partial sentiment elements, instead of predicting the four elements in one shot. In this work, we introduce the Aspect Sentiment Quad Prediction (ASQP) task, aiming to jointly detect all sentiment elements in quads for a given opinionated sentence, which can reveal a more comprehensive and complete aspect-level sentiment structure. We further propose a novel Paraphrase modeling paradigm to cast the ASQP task to a paraphrase generation process. On one hand, the generation formulation allows solving ASQP in an end-to-end manner, alleviating the potential error propagation in the pipeline solution. On the other hand, the semantics of the sentiment elements can be fully exploited by learning to generate them in the natural language form. Extensive experiments on benchmark datasets show the superiority of our proposed method and the capacity of cross-task transfer with the proposed unified Paraphrase modeling framework.

Aspect-based Sentiment Analysis in Question Answering Forums
Wenxuan Zhang | Yang Deng | Xin Li | Lidong Bing | Wai Lam
Findings of the Association for Computational Linguistics: EMNLP 2021

Aspect-based sentiment analysis (ABSA) typically focuses on extracting aspects and predicting their sentiments on individual sentences such as customer reviews. Recently, another kind of opinion sharing platform, namely question answering (QA) forum, has received increasing popularity, which accumulates a large number of user opinions towards various aspects. This motivates us to investigate the task of ABSA on QA forums (ABSA-QA), aiming to jointly detect the discussed aspects and their sentiment polarities for a given QA pair. Unlike review sentences, a QA pair is composed of two parallel sentences, which requires interaction modeling to align the aspect mentioned in the question and the associated opinion clues in the answer. To this end, we propose a model with a specific design of cross-sentence aspect-opinion interaction modeling to address this task. The proposed method is evaluated on three real-world datasets and the results show that our model outperforms several strong baselines adopted from related state-of-the-art models.

2020

Multi-hop Inference for Question-driven Summarization
Yang Deng | Wenxuan Zhang | Wai Lam
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Question-driven summarization has been recently studied as an effective approach to summarizing the source document to produce concise but informative answers for non-factoid questions. In this work, we propose a novel question-driven abstractive summarization method, Multi-hop Selective Generator (MSG), to incorporate multi-hop reasoning into question-driven summarization and, meanwhile, provide justifications for the generated summaries. Specifically, we jointly model the relevance to the question and the interrelation among different sentences via a human-like multi-hop inference module, which captures important sentences for justifying the summarized answer. A gated selective pointer generator network with a multi-view coverage mechanism is designed to integrate diverse information from different perspectives. Experimental results show that the proposed method consistently outperforms state-of-the-art methods on two non-factoid QA datasets, namely WikiHow and PubMedQA.

Intra-/Inter-Interaction Network with Latent Interaction Modeling for Multi-turn Response Selection
Yang Deng | Wenxuan Zhang | Wai Lam
Proceedings of the 28th International Conference on Computational Linguistics

Multi-turn response selection has been extensively studied and applied to many real-world applications in recent years. However, current methods typically model the interactions between multi-turn utterances and candidate responses with iterative approaches, which is not practical as the turns of conversations vary. Besides, some latent features, such as user intent and conversation topic, are under-discovered in existing works. In this work, we propose Intra-/Inter-Interaction Network (I³) with latent interaction modeling to comprehensively model multi-level interactions between the utterance context and the response. In specific, we first encode the intra- and inter-utterance interaction with the given response from both individual utterance and the overall utterance context. Then we develop a latent multi-view subspace clustering module to model the latent interaction between the utterance and response. Experimental results show that the proposed method substantially and consistently outperforms existing state-of-the-art methods on three multi-turn response selection benchmark datasets.

AnswerFact: Fact Checking in Product Question Answering
Wenxuan Zhang | Yang Deng | Jing Ma | Wai Lam
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Product-related question answering platforms nowadays are widely employed in many E-commerce sites, providing a convenient way for potential customers to address their concerns during online shopping. However, the misinformation in the answers on those platforms poses unprecedented challenges for users to obtain reliable and truthful product information, which may even cause a commercial loss in E-commerce business. To tackle this issue, we investigate to predict the veracity of answers in this paper and introduce AnswerFact, a large scale fact checking dataset from product question answering forums. Each answer is accompanied by its veracity label and associated evidence sentences, providing a valuable testbed for evidence-based fact checking tasks in QA settings. We further propose a novel neural model with tailored evidence ranking components to handle the concerned answer veracity prediction problem. Extensive experiments are conducted with our proposed model and various existing fact checking methods, showing that our method outperforms all baselines on this task.

Answering Product-related Questions with Heterogeneous Information
Wenxuan Zhang | Qian Yu | Wai Lam
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Providing instant response for product-related questions in E-commerce question answering platforms can greatly improve users’ online shopping experience. However, existing product question answering (PQA) methods only consider a single information source such as user reviews and/or require large amounts of labeled data. In this paper, we propose a novel framework to tackle the PQA task via exploiting heterogeneous information including natural language text and attribute-value pairs from two information sources of the concerned product, namely product details and user reviews. A heterogeneous information encoding component is then designed for obtaining unified representations of information with different formats. The sources of the candidate snippets are also incorporated when measuring the question-snippet relevance. Moreover, the framework is trained with a specifically designed weak supervision paradigm making use of available answers in the training phase. Experiments on a real-world dataset show that our proposed framework achieves superior performance over state-of-the-art models.

2019

Exploiting BERT for End-to-End Aspect-based Sentiment Analysis
Xin Li | Lidong Bing | Wenxuan Zhang | Wai Lam
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

In this paper, we investigate the modeling power of contextualized embeddings from pre-trained language models, e.g. BERT, on the E2E-ABSA task. Specifically, we build a series of simple yet insightful neural baselines to deal with E2E-ABSA. The experimental results show that even with a simple linear classification layer, our BERT-based architecture can outperform state-of-the-art works. Besides, we also standardize the comparative study by consistently utilizing a hold-out validation dataset for model selection, which is largely ignored by previous works. Therefore, our work can serve as a BERT-based benchmark for E2E-ABSA.

Co-authors

Tat-Seng Chua 5

Mahani Aljunied 3

Kenji Kawaguchi 2

Hou Pong Chan 1

Chip Hong Chang 1

Guanzheng Chen 1

Yi Feng (冯毅) 1

Liping Jing (景丽萍) 1

Bin Liang (梁斌) 1

Xuan-Phi Nguyen 1

Sinno Jialin Pan 1

Bing Qin (秦兵) 1

Zhuoyang Song 1

Weixiang Zhao 1

Venues