Cunxiang Wang - ACL Anthology

Cunxiang Wang

2026

HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing
Andrew Zhuoer Feng | Cunxiang Wang | Yu Luo | Lin Fan | Irene Zhou | Zikang Wang | Xiaotao Gu | Jie Tang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM’s performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing **12** genres and **1302** instructions across three task categories: contextual **completion**, outline-**guided** writing, and **open**-ended generation. ToW successfully mitigates the biases, achieving a **0.93** Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation
Yanzhi Tian | Cunxiang Wang | Zeming Liu | Heyan Huang | Wenbo Yu | Dawei Song | Jie Tang | Yuhang Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 points in combined system- and segment-level correlation with human judgments compared with current methods. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.

IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
Bosi Wen | Yilin Niu | Cunxiang Wang | Pei Ke | Xiaoying Ling | Ying Zhang | Aohan Zeng | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Instruction-following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction-following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic for fine-grained, efficient, and reliable instruction-following evaluation. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments show that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including o4-mini and Gemini-3-Pro. With the reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lowercomputational overhead compared to strong LLM critic baselines. Our code and model are available at https://github.com/thu-coai/IF-CRITIC.

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
Bosi Wen | Yilin Niu | Cunxiang Wang | Xiaoying Ling | Ying Zhang | Pei Ke | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

2025

R³: “This is My SQL, Are You With Me?” A Consensus-Based Multi-Agent System for Text-to-SQL Tasks
Hanchen Xia | Feng Jiang | Naihao Deng | Cunxiang Wang | Guojiang Zhao | Rada Mihalcea | Yue Zhang
Proceedings of the 4th Table Representation Learning Workshop

Large Language Models (LLMs) have demon- strated exceptional performance across diverse tasks. To harness their capabilities for Text- to-SQL, we introduce R3 (Review-Rebuttal- Revision), a consensus-based multi-agent sys- tem for Text-to-SQL tasks. R3 achieves the new state-of-the-art performance of 89.9 on the Spider test set. In the meantime, R3 achieves 61.80 on the Bird development set. R3 out- performs existing single-LLM and multi-agent Text-to-SQL systems by 1.3% to 8.1% on Spi- der and Bird, respectively. Surprisingly, we find that for Llama-3-8B, R3 outperforms chain-of- thought prompting by over 20%, even outper- forming GPT-3.5 on the Spider development set. We open-source our codebase at https: //github.com/1ring2rta/R3.

Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions
Hongru Wang | Boyang Xue | Baohang Zhou | Tianhua Zhang | Cunxiang Wang | Huimin Wang | Guanhua Chen | Kam-Fai Wong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., internal reasoning such as generate-then-read). In contrast, for questions that fall outside their known scope, these models rely on external knowledge retrieval to provide accurate responses (i.e., external acting such as retrieve-then-read). However, few previous works consider the compositional questions, which consist of several known and unknown sub-questions, necessitating the dynamic combination of previous two methods (i.e., internal reasoning and external acting) to achieve a better trade-off between effectiveness and efficiency. To this end, we introduce a Self Divide-and-Conquer (Self-DC) framework, accompanying with the first Compositional unknown Question-Answering dataset (CuQA). This framework enables LLMs to adaptively choose between using internal knowledge and retrieving external knowledge as needed, resulting in a better trade-off between effectiveness and efficiency. Experimental results on two datasets demonstrate that Self-DC can achieve comparable or even better performance with much fewer external calls compared with several strong baselines.

CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial Search
Kaixin Wu | Yixin Ji | Zeyuan Chen | Qiang Wang | Cunxiang Wang | Hong Liu | Baijun Ji | Xu Jia | Zhongyi Liu | Jinjie Gu | Yuan Zhou | Linjian Mo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.

Unlocking Recursive Thinking of LLMs: Alignment via Refinement
Haoke Zhang | Xiaobo Liang | Cunxiang Wang | Juntao Li | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025

The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose AvR: Alignment via Refinement, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize refinement-aware rewards. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20% in win rate on AlpacaEval 2.0. Our code is available at Github .

Training Language Model to Critique for Better Refinement
Tianshu Yu | Chao Xiang | Mingchuan Yang | Pei Ke | Bosi Wen | Cunxiang Wang | Jiale Cheng | Li Zhang | Xinyu Mu | Chuxiong Sun | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce Refinement-oriented Critique Optimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks—dialog generation, summarization, question answering, mathematical reasoning, and code generation—and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method’s effectiveness in enhancing LLM critique-refinement loops. Code and data will be publicly available upon acceptance of this paper.

HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Bosi Wen | Pei Ke | Yufei Sun | Cunxiang Wang | Xiaotao Gu | Jinfeng Zhou | Jie Tang | Hongning Wang | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025

Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods. Our code is available athttps://github.com/thu-coai/HPSS.

How Likely Do LLMs with CoT Mimic Human Reasoning?
Guangsheng Bao | Hongbo Zhang | Cunxiang Wang | Linyi Yang | Yue Zhang
Proceedings of the 31st International Conference on Computational Linguistics

Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions about its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often deviate from the ideal causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answers). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it, while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size only, urging research on new techniques. We hope that this preliminary study will shed light on understanding and improving the reasoning process in LLM.

LongSafety: Evaluating Long-Context Safety of Large Language Models
Yida Lu | Jiale Cheng | Zhexin Zhang | Shiyao Cui | Cunxiang Wang | Xiaotao Gu | Yuxiao Dong | Jie Tang | Hongning Wang | Minlie Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data will be publicly available.

2024

LONG²RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
Zehan Qi | Rongwu Xu | Zhijiang Guo | Cunxiang Wang | Hao Zhang | Wei Xu
Findings of the Association for Computational Linguistics: EMNLP 2024

SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
Xiaoze Liu | Ting Sun | Tianyang Xu | Feijie Wu | Cunxiang Wang | Xiaoqian Wang | Jing Gao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may infringe on copyrights or overly restrict non-copyrighted texts, leading to these challenges: (i) the need for a comprehensive evaluation benchmark to assess copyright compliance from multiple aspects; (ii) evaluating robustness against safeguard bypassing attacks; and (iii) developing effective defenses targeted against the generation of copyrighted text.To tackle these challenges, we introduce a curated dataset to evaluate methods, test attack strategies, and propose a lightweight, real-time defense mechanism to prevent the generation of copyrighted text, ensuring the safe and lawful use of LLMs. Our experiments demonstrate that current LLMs frequently output copyrighted text, and that jailbreaking attacks can significantly increase the volume of copyrighted output. Our proposed defense mechanism substantially reduces the volume of copyrighted text generated by LLMs by effectively refusing malicious requests.

Nash CoT: Multi-Path Inference with Preference Equilibrium
Ziqi Zhang | Cunxiang Wang | Xiao Xiong | Yue Zhang | Donglin Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Chain of thought (CoT) is a reasoning framework that can enhance the performance of large language models (LLMs) on complex inference tasks. In particular, among various studies related to CoT, multi-path inference stands out as a simple yet effective improvement. However, there is no optimal setting for the number of inference paths. Therefore, we have to increase the number of inference paths to obtain better results, which in turn increases the inference cost. To address this limitation, we can utilize question-related role templates to guide LLMs into relevant roles, thereby increasing the possibility of correct inferences for each path and further reducing dependence on the number of inference paths while improving reasoning accuracy. However, placing LLMs into specific roles may reduce their reasoning diversity and performance on a few tasks where role dependence is low. To alleviate the excessive immersion of the LLM into a specific role, we propose Nash CoT by constructing a competitive system on each path that balances the generation from role-specific LLMs’ and the general LLMs’ generation, thereby ensuring both effective role adoption and diversity in LLM generation further maintaining the performance of multi-path inference while reducing the requirement of the number of inference paths. We evaluate Nash CoT across various inference tasks, including Arabic Reasoning, Commonsense Question Answering, and Symbolic Inference, achieving results that are comparable to or better than those of multi-path CoT with the equal number of inference paths.

Knowledge Conflicts for LLMs: A Survey
Rongwu Xu | Zehan Qi | Zhijiang Guo | Cunxiang Wang | Hongru Wang | Yue Zhang | Wei Xu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.

2023

TRAMS: Training-free Memory Selection for Long-range Language Modeling
Haofei Yu | Cunxiang Wang | Yue Zhang | Wei Bi
Findings of the Association for Computational Linguistics: EMNLP 2023

The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.

RFiD: Towards Rational Fusion-in-Decoder for Open-Domain Question Answering
Cunxiang Wang | Haofei Yu | Yue Zhang
Findings of the Association for Computational Linguistics: ACL 2023

Open-Domain Question Answering (ODQA) systems necessitate a reader model capable of generating answers by simultaneously referring to multiple passages. Although representative models like Fusion-in-Decoder (FiD) have been proposed to address this challenge, these systems can inadvertently rely on spurious features instead of genuine causal relationships between the question and the passages to generate answers. To counter this problem, we introduce the Rational Fusion-in-Decoder (RFiD) model. Our model leverages the encoders of FiD to differentiate between causal relationships and spurious features, subsequently guiding the decoder to generate answers informed by this discernment. Experimental results on two ODQA datasets, Natural Questions (NQ) and TriviaQA (TQ), demonstrate that our model surpasses previous methods, achieving improvements of up to 1.5 and 0.7 in Exact Match scores on NQ, and exhibits an enhanced ability to identify causal relationships.

Exploiting Abstract Meaning Representation for Open-Domain Question Answering
Cunxiang Wang | Zhikun Xu | Qipeng Guo | Xiangkun Hu | Xuefeng Bai | Zheng Zhang | Yue Zhang
Findings of the Association for Computational Linguistics: ACL 2023

The Open-Domain Question Answering (ODQA) task involves retrieving and subsequently generating answers from fine-grained relevant passages within a database. Current systems leverage Pretrained Language Models (PLMs) to model the relationship between questions and passages. However, the diversity in surface form expressions can hinder the model’s ability to capture accurate correlations, especially within complex contexts. Therefore, we utilize Abstract Meaning Representation (AMR) graphs to assist the model in understanding complex semantic information. We introduce a method known as Graph-as-Token (GST) to incorporate AMRs into PLMs. Results from Natural Questions (NQ) and TriviaQA (TQ) demonstrate that our GST method can significantly improve performance, resulting in up to 2.44/3.17 Exact Match score improvements on NQ/TQ respectively. Furthermore, our method enhances robustness and outperforms alternative Graph Neural Network (GNN) methods for integrating AMRs. To the best of our knowledge, we are the first to employ semantic graphs in ODQA.

2021

Can Generative Pre-trained Language Models Serve As Knowledge Bases for Closed-book QA?
Cunxiang Wang | Pai Liu | Yue Zhang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent work has investigated the interesting question using pre-trained language models (PLMs) as knowledge bases for answering open questions. However, existing work is limited in using small benchmarks with high test-train overlaps. We construct a new dataset of closed-book QA using SQuAD, and investigate the performance of BART. Experiments show that it is challenging for BART to remember training facts in high precision, and also challenging to answer closed-book questions even if relevant knowledge is retained. Some promising directions are found, including decoupling the knowledge memorizing process and the QA finetune process, forcing the model to recall relevant knowledge when question answering.

2020

SemEval-2020 Task 4: Commonsense Validation and Explanation
Cunxiang Wang | Shuailong Liang | Yili Jin | Yilong Wang | Xiaodan Zhu | Yue Zhang
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present SemEval-2020 Task 4, Commonsense Validation and Explanation (ComVE), which includes three subtasks, aiming to evaluate whether a system can distinguish a natural language statement that makes sense to humans from one that does not, and provide the reasons. Specifically, in our first subtask, the participating systems are required to choose from two natural language statements of similar wording the one that makes sense and the one does not. The second subtask additionally asks a system to select the key reason from three options why a given statement does not make sense. In the third subtask, a participating system needs to generate the reason automatically. 39 teams submitted their valid systems to at least one subtask. For Subtask A and Subtask B, top-performing teams have achieved results closed to human performance. However, for Subtask C, there is still a considerable gap between system and human performance. The dataset used in our task can be found at https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.

2019

Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation
Cunxiang Wang | Shuailong Liang | Yue Zhang | Xiaonan Li | Tian Gao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has the sense-making capability. Existing benchmarks measure common sense knowledge indirectly or without reasoning. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense-making.

Co-authors

Shuailong Liang 2

Xiaoying Ling 2

Xuefeng Bai (白雪峰) 1

Guangsheng Bao 1

Andrew Zhuoer Feng 1

Yuhang Guo (郭宇航) 1

He-Yan Huang (黄河燕) 1

Yixin Ji (纪一心) 1

Feng Jiang (蒋峰) 1

Rada Mihalcea 1

Xiaoqian Wang 1

Mingchuan Yang 1

Tianhua Zhang 1

Guojiang Zhao 1

Venues