Nan Xu - ACL Anthology

Nan Xu

2025

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
Junnan Zhu | Jingyi Wang | Bohan Yu | Xiaoyu Wu | Junbo Li | Lei Wang | Nan Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements.

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Yexiang Liu | Zekun Li | Zhi Fang | Nan Xu | Ran He | Tieniu Tan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs × 8 prompting strategies × 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications.Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.

ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
Jingxuan Wei | Nan Xu | Junnan Zhu | Haoyanni | Gaowei Wu | Qi Chen | Bihui Yu | Lei Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling
Minzheng Wang | Xinghua Zhang | Kun Chen | Nan Xu | Haiyang Yu | Fei Huang | Wenji Mao | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction, which bring about vast amounts of conversation logs and increasing demand for dialogue generation. The dialogue’s life-cycle spans from Prelude through Interlocution to Epilogue, encompassing rich dialogue elements. Despite large volumes of dialogue-related studies, there is a lack of systematic investigation into the dialogue stages to frame benchmark construction that covers comprehensive dialogue elements. This hinders the precise modeling, generation and assessment of LLMs-based dialogue systems. To bridge this gap, in this paper, we introduce a new research task—Dialogue Element MOdeling, including Element Awareness and Dialogue Agent Interaction, and propose a novel benchmark, DEMO, designed for a comprehensive dialogue modeling and assessment. On this basis, we further build the DEMO agent with the adept ability to model dialogue elements via imitation learning. Extensive experiments on DEMO indicate that current representative LLMs still have considerable potential for enhancement, and our DEMO agent performs well in both dialogue element modeling and out-of-domain tasks.

LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems
Nan Xu | Xuezhe Ma
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Interestingly, LLMs yet struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r’s in the word “strawberry”. There are several popular conjectures (e.g., tokenization, architecture and training data) regarding the reason for deficiency of LLMs in simple word-based counting problems, sharing the similar belief that such failure stems from model pretraining hence probably inevitable during deployment. In this paper, we carefully design multiple evaluation settings to investigate validity of prevalent conjectures. Meanwhile, we measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Although specialized LLMs suffer from counting problems as well, we find conjectures about inherent deficiency of LLMs invalid and further seek opportunities to elicit knowledge and capabilities from LLMs which are beneficial to counting tasks. Compared with strategies such as finetuning and in-context learning that are commonly adopted to enhance performance on new or challenging tasks, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks with more accurate responses.We hope our conjecture validation design could provide insights to study future critical failure modes of LLMs. Based on challenges in transferring advanced capabilities to much simpler tasks, we call for more attention to model capability acquisition and evaluation. We also highlight the importance of cultivating consciousness of “reasoning before responding” during model pretraining.

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
Nan Xu | Fei Wang | Sheng Zhang | Hoifung Poon | Muhao Chen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Motivated by in-context learning (ICL) capabilities of Large Language Models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.

ImaRA: An Imaginative Frame Augmented Method for Low-Resource Multimodal Metaphor Detection and Explanation
Yuan Tian | Minzheng Wang | Nan Xu | Wenji Mao
Findings of the Association for Computational Linguistics: NAACL 2025

Multimodal metaphor detection is an important and challenging task in multimedia computing, which aims to distinguish between metaphorical and literal multimodal expressions. Existing studies mainly utilize typical multimodal computing approaches for detection, neglecting the unique cross-domain and cross-modality characteristics underlying multimodal metaphor understanding. According to Conceptual Metaphor Theory (CMT), the inconsistency between source and target domains and their attribute similarity are essential to infer the intricate meanings implied in metaphors. In practice, the scarcity of the annotated multimodal metaphorical contents in the real world brings additional difficulty to the detection task and further complicates the understanding of multimodal metaphors. To address the above challenges, in this paper, we propose a novel Imaginative FRame Augmented (ImaRA) method for low-resource multimodal metaphor detection and explanation inspired by CMT. Specifically, we first identify imaginative frame as an associative structure to stimulate the imaginative thinking of multimodal metaphor detection and understanding. We then construct a cross-modal imagination dataset rich in multimodal metaphors and corresponding imaginative frames, and retrieve an augmented instance from this imagination dataset using imaginative frames mined from the input. This augmented instance serves as the demonstration exemplar to boost the metaphor reasoning ability of the multimodal large language model (MLLM) in low-resource multimodal scenarios. Experiments on two publicly available datasets show that our method consistently achieves robust results compared to MLLM-based methods for both multimodal metaphor detection and explanation in low-resource scenarios and meanwhile surpasses existing multimodal metaphor detection methods with full training data.

2024

Bridging Word-Pair and Token-Level Metaphor Detection with Explainable Domain Mining
Yuan Tian | Ruike Zhang | Nan Xu | Wenji Mao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Metaphor detection aims to identify whether a linguistic expression in text is metaphorical or literal. Most existing research tackles this problem either using word-pair or token-level information as input, and thus treats word-pair and token-level metaphor detection as distinct subtasks. Benefited from the simplified structure of word pairs, recent methods for word-pair metaphor detection can provide intermediate explainable clues for the detection results, which remains a challenging issue for token-level metaphor detection. To mitigate this issue in token-level metaphor detection and take advantage of word pairs, in this paper, we make the first attempt to bridge word-pair and token-level metaphor detection via modeling word pairs within a sentence as explainable intermediate information. As the central role of verb in metaphorical expressions, we focus on token-level verb metaphor detection and propose a novel explainable Word Pair based Domain Mining (WPDM) method. Our work is inspired by conceptual metaphor theory (CMT). We first devise an approach for conceptual domain mining utilizing semantic role mapping and resources at cognitive, commonsense and lexical levels. We then leverage the inconsistency between source and target domains for core word pair modeling to facilitate the explainability. Experiments on four datasets verify the effectiveness of our method and demonstrate its capability to provide the core word pair and corresponding conceptual domains as explainable clues for metaphor detection.

Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Qin Liu | Fei Wang | Nan Xu | Tianyi Lorena Yan | Tao Meng | Muhao Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Performance of large language models (LLMs) may vary with different prompts or instructions of even the same task. One commonly recognized factor for this phenomenon is the model’s familiarity with the given prompt or instruction, which is typically estimated by its perplexity. However, finding the prompt with the lowest perplexity is challenging, given the enormous space of possible prompting phrases. In this paper, we propose monotonic paraphrasing (MonoPara), an end-to-end decoding strategy that paraphrases given prompts or instructions into their lower perplexity counterparts based on an ensemble of a paraphrase LM for prompt (or instruction) rewriting, and a target LM (i.e. the prompt or instruction executor) that constrains the generation for lower perplexity. The ensemble decoding process can efficiently paraphrase the original prompt without altering its semantic meaning, while monotonically decrease the perplexity of each generation as calculated by the target LM. We explore in detail both greedy and search-based decoding as two alternative decoding schemes of MonoPara. Notably, MonoPara does not require any training and can monotonically lower the perplexity of the paraphrased prompt or instruction, leading to improved performance of zero-shot LM prompting as evaluated on a wide selection of tasks. In addition, MonoPara is also shown to effectively improve LMs’ generalization on perturbed and unseen task instructions.

PromISe: Releasing the Capabilities of LLMs with Prompt Introspective Search
Minzheng Wang | Nan Xu | Jiahao Zhao | Yin Luo | Wenji Mao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The development of large language models (LLMs) raises the importance of assessing the fairness and completeness of various evaluation benchmarks. Regrettably, these benchmarks predominantly utilize uniform manual prompts, which may not fully capture the expansive capabilities of LLMs—potentially leading to an underestimation of their performance. To unlock the potential of LLMs, researchers pay attention to automated prompt search methods, which employ LLMs as optimizers to discover optimal prompts. However, previous methods generate the solutions implicitly, which overlook the underlying thought process and lack explicit feedback. In this paper, we propose a novel prompt introspective search framework, namely PromISe, to better release the capabilities of LLMs. It converts the process of optimizing prompts into an explicit chain of thought, through a step-by-step procedure that integrates self-introspect and self-refine. Extensive experiments, conducted over 73 tasks on two major benchmarks, demonstrate that our proposed PromISe significantly boosts the performance of 12 well-known LLMs compared to the baseline approach. Moreover, our study offers enhanced insights into the interaction between humans and LLMs, potentially serving as a foundation for future designs and implementations. Keywords: large language models, prompt search, self-introspect, self-refine

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
Nan Xu | Fei Wang | Ben Zhou | Bangzheng Li | Chaowei Xiao | Muhao Chen
Findings of the Association for Computational Linguistics: NAACL 2024

While large language models (LLMs) have demonstrated increasing power, they have also called upon studies on their vulnerabilities. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of 1) multilingual cognitive overload, 2) veiled expression, and 3) effect-to- cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.

CEO: Corpus-based Open-Domain Event Ontology Induction
Nan Xu | Hongming Zhang | Jianshu Chen
Findings of the Association for Computational Linguistics: EACL 2024

Existing event-centric NLP models often only apply to the pre-defined ontology, which significantly restricts their generalization capabilities.This paper presents CEO, a novel Corpus-based Event Ontology induction model to relax the restriction imposed by pre-defined event ontologies. Without direct supervision, CEO leverages distant supervision from available summary datasets to detect corpus-wise salient events and exploits external event knowledge to force events within a short distance to have close embeddings. Experiments on three popular event datasets show that the schema induced by CEO has better coverage and higher accuracy than previous methods. Moreover, CEO is the first event ontology induction model that can induce a hierarchical event ontology with meaningful names on eleven open-domain corpora, making the induced schema more trustworthy and easier to be further curated. We anonymously release our dataset, codes, and induced ontology.

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Minzheng Wang | Longze Chen | Fu Cheng | Shengyi Liao | Xinghua Zhang | Bingli Wu | Haiyang Yu | Nan Xu | Lei Zhang | Run Luo | Yunshui Li | Min Yang | Fei Huang | Yongbin Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Long-context modeling capabilities of Large Language Models (LLMs) have garnered widespread attention, leading to the emergence of LLMs with ultra-context windows. Meanwhile, benchmarks for evaluating long-context language models are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong’s test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model’s long-context modeling capabilities.

A Theory Guided Scaffolding Instruction Framework for LLM-Enabled Metaphor Reasoning
Yuan Tian | Nan Xu | Wenji Mao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Metaphor detection is a challenging task in figurative language processing, which aims to distinguish between metaphorical and literal expressions in text. Existing methods tackle metaphor detection via training or fine-tuning discriminative models on labeled data. However, these approaches struggle to explain the underlying reasoning process behind the metaphorical/literal judgment. Recently, large language models (LLMs) have shown promise in language reasoning tasks. Although promising, LLM-based methods for metaphor detection and reasoning are still faced with the challenging issue of bringing the explainable concepts for metaphor reasoning and their linguistic manifestation. To fill this gap, we propose a novel Theory guided Scaffolding Instruction (TSI) framework that instructs an LLM to infer the underlying reasoning process of metaphor detection guided by metaphor theories for the first time. Our work is inspired by a pedagogical strategy called scaffolding instruction, which encourages educators to provide questioning and support as scaffolding so as to assist learners in constructing the understanding of pedagogical goals step by step. We first construct a metaphor knowledge graph grounded in metaphor theory which serves as the instructional structure to obtain a series of scaffolding questions, directing the LLM to incrementally generate the reasoning process for metaphor understanding through dialogue interactions. During this theory guided instruction process, we explore the LLM’s mastery boundary and provide the relevant knowledge as scaffolding support when the question is beyond the LLM’s capability. Experimental results verify that our method significantly outperforms both the LLM-based reasoning methods and the SOTA methods in metaphor detection, indicating the facilitation of metaphor and instruction theories in guiding LLM-based reasoning process.

mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Fei Wang | Wenxuan Zhou | James Y. Huang | Nan Xu | Sheng Zhang | Hoifung Poon | Muhao Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood—an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.

2023

Evaluating Large Language Models on Controlled Generation Tasks
Jiao Sun | Yufei Tian | Wangchunshu Zhou | Nan Xu | Qian Hu | Rahul Gupta | John Wieting | Nanyun Peng | Xuezhe Ma
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models. We conclude that *large language models struggle at meeting fine-grained hard constraints*.

Look-back Decoding for Open-Ended Text Generation
Nan Xu | Chunting Zhou | Asli Celikyilmaz | Xuezhe Ma
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Given a prefix (context), open-ended generation aims to decode texts that are coherent, which do not abruptly drift from previous topics, and informative, which do not suffer from undesired repetitions. In this paper, we propose Look-back, an improved decoding algorithm that leverages the Kullback–Leibler divergence to track the distribution distance between current and historical decoding steps. Thus Look-back can automatically predict potential repetitive phrase and topic drift, and remove tokens that may cause the failure modes, restricting the next token probability distribution within a plausible distance to the history. We perform decoding experiments on document continuation and story generation, and demonstrate that Look-back is able to generate more fluent and coherent text, outperforming other strong decoding methods significantly in both automatic and human evaluations.

Dense Retrieval as Indirect Supervision for Large-space Decision Making
Nan Xu | Fei Wang | Mingtao Dong | Muhao Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Many discriminative natural language understanding (NLU) tasks have large label spaces. Learning such a process of large-space decision making is particularly challenging due to the lack of training instances per label and the difficulty of selection among many fine-grained labels. Inspired by dense retrieval methods for passage finding in open-domain QA, we propose a reformulation of large-space discriminative NLU tasks as a learning-to-retrieve task, leading to a novel solution named Dense Decision Retrieval (DDR). Instead of predicting fine-grained decisions as logits, DDR adopts a dual-encoder architecture that learns to predict by retrieving from a decision thesaurus. This approach not only leverages rich indirect supervision signals from easy-to-consume learning resources for dense retrieval, it also leads to enhanced prediction generalizability with a semantically meaningful representation of the large decision space. When evaluated on tasks with decision spaces ranging from hundreds to hundred-thousand scales, DDR outperforms strong baselines greatly by 27.54% in P @1 on two extreme multi-label classification tasks, 1.17% in F1 score ultra-fine entity typing, and 1.26% in accuracy on three few-shot intent classification tasks on average.

Target-Oriented Relation Alignment for Cross-Lingual Stance Detection
Ruike Zhang | Nan Xu | Hanxuan Yang | Yuan Tian | Wenji Mao
Findings of the Association for Computational Linguistics: ACL 2023

Stance detection is an important task in text mining and social media analytics, aiming to automatically identify the user’s attitude toward a specific target from text, and has wide applications in a variety of domains. Previous work on stance detection has mainly focused on monolingual setting. To address the problem of imbalanced language resources, cross-lingual stance detection is proposed to transfer the knowledge learned from a high-resource (source) language (typically English) to another low-resource (target) language. However, existing research on cross-lingual stance detection has ignored the inconsistency in the occurrences and distributions of targets between languages, which consequently degrades the performance of stance detection in low-resource languages. In this paper, we first identify the target inconsistency issue in cross-lingual stance detection, and propose a fine-grained Target-oriented Relation Alignment (TaRA) method for the task, which considers both target-level associations and language-level alignments. Specifically, we propose the Target Relation Graph to learn the in-language and cross-language target associations. We further devise the relation alignment strategy to enable knowledge transfer between semantically correlated targets across languages. Experimental results on the representative datasets demonstrate the effectiveness of our method compared to competitive methods under variant settings.

Modeling Conceptual Attribute Likeness and Domain Inconsistency for Metaphor Detection
Yuan Tian | Nan Xu | Wenji Mao | Daniel Zeng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Metaphor detection is an important and challenging task in natural language processing, which aims to distinguish between metaphorical and literal expressions in text. Previous studies mainly leverage the incongruity of source and target domains and contextual clues for detection, neglecting similar attributes shared between source and target concepts in metaphorical expressions. Based on conceptual metaphor theory, these similar attributes are essential to infer implicit meanings conveyed by the metaphor. Under the guidance of conceptual metaphor theory, in this paper, we model the likeness of attribute for the first time and propose a novel Attribute Likeness and Domain Inconsistency Learning framework (AIDIL) for word-pair metaphor detection. Specifically, we propose an attribute siamese network to mine similar attributes between source and target concepts. We then devise a domain contrastive learning strategy to learn the semantic inconsistency of concepts in source and target domains. Extensive experiments on four datasets verify that our method significantly outperforms the previous state-of-the-art methods, and demonstrate the generalization ability of our method.

Dynamic Routing Transformer Network for Multimodal Sarcasm Detection
Yuan Tian | Nan Xu | Ruike Zhang | Wenji Mao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal sarcasm detection is an important research topic in natural language processing and multimedia computing, and benefits a wide range of applications in multiple domains. Most existing studies regard the incongruity between image and text as the indicative clue in identifying multimodal sarcasm. To capture cross-modal incongruity, previous methods rely on fixed architectures in network design, which restricts the model from dynamically adjusting to diverse image-text pairs. Inspired by routing-based dynamic network, we model the dynamic mechanism in multimodal sarcasm detection and propose the Dynamic Routing Transformer Network (DynRT-Net). Our method utilizes dynamic paths to activate different routing transformer modules with hierarchical co-attention adapting to cross-modal incongruity. Experimental results on a public dataset demonstrate the effectiveness of our method compared to the state-of-the-art methods. Our codes are available at https://github.com/TIAN-viola/DynRT.

2022

A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space
Yuhao Zhang | Hongji Zhu | Yongliang Wang | Nan Xu | Xiaobo Li | Binqiang Zhao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Learning high-quality sentence representations is a fundamental problem of natural language processing which could benefit a wide range of downstream tasks. Though the BERT-like pre-trained language models have achieved great success, using their sentence representations directly often results in poor performance on the semantic textual similarity task. Recently, several contrastive learning methods have been proposed for learning sentence representations and have shown promising results. However, most of them focus on the constitution of positive and negative representation pairs and pay little attention to the training objective like NT-Xent, which is not sufficient enough to acquire the discriminating power and is unable to model the partial order of semantics between sentences. So in this paper, we propose a new method ArcCSE, with training objectives designed to enhance the pairwise discriminative power and model the entailment relation of triplet sentences. We conduct extensive experiments which demonstrate that our approach outperforms the previous state-of-the-art on diverse sentence related tasks, including STS and SentEval.

Does Your Model Classify Entities Reasonably? Diagnosing and Mitigating Spurious Correlations in Entity Typing
Nan Xu | Fei Wang | Bangzheng Li | Mingtao Dong | Muhao Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Entity typing aims at predicting one or more words that describe the type(s) of a specific mention in a sentence. Due to shortcuts from surface patterns to annotated entity labels and biased training, existing entity typing models are subject to the problem of spurious correlations. To comprehensively investigate the faithfulness and reliability of entity typing methods, we first systematically define distinct kinds of model biases that are reflected mainly from spurious correlations. Particularly, we identify six types of existing model biases, including mention-context bias, lexical overlapping bias, named entity bias, pronoun bias, dependency bias, and overgeneralization bias. To mitigate model biases, we then introduce a counterfactual data augmentation method. By augmenting the original training set with their debiasedcounterparts, models are forced to fully comprehend sentences and discover the fundamental cues for entity typing, rather than relying on spurious correlations for shortcuts. Experimental results on the UFET dataset show our counterfactual data augmentation approach helps improve generalization of different entity typing models with consistently better performance on both the original and debiased test sets.

2020

Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association
Nan Xu | Zhixiong Zeng | Wenji Mao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Sarcasm is a sophisticated linguistic phenomenon to express the opposite of what one really means. With the rapid growth of social media, multimodal sarcastic tweets are widely posted on various social platforms. In multimodal context, sarcasm is no longer a pure linguistic phenomenon, and due to the nature of social media short text, the opposite is more often manifested via cross-modality expressions. Thus traditional text-based methods are insufficient to detect multimodal sarcasm. To reason with multimodal sarcastic tweets, in this paper, we propose a novel method for modeling cross-modality contrast in the associated context. Our method models both cross-modality contrast and semantic association by constructing the Decomposition and Relation Network (namely D&R Net). The decomposition network represents the commonality and discrepancy between image and text, and the relation network models the semantic association in cross-modality context. Experimental results on a public dataset demonstrate the effectiveness of our model in multimodal sarcasm detection.

2019

Modeling Conversation Structure and Temporal Dynamics for Jointly Predicting Rumor Stance and Veracity
Penghui Wei | Nan Xu | Wenji Mao
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Automatically verifying rumorous information has become an important and challenging task in natural language processing and social media analytics. Previous studies reveal that people’s stances towards rumorous messages can provide indicative clues for identifying the veracity of rumors, and thus determining the stances of public reactions is a crucial preceding step for rumor veracity prediction. In this paper, we propose a hierarchical multi-task learning framework for jointly predicting rumor stance and veracity on Twitter, which consists of two components. The bottom component of our framework classifies the stances of tweets in a conversation discussing a rumor via modeling the structural property based on a novel graph convolutional network. The top component predicts the rumor veracity by exploiting the temporal dynamics of stance evolution. Experimental results on two benchmark datasets show that our method outperforms previous methods in both rumor stance classification and veracity prediction.

Co-authors

Xinghua Zhang 2

Asli Celikyilmaz 1

James Y. Huang 1

Yongliang Wang 1

Tianyi Lorena Yan 1

Zhixiong Zeng 1

Hongming Zhang 1

Binqiang Zhao 1

Wangchunshu Zhou 1

Chunting Zhou 1

Venues