Jianxing Yu - ACL Anthology

Jianxing Yu

2025

Answering Complex Geographic Questions by Adaptive Reasoning with Visual Context and External Commonsense Knowledge
Fan Li | Jianxing Yu | Jielong Tang | Wenqing Chen | Hanjiang Lai | Yanghui Rao | Jian Yin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper focuses on a new task of answering geographic reasoning questions based on the given image (called GeoVQA). Unlike traditional VQA tasks, GeoVQA asks for details about the image-related culture, landscape, etc. This requires not only the identification of the objects in the image, their properties and relations, but also the understanding of the geographic knowledge of the objects, such as location, transportation, landmark, cuisine, etc. This background knowledge does not explicitly appear in the image, nor is there an extra-textual description. Without this missing but necessary knowledge, it is difficult for existing matching-based methods to infer the correct answer. To tackle these challenges, we propose a new geographic reasoning framework for our task. We first analyze the image and describe its fine-grained content by text and keywords using a multi-modal retrieval augmented technique, so as to deduce an answer in a unified textual modality. Next, we retrieve the crucial geographic commonsense knowledge. To reduce the retrieval complexity, we design a dynamic method that can adaptively collect the relevant clues for each reasoning step. The step in the incorrect direction will be pruned according to some judgment criteria. The remaining steps can help us form a reasoning chain to derive a correct answer. Moreover, we create a large-scale dataset GVQA with 41,329 samples to conduct the evaluation. The results demonstrate the effectiveness of our approach.

Detecting Emotional Incongruity of Sarcasm by Commonsense Reasoning
Ziqi Qiu | Jianxing Yu | Yufeng Zhang | Hanjiang Lai | Yanghui Rao | Qinliang Su | Jian Yin
Proceedings of the 31st International Conference on Computational Linguistics

This paper focuses on sarcasm detection, which aims to identify whether given statements convey criticism, mockery, or other negative sentiment opposite to the literal meaning. To detect sarcasm, humans often require a comprehensive understanding of the semantics in the statement and even resort to external commonsense to infer the fine-grained incongruity. However, existing methods lack commonsense inferential ability when they face complex real-world scenarios, leading to unsatisfactory performance. To address this problem, we propose a novel framework for sarcasm detection, which conducts incongruity reasoning based on commonsense augmentation, called EICR. Concretely, we first employ retrieval-augmented large language models to supplement the missing but indispensable commonsense background knowledge. To capture complex contextual associations, we construct a dependency graph and obtain the optimized topology via graph refinement. We further introduce an adaptive reasoning skeleton that integrates prior rules to extract sentiment-inconsistent subgraphs explicitly. To eliminate the possible spurious relations between words and labels, we employ adversarial contrastive learning to enhance the robustness of the detector. Experiments conducted on five datasets demonstrate the effectiveness of EICR.

Eliciting Implicit Acoustic Styles from Open-domain Instructions to Facilitate Fine-grained Controllable Generation of Speech
Jianxing Yu | Gou Zihao | Chen Li | Zhisheng Wang | Peiji Yang | Wenqing Chen | Jian Yin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This paper focuses on generating speech with the acoustic style that meets users’ needs based on their open-domain instructions. To control the style, early work mostly relies on pre-defined rules or templates. The control types and formats are fixed in a closed domain, making it hard to meet diverse needs of users. One solution is to resort to instructions in free text to guide the generation. Current work mainly studies the instructions that clearly specify the acoustic styles, such as low pitch and fast speed. However, the instructions are complex, some even vague and abstract, such as “Generate a voice of a woman who is heartbroken due to a breakup. It is hard to infer this implicit style by traditional matching-based methods. To address this problem, we propose a new controllable model. It first utilizes multimodal LLMs with knowledge-augmented techniques to infer the desired speech style from the instructions. The powerful language understanding ability of LLMs can help us better elicit the implicit style factors from the instruction. By using these factors as a control condition, we design a diffusion-based generator adept at finely adjusting speech details. That offers higher flexibility to meet complex users’ needs. Next, we verify the output speech from three aspects, i.e., consistency of decoding state, mel-spectrogram, and instruction style. This verified feedback can inversely optimize the generator. Extensive experiments are conducted on three popular datasets. The results show the effectiveness and good controllability of our approach.

Generating Commonsense Reasoning Questions with Controllable Complexity through Multi-step Structural Composition
Jianxing Yu | Shiqi Wang | Hanjiang Lai | Wenqing Chen | Yanghui Rao | Qinliang Su | Jian Yin
Proceedings of the 31st International Conference on Computational Linguistics

This paper studies the task of generating commonsense reasoning questions (QG) with desired difficulty levels. Compared to traditional shallow questions that can be solved by simple term matching, ours are more challenging. Our answering process requires reasoning over multiple contextual and commonsense clues. That involves advanced comprehension skills, such as abstract semantics learning and missing knowledge inference. Existing work mostly learns to map the given text into questions, lacking a mechanism to control results with the desired complexity. To address this problem, we propose a novel controllable framework. We first derive contextual and commonsense clues involved in reasoning questions from the text. These clues are used to create simple sub-questions. We then aggregate multiple sub-questions to compose complex ones under the guidance of prior reasoning structures. By iterating this process, we can compose a complex QG task based on a series of smaller and simpler QG subtasks. Each subtask serves as a building block for a larger one. Each composition corresponds to an increase in the reasoning step. Moreover, we design a voting verifier to ensure results’ validity from multiple views, including answer consistency, reasoning difficulty, and context correlation. Finally, we can learn the optimal QG model to yield thought-provoking results. Evaluations on two typical datasets validate our method.

CoE: A Clue of Emotion Framework for Emotion Recognition in Conversations
Zhiyu Shen | Yunhe Pang | Yanghui Rao | Jianxing Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Emotion Recognition in Conversations (ERC) is crucial for machines to understand dynamic human emotions. While Large Language Models (LLMs) show promise, their performance is often limited by challenges in interpreting complex conversational streams. We introduce a Clue of Emotion (CoE) framework, which progressively integrates key conversational clues to enhance the ERC task. Building on CoE, we implement a multi-stage auxiliary learning strategy that incorporates role-playing, speaker identification, and emotion reasoning tasks, each targeting different aspects of conversational emotion understanding and enhancing the model’s ability to interpret emotional contexts. Our experiments on EmoryNLP, MELD, and IEMOCAP demonstrate that CoE consistently outperforms state-of-the-art methods, achieving a 2.92% improvement on EmoryNLP. These results underscore the effectiveness of clues and multi-stage auxiliary learning for ERC, offering valuable insights for future research.

UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER
Jielong Tang | Yang Yang | Jianxing Yu | Zhen-Xing Wang | Haoyuan Liang | Liang Yao | Jian Yin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Grounded Multimodal Named Entity Recognition (GMNER) is a new information extraction task. It requires models to extract named entities and ground them to real-world visual objects. Previous methods, relying on domain-specific fine-tuning, struggle with unseen multimodal entities due to limited knowledge and generalization. Recently, multimodal large language models (MLLMs) have demonstrated strong open-set abilities. However, their performance is hindered by the lack of in-domain knowledge due to costly training for GMNER datasets. To address these limitations, we propose **UnCo**, a two-stage Uncertainty-driven Collaborative framework that leverages the complementary strengths of small fine-tuned models and MLLMs. Specifically, **in stage one**, we equip the small model with a unified uncertainty estimation (UE) for multimodal entities. This enables the small model to express "I do not know" when recognizing unseen entities beyond its capabilities. Predictions with high uncertainty are then filtered and delegated to the MLLM. **In stage two**, an Uncertainty-aware Hierarchical Correction mechanism guides the MLLM to refine uncertain predictions using its open-domain knowledge. Ultimately, UnCo effectively retains the in-domain knowledge of small models while utilizing the capabilities of MLLMs to handle unseen samples. Extensive experiments demonstrate UnCo’s effectiveness on two GMNER benchmarks.

2024

Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference
Jianxing Yu | Shiqi Wang | Han Yin | Zhenlong Sun | Ruobing Xie | Bo Zhang | Yanghui Rao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper focuses on detecting clickbait posts on the Web. These posts often use eye-catching disinformation in mixed modalities to mislead users to click for profit. That affects the user experience and thus would be blocked by content provider. To escape detection, malicious creators use tricks to add some irrelevant non-bait content into bait posts, dressing them up as legal to fool the detector. This content often has biased relations with non-bait labels, yet traditional detectors tend to make predictions based on simple co-occurrence rather than grasping inherent factors that lead to malicious behavior. This spurious bias would easily cause misjudgments. To address this problem, we propose a new debiased method based on causal inference. We first employ a set of features in multiple modalities to characterize the posts. Considering these features are often mixed up with unknown biases, we then disentangle three kinds of latent factors from them, including the invariant factor that indicates intrinsic bait intention; the causal factor which reflects deceptive patterns in a certain scenario, and non-causal noise. By eliminating the noise that causes bias, we can use invariant and causal factors to build a robust model with good generalization ability. Experiments on three popular datasets show the effectiveness of our approach.

Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training
Zetong Li | Qinliang Su | Shijing Si | Jianxing Yu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

BERT and TFIDF features excel in capturing rich semantics and important words, respectively. Since most existing clustering methods are solely based on the BERT model, they often fall short in utilizing keyword information, which, however, is very useful in clustering short texts. In this paper, we propose a **CO**-**T**raining **C**lustering (**COTC**) framework to make use of the collective strengths of BERT and TFIDF features. Specifically, we develop two modules responsible for the clustering of BERT and TFIDF features, respectively. We use the deep representations and cluster assignments from the TFIDF module outputs to guide the learning of the BERT module, seeking to align them at both the representation and cluster levels. Reversely, we also use the BERT module outputs to train the TFIDF module, thus leading to the mutual promotion. We then show that the alternating co-training framework can be placed under a unified joint training objective, which allows the two modules to be connected tightly and the training signals to be propagated efficiently. Experiments on eight benchmark datasets show that our method outperforms current SOTA methods significantly.

Domain Adaptation for Subjective Induction Questions Answering on Products by Adversarial Disentangled Learning
Yufeng Zhang | Jianxing Yu | Yanghui Rao | Libin Zheng | Qinliang Su | Huaijie Zhu | Jian Yin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper focuses on answering subjective questions about products. Different from the factoid question with a single answer span, this subjective one involves multiple viewpoints. For example, the question of ‘how the phone’s battery is?’ not only involves facts of battery capacity but also contains users’ opinions on the battery’s pros and cons. A good answer should be able to integrate these heterogeneous and even inconsistent viewpoints, which is formalized as a subjective induction QA task. For this task, the data distributions are often imbalanced across different product domains. It is hard for traditional methods to work well without considering the shift of domain patterns. To address this problem, we propose a novel domain-adaptive model. Concretely, for each sample in the source and target domain, we first retrieve answer-related knowledge and represent them independently. To facilitate knowledge transferring, we then disentangle the representations into domain-invariant and domain-specific latent factors. Moreover, we develop an adversarial discriminator with contrastive learning to reduce the impact of out-of-domain bias. Based on learned latent vectors in a target domain, we yield multi-perspective summaries as inductive answers. Experiments on popular datasets show the effectiveness of our method.

2023

Generating Deep Questions with Commonsense Reasoning Ability from the Text by Disentangled Adversarial Inference
Jianxing Yu | Shiqi Wang | Libin Zheng | Qinliang Su | Wei Liu | Baoquan Zhao | Jian Yin
Findings of the Association for Computational Linguistics: ACL 2023

This paper proposes a new task of commonsense question generation, which aims to yield deep-level and to-the-point questions from the text. Their answers need to reason over disjoint relevant contexts and external commonsense knowledge, such as encyclopedic facts and causality. The knowledge may not be explicitly mentioned in the text but is used by most humans for problem-shooting. Such complex reasoning with hidden contexts involves deep semantic understanding. Thus, this task has great application value, such as making high-quality quizzes in advanced exams. Due to the lack of modeling complexity, existing methods may produce shallow questions that can be answered by simple word matching. To address these challenges, we propose a new QG model by simultaneously considering asking contents, expressive ways, and answering complexity. We first retrieve text-related commonsense context. Then we disentangle the key factors that control questions in terms of reasoning content and verbalized way. Independence priors and constraints are imposed to facilitate disentanglement. We further develop a discriminator to promote the deep results by considering their answering complexity. Through adversarial inference, we learn the latent factors from data. By sampling the expressive factor from the data distributions, diverse questions can be yielded. Evaluations of two typical data sets show the effectiveness of our approach.

2022

Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization
Zexuan Qiu | Qinliang Su | Jianxing Yu | Shijing Si
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Efficient document retrieval heavily relies on the technique of semantic hashing, which learns a binary code for every document and employs Hamming distance to evaluate document distances. However, existing semantic hashing methods are mostly established on outdated TFIDF features, which obviously do not contain lots of important semantic information about documents. Furthermore, the Hamming distance can only be equal to one of several integer values, significantly limiting its representational ability for document distances. To address these issues, in this paper, we propose to leverage BERT embeddings to perform efficient retrieval based on the product quantization technique, which will assign for every document a real-valued codeword from the codebook, instead of a binary code as in semantic hashing. Specifically, we first transform the original BERT embeddings via a learnable mapping and feed the transformed embedding into a probabilistic product quantization module to output the assigned codeword. The refining and quantizing modules can be optimized in an end-to-end manner by minimizing the probabilistic contrastive loss. A mutual information maximization based method is further proposed to improve the representativeness of codewords, so that documents can be quantized more accurately. Extensive experiments conducted on three benchmarks demonstrate that our proposed method significantly outperforms current state-of-the-art baselines.

2021

Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval
Zijing Ou | Qinliang Su | Jianxing Yu | Bang Liu | Jingwen Wang | Ruihui Zhao | Changyou Chen | Yefeng Zheng
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

With the need of fast retrieval speed and small memory footprint, document hashing has been playing a crucial role in large-scale information retrieval. To generate high-quality hashing code, both semantics and neighborhood information are crucial. However, most existing methods leverage only one of them or simply combine them via some intuitive criteria, lacking a theoretical principle to guide the integration process. In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model. To deal with the complicated correlations among documents, we further propose a tree-structured approximation method for learning. Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones. Extensive experimental results on three benchmark datasets show that our method achieves superior performance over state-of-the-art methods, demonstrating the effectiveness of the proposed model for simultaneously preserving semantic and neighborhood information.

Retrieve & Memorize: Dialog Policy Learning with Multi-Action Memory
YunHao Li | Yunyi Yang | Xiaojun Quan | Jianxing Yu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Refining BERT Embeddings for Document Hashing via Mutual Information Maximization
Zijing Ou | Qinliang Su | Jianxing Yu | Ruihui Zhao | Yefeng Zheng | Bang Liu
Findings of the Association for Computational Linguistics: EMNLP 2021

Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOG), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOG or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed based on the mutual information (MI) maximization principle. Specifically, the method first constructs appropriate global and local codes from the documents and then seeks to maximize their mutual information. Experimental results on three benchmark datasets demonstrate that the proposed method is able to generate hash codes that outperform existing ones learned from BOG features by a substantial margin.

2020

Multi-Domain Dialogue Acts and Response Co-Generation
Kai Wang | Junfeng Tian | Rui Wang | Xiaojun Quan | Jianxing Yu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Generating fluent and informative responses is of critical importance for task-oriented dialogue systems. Existing pipeline approaches generally predict multiple dialogue acts first and use them to assist response generation. There are at least two shortcomings with such approaches. First, the inherent structures of multi-domain dialogue acts are neglected. Second, the semantic associations between acts and responses are not taken into account for response generation. To address these issues, we propose a neural co-generation model that generates dialogue acts and responses concurrently. Unlike those pipeline approaches, our act generation module preserves the semantic structures of multi-domain dialogue acts and our response generation module dynamically attends to different acts as needed. We train the two modules jointly using an uncertainty loss to adjust their task weights adaptively. Extensive experiments are conducted on the large-scale MultiWOZ dataset and the results show that our model achieves very favorable improvement over several state-of-the-art models in both automatic and human evaluations.

Low-Resource Generation of Multi-hop Reasoning Questions
Jianxing Yu | Wei Liu | Shuang Qiu | Qinliang Su | Kai Wang | Xiaojun Quan | Jian Yin
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper focuses on generating multi-hop reasoning questions from the raw text in a low resource circumstance. Such questions have to be syntactically valid and need to logically correlate with the answers by deducing over multiple relations on several sentences in the text. Specifically, we first build a multi-hop generation model and guide it to satisfy the logical rationality by the reasoning chain extracted from a given text. Since the labeled data is limited and insufficient for training, we propose to learn the model with the help of a large scale of unlabeled data that is much easier to obtain. Such data contains rich expressive forms of the questions with structural patterns on syntax and semantics. These patterns can be estimated by the neural hidden semi-Markov model using latent variables. With latent patterns as a prior, we can regularize the generation model and produce the optimal results. Experimental results on the HotpotQA data set demonstrate the effectiveness of our model. Moreover, we apply the generated results to the task of machine reading comprehension and achieve significant performance improvements.

Embedding Dynamic Attributed Networks by Modeling the Evolution Processes
Zenan Xu | Zijing Ou | Qinliang Su | Jianxing Yu | Xiaojun Quan | ZhenKun Lin
Proceedings of the 28th International Conference on Computational Linguistics

Network embedding has recently emerged as a promising technique to embed nodes of a network into low-dimensional vectors. While fairly successful, most existing works focus on the embedding techniques for static networks. But in practice, there are many networks that are evolving over time and hence are dynamic, e.g., the social networks. To address this issue, a high-order spatio-temporal embedding model is developed to track the evolutions of dynamic networks. Specifically, an activeness-aware neighborhood embedding method is first proposed to extract the high-order neighborhood information at each given timestamp. Then, an embedding prediction framework is further developed to capture the temporal correlations, in which the attention mechanism is employed instead of recurrent neural networks (RNNs) for its efficiency in computing and flexibility in modeling. Extensive experiments are conducted on four real-world datasets from three different areas. It is shown that the proposed method outperforms all the baselines by a substantial margin for the tasks of dynamic link prediction and node classification, which demonstrates the effectiveness of the proposed methods on tracking the evolutions of dynamic networks.

2019

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text
Jianxing Yu | Zhengjun Zha | Jian Yin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper focuses on the topic of inferential machine comprehension, which aims to fully understand the meanings of given text to answer generic questions, especially the ones needed reasoning skills. In particular, we first encode the given document, question and options in a context aware way. We then propose a new network to solve the inference problem by decomposing it into a series of attention-based reasoning steps. The result of the previous step acts as the context of next step. To make each step can be directly inferred from the text, we design an operational cell with prior structure. By recursively linking the cells, the inferred results are synthesized together to form the evidence chain for reasoning, where the reasoning direction can be guided by imposing structural constraints to regulate interactions on the cells. Moreover, a termination mechanism is introduced to dynamically determine the uncertain reasoning depth, and the network is trained by reinforcement learning. Experimental results on 3 popular data sets, including MCTest, RACE and MultiRC, demonstrate the effectiveness of our approach.

2012

Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
Jianxing Yu | Zheng-Jun Zha | Tat-Seng Chua
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews
Jianxing Yu | Zheng-Jun Zha | Meng Wang | Kai Wang | Tat-Seng Chua
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
Jianxing Yu | Zheng-Jun Zha | Meng Wang | Tat-Seng Chua
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Co-authors

Tat-Seng Chua 3

Zheng-Jun Zha 3

Changyou Chen 1

Haoyuan Liang 1

Zhisheng Wang 1

Zhen-Xing Wang 1

Bo Zhang (波章,) 1

Venues