Zhe Yang

2025

MC²: A Minimum-Coverage and Dataset-Agnostic Framework for Compositional Generalization of LLMs on Semantic Parsing
Ziyao Xu | Zhe Yang | Houfeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2025

Compositional generalization is one of the important abilities that large language models (LLMs) need to have for semantic parsing. Previous research typically relies on dataset-specific designs or a large number of samples in demonstrations to improve the compositional generalization of LLMs on semantic parsing. We revisit this issue and find that when the number of samples in a demonstration is limited to a theoretical lower bound for achieving compositional generalization (minimum-coverage), current advanced LLMs cannot arbitrarily achieve good compositional generalization generically on different semantic parsing datasets without dataset-specific designs. To solve this problem, we propose Multi-level Component Composition (MC^2), a minimum-coverage and dataset-agnostic framework based on input primitives, which aims to generically help LLMs achieve compositional generalization by selecting and organizing samples from multiple compositional levels that satisfy the primitive coverage. Experiments and analysis show that MC^2 can effectively improve the compositional generalization of LLMs on different semantic parsing datasets in the minimum-coverage setting.

pdf bib abs

Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the t^th round of self-correction is given by: Acc_t = Upp - 𝛼^t(Upp - Acc₀),where Acc₀ denotes the initial accuracy, Upp represents the upper bound of accuracy convergence, and 𝛼 determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.

pdf bib abs

Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation
Zhe Yang | Tiantian Liang
Proceedings of the 31st International Conference on Computational Linguistics

Session-based recommendation focuses on predicting the next item a user will interact with based on sequences of anonymous user sessions. A significant challenge in this field is data sparsity due to the typically short-term interactions. Most existing methods rely heavily on users’ current interactions, overlooking the wealth of auxiliary information available. To address this, we propose a novel model, the Multi-Graph Co-Training model (MGCOT), which leverages not only the current session graph but also similar session graphs and a global item relation graph. This approach allows for a more comprehensive exploration of intrinsic relationships and better captures user intent from multiple views, enabling session representations to complement each other. Additionally, MGCOT employs multi-head attention mechanisms to effectively capture relevant session intent and uses contrastive learning to form accurate and robust session representations. Extensive experiments on three datasets demonstrate that MGCOT significantly enhances the performance of session-based recommendations, particularly on the Diginetica dataset, achieving improvements up to 2.00% in P@20 and 10.70% in MRR@20. Resources have been made publicly available in our GitHub repository https://github.com/liang-tian-tian/MGCOT.

pdf bib abs

Large Language Models with chain-of-thought prompting, such as OpenAI-o1, have shown impressive capabilities in natural language inference tasks. However, Multi-hop Question Answering (MHQA) remains challenging for many existing models due to issues like hallucination, error propagation, and limited context length. To address these challenges and enhance LLMs’ performance on MHQA, we propose the Self-Guiding prompting Finite State Machine (SG-FSM), designed to strengthen multi-hop reasoning abilities. Unlike traditional chain-of-thought methods, SG-FSM tackles MHQA by iteratively breaking down complex questions into sub-questions, correcting itself to improve accuracy. It processes one sub-question at a time, dynamically deciding the next step based on the current context and results, functioning much like an automaton. Experiments across various benchmarks demonstrate the effectiveness of our approach, outperforming strong baselines on challenging datasets such as Musique. SG-FSM reduces hallucination, enabling recovery of the correct final answer despite intermediate errors. It also improves adherence to specified output formats, simplifying evaluation significantly.

pdf bib abs

Palette of Language Models: A Solver for Controlled Text Generation
Zhe Yang | Yi Huang | Yaqin Chen | Xiaoting Wu | Junlan Feng | Chao Deng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist’s palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.

pdf bib abs

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Zhe Yang | Yichang Zhang | Yudong Wang | Ziyao Xu | Junyang Lin | Zhifang Sui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction.

2024

pdf bib abs

In extremely low resource relation identification scenario, small language models (SLMs) incline to overfit, which significantly diminishes their accuracy. Recently, large language models (LLMs) are gradually applied to classification tasks with converting original objective into the generation task via in-context learning. However, abundance of the classifier categories poses challenges in selecting demonstrations. Moreover, the mapping between category labels and textual descriptions requires expensive expert knowledge, thereby constraining the efficacy of in-context learning for LLMs. We uphold that SLM is optimal for handling classification tasks, and its shortcomings in the low resource setting can be mitigated by leveraging LLM. Hence, we propose a co-evolution strategy on SLM & LLM for relation identification. Specifically, LLM provides essential background knowledge to assist training process of the SLM classifier, while evaluation metrics from the classifier, in turn, offer valuable insights to refine the generation prompts of the LLM. We conduct experiments on several datasets which demonstrates preponderance of the proposed model.

pdf bib abs

Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

pdf bib abs

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.

2023

pdf bib abs

Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning
Zhe Yang | Damai Dai | Peiyi Wang | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023

Large Language Models (LLMs) have recently gained the In-Context Learning (ICL) ability with the models scaling up, allowing them to quickly adapt to downstream tasks with only a few demonstration examples prepended in the input sequence. Nonetheless, the current practice of ICL treats all demonstration examples equally, which still warrants improvement, as the quality of examples is usually uneven. In this paper, we investigate how to determine approximately optimal weights for demonstration examples and how to apply them during ICL. To assess the quality of weights in the absence of additional validation data, we design a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance. To expedite the weight-searching process, we discretize the continuous weight space and adopt beam search. With approximately optimal weights obtained, we further propose two strategies to apply them to demonstrations at different model positions. Experimental results on 8 text classification tasks show that our approach outperforms conventional ICL by a large margin. Our code are publicly available at https:github.com/Zhe-Young/WICL.

pdf bib abs

Learning to Leverage High-Order Medical Knowledge Graph for Joint Entity and Relation Extraction
Zhe Yang | Yi Huang | Junlan Feng
Findings of the Association for Computational Linguistics: ACL 2023

Automatic medical entity and relation extraction is essential for daily electronic medical record (EMR) analysis, and has attracted a lot of academic attention. Tremendous progress has been made in recent years. However, medical terms are difficult to understand, and their relations are more complicated than general ones. Based on this situation, domain knowledge gives better background and contexts for medical terms. Despite the benefits of medical domain knowledge, the utilization way of it for joint entity and relation extraction is inadequate. To foster this line of research, in this work, we propose to leverage the medical knowledge graph for extracting entities and relations for Chinese Medical Texts in a collective way. Specifically, we propose to construct a high-order heterogeneous graph based on medical knowledge graph, which is linked to the entity mentions in the text. In this way, neighbors from the high-order heterogeneous graph can pass the message to each other for better global context representations. Our experiments on real Chinese Medical Texts show that our method is more effective than state-of-the-art methods.

2022

pdf bib abs

Low-resource Neural Machine Translation with Cross-modal Alignment
Zhe Yang | Qingkai Fang | Yang Feng
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

How to achieve neural machine translation with limited parallel data? Existing techniques often rely on large-scale monolingual corpus, which is impractical for some low-resource languages. In this paper, we turn to connect several low-resource languages to a particular high-resource one by additional visual modality. Specifically, we propose a cross-modal contrastive learning method to learn a shared space for all languages, where both a coarse-grained sentence-level objective and a fine-grained token-level one are introduced. Experimental results and further analysis show that our method can effectively learn the cross-modal and cross-lingual alignment with a small amount of image-text pairs, and achieves significant improvements over the text-only baseline under both zero-shot and few-shot scenarios.

2020

pdf bib abs

Aspect-category sentiment analysis (ACSA) aims to predict the aspect categories mentioned in texts and their corresponding sentiment polarities. Some joint models have been proposed to address this task. Given a text, these joint models detect all the aspect categories mentioned in the text and predict the sentiment polarities toward them at once. Although these joint models obtain promising performances, they train separate parameters for each aspect category and therefore suffer from data deficiency of some aspect categories. To solve this problem, we propose a novel joint model which contains a shared sentiment prediction layer. The shared sentiment prediction layer transfers sentiment knowledge between aspect categories and alleviates the problem caused by data deficiency. Experiments conducted on SemEval-2016 Datasets demonstrate the effectiveness of our model.