2024
pdf
bib
Explicit Memory Learning with Expectation Maximization
Zhangyue Yin
|
Qiushi Sun
|
Qipeng Guo
|
Zhiyuan Zeng
|
Qinyuan Cheng
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
pdf
bib
abs
LLM can Achieve Self-Regulation via Hyperparameter Aware Generation
Siyin Wang
|
Shimin Li
|
Tianxiang Sun
|
Jinlan Fu
|
Qinyuan Cheng
|
Jiasheng Ye
|
Junjie Ye
|
Xipeng Qiu
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2024
In the realm of Large Language Models (LLMs), users commonly employ diverse decoding strategies and adjust hyperparameters to control the generated text. However, a critical question emerges: Are LLMs conscious of the existence of these decoding strategies and capable of regulating themselves? The current decoding generation process often relies on empirical and heuristic manual adjustments to hyperparameters based on types of tasks and demands. However, this process is typically cumbersome, and the decoding hyperparameters may not always be optimal for each sample. To address the aforementioned challenges, we propose a novel text generation paradigm termed Hyperparameter Aware Generation (HAG). By leveraging hyperparameter-aware instruction tuning, the LLM autonomously determines the optimal decoding strategy and configs based on the input samples, enabling self-regulation. Our approach eliminates the need for extensive manual tuning, offering a more autonomous, self-regulate model behavior. Experimental results spanning six datasets across reasoning, creativity, translation, and mathematics tasks demonstrate that hyperparameter-aware instruction tuning empowers the LLMs to self-regulate the decoding strategy and hyperparameter. HAG extends the current paradigm in the text generation process, highlighting the feasibility of endowing the LLMs with self-regulate decoding strategies.
pdf
bib
abs
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
Qin Zhu
|
Qinyuan Cheng
|
Runyu Peng
|
Xiaonan Li
|
Ru Peng
|
Tengxiao Liu
|
Xipeng Qiu
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs’ true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
pdf
bib
abs
Scaling Laws for Fact Memorization of Large Language Models
Xingyu Lu
|
Xiaonan Li
|
Qinyuan Cheng
|
Kai Ding
|
Xuanjing Huang
|
Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2024
Fact knowledge memorization is crucial for Large Language Models (LLM) to generate factual and reliable responses. However, the behaviors of LLM fact memorization remain under-explored. In this paper, we analyze the scaling laws for LLM’s fact knowledge and LLMs’ behaviors of memorizing different types of facts. We find that LLMs’ fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs, respectively. Estimated by the built scaling law, memorizing the whole Wikidata’s facts requires training an LLM with 1000B non-embed parameters for 100 epochs, suggesting that using LLMs to memorize all public facts is almost implausible for a general pre-training setting. Meanwhile, we find that LLMs can generalize on unseen fact knowledge and its scaling law is similar to general pre-training. Additionally, we analyze the compatibility and preference of LLMs’ fact memorization. For compatibility, we find LLMs struggle with memorizing redundant facts in a unified way. Only when correlated facts have the same direction and structure, the LLM can compatibly memorize them. This shows the inefficiency of LLM memorization for redundant facts. For preference, the LLM pays more attention to memorizing more frequent and difficult facts, and the subsequent facts can overwrite prior facts’ memorization, which significantly hinders low-frequency facts memorization. Our findings reveal the capacity and characteristics of LLMs’ fact knowledge learning, which provide directions for LLMs’ fact knowledge augmentation.
pdf
bib
abs
Unified Active Retrieval for Retrieval Augmented Generation
Qinyuan Cheng
|
Xiaonan Li
|
Shimin Li
|
Qin Zhu
|
Zhangyue Yin
|
Yunfan Shao
|
Linyang Li
|
Tianxiang Sun
|
Hang Yan
|
Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2024
In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal. Therefore, determining whether to retrieve is crucial for RAG, which is usually referred to as Active Retrieval. However, existing active retrieval methods face two challenges: 1. They usually rely on a single criterion, which struggles with handling various types of instructions. 2. They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated and leads to higher response latency. To address these challenges, we propose Unified Active Retrieval (UAR). UAR contains four orthogonal criteria and casts them into plug-and-play classification tasks, which achieves multifaceted retrieval timing judgements with negligible extra inference cost. We further introduce the Unified Active Retrieval Criteria (UAR-Criteria), designed to process diverse active retrieval scenarios through a standardized procedure. Experiments on four representative types of user instructions show that UAR significantly outperforms existing work on the retrieval timing judgement and the performance of downstream tasks, which shows the effectiveness of UAR and its helpfulness to downstream tasks.
pdf
bib
abs
Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance
Zhangyue Yin
|
Qiushi Sun
|
Qipeng Guo
|
Zhiyuan Zeng
|
Xiaonan Li
|
Junqi Dai
|
Qinyuan Cheng
|
Xuanjing Huang
|
Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Machine reasoning, which involves solving complex problems through step-by-step deduction and analysis, is a crucial indicator of the capabilities of Large Language Models (LLMs). However, as the complexity of tasks escalates, LLMs often encounter increasing errors in their multi-step reasoning process. This study delves into the underlying factors contributing to these reasoning errors and seeks to leverage uncertainty to refine them. Specifically, we introduce Uncertainty-aware Adaptive Guidance (UAG), a novel approach for guiding LLM reasoning onto an accurate and reliable trajectory. UAG first identifies and evaluates uncertainty signals within each step of the reasoning chain. Upon detecting a significant increase in uncertainty, UAG intervenes by retracting to a previously reliable state and then introduces certified reasoning clues for refinement. By dynamically adjusting the reasoning process, UAG offers a plug-and-play solution for improving LLMs’ performance in complex reasoning. Extensive experiments across various reasoning tasks demonstrate that UAG not only enhances the reasoning abilities of LLMs but also consistently outperforms several strong baselines with minimal computational overhead. Further analysis reveals that UAG is notably effective in identifying and diminishing reasoning errors.
pdf
bib
abs
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
Zhangyue Yin
|
Qiushi Sun
|
Qipeng Guo
|
Zhiyuan Zeng
|
Xiaonan Li
|
Tianxiang Sun
|
Cheng Chang
|
Qinyuan Cheng
|
Ding Wang
|
Xiaofeng Mou
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for Large Language Models (LLMs) in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify this as a primary factor constraining the reasoning capabilities of LLMs, a limitation that cannot be resolved solely based on the predicted answers. To address this shortcoming, we introduce a hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning), which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains in accordance with the complexity of the task. Experimental results on a series of complex reasoning tasks show that AoR outperforms prominent ensemble methods. Further analysis reveals that AoR not only adapts various LLMs but also achieves a superior performance ceiling when compared to current methods.
2023
pdf
bib
abs
Improving Contrastive Learning of Sentence Embeddings from AI Feedback
Qinyuan Cheng
|
Xiaogui Yang
|
Tianxiang Sun
|
Linyang Li
|
Xipeng Qiu
Findings of the Association for Computational Linguistics: ACL 2023
Contrastive learning has become a popular approach in natural language processing, particularly for the learning of sentence embeddings.However, the discrete nature of natural language makes it difficult to ensure the quality of positive and negative sample pairs generated through data augmentation methods. Although supervised contrastive learning can produce more accurate sample pairs with human feedback labels, it still lacks fine-grained training signals. In this paper, we propose to improve Contrastive Learning of sentence embeddings from AI Feedback (CLAIF).Our method utilizes AI feedback from large pre-trained language models (LLMs) to construct sample pairs with fine-grained sample similarity scores to improve contrastive learning. Besides, we combine human feedback and AI feedback to provide better supervision signals for supervised contrastive learning of sentence embeddings.Experimental results show that our method achieves state-of-the-art performance on several semantic textual similarity (STS) and transfer learning tasks compared to other unsupervised and supervised contrastive learning methods.
2022
pdf
bib
abs
Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator
Qinyuan Cheng
|
Linyang Li
|
Guofeng Quan
|
Feng Gao
|
Xiaofeng Mou
|
Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2022
Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies.Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem.That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts.Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues.Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates in the interactive evaluation of MultiWOZ dataset and the proposed scores measure the response quality besides the inform and success rates.We are hoping that our work will encourage simulator-based interactive evaluations in the TOD task.