Xiaohan Xu
2024
Re-Reading Improves Reasoning in Large Language Models
Xiaohan Xu
|
Chongyang Tao
|
Tao Shen
|
Can Xu
|
Hongbo Xu
|
Guodong Long
|
Jian-Guang Lou
|
Shuai Ma
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
To enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs), we introduce a simple, yet general and effective prompting method, RE2, i.e., Re-Reading the question as input. Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), which aim to elicit the reasoning process in the output, RE2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. Consequently, RE2 demonstrates strong generality and compatibility with most thought-eliciting prompting methods, including CoT. Crucially, RE2 facilitates a “bidirectional” encoding in unidirectional decoder-only LLMs because the first pass could provide global information for the second pass. We begin with a preliminary empirical study as the foundation of RE2, illustrating its potential to enable “bidirectional” attention mechanisms. We then evaluate RE2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality. Our findings indicate that, with the exception of a few scenarios on vanilla ChatGPT, RE2 consistently enhances the reasoning performance of LLMs through a simple re-reading strategy. Further analyses reveal RE2’s adaptability, showing how it can be effectively integrated with different LLMs, thought-eliciting prompting, and ensemble strategies.
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges
Zhen Li
|
Xiaohan Xu
|
Tao Shen
|
Can Xu
|
Jia-Chen Gu
|
Yuxuan Lai
|
Chongyang Tao
|
Shuai Ma
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing Large Language Models (LLMs) has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This paper aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods. Our detailed exploration includes critically assessing various LLM-based methodologies, as well as comparing their strengths and limitations in evaluating NLG outputs. By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.