Xuan Luo
2025
Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models
Mingyang Song
|
Mao Zheng
|
Xuan Luo
Proceedings of the 31st International Conference on Computational Linguistics
Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose Counting-Stars, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. Counting-Stars comprises two counting-based multiple pieces of evidence retrieval tasks: searching and reasoning. Using Counting-Stars, we conducted experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.
Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study
Mingyang Song
|
Mao Zheng
|
Xuan Luo
Proceedings of the 31st International Conference on Computational Linguistics
Utilizing Large Language Models (LLMs) as evaluators to assess the performance of other LLMs has garnered attention. However, this evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this issue, we propose and explore two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4, perform better in the many-shot regime than in the zero-shot regime. Furthermore, in most cases, MSwR performs significantly better than MSoR.
2022
Masked Language Models Know Which are Popular: A Simple Ranking Strategy for Commonsense Question Answering
Xuan Luo
|
Chuang Fan
|
Yice Zhang
|
Wanguo Jiang
|
Bing Qin
|
Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2022
We propose a simple ranking strategy to solve a generative commonsense question answering (QA) problem. Compared with multiple-choice QA, it is challenging because the answers to a question are not unique and they are supposed to be popular and diverse. Our strategy exploits the dataset itself and negative samples that we collect from WordNet to train a ranker that picks out the most popular answers for commonsense questions. The effectiveness of our strategy is verified on different pre-trained masked language models (MLMs) in a pipeline framework, where an MLM reranks the generated answers. Further, we explore an end-to-end framework where MLMs are utilized to guide the generation of generative language models (GLMs). Taking advantage of reinforcement learning, we apply policy gradient to train a GLM with the rewards fed back by an MLM. Empirical results on ProtoQA dataset demonstrate that MLMs can acquire the ability to distinguish the popular answers and improve the typical answer generation of GLMs as well.
Search
Fix data
Co-authors
- Mingyang Song 2
- Mao Zheng 2
- Chuang Fan 1
- Wanguo Jiang 1
- Bing Qin (秦兵) 1
- show all...