2024
pdf
bib
abs
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Ge Bai
|
Jie Liu
|
Xingyuan Bu
|
Yancheng He
|
Jiaheng Liu
|
Zhanhui Zhou
|
Zhuoran Lin
|
Wenbo Su
|
Tiezheng Ge
|
Bo Zheng
|
Wanli Ouyang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at https://github.com/mtbench101/mt-bench-101.
pdf
bib
abs
E2-LLM: Efficient and Extreme Length Extension of Large Language Models
Jiaheng Liu
|
ZhiqiBai ZhiqiBai
|
Yuanxing Zhang
|
Chenchen Zhang
|
YuangZh YuangZh
|
Ge Zhang
|
JiakaiWang JiakaiWang
|
Haoran Que
|
Yukang Chen
|
Wenbo Su
|
Tiezheng Ge
|
Jie Fu
|
Wenhu Chen
|
Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2024
Training Large Language Models (LLMs) to process extensive context lengths incurs prohibitive computational costs. Prevailing techniques for extending context capabilities in LLMs typically require not only additional training procedures but also access to datasets with long context (e.g., sequences of 32K tokens), presupposing substantial GPU expenditures. To address the aforementioned issues, we introduce a novel solution named Efficient and Extreme length extension for Large Language Models (E2-LLM). E2-LLM entails a singular training process over considerably short sequences (e.g., 4K tokens), which greatly mitigates the cost of continual-pretraining or fine-tuning. Within the training phase, we incorporate a dual augmentation strategy with Rotary Position Embeddings (RoPE) that adjusts the scale and position indices across distinct training samples. E 2 -LLM is meticulously designed to enhance the model’s robustness to diverse relative positions. The experimental results on multiple benchmark datasets demonstrate the superior performance of E 2 -LLM on demanding tasks of processing long contexts.
pdf
bib
abs
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
Yanan Wu
|
Jie Liu
|
Xingyuan Bu
|
Jiaheng Liu
|
Zhanhui Zhou
|
Yuanxing Zhang
|
Chenchen Zhang
|
ZhiqiBai ZhiqiBai
|
Haibin Chen
|
Tiezheng Ge
|
Wanli Ouyang
|
Wenbo Su
|
Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systemically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we then evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models. Code is available at https://github.com/conceptmath/conceptmath.
pdf
bib
abs
GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models
Shilong Li
|
Yancheng He
|
Hangyu Guo
|
Xingyuan Bu
|
Ge Bai
|
Jie Liu
|
Jiaheng Liu
|
Xingwei Qu
|
Yangguang Li
|
Wanli Ouyang
|
Wenbo Su
|
Bo Zheng
Findings of the Association for Computational Linguistics: EMNLP 2024
Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks. Despite numerous efforts made to optimize LLMs for long contexts, challenges persist in robustly processing long inputs. In this paper, we introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously. Upon receiving a question, the agent first undertakes a step-by-step analysis and devises a rational plan. It then invokes a set of predefined functions to read node content and neighbors, facilitating a coarse-to-fine exploration of the graph. Throughout the exploration, the agent continuously records new insights and reflects on current circumstances to optimize the process until it has gathered sufficient information to generate an answer. Experimental results on the LV-Eval dataset reveal that GraphReader using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin. Additionally, our approach demonstrates superior performance on four challenging single-hop and multi-hop benchmarks.