Qiyuan Zhang
2024
Collaborative Performance Prediction for Large Language Models
Qiyuan Zhang
|
Fuyuan Lyu
|
Xue Liu
|
Chen Ma
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.
2021
NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering Dataset
Qiyuan Zhang
|
Lei Wang
|
Sicheng Yu
|
Shuohang Wang
|
Yang Wang
|
Jing Jiang
|
Ee-Peng Lim
Findings of the Association for Computational Linguistics: EMNLP 2021
While diverse question answering (QA) datasets have been proposed and contributed significantly to the development of deep learning models for QA tasks, the existing datasets fall short in two aspects. First, we lack QA datasets covering complex questions that involve answers as well as the reasoning processes to get them. As a result, the state-of-the-art QA research on numerical reasoning still focuses on simple calculations and does not provide the mathematical expressions or evidence justifying the answers. Second, the QA community has contributed a lot of effort to improve the interpretability of QA models. However, they fail to explicitly show the reasoning process, such as the evidence order for reasoning and the interactions between different pieces of evidence. To address the above shortcoming, we introduce NOAHQA, a conversational and bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions. With NOAHQA, we develop an interpretable reasoning graph as well as the appropriate evaluation metric to measure the answer quality. We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores, while the human performance is 89.7. We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans, eg, 28 scores.
Search
Co-authors
- Fuyuan Lyu 1
- Xue Liu 1
- Chen Ma 1
- Lei Wang 1
- Sicheng Yu 1
- show all...