Ruoxi Ning
2024
TAIL: A Toolkit for Automatic and Realistic Long-Context Large Language Model Evaluation
Gefei Gu
|
Yilun Zhao
|
Ruoxi Ning
|
Yanan Zheng
|
Arman Cohan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
As long-context large language models (LLMs) are attracting increasing attention for their ability to handle context windows exceeding 128k tokens, the need for effective evaluation methods for these models becomes critical.Existing evaluation methods, however, fall short: needle-in-a-haystack (NIAH) and its variants are overly simplistic, while creating realistic benchmarks is prohibitively expensive due to extensive human annotation requirements. To bridge this gap, we propose TAIL, an automatic toolkit for creating realistic evaluation benchmarks and assessing the performance of long-context LLMs.With TAIL, users can customize the building of a long-context, document-grounded QA benchmark and obtain visualized performance metrics of evaluated models.TAIL has the advantage of requiring minimal human annotation and generating natural questions based on user-provided long-context documents. We apply TAIL to construct a benchmark encompassing multiple expert domains, such as finance, law, patent, and scientific literature. We then evaluate four state-of-the-art long-context LLMs using this benchmark. Results show that all LLMs experience varyingdegrees of performance degradation as contextlengths increase.
2022
Challenges to Open-Domain Constituency Parsing
Sen Yang
|
Leyang Cui
|
Ruoxi Ning
|
Di Wu
|
Yue Zhang
Findings of the Association for Computational Linguistics: ACL 2022
Neural constituency parsers have reached practical performance on news-domain benchmarks. However, their generalization ability to other domains remains weak. Existing findings on cross-domain constituency parsing are only made on a limited number of domains. Tracking this, we manually annotate a high-quality constituency treebank containing five domains. We analyze challenges to open-domain constituency parsing using a set of linguistic features on various strong constituency parsers. Primarily, we find that 1) BERT significantly increases parsers’ cross-domain performance by reducing their sensitivity on the domain-variant features.2) Compared with single metrics such as unigram distribution and OOV rate, challenges to open-domain constituency parsing arise from complex features, including cross-domain lexical and constituent structure variations.
Search
Co-authors
- Gefei Gu 1
- Yilun Zhao 1
- Yanan Zheng 1
- Arman Cohan 1
- Sen Yang 1
- show all...