Jenyuan Wang
2024
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Rujun Han
|
Yuhao Zhang
|
Peng Qi
|
Yumo Xu
|
Jenyuan Wang
|
Lan Liu
|
William Yang Wang
|
Bonan Min
|
Vittorio Castelli
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA’s answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM’s answers are preferred to LFRQA’s answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
Vilém Zouhar
|
Shuoyang Ding
|
Anna Currey
|
Tatyana Badeka
|
Jenyuan Wang
|
Brian Thompson
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to both metrics that rely on the surface form and pre-trained metrics that are not fine-tuned on MT quality judgments.