2023
pdf
bib
abs
WikiHowQA: A Comprehensive Benchmark for Multi-Document Non-Factoid Question Answering
Valeriia Bolotova-Baranova
|
Vladislav Blinov
|
Sofya Filippova
|
Falk Scholer
|
Mark Sanderson
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Answering non-factoid questions (NFQA) is a challenging task, requiring passage-level answers that are difficult to construct and evaluate. Search engines may provide a summary of a single web page, but many questions require reasoning across multiple documents. Meanwhile, modern models can generate highly coherent and fluent, but often factually incorrect answers that can deceive even non-expert humans. There is a critical need for high-quality resources for multi-document NFQA (MD-NFQA) to train new models and evaluate answers’ grounding and factual consistency in relation to supporting documents. To address this gap, we introduce WikiHowQA, a new multi-document NFQA benchmark built on WikiHow, a website dedicated to answering “how-to” questions. The benchmark includes 11,746 human-written answers along with 74,527 supporting documents. We describe the unique challenges of the resource, provide strong baselines, and propose a novel human evaluation framework that utilizes highlighted relevant supporting passages to mitigate issues such as assessor unfamiliarity with the question topic. All code and data, including the automatic code for preparing the human evaluation, are publicly available.
2008
pdf
bib
abs
From Research to Application in Multilingual Information Access: the Contribution of Evaluation
Carol Peters
|
Martin Braschler
|
Giorgio Di Nunzio
|
Nicola Ferro
|
Julio Gonzalo
|
Mark Sanderson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The importance of evaluation in promoting research and development in the information retrieval and natural language processing domains has long been recognised but is this sufficient? In many areas there is still a considerable gap between the results achieved by the research community and their implementation in commercial applications. This is particularly true for the cross-language or multilingual retrieval areas. Despite the strong demand for and interest in multilingual IR functionality, there are still very few operational systems on offer. The Cross Language Evaluation Forum (CLEF) is now taking steps aimed at changing this situation. The paper provides a critical assessment of the main results achieved by CLEF so far and discusses plans now underway to extend its activities in order to have a more direct impact on the application sector.
pdf
bib
abs
An Evaluation Resource for Geographic Information Retrieval
Thomas Mandl
|
Fredric Gey
|
Giorgio Di Nunzio
|
Nicola Ferro
|
Mark Sanderson
|
Diana Santos
|
Christa Womser-Hacker
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present an evaluation resource for geographic information retrieval developed within the Cross Language Evaluation Forum (CLEF). The GeoCLEF track is dedicated to the evaluation of geographic information retrieval systems. The resource encompasses more than 600,000 documents, 75 topics so far, and more than 100,000 relevance judgments for these topics. Geographic information retrieval requires an evaluation resource which represents realistic information needs and which is geographically challenging. Some experimental results and analysis are reported
2006
pdf
bib
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference
Robert C. Moore
|
Jeff Bilmes
|
Jennifer Chu-Carroll
|
Mark Sanderson
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference
pdf
bib
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Robert C. Moore
|
Jeff Bilmes
|
Jennifer Chu-Carroll
|
Mark Sanderson
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
pdf
bib
The Effect of Machine Translation on the Performance of Arabic-EnglishQA System
Azzah Al-Maskari
|
Mark Sanderson
Proceedings of the Workshop on Multilingual Question Answering - MLQA ‘06
2001
pdf
bib
Large scale testing of a descriptive phrase finder
Hideo Joho
|
Ying Ki Liu
|
Mark Sanderson
Proceedings of the First International Conference on Human Language Technology Research
2000
pdf
bib
Book Reviews: Advances in Automatic Text Summarization
Mark Sanderson
Computational Linguistics, Volume 26, Number 2, June 2000