Content-based Models of Quotation

Ansel MacLaughlin, David Smith


Abstract
We explore the task of quotability identification, in which, given a document, we aim to identify which of its passages are the most quotable, i.e. the most likely to be directly quoted by later derived documents. We approach quotability identification as a passage ranking problem and evaluate how well both feature-based and BERT-based (Devlin et al., 2019) models rank the passages in a given document by their predicted quotability. We explore this problem through evaluations on five datasets that span multiple languages (English, Latin) and genres of literature (e.g. poetry, plays, novels) and whose corresponding derived documents are of multiple types (news, journal articles). Our experiments confirm the relatively strong performance of BERT-based models on this task, with the best model, a RoBERTA sequential sentence tagger, achieving an average rho of 0.35 and NDCG@1, 5, 50 of 0.26, 0.31 and 0.40, respectively, across all five datasets.
Anthology ID:
2021.eacl-main.195
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Editors:
Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2296–2314
Language:
URL:
https://aclanthology.org/2021.eacl-main.195
DOI:
10.18653/v1/2021.eacl-main.195
Bibkey:
Cite (ACL):
Ansel MacLaughlin and David Smith. 2021. Content-based Models of Quotation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2296–2314, Online. Association for Computational Linguistics.
Cite (Informal):
Content-based Models of Quotation (MacLaughlin & Smith, EACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eacl-main.195.pdf