The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts

Krishnapriya Vishnubhotla, Adam Hammond, Graeme Hirst


Abstract
We present the Project Dialogism Novel Corpus, or PDNC, an annotated dataset of quotations for English literary texts. PDNC contains annotations for 35,978 quotations across 22 full-length novels, and is by an order of magnitude the largest corpus of its kind. Each quotation is annotated for the speaker, addressees, type of quotation, referring expression, and character mentions within the quotation text. The annotated attributes allow for a comprehensive evaluation of models of quotation attribution and coreference for literary texts.
Anthology ID:
2022.lrec-1.628
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5838–5848
Language:
URL:
https://aclanthology.org/2022.lrec-1.628
DOI:
Bibkey:
Cite (ACL):
Krishnapriya Vishnubhotla, Adam Hammond, and Graeme Hirst. 2022. The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5838–5848, Marseille, France. European Language Resources Association.
Cite (Informal):
The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts (Vishnubhotla et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.628.pdf
Code
 priya22/pdnc-lrec2022 +  additional community code
Data
PDNC