DiscoGeM: A Crowdsourced Corpus of Genre-Mixed Implicit Discourse Relations

Merel Scholman, Tianai Dong, Frances Yung, Vera Demberg


Abstract
We present DiscoGeM, a crowdsourced corpus of 6,505 implicit discourse relations from three genres: political speech, literature, and encyclopedic texts. Each instance was annotated by 10 crowd workers. Various label aggregation methods were explored to evaluate how to obtain a label that best captures the meaning inferred by the crowd annotators. The results show that a significant proportion of discourse relations in DiscoGeM are ambiguous and can express multiple relation senses. Probability distribution labels better capture these interpretations than single labels. Further, the results emphasize that text genre crucially affects the distribution of discourse relations, suggesting that genre should be included as a factor in automatic relation classification. We make available the newly created DiscoGeM corpus, as well as the dataset with all annotator-level labels. Both the corpus and the dataset can facilitate a multitude of applications and research purposes, for example to function as training data to improve the performance of automatic discourse relation parsers, as well as facilitate research into non-connective signals of discourse relations.
Anthology ID:
2022.lrec-1.351
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3281–3290
Language:
URL:
https://aclanthology.org/2022.lrec-1.351
DOI:
Bibkey:
Cite (ACL):
Merel Scholman, Tianai Dong, Frances Yung, and Vera Demberg. 2022. DiscoGeM: A Crowdsourced Corpus of Genre-Mixed Implicit Discourse Relations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3281–3290, Marseille, France. European Language Resources Association.
Cite (Informal):
DiscoGeM: A Crowdsourced Corpus of Genre-Mixed Implicit Discourse Relations (Scholman et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.351.pdf
Code
 merelscholman/discogem