GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains

Yang Janet Liu; Tatsuya Aoyama; Wesley Scivetti; Yilun Zhu; Shabnam Behzad; Lauren Levine; Jessica Lin; Devika Tiwari; Amir Zeldes

doi:10.18653/v1/2024.emnlp-main.684

GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains

Yang Janet Liu, Tatsuya Aoyama, Wesley Scivetti, Yilun Zhu, Shabnam Behzad, Lauren Elizabeth Levine, Jessica Lin, Devika Tiwari, Amir Zeldes

Abstract

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.

Anthology ID:: 2024.emnlp-main.684
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12287–12303
Language:
URL:: https://aclanthology.org/2024.emnlp-main.684/
DOI:: 10.18653/v1/2024.emnlp-main.684
Bibkey:
Cite (ACL):: Yang Janet Liu, Tatsuya Aoyama, Wesley Scivetti, Yilun Zhu, Shabnam Behzad, Lauren Elizabeth Levine, Jessica Lin, Devika Tiwari, and Amir Zeldes. 2024. GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12287–12303, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains (Liu et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.684.pdf

PDF Cite Search Fix data