Multi-lingual and Cross-genre Discourse Unit Segmentation

Peter Bourgonje, Robin Schäfer


Abstract
We describe a series of experiments applied to data sets from different languages and genres annotated for coherence relations according to different theoretical frameworks. Specifically, we investigate the feasibility of a unified (theory-neutral) approach toward discourse segmentation; a process which divides a text into minimal discourse units that are involved in s coherence relation. We apply a RandomForest and an LSTM based approach for all data sets, and we improve over a simple baseline assuming simple sentence or clause-like segmentation. Performance however varies a lot depending on language, and more importantly genre, with f-scores ranging from 73.00 to 94.47.
Anthology ID:
W19-2714
Volume:
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
Month:
June
Year:
2019
Address:
Minneapolis, MN
Editors:
Amir Zeldes, Debopam Das, Erick Maziero Galani, Juliano Desiderato Antonio, Mikel Iruskieta
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
105–114
Language:
URL:
https://aclanthology.org/W19-2714
DOI:
10.18653/v1/W19-2714
Bibkey:
Cite (ACL):
Peter Bourgonje and Robin Schäfer. 2019. Multi-lingual and Cross-genre Discourse Unit Segmentation. In Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, pages 105–114, Minneapolis, MN. Association for Computational Linguistics.
Cite (Informal):
Multi-lingual and Cross-genre Discourse Unit Segmentation (Bourgonje & Schäfer, NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-2714.pdf
Data
DISRPT2019