GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing

Siyao Peng, Yang Janet Liu, Amir Zeldes


Abstract
A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset’s parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.
Anthology ID:
2022.aacl-short.47
Volume:
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Month:
November
Year:
2022
Address:
Online only
Editors:
Yulan He, Heng Ji, Sujian Li, Yang Liu, Chua-Hui Chang
Venues:
AACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
382–391
Language:
URL:
https://aclanthology.org/2022.aacl-short.47
DOI:
Bibkey:
Cite (ACL):
Siyao Peng, Yang Janet Liu, and Amir Zeldes. 2022. GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 382–391, Online only. Association for Computational Linguistics.
Cite (Informal):
GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing (Peng et al., AACL-IJCNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.aacl-short.47.pdf