Converting the Sinica Treebank of Mandarin Chinese to Universal Dependencies

Yu-Ming Hsieh, Yueh-Yin Shih, Wei-Yun Ma


Abstract
This paper describes the conversion of the Sinica Treebank, one of the major Mandarin Chinese treebanks, to Universal Dependencies. The conversion is rule-based and the process involves POS tag mapping, head adjusting in line with the UD scheme and the dependency conversion. Linguistic insights into Mandarin Chinese alongwith the conversion are also discussed. The resulting corpus is the UD Chinese Sinica Treebank which contains more than fifty thousand tree structures according to the UD scheme. The dataset can be downloaded at https://github.com/ckiplab/ud.
Anthology ID:
2022.law-1.4
Volume:
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Sameer Pradhan, Sandra Kuebler
Venue:
LAW
SIG:
SIGANN
Publisher:
European Language Resources Association
Note:
Pages:
23–30
Language:
URL:
https://aclanthology.org/2022.law-1.4
DOI:
Bibkey:
Cite (ACL):
Yu-Ming Hsieh, Yueh-Yin Shih, and Wei-Yun Ma. 2022. Converting the Sinica Treebank of Mandarin Chinese to Universal Dependencies. In Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022, pages 23–30, Marseille, France. European Language Resources Association.
Cite (Informal):
Converting the Sinica Treebank of Mandarin Chinese to Universal Dependencies (Hsieh et al., LAW 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.law-1.4.pdf
Code
 ckiplab/ud
Data
Universal Dependencies