Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank

Jiří Mírovský, Pavlína Synková, Lucie Polakova, Marie Paclíková


Abstract
We present a cost-effective method for obtaining a high-quality annotation of explicit discourse relations in the Czech part of the Prague Czech–English Dependency Treebank, a corpus of almost 50 thousand sentences coming from the Czech translation of the Wall Street Journal part of the Penn Treebank. We use three different sources of information and combine them to obtain the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the Lexicon of Czech Discourse Connectives CzeDLex. After solving as many discrepancies as possible automatically, the final discourse annotation is achieved by manual inspection of the remaining problematic cases. The discourse annotation of the corpus will be available both in the Prague format (on top of tectogrammatical trees) with the Prague taxonomy of discourse types, and in the Penn format (on plain texts) with the Penn Discourse Treebank 3.0 sense taxonomy.
Anthology ID:
2024.lrec-main.362
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4067–4077
Language:
URL:
https://aclanthology.org/2024.lrec-main.362
DOI:
Bibkey:
Cite (ACL):
Jiří Mírovský, Pavlína Synková, Lucie Polakova, and Marie Paclíková. 2024. Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4067–4077, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank (Mírovský et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.362.pdf