On the Development of Interlinearized Ancient Literature of Ethnic Minorities: A Case Study of the Interlinearization of Ancient Written Tibetan Literature

Congjun Long, Bo An


Abstract
Ancient ethnic documents are essential to China’s ancient literature and an indispensable civilizational achievement of Chinese culture. However, few research teams are involved due to language and script literacy limitations. To address these issues, this paper proposes an interlinearized annotation strategy for ancient ethnic literature. This strategy aims to alleviate text literacy difficulties, encourage interdisciplinary researchers to participate in studying ancient ethnic literature, and improve the efficiency of ancient ethnic literature development. Concretely, the interlinearized annotation consists of original, word segmentation, Latin, annotated, and translation lines. In this paper, we take ancient Tibetan literature as an example to explore the interlinearized annotation strategy. However, manually building large-scale corpus is challenging. To build a large-scale interlinearized dataset, we propose a multi-task learning-based interlinearized annotation method, which can generate interlinearized annotation lines based on the original line. Experimental results show that after training on about 10,000 sentences (lines) of data, our model achieves 70.9% and 63.2% F1 values on the segmentation lines and annotated lines, respectively, and 18.7% BLEU on the translation lines. It dramatically enhances the efficiency of data annotation, effectively speeds up interlinearized annotation, and reduces the workload of manual annotation.
Anthology ID:
2023.alp-1.25
Volume:
Proceedings of the Ancient Language Processing Workshop
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti
Venues:
ALP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
222–231
Language:
URL:
https://aclanthology.org/2023.alp-1.25
DOI:
Bibkey:
Cite (ACL):
Congjun Long and Bo An. 2023. On the Development of Interlinearized Ancient Literature of Ethnic Minorities: A Case Study of the Interlinearization of Ancient Written Tibetan Literature. In Proceedings of the Ancient Language Processing Workshop, pages 222–231, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
On the Development of Interlinearized Ancient Literature of Ethnic Minorities: A Case Study of the Interlinearization of Ancient Written Tibetan Literature (Long & An, ALP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.alp-1.25.pdf