A Parallel Corpus and Dictionary for Amis-Mandarin Translation

Francis Zheng, Edison Marrese-Taylor, Yutaka Matsuo


Abstract
Amis is an endangered language indigenous to Taiwan with limited data available for computational processing. We thus present an Amis-Mandarin dataset containing a parallel corpus of 5,751 Amis and Mandarin sentences and a dictionary of 7,800 Amis words and phrases with their definitions in Mandarin. Using our dataset, we also established a baseline for machine translation between Amis and Mandarin in both directions. Our dataset can be found at https://github.com/francisdzheng/amis-mandarin.
Anthology ID:
2022.nlp4dh-1.11
Volume:
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Month:
November
Year:
2022
Address:
Taipei, Taiwan
Editors:
Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
79–84
Language:
URL:
https://aclanthology.org/2022.nlp4dh-1.11
DOI:
Bibkey:
Cite (ACL):
Francis Zheng, Edison Marrese-Taylor, and Yutaka Matsuo. 2022. A Parallel Corpus and Dictionary for Amis-Mandarin Translation. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 79–84, Taipei, Taiwan. Association for Computational Linguistics.
Cite (Informal):
A Parallel Corpus and Dictionary for Amis-Mandarin Translation (Zheng et al., NLP4DH 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.nlp4dh-1.11.pdf