Unsupervised Stem-based Cross-lingual Part-of-Speech Tagging for Morphologically Rich Low-Resource Languages

Ramy Eskander; Cass Lowry; Sujay Khandagale; Judith L. Klavans; Maria Polinsky; Smaranda Muresan

doi:10.18653/v1/2022.naacl-main.298

Unsupervised Stem-based Cross-lingual Part-of-Speech Tagging for Morphologically Rich Low-Resource Languages

Ramy Eskander, Cass Lowry, Sujay Khandagale, Judith Klavans, Maria Polinsky, Smaranda Muresan

Abstract

Unsupervised cross-lingual projection for part-of-speech (POS) tagging relies on the use of parallel data to project POS tags from a source language for which a POS tagger is available onto a target language across word-level alignments. The projected tags then form the basis for learning a POS model for the target language. However, languages with rich morphology often yield sparse word alignments because words corresponding to the same citation form do not align well. We hypothesize that for morphologically complex languages, it is more efficient to use the stem rather than the word as the core unit of abstraction. Our contributions are: 1) we propose an unsupervised stem-based cross-lingual approach for POS tagging for low-resource languages of rich morphology; 2) we further investigate morpheme-level alignment and projection; and 3) we examine whether the use of linguistic priors for morphological segmentation improves POS tagging. We conduct experiments using six source languages and eight morphologically complex target languages of diverse typologies. Our results show that the stem-based approach improves the POS models for all the target languages, with an average relative error reduction of 10.3% in accuracy per target language, and outperforms the word-based approach that operates on three-times more data for about two thirds of the language pairs we consider. Moreover, we show that morpheme-level alignment and projection and the use of linguistic priors for morphological segmentation further improve POS tagging.

Anthology ID:: 2022.naacl-main.298
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4061–4072
Language:
URL:: https://aclanthology.org/2022.naacl-main.298/
DOI:: 10.18653/v1/2022.naacl-main.298
Bibkey:
Cite (ACL):: Ramy Eskander, Cass Lowry, Sujay Khandagale, Judith Klavans, Maria Polinsky, and Smaranda Muresan. 2022. Unsupervised Stem-based Cross-lingual Part-of-Speech Tagging for Morphologically Rich Low-Resource Languages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4061–4072, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: Unsupervised Stem-based Cross-lingual Part-of-Speech Tagging for Morphologically Rich Low-Resource Languages (Eskander et al., NAACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.naacl-main.298.pdf
Video:: https://aclanthology.org/2022.naacl-main.298.mp4
Code: rnd2110/unsupervised-cross-lingual-pos-tagging

PDF Cite Search Code Video Fix data