How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Rachit Bansal, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, Jacob Dahl


Abstract
Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language – Sumerian cuneiform – one of the world’s oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.
Anthology ID:
2021.acl-srw.5
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:
August
Year:
2021
Address:
Online
Editors:
Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–59
Language:
URL:
https://aclanthology.org/2021.acl-srw.5
DOI:
10.18653/v1/2021.acl-srw.5
Bibkey:
Cite (ACL):
Rachit Bansal, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, and Jacob Dahl. 2021. How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 44–59, Online. Association for Computational Linguistics.
Cite (Informal):
How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages (Bansal et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-srw.5.pdf
Optional supplementary material:
 2021.acl-srw.5.OptionalSupplementaryMaterial.zip
Video:
 https://aclanthology.org/2021.acl-srw.5.mp4
Code
 cdli-gh/Semi-Supervised-NMT-for-Sumerian-English +  additional community code