Annotation Efficient Language Identification from Weak Labels

Shriphani Palakodety; Ashiqur Khudabukhsh

doi:10.18653/v1/2020.wnut-1.24

Annotation Efficient Language Identification from Weak Labels

Shriphani Palakodety, Ashiqur KhudaBukhsh

Abstract

India is home to several languages with more than 30m speakers. These languages exhibit significant presence on social media platforms. However, several of these widely-used languages are under-addressed by current Natural Language Processing (NLP) models and resources. User generated social media content in these languages is also typically authored in the Roman script as opposed to the traditional native script further contributing to resource scarcity. In this paper, we leverage a minimally supervised NLP technique to obtain weak language labels from a large-scale Indian social media corpus leading to a robust and annotation-efficient language-identification technique spanning nine Romanized Indian languages. In fast-spreading pandemic situations such as the current COVID-19 situation, information processing objectives might be heavily tilted towards under-served languages in densely populated regions. We release our models to facilitate downstream analyses in these low-resource languages. Experiments across multiple social media corpora demonstrate the model’s robustness and provide several interesting insights on Indian language usage patterns on social media. We release an annotated data set of 1,000 comments in ten Romanized languages as a social media evaluation benchmark.

Anthology ID:: 2020.wnut-1.24
Volume:: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:: November
Year:: 2020
Address:: Online
Editors:: Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:: WNUT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 181–192
Language:
URL:: https://aclanthology.org/2020.wnut-1.24/
DOI:: 10.18653/v1/2020.wnut-1.24
Bibkey:
Cite (ACL):: Shriphani Palakodety and Ashiqur KhudaBukhsh. 2020. Annotation Efficient Language Identification from Weak Labels. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 181–192, Online. Association for Computational Linguistics.
Cite (Informal):: Annotation Efficient Language Identification from Weak Labels (Palakodety & KhudaBukhsh, WNUT 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.wnut-1.24.pdf

PDF Cite Search Fix data