Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth

Jennifer Williams, Charlie Dagli


Abstract
We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geo-location, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.
Anthology ID:
W17-1209
Volume:
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Preslav Nakov, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann, Shevin Malmasi, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
73–83
Language:
URL:
https://aclanthology.org/W17-1209/
DOI:
10.18653/v1/W17-1209
Bibkey:
Cite (ACL):
Jennifer Williams and Charlie Dagli. 2017. Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 73–83, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth (Williams & Dagli, VarDial 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1209.pdf