All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media

Jasabanta Patro; Bidisha Samanta; Saurabh Singh; Abhipsa Basu; Prithwish Mukherjee; Monojit Choudhury; Animesh Mukherjee

doi:10.18653/v1/D17-1240

All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media

Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee

Abstract

n this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (∼ 0.62) in predicting the borrowing likeliness compared to the best performing baseline (∼ 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88% of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

Anthology ID:: D17-1240
Volume:: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:: September
Year:: 2017
Address:: Copenhagen, Denmark
Editors:: Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2264–2274
Language:
URL:: https://aclanthology.org/D17-1240/
DOI:: 10.18653/v1/D17-1240
Bibkey:
Cite (ACL):: Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu, Prithwish Mukherjee, Monojit Choudhury, and Animesh Mukherjee. 2017. All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2264–2274, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):: All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media (Patro et al., EMNLP 2017)
Copy Citation:
PDF:: https://aclanthology.org/D17-1240.pdf
Attachment:: D17-1240.Attachment.pdf

PDF Cite Search Attachment Fix data