Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling

Arindam Chatterjere, Vineeth Guptha, Parul Chopra, Amitava Das


Abstract
Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions like - India, Europe, Canada and Mexico. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish text. In recent times, there have been several success stories with neural language modeling like Generative Pre-trained Transformer (GPT) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) etc.. Hence, neural language models have become the new holy grail of modern NLP, although LM for CM is an unexplored area altogether. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models. Analysis shows switching-points are the main challenge for the LMCM performance drop, therefore in this paper we introduce the idea of minority positive sampling to selectively induce more sample to achieve better performance. On the contrary, all neural language models demand a huge corpus to train on for better performance. Finally, we are reporting a perplexity of 139 for Hinglish (Hindi-English language pair) LMCM using statistical bi-directional techniques.
Anthology ID:
2020.lrec-1.764
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6228–6236
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.764
DOI:
Bibkey:
Cite (ACL):
Arindam Chatterjere, Vineeth Guptha, Parul Chopra, and Amitava Das. 2020. Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6228–6236, Marseille, France. European Language Resources Association.
Cite (Informal):
Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling (Chatterjere et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.764.pdf