Language Identification and Normalization of Code Mixed English and Punjabi Text

Neetika Bansal; Dr. Vishal Goyal; Dr. Simpel Rani

Language Identification and Normalization of Code Mixed English and Punjabi Text

Neetika Bansal, Dr. Vishal Goyal, Dr. Simpel Rani

Abstract

Code mixing is prevalent when users use two or more languages while communicating. It becomes more complex when users prefer romanized text to Unicode typing. The automatic processing of social media data has become one of popular areas of interest. Especially since COVID period the involvement of youngsters has attained heights. Walking with the pace our intended software deals with Language Identification and Normalization of English and Punjabi code mixed text. The software designed follows a pipeline which includes data collection, pre-processing, language identification, handling Out of Vocabulary words, normalization and transliteration of English- Punjabi text. After applying five-fold cross validation on the corpus, the accuracy of 96.8% is achieved on a trained dataset of around 80025 tokens. After the prediction of the tags: the slangs, contractions in the user input are normalized to their standard form. In addition, the words with Punjabi as predicted tags are transliterated to Punjabi.

Anthology ID:: 2020.icon-demos.12
Volume:: Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations
Month:: DECEMBER
Year:: 2020
Address:: Patna, India
Editors:: Vishal Goyal, Asif Ekbal
Venue:: ICON
SIG:
Publisher:: NLP Association of India (NLPAI)
Note:
Pages:: 30–31
Language:
URL:: https://aclanthology.org/2020.icon-demos.12/
DOI:
Bibkey:
Cite (ACL):: Neetika Bansal, Dr. Vishal Goyal, and Dr. Simpel Rani. 2020. Language Identification and Normalization of Code Mixed English and Punjabi Text. In Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations, pages 30–31, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):: Language Identification and Normalization of Code Mixed English and Punjabi Text (Bansal et al., ICON 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.icon-demos.12.pdf

PDF Cite Search Fix data