IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment

,


Introduction
Considering the growing interest of developing language technologies (with Indian languages support) due to vase Indian language base internet users (expected to cross 650 millions1 ), development of pre-trained Indian language text representation suitable for various NLP applications is becoming an important task.Though India is home to a vast linguistic landscapes, encompassing 1,369 rationalized languages and dialects (INDIA, 2011), the majority of the Indian language related NLP research focus primarily on few scheduled languages (only 22 languages are scheduled).Further, in the context of developing Indian language technology, the widespread use of transliterated text, creative acronyms, multilingual, code-mixed text etc. on social media should also be taken into account.To mitigate the above challenges, few studies (Kakwani et al., 2020;Conneau et al., 2020;Khanuja et al., 2021) have initiated the development of word embedding for Indian languages.However, these studies mostly focus on monolingual corpora with limited reach to social media setups.Motivated by these observations, this paper focuses on developing a more generalized representation vector suitable for the text with different characteristicswritten in native scripts, transliterated text, multilingual, code-mixed, and other social media related characteristics.
With a target to incorporate diverse characteristics of user-generated contents on social media and also well formed text, we consider texts collected from two forms of sources -social media text (Twitter, Facebook and YouTube), and well-formed text collected from Samanantar Dataset (Ramesh et al., 2021), Dakshina Dataset (Roark et al., 2020), Manipuri-English comparable corpus (Laitonjam and Singh, 2023) and Wikipedia (including 20 schedule languages written in respective native scripts).Due to computing resource constraints at our end, the embedding of the words present in the corpus has been built using FastText (Bojanowski et al., 2017) model named IndiSocialFT2 .To evaluate the quality of obtained embedding, we compare it with three popular publicly available works for Indian languages namely Facebook's FastText (Wiki+CommonCrawl) (Grave et al., 2018), two models by IndicNLPSuite (Kakwani et al., 2020)-IndicFT and IndicBERT , and Google's MuRIL (Khanuja et al., 2021) using both intrinsic and extrinsic evaluation methods.From various experimental observations, the proposed embedding outperforms all the baseline embedding for almost all the cases and languages.

Related Work
Facebook FastText project (Grave et al., 2018)  word embeddings that encompasses both semantic and syntactic information.
Expanding on the growing interest in Indian language representation, authors of IndicNLPSuite (Kakwani et al., 2020) have developed two different pre-trained models, IndicFT -a set of 11 monolingual pre-trained FastText embedding models, and IndicBERT -a multilingual ALBERT model trained on their corpora, referred to as IndicCorp.
To address the code-mixed cross-lingual transfer tasks, a few works like Conneau et al. (2020) use transliterated data in training, but limit to including naturally present web crawl data.
A more recent development in this field is the MuRIL (Khanuja et al., 2021), which focuses on multilingual representations for 17 Indian languages.MuRIL is a pre-trained multilingual language model based on the BERT framework, specifically built for Indian languages.Its effectiveness has been demonstrated across a wide range of NLP tasks for Indian languages.

Data Collection
We have crawled tweets, retweets, and replies using Twitter's API, focusing on Indian languages of a three-year duration spanning from 2019 to 2022.Our curated dataset primarily comprises text sourced from Twitter, totaling 0.6 billion tweets, including quoted retweets and replies.These tweets are filtered by location (India) and amount to 5.5 billion tokens.Additionally, we have collected posts and comments from Facebook profiles of well-known Indian individuals and news media, as well as comments on videos uploaded by popular news and entertainment channels on YouTube.The content from Facebook includes a total of 0.8 million posts, which also encompass comments and nested comments, resulting in 14.8 million tokens.The dataset also incorporates 0.4 million comments from YouTube, comprising 3.8 million tokens.

Model Pre-training Details
With the curated dataset, we have trained a 300dimensional embeddings model using FastText.We have selected FastText for this task due to its ability to handle morphologically rich languages, such as Indian languages, by incorporating subword information in the form of character n-gram embeddings during the training process.
We have run the training for 15 epochs, utilized a window size of 5, and set a minimum token count of 5 for each instance.These hyperparameters are chosen to optimize the performance of the embeddings while considering the specific linguistic characteristics of the Indian languages in our dataset.
The resulting multilingual word embeddings are  expected to capture semantic and syntactic similarities across the various Indian languages present in the dataset, thereby enabling the development of more effective natural language processing applications tailored to this diverse set of languages.

Evaluation on Texts with Native Scripts
In the native script setting, we have compared our embeddings (referred to as IndiSocialFT) with two pre-trained embeddings: one released by the Fast-Text project, trained on Wiki+CommonCrawl (FT-WC) (Grave et al., 2018), and the other released by IndicNLPSuite, IndicFT.Evaluation is done with two setups -intrinsic and extrinsic.
Intrinsic Evaluation: For intrinsic evaluation, we have created a set of semantically related word pairs (antonyms and synonyms word pairs) for six languages (five Indian and English) -Assamese, Bengali, Manipuri, Urdu(only antonyms), Tamil, and English each of set having 100-150 pairs, and performed a ranking-based intrinsic evaluation.Let Sim k (w i ) be the set of the top most similar k words of w i .We have used cosine similarity to estimate similarity between embedding of two words.

Lang
In the ranking-based approach, given a word pair (w i , w j ), the rank of word w j is defined as the position of word w j in the set Sim k (w i ).As the words in the antonyms pairs and the synonyms pairs are semantically similar to each other, their ranks are expected to be low.The average rank (k = 50) for different language datasets of antonyms and synonyms pairs, considering various pre-trained FastText-based models, is tabulated in Table 2.For most of the languages the average rank result produced by our embedding for pairs of words is lower than the result produced by the other monolingual embeddings.The lower ranked result reveals that our embedding is better at representing the semantically similar words.
We have also conducted another word similarity based intrinsic evaluation using the IIIT-Hyderabad word similarity dataset (Akhtar et al., 2017).The word similarity assessment examines the relationship between the distances of word embedding and the semantic similarity perceived by humans (Wang et al., 2019).This helps to determine how well the word embedding representations capture human-like understanding of similarity, and supports the idea that a word's meaning is connected to the context it appears in.Higher the word similarity score generated by the word embedding model for semantic similar words implies the better word embedding representations.ity evaluation method outlined by Kakwani et al. (2020), we have assessed the similarity scores on the IIIT-Hyderabad word similarity dataset.The word similarity scores are presented in Table 3.Our findings demonstrate that across various languages, our embedding model outperforms the other models in terms of word similarity scores.These higher scores suggest that the word embeddings generated by our model in the native script are more effective at capturing a human-like perception of similarity.
Extrinsic Evaluation: We have further conducted extrinsic evaluation of our model by employing text classification tasks on the IndicGLUE Datasets (Kakwani et al., 2020).IndicGLUE is a comprehensive dataset containing news articles classified into various categories, covering nine different Indian languages, each represented in its native script, offering a diverse linguistic landscape for evaluation and analysis.We have adopted the k-NN (k = 4) classifier based evaluation module outlined by Kakwani et al. (2020) to assess our embeddings.As this approach is non-parametric, the classification performance directly illustrates the efficacy of the embedding space in capturing the semantic and contextual information of of each word in the text.The accuracy score of the trained classifier on IndicGLUE Datasets is presented in Table 4.We have got an average accuracy score of 96.69%.This remarkable accuracy score highlights the effectiveness of our embedding model in handling the semantic and contextual information of text in the native script.

Evaluation on Multilingual Code-Mixed Texts
We have assessed our word embeddings in a codemixed multilingual environment by conducting various text classification tasks, including (a) sentiment analysis, (b) offensive language detection, and (c) domain classification, using publicly available code-mixed datasets.
Classifier Training-In line with Kakwani et al. (2020), we have employed k-NN (k = 4) classifier, ensuring that classification performance directly reflects the effectiveness of the embedding space in capturing text semantics and contextual information.
Results-As our embedding model is trained over a large dataset containing text from social media platforms, it inherently contains the code-mixed and multilingual text.We have compared our classification results with a baseline model trained using a TF-IDF vectorizer, FastText, IndicFT, IndicBERT and MuRIL (Khanuja et al., 2021).The text classification results in terms of accuracy and Macro F1 score are presented in Table 6.As our model is trained over the both native and code-mixed text, we can observe that in most of the dataset our model outperform all other models.The higher average accuracy score of 0.691 and higher Macro F1 score of 0.504 for various text classification tasks on the code-mixed multilingual texts, using our embedding model, demonstrates that our model is more proficient in managing contextual information in code-mixed and multilingual settings also.

Conclusion and Future Work
In this paper, we have addressed the challenge of representing text in a multilingual code-mixed social media environment by developing a FastTextbased embedding model.from publicly available corpora to ensure balance.
We have assessed the performance of our trained embeddings in both monolingual native script and code-mixed multilingual environments, employing a range of intrinsic and extrinsic evaluation techniques.The results demonstrate the effectiveness of our model's embedding space in capturing the semantic and syntactic information of text in both native monolingual and code-mixed multilingual contexts.This trained embedding model can be utilized to address various NLP challenges in the social media context in the Indian region.
As for future work, we plan to further improve the quality of our embeddings by incorporating additional data sources and exploring transformerbased pre-training techniques.We also aim to extend the applicability of our embeddings to a wider range of NLP tasks and evaluate their performance in more diverse linguistic scenarios.

Table 1 :
Summarization of different model support and corresponding training dataset

Table 2 :
Average rank score on Antonyms(Anto) and Synonyms(Syno) pair of different languages by taking top 50 similar words.

Table 4 :
Accuracy score (in percentage) on IndicGLUE News category testset in different languages.

Table 5 :
Following the similar-Statistics of different multilingual Code-mixed datasets used for evaluating the model.Here, SA indicates Sentiment Analysis, DC indicates Domain Classification and OfD indicates Offensive language detection

Table 6 :
Our model is trained on a diverse dataset collected from various social media platforms, supplemented with native script text Accuracy (acc) and Macro-F1 (F1) score of Text Classification task on different code-mixed dataset for k-NN (k=4) using (a) TF-IDF, (b) MuRIL, (c)IndicBERT and, (d) IndiSocialFT.