On Utilizing Constituent Language Resources to Improve Downstream Tasks in Hinglish

Performance of downstream NLP tasks on code-switched Hindi-English (aka Hinglish ) continues to remain a signiﬁcant challenge. In-tuitively, Hindi and English corpora should aid improve task performance on Hinglish. We show that meta-learning framework can effectively utilize the the labelled resources of the downstream tasks in the constituent 1 languages. The proposed approach improves the performance on downstream tasks on code-switched language. We experiment with Hinglish code-switching benchmark GLUE-CoS and report signiﬁcant improvements.


Introduction
In parts of the world where people speak more than one language in day-to-day affairs, the mixing of the languages often occurs naturally.This phenomenon, termed as code-switching or codemixing, is observed for various language pairs, e.g.Hindi-English, Spanish-English.Natural language processing tasks such as sentiment analysis, named entity recognition, question-answering are interesting research challenges for Hinglish (Hindi-English code-switching).
With the rise of multilingual language models (LM), pretrain-and-finetune approach has been widely used for various downstream NLP tasks; it is observed that when LMs are pretrained with large corpora, they can be easily transferred to downstream tasks with limited fine-tuning data.Most of the publicly available multilingual LMs are pretrained using combination of various monolingual corpora (such as Hindi and English), but not using code-switched data (e.g.Hinglish).This leads to sub-optimal LMs for the downstream tasks in code-switched language.
Moreover, the downstream tasks on codeswitched language (e.g.Hinglish) usually suffers Figure 1: The proposed approach for code-switched NLP utilizing the downstream task data in Hindi and English for improving Hinglish task performance.from low-resource problem.Usually, the same task on the constituent languages (e.g.Hindi and English) is not low-resource -which we see as an opportunity.We explore to utilize the constituent language resources to fine-tune the LM for the downstream task in code-switched language.Based on these intuitions, we propose a meta-learning based approach to address the code-switching challenge, as illustrated in Figure .1.The meta-learner helps leverage Hindi and English resources to improve the performance on code-switched data.
During the fine-tuning of a pre-trained language model, its weights adjust for the downstream task.It has been shown that if the base pre-trained model is multilingual, the fine-tuning with one language helps achieve impressive performance on another language in a zero-shot setting too (Pires et al., 2019).It shows that fine-tuning teaches a multilingual model how to perform the downstream task in somewhat language-agnostic fashion.However, such transfers are better when source and target languages are closely related; obviously, the constituent languages are the closest source languages to the target code-switched language.However, the models learned/fine-tuned with constituent languages may become specific to the source languages.Meta-learning approaches are good at solving this problem by learning model parameters suitable for multiple downstream tasks.Therefore, we aim to use meta-learning with constituent language resources, to obtain a model state that serves as better initialization point for fine-tuning with limited code-switched samples.Overall, we focus on the following inquiry: For various downstream tasks, can we effectively use Hindi and English resources to improve the performance on codeswitched data? and explore utility of meta-learning for it.

Related Work
Varieties of downstream tasks in code-switched languages have given rise to standard benchmarks and need for synthesizing CS data.Token-level Language Identification (LID) is one of the earliest explored primary tasks in code-switched NLP for dialectal Arabic-Modern Standard Arab (Elfardy and Diab, 2012;Solorio et al., 2014), Spanish-English, Nepalese-English, Mandarin-English (Solorio et al., 2014), and English-Indic languages (Sequiera et al., 2015) (Zhang et al., 2018).Part-of-speech (POS) tagging for Hindi-English code-switching has been explored either via LID route (Sequiera et al., 2015) that identified language of text chunks and applied POS tagger of the respective language; or by utilizing language-specific word representations (Ball and Garrette, 2018).Named Entity Recognition (NER) on codeswitched language pairs of Spanish-English, Nepalese-English, Mandarin-English, Modern Standard Arabic-Egyptian (Priyadharshini et al., 2020;Winata et al., 2019;Aguilar et al., 2018;Solorio et al., 2014) have been explored.Use of meta-embeddings (Priyadharshini et al., 2020;Winata et al., 2019) has also been explored for the task of code-switched NER.GLUECoS and LINCE benchmarks are created by Aguilar et al. (2020) and Khanuja et al. (2020) that include LID, POS-tagging, NER, Questionanswering (QA), Sentiment Analysis (SA), Natural Language Inference (NLI), and Machine Translation (MT) for evaluating various models on several language pairs across several tasks.Code-Switched Text Generation is explored in recent literature to address the data scarcity for code-switched scenarios.Gupta et al. (2020) proposed to create synthetic code-switched texts from parallel corpus by replacing named entities, noun phrases, adjectives in the Hindi sentence with corresponding English translation obtained from align-ments.They train a deep learning model on this synthetic data to generate more code-switched texts.Rizvi et al. ( 2021) have released a toolkit for generating synthetic code-switched texts.They implement two linguistic theories; the Equivalence Constraint theory and the Matrix Language theory to constraint the synthetic code-switched sentences generated.Tarunesh et al. (2021b) use existing unsupervised neural machine translation techniques to generate code-switched sentences.However, in our (limited) initial experimentation with generation approaches, we observe that generated CS text suffers from unnatural switching and grammar violations.Meta-learning frameworks are task-agnostic representation learning that is widely used for fast-adaptation to downstream tasks of interest.MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018) are arguably two of the most popular meta-learning approaches.MAML's computation and space complexity makes it somewhat impractical -a problem that Reptile addresses with heuristics without compromising on performance.Meta learning is highly effective and shown to be beneficial for a variety of NLP tasks such as low resource machine translation, persona consistent dialogues, low resource sales prediction, speech recognition etc.We believe that its applications can be extended to code-switched NLP too.

Our Approach
In this work we incorporate a technique to improve the downstream task performance in Hinglish: by utilize downstream task data of Hindi and English in meta-learning framework.Overall, we aim to utilize high resource unsupervised text corpora and task corpora of constituent language (English and/or Hindi) to initialize the model that can be transferred to CS task effectively.
The idea behind meta learning algorithms such as Reptile is to learn a better initialization for the target task using a set of auxiliary source tasks.In CS setting, this amounts to using constituent languages to learn a better initialization parameters and then fine-tuning on small amount of available code-switched data starting from the initialized parameters.
Our specific methodology is inspired from Tarunesh et al. (2021a).To represent it more formally, lets say the code-switched language has English (en) and Hindi (hi) as the constituent lan-Algorithm 1 Our Meta-learning Approach for CS Input: Task data T en and T hi for English and Hindi, and T hi−en for Hinglish Output: The model for Hinglish downstream task 1: Initialize model weights θ 2: while not converged do 3: Perform Reptile updates 4: Draw total of m random batches from T en and T hi 5: θout ← θ 6: for i th batch, 0 < i < m do 7: train and update model for k steps 8: θ ← θout, reset the model 10: end for 11: θvar ← θ − θout 13: θ ← θout + βθvar, 14: end while meta-learning ends.15: while not converged do 16: Fine-tune the model on target T hi−en 17: end while guages, lets represent tasks in theses languages as T en and T hi and the task in target code-mixed as T hi−en .The proposed meta-learning approach is described in Algorithm 1. First, the model weights θ are initialized.The model consists of a randomly initialized classifier head on top of a pre-trained Language Model (LM).The outer loop of the meta-learner executes for n e number of epochs over T hi and T en task datasets; this is our convergence criteria.Within each meta-step, we sample m batches from the collective pool of batches of English and Hindi.With a batch, the model is trained for k steps, and the updated model weights are persisted as θ i batch , and the model is reset to the θ out state prior to the meta-step.θ i batch represents the the model state had it been trained with the i th batch alone.Since each of these states are too specific, their average θ represents somewhat generalized state of model for the downstream task.We obtain the measure of the model deviation as θ var that represents the direction of generalized model state θ relative to the current state θ out .Finally, the model weights are updated in that direction, with step size of β.At convergence, the model state serves as a good initialization to fine-tune the same network on the Higlish task data T hi−en .

Experiments and Results
We utilize GLUECoS (Khanuja et al., 2020) benchmark as our test-bed for evaluation on Hindi-English code-switching.The official benchmark involves POS tagging (POS) (Universal Dependency), Named Entity Recognition (NER), Sen-timent Analysis (SA), Question Answering (QA) and Natural Language Inference (NLI).We submit test set predictions to the official benchmark portal to obtain the F1-score metrics.
As part of meta-training and its ablations, we utilize the publicly available datasets of each downstream task for Hindi and English languages.The details of the monolingual task datasets used for meta-training are described in Table 1.

Implementation Details
We use the code repository of Tarunesh et al. (2021a)2 as base code.We use NVIDIA Tesla V100 GPU to run all experiments.We use bert-base-multilingual-cased as base pretrained model in all our experiments.We tune hyper-parameters based on loss on validation set wherever available.We use the following range of values for selecting the best hyper-parameter • Batch Size: 8, 16, 32 • Learning Rate: 1e-3, 1e-4, 1e-5, 1e-6, 3e-3, 3e-4, 3e-5, 3e-6, 5e-3, 5e-4, 5e-5, 5e-6 We meta-train the model for n e = 5 epochs, select the best model and fine-tune it further using codeswitched data.We set the number of meta update steps k = 3, and number of batches in meta-step m = 8.We set the meta training step size hyperparameter β = 1.0.Meta training requires 12 GPU hours and fine-tuning requires 1 GPU hour, approximately.

Monolingual Labelled Datasets
Table 1 describes the English and Hindi monolingual datasets used for meta-learning along with their sources and their sizes.

Baselines and Our Models
To evaluate the effectiveness of our meta training approach, we show comparison with following: • CS that finetunes the base LM using only the CS task data as reported in the leaderboard3 and we report our replication results too.
• EN→CS, HI→CS that first fine-tune base LM on English (in EN→CS) or Hindi (in HI→CS), which is then further fine-tuned for the CS task.• EN+HI→ CS that first fine-tunes base LM on combined set of English and Hindi, which is then further fine-tuned for the CS task.
• EN+HI+CS training uses concatenated and shuffled English, Hindi and code-switched data.Since there is imbalance in data across three sources, we utilize temperature-based sampling strategy (Arivazhagan et al., 2019) that allows to sample proportional to their dataset sizes (τ = 1) or uniformly (τ = ∞).
• Meta-Trained model first uses meta-learning to learn a better model initialization using constituent languages (English and Hindi) and further fine-tunes this meta-trained model using code-switched data.
• Code-Mixed mBERT: We also compare our results against the results reported by (Tarunesh et al., 2021b).They further train a mBERT model on synthetic code-mixed sentences generated.The report the results on tasks like NLI and Sentiment analysis.
We use bert-base-multilingual-cased (Devlin et al., 2018) as the base model in all our experiments.

Results and Analysis
Table 2 presents results of our experiments.Following are the key observations: • Across different tasks, the general trend of performance is observed as Meta-Trained > CS.However, for joint fine-tuning and transfer-learning experiments, we do not observe a clear pattern compared to the baseline model.This indicates the importance of metalearning in utilizing the task-specific data of the constituent languages in assisting the CS task.
• Joint fine-tuning with EN+HI+CS yields somewhat inconsistent results on different tasks when sampling temperature τ = 1.We attribute this to few orders of magnitude of data imbalance between EN, HI, and CS sets.However, enforcing sampling uniformity with τ = ∞ alleviates the problems to some extent.We observe better performance on only Question Answering and Sentiment Analysis tasks.
• Meta-learning yields definite improvement across the downstream tasks.It indicates that meta-learning indeed yields a generalized initialization point better suited for fine-tuning with CS.
Meta-learning and the EN+HI→CS, both, use same fine-tuning data, yet the former yields superior performance.Thus, the improvements by using constituent language data cannot be attributed to the inflated training set only.
Statistical Significance: To understand the statistical significance of the results, we run our replication of CS, transfer-learning experiments, multi-task experiments and the proposed Meta-Trained approach with 5 different random seeds.We perform t-test between the distributions of the obtained performance metrics.We observe that with p < 0.05, the proposed Meta-Trained approach outperforms the CS approach, statistically on all the tasks barring NLI task.
• As an auxiliary observation in Table 2, our replication of the mBERT outperforms the one reported on leaderboard and the results reported by Tarunesh et al. (2021b).This indicates towards non-trivial role of hyperparameter tuning on this benchmark.
Based on these observations, the answer to our RQ is affirmative.In other words, Hindi and En- In the first example, our model predicts correctly whereas the baseline fails.In Example 2, we observe that both the models fail to predict the correct answer i.e., America and predicts similar incorrect answers.

Limitations and Future Work
Limitations of this work spawns from 1. requirement of labeled training data in constituent languages and 2. assumption about availability of multilingual pre-trained language model covering both the constituent languages.As we consider the classifier layer weights as part of θ, the universe of labels of constituent languages should encompass the labels for the code-switch data.However, this constraint can be loosened by sharing classifier weights between meta-training and training stages.
In future, we would like to validate the proposed approach for other code-switching pairs, such as Spanish-English; in this paper, due to the lack of adequate task-specific datasets in the constituent languages, we were unable to present findings for the ES-EN pair.

Conclusion
We studied the usefulness of corpus in the constituent languages, Hindi and English, for improving the performance of the NLP tasks in codeswitched language, Hinglish.We propose a Reptile based meta-learning framework which learns better initialization using task-specific labelled datasets in the constituent languages and improves performance for Code-switched language.Our results indicate meta-trained models outperform other strong baselines for all GLUECoS tasks.

Table 1 :
English and Hindi downstream task datasets.

Table 2 :
Results on various GLUECoS tasks.* as reported in the leaderboard.Figures are the F1 metrics.QA uses F1 as defined by squad protocols.Reported figures are mean±standard deviation.† indicates results which are statistically significant compared to our replication of CS glish resources help improve the performance on CS task.The improvement is even pronounced when combined with meta-learning approach.