HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation

Understanding linguistics and morphology of resource-scarce code-mixed texts remains a key challenge in text processing. Although word embedding comes in handy to support downstream tasks for low-resource languages, there are plenty of scopes in improving the quality of language representation particularly for code-mixed languages. In this paper, we propose HIT, a robust representation learning method for code-mixed texts. HIT is a hierarchical transformer-based framework that captures the semantic relationship among words and hierarchically learns the sentence-level semantics using a fused attention mechanism. HIT incorporates two attention modules, a multi-headed self-attention and an outer product attention module, and computes their weighted sum to obtain the attention weights. Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages across four NLP tasks on eleven datasets suggests significant performance improvement against various state-of-the-art systems. We further show the adaptability of learned representation across tasks in a transfer learning setup (with and without fine-tuning).


Introduction
India is the second most populated country in the world, where ∼ 1.36 billion people speak in over 200 different languages. Among them, the top five languages (Hindi, Bengali, Telegu, Tamil, and Malayalam) covers ∼ 93%of the entire population with more than 26% of them being bilingual (as per Wikipedia). Moreover, a significant proportion of them (Singh et al., 2018a) use code-mixed languages to express themselves in Online Social Networks (OSN).
Code-mixing (CM) is a linguistic phenomenon in which two or more languages are alternately used during conversation. One of the languages is usually English, while the other can be any regional language such as Hindi (Hindi + English → Hinglish), Bengali (Bengali + English → Benglish), Spanish (Spanish + English → Spaniglish), etc. Their presence on social media platforms and in day-to-day conversions among the people of a multi-lingual communities (such as Indians) is overwhelming. Despite the fact that a significant population is comfortable with code-mixed languages, the research involving them is fairly young. One of the prime reasons is the linguistic diversity, i.e., research on any language often fails to adapt for other distant languages, thus they need to be studied and researched separately. In recent years, many organizations have identified the challenges and have put in commendable efforts for the development of computational systems in regional monolingual or code-mixed setups.
Traditionally, the NLP community has studied the code-mixing phenomenon from a task-specific point of view. Recently, a few studies (Pratapa et al., 2018;Aguilar and Solorio, 2020) have started learning representations for code-mixed texts for semantic and syntactic tasks. While the former has showcased the importance of multi-lingual embeddings from CM text, the latter has made use of a hierarchical attention mechanism on top of positionally aware character bi-grams and tri-grams to learn robust word representations for CM text. Carrying over the same objective, in this paper, we introduce a novel HIerarchically attentive Transformer (HIT) framework to effectively encode the syntactic and semantic features in embeddings space. At first, HIT learns sub-word level representations employing a fused attention mechanism (FAME)an outer-product based attention mechanism (Le et al., 2020) fused with standard multi-headed selfattention (Vaswani et al., 2017). The intuition of sub-word level representation learning is supple-mented by the lexical variations of a word in codemixed languages. The character-level HIT helps in representing phonetically similar word and their variations to a similar embedding space and extracts better representation for noisy texts. Subsequently, we apply HIT module at word-level to incorporate the semantics at the sentence-level. The output of HIT is a sequence of word representations, and can be fed to the architectures of any downstream NLP tasks. For the evaluation of HIT, we experiment on one classification (sentiment classification), one generative (MT), and two sequencelabelling (POS tagging and NER) tasks. In total, these tasks span to eleven datasets across six codemixed languages -one European (Spanish) and five Indic (Hindi, Bengali, Telugu, Tamil, and Malayalam). Our empirical results show that representations learned by HIT are superior to existing multilingual and code-mixed representations, and report state-of-the-art performance across all tasks. Additionally, we observe encouraging adaptability of HIT in a transfer learning setup across tasks. The representations learned for a task is employed for learning other tasks w/ and w/o fine-tuning. HIT yields good performance in both setups for two code-mixed languages. Main contributions: We summarize our contributions as follow: • We propose a hierarchical attention transformer framework for learning word representations of code-mixed texts for six non-English languages. • We propose a hybrid self-attention mechanism, FAME, to fuse the multi-headed self-attention and outer-product attention mechanisms in our transformer encoders. • We show the effectiveness of HIT on eleven datasets across four NLP tasks and six languages. • We observe good task-invariant performance of HIT in a transfer learning setup for two codemixed languages. Reproducibility: Source codes, datasets and other details to reproduce the results have been made public at https://github.com/LCS2-IIITD/ HIT-ACL2021-Codemixed-Representation.

Related Work
Recent years have witnessed a few interesting work in the domain of code-mixed/switched representation learning. Seminal work was driven by bilingual embedding that employs cross-lingual transfer to develop NLP models for resource-scarce lan-guages (Upadhyay et al., 2016;Akhtar et al., 2018;Ruder et al., 2019). Faruqui and Dyer (2014) introduced the BiCCA embedding using bilingual correlation, which performed well on syntactical tasks, but poorly on cross-lingual semantic tasks. Similarly, frameworks proposed by Hermann and Blunsom (2014)  Another school of thought revolves around subword level representations, which can help to capture variations found in CM and transliterated text. Joshi et al. (2016) proposed a CNN-LSTM based model to learn the sub-word embeddings through 1-D convolutions of character inputs. They showed that it resulted in better sentiment classification performance for CM text. On top of this intuition, attention-based frameworks have also been proven to be successful in learning low-level representations. The HAN (Yang et al., 2016) model provides the intuition of hierarchical attention for document classification, which enables it to differentially attend to more and less important content, at the word and sentence levels. In another work, Aguilar and Solorio (2020) proposed CS-ELMo for code-mixed inputs with similar intuition. It utilizes the hierarchical attention mechanism on bi-gram and tri-gram levels to effectively encode the subword level representations, while adding positional awareness to it.
Our work builds on top of these two earlier works to push the robustness of code-mixed representations to higher levels. However, the main difference between existing studies and HIT is the incorporation of outer-product attention-based fused attention mechanism (FAME).

Proposed Methodology
In this section, we describe the architecture of HIT for learning effective representations in code-

Subword-embeddings
Task-specific network Figure 1: Hierarchical Transformer along with our novel FAME mechanism for attention computation. mixed languages. The backbone of HIT is transformer (Vaswani et al., 2017) and Hierarchical Attention Network (HAN) (Yang et al., 2016). HIT takes a sequence of words (a code-mixed sentence) S = w 1 , w 2 , ..., w N as input and processes each word w i using a character-level HIT to obtain sub-word representation S sb = sb 1 , sb 2 , ..., sb N . The character-level HIT is a transformer encoder, where instead of computing multi-headed selfattention only, we amalgamate it with an outerproduct attention mechanism (Le et al., 2020) as well. The intuition of outer-product attention is to extract higher-order character-level relational similarities among inputs. To leverage both attention mechanisms, we compute their weighted sum using a softmax layer. Subsequently, we pass it through the typical normalization and feed-forward layers to obtain the encoder's output. A stacking of l c encoders is used. In the next layer of the hierarchy, these sub-word representations are combined with positional and rudimentary embeddings of each word and forwarded to the word-level HIT's encoder. Finally, the output of word-level HIT is fed to the respective task-specific network. The hierarchical nature of HIT enables us to capture both character-level and word-level relational (syntactic and semantic) similarities. A high-level schema of HIT is shown in Figure 1.

Fused Attention Mechanism (FAME)
FAME extends the multi-headed self-attention (MSA) module of a standard transformer by including a novel outer-product attention (OPA) mechanism. Given an input x, we use three weight matrices, W self Q , W self K , and W self V , to project the input to Query (Q self ) , Key (K self ), and Value (V self ) representations for MSA, respectively. Similarly for OPA we use W outer Q , W outer K , and W outer V for the projecting x to Q outer , K outer and V outer . Next, the two attention mechanisms are learnt in parallel, and a weighted sum is computed as its output. Formally, H = α 1 · Z self ⊕ α 2 · Z outer , where Z self and Z outer respectively are the outputs of multi-headed self attention and outer-product attention modules, and α 1 and α 2 are the respective weights computed through a softmax function.
Multi-Headed Self Attention. The standard transformer self-attention module (Vaswani et al., 2017) computes a scaled dot-product between the query and key vectors prior to learn the attention weights for the value vector. We compute the output as follows: where N is the sequence length, and d k is the dimension of the key vector.
Outer-Product Attention. This is the second attention that we incorporate into FAME. Although the fundamental process of OPA (Le et al., 2020) is similar to the multi-headed self-attention computation, OPA differs in terms of different operators and the use of row-wise tanh activation instead of softmax. To compute the interaction between the query and key vectors, we use element-wise multiplication as opposed to the dot-product in MSA. Subsequently, an element-wise tanh is applied before computing the outer-product with the value vector. The intuition is to exploit fine-level associations between the key-scaled query and value representations in a code-mixed setup. Similar to the earlier case, we define OPA as: where is the element-wise multiplication, and ⊗ is the outer-product.

Task-specific Layers
As we mention earlier, HIT can be adapted for various NLP tasks including sequence labelling, classification, or generative problems. In the current work, we evaluate HIT on part-of-speech (POS) tagging, named-entity recognition (NER), sentiment classification, and machine translation (MT). We mention their specific architectural details below. For the sentiment classification, we apply a Glob-alAveragePooling operation over the token embeddings to obtain the sentence embeddings. Additionally, we concatenate extracted statistical features along with the embeddings before feeding into the final classification layer. We use tf-idf (term frequency-inverse document frequency) vectors for {1, 2, 3}-grams of words and characters extracted from each text. We hypothesize that these statistical features contain sufficient information to get rid of any handcrafted features like the ones suggested by Bansal et al. (2020). Finally, a softmax activation function is used for the prediction. Similarly, for POS tagging and NER, the corresponding labels for each of the token's embedding is obtained through a softmax activated output. In case of MT, we use an encoder-decoder framework where both the encoder and the decoder are based on the HIT framework.

Experiments, Results, and Analyses
In this section, we furnish the details of chosen datasets, our experimental results, comparative study, and necessary analyses.

Datasets
We evaluate 11 publicly available datasets across 4 tasks in 6 code-mixed languages. For POS tagging,   ploy Hindi (Singh et al., 2018c) and Spanish (Aguilar et al., 2018) datasets with 2079 and 52781 sentences, respectively. In Hindi, the labels are name, location, and organization. The Spanish dataset has six additional labels -event, group, product, time, title, and other named entities.
• Machine Translation: We utilize a recently developed Hindi-English code-mixed parallel corpus for machine translation (Gupta et al., 2020) comprising more than 200k sentence pair. For experiments, we transliterate all Devanagari Hindi text. It is a hybrid CNN-LSTM model. A 1D convolution operation is applied for the subword representation. Subsquently, the convoluted features are max-pooled and fed to an LSTM. Since this system disregards word boundaries in a sentence, we use it for sentiment classification only.

Baselines
Machine translation: For machine translation, we evaluate HIT against GFF-Pointer (Gupta et al., 2020), a gated feature fusion (GFF) based approach to amalgamate the XLM and syntactic features during encoding and a Pointer generator for decoding. Furthermore, we also incorporate three other baselines for comparison -Seq2Seq

Experimental Setup
For each experiment, we use a dropout = 0.1 in both transformer block and the task specific layers. Categorical cross-entropy loss with Adam (η = 0.001, β 1 = 0.9, β 2 = 0.999) optimizer (Kingma and Ba, 2014) is employed in all experiments. We train our models for maximum 500 epochs with an early-stopping criteria having patience = 50. We additionally use a learning rate scheduler to reduce learning rate to 70% at plateaus with a patience of 20 epochs. All models are trained with batch-size= 32.

Experimental Results
We compute precision, recall, F1-score for POS, NER, and sentiment classification, whereas, BLEU, METEOR, and ROUGE scores are reported for the machine translation task.
Sentiment classification: As shown in Table 2, HIT obtains best F1-scores across all languages. For Hindi, three baselines (BiLSTM, ML-BERT, and CS-ELMo) obtain the best F1-score of 0.909, where HIT yields a small improvement with 0.915 F1-score. In comparison, we observe an improvement of 1.4% for Tamil, where HIT and the best baseline (CS-ELMo) report 0.473 and 0.459 F1scores, respectively. We observe the same pattern for Malayalam and Spanish as well -in both cases, HIT obtains improvements of 0.9% and 2.0%, respectively. For Malayalam, HIT reports 0.651 F1-score, whereas CS-ELMo reports 0.642 F1-score. In case of Spanish, HAN turns out to be the best baseline with 0.440 F1-score. Com-  Table 3: Performance of HIT on POS tagging. Best scores are highlighted in bold.

Model
Hindi Spanish Pr.
Re.  paratively, HIT achieves 0.460 F1-score. The last two rows of Table 2 report ablation results -a) excluding outer-product attention (Atn outer ) from HIT; and b) excluding sub-word embeddings (character-level HIT). In all cases, the absence of sub-word embeddings yields negative effect on the performance; hence, suggesting the effectiveness of character-level HIT in the architecture. On the other hand, omitting outer-product attention declines F1-scores in 3 out of 4 cases -we observe a margin improvement of 0.04 points for Malayalam. In summary, HIT attains state-of-the-art performance across all four datasets, whereas, the best baseline (CS-ELMo) reports 1.2% lower scores on average.
POS tagging: Table 3 shows the comparative results for POS tagging in Hindi, Telugu, Bengali, and Spanish. Similar to sentiment classification, we observe that HIT attains best F1-scores across three datasets (0.625% better on average). It achieves 0.919, 0.762, 0.853, and 0.825 F1-scores for Hindi, Telugu, Bengali, and Spanish, respectively. In comparison, CS-ELMo yields best F1-scores among all the baselines across three datasets viz. Hindi (0.910), Telugu (0.775), and Bengali (0.847). For Spanish, ML-BERT obtains the best baseline F1score of 0.802. From ablation, we observe the negative effect on performance by removing either the outer-product attention or character-level HIT for the majority of the cases.
NER: The performance of HIT for NER is also in-line with the previous two tasks, as show in   In all three tasks, CS-ELMo is arguably the most consistent baseline. Together with the state-ofthe-art performance of HIT, we regard the good performance to the subword-level contextual modelling -both systems use contextual representational models (ELMo and Transformer) to encode the syntactic and semantic features. Moreover, the FAME module in HIT assists in improving the performance even further.
Machine Translation: Finally, Table 5 reports the results for the English to Hindi (En-Hi) machine translation task. For comparison, we also report BLEU, METEOR, and ROUGE-L scores for four baseline systems -Seq2Seq   Table 6: Transfer learning models. Code-mixed word representations, learned for a (source) task, is utilized for building models for other (target) tasks of same language w/ and w/o fine-tuning. We highlight the cases in bold where transfer learning achieves better performance than original base HIT.

Effects of Transfer Learning across Tasks
One of the core objectives of representation learning is that the learned representation should be task-invariant -the representations learned for one task should also be (near) effective for other tasks. The intuition is that the syntactic and semantic features captured for a language should be independent of the tasks, and if it does not comply, the representation can be attributed to capture the taskspecific feature, instead of linguistic features. To this end, we perform transfer learning experiments with (w/) and without (w/o) fine-tuning. Since we have only one dataset for Tamil, Telugu, Bengali, and Malayalam, we choose Hindi and Spanish code-mixed datasets (POS, NER, and sentiment classification) for the study. Table 6 reports results for both code-mixed languages. For each case, we learn HIT's representation on one (source) task and subsequently utilize the representation for the other two tasks (targets). Moreover, we repeat each experiment with and without fine-tuning HIT.
For Hindi code-mixed, we do not observe the  positive effect of transfer learning for NER. It could be because of the limited lexical variations of named-entities in other datasets. However, we obtain the best F1-score (0.936) for POS tagging in a transfer learning setup with sentiment classification. Similarly, for the sentiment classification as target, we observe performance improvements with both POS and NER as source tasks. In Spanish, we also observe increment in F1-scores for all three tasks. We attribute these improvements to the availability of more number of sentences for HIT to leverage the linguistic features in both Hindi and Spanish.

Error Analysis
In this section, we analyze the performance of HIT both quantitatively and qualitatively. At first, we report the confusion matrices 3 for Hindi NER and sentiment classification in Table 7. In both cases, we observe the true-positives to be significant for all labels. Furthermore, we also observe the falsepositives to be extremely low (except for 'B-Org'  in NER) for majority of the cases -suggesting very good precision in general. The major contribution in error is due to the neutral and other classes in sentiment and NER, respectively. In sentiment analysis, 10% of the positive and negative labels each were mis-classified as neutral. Similarly in NER, we observe that the organization entities (B-Org & I-Org) and other classes confuse with each other in a significant number of samples. It may be due to the under-represented (∼ 13%) organization entities in the dataset.
We also perform qualitative error analysis of HIT and CS-ELMo. Table 8 reports the results for the NER and sentiment classification tasks 3 . For the first example in sentiment classification, HIT accurately predicts the sentiment labels as positive; however in comparison, CS-ELMo mis-classifies it as neutral. For the second example, both HIT and CS-ELMo wrongly predict the sentiment as neutral. One possible reason could be the presence of the negatively-inclined word chodo (leave) in the sentence. For NER, the sentence has two entities (one person and one organization). While HIT correctly identifies 'dhan dhan satguru' as person, it could not recognize 'msg' as organization. On the other hand, CS-ELMo correctly identifies both.
Furthermore, we take the first example of sentiment analysis (from Table 8) to get the insight of HIT. It is not hard to understand that the most positive vibe is attributed by the phrase 'badhai ho sir' (congratulations sir). To validate our hypothesis, we use a gradient based interpretation technique, Grad-CAM (Selvaraju et al., 2019), which uses gradients of neural networks to show the effect of neurons on the final output. Due to hierarchical and modular nature of HIT, we are able to extract the intermediate word level representations learnt by the character-level HIT and compute the gradient of loss of the actual class considering these representations. The magnitude of gradient shows the impact of each word on the final output. Figure  2a shows the word-level and character-level gradient maps for the original input. We can observe that HIT attends to the most important component in both cases. For word-level, it highlights the positive phrase 'liye badhai' (congratulations on). Moreover, character-level HIT attends to the two syllables 'b' and 'dh' in the word 'badhai' (congratulation). It suggests that both the word-level and character-level components are capable of extracting important features from inputs. Furthermore, to check the robustness, we investigate HIT on a perturbed input. In the previous example, we tweak the spelling of the most important word 'badhai' to 'badhaaii' (an out-of-vocabulary word considering the dataset). Figure 2b shows similar patterns in the perturbed input case as well. It signifies that HIT identifies the phonetic similarity of the two words and is flexible to the spelling variants, a common feature in code-mixed environment.

HIT's Performance on Monolingual Data
In this section, we outline the performance of HIT for monolingual and low-resource settings. We consider the sentiment classification dataset curated by Akhtar et al. (2016)  Hindi reviews with 4 sentiment labels -positive, negative, neutral, and conflict. We also utilize a Magahi POS dataset (Kumar et al., 2012), annotated with 33 tags from the BIS-tagset 4 . We report the performance of HIT and other baselines on these two datasets in Table 9. For the Hindi sentiment classification task, we observe that HIT yields an F1-score of 0.635, which is better than CS-ELMo and ML-BERT by 9.3% and 5.9%. Also, for Magahi POS, HIT reports the best F1-score of 0.775increaments of +2.1% and +9.5% over CS-ELMo and ML-BERT, respectively. These results suggest that HIT is capable of handling monolingual and low-resource texts in an efficient manner.

Learnable Parameters and Power Usage
We conduct all our experiments on 1 Tesla T4 GPU.
In Table 10

Conclusion
In this work, we present HIT -a hierarchical transformer-based framework for learning robust code-mixed representations. HIT contains a novel fused attention mechanism, which calculates a weighted sum of the multi-headed self attention and outer-product attention, and is capable of capturing relevant information at a more granular level. We experimented with eleven code-mixed datasets for POS, NER, sentiment classification, and MT tasks across six languages. We observed that HIT successfully outperforms existing SOTA systems. We also demonstrate the task-invariant nature of the representations learned by HIT via a transfer learning setup, signifying it's effectiveness in learning linguistic features of CM text rather than taskspecific features. Finally, we qualitatively show that HIT successfully embeds semantically and phonetically similar words of a code-mixed language.

A.1 Semantic Understanding of Languages
In this section, we study the semantic relationships between different Indic languages. We calculate the proportion of common words in Table 11  Moreover, this sharing is not driven by English, as, only 27% of these words are English, which is the lowest proportion among all other language pairs. Being originate from a similar root and having a phonetic resemblance makes Tamil and Malayalam sister languages 5 . Similar observations are also made from the word representation lens. We use t-SNE (Van der Maaten and Hinton, 2008) plots to embed HIT's representations onto a 2-D space for interpretability (Fig 3). Although, the embeddings are well clustered based on the languages, we can easily figure out the semantically similar words across languages embedded onto a similar space. Furthermore, Fig 3(b) shows that pronouns (e.g., 'aap') in Tamil, Telegu and Hindi are embedded onto a similar space with Bengali words 'aamar', 'aamay'. Although each of these representations are learned on separate models on separate datasets, the robustness of the underlying hierarchical representation enables our model to capture cross-lingual semantics from noisy code-mixed texts. We can attribute these observations to the relatedness of Indic languages on a socio-cultural basis.

A.2 Datasets
We report all available POS tags in Table 12.

A.3 Confusion Matrices and Error Analysis
We report the confusion matrices to show the labelwise performance for the sentiment classification, PoS tagging and NER in Tables 5, 4, and 6, respectively.
We similarly perform qualitative analysis on the MT task where our model shows superior performance as compared to the baselines. In example 1 of Table 13 (d), HIT translates the English text ''Licencing and import policies were liberalise" to "Licencing aur policies liberal the |". Although this prediction has very low BLEU score when evaluated against the target, this example shows an interesting observation. The overall translation is a contextually meaningful sentence in Hindi. Further HIT translates the phrase 'were liberalise' to 'liberal the'. In Hindi, 'the' represents past tense. Another interesting observation is the ability of HIT to copy texts from source to predicted text. Even without having an explicit copying mechanism (See et al., 2017), HIT is able to understand    the key phrases that co-occur in both Hindi and English, like, numeric and proper nouns and copies these tokens while generating. This shows how our model can also be used in conditional generation of texts. It also ends the sentence with |, which is a common punctuation widely used as a full stop in Hindi texts.  Org: nunca pensé que " bruh " me frustraría tanto Neu Neu Neg Trans: I never thought that "bruh" would frustrate me so much