Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System

We describe a CRF based system for word-level language identiﬁcation of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages. Its performance is benchmarked against the test sets provided by the shared task on code-mixing (Solorio et al., 2014) for four language pairs, namely, English-Spanish (En-Es), English-Nepali (En-Ne), English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects. The experimental results show a consistent performance across the language pairs.


Introduction
Code-mixing and code-switching in conversations has been an extensively studied topic for several years; it has been analyzed from structural, psycholinguistic, and sociolinguistic perspectives (Muysken, 2001;Poplack, 2004;Senaratne, 2009;Boztepe, 2005). Although bilingualism is very common in many countries, it has seldom been studied in detail in computer-mediatedcommunication, and more particularly in social media. A large portion of related work (Androutsopoulos, 2013;Paolillo, 2011;Dabrowska, 2013;Halim and Maros, 2014), does not explicitly deal with computational modeling of this phenomena. Therefore, identifying code-mixing in social media conversations and the web is a very relevant topic today. It has garnered interest recently, in the context of basic NLP tasks (Solorio and Liu, 2008b;Solorio and Liu, 2008a), IR (Roy et al., 2013) and social media analysis (Lignos and Marcus, 2013). It should also be noted that the identi- * * The author contributed to this work during his internship at Microsoft Research India fication of languages due to code-switching is different from identifying multiple languages in documents (Nguyen and Dogruz, 2013), as the different languages contained in a single document might not necessarily be due to instances of code switching.
In this paper, we present a system built with off-the-shelf tools that utilize several character and word-level features to solve the EMNLP Code-Switching shared task (Solorio et al., 2014) of labeling a sequence of words with six tags viz. lang1, lang2, mixed, ne, ambiguous, and others. Here, lang1 and lang2 refer to the two languages that are mixed in the text, which could be English-Spanish, English-Nepali, English-Mandarin or Standard Arabic-dialectal Arabic. mixed refers to tokens with morphemes from both, lang1 and lang2, ne are named entities, a word whose label cannot be determined with certainty in the given context is labeled ambiguous, and everything else is tagged other (Smileys, punctuations, etc.).
The report is organized as follows. In Sec. 2, we present an overview of the system and detail out the features. Sec. 3 describes the training experiments to fine tune the system. The shared task results on test data provided by the organizers is reported and discussed in Sec. 4. In Sec. 5 we conclude with some pointers to future work.

System overview
The task can be viewed as a sequence labeling problem, where, like POS tagging, each token in a sentence needs to be labeled with one of the 6 tags. Conditional Random Fields (CRF) are a reasonable choice for such sequence labeling tasks (Lafferty et al., 2001); previous work (King and Abney, 2013) has shown that it provides good performance for the language identification task as well. Therefore, in our work, we explored various token level and contextual features to build an optimal CRF using the provided training data. The features  used can be broadly grouped as described below: Capitalization Features: They capture if letter(s) in a token has been capitalized or not. The reason for using this feature is that in several languages, capital Roman letters are used to denote proper nouns which could correspond to named entities. This feature is meaningful only for languages which make case distinction (e.g., Roman, Greek and Cyrillic scripts).
Contextual Features: They constitute the current and surrounding tokens and the length of the current token. Code-switching points are context sensitive and depend on various structural restrictions (Muysken, 2001;Poplack, 1980). Special Character Features: They capture the existence of special characters and numbers in the token. Tweets contain various entities like hashtags, mentions, links, smileys, etc., which are signaled by #, @ and other special characters.
Lexicon Features: These features indicate the existence of a token in lexicons. Common words in a language and named entities can be curated into finite, manageable lexicons and were therefore used for cases where such data was available.
Character n-gram features: Following King and Abney (2013), we also used charagter n-grams for n=1 to 5. However, instead of directly using the n-grams as features in the CRF, we trained two binary maximum entropy classifiers to identify words of lang1 and lang2. The classifiers returned the probability that a word is of lang1 (or lang2), which were then binned into 10 equal buckets and used as features.
The features are listed in Table 1.

Data extraction and pre-processing
The ruby script provided by the shared task organizers was used to retrieve tweets for each of the language pairs. Tweets that could not be downloaded either because they were deleted or pro-  tected were excluded from the training set. Table 2 shows the number of tweets that we were able to retrieve for the released datasets. Further, we found a few rare cases of tokenization errors, as evident from the occurrence of spaces within tokens. These were not removed from the training set and instead, the spaces in these tokens were replaced by an underscore.

Feature extraction and labeling
Named entities for English and Spanish were obtained from DBPedia instance types, namely, Agent, Award, Device, Holiday, Language, Mean-sOfTransportation, Name, PersonFunction, Place, and Work. Frequency lists for these languages were obtained from the Leipzig Copora Collection (Quasthoff et al., 2006); words containing special characters and numbers were removed from the list. The files used are listed in table 3. The character n-gram classifiers were implemented using the MaxEnt classifier provided in MAL-LET (McCallum, 2002). The classifiers were trained on 6,000 positive examples randomly sampled from the training set and negative examples sampled from both, the training set and from word lists of multiple languages from (Quasthoff et al., 2006); the number of examples used for each of these classifiers is given in Table 4. We used CRF++ (Kudo, 2014) for labeling the tweets. For all language pairs, CRF++ was run under its default settings.

Model selection
For each language pair, we experimented with various feature combinations using 3-fold cross validation on the released training sets. Table 5 reports the token-level labeling accuracies for the various models, based on which the optimal feature sets for each language pairs were chosen. These optimal features are reported in Table 1, and the corresponding performance for 3-fold cross validation in

Overall token labeling accuracy
The overall token labeling accuracies for the regular and surpise test sets (wherever applicable) and a second set of dialectal and standard Arabic are reported in the last two rows of Table 5. The same table also reports the results of the 3-fold cross val-idation on the training datasets. Several important observations can be made from these accuracy values. Firstly, accuracies observed during the training phase was quite high (∼ 95%) and exactly similar for En-Es, En-Ne and En-Cn data; but for Ar-Ar dataset our method could achieve only up to 85% accuracy. We believe that this is due to unavailability of any of the lexicon features, which in turn was because we did not have access to any lexicon for dialectal Arabic. While complete set of lexical features were not available for En-Cn as well, we did have English lexicon; also, we noticed that in the En-Cn dataset, almost always the En words were written in Roman script and the Cn words were written in the Chinese script. Hence, in this case, script itself is a very effective feature for classification, which has been indirectly modeled by the CHR0 feature. On the other hand, in the Ar-Ar datasets, both the dialects are written using the same script (Arabic). Further, we found that using the CNG0 feature that is obtained by training a character n-gram classifier for the language pairs resulted in the drop of performance. Since we are not familiar with arabic scripts, we are not sure how effective the character n-gram based features are in differentiating between the standard and the dialectal Arabic. Based on our experiment with CNG0, we hypothesize that the dialects may not show a drastic difference in their character n-gram distributions and therefore may not contribute to the performance of our system.
Secondly, we observe that effectiveness of the different feature sets vary across language pairs. Using all the features of the previous words (context = B) seems to hurt the performance, though just looking at the previous 3 and next 3 tokens was useful. On the other hand, in Ar-Ar the reverse has been observed. Apart from lexicons, character n-grams seems to be a very useful feature in En-Es classification. As discussed above, CHR* features are effective for En-Cn because, among other things, one of these features also captures whether the word is in Roman script. For En-Ne, we do not see any particular feature or sets of features that strongly influence the classification.
The overall token labeling accuracy of the shared task runs, at least in some cases, differ quite significantly from our 3-fold cross validation results. On the regular test sets, the results for En-Ne is very similar to, and En-Cn and Ar-Ar are within expected range of the training set results. However, we observe a 10% drop in En-Es. We observe an even bigger drop in the accuracy of the second Ar-Ar test set. We will discuss the possible reason for this in the next subsection. The accuracies on the surprise sets do not show any specific trend. While for En-Es the accuracy is higher by 5% for the surprise set than the regular set, En-Ne and Ar-Ar show the reverse, and a more expected trend. The rather drastic drops in the accuracy for these two pairs on the surprise sets makes error analysis and comparative analysis of the training, test and surprise datasets imperative. Table 6 reports the F-scores for the six labels, i.e., classes, and also an overall tweet/post level accuracy. The latter is defined as the percentage of input units (which could be either a tweet or a post or just a sentence depending on the dataset) that are correctly identified as either code-mixed or monolingual; an input unit is considered code-mixed if there is at least one word labeled as lang1 and one as lang2.

Error Analysis
For all the language pairs other than Arabic, the F-score for NE is much lower than that for lang1 and lang2. Thus, the performance of the system can be significantly improved by identifying NEs better. Currently, we have used lexicons for only English and Spanish. This information was not available for the other languages, namely, Nepali, Mandarin, and Arabic. The problem of NE detection is further compounded by the informal nature of sentences, because of which they may not always be capitalized or spelt properly. Better detection of NEs in code-mixed and informal text is an interesting research challenge that we plan to tackle in the future.
Note that the ambiguous and mixed classes can be ignored because their combined occurrence is less than 0.5% in all the datasets, and hence they have practically no effect on the final labeling accuracy. In fact, their rarity (especially in the training set) is also the reason behind the very poor Fscores for these classes. In En-Cn, we also observe a low F-score for other.
In the Ar-Ar training data as well as the test set, there are fewer words of lang2, i.e., dialectal Arabic. Since our system was trained primarily on the context and word features (and not lexicon or character n-grams), there was not enough examples in the training set for lang2 to learn a reliable model for identifying lang2. Moreover, due to the distributional skew, the system learnt to label the tokens as lang1 with very high probability. The high accuracy in the Ar-Ar original test set is because 81.5% of the tokens were indeed of type lang1 in the test data while only 0.26% were labeled as lang2. This is also reflected by the fact that though the F-score for lang2 in Ar-Ar test set is 0.158, the overall accuracy is still 90.1% because F-score for lang1 is 94.2%.
As shown in Table 7, the distribution of the classes in the second Ar-Ar test set and the surprise set is much less skewed and thus, very different from that of the training and original test sets. In fact, words of lang2 occur more frequently in these sets than those of lang1. This difference in class distributions, we believe, is the primary reason behind the poorer performance of the system on some of the Ar-Ar test sets.
We also observe a significant drop in accuracy for En-Ne surprise data, as compared to the accuracy on the regular En-Ne test and training data. We suspect that it could be either due to the difference in the class distribution or the genre/style of the two datasets, or both. An analysis of the surprise test set reveals that a good fraction of the data consist of long song titles or part of the lyrics of various Nepali songs. Many of these words were labeled as lang2 (i.e., Nepali) by our system, but were actually labeled as NEs in the gold annotations 1 While song titles can certainly be considered as NEs, it is very difficult to identify them without appropriate resources. It should however be noted that the En-Ne surprise set has only 1087 tokens, which is too small to base any strong claims or conclusions on.   Table 7: Distribution (in %) of the classes in the training and the three test sets for Ar-Ar.

Conclusion
In this paper, we have described a CRF based word labeling system for word-level language identification of code-mixed text. The system relies on annotated data for supervised training and also lexicons of the languages, if available. Character n-grams of the words were also used in a MaxEnt classifier to detect the language of a word. This feature has been found to be useful for some language pairs. Since none of the techniques or concepts used here is language specific, we believe that this approach is applicable for word labeling for code-mixed text between any two (or more) languages as long as annotated data is available. This is demonstrated by the fact that the system performs more or less consistently with accuracies ranging from 80% -95% across four language pairs (except for the case of Ar-Ar second test set and the surprise set which is due to stark distributional differences between the training and test sets). NE detection is one of the most challenging problems, improving which will definitely improve the overall performance of our system. It will be interesting to explore semi-supervised and unsupervised techniques for solving this task because creating annotated datasets is expensive and effort-intensive.