DCU-UVT: Word-Level Language Classification with Code-Mixed Data

This paper describes the DCU-UVT team’s participation in the Language Iden-tiﬁcation in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching . Word-level classiﬁcation experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k -nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our ﬁnal system and present results for the Nepali-English and Spanish-English datasets.


Introduction
This paper describes DCU-UVT's participation in the shared task Language Identification in Code-Switched Data (Solorio et al., 2014) at the Workshop on Computational Approaches to Code Switching, EMNLP, 2014. The task is to make word-level predictions (six labels: lang1, lang2, ne, mixed, ambiguous and other) for mixedlanguage user generated content. We submit predictions for Nepali-English and Spanish-English data and perform experiments using dictionaries, a k-nearest neighbour (k-NN) classifier and a linearkernel SVM classifier.
In our dictionary-based approach, we investigate the use of different English dictionaries as well as the training data. In the k-NN based approach, we use string edit distance, charactern-gram overlap and context similarity to make predictions. For the SVM approach, we experiment with context-independent (word, charactern-grams, length of a word and capitalisation information) and context-sensitive (adding the pre-vious and next word as bigrams) features in different combinations. We also experiment with adding features from the k-NN approach and another set of features from a neural network. Based on performance in cross-validation, we select the SVM classifier with basic features (word, character-ngrams, length of a word, capitalisation information and context) as our final system.

Data Statistics
The training data provided for this task consists of tweets. Unfortunately, because of deleted tweets, the full training set could not be downloaded. Out of 9,993 Nepali-English training tweets, we were able to download 9,668 and out of 11,400 Spanish-English training tweets, we were able to download 11,353. Table 1 shows the token-level statistics of the two datasets.  These are mentions (@username), punctuation symbols, emoticons, numbers (except numbers that represent words such as 2 for to), words in a language other than lang1 and lang2 and unintelligible words. Named entities (ne) are much less frequent and mixed language words (e.g. ramriness) and words for which there is not enough context to disambiguate them are rare. Hash tags are annotated as if the hash symbol was not there, e.g. #truestory is labelled lang1.

Experiments
All experiments are carried out for Nepali-English data. Later we apply the best approach to Spanish-English. We train our systems in a five-fold crossvalidation and obtain best parameters based on average cross-validation results. Cross-validation splits are made based on users, i.e. we avoid the occurrence of a user's tweets both in training and test splits for each cross-validation run. We address the task with the following approaches: 1. a simple dictionary-based classifier,  Table 2: Average cross-validation accuracy of dictionary-based prediction for Nepali-English 2. classification using supervised machine learning with k-nearest neighbour, and 3. classification using supervised machine learning with SVMs.

Dictionary-Based Detection
We start with a simple dictionary-based approach using as dictionaries (a) the British National Corpus (BNC) (Aston and Burnard, 1998), (b) Han et al.'s lexical normalisation dictionary (LexNorm) (Han et al., 2012) and (c) the training data. The BNC and LexNorm dictionaries are built by recording all words occurring in the respective corpus or word list as English. For the BNC, we also collect word frequency information. For the training data, we obtain dictionaries for each of the six labels and each of the five cross-validation runs (using the relevant 4/5 of training data).
To make a prediction, we consult all dictionaries. If there are more than one candidate label, we choose the label for which the frequency for the query token is highest. To account for the fact that the BNC is much larger than the training data, we normalise all frequencies before comparison. LexNorm has no frequency information, hence it is added to our system as a simple word list (we consider the language of a word to be English if it appears in LexNorm). If a word appears in multiple dictionaries with the same frequency or if the word does not appear in any dictionary or list, the predicted language is chosen based on the dominant language(s)/label(s) of the corpus.
We experiment with the individual dictionaries and the combination of all three dictionaries, among which the combination achieves the highest cross-validation accuracy (90.71%). Table 2 shows the results of dictionary-based detection obtained in five-fold cross-validation.

Classification with k-NN
For Nepali-English, we also experiment with a simple k-nearest neighbour (k-NN) approach. For each test item, we select a subset of the training data using string edit distance and n-gram overlap and choose the majority label of the subset as our prediction. For efficiency, we first select k 1 items that share an n-gram with the token to be classified. 1 The set of k 1 items is then re-ranked according to string edit distance to the test item and the best k 2 matches are used to make a prediction.
Apart from varying k 1 and k 2 , we experiment with (a) lowercasing strings, (b) including context by concatenating the previous, current and next token, and (c) weighting context by first calculating edit distances for the previous, current and next token separately and using a weighted average. The best configuration we found in crossvalidation uses lowercasing with k 1 = 800 and k 2 = 16 but no context information. It achieves an accuracy of 94.97%.

SVM Classification
We experiment with linear kernel SVM classifiers using Liblinear (Fan et al., 2008). Parameter optimisation 2 is performed for each feature set combination to obtain best cross-validation accuracy.

Basic Features
Following , our basic features are: Char-N-Grams (G): We start with a character n-gram-based approach (Cavnar and Trenkle, 1994). Following King and Abney (2013), we select lowercased character n-grams (n=1 to 5) and the word as the features in our experiments.

Dictionary-Based Labels (D):
We use presence in the dictionary of the 5,000 most frequent words in the BNC and presence in the LexNorm dictionary as binary features. 3 Length of words (L): We create multiple features for token length using a decision tree (J48). We use length as the only feature to train a decision tree for each fold and use the nodes obtained from the tree to create boolean features (Rubino et al., 2013;. 1 Starting with n = 5, we decrease n until there are at least k1 items and then we randomly remove items added in the last augmentation step to arrive at exactly k1 items. (For n = 0, we randomly sample from the full training data.) 2 C = 2 i with i = −15, −14, ..., 10 3 We chose these parameters based on experiments with each dictionary, combinations of dictionaries and various frequency thresholds. We apply a frequency threshold to the BNC to increase precision. We rank the words according to frequency and used the rank as a threshold (e.g. top-5K, top-10K etc.). With the top 5,000 ranked words and C = 0.25, we obtained best accuracy (96.40%).   Context (P i and N j ): We consider the previous i and next j token to be combined with the current token, forming an (i+1)-gram and a (j+1)-gram, which we add as features. Six settings are tested. Table 4 shows that using the bigrams formed with the previous and next word are the best combination for the task (among those tested). Among the eight combinations of the first four feature sets that contain the first set (G), Table 3 shows that the 6-way SVM classifier 4 performs best with all features sets (GDLC), achieving 96.40% accuracy. Adding contextual information P i N j to GDLC, Table 4 shows best results for i=j=1, achieving 96.42% accuracy, only slightly ahead of the context-independent system.

Neural Network (Elman) and k-NN Features
We experiment with two additional features sets not covered by : Neural Network (Elman): We extract features from the hidden layer of a recurrent neural net-  Table 5: Average cross-validation accuracy of 6way SVMs of combinations of GDLC, k-NN, Elman and P 1 N 1 features for Nepali-English work that has been trained to predict the next character in a string (Chrupała, 2014). The 10 most active units of the hidden layer for each of the initial 4 bytes and final 4 bytes of each token are binarised by using a threshold of 0.5.
k-Nearest Neighbour (kNN): We obtain features from our basic k-NN approach (Section 4.2), encoding the prediction of the k-NN model with six binary features (one for each label) and a numeric feature for each label stating the relative number of votes for the label, e.g. if k 2 = 16 and 12 votes are for lang1 the value of the feature votes4lang1 will be 0.75. Furthermore, we add two features stating the minimum and maximum edit distance between the test token and the k 2 selected training tokens. Table 5 shows cross-validation results for these new feature sets with and without the P 1 N 1 context features. Excluding the GDLC features, we can see that best accuracy is with k-NN and P 1 N 1 features (95.11%). For Elman features, the accuracy is lower (91.53% with context). In combination with the GDLC features, however, the Elman features can achieve a small improvement over the GDLC+P 1 N 1 combination (+0.04 percentage points): 96.46% accuracy for the GDLC+Elman setting (without P 1 N 1 features). Furthermore, the k-NN features do not combine well. 5

Final System and Test Results
At the time of submission of predictions, we had an error in our GDLC+Elman feature combiner re-5 A possible explanation may be that the k-NN features are based on only 3 of 5 folds for the training data (3 folds are used to make predictions for the 4th set) but 4 of 5 folds are used for test data predictions in each cross-validation run.  sulting in slightly lower performance. Therefore, we selected SVM-GDLC-P 1 N 1 as our final approach and trained the final two systems using the full training data for Nepali-English and Spanish-English respectively. While we knew that C = 0.125 is best for Nepali-English from our experiments, we had to re-tune parameter C for Spanish-English using cross-validation on the training data. We found best accuracy of 94.16% for Spanish-English with C = 128. Final predictions for the test sets are made using these systems. Table 6 shows the test set results. The test set for this task is divided into tweets and a surprise genre. For the tweets, we achieve 96.3% and 84.4% accuracy (overall token-level accuracy) in Nepali-English and in Spanish-English respectively. For this surprise genre (a collection of posts from Facebook and blogs), we achieve 85.6% for Nepali-English and 94.4% for Spanish-English.

Conclusion
To summarise, we achieved reasonable accuracy with a 6-way SVM classifier by employing basic features only. We found that using dictionaries is helpful, as are contextual features. The performance of the k-NN classifier is also notable: it is only 1.45 percentage points behind the final SVMbased system (in terms of cross-validation accuracy). Adding neural network features can further increase the accuracy of systems.
Briefly opening the test files to check for formatting issues, we notice that the surprise genre data contains language-specific scripts that could easily be addressed in an English vs. non-English scenario.