Lexical Normalization for Code-switched Data and its Effect on POS Tagging

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English and Turkish-German. For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models significantly outperform monolingual ones, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input.


Introduction
Social media provide an invaluable source of information for natural language processing (NLP) systems. Its informative and spontaneous nature leads to many interesting phenomena, like nonstandard words, spelling errors and abbreviations. One particularly challenging and interesting phenomenon is the use of multiple languages within the same utterance, which is also called codeswitching (CS) (Gumperz, 1982;Myers-Scotton, 1995;Toribio and Bullock, 2012).
Because most NLP models are designed to process canonical and monolingual data, their performance drops enormously when having to process 1 Source code is available at: https://bitbucket. org/robvanderg/csmonoise. The Turkish-German data is available at: https://github.com/ozlemcek/ TrDeNormData social media data (Eisenstein, 2013). One solution to this problem is lexical normalization: the translation of non-standard (e.g. social media) text to its canonical form (Han and Baldwin, 2011). Previous work has shown that by standardizing the data, we can improve the robustness of NLP systems (Derczynski et al., 2013;van der Goot and van Noord, 2017). Nevertheless these systems overlook code-switching. (1) shows a code-switched tweet (upper) and its normalization annotation (lower), taken from an Indonesian-English CS corpus (Barik et al., 2019) (Indonesian in bold). This example demonstrates that CS complicates normalization, because it can be unclear in which language to normalize (e.g., ak is normalized to aku 'I' in Indonesian. English-only normalization systems would probably normalize it to ok).
(1) ak aku . . Recently, there has been an increasing interest in the automatic processing of CS data, however, there has not been much work on its lexical normalization. To the best of our knowledge, only Adouane et al. (2019) focus entirely on lexical normalization for CS data in their work. For other works, normalization is a preprocessing step for downstream tasks: chunking (Sharma et al., 2016), parsing (Bhat et al., 2017(Bhat et al., , 2018, or machine translation (Barik et al., 2019). These CS normalizers are either rule-based and language-specific ( Barik et al., 2019) or combine (Hindi) back-transliteration and normalization (Sharma et al., 2016;Bhat et al., 2017Bhat et al., , 2018 thus, they are not directly applicable to other lexical normalization datasets. In this work, • We are the first to present open-source normalization models specialized for CS lexical normalization without any language-specific components. • We provide a novel lexical normalization dataset by annotating a Turkish-German Twitter corpus (Ç etinoglu, 2016). We also align existing annotation layers -language IDs (LID) and part-of-speech (POS) tags -to normalization annotations.
• We evaluate three CS normalization models on two language pairs (Turkish-German (Tr-De), Indonesian-English (Id-En)). For both datasets, CS models reach performance in a similar range as monolingual models reach on monolingual datasets.
• Our CS-tailored normalization models outperform Id-En state of the art and set the state of the art for the Tr-De dataset.
• We show that our proposed normalization models improve the performance of POS taggers. For a broad perspective, we employ a variety of taggers (CRF, BiLSTM, BERT).

Related Work
Lexical normalization Traditionally, social media normalization approaches can be broadly divided into two types. The first stream of work uses techniques borrowed from machine translation (Aw et al., 2006;Pennell and Liu, 2011;Ljubešic et al., 2016). The second stream is based on a classic spelling correction framework (noisy-channel models) (Han, 2014). Here, they often apply three steps, detecting which words need to be replaced, generating candidates, and ranking these candidates. Later, it became evident that a two-step approach is sufficient (Jin, 2015;van der Goot, 2019), and the detection step was alleviated by considering the original word as a normalization candidate. The current state-of-the-art model for most languages is MoNoise (van der Goot, 2019), which is based on this two-step approach. A variety of modules are used for the generation of candidates. For the ranking, MoNoise complements features from the generation step with additional features, which are all combined in a random forest classifier that predicts the probability that a candidate is a 'correct' candidate. MoNoise is described in more detail in Section 4.2. More recently, sequence-tosequence models (Lourentzou et al., 2019) and contextual embeddings (Muller et al., 2019) have been used for the lexical normalization task. These approaches have been shown to reach performances close to MoNoise on English benchmarks.
Like most NLP tasks, most research on normalization has been done on English datasets (Han and Baldwin, 2011;Baldwin et al., 2015). However, there has been some efforts on other languages, where usually only one language is considered, we refer to Sharf and Rahman (2017) and van der Goot (2019) for an overview of available resources.
Processing of code-switched social media data Early work on normalizing CS data focused on Hindi-English, as part of pipelines to achieve downstream tasks (Sharma et al., 2016;Bhat et al., 2017Bhat et al., , 2018. As Hindi is Romanized in datasets and additional Hindi resources are in the Devanagari script, they include back-transliteration into the normalization step, thus defining the task beyond the scope of this paper. Nevertheless, all systems report a positive impact of normalization on their final task. More recently, Barik et al. (2019) experiment on normalization for Indonesian-English. They use a rule-based approach supplemented by clusters derived from word embeddings, and show that normalization can be used to improve machine translation. Adouane et al. (2019) instead propose to use sequence-to-sequence models for normalizing Algerian Arabic data mixed with Modern Standard Arabic, French, Berber, and English. They show that their edit distance-based token-level aligner helps improve normalization.
When annotating the Tr-De dataset for normalization, we also adapted its POS tags (see Section 3.1). This gives us the opportunity to apply POS tagging as extrinsic evaluation. Besides research on Hindi-English that combines normalization and back-transliteration, most work either use normalization to improve tagging performance of monolingual social media data (Derczynski et al., 2013;, or on POS tagging of CS data without normalization (AlGhamdi et al., 2016;Soto and Hirschberg, 2018). In this work, we combine these angles.
Because some of our proposed normalization models depend on language labels, we require a word-level language identification system. There is a wide variety of approaches used for this task, where early systems mostly used CRFs (Sequiera et al., 2015;Molina et al., 2016). More recently, neural networks based approaches have shown superior performance for this task (Zhang et al., 2018). We opt for three different architectures to observe the effect of the quality of language identification on normalization (Section 4.1).

Data
In this section we first describe the design decisions of the novel Turkish-German dataset, then we compare some basic statistics together with the existing Indonesian-English dataset (Barik et al., 2019).

Turkish-German code-switched normalization corpus
We use the Turkish-German Twitter corpus from Ç etinoglu (2016) in our experiments. It consists of 17K tokens as 1,029 tweets. The raw tweets of the corpus have undergone three main steps of alternations after the collection: tokenization, normalization, and segmentation. 2 In addition, usernames and URLs are anonymized as @username and [url] respectively, and intra-word CS boundaries are marked in Mixed tokens with §. Each alternation layer is exemplified on a sentence from the corpus in Figure 1. The Seg+CS layer is annotated with language IDs and POS tags (Ç etinoglu and Çöltekin, 2016). The LID tag set consists of TR (Turkish), DE (German), Lang3 (third language), Mixed (intra-word CS), NE (named entity), Ambig (both Turkish and German and cannot be disambiguated in given context), Other (punctuation, numbers, URLs, emoticons, symbols). Additionally, named entities are tagged with their language label next to the NE tag, e.g. 'Germany' is annotated in the corpus as follows depending on the language: Almanya NE.TR, Deutschland NE.DE, Germany NE.Lang3. The POS annotation adopts the Universal Dependencies (UD) tag set .
Preprocessing for normalization The original version of the corpus has only the Raw and Seg+CS layers and only tweet-level alignment between them. As our work focuses only on normalization we created the intermediate layers Tok+Anon and Norm that leave out other tasks. Since MoNoise requires word-aligned annotations, we also provided these alignments.
We anonymized and tokenized the raw tweets to achieve the Tok+Anon layer. For tokenization, we use a slightly modified version of twokenize.py 3 (O'Connor et al., 2010). To obtain the Norm layer, we merged back segmented tokens and removed CS boundaries on the Seg+CS layer.
After this stage, we aligned Tok+Anon and Norm on the token level automatically using Giza++ (Och and Ney, 2003). We parsed the resulting alignment files to align the actual tokens and corrected them manually. There are 15,715 1:1, 520 1:n, and 147 n:1 alignments.
LID and POS alignment The existing LID and POS tags are on the Seg+CS layer; since we base our experiments on the Tok+Anon layer, we need to map the annotations. This is done in two steps following the Seg+CS ⇒ Norm ⇒ Tok+Anon order. Due to segmentation merges in the first step, and 1:n and n:1 token alignments in the second step, there are non-trivial LID and POS alignments. Figure 2 demonstrates a segmented word in the first column. The first segment Semesterda 'in semester' is Mixed with German Semester and Turkish locative case marker da. The second segment is the Turkish copular -yim 'I am'. Their POS tags are NOUN and VERB, respectively. When segmentation is undone in the second column (Norm), their LID and POS are merged too. If two tokens #words %norm % split %merge CMI have the same LID, the merged token takes the same LID. If they are different, the resulting token is Mixed, as in the example. POS tag merging rules can get more complicated, therefore, we used a heuristic that favors the POS tag of the second token in most cases. 4 When a NOUN segment is merged with a VERB segment, as in Figure 2 (Seg+CS ⇒ Norm), the merged token is assigned a VERB POS tag. For the Norm ⇒ Tok+Anon mapping, the alignment is 1:1, thus LID and POS are directly carried over.

Dataset characteristics
Besides the data described in the previous section, we use the Indonesian-English (Id-En) data from (Barik et al., 2019). The Id-En data is only annotated with language IDs and uses three labels: ID, EN, UN (Unspecified), whereas the Tr-De includes 12 labels (Section 3.1). To simplify the models and improve comparability, we map the language labels of the Tr-De dataset to TR, DE and UN. Named entities are mapped to their respective language tags, e.g, NE.DE to DE. Mixed tokens are mapped to DE as they are German words with Turkish inflection. Lang3, Ambig and Other are mapped to UN.
We divide both datasets into a train and test split (80-20%), and omit a development set due to small sizes. Since we want to leave test set out in analyses, we opt for 10-fold cross-validation on the training split of the data in experiments. Statistics of the training splits of the datasets are shown in Table 1. The datasets are relatively small, but a high ratio of words is normalized, including a high percentage of splits and merges. The percentage of in-vocabulary words is especially low in the Tr-De data, which is mainly due to the morphological richness of Turkish. The code-mixing index (CMI) (Das and Gambäck, 2014)   the (average of the) amount of words not written in the majority language for each sentence. The relatively high CMI for both datasets indicates a high frequency of code-switching occurs in the data.
In both datasets there are a small amount of sentences without normalization (8 and 76 for respectively Id-En and Tr-De), which might be desirable for evaluation of (over)normalization, as in a realworld setup one also does not know beforehand whether normalization is necessary. In more than half of the sentences the number of normalized words is larger than 3. Furthermore, there are some sentences (5-10 per dataset) with a very high normalization ratio (>70%), which are all in capitals.

Monolingual Data
Our baseline model (MoNoise) exploits monolingual data from both the source and the target domain (canonical data) to train word embeddings and estimate n-gram probabilities. To this end, we utilize Wikipedia dumps from 01-01-2020 and random tweets collected throughout 2012 and 2018 from the Twitter API, filtered by the FastText language classifier (Joulin et al., 2017). We tokenized this data based on whitespaces, and removed all duplicate sentences/tweets. The sizes of the collected raw datasets are shown in Table 2.

Models
In this section we describe the models used for word-level language identification (4.1), lexical normalization (4.2) and POS tagging (4.3).

Word-level language identification
We treat language identification as a sequence labeling task where the label of each word is a language ID. We evaluate three sequence labeling libraries: 1) MarMoT (Mueller et al., 2013), a higher-order conditional random fields tagger 2) Bilty (Plank et al., 2016) language of the language pairs to the same space using MUSE (Lample et al., 2018), whereas for MaChAmp, we use multilingual BERT. 5 We use the default settings for all toolkits.

Normalization
We choose to use MoNoise (van der Goot, 2019) as a baseline and starting point for our proposed models for two main reasons: 1) Normalization annotation for code-switched data is scarce, and MoNoise is specifically strong in low-resource setups because of its dependence on external resources (generated from raw data); 2) It is the only normalization model that has shown to be effective in multiple languages. Below we first introduce the standard monolingual MoNoise model, and then all the proposed extensions which are focused on code-switched data. A schematic overview of all models is shown in Figure 3.
Monolingual (Figure 3a) MoNoise consists of two parts, a candidate generation step and a candidate ranking step. For the generation of candidates, a spelling correction system (Aspell), 6 word embeddings and a dictionary based on the training data are used. Features from these modules are then supplemented with n-gram probabilities based on Wikipedia and Twitter data and other features indicating whether a word is present in the Aspell dictionary, whether it contains an alphabetical character, the length of a candidate compared to the original word, and whether it starts with a capital. For the novel proposed models, we will split up the features based on whether they 5 multi cased L-12 H-768 A-12 6 www.aspell.net require language-specific resources (spelling correction, word embeddings and n-grams features; yellow and red in Figure 3), or whether they are language-agnostic (all other features; blue in Figure 3). For the ranking of the candidates a random forest classifier (Breiman, 2001) is used, which predicts the probability whether a candidate is correct. An obvious disadvantage when applying monolingual MoNoise on CS-data is that many features are language-specific (e.g. spelling correction, word embeddings, n-grams), which is sub-optimal for tokens from another language. Since our datasets and evaluation include capitals, we use the version of MoNoise including capitalization handling (van der Goot et al., 2020).
Fragments (Figure 3b) The baseline model has the deficiency that it has the language-specific features only for one language, while normalizing texts for two languages. An intuitive way of improving this model would be to split up the input data into monolingual fragments, and train two separate monolingual models. The fragments are created by splitting the data on every CS point, where words with the UN label are converted to the label of the previous word. This setup has the advantage that the normalization model itself does not need any adaptation, and it can thus be used with any normalization model. The disadvantages are that it is dependent on a language label, two separate classifiers have to be trained and the context is interrupted.
Multilingual (Figure 3c) Instead of using two separate random forest classifiers, we can exploit both feature sets simultaneously in one classifier. This means that for every language-specific feature, we now have two features. In this setup, the model is not explicitly informed about the language of input words, however, some of the features (especially n-gram probabilities) will have a very high correlation with this information. This model has the advantage that only one classifier has to be trained, and no language labels are necessary. It has the disadvantage that it uses more features for the classifier compared to the Monolingual and Fragments models, which increases the complexity of the classification.
Language-aware (Figure 3d) Some of the language-specific features of the Multilingual model will be rather superfluous for words in the other language. For example, it will search for Turkish words in German word embeddings, and also use n-gram counts based on the German Wikipedia. To avoid this, we can use only one copy of each language-specific feature, and generate them based on the language label (the same language labels as in the Fragments model are used). More concretely, this means that for a German word, we will generate uni-gram probabilities based on German data, whereas for Turkish we will use Turkish data; these are then modeled as one feature in the model. On top of this, we also add a feature that indicates which language a word belongs to. There might be some mismatches in the importance of features because different data sources and languages are used. Because the language label is known, and a random forest classifier can model feature interactions intrinsically (Breiman, 2001), these mismatches should not be problematic. This model has the advantage that the number of features stays almost the same as in the Monolingual model (+1, the language ID), but a disadvantage is that it requires language labels.

POS tagging
For POS tagging, we examine the same three sequence labeling systems as used for language identification (Section 4.1): MarMoT, Bilty and MaChAmp. For each normalization setting, we normalize the input data, and use this normalized text as input for the POS tagger, which is trained on canonical data.

Evaluation
In this section we evaluate each of the three subtasks (LID, normalization, POS), where for the lat-  ter two we also examine the effect of exploiting the prediction of the previous tasks. Unless mentioned otherwise, we report the results of 10-fold crossvalidation on the training split of the data. For all experiments, we use a paired bootstrap test on the sentence level with 1,000 samples to test significance. For all results, we order the models by the complexity of the implementation as compared to MoNoise (first fragments, as the original model can be used as a black box, then multilingual because it does not need a language classifier, and finally the language-aware model). An * next to results denotes a significant difference for p < 0.05, of a model always as compared to the previous model (corresponding to the previous column in Table 6, the previous row in other tables) for the same data.

Language identification
Results for the language identification task are reported in Table 3. Unsurprisingly, the performances are in line with the chronological order of the introduction of the systems, and their computational complexity. It should be noted that for MaChAmp we used pre-trained embeddings which were trained on the largest amount of external data. When inspecting the performance per language label, we saw that the 'UNspecified' is by far the most difficult. Even though this class contains punctuation, it also contains many harder cases, where a word belongs to any language other than Lang1 and Lang2, or when the annotator is uncertain. Barik et al. (2019) use a conditional random fields classifier with a variety of features for this task, and report 90.11 accuracy for the full Id-En dataset in a 5-fold cross-validation setting. Which, despite differences in data splits, confirms that our results are competitive.

Normalization
For lexical normalization, a wide variety of evaluation metrics is used in the literature, ranging from accuracy (Han and Baldwin, 2011), F1 score (Baldwin et al., 2015) and precision over out-  Table 4: Normalization performance of the baselines and the proposed models (10-fold accuracy). For the models dependent on language labels, we used the labels predicted by MaChAmp.
of-vocabulary words (Alegria et al., 2013), to CER and BLUE score (Ljubešic et al., 2016). Because the word order is fixed in our task, and to ease interpretation of the results, we opt to use simple accuracy on the word level, where we consider all words (i.e., also the unnormalized words).
To interpret the scores, we include three baselines: 1) leave-as-is (LAI), which always outputs the original word, i.e. its accuracy is equivalent to the percentage of words that are not normalized 2) most-frequent-replacement (MFR), which uses the most frequent replacement from the training data for each word 3) monolingual MoNoise, which can be trained on either of the languages within a language pair (two models).
Results for the different models are compared in Table 4. For the Id-En dataset, the differences between all proposed models are small and not significant. Even the monolingual models perform remarkably well, and only small gains are observable when using the multilingual model. We also compared our results to Barik et al. (2019), using their evaluation metric as their model/output was not available. The metric is non-deterministic, as it uses accuracy over unique OOV words. 7 Nevertheless, our average estimated result for Multilingual is 69.83 for this metric, outperforming their score of 68.50.
For the Tr-De dataset, the scores are generally lower, indicating that this dataset (and perhaps language pair) is more difficult. Especially now, we can observe that the code-switched adaptations lead to substantially higher scores. To our surprise, Multilingual and Language-aware  perform on par, even though the multilingual model does not rely on language labels. Fragments performs significantly worse. This leads to the conclusion that language labels are not directly beneficial for lexical normalization (in this setup). In general, the performances are in a similar range as for monolingual datasets (van der Goot, 2019). 8 Model behavior Besides the metrics reported in the table, we also examined precision and recall. Precision is generally much higher (1.1 to 3 times, see Appendix B) than recall especially for Tr-De, which is in line with previous observations (van der Goot, 2019). This means that the model is conservative and only replaces cases for which it is rather certain, which arguably is a desirable behavior.

Effect of language predictions
To evaluate the effect of the language predictions, we run both the Fragments and the Language-aware models with all language predictions from Section 5.1 as well as the gold language labels. The results (Table 5) show that the performance of the language identification has a positive effect on the normalization performance. Although it is not significant in most cases, it should be noted that significance is only tested compared to the previous model.
Language labels Looking at the normalization performance breakdown on language labels shows that the gains of our proposed models are consistently smaller on Indonesian and Turkish compared to respectively English and German (see Appendix A for full results). This was to be expected, as for these languages the model has less external  , and punctuation replacements. On the Id-En data, however, there is a higher number of these frequent replacements compared to the Tr-De dataset, which explains the high scores and small variability for Id-En in Table 4 and 5.
For the Tr-De dataset, the most common mistakes include: not correcting capitalization in the beginning of a sentence, merging of words, monolingual ambiguous cases depending on context (mi → [mi, mı], question clitics in TR), and tokenization and punctuation mistakes (?:D →? :D). In comparison, for the Id-En dataset, the models make rather different errors: in-vocabulary words which should be normalized are left as is (kaya → seperti, usah → perlu), normalizations which are lexically very distant are not found (lw → kamu), and English contractions are often not replaced (isnt → is not). Error analysis on the Id-En dataset revealed that correction of capitalization was annotated inconsistently. However, because in most cases the normalization was lowercased, this did not have a large effect on performance.
Interestingly, Language-aware is better in correcting words that exist in both languages. For instance, ne is the informal form of eine 'a/one' in German, and also means 'what' in Turkish. The dataset annotations expect the ne → eine normalization. While Multilingual fails to do so, Language-aware corrects them. We believe language IDs play a positive role here in defining the context, and although in general both models perform on par, if a dataset contains many such ambiguous words, Language-aware could be preferable.

POS tagging
For POS tagging, we only look at Tr-De as Id-En is not annotated with POS tags. We employ a pipeline approach; we first normalize our training data in a 10-fold setting, and then apply the tagger on this normalized data. The taggers are trained on a shuffled concatenation of the Turkish-IMST (Sulubacak et al., 2016) and German-GSD (McDonald et al., 2013) datasets of UD version 2.5 (Nivre et al., 2020). Now that none of the CS data is used during training, 10-fold cross-validation is not necessary. We directly apply the taggers on the full training data. This way the exact same data split is used for evaluation as in the 10-fold setting in the previous sections. Even though we have POS tags available for the gold normalization (Section 3.1), we do not have gold tags for predicted normalization, and to keep the comparison fair we evaluate using the Tok+Anon POS tags. When a word is split or merged, we use the alignment and check whether the correct tag is present. In other words: we select one tag based on an oracle selection. 9 Results in Table 6 show that, surprisingly, Bilty performs competitive to MaChAmp across most settings. Considering the differences between the normalization models, the Multilingual model and the Language-aware model perform on par, but there is still a marginal gap compared to the gold normalization.
We also analyzed the confusion matrices of the POS tagger, the full analysis can be found in Appendix C, we will shortly summarize findings here. 1) Bilty is mainly outperforming MaChAmp in gold due to better recognition of symbols (emojis), 2) Bilty is more sensitive to different normalization strategies, whereas for MaChAmp the differ- ences between them are minimal, 3) Performance on nouns improves a lot after normalization, especially for German (due to corrected capitalization of nouns), 4) The second POS tag which improved most are verbs, investigation showed that this is mainly because Turkish-specific characters are replaced by their ASCII counterparts, which helps the tagger assign the correct POS.

Test data
On the test data we take both the 'no normalization' and the best baseline (which are monolingual Indonesian for Id-En and monolingual German for Tr-De), and compare these to our best two proposed normalization models. The results in Table 7 show that, parallel to 10-fold crossvalidation results (Table 4), Multilingual and Language-aware scores are similar and their difference is insignificant for both datasets. This leads to the conclusion that Multilingual is the most elegant model, as it is not dependent on language labels. On the Tr-De dataset the proposed models are clearly outperforming the baselines. However, on the Id-En dataset the differences are small (and not significant) between the monolingual model and both of our proposed models. For Tr-De, we take the test set normalized by systems in the second column of Table 7 and apply MaChAmp for POS tagging. The results in the third column show that the POS tagger follows the trend in normalization scores, and performs slightly better when using the multilingual model, beating the LAI baseline (i.e. not using normalization) with 5.4% relative improvement.

Conclusion
Code-switching provides many challenges for NLP systems. In this work we attempt to overcome some of these challenges by normalizing the data, and evaluating the downstream effect of this for POS tagging. For evaluation we use an Indonesian-English dataset (Barik et al., 2019) as well as a German-Turkish dataset (Ç etinoglu, 2016), for which we provided novel normalization layers and adapted existing LID and POS annotation.
We proposed three different models to normalize CS data. The two best-performing models are Language-aware and Multilingual. The first model exploits language labels, to identify for which language to generate features, whereas the second model combines features for both languages. The differences in performance between these two systems was not significant for any of the 10-fold experiments nor on the test data, so in most cases the multilingual model would be preferable, as it has no dependence on language labels.
We showed that normalizing the input before POS tagging results in significantly higher POS accuracies for CS data. Gold normalization experiments showed that there is still room for improvement for normalization models to help POS tagging.
An interesting property of the proposed model is that it does not have to be trained on intrasentential CS data. In fact, it can be trained on a mix of two monolingual datasets, thereby handling many more language pairs. We hope to evaluate this setup if resources (i.e., normalization test data for a CS language pair, and monolingual normalization training data for both languages) become available.  Table 9: precision and recall for both datasets, we follow the definitions of (van der Goot, 2019) Appendix A Breakdown of performance per language Table 8 show the accuracy of all the proposed models per language. The LAI scores show that most of the normalization replacements are necessary for ID and DE. Interestingly, performance of the last two models is highest on respectively EN and DE, which is probably due to the original model being developed mostly with a focus European languages. Table 9 show the precision and recall of all models on both datasets. LAI has 0.0 on all metrics, because it never finds a correct normalization.

C Confusions of POS taggers
We conducted an analysis of POS tagging confusions for the setting described in Section 5.3. In Table 10 and Table 11 the error frequencies of respectively MaChAmp and Bilty are shown. The tables report the frequency of the top-10 most frequent errors of the baseline (LAI), and the difference in counts observed using a variety of normalization strategies. In Figure 3 and Figure 3 the full confusion matrices for respectively MaChAmp