MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MULTILEXNORM shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 12 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-ofspeech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system.1


Introduction
The rise of social media has led to a tremendous increase in the amount of data shared over the Internet. But because of its spontaneous nature, the data naturally abounds with numerous language variations, both intended (e.g., slang, abbreviations, non-standard capitalization) and unintended ones (e.g., typos). This, in turn, poses considerable problems for existing natural language processing (NLP) tools (e.g., Baldwin et al., 2013;Eisenstein, 2013), most of which were originally designed to process canonical texts. One way to improve the performance of such systems is to normalize text and thus make it more similar to the data the NLP systems were initially designed for (and trained on).
At this point, to avoid confusion with other existing notions of text normalization (cf. Sproat et al., 2001;Aw et al., 2006), we should state that, throughout this paper, we will only deal with lexical normalization-a task which Han and Baldwin (2011) define as "a mapping from 'ill-formed' outof-vocabulary (OOV) lexical items to their standard lexical forms." We focus only on social media data, as opposed to historical data (Tang et al., 2018;Bollmann, 2019) or medical data (Dirkson et al., 2019), and extend the scope of this task further to the cases where wrong in-vocabulary (IV) tokens can be normalized to (i.e., replaced with) their invocabulary counterparts, arriving at the following formulation:

Definition -Lexical Normalization
Lexical normalization is the task of transforming an utterance into its standard form, word by word, including both one-to-many (1-n) and many-to-one (n-1) replacements.
It should be noted that deletions and insertions of complete words are thus beyond the scope of the task as defined here.
Although lexical normalization potentially removes social signals (Nguyen et al., 2021), it has also been shown to boost many downstream NLP tasks, including named entity recognition (Schulz et al., 2016;Plank et al., 2020), POS tagging (Derczynski et al., 2013;Schulz et al., 2016; Zupan et al., 2019), dependency and constituency parsing (Baldwin and Li, 2015;van der Goot et al., 2020;van der Goot and van Noord, 2017), sentiment analysis (Van Hee et al., 2017;Sidarenka, 2019, pp. 79, 122), and machine translation (Bhat et al., 2018). However, existing work on this topic is largely fragmented, focused mostly on one language, relies on different evaluation metrics, or makes different assumptions regarding the items to be normalized (cf. Yang and Eisenstein, 2013;Li and Liu, 2015;Xu et al., 2015). All this makes it extremely hard to compare existing and new normalization systems.
In an attempt to achieve greater reproducibility, linguistic variety, and a standardized benchmark for multilingual lexical normalization, we introduce the MULTILEXNORM shared task. The benchmark for this task comprises datasets for 12 language(-pair)s: Danish, German, English, Spanish, Croatian, Indonesian-English, Italian, Dutch, Slovenian, Serbian, Turkish, and Turkish-German. All datasets contain sentences from popular social media platforms, which have been annotated for lexical normalization (i.e., with word-level replace-ments). Following our definition, we considered both intended and unintended spelling deviations, and included all categories defined by van der  except phrasal abbreviations. We assume gold tokenization in all datasets, and leave automation of the tokenization step for future work. Examples of annotated sentences for all languages are shown in Table 1.
Furthermore, to precisely measure the effect of text normalization on downstream tasks, we also included a dedicated track on extrinsic evaluation, in which we estimate how much the results of dependency parsing and part-of-speech (POS) tagging change after normalization. This track includes corpora for English, German, Italian, and Turkish, annotated with Universal Dependencies (Nivre et al., 2020).
More details about intrinsic and extrinsic datasets are given in §2 and §5, respectively. We also provide an overview of baselines and submitted systems in §3, discussing their intrinsic and extrinsic results in §4 and §5, respectively. The paper concludes with a summary of the findings of the shared task and suggestions for future work.

Language and citation
Kappa Same cand.
EN (Baldwin et al., 2015) 0.89 EN (Pennell and Liu, 2014) 0.59 98.73 IT (van der Goot et al., 2020) 0. 64-0.79 73.91-77.78 NL (Schuur, 2020) 0.77-0.91 DA (Plank et al., 2020) 0.89 96.3 Table 2: Agreement scores for lexical normalization found in the literature. The "Kappa" column reports Fleiss/Cohen's kappa (rounded to 2 decimals) on the decision of whether a word needs to be normalized or not, whereas "Same cand." reports the raw percentage of times annotators agreed. Ranges in Italian include raw annotation and after some automatic fixes; for Dutch they are between different domains.

Data
Our selection of languages is purely based on dataset availability. We are aware that the benchmark contains mostly Indo-European languages, and encourage additions to this benchmark in the future to increase language variety. 2 We kindly request future work to cite the original data sources, and provide the bib-files on our website. Lexical normalization is a subjective task, as in many cases multiple interpretations and annotations are plausible. Furthermore, annotators may disagree on what is "normal", and whether normalization is necessary for certain words. We have summarized all results on studies on inter-annotator agreement that we are aware of in Table 2.
Two types of agreement are reported in the literature: (1) Cohen's/Fleiss' kappa score on the choice of whether a word is in need of normalization; and (2) "Same candidate", which reports the percentage of times annotators agreed on the normalization replacement for words normalized by multiple annotators. Results in Table 2 show that the choice of whether to normalize has a medium to high kappa score, whereas the choice of the correct normalization candidate is generally high. An exception is Italian, which has a relatively low score due to some annotators not correcting capitalization (van der Goot et al., 2020).
Besides converting all data to the same format, we have attempted to converge annotation styles whenever possible. In particular, we applied the same normalization annotation in these cases: 2 Please contact the first author of this paper if you are interested in adding a language. If more languages are added, future versions of MULTILEXNORM will be released via the repository.
• Interjections and punctuation are kept untouched, hahaha → hahaha and not haha; • Usernames, hashtags and URLs are kept untouched; if data is anonymized, usernames become @username; • We kept capitalization correction where available. Unfortunately we did not have the budget to include capitalization correction in all datasets; • We removed data that is not in the target language (mostly Frisian and Afrikaans in the Dutch data, and Indonesian and Dutch tweets in the German dataset); • We fixed some tokenization issues in multiple languages.
With regard to data availability and composition, we note that some of the datasets were published before the shared task was held. 3 All datasets contain data from the Twitter platform; the Dutch corpus also includes forum and SMS data, and the Danish dataset includes texts from Arto, a Danish social media network. More details about the data collection for each dataset can be found in the dataset statement (Appendix A).
An overview of our datasets is shown in Table 3. It is clear that different annotation guidelines have been used, where some included "one-to-many" and "many-to-one" replacements, and correction of capitalization, where others did not. Furthermore, the amount of necessary normalization is very different, and the training splits of the datasets vary greatly, with the largest being almost 10 times larger than the smallest. It should also be noted that only 7 languages (DE, EN, HR, ID-EN, NL, SL, SR) have a dedicated development test, due to data availability.

Baselines
The organizers provided two naive baselines (i.e., LAI and MFR, as introduced below), and an "informed" baseline, based on training the previous state-of-the-art MoNoise over the respective datasets (van der Goot, 2019a).

Lang
Words Sents 1-n n-  Table 3: Some statistics on the 12 language(-pair)s within the MULTILEXNORM benchmark. The "1-n" column indicates the percentage of words which are split into multiple words (one-to-many), "n-1" indicates the proportion of words that are merged with other words as part of normalization (many-to-one), and "Change" indicates the percentage of words that are normalized. "Caps" indicates whether standard capitalization is included in the annotation; for datasets without annotation of capitalization, everything is lowercased.
LAI Leave-As-Is baseline, which simply returns the input word.
MFR Most-Frequent-Replacement baseline. It stores for every input word (unigram) its most frequent replacement in the training data. Then at run-time it simply replaces each word with its most common replacement. If a word is not seen before, it is returned as-is.
MoNoise This is based on a two-step approach. It first generates candidates based on: word embeddings, the Aspell spell checker, 4 replacements found in the training data, and some heuristics. In the second step, features from the generation step are combined with additional features (including character n-gram probabilities) and used to train a random forest classifier, which predicts the probability that a candidate is the correct candidate. The only tuned component is the generation of Aspell candidates, where the --bad-spellers options can be used to generate more candidates. For most languages this resulted in a slower but more effective model (except for HR, ID-EN, and SL). For the code-switched language pairs, the codeswitched version of MoNoise was used (van der Goot and Çetinoglu, 2021). To retrain MoNoise, new raw data was collected to base its n-gram probabilities and word embeddings on. We downloaded Twitter data of 2012-2020 from archive.org, filtered it with the fastText language classifier (Joulin 4 http://aspell.net/ et al., 2017a), and used the most recent Wikidump for each language. 5

Submissions
The shared task ran in mid-2021, and attracted 9 participants with 18 submissions. We include the full list of submissions, but no system description or paper was received from maet, team, thunderml, or learnML, so the details of these methods are not clear. Submissions marked with an asterisk (" * ") involve one or more of the shared task organizers.
ÚFAL (Samuel and Straka, 2021) The system is based on ByT5 (Xue et al., 2021), and is a word-by-word normalization model. In order for the model to be as close to the original pre-training task as possible, each input word is normalized independently: it is enclosed in an opening and ending tag, over which ByT5 is run to produce the normalization.
The authors fine-tune ByT5 in two steps, first on synthetic data and then on the MULTILEXNORM data. To obtain synthetic data, they use Wikipedia as target data, and create unnormalized input through character edits, word edits, and dictionary replacements trained from the MULTILEXNORM data. During the final fine-tuning, they either: (a) use only the MULTILEXNORM data; or (b) com-bine the MULTILEXNORM with the synthetic data. They submitted two systems, a single model for every treebank and an ensemble of 4 models for every treebank. Both adapting the input to fit the pre-training step and the use synthetic data proved to be very beneficial for the system.
HEL-LJU * (Scherrer and Ljubešić, 2021) The system is based on a BERT (Devlin et al., 2019) token classification preprocessing step, where for each token the type of the necessary transformation is predicted (none, uppercase, lowercase, titlecase, modify), and a character-level statistical machine translation (SMT) model is used to normalize accordingly. For some languages, depending on the results on the development data, the training data was extended by back-translating OpenSubtitles data. The paper evaluates a range of MT systems and ablations, and shows that a character-level SMT model is highly competitive.
TrinkaAI (Kubal and Nagvenkar, 2021) The proposed model is based on a sequence labeling approach, where the input tokens are unnormalized and the target tokens are normalized. To reduce the target labels and make predictions faster, classes are based on those tokens for which normalization is required, and tokens which do not need to be normalized are labelled with a single target token. This sequence labeling model is fine-tuned on a pre-trained multilingual model encompassing all languages in the shared task. Further, a postprocessing layer concerning word-alignment is applied, which further improved performance. This sequence-labeling approach ranked 6th out of 21 models, and scored highest among all competitors for the Spanish Language.

BLUE (Bucur et al., 2021)
The team tackled the task of lexical normalization as a neural machine translation problem, using the MBart-50 (Tang et al., 2020) multilingual manyto-many model. They fine-tuned the model for all the available languages, and used a MFR baseline for Danish and Serbian. They opted for a sentencelevel approach as opposed to a word-level approach, using simple linear sum assignment based on Levenshtein distance to align the normalized words with the raw words.

CL-MoNoise * (van der Goot, 2021)
This is the same method as MoNoise, but it is deployed cross-lingually: it is trained on the source language, including candidate generation (Aspell, word embeddings, n-gram probabilities), then at prediction time, it is applied to the raw data in the target language. The best source language to transfer from is chosen based on empirical results on the training data.
MaChAmp * (van der Goot, 2021) The team of CL-MoNoise also used a sequence labeling approach, in learning (character) transformations of each original word to its normalization. Scores are low on datasets that include capitalization correction, as this is not properly included in the current transformation algorithm (everything is lower-cased beforehand). The method performed much better when based on XLM-R (Conneau et al., 2020) than mBERT. Potential improvements could be gained by exploiting the multi-task capabilities of MaChAmp (van der Goot et al., 2021).

Intrinsic Metric
A wide variety of evaluation metrics have been used to evaluate lexical normalization performance, including accuracy over OOV words, F1 score, BLEU, word error rate, and character error rate. We choose to use Error Reduction Rate (ERR) (van der Goot, 2019b) as our main metric. ERR is the wordlevel accuracy normalized for the percentage of words that are in need of normalization. To calculate ERR, we use the word-level accuracy, and the percentage of words that are not normalized in the annotation: We choose to use ERR instead of word-level accuracy to be able to compare (and combine) scores across datasets, since different numbers of candidates are in need of normalization. An accuracy of 93% might be a very good score on one dataset, whereas on another dataset a normalization model which scores 93% might be completely useless. The ERR will normally have a value between 0% and 100%. A negative ERR indicates that the system normalizes more words wrongly than correctly. The Leave-As-Is baseline (Section 3), which simply returns the input words, will thus by definition score an ERR of 0.0. For a more in-depth discussion of evaluation metrics for normalization and ERR, we refer the reader to Section 5.1 of van der Goot (2019b).   Table 4: ERR on the test data (%). Negative values indicate that the system normalizes more words wrongly than correctly. Gray rows indicate baseline systems provided by the organizers. * Teams including an organizer.
The official winner of the shared task is the highest-scoring team (macro-averaged over all datasets/languages) with an open-source implementation.

Results: Intrinsic Evaluation
The main results of the shared task are shown in Table 4. All submissions except MaChAmp beat the LAI baseline, and most of them beat the MFR baseline, which turned out to be a strong baseline.
The overall winner of the shared task was ÚFAL-2, which beat the second-best team by a staggering 13 points. Recall that this method is based on ByT5, a transformer-based encoder-decoder bytelevel system. Of particular note is that while recent neural approaches (Lourentzou et al., 2019;Muller et al., 2019) had not been clearly superior to the baseline MoNoise on English, ÚFAL-2 surpassed the baseline method by an appreciable margin.
The second-best team was HEL-LJU, who improved over MoNoise by more than four points. The authors report that character-level NMT provided lower results than their SMT approach, even with backtranslation.
The remaining teams mostly used either token classification or encoder-decoder approaches.
The results of this shared task showed that for state-of-the-art results one needs: (1) pre-trained models; (2) an encoder-decoder architecture over bytes or characters; and (3) synthetic task-related data.
One thing in common for the datasets with lower results in Table 4 (e.g., ES, NL, TR, and TR-DE) is that they include annotation for capitalization. Unsurprisingly, smaller datasets also tend to result in lower scores in general.

Extrinsic Evaluation
We perform a main extrinsic evaluation of the impact of normalization on dependency parsing using Universal Dependency annotations (Nivre et al., 2020). See Section 5.2 for test set details. We used version 2.8 of all treebanks, and syntacticallysplit multiword tokens and empty nodes (ellipsis) are undone with ud-conversion-tools. 6 We trained MaChAmp (van der Goot et al., 2021) with default settings and XLM-R embeddings (Conneau et al., 2020). We use the largest canonical treebank of each of the respective languages as the source domain, and attempt to improve performance on the target domain by normalizing data first. In other words, we take the input text, use a normalization system to get the normalized version, and pass this to the parser as input. The parsers are not trained on social media data, to evaluate the impact of normalization on parsing in a domain-shift scenario. Because the normalized version should be closer to the canonical training data of the parser, performance is expected to improve compared to using the input directly. The training treebanks are UD_English-EWT (Silveira et al., 2014), UD_German-GSD (Brants et al., 2004), UD_Italian-ISDT (Bosco et al., 2014), and UD_Turkish-IMST (Sulubacak et al., 2016). For the extrinsic evaluation, we report the average over three runs with different random seeds (only the parser is retrained, not the normalization models).

Alignment-aware Parsing Metrics
Because our definition of the lexical normalization task includes the splitting and merging of tokens (namely "one-to-many" and "many-to-one" replacements, cf. Section 1), the standard evaluation for dependency parsing has to be adapted for our purposes. Specifically, we compute the tokenlevel labeled attachment score (LAS) and unlabeled attachment score (UAS) after aligning predicted tokens to gold tokens. We refer to these alignmentaware metric variants as aligned LAS and UAS (i.e., a-LAS and a-UAS), respectively. Because "many-to-one" replacements are relatively rare, we cannot check when they are correct (normalization annotation is not available for most treebanks) and they are non-trivial to include in the aligned evaluation, and hence we decided to undo these in the system outputs, and use the original input instead. For the "one-to-many" replacements, we check whether one of the words in the split is connected correctly. All incorrect words in the 'many' are thus simply ignored. It should be noted that this can give an advantage to systems that split, and we thus suggest that this metric is always reported with the number of splits. Furthermore, we assume none of the teams made use of this shortcoming in the metric, as they were unaware of these details.

Test Sets and Metric
We employ a-LAS as the main metric for extrinsic evaluation, and also report a-UAS for the sake of completeness. Each system was tested on 7 dependency parsing treebanks consisting of posts from Twitter in 4 different languages:

Results: Impact on Parsing
The main results (a-LAS) are reported in Table 5.
Although most submissions outperform the LAI baseline, it becomes clear that lexical normalization is only a step towards closing the gap in performance on canonical data, as performance is still far from the average LAS on our training sets (79 LAS). This is confirmed by the scores of using the manually-annotated (gold) normalization. 7 The best performing model scores 1.72 LAS points higher than the LAI baseline. Compared to normalization performance (Table 4), the baselines (LAI and MFR) rank highly, especially for tr-iwt151 and en-tweebank2, which is probably because they have less risk of over-normalization, and some of the treebanks might need only very little normalization (there is also an abundance of canonical data to be found on Twitter). The largest gains compared to the LAI baseline are obtained on the en-monoise treebank, probably because this treebank contains data filtered to contain data in need of normalization. In the gold-standard annotations, 31.33% of the words are normalized, compared to 1.04% for en-tweebank2. The full a-UAS scores can be found in the appendix (Table 6). In general, performance is approximately 7-10 points higher than LAS (absolute), and differences between teams are smaller. Interestingly, the ranking is slightly different there, with HEL-LJU and MFR ranking higher, and CL-MoNoise ranking lower. Overall, the results confirm that the best normalization systems (by ÚFAL and HEL-LJU) also result in the highest observed parsing improvements on these social media treebank test sets. Once again, these two teams outperform the previous state-of-the-art system (i.e., MoNoise).   Table 5: a-LAS scores (%) and the number of splits (in smaller font) for each dataset. Gray rows indicate baseline systems provided by the organizers. * Teams including an organizer. The final row ("Gold") indicates performance with gold-standard normalization.

Results: Impact on POS tagging
Additionally, we calculated the POS accuracies using the same heuristic as described in Section 5.1 (i.e., a-POS), and present full results in the appendix (Table 7). Again, we see some changes in the ranking of the teams, and performance improvements are slightly more moderate compared to a-LAS. The baselines score highest on en-tweebank2 (MFR) and tr-iwt151 (LAI), and the highest gains are again obtained on the en-monoise treebank.

Conclusions
With MULTILEXNORM, we have developed a multilingual benchmark for lexical normalization consisting of previously-created datasets spanning 12 language variants. We proposed a standard evaluation metric, and both intrinsic and extrinsic evaluation via dependency parsing and POS tagging. We hosted a shared task with this new benchmark, which enabled comparison of performance of 21 models (18 submissions by participants, and 3 inhouse baselines). The results of the shared task show that the previous state of the art on lexical normalization is outperformed by a large margin.
The extrinsic evaluation on dependency parsing and POS tagging shows that lexical normalization is beneficial (with improvements in a-LAS and a-UAS of up to +1.72 and +0.85, and improvements in a-POS of up to +1.54, respectively), but there is still a performance gap compared to the performance levels observed on canonical data. We hope that the proposed benchmark will lead to more research in multilingual normalization, and more transparent and fairer comparisons. All submissions, evaluation scripts, and baseline models are available in the shared task repository.

A MULTILEXNORM Data Statement
Following Bender and Friedman (2018), we present statements for MULTILEXNORM data.

A. CURATION RATIONALE
• Danish: We collected data from Twitter by querying the Twitter API using the following emotion-related keywords: frygt, glaed, kaerlighed, overraskelse, racistisk, sjov, smerte, tristhed. Tweets were collected in 2019-2020. We also scraped all Arto pages from the Internet Wayback machine (archive.org), and extracted the user-generated content from the HTML with a script. We then applied filtering on the combination of this data, leaving only sentences which were classified as Danish with a confidence of at least 0.885 by the FastText language classifier (Joulin et al., 2017b) and contain at least 3 words in the Danish Aspell dictionary (which are not in the English dictionary), and contain at least 2 words not in the Danish Aspell dictionary.
• German: To create this corpus, we randomly sampled 10,000 messages from the German Twitter Snapshot (GTS; Scheffler, 2014)-a collection of 24 million tweets, which were gathered in April 2013 by permanently tracking a list of 397 frequent German words via the Twitter Streaming API and subsequently filtered with langid.py (Lui and Baldwin, 2012). We analyzed all tokens of the sample with TreeTagger (Schmid, 1994) and hunspell. Afterwards, two human experts annotated all words that any of these tools considered as out-of-vocabulary (OOV) and that appeared at least twice in the selected microblogs or belonged to a set of 1,000 randomly-chosen hapax legomena. Finally, we only left tweets that contained words annotated as spelling deviation by either of the experts, resulting in a total of 1,492 messages.
• English: Tweets were collected using the Twitter Streaming API over the period 23-29 May, 2014, and then filtered by langid.py (Lui and Baldwin, 2012) to remove non-English tweets. To ensure that tweets had a high likelihood of requiring lexical normalization, tweets with less than 2 non-standard words (i.e. words not occurring in the SCOWL dictionary) were filtered out.
• Spanish: To maximize the chances of getting tweets in the Spanish language, tweets were collected through Twitter's streaming API by restricting the search to a geographical bounding box within Spain but excluding bilingual regions. The selected geographic area forms a rectangle with Guadalajara (coordinates: 41, -2) as the northeasternmost point and Cadiz (coordinates: 36.5, -6) as the southwesternmost point. The resulting collection with over 227K tweets was filtered to keep only tweets identified by Twitter as having been written in Spanish (i.e. 'lang' field set to 'es'), and further sampling was done to make it manageable for manual labeling.
• Croatian: The dataset is a subset of the large Croatian Twitter crawl harvested with Tweet-Cat (Ljubešić et al., 2014) between 2013 and 2016. It contains a similar amount of standard and non-standard data, and non-standard data was oversampled from the original data collection. The standardness level of the data was predicted via feature-based machine learning (Ljubešić et al., 2015). Discrimination between Croatian and Serbian tweets was performed with a dedicated supervised classifier (Ljubešić and Kranjčić, 2015).
• Indonesian-English: Barik et al. (2019) collected Indonesian-English code-mixed tweets using the Twitter search API. First, they compiled a list of Indonesian and English stopwords (100 for each language), based on frequent word lists from Wiktionary. 8 The stopwords were then used as search queries. In order to obtain code-mixed tweets, the "language" parameter in the search query was set to be constrastive to the language of the stopword used. For example, the "language" parameter is set to English when an Indonesian stopword is used as a query, and vice versa. To minimize chance that tweets contain any word from local indigenous or other languages, the "location" parameter in the search query is restricted to only Jakarta and Bandung (the two largest cities in Indonesia). In total, 49,647 tweets were collected. Two human annotators labeled a sample of 825 tweets from the larger collection. The annotators were instructed to tokenize tweets into a list of word segments, and then provide the lowercase normalized form for each segment. A segment can be a single or multi-token word, untokenized proper name, hyperlink, emoticon, or Twitter special term (i.e., hashtag or mention).
• Italian: The dataset is a subset of the data from Sanguinetti et al. (2018) (v. 2.1), which in turn is a subset of SENTIPOLC (Barbieri et al., 2016) and SentiTUT (Bosco et al., 2013). Tweets were mostly collected during the period 2011-2012, and have been filtered based on keywords about politics, in addition to a small subset from the random Twitter API stream. Tweets that contain ≥ 3 out-ofvocabulary words (i.e., not in the Aspell dictionary for Italian, or either a URL, hashtag, username, or text consisting of punctuationonly) were filtered out to ensure a basic density of non-standard language for further annotation. Moreover, a small list of proper nouns was added to the vocabulary, taken from the most frequent out-of-vocabulary words in the dataset.
• Dutch: We took the data from the SoNaR Nieuwe Media Corpus (Oostdijk et al., 2014) as a starting point, and selected sentences which contain at least 3 words which are not in the Aspell dictionary for Dutch. We originally took 500 sentences from each sub-domain (SMS, chats, and tweets), and then removed all sentences which were completely written in another language (i.e., Frisian, Afrikaans, English, or Spanish).
• Slovenian: The dataset is a subset of a large Slovenian Twitter crawl harvested with Tweet-Cat (Ljubešić et al., 2014) between 2013 and 2016. It contains a similar amount of standard and non-standard data, and non-standard data was oversampled from the original data collection. The standardness level of the data was predicted via feature-based machine learning (Ljubešić et al., 2015).
• Serbian: The dataset is a subset of a large Serbian Twitter crawl harvested with Tweet-Cat (Ljubešić et al., 2014) between 2013 and 2016. It contains a similar amount of stan-dard and non-standard data, and non-standard data was oversampled from the original data collection. The standardness level of the data was predicted via feature-based machine learning (Ljubešić et al., 2015). Discrimination between Croatian and Serbian tweets was performed with a dedicated supervised classifier (Ljubešić and Kranjčić, 2015).
• Turkish-German: The code-switched dataset is derived by filtering tweets labeled as Turkish and German according to Twitter's language ID assignment. Turkish tweets were collected in 2015 and German tweets during 2009-2011.
To identify mixed German-Turkish tweets, we mainly used morphological analyzers (Oflazer, 1994;Schmid et al., 2004) as filters. Manual filtering followed the automatic filtering, resulting in the final dataset. The raw tweets were manually tokenized, normalized and segmented (Çetinoglu, 2016). In addition, usernames and URLs were anonymized as @username and [url], respectively, and language IDs for each token were added. Adapting the dataset to the normalization task was performed in van der Goot and Çetinoglu (2021).
B. LANGUAGE VARIETY All of the datasets consist of social media variants of the standard languages, and are not bound by a regional standard (i.e., no distinction is made between en_us or en_gb).
C. SPEAKER DEMOGRAPHIC The speaker demographics are unknown. For some of the collected data this might have been available, but it is not shared on purpose (for privacy reasons).

D. ANNOTATOR DEMOGRAPHIC
• Danish: Two native speakers of Danish. Both were higher-education students (male and female), between the age of 20 and 30.
• German: An undergraduate (native German speaker studying computational linguistics), and a PhD student (Belarusian Germanist pursuing a degree in computational linguistics).
• English: 12 interns and employees at IBM Research Australia were involved in the data annotation. All annotators had a high level of  Table 6: a-UAS scores (%) and the number of splits (in smaller font) for each dataset. Gray rows indicate baseline systems provided by the organizers. * Teams including an organizer. The final row ("Gold") indicates performance with gold-standard normalization.
English proficiency (IELTS 6.0+) and were reasonably familiar with Twitter data.
• Spanish: Nine native speakers of Spanish. Eight male and one female with ages ranging from 30 to 60. All annotators had a background in natural language processing and were familiar with the Twitter platform.
• Croatian: Three native speakers of Croatian, all linguists with an MA degree, trained in data annotation.
• Indonesian-English: Two native speakers of Indonesian, fluent in English. Both annotators were 22 years old at the time of the annotation.
• Italian: Four native speakers of Italian, all male, between the age of 20 and 38, from a variety of Italian regions (i.e., Veneto, Tuscany, Liguria, and Apulia). All annotators had a background in natural language processing and were familiar with the Twitter platform.
• Dutch: The main annotator was a native Dutch Information Science master student (male, age range 20-25). The second annotator (for agreement scores) was a native Dutch male PhD student in NLP, age 27.
• Slovenian: Five native speakers of Slovenian, all master-level students of language-related studies.
• Serbian: Two native speakers of Serbian, all linguists with an MA degree, trained in data annotation.
• Turkish-German: The annotators were three Turkish-German bilinguals born and raised in Germany. They have studied computational linguistics and their age ranged from 20 to 25.

E. SPEECH SITUATION
The data is not spoken. However, input methods might have changed over time. A tweet collected from 2012 was less likely to be produced with a spell checker compared to one collected from 2020.  Table 7: a-POS accuracy (%) for each dataset. Gray rows indicate baseline systems provided by the organizers. * Teams including an organizer. The final row ("Gold") indicates performance with gold-standard normalization.

F. TEXT CHARACTERISTICS
them shorter than 140 characters (Twitter increased the maximum tweet length in September 2017).

I. PROVENANCE APPENDIX
The data is released under a CC-BY-SA license.

B a-UAS Scores
We report a-UAS scores in Table 6.

C a-POS Accuracies
We report a-POS accuracy values in Table 7. Note that the en-aae treebank is not included here because it has no POS annotation.