IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism

Indonesian language is heavily riddled with colloquialism whether in written or spoken forms. In this paper, we identify a class of Indonesian colloquial words that have undergone morphological transformations from their standard forms, categorize their word formations, and propose a benchmark dataset of Indonesian Colloquial Lexicons (IndoCollex) consisting of informal words on Twitter ex-pertly annotated with their standard forms and their word formation types/tags. We evaluate several models for character-level transduction to perform morphological word normalization on this testbed to understand their failure cases and provide baselines for future work. As IndoCollex catalogues word formation phenomena that are also present in the non-standard text of other languages, it can also provide an attractive testbed for methods tailored for cross-lingual word normalization and non-standard word formation.


Introduction
Indonesian language is one of the most widely spoken languages in the world with around 200 million speakers. Despite its large number of speakers, in terms of NLP resources, Indonesian language is not very well represented (Joshi et al., 2020). Most of its data are in the form of unlabeled web and user generated contents in online platforms such as social media, which are noisy and riddled with colloquialism which poses difficulties for NLP systems (Baldwin et al., 2013a;Eisenstein, 2013a).
Traditionally, the majority of Indonesian colloquial or informal lexicons are borrowed words from foreign or local dialect words, and sometimes with phonetic and lexical modifications. 1 Increasingly however, Indonesian colloquial words are more commonly a morphological transformation 2 of their standard counterparts. 3 Despite these evolving lexicons, existing research on Indonesian word normalization has largely (1) relied on creating static informal dictionaries (Le et al., 2016), rendering normalization of unseen words impossible, and (2) for the specific task of sentiment analysis (Le et al., 2016) or machine translation (Guntara et al., 2020), with no direct implication to word normalization in general. Given the obvious utility of creating NLP systems that can normalize Indonesian informal data, we believe that the bottleneck is that there is no standard open testbed for researchers and developers of such system to test the effectiveness of their models to these colloquial words.
In this paper, we introduce IndoCollex, a new, realistic dataset aimed at testing normalization models to these phenomena. IndoCollex is a professionally annotated dataset, where each informal word is paired with its standard form and expertly annotated with its word formation type. The words are sampled from Twitter across different regions, therefore contain naturally occurring Indonesian colloquial words.
We benchmark character-level sequence-tosequence transduction with LSTM (Deutsch et al., 2018;Cotterell et al., 2018) and Transformer (Vaswani et al., 2017) architectures, as well as a rule-based approach (Eskander et al., 2013;Moeljadi et al., 2019) on our data to understand their success and failure cases ( §7.2, §7.3) and to provide baselines for future work. We also test methods for data augmentation in machine translation (back-translation), which to the best of our knowledge has never been applied to 2 We used the term morphological transformations broadly here to include word form changes at the respective interfaces of grammar (phonology, syntax, and semantics), following the definition by Trips (2017). character-level morphological transformation, and observe that adding back-translated data to train transformer improves its performance for normalizing informal words. We also test models in the other direction: generating informal from formal words, which can be useful for generating possible lexical replacements to standard text (Belinkov and Bisk, 2018).

Related Work
With the advent of social media and other user generated contents on the web, non-standard text such as informal language, colloquialism and slang become more prevalent. Concurrently, the rise of technologies like unsupervised language modeling opened up a new avenue for low-resource languages which lack annotated data for supervision. These systems typically only require large amounts of unlabeled text to train (Lample and Conneau, 2019;Brown et al., 2020). However, even when NLP systems require only unlabeled data to train, the varying degrees of formalism between different sources of monolingual data pose domain adaption challenges to NLP systems which are trained on one source (e.g. Wikipedia) to transfer to another (e.g. social media) (Eisenstein, 2013b;Baldwin et al., 2013b;Belinkov and Bisk, 2018;Pei et al., 2019). Worse yet, for an overwhelming majority of lower resource languages, unstructured and unlabeled text on the Internet is often the sole source of data to train NLP systems (Joshi et al., 2020). Therefore, addressing the formalism discrepancy will augment the types of web texts which can be employed in language technologies, especially for languages such as Indonesian which are subject to a high degree of informalism as will be discussed.
While this motivates research on training systems that are robust to non-standard data (Michel and Neubig, 2018;Belinkov and Bisk, 2018;Tan et al., 2020b,a), one intuitive direction is to normalize colloquial language use. Most of the work on colloquial language normalization has been done at the sentence-level: for colloquial English (Han et al., 2013;Lourentzou et al., 2019), Spanish (Cerón-Guzmán and León-Guzmán, 2016), Italian (Weber and Zhekova, 2016), Vietnamese (Nguyen et al., 2015), and Indonesian (Barik et al., 2019;Wibowo et al., 2020). However, research on the linguistic phenomena of non-standard text (Mattiello, 2005), which argues that slang words exhibit extra-grammatical morpho-logical properties (such as portmanteaus, clipping) that distinguish them from the standard form, justifies the need for word-level normalization.
Word-level normalization also has its merit because due to its much lower hypothesis space, models can be trained using significantly smaller amount of data (e.g., compare SIGMORPHON's 10k examples to WMT's 10 6 at high-resource setting). Further, from our manual analysis of the top-10k most frequent Indonesian informal words we collected from Twitter, we find that around 95% of these words do not require context to normalize. Additionally, previous works such as Kulkarni and Wang (2018) have suggested that creating computational models for this generation of informal words can give us insights into the generative process of word formation in non-standard language. This is important because studies into the generative processes of word formation in non-standard text can deepen our understanding of non-standard text. Moreover, they are potentially applicable to many languages since word formation patterns are shared across languages (Štekauer et al., 2012), e.g., portmanteaus (such as brexit) have been found not only in English but also in many other languages such as Indonesian (Dardjowidjojo, 1979), Modern Hebrew (Bat-El, 1996), and Spanish (Piñeros, 2004). Finally, the studies may have broader applications including development of rich conversational agents and tools like brand name generators and headlines (Özbal and Strapparava, 2012).
Previous work that qualitatively catalogues or creates computational models for informal word formations such as shortening has mostly been in English, using LSTMs (Gangal et al., 2017;Kulkarni and Wang, 2018) or finite state machines (Deri and Knight, 2015) to generate informal words given the standard forms and the type of word formation. Most of the dataset: formal-informal word pairs labeled with their word formation used to train these models are also in English. Other dictionaries of informal English words include SlangNet (Dhuliawala et al., 2016), SlangSD (Wu et al., 2018), and SLANGZY (Pei et al., 2019). There is also a dataset that contains pairs of formal-informal Indonesian words (Salsabila et al., 2018), but they are not annotated with word formation mechanisms. To the best of our knowledge, ours is the first dataset of formal-informal lexicon in a language other than English that is annotated with their word formation types.

Indonesian Colloquial Words
Language evolves over time due to the process of language learning across generations, contact with other languages, differences in social groups, and rapid casual usages (Liberman et al., 2003). Each of these factors exists to a high degree in Indonesia, resulting in the constant evolution of its language due to contacts with over 700 local languages (Simons and Fennig, 2017), socioeconomic and education inequalities that result in varying level of adoption of the standard Indonesian (Azzizah, 2015), and the rise of social media usages with widespread celeb culture (Suhardianto et al., 2019;Heryanto, 2008) that causes new words to be invented and spread rapidly.
We catalog the following word formation types that are common in colloquial Indonesian.
1. Disemvoweling: elimination of some or all the vowels, e.g: jangan to jgn ('no' or 'don't'). Disemvoweling does not correspond to any phonetic change, 2. Shortening or Clipping: syllabic shortening of the original word, e.g: internet to inet. Unlike disemvoweling, shortening does imply phonetic change, 3. Space/dash removal: shortened version of writing Indonesian plural form, e.g.: temanteman to temanteman or teman2 ('friends'), 4. Phonetic (sound) alteration: slight change both in sound and spelling in text, but the number of syllables stay the same, e.g: pakai to pake or pakek ('use'), 5. Informal affixation: modification, addition or removal of affixes, e.g: mengajari to ngajarin ('to teach'), 6. Compounding and acronym: syllabic and letter compounds of one or more words akin to acronyms, abbreviations, and portmanteau, e.g: anak baru gede to abg ('teen'), budak cinta to bucin (literally, 'being a slave to love'), 7. Reverse: letter reversal, or colloquially known as "Boso Walikan" (Hoogervorst, 2014), e.g: malang (the name of a city in Indonesia) to ngalam. 8. Loan words: borrowed words, often from local language or English, e.g: bokap ('dad' in Betawi) 9. Jargon: tagline, terms that have been made into a popular term, e.g: meneketehe, from mana aku tahu (a jargon for 'how should I know?'). Some of the above transformations are also found in the literature of other languages, such as English and Korean. In English, disemvoweling was common during the texting (SMS) era in order to write faster and to save on message lengths e.g., c u l8r ('see you later'). Informal affixation (cryin, sweet-ass), compounding and portmanteaus (btw, sexting), and phonetic alteration (dis is da wae) are also present. In Korean, some compounded or shortened version of Konglish is also widely used (Khan and Choi, 2016), e.g., chimaek from chicken and maek ('beer'). Any insight we obtain through evaluating models on our dataset may therefore be of interest to other languages that share similar colloquial transformations; insights that may be increasingly paramount due to the rising prevalance of non-standard text in many languages on the web (Kulkarni and Wang, 2018;Joshi et al., 2020) and the challenges they pose to NLP systems (Belinkov and Bisk, 2018;Pei et al., 2019).
Loan word transformations that come from other languages require multilingual dictionaries/embeddings to normalize while jargons often require background knowledge. Aside from these two, we follow the previous work and hypothesize that the word formations that fall in other categories are mostly morphological transformations that can be learned at character-level (Kulkarni and Wang, 2018;Gangal et al., 2017). In §4, we describe how we curate this colloquial transformation data.

Indonesian Colloquialism Analysis
In this section, we motivate the importance of research on Indonesian colloquialism by highlighting their prevalence in Indonesian web text. We indeed observe that in its daily use Indonesians use colloquial Indonesian to generate contents in the web with (1) vocabularies that are different from formal Indonesian and (2) at a higher rate than colloquial use in the English language.
To compare colloquial and formal Indonesian (from Twitter and Lazada product reviews 4 and from Kompas news articles respectively (Tala, 2003)), we compute these dataset perplexities as well as their out-of-vocabulary (OOV) rates with respect to an Indonesian formal lexicon constructed from tokenizing Indonesian Wikipedia articles. For a fair comparison, we sample 3685 sentences from  To compare to colloquial use in the English language, we also compare English tweets to an English formal lexicon constructed from English Wikipedia articles. We use Wikipedia to construct these lexicons to include named entities which are not typically present in traditional dictionaries. Table 1 shows the OOV rate of the various datasets. Our OOV count excludes Twitter usernames, hashtags, mentions, URLs, dates, and numbers. To avoid rare words being captured as OOV, we also remove any token that only occurred once (shown as OOV-2) on the Table. We observe that the OOV rate of colloquial Indonesian is double the OOV rate of informal English. OOV of the formal Indonesian text (Kompas news) is low, as expected.
We use perplexity as a measure of impact of colloquialism beyond vocabulary usage and utilize a pre-trained Indonesian GPT-2 trained on Wikipedia 5 and Open AI's GPT-2 6 to calculate Indonesian and English data perplexities, respectively. Table 1 shows these perplexities.
Indonesian tweets have comparable perplexity as Lazada as they both use colloquial language. Both also have much higher perplexities than Kompas, implying that Indonesian LM finds that colloquial Indonesian is different than formal Indonesian. Similarly, English tweets have a higher perplexity compared to English Wikipedia (Radford et al., 2019). Notably, aside from Indonesian Twitter having around two times higher OOV rates: as high as 14.6% in OOV and 8.3% in OOV-2 5 huggingface.co/cahya/gpt2-small-indonesian-522M 6 huggingface.co/gpt2  than English Twitter, its perplexity too is significantly higher than English Twitter -suggesting that the non-standard word formation is a much more prominent issue when it comes to Indonesian, yet remains significantly under-researched.

Data Collection and Annotation
Our dataset is constructed and manually annotated from a list of informal words obtained from Twitter. The data construction process is summarized in Figure 1. As an archipelago country, Indonesia is very diverse in local languages, which affects the way people use the Indonesian language. Hence, we sample 80 tweets per-day from March 2017 to May 2020, from each of the 34 provinces in Indonesia. We then select top 10k frequent tokens not appearing in our Wikipedia-based formal word dictionary and treat them as informal. Then we manually filter out from this list, OOV words that are not informal words such as product names or entities. Despite being sampled according to geolocation, we note that most of the informal words are more inclined to informal words commonly used in Jakarta. We suspect this is because Jakarta, being the center of Indonesian economy and pop culture (CITE), heavily influences the other regions through mainstream media. Further investigation on this aspect is necessary and we leave this as a future work. We assign four Indonesian native speakers 7 , with formal education in linguistics and/or computational linguistics, to annotate each informal word with its standard form and label the pair with their word formation types according to our annotation codebook. 8 We annotate 9 different types of word formation mechanisms: disemvoweling, shortening, space/dash removal, phonetic (sound) alteration, affixation, compounding, reverse, loan word, and jargon. Since an informal word is often produced by stacking multiple transformations, we also annotate the transformation order, from the formal word to the informal. Some annotation examples are shown in Table 2. To simplify the transformation task, we assume single transformations and treat stacked transformations as a sequence of separate transformations. Words undergoing multiple transformations are broken down into different entries in our dataset. Ultimately, our dataset consists of parallel formal and informal Indonesian word pairs, each with its annotated word formation type from formal to informal. A sample of our dataset is shown in Table 3. Note that the same formal word with the same transformation may produce different informal words due to the open vocabulary of colloquial words.
Our dataset contains 3048 annotated word pairs 9 of which 2036 are those with morphological transformations (i.e., not loan words or jargons), which is comparable in size to other morphological transformation dataset such as the SIGMORPHON shared task (Cotterell et al., 2018). In comparison, Bengali, which is also a lower resource language comparable to Indonesian (Joshi et al., 2020), has 136 lemmas (and 4000 word forms) crowdsourced in the SIGMORPHON inflection dataset while our dataset has expertly annotated 1602 formal words (and 2036 informal variants).
In order to ensure the quality of our annotations, we sample 100 word pairs and compute Kripendorff's Alpha (α) (Hayes and Krippendorff, 2007) and Cohen's Kappa (κ) (Cohen, 1960) to measure agreement on word formation type annotations. The scores are α = 0.709 κ = 0.708, showing that the annotators have substantial agreement on our dataset (Viera et al., 2005). We split the dataset into training, validation, and testing as in Table 4. Note 7 formally employed by our company, Kata.ai. 8 https://github.com/haryoa/indo-collex 9 Full dataset: https://github.com/haryoa/indo-collex that since reverse formation is quite rare, we augment the data and add additional reverse formation in the testing and validation sets.
In our experiments, we exclude loan word and jargon from the evaluation of character-level models, since these transformations are challenging, if not impossible to handle at the character-level alone without (1) additional resources such as multilingual dictionaries/embeddings and without (2) involving additional tasks such as translation.

Rule-Based Transformation Baseline
We believe that some formal to informal word formation mechanisms follow regular patterns. We manually define a rule-based system as one of our baselines (see Appendix). As we will demonstrate in the results section, there are several challenges entailed with a rule-based approach. Firstly, our rule-based transformation only works from formal to informal-as most of the colloquialism involves removing parts of the word, reverting from informal to formal Indonesian proves difficult for the rule-based system as it requires predicting the removed characters.
Secondly, the rule-based approach can not be universally applied. For example, in affixation, some Indonesian root words have sub-words similar to common morphological affixes in Indonesian such as meor -kan. However, since these subwords are part of the root words, they should not be removed/altered e.g., membal ('bouncy') cannot be transformed via informal affixation to ngebal, since mein membal is part of the root word. Similarly, sound-alter transformation is applicable only to some words but not others e.g., malam ('night') can be altered to malem, but galak ('fierce') cannot be altered to galek. The rule of which words can be sound-altered seems arbitrary. In compounding, there is also no clear rule as to which abbreviation to use in different settings (e.g., anak baru gede is abbreviated to ABG, but rapat kerja nasional is abbreviated to rakernas instead of RKN). Lastly, as a single word may have multiple possible transformations that can apply, since rule-based system cannot rank these possible outputs, it randomly picks one of the candidates.

Character-Level Seq2Seq Models
Previous approaches for generating transformed words model the task as a character-level sequenceto-sequence (SEQ2SEQ) problem: the characters  Source Target Word Formation Tag ayo (formal of "let's go") yuk (informal of "let's go") sound-alter ayo (formal of "let's go") yuks (informal of "let's go") sound-alter yuk (informal of "let's go") kuy (informal of "let's go") reverse yuks (informal of "let's go") skuy (informal of "let's go") reverse kemarin (formal of "yesterday") kmrn (informal of "yesterday") disemvoweling nasi goreng (fried rice) nasgor compounding membuka (formal of "opening") ngebuka (informal of "opening") affixation  from the root word and an encoding of the desired transformation type are given as input to a neural encoder, and the decoder is trained to produce the transformed word, one character at a time (Gangal et al., 2017;Deutsch et al., 2018;Cotterell et al., 2017). In reality however, transformation types are often implied, but not given. For example, an Indonesian speaker will be able to transform the formal tolong ('help') to tlg given examples that jangan ('don't') can be transformed to jgn, even without the transformation type i.e., disemvoweling being specified. Thus, we also experiment with these SEQ2SEQ models for generating informal words from formal (and vice versa) without inputting any word formation tag to see if the models can induce the desired transformation type based on morphologically similar words in the training examples. We also use these models trained to generate outputs without word formation input to generate back-translated data to augment our training ( §7.1).

BiLSTM
The dominant model for character-level transduction that have been applied to many tasks such as morphological inflection (Cotterell et al., 2017), morphological derivation (Deutsch et al., 2018), and informal word formation (Gangal et al., 2017) adopts a character-level SEQ2SEQ model that learns to generate a target word from its original form given the desired transformation. These models typically use bi-directional LSTM with attention (Luong et al., 2015) to learn these transformations as orthographic functions. For the task of morphological derivation, the SOTA model (Deutsch et al., 2018) also proposes a dictionary constraint approach where the decoding process is restricted to output tokens listed in the dictionary, which improves the accuracy of their model. We evaluate this SOTA character SEQ2SEQ that leverages dictionary constraint (BiLSTM+Dict), whose code is publicly available, 10 on our data. Following their approach, we train this model for 30 epochs with a batch-size of 5 using Adam optimizer with initial learning-rate of 0.005, an embedding size of 20, and a hidden state size of 40. For the dictionary constraint, we construct dictionaries of formal words from Indonesian Wikipedia ( §3.2) and informal words we collected from Twitter (i.e., words we collected from Twitter that do not appear in our Wikipedia-based formal word dictionary §4).

Transformer
Given that more recently Transformer has been shown to outperform standard recurrent models on several character-level transduction tasks including morphological inflection and historical text normalization, grapheme-to-phoneme conversion, and transliteration (Wu et al., 2020); we evaluate character-based Transformer model (Vaswani et al., 2017) on our dataset. We conduct hyperparame-ter tuning on the size of the character embeddings, the number of layers, and the number of attention heads of the Transformer. For training, we use Adam with an initial learning rate of 0.005, a batch size of 128 (following (Wu et al., 2020)), and train for a maximum of 200 epochs, returning the model with the least validation loss.

Experiment and Results
We evaluate standard character-level transduction models on our dataset to assess its difficulty. Our goal is not to train SOTA models for word normalization but rather to test these models for such task on our data, and elucidate what features of the data make it difficult.

Experiment Settings
We train and evaluate the BiLSTM+Dict and Transformer models on our dataset. The models are trained and evaluated in both directions: formal↔informal (F↔I) Indonesian. However, as mentioned previously, we only explore formal→informal (F→I) for the rule-based model. We also train the SEQ2SEQ models with and without inputting the word formation tag. Each experiment took about 3 hours on a K80 GPU.
Aside from training the models to transform formal↔informal words, we also use the Transformer model to predict the word formation tag t ∈ T , where T is the set of word formation types in our dataset, that best applies given an informal word and its corresponding formal form (I→F) or vice versa (F→I) (i.e., Transformer (I→F )→T and Transformer (F →I)→T ).
We experiment with using backtranslation (Sennrich et al., 2016), which has been used to learn novel inflections in statistical machine translation (Bojar and Tamchyna, 2011), at the characterlevel to increase the training data for I→F. Using Transformer F →I model that performs best on the validation set, we generate informal words from the words in our formal dictionary sorted by frequencies. We experiment with generating M = kN additional word pairs, where k = {1, 2, 3} and N is the number of word pairs in the original training data. We similarly augment training data for F→I by using the Transformer I→F model that performs the best on the validation set to generate formal words from our informal word dictionary.
To ensure that the augmented data has similar transformation distribution as the original train-ing data, we predict the word formation type that best applies to each generated word pair using the Transformer (I↔F )→T model that performs best on validation. For each word formation type, we add rM generated pairs with such type to our training data based on its ratio r in the original training.
Each model's performance is measured by the top-1 and top-10 accuracy. Since formal→informal transformation is rather flexible, we also capture the BLEU score of the model's output. We report performances of the hyperparameter-tuned models that perform best on the validation set.

Results
Our experiment results are shown in Table 5. Generally, Transformer models outperform all other models. Specifying the target word formation type improves the performance of both models. Backtranslation is also shown to improve the performance of the Transformer. Transformer with added backtranslation and word formation tag yields the best test performance in both directions.
We also observe that in average the performance of the models are higher in the I→F direction than F→I. We observe similar trends when predicting word formation types given word pairs. The accuracy of the Transformer (I→F )→T model that predicts the type that applies given an informal word and its corresponding formal form is 81.4%; which is significantly higher than the 65.0% accuracy of the Transformer (F →I)→T model that predicts the type given a formal word and its corresponding informal form. This may point to the inherent ambiguity of generating informal words from the formal words. Due to the open-vocabulary of informal words, there are potentially many ways to transform a formal word into informal forms.
Surprisingly, rule-based transformation outperforms BiLSTM+Dict and several non-optimal Transformer configurations in terms of top-1 accuracy. However, rule-based transformation does not perform well in terms of top-10 accuracy. We observe that the rule-based transformation does not always manage to produce 10 transformation candidates, therefore missing out on the extra chances to correctly guess the output.

Discussion
In this section, we discuss failures and success cases of the best performing model (Transformer) on our dataset, elucidate what the model learns,   and analyze features of the data that make it challenging for the model. As seen in Table 5, when the desired word formation is not given, the Transformer has worse performance when performing F→I transformation compared to I→F. This is because transforming from formal to informal has a higher level of ambiguity i.e., a word can be made informal by multiple possible word formations. If the word formation type is not given, we observe that Transformer will learn to select the type implicitly. For example, it selects the disemvoweling mechanism implicitly as it pays attention to vowels in the word while removing them e.g., to correctly generate the informal sdh from the formal sudah (meaning, 'already') Figure 2). If the input consists of two words (separated by space), the model assumes the space/dash removal mechanism, paying attention to the characters before and after the space while removing the space e.g., given the word ga tau (meaning, 'don't know'), the model removes the space and correctly returns gatau.
However, the Transformer may select an incorrect transformation when the target word formation is not given e.g., the phrase ibu hamil ('pregnant mother') is often expressed as bumil (acronym). Without tag, the model performs a space/dash removal instead, and produced incorrect ibuhamil. Figure 3 shows how the model attends to the tag when it is given and applies the correct mechanism.
We observe that the model also attends to the  tag when transforming the word in the reverse (I→F) direction e.g., the model pays attention to the tag while correctly generating the vowels of a disemvoweled words ksl to kesal ('annoyed') or the space between the compounded word gatau to ga tau (Figure 4). In general, we observe that formal to informal transformation is challenging, since multiple valid informal words are possible even for a given word and word formation type. For example, kamu ('you') can be written informally as km or kmu both with the same disemvoweling transformation. Some word formation mechanisms are also ambiguous. For example, budak cinta's acronym is bucin (using the prefix of the second word), whereas ibu hamil's acronym is bumil (using the suffix of the second word). The acronym transformation seems to be applied on a case-by-case basis with no clear pattern. Reversing acronym to its original phrases is even more challenging (with or without tags) since it requires models to reconstruct the full phrase given minimum context e.g., reconstructing anak layangan ('tacky') from its acronym alay.
Another challenging transformation is affixation. Since meand its different variants (mem-, men-, etc.) are common morphological prefixes in Indonesian, we observe that our best model, the Transformer, often puts mein I→F affixation transformation, mistakenly transforming for example, nyantai ('to relax') into menyantai (expected: bersantai). This suggests that more training data may be needed to capture various affixation.
On the other hand, in sound alteration, we observe that Transformer successfully learns to soundalter even when the word formation is not explicitly mentioned. For example, it learns to transform the informal pake ('to wear') to pakai (attending to the characters e when outputting ai), kalo ('if') to kalau (attending to the character o when outputting au), and mauuu ('want') to mau (attending to the characters uuu when outputting u).

Ethical Consideration
Normalizing informal Indonesian language might serve as a bridge to connect the generational gap in the use of the language, as the informal Indonesian language is more popular among the younger populace. Furthermore, it can potentially bridge linguistic differences across the Indonesian archipelago. Although we attempt to collect informal data from each province in Indonesia, the resulting informal dataset is still mostly Jakarta-centric, and further scraping and verification of the linguistic coverage is necessary for future work. Finally, as not every Indonesian speaks perfect standard Indonesian, having an NLP interface (such as chatbots) that can readily accept (process and understand via normalization) any kind of informality that might arise promotes inclusivity that all NLP research should strive for.

Conclusion and Future Work
We show that colloquial and formal Indonesian are vastly different in terms of OOV-rate and perplexity, which poses difficulty for NLP systems that are trained on formal corpora. This significant gap between train and test sets in terms of formalism may hinder progress in Indonesian NLP research. We propose a new benchmark dataset for Indonesian colloquial word normalization that contains formal-informal word pairs annotated with their word formation mechanisms. We test several dominant character-level transduction models as baselines on the dataset and observe that different word formation mechanisms pose different levels of difficulties to the models with transformation to informal forms being more challenging due to the higher degree of transformation variants. Through this dataset, we intend to provide a standard benchmark for Indonesian word normalization and foster further research on models, datasets and evaluation metrics tailored for this increasingly prevalent and important problem.
In the future, we are interested to use the context in which the words occur, either textual (e.g., sentences) or other modalities (e.g., images or memes), to improve word transformation (formal ↔ informal) by using the context as either implicit signal (Wijaya et al., 2017) or explicit signal for "translating" between the formal and informal word forms based on similarities between their sentence contexts (Feng et al., 2020;Reimers and Gurevych, 2020) or image contexts (Bergsma and Van Durme, 2011;Kiela et al., 2015;Khani et al., 2021). We are also interested to learn if simple clustering of contexts within which the words occur can help us learn the mapping between the formal and informal words similar to finding paraphrase matching (Wijaya and Gianfortoni, 2011). Lastly, we are interested in the use of text normalization to augment data for training informal text translation (Michel and Neubig, 2018;Jones and Wijaya, 2021) or for training other downstream applications such as framing identification (Card et al., 2015;Liu et al., 2019;Akyürek et al., 2020), which are typically trained on formal news text, on informal social media text.