IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.


Introduction
Transformer-based pretrained language models (Vaswani et al., 2017;Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019) have become the backbone of modern NLP systems, due to their success across various languages and tasks. However, obtaining high-quality contextualized representations for specific domains/data sources such as biomedical, social media, and legal, remains a challenge.
Previous studies (Alsentzer et al., 2019;Chalkidis et al., 2020;Nguyen et al., 2020) have shown that for domain-specific text, pretraining from scratch outperforms off-the-shelf BERT. As an alternative approach with lower cost, Gururangan et al. (2020) demonstrated that domain adaptive pretraining (i.e. pretraining the model on target domain text before task fine-tuning) is effective, although still not as good as training from scratch.
The main drawback of domain-adaptive pretraining is that domain-specific words that are not in the pretrained vocabulary are often tokenized poorly. For instance, in BIOBERT (Lee et al., 2019), Immunoglobulin is tokenized into {I, ##mm, ##uno, ##g, ##lo, ##bul, ##in}, despite being a common term in biology. To tackle this problem, Poerner et al. (2020); Tai et al. (2020) proposed simple methods to domain-extend the BERT vocabulary: Poerner et al. (2020) initialize new vocabulary using a learned projection from word2vec (Mikolov et al., 2013), while Tai et al. (2020) use random initialization with weight augmentation, substantially increasing the number of model parameters.
New vocabulary augmentation has been also conducted for language-adaptive pretraining, mainly based on multilingual BERT (MBERT). For instance, Chau et al. (2020) replace 99 "unused" WordPiece tokens of MBERT with new common tokens in the target language, while Wang et al. (2020) extend MBERT vocabulary with nonoverlapping tokens (|V MBERT − V new |). These two approaches use random initialization for new Word-Piece token embeddings.
In this paper, we focus on the task of learning an Indonesian BERT model for Twitter, and show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections (Poerner et al., 2020 There are two primary reasons to experiment with Indonesian Twitter. First, despite being the official language of the 5th most populous nation, Indonesian is underrepresented in NLP (notwithstanding recent Indonesian benchmarks and datasets (Wilie et al., 2020;Koto et al., 2020a,b)). Second, with a large user base, Twitter is often utilized to support policymakers, business (Fiarni et al., 2016), or to monitor elections (Suciati et al., 2019) or health issues (Prastyo et al., 2020). Note that most previous studies that target Indonesian Twitter tend to use traditional machine learning models (e.g. n-gram and recurrent models (Fiarni et al., 2016;Koto and Rahmaningtyas, 2017)).
To summarize our contributions: (1) we release INDOBERTWEET, the first large-scale pretrained Indonesian language model for social media data; and (2) through extensive experimentation, we compare a range of approaches to domain-specific vocabulary initialization over a domain-general BERT model, and find that a simple average of subword embeddings is more effective than previouslyproposed methods and reduces the overhead for domain-adaptive pretraining by 80%.

Twitter Dataset
We crawl Indonesian tweets over a 1-year period using the official Twitter API, 2 from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We found that the Twitter language identifier is reasonably accurate for Indonesian, and so use it to filter out non-Indonesian tweets. From 100 randomly-sampled tweets, we found a majority of them (87) to be Indonesian, with a small number being Malay (12) and Swahili (1). 3 After removing redundant tweets (with the same ID), we obtain 26M tweets with 409M word tokens, two times larger than the training data used to pretrain INDOBERT (Koto et al., 2020b). We set aside 230K tweets for development, and extract a vocabulary of 31,984 types based on WordPiece (Wu et al., 2016). We lower-case all words and follow the same preprocessing steps as English BERTWEET (Nguyen et al., 2020): (1) converting user mentions and URLs into @USER and HTTPURL, respectively; and (2) translating emoticons into text using the emoji package. 4

INDOBERTWEET Model
INDOBERTWEET is trained based on a masked language model objective (Devlin et al., 2019) following the same procedure as the indobert-base-uncased (INDOBERT) model. 5 It is a transformer encoder with 12 hidden layers (dimension=768), 12 attention heads, and 3 feed-forward hidden layers (dimension=3,072). The only difference is the maximum sequence length, which we set to 128 tokens based on the average number of words per document in our Twitter corpus.
In this work, we train 5 INDOBERTWEET models. The first model is pretrained from scratch based on the aforementioned configuration. The remaining four models are based on domain-adaptive pretraining with different vocabulary adaptation strategies, as discussed in Section 2.3.
For the linear projection strategy (Method 3), we train 300d fastText embeddings (Bojanowski et al., 2017) over the tokenized Indonesian Twitter corpus. Following Poerner et al. (2020), we use the shared types (V IB ∩ V IBT ) to train a linear transformation from fastText embeddings E FT to INDOBERT embeddings E IB as follows: To average subword embeddings of x ∈ V IBT (Method 4), we compute: where T IB (x) is the set of WordPiece tokens for word x produced by INDOBERT's tokenizer.

Experimental Setup
We accumulate gradients over 4 steps to simulate a batch size of 2048. When pretraining from scratch, we train the model for 1M steps, and use a learning rate of 1e−4 and the Adam optimizer with a linear scheduler. All pretraining experiments are done using 4×V100 GPUs (32GB).
For domain-adaptive pretraining (using IN-DOBERT model), we consider three benchmarks: (1) domain-adaptive pretraining without domainspecific vocabulary adaptation (V IBT = V IB ) for 200K steps; (2) applying the new vocabulary adaptation approaches from Section 2.3 without additional domain-adaptive pretraining; and (3) applying the new vocabulary adaptation approaches from Section 2.3 with 200K domain-adaptive pretraining steps.
Downstream tasks. To evaluate the pretrained models, we use 7 Indonesian Twitter datasets, as summarized in Table 1. This includes sentiment analysis (Koto and Rahmaningtyas, 2017; Purwarianti and Crisdayanti, 2019), emotion classification (Saputri et al., 2018), hate speech detection (Alfina et al., 2017;Ibrohim and Budi, 2019), and named entity recognition (Munarko et al., 2018). For emotion classification, the classes are fear, angry, sad, happy, and love. Named entity recognition (NER) is based on the PERSON, ORGANIZATION, and LOCATION tags. NER has two test set partitions, where the first is formal texts (e.g. news snippets on Twitter) and the second is informal texts. The train and dev partitions are a mixture of formal and informal tweets, and shared across the two test sets.
Fine-tuning. For sentiment, emotion, and hate speech classification, we add an MLP layer that takes the average pooled output of INDOBER-TWEET as input, while for NER we use the first subword of each word token for tag prediction. We pre-process the tweets as described in Section 2.1, and use a batch size of 30, maximum token length of 128, learning rate of 5e−5, Adam optimizer with epsilon of 1e−8, and early stopping with patience of 5. We additionally introduce a canonical split for both hate speech detection tasks with 5-fold cross validation, following Koto et al. (2020b). In Table 1, SmSA, EmoT, and NER use the original held-out evaluation splits.
Baselines. We use the two INDOBERT models from Koto et al. (2020b) and Wilie et al. (2020) as baselines, in addition to multilingual BERT (MBERT, which includes Indonesian) and a monolingual BERT for Malay (MALAYBERT). 7 Our rationale for including MALAYBERT is that we are interested in testing its performance on Indonesian, given that the two languages are closely related and we know that the Twitter training data includes some amount of Malay text. Table 2 shows the full results across the different pretrained models for the 7 Indonesian Twitter datasets. Note that the first four models are pretrained models without domain-adaptive pretraining (i.e. they are used as purely off-the-shelf models). In terms of baselines, MALAYBERT is a better model for Indonesian than MBERT, consistent with Koto et al. (2020b), and better again are the two different INDOBERT models at al-  forming the best, in fact outperforming the domainspecific INDOBERTWEET when trained for 1M steps from scratch. These findings reveal that we can adapt an off-the-shelf pretrained model very efficiently (5 times faster than training from scratch) with better average performance.

Discussion
Given these positive results on Indonesian, we conducted a similar experiment in a second language, English: we follow Nguyen et al. (2020) in adapting ROBERTA 9 for Twitter using the embedding averaging method to initialize new vocabulary, and compare ourselves against BERTWEET (trained from scratch on 845M English tweets).
A caveat here is that BERTWEET (Nguyen et al., 2020) and ROBERTA (Liu et al., 2019) use different tokenization methods: byte-level BPE vs. fastBPE (Sennrich et al., 2016). Because of this, rather than replacing ROBERTA's vocabulary with BERTWEET's (like our Indonesian experiments), we train ROBERTA's BPE tokenizer on English Twitter data (described below) to create a domainspecific vocabulary. This means that the two models (BERTWEET and domain-adapted ROBERTA  with modified vocabulary) will not be directly comparable. Following Nguyen et al. (2020), we download 42M tweets from the Internet Archive 10 over the period July 2017 to October 2019 (the first two days of each month), which we use for domain-adaptive pretraining. Note that this pretraining data is an order of magnitude smaller than that of BERTWEET (42M vs. 845M). We use SpaCy 11 to filter English tweets, and follow the same preprocessing steps and downstream tasks as Nguyen et al. (2020) (7 tasks in total; see the Appendix for details). We pretrain ROBERTA for 200K steps using the embedding averaging method.
In Table 3, we see that BERTWEET outperforms ROBERTA (+3.4% absolute). With domainadaptive pretraining using domain-specific vocabulary, the performance gap narrows to +2.2%, but are not as impressive as our Indonesian experiments. There are two reasons for this: (1) our domain-adaptive pretraining data is an order of magnitude smaller than for BERTWEET; and (2) the difference in tokenization methods between BERTWEET and ROBERTA results in a very different vocabulary.
Lastly, we argue that the different tokenization settings between INDOBERTWEET and BERTWEET (ours) may also contribute to the difference in results. The differences include: (1) uncased vs. cased; (2) WordPiece vs. fastBPE tokenizer; and (3) vocabulary size (32K vs. 50K) between both models. In Figure 1, we present the frequency distribution of #subword of new types in both models after tokenizing by each general-domain tokenizer. Interestingly, we find that BERTWEET has more new types than IN-DOBERTWEET, with #subword after tokenization being more varied (average length of #subword of

Conclusion
We present the first large-scale pretrained model for Indonesian Twitter. We explored domain-adaptive pretraining with domain-specific vocabulary adaptation using several strategies, and found that the best method -averaging of subword embeddings from the original model -achieved the best average performance across 7 tasks, and is five times faster than the dominant paradigm of pretraining from scratch.   (2020). We re-ran all experiments and found slightly lower performance for some models as compared to BERTWEET. For evaluation, the POS tagging datasets (Ritter et al., 2011;Gimpel et al., 2011;Liu et al., 2018) use accuracy, SemEval2017 (Rosenthal et al., 2017) uses Avg Rec , SemEval2018 (Van Hee et al., 2018) uses F1 pos , and NER (Strauss et al., 2016;Derczynski et al., 2017) uses F1 entity .