ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes.


Introduction
Language models based on transformer architecture (Vaswani et al., 2017) pre-trained on largescale datasets have brought about a paradigm shift in natural language processing (NLP), reshaping how we analyze, understand, and generate text.In particular, BERT (Devlin et al., 2019) and its variants (Liu et al., 2019;Conneau et al., 2020) have achieved state-of-the-art performance on a wide range of downstream NLP tasks, including but not limited to text classification, sentiment analysis, question answering, and machine translation.English is moving for the rapid development of language models across specific domains such as medical (Lee et al., 2019;Rasmy et al., 2021), scientific (Beltagy et al., 2019), legal (Chalkidis et al., 2020), political conflict and violence (Hu et al., 2022), and especially social media (Nguyen et al., 2020;DeLucia et al., 2022;Pérez et al., 2022;Zhang et al., 2022).
Vietnamese is the eighth largest language used over the internet, with around 85 million users across the world 5 .Despite a large amount of Vietnamese data available over the Internet, the advancement of NLP research in Vietnamese is still slow-moving.This can be attributed to several factors, to name a few: the scattered nature of available datasets, limited documentation, and minimal community engagement.Moreover, most existing pre-trained models for Vietnamese were primarily trained on large-scale corpora sourced from general texts (Tran et al., 2020;Nguyen and Tuan Nguyen, 2020;Tran et al., 2023).While these sources provide broad language coverage, they may not fully represent the sociolinguistic phenomena in Vietnamese social media texts.Social media texts often exhibit different linguistic patterns, informal language usage, non-standard vocabulary, lacking diacritics and emoticons that are not prevalent in formal written texts.The limitations of using language models pre-trained on general corpora become apparent when processing Vietnamese social media texts.The models can struggle to accurately understand and interpret the informal language, using emoji, teencode, and diacritics used in social media discussions.This can lead to suboptimal perfor-mance in Vietnamese social media tasks, including emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection.We present ViSoBERT, a pre-trained language model designed explicitly for Vietnamese social media texts to address these challenges.ViSoBERT is based on the transformer architecture and trained on a large-scale dataset of Vietnamese posts and comments extracted from well-known social media networks, including Facebook, Tiktok, and Youtube.Our model outperforms existing pretrained models on various downstream tasks, including emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection, demonstrating its effectiveness in capturing the unique characteristics of Vietnamese social media texts.Our contributions are summarized as follows.
• We presented ViSoBERT, the first PLM based on the XLM-R architecture and pre-training procedure for Vietnamese social media text processing.ViSoBERT is available publicly for research purposes in Vietnamese social media mining.ViSoBERT can be a strong baseline for Vietnamese social media text processing tasks and their applications.
• ViSoBERT produces SOTA performances on multiple Vietnamese downstream social media tasks, thus illustrating the effectiveness of our PLM on Vietnamese social media texts.
• To understand our pre-trained language model deeply, we analyze experimental results on the masking rate, examining social media characteristics, including emojis, teencode, and diacritics, and implementing feature-based extraction for task-specific models.

Fundamental of Pre-trained Language Models for Social Media Texts
Pre-trained Language Models (PLMs) based on transformers (Vaswani et al., 2017) have become a crucial element in cutting-edge NLP tasks, including text classification and natural language generation.Since then, language models based on transformers related to our study have been reviewed, including PLMs for Vietnamese social media texts.

Pre-trained Language Models for Vietnamese
Several PLMs have recently been developed for processing Vietnamese texts.These models have varied in their architectures, training data, and evaluation metrics.PhoBERT, developed by Nguyen and Tuan Nguyen (2020), is the first general pretrained language model (PLM) created for the Vietnamese language.The model employs the same architecture as BERT (Devlin et al., 2019) and the same pre-training technique as RoBERTa (Liu et al., 2019) to ensure robust and reliable performance.PhoBERT was trained on a 20GB wordlevel Vietnamese Wikipedia corpus, which produces SOTA performances on a range of downstream tasks of POS tagging, dependency parsing, NER, and NLI.Following the success of PhoBERT, viBERT (Tran et al., 2020) andvELECTRA (Tran et al., 2020), both monolingual pre-trained language models based on the BERT and ELECTRA architectures, were introduced.They were trained on substantial datasets, with ViBERT using a 10GB corpus and vELECTRA utilizing an even larger 60GB collection of uncompressed Vietnamese text.viBERT4news 6 was published by NlpHUST, a Vietnamese version of BERT trained on more than 20 GB of news datasets.For Vietnamese text summarization, BARTpho (Tran et al., 2022) is presented as the first large-scale monolingual seq2seq models pre-trained for Vietnamese, based on the seq2seq denoising autoencoder BART.Moreover, ViT5 (Phan et al., 2022) follows the encoderdecoder architecture proposed by Vaswani et al. (2017) and the T5 framework proposed by Raffel et al. (2020).Many language models are designed for general use, while the availability of strong baseline models for domain-specific applications remains limited.Since then, Minh et al. (2022) introduced ViHealthBERT, the first domain-specific PLM for Vietnamese healthcare.

Pre-trained Language Models for Social Media Texts
Multiple PLMs were introduced for social media for multilingual and monolinguals.BERTweet (Nguyen et al., 2020) was presented as the first public large-scale PLM for English Tweets.BERTweet has the same architecture as BERT Base (Devlin et al., 2019) and is trained using the RoBERTa pre-training procedure (Liu et al., 2019).Koto et al. (2021) proposed IndoBERTweet, the first largescale pre-trained model for Indonesian Twitter.In-doBERTweet is trained by extending a monolingually trained Indonesian BERT model with an additive domain-specific vocabulary.RoBERTuito, presented in Pérez et al. (2022), is a robust transformer model trained on 500 million Spanish tweets.RoBERTuito excels in various language contexts, including multilingual and codeswitching scenarios, such as Spanish and English.
TWilBert (Ángel González et al., 2021) is proposed as a specialization of BERT architecture both for the Spanish language and the Twitter domain to address text classification tasks in Spanish Twitter.Bernice, introduced by DeLucia et al. ( 2022), is the first multilingual pre-trained encoder designed exclusively for Twitter data.This model uses a customized tokenizer trained solely on Twitter data and incorporates a larger volume of Twitter data (2.5B tweets) than most BERT-style models.Zhang et al. (2022) introduced TwHIN-BERT, a multilingual language model trained on 7 billion Twitter tweets in more than 100 different languages.It is designed to handle short, noisy, user-generated text effectively.Previously, (Barbieri et al., 2022) extended the training of the XLM-R (Conneau et al., 2020) checkpoint using a data set comprising 198 million multilingual tweets.As a result, XLM-T is adapted to the Twitter domain and was not exclusively trained on data from within that domain.

ViSoBERT
This section presents the architecture, pre-training data, and our custom tokenizer on Vietnamese social media texts for ViSoBERT.

Pre-training Data
We crawled textual data from Vietnamese public social networks such as Facebook7 , Tiktok8 , and YouTube9 which are the three most well known social networks in Vietnam,with 52.65,49.86,and 63.00 million users10 , respectively, in early 2023.
To effectively gather data from these platforms, we harnessed the capabilities of specialized tools provided by each platform.Pre-processing Data: Pre-processing is vital for models consuming social media data, which is massively noisy, and has user handles (@username), hashtags, emojis, misspellings, hyperlinks, and other noncanonical texts.We perform the following steps to clean the dataset: removing noncanonical texts, removing comments including links, removing excessively repeated spam and meaningless comments, removing comments including only user handles (@username), and keeping emojis in training data.
As a result, our pretraining data after crawling and preprocessing contains 1GB of uncompressed text.Our pretraining data is available only for research purposes.

Model Architecture
Transformers (Vaswani et al., 2017) have significantly advanced NLP research using trained models in recent years.Although language models (Nguyen and Tuan Nguyen, 2020;Nguyen and Nguyen, 2021) have also proven effective on a range of Vietnamese NLP tasks, their results on Vietnamese social media tasks (Nguyen et al., 2022) need to be significantly improved.To address this issue, taking into account successful hyperparameters from XLM-R (Conneau et al., 2020), we proposed ViSoBERT, a transformerbased model in the style of XLM-R architecture with 768 hidden units, 12 self-attention layers, and 12 attention heads, and used a masked language objective (the same as Conneau et al. (2020)).

The Vietnamese Social Media Tokenizer
To the best of our knowledge, ViSoBERT is the first PLM with a custom tokenizer for Vietnamese social media texts.Bernice (DeLucia et al., 2022)  Owing to the ability to handle raw texts of Sen-tencePiece (Kudo and Richardson, 2018) without any loss compared to Byte-Pair Encoding (Conneau et al., 2020), we built a custom tokenizer on Vietnamese social media by SentencePiece on the whole training dataset.A model has better coverage of data than another when fewer subwords are needed to represent the text, and the subwords are longer (DeLucia et al., 2022).Figure 2 (in Appendix A) displays the mean token length for each considered model and group of tasks.ViSoBERT achieves the shortest representations for all Vietnamese social media downstream tasks compared to other PLMs.
Emojis and teencode are essential to the "language" on Vietnamese social media platforms.Our custom tokenizer's capability to decode emojis and teencode ensure that their semantic meaning and contextual significance are accurately captured and incorporated into the language representation, thus enhancing the overall quality and comprehensiveness of text analysis and understanding.
To assess the tokenized ability of Vietnamese social media textual data, we conducted an analysis of several data samples.Table 1 shows several actual social comments and their tokenizations with the tokenizers of the two pre-trained language models, ViSoBERT and PhoBERT, the best strong baseline.The results show that our custom tokenizer performed better compared to others.

Experimental Settings
We accumulate gradients over one step to simulate a batch size of 128.When pretraining from scratch, we train the model for 1.2M steps in 12 14 https://twitter.com/epochs.We trained our model for about three days on 2×RTX4090 GPUs (24GB).Each sentence is tokenized and masked dynamically with a probability equal to 30% (which is extensively experimented on Section 5.1 to explore the optimal value).Further details on hyperparameters and training can be found in Table 6 of Appendix B. Downstream tasks.To evaluate ViSoBERT, we used five Vietnamese social media datasets available for research purposes, as summarized in Table 2.The downstream tasks include emotion recognition (UIT-VSMEC) (Ho et al., 2020), hate speech detection (UIT-ViHSD) (Luu et al., 2021), sentiment analysis (SA-VLSP2016) (Nguyen et al., 2018), spam reviews detection (ViSpamReviews) (Dinh et al., 2022), and hate speech spans detection (UIT-ViHOS) (Hoang et al., 2023).
Fine-tuning.We conducted empirical finetuning for all pre-trained language models using the simpletransformers15 .Our fine-tuning process followed standard procedures, most of which are outlined in (Devlin et al., 2019).For all tasks mentioned above, we use a batch size of 40, a maximum token length of 128, a learning rate of 2e-5, and AdamW optimizer (Loshchilov and Hutter, 2019) with an epsilon of 1e-8.We executed a 10epoch training process and evaluated downstream tasks using the best-performing model from those epochs.Furthermore, none of the pre-processing techniques is applied in all datasets to evaluate our PLM's ability to handle raw texts.Baseline models.To establish the main baseline models, we utilized several well-known PLMs, including monolingual and multilingual, to support Vietnamese NLP social media tasks.The details of each model are shown in Table 3.
• Monolingual language models: viBERT (Tran et al., 2020) and vELECTRA (Tran et al., 2020) are PLMs for Vietnamese based on BERT and ELECTRA architecture, respectively.PhoBERT, which is based on BERT architecture and RoBERTa pre-training procedure, (Nguyen and Tuan Nguyen, 2020) is the first large-scale monolingual language model pre-trained for Vietnamese; PhoBERT obtains state-of-the-art performances on a range of Vietnamese NLP tasks.

Main Results
Table 4 shows ViSoBERT's scores with the previous highest reported results on other PLMs using the same experimental setup.It is clear that our Vi-SoBERT produces new SOTA performance results for multiple Vietnamese downstream social media tasks without any pre-processing technique.Emotion Recognition Task: PhoBERT and TwHIN-BERT archive the previous SOTA performances on monolingual and multilingual models, respectively.ViSoBERT obtains 68.10%, 68.37%, and 65.88% of Acc, WF1, and MF1, respectively, significantly higher than these PhoBERT and TwHIN-BERT models.
Hate Speech Detection Task: ViSoBERT achieves significant improvements over previous state-of-the-art models, PhoBERT and TwHIN-BERT, with scores of 88.51%, 88.31%, and 68.77% in Acc, WF1, and MF1, respectively.Notably, these achievements are made despite the presence of bias within the dataset 16 .
Hate Speech Spans Detection Task 20 : Our pre-trained ViSoBERT boosted the results up to 91.62%, 91.57%, and 86.80% on Acc, WF1, and MF1, respectively.While the difference is insignificant, ViSoBERT indicates an outstanding ability to capture Vietnamese social media information compared to other PLMs (see Section 5.3). 20For the Hate Speech Spans Detection task, we evaluate the total of spans on each comment rather than spans of each word in Hoang et al. (2023) to retain the context of each comment.

Multilingual social media PLMs:
The results show that ViSoBERT consistently outperforms XLM-T and Bernice in five Vietnamese social media tasks.It's worth noting that XLM-T, TwHIN-BERT, and Bernice were all exclusively trained on data from the Twitter platform.However, this approach has limitations when applied to the Vietnamese context.The training data from this source may not capture the intricate linguistic and contextual nuances prevalent in Vietnamese social media because Twitter is not widely used in Vietnam.

Result Analysis and Discussion
In this section, we consider the improvement of our PLM more compared to powerful others, including PhoBERT and TwHIN-BERT, in terms of different aspects.Firstly, we investigate the effects of masking rate on our pre-trained model performance (see Section 5.1).Additionally, we examine the influence of social media characteristics on the model's ability to process and understand the language used in these social contexts (see Section 5.2).Lastly, we employed feature-based extraction techniques on task-specific models to verify the potential of leveraging social media textual data to enhance word representations (see Section 5.3).

Impact of Masking Rate on Vietnamese Social Media PLM
For the first time presenting the Masked Language Model, Devlin et al. (2019) consciously utilized a random masking rate of 15%.The authors believed masking too many tokens could lead to losing crucial contextual information required to decode them accurately.Additionally, the authors felt that masking too few tokens would harm the training process and make it less effective.However, according to Wettig et al. (2023), 15% is not universally optimal for model and training data.
We experiment with masking rates ranging from 10% to 50% and evaluate the model's performance on five downstream Vietnamese social media tasks.Figure 1 illustrates the results obtained from our experiments with six different masking rates.Interestingly, our pre-trained ViSoBERT achieved the highest performance when using a masking rate of 30%.This suggests a delicate balance between the amount of contextual information retained and the efficiency of the training process, and an optimal masking rate can be found within this range.
However, the optimal masking rate also depends on the specific task.For instance, in the hate speech detection task, we found that a masking rate of 50% yielded the best results, surpassing other masking rate values.This implies that the optimal masking rate may vary depending on the nature and requirements of different tasks.
Considering the overall performance across multiple tasks, we determined that a masking rate of 30% produced the optimal balance for our pre-trained ViSoBERT model.Consequently, we adopted this masking rate for ViSoBERT, ensuring efficient and effective utilization of contextual information during training.

Impact of Vietnamese Social Media Characteristics
Emojis, teencode, and diacritics are essential features of social media, especially Vietnamese social media.The ability of the tokenizer to decode emojis and the ability of the model to understand the context of teencode and diacritics are crucial.Hence, to evaluate the performance of Vi-SoBERT on social media characteristics, comprehensive experiments were conducted among several strong PLMs: PM4ViSMT, PhoBERT, and TwHIN-BERT.

Impact of Emoji on PLMs:
We conducted two experimental procedures to comprehensively investigate the importance of emojis, including converting emojis to general text and removing emojis.
Table 5 shows our detailed setting and experimental results on downstream tasks and pre-trained models.The results indicate a moderate reduction in performance across all downstream tasks when emojis are removed or converted to text in our pre-trained ViSoBERT model.Our pre-trained decreases 0.62% Acc, 0.55% WF1, and 0.78% MF1 on Average for downstream tasks while converting emojis to text.In addition, an average reduction of 1.33% Acc, 1.32% WF1, and 1.42% MF1 can be seen in our pre-trained model while removing all emojis in each comment.This is because when emojis are converted to text, the context of the comment is preserved, while removing all emojis results in the loss of that context.This trend is also observed in the TwHIN-BERT model, specifically designed for social media processing.However, TwHIN-BERT slightly improves emotion recognition and spam reviews detection tasks compared to its competitors when operating on raw texts.Nevertheless, this improvement is marginal and insignificant, as indicated by the small increments of 0.61%, 0.13%, and 0.21% in Acc, WF1, and MF1 on the emotion recognition task, respectively, and 0.08% Acc, 0.05% WF1, and 0.04% MF1 on spam reviews detection task.One potential reason for this phenomenon is that TwHIN-BERT and ViSoBERT are PLMs trained on emojis datasets.Consequently, these models can comprehend the contextual meaning conveyed by emojis.This finding underscores the importance of emojis in social media texts.
In contrast, there is a general trend of improved performance across a range of downstream tasks when removing or converting emojis to text on  PhoBERT, the Vietnamese SOTA pre-trained language model.PhoBERT is a PLM on a general text (Vietnamese Wikipedia) dataset containing no emojis; therefore, when PhoBERT encounters an emoji, it treats it as an unknown token (see Table 1 Appendix B).Therefore, while applying emoji preprocessing techniques, including converting emoijs to text and removing emojis, PhoBERT produces better performances compared to raw text.
Our pre-trained model ViSoBERT on raw texts outperformed PhoBERT and TwHIN-BERT even when applying two pre-processing emojis techniques.This claims our pre-trained model's ability to handle Vietnamese social media raw texts.

Impact of Teencode on PLMs:
Due to informal and casual communication, social media texts often lead to common linguistic errors, such as misspellings and teencode.For example, the phrase "ăng kơmmmmm" should be "ăn cơm" ("Eat rice" in English), and "ko" should be "không" ("No" in English).To address this challenge, Nguyen and Van Nguyen (2020) presented several rules to standardize social media texts.Building upon the previous work, Quoc Tran et al. (2023) proposed a strict and efficient pre-processing technique to clean comments on Vietnamese social media.
Table 7 (in Appendix C) shows the results with and without standardizing teencode on social media texts.There is an uptrend across PhoBERT, TwHIN-BERT, and ViSoBERT while applying standardized pre-processing techniques.ViSoBERT, with standardized pre-processing techniques, outperforms almost downstream tasks but spam reviews detection.The possible reason is that the ViSpamReviews dataset contains samples in which users use the word with duplicated characters to improve the comment length while standardizing teencodes leads to misunderstanding.
Experimental results strongly suggest that the improvement achieved by applying complex preprocessing techniques to pre-trained models in the context of Vietnamese social media text is relatively insignificant.Despite the considerable time and effort invested in designing and implementing these techniques, the actual gains in PLMs performance are not substantial and unstable.
However, social media text does not always adhere to proper writing conventions.Due to various reasons, many users write text without diacritic marks when commenting on social media platforms.Consequently, effectively handling diacritics in Vietnamese social media becomes a critical challenge.To evaluate the PLMs' capability to address this challenge, we experimented by removing all diacritic marks from the datasets of five downstream tasks.This experiment aimed to assess the model's performance in processing text without diacritics and determine its ability to understand Vietnamese social media content in such cases.
Table 8 (in Appendix C) presents the results of the two best baselines compared to our pre-trained diacritics experiments.The experimental results reveal that the performance of all pre-trained models, including ours, exhibited a significant decrease when dealing with social media comments lacking diacritics.This decline in performance can be attributed to the loss of contextual information caused by the removal of diacritics.The lower the percentage of diacritic removal in each comment, the more significant the performance improvement in all PLMs.However, our ViSoBERT demonstrated a relatively minor reduction in performance across all downstream tasks.This suggests that our model possesses a certain level of robustness and adaptability in comprehending and analyzing Vietnamese social media content without diacritics.We attribute this to the efficiency of the in-domain pre-training data of ViSoBERT.
In contrast, PhoBERT and TwHIN-BERT experienced a substantial drop in performance across the benchmark datasets.These PLMs struggled to cope with the absence of diacritics in Vietnamese social media comments.The main reason is that the tokenizer of PhoBERT can not encode nondiacritics comments due to not including those in pre-training data.Several tokenized examples of the three best PLMs are presented in Table 10 (in Appendix F).Thus, the significant decrease in its performance highlights the challenge of handling diacritics on Vietnamese social media.While handling diacritics remains challenging, ViSoBERT demonstrates promising performance, suggesting the potential for specialized language models tailored for Vietnamese social media analysis.

Impact of Feature-based Extraction to Task-Specific Models
In task-specific models, the contextualized word embeddings from PLMs are typically employed as input features.We aim to assess the quality of contextualized word embeddings generated by PhoBERT, TwHIN-BERT, and ViSoBERT to verify whether social media data can enhance word representation.These contextualized word embeddings are applied as embedding features to BiLSTM, and BiGRU is randomly initialized before the classification layer.We append a linear prediction layer to the last transformer layer of each PLM regard-ing the first subword of each word token, which is similar to Devlin et al. (2019).
Our experiment results (see Table 9 in Appendix C) demonstrate that the word embeddings generated by our pre-trained language model Vi-SoBERT outperform other pre-trained embeddings when utilized with BiLSTM and BiGRU for all downstream tasks.The experimental results indicate the significant impact of leveraging social media text data for enriching word embeddings.Furthermore, this finding underscores the effectiveness of our model in capturing the linguistic characteristics prevalent in Vietnamese social media texts.
Figure 3 (in Appendix D) presents the performances of the PLMs as input features to BiLSTM and BiGRU on the dev set per epoch in terms of MF1.The results demonstrate that ViSoBERT reaches its peak MF1 score in only 1 to 3 epochs, whereas other PLMs typically require an average of 8 to 10 epochs to achieve on-par performance.This suggests that ViSoBERT has a superior capability to extract Vietnamese social media information compared to other models.

Conclusion and Future Work
We presented ViSoBERT, a novel large-scale monolingual pre-trained language model on Vietnamese social media texts.We illustrated that Vi-SoBERT with fewer parameters outperforms recent strong pre-trained language models such as viBERT, vELECTRA, PhoBERT, XLM-R, XLM-T, TwHIN-BERT, and Bernice and achieves stateof-the-art performances for multiple downstream Vietnamese social media tasks, including emotion recognition, hate speech detection, spam reviews detection, and hate speech spans detection.We conducted extensive analyses to demonstrate the efficiency of ViSoBERT on various Vietnamese social media characteristics, including emojis, teencodes, and diacritics.Furthermore, our pre-trained language model ViSoBERT also shows the potential of leveraging Vietnamese social media text to enhance word representations compared to other PLMs.We hope the widespread use of our opensource ViSoBERT pre-trained language model will advance current NLP social media tasks and applications for Vietnamese.Other low-resource languages can adopt how to create PLMs for enhancing their current NLP social media tasks and relevant applications.

Limitations
While we have demonstrated that ViSoBERT can perform state-of-the-art on a range of NLP social media tasks for Vietnamese, we think additional analyses and experiments are necessary to fully comprehend what aspects of ViSoBERT were responsible for its success and what understanding of Vietnamese social media texts ViSoBERT captures.We leave these additional investigations to future research.Moreover, future work aims to explore a broader range of Vietnamese social media downstream tasks that this paper may not cover.Also, we chose to train the base-size transformer model instead of the Large variant because base models are more accessible due to their lower computational requirements.For PhoBERT, XLM-R, and TwHIN-BERT, we implemented two versions Base and Large for all Vietnamese social media downstream tasks.However, it is not a fair comparison due to their significantly larger model configurations.Moreover, regular updates and expansions of the pre-training data are essential to keep up with the rapid evolution of social media.This allows the pre-trained model to adapt effectively to the dynamic linguistic patterns and trends in Vietnamese social media.Table 7: Performances of the pre-trained language models on downstream Vietnamese social media tasks by applying word standardizing pre-processing techniques.

Model
[♣] and [♦] denoted with and without standardizing word technique, respectively.∆ denoted the increase (↑) and the decrease (↓) in performances of the pre-trained language models compared to its competitors without normalizing teencodes.
To emphasize the essentials of diacritics, we conducted an analysis on several data samples by removing 100%, 75%, 50%, and 25% diacritics of total words that included diacritics in each comment of five downstream tasks.Table 8 presents performances of the pre-trained language models on downstream Vietnamese social media tasks by removing diacritics in all datasets.Table 8: Performances of the pre-trained language models on downstream Vietnamese social media tasks by removing diacritics in all datasets.

E Updating New Spans of Hate Speech Span Detection Samples with Pre-processing Techniques
Due to performing pre-processing techniques, the span positions on the data samples can be changed.Therefore, we present Algorithm 1, which shows how to update new span positions of samples applied with pre-processing techniques in the Hate Speech Spans Detection task (UIT-ViHOS dataset).This algorithm takes as input a comment and its spans and returns the pre-processed comment and its span along with pre-processing techniques.

Removing 75% diacritics in each comment
Comment cai con do chơi do mua o đâu nhi .cười deo nhat duoc mom .

Figure 1 :
Figure1: Impact of masking rate on our pre-trained ViSoBERT in terms of MF1.

Table 1 :
Actual social comments and their tokenizations with the tokenizers of the two pre-trained language models, ViSoBERT and PhoBERT.Disclaimer: This table contains actual comments on social networks that might be construed as abusive, offensive, or obscene.

Table 2 :
Statistics and descriptions of Vietnamese social media processing tasks.Acc, WF1, and MF1 denoted Accuracy, weighted F1-score, and macro F1-score metrics, respectively.

Table 3 :
Detailed information about baselines and our PLM.#Layers, #Heads, #Batch, #Params, #Vocab, #MSL, and CSMT indicate the number of hidden layers, number of attention heads, batch size value, domain training data, number of total parameters, vocabulary size, max sequence length, and custom social media tokenizer, respectively.

Table 4 :
Performances on downstream Vietnamese social media tasks of previous state-of-the-art monolingual and multilingual PLMs without pre-processing techniques.Avg denoted the average MF1 score of each PLM.
‡ denotes that the highest result is statistically significant at p < 0.01 compared to the second best, using a paired t-test.

Table 5 :
Performances of pre-trained models on downstream Vietnamese social media tasks by applying two emojis pre-processing techniques.[♣], [♦], and [♠] denoted our pre-trained language model ViSoBERT converting emoji to text, removing emojis and without applying any pre-processing techniques, respectively.∆ denoted the increase (↑) and the decrease (↓) in performances of the PLMs compared to their competitors without applying any pre-processing techniques.