Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation

While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with in-sufﬁcient data. This investigation proposes a new solution to adapt an existing domain-generic model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-speciﬁc models called Multi-Domain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. More-over, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method signiﬁcantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s.


Introduction
Word segmentation (WS) is a crucial upstream process for most natural language processing (NLP) tasks such as named entity recognition (NER), machine translation (MT), and part-of-speech tagging (POS). Nguyen et al. (2017) showed POS performance increased from 87% to 93% when the WS was improved. WS can also enhance the performance of MT, such as the work done by Chang et al. (2008) for Chinese-English MT.
While word segmentation is considered a solved problem in many languages, the task is still a challenge in continuous-script languages. A great number of writing systems have no word boundary, e.g., Thai, Chinese, and Japanese. Deep learning has been effective in performing WS in these languages. However, it requires a large amount of training data to construct a reliable model, which can be a limitation for low-resource languages like Thai and Urdu. Furthermore, like other NLP tasks, word segmentation is domain-dependent (Fu et al., 2020). To handle a variety of data domains, there should be a substantial amount of data for each of them, exacerbating the low-resource problem. To make the matter worse, we may also need to handle input from a completely unseen domain.
In this paper, we propose a framework to address two domain dependency problems: (i) how to effectively construct a WS model to handle input from a given domain in a data-poor setting; (ii) how to effectively handle out-of-domain input. To address the first problem, we propose a new domain adaptation solution based on the concept of stacked ensemble (SE) learning (Limkonchotiwat et al., 2020) and data augmentation. To handle out-of-domain input, we use an ensemble of domain-specific models to produce predictive results.
The crux of our proposed method lies in the following technical contributions: • We introduce multiple deep learning models following the concept of SE to construct domain-specific models that obtain better performance than the original SE and existing techniques in domain-adaptation problems. We call this technique Deep Stacked Ensemble (DSE). • To make sure that each domain has sufficient data to build an accurate model, we design a data augmentation approach which consists of two techniques to generate hard-to-segment and semi-hard-to-segment samples to help improve the performance based on Masked Language Model (MLM). • We use multiple domain-specific models and a result aggregation module to form an ensemble learning framework addressing the out-ofdomain problems. We call this method Multi-Domain Ensemble (MDE). • Furthermore, we propose a multiple task dataset called "VISTEC-TP-TH-2021", a social media dataset for Thai text processing, annotated for four text processing tasks: word segmentation, named-entity boundary, and misspelling detection and correction. To assess the effectiveness of our approach, we compare our method with competitors in domain adaptation and out-of-domain scenarios on Thai, Chinese, Japanese, and Urdu. Experimental results showed that DSE improved the performance of the state-of-the-art Thai word segmentation (TWS) from 93.47% and 84.03% to 96.67% and 91.51% at character and word levels in domain adaptation settings. With the proposed data augmentation approach, our domain-specific model has improved even further at both character and word levels. For out-of-domain scenarios, our MDE framework outperformed the state-of-the-art TWS and multi-criteria baseline at character and word levels. Moreover, we applied our framework to Chinese, Japanese, and Urdu which resulted in improvement showing the applicability of our method to other languages. We make our code available at: github.com/mrpeerat/OSKut

Related Work
In this section, we discuss literature related to our investigation, namely ensemble learning, domain adaption, and data augmentation. Ensemble Learning. Recently, considerable research attention has been dedicated to applying en-semble learning to boost the performance obtained from individual models (Sikdar and Gambäck, 2017;Chen et al., 2020a;Kuwabara et al., 2020) and to introduce previously ignored features for ensemble models such as provenance information in slot filtering (Viswanathan et al., 2015).
Several studies have used ensemble methods to boost the accuracy in WS. For example, Liu and Lin (2014) proposed a probabilistic ensemble learning framework using multiple weak word segmenters to form a strong segmenter. Moreover, Min et al. (2015) proposed an ensemble learning model to address the word segmentation and Partof-Speech tagging problems by combining both discriminative and generative methods. Domain Adaptation. Several WS studies proposed techniques to adapt the data distribution from one domain to another (Zhang et al., 2013;Ding et al., 2020). Another popular approach is to add new features or change network architectures of the target model (Monroe et al., 2014;Liu et al., 2014;Bao et al., 2017;Huang et al., 2020). Ding et al. (2020) presented a semi-supervised approach for performing Chinese WS on a new domain by using adversarial training to help learn the difference between the source and target domain. Recently, Limkonchotiwat et al. (2020) proposed a filter-and-refine solution based on the stacked ensemble (SE) to convert a base model to a target domain. The SE consists of a domain-generic base model and a domain-specific model that analyzes the output of the domain-generic model and revises the segmentation. The method achieved similar performance to traditional transfer learning methods while requiring no access to the domain-generic model weights. Data Augmentation and Self-Supervised learning. Word segmentation for low-resource languages is a challenging task due to the data limitation. Most Thai WS models report below 90% accuracy in domain-adaptation settings (Kittinaradorn et al., 2019;Chormai et al., 2020). Many researchers proposed data augmentation methods for Asian languages to increase the performance of WS models by using existing models' output as input to new models such as synthetic data, entropy parser, and character embedding (Zheng et al., 2018;Wang et al., 2019;Fung et al., 2004).
With the advent of large language models (Devlin et al., 2018;Yang et al., 2019;Brown et al., 2020), we have been witnessing an explosion in self-supervised learning techniques. Data augmentation methods such as the Masked Language Model (MLM) using BERT (Devlin et al., 2018) allow us to generate new sentences that are similar to real data by randomly selecting words in a sentence to replace them with new words (Chen et al., 2020b;Liao et al., 2020). Yavuz et al. (2020) proposed MaskAugment, a controllable mechanism and augmentation method that used a pre-trained BERT model to replace words in a sentence. The method is used in an unsupervised teacher-student framework to improve domain adaptation for dialog act task. Furthermore, Li et al. (2020) proposed a MLM-based augmentation method that could also preserve the underlying labels of the sentence in the aspect term extraction task. Out-Of-Domain Scenarios. While domain adaptation presents a useful paradigm to adjust an existing model to a target domain, it is impracticable to anticipate all different input types in advance. Hence, the ability to handle samples from unseen domains (i.e., out-of-domain samples) is critical to the solution's performance. For example, Wagner et al. (2020) proposed utilizing treebank vectors and a method to interpolate a prediction from existing treebank vectors to handle out-of-domain input samples. Ng et al. (2020) proposed a solution utilizing data augmentation to generate training samples to diversify the training set so that the model can handle out-of-domain samples better. Discussion. For domain adaptation, an ensemble learning method such as SE (Limkonchotiwat et al., 2020) provides a flexible framework for adapting any base model to a target domain. We hypothesize that we can improve the accuracy of SE by introducing a deep learning architecture at the domain-specific part. However, this adjustment would require a larger amount of data for each domain than the original SE method which uses a traditional Conditional Random Field (CRF) model (Lafferty et al., 2001). To tackle this problem, data-augmentation presents an avenue to address the data requirements. Regarding out-ofdomain scenarios, we hypothesize that an ensemble of domain-specific models can be used to boost the accuracy of out-of-domain situations. This is the first WS work to address this problem without using any out-of-domain data.

Methodology
In this section, we present the overview of domain adaptation in Section 3.1. We present our domain adaptation solution based on the concept of stacked ensemble (SE) learning (Limkonchotiwat et al., 2020) and an MLM-based data augmentation method in Section 3.2. Section 3.3 presents how multiple domain-specific models can work as an ensemble to support out-of-domain scenarios.

Input Output
Transfer (a) Domain-specific model with transfer learning.  The main advantages of SE over TL are as follows: (i) the architecture of the domain-specific model can be selected independently of the existing domain-generic one; (ii) it is able to handle models where we cannot adjust their weights, i.e., black boxes. Consequently, we adopt SE as our approach to tackling the domain adaptation problem.

Deep Stacked Ensemble (DSE)
As stated earlier, SE allows us to introduce a new architecture to handle domain-specific input. To exploit this advantage, we introduce the Bidirectional Long Short-Term Memory (Bi-LSTM) with Attention mechanism to the current state-of-the-art TWS architecture (Kittinaradorn et al., 2019). We call our proposed domain adaptation method Deep Stacked Ensemble (DSE). Figure 2 shows the structure of the domain specific part of our solution. There are three main kinds of features. A character n-gram is passed through a CNN following Kittinaradorn et al. (2019) to create an embedding vector (shown in blue). A character type n-gram which indicates whether a character is either a vowel, digit, special character, or an English character, is turned into an embedding vector (shown in red). Lastly, we use probability and entropy values from the domaingeneric model, which indicates whether a character is a start or end of a word or not in a dictionary, as the additional features (colored as green). We then concatenate all of the embeddings and feed them to the Bi-LSTM layer (Hochreiter and Schmidhuber, 1997;Ma et al., 2018).
The Attention model is connected to the Bi-LSTM output layer for improved accuracy because the attention mechanism is effective at capturing long-range dependencies (Duan and Zhao, 2020). The attention layer is followed by a fully connected network that ends with a single sigmoid output for Thai and Chinese (boundary or not) and a softmax output for Japanese and Urdu (the beginning, middle, or end of a word, or a word with a single character). Ablation studies, results given in Table 11, show that each component in the domain-specific model improves the performance incrementally. Unlike the original SE that relies only on the CRF as the domain-specific model, our deep learning approach to construct the domain-specific model can capture intricate WS patterns in the domain better than the original SE and transfer learning method.
However, unlike deep learning approaches, the classical machine learning approach, i.e., CRF, does not require a large amount of training data. To handle this problem, we propose the data augmentation technique at the character level. This can increase the amount of training data and thus improves the performance significantly.
Data Augmentation The main advantage of using a separate model for each domain is the ability to handle contradicting segmentation conditions from different domains (Fu et al., 2020). However, this approach requires a substantial amount of data in each domain as stated earlier. To mitigate this problem, we also propose two data augmentation methods based on the Masked Language Model (MLM) WangchanBERTa (Lowphansirikul et al., 2021) trained on Thai Wikipedia Dump. As shown in Figure 3, we mask words based on the output of the domain-generic model. The output posteriors from the model are used to compute the character-level entropy values. Then, the values are summed together to represent the score for each word. We select the words with the highest scores to mask in order to perform data augmentation. This is done to favor long words, since long words are harder to segment. We select the the top-k words to mask and replace them (substitution) using MLM. This a pretrained process to ensure the generation of hard-to-segment sentences. We also introduce semi-hard-to-segment samples by preferring word insertion after the word (rather than substitution). The same MLM is used to perform next word prediction instead of masked prediction. The ratio between hard-to-segment and semi-hardto-segment is 80:20. This is found via grid search (see Table 13).
The insertion method gives the best performance compared with other semi-hard-to-segment generation methods (see the results in Table 13). The entropy selection method, compared with competitive selection methods in Table 12, shows that our method has the best performance for all Top-k selection and average scores.  Figure 3: Overview of our data augmentation pipeline.

Muti-Domain Ensemble (MDE)
It is unrealistic to expect that the training and test distributions always match. Getting new training data for the out-of-domain scenarios can be expensive and time consuming (Ng et al., 2020;Liu et al., 2019). In such cases, transfer learning or the previously described DSE method are not sufficient. We propose a framework, which utilizes an ensemble of domain-specific models to handle outof-domain samples, called Muti-Domain Ensemble (MDE). Figure 4 presents the structure of MDE. The framework consists of multiple word segmentation models, where each model is adapted to a specific target domain (except for the out-ofdomain data) using the method described in Section 3.2. Results from multiple word segmentation models are combined using a result aggregator to form the final prediction. In this investigation, we formulate two result aggregation strategies as follows. (i) We compute a simple average of the posterior output from each model. Then, we predict the class that has the highest probability: this is a basic method for ensemble modeling (Avg); (ii) We calculate the entropy from each model based on their posterior distribution output. We then choose the prediction of the model with the lowest entropy and we call it Min Entropy (ME).

Result
The results of the MDE framework and aggregation module given in Tables 5 and 10 show that the entropy method performs better than the basic method and improves the performance significantly over other models in out-of-domain scenarios.

Performance Evaluations on Thai Word Segmentation
In this section, we report results from experimental studies on four Thai word segmentation (TWS) benchmark datasets. The studies are organized as follows. (i) we compare our method with competitive methods on domain adaptation; (ii) we show the effect of the data augmentation technique on domain adaptation; (iii) we report the results on out-of-domain setups; (iv) we show the effect of WS in downstream tasks.
Note that experimental studies on Chinese, Urdu, and Japanese are presented in Section 5.

Experimental Setup
Competitive Methods. We evaluate our proposed solution against two state-of-the-art methods namely DeepCut (DC) (Kittinaradorn et al., 2019) and AttaCut (AC) (Chormai et al., 2020). These methods are based on the Convolutional Neural Network (CNN) and trained on a generic corpus (BEST2009 (Boriboon et al., 2009). For domain adaptation experiments, we also applied the concept of Transfer Learning (TL) to adapt DC and AC to the target corpora, and we call these adaptations TL-DC and TL-AC, respectively. Similarly, for the Stacked Ensemble Filter-and-Refine (SEFR) method (Limkonchotiwat et al., 2020), we created two variants, SE-DC and SE-AC, using DC and AC as the base model, respectively. For the evaluation of our method, Deep Stack Ensemble (DSE), we followed the same principle and created two variants DSE-DC and DSE-AC based on DC and AC, respectively. Evaluation Metrics. We use F1 score as the evaluation metric for the TWS task at character and word levels to avoid the overestimation of TWS (Chormai et al., 2020;Limkonchotiwat et al., 2020). Parameter Settings. In these experiments, we used grid search on 4 parameters including Bi-LSTM nodes, attention nodes, optimizer, and top-k inside the domain-specific model. We started the learning rate at 0.01 on an optimizer. For every 10 steps where the loss did not decrease, the learning rate was multiplied by a factor of 0.1. We set the number of training epochs to 300 with an option of early stopping. For the CNN layer and character embedding settings, we followed Kittinaradorn et al. (2019). We tuned the top-k value of the filtering system in a domain-specific model to be the same as the original SE. For the top-k value in the out-of-domain scenarios, we used the same k for all domain-specific models in the domain adaptation settings. Lastly, we tuned all of the parameters by using 10% of training data of the target domain. The hyper-parameters and their values are given in Table 1.

Hyper-parameters
Values for grid search

Datasets
Benchmark Datasets. Our benchmark corpora can be seen in  misspelling detection and correction, and named entity recognition. In the data collection process, we focused on the longest sentences to create a more challenging dataset due to the fact that long sentences made the model's performance decrease significantly compared with short sentences in the same domain (Section 4.3). The Out-of-Vocabulary rate on the test set is 13.65%. We followed Boriboon et al. (2009) for the word and named entity tasks annotation guideline. We also included new guidelines about word editing criteria for misspelt words such as words used on the internet (Netspeak), transliterated loanwords, abbreviations, and shortened words, by using the Royal Institute Thai dictionary. We compared our dataset to the biggest Thai social media dictionary (Horsuwan et al., 2020) and found 79K words that did not appear in the dictionary.

Domain Adaptation
Without Data Augmentation. We evaluate the performance of our domain-specific model against competitive methods in four TWS benchmark corpora, WS160, TNHC, LST20, and VISTEC. The experimental results are given in Table 3. The competitive methods are defined in Section 4.1.
The DSE-DC (DeepCut) outperformed the strongest base model, DC, by 3.2% and 7.2% on WS160, 6.23% and 13.74% on TNHC, 4.41% and 10.18% on LST20, and 4.59% and 11.13% on VIS-TEC at character and word levels, respectively. Our domain-specific model also outperformed the original SE by 2.16% and 6.46% on SE-DC and 1.87% and 4.88% on SE-AC (AttaCut) at character and word levels on all setups. More importantly, our domain-specific model outperformed TL (transfer learning) methods showing the strength of our DSE model.
As expected, the newly constructed TWS social media dataset (VISTEC) shows that even TL-DC performed below 91% at word level, a large drop from the 96% achieved in the generic domain LST20 corpus. Also, the VISTEC dataset creates a new challenge for the social media domain. Comparing the WS160 and VISTEC datasets, the AC's performance decreased from 93.5% to 91.47% and 84.04% to 79.30% at the word level and the character level, respectively.
With Data Augmentation. In this experiment, we show the effect of the data augmentation in domain adaptation settings for different amounts of  adaptation data. We report the findings of the data augmentation process on 2 corpora, i.e., Wisesight (social media domain) which is the smallest corpus and LST20 which is the largest generic domain corpus LST20 (see Table 2). We fixed the top-k value in the data augmentation step at 60% and 10% of the segmentation predictions of the Wisesight and LST20 corpora, respectively. This value is found via grid search (see Table 12). We then use these augmented data with TL, SE, and DSE. As shown in Table 4, the data augmentation process can improve the performance in the small corpus, i.e., Wisesight (WS160). DSE-DC (DeepCut) outperformed the base model by 5.01% and 12.72% at character and word levels. Also, DSE-DC outperformed TL-DC by 1.39% and 3.36% at the character and word levels respectively.  However, since LST20 is sufficiently large, the augmentation did not produce performance improvement with respect to the model constructed using the original data only.

The Effect of Data Augmentation in Insufficient Data Scenarios
In this experiment, we evaluated the transfer learning (TL) and our method (DSE) trained on a vary-ing numbers of sentences ranging from 100 to 1000 on the large datasets TNHC, LST20, and VISTEC to show the effectiveness of data augmentation in the insufficient data scenarios. As can be seen from Figure 5, the data augmentation improved the performance by 0.77% on average for TNHC, 1.55% for LST20, and 0.19% for VISTEC using DSE on the proposed data augmentation technique. Also, the transfer learning F1 performance is improved by 0.14% on average for TNHC and 0.57% for VIS-TEC. However, the performance of transfer learning on the LST20 data augmentation technique did not improve on this method as the baseline model (DeepCut) was trained on the same domain as the LST20 corpus. The performance of transfer learning in this setting is similar to the LST20 transfer learning model in Table 3.
The results of our method in insufficient data scenarios show that we improved the performance using the proposed data augmentation method when the original data is insufficient. Also, the best number of sentences for the augmentation technique in transfer learning is between 100 to 500 sentences and for our method is 500 to 1,000 sentences.

Experiments on Out-of-Domain Scenarios
In this experiment, we evaluated our Multi-Domain Ensemble (MDE) framework against two methods namely, DC trained on BEST2009 and Multi-Criteria (MC). MC is a multi-task model which learns multiple segmentation criteria from different domains jointly use shared layers (Chen et al., 2017). For MC and MDE, the target domains were left out from the training and the models are trained on the remaining domain. As shown in Table 5, the performance improvements on Wisesight and TNHC were statistically significant (P<0.001 using McNemar's test) compared with MDE-ME and DC. Moreover, in comparison to DC, the performance improvement provided by MDE-Avg was also statistically significant on TNHC. As a result of MDE framework, we improved the performance from the base model (DC) at character and word level by 1.17% and 3.53% on WS160, 2.97% and 6.77% on TNHC, 0.26% and 0.42% on LST20, and 0.68% and 1.17% on VISTEC. Moreover, our MDE framework also outperformed the MC model in this experiment with significant results. In addition, the ME (Min Entropy) can improve the performance better than  As mentioned earlier, word segmentation is a domain-dependent task and we cannot expect the input to always be in domain. A model that can robustly handle the out-of-domain scenarios is desirable. Even with the improvement gained by our proposed solution, the gap between out-of-domain and domain adaptation is still large, showing potential for further investigation. In the next experiment, we show the effect of the data augmentation on downstream tasks.

The Effect of Word Segmentation and Data Augmentation on Downstream Tasks
Previously, we showed the proposed data augmentation improved the performance of TWS in the domain adaptation settings. In this experiment, we applied TWS to downstream tasks such as named entity recognition (NER), text classification, and sentiment analysis compared with the TWS base model (DC and AC), TL, DSE, and DSE with augmented data. For the text classification experiments, we use Wongnai corpus and Wisesight corpus for sentiment analysis. The exact model setting and evaluation metric follow Thai classification benchmark 2 . For the NER experiment, we used NCRF++ (Yang and Zhang, 2018) trained with data from Nutcha (2016)'s work. We trained our DSE and competitive methods (except the baseline model) on the Wisesight corpus to show the performance of the proposed augmentation technique.
The results are given in Table 6. When the downstream tasks are not dependent on WS performance, the results one similar i.e., text and sentiment classification tasks. On the other hand, when the downstream task is dependent on WS performance, i.e., NER, we can significantly improve the downstream task. For example, we improved the performance of DSE-DC from 93.47% to 96.67% at the character level, and when combined with data augmentation, increased the accuracy to 98.48%. As a result, the F1 score in the NER task increased from 63.46% to 72.27%.

Chinese, Urdu, and Japanese Word Segmentation
In this section, we demonstrate the generalizability of our method on Chinese word segmentation (CWS), Urdu word segmentation (UWS), and Japanese word segmentation (JWS   Both stacked ensemble methods improve over the baseline models in all settings showing the potential of stacked ensemble in improving WS performance. Moreover, the proposed DSE outperforms the original SE (Limkonchotiwat et al., 2020) significantly for MSR, PKU, and UCRF (P<0.001). The largest performance improvement is over 10% on the PKU corpus. Domain Adaptation on Japanese. As in the TWS experiments, DSE can also be used for domain adaptation by training the domain-specific portion on the target domain. For this JWS task, we used Nagisa (Ikeda, 2018) trained on Balanced Corpus of Contemporary Written Japanese (BC-CWJ) (Maekawa et al.) corpora as the base model. The domain-specific part of the SE was trained on the target corpus to create an adapted model. Note that the Nagisa model released does not lend itself for transfer learning because the authors did not provide the model weights. From Table 9 SE and DSE improves significantly over the baseline showing the effectiveness of SE in situations when one cannot perform typical transfer learning.    Table 10 summarizes the results of the out-ofdomain experiments. The MDE provides a minimal improvement over the baseline on JWS. We hypothesize that this is because two out of the three corpora are too small to train a reliable model. However, on Chinese the MDE provides large gains over the baseline with the min entropy method performing better than the simple averaging method.

Concluding Remarks
This investigation presents a set of solutions to address two domain dependency problems: handling cross-domain and out-of-domain samples. Our key findings are as follows. First, we applied deep learning to the original stacked ensemble method and obtained a significant improvement. Second, we show that data augmentation is an effective method to combat the low-resource limitation in domain adaptation. Third, we can use an ensemble of domain-specific models to obtain a performance improvement over each domain-specific model acting alone. Finally, in addition to Thai, we can apply the same principle to Chinese, Japanese, and Urdu and obtain similar improvements. As future work, we plan to experiment with novel techniques, i.e., Transformer and contrastive learning. Mori, Yuji Matsumoto, Mai Omura, and Yugo Murawaki. 2018  The Section 3.2 presented the way we select a word to augment and compared it against the traditional method i.e., random selection. In this study, we show the validation score of random, entropy, and maximum length selections in our data augmentation technique on the Wisesight corpus with DC base model by varying the k value from top-10% to 100% on the substitution method and fixed top-20% value for the insertion method. The validation score results are given in Table 12, the best range top-k value for the data augmentation is 50% to 60%. The performance of entropy selection is better than competitive methods with reasonable results. Due to the fact that long words are harder to segment than short ones, the entropy selection method favors long words with a high uncertainty. Maximum length selection, gives a similar score with entropy selection due to the fact that high uncertainty score mostly comes from long words. Also, the best F1 score is obtained using top 60% not 100% as words in top-60% might have the most incorrect answers and bias from frequency word.

A.3 Ablation Studies For Different
Semi-Hard-Sample Procedures Section 3.2 mentioned a competitive method to produce semi-hard-to-segment samples. We use  the semi-hard-to-segment method with substitution by fixing the k value at top-60% for substitution method and we vary k in the range of 10% to 100% on the semi-hard-to-segment methods to show the performance of each method. We show the validation score on Wisesight (training data) with character and word levels, respectively. The results are presented in Table 13. As can be seen, the insertion method reports the best performance on every top-k entropy selection. The deletion method is inappropriate due to the fact that we might delete some information in the training data.

A.4 Error Analysis
We performed an error analysis on Wisesight (WS160) corpora for DC, SE-DC, and DSE-DC to investigate the improvement from the baselines as well as the benefits of our method in domain adaptation setups. We used the same setting as