CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding , the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised ob-jective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9% accuracy on average. Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece . CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.


Introduction
Decompounding is the task of separating compound words into their single word constituents.Decompounding is used in user-facing tools such as dictionaries and morphological analyzers (Altinok, † Equal senior authorship.
Decompounding can come in two similar yet different task formats: (i) compound segmentation and (ii) compound normalization (Ziering and van der Plas, 2016).Compound segmentation is the task of segmenting a word into its compound constituents, while preserving its surface form (e.g.bridesmaid → brides + maid).Compound normalization is the task of recovering the base form of each compound constituent (e.g.bridesmaid → bride + maid). 1ost prior work on decompounding has focused on the few languages with excessively productive compound formation such as Finnish, German and Swedish (Koehn and Knight, 2003;Shapiro, 2016;Riedl and Biemann, 2016).However, compound words occur in a large, diverse number of languages (Vogel and Scalise, 2010).Yet, datasets which annotate compounds with their segmented or normalized form sparsely exist, even in languages with high compound usage.As the first contribution of this work, we aim to address this issue by introducing a dataset of 255k compound words and their normalized form as well as non-compound words covering 56 languages obtained from Wiktionary (www.wiktionary.org).
Using our dataset, we then find that large language models (LLMs), which typically rely on subword-based tokenization (Sennrich et al., 2016;Kudo and Richardson, 2018), struggle with decompounding, as illustrated in Figure 1.Performance is especially low for compounds where subword boundaries do not coincide with compound constituent boundaries; we term compounds with this property 'hard' compounds (Figure 2).
In order to create a more effective decompounding model, we then formulate compound segmentation and normalization as a sequence-to-sequence learning task (Sutskever et al., 2014) and train a byte-level ByT5 model (Xue et al., 2022) using a two-stage framework.In the first stage, we use a novel self-supervised hyphen-prediction objective to learn compound segmentation without any labeled data.In the second stage, we turn the model into a compound normalization model via supervised training on our Wiktionary data.In addition, we introduce a procedure to predict the segmentation of any compound word based on its normalized form, effectively making compound segmentation a subtask of normalization.Finally, we demonstrate that decompounding has real-world applications by investigating compound segmentation for language model tokenization.We apply compound segmentation as pretokenization during training of a SentencePiece tokenizer (Kudo and Richardson, 2018), which results in fewer hard compounds while incurring no extra cost during training and inference of the language model (i.e. the only extra cost occurs during creation of the tokenizer).
Our Stage 1 models outperform the best prior unsupervised models by 13.9% accuracy on average, while our (supervised) Stage 2 models outperform all prior language-specific decompounding tools.Furthermore, a model trained with a Com-poundPiece tokenizer achieves a 5.5% improved performance on compound normalization over an otherwise equivalent SentencePiece model.

Contributions. 1)
We introduce a dataset for decompounding of 255k words across 56 languages obtained from Wiktionary. 2) We show that a byte-level language model can efficiently decompound words via a two-stage training framework, whereas current subword-based LLMs fall short.
3) We present a way to improve subword tokenization by performing compound segmentation during creation of the tokenizer.4) We make our code, models and dataset publicly available at github.com/bminixhofer/compoundpiece.

Related Work
Decompounding.Early work in decompounding used word frequency lists along with manually specified suffixes (e.g., a connective -s-) to segment and normalize German compounds (Langer, 1998;Koehn and Knight, 2003).Subsequently, multiple submissions to the Morpho Challenge in morphological segmentation (Kurimo et al., 2010) explicitly or implicitly made use of compound segmentation (Lignos, 2010;Virpioja et al., 2011).Later work replaced the fixed list of suffixes used in Koehn and Knight (2003) by learned morphological operations from parallel corpora (Macherey et al., 2011) or from pre-lemmatized corpora of non-compound words (Ziering and van der Plas, 2016).Another branch of work added more linguistic knowledge in the form of black-and white-lists to the paradigm of Koehn and Knight (2003), resulting in JWordSplitter2 (German) and nl-splitter3 (Dutch); this has only been done for a couple of languages due to its knowledge-intensive nature.CharSplit (Tuggener, 2016) achieves high performance for German by relying on the frequency of character n-grams appearing within the compound.
While the approaches above use (at most) light supervision, there exist supervised approaches which learn directly from an annotated corpus of compounds and their constituents, along with optional auxiliary signals (Biemann et al., 2008;Alfonseca et al., 2008).In contrast, SECOS (Riedl and Biemann, 2016) is a fully unsupervised and language-agnostic method achieving competitive performance by using word embeddings along with word frequencies for semantic compound segmen-  tation.Our method improves over SECOS in the unsupervised case and provides a unified alternative to prior language-specific decompounding tools via additional training on labelled data.
Relation to Morphological Segmentation.Decompounding can be seen as a special case of morphological segmentation (Batsuren et al., 2022a).However, a large amount of work in morphological segmentation focuses on derivational and inflectional morphology (Cotterell et al., 2016;Faruqui et al., 2016;Cotterell et al., 2018;McCarthy et al., 2019;Goldman et al., 2022), which is reflected by datasets such as UniMorph (Batsuren et al., 2022b) and MorphyNet (Batsuren et al., 2021) annotating inflectional and derivational affixes, but not compound constituents.The SIGMORPHON-2022 Shared Task (Batsuren et al., 2022a, SMST 2022) breaks this pattern by providing a dataset for segmentation into compound constituents in addition to inflectional and derivational affixes.We improve on the SMST 2022 dataset by broadening coverage from 9 to 56 languages, as well as handling negatives (i.e., non-compounds) more carefully ( §3.1).
Decompounding Datasets.Besides the SMST 2022 dataset, datasets for decompounding include AuCoPro (van Zaanen et al., 2014) for Dutch and Afrikaans, and the GermaNet dataset for German (Henrich and Hinrichs, 2011).Although there is a significant amount of work studying compound terms in languages with highly productive compound formation beyond German and Dutch, such as Finnish and Greek (Pollatsek et al., 2000;Lindén and Pirinen, 2009;Koliopoulou, 2014;Shapiro, 2016;Virkkunen et al., 2018), to the best of our knowledge there exist no public datasets for decompounding in these languages (and beyond).
Linguistically Informed Tokenization.Various studies have tried augmenting or replacing the 'linguistically uninformed' subword-tokenizers used in contemporary LMs (Devlin et al., 2019;Raffel et al., 2020, inter alia) such as SentencePiece (Kudo and Richardson, 2018) and BPE (Sennrich et al., 2016) with linguistic knowledge.Using manually constructed morphological analyzers before applying BPE (Pan et al., 2020) or after generation (Matthews et al., 2018) has led to improvements, but is limited by the availability (and quality) of morphological analyzers across many languages.

Dataset Construction
We use words categorized as compound terms on Wiktionary to create a dataset for decompounding.The information on Wiktionary allows associating compound terms with their corresponding normalized constituents.Since Wiktionary only annotates the top-level split,4 we recursively split constituents into their smallest parts by checking if the top-level constituents are themselves compound words.Many prior decompounding tools do not evaluate performance on negative examples (i.e.non-compound words; Koehn and Knight, 2003;Riedl and Biemann, 2016;Tuggener, 2016) since most prior datasets do not contain any (Henrich and Hinrichs, 2011;van Zaanen et al., 2014).It is not trivial to obtain negative examples from Wiktionary since a large amount of compound words are not categorized as such, leading to many false negatives.We solve this issue by using all normalized compound constituents as negative examples, since by definition the compound constituents can also appear on their own as non-compound words.
Note that this way of obtaining negative examples is biased against words which never occur inside compounds; however, we found this to be a rather weak bias (Appendix E).We include every language with at least 100 words, leading to a dataset which covers 56 languages.The number of training examples is shown in Figure 3, example words in Figure 4. We select up to 1,000 words (but at most 50% of total words) in every language as evaluation data.See Appendix A for further details concerning the dataset.

Two-Stage Training
To overcome the problem of data scarcity in lowresource languages, we introduce a two-stage training procedure for creating dedicated decompounding models.In Stage 1, we train on the selfsupervised objective of restoring hyphenation in words extracted from a large-scale Web corpus, leading to a self-supervised compound segmentation model.In Stage 2, we fine-tune the model on compounds and their normalized constituents from an annotated corpus in a supervised fashion, turning it into a compound normalization model.
Stage 1: Self-Supervised Compound Segmentation.This stage is motivated by the fact that hyphen characters can be seen as a high-precision, lowrecall indicator of compound constituent boundaries, in the same way that newline characters are a high-precision, low-recall indicator of sentence boundaries (Minixhofer et al., 2023).We use this natural segmentation into compound constituents to create a compound segmentation model without requiring any labeled data.First, we obtain all words containing a hyphen plus an equivalent amount of non-hyphenated words from a corpus of unannotated text.Hyphens primarily have two uses: (1) as a compound boundary and (2) to indicate the word continues on the next line.We only want to retain hyphens when they function as compound boundaries, so we filter the instances of ( 2) by discarding all words where the hyphenated form of the word occurs x ≤ e −6 times less frequent  than the non-hyphenated form. 5e strip all words of hyphens and train a seq2seq LM to predict the original (hyphenated) form of each word.We introduce a logit bias b added to the logit of the token representing a hyphen to skew generation towards or away from hyphenation at inference time.Training on this data enables effective compound segmentation without relying on human annotations, as demonstrated later in §5.
Stage 2: Supervised Compound Normalization.In the second stage, we improve upon the Stage 1 model by additional training on labeled data, where the inputs are individual compounds, and the target is to predict the normalized constituents of each compound, separated by a hyphen.Training exclusively on compound normalization allows using data from the collected Wiktionary dataset, which contains compound terms along with their normalized constituents across many languages, but does not contain compound segmentation annotations.

Turning Normalization into Segmentation
Considering the scarcity of annotated compound segmentation data, it is infeasible to train a multilingual model directly on segmentation.Thus, we introduce a method to predict a segmentation given the normalized constituents.Let x be a word of length n.In addition, we have k normalized com-pound constituents c = {c 1 , ..., c k } (e.g.predicted by the Stage 2 model).Our aim is to find boundaries r = {r 0 , ..., r k }, r 0 = 0, r k = n giving rise to the segmentation s = {x[r 0 : r 1 ], ..., x[r k−1 : r k ]}.We approach this problem by minimizing the edit distance of each segment to its corresponding normalized constituent.This leads to an optimization problem where the cost C(s) indicates the total edits needed to turn all segments into their corresponding normalized constituents: Here, L is an edit distance metric such as Levenshtein distance (Levenshtein et al., 1966).The optimal segmentation s ⋆ is the segmentation with the minimal cost: s ⋆ = arg min s C(s).
In case of ties, we prefer segmentations with higher edit cost for segments with lower indices due to the preference for languages in our training set for suffixation over prefixation (Hammarström, 2021). 6There is a total of n k−1 possible segmentations, so solving the optimization problem via enumeration of all solutions is only feasible for short words (Figure 5).We introduce a more efficient search algorithm which is capable of quickly finding the optimal segmentation of long words by enumerating candidates in order of a lower bound on the edit distance in Appendix B. This method can be used to turn the normalization predictions of a model into segmentation.We also use it on the ground-truth normalization from Wiktionary, making it possible to approximate compound segmentation performance by comparing the segmentation corresponding to the ground-truth normalization to the segmentation produced by the model normalization predictions.

Reducing Hard Compounds
We define hard compounds relative to a particular tokenizer as compound words where the constituent boundaries do not coincide with token boundaries set by the tokenizer.More formally, a compound word made up of k constituents and l subwords is hard if the constituent boundaries r = {r 0 , ..., r k } are not a subset of the token boundaries t = {t 0 , ..., t l } i.e. r ̸ ⊂ t.We hypothesize that hard compounds may impair language model performance due to the nontrivial relation of subwords to the compound word.In contrast, in easy compounds the word is naturally decomposed into its constituents.The increased difficulty of hard compounds is apparent on the sequence-to-sequence compound segmentation task: for an easy compound, all tokens can be copied to the output (only the special separator tokens must be inserted).On the other hand, for hard compounds, the tokens change, requiring knowledge of the characters within each token.
Tokenizers where every possible constituent boundary is a token boundary trivially do not give rise to any hard compounds.This includes character-level (Clark et al., 2022;Tay et al., 2022b) as well as byte-level tokenizers (Xue et al., 2022).However, many contemporary language models use subword-based tokenizers to increase efficiency (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020).We propose a modification to subword tokenization to reduce the number of hard compounds while keeping the efficiency advantages.
Subword tokenizers typically segment text into pre-tokens (e.g. by splitting on whitespace) before applying their subword tokenization algorithm (Mielke et al., 2021).We propose modifying pretokenization by applying compound segmentation in addition to splitting on whitespace.This modification is only done during creation of the tokenizer, thus incurring no additional cost once the tokenizer has been created.We refer to tokenizers created in this way as CompoundPiece tokenizers.The modified pretokenization tries to create more subwords which do not span compound constituent boundaries, thus decreasing the fraction of hard compounds (Figure 6).It aims to turn the dual-route model for computing the meaning of complex (compound) words proposed by Hofmann et al. ( 2021) into a single-route model which always computes the meaning of compounds from their constitutent subwords, and never stores a compound word as a single subword.

Data
We obtain Stage 1 data by selecting all words containing a hyphen from a subset of the mC4 corpus (Xue et al., 2021) which results in 25M hyphenated words.As negative examples, we choose the n most common words from mC4 such that there is an equivalent amount of non-hyphenated and hyphenated words in every language.Regarding the Stage 2 data, see Section §3.1 before.

Training
We train a decompounding model using a two-stage framework ( §3) covering 56 languages.We use ByT5 (Xue et al., 2022) as our main pretrained model and the main starting point since it directly ingests Unicode bytes instead of using subword tokenization, leading to zero hard compounds.We compare our approach against the subword-based T5 (Raffel et al., 2020), Flan-T5 (Chung et al., 2022) and mT5 (Xue et al., 2021) trained with the same two-stage framework.We use t5x (Roberts et al., 2022) for training with a batch size of 512 and a maximum sequence length of 64 tokens, otherwise matching T5 pretraining (Raffel et al., 2020).The setup is the same for Stage 1 and Stage 2.

Evaluation
Metric.We measure performance via averaged accuracy, i.e., the ratio of examples which are entirely correctly segmented or normalized.
Datasets.Besides our new Wiktionary evaluation subset, we use the established datasets for particular languages: GermaNet (Henrich and Hinrichs, 2011), AuCoPro for Dutch (van Zaanen et al., 2014) as well the subset containing compound-only words across 6 languages from the SIGMORPHON 2022 Shared Task (Batsuren et al., 2022a). 7  Baselines.We use SECOS as the main unsupervised baseline, as well as CharSplit, JWS and nlsplitter as baselines using different amounts of supervision.For the SIGMORPHON 2022 Shared Task dataset, we compare against the task winner, DeepSPIN-3 (Peters and Martins, 2022).
Languages.For clarity of presentation, we present results on Danish, German, English, Spanish, Estonian, Greek, Persian, Finnish, Hungarian, Kazakh, Latvian, Dutch, Polish and Swedish as a linguistically diverse subset of languages with productive compound formation in the main paper.For the full evaluation across all languages, see Appendix C.

Results and Discussion
Main compound segmentation results are shown in Table 1.For the self-supervised models, we choose the logit bias b = 3 to bias generation towards hyphenated words.8ByT5 outperforms subwordbased models by a large margin with an absolute 8.9% improvement over the best subword-based model after Stage 1 training, and a 3.7% improvement after Stage 2 training.Comparing models not trained on any annotated data, the self-supervised ByT5 outperforms SECOS on 13 out of 14 languages, and by 13.9% on average.
We further compare against language-specific and supervised methods in Table 2. Our ByT5based model outperforms all prior methods on every dataset.Since GermaNet tests compound head segmentation (i.e., even if a word contains multiple constituents, it is only split into a head and a modifier) we count an example as correctly segmented if either the first constituent matches the modifier or the last constituent matches the head.
Evaluating LLMs on Decompounding.We also evaluate in-context learning performance of multiple LLMs on compound segmentation.We use T5 models with 770M, 3B and 11B parameters (Raffel et al., 2020) as well as the UL2 model with 20B parameters (Tay et al., 2022a) since all of them use the same tokenizer, enabling performance comparisons on hard compounds across LLMs.We use the model versions fine-tuned on the Flan dataset collection (Chung et al., 2022), matching our prompt to the style of instructions in the Flan collection (Appendix D).Zero-to 16-shot results are shown in Figure 7.Although the LLMs perform non-trivially well on easy compounds, performance is close to zero (<3%) on hard compounds.Intriguingly, UL2 20B performs worse than Flan T5 XXL (11B), reversing the trend seen on other tasks (Tay et al., 2022a).All the LLMs perform considerably worse than our ByT5-based model; see Figure 1.Reducing Hard Compounds via Compound-Piece.To evaluate our method of reducing the number of hard compounds in subword-based language models ( §3.4), we train CompoundPiece models in two configurations: (i) multilingual tokenizers across all 56 languages and (ii) separate monolingual tokenizers for every language.For the multilingual tokenizers, we sample languages with p(L) ∝ |L| α where p(L) is the probability of sampling text from a language L with |L| texts as in prior work (Conneau et al., 2020).We use a subsample of 10M texts from the mC4 corpus (Xue et al., 2021) with α = 0.2.The vocabulary size is 250k for the multilingual and 32k for the monolin-gual tokenizers, following prior work (Rust et al., 2021;Conneau et al., 2020).
We use our fine-tuned ByT5 model for traintime pretokenization into compound constituents and SentencePiece (Kudo and Richardson, 2018) with Unigram LM (Kudo, 2018) as the subword tokenization applied after pretokenization.As a baseline, we train SentencePiece tokenizers with pretokenization into words (split by whitespace) on the same data.Table 3 shows the percentage of hard compounds for every tokenizer.Compound-Piece reduces the number of hard compounds from 27.1% → 9.7% on average in the monolingual case.In the multilingual case, there is a less marked  common tokens in other languages is likely the lead cause for the increased number of hard compounds in the multilingual tokenizers.It could potentially be solved by adjusting token probability based on the input language; we leave this to future work.
To more thoroughly evaluate our tokenization, we train multilingual T5 models using Sentence-Piece and CompoundPiece.We use the same sampling ratio (α = 0.2) of mC4 as for creating the tokenizer, but instead use a subset of 500M texts.We match the architecture and the pretraining setup of the mT5-base model, but train for a total of 65.5B tokens. 9We evaluate the model on the decompounding task.Results are shown in Table 5.
Ablation Studies.We quantify the impact of the most significant design choices of our model in Table 6.Although filtering hyphens-as-newlineindicator ( §4.1) removes only 300k words (<1%) from the pretraining data, it increases performance on negatives by a large margin.Removing Stage 1 training (i.e., fine-tuning directly on the Wiktionary data instead) consistently decreases performance.

Conclusion
We systematically investigated word decompounding tasks of compound segmentation and normalization on a wide scale and in multilingual contexts.To this end, we introduced a dataset of 255k words including compounds and non-compounds across 56 languages from Wiktionary, which allowed us to evaluate performance of LLMs on decompounding.We found that current LLMs' performance is limited due to hard compounds which arise when subword token boundaries do not coincide with compound constituent boundaries.We then introduced dedicated models for decompounding which use byte-level tokenization to entirely avoid hard compounds.Finally, we used our decompounding models to create novel CompoundPiece tokenizers, keeping the efficiency advantages of subword tokenization while strongly decreasing the amount of hard compounds; this increases the performance of CompoundPiece models over comparable Senten-cePiece models on the decompounding tasks.

Limitations
Although self-supervised training in Stage 1 allows for decompounding without any annotated training data, Stage 2 training is limited to languages with sufficient entries in Wiktionary: this excludes extremely low-resource languages.Furthermore, due to computational constraints we have not trained larger models using CompoundPiece tokenization; hence we are unable to report on its benefits at larger scales and on tasks besides decompounding.
Patrick Ziering and Lonneke van der Plas.2016.Towards unsupervised and language-independent compound splitting using inflectional morphological transformations.In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644-653, San Diego, California.Association for Computational Linguistics.

A Dataset Statistics
Statistics for the training and validation splits of the Wiktionary dataset are shown in Table 7.

B Efficient Segmentation Algorithm
Pseudocode of the brute-force algorithm to turn normalization into segmentation is shown in Algorithm 1.Since enumerating all possible segmentations is only feasible for short words ( §3.3) we introduce a more efficient algorithm (Algorithm 2) where candidate segmentations are ordered such that segmentations with constituents closest in length to the corresponding normalized constituents appear first.Assuming insertions and deletions both have a cost of one (as is the case in standard Levenshtein distance), constituents are thus sorted in increasing order of a lower bound on edit distance.The procedure can stop once the lower bound on edit distance reaches the cost of the best solution found so far since by that point it is impossible for a better solution to be found.Note that the normalization-to-segmentation problem is related to sequence partitioning (Manne and Sorevik, 1995;Han et al., 1992) where the aim is to find a partition of a sequence such that the maximum cost across partitions of some cost function is minimized.However, since our goal is to find the partitioning with the minimum aggregated cost, algorithms for conventional sequence partitioning are not applicable.

C Results for All Languages
Segmentation accuracy for all languages is shown in Tables 8-11.

D LLM Prompts
The prompt used for LLM evaluations ( §5) is shown in Figure 8.The prompt was chosen among 10 prompts to maximize performance on Flan T5 Large.For 2-to 16-shot results, we provide 50% positive (compound) and 50% negative (noncompound) examples in a random order.

E Quantifying Negative Collection Bias
We conduct an experiment to measure the extent of the bias against words which do not occur inside compounds in our data collection methodology ( §3.1).In particular, we quantify the bias against long non-compound words, which usually would not occur inside compounds.We took a  random sample of 500 words each from word frequency lists in English and German (Speer, 2022), manually removed compound words, and compared the length statistics of this (unbiased) sample of non-compounds to our non-compound dataset.
While words in our non-compound dataset are indeed shorter on average (6.0 vs. 6.7 chars for English, 6.7 vs. 7.1 chars for German), with less than one character length difference on average, there is only a weak length bias in data collection.
We also found qualitatively that our noncompound dataset contains a wide variety of words since compounding is typically a process that can occur for many different root words.

Figure 4 :
Figure 4: Example words in the Wiktionary dataset.

Figure 7 :
Figure 7: Few-shot in-context learning performance of LLMs on easy positives, hard positives, negatives and across all examples.Hard negatives are the same across all LLMs since they use the same tokenizer.

Figure 8 :
Figure 8: Prompts used to evaluate LLM in-context learning compound segmentation performance.

Table 4 :
Example compound words which are easy for the monolingual but hard for the multilingual Com-poundPiece tokenizer."_" indicates whitespace.

Table 6 :
Ablation studies on not filtering hyphens-asnewline-indicator and on skipping Stage 1 training.