Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

One common characteristic of PLMs is their input segmentation: PLMs are based on fixed-size vocabularies of words and subwords that are generated by compression algorithms such as bytepair encoding (Gage, 1994;Sennrich et al., 2016) and WordPiece (Schuster and Nakajima, 2012;Wu et al., 2016).The segmentations produced by these Figure 1: Basic experimental setup.WordPiece segmentation (s w ) mixes part of the stem bizarre with the prefix super, creating an association with superb.Derivational segmentation (s d ), on the other hand, separates prefix and stem by a hyphen.The two likelihoods are averaged across 20 models trained with different random seeds.While superbizarre has negative sentiment, applausive is an example of a complex word with positive sentiment.
algorithms are linguistically questionable at times (Church, 2020), which has been shown to worsen performance on certain downstream tasks (Bostrom and Durrett, 2020;Hofmann et al., 2020a).However, the wider implications of these findings, particularly with regard to the generalization capabilities of PLMs, are still poorly understood.
Here, we address a central aspect of this issue, namely how the input segmentation affects the semantic representations of PLMs, taking BERT as the example PLM.We focus on derivationally complex words since they exhibit relatively systematic patterns on the lexical level, providing an ideal testbed for linguistic generalization.At the same time, the fact that low-frequency and outof-vocabulary (OOV) words are often derivationally complex (Baayen and Lieber, 1991) makes our work relevant in practical settings, especially when many one-word expressions are involved, e.g., in query processing (Kacprzak et al., 2017).
The topic of this paper is related to the more fundamental question of how PLMs represent the arXiv:2101.00403v1[cs.CL] 2 Jan 2021 meaning of complex words in the first place.So far, most studies have focused on methods of representation extraction, using ad-hoc heuristics such as averaging the subword embeddings (Pinter et al., 2020;Sia et al., 2020;Vulić et al., 2020) or taking the first subword embedding (Devlin et al., 2019;Heinzerling and Strube, 2019;Martin et al., 2020).While not resolving the issue, we lay the theoretical groundwork for more systematic analyses by showing that PLMs can be regarded as serial dual-route models (Caramazza et al., 1988), i.e., the meanings of complex words are either stored or else need to be computed from the subwords.
Contributions.We present the first study examining how the input segmentation of PLMs, specifically BERT, affects their interpretations of derivationally complex English words.We show that PLMs can be interpreted as serial dual-route models, which implies that maximally meaningful input tokens should allow for the best generalization on new words.This hypothesis is confirmed by a series of semantic probing tasks on which derivational segmentation consistently outperforms BERT's WordPiece segmentation by a large margin.This suggests that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.We also publish three large datasets of derivationally complex words with corresponding semantic properties. 1 2 How Are Complex Words Processed?
Contrasting these single-route frameworks are dual-route models that allow for a combination of storage and computation.Dual-route models 1 We will make all our code and data publicly available.
Outside the taxonomy presented so far are recent models that assume multiple levels of representation as well as various forms of interaction between them (Rácz et al., 2015;Needle and Pierrehumbert, 2018).In these models, sufficiently frequent complex words are stored together with representations that include their internal structure.Complex-word processing is driven by analogical processes over the mental lexicon (Rácz et al., 2020).

Complex Words in NLP
Most models of word meaning proposed in NLP can be roughly assigned to one of the single-route and dual-route approaches.Word embeddings that represent complex words as whole-word vectors (Deerwester et al., 1990;Mikolov et al., 2013a,b;Pennington et al., 2014) can be seen as single-route storage models.Word embeddings that represent complex words as a function of subword or morpheme vectors (Schütze, 1992;Luong et al., 2013) can be seen as single-route computation models.Finally, word embeddings that represent complex words as a function of subword or morpheme vectors as well as whole-word vectors (Botha and Blunsom, 2014;Qiu et al., 2014;Bhatia et al., 2016;Bojanowski et al., 2017;Athiwaratkun et al., 2018;Salle and Villavicencio, 2018) are most closely related to parallel dual-route approaches.
Where are PLMs to be located on this taxonomy?PLMs represent many complex words as wholeword vectors.Similarly to character-based models (Sutskever et al., 2011;Kim et al., 2016), they can also store the meaning of frequent complex words that are segmented into subwords, i.e., frequent subword collocations, in their model weights.When the complex-word meaning is neither stored as a whole-word vector nor in the model weights, PLMs compute the meaning as a compositional function of the subwords.Conceptually, PLMs can thus be interpreted as serial dual-route models.
Seeing PLMs as serial dual-route models allows for a more nuanced view on the central research question of this paper: testing the semantic generalization capabilities on complex words means testing the quality of their semantic representations in cases where the meaning is neither stored in the input embeddings nor in the model weights and hence needs to be computed compositionally as a function of the subwords (i.e., the computationbased route is activated).We hypothesize that the morphological validity of the segmentation affects the representational quality in these cases, and that the best generalization is achieved by maximally meaningful tokens.This does not imply that the tokens have to be morphemes, but the segmentation boundaries need to coincide with morphological boundaries, i.e., morpheme bundles (Stump, 2017(Stump, , 2019) are also possible.Complex words whose meanings are stored in the model weights, on the other hand, are expected to be affected by the segmentation to a much lesser extent.

Setup
Analyzing the impact of different input segmentations on BERT's semantic generalization capabilities is not straightforward since it is not clear a priori how to measure the quality of representations.In this study, we devise a novel lexicalsemantic probing task: we use BERT's representations for complex words to predict semantic properties, specifically sentiment and topicality (see Figure 1).Given a complex word such as superbizarre, the task is to predict, e.g., that its sentiment is negative.For topicality, the task is to predict, e.g., that isotopize is a complex word used in physics.We confine ourselves to binary prediction, i.e., the probed semantic properties always consist of two classes (e.g., positive and negative).The extent to which a segmentation allows to solve this task is taken as an indicator of the representational quality it results in.
More formally, let D be a dataset consisting of complex words x and corresponding semantic properties y (e.g., sentiment).We denote with s(x) = (t 1 , . . ., t k ) the segmentation of x into a sequence of k subwords.We ask how s impacts the capability of BERT to predict y, i.e., how p(y|(s(x)), the likelihood of the true semantic property y given a certain segmentation of x, depends on different choices for s.The two specific segmentation methods we compare in this study are BERT's standard WordPiece segmentation (Schuster and Nakajima, 2012; Wu et al., 2016), s w , and a derivational segmentation that segments complex words into stems and affixes, s d .

Data
Existing datasets do not allow to conduct experiments following the described setup.We therefore introduce three datasets that we create using distant supervision (Mintz et al., 2009): we employ large existing datasets annotated for sentiment or topicality, extract all derivationally complex words for a predefined set of two classes, and use the dataset labels as their semantic properties.
For determining and segmenting derivationally complex words, we use the algorithm introduced by Hofmann et al. (2020b), which takes as input a set of prefixes, suffixes, and stems and checks for each word in the data whether it can be derived from a stem using a combination of prefixes and suffixes.The algorithm is sensitive to morpho-orthographic rules of English (Plag, 2003), e.g., when the suffix ize is removed from isotopize, the result is isotope, not isotop.We follow Hofmann et al. (2020a) in using BERT's prefixes, suffixes, and stems as input to the algorithm.
To get the labels indicating the semantic properties, we compute for each complex word which fraction of texts containing the word belongs to one of the two classes (e.g., topics) and rank all words accordingly.We then take the first and third tertiles of complex words as representing the two classes.We randomly split the words into 60% training, 20% development, and 20% test.
In the following, we describe the characteristics of the three datasets in greater depth.Table 1 provides summary statistics.See Appendix A.1 for details about data preprocessing.
Amazon.Amazon is an online e-commerce platform on which customers can buy and review a variety of products.A large dataset of Amazon reviews has been made publicly available (Ni et al., 2019). 2 We extract derivationally complex words from reviews with one or two (neg) as well as four or five stars (pos), discarding three-star reviews.This way of binarizing the five-star range has been used before (Yang and Eisenstein, 2017). ArXiv.
ArXiv is an open-access distribution service for scientific articles.Recently, a dataset of all papers published on ArXiv with corresponding metadata has been released. 3For this study, we extract all articles from physics (phys) and computer science (cs), which we identify using ArXiv's subject classification.
Reddit.Reddit is a social media platform hosting discussions about various topics.It is divided into smaller communities, so-called subreddits, which have been shown to be a rich source of derivationally complex words (Hofmann et al., 2020c).Hofmann et al. (2020a) have published a dataset of derivatives found on Reddit annotated with the subreddits in which they occur. 4We define two groups of subreddits that are all among the largest subreddits, an entertainment set (ent) consisting of the subreddits anime, DestinyTheGame, funny, Games, gaming, leagueoflegends, movies, Music, pics, and videos, as well as a discussion set (dis) consisting of the subreddits askscience, atheism, conspiracy, news, Libertarian, politics, science, technology, TwoXChromosomes, and worldnews, and extract all derivationally complex words that occur in them.

Models
We train two main models on the introduced binary classification task: BERT with the standard WordPiece segmentation (s w ) and BERT using the derivational segmentation (s d ), a model that we refer to as DelBERT (Derivation leveraging BERT).The specific BERT variant we use is BERT BASE (uncased) (Devlin et al., 2019).For the derivational segmentation, we follow previous work by Hofmann et al. (2020a) in separating stem and prefixes by a hyphen.We further follow Casanueva et al. (2020) and Vulić et al. (2020) in mean-pooling the output representations for all subwords, excluding BERT's special tokens.The mean-pooled representation is then fed into a two-layer feed-forward network for classification.To examine the relative importance of different types of morphological units, we train two additional models in which we ablate information about the stem and affixes, i.e., we represent stems and affixes by the same randomly chosen input embedding. 5e finetune BERT, DelBERT, and the two ablated models on the three datasets using 20 different random seeds.We choose F1 as the evaluation measure.See Appendix A.2 for details about implementation and hyperparameters.

Results
DelBERT (s d ) outperforms BERT (s w ) by a large margin on all three datasets (Table 2).It is interesting to notice that the performance difference is larger for ArXiv and Reddit than for Amazon, indicating that the gains in representational quality are particularly large for topicality.
What is it that leads to DelBERT's increased performance?The ablation study shows that models using only stem information already achieve relatively high performance and are on par or even better than the BERT models on ArXiv and Reddit.However, the DelBERT models still perform substantially better than the stem models on all three datasets.The gap is particularly pronounced for Amazon, which indicates that the interaction between the meaning of stem and affixes is more complex for sentiment than for topicality.This makes sense from a linguistic point of view: while stems tend to be good cues for the topicality of a complex word, sentiment often depends on semantic interactions between stems and affixes.prefix un, e.g., turns the sentiment of amusing negative, it turns the sentiment of biased positive.Such effects involving negation and antonymy are known to be challenging for PLMs (Ettinger, 2020;Kassner and Schütze, 2020) and might be one of the reasons for the generally lower performance on Amazon. 6The performance of models using only affixes is overall much lower.

Quantitative Analysis
To further examine how BERT (s w ) and DelBERT (s d ) differ in the way they infer the meaning of complex words, we perform a convergence analysis.We find that the DelBERT models reach their peak in performance faster than the BERT models (Figure 2).This is in line with our interpretation of PLMs as serial dual-route models (see Section 6 Another reason for the lower performance on sentiment prediction is that the datasets were created by means of distant supervision (see Section 3.2), and hence many complex words do not directly carry information about sentiment.
2.2): while DelBERT operates on morphological units and can combine the subword meanings to infer the meanings of complex words, BERT's subwords do not necessarily bear lexical meanings, and hence the derivational patterns need to be stored by adapting the model weights.This is an additional burden, leading to longer convergence times and substantially worse overall performance.
Our hypothesis that PLMs can use two routes to process complex words (storage in weights and compositional computation based on input embeddings), and that the second route is blocked when the input segmentation is not morphological, suggests the existence of frequency effects: BERT might have seen frequent complex words multiple times during pretraining and stored their meaning in the model weights.This is less likely for infrequent complex words, making the capability to compositionally infer the meaning (i.e., the computation route) more important.We therefore expect the difference in performance between DelBERT Figure 3: Frequency analysis.The plots show the average performance (accuracy) of 20 BERT and DelBERT models trained with different random seeds for complex words of low (f ≤ 5), mid (5 < f ≤ 500), and high (f > 500) frequency.On all three datasets, BERT performs similarly or better than DelBERT for complex words of high frequency but worse for complex words of low and mid frequency.
(which should have an advantage on the computation route) and BERT be larger for infrequent words.To test this hypothesis, we split the complex words of each dataset into three bins of low (f ≤ 5), mid (5 < f ≤ 500), and high (f > 500) absolute frequencies, and analyze how the performance of BERT and DelBERT differs on the three bins.We merge development and test sets for this analysis and compute accuracies instead of F1 scores.The results are in line with our hypothesis (Figure 3): BERT performs worse than DelBERT on complex words of low and mid frequencies but achieves very similar (ArXiv, Reddit) or even better (Amazon) accuracies on high-frequency complex words.These results strongly suggest that two different mechanisms are involved, and that BERT has a disadvantage for complex words that do not have a high frequency.At the same time, the slight advantage of BERT on high-frequency complex words indicates that it has high-quality representations of these words in its weights, which DelBERT cannot exploit since it uses a different segmentation.

Qualitative Analysis
Besides quantitative factors, we are interested in identifying qualitative contexts in which DelBERT (s d ) has a particular advantage compared to BERT (s w ).To do so, we filter the datasets for complex words that are consistently classified correctly by the DelBERT models and incorrectly by the BERT models.For each word, we compute average likelihoods across all DelBERT and BERT models, respectively, and rank words according to the difference of their likelihood under both model types.Looking into the words with the most extreme differences, we observe three broad classes of cases.Table 3 provides example complex words for the three classes and each of the datasets.
Firstly, the addition of a suffix is often connected with morpho-orthographic changes (e.g., the deletion of a stem-final e before the suffix), which leads to a segmentation of the stem into several subwords since the truncated stem is not in the WordPiece vocabulary (applausive, isotopize, prematuration).The model does not seem to be able to recover the meaning of the stem from the subwords.Secondly, similarly to the first class, the addition of a prefix has the effect that the word-internal (as opposed to word-initial) form of the stem would have to be available for proper segmentation.Since this form rarely exists in the WordPiece vocabulary, the stem is segmented into several subwords (superannoying, antimicrosoft, nonmultiplayer).Again, it does not seem to be possible for the model to recover the meaning of the stem.Thirdly, the segmentation of prefixed complex words often fuses the prefix with the first characters of the stem (overseasoned, inkinetic, promosque).This case is particularly detrimental since it not only makes it difficult to recover the meaning of the stem but also creates associations with unrelated meanings, sometimes even opposite meanings as in the case of superbizarre.The three classes thus underscore the difficulty of inferring the meaning of complex words from the subwords when the wholeword meaning is not stored in the model weights and the subwords are not morphological.

Related Work
Several recent studies have examined how the performance of PLMs is affected by their input segmentation.Bostrom and Durrett (2020) pretrain RoBERTa (Liu et al., 2019)  more closely with morphology to perform better on a number of downstream tasks.Hofmann et al. (2020a) analyze the performance of BERT on morphological well-formedness prediction and show that a derivational segmentation substantially improves performance compared to a model using WordPiece segmentation.Tan et al. (2020) propose a segmentation method for BERT that splits inflected words into stems and inflection symbols and show that it allows for a better generalization on non-standard inflections.Relatedly, studies from the field of automatic speech recognition have demonstrated that morphological decomposition improves the perplexity of language models, leading to overall improvements in performance (Fang  et al., 2015; Jain et al., 2020). 7  Most NLP studies on derivational morphology have been devoted to the question of how semantic representations of derivationally complex words can be enhanced by including morphological information (Luong et al., 2013;Botha and Blunsom, 2014;Qiu et al., 2014;Bhatia et al., 2016;Cotterell and Schütze, 2018), and how affix embeddings can be computed (Lazaridou et al., 2013;Kisselew et al., 2015;Padó et al., 2016).Cotterell et al. (2017), Vylomova et al. (2017), andDeutsch et al. (2018) propose sequence-to-sequence models for the generation of derivationally complex words.Hofmann et al. (2020a) address the same task using BERT.In contrast, we analyze how different input segmentations affect the semantic represenations of derivationally complex words in PLMs, a question that has not been addressed before. 7There are also studies that analyze morphological aspects of PLMs without a focus on questions surrounding segmentation (Edmiston, 2020;Klemen et al., 2020).

Conclusion
We have examined how the input segmentation of PLMs, specifically BERT, affects their interpretations of derivationally complex words.Drawing upon insights from psycholinguistics, we have deduced a conceptual interpretation of PLMs as serial dual-route models, which implies that maximally meaningful input tokens should allow for the best generalization on new words.This hypothesis was confirmed by a series of semantic probing tasks on which DelBERT, a model using derivational segmentation, consistently outperformed BERT using WordPiece segmentation.Quantitative and qualitative analyses further showed that BERT's inferior performance was caused by its inability to infer the complex-word meaning as a function of the subwords when the complex-word meaning was not stored in the weights.Overall, our findings suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

Figure 2 :
Figure 2: Convergence analysis.The upper panel shows the distributions of the number of epochs after which the models reach their maximum validation performance.The lower panel shows the trajectories of the average validation performance (F1) across epochs.The plots are based on 20 models trained with different random seeds, so each histogram in the upper panel sums to 20.

Table 1 :
Dataset characteristics.The table provides information about the datasets such as the relevant semantic properties with their classes and example complex words.|D|: number of complex words.

Table 3 :
with different segmentation methods and find segmentations that align Error analysis.The table gives example complex words that are consistently classified correctly by DelBERT and incorrectly by BERT.x: complex word; y: class; s d (x): derivational segmentation; µ p : average likelihood of true class across 20 models trained with different random seeds; s w (x): WordPiece segmentation.