Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses?

Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes {{em beyond} word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses.Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; and that they are most effective for disambiguating segmentation and morphosyntactic features, less so regarding pure word-sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions, whether calculated as masked or unmasked tokens. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.


Introduction
Semitic morphologically-rich languages (MRLs) such as Arabic, Hebrew, and Aramaic are characterized by extreme ambiguity at the word level (Wintner, 2014;Tsarfaty et al., 2020). In a standard text, many (and often most) of the words will be homographs with multiple possible analyses. The high ambiguity derives from several factors. First, prepositions, conjunctions, accusative pronouns, and possessive pronouns are often seamlessly affixed to words. Next, vowels are generally omitted in written texts. Finally, proper nouns are not differentiated from common nouns (no capital letters).
The task of distinguishing between Hebrew homograph analyses is related to the general task of Word Sense Disambiguation (WSD) (Agirre and Edmonds, 2006;Navigli, 2009), yet it is more challenging. In the standard case of WSD, a single orthographic form is associated with a single word that can be analyzed in terms of two or more senses; also, the analyses are generally pronounced identically, and often have the same morphosyntactic properties (e.g bank of a river vs. savings bank). In contrast, in Semitic languages, the need for disambiguation often goes beyond a determination of sense. Hebrew word ambiguities can be divided into three primary categories (Table 1): 1. Segmentation ambiguities, in which a given orthographic form may (or may not) be segmented into multiple word units each bearing its own role (POS tag) in the sentence. 2. Morphosyntactic ambiguities, in which the segmentation of the form is not ambiguous, but multiple analyses of the word reflect different morphosyntactic properties of each word unit(s). 3. Sense ambiguities (the aforementioned standard case of WSD), in which the analyses of the unit(s) do not differ in their morphosyntactic properties, but rather in their sense. One orthographic form may exhibit multiple types of ambiguity simultaneously.
Pretrained contextualized language models with standard word-piece tokenization mechanisms have been shown to excel at WSD in English and other Indo-European languages (Yaghoobzadeh et al., 2019). However, for Hebrew and other semitic lan-  guages it has been argued that such models would not sufficiently capture the structure of MRLs in order to distinguish between internally-complex homograph analyses (Klein and Tsarfaty, 2020;Tsarfaty et al., 2020). In this work, we take Modern Hebrew, a Semitic language with rich and highly ambiguous morphology, as a case study, and we investigate the extent to which homographs can be disambiguated by contextualized embeddings, regarding all three levels of ambiguity. Regarding Arabic -a sister language to Hebrew -a wide survey of WSD methods is presented by Abderrahim and Abderrahim (2022). They raise the possibility of utilizing pretrained contextualized embeddings, yet leave its evaluation to future work. 1 Hebrew is a particularly challenging language on which to perform a homograph disambiguation due to the limited available corpora. First of all, currently existing Hebrew treebanks are severely limited in size, such that most of the words in the language are not amply represented. Furthermore, even regarding common Hebrew words, this corpus is problematic, because the nature of language is such that many homographs are skewed in their distribution; thus, even if the primary analysis is sufficiently represented within a tagged corpus, the secondary analysis will often be hopelessly underrepresented. For instance, one common Hebrew homograph is ‫מה‬ (mhm), which can be analyzed as a preposition with pronominal suffix, or as an interrogative. The ratio of these two analyses in naturally-occurring Hebrew text is over 50:1; thus, occurrences of the secondary analysis within existing tagged corpora are insufficient.
In theory, these homograph ambiguities could be addressed using POS tagging systems. For instance, Habash and Rambow (2005)  use of a morphological tagging system to solve WSD in Arabic. A number of Hebrew POS tagging systems have been published as well (Yona and Wintner, 2005;Adler and Elhadad, 2006;Shacham and Wintner, 2007). The current SOTA for Hebrew POS tagging is the YAP morpho-syntactic parser (Tsarfaty et al., 2019). However, as we have shown in a previous study (Shmidman et al., 2020, p. 3318, table 2), although YAP produces high accuracy overall on normal Hebrew text, its scores drop drastically regarding homographs of skewed distribution.
For analogous cases of skewed distribution in other languages, researchers have proposed the creation of dedicated challenge sets, containing hardto-classify sentences not easily found in naturallyoccurring text (Gardner et al., 2020;Elkahky et al., 2018). In the aforementioned previous study, we produced 22 such challenge sets for Hebrew homographs, and demonstrated that a Bi-LSTM of noncontextualized embeddings can obtain high accuracy on this task, establishing the current SOTA for Hebrew homograph disambiguation (Shmidman et al., 2020). In this paper, we extend the investigation by considering whether contextualized embeddings from pretrained language models (PLMs) can provide a more optimal solution. We consider all existing contextualized Hebrew PLMs: multilingual BERT ("mBERT") (Devlin et al., 2019); HeBERT (Chriqui and Yahav, 2021); and Aleph-BERT (Seker et al., 2021) (Table 2). Moreover, we evaluate and verify these on a new dataset, substantially larger than all previous datasets for Hebrew homograph disambiguation.
Our experiments demonstrate that contextualized PLMs pre-trained on sufficiently large unlabeled data and vocabulary size are excellent at disambiguating the word-internal structures of homographs, yet face some challenge with pure sense disambiguation. We show the efficacy of these models in cases of homographs with skewed distribution, and in a few-shot setup. Finally, we establish new state-of-the-art results on the challenging task of homograph disambiguation for a morphologicallyrich language printed without vowels, along with a novel benchmark for assessing the morphological reach of future PLMs in Hebrew.

The Data
The challenge sets for Hebrew homograph disambiguation from our previous study were limited in  number (only 22 sets) and insufficiently representative regarding types of ambiguities; only one of the sets was a prefix-segmentation ambiguity. Further, they were limited to binary cases, where only two analyses exist.
In contrast, for this study we employed field experts to choose the most critical homographs in the language. The experts chose 75 homographs from a list of the 3600 most frequent words in the language, balancing frequency of word occurrence with practical need for its disambiguation. All of the homographs occur with a minimum frequency of 27 words per million in naturally occurring Hebrew text. Our challenge sets include homographs with 2-5 possible analyses. Our sets contain a wide representation of segmentation ambiguities (15 in number), as well as 5 cases of purely semantic ambiguities. For each of the 75 homographs, we collect hundreds of naturally-occurring sentences attesting to each analysis. In almost all cases, we succeed in collecting 1000 sentences for the primary analysis, at least 500 sentences for the secondary analysis, and at least 250 for each additional analysis. The sentences were culled from newspapers, Wikipedia, literature, and social media. We employed a team of annotators who chose the relevant homograph analysis for each case. 2 All in all, our 75 challenge sets contain 150K tagged sentences. The full list of homographs and analyses is provided in Appendix A. 3

Experimental Setup
To evaluate the ability of pre-trained language models (PLMs) to disambiguate the in-context analyses of morphologically rich and highly ambiguous homographs in Hebrew, we adopt a "word expert" approach, producing dedicated classifiers for each individual homograph (Zhao et al., 2020).
We use two types of PLMs, contextualized and non-contextualized. For the non-contextualized case, we replicate our previous method detailed in Shmidman et al. (2020). For each training example, we use a BiLSTM on top of the word2vec embeddings of all of the words in the sentence (other than the homograph itself) to produce an encoding for disambiguation. 4 An MLP is trained to predict the correct homograph analysis based on this encoding. 5 For the contextualized case, we run the sentence through a pretrained contextualized language model and retrieve the 768-dimension embedding representing the homograph in question. An MLP is trained to predict the correct analysis based on the homographs embeddings alone. In the standard "unmasked" scenario, the sentence is fed into the model as is, including the homograph in question. In the "masked" scenario, the homograph is replaced with a [MASK] token.
We evaluate the performance of each given method on each given challenge set using 10-fold cross-validation. We calculate an F1 score for each homograph analysis, based upon the precision and recall scores micro-averaged across all folds. We then calculate the macro-average of the F1 scores for all possible analyses for a given homograph, and this is the score reported in the charts herein.

Results and Analysis
Standard (Unmasked) Scenario Figure 1 presents the cumulative F1 score obtained by the models for all challenge sets. Our results show that HeBERT and AlephBERT far outperform mBERT, with AlephBERT achieving the higher score. The poor performance of mBERT is likely due to its smaller pre-training data size and exceedingly lean Hebrew vocabulary (cf. Table 2). Furthermore, the  HeBERT and AlephBERT models both substantially outperform the previous word2vec-based SOTA. It is thus apparent that contextualized language models do effectively capture Hebrew homograph distinctions, even those based on word-pieces, even for an MRL, and they do so more effectively than non-contextualized models. Figure 2 demonstrates AlephBERT's performance on different ambiguity types. AlephBERT performs equally well on cases of segmentation ambiguity and morphosyntactic ambiguity. In contrast, when it comes to ambiguities that are purely semantic, the scores are noticeably lower. This is in line with the findings of Ettinger (2020), who shows that BERT is stronger with syntax than semantics; Goldberg (2019) also notes BERT's strong syntactic abilities. Interestingly, the same gap exists with the W2V-based method. Thus, both contextualized and non-contextualized embeddings struggle to differentiate between senses which are morphologically equivalent. Although such cases are only of minimal import when it comes to sentence parsing, Figure 4: Disambiguation accuracy across varying degrees of word piece splits within the target homograph, using mBERT. they are critical for downstream tasks such as coreference resolution and relation extraction. It thus remains a desideratum to improve disambiguation of purely semantic Hebrew homographs.
The results in Figure 3 demonstrate that Aleph-BERT performs equally well on cases of binary homographs as on cases of three-way homograph classification. However, when faced with cases of 4-way or 5-way classification, accuracy declines.
The Effect of Word-Pieces Previous studies have hypothesized that word-pieces are not adequate for capturing complex morphosyntactic structures due to arbitrary (non-linguistic) word-splits. To probe into this we investigate the question, do such splits affect performance. Our 75 homographs are all treated as single tokens in HeBERT and AlephBERT. However, many of the homographs are broken up into word pieces in mBERT, due to its meager Hebrew vocabulary. We thus compare mBERT's results on words treated as single tokens versus those that are broken up into two or three pieces, which are aggregated using first, sum, or average of the vectors. With regard to cases of split words, we train models using three separate methods: providing the MLP with only the embedding of the first word piece; with an average of the word piece embeddings; or with the sum of the embeddings. As shown in Figure 4, the splitting of a homograph into three word-pieces appears to have a negative impact on the ability of the resulting embedding to differentiate between homograph analyses, for all aggregation methods.

Masked Scenario
We consider whether Aleph-BERT embeddings are more effective if we replace the homograph word with [MASK] when running the challenge set sentences through AlephBERT. The motivation behind this experiment is that, as explained above, many of the homographs are skewed in their natural proportion. In such cases, we worry whether AlephBERT might be disproportionately influenced by the skewed distribution; replacing the word with [MASK] would prevent the model from being influenced as such. As shown in figure 5, AlephBERT achieves high scores both with balanced homographs as well as with homographs of highly skewed distribution. Using a [MASK] token instead of the actual word does not generally improve performance, whether or not the homographs are of skewed proportion.

Few-Shot Scenarios
In our experiments thus far, the 10-fold cross-validation allows the MLP to leverage 90% of the data in each fold (hundreds of sentences for each analysis) in order to learn the difference between the analyses. We now consider whether the AlephBERT embeddings can suffice on a few-shot basis, where the training stage has access to only 100, 50, 25, 10 or even 5 examples of each analysis. In these cases, we train an MLP based only on these few samples, and we use the rest of the sentences for evaluation. Astoundingly, as demonstrated in Figure 6, the AlephBERT embeddings provide a highly accurate solution even on this few-shot basis. Even when training with only 5 examples of each homograph analysis, Aleph-BERT reaches an accuracy that is not far below the accuracy achieved when performing full 10-fold CV across hundreds of sentences of each analysis.
Probing Scenarios Finally, we probe the pretrained AlephBERT embeddings (Yaghoobzadeh et al., 2019;Tenney et al., 2019;Klafka and Ettinger, 2020;Belinkov, 2021) to see whether in Figure 6: Use of AlephBERT embeddings to differentiate between homographs on a few-shot basis, contrasted with scores from the full 10-fold CV ("All"). and of themselves they reflect clusters which correspond to different homograph analyses. We skip the MLP, and instead use the raw embeddings directly, classifying sentences based on their proximity to the centroid of the training samples for each homograph analysis. As shown in the orange bars in Figure 6, this method generally does not perform as well as the MLP-based method; however, the degradation is limited to only a few percentage points, indicating that the raw embeddings are generally clustered in groups which indeed reflect the distinctions between the analyses.

Conclusion
In this study we have utilized a wide-ranging collection of Hebrew homograph challenge sets in order to evaluate the extent to which raw contextualized embeddings can be leveraged to disambiguate Hebrew homographs. We found that contextualized embeddings can effectively disambiguate analyses of homographs, much more so than non-contextualized ones, regarding multiple types of ambiguity: segmentation, morphosyntactic and sense. Yet, efficacy on pure sense ambiguity is lower than on the other two types. Additionally, an increasing number of splits, or an increasing number of different possible analyses of a token, each lower efficacy. Finally, we found that contextualized embeddings can function effectively for this purpose on a few-shot basis, with as little as 5 examples of each analysis. This indicates that with relatively modest effort, highly ambiguous homographs may be effectively treated.

Limitations
One of the major strengths of this paper is its new and comprehensive dataset for the training and benchmarking Hebrew homograph disambiguation. The dataset is uncomparable in size, quality and balance to all previous Hebrew homograph datasets; we have made every effort to be as inclusive as possible in the creation of the dataset, making sure to include data from a widely diverse set of genres. Nevertheless, a perennial challenge in corpus-based studies is that the lion's share of the available data tends to be authored by male writers. In order to offset this bias, we bolstered our corpus with a large corpus of texts specifically taken from blog sites devoted entirely to female bloggers. Even so, we cannot escape the fact that female writing and feminine conjugations are underrepresented in our dataset.
A further limitation derives from our filter for sentences with offensive language. We perceived early on that our human annotators were not comfortable tagging sentences with offensive language, and we therefore took steps to remove such sentences from our corpus. Nevertheless, this means that our resulting dataset is limited in that it does not properly reflect the use of offensive language in naturally-occurring Hebrew sentences. Furthermore, our resulting tests and reported scores may not accurately reflect the performance of our models when applied to sentences with offensive language.

Ethics Statement
Creation of the Dataset As noted, our dataset contains over 150K sentences in all. Every sentence was reviewed and tagged by our team of human annotators, who chose the relevant homograph analysis for each instance of each of our 75 homographs. Our annotator team included members of diverse genders and sexual orientations. They were paid hourly wages with legal pay stubs. Their hourly wage was well above the minimum wage required by law. The entirety of the dataset will be made available for research purposes with the acceptance of this article, together with the tagging information.
Risks of the Research Ultimately, this data will enable end-users to automatically diacritize and parse large corpora of Hebrew text. For the most part, this will provide a beneficial contribution to the world: for the visually impaired, this technology will enable the development of more precise text-to-speech products; teachers will be able to provide children and second-language learners with accessible diacritized texts; and humanities and linguistics researchers can bolster their research with big-data analysis. However, there also is a risk of nefarious use, for instance, if an end user were to leverage these capabilities in order to produce anonymous texts or recordings containing threats to life, liberty, or happiness.

A Appendix A: Table of Homographs
We present three tables of homographs, corresponding to the three categories of homographs discussed within the paper (segmentation ambiguity, morphosyntactic ambiguity, and semantic ambiguity). 6 In each table, the first column ("form") indicates the homograph as it is found in naturally-occurring nondiacritized Hebrew text. 7 The second column ("word") indicates the possible diacritizations of each form. In cases where the diacritized word includes an attached prefix, a plus sign indicates the segmentation point between the prefix letters and the primary word. Regarding all homographs considered in this paper, different segmentation options are diacritized differently. Thus, for each sentence in the dataset, our human annotators were asked simply to choose the correct diacritization for the target homograph (that is, they were asked to choose among the options listed in the "word" column). There was no need for the annotators to separately tag the segmentation, because in all cases the choice of diacritization itself indicates the segmentation. 8 The third column indicates the morphology of each of the possible diacritizations. 9 The fourth column lists the translation. 10 Within each It should be noted that the higher level of ambiguity are supersets of the lower levels: segmentation ambiguities generally entail differences on the morphosyntactic and semantic levels as well, and morphosyntactic ambiguities generally also entail semantic ambiguities. Furthermore, because many of the homographs admit to more than two analyses, it is often the case that a subset of the analyses may form a lower level of ambiguity (e.g., just a semantic ambiguity), while other analyses form a higher level of ambiguity (e.g. segmentation ambiguity). For the purposes of this paper, we categorize each homograph according to the highest level of ambiguity involved: First, if a segmentation ambiguity is indicated anywhere across the possible analyses, then we include the homograph in the "segmentation ambiguity" category. Next, if there is no segmentation ambiguity, but a morphosyntactic ambiguity is indicated anywhere across the possible analyses, then we include the homograph in the "morphosyntactic ambiguity" category. Finally, if the analyses all differ only on the semantic level, then we include the homograph in the "semantic ambiguity" category. 7 The ranking of analyses is based on a frequency analysis of our in-house annotated corpus. It is worth emphasizing that the paper as a whole relates to each of the 75 homographs specifically as they are spelled in this list, and does not relate to cases where further prefixes are attached to the homographs. As a result, the frequency analysis may sometimes seem counterintuitive. For instance, regarding the form ‫,ראשי‬ a native Hebrew speaker might intuit that the adjectival form is primary ( ‫י‬ ‫אשׁ‬ ‫.)ר‬ However, in practice, that sense is common only when prefixed with a definitive marker ( ‫.)הראשי‬ In contrast, the homograph considered here involves the form ‫ראשי‬ as is, without any prefixes; in this case, the other analyses are far more common. 8 It should be noted that in certain sentences, an exceedingly rare diacritization was warranted, which was not among the options listed in the "word" column. The annotators were instructed to tag such cases as "none of the above", and all such sentences were removed from the corpus. Similarly, some sentences do not provide enough context to determine the correct diacritization; the annotators were asked to tag such sentences as "unclear", and these sentences too were removed from the corpus. 9 In most cases, a diacritized form has one specific morphological analysis. However, in other cases, the diacritized form can admit to multiple morphologies. In such cases, we list all of the practically relevant morphological analyses in the third column, separated by a slash (thus for instance in the case of ‫.)זר‬ Rare analyses which hardly ever occur in practice are not listed.
10 Naturally, a given Hebrew term often captures a substantial range of potential English translations, and it would not be practical to list them all in this column; therefore, we generally present only a single representative translation. 11 Ideally, we might have grouped the homographs based on part-of-speech instead. However, as can be seen from the following tables, the 75 homographs vary so widely in terms of the parts of speech that they can represent, such that alphabetical listing was deemed most useful. 12  the out-of-vocabulary words, we use a trainable UNK parameter in place of the word2vec embedding, which is trained from scratch for each "word expert" classifier. As per the "word expert" paradigm, a completely separate MLP is trained for each homograph. In each case, the possible homograph analyses are each treated as a possible class for prediction, and the MLP is trained to choose from among those classes. Thus, for instance, if the homograph has two analyses, we train an MLP to predict one of the two classes; if the homograph has three analyses, then we train an MLP to predict one of the three classes; and so on.
For the Probing Scenarios based on centroid classification, we proceed as follows. For each of the homographs, given the training sample size (100, 50, 25, etc.), we randomly select that amount of training sentences for each of the possible analyses of the homograph. We calculate the centroid for each of the analyses by averaging the embeddings of the target homographs within the corresponding training sentences. The remainder of the available sentences for the homograph forms the test set. We classify them by calculating the dot product of the embedding of the target homograph in each given test sentence with the centroid of each of the homograph analyses. We run this process 200 times, each time selecting a different random set of training sentences. The values plotted in Figure 6 reflect the average of the F1 scores across these 200 rounds. For the corresponding MLP-based experiments presented for comparison in the aforementioned table, we follow an analogous procedure, across 10 rounds.