Minimally-Supervised Morphological Segmentation using Adaptor Grammars with Linguistic Priors

With the increasing interest in low-resource languages, unsupervised morphological segmentation has become an active area of research, where approaches based on Adap-tor Grammars achieve state-of-the-art results. We demonstrate the power of harnessing linguistic knowledge as priors within Adaptor Grammars in a minimally-supervised learning fashion. We introduce two types of priors: 1) grammar deﬁnition, where we design language-speciﬁc grammars; and 2) linguist-provided afﬁxes, collected by an expert in the language and seeded into the grammars. We use Japanese and Georgian as respective case studies for the two types of priors and introduce new datasets for these languages, with gold morphological segmentation for evaluation. We show that the use of priors results in error reductions of 8.9 % and 34.2 %, respectively, over the equivalent state-of-the-art unsupervised system.


Introduction
Morphological segmentation is an essential subtask in many natural language processing (NLP) applications, especially in the case of morphologically complex languages. With the need to develop NLP tools for low-resource languages, unsupervised morphological segmentation has been receiving increasing interest over the last two decades (Goldsmith, 2001;Creutz and Lagus, 2007a;Poon et al., 2009;Sirts and Goldwater, 2013;Botha and Blunsom, 2013;Narasimhan et al., 2014;Eskander et al., 2016Eskander et al., , 2018Eskander et al., , 2019. In this work, we show how linguistic priors effectively boost morphological-segmentation performance in a minimally-supervised manner that does not require segmented words for training. We integrate our priors within Adaptor Grammars (Johnson et al., 2007), a type of nonparametric Bayesian models that generalize Probabilistic Context-Free Grammars (PCFGs). Adaptor Grammars have proved successful for unsupervised morphological segmentation, achieving state-of-the-art results across a variety of typologically diverse languages (Eskander et al., 2020).
We introduce two types of linguistic priors: 1) grammar definition, where we design a languagespecific grammar that is tailored for the language of interest by modeling specific morphological phenomena, and 2) linguist-provided affixes, where an expert in the underlying language compiles a list of carefully selected affixes and seeds it into the grammars prior to training the segmentation model. We use Japanese and Georgian as case studies for priors 1 and 2, respectively. As our goal is to develop a robust approach that benefits low-resource and/or endangered languages of high morphological complexity, we use Japanese and Georgian in a low-resource setting where we do not have access to morphologically segmented data for training but have access to linguistic information such as word structure and affixes.
We show that using linguistic priors in a minimally-supervised setting leads to a significant improvement in performance over the equivalent state-of-the-art unsupervised system. We also present two morphologically segmented datasets for Japanese and Georgian that we use as our gold standard and that can be utilized in other morphology tasks. 1

Linguistic Priors
We utilize MorphAGram (Eskander et al., 2020) 2 , an open-source morphologicalsegmentation framework that is based on Adaptor Grammars (AGs) (Johnson et al., 2007). AGs have proved successful for unsupervised and minimally-supervised morphological segmentation, outperforming the competing discriminative models (Sirts and Goldwater, 2013;Eskander et al., 2019Eskander et al., , 2020. Adaptor Grammars are non-parametric Bayesian models that are composed of two main components: 1) a Probabilistic Context-Free Grammar (PCFG) whose definition relies on the underlying task (in the case of morphological segmentation, a PCFG models word structure); and 2) an adaptor that is based on the Pitman-Yor process (Pitman, 1995). The adaptor keeps the posterior probability of a subtree proportional to the number of times that subtree is utilized to parse the input data and manages the caching of the subtrees. The learning process is Markov Chain Monte Carlo sampling (MCMC) (Andrieu et al., 2003) that does the inference of the PCFG probabilities and the hyperparameters of the model. Eskander et al. (2016) define a set of languageindependent grammars and three learning settings for Adaptor Grammars: 1) Standard, fully unsupervised; 2) Scholar-Seeded, minimally-supervised by manually seeding affixes into the grammar prior to training the segmentation model, and 3) Cascaded, fully unsupervised by approximating the Scholar-Seeded setting using automatically generated af-fixes from an initial round of learning. We next present two ways of including linguistic priors in Adaptor Grammars: 1) defining a language-specific grammar; and 2) using linguist-provided affixes in the Scholar-Seeded learning setup.

Linguistic Priors as Grammar Definition
Eskander et al. (2016) define languageindependent grammars that model the word as a sequence of generic morphemes or as a sequence of prefixes, stem and suffixes. We consider their PrStSu+SM grammar in the current study as it is the grammar that performed best on average across different languages. This language-independent definition of the grammar is depicted on the left side of Figure 1, where the word is modeled as a prefix Pr, a stem St and a suffix Su, and both the prefix and suffix are recursively defined in order to model compounding in affixes, while a morpheme is composed of smaller units, submorphemes SM, representing sequences of characters.
While this grammar is intended to be generic and to describe word structure in any language, we hypothesize that a definition that imposes languagespecific constraints would be more efficient. Therefore, we define a grammar for Japanese, where we use characteristics that are specific to Japane-se word structure as language priors. Our tailored grammar definition for Japanese is shown on the right side of Figure 1, where we impose the following specifications: A word has a maximum of one one-character or two-character prefix morphemes.
A stem is recursively defined as a sequence of morphemes in order to allow for stem compounding.
Characters are separated into two groups, Kana (Japanese syllabaries) and Kanji (adapted Chinese characters).
A submorpheme represents a sequence of characters that is either in Kana or Kanji.

Linguistic Priors as Linguist-Provided Affixes
Similar to the Scholar-Seeded setting, we compile a list of affixes and seed it into the grammar trees before learning the segmentation model. However, unlike Eskander et al. (2016), where the affixes are collected from online resources by someone who may have never studied the language of interest, in this study we use affixes that are carefully compiled by an expert linguist who specializes in Georgian, resulting in more accurate linguistic priors. With that goal in mind, a total of 119 affixes are collected from the leading reference grammar book (Aronson, 1990).

Evaluation Data
We annotate two datasets with morphological segmentation that we use as the gold standard to evaluate our segmentation models for Japanese and Georgian. Both datasets are composed of 1,000 words that are randomly sampled from the most frequent 50,000 words in Wikipedia and segmented into their basic morphemes 3 , similar to the data of the Morpho Challenge shared task 4 . Table 1 lists segmentation examples for both languages.
The Japanese gold segmentation was created by a native-speaker linguist. For Georgian, which has highly complex morphology, we started with the gold-standard dataset of 1000 words introduced by Eskander et al. (2020), which was built by an untrained native speaker and contained only one 3 The Georgian dataset contains five non-words and three phonetic spellings of English character names. 4 http://morpho.aalto.fi/events/morphochallenge/ Japanese Word Segmentation

Experimental Setup
We evaluate our morphological-segmentation models for Japanese in the Standard (STD) and Cascaded (CAS) 5 settings, both with generic and language-specific (LS) grammar definitions. For Georgian, we evaluate our morphologicalsegmentation models in the Standard (STD), Cascaded (CAS) and Scholar-Seeded (SS) settings, in addition to the proposed Scholar-Seeded setting with linguist-provided affixes (SS-Ling).
We perform the evaluation in a transductive manner, where the unsegmented words in the gold standard are part of the training sets; this is common in evaluating unsupervised and minimally-supervised morphological segmentation (Poon et al., 2009;Sirts and Goldwater, 2013;Narasimhan et al., 2014;Eskander et al., 2016Eskander et al., , 2019Eskander et al., , 2020. For the metrics, we use Boundary Precision and Recall (BPR) and EMMA-2 (Virpioja et al., 2011). BPR is the classical metric for evaluating morphological segmentation; it compares the boundaries in the proposed segmentation to those in the reference. EMMA-2   Table 3: Category-wise morphological-segmentation performance for Georgian using the BPR and EMMA-2 metrics. AG = Adaptor Grammars. SS = Scholar-Seeded. SS-Ling = Scholar-Seeded with linguist-provided affixes.
is based on matching the morphemes in the proposed segmentation to those in the reference in a many-to-one assignment setup. We evaluate our system versus two state-of-theart unsupervised baselines: MorphAGram without the use of linguistic priors and Morfessor (Virpioja et al., 2013) 6 . Morfessor is a commonly-used framework for unsupervised morphological segmentation. It is based on an HMM model that relies on the Minimum Description Length (MDL) concept for deriving the optimal segmentation (Creutz and Lagus, 2007b). Since our approach does not assume access to manually annotated segmentation, it is not directly comparable to semi-supervised approaches that rely on such annotations (Ruokolainen et al., 2014;Kann et al., 2018). Finally, we report all the Adaptor-Grammar results as the average over three runs of different randomization parameters. Table 2 reports the overall performance of our models for both Japanese and Georgian, while Table 3 shows the results per part-of-speech category for Georgian.

System Performance
For Japanese, the use of a language-specific grammar definition improves both precision and recall, resulting in BPR F1-score error reductions of 8.9 % and 7.1 % over the generic Standard and Cascaded settings, respectively, and a BPR F1-score error reduction of 9.8 % over Morfessor. For Georgian, the use of linguist-provided seeded affixes improves both precision and recall, where the recall significantly increases by absolute 13.3 % over using an affix list of lower quality. In addition, the proposed linguistic priors result in BPR F1-score error reductions of 34.2 %, 30.0 % and 31.1 % over the Standard, Cascaded and regular Scholar-Seeded settings, respectively, and a BPR F1-score error reduction of 53.3 % over Morfessor. Analysing results per category, verbs and nouns receive the biggest F1-score improvements of absolute 14.3 % and 4.9 %, respectively, with the use of linguist-provided affixes.
A similar pattern of results is found with EMMA-2. Finally, all the improvements due to the use of linguistic priors are statistically significant (P < 0.01) on both metrics. Table 4 lists some examples of correctly and incorrectly segmented words by our Japanese and

Error Analysis
Gold Segmentation STD Segmentation SS Segmentation SS-Ling Segmentation Georgian lur˘i lur˘+ i lur˘+ i lur˘+ i lur˘+ i rvis rv + is r + vis rv + is rv + is gamova ga mo v a gamo v a ga mo v a gamov a auri˝aur + i˝a + uri˝a + ur + i˝aur + i Table 4: Examples of output segmentations for Japanese and Georgian. STD = Standard. STD-LS = Standard with a language-specific grammar. SS = Scholar-Seeded. SS-Ling = Scholar-Seeded with linguist-provided affixes. Incorrect morphemes are marked in red.
Georgian segmentation models. We discuss the most prominent observations below.
Japanese: Both the STD and STD-LS models perform well on prefix segmentation, achieving F1-scores of more than 90 % in the detection of several one-character prefixes, such as お and ご. However, STD-LS outperforms its languageindependent counterpart in the detection of stems, where compounding is explicitly modeled. For instance, STD and STD-LS achieve F1-scores of 15.8 % and 98.6 %, respectively, in the detection of the common stem られ (be). On the other hand, when either model consistently fails to detect a specific morpheme, the other model fails as well. For example, neither model can detect the morphemes せん and かった.
Georgian: SS-Ling outperforms both STD and SS at discovering the top most frequent one-letter morphemes, such as i, a, s, e, m, o and v, achieving an average F1-score of 76.0 %, compared to 57.7 % and 57.3 % by STD and SS, respectively. In addition, SS and STD suffer lower precision as they tend to oversegment the morphemes represented by a single letter. Similarly, SS-Ling can recognize the most frequent two-letter morphemes, namely eb and da, with absolute increases in precision of 59.0 % and 62.0 % over STD and SS, respectively; both morphemes are explicitly seeded into the SS-Ling grammar prior to training the model.

Conclusion and Future Work
We proposed two types of linguistic priors for minimally-supervised morphological segmentation using Adaptor Grammars. The first prior is in the form of defining a language-specific grammar, whi-le the second relies on compiling a list of linguistprovided affixes and seeding it into the grammars. Our approaches result in error reductions of 8.9 %, for Japanese, and 34.2 %, for Georgian, as compared to the state-of-the-art system. In future work, we plan to explore the use of linguistic priors that apply to a group of morphologically similar lowresource languages.