K-UniMorph: Korean Universal Morphology and its Feature Schema

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from Sylak-Glassman et al. (2015) and Sylak-Glassman (2016) for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.


Introduction
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage morphological paradigms for diverse world languages (McCarthy et al., 2020;Kirov et al., 2018).Uni-Morph consists of a lemma and bundle of morphological features related to a particular inflected word form as follows, for example: 나서다naseoda 나섰다naseossda V;DECL;PST where 나서다naseoda is the lemma form and 나섰 다naseossda ('became') is the inflected form with V;DECL;PST (verb, declarative, and past tense) as morphological schema.
It started in 2016 as a SIGMORPHON shared task (Cotterell et al., 2016) for the problem of morphological reinflection, and it introduced morphological datasets for 10 languages.The inflection task, using the given lemma with its part-of-speech to generate a target inflected form, has been continued through the years: CoNLL-SIGMORPHON 2017 Shared Task (Cotterell et al., 2017), CoNLL-SIGMORPHON 2018 Shared Task (Cotterell et al., 2018), SIGMOR-PHON 2019Shared Task (McCarthy et al., 2019), SIGMORPHON 2020 Shared Task (Gorman et al., 2020) and SIGMORPHON 2021Shared Task (Pimentel et al., 2021).However, the Korean language has not been a part of the shared task because of the lack of the dataset.
Nonetheless, although rarely, morphological paradigms for Korean have been explored in the context of computational linguistics.Yongkyoon (1993) defined the inflectional classes for verbs in Korean using word-and-paradigm (WP) (Hockett, 1954) approaches.His fifteen classes of the verb which can be joined with seven different types of verbal endings, are based on inflected forms of the verb.Seokjoon (1999) systematized the list of final endings and their properties, which are also used as conjunctive endings in Korean.Otherwise, properties of verbs such as mood, tense, voice, evidentiality, interrogativity have been extensively studied in Korean linguistics independently: for example, inter alia, tense (Byung-sun, 2003), grammatical voice (Chulwoo, 2007), interaction of tense-aspect-mood marking with modality (Jae Mog, 1998), evidentiality (Donghoon, 2008), and interrogativity (Donghoon, 2011).
In continuation of the efforts, this paper proposes a new Universal Morphology dataset for Korean.We adopt morphological feature schema from Sylak-Glassman et al. (2015) and Sylak-Glassman (2016) for the Korean language and extract inflected verb forms from the Sejong morphologi-cally analyzed corpus over 0.6M sentences with 9.5M words.We set the criteria in detail by explaining how to extract inflected verbal forms (Section 2), and carry out the inflection task using different Korean word forms such as letter, syllable and morpheme (Section 3).Finally, we discuss future perspectives on a Korean UniMorph dataset (Section 4).

UniMorph Features Schema
Verbal endings in the inflected forms of the predicate has been considered as still being in the part of the word as proposed in several grammar formalisms for Korean such as lexicalized tree adjoining grammars (Park, 2006), head driven phrase structure grammars (Ko, 2010), and combinatory categorial grammars (Kang, 2011) in contrast to government and binding (GB) theory (Chomsky, 1981(Chomsky, , 1982) ) for Korean in which the entire sentence depends on separated verbal endings.This idea goes back to Maurice Gross's lexicon grammars (Gross, 1975), and his students who worked on a descriptive analysis of Korean in which the number of predicates in Korean could be fixed by generating possible inflection forms: e.g.Pak (1987); Nho (1992); Nam (1994); Shin (1994); Park (1996); Chung (1998);Han (2000).However, we have separated the postposition from the substantive such as noun phrases instead of keeping themselves together.Therefore, with the current Korean dataset, we decide to annotate morphological data for verbs (V).
Table 1 shows the morphological schema for Korean UniMorph where we adopt features from Sylak-Glassman et al. (2015) and Sylak-Glassman (2016) for the Korean language.In addition to the features schema, we consider following these four different types of verbal endings, in which they convey grammatical meanings for the predicate: sentence final ending (ef), non-final ending (ep), conjunctive ending (ec), and modifier ending (etm).
Evidentiality It is a grammatical category that reflects the source of information that a speaker conveys in a proposition.It is often expressed through morphological markers such as sentence final endings (ef) 대dae, 내nae, and 래lae bring in hearsay (HRSY), and non-final endings (ep) 겠gess introduce inferred (INFER).Since the suffix for the quotative (QUOT) is denoted with a postposition (jkq) in Korean instead of the verbal ending, it is excluded from the current set of schemata.
Interrogativity It indicates either to express a statement (DECL) or a question (INT).We consider all sentence final ending (ef) ended with 다da as declarative DECL, and sentence final ending (ef) included 가ga and 까kka as interrogative INT.
Mood The grammatical mood of a verb indicates modality on a verb by the morphological marking.Realis (REAL) and irrealis (IRR) are represented by a verbal modifier ending (also known as an adnominal ending) (etm), ㄴn and ㄹl, respectively.The usage of adnominal endings consists of (i) collocation such as 인한inhan, 치면chimyeon, 대한 daehan, (ii) modifiers and (iii) relative clauses.Realis and irrealis are concerned with regardless of modifiers or relative clauses.General purposive (PURP) is decided by 려고lyeogo and 하러haleo, and obligative (OBLIG) is introduced by 야ya.It is worthwhile to note that we do not consider indicative (IND) because we specify declarative DECL.
Tense It refers to the time frame in which a verb's action or state of being occurs.Non-final endings (ep) such as 았ass and 었eoss and final endings (ef) such as ㄴ다nda 는다neunda can represent the past (PAST) and the present (PRS) tenses, repectively.Since the future tense (FUT) has been considered as irrealis (IRR) in Korean, we don't annotate it here.
Voice We deduce the passive (PASS) from the verb stem instead of the verbal ending such as jabhi ('be caught').Whereas the verb jab ('catch') and the passive suffix hi might be segmented, the current criteria of the Sejong corpus combines them together as a single morpheme.이히리기i, hi, li, gi are verbal endings known for both the passive and the causative.If the verb has a verbal ending 게 ge such as verb stem+{이i|히hi|리li|기gi}+게ge {하ha|만들mandeul ('make')}, then it is causative (CAUS), otherwise passive (PASS).
Other schema For politeness, we introduce only polite (POL) using the non-final ending (ep) 시si as the direct encoding of the speaker-addressee relationship (Brown and Levinson, 1987, p.276).Lastly, since we are not able to deduce the valency of the verb from morphemes, we do not include INTR (intransitive), TR (transitive) and DITR (ditransitive).However, we leave them for future work because the valency might still be valid morphological feature schemata for Korean.

Data creation
We prepare the data by extracting inflected verb forms from the Sejong morphologically analyzed corpus (sjmorph) over 676,951 sentences with 7,835,239 eojeols (word units separated by space) which represent 9,537,029 tokens.We are using the same training/dev/test data split that Park and Tyers (2019) proposed for Korean part of speech (POS) tagging.However, the current sjmorph doesn't contain POS labels for the eojeol (the word).Instead, it contains the sequence of POS labels for morphemes as follows: 나섰다naseossda 나서naseo/VV+었eoss/EP+다da/EF where it contains only each morpheme's POS label: a verb 나서naseo ('become'), a non-final ending 었eoss ('PST'), and a final ending 다da ('DECL'), and it does not show whether the word 나섰다 naseossda ('became') is a verb.Previous works (Petrov et al., 2012;Park et al., 2016;Park and Tyers, 2019;Kim and Colineau, 2020) propose a partial mapping table between Sejong POS (and the sequence of Sejong POSs) (XPOS) and Universal POS (UPOS) labels where UPOS represents the grammatical category of the word.However, no study has presented the correctness of their conversion rules.Therefore, we utilize UD_Korean-GSD (McDonald et al., 2013) in Universal Dependencies (Nivre et al., 2016(Nivre et al., , 2020) that provides Sejong POS(s) and Universal POS labels for each word.Nevertheless, we observed several critical POS annotation errors in UD_Korean-GSD.For this reason, we proceeded to revise GSD's Sejong POS(s) and Universal POS to evaluate our criteria of getting verbs (inflected forms and their lemmas) from sjmorph.This approach involved randomly selecting 300 sentences from the GSD and manually revising their POS labels based on the Sejong POSs.For thorough verification, they were examined by our linguist for over 60 hours over 3 weeks.The main places of error that we noticed were how words for proper nouns were labeled as NOUN even with its XPOS of proper nouns (NNP).They were corrected to the UPOS label of PROPN.Another common place of error was how the dataset recognized and labeled words according to their roles as constituent parts of the sentence they are in, instead of the word's own category.For example, the temporal nouns was usually annotated as ADV instead of NOUN.We changed this mislabeling by acknowledging the word itself, separate from the sentence.Again, the Sejong POS labels were revised based on the criteria of the Sejong corpus.After correcting 738 words for Sejong POS labels and 705 words for Universal POS labels from 300 sentences in the development file, we trained the sequence of Sejong POS labels using semi-supervised learning to predict the Universal POS label for each word.Among 3674 predictions, there were only 332 UPOS prediction errors, and an error scarcely occurs for VERB labels, which we attempted to ex-  tract from sjmorph.Therefore, we consider this current error rate for the verb to be negligible.Finally, we extract 244,871 inflected verbal forms for 43,959 lemma types from sjmorph.Then, we remove all duplicated items from train+dev datasets compared to the test dataset.In Table 2 is the brief statistics of the current dataset.

Morphological reinflection
The goal of the morphological reinflection task creates the generative function of morphological schema to produce the inflected form of the given word.For Korean, we use 나서다naseoda and V;DECL;PST to predict 나섰다naseossda by using the composition of alphabet letters (L), syllables (S) and morphemes (M) of the word as shown in Table 3.The word is decomposed into the sequence of consonants and vowels by Letter, the sequence of units constructed with two or three letters by syllable, and the sequence of morphological units by morpheme.The conversion from the target form of each representation to the surface form and vice versa are straightforward in technical terms.
For our task, we use the baseline system from The CoNLL-SIGMORPHON 2018 Shared Task (Cotterell et al., 2018). 1 The system uses alignment, span merging and rule extraction to predict the set of all inflected forms of a lexical item (Durrett and DeNero, 2013).We also build a basic neural model using fairseq2 (Ott et al., 2019) and Transformer (Vaswani et al., 2017).the word.This is because morpheme forms imply lemma forms for both source and target data.While the average number of inflected forms per lemma is 8.285, there are 22 verb lemmas that have more than 400 different inflected forms.The average number of inflected forms per lemma and morphological feature pair is also 5.634, and this makes Korean difficult to predict the inflected form.

Comparison with UniMorph 4.0 Korean
UniMorph 4.0 (Batsuren et al., 2022) includes a Korean dataset, which provides 2686 lemma and 241,323 inflected forms that are automatically extracted from Wiktionary.It is mainly comprised of adjectives and verbs with totals of 52,387 and 188,821, respectively.3Thoroughly, we inspected the verbs in UniMorph 4.0 Korean to compare with K-UniMorph: Among the 152,454 inflected forms of verbs in UniMorph 4.0 Korean, there are only 16,489 forms that appear in 9.5M words of the Sejong corpus, and 135,965 forms (89.18%) that never occur.UniMorph 4.0 Korean annotated all verbs (V) as FIN and all participles (V.CPTP) as NFIN.We can consider adding FIN for all verbs endings with ef (final verbal endings) and NFIN for all verbs ending with etm (adnominal endings, which are utilized for relative clauses, modifiers, and a part of collocations).To inspect this, UniMorph 4.0 Korean provides the imperative-jussive modality IMP which consists of 1;PL and 2, but it seems that Number (PL) occurs only with 1 (Person).While K-UniMorph considers only 시si (an honorific for the agent) as

Discussion and Future Perspectives
We have dealt with UniMorph schema for verbs, and obtained experimental results for the morphological reinflection task using the different representation forms of the word.Nouns in Korean have been considered by separating postposition from the lemma of the noun instead of keeping themselves together (e.g.프랑스peulangseu ('France') and 의ui ('GEN') instead of 프랑스의peulangseuui) in several grammar formalisms for Korean.However, in addition to exogenously given interests such as inflection in context,4 recent studies insist the functional morphemes including both ver-bal endings and postpositions in Korean should be treated as part of a word, with the result that their categories do not require to be assigned individually in a syntactic level (Park and Kim, 2023).Accordingly, it would be more efficient to assign the syntactic categories on the fully inflected lexical word derived by the lexical rule of the morphological processes in the lexicon.Therefore, we will investigate how we adopt features for nouns such as cases including non-core and local cases such as NOM (nominative), ACC (accusative), comparison (CMPR), and information structure TOP (topic) (Table 6).It will also include a typology of jkb (adverbial marker), which raises ambiguities.An adverbial marker can represent 'dative' which marks the indirect object, 'instrumental' which marks means by which an action occurred, 'allative' which marks a type of locative grammatical case, 'ablative' which expresses motion away from something, or 'comparative' (CMPR, 예상yesang.We leave a detailed study on nouns and other grammatical categories for future work.All datasets of K-UniMorph are available at https://github.com/jungyeul/K-UniMorph to reproduce the results.

A Neural Experiment Description
We use the default setting of fairseq for the neural experiment for the Table 4 in §3.2 as described in Table 7. fairseq fairseq-preprocess, fairseq-train and fairseq-interactive.
GPU around 1 hour of GPU has been consumed for the training step for each experiment.
Total runtime It takes about 2 to 3 hours for completing one experiment including all steps (preprocessing, training and evaluation).

Table 1 :
Korean UniMorph schema for verbs: vv for verb, va for adjective, vcp for copula, and nnb for bound noun,

Table 2 :
Statistics of Korean UniMorph

Table 3 :
Example of the surface form and its different representation using letters, syllables and morphemes.

Table 4 :
Table 4 shows the experimental results for Korean UniMorph using the three different representation forms.It is notable that the morpheme forms outperform the other surface representation forms such as by letters and syllables of Experimental results (accuracy) PL -

Table 6 :
Korean UniMorph schema for nouns.for시si,andPOL comes from verbal endings 요yo and 습니다seubnida with either FORM or INFM.However, FORM.ELEV is to elevate the referent.Therefore, it should be with IMP;2|3 and instead, FORM.HUMB can be introduced with IMP;1 for 습 니다seubnida, and INFM.ELEV|INFN.HUMB for 요 yo.Hence, K-UniMorph provides a richer feature schema based on linguistics analysis.Table5summarises the different usage of the feature schema between UniMorph 4.0 Korean K-UniMorph.

Table 7 :
A single run with a seed number Hyperparameter