Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.


Introduction
Writing grammatically correct Korean sentences is difficult for learners studying Korean as a Foreign Language (KFL) and even for native Korean speakers due to its morphological and orthographical complexity such as particles, spelling, and collocation. Its word spacing rule is complex since there are many domain-dependent exceptions, of which only around 20% of native speakers understand thoroughly (Lee, 2014). Since Korean is an agglutinative language (Sohn, 2001;Song, 2006), getting used to Korean grammar is time-consuming for KFL learners whose mother tongue is nonagglutinative (Haupt et al., 2017;Kim, 2020). However, despite the growing number of KFL learners (Lee, 2018), little research has been conducted on Korean Grammatical Error Correction (GEC) because of the previously described difficulties of the Korean language. Another major obstacle to developing a Korean GEC system is the lack of resources to train a machine learning model.
In this paper, we propose three datasets that cover various grammatical errors from different types of annotators and learners. The first dataset named Kor-Native is crowd-sourced from native Korean speakers. Second, Kor-Learner are from KFL learners that consists of essays with detailed corrections and annotations by Korean tutors. Third, Kor-Lang8 are similar with Kor-Learner except that they consist of sentences made by KFL learners but corrected by native Koreans on social platforms who are not necessarily linguistic experts. We also analyze our datasets in terms of error type distributions.
Another important problem with Korean GEC is that most existing datasets do not have human annotation and their quality is not guaranteed, which makes it hard to use them for evaluation. Also, human annotations are difficult to be done on a large corpora since experts are needed and it is costly. The schema and types of error types could be different by datasets and annotators, which can make it counter-productive. Another way that we can analyze and evaluate on the dataset is by automatic annotation from parallel corpora. While there is already one for English called ERRANT (Bryant et al., 2017), there is no automatic error type detection system for Korean. We cannot fully demonstrate and classify error types and edits by using ERRANT, because Korean has many different characteristics than English (Section 4.4). This motivates us to develop an automated error correction system for Korean (KAGAS), along with annotated error types of refined corpora using KAGAS.
Furthermore, in order to demonstrate the usefulness of our corpora and encourage future research on Korean, we build a simple yet effective baseline model based on BART (Lewis et al., 2019) that generates grammatically corrected sentences based on noisy text. We further analyze the generated outputs of BART on how the accuracy of each system differs by error types when compared with a statistical method called Hanspell 2 , providing use cases and insights gained from analysis.
To summarize, the contributions of this paper are as follows: (1) collection of three different types of parallel corpora for Korean GEC, (2) a novel grammatical error annotation toolkit for Korean called KAGAS, and (3) a simple yet effective baseline Korean GEC models on our datasets with detailed analysis by KAGAS.

Related Work
Datasets Well-curated datasets in each language are crucial to build a GEC system that can capture language-specific characteristics (Bender, 2011). In addition to several shared tasks on English GEC (Ng et al., 2014;Bryant et al., 2019;Rao et al., 2018), resources for GEC in other languages are also available (Wu et al., 2018;Li et al., 2018;Rozovskaya and Roth, 2019;Koyama et al., 2020;Boyd, 2018). Existing works on Korean GEC (Min et al., 2020;Lee et al., 2021;Park et al., 2020) are challenging to be replicated because they use internal datasets or existing datasets without providing pre-processing details and scripts. Therefore, it is urgent to provide publicly available datasets in a unified and easily accessible form with preprocessing pipelines that are fully reproducible for the GEC research on Korean.
Evaluation M 2 scorer (Dahlmeier and Ng, 2012) which measures precision, recall, and F 0.5 scores based on edits, is the standard evaluation metric for English GEC models. It requires an M 2 file with annotations of edit paths from an erroneous sentence to a corrected sentence. However, it is expensive to collect the annotations by human workers as they are often required to have expert linguistic knowledge. When these annotations are not available, GLEU (Napoles et al., 2015), a simple variant of BLEU (Papineni et al., 2002), is used instead by the simple n-gram matching. Another way of generating an M 2 file for English in a rule-based manner is by using the error annotation toolkit called ERRANT (Bryant et al., 2017). We extend ERRANT to make KAGAS and utilize it to 2 https://speller.cs.pusan.ac.kr/ align and annotate edits on our datasets and make an M 2 file to evaluate on Korean GEC models.
Automatic Error Annotation for other languages Creating a sufficient amount of humanannotated dataset for GEC on other languages is not trivial. To navigate this problem, there were attempts to adapt ERRANT ( Bryant et al. (2017) onto languages other than English for error type annotation, such as on Czech (Náplava et al., 2022), Hindi (Sonawane et al., 2020), Russian (Katinskaia et al., 2022), German (Boyd, 2018), and Arabic (Belkebir and Habash, 2021), but no existing work has previously extended ERRANT onto Korean. When extending ERRANT onto other languages, necessary changes about the error types were made such as discarding unmatched ones by ERRANT and adding language-specific ones 3 .
Models Early works on Korean GEC focus on detecting particle errors with statistical methods (Lee et al., 2012;Israel et al., 2013;Dickinson et al., 2011). A copy-augmented transformer (Zhao et al., 2019) by pre-training to denoise and fine-tuning with paired data demonstrates remarkable performance and is widely used in GEC. Recent studies (Min et al., 2020;Lee et al., 2021;Park et al., 2020) apply this method for Korean GEC. On the other hand, Katsumata and Komachi (2020) show that BART (Lewis et al., 2020), known to be effective on conditioned generation tasks, can be used to build a strong baseline for GEC systems. Following this work, we load the pre-trained weights from KoBART 4 , a Korean version of BART, and finetune it using our GEC datasets.

Data Collection
We build three corpora for Korean GEC: Kor-Learner ( §3.1), Kor-Native ( §3.2), and Kor-Lang8 ( §3.3). The statistics of each dataset is described on Table 1. We describe the main characteristic and source of the dataset and how it is preprocessed in the following subsection. We expect that different characteristics of these diverse datasets in terms of quantity, quality, and error type distributions ( Figure 1) allow us to train and evaluate a robust GEC model. Figure 1: The distribution of error types by our proposed dataset (Upper three). The lowest one (Lang8) are for comparison with Kor-Lang8. Percentages are shown for error types that has larger than 5%. The distribution of error types for KFL learners show that NOUN, CONJ, END and PART are frequent than native learners, which is similar with previous corpus studies (Kim, 2020;Lee, 2020 files had empty edits, missing tags, and inconsistent edit correction tags depending on annotators, so additional refinement and proofreading was required. Therefore, the authors manually inspected the output of parallel corpora and discard sentences with insufficient annotations (Appendix A.4.2). After applying appropriate modifications to the NIKL corpus, we were able to make Kor-Learnerwhich contains word-level parallel sentences with high quality.

Native Korean Corpus (Kor-Native)
The purpose of this corpus is to build a parallel corpus representing grammatical errors native Korean speakers make. Because the Korean orthography guidelines are complicated consisting of 57 rules with numerous exceptions 6 , only a few native Korean speakers fully internalize all from the guidelines and apply them correctly. Thus, the standard approach depends on the manpower of Korean language experts, which is not scalable and is very costly. Thus, we introduce our novel method to create a large parallel GEC corpus from correct sentences, which does not depend on the manpower of experts, but the general public of native Korean speakers. Our method is characterized as a backward approach. We collect grammatically correct sentences from two sources 8 , and read the correct sentences using Google Text-to-Speech(TTS) system. We asked the general public to dictate grammatically correct sentences and transcribe them. The transcribed sentences may be incorrect, containing grammatical errors that the audience often makes. Details of the collection are described in Appendix A.2. Figure 1 shows that most of the collected error types were on word spacing. The   Figure 2: An example of an M 2 file output by KAGAS. Translated into English. Note that "to school" is treated as one word for the translation example.

KAGAS
We propose Korean Automatic Grammatical error Annotation System (KAGAS) that automatically aligns edits and annotate error types on parallel corpora that overcomes many disadvantages of hand-written annotations by human(Appendix B.2). Figure 2 shows the overview of KAGAS. As the scope of the system is to extract edits and annotate an error type to each edit, our system assumes the given corrected sentence is grammatically correct. Then, our system takes a pair of the original sentence and the corrected sentence as input and output aligned edits with error types. We further extend the usage of KAGAS to analyze the generated text of our baseline models by each error type in Table 7 at Section 6. In this section, we describe in detail about the construction and contributions of KAGAS, with human evaluation results.

Alignment Strategy
We first conduct sentence-level alignment to define a "single edit". We use Damerau-Levenshtein distance (Felice et al., 2016)

Corrected:
점심이 너무 작은 나머지 배고팠어요. Translation: I was hungry because I had such a small launch.

Corrected:
오늘은 머리를 자르러 갔다. Translation: I went to a barber to get my hair cut today. repository 12 and get edit pairs. Note that we apply different alignment strategy from ERRANT which defines the scope of a "single" edit. We use Koreanspecific linguistic cost 13 , so that word pairs with lower POS cost and lower lemma cost are more likely to be aligned together. Also, we use custom merging rules to merge single word-level edits into WO and WS. Therefore, the number of total edits and average token length on edits, and the output M 2 file made from KAGAS differs from that of ERRANT, since an M 2 file consists of edit alignment and error type (Fig. 2). This would result in different M 2 scores when applied to model output 12 https://github.com/chrisjbryant/ edit-extraction 13 Kkma POS tagger, Korean lemmatizer evaluation.

Error types
We describe here how we consider the unique linguistic characteristics of Korean (Appendix B.1), and define 14 error types (Table 3).
Classifying error types in morpheme-level Because Korean is an agglutinative language, we need to figure out the morpheme-level difference between original and corrected word. For example, 학교에 ('to school') in Table 2 is divided into two parts in morpheme-level, 학교 ('school') + 에 ('to'), based on its roles in a word. If this word is annotated to 집에('to home'), we should treat this edit as NOUN(학교 -> 집), and if this word is annotated to 학교에서 ('at school'), we should classify this edit as PART (Particle, since -서 is added). Therefore, we need to break down original and corrected word-level edits into morphemes and only look at textitmorpheme-level differences between the two when classifying error types. When conducting morpheme-level alignment, we utilize morpheme-level Levenshtein distance for Korean 14 . Also, the POS tags of Korean are based on morphemes and not words, meaning that there can be multiple POS tags for one word. Apart from POS tagging, KAGAS also considers the composition of edits(e.g. SHORT).Please refer to Appendix B.4 for detailed examples.
No PREP, but PART In Korean, morpheme (not word) align with the meaning. Therefore, " 학교 에"(To school)->"학교에"(To-school) is WS, and "혁고에"(To-sceol)->"학교에"(To-school) is SPELL, which is different. In similar vein, There is no PREP(Positioning. "to" before "school") in Korean. They rather view them as postpositional particle(Positioning "-에" after "학교") On the motivation of selecting 14 Error Types According to previous work that categorizes Korean grammatical error by frequency (Shin, 2007), Korean error types are divided by (1) Sound, (2) Format, (3) Spacing, and (4) The rest, meaning that orthographical errors were highly frequent in Korean error types. Therefore, we designed error types to focus on capturing frequent orthographical errors such as WS, SPELL, along with syntax and morphological errors such as WO and SHORT. There are 9 most important categories of POS for Korean (noun, pronoun, numeral, verb, adjective, postposition, pre-noun, adverbs, interjection), and a single word is divided into substantives (mostly by nouns) and inflectional words 15 . Most inflectional words are irregular and prone to change in format, and detecting those are also important. Therefore, we added 6 error types that can cover the 9 types of POS for Korean except for the numeral part 16 (noun&pronoun to NOUN, verb to VERB, adjective to ADJ, postposition to PART, pre-noun&adverb to MOD, interjection to PUNCT) and 2 error types for inflectional words(CONJ, END(=suffix)), which can be classified by the POS tagger. The result of 14 error types, motivated by both Korean linguistic characteristic information in terms of linguistic typology and orthographical guidelines, contain all the crucial, frequent error types.
About INS/DEL edits. Since Korean is a discourse-oriented language, one can omit the subject or the object in a sentence depending on the previous context. These cases are classified as INS/DEL edits, which are grammatically correct. There are also cases of INS/DEL that edits unnecessary modifiers, which is also non-grammatical edits but rather variations to sentences. Previous works that applied ERRANT onto other languages also discard INS/DEL edits or treat them in a similar manner to our work. Works on Hindi (Sonawane et al., 2020) and Russian (Katinskaia et al., 2022) only classifies R: (Replacement). For Arabic (Belkebir and Habash, 2021), Insertion and Deletion are not classified further other than token-level and word-level INS/DEL. We believe that some INS/DEL edits contain meaningful grammatical errors. However, following previous reasons and given our situation that we unfortunately don't have enough resources to conduct human evaluation of the subgroups of INS/DEL, we believe that not dividing INS/DEL any further would have more gains than losses regarding the reliability of KA-GAS. DEL/INS examples are at Appendix B.3.1, and more details about selecting the granularity of error types are at Appendix Section B.6.
Priority between Error Types Due to the nature of Korean language, multiple error types can be classified for a single edit. However, we decided to output a single representative error type for each edit (Appendix B.5) by defining the priority between them in order to make a deterministic, reliable system with clear evidence(Appendix C.2). Detailed steps are as follows: (1) We first classify edits that won't overlap with one another(INS/DEL/WS/WO/PUNCT) according to the current error type definition.
(2) After that, we prioritized classifying frequent formal and orthographical errors such as SPELL and SHORT than the rest, since those errors are highly frequent in Korean Grammar. (3) When there are single POS types for an edit, we return the error types according to the POS. (4) When there were multiple POS types per an edit, we first checked whether the edit was CONJ(a combination of VERB+ENDING or ADJECTIVE+ENDING). Others are left as un-   classified. We detect INS and DEL directly by the outputs of sentence-level alignment. We merge the edits for WS and WO based on the syntactical appearance of the edits. For SPELL, we use a Korean spellchecker dictionary 17 . We utilize the Korean POS tagger (Appendix C.1) to classify other POSrelated error types.

Evaluation of the Annotation System
We evaluate our system by 3 Korean GEC experts majoring in Korean linguistics (Table 4). First, we evaluate the acceptance rate of each error type by randomly sampling 26 parallel sentences with a single edit for each error type from our datasets. One half (13 sentences) is written by native Korean speakers, and the other is written by KFL learners. In total, there are 364 parallel sentences in a random order. Each evaluator evaluated "good" or "bad" for each parallel sentences. The acceptance rate is the rate of "good" responses out of "good" and "bad" responses 18 . The overall acceptance rate is the sum of the acceptance rate of each error type weighted by the proportion of the error type in each dataset. Therefore, it depends on the distribution of error types in a dataset. By looking at the overall ac-17 https://github.com/spellcheck-ko/ hunspell-dict-ko 18 We measure the 95% confidence interval by each error types with n=26(#samples per error types) * 3(#evaluators) for Table 3, and report the weighted sum of those on Table 4. ceptance rate (Table 4), we can estimate that about 90% of the classified edits are evaluated as good for KAGAS on real dataset distributions. The coverage in a dataset is the rate of edits which is not identified as UNK (Unclassified). At Table 5, we can see that the inter-annotator agreements are moderate, meaning that the evaluation results are consistent between annotators to be reliable enough. It is also meaningful to note that, KAGAS has a very high human acceptance rate (>96.15%) for frequently observed error types on our dataset, such as WO, SPELL, PUNCT, and PART(PARTICLE). High acceptance rate for PART is especially meaningful since PART plays an important role in representing grammatical case(-격) in Korean. Detailed analysis and the evaluation interface figure is in Appendix C.3.1.

Contributions of KAGAS
To summarize, KAGAS is different from previous work such as (1) integration of morpheme-level POS tags, (2) using morpheme-level alignment strategy, and (3) Defining Korean-specific 14 error types. Following these reasons, we believe that KAGAS capture a more diverse and accurate set of Korean error types than a simple adaptation from automatic error type systems such as (Choshen and Abend, 2018) or ERRANT.

Experiments
We conduct experiments to build an effective model to encourage future research on Korean GEC models. We report test accuracy using the model with the best validation GLEU score. Detailed experiment settings to reproduce our results appear in Appendix D.

Evaluation Metrics
We evaluate our model using M 2 scorer and GLEU (Section 2). Note that we can obtain M 2 scores as well as GLEU scores by making an M 2 file by KA-GAS. We also report self-scores(self-GLEU and self-M 2 , obtained by treating original text as system outputs to each evaluation system) to compare  Table 6: Full experiment statistics on the test set. KoBART outputs are averaged from outputs of 3 different seeds. Generation time (sentence/second) is measured by dividing the total number of sentences by total amount of generation time taken. Accuracies on the test set is reported with model checkpoints that has the highest GLEU validation accuracy.
the characteristics of the dataset itself. Higher selfscores would mean that corrected text is similar to the original text.

Hanspell as Baseline GEC System
Our primary aim is to build a first strong baseline model on Korean GEC. Therefore, we compare our methods with a commercial, well-known Korean GEC statistical system called Hanspell 19 . (Note that Hanspell is a completely different system from Hunspell 20 , a spellchecker). It is developed by the Pusan University since 1992 and it widely used in Korea since it is free and easily accessible through the web 21 .

Datasets
We split datasets by the train_test_val_split function from sklearn 22 . The train, test, valid ratio is set to 70%, 15%, and 15% on seed 0. Then, we aggregate all three datasets to make Kor-Union 23 .

Model Training
We use the HuggingFace 24 implementation of BART by loading the weights from the pre-trained Korean BART (KoBART). We train models with multiple scenarios: (1) fine-tuning KoBART with 3 individual datasets, and (2)

Tokenization
We utilize the character BPE (Sennrich et al., 2016) tokenizer from HuggingFace tokenizers library, as KoBART used. Due to the limitations of the tokenizer, the encoded then decoded version of the original raw text automatically removes spaces between a word and punctuation (Appendix D.2). Therefore, naive evaluation of the generated output (decoded by the tokenizer) with the M 2 file made by raw text output is not aligned well, resulting in bad accuracy. Since we thought measuring the performance of the model has higher priority than measuring the performance of the tokenizer, we use the decoded version of text to train and make M 2 files for evaluation.

Results and Discussion
Effectiveness of Neural models As we can see in Table 6, the model trained with our dataset outperform the current commercial GEC system(Hanspell) on all datasets. It is notable in that Hanspell is currently known as the best performing system open-source system for correcting erroneous Korean sentences. The result implies that our dataset helps to build a better GEC system, and our that our model can serve as a reasonable baseline that shows the effectiveness of neural models against previous rule-based systems on GEC. Moreover, the generation speed of our neural models (KoBART) is about five times faster than Hanspell, showing the efficiency as well as performance 25 .  Table 7: GLEU and M 2 scores on the generation output on the test set of Hanspell and KoBART on Kor-Union. The scores are divided by all 14 + UNK error types. For convenience, scores higher than a certain threshold are highlighted. PUN. and SHO. is for PUNCT and SHORT, respectively. The standard deviation(STD) of Hanspell is higher than that of KoBART, meaning that scores by KoBART are evenly distributed for all error types, and scores by Hanspell are biased toward WS and SPELL.

Detailed analysis by Error Types
types. In contrast, Hanspell's performance is very biased towards SPELL and WS. Here, we demonstrate the usefulness of KAGAS. It is meaningful to note that, KAGAS enables us to conduct a detailed post-analysis of model output by measuring model performance on individual error types. KAGAS can be served as a valuable resource to evaluate the performance of different models qualitatively from various perspectives.
Kor-Native Note that the performance of Kor-Native is much higher than the other datasets. The error type distribution ( Figure 1) for Kor-Native aligns with Shin et al. (2015) that more than half of the dataset is on WS for native Korean speakers, which is different from learner datasets which has a more diverse set of error types. Therefore, it is easier for the model to train on Kor-Native than on other datasets.

Conclusion
In this work, we (1) construct three parallel datasets of grammatically incorrect and corrected Korean sentence pairs for training Korean GEC systems: Kor-Lang8, Kor-Native, and Kor-Learner. Our datasets are complementary representing grammatical errors that generated by both native Korean speakers and KFL learners.
(2) to train and evaluate models with these new datasets, we develop KAGAS, which considers the linguistic characteristic of Korean and automatically aligns and annotates edits between sentence pairs. (3) We show our experimental results based on a pre-trained Ko-BART model with fine-tuning on our datasets and compare them with a baseline system, Hanspell. We expect that our datasets, evaluation toolkit, and models will foster active future research on Korean GEC as well as a wide range of Korean NLP tasks. Future work includes on further refining our pro-posed method, KAGAS, by extending the coverage and making more accurate error type classification.

Appendix for Standardizing Korean Grammatical Error Correction: Datasets, Evaluation, and Models
A Detailed instruction of dataset pre-processing

A.1 General
In original-corrected pairs, there are cases where punctuation and words are in one word for origialcorrected edit pairs, such as: "갔어." -> "갔어!" Since we are doing a word-level alignment, it seems inappropirate to classify this whole edit as "PUNCT". Therefore, in order to correctly get error type distributions per our dataset, we process all of our dataset to add spaces between punctuations ("갔어." -> "갔어 .", "갔어!" -> "갔어 !"). After this, only punctuations can be left for alignment. Now the edit pairs from the previous example are transformed into "."->"!", which seems very appropriate as an edit that can be classified as "PUNCT".

A.2 Kor-Native
Collecting correct sentences. We collect grammatically correct sentences from two sources: We have granted to apply changes to the original dataset (Additionally make grammatically wrong sentences out of correct sentence) and redistribute these datasets, under the Korean Gong-Gong-Nuri-4 license 28 . This license states that anyone can use Kor-Native for non-commercial purposes under proper attribution of source.
Collecting transcribed sentences. We read the correct sentences using Google Text-to-Speech (TTS) to native Korean speakers and let them dictate the sentences they hear on crowd-sourcing platforms. The demo page for the platform that we used is shown above. We designed our method to deliver the correct sentences to the audience in oral because a written form may interfere the writing behavior of the audience. As a result, we collected 51,672 transcribed sentences.
Filtering. Not all transcribed sentences contain grammatical errors. We filtered out transcribed sentences which do not contain a grammatical error by the following criteria: 1. If a correct sentence and its transcription are exactly same, 2. or if the differences between two sentences fall into any of the followings: • a punctuation, • related to a number, • a named entity.
A punctuation is not read by TTS. A number has multiple representations, e.g., "1" in Arabic numeral and "일" in Korean alphabet. Finally, we excluded transcribed sentences that are too short compared to the original correct sentence.

A.3 Kor-Lang8
We first extract incorrect-correct Korean dataset pair from the raw Lang-8 corpus. Then, we extract all pairs that contain Korean letters and preprocess the corpus to obtain (orignial, corrected) pairs. We apply various post-processing techniques into original raw Lang-8 corpora. Those techniques include: We discard pairs which: • token length (when tokenized by kobart tokenizer) is longer than 200, since it consisted of meaningless repetition of words or numbers.
• contains language other than English, Korean, or punctuations, such as arabic or japanese characters.
• length of one token (splitted by space) is bigger than 20, since the sentences doesn't make sense by manual inspection.
• length of each sentence is at least longer than 2. (naive length, not tokenized length) We also compute the ratio (r t ) of the number of tokens of the post-edit to the pre-edit (n t,pre , n t,post ). Similarly, we compute the ratio (r l ) of the lengths. Then, we retain the (pre-edit, post-edit) pairs satisfying the following conditions and discard the others: 1) 0.25 < r t < 4, 2) 0.5 < r l < 1.25, 3) min(n t,pre , n t,post ) > 5, and 4) the length of the longest common subsequence is greater than 10 characters.
Then, we modify each sequence by deleting the traces of unneeded, additional proof marks. Therefore we discard phrases which is inside brackets. Those indicate the subsequence SEQUENCES inside the text such as (SEQUENCES), {SEQUENCES}, <SEQUENCES>, or [SEQUENCES]. We discard them along WITH the brackets. In a similar context, there was multiple repetition of partiuclar tokens, such as " 안녕 홍대 !!!!!! ????". so we shortened repeated patterns and make it appear only once. Those special tokens include [' ', '!', ';', '?', '', '>', '', '+', 'ㅠ', 'ㅜ', 'ㅋ']. After applying this, the original sentence is converted into "안녕 홍대 ! ?" After this step, we filter sentences by leaving out only those whose jamo_levenshetein distance 29 is smaller than 10. Pairs whose levenshtein distance is bigger than this threshold is likely to contain pairs that are not grammatical edits, but rather rephrases or additional explanations. Lastly, we retain pairs whose original and corrected pairs are unique and original and corrected sentences are not the same (there must be at least one edit).
After this step, there are 109,560 sentence pairs in this corpus. Full details about the modifying and filtering functions for lang8 are going to be opensourced, for reproducibility for everyone.

A.4 Kor-Learner
The original corpus is a set of XML files with multiple tutors' tags and corrections to the errors of Korean learner essays.

A.4.2 Manual refinement step
As explained at Section 3.1, some XML files had empty edits, missing tags, and inconsistent edit correction tags depending on annotators. Of all the possible tags (Appendix A.4.1), it was common that not all ErrorArea, ErrorLevel, and ErrorPattern tags were present for each edit. Therefore we conduct a refinement step to ensure the quality of the dataset. We process the NIKL learner corpus by the following steps: First, Merge all XML files into a single corpus. Then, we discard sentences with no or inconsistent proofread tags by manual inspection. For example, there were datasets labeled as "DELETE" for the proofread tags, where the place was originally meant to be the place for morpheme-level edits. We discard those datasets. Since the grammatical aspects of handling written and spoken languages are different, we discard datasets tagged as spoken language and leave only written language. Lastly, we validate the consistency of the types of errors tagged by different tutors and leave out only valid annotations. After this step, we build a corrected sentence from the original sentence and morpheme-level corrections by merging morpheme-level syllables into Korean words (Appendix A.4.3).

B KAGAS Development Details B.1 Brief introduction to Korean language
The current orthographical practice of Korean writing system, Hangul (한글), was established by the Korean Ministry of Education in 1988. One prominent feature of the practice is morphophonemic. This indicates that a symbol is the binding of letters consisting of morpheme-based syllables. For instance, though 자연어 in 'natural language' is pronounced as 자여너 [tCa.j2.n2], it should be written as 자연어 since each of 자연 'natural', and 어 'language' is a morpheme with one or two syllables. Words, or Eojeol (어절) are formed by both content and functional morphemes in general. They are basic segments for word spacing in Korean. The rules for the word spacing are also described in the orthography guidelines, however, they are often regarded as complex ones for native Korean speakers (Lee, 2014). In the view of linguistic typology, as mentioned, Korean is an agglutinative language in that each morpheme encodes a single feature. This turns out that the language has rich morphology such as various particles and complex conjugation forms. The example in (1) shows that each particles attached to a noun indicates a case marker such as nominative, accusative and the others. Furthermore, the affixes attached to a verb stem serve as functional morphemes pertaining to tense, aspect and mood. Another distinction of the language is that pro-drop or zero anaphora is abundant, which is common in morphologically rich languages (Tsarfaty et al., 2010). Particle omission is also frequent in colloquial speech (Lee and Song, 2012). These linguistic characterisitcs are different from the ones of fusional languages such as English and German where a concatenated morpheme has multiple features in usual (Comrie, 1989;Vania and Lopez, 2017).
(1) 수지-가   The word order of Korean is relatively free. While the canonical word order of the language is Subject-Object-Verb (SOV), it is possible to change the positions of the words in a sentence. However, corpus and psycholinguistic studies have reported that the preferred order exists for adverbs with the conditions related to the meaning of the verb (Shin, 2007) and the specific types of the adverbs such as the time and place ones (Nam et al., 2018). Nevertheless, Korean speakers allow various word orders when comprehending and producing sentences in general. Table B.1 shows the benefits of using an automatic system over human annotation. Making a fully human annotated dataset resource for Grammatical Error Correction is quite difficult and costly, and an automated version of it (KAGAS) could be a great alternative, and could even overcome many disadvantages of human annotation. Therefore, we emphasize here again the advantages of automatic error type correction system.

B.2 Benefits of using KAGAS
• KAGAS provides a unified schema for all Korean parallel datasets. In contrast, error types by human annotations are different by datasets, and thus hard to compare.
• KAGAS uses a deterministic, trustworthy decision on assigning error types, where it could be random or different by annotators for human annotations.
• KAGAS can be applied with no cost, and instantly get output while it takes a lot of time, money, and effort to hire experts for annotation and validate them. This particularly becomes a great advantage for datasets used for training neural models where the dataset size is often too large to conduct high-quality human annotation, and on other languages than English where experts are very expensive and difficult to hire.

B.3.1 INS & DEL
In Korean, one can omit the subject or the object in a sentence depending on the previous context, since it is a discourse-oriented language. In these cases, sentence with DEL and INS edits and also without DEL and INS edits are both grammatically correct. There were also cases of INS or DEL which are edits of unnecessary modifiers. This is also a case of non-grammatical edits but rather variations to sentences. For this reason, we felt no need to divide error types further for INS&DEL that mostly accounts for unnecessary, non-grammatical edits.

B.4 Detailed examples of Alignments and assignment of error types by KAGAS
We add some examples that describe how KAGAS assigns word-level POS error types by morpheme-level edits. Note that POS tagging is conducted by the kkma POS tagger.
Please refer to software:KAGAS/pos_granularity.py for full mapping of kkma POS tags grouped into our error types.

B.5 On assignment of single representative Error Types instead of multiple Error Types per a single edit
Defining error types that don't overlap with one another in the first place would be optimal, but unfortunately, defining meaningful error types for Korean that are mutually exclusive is almost infeasible. (Appendix A.4.1: The NIKL corpus tagged morpheme-level error types in 3 levels: the position of error types(오류 위치), ErrorPattern(오류 양상), and ErrorLevel(오류 층위).) Similarly, we want to clarify that the current implementation of KAGAS(software:KAGAS/scripts/align_text_korean.py#L404) has the ability to output all candidates of error type classifications(in formal aspect (INS/DEL/SUB), the POS of the edit, and the nature and scope of edit(SPELL/SHORT/The rest)). Currently, it is aggregated to a single error type, in the order of pre-defined priorities. While KAGAS can be easily extended to output multiple error types for a single edit, human evaluation and error type distribution analysis becomes much more complicated if we evaluate all possible error types per edit. For simplicity and clarity (and to make a deterministic reliable system), we decided to assign priorities and conduct human evaluation only on the highest priority error types. Please note that other works that extend ERRANT onto other languages also assign single error types to each edit (Náplava et al.

B.6 The granularity of Error Types
Our primary goal on building KAGAS was to correctly classify error types in as much coverage as possible, while the human evaluation of KAGAS output is reliable enough. The first version of KAGAS was made after referring to the Korean orthography guidelines and other related work, and adjusting them into the ERRANT error types. It first contained a more diverse set of error types, with multiple error types assigned per an edit (e.g. SUB:VERB:FINAL_ENDING, SUB:VERB:DERIVATIONAL_ENDING, SUB:PARTICLE:OBJECTIVE, or INS:PUNCT). However, we noticed that there were 2 issues that prevented the practical and reliable use of the first version of KAGAS, and fixing these problems led to the current version described in the paper. First, the accuracy of kkma (POS tagger) was not good enough to ensure good quality of error types described previously in much detail, which is something that is beyond the scope of this work (We believe that the improvement of a POS tagger will enable KAGAS to define a more detailed error type classification with high reliability). Second, we could not perform reliable human evaluation with fine-grained error types. For reliable human evaluation we needed at least 26 samples per an error type -13 in Kor-Native and 13 in Kor-Learner + Kor-Lang8 -to conduct a reliable human evaluation. Therefore, to ensure the quality of classification by KAGAS, error types without sufficient samples were aggregated into higher categories of similar groups or left as unclassified (at software:KAGAS/edit-extraction/pos_granularity.py).
C Implementation Details

C.1 Kkma POS Tagger
We use the Konlpy wrapper for Kkma Korean POS tagger 35 , to tag Part-Of-Speech information in a given sentence. We chose to use Kkma because it had the most diverse POS tags 36 among the konlpy POS taggers. However, Kkma fail to recover to the original form of a sentence after the output of POS tagging. Kkma outputs morpheme-level tags, and it erases whitespaces from the original input sentence. Therefore, recovering whitespaces after processing a sentence by Kkma is necessary, along with aggregating morpheme-level tags into word-level. We solve this issue by utilizing morpheme-level alignment for Korean.

C.2 Defining the priorities between error types
We wanted our system to be highly reliable and clear given the current available resources. Therefore, we prioritized classifying frequent, orthographic error types over POS classification. After the output of the edit extraction, We use the allSplit method and merge mulitple edits as one edit of word space and word order errors. For detection spell errors, we explained earlier that we use the Korean spellchecker dictionary 37 . Note that words that are proper noun is likely to be not included in the Korean dictionary, so spell errors are defined in a more narrower sense than it is currently thought of. We defined edits as spell errors only when original span wasn't inside the korean dictionary, but after editing, the edited word is inside the Korean dictionary. Therefore, corrections on proper nouns are treated as correct when there are classified as NOUN errors, not SPELL errors.
There were sometimes edits that could be both classified by one or more error types. For example, and insertion edit that added punctuation can be both classified as "INS" edits or "PUNCT" edits. In order to avoid this ambiguouity, we set the priority between edits. The priority is as follows.

• INS & DEL > the rest • WS > WO > SPELL > SHORT > PUNCT > the rest
We informed this to participants for human evaluation to evaluate ambiguous edits on this priority. For Korean-specific linguistically aligned alignment, we computed similar with the English alignment system, but we defined Korean lemma cost using the soylemma's lemmatizer, and we defined the Korean content pos as NNG, NNP, NNB, NNM, NR, NP, VV, VA, VXV, VXA, VCP, VCN, MDT, MDN, MAG, MAC out of full pos tags for korean in 38 .

C.3 Qualitative analysis on user evaluation.
Figure 4: Demo that we used for KAGAS system evaluation. Translated into english. Figure ?? shows the evaluation demo interface that we used for human evaluation. We gave the full list of error types and made the evaluators to mark either 'good' or 'bad' about the error type classification.

C.3.2 About low-performing cases
Overall, the participants evaluated error types that could easily be identified by their forms with a higher proportion of 'good', and error types that relates to the POS tags as 'bad'. After manual inspection of edits that were classified as 'bad' by the Korean experts, we found that most of them were due to the limitations of the POS tagger. Most of the times the POS tagger fail to tag the correct POS for edit words, especially when there is a spelling error inside a word, or it is a pronoun. This explains why acceptance rates for POS-related error types had lower scores, for example, ADJ, NOUN, VERB, or CONJ. Also, after the main evaluation, we additionally asked the participants to classify edits that were marked as UNK, edits which KAGAS was unable to classify it to any error types. The participants classified most of the UNK edits as spelling errors. Since there are a lot of inflectional forms for a word for Korean, current  Table D.1: We also report the evaluation results on valid sets which has the highest GLEU score. KoBART outputs are averaged from outputs of 3 different seeds. Here, we also report the total training time for 10 epochs, in the total time section.
dictionary-based spellchecker fail to identify all SPELL error edits. Therefore, we believe that KAGAS will benefit from the improvement of the Korean POS tagger and spell checker.

C.3.3 About selection of sentences for evaluation
For simplicity and clarity for annotators, we selected sentences with a single edit for each error type from our dataset for human evaluation. One concern could be that there could be a selection biasstraightforward cases could be selected for evaluation. We would like to first clarify that our 14 error types are entirely defined by local edits. In other words, the error type classification output of KAGAS is not affected by adjacent words or sentence structure (POS tagging is performed word-wise, and INS/DEL edits are not divided further). Therefore, we carefully argue that the validity of KAGAS is not affected by the number of edits and thus sentences with one edit can sufficiently represent the entire data, since the goal of human evaluation is to evaluate whether KAGAS correctly classifies word-level edits.

D Experimental Details D.1 Details of Experimental Settings
We used a computational infrastructure which have 4 core CPU with 13GB memory and with one GPU(NVIDIA Tesla V100). All reported models are run on one GPU. We use the kobart pretrained model and kobart tokenizer. We allocate 70% of data set to train, 15% to test, and 15% to valid data sets by using Python scikit-learn library, sklearn.train_test_split function. GLEU(Napoles et al., 2015) scores are evaluated by the official github repository 39 , and M 2 scores (Dahlmeier and Ng, 2012) are also evaluated on the official repository 40 .

D.2 Tokenizer issue on punctuation space recovery
Below is an example of the encoded and decoded outputs of the tokenizer.
D.3 About pre-training with wikipedia dataset.
Since GEC suffers from lack of plentiful parallel data, we also tried to pre-train our model on Wikipedia edit pairs (Lichtarge et al., 2019) with a learning rate of 1e-05, and then fine-tune for 10 epochs on each individual datasets. However, we found out that KoBART is already a very strong pre-trained model, and the benefit from Wikipedia edit pair training is small. Therefore, we decided not to use the Korean Wikipedia edit pairs on our baseline experiment.

D.4 Further analysis on model outputs.
Accuracies improve linearly with the proportion of the training dataset. Kor-Native scores notably high on WS, while Kor-Learner scores poorly. According to Figure 1, Kor-Native has a large proportion of WS and Kor-Learner have only a small fraction. The same trend applies for PART on Kor-Learner datasets, compared with Kor-Lang8. Table D.2 is obtained by the indiviual error type proportion with respect to the GLEU scores shown in Figure 5. According to Table D.2, there is a positive correlation between the distribution of error types and the individual performance for Kor-Learner and Kor-Native. This means that when making a GEC model, training dataset distributions should differ in relation to what type of error types one wants to have high performance on. For example, if a model that performs well on ADJ errors is needed, Kor-Learner dataset should be utilized, and if a model that corrects WS errors very well is needed, the Kor-Native dataset should be used, and if one need to correct informal errors from KFL learners, using Kor-Lang8 would be the best, while using Kor-Native would be better to correct native speaker errors. Therefore, we believe all three datasets have their own purpose, and we provide them as separate three datasets without unifying them.

Comparison with KoBART and KoBART + Kor-Union
The results on KoBART + Kor-Union is the results from the model fine-tuned twice, first with Kor-Union and then with the individual dataset. As we can see in Table 6, there is an improvement in precision and F 0.5 scores compared with KoBART+Kor-Union (KoBART fine-tuned on Kor-Union and then fine-tuned on each individual dataset again) than KoBART (fine-tuned directly on KoBART), meaning that all three datasets can help on improving the performance of the individual datasets. Analysis for KoBART + Kor-Union on each error type distributions shows similar trends with KoBART (Table E.1). Figure 6 shows the test dataset heatmap for KoBART + Kor-Union. We can see that the trends are similar with that of KoBART. Also, Figure 7 shows the valid dataset heatmap of KoBART compared with Hanspell. The full error type scores are described in Table E.1. It first shows the count of occurences of the valid datasets used for generation and making heatmap illustrations. Also, it shows the full output of scores by all datasets with all methodologies.

E Ethical Consideration
We have conducted an IRB for KAGAS human evaluation. 42