KyungTae Lim - ACL Anthology

KyungTae Lim

Also published as: Kyungtae Lim

2026

TReX: Tokenizer Regression for Optimal Data Mixture
Inho Won | Hangyeol Yoo | Minkyung Cho | Jungyeul Park | Hoyun Song | KyungTae Lim
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer’s compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TReX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TReX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX’s predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both in- and out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.

TELLME: Test-Enhanced Learning for Language Model Enrichment
Minjun Kim | Inho Won | HyeonSeok Lim | MinKyu Kim | Junghun Yuk | Wooyoung Go | Jongyoul Park | Jungyeul Park | KyungTae Lim
Findings of the Association for Computational Linguistics: EACL 2026

Continual pre-training (CPT) has been widely adopted as a method for domain expansion in large language models. However, CPT has consistently been accompanied by challenges, such as the difficulty of acquiring large-scale domain-specific datasets and high computational costs. In this study, we propose a novel method called Test-Enhanced Learning for Language Model Enrichment (TELLME) to alleviate these issues. TELLME leverages the Test-Enhanced Learning (TEL) principle, whereby the model’s learning efficiency is improved using quizzes during training. It integrates this principle with CPT, thereby promoting efficient domain-specific knowledge acquisition and long-term memory retention. Experimental results demonstrate that TELLME outperforms existing methods by up to 23.6% in the financial domain and achieves a 9.8% improvement in long-term memory retention.

Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark
Jieun Park | KyungTae Lim | Joon-ho Lim
Findings of the Association for Computational Linguistics: EACL 2026

Recent advancements in LLMs have significantly improved mathematical problem-solving, with models like GPT-4 achieving human-level performance. However, proficiently solving mathematical problems differs fundamentally from effectively teaching mathematics. To bridge this gap, we introduce the Bi-GSM8K benchmark, a bilingual English-Korean dataset enriched with teacher solutions, student solutions, and annotations marking students’ initial errors. This dataset is designed to evaluate two core capabilities of LLMs: (1) measuring similarity between student and teacher solutions, and (2) identifying the initial error point in student solutions. Our method achieves high agreement with human judgments, with Pearson 0.89 and Spearman 0.88 on English, and Pearson 0.89 and Spearman 0.87 on Korean. It also offers significantly lower latency and resource usage than commercial APIs, demonstrating strong computational efficiency. In the error detection task, open-source models achieved approximately 86% accuracy, with performance within 10% points of commercial LLMs API, suggesting strong practical potential. Our key contributions include the open-source release of Bi-GSM8K, novel evaluation metrics, and comparative analyses of LLM performance across languages.

ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs
Hangyeol Yoo | ChangSu Choi | Minjun Kim | Seohyun Song | SeungWoo Song | Inho Won | Jongyoul Park | Cheoneum Park | KyungTae Lim
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.

2025

Can LLMs Truly Plan? A Comprehensive Evaluation of Planning Capabilities
Gayeon Jung | HyeonSeok Lim | Minjun Kim | Joon-ho Lim | KyungTae Lim | Hansaem Kim
Findings of the Association for Computational Linguistics: EMNLP 2025

The existing assessments of planning capabilities of large language models (LLMs) remain largely limited to single-language or specific representation formats. To address this gap, we introduce the Multi-Plan benchmark comprising 204 multilingual and multi-format travel planning scenarios. In experimental results obtained with state-of-the-art LLMs, the Multi-Plan benchmark effectively highlights the performance disparities among models, notably showing superior results for reasoning-specialized models. Interestingly, language differences exhibited minimal impact, whereas mathematically structured representations significantly improved planning accuracy for most models, underscoring the crucial role of the input format. These findings enhance our understanding of planning abilities of LLMs, offer valuable insights for future research, and emphasize the need for more sophisticated AI evaluation methods. This dataset is publicly available at http://huggingface.co/datasets/Bllossom/Multi-Plan.

SCV: Light and Effective Multi-Vector Retrieval with Sequence Compressive Vectors
Cheoneum Park | Seohyeong Jeong | Minsang Kim | KyungTae Lim | Yong-Hun Lee
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Recent advances in language models (LMs) has driven progress in information retrieval (IR), effectively extracting semantically relevant information. However, they face challenges in balancing computational costs with deeper query-document interactions. To tackle this, we present two mechanisms: 1) a light and effective multi-vector retrieval with sequence compression vectors, dubbed SCV and 2) coarse-to-fine vector search. The strengths of SCV stems from its application of span compressive vectors for scoring. By employing a non-linear operation to examine every token in the document, we abstract these into a span-level representation. These vectors effectively reduce the document’s dimensional representation, enabling the model to engage comprehensively with tokens across the entire collection of documents, rather than the subset retrieved by Approximate Nearest Neighbor. Therefore, our framework performs a coarse single vector search during the inference stage and conducts a fine-grained multi-vector search end-to-end. This approach effectively reduces the cost required for search. We empirically show that SCV achieves the fastest latency compared to other state-of-the-art models and can obtain competitive performance on both in-domain and out-of-domain benchmark datasets.

Unified Automated Essay Scoring and Grammatical Error Correction
SeungWoo Song | Junghun Yuk | ChangSu Choi | HanGyeol Yoo | HyeonSeok Lim | KyungTae Lim | Jungyeul Park
Findings of the Association for Computational Linguistics: NAACL 2025

This study explores the integration of automated writing evaluation (AWE) and grammatical error correction (GEC) through multitask learning, demonstrating how combining these distinct tasks can enhance performance in both areas. By leveraging a shared learning framework, we show that models trained jointly on AWE and GEC outperform those trained on each task individually. To support this effort, we introduce a dataset specifically designed for multitask learning using AWE and GEC. Our experiments reveal significant synergies between tasks, leading to improvements in both writing assessment accuracy and error correction precision. This research represents a novel approach for optimizing language learning tools by unifying writing evaluation and correction tasks, offering insights into the potential of multitask learning in educational applications.

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Dongwon Noh | Donghyeok Koh | Junghun Yuk | Gyuwan Kim | Jae Yong Lee | KyungTae Lim | Cheoneum Park
Findings of the Association for Computational Linguistics: EMNLP 2025

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce ScholarBench, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. ScholarBench targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, ScholarBench evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Unlocking Korean Verbs: A User-Friendly Exploration into the Verb Lexicon
Seohyun Song | Eunkyul Leah Jo | Yige Chen | Jeen-Pyo Hong | Kyuwon Kim | Jin Wee | Kang Miyoung | KyungTae Lim | Jungyeul Park | Chulwoo Park
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

The Sejong dictionary dataset offers a valuable resource, providing extensive coverage of morphology, syntax, and semantic representation. This dataset can be utilized to explore linguistic information in greater depth.The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases and their associations with target verbs. This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information, with a particular focus on subcategorization frames. Additionally, it outlines our efforts in mapping this information by aligning subcategorization frames with corresponding illustrative sentence examples.Furthermore, we provide a Python library that would simplify syntactic parsing and semantic role labeling. These tools are intended to assist individuals interested in harnessing the Sejong dictionary dataset to develop applications for Korean language processing.

VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation
Hyeonseok Lim | Dongjae Shin | Seohyun Song | Inho Won | Minjun Kim | Junghun Yuk | Haneol Jang | KyungTae Lim
Proceedings of the 31st International Conference on Computational Linguistics

We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.

2024

Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
ChangSu Choi | Yongbin Jeong | Seoyoon Park | Inho Won | HyeonSeok Lim | SangMin Kim | Yejee Kang | Chanhyuk Yoon | Jaewan Park | Yiseul Lee | HyeJin Lee | Younggyun Hahm | Hansaem Kim | KyungTae Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

A Linguistically-Informed Annotation Strategy for Korean Semantic Role Labeling
Yige Chen | KyungTae Lim | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Semantic role labeling is an essential component of semantic and syntactic processing of natural languages, which reveals the predicate-argument structure of the language. Despite its importance, semantic role labeling for the Korean language has not been studied extensively. One notable issue is the lack of uniformity among data annotation strategies across different datasets, which often lack thorough rationales. In this study, we suggest an annotation strategy for Korean semantic role labeling that is in line with the previously proposed linguistic theories as well as the distinct properties of the Korean language. We further propose a simple yet viable conversion strategy from the Sejong verb dictionary to a CoNLL-style dataset for Korean semantic role labeling. Experiment results using a transformer-based sequence labeling model demonstrate the reliability and trainability of the converted dataset.

Towards Standardized Annotation and Parsing for Korean FrameNet
Yige Chen | Jae Ihn | KyungTae Lim | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Previous research on Korean FrameNet has produced several datasets that serve as resources for FrameNet parsing in Korean. However, these datasets suffer from the problem that annotations are assigned on the word level, which is not optimally designed based on the agglutinative feature of Korean. To address this issue, we introduce a morphologically enhanced annotation strategy for Korean FrameNet datasets and parsing by leveraging the CoNLL-U format. We present the results of the FrameNet parsers trained on the Korean FrameNet data in the original format and our proposed format, respectively, and further elaborate on the linguistic rationales of our proposed scheme. We suggest the morpheme-based scheme to be the standard of Korean FrameNet data annotation.

When the Misidentified Adverbial Phrase Functions as a Complement
Yige Chen | Kyuwon Kim | KyungTae Lim | Jungyeul Park | Chulwoo Park
Findings of the Association for Computational Linguistics: EMNLP 2024

This study investigates the predicate-argument structure in Korean language processing. Despite the importance of distinguishing mandatory arguments and optional modifiers in sentences, research in this area has been limited. We introduce a dataset with token-level annotations which labels mandatory and optional elements as complements and adjuncts, respectively. Particularly, we reclassify certain Korean phrases, previously misidentified as adverbial phrases, as complements, addressing misuses of the term adjunct in existing Korean treebanks. Utilizing a Korean dependency treebank, we develop an automatic labeling technique for complements and adjuncts. Experiments using the proposed dataset yield satisfying results, demonstrating that the dataset is trainable and reliable.

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment
DongJae Shin | HyeonSeok Lim | Inho Won | ChangSu Choi | Minjun Kim | SeungWoo Song | HanGyeol Yoo | SangMin Kim | KyungTae Lim
Findings of the Association for Computational Linguistics: NAACL 2024

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.

2023

K-UniMorph: Korean Universal Morphology and its Feature Schema
Eunkyul Leah Jo | Kyuwon Kim | Xihan Wu | KyungTae Lim | Jungyeul Park | Chulwoo Park
Findings of the Association for Computational Linguistics: ACL 2023

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from CITATION and CITATION for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.

Teddysum at MEDIQA-Chat 2023: an analysis of fine-tuning strategy for long dialog summarization
Yongbin Jeong | Ju-Hyuck Han | Kyung Min Chae | Yousang Cho | Hyunbin Seo | KyungTae Lim | Key-Sun Choi | Younggyun Hahm
Proceedings of the 5th Clinical Natural Language Processing Workshop

In this paper, we introduce the design and various attempts for TaskB of MEDIQA-Chat 2023. The goal of TaskB in MEDIQA-Chat 2023 is to generate full clinical note from doctor-patient consultation dialogues. This task has several challenging issues, such as lack of training data, handling long dialogue inputs, and generating semi-structured clinical note which have section heads. To address these issues, we conducted various experiments and analyzed their results. We utilized the DialogLED model pre-trained on long dialogue data to handle long inputs, and we pre-trained on other dialogue datasets to address the lack of training data. We also attempted methods such as using prompts and contrastive learning for handling sections. This paper provides insights into clinical note generation through analyzing experimental methods and results, and it suggests future research directions.

2022

Yet Another Format of Universal Dependencies for Korean
Yige Chen | Eunkyul Leah Jo | Yundong Yao | KyungTae Lim | Miikka Silfverberg | Francis M. Tyers | Jungyeul Park
Proceedings of the 29th International Conference on Computational Linguistics

In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analysis.

Efficient Multilingual Multi-modal Pre-training through Triple Contrastive Loss
Youhan Lee | KyungTae Lim | Woonhyuk Baek | Byungseok Roh | Saehoon Kim
Proceedings of the 29th International Conference on Computational Linguistics

Learning visual and textual representations in the shared space from web-scale image-text pairs improves the performance of diverse vision-and-language tasks, as well as modality-specific tasks. Many attempts in this framework have been made to connect English-only texts and images, and only a few works have been proposed to extend this framework in multilingual settings with the help of many translation pairs. In this multilingual approach, a typical setup is to use pairs of (image and English-text) and translation pairs. The major limitation of this approach is that the learning signal of aligning visual representation with under-resourced language representation is not strong, achieving a sub-optimal performance of vision-and-language tasks. In this work, we propose a simple yet effective enhancement scheme for previous multilingual multi-modal representation methods by using a limited number of pairs of images and non-English texts. In specific, our scheme fine-tunes a pre-trained multilingual model by minimizing a triplet contrastive loss on triplets of image and two different language texts with the same meaning, improving the connection between images and non-English texts. Experiments confirm that our enhancement strategy achieves performance gains in image-text retrieval, zero-shot image classification, and sentence embedding tasks.

2018

The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen | Rogier Blokland | KyungTae Lim | Thierry Poibeau | Michael Rießler
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages.

Multilingual Dependency Parsing for Low-Resource Languages: Case Studies on North Saami and Komi-Zyrian
KyungTae Lim | Niko Partanen | Thierry Poibeau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Analyse syntaxique de langues faiblement dotées à partir de plongements de mots multilingues [Syntactic analysis of under-resourced languages from multilingual word embeddings]
KyungTae Lim | Niko Partanen | Thierry Poibeau
Traitement Automatique des Langues, Volume 59, Numéro 3 : Traitement automatique des langues peu dotées [NLP for Under-Resourced Languages]

Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations
Niko Partanen | Kyungtae Lim | Michael Rießler | Thierry Poibeau
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

Affordances in Grounded Language Learning
Stephen McGregor | KyungTae Lim
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

We present a novel methodology involving mappings between different modes of semantic representation. We propose distributional semantic models as a mechanism for representing the kind of world knowledge inherent in the system of abstract symbols characteristic of a sophisticated community of language users. Then, motivated by insight from ecological psychology, we describe a model approximating affordances, by which we mean a language learner’s direct perception of opportunities for action in an environment. We present a preliminary experiment involving mapping between these two representational modalities, and propose that our methodology can become the basis for a cognitively inspired model of grounded language learning.

SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations
KyungTae Lim | Cheoneum Park | Changki Lee | Thierry Poibeau
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We describe the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. Our parser performed well in the official end-to-end evaluation (73.02 LAS – 4th/26 teams, and 78.72 UAS – 2nd/26); remarkably, we achieved the best UAS scores on all the English corpora by applying the three suggested feature representations. Finally, we were also ranked 1st at the optional event extraction task, part of the 2018 Extrinsic Parser Evaluation campaign.

2017

A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations
KyungTae Lim | Thierry Poibeau
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present our multilingual dependency parser developed for the CoNLL 2017 UD Shared Task dealing with “Multilingual Parsing from Raw Text to Universal Dependencies”. Our parser extends the monolingual BIST-parser as a multi-source multilingual trainable parser. Thanks to multilingual word embeddings and one hot encodings for languages, our system can use both monolingual and multi-source training. We trained 69 monolingual language models and 13 multilingual models for the shared task. Our multilingual approach making use of different resources yield better results than the monolingual approach for 11 languages. Our system ranked 5 th and achieved 70.93 overall LAS score over the 81 test corpora (macro-averaged LAS F1 score).

2014

Named Entity Corpus Construction using Wikipedia and DBpedia Ontology
Younggyun Hahm | Jungyeul Park | Kyungtae Lim | Youngsik Kim | Dosam Hwang | Key-Sun Choi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we propose a novel method to automatically build a named entity corpus based on the DBpedia ontology. Since most of named entity recognition systems require time and effort consuming annotation tasks as training data. Work on NER has thus for been limited on certain languages like English that are resource-abundant in general. As an alternative, we suggest that the NE corpus generated by our proposed method, can be used as training data. Our approach introduces Wikipedia as a raw text and uses the DBpedia data set for named entity disambiguation. Our method is language-independent and easy to be applied to many different languages where Wikipedia and DBpedia are provided. Throughout the paper, we demonstrate that our NE corpus is of comparable quality even to the manually annotated NE corpus.

2012

Korean NLP2RDF Resources
YoungGyun Hahm | KyungTae Lim | Jungyeul Park | Yongun Yoon | Key-Sun Choi
Proceedings of the 10th Workshop on Asian Language Resources

Co-authors

Younggyun Hahm 4

Cheoneum Park 4

Niko Partanen 4

Eunkyul Leah Jo 3

SeungWoo Song 3

Yongbin Jeong 2

Jongyoul Park 2

Michael Rießler 2

Woonhyuk Baek 1

Rogier Blokland 1

Kyung Min Chae 1

Jeen-Pyo Hong 1

Seohyeong Jeong 1

Donghyeok Koh 1

Stephen McGregor 1

Byungseok Roh 1

Miikka Silfverberg 1

Francis Tyers 1

Chanhyuk Yoon 1

Venues