Seohyun Song
2026
ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs
Hangyeol Yoo | ChangSu Choi | Minjun Kim | Seohyun Song | SeungWoo Song | Inho Won | Jongyoul Park | Cheoneum Park | KyungTae Lim
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Hangyeol Yoo | ChangSu Choi | Minjun Kim | Seohyun Song | SeungWoo Song | Inho Won | Jongyoul Park | Cheoneum Park | KyungTae Lim
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.
2025
Unlocking Korean Verbs: A User-Friendly Exploration into the Verb Lexicon
Seohyun Song | Eunkyul Leah Jo | Yige Chen | Jeen-Pyo Hong | Kyuwon Kim | Jin Wee | Kang Miyoung | KyungTae Lim | Jungyeul Park | Chulwoo Park
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Seohyun Song | Eunkyul Leah Jo | Yige Chen | Jeen-Pyo Hong | Kyuwon Kim | Jin Wee | Kang Miyoung | KyungTae Lim | Jungyeul Park | Chulwoo Park
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
The Sejong dictionary dataset offers a valuable resource, providing extensive coverage of morphology, syntax, and semantic representation. This dataset can be utilized to explore linguistic information in greater depth.The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases and their associations with target verbs. This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information, with a particular focus on subcategorization frames. Additionally, it outlines our efforts in mapping this information by aligning subcategorization frames with corresponding illustrative sentence examples.Furthermore, we provide a Python library that would simplify syntactic parsing and semantic role labeling. These tools are intended to assist individuals interested in harnessing the Sejong dictionary dataset to develop applications for Korean language processing.
VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation
Hyeonseok Lim | Dongjae Shin | Seohyun Song | Inho Won | Minjun Kim | Junghun Yuk | Haneol Jang | KyungTae Lim
Proceedings of the 31st International Conference on Computational Linguistics
Hyeonseok Lim | Dongjae Shin | Seohyun Song | Inho Won | Minjun Kim | Junghun Yuk | Haneol Jang | KyungTae Lim
Proceedings of the 31st International Conference on Computational Linguistics
We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.