Soyoung Yoon


2023

pdf bib
An Integrated Search System for Korea Weather Data
Jinkyung Jo | Dayeon Ki | Soyoung Yoon | Minjoon Seo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

We introduce WeatherSearch, an integrated search system deployed at the Korea Meteorological Administration (KMA). WeatherSearch enables users to retrieve all the relevant data for weather forecasting from a massive weather database with simple natural language queries. We carefully design and conduct multiple expert surveys and interviews for template creation and apply data augmentation techniques including template filling to collect 4 million data points with minimal human labors. We then finetune mT5 on the collected dataset and achieve an average MRR of 0.66 and an average Recall of 0.82. We also discuss weather-data-specific characteristics that should be taken into account for creating such a system. We hope our paper serves as a simple and effective guideline for those designing similar systems in other regions of the world.

pdf bib
Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation
Soyoung Yoon | Sungjoon Park | Gyuwan Kim | Junhee Cho | Kihyo Park | Gyu Tae Kim | Minjoon Seo | Alice Oh
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.

2021

pdf bib
SSMix: Saliency-Based Span Mixup for Text Classification
Soyoung Yoon | Gyuwan Kim | Kyumin Park
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021