2025
pdf
bib
abs
MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets
Heuiyeen Yeen
|
Seokhee Hong
|
Hyeongu Yun
|
Jinsik Lee
Findings of the Association for Computational Linguistics: EMNLP 2025
We introduce MANTA, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.
pdf
bib
abs
From KMMLU-Redux to Pro: A Professional Korean Benchmark Suite for LLM Evaluation
Seokhee Hong
|
Sunkyoung Kim
|
Guijin Son
|
Soyeon Kim
|
Yeonjung Hong
|
Jinsik Lee
Findings of the Association for Computational Linguistics: EMNLP 2025
The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea.
2024
pdf
bib
abs
Who Wrote this Code? Watermarking for Code Generation
Taehyun Lee
|
Seokhee Hong
|
Jaewoo Ahn
|
Ilgee Hong
|
Hwaran Lee
|
Sangdoo Yun
|
Jamin Shin
|
Gunhee Kim
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Since the remarkable generation performance of large language models raised ethical and legal concerns, approaches to detect machine-generated text by embedding watermarks are being developed.However, we discover that the existing works fail to function appropriately in code generation tasks due to the task’s nature of having low entropy.Extending a logit-modifying watermark method, we propose Selective WatErmarking via Entropy Thresholding (SWEET), which enhances detection ability and mitigates code quality degeneration by removing low-entropy segments at generating and detecting watermarks.Our experiments show that SWEET significantly improves code quality preservation while outperforming all baselines, including post-hoc detection methods, in detecting machine-generated code text.Our code is available inhttps://github.com/hongcheki/sweet-watermark.
2023
pdf
bib
abs
SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created through Human-Machine Collaboration
Hwaran Lee
|
Seokhee Hong
|
Joonsuk Park
|
Takyoung Kim
|
Meeyoung Cha
|
Yejin Choi
|
Byoungpil Kim
|
Gunhee Kim
|
Eun-Ju Lee
|
Yong Lim
|
Alice Oh
|
Sangchul Park
|
Jung-Woo Ha
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The potential social harms that large language models pose, such as generating offensive content and reinforcing biases, are steeply rising. Existing works focus on coping with this concern while interacting with ill-intentioned users, such as those who explicitly make hate speech or elicit harmful responses. However, discussions on sensitive issues can become toxic even if the users are well-intentioned. For safer models in such scenarios, we present the Sensitive Questions and Acceptable Response (SQuARe) dataset, a large-scale Korean dataset of 49k sensitive questions with 42k acceptable and 46k non-acceptable responses. The dataset was constructed leveraging HyperCLOVA in a human-in-the-loop manner based on real news headlines. Experiments show that acceptable response generation significantly improves for HyperCLOVA and GPT-3, demonstrating the efficacy of this dataset.
pdf
bib
abs
KoSBI: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Applications
Hwaran Lee
|
Seokhee Hong
|
Joonsuk Park
|
Takyoung Kim
|
Gunhee Kim
|
Jung-woo Ha
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Large language models (LLMs) not only learn natural text generation abilities but also social biases against different demographic groups from real-world data. This poses a critical risk when deploying LLM-based applications. Existing research and resources are not readily applicable in South Korea due to the differences in language and culture, both of which significantly affect the biases and targeted demographic groups. This limitation requires localized social bias datasets to ensure the safe and effective deployment of LLMs. To this end, we present KosBi, a new social bias dataset of 34k pairs of contexts and sentences in Korean covering 72 demographic groups in 15 categories. We find that through filtering-based moderation, social biases in generated content can be reduced by 16.47%p on average for HyperClova (30B and 82B), and GPT-3.
2021
pdf
bib
abs
How Robust are Fact Checking Systems on Colloquial Claims?
Byeongchang Kim
|
Hyunwoo Kim
|
Seokhee Hong
|
Gunhee Kim
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Knowledge is now starting to power neural dialogue agents. At the same time, the risk of misinformation and disinformation from dialogue agents also rises. Verifying the veracity of information from formal sources are widely studied in computational fact checking. In this work, we ask: How robust are fact checking systems on claims in colloquial style? We aim to open up new discussions in the intersection of fact verification and dialogue safety. In order to investigate how fact checking systems behave on colloquial claims, we transfer the styles of claims from FEVER (Thorne et al., 2018) into colloquialism. We find that existing fact checking systems that perform well on claims in formal style significantly degenerate on colloquial claims with the same semantics. Especially, we show that document retrieval is the weakest spot in the system even vulnerable to filler words, such as “yeah” and “you know”. The document recall of WikiAPI retriever (Hanselowski et al., 2018) which is 90.0% on FEVER, drops to 72.2% on the colloquial claims. We compare the characteristics of colloquial claims to those of claims in formal style, and demonstrate the challenging issues in them.