2024
pdf
bib
Intelligent Predictive Maintenance RAG framework for Power Plants: Enhancing QA with StyleDFS and Domain Specific Instruction Tuning
Seongtae Hong
|
Joong Min Shin
|
Jaehyung Seo
|
Taemin Lee
|
Jeongbae Park
|
Cho Man Young
|
Byeongho Choi
|
Heuiseok Lim
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
pdf
bib
abs
KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models
Jaehyung Seo
|
Jaewook Lee
|
Chanjun Park
|
SeongTae Hong
|
Seungjun Lee
|
Heuiseok Lim
Findings of the Association for Computational Linguistics: ACL 2024
The evolution of large language models (LLMs) has culminated in a multitask model paradigm where prompts drive the generation of user-specific outputs. However, this advancement has revealed a critical challenge: LLMs frequently produce outputs against socially acceptable commonsense standards in various scenarios. To address this gap in commonsense reasoning, we present KoCommonGEN v2, a fine-grained benchmark dataset focused on Korean commonsense reasoning. This dataset, enriched with human annotations, comprises multiple-choice questions across seven error categories. These categories include commonsense memorization, numerical commonsense, toxic speech, and more, which are vulnerable to undermining the reliability of LLMs’ commonsense reasoning capabilities. The empirical results present that LLMs struggle with Korean commonsense reasoning. With human accuracy benchmarked at approximately 85%, GPT-4’s performance lags at about 74%, and other LLMs demonstrate an average accuracy of around 42%. Our findings emphasize the need for targeted improvements in Korean commonsense reasoning within LLMs, paving the way for more socially and contextually sensitive AI models.
pdf
bib
abs
Translation of Multifaceted Data without Re-Training of Machine Translation Systems
Hyeonseok Moon
|
Seungyoon Lee
|
SeongTae Hong
|
Seungjun Lee
|
Chanjun Park
|
Heuiseok Lim
Findings of the Association for Computational Linguistics: EMNLP 2024
Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation. in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.