2024
pdf
bib
abs
Length-aware Byte Pair Encoding for Mitigating Over-segmentation in Korean Machine Translation
Jungseob Lee
|
Hyeonseok Moon
|
Seungjun Lee
|
Chanjun Park
|
Sugyeong Eo
|
Hyunwoong Ko
|
Jaehyung Seo
|
Seungyoon Lee
|
Heuiseok Lim
Findings of the Association for Computational Linguistics: ACL 2024
Byte Pair Encoding is an effective approach in machine translation across several languages. However, our analysis indicates that BPE is prone to over-segmentation in the morphologically rich language, Korean, which can erode word semantics and lead to semantic confusion during training. This semantic confusion, stemming from over-segmentation, ultimately contributes to a degradation of overall translation quality. To address this issue, we introduce Length-aware Subword Vocabulary Construction (LeVoC), a novel approach strategically incorporating longer words into the vocabulary. By utilizing an external monolingual Korean corpus, LeVoC extracts and integrates long words, effectively preserving morphological information and reducing semantic confusion. Our experiments demonstrate that LeVoC not only significantly outperforms BPE, but also can be applied to and surpass current state-of-the-art morpheme-aware subword tokenization methods. We provide evidence that the difficulty in translating sentences with long words in Korean is associated with morphological compositionality, and LeVoC’s ability to reduce semantic confusion during training leads to improved translation quality.
pdf
bib
abs
Translation of Multifaceted Data without Re-Training of Machine Translation Systems
Hyeonseok Moon
|
Seungyoon Lee
|
SeongTae Hong
|
Seungjun Lee
|
Chanjun Park
|
Heuiseok Lim
Findings of the Association for Computational Linguistics: EMNLP 2024
Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation. in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.
pdf
bib
abs
Exploring Inherent Biases in LLMs within Korean Social Context: A Comparative Analysis of ChatGPT and GPT-4
Seungyoon Lee
|
Dong Kim
|
Dahyun Jung
|
Chanjun Park
|
Heuiseok Lim
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Large Language Models (LLMs) have significantly impacted various fields requiring advanced linguistic understanding, yet concerns regarding their inherent biases and ethical considerations have also increased. Notably, LLMs have been critiqued for perpetuating stereotypes against diverse groups based on race, sexual orientation, and other attributes. However, most research analyzing these biases has predominantly focused on communities where English is the primary language, neglecting to consider the cultural and linguistic nuances of other societies. In this paper, we aim to explore the inherent biases and toxicity of LLMs, specifically within the social context of Korea. We devise a set of prompts that reflect major societal issues in Korea and assign varied personas to both ChatGPT and GPT-4 to assess the toxicity of the generated sentences. Our findings indicate that certain personas or prompt combinations consistently yield harmful content, highlighting the potential risks associated with specific persona-issue alignments within the Korean cultural framework. Furthermore, we discover that GPT-4 can produce more than twice the level of toxic content than ChatGPT under certain conditions.
pdf
bib
abs
Leveraging Pre-existing Resources for Data-Efficient Counter-Narrative Generation in Korean
Seungyoon Lee
|
Chanjun Park
|
DaHyun Jung
|
Hyeonseok Moon
|
Jaehyung Seo
|
Sugyeong Eo
|
Heuiseok Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Counter-narrative generation, i.e., the generation of fact-based responses to hate speech with the aim of correcting discriminatory beliefs, has been demonstrated to be an effective method to combat hate speech. However, its effectiveness is limited by the resource-intensive nature of dataset construction processes and only focuses on the primary language. To alleviate this problem, we propose a Korean Hate Speech Counter Punch (KHSCP), a cost-effective counter-narrative generation method in the Korean language. To this end, we release the first counter-narrative generation dataset in Korean and pose two research questions. Under the questions, we propose an effective augmentation method and investigate the reasonability of a large language model to overcome data scarcity in low-resource environments by leveraging existing resources. In this regard, we conduct several experiments to verify the effectiveness of the proposed method. Our results reveal that applying pre-existing resources can improve the generation performance by a significant margin. Through deep analysis on these experiments, this work proposes the possibility of overcoming the challenges of generating counter-narratives in low-resource environments.
2022
pdf
bib
abs
Focus on FoCus: Is FoCus focused on Context, Knowledge and Persona?
SeungYoon Lee
|
Jungseob Lee
|
Chanjun Park
|
Sugyeong Eo
|
Hyeonseok Moon
|
Jaehyung Seo
|
Jeongbae Park
|
Heuiseok Lim
Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge
Rather than continuing the conversation based on personalized or implicit information, the existing conversation system generates dialogue by focusing only on the superficial content. To solve this problem, FoCus was recently released. FoCus is a persona-knowledge grounded dialogue generation dataset that leverages Wikipedia’s knowledge and personal persona, focusing on the landmarks provided by Google, enabling user-centered conversation. However, a closer empirical study is needed since research in the field is still in its early stages. Therefore, we fling two research questions about FoCus. “Is the FoCus whether for conversation or question answering?” to identify the structural problems of the dataset. “Does the FoCus model do real knowledge blending?” to closely demonstrate that the model acquires actual knowledge. As a result of the experiment, we present that the FoCus model could not correctly blend the knowledge according to the input dialogue and that the dataset design is unsuitable for the multi-turn conversation.