2024
pdf
bib
abs
GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi
|
Eunju Lee
|
Kyohoon Jin
|
YoungBin Kim
Findings of the Association for Computational Linguistics: EACL 2024
Data annotation is an essential step for constructing new datasets. However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive. In addition, the complexity of this process increases when dealing with low-resource languages owing to the difference in the language pool of crowdworkers. To address these issues, this study proposes an autonomous annotation method by utilizing large language models, which have been recently demonstrated to exhibit remarkable performance. Through our experiments, we demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation. Additionally, we constructed an image captioning dataset using our approach and are committed to open this dataset for future study. We have opened our source code for further study and reproducibility.
pdf
bib
abs
Don’t be a Fool: Pooling Strategies in Offensive Language Detection from User-Intended Adversarial Attacks
Seunguk Yu
|
Juhwan Choi
|
YoungBin Kim
Findings of the Association for Computational Linguistics: NAACL 2024
Offensive language detection is an important task for filtering out abusive expressions and improving online user experiences. However, malicious users often attempt to avoid filtering systems through the involvement of textual noises. In this paper, we propose these evasions as user-intended adversarial attacks that insert special symbols or leverage the distinctive features of the Korean language. Furthermore, we introduce simple yet effective pooling strategies in a layer-wise manner to defend against the proposed attacks, focusing on the preceding layers not just the last layer to capture both offensiveness and token embeddings. We demonstrate that these pooling strategies are more robust to performance degradation even when the attack rate is increased, without directly training of such patterns. Notably, we found that models pre-trained on clean texts could achieve a comparable performance in detecting attacked offensive language, to models pre-trained on noisy texts by employing these pooling strategies.
pdf
bib
abs
AutoAugment Is What You Need: Enhancing Rule-based Augmentation Methods in Low-resource Regimes
Juhwan Choi
|
Kyohoon Jin
|
Junho Lee
|
Sangmin Song
|
YoungBin Kim
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Text data augmentation is a complex problem due to the discrete nature of sentences. Although rule-based augmentation methods are widely adopted in real-world applications because of their simplicity, they suffer from potential semantic damage. Previous researchers have suggested easy data augmentation with soft labels (softEDA), employing label smoothing to mitigate this problem. However, finding the best factor for each model and dataset is challenging; therefore, using softEDA in real-world applications is still difficult. In this paper, we propose adapting AutoAugment to solve this problem. The experimental results suggest that the proposed method can boost existing augmentation methods and that rule-based methods can enhance cutting-edge pretrained language models. We offer the source code.
pdf
bib
abs
Enhancing Effectiveness and Robustness in a Low-Resource Regime via Decision-Boundary-aware Data Augmentation
Kyohoon Jin
|
Junho Lee
|
Juhwan Choi
|
Sangmin Song
|
Youngbin Kim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Efforts to leverage deep learning models in low-resource regimes have led to numerous augmentation studies. However, the direct application of methods, such as mixup and cutout, is limited due to the discrete characteristics of the textual data. While methods using pre trained language models have exhibited good efficiency, they require additional considerations for robustness. Inspired by recent studies on decision boundaries, this paper proposes a decision-boundary-aware data augmentation strategy to enhance robustness using pretrained language models. The proposed technique first focuses on shifting the latent features closer to the decision boundary, followed by reconstruction to generate an ambiguous version with a soft label. Additionally, mid-K sampling is suggested to enhance the diversity of the generated sentences. This paper demonstrates the performance of the proposed augmentation strategy compared to other methods through extensive experiments. Furthermore, the ablation study demonstrates the effect of soft labels and mid-K sampling and the extensibility of the method with curriculum data augmentation.
2023
pdf
bib
abs
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
Jungmin Yun
|
Mihyeon Kim
|
Youngbin Kim
Findings of the Association for Computational Linguistics: EMNLP 2023
Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts with all tokens, including the ones unfavorable to classification performance. To overcome these challenges, we propose integrating two strategies: token pruning and token combining. Token pruning eliminates less important tokens in the attention mechanism’s key and value as they pass through the layers. Additionally, we adopt fuzzy logic to handle uncertainty and alleviate potential mispruning risks arising from an imbalanced distribution of each token’s importance. Token combining, on the other hand, condenses input sequences into smaller sizes in order to further compress the model. By integrating these two approaches, we not only improve the model’s performance but also reduce its computational demands. Experiments with various datasets demonstrate superior performance compared to baseline models, especially with the best improvement over the existing BERT model, achieving +5%p in accuracy and +5.6%p in F1 score. Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved.
pdf
bib
abs
It Ain’t Over: A Multi-aspect Diverse Math Word Problem Dataset
Jiwoo Kim
|
Youngbin Kim
|
Ilwoong Baek
|
JinYeong Bak
|
Jongwuk Lee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The math word problem (MWP) is a complex task that requires natural language understanding and logical reasoning to extract key knowledge from natural language narratives. Previous studies have provided various MWP datasets but lack diversity in problem types, lexical usage patterns, languages, and annotations for intermediate solutions. To address these limitations, we introduce a new MWP dataset, named DMath (Diverse Math Word Problems), offering a wide range of diversity in problem types, lexical usage patterns, languages, and intermediate solutions. The problems are available in English and Korean and include an expression tree and Python code as intermediate solutions. Through extensive experiments, we demonstrate that the DMath dataset provides a new opportunity to evaluate the capability of large language models, i.e., GPT-4 only achieves about 75% accuracy on the DMath dataset.
2021
pdf
bib
abs
Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation
Kyeongpil Kang
|
Kyohoon Jin
|
Soyoung Yang
|
Soojin Jang
|
Jaegul Choo
|
Youngbin Kim
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Understanding voluminous historical records provides clues on the past in various aspects, such as social and political issues and even natural science facts. However, it is generally difficult to fully utilize the historical records, since most of the documents are not written in a modern language and part of the contents are damaged over time. As a result, restoring the damaged or unrecognizable parts as well as translating the records into modern languages are crucial tasks. In response, we present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism, specifically utilizing two Korean historical records, ones of the most voluminous historical records in the world. Experimental results show that our approach significantly improves the accuracy of the translation task than baselines without multi-task learning. In addition, we present an in-depth exploratory analysis on our translated results via topic modeling, uncovering several significant historical events.