“Why do I feel offended?” - Korean Dataset for Offensive Language Identification

San-Hee Park, Kang-Min Kim, O-Joun Lee, Youjin Kang, Jaewon Lee, Su-Min Lee, SangKeun Lee


Abstract
Warning: This paper contains some offensive expressions. Offensive content is an unavoidable issue on social media. Most existing offensive language identification methods rely on the compilation of labeled datasets. However, existing methods rarely consider low-resource languages that have relatively less data available for training (e.g., Korean). To address these issues, we construct a novel KOrean Dataset for Offensive Language Identification (KODOLI). KODOLI comprises more fine-grained offensiveness categories (i.e., not offensive, likely offensive, and offensive) than existing ones. A likely offensive language refers to texts with implicit offensiveness or abusive language without offensive intentions. In addition, we propose two auxiliary tasks to help identify offensive languages: abusive language detection and sentiment analysis. We provide experimental results for baselines on KODOLI and observe that language models suffer from identifying “LIKELY” offensive statements. Quantitative results and qualitative analysis demonstrate that jointly learning offensive language, abusive language and sentiment information improves the performance of offensive language identification.
Anthology ID:
2023.findings-eacl.85
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1142–1153
Language:
URL:
https://aclanthology.org/2023.findings-eacl.85
DOI:
10.18653/v1/2023.findings-eacl.85
Bibkey:
Cite (ACL):
San-Hee Park, Kang-Min Kim, O-Joun Lee, Youjin Kang, Jaewon Lee, Su-Min Lee, and SangKeun Lee. 2023. “Why do I feel offended?” - Korean Dataset for Offensive Language Identification. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1142–1153, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
“Why do I feel offended?” - Korean Dataset for Offensive Language Identification (Park et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.85.pdf
Dataset:
 2023.findings-eacl.85.dataset.zip
Video:
 https://aclanthology.org/2023.findings-eacl.85.mp4