Dataset for Identification of Homophobia and Transphobia for Telugu, Kannada, and Gujarati

Prasanna Kumar Kumaresan; Rahul Ponnusamy; Dhruv Sharma; Paul Buitelaar; Bharathi Raja Chakravarthi

Dataset for Identification of Homophobia and Transphobia for Telugu, Kannada, and Gujarati

Prasanna Kumar Kumaresan, Rahul Ponnusamy, Dhruv Sharma, Paul Buitelaar, Bharathi Raja Chakravarthi

Abstract

Users of social media platforms are negatively affected by the proliferation of hate or abusive content. There has been a rise in homophobic and transphobic content in recent years targeting LGBT+ individuals. The increasing levels of homophobia and transphobia online can make online platforms harmful and threatening for LGBT+ persons, potentially inhibiting equality, diversity, and inclusion. We are introducing a new dataset for three languages, namely Telugu, Kannada, and Gujarati. Additionally, we have created an expert-labeled dataset to automatically identify homophobic and transphobic content within comments collected from YouTube. We provided comprehensive annotation rules to educate annotators in this process. We collected approximately 10,000 comments from YouTube for all three languages. Marking the first dataset of these languages for this task, we also developed a baseline model with pre-trained transformers.

Anthology ID:: 2024.lrec-main.393
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 4404–4411
Language:
URL:: https://aclanthology.org/2024.lrec-main.393/
DOI:
Bibkey:
Cite (ACL):: Prasanna Kumar Kumaresan, Rahul Ponnusamy, Dhruv Sharma, Paul Buitelaar, and Bharathi Raja Chakravarthi. 2024. Dataset for Identification of Homophobia and Transphobia for Telugu, Kannada, and Gujarati. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4404–4411, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Dataset for Identification of Homophobia and Transphobia for Telugu, Kannada, and Gujarati (Kumaresan et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.393.pdf

PDF Cite Search Fix data