DOSA: Dravidian Code-Mixed Offensive Span Identification Dataset

Manikandan Ravikiran, Subbiah Annamalai


Abstract
This paper presents the Dravidian Offensive Span Identification Dataset (DOSA) for under-resourced Tamil-English and Kannada-English code-mixed text. The dataset addresses the lack of code-mixed datasets with annotated offensive spans by extending annotations of existing code-mixed offensive language identification datasets. It provides span annotations for Tamil-English and Kannada-English code-mixed comments posted by users on YouTube social media. Overall the dataset consists of 4786 Tamil-English comments with 6202 annotated spans and 1097 Kannada-English comments with 1641 annotated spans, each annotated by two different annotators. We further present some of our baseline experimental results on the developed dataset, thereby eliciting research in under-resourced languages, leading to an essential step towards semi-automated content moderation in Dravidian languages. The dataset is available in https://github.com/teamdl-mlsg/DOSA
Anthology ID:
2021.dravidianlangtech-1.2
Volume:
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Month:
April
Year:
2021
Address:
Kyiv
Editors:
Bharathi Raja Chakravarthi, Ruba Priyadharshini, Anand Kumar M, Parameswari Krishnamurthy, Elizabeth Sherly
Venue:
DravidianLangTech
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10–17
Language:
URL:
https://aclanthology.org/2021.dravidianlangtech-1.2
DOI:
Bibkey:
Cite (ACL):
Manikandan Ravikiran and Subbiah Annamalai. 2021. DOSA: Dravidian Code-Mixed Offensive Span Identification Dataset. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 10–17, Kyiv. Association for Computational Linguistics.
Cite (Informal):
DOSA: Dravidian Code-Mixed Offensive Span Identification Dataset (Ravikiran & Annamalai, DravidianLangTech 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.dravidianlangtech-1.2.pdf
Code
 manikandan-ravikiran/dosa
Data
OLID