MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Dwip Dalal, Vivek Srivastava, Mayank Singh


Abstract
Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual and multi-topic dataset MMT collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.
Anthology ID:
2023.c3nlp-1.6
Volume:
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Sunipa Dev, Vinodkumar Prabhakaran, David Adelani, Dirk Hovy, Luciana Benotti
Venue:
C3NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–52
Language:
URL:
https://aclanthology.org/2023.c3nlp-1.6
DOI:
10.18653/v1/2023.c3nlp-1.6
Bibkey:
Cite (ACL):
Dwip Dalal, Vivek Srivastava, and Mayank Singh. 2023. MMT: A Multilingual and Multi-Topic Indian Social Media Dataset. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 47–52, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
MMT: A Multilingual and Multi-Topic Indian Social Media Dataset (Dalal et al., C3NLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.c3nlp-1.6.pdf
Video:
 https://aclanthology.org/2023.c3nlp-1.6.mp4