TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching

Megan Herrera, Ankit Aich, Natalie Parde


Abstract
Deploying recent natural language processing innovations to low-resource settings allows for state-of-the-art research findings and applications to be accessed across cultural and linguistic borders. One low-resource setting of increasing interest is code-switching, the phenomenon of combining, swapping, or alternating the use of two or more languages in continuous dialogue. In this paper, we introduce a large dataset (20k+ instances) to facilitate investigation of Tagalog-English code-switching, which has become a popular mode of discourse in Philippine culture. Tagalog is an Austronesian language and former official language of the Philippines spoken by over 23 million people worldwide, but it and Tagalog-English are under-represented in NLP research and practice. We describe our methods for data collection, as well as our labeling procedures. We analyze our resulting dataset, and finally conclude by providing results from a proof-of-concept regression task to establish dataset validity, achieving a strong performance benchmark (R2=0.797-0.909; RMSE=0.068-0.057).
Anthology ID:
2022.lrec-1.225
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2090–2097
Language:
URL:
https://aclanthology.org/2022.lrec-1.225
DOI:
Bibkey:
Cite (ACL):
Megan Herrera, Ankit Aich, and Natalie Parde. 2022. TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2090–2097, Marseille, France. European Language Resources Association.
Cite (Informal):
TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching (Herrera et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.225.pdf
Code
 meg2121/tweettaglish-dataset