TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching

Megan Herrera; Ankit Aich; Natalie Parde

TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching

Megan Herrera, Ankit Aich, Natalie Parde

Abstract

Deploying recent natural language processing innovations to low-resource settings allows for state-of-the-art research findings and applications to be accessed across cultural and linguistic borders. One low-resource setting of increasing interest is code-switching, the phenomenon of combining, swapping, or alternating the use of two or more languages in continuous dialogue. In this paper, we introduce a large dataset (20k+ instances) to facilitate investigation of Tagalog-English code-switching, which has become a popular mode of discourse in Philippine culture. Tagalog is an Austronesian language and former official language of the Philippines spoken by over 23 million people worldwide, but it and Tagalog-English are under-represented in NLP research and practice. We describe our methods for data collection, as well as our labeling procedures. We analyze our resulting dataset, and finally conclude by providing results from a proof-of-concept regression task to establish dataset validity, achieving a strong performance benchmark (R2=0.797-0.909; RMSE=0.068-0.057).

Anthology ID:: 2022.lrec-1.225
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 2090–2097
Language:
URL:: https://aclanthology.org/2022.lrec-1.225/
DOI:
Bibkey:
Cite (ACL):: Megan Herrera, Ankit Aich, and Natalie Parde. 2022. TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2090–2097, Marseille, France. European Language Resources Association.
Cite (Informal):: TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching (Herrera et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.225.pdf

PDF Cite Search Fix data