Almeida Jacqueline Toribio

2021

A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies
A. Seza Doğruöz | Sunayana Sitaram | Barbara E. Bullock | Almeida Jacqueline Toribio
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The analysis of data in which multiple languages are represented has gained popularity among computational linguists in recent years. So far, much of this research focuses mainly on the improvement of computational methods and largely ignores linguistic and social aspects of C-S discussed across a wide range of languages within the long-established literature in linguistics. To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies. From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts as highly multilingual areas. From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data, lack of robust evaluation benchmarks for C-S (across multilingual situations and types of C-S) and lack of end-to- end systems that cover sociolinguistic aspects of C-S as well. Our survey will be a step to- wards an outcome of mutual benefit for computational scientists and linguists with a shared interest in multilingualism and C-S.

2019

pdf bib abs

The limits of Spanglish?
Barbara Bullock | Wally Guzmán | Almeida Jacqueline Toribio
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Linguistic code-switching (C-S) is common in oral bilingual vernacular speech. When used in literature, C-S becomes an artistic choice that can mirror the patterns of bilingual interactions. But it can also potentially exceed them. What are the limits of C-S? We model features of C-S in corpora of contemporary U.S. Spanish-English literary and conversational data to analyze why some critics view the ‘Spanglish’ texts of Ilan Stavans as deviating from a C-S norm.

2018

pdf bib abs

The University of Texas System Submission for the Code-Switching Workshop Shared Task 2018
Florian Janke | Tongrui Li | Eric Rincón | Gualberto Guzmán | Barbara Bullock | Almeida Jacqueline Toribio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

This paper describes the system for the Named Entity Recognition Shared Task of the Third Workshop on Computational Approaches to Linguistic Code-Switching (CALCS) submitted by the Bilingual Annotations Tasks (BATs) research group of the University of Texas. Our system uses several features to train a Conditional Random Field (CRF) model for classifying input words as Named Entities (NEs) using the Inside-Outside-Beginning (IOB) tagging scheme. We participated in the Modern Standard Arabic-Egyptian Arabic (MSA-EGY) and English-Spanish (ENG-SPA) tasks, achieving weighted average F-scores of 65.62 and 54.16 respectively. We also describe the performance of a deep neural network (NN) trained on a subset of the CRF features, which did not surpass CRF performance.

pdf bib abs

Predicting the presence of a Matrix Language in code-switching
Barbara Bullock | Wally Guzmán | Jacqueline Serigos | Vivek Sharath | Almeida Jacqueline Toribio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

One language is often assumed to be dominant in code-switching but this assumption has not been empirically tested. We operationalize the matrix language (ML) at the level of the sentence, using three common definitions from linguistics. We test whether these converge and then model this convergence via a set of metrics that together quantify the nature of C-S. We conduct our experiment on four Spanish-English corpora. Our results demonstrate that our model can separate some corpora according to whether they have a dominant ML or not but that the corpora span a range of mixing types that cannot be sorted neatly into an insertional vs. alternational dichotomy.