Olga Kellert

2025

Where and How Do Languages Mix? A Study of Spanish-Guaraní Code-Switching in Paraguay
Olga Kellert | Nemika Tyagi
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching

Code-switching, the alternating use of multiple languages within a single utterance, is a widespread linguistic phenomenon that poses unique challenges for both sociolinguistic analysis and Natural Language Processing (NLP). While prior research has explored code-switching from either a syntactic or geographic perspective, few studies have integrated both aspects, particularly for underexplored language pairs like Spanish-Guaraní. In this paper, we analyze Spanish-Guaraní code-switching using a dataset of geotagged tweets from Asunción, Paraguay, collected from 2017 to 2021. We employ a differential distribution method to map the geographic distribution of code-switching across urban zones and analyze its syntactic positioning within sentences. Our findings reveal distinct spatial patterns, with Guaraní-dominant tweets concentrated in the western and southwestern areas, while Spanish-only tweets are more prevalent in central and eastern regions. Syntactic analysis shows that code-switching occurs most frequently in the middle of sentences, often involving verbs, pronouns, and adjectives. These results provide new insights into the interaction between linguistic, social, and geographic factors in bilingual communication. Our study contributes to both sociolinguistic research and NLP applications, offering a framework for analyzing mixed-language data in digital communication.

pdf bib abs

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages
Olga Kellert | Nemika Tyagi | Muhammad Imran | Nelvin Licona-Guevara | Carlos Gómez-Rodríguez
Findings of the Association for Computational Linguistics: EMNLP 2025

Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments.

2023

pdf bib abs

Use of NLP in the Context of Belief states of Ethnic Minorities in Latin America
Olga Kellert | Mahmud Zaman
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

The major goal of our study is to test methods in NLP in the domain of health care education related to Covid-19 of vulnerable groups such as indigenous people from Latin America. In order to achieve this goal, we asked participants in a survey questionnaire to provide answers about health related topics. We used these answers to measure the health education status ofour participants. In this paper, we summarize the results from our NLP-application on the participants’ answers. In the first experiment, we use embeddings-based tools to measure the semantic similarity between participants’ answers and “expert” or “reference” answers. In the second experiment, we use synonym-based methods to classify answers under topics. We compare the results from both experiments with human annotations. Our results show that the tested NLP-methods reach a significantly lower accuracy score than human annotations in both experiments. We explain this difference by the assumption that human annotators are much better in pragmatic inferencing necessary to classify the semantic similarity and topic classification of answers.

2022

pdf bib abs

Using neural topic models to track context shifts of words: a case study of COVID-related terms before and after the lockdown in April 2020
Olga Kellert | Md Mahmud Uz Zaman
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

This paper explores lexical meaning changes in a new dataset, which includes tweets from before and after the COVID-related lockdown in April 2020. We use this dataset to evaluate traditional and more recent unsupervised approaches to lexical semantic change that make use of contextualized word representations based on the BERT neural language model to obtain representations of word usages. We argue that previous models that encode local representations of words cannot capture global context shifts such as the context shift of face masks since the pandemic outbreak. We experiment with neural topic models to track context shifts of words. We show that this approach can reveal textual associations of words that go beyond their lexical meaning representation. We discuss future work and how to proceed capturing the pragmatic aspect of meaning change as opposed to lexical semantic change.

pdf bib abs

Social Context and User Profiles of Linguistic Variation on a Micro Scale
Olga Kellert | Nicholas Hill Matlis
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents a new tweet-based approach in geolinguistic analysis which combines geolocation, user IDs and textual features in order to identify patterns of linguistic variation on a sub-city scale. Sub-city variations can be connected to social drivers and thus open new opportunities for understanding the mechanisms of language variation and change. However, measuring linguistic variation on these scales is challenging due to the lack of highly-spatially-resolved data as well as to the daily movement or users’ “mobility” inside cities which can obscure the relation between the social context and linguistic variation. Here we demonstrate how combining geolocation with user IDs and textual analysis of tweets can yield information about the linguistic profiles of the users, the social context associated with specific locations and their connection to linguistic variation. We apply our methodology to analyze dialects in Buenos Aires and find evidence of socially-driven variation. Our methods will contribute to the identification of sociolinguistic patterns inside cities, which are valuable in social sciences and social services.

Co-authors

Nicholas Hill Matlis 1

Mahmud Zaman 1

Venues

WS1

Fix author