Robert Pugh

2025

I Speak for the Árboles: Developing a Dependency Treebank for Spanish L2 and Heritage Speakers
Emiliana Pulido | Robert Pugh | Zoey Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

We introduce the first dependency treebank containing Universal Dependencies (UD) annotations for Spanish learner writing from the UC Davis COWSL2H corpus. Our annotations include lemmatization, POS tagging, and syntactic dependencies. We adapt the existing UD framework for Spanish L1 to account forlearner-specific features such as code-switching and non-canonical syntax. A suite of parsing evaluation experiments shows that parsers trained on learner data together with moderate sizes of Spanish L1 data can yield reasonable performance. Our annotations are openly accessible to motivate future development of learner-oriented language technologies.

pdf bib

pdf bib abs

This work presents Py-elotl, a suite of tools and resources in Python for processing text in several indigenous languages spoken in Mexico. These resources include parallel corpora, linguistic taggers/analyzers, and orthographic normalization tools. This work aims to develop essential resources to support language pre-processing and linguistic research, and the future creation of more complete downstream applications that could be useful for the speakers and enhance the visibility of these languages. The current version supports language groups such as Nahuatl, Otomi, Mixtec, and Huave. This project is open-source and freely available for use and collaboration

This paper presents the findings of the AmericasNLP 2025 Shared Tasks: (1) machine translation for truly low-resource languages, (2) morphological adaptation for generating educational examples, and (3) developing metrics for machine translation in Indigenous languages. The shared tasks cover 14 diverse Indigenous languages of the Americas. A total of 11 teams participated, submitting 26 systems across all tasks, languages, and models. We describe the shared tasks, introduce the datasets and evaluation metrics used, summarize the baselines and submitted systems, and report our findings.

pdf bib abs

Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl
Robert Pugh | Cheyenne Wing | María Ximena Juárez Huerta | Ángeles Márquez Hernandez | Francis Tyers
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The development of digital linguistic resources is essential for enhancing the inclusion of indigenous and marginalized languages in the digital domain. Indigenous languages of Mexico, despite representing vast typological diversity and millions of speakers, have largely been overlooked in NLP until recently. In this paper, we present a corpus of audio and annotated transcriptions of Western Sierra Puebla Nahuatl, an endangered variety of Nahuatl spoken in Puebla, Mexico. The data made available in this corpus are useful for ASR, spelling normalization, and word-level language identification. We detail the corpus-creation process, and describe experiments to report benchmark results for each of these important NLP tasks. The corpus audio and text is made freely available.

2024

pdf bib

Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Manuel Mager | Abteen Ebrahimi | Shruti Rijhwani | Arturo Oncevay | Luis Chiruzzo | Robert Pugh | Katharina von der Wense
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

pdf bib abs

Wav2pos: Exploring syntactic analysis from audio for Highland Puebla Nahuatl
Robert Pugh | Varun Sreedhar | Francis Tyers
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

We describe an approach to part-of-speech tagging from audio with very little human-annotated data, for Highland Puebla Nahuatl, a low-resource language of Mexico. While automatic morphosyntactic analysis is typically trained on annotated textual data, large amounts of text is rarely available for low-resource, marginalized, and/or minority languages, and morphosyntactically-annotated data is even harder to come by. Much of the data from these languages may exist in the form of recordings, often only partially-transcribed or analyzed by field linguists working on language documentation projects. Given this relatively low-availability of text in the low-resource language scenario, we explore end-to-end automated morphosyntactic analysis directly from audio. The experiments described in this paper focus on one piece of morphosyntax, part-of-speech tagging, and builds on existing work in a high-resource setting. We use weak supervision to increase training volume, and explore a few techniques for generating word-level predictions from the acoustic features. Our experiments show promising results, despite less than 400 sentences of audio-aligned, manually-labeled text.

This paper presents the results of the first shared task about the creation of educational materials for three indigenous languages of the Americas.The task proposes to automatically generate variations of sentences according to linguistic features that could be used for grammar exercises.The languages involved in this task are Bribri, Maya, and Guarani.Seven teams took part in the challenge, submitting a total of 22 systems, obtaining very promising results.

pdf bib abs

This paper presents the findings of the third iteration of the AmericasNLP Shared Task on Machine Translation. This year’s competition features eleven Indigenous languages found across North, Central, and South America. A total of six teams participate with a total of 157 submissions across all languages and models. Two baselines – the Sheffield and Helsinki systems from 2023 – are provided and represent hard-to-beat starting points for the competition. In addition to the baselines, teams are given access to a new repository of training data which consists of data collected by teams in prior shared tasks. Using ChrF++ as the main competition metric, we see improvements over the baseline for 4 languages: Chatino, Guarani, Quechua, and Rarámuri, with performance increases over the best baseline of 4.2 ChrF++. In this work, we present a summary of the submitted systems, results, and a human evaluation of system outputs for Bribri, which consists of both (1) a rating of meaning and fluency and (2) a qualitative error analysis of outputs from the best submitted system.

pdf bib abs

A Universal Dependencies Treebank for Highland Puebla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present a Universal Dependencies (UD) treebank for Highland Puebla Nahuatl. The treebank is only the second such UD corpus for a Mexican language, and supplements an existing treebank for another Nahuatl variant. We describe the process of data collection, annotation decisions and interesting syntactic constructions, and discuss some similarities and differences between the Highland Puebla Nahuatl treebank and the existing Western Sierra Puebla Nahuatl treebank.

pdf bib abs

Experiments in Multi-Variant Natural Language Processing for Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

Linguistic variation is a complicating factor for digital language technologies. This is particularly true for languages that lack an official “standard” variety, including many regional and minoritized languages. In this paper, we describe a set of experiments focused on multivariant natural language processing for the Nahuatl, an indigenous Mexican language with a high level of linguistic variation and no single recognized standard variant. Using small (10k tokens), recently-published annotated datasets for two Nahuatl variants, we compare the performance of single-variant, cross-variant, and joint training, and explore how different models perform on a third Nahuatl variant, unseen in training. These results and the subsequent discussion contribute to efforts of developing low-resource NLP that is robust to diatopic variation. We share all code used to process the data and run the experiments.

2023

pdf bib abs

Codex to corpus: Exploring annotation and processing for an open and extensible machine-readable edition of the Florentine Codex
Francis Tyers | Robert Pugh | Valery Berthoud F.
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes an ongoing effort to create, from the original hand-written text, a machine-readable, linguistically-annotated, and easily-searchable corpus of the Nahuatl portion of the Florentine Codex, a 16th century Mesoamerican manuscript written in Nahuatl and Spanish. The Codex consists of 12 books and over 300,000 tokens. We describe the process of annotating 3 of these books, the steps of text preprocessing undertaken, our approach to efficient manual processing and annotation, and some of the challenges faced along the way. We also report on a set of experiments evaluating our ability to automate the text processing tasks to aid in the remaining annotation effort, and find the results promising despite the relatively low volume of training data. Finally, we briefly present a real use case from the humanities that would benefit from the searchable, linguistically annotated corpus we describe.

pdf bib abs

Developing finite-state language technology for Maya
Robert Pugh | Francis Tyers | Quetzil Castañeda
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

We describe a suite of finite-state language technologies for Maya, a Mayan language spoken in Mexico. At the core is a computational model of Maya morphology and phonology using a finite-state transducer. This model results in a morphological analyzer and a morphologically-informed spell-checker. All of these technologies are designed for use as both a pedagogical reading/writing aid for L2 learners and as a general language processing tool capable of supporting much of the natural variation in written Maya. We discuss the relevant features of Maya morphosyntax and orthography, and then outline the implementation details of the analyzer. To conclude, we present a longer-term vision for these tools and their use by both native speakers and learners.

pdf bib abs

A finite-state morphological analyser for Highland Puebla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes the development of a free/open-source finite-state morphologicaltransducer for Highland Puebla Nahuatl, a Uto-Aztecan language spoken in and around the stateof Puebla in Mexico. The finite-state toolkit used for the work is the Helsinki Finite-StateToolkit (HFST); we use the lexc formalism for modelling the morphotactics and twol formal-ism for modelling morphophonological alternations. An evaluation is presented which showsthat the transducer has a reasonable coveragearound 90%on freely-available corpora of the language, and high precisionover 95%on a manually verified test set

pdf bib abs

Comparing methods of orthographic conversion for Bàsàá, a language of Cameroon
Alexandra O’neil | Daniel Swanson | Robert Pugh | Francis Tyers | Emmanuel Ngue Um
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

Orthographical standardization is a milestone in a language’s documentation and the development of its resources. However, texts written in former orthographies remain relevant to the language’s history and development and therefore must be converted to the standardized orthography. Ensuring a language has access to the orthographically standardized version of all of its recorded texts is important in the development of resources as it provides additional textual resources for training, supports contribution of authors using former writing systems, and provides information about the development of the language. This paper evaluates the performance of natural language processing methods, specifically Finite State Transducers and Long Short-term Memory networks, for the orthographical conversion of Bàsàá texts from the Protestant missionary orthography to the now-standard AGLC orthography, with the conclusion that LSTMs are somewhat more effective in the absence of explicit lexical information.

2022

pdf bib abs

Universal Dependencies for Western Sierra Puebla Nahuatl
Robert Pugh | Marivel Huerta Mendez | Mitsuya Sasaki | Francis Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a morpho-syntactically-annotated corpus of Western Sierra Puebla Nahuatl that conforms to the annotation guidelines of the Universal Dependencies project. We describe the sources of the texts that make up the corpus, the annotation process, and important annotation decisions made throughout the development of the corpus. As the first indigenous language of Mexico to be added to the Universal Dependencies project, this corpus offers a good opportunity to test and more clearly define annotation guidelines for the Meso-american linguistic area, spontaneous and elicited spoken data, and code-switching.

2021

pdf bib abs

Investigating variation in written forms of Nahuatl using character-based language models
Robert Pugh | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe experiments with character-based language modeling for written variants of Nahuatl. Using a standard LSTM model and publicly available Bible translations, we explore how character language models can be applied to the tasks of estimating mutual intelligibility, identifying genetic similarity, and distinguishing written variants. We demonstrate that these simple language models are able to capture similarities and differences that have been described in the linguistic literature.

pdf bib

Towards an Open Source Finite-State Morphological Analyzer for Zacatlán-Ahuacatlán-Tepetzintla Nahuatl
Robert Pugh | Francis Tyers | Marivel Huerta Mendez
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2019

pdf bib abs

How to account for mispellings: Quantifying the benefit of character representations in neural content scoring models
Brian Riordan | Michael Flor | Robert Pugh
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Character-based representations in neural models have been claimed to be a tool to overcome spelling variation in in word token-based input. We examine this claim in neural models for content scoring. We formulate precise hypotheses about the possible effects of adding character representations to word-based models and test these hypotheses on large-scale real world content scoring datasets. We find that, while character representations may provide small performance gains in general, their effectiveness in accounting for spelling variation may be limited. We show that spelling correction can provide larger gains than character representations, and that spelling correction improves the performance of models with character representations. With these insights, we report a new state of the art on the ASAP-SAS content scoring dataset.

2018

pdf bib abs

Automatic Token and Turn Level Language Identification for Code-Switched Text Dialog: An Analysis Across Language Pairs and Corpora
Vikram Ramanarayanan | Robert Pugh
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

We examine the efficacy of various feature–learner combinations for language identification in different types of text-based code-switched interactions – human-human dialog, human-machine dialog as well as monolog – at both the token and turn levels. In order to examine the generalization of such methods across language pairs and datasets, we analyze 10 different datasets of code-switched text. We extract a variety of character- and word-based text features and pass them into multiple learners, including conditional random fields, logistic regressors and recurrent neural networks. We further examine the efficacy of novel character-level embedding and GloVe features in improving performance and observe that our best-performing text system significantly outperforms a majority vote baseline across language pairs and datasets.

2017

pdf bib abs

Native Language Identification (NLI) is the task of automatically identifying the native language (L1) of an individual based on their language production in a learned language. It is typically framed as a classification task where the set of L1s is known a priori. Two previous shared tasks on NLI have been organized where the aim was to identify the L1 of learners of English based on essays (2013) and spoken responses (2016) they provided during a standardized assessment of academic English proficiency. The 2017 shared task combines the inputs from the two prior tasks for the first time. There are three tracks: NLI on the essay only, NLI on the spoken response only (based on a transcription of the response and i-vector acoustic features), and NLI using both responses. We believe this makes for a more interesting shared task while building on the methods and results from the previous two shared tasks. In this paper, we report the results of the shared task. A total of 19 teams competed across the three different sub-tasks. The fusion track showed that combining the written and spoken responses provides a large boost in prediction accuracy. Multiple classifier systems (e.g. ensembles and meta-classifiers) were the most effective in all tasks, with most based on traditional classifiers (e.g. SVMs) with lexical/syntactic features.