Enora Rice


2024

pdf bib
GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text
Michael Ginn | Lindia Tjuatja | Taiqi He | Enora Rice | Graham Neubig | Alexis Palmer | Lori Levin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages.Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%. Our pretrained model and dataset are available on Hugging Face: https://huggingface.co/collections/lecslab/glosslm-66da150854209e910113dd87

pdf bib
On the Robustness of Neural Models for Full Sentence Transformation
Michael Ginn | Ali Marashian | Bhargav Shandilya | Claire Post | Enora Rice | Juan Vásquez | Marie Mcgregor | Matthew Buchholz | Mans Hulden | Alexis Palmer
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

This paper describes the LECS Lab submission to the AmericasNLP 2024 Shared Task on the Creation of Educational Materials for Indigenous Languages. The task requires transforming a base sentence with regards to one or more linguistic properties (such as negation or tense). We observe that this task shares many similarities with the well-studied task of word-level morphological inflection, and we explore whether the findings from inflection research are applicable to this task. In particular, we experiment with a number of augmentation strategies, finding that they can significantly benefit performance, but that not all augmented data is necessarily beneficial. Furthermore, we find that our character-level neural models show high variability with regards to performance on unseen data, and may not be the best choice when training data is limited.

pdf bib
TAMS: Translation-Assisted Morphological Segmentation
Enora Rice | Ali Marashian | Luke Gessler | Alexis Palmer | Katharina von der Wense
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes.This is a core task in endangered language documentation, and NLP systems have the potential to dramatically speed up this process. In typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present a method that attempts to leverage translation data in the canonical segmentation task. We propose a character-level sequence-to-sequence model that incorporates representations of translations obtained from pretrained high-resource monolingual language models as an additional signal. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. Additionally, we find that we can achieve strong performance even without needing difficult-to-obtain word level alignments. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.

2023

pdf bib
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
Manuel Mager | Abteen Ebrahimi | Arturo Oncevay | Enora Rice | Shruti Rijhwani | Alexis Palmer | Katharina Kann
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

pdf bib
Findings of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages
Abteen Ebrahimi | Manuel Mager | Shruti Rijhwani | Enora Rice | Arturo Oncevay | Claudia Baltazar | María Cortés | Cynthia Montaño | John E. Ortega | Rolando Coto-solano | Hilaria Cruz | Alexis Palmer | Katharina Kann
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

In this work, we present the results of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages of the Americas. This edition of the shared task featured eleven language pairs, one of which – Chatino-Spanish – uses a newly collected evaluation dataset, consisting of professionally translated text from the legal domain. Seven teams participated in the shared task, with a total of 181 submissions. Additionally, we conduct a human evaluation of the best system outputs, and compare them to the best submissions from the prior shared task. We find that this analysis agrees with the quantitative measures used to rank submissions, which shows further improvements of 9.64 ChrF on average across all languages, when compared to the prior winning system.