2024
pdf
bib
abs
GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text
Michael Ginn
|
Lindia Tjuatja
|
Taiqi He
|
Enora Rice
|
Graham Neubig
|
Alexis Palmer
|
Lori Levin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages.Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%. Our pretrained model and dataset are available on Hugging Face: https://huggingface.co/collections/lecslab/glosslm-66da150854209e910113dd87
pdf
bib
abs
BELT: Building Endangered Language Technology
Michael Ginn
|
David Saavedra-Beltrán
|
Camilo Robayo
|
Alexis Palmer
Proceedings of the Sixth Workshop on Teaching NLP
The development of language technology (LT) for an endangered language is often identified as a goal in language revitalization efforts, but developing such technologies is typically subject to additional methodological challenges as well as social and ethical concerns. In particular, LT development has too often taken on colonialist qualities, extracting language data, relying on outside experts, and denying the speakers of a language sovereignty over the technologies produced.We seek to avoid such an approach through the development of the Building Endangered Language Technology (BELT) website, an educational resource designed for speakers and community members with limited technological experience to develop LTs for their own language. Specifically, BELT provides interactive lessons on basic Python programming, coupled with projects to develop specific language technologies, such as spellcheckers or word games. In this paper, we describe BELT’s design, the motivation underlying many key decisions, and preliminary responses from learners.
pdf
bib
abs
Decomposing Fusional Morphemes with Vector Embeddings
Michael Ginn
|
Alexis Palmer
Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Distributional approaches have proven effective in modeling semantics and phonology through vector embeddings. We explore whether distributional representations can also effectively model morphological information. We train static vector embeddings over morphological sequences. Then, we explore morpheme categories for fusional morphemes, which encode multiple linguistic dimensions, and often have close relationships to other morphemes. We study whether the learned vector embeddings align with these linguistic dimensions, finding strong evidence that this is the case. Our work uses two low-resource languages, Uspanteko and Tsez, demonstrating that distributional morphological representations are effective even with limited data.
pdf
bib
abs
Can we teach language models to gloss endangered languages?
Michael Ginn
|
Mans Hulden
|
Alexis Palmer
Findings of the Association for Computational Linguistics: EMNLP 2024
Interlinear glossed text (IGT) is a popular format in language documentation projects, where each morpheme is labeled with a descriptive annotation. Automating the creation of interlinear glossed text would be desirable to reduce annotator effort and maintain consistency across annotated corpora. Prior research has explored a number of statistical and neural methods for automatically producing IGT. As large language models (LLMs) have showed promising results across multilingual tasks, even for rare, endangered languages, it is natural to wonder whether they can be utilized for the task of generating IGT. We explore whether LLMs can be effective at the task of interlinear glossing with in-context learning, without any traditional training. We propose new approaches for selecting examples to provide in-context, observing that targeted selection can significantly improve performance. We find that LLM-based methods beat standard transformer baselines, despite requiring no training at all. These approaches still underperform state-of-the-art supervised systems for the task, but are highly practical for researchers outside of the NLP community, requiring minimal effort to use.
pdf
bib
abs
On the Robustness of Neural Models for Full Sentence Transformation
Michael Ginn
|
Ali Marashian
|
Bhargav Shandilya
|
Claire Post
|
Enora Rice
|
Juan Vásquez
|
Marie Mcgregor
|
Matthew Buchholz
|
Mans Hulden
|
Alexis Palmer
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
This paper describes the LECS Lab submission to the AmericasNLP 2024 Shared Task on the Creation of Educational Materials for Indigenous Languages. The task requires transforming a base sentence with regards to one or more linguistic properties (such as negation or tense). We observe that this task shares many similarities with the well-studied task of word-level morphological inflection, and we explore whether the findings from inflection research are applicable to this task. In particular, we experiment with a number of augmentation strategies, finding that they can significantly benefit performance, but that not all augmented data is necessarily beneficial. Furthermore, we find that our character-level neural models show high variability with regards to performance on unseen data, and may not be the best choice when training data is limited.
pdf
bib
abs
Resisting the Lure of the Skyline: Grounding Practices in Active Learning for Morphological Inflection
Saliha Muradoglu
|
Michael Ginn
|
Miikka Silfverberg
|
Mans Hulden
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Active learning (AL) aims to lower the demand of annotation by selecting informative unannotated samples for the model building. In this paper, we explore the importance of conscious experimental design in the language documentation and description setting, particularly the distribution of the unannotated sample pool. We focus on the task of morphological inflection using a Transformer model. We propose context motivated benchmarks: a baseline and skyline. The baseline describes the frequency weighted distribution encountered in natural speech. We simulate this using Wikipedia texts. The skyline defines the more common approach, uniform sampling from a large, balanced corpus (UniMorph, in our case), which often yields mixed results. We note the unrealistic nature of this unannotated pool. When these factors are considered, our results show a clear benefit to targeted sampling.
pdf
bib
abs
PyFoma: a Python finite-state compiler module
Mans Hulden
|
Michael Ginn
|
Miikka Silfverberg
|
Michael Hammond
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We describe PyFoma, an open-source Python module for constructing weighted and unweighted finite-state transducers and automata from regular expressions, string rewriting rules, right-linear grammars, or low-level state/transition manipulation. A large variety of standard algorithms for working with finite-state machines is included, with a particular focus on the needs of linguistic and NLP applications. The data structures and code in the module are designed for legibility to allow for potential use in teaching the theory and algorithms associated with finite-state machines.
2023
pdf
bib
abs
Ginn-Khamov at SemEval-2023 Task 6, Subtask B: Legal Named Entities Extraction for Heterogenous Documents
Michael Ginn
|
Roman Khamov
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
This paper describes our submission to SemEval-2023 Task 6, Subtask B, a shared task on performing Named Entity Recognition in legal documents for specific legal entity types. Documents are divided into the preamble and judgement texts, and certain entity types should only be tagged in one of the two text sections. To address this challenge, our team proposes a token classification model that is augmented with information about the document type, which achieves greater performance than the non-augmented system.
pdf
bib
abs
Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing
Michael Ginn
|
Sarah Moeller
|
Alexis Palmer
|
Anna Stacey
|
Garrett Nicolai
|
Mans Hulden
|
Miikka Silfverberg
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
This paper presents the findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing. This first iteration of the shared task explores glossing of a set of six typologically diverse languages: Arapaho, Gitksan, Lezgi, Natügu, Tsez and Uspanteko. The shared task encompasses two tracks: a resource-scarce closed track and an open track, where participants are allowed to utilize external data resources. Five teams participated in the shared task. The winning team Tü-CL achieved a 23.99%-point improvement over a baseline RoBERTa system in the closed track and a 17.42%-point improvement in the open track.
pdf
bib
abs
Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context
Michael Ginn
|
Alexis Palmer
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
Generalization is of particular importance in resource-constrained settings, where the available training data may represent only a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.