Michael Gasser


2022

pdf bib
UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren | Omer Goldman | Salam Khalifa | Nizar Habash | Witold Kieraś | Gábor Bella | Brian Leonard | Garrett Nicolai | Kyle Gorman | Yustinus Ghanggo Ate | Maria Ryskina | Sabrina Mielke | Elena Budianskaya | Charbel El-Khaissi | Tiago Pimentel | Michael Gasser | William Abbott Lane | Mohit Raj | Matt Coler | Jaime Rafael Montoya Samame | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Arturo Oncevay | Juan López Bautista | Gema Celeste Silva Villegas | Lucas Torroba Hennigen | Adam Ek | David Guriel | Peter Dirix | Jean-Philippe Bernardy | Andrey Scherbakov | Aziyana Bayyr-ool | Antonios Anastasopoulos | Roberto Zariquiey | Karina Sheifer | Sofya Ganieva | Hilaria Cruz | Ritván Karahóǧa | Stella Markantonatou | George Pavlidis | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Candy Angulo | Jatayu Baxi | Andrew Krizhanovsky | Natalia Krizhanovskaya | Elizabeth Salesky | Clara Vania | Sardana Ivanova | Jennifer White | Rowan Hall Maudslay | Josef Valvoda | Ran Zmigrod | Paula Czarnowska | Irene Nikkarinen | Aelita Salchak | Brijesh Bhatt | Christopher Straughn | Zoey Liu | Jonathan North Washington | Yuval Pinter | Duygu Ataman | Marcin Wolinski | Totok Suhardijanto | Anna Yablonskaya | Niklas Stoehr | Hossep Dolatian | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Aryaman Arora | Richard J. Hatcher | Ritesh Kumar | Jeremiah Young | Daria Rodionova | Anastasia Yemelina | Taras Andrushko | Igor Marchenko | Polina Mashkovtseva | Alexandra Serova | Emily Prud’hommeaux | Maria Nepomniashchaya | Fausto Giunchiglia | Eleanor Chodroff | Mans Hulden | Miikka Silfverberg | Arya D. McCarthy | David Yarowsky | Ryan Cotterell | Reut Tsarfaty | Ekaterina Vylomova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

2021

pdf bib
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
Tiago Pimentel | Maria Ryskina | Sabrina J. Mielke | Shijie Wu | Eleanor Chodroff | Brian Leonard | Garrett Nicolai | Yustinus Ghanggo Ate | Salam Khalifa | Nizar Habash | Charbel El-Khaissi | Omer Goldman | Michael Gasser | William Lane | Matt Coler | Arturo Oncevay | Jaime Rafael Montoya Samame | Gema Celeste Silva Villegas | Adam Ek | Jean-Philippe Bernardy | Andrey Shcherbakov | Aziyana Bayyr-ool | Karina Sheifer | Sofya Ganieva | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Andrew Krizhanovsky | Natalia Krizhanovsky | Clara Vania | Sardana Ivanova | Aelita Salchak | Christopher Straughn | Zoey Liu | Jonathan North Washington | Duygu Ataman | Witold Kieraś | Marcin Woliński | Totok Suhardijanto | Niklas Stoehr | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Richard J. Hatcher | Emily Prud’hommeaux | Ritesh Kumar | Mans Hulden | Botond Barta | Dorina Lakatos | Gábor Szolnok | Judit Ács | Mohit Raj | David Yarowsky | Ryan Cotterell | Ben Ambridge | Ekaterina Vylomova
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

2020

pdf bib
Character Alignment in Morphologically Complex Translation Sets for Related Languages
Michael Gasser | Binyam Ephrem Seyoum | Nazareth Amlesom Kifle
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

For languages with complex morphology, word-to-word translation is a task with various potential applications, for example, in information retrieval, language instruction, and dictionary creation, as well as in machine translation. In this paper, we confine ourselves to the subtask of character alignment for the particular case of families of related languages with very few resources for most or all members. There are many such families; we focus on the subgroup of Semitic languages spoken in Ethiopia and Eritrea. We begin with an adaptation of the familiar alignment algorithms behind statistical machine translation, modifying them as appropriate for our task. We show how character alignment can reveal morphological, phonological, and orthographic correspondences among related languages.

bib
A Translation-Based Approach to Morphology Learning for Low Resource Languages
Tewodros Gebreselassie | Amanuel Mersha | Michael Gasser
Proceedings of the Fourth Widening Natural Language Processing Workshop

“Low resource languages” usually refers to languages that lack corpora and basic tools such as part-of-speech taggers. But a significant number of such languages do benefit from the availability of relatively complex linguistic descriptions of phonology, morphology, and syntax, as well as dictionaries. A further category, probably the majority of the world’s languages, suffers from the lack of even these resources. In this paper, we investigate the possibility of learning the morphology of such a language by relying on its close relationship to a language with more resources. Specifically, we use a transfer-based approach to learn the morphology of the severely under-resourced language Gofa, starting with a neural morphological generator for the closely related language, Wolaytta. Both languages are members of the Omotic family, spoken and southwestern Ethiopia, and, like other Omotic languages, both are morphologically complex. We first create a finite- state transducer for morphological analysis and generation for Wolaytta, based on relatively complete linguistic descriptions and lexicons for the language. Next, we train an encoder-decoder neural network on the task of morphological generation for Wolaytta, using data generated by the FST. Such a network takes a root and a set of grammatical features as input and generates a word form as output. We then elicit Gofa translations of a small set of Wolaytta words from bilingual speakers. Finally, we retrain the decoder of the Wolaytta network, using a small set of Gofa target words that are translations of the Wolaytta outputs of the original network. The evaluation shows that the transfer network performs better than a separate encoder-decoder network trained on a larger set of Gofa words. We conclude with implications for the learning of morphology for severely under-resourced languages in regions where there are related languages with more resources.

2018

pdf bib
Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus
Andargachew Mekonnen Gezmu | Binyam Ephrem Seyoum | Michael Gasser | Andreas Nürnberger
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing

We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.

2014

pdf bib
Guampa: a Toolkit for Collaborative Translation
Alex Rudnick | Taylor Skidmore | Alberto Samaniego | Michael Gasser
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Here we present Guampa, a new software package for online collaborative translation. This system grows out of our discussions with Guarani-language activists and educators in Paraguay, and attempts to address problems faced by machine translation researchers and by members of any community speaking an under-represented language. Guampa enables volunteers and students to work together to translate documents into heritage languages, both to make more materials available in those languages, and also to generate bitext suitable for training machine translation systems. While many approaches to crowdsourcing bitext corpora focus on Mechanical Turk and temporarily engaging anonymous workers, Guampa is intended to foster an online community in which discussions can take place, language learners can practice their translation skills, and complete documents can be translated. This approach is appropriate for the Spanish-Guarani language pair as there are many speakers of both languages, and Guarani has a dedicated activist community. Our goal is to make it easy for anyone to set up their own instance of Guampa and populate it with documents – such as automatically imported Wikipedia articles – to be translated for their particular language pair. Guampa is freely available and relatively easy to use.

2013

pdf bib
HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10
Alex Rudnick | Can Liu | Michael Gasser
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Lexical Selection for Hybrid MT with Sequence Labeling
Alex Rudnick | Michael Gasser
Proceedings of the Second Workshop on Hybrid Approaches to Translation

2012

pdf bib
Incremental Learning of Affix Segmentation
Wondwossen Mulugeta | Michael Gasser | Baye Yimam
Proceedings of COLING 2012

2011

pdf bib
Towards synchronous extensible dependency grammar
Michael Gasser
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

Extensible Dependency Grammar (XDG; Debusmann, 2007) is a flexible, modular dependency grammar framework in which sentence analyses consist of multigraphs and processing takes the form of constraint satisfaction. This paper shows how XDG lends itself to grammar-driven machine translation and introduces the machinery necessary for synchronous XDG. Since the approach relies on a shared semantics, it resembles interlingua MT. It differs in that there are no separate analysis and generation phases. Rather, translation consists of the simultaneous analysis and generation of a single source-target “sentence”.

pdf bib
Towards a Malay Derivational Lexicon: Learning Affixes Using Expectation Maximization
Suriani Sulaiman | Michael Gasser | Sandra Kuebler
Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP)

2010

pdf bib
Expanding the Lexicon for a Resource-Poor Language Using a Morphological Analyzer and a Web Crawler
Michael Gasser
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Resource-poor languages may suffer from a lack of any of the basic resources that are fundamental to computational linguistics, including an adequate digital lexicon. Given the relatively small corpus of texts that exists for such languages, extending the lexicon presents a challenge. Languages with complex morphology present a special case, however, because individual words in these languages provide a great deal of information about the grammatical properties of the roots that they are based on. Given a morphological analyzer, it is even possible to extract novel roots from words. In this paper, we look at the case of Tigrinya, a Semitic language with limited lexical resources for which a morphological analyzer is available. It is shown that this analyzer applied to the list of more than 200,000 Tigrinya words that is extracted by a web crawler can extend the lexicon in two ways, by adding new roots and by inferring some of the derivational constraints that apply to known roots.

2009

pdf bib
Semitic Morphological Analysis and Generation Using Finite State Transducers with Feature Structures
Michael Gasser
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

1994

pdf bib
Modularity in a Connectionist Model of Morphology Acquisition
Michael Gasser
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
Acquiring Receptive Morphology: A Connectionist Model
Michael Gasser
32nd Annual Meeting of the Association for Computational Linguistics

1988

pdf bib
Sequencing in a Connectionist Model of Language Processing
Michael Gasser | Michael G. Dyer
Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics

Search
Co-authors
Fix data