2022
pdf
bib
abs
Interactive Word Completion for Plains Cree
William Lane
|
Atticus Harrigan
|
Antti Arppe
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The composition of richly-inflected words in morphologically complex languages can be a challenge for language learners developing literacy. Accordingly, Lane and Bird (2020) proposed a finite state approach which maps prefixes in a language to a set of possible completions up to the next morpheme boundary, for the incremental building of complex words. In this work, we develop an approach to morph-based auto-completion based on a finite state morphological analyzer of Plains Cree (nêhiyawêwin), showing the portability of the concept to a much larger, more complete morphological transducer. Additionally, we propose and compare various novel ranking strategies on the morph auto-complete output. The best weighting scheme ranks the target completion in the top 10 results in 64.9% of queries, and in the top 50 in 73.9% of queries.
pdf
bib
abs
A Finite State Aproach to Interactive Transcription
William Lane
|
Steven Bird
Proceedings of the first workshop on NLP applications to field linguistics
We describe a novel approach to transcribing morphologically complex, local, oral languages. The approach connects with local motivations for participating in language work which center on language learning, accessing the content of audio collections, and applying this knowledge in language revitalization and maintenance. We develop a constraint-based approach to interactive word completion, expressed using Optimality Theoretic constraints, implemented in a finite state transducer, and applied to an Indigenous language. We show that this approach suggests correct full word predictions on 57.9% of the test utterances, and correct partial word predictions on 67.5% of the test utterances. In total, 87% of the test utterances receive full or partial word suggestions which serve to guide the interactive transcription process.
2021
pdf
bib
abs
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
Tiago Pimentel
|
Maria Ryskina
|
Sabrina J. Mielke
|
Shijie Wu
|
Eleanor Chodroff
|
Brian Leonard
|
Garrett Nicolai
|
Yustinus Ghanggo Ate
|
Salam Khalifa
|
Nizar Habash
|
Charbel El-Khaissi
|
Omer Goldman
|
Michael Gasser
|
William Lane
|
Matt Coler
|
Arturo Oncevay
|
Jaime Rafael Montoya Samame
|
Gema Celeste Silva Villegas
|
Adam Ek
|
Jean-Philippe Bernardy
|
Andrey Shcherbakov
|
Aziyana Bayyr-ool
|
Karina Sheifer
|
Sofya Ganieva
|
Matvey Plugaryov
|
Elena Klyachko
|
Ali Salehi
|
Andrew Krizhanovsky
|
Natalia Krizhanovsky
|
Clara Vania
|
Sardana Ivanova
|
Aelita Salchak
|
Christopher Straughn
|
Zoey Liu
|
Jonathan North Washington
|
Duygu Ataman
|
Witold Kieraś
|
Marcin Woliński
|
Totok Suhardijanto
|
Niklas Stoehr
|
Zahroh Nuriah
|
Shyam Ratan
|
Francis M. Tyers
|
Edoardo M. Ponti
|
Grant Aiton
|
Richard J. Hatcher
|
Emily Prud’hommeaux
|
Ritesh Kumar
|
Mans Hulden
|
Botond Barta
|
Dorina Lakatos
|
Gábor Szolnok
|
Judit Ács
|
Mohit Raj
|
David Yarowsky
|
Ryan Cotterell
|
Ben Ambridge
|
Ekaterina Vylomova
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.
pdf
bib
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Afshin Rahimi
|
William Lane
|
Guido Zuccon
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
pdf
bib
abs
Local Word Discovery for Interactive Transcription
William Lane
|
Steven Bird
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Human expertise and the participation of speech communities are essential factors in the success of technologies for low-resource languages. Accordingly, we propose a new computational task which is tuned to the available knowledge and interests in an Indigenous community, and which supports the construction of high quality texts and lexicons. The task is illustrated for Kunwinjku, a morphologically-complex Australian language. We combine a finite state implementation of a published grammar with a partial lexicon, and apply this to a noisy phone representation of the signal. We locate known lexemes in the signal and use the morphological transducer to build these out into hypothetical, morphologically-complex words for human validation. We show that applying a single iteration of this method results in a relative transcription density gain of 17%. Further, we find that 75% of breath groups in the test set receive at least one correct partial or full-word suggestion.
pdf
bib
abs
A Computational Model for Interactive Transcription
William Lane
|
Mat Bettinson
|
Steven Bird
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Transcribing low resource languages can be challenging in the absence of a good lexicon and trained transcribers. Accordingly, we seek a way to enable interactive transcription whereby the machine amplifies human efforts. This paper presents a data model and a system architecture for interactive transcription, supporting multiple modes of interactivity, increasing the likelihood of finding tasks that engage local participation in language work. The approach also supports other applications which are useful in our context, including spoken document retrieval and language learning.
2020
pdf
bib
abs
Interactive Word Completion for Morphologically Complex Languages
William Lane
|
Steven Bird
Proceedings of the 28th International Conference on Computational Linguistics
Text input technologies for low-resource languages support literacy, content authoring, and language learning. However, tasks such as word completion pose a challenge for morphologically complex languages thanks to the combinatorial explosion of possible words. We have developed a method for morphologically-aware text input in Kunwinjku, a polysynthetic language of northern Australia. We modify an existing finite state recognizer to map input morph prefixes to morph completions, respecting the morphosyntax and morphophonology of the language. We demonstrate the portability of the method by applying it to Turkish. We show that the space of proximal morph completions is many orders of magnitude smaller than the space of full word completions for Kunwinjku. We provide a visualization of the morph completion space to enable the text completion parameters to be fine-tuned. Finally, we report on a web services deployment, along with a web interface which helps users enter morphologically complex words and which retrieves corresponding entries from the lexicon.
pdf
bib
abs
Bootstrapping Techniques for Polysynthetic Morphological Analysis
William Lane
|
Steven Bird
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrapping a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by “hallucinating” missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrapping a neural morph analyzer from minimal resources.
2019
pdf
bib
abs
Towards A Robust Morphological Analyzer for Kunwinjku
William Lane
|
Steven Bird
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association
Kunwinjku is an indigenous Australian language spoken in northern Australia which exhibits agglutinative and polysynthetic properties. Members of the community have expressed interest in co-developing language applications that promote their values and priorities. Modeling the morphology of the Kunwinjku language is an important step towards accomplishing the community’s goals. Finite State Transducers have long been the go-to method for modeling morphologically rich languages, and in this paper we discuss some of the distinct modeling challenges present in the morphosyntax of verbs in Kunwinjku. We show that a fairly straightforward implementation using standard features of the foma toolkit can account for much of the verb structure. Continuing challenges include robustness in the face of variation and unseen vocabulary, as well as how to handle complex reduplicative processes. Our future work will build off the baseline and challenges presented here.