Francis Tyers - ACL Anthology

Francis Tyers

Also published as: Francis M. Tyers, Francis M Tyers, Francis M. Tyers

2025

Py-Elotl: A Python NLP package for the languages of Mexico
Ximena Gutierrez-Vasques | Robert Pugh | Victor Mijangos | Diego Barriga Martínez | Paul Aguilar | Mikel Segura | Paola Innes | Javier Santillan | Cynthia Montaño | Francis Tyers
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This work presents Py-elotl, a suite of tools and resources in Python for processing text in several indigenous languages spoken in Mexico. These resources include parallel corpora, linguistic taggers/analyzers, and orthographic normalization tools. This work aims to develop essential resources to support language pre-processing and linguistic research, and the future creation of more complete downstream applications that could be useful for the speakers and enhance the visibility of these languages. The current version supports language groups such as Nahuatl, Otomi, Mixtec, and Huave. This project is open-source and freely available for use and collaboration

Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl
Robert Pugh | Cheyenne Wing | María Ximena Juárez Huerta | Ángeles Márquez Hernandez | Francis Tyers
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The development of digital linguistic resources is essential for enhancing the inclusion of indigenous and marginalized languages in the digital domain. Indigenous languages of Mexico, despite representing vast typological diversity and millions of speakers, have largely been overlooked in NLP until recently. In this paper, we present a corpus of audio and annotated transcriptions of Western Sierra Puebla Nahuatl, an endangered variety of Nahuatl spoken in Puebla, Mexico. The data made available in this corpus are useful for ASR, spelling normalization, and word-level language identification. We detail the corpus-creation process, and describe experiments to report benchmark results for each of these important NLP tasks. The corpus audio and text is made freely available.

Speech Technologies Datasets for African Under-Served Languages
Emmanuel Ngue Um | Francis Tyers | Eliette-Caroline Emilie Ngo Tjomb | Florus Landry Dibengue | Blaise-Mathieu Banoum Manguele | Blaise Abbo Djoulde | Mathilde Nyambe A | Brice Martial Atangana Eloundou | Jeff Sterling Ngami Kamagoua | José Mpouda Avom | Zacharie Nyobe | Emmanuel Giovanni Eloundou Eyenga | André Likwai
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

The expansion of the speech technology sector has given rise to a novel economic model in language research, with the objective of developing speech datasets. This model is expanding to under-served African languages through collaborative efforts between industries, organisations, and the active participation of communities. This collaboration is yielding new datasets for machine learning, while also disclosing vulnerabilities and sociolinguistic discrepancies between industrialised and non-industrialised societies. A case study of a speech data collection camp that took place in September 2024 in Cameroon, involving representatives of 31 languages throughout the continent, illustrates both the prospects of the new economic model for research on under-served languages and the challenges of fair, effective, and responsible participation.

2024

Universal Dependencies for Saraiki
Meesum Alam | Francis Tyers | Emily Hanink | Sandra Kübler
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

We present the first treebank of the Saraiki/Siraiki [ISO 639-3 skr] language, using the Universal Dependency annotation scheme (de Marneffe et al., 2021). The treebank currently comprises 587 annotated sentences and 7597 tokens. We explain the most relevant syntactic and morphological features of Saraiki, along with the decision we have made for a range of language specific constructions, namely compounds, verbal structures including light verb and serial verb constructions, and relative clauses.

Experiments in Multi-Variant Natural Language Processing for Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

Linguistic variation is a complicating factor for digital language technologies. This is particularly true for languages that lack an official “standard” variety, including many regional and minoritized languages. In this paper, we describe a set of experiments focused on multivariant natural language processing for the Nahuatl, an indigenous Mexican language with a high level of linguistic variation and no single recognized standard variant. Using small (10k tokens), recently-published annotated datasets for two Nahuatl variants, we compare the performance of single-variant, cross-variant, and joint training, and explore how different models perform on a third Nahuatl variant, unseen in training. These results and the subsequent discussion contribute to efforts of developing low-resource NLP that is robust to diatopic variation. We share all code used to process the data and run the experiments.

Developing a Benchmark for Pronunciation Feedback: Creation of a Phonemically Annotated Speech Corpus of isiZulu Language Learner Speech
Alexandra O’Neil | Nils Hjortnaes | Francis Tyers | Zinhle Nkosi | Thulile Ndlovu | Zanele Mlondo | Ngami Phumzile Pewa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pronunciation of the phonemic inventory of a new language often presents difficulties to second language (L2) learners. These challenges can be alleviated by the development of pronunciation feedback tools that take speech input from learners and return information about errors in the utterance. This paper presents the development of a corpus designed for use in pronunciation feedback research. The corpus is comprised of gold standard recordings from isiZulu teachers and recordings from isiZulu L2 learners that have been annotated for pronunciation errors. Exploring the potential benefits of word-level versus phoneme-level feedback necessitates a speech corpus that has been annotated for errors on the phoneme-level. To aid in this discussion, this corpus of isiZulu L2 speech has been annotated for phoneme-errors in utterances, as well as suprasegmental errors in tone.

Wav2pos: Exploring syntactic analysis from audio for Highland Puebla Nahuatl
Robert Pugh | Varun Sreedhar | Francis Tyers
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

We describe an approach to part-of-speech tagging from audio with very little human-annotated data, for Highland Puebla Nahuatl, a low-resource language of Mexico. While automatic morphosyntactic analysis is typically trained on annotated textual data, large amounts of text is rarely available for low-resource, marginalized, and/or minority languages, and morphosyntactically-annotated data is even harder to come by. Much of the data from these languages may exist in the form of recordings, often only partially-transcribed or analyzed by field linguists working on language documentation projects. Given this relatively low-availability of text in the low-resource language scenario, we explore end-to-end automated morphosyntactic analysis directly from audio. The experiments described in this paper focus on one piece of morphosyntax, part-of-speech tagging, and builds on existing work in a high-resource setting. We use weak supervision to increase training volume, and explore a few techniques for generating word-level predictions from the acoustic features. Our experiments show promising results, despite less than 400 sentences of audio-aligned, manually-labeled text.

Evaluating Automatic Pronunciation Scoring with Crowd-sourced Speech Corpus Annotations
Nils Hjortnaes | Daniel Dakota | Sandra Kübler | Francis Tyers
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning

Towards Named-Entity and Coreference Annotation of the Hebrew Bible
Daniel G. Swanson | Bryce D. Bussert | Francis Tyers
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

Named-entity annotation refers to the process of specifying what real-world (or, at least, external-to-the-text) entities various names and descriptions within a text refer to. Coreference annotation, meanwhile, specifies what context-dependent words or phrases, such as pronouns refer to. This paper describes an ongoing project to apply both of these to the Hebrew Bible, so far covering most of the book of Genesis, fully marking every person, place, object, and point in time which occurs in the text. The annotation process and possible future uses for the data are covered, along with the challenges involved in applying existing annotation guidelines to the Hebrew text.

A Universal Dependencies Treebank for Highland Puebla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present a Universal Dependencies (UD) treebank for Highland Puebla Nahuatl. The treebank is only the second such UD corpus for a Mexican language, and supplements an existing treebank for another Nahuatl variant. We describe the process of data collection, annotation decisions and interesting syntactic constructions, and discuss some similarities and differences between the Highland Puebla Nahuatl treebank and the existing Western Sierra Puebla Nahuatl treebank.

Proceedings of the Third Workshop on NLP Applications to Field Linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Saliha Muradoglu | Eric Le Ferrand | Elena Klyachko | Ekaterina Vylomova | Tatiana Shavrina | Francis Tyers
Proceedings of the Third Workshop on NLP Applications to Field Linguistics

Team jelarson at SemEval 2024 Task 8: Predicting Boundary Line Between Human and Machine Generated Text
Joseph Larson | Francis Tyers
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In this paper, we handle the task of building a system that, given a document written first by a human and then finished by an LLM, the system must determine the transition word i.e. where the machine begins to write. We built a system by examining the data for textual anomalies and combining a method of heuristic approaches with a linear regression model based on the text length of each document.

Producing a Parallel Universal Dependencies Treebank of Ancient Hebrew and Ancient Greek via Cross-Lingual Projection
Daniel G. Swanson | Bryce D. Bussert | Francis Tyers
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper we present the initial construction of a treebank of Ancient Greek containing portions of the Septuagint, a translation of the Hebrew Scriptures (1576 sentences, 39K tokens, roughly 7% of the total corpus). We construct the treebank by word-aligning and projecting from the parallel text in Ancient Hebrew before automatically correcting systematic syntactic mismatches and manually correcting other errors.

2023

Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)
Loïc Grobol | Francis Tyers
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)

Codex to corpus: Exploring annotation and processing for an open and extensible machine-readable edition of the Florentine Codex
Francis Tyers | Robert Pugh | Valery Berthoud F.
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes an ongoing effort to create, from the original hand-written text, a machine-readable, linguistically-annotated, and easily-searchable corpus of the Nahuatl portion of the Florentine Codex, a 16th century Mesoamerican manuscript written in Nahuatl and Spanish. The Codex consists of 12 books and over 300,000 tokens. We describe the process of annotating 3 of these books, the steps of text preprocessing undertaken, our approach to efficient manual processing and annotation, and some of the challenges faced along the way. We also report on a set of experiments evaluating our ability to automate the text processing tasks to aid in the remaining annotation effort, and find the results promising despite the relatively low volume of training data. Finally, we briefly present a real use case from the humanities that would benefit from the searchable, linguistically annotated corpus we describe.

Towards a finite-state morphological analyser for San Mateo Huave
Francis M. Tyers | Samuel Herrera Castro
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Developing finite-state language technology for Maya
Robert Pugh | Francis Tyers | Quetzil Castañeda
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

We describe a suite of finite-state language technologies for Maya, a Mayan language spoken in Mexico. At the core is a computational model of Maya morphology and phonology using a finite-state transducer. This model results in a morphological analyzer and a morphologically-informed spell-checker. All of these technologies are designed for use as both a pedagogical reading/writing aid for L2 learners and as a general language processing tool capable of supporting much of the natural variation in written Maya. We discuss the relevant features of Maya morphosyntax and orthography, and then outline the implementation details of the analyzer. To conclude, we present a longer-term vision for these tools and their use by both native speakers and learners.

Proceedings of the Second Workshop on NLP Applications to Field Linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Elena Klyachko | Ekaterina Vylomova | Tatiana Shavrina | Eric Le Ferrand | Valentin Malykh | Francis Tyers | Timofey Arkhangelskiy | Vladislav Mikhailov
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

WITH Context: Adding Rule-Grouping to VISL CG-3
Daniel Swanson | Tino Didriksen | Francis M. Tyers
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications

This paper presents an extension to the VISL CG-3 compiler and processor which enables complex contexts to be shared between rules. This sharing substantially improves the readability and maintainability of sets of rules performing multi-step operations.

A finite-state morphological analyser for Highland Puebla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes the development of a free/open-source finite-state morphologicaltransducer for Highland Puebla Nahuatl, a Uto-Aztecan language spoken in and around the stateof Puebla in Mexico. The finite-state toolkit used for the work is the Helsinki Finite-StateToolkit (HFST); we use the lexc formalism for modelling the morphotactics and twol formal-ism for modelling morphophonological alternations. An evaluation is presented which showsthat the transducer has a reasonable coveragearound 90%on freely-available corpora of the language, and high precisionover 95%on a manually verified test set

Comparing methods of orthographic conversion for Bàsàá, a language of Cameroon
Alexandra O’neil | Daniel Swanson | Robert Pugh | Francis Tyers | Emmanuel Ngue Um
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

Orthographical standardization is a milestone in a language’s documentation and the development of its resources. However, texts written in former orthographies remain relevant to the language’s history and development and therefore must be converted to the standardized orthography. Ensuring a language has access to the orthographically standardized version of all of its recorded texts is important in the development of resources as it provides additional textual resources for training, supports contribution of authors using former writing systems, and provides information about the development of the language. This paper evaluates the performance of natural language processing methods, specifically Finite State Transducers and Long Short-term Memory networks, for the orthographical conversion of Bàsàá texts from the Protestant missionary orthography to the now-standard AGLC orthography, with the conclusion that LSTMs are somewhat more effective in the absence of explicit lexical information.

2022

Handling Stress in Finite-State Morphological Analyzers for Ancient Greek and Ancient Hebrew
Daniel G. Swanson | Francis M. Tyers
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

Modeling stress placement has historically been a challenge for computational morphological analysis, especially in finite-state systems because lexically conditioned stress cannot be modeled using only rewrite rules on the phonological form of a word. However, these phenomena can be modeled fairly easily if the lexicon’s internal representation is allowed to contain more information than the pure phonological form. In this paper we describe the stress systems of Ancient Greek and Ancient Hebrew and we present two prototype finite-state morphological analyzers, one for each language, which successfully implement these stress systems by inserting a small number of control characters into the phonological form, thus conclusively refuting the claim that finite-state systems are not powerful enough to model such stress systems and arguing in favor of the continued relevance of finite-state systems as an appropriate tool for modeling the morphology of historical languages.

How to encode arbitrarily complex morphology in word embeddings, no corpus needed
Lane Schwartz | Coleman Haley | Francis Tyers
Proceedings of the First Workshop on NLP applications to field linguistics

In this paper, we present a straightforward technique for constructing interpretable word embeddings from morphologically analyzed examples (such as interlinear glosses) for all of the world’s languages. Currently, fewer than 300-400 languages out of approximately 7000 have have more than a trivial amount of digitized texts; of those, between 100-200 languages (most in the Indo-European language family) have enough text data for BERT embeddings of reasonable quality to be trained. The word embeddings in this paper are explicitly designed to be both linguistically interpretable and fully capable of handling the broad variety found in the world’s diverse set of 7000 languages, regardless of corpus size or morphological characteristics. We demonstrate the applicability of our representation through examples drawn from a typologically diverse set of languages whose morphology includes prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and reduplication.

UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren | Omer Goldman | Salam Khalifa | Nizar Habash | Witold Kieraś | Gábor Bella | Brian Leonard | Garrett Nicolai | Kyle Gorman | Yustinus Ghanggo Ate | Maria Ryskina | Sabrina Mielke | Elena Budianskaya | Charbel El-Khaissi | Tiago Pimentel | Michael Gasser | William Abbott Lane | Mohit Raj | Matt Coler | Jaime Rafael Montoya Samame | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Arturo Oncevay | Juan López Bautista | Gema Celeste Silva Villegas | Lucas Torroba Hennigen | Adam Ek | David Guriel | Peter Dirix | Jean-Philippe Bernardy | Andrey Scherbakov | Aziyana Bayyr-ool | Antonios Anastasopoulos | Roberto Zariquiey | Karina Sheifer | Sofya Ganieva | Hilaria Cruz | Ritván Karahóǧa | Stella Markantonatou | George Pavlidis | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Candy Angulo | Jatayu Baxi | Andrew Krizhanovsky | Natalia Krizhanovskaya | Elizabeth Salesky | Clara Vania | Sardana Ivanova | Jennifer White | Rowan Hall Maudslay | Josef Valvoda | Ran Zmigrod | Paula Czarnowska | Irene Nikkarinen | Aelita Salchak | Brijesh Bhatt | Christopher Straughn | Zoey Liu | Jonathan North Washington | Yuval Pinter | Duygu Ataman | Marcin Wolinski | Totok Suhardijanto | Anna Yablonskaya | Niklas Stoehr | Hossep Dolatian | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Aryaman Arora | Richard J. Hatcher | Ritesh Kumar | Jeremiah Young | Daria Rodionova | Anastasia Yemelina | Taras Andrushko | Igor Marchenko | Polina Mashkovtseva | Alexandra Serova | Emily Prud’hommeaux | Maria Nepomniashchaya | Fausto Giunchiglia | Eleanor Chodroff | Mans Hulden | Miikka Silfverberg | Arya D. McCarthy | David Yarowsky | Ryan Cotterell | Reut Tsarfaty | Ekaterina Vylomova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

Yet Another Format of Universal Dependencies for Korean
Yige Chen | Eunkyul Leah Jo | Yundong Yao | KyungTae Lim | Miikka Silfverberg | Francis M. Tyers | Jungyeul Park
Proceedings of the 29th International Conference on Computational Linguistics

In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analysis.

Predictive Text for Agglutinative and Polysynthetic Languages
Sergey Kosyak | Francis Tyers
Proceedings of the First Workshop on NLP applications to field linguistics

This paper presents a set of experiments in the area of morphological modelling and prediction. We test whether morphological segmentation can compete against statistical segmentation in the tasks of language modelling and predictive text entry for two under-resourced and indigenous languages, K’iche’ and Chukchi. We use different segmentation methods — both statistical and morphological — to make datasets that are used to train models of different types: single-way segmented, which are trained using data from one segmenter; two-way segmented, which are trained using concatenated data from two segmenters; and finetuned, which are trained on two datasets from different segmenters. We compute word and character level perplexities and find that single-way segmented models trained on morphologically segmented data show the highest performance. Finally, we evaluate the language models on the task of predictive text entry using gold standard data and measure the average number of clicks per character and keystroke savings rate. We find that the models trained on morphologically segmented data show better scores, although with substantial room for improvement. At last, we propose the usage of morphological segmentation in order to improve the end-user experience while using predictive text and we plan on testing this assumption by doing end-user evaluation.

A Universal Dependencies Treebank of Ancient Hebrew
Daniel G. Swanson | Francis M. Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present the initial construction of a Universal Dependencies treebank with morphological annotations of Ancient Hebrew containing portions of the Hebrew Scriptures (1579 sentences, 27K tokens) for use in comparative study with ancient translations and for analysis of the development of Hebrew syntax. We construct this treebank by applying a rule-based parser (300 rules) to an existing morphologically-annotated corpus with minimal constituency structure and manually verifying the output and present the results of this semi-automated annotation process and some of the annotation decisions made in the process of applying the UD guidelines to a new language.

Universal Dependencies for Western Sierra Puebla Nahuatl
Robert Pugh | Marivel Huerta Mendez | Mitsuya Sasaki | Francis Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a morpho-syntactically-annotated corpus of Western Sierra Puebla Nahuatl that conforms to the annotation guidelines of the Universal Dependencies project. We describe the sources of the texts that make up the corpus, the annotation process, and important annotation decisions made throughout the development of the corpus. As the first indigenous language of Mexico to be added to the Universal Dependencies project, this corpus offers a good opportunity to test and more clearly define annotation guidelines for the Meso-american linguistic area, spontaneous and elicited spoken data, and code-switching.

A Free/Open-Source Morphological Analyser and Generator for Sakha
Sardana Ivanova | Jonathan Washington | Francis Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present, to our knowledge, the first ever published morphological analyser and generator for Sakha, a marginalised language of Siberia. The transducer, developed using HFST, has coverage of solidly above 90%, and high precision. In the development of the analyser, we have expanded linguistic knowledge about Sakha, and developed strategies for complex grammatical patterns. The transducer is already being used in downstream tasks, including computer assisted language learning applications for linguistic maintenance and computational linguistic shared tasks.

Proceedings of the First Workshop on NLP applications to field linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Elena Klyachko | Ekaterina Neminova | Ekaterina Vylomova | Tatiana Shavrina | Eric Le Ferrand | Valentin Malykh | Francis Tyers | Timofey Arkhangelskiy | Vladislav Mikhailov | Alena Fenogenova
Proceedings of the First Workshop on NLP applications to field linguistics

2021

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.

A survey of part-of-speech tagging approaches applied to K’iche’
Francis Tyers | Nick Howell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We study the performance of several popular neural part-of-speech taggers from the Universal Dependencies ecosystem on Mayan languages using a small corpus of 1435 annotated K’iche’ sentences consisting of approximately 10,000 tokens, with encouraging results: F₁ scores 93%+ on lemmatisation, part-of-speech and morphological feature assignment. The high performance motivates a cross-language part-of-speech tagging study, where K’iche’-trained models are evaluated on two other Mayan languages, Kaqchikel and Uspanteko: performance on Kaqchikel is good, 63-85%, and on Uspanteko modest, 60-71%. Supporting experiments lead us to conclude the relative diversity of morphological features as a plausible explanation for the limiting factors in cross-language tagging performance, providing some direction for future sentence annotation and collection work to support these and other Mayan languages.

Towards an Open Source Finite-State Morphological Analyzer for Zacatlán-Ahuacatlán-Tepetzintla Nahuatl
Robert Pugh | Francis Tyers | Marivel Huerta Mendez
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

A finite-state morphological analyser for Paraguayan Guaraní
Anastasia Kuznetsova | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This article describes the development of morphological analyser for Paraguayan Guaraní, agglutinative indigenous language spoken by nearly 6 million people in South America. The implementation of our analyser uses HFST (Helsiki Finite State Technology) and two-level transducer that covers morphotactics and phonological processes occurring in Guaraní. We assess the efficacy of the approach on publicly available Wikipedia and Bible corpora and the naive coverage of analyser reaches 86% on Wikipedia and 91% on Bible corpora.

The Relevance of the Source Language in Transfer Learning for ASR
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

Keyword spotting for audiovisual archival search in Uralic languages
Nils Hjortnaes | Niko Partanen | Francis M. Tyers
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

Do RNN States Encode Abstract Phonological Alternations?
Miikka Silfverberg | Francis Tyers | Garrett Nicolai | Mans Hulden
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Sequence-to-sequence models have delivered impressive results in word formation tasks such as morphological inflection, often learning to model subtle morphophonological details with limited training data. Despite the performance, the opacity of neural models makes it difficult to determine whether complex generalizations are learned, or whether a kind of separate rote memorization of each morphophonological process takes place. To investigate whether complex alternations are simply memorized or whether there is some level of generalization across related sound changes in a sequence-to-sequence model, we perform several experiments on Finnish consonant gradation—a complex set of sound changes triggered in some words by certain suffixes. We find that our models often—though not always—encode 17 different consonant gradation processes in a handful of dimensions in the RNN. We also show that by scaling the activations in these dimensions we can control whether consonant gradation occurs and the direction of the gradation.

A corpus of K’iche’ annotated for morphosyntactic structure
Francis Tyers | Robert Henderson
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This article describes a collection of sentences in K’iche’ annotated for morphology and syntax. K’iche’ is a language in the Mayan language family, spoken in Guatemala. The annotation is done according to the guidelines of the Universal Dependencies project. The corpus consists of a total of 1,433 sentences containing approximately 10,000 tokens and is released under a free/open-source licence. We present a comparison of parsing systems for K’iche’ using this corpus and describe how it can be used for mining linguistic examples.

Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.

Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik
Hyunji Hayley Park | Lane Schwartz | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
Tiago Pimentel | Maria Ryskina | Sabrina J. Mielke | Shijie Wu | Eleanor Chodroff | Brian Leonard | Garrett Nicolai | Yustinus Ghanggo Ate | Salam Khalifa | Nizar Habash | Charbel El-Khaissi | Omer Goldman | Michael Gasser | William Lane | Matt Coler | Arturo Oncevay | Jaime Rafael Montoya Samame | Gema Celeste Silva Villegas | Adam Ek | Jean-Philippe Bernardy | Andrey Shcherbakov | Aziyana Bayyr-ool | Karina Sheifer | Sofya Ganieva | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Andrew Krizhanovsky | Natalia Krizhanovsky | Clara Vania | Sardana Ivanova | Aelita Salchak | Christopher Straughn | Zoey Liu | Jonathan North Washington | Duygu Ataman | Witold Kieraś | Marcin Woliński | Totok Suhardijanto | Niklas Stoehr | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Richard J. Hatcher | Emily Prud’hommeaux | Ritesh Kumar | Mans Hulden | Botond Barta | Dorina Lakatos | Gábor Szolnok | Judit Ács | Mohit Raj | David Yarowsky | Ryan Cotterell | Ben Ambridge | Ekaterina Vylomova
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

Investigating variation in written forms of Nahuatl using character-based language models
Robert Pugh | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe experiments with character-based language modeling for written variants of Nahuatl. Using a standard LSTM model and publicly available Bible translations, we explore how character language models can be applied to the tasks of estimating mutual intelligibility, identifying genetic similarity, and distinguishing written variants. We demonstrate that these simple language models are able to capture similarities and differences that have been described in the linguistic literature.

2020

Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Francis M. Tyers | Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

An Unsupervised Method for Weighting Finite-state Morphological Analyzers
Amr Keleg | Francis M. Tyers | Nicholas Howell | Tommi A. Pirinen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.

Effort-value payoff in lemmatisation for Uralic languages
Nick Howell | Maria Bibaeva | Francis M. Tyers
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

Towards a Speech Recognizer for Komi, an Endangered and Low-Resource Uralic Language
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

Dependency annotation of noun incorporation in polysynthetic languages
Francis Tyers | Karina Mishchenkova
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

This paper describes an approach to annotating noun incorporation in Universal Dependencies. It motivates the need to annotate this particular morphosyntactic phenomenon and justifies it with respect to frequency of the construction. A case study is presented in which the proposed annotation scheme is applied to Chukchi, a language that exhibits noun incorporation. We compare argument encoding in Chukchi, English and Russian and find that while in English and Russian discourse elements are primarily tracked through noun phrases and pronouns, in Chukchi they are tracked through agreement marking and incorporation, with a lesser role for noun phrases.

A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.

Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila | Megan Branson | Kelly Davis | Michael Kohler | Josh Meyer | Michael Henretty | Reuben Morais | Lindsay Saunders | Francis Tyers | Gregor Weber
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.

Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes | Timofey Arkhangelskiy | Niko Partanen | Michael Rießler | Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.

A Finite-State Morphological Analyser for Evenki
Anna Zueva | Anastasia Kuznetsova | Francis Tyers
Proceedings of the Twelfth Language Resources and Evaluation Conference

It has been widely admitted that morphological analysis is an important step in automated text processing for morphologically rich languages. Evenki is a language with rich morphology, therefore a morphological analyser is highly desirable for processing Evenki texts and developing applications for Evenki. Although two morphological analysers for Evenki have already been developed, they are able to analyse less than a half of the available Evenki corpora. The aim of this paper is to create a new morphological analyser for Evenki. It is implemented using the Helsinki Finite-State Transducer toolkit (HFST). The lexc formalism is used to specify the morphotactic rules, which define the valid orderings of morphemes in a word. Morphophonological alternations and orthographic rules are described using the twol formalism. The lexicon is extracted from available machine-readable dictionaries. Since a part of the corpora belongs to texts in Evenki dialects, a version of the analyser with relaxed rules is developed for processing dialectal features. We evaluate the analyser on available Evenki corpora and estimate precision, recall and F-score. We obtain coverage scores of between 61% and 87% on the available Evenki corpora.

Universal Dependency Treebank for Xibe
He Zhou | Juyeon Chung | Sandra Kübler | Francis Tyers
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

We present our work of constructing the first treebank for the Xibe language following the Universal Dependencies (UD) annotation scheme. Xibe is a low-resourced and severely endangered Tungusic language spoken by the Xibe minority living in the Xinjiang Uygur Autonomous Region of China. We collected 810 sentences so far, including 544 sentences from a grammar book on written Xibe and 266 sentences from Cabcal News. We annotated those sentences manually from scratch. In this paper, we report the procedure of building this treebank and analyze several important annotation issues of our treebank. Finally, we propose our plans for future work.

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Joakim Nivre | Marie-Catherine de Marneffe | Filip Ginter | Jan Hajič | Christopher D. Manning | Sampo Pyysalo | Sebastian Schuster | Francis Tyers | Daniel Zeman
Proceedings of the Twelfth Language Resources and Evaluation Conference

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

2019

Data-Driven Morphological Analysis for Uralic Languages
Miikka Silfverberg | Francis Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

A Report on the Third VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Yves Scherrer | Tanja Samardžić | Francis Tyers | Miikka Silfverberg | Natalia Klyueva | Tung-Le Pan | Chu-Ren Huang | Radu Tudor Ionescu | Andrei M. Butnaru | Tommi Jauhiainen
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus
Jungyeul Park | Francis Tyers
Proceedings of the 13th Linguistic Annotation Workshop

In this paper we present a new annotation scheme for the Sejong part-of-speech tagged corpus based on Universal Dependencies style annotation. By using a new annotation scheme, we can produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing. We also explore the possibility of doing named-entity recognition and semantic-role labelling for Korean using the new annotation scheme.

A free/open-source rule-based machine translation system for Crimean Tatar to Turkish
Memduh Gökırmak | Francis Tyers | Jonathan Washington
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

Development of a Universal Dependencies treebank for Welsh
Johannes Heinecke | Francis M. Tyers
Proceedings of the Celtic Language Technology Workshop

Proceedings of the Celtic Language Technology Workshop
Teresa Lynn | Delyth Prys | Colin Batchelor | Francis Tyers
Proceedings of the Celtic Language Technology Workshop

Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
Alexandre Rademaker | Francis Tyers
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
Tommi A. Pirinen | Heiki-Jaan Kaalep | Francis M. Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

A biscriptual morphological transducer for Crimean Tatar
Francis M. Tyers | Jonathan Washington | Darya Kavitskaya | Memduh Gökırmak | Nick Howell | Remziye Berberova
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

Building a Morphological Analyser for Laz
Esra Onal | Francis Tyers
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This study is an attempt to contribute to documentation and revitalization efforts of endangered Laz language, a member of South Caucasian language family mainly spoken on northeastern coastline of Turkey. It constitutes the first steps to create a general computational model for word form recognition and production for Laz by building a rule-based morphological analyser using Helsinki Finite-State Toolkit (HFST). The evaluation results show that the analyser has a 64.9% coverage over a corpus collected for this study with 111,365 tokens. We have also performed an error analysis on randomly selected 100 tokens from the corpus which are not covered by the analyser, and these results show that the errors mostly result from Turkish words in the corpus and missing stems in our lexicon.

2018

A prototype finite-state morphological analyser for Chukchi
Vasilisa Andriyanets | Francis Tyers
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

In this article we describe the application of finite-state transducers to the morphological and phonological systems of Chukchi, a polysynthetic language spoken in the north of the Russian Federation. The language exhibits progressive and regressive vowel harmony, productive incorporation and extensive circumfixing. To implement the analyser we use the well-known Helsinki Finite-State Toolkit (HFST). The resulting model covers the majority of the morphological and phonological processes. A brief evaluation carried out on publically-available corpora shows that the coverage of the transducer is between and 53% and 76%. An error evaluation of 100 tokens randomly selected from the corpus, which were not covered by the analyser shows that most of the morphological processes are covered and that the majority of errors are caused by a limited stem lexicon.

Can LSTM Learn to Capture Agreement? The Case of Basque
Shauli Ravfogel | Yoav Goldberg | Francis Tyers
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Sequential neural networks models are powerful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature of these models raises the questions: to what extent can these models implicitly learn hierarchical structures typical to human language, and what kind of grammatical phenomena can they acquire? We focus on the task of agreement prediction in Basque, as a case study for a task that requires implicit understanding of sentence structure and the acquisition of a complex but consistent morphological system. Analyzing experimental results from two syntactic prediction tasks – verb number prediction and suffix recovery – we find that sequential models perform worse on agreement prediction in Basque than one might expect on the basis of a previous agreement prediction work in English. Tentative findings based on diagnostic classifiers suggest the network makes use of local heuristics as a proxy for the hierarchical structure of the sentence. We propose the Basque agreement prediction task as challenging benchmark for models that attempt to learn regularities in human language.

Multi-source synthetic treebank creation for improved cross-lingual dependency parsing
Francis Tyers | Mariya Sheyanova | Aleksandra Martynova | Pavel Stepachev | Konstantin Vinogorodskiy
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper describes a method of creating synthetic treebanks for cross-lingual dependency parsing using a combination of machine translation (including pivot translation), annotation projection and the spanning tree algorithm. Sentences are first automatically translated from a lesser-resourced language to a number of related highly-resourced languages, parsed and then the annotations are projected back to the lesser-resourced language, leading to multiple trees for each sentence from the lesser-resourced language. The final treebank is created by merging the possible trees into a graph and running the spanning tree algorithm to vote for the best tree for each sentence. We present experiments aimed at parsing Faroese using a combination of Danish, Swedish and Norwegian. In a similar experimental setup to the CoNLL 2018 shared task on dependency parsing we report state-of-the-art results on dependency parsing for Faroese using an off-the-shelf parser.

Finite-state morphological analysis for Gagauz
Sevilay Bayatli | Güllü Karanfil | Memduh Gökırmak | Francis M. Tyers
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Michael Rießler | Jack Rueter | Trond Trosterud | Francis M. Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

A prototype dependency treebank for Breton
Francis M Tyers | Vinit Ravishankar
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

This paper describes the development of the first syntactically-annotated corpus of Breton. The corpus is part of the Universal Dependencies project. In the paper we describe how the corpus was prepared, some Breton-specific constructions that required special treatment, and in addition we give results for parsing Breton using a number of off-the-shelf data-driven parsers.

Rule-based machine translation from Kazakh to Turkish
Sevilay Bayatli | Sefer Kurnaz | Ilnar Salimzyanov | Jonathan Washington | Francis M. Tyers
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

This paper presents a shallow-transfer machine translation (MT) system for translating from Kazakh to Turkish. Background on the differences between the languages is presented, followed by how the system was designed to handle some of these differences. The system is based on the Apertium free/open-source machine translation platform. The structure of the system and how it works is described, along with an evaluation against two competing systems. Linguistic components were developed, including a Kazakh-Turkish bilingual dictionary, Constraint Grammar disambiguation rules, lexical selection rules, and structural transfer rules. With many known issues yet to be addressed, our RBMT system has reached performance comparable to publicly-available corpus-based MT systems between the languages.

Towards an open-source universal-dependency treebank for Erzya
Jack Rueter | Francis Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

2017

Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages
Francis M. Tyers | Michael Rießler | Tommi A. Pirinen | Trond Trosterud
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

Universal Dependencies
Joakim Nivre | Daniel Zeman | Filip Ginter | Francis Tyers
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Universal Dependencies (UD) is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages. This tutorial gives an introduction to the UD framework and resources, from basic design principles to annotation guidelines and existing treebanks. We also discuss tools for developing and exploiting UD treebanks and survey applications of UD in NLP and linguistics.

Towards a dependency-annotated treebank for Bambara
Ekaterina Aplonova | Francis M. Tyers
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

North-Sámi to Finnish rule-based machine translation system
Ryan Johnson | Tommi A Pirinen | Tiina Puolakainen | Francis Tyers | Trond Trosterud | Kevin Unhammer
Proceedings of the 21st Nordic Conference on Computational Linguistics

Finite-State Morphological Analysis for Marathi
Vinit Ravishankar | Francis M. Tyers
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

A Dependency Treebank for Kurmanji Kurdish
Memduh Gökırmak | Francis M. Tyers
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

UD Annotatrix: An annotation tool for Universal Dependencies
Francis M. Tyers | Mariya Sheyanova | Jonathan North Washington
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

Machine translation with North Saami as a pivot language
Lene Antonsen | Ciprian Gerstenberger | Maja Kappfjell | Sandra Nystø Rahka | Marja-Liisa Olthuis | Trond Trosterud | Francis M. Tyers
Proceedings of the 21st Nordic Conference on Computational Linguistics

Annotation schemes in North Sámi dependency parsing
Francis M. Tyers | Mariya Sheyanova
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

2016

A Finite-state Morphological Analyser for Tuvan
Francis Tyers | Aziyana Bayyr-ool | Aelita Salchak | Jonathan Washington
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

~This paper describes the development of free/open-source finite-state morphological transducers for Tuvan, a Turkic language spoken in and around the Tuvan Republic in Russia. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST), we use the lexc formalism for modelling the morphotactics and twol formalism for modelling morphophonological alternations. We present a novel description of the morphological combinatorics of pseudo-derivational morphemes in Tuvan. An evaluation is presented which shows that the transducer has a reasonable coverage―around 93%―on freely-available corpora of the languages, and high precision―over 99%―on a manually verified test set.

Apertium: a free/open source platform for machine translation and basic language technology
Mikel L. Forcada | Francis M. Tyers
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

A Finite-State Morphological Analyser for Sindhi
Raveesh Motlani | Francis Tyers | Dipti Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium’s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81% and a precision of over 97%.

Universal Dependencies for Turkish
Umut Sulubacak | Memduh Gokirmak | Francis Tyers | Çağrı Çöltekin | Joakim Nivre | Gülşen Eryiğit
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.

2015

Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Francis M. Tyers | Mikel L. Forcada | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

Automatic conversion of colloquial Finnishto standard Finnish
Inari Listenmaa | Francis M. Tyers
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

Unsupervised training of maximum-entropy models for lexical selection i in rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martinez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Mikel L. Forcada | Francis M. Tyers | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

Automatic word stress annotation of Russian unrestricted text
Robert Reynolds | Francis Tyers
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

Subsegmental language detection in Celtic language text
Akshay Minocha | Francis Tyers
Proceedings of the First Celtic Language Technology Workshop

Why Implementation Matters: Evaluation of an Open-source Constraint Grammar Parser
Dávid Márk Nemeskey | Francis Tyers | Mans Hulden
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Finite-state morphological transducers for three Kypchak languages
Jonathan Washington | Ilnar Salimzyanov | Francis Tyers
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the development of free/open-source finite-state morphological transducers for three Turkic languages―Kazakh, Tatar, and Kumyk―representing one language from each of the three sub-branches of the Kypchak branch of Turkic. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST). This paper describes how the development of a transducer for each subsequent closely-related language took less development time. An evaluation is presented which shows that the transducers all have a reasonable coverage―around 90%―on freely available corpora of the languages, and high precision over a manually verified test set.

2013

A Free/Open-source Kazakh-Tatar Machine Translation System
Ilnar Salimzyanov | Jonathan Washington | Francis Tyers
Proceedings of Machine Translation Summit XIV: Papers

2012

Free/Open Source Shallow-Transfer Based Machine Translation for Spanish and Aragonese
Juan Pablo Martínez Cortés | Jim O’Regan | Francis Tyers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article describes the development of a bidirectional shallow-transfer based machine translation system for Spanish and Aragonese, based on the Apertium platform, reusing the resources provided by other translators built for the platform. The system, and the morphological analyser built for it, are both the first resources of their kind for Aragonese. The morphological analyser has coverage of over 80\%, and is being reused to create a spelling checker for Aragonese. The translator is bidirectional: the Word Error Rate for Spanish to Aragonese is 16.83%, while Aragonese to Spanish is 11.61%.

A finite-state morphological transducer for Kyrgyz
Jonathan Washington | Mirlan Ipasov | Francis Tyers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the development of a free/open-source finite-state morphological transducer for Kyrgyz. The transducer has been developed for morphological generation for use within a prototype TurkishâKyrgyz machine translation system, but has also been extensively tested for analysis. The finite-state toolkit used for the work was the Helsinki Finite-State Toolkit (HFST). The paper describes some issues in Kyrgyz morphology, the development of the tool, some linguistic issues encountered and how they were dealt with, and which issues are left to resolve. An evaluation is presented which shows that the transducer has medium-level coverage, between 82% and 87% on two freely available corpora of Kyrgyz, and high precision and recall over a manually verified test set.

Rule-based Machine Translation between Indonesian and Malaysian
Raymond Hendy Susanto | Septina Dian Larasati | Francis M. Tyers
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

Flexible finite-state lexical selection for rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

A rule-based machine translation system from Serbo-Croatian to Macedonian
Hrvoje Peradin | Francis Tyers
Proceedings of the Third International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes the development of a one-way machine translation system from SerboCroatian to Macedonian on the Apertium platform. Details of resources and development methods are given, as well as an evaluation, and general directives for future work.

2011

Rapid rule-based machine translation between Dutch and Afrikaans
Pim Otte | Francis M. Tyers
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

Apertium-IceNLP: A rule-based Icelandic to English machine translation system
Martha Dís Brandt | Hrafh Loftsson | Hlynur Sigurþórsson | Francis M. Tyers
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

An Italian to Catalan RBMT system reusing data from existing language pairs
Antonio Toral | Mireia Ginestí-Rosell | Francis Tyers
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper presents an Italian→Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM.

2010

Rule-based Breton to French machine translation
Francis Tyers
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

2009

Rule-Based Augmentation of Training Data in Breton-French Statistical Machine Translation
Francis M. Tyers
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

Matxin: Moving towards language independence
Aingeru Mayor | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes some of the issues found when adapting and extending the Matxin free-software machine translation system to other language pairs. It sketches out some of the characteristics of Matxin and offers some possible solutions to these issues.

Developing Prototypes for Machine Translation between Two Sami Languages
Francis M. Tyers | Linda Wiechetek | Trond Trosterud
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martinez | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

The Apertium machine translation platform: Five years on
Mikel L. Forcada | Francis M. Tyers | Gema Ramírez-Sánchez
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes Apertium: a free/open-source machine translation platform (engine, toolbox and data), its history, its philosophy of design, its technology, the community of developers, the research and business based on it, and its prospects and challenges, now that it is five years old.

Shallow-transfer rule-based machine translation for Swedish to Danish
Francis M. Tyers | Jacob Nordfalk
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This article describes the development of a shallow-transfer machine translation system from Swedish to Danish in the Apertium platform. It gives details of the resources used, the methods for constructing the system and an evaluation of the translation quality. The quality is found to be comparable with that of current commercial systems, despite the particularly low coverage of the lexicons.

Development of a morphological analyser for Bengali
Abu Zaher Md Faridee | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This article describes the development of an open-source morphological analyser for Bengali Language using 􏰁nitestate technology. First we discuss the challenges of creating a morphological analyser for a highly in􏰂ectional language like Bengali and then propose a solution to that using lttoolbox, an open-source 􏰁nite-state toolkit. We then evaluate the performance of our developed system and propose ways of improving it further.

Co-authors

Elena Klyachko 6

Flammie A. Pirinen 6

Michael Rießler 6

Miikka Silfverberg 6

Daniel G. Swanson 6

Ekaterina Vylomova 6

Sardana Ivanova 5

Trond Trosterud 5

Garrett Nicolai 4

Niko Partanen 4

Felipe Sánchez-Martínez 4

Timofey Arkhangelskiy 3

Aziyana Bayyr-ool 3

Eleanor Chodroff 3

Ryan Cotterell 3

Andrew Krizhanovsky 3

Sandra Kübler 3

Éric Le Ferrand 3

Sabrina J. Mielke 3

Tiago Pimentel 3

Anna Postnikova 3

Juan Antonio Pérez-Ortiz 3

Aelita Salchak 3

Ilnar Salimzyanov 3

Tatiana Shavrina 3

Mariya Sheyanova 3

Ekaterina Voloshina 3

Ekaterina Ageeva 2

Antonios Anastasopoulos 2

Yustinus Ghanggo Ate 2

Sevilay Bayatli 2

Jean-Philippe Bernardy 2

Bryce D. Bussert 2

Sriram Chellappan 2

Cagri Coltekin 2

Paula Czarnowska 2

Charbel El-Khaissi 2

Sofya Ganieva 2

Michael Gasser 2

Richard J. Hatcher 2

Marivel Huerta Mendez 2

Salam Khalifa 2

Witold Kieraś 2

Natalia Krizhanovsky 2

Anastasia Kuznetsova 2

Brian Leonard 2

Valentin Malykh 2

Christopher D. Manning 2

Rowan Hall Maudslay 2

Vladislav Mikhailov 2

Jamshidbek Mirzakhalov 2

Bekhzodbek Moydinboyev 2

Irene Nikkarinen 2

Zahroh Nuriah 2

Arturo Oncevay 2

Alexandra O’Neil 2

Jungyeul Park 2

Matvey Plugaryov 2

Edoardo M. Ponti 2

Emily Prud’hommeaux 2

Shaxnoza Pulatova 2

Sampo Pyysalo 2

Vinit Ravishankar 2

Maria Ryskina 2

Elizabeth Salesky 2

Jaime Rafael Montoya Samame 2

Sebastian Schuster 2

Lane Schwartz 2

Karina Sheifer 2

Niklas Stoehr 2

Christopher Straughn 2

Totok Suhardijanto 2

Umut Sulubacak 2

Lucas Torroba Hennigen 2

Emmanuel Ngué Um 2

Josef Valvoda 2

Gema Celeste Silva Villegas 2

Jennifer White 2

Marcin Woliński 2

David Yarowsky 2

Marie-Catherine de Marneffe 2

Otabek Abduraufov 1

Hector Fernandez Alcalde 1

Vasilisa Andriyanets 1

Taras Andrushko 1

Lene Antonsen 1

Ekaterina Aplonova 1

Rosana Ardila 1

Aryaman Arora 1

Brice Martial Atangana Eloundou 1

Mohammed Attia 1

Elena Badmaeva 1

Esha Banerjee 1

Blaise-Mathieu Banoum Manguele 1

Diego Barriga Martínez 1

Colin Batchelor 1

Khuyagbaatar Batsuren 1

Remziye Berberova 1

Valery Berthoud F. 1

Brijesh Bhatt 1

Maria Bibaeva 1

Martha Dís Brandt 1

Megan Branson 1

Elena Budianskaya 1

Aljoscha Burchardt 1

Andrei Butnaru 1

Delio Siticonatzi Camaiteri 1

Quetzil Castañeda 1

Samuel Herrera Castro 1

Silvie Cinková 1

Daniel Dakota 1

Florus Landry Dibengue 1

Tino Didriksen 1

Blaise Abbo Djoulde 1

Hossep Dolatian 1

Kira Droganova 1

Emmanuel Giovanni Eloundou Eyenga 1

Gülşen Eryiğit 1

Abu Zaher Md Faridee 1

Alena Fenogenova 1

Ciprian Gerstenberger 1

Mireia Ginestí-Rosell 1

Fausto Giunchiglia 1

Yoav Goldberg 1

Morgan Grobol 1

Ximena Gutierrez-Vasques 1

Mammad Hajili 1

Jan Hajič jr. 1

Coleman Haley 1

Johannes Heinecke 1

Robert Henderson 1

Michael Henretty 1

Jaroslava Hlaváčová 1

Nicholas Howell 1

Chu-Ren Huang 1

Radu Tudor Ionescu 1

Mirlan Ipasov 1

Tommi Jauhiainen 1

Eunkyul Leah Jo 1

María Ximena Juárez Huerta 1

Heiki-Jaan Kaalep 1

Hiroshi Kanayama 1

Jenna Kanerva 1

Maja Kappfjell 1

Ritván Karahóǧa 1

Güllü Karanfil 1

Sherzod Kariev 1

Darya Kavitskaya 1

Tolga Kayadelen 1

Václava Kettnerová 1

Abror Khaytbaev 1

Jesse Kirchner 1

Christo Kirov 1

Natalia Klyueva 1

Michael Kohler 1

Sergey Kosyak 1

Julia Kreutzer 1

Natalia Krizhanovskaya 1

Aigiz Kunafin 1

Sookyoung Kwak 1

Dorina Lakatos 1

Tatiana Lando 1

William Abbott Lane 1

Septina Dian Larasati 1

Joseph Larson 1

Antonio Laverghetta Jr. 1

Saran Lertpradit 1

André Likwai 1

Inari Listenmaa 1

Hrafh Loftsson 1

Juhani Luotolahti 1

Juan López Bautista 1

Didier López Francis 1

Vivien Macketanz 1

Shervin Malmasi 1

Michael Mandel 1

Ruli Manurung 1

Igor Marchenko 1

Katrin Marheinecke 1

Stella Markantonatou 1

Aleksandra Martynova 1

Juan Pablo Martínez 1

Héctor Martínez Alonso 1

Polina Mashkovtseva 1

Aingeru Mayor 1

Arya D. McCarthy 1

Gustavo Mendonca 1

Victor Mijangos 1

Akshay Minocha 1

Karina Mishchenkova 1

Anna Missilä 1

Zanele Mlondo 1

Cynthia Montaño 1

Reuben Morais 1

Raveesh Motlani 1

José Mpouda Avom 1

Saliha Muradoğlu 1

Ángeles Márquez Hernandez 1

Thulile Ndlovu 1

Anna Nedoluzhko 1

Dávid Márk Nemeskey 1

Ekaterina Neminova 1

Maria Nepomniashchaya 1

Jeff Sterling Ngami Kamagoua 1

Eliette-Caroline Emilie Ngo Tjomb 1

Rattima Nitisaroj 1

Jacob Nordfalk 1

Mathilde Nyambe A 1

Zacharie Nyobe 1

Sandra Nystø Rahka 1

Marja-Liisa Olthuis 1

Jim O’Regan 1

Hyunji Hayley Park 1

George Pavlidis 1

Hrvoje Peradin 1

Ngami Phumzile Pewa 1

Edoardo Maria Ponti 1

Martin Potthast 1

Tiina Puolakainen 1

Alexandre Rademaker 1

Gema Ramírez-Sánchez 1

Shauli Ravfogel 1

Robert Reynolds 1

Daria Rodionova 1

Esaú Zumaeta Rojas 1

Tanja Samardzic 1

Manuela Sanguinetti 1

Javier Santillan 1

Mitsuya Sasaki 1

Lindsay Saunders 1

Andrey Scherbakov 1

Yves Scherrer 1

Alexandra Serova 1

Dipti Misra Sharma 1

Andrey Shcherbakov 1

Atsuko Shimada 1

Hlynur Sigurþórsson 1

Varun Sreedhar 1

Antonio Stella 1

Pavel Stepachev 1

Jana Strnadová 1

Raymond Hendy Susanto 1

Gábor Szolnok 1

Svetlana Toldova 1

Antonio Toral 1

Reut Tsarfaty 1

Kevin Unhammer 1

Zdenka Uresova 1

Hans Uszkoreit 1

Mokhiyakhon Uzokova 1

Konstantin Vinogorodskiy 1

Linda Wiechetek 1

Adina Williams 1

Cheyenne Wing 1

Anna Yablonskaya 1

Anastasia Yemelina 1

Jeremiah Young 1

Marcos Zampieri 1

Roberto Zariquiey 1

Valeria de Paiva 1

Venues

JEP/TALN/RECITAL1