Timofey Arkhangelskiy


2022

pdf bib
Eastern Armenian National Corpus: State of the Art and Perspectives
Victoria Khurshudyan | Timofey Arkhangelskiy | Misha Daniel | Vladimir Plungian | Dmitri Levonian | Alex Polyakov | Sergei Rubakov
Proceedings of the Workshop on Processing Language Variation: Digital Armenian (DigitAm) within the 13th Language Resources and Evaluation Conference

Eastern Armenian National Corpus (EANC) is a comprehensive corpus of Modern Eastern Armenian with about 110 million tokens, covering written and oral discourses from the mid-19th century to the present. The corpus is provided with morphological, semantic and metatext annotation, as well as English translations. EANC is open access and available at www.eanc.net.

pdf bib
Proceedings of the first workshop on NLP applications to field linguistics
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Elena Klyachko | Ekaterina Neminova | Ekaterina Vylomova | Tatiana Shavrina | Eric Le Ferrand | Valentin Malykh | Francis Tyers | Timofey Arkhangelskiy | Vladislav Mikhailov | Alena Fenogenova
Proceedings of the first workshop on NLP applications to field linguistics

2021

pdf bib
Low-Resource ASR with an Augmented Language Model
Timofey Arkhangelskiy
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

2020

pdf bib
UniMorph 3.0: Universal Morphology
Arya D. McCarthy | Christo Kirov | Matteo Grella | Amrit Nidhi | Patrick Xia | Kyle Gorman | Ekaterina Vylomova | Sabrina J. Mielke | Garrett Nicolai | Miikka Silfverberg | Timofey Arkhangelskiy | Nataly Krizhanovsky | Andrew Krizhanovsky | Elena Klyachko | Alexey Sorokin | John Mansfield | Valts Ernštreits | Yuval Pinter | Cassandra L. Jacobs | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.

pdf bib
Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes | Timofey Arkhangelskiy | Niko Partanen | Michael Rießler | Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.

2019

pdf bib
Uralic multimedia corpora: ISO/TEI corpus data in the project INEL
Timofey Arkhangelskiy | Anne Ferger | Hanna Hedeland
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Corpora of social media in minority Uralic languages
Timofey Arkhangelskiy
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Corpus of usage examples: What is it good for?
Timofey Arkhangelskiy
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf bib
Sound-aligned corpus of Udmurt dialectal texts
Timofey Arkhangelskiy | Ekaterina Georgieva
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

2014

pdf bib
Estimating Native Vocabulary Size in an Endangered Language
Timofey Arkhangelskiy
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

2012

pdf bib
The Creation of Large-Scale Annotated Corpora of Minority Languages using UniParser and the EANC platform
Timofey Arkhangelskiy | Oleg Belyaev | Arseniy Vydrin
Proceedings of COLING 2012: Posters