Askars Salimbajevs


2024

pdf bib
Code-Mixed Text Augmentation for Latvian ASR
Martins Kronis | Askars Salimbajevs | Mārcis Pinnis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Code-mixing has become mainstream in the modern, globalised world and affects low-resource languages, such as Latvian, in particular. Solutions to developing an automatic speech recognition system (ASR) for code-mixed speech often rely on specially created audio-text corpora, which are expensive and time-consuming to create. In this work, we attempt to tackle code-mixed Latvian-English speech recognition by improving the language model (LM) of a hybrid ASR system. We make a distinction between inflected transliterations and phonetic transcriptions as two different foreign word types. We propose an inflected transliteration model and a phonetic transcription model for the automatic generation of said word types. We then leverage a large human-translated English-Latvian parallel text corpus to generate synthetic code-mixed Latvian sentences by substituting in generated foreign words. Using the newly created augmented corpora, we train a new LM and combine it with our existing Latvian acoustic model (AM). For evaluation, we create a specialised foreign word test set on which our methods yield up to 15% relative CER improvement. We then further validate these results in a human evaluation campaign.

2020

pdf bib
The COMPRISE Cloud Platform
Raivis Skadiņš | Askars Salimbajevs
Proceedings of the 1st International Workshop on Language Technology Platforms

This paper presents the COMPRISE cloud platform that is developed in the H2020 project. We present an overview of the COMPRISE project, its main goals, components, and how the cloud platform fits in the context of the overall project. The COMPRISE cloud platform is presented in more detail – main users, use scenarios, functions, implementation details, and how it will be used by both COMPRISE’s targeted audience and the broader language-technology community.

2018

pdf bib
Creating Lithuanian and Latvian Speech Corpora from Inaccurately Annotated Web Data
Askars Salimbajevs
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Designing a Speech Corpus for the Development and Evaluation of Dictation Systems in Latvian
Mārcis Pinnis | Askars Salimbajevs | Ilze Auziņa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper the authors present a speech corpus designed and created for the development and evaluation of dictation systems in Latvian. The corpus consists of over nine hours of orthographically annotated speech from 30 different speakers. The corpus features spoken commands that are common for dictation systems for text editors. The corpus is evaluated in an automatic speech recognition scenario. Evaluation results in an ASR dictation scenario show that the addition of the corpus to the acoustic model training data in combination with language model adaptation allows to decrease the WER by up to relative 41.36% (or 16.83% in absolute numbers) compared to a baseline system without language model adaptation. Contribution of acoustic data augmentation is at relative 12.57% (or 3.43% absolute).

2015

pdf bib
Error Analysis and Improving Speech Recognition for Latvian Language
Askars Salimbajevs | Jevgenijs Strigins
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Using sub-word n-gram models for dealing with OOV in large vocabulary speech recognition for Latvian
Askars Salimbajevs | Jevgenijs Strigins
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)