2025
pdf
bib
abs
Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs
Artem Fedorchenko
|
Tanel Alumäe
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We finetune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.
2024
pdf
bib
abs
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation
Tiia Sildam
|
Andra Velve
|
Tanel Alumäe
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.
2023
pdf
bib
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Tanel Alumäe
|
Mark Fishel
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
pdf
bib
abs
Automatic Closed Captioning for Estonian Live Broadcasts
Tanel Alumäe
|
Joonas Kalda
|
Külliki Bode
|
Martin Kaitsa
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
This paper describes a speech recognition based closed captioning system for Estonian language, primarily intended for the hard-of-hearing community. The system automatically identifies Estonian speech segments, converts speech to text using Kaldi-based TDNN-F models, and applies punctuation insertion and inverse text normalization. The word error rate of the system is 8.5% for television news programs and 13.4% for talk shows. The system is used by the Estonian Public Television for captioning live native language broadcasts and by the Estonian Parliament for captioning its live video feeds. Qualitative evaluation with the target audience showed that while the existence of closed captioning is crucial, the most important aspects that need to be improved are the ASR quality and better synchronization of the captions with the audio.
pdf
bib
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications
Eckhard Bick
|
Trond Trosterud
|
Tanel Alumäe
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications
2017
pdf
bib
abs
Low-Resource Neural Headline Generation
Ottokar Tilk
|
Tanel Alumäe
Proceedings of the Workshop on New Frontiers in Summarization
Recent neural headline generation models have shown great results, but are generally trained on very large datasets. We focus our efforts on improving headline quality on smaller datasets by the means of pretraining. We propose new methods that enable pre-training all the parameters of the model and utilize all available text, resulting in improvements by up to 32.4% relative in perplexity and 2.84 points in ROUGE.
2012
pdf
bib
A Hierarchical Dirichlet Process Model for Joint Part-of-Speech and Morphology Induction
Kairit Sirts
|
Tanel Alumäe
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
2010
pdf
bib
Domain Adaptation of Maximum Entropy Language Models
Tanel Alumäe
|
Mikko Kurimo
Proceedings of the ACL 2010 Conference Short Papers
2007
pdf
bib
Automatic Compound Word Reconstruction for Speech Recognition of Compounding Languages
Tanel Alumäe
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)
2006
pdf
bib
Unlimited vocabulary speech recognition for agglutinative languages
Mikko Kurimo
|
Antti Puurula
|
Ebru Arisoy
|
Vesa Siivola
|
Teemu Hirsimäki
|
Janne Pylkkönen
|
Tanel Alumäe
|
Murat Saraclar
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference