Mirjam Sepesy Maucec

Also published as: Mirjam Sepesy Maučec


pdf bib
The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource
Andrej Žgank | Mirjam Sepesy Maučec | Darinka Verdonik
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The manual transcriptions were used to evaluate the quality of unsupervised transcriptions. The average word error rate for the SI TEDx-UM evaluation subset was 50.7%, with out of vocabulary rate of 24% and language model perplexity of 390. The unsupervised transcriptions contain 372k tokens, where 32k of them were different.


pdf bib
Machine Translation for Subtitling: A Large-Scale Evaluation
Thierry Etchegoyhen | Lindsay Bywood | Mark Fishel | Panayota Georgakopoulou | Jie Jiang | Gerard van Loenhout | Arantza del Pozo | Mirjam Sepesy Maučec | Anja Turner | Martin Volk
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article describes a large-scale evaluation of the use of Statistical Machine Translation for professional subtitling. The work was carried out within the FP7 EU-funded project SUMAT and involved two rounds of evaluation: a quality evaluation and a measure of productivity gain/loss. We present the SMT systems built for the project and the corpora they were trained on, which combine professionally created and crowd-sourced data. Evaluation goals, methodology and results are presented for the eleven translation pairs that were evaluated by professional subtitlers. Overall, a majority of the machine translated subtitles received good quality ratings. The results were also positive in terms of productivity, with a global gain approaching 40%. We also evaluated the impact of applying quality estimation and filtering of poor MT output, which resulted in higher productivity gains for filtered files as opposed to fully machine-translated files. Finally, we present and discuss feedback from the subtitlers who participated in the evaluation, a key aspect for any eventual adoption of machine translation technology in professional subtitling.


pdf bib
SMT Approaches for Commercial Translation of Subtitles
Thierry Etchegoyhen | Mark Fishel | Jie Jiang | Mirjam Sepesy Maucec
Proceedings of Machine Translation Summit XIV: User track

pdf bib
SUMAT: An Online Service for Subtitling by Machine Translation
P. Georgakopoulou | L. Bywood | Thierry Etchegoyen | Mark Fishel | Jie Jiang | G. van Loenhout | A. del Pozo | D. Spiliotopoulos | Mirjam Sepesy Maucec | A. Turner
Proceedings of Machine Translation Summit XIV: European projects


pdf bib
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
Volha Petukhova | Rodrigo Agerri | Mark Fishel | Sergio Penkale | Arantza del Pozo | Mirjam Sepesy Maučec | Andy Way | Panayota Georgakopoulou | Martin Volk
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: http://www.sumat-project.eu) aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.


pdf bib
Acquisition and Annotation of Slovenian Broadcast News Database
Andrej Žgank | Tomaž Rotovnik | Mirjam Sepesy Maučec | Darinka Verdonik | Janez Kitak | Damjan Vlaj | Vladimir Hozjan | Zdravko Kačič | Bogomir Horvat
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)