Matthias Sperber


2021

pdf bib
Streaming Models for Joint Speech Recognition and Translation
Orion Weller | Matthias Sperber | Christian Gollan | Joris Kluivers
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap, recent work has shown initial progress into the feasibility for end-to-end models to produce both of these outputs. However, all previous work has only looked at this problem from the consecutive perspective, leaving uncertainty on whether these approaches are effective in the more challenging streaming setting. We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches. We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders. Our evaluation across a range of metrics capturing accuracy, latency, and consistency shows that our end-to-end models are statistically similar to cascading models, while having half the number of parameters. We also find that both systems provide strong translation quality at low latency, keeping 99% of consecutive quality at a lag of just under a second.

2020

pdf bib
Consistent Transcription and Translation of Speech
Matthias Sperber | Hendra Setiawan | Christian Gollan | Udhyakumar Nallasamy | Matthias Paulik
Transactions of the Association for Computational Linguistics, Volume 8

The conventional paradigm in speech translation starts with a speech recognition step to generate transcripts, followed by a translation step with the automatic transcripts as input. To address various shortcomings of this paradigm, recent work explores end-to-end trainable direct models that translate without transcribing. However, transcripts can be an indispensable output in practical applications, which often display transcripts alongside the translations to users.We make this common requirement explicit and explore the task of jointly transcribing and translating speech. Although high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs that degrade the user experience. We introduce a methodology to evaluate consistency and compare several modeling approaches, including the traditional cascaded approach and end-to-end models. We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency. We further introduce simple techniques for directly optimizing for consistency, and analyze the resulting trade-offs between consistency, transcription accuracy, and translation accuracy.1

pdf bib
Speech Translation and the End-to-End Promise: Taking Stock of Where We Are
Matthias Sperber | Matthias Paulik
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Over its three decade history, speech translation has experienced several shifts in its primary research themes; moving from loosely coupled cascades of speech recognition and machine translation, to exploring questions of tight coupling, and finally to end-to-end models that have recently attracted much attention. This paper provides a brief survey of these developments, along with a discussion of the main challenges of traditional approaches which stem from committing to intermediate representations from the speech recognizer, and from training cascaded models separately towards different objectives. Recent end-to-end modeling techniques promise a principled way of overcoming these issues by allowing joint training of all model components and removing the need for explicit intermediate representations. However, a closer look reveals that many end-to-end models fall short of solving these issues, due to compromises made to address data scarcity. This paper provides a unifying categorization and nomenclature that covers both traditional and recent approaches and that may help researchers by highlighting both trade-offs and open research questions.

pdf bib
Variational Neural Machine Translation with Normalizing Flows
Hendra Setiawan | Matthias Sperber | Udhyakumar Nallasamy | Matthias Paulik
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Variational Neural Machine Translation (VNMT) is an attractive framework for modeling the generation of target translations, conditioned not only on the source sentence but also on some latent random variables. The latent variable modeling may introduce useful statistical dependencies that can improve translation accuracy. Unfortunately, learning informative latent variables is non-trivial, as the latent space can be prohibitively large, and the latent codes are prone to be ignored by many translation models at training time. Previous works impose strong assumptions on the distribution of the latent code and limit the choice of the NMT architecture. In this paper, we propose to apply the VNMT framework to the state-of-the-art Transformer and introduce a more flexible approximate posterior based on normalizing flows. We demonstrate the efficacy of our proposal under both in-domain and out-of-domain conditions, significantly outperforming strong baselines.

2019

pdf bib
Fluent Translations from Disfluent Speech in End-to-End Speech Translation
Elizabeth Salesky | Matthias Sperber | Alexander Waibel
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Spoken language translation applications for speech suffer due to conversational speech phenomena, particularly the presence of disfluencies. With the rise of end-to-end speech translation models, processing steps such as disfluency removal that were previously an intermediate step between speech recognition and machine translation need to be incorporated into model architectures. We use a sequence-to-sequence model to translate from noisy, disfluent speech to fluent text with disfluencies removed using the recently collected ‘copy-edited’ references for the Fisher Spanish-English dataset. We are able to directly generate fluent translations and introduce considerations about how to evaluate success on this task. This work provides a baseline for a new task, implicitly removing disfluencies in end-to-end translation of conversational speech.

pdf bib
Self-Attentional Models for Lattice Inputs
Matthias Sperber | Graham Neubig | Ngoc-Quan Pham | Alex Waibel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Lattices are an efficient and effective method to encode ambiguity of upstream systems in natural language processing tasks, for example to compactly capture multiple speech recognition hypotheses, or to represent multiple linguistic analyses. Previous work has extended recurrent neural networks to model lattice inputs and achieved improvements in various tasks, but these models suffer from very slow computation speeds. This paper extends the recently proposed paradigm of self-attention to handle lattice inputs. Self-attention is a sequence modeling technique that relates inputs to one another by computing pairwise similarities and has gained popularity for both its strong results and its computational efficiency. To extend such models to handle lattices, we introduce probabilistic reachability masks that incorporate lattice structure into the model and support lattice scores if available. We also propose a method for adapting positional embeddings to lattice structures. We apply the proposed model to a speech translation task and find that it outperforms all examined baselines while being much faster to compute than previous neural lattice models during both training and inference.

pdf bib
Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
Elizabeth Salesky | Matthias Sperber | Alan W Black
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.

pdf bib
Paraphrases as Foreign Languages in Multilingual Neural Machine Translation
Zhong Zhou | Matthias Sperber | Alexander Waibel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Paraphrases, rewordings of the same semantic meaning, are useful for improving generalization and translation. Unlike previous works that only explore paraphrases at the word or phrase level, we use different translations of the whole training data that are consistent in structure as paraphrases at the corpus level. We treat paraphrases as foreign languages, tag source sentences with paraphrase labels, and train on parallel paraphrases in the style of multilingual Neural Machine Translation (NMT). Our multi-paraphrase NMT that trains only on two languages outperforms the multilingual baselines. Adding paraphrases improves the rare word translation and increases entropy and diversity in lexical choice. Adding the source paraphrases boosts performance better than adding the target ones, while adding both lifts performance further. We achieve a BLEU score of 57.2 for French-to-English translation using 24 corpus-level paraphrases of the Bible, which outperforms the multilingual baselines and is +34.7 above the single-source single-target NMT baseline.

pdf bib
Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation
Matthias Sperber | Graham Neubig | Jan Niehues | Alex Waibel
Transactions of the Association for Computational Linguistics, Volume 7

Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. However, experiments are inconclusive on whether the cascade or the direct model is stronger, and have only been conducted under the unrealistic assumption that both are trained on equal amounts of data, ignoring other available speech recognition and machine translation corpora. In this paper, we demonstrate that direct speech translation models require more data to perform well than cascaded models, and although they allow including auxiliary data through multi-task training, they are poor at exploiting such data, putting them at a severe disadvantage. As a remedy, we propose the use of end- to-end trainable models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. We show that such models naturally decompose into multi-task–trainable recognition and translation tasks and propose an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages. Our proposed model outperforms all examined baselines and is able to exploit auxiliary training data much more effectively than direct attentional models.

2018

pdf bib
KIT Lecture Translator: Multilingual Speech Translation with One-Shot Learning
Florian Dessloch | Thanh-Le Ha | Markus Müller | Jan Niehues | Thai-Son Nguyen | Ngoc-Quan Pham | Elizabeth Salesky | Matthias Sperber | Sebastian Stüker | Thomas Zenkel | Alexander Waibel
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

In today’s globalized world we have the ability to communicate with people across the world. However, in many situations the language barrier still presents a major issue. For example, many foreign students coming to KIT to study are initially unable to follow a lecture in German. Therefore, we offer an automatic simultaneous interpretation service for students. To fulfill this task, we have developed a low-latency translation system that is adapted to lectures and covers several language pairs. While the switch from traditional Statistical Machine Translation to Neural Machine Translation (NMT) significantly improved performance, to integrate NMT into the speech translation framework required several adjustments. We have addressed the run-time constraints and different types of input. Furthermore, we utilized one-shot learning to easily add new topic-specific terms to the system. Besides better performance, NMT also enabled us increase our covered languages through multilingual NMT. % Combining these techniques, we are able to provide an adapted speech translation system for several European languages.

pdf bib
KIT-Multi: A Translation-Oriented Multilingual Embedding Corpus
Thanh-Le Ha | Jan Niehues | Matthias Sperber | Ngoc Quan Pham | Alexander Waibel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
XNMT: The eXtensible Neural Machine Translation Toolkit
Graham Neubig | Matthias Sperber | Xinyi Wang | Matthieu Felix | Austin Matthews | Sarguna Padmanabhan | Ye Qi | Devendra Sachan | Philip Arthur | Pierre Godard | John Hewitt | Rachid Riad | Liming Wang
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation
Zhong Zhou | Matthias Sperber | Alexander Waibel
Proceedings of the Third Conference on Machine Translation: Research Papers

We work on translation from rich-resource languages to low-resource languages. The main challenges we identify are the lack of low-resource language data, effective methods for cross-lingual transfer, and the variable-binding problem that is common in neural systems. We build a translation system that addresses these challenges using eight European language families as our test ground. Firstly, we add the source and the target family labels and study intra-family and inter-family influences for effective cross-lingual transfer. We achieve an improvement of +9.9 in BLEU score for English-Swedish translation using eight families compared to the single-family multi-source multi-target baseline. Moreover, we find that training on two neighboring families closest to the low-resource language is often enough. Secondly, we construct an ablation study and find that reasonably good results can be achieved even with considerably less target data. Thirdly, we address the variable-binding problem by building an order-preserving named entity translation model. We obtain 60.6% accuracy in qualitative evaluation where our translations are akin to human translations in a preliminary study.

2017

pdf bib
The Karlsruhe Institute of Technology Systems for the News Translation Task in WMT 2017
Ngoc-Quan Pham | Jan Niehues | Thanh-Le Ha | Eunah Cho | Matthias Sperber | Alexander Waibel
Proceedings of the Second Conference on Machine Translation

pdf bib
Neural Lattice-to-Sequence Models for Uncertain Inputs
Matthias Sperber | Graham Neubig | Jan Niehues | Alex Waibel
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The input to a neural sequence-to-sequence model is often determined by an up-stream system, e.g. a word segmenter, part of speech tagger, or speech recognizer. These up-stream models are potentially error-prone. Representing inputs through word lattices allows making this uncertainty explicit by capturing alternative sequences and their posterior probabilities in a compact form. In this work, we extend the TreeLSTM (Tai et al., 2015) into a LatticeLSTM that is able to consume word lattices, and can be used as encoder in an attentional encoder-decoder model. We integrate lattice posterior scores into this architecture by extending the TreeLSTM’s child-sum and forget gates and introducing a bias term into the attention mechanism. We experiment with speech translation lattices and report consistent improvements over baselines that translate either the 1-best hypothesis or the lattice without posterior scores.

2016

pdf bib
Optimizing Computer-Assisted Transcription Quality with Iterative User Interfaces
Matthias Sperber | Graham Neubig | Satoshi Nakamura | Alex Waibel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Computer-assisted transcription promises high-quality speech transcription at reduced costs. This is achieved by limiting human effort to transcribing parts for which automatic transcription quality is insufficient. Our goal is to improve the human transcription quality via appropriate user interface design. We focus on iterative interfaces that allow humans to solve tasks based on an initially given suggestion, in this case an automatic transcription. We conduct a user study that reveals considerable quality gains for three variations of iterative interfaces over a non-iterative from-scratch transcription interface. Our iterative interfaces included post-editing, confidence-enhanced post-editing, and a novel retyping interface. All three yielded similar quality on average, but we found that the proposed retyping interface was less sensitive to the difficulty of the segment, and superior when the automatic transcription of the segment contained relatively many errors. An analysis using mixed-effects models allows us to quantify these and other factors and draw conclusions over which interface design should be chosen in which circumstance.

pdf bib
The Karlsruhe Institute of Technology Systems for the News Translation Task in WMT 2016
Thanh-Le Ha | Eunah Cho | Jan Niehues | Mohammed Mediani | Matthias Sperber | Alexandre Allauzen | Alexander Waibel
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Lecture Translator - Speech translation framework for simultaneous lecture translation
Markus Müller | Thai Son Nguyen | Jan Niehues | Eunah Cho | Bastian Krüger | Thanh-Le Ha | Kevin Kilgour | Matthias Sperber | Mohammed Mediani | Sebastian Stüker | Alex Waibel
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Lightly Supervised Quality Estimation
Matthias Sperber | Graham Neubig | Jan Niehues | Sebastian Stüker | Alex Waibel
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Evaluating the quality of output from language processing systems such as machine translation or speech recognition is an essential step in ensuring that they are sufficient for practical use. However, depending on the practical requirements, evaluation approaches can differ strongly. Often, reference-based evaluation measures (such as BLEU or WER) are appealing because they are cheap and allow rapid quantitative comparison. On the other hand, practitioners often focus on manual evaluation because they must deal with frequently changing domains and quality standards requested by customers, for which reference-based evaluation is insufficient or not possible due to missing in-domain reference data (Harris et al., 2016). In this paper, we attempt to bridge this gap by proposing a framework for lightly supervised quality estimation. We collect manually annotated scores for a small number of segments in a test corpus or document, and combine them with automatically predicted quality scores for the remaining segments to predict an overall quality estimate. An evaluation shows that our framework estimates quality more reliably than using fully automatic quality estimation approaches, while keeping annotation effort low by not requiring full references to be available for the particular domain.

2015

pdf bib
The 2015 KIT IWSLT speech-to-text systems for English and German
Markus Mueller | Tai Son Nguyen | Matthias Sperber | Kevin Kilgour | Sebastian Stuker | Alex Waibel
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

2014

pdf bib
Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff
Matthias Sperber | Mirjam Simantzik | Graham Neubig | Satoshi Nakamura | Alex Waibel
Transactions of the Association for Computational Linguistics, Volume 2

In this paper, we study the problem of manually correcting automatic annotations of natural language in as efficient a manner as possible. We introduce a method for automatically segmenting a corpus into chunks such that many uncertain labels are grouped into the same chunk, while human supervision can be omitted altogether for other segments. A tradeoff must be found for segment sizes. Choosing short segments allows us to reduce the number of highly confident labels that are supervised by the annotator, which is useful because these labels are often already correct and supervising correct labels is a waste of effort. In contrast, long segments reduce the cognitive effort due to context switches. Our method helps find the segmentation that optimizes supervision efficiency by defining user models to predict the cost and utility of supervising each segment and solving a constrained optimization problem balancing these contradictory objectives. A user study demonstrates noticeable gains over pre-segmented, confidence-ordered baselines on two natural language processing tasks: speech transcription and word segmentation.

pdf bib
The 2014 KIT IWSLT speech-to-text systems for English, German and Italian
Kevin Kilgour | Michael Heck | Markus Müller | Matthias Sperber | Sebastian Stüker | Alex Waibel
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes our German, Italian and English Speech-to-Text (STT) systems for the 2014 IWSLT TED ASR track. Our setup uses ROVER and confusion network combination from various subsystems to achieve a good overall performance. The individual subsystems are built by using different front-ends, (e.g., MVDR-MFCC or lMel), acoustic models (GMM or modular DNN) and phone sets and by training on various subsets of the training data. Decoding is performed in two stages, where the GMM systems are adapted in an unsupervised manner on the combination of the first stage outputs using VTLN, MLLR, and cMLLR. The combination setup produces a final hypothesis that has a significantly lower WER than any of the individual subsystems.

2013

pdf bib
The 2013 KIT IWSLT speech-to-text systems for German and English
Kevin Kilgour | Christian Mohr | Michael Heck | Quoc Bao Nguyen | Van Huy Nguyen | Evgeniy Shin | Igor Tseyzer | Jonas Gehring | Markus Müller | Matthias Sperber | Sebastian Stüker | Alex Waibel
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes our English Speech-to-Text (STT) systems for the 2013 IWSLT TED ASR track. The systems consist of multiple subsystems that are combinations of different front-ends, e.g. MVDR-MFCC based and lMel based ones, GMM and NN acoustic models and different phone sets. The outputs of the subsystems are combined via confusion network combination. Decoding is done in two stages, where the systems of the second stage are adapted in an unsupervised manner on the combination of the first stage outputs using VTLN, MLLR, and cMLLR.

2012

pdf bib
The 2012 KIT and KIT-NAIST English ASR systems for the IWSLT evaluation
Christian Saam | Christian Mohr | Kevin Kilgour | Michael Heck | Matthias Sperber | Keigo Kubo | Sebatian Stüker | Sakriani Sakri | Graham Neubig | Tomoki Toda | Satoshi Nakamura | Alex Waibel
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes our English Speech-to-Text (STT) systems for the 2012 IWSLT TED ASR track evaluation. The systems consist of 10 subsystems that are combinations of different front-ends, e.g. MVDR based and MFCC based ones, and two different phone sets. The outputs of the subsystems are combined via confusion network combination. Decoding is done in two stages, where the systems of the second stage are adapted in an unsupervised manner on the combination of the first stage outputs using VTLN, MLLR, and cM-LLR.

pdf bib
The KIT-NAIST (contrastive) English ASR system for IWSLT 2012
Michael Heck | Keigo Kubo | Matthias Sperber | Sakriani Sakti | Sebastian Stüker | Christian Saam | Kevin Kilgour | Christian Mohr | Graham Neubig | Tomoki Toda | Satoshi Nakamura | Alex Waibel
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the KIT-NAIST (Contrastive) English speech recognition system for the IWSLT 2012 Evaluation Campaign. In particular, we participated in the ASR track of the IWSLT TED task. The system was developed by Karlsruhe Institute of Technology (KIT) and Nara Institute of Science and Technology (NAIST) teams in collaboration within the interACT project. We employ single system decoding with fully continuous and semi-continuous models, as well as a three-stage, multipass system combination framework built with the Janus Recognition Toolkit. On the IWSLT 2010 test set our single system introduced in this work achieves a WER of 17.6%, and our final combination achieves a WER of 14.4%.