Finding sufficient in-domain text data for language modeling is a recurrent challenge. Some methods have already been proposed for selecting parts of out-of-domain text data most closely resembling the in-domain data using a small amount of the latter. Including this new “near-domain” data in training can potentially lead to better language model performance, while reducing training resources relative to incorporating all data. One popular, state-of-the-art selection process based on cross-entropy scores makes use of in-domain and out-ofdomain language models. In order to compensate for the limited availability of the in-domain data required for this method, we introduce enhancements to two of its steps. Firstly, we improve the procedure for drawing the outof-domain sample data used for selection. Secondly, we use word-associations in order to extend the underlying vocabulary of the sample language models used for scoring. These enhancements are applied to selecting text for language modeling of talks given in a technical subject area. Besides comparing perplexity, we judge the resulting language models by their performance in automatic speech recognition and machine translation tasks. We evaluate our method in different contexts. We show that it yields consistent improvements, up to 2% absolute reduction in word error rate and 0.3 Bleu points. We achieve these improvements even given a much smaller in-domain set.
The 2013 KIT Quaero speech-to-text system for French
Joshua Winebarger | Bao Nguyen | Jonas Gehring | Sebastian Stüker | Alex Waibel
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers
This paper describes our Speech-to-Text (STT) system for French, which was developed as part of our efforts in the Quaero program for the 2013 evaluation. Our STT system consists of six subsystems which were created by combining multiple complementary sources of pronunciation modeling including graphemes with various feature front-ends based on deep neural networks and tonal features. Both speaker-independent and speaker adaptively trained versions of the systems were built. The resulting systems were then combined via confusion network combination and crossadaptation. Through progressive advances and system combination we reach a word error rate (WER) of 16.5% on the 2012 Quaero evaluation data.