Philip N. Garner


2022

Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of large amounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also show that the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

2017

We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.

2016

The multi-level adaptive networks (MLAN) technique is a cross-lingual adaptation framework where a bottleneck (BN) layer in a deep neural network (DNN) trained in a source language is used for producing BN features to be exploited in a second DNN in a target language. We investigate how the correlation (in the sense of phonetic similarity) of the source and target languages and the amount of data of the source language affect the efficiency of the MLAN schemes. We experiment with three different scenarios using, i) French, as a source language uncorrelated to the target language, ii) Ukrainian, as a source language correlated to the target one and finally iii) English as a source language uncorrelated to the target language using a relatively large amount of data in respect to the other two scenarios. In all cases Russian is used as target language. GLOBALPHONE data is used, except for English, where a mixture of LIBRISPEECH, TEDLIUM and AMIDA is available. The results have shown that both of these two factors are important for the MLAN schemes. Specifically, on the one hand, when a modest amount of data from the source language is used, the correlation of the source and target languages is very important. On the other hand, the correlation of the two languages seems to be less important when a relatively large amount of data, from the source language, is used. The best performance in word error rate (WER), was achieved when the English language was used as the source one in the multi-task MLAN scheme, achieving a relative improvement of 9.4% in respect to the baseline DNN model.

2011

2010