Vicky Zayats


2022

pdf bib
Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection
Angelica Chen | Vicky Zayats | Daniel Walker | Dirk Padfield
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. While this post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation), most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays for the user. In this work we propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context, in essence learning to dynamically size the lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.

2021

pdf bib
Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech
Katrin Tomanek | Vicky Zayats | Dirk Padfield | Kara Vaillancourt | Fadi Biadsy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.

pdf bib
Representations for Question Answering from Documents with Tables and Text
Vicky Zayats | Kristina Toutanova | Mari Ostendorf
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Tables in web documents are pervasive and can be directly used to answer many of the queries searched on the web, motivating their integration in question answering. Very often information presented in tables is succinct and hard to interpret with standard language representations. On the other hand, tables often appear within textual context, such as an article describing the table. Using the information from an article as additional context can potentially enrich table representations. In this work we aim to improve question answering from tables by refining table representations based on information from surrounding text. We also present an effective method to combine text and table-based predictions for question answering from full documents, obtaining significant improvements on the Natural Questions dataset (Kwiatkowski et al., 2019).

2019

pdf bib
Giving Attention to the Unexpected: Using Prosody Innovations in Disfluency Detection
Vicky Zayats | Mari Ostendorf
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Disfluencies in spontaneous speech are known to be associated with prosodic disruptions. However, most algorithms for disfluency detection use only word transcripts. Integrating prosodic cues has proved difficult because of the many sources of variability affecting the acoustic correlates. This paper introduces a new approach to extracting acoustic-prosodic cues using text-based distributional prediction of acoustic cues to derive vector z-score features (innovations). We explore both early and late fusion techniques for integrating text and prosody, showing gains over a high-accuracy text-only model.