Gakuto Kurata


2024

pdf bib
Robust ASR Error Correction with Conservative Data Filtering
Takuma Udagawa | Masayuki Suzuki | Masayasu Muraoka | Gakuto Kurata
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems.Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their gold references (as targets).However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings.In this work, we propose two fundamental criteria that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from the available context (e.g. source phonemes).Through these criteria, we identify low-quality EC pairs and train the models not to make any correction in such cases, the process we refer to as conservative data filtering.In our experiments, we focus on Japanese ASR using a strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC.Through our evaluation on a suite of 21 internal benchmarks, we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the challenging OOD settings.

2023

pdf bib
Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries
Ashish Mittal | Sunita Sarawagi | Preethi Jyothi | George Saon | Gakuto Kurata
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Despite the impressive performance of ASR models on mainstream benchmarks, their performance on rare words is unsatisfactory. In enterprise settings, often a focused list of entities (such as locations, names, etc) are available which can be used to adapt the model to the terminology of specific domains. In this paper, we present a novel inference algorithm that improves the prediction of state-of-the-art ASR models using nearest-neighbor-based matching on an inference-time word list. We consider both the Transducer architecture that is useful in the streaming setting, and state-of-the-art encoder-decoder models such as Whisper. In our approach, a list of rare entities is indexed in a memory by synthesizing speech for each entry, and then storing the internal acoustic and language model states obtained from the best possible alignment on the ASR model. The memory is organized as a trie which we harness to perform a stateful lookup during inference. A key property of our extension is that we prevent spurious matches by restricting to only word-level matches. In our experiments on publicly available datasets and private benchmarks, we show that our method is effective in significantly improving rare word recognition.

2016

pdf bib
Leveraging Sentence-level Information with Encoder LSTM for Semantic Slot Filling
Gakuto Kurata | Bing Xiang | Bowen Zhou | Mo Yu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Improved Neural Network-based Multi-label Classification with Better Initialization Leveraging Label Co-occurrence
Gakuto Kurata | Bing Xiang | Bowen Zhou
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2006

pdf bib
Phoneme-to-Text Transcription System with an Infinite Vocabulary
Shinsuke Mori | Daisuke Takuma | Gakuto Kurata
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics