Employing low-pass filtered temporal speech features for the training of ideal ratio mask in speech enhancement

Yan-Tong Chen, Zi-Qiang Lin, Jeih-Weih Hung


Abstract
The masking-based speech enhancement method pursues a multiplicative mask that applies to the spectrogram of input noise-corrupted utterance, and a deep neural network (DNN) is often used to learn the mask. In particular, the features commonly used for automatic speech recognition can serve as the input of the DNN to learn the well-behaved mask that significantly reduce the noise distortion of processed utterances. This study proposes to preprocess the input speech features for the ideal ratio mask (IRM)-based DNN by lowpass filtering in order to alleviate the noise components. In particular, we employ the discrete wavelet transform (DWT) to decompose the temporal speech feature sequence and scale down the detail coefficients, which correspond to the high-pass portion of the sequence. Preliminary experiments conducted on a subset of TIMIT corpus reveal that the proposed method can make the resulting IRM achieve higher speech quality and intelligibility for the babble noise-corrupted signals compared with the original IRM, indicating that the lowpass filtered temporal feature sequence can learn a superior IRM network for speech enhancement.
Anthology ID:
2021.rocling-1.30
Volume:
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)
Month:
October
Year:
2021
Address:
Taoyuan, Taiwan
Editors:
Lung-Hao Lee, Chia-Hui Chang, Kuan-Yu Chen
Venue:
ROCLING
SIG:
Publisher:
The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
Note:
Pages:
236–242
Language:
URL:
https://aclanthology.org/2021.rocling-1.30
DOI:
Bibkey:
Cite (ACL):
Yan-Tong Chen, Zi-Qiang Lin, and Jeih-Weih Hung. 2021. Employing low-pass filtered temporal speech features for the training of ideal ratio mask in speech enhancement. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021), pages 236–242, Taoyuan, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
Cite (Informal):
Employing low-pass filtered temporal speech features for the training of ideal ratio mask in speech enhancement (Chen et al., ROCLING 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.rocling-1.30.pdf