Jinglu Wang
2023
Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text
Xiang Li
|
Jinglu Wang
|
Xiaohao Xu
|
Muqiao Yang
|
Fan Yang
|
Yizhou Zhao
|
Rita Singh
|
Bhiksha Raj
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Linguistic communication is prevalent in Human-Computer Interaction (HCI). Speech (spoken language) serves as a convenient yet potentially ambiguous form due to noise and accents, exposing a gap compared to text. In this study, we investigate the prominent HCI task, Referring Video Object Segmentation (R-VOS), which aims to segment and track objects using linguistic references. While text input is well-investigated, speech input is under-explored. Our objective is to bridge the gap between speech and text, enabling the adaptation of existing text-input R-VOS models to accommodate noisy speech input effectively. Specifically, we propose a method to align the semantic spaces between speech and text by incorporating two key modules: 1) Noise-Aware Semantic Adjustment (NSA) for clear semantics extraction from noisy speech; and 2) Semantic Jitter Suppression (SJS) enabling R-VOS models to tolerate noisy queries. Comprehensive experiments conducted on the challenging AVOS benchmarks reveal that our proposed method outperforms state-of-the-art approaches.
Search
Co-authors
- Xiang Li 1
- Xiaohao Xu 1
- Muqiao Yang 1
- Fan Yang 1
- Yizhou Zhao 1
- show all...