Rita Singh


2024

pdf bib
R-BASS : Relevance-aided Block-wise Adaptation for Speech Summarization
Roshan Sharma | Ruchira Sharma | Hira Dhamyal | Rita Singh | Bhiksha Raj
Findings of the Association for Computational Linguistics: NAACL 2024

End-to-end speech summarization on long recordings is challenging because of the high computational cost. Block-wise Adaptation for Speech Summarization (BASS) summarizes arbitrarily long sequences by sequentially processing abutting chunks of audio. Despite the benefits of BASS, it has higher compute time due to sequential processing of all blocks, regardless of whether they are relevant to the final summary. In this paper, we propose R-BASS, a new relevance-aware block-wise adaptation method. First, we introduce two approaches to automatically estimate block relevance based on lexical and semantic similarity between the block-level transcript and the summary. Experiments on the How2 dataset show that using ground truth relevance during inference improves efficiency by 63.9 % by dropping irrelevant blocks. Finally, we incorporate relevance scores into training using a novel relevance loss and relevance predictor, and the proposed R-BASS model makes it possible to drop 86.3 % of the blocks while retaining comparable performance, resulting in a 2.2x speedup over BASS.

2023

pdf bib
Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text
Xiang Li | Jinglu Wang | Xiaohao Xu | Muqiao Yang | Fan Yang | Yizhou Zhao | Rita Singh | Bhiksha Raj
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Linguistic communication is prevalent in Human-Computer Interaction (HCI). Speech (spoken language) serves as a convenient yet potentially ambiguous form due to noise and accents, exposing a gap compared to text. In this study, we investigate the prominent HCI task, Referring Video Object Segmentation (R-VOS), which aims to segment and track objects using linguistic references. While text input is well-investigated, speech input is under-explored. Our objective is to bridge the gap between speech and text, enabling the adaptation of existing text-input R-VOS models to accommodate noisy speech input effectively. Specifically, we propose a method to align the semantic spaces between speech and text by incorporating two key modules: 1) Noise-Aware Semantic Adjustment (NSA) for clear semantics extraction from noisy speech; and 2) Semantic Jitter Suppression (SJS) enabling R-VOS models to tolerate noisy queries. Comprehensive experiments conducted on the challenging AVOS benchmarks reveal that our proposed method outperforms state-of-the-art approaches.

pdf bib
Token Prediction as Implicit Classification to Identify LLM-Generated Text
Yutian Chen | Hao Kang | Vivian Zhai | Liangze Li | Rita Singh | Bhiksha Raj
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task and directly fine-tune the base LM to perform it. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments. We compared our approach to the more direct approach of utilizing hidden states for classification. Evaluation shows the exceptional performance of our method in the text classification task, highlighting its simplicity and efficiency. Furthermore, interpretability studies on the features extracted by our model reveal its ability to differentiate distinctive writing styles among various LLMs even in the absence of an explicit classifier. We also collected a dataset named OpenLLMText, containing approximately 340k text samples from human and LLMs, including GPT3.5, PaLM, LLaMA, and GPT2.