William M. Campbell
2024
Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval
Qiuhai Zeng
|
Zimeng Qiu
|
Dae Yon Hwang
|
Xin He
|
William M. Campbell
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language model (LLM) under the dual-encoder retrieval framework. We demonstrate on multiple languages that the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instruct tuning. We evaluate our proposed method under low-resource settings on three English, two German and one Portuguese retrieval datasets measuring NDCG@10, MRR@100, Recall@100. We significantly improve the average zero-shot retrieval performance on all metrics, increasing out-of-box FLAN-T5 model variations by [4.73%, 6.15%] in absolute NDCG@10 and exceeding four supervised dense retrievers.
2020
Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems
William M. Campbell
|
Alex Waibel
|
Dilek Hakkani-Tur
|
Timothy J. Hazen
|
Kevin Kilgour
|
Eunah Cho
|
Varun Kumar
|
Hadrien Glaude
Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems
2019
Paraphrase Generation for Semi-Supervised Learning in NLU
Eunah Cho
|
He Xie
|
William M. Campbell
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation
Semi-supervised learning is an efficient way to improve performance for natural language processing systems. In this work, we propose Para-SSL, a scheme to generate candidate utterances using paraphrasing and methods from semi-supervised learning. In order to perform paraphrase generation in the context of a dialog system, we automatically extract paraphrase pairs to create a paraphrase corpus. Using this data, we build a paraphrase generation system and perform one-to-many generation, followed by a validation step to select only the utterances with good quality. The paraphrase-based semi-supervised learning is applied to five functionalities in a natural language understanding system. Our proposed method for semi-supervised learning using paraphrase generation does not require user utterances and can be applied prior to releasing a new functionality to a system. Experiments show that we can achieve up to 19% of relative slot error reduction without an access to user utterances, and up to 35% when leveraging live traffic utterances.
Search
Fix data
Co-authors
- Eunah Cho 2
- Hadrien Glaude 1
- Dilek Hakkani-Tur 1
- Timothy J. Hazen 1
- Xin He 1
- show all...