Tsu-Yuan Hsu
2024
Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling
Chao-Wei Huang
|
Chen-An Li
|
Tsu-Yuan Hsu
|
Chen-Yu Hsu
|
Yun-Nung Chen
Findings of the Association for Computational Linguistics: EACL 2024
Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an ̲Unsupervised ̲Multilingual dense ̲Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. All of our source code, data, and models are available: https://github.com/MiuLab/UMR
2023
Visually-Enhanced Phrase Understanding
Tsu-Yuan Hsu
|
Chen-An Li
|
Chao-Wei Huang
|
Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2023
Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often outperform models that are purely text-based, such as BERT. In this study, our objective is to utilize both textual and visual encoders of multi-modal pre-trained models to enhance language understanding tasks. We achieve this by generating an image associated with a textual prompt, thus enriching the representation of a phrase for downstream tasks. Results from experiments conducted on four benchmark datasets demonstrate that our proposed method, which leverages visually-enhanced text representations, significantly improves performance in the entity clustering task.
CONVERSER: Few-shot Conversational Dense Retrieval with Synthetic Data Generation
Chao-Wei Huang
|
Chen-Yu Hsu
|
Tsu-Yuan Hsu
|
Chen-An Li
|
Yun-Nung Chen
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Conversational search provides a natural interface for information retrieval (IR). Recent approaches have demonstrated promising results in applying dense retrieval to conversational IR. However, training dense retrievers requires large amounts of in-domain paired data. This hinders the development of conversational dense retrievers, as abundant in-domain conversations are expensive to collect. In this paper, we propose Converser, a framework for training conversational dense retrievers with at most 6 examples of in-domain dialogues. Specifically, we utilize the in-context learning capability of large language models to generate conversational queries given a passage in the retrieval corpus. Experimental results on conversational retrieval benchmarks OR-QuAC and TREC CAsT 19 show that the proposed Converser achieves comparable performance to fully-supervised models, demonstrating the effectiveness of our proposed framework in few-shot conversational dense retrieval. All source code and generated datasets are available: https://github.com/MiuLab/CONVERSER
Search