2024
pdf
bib
abs
ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval
Kelong Mao
|
Chenlong Deng
|
Haonan Chen
|
Fengran Mo
|
Zheng Liu
|
Tetsuya Sakai
|
Zhicheng Dou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via contrastive learning while enhancing the complex session understanding through masked instruction tuning on high-quality conversational instruction tuning data. Extensive experiments on five conversational search benchmarks demonstrate that ChatRetriever significantly outperforms existing conversational dense retrievers, achieving state-of-the-art performance on par with LLM-based rewriting approaches. Furthermore, ChatRetriever exhibits superior robustness in handling diverse conversational contexts. Our work highlights the potential of adapting LLMs for retrieval with complex inputs like conversational search sessions and proposes an effective approach to advance this research direction.
pdf
bib
abs
CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search
Fengran Mo
|
Abbas Ghaddar
|
Kelong Mao
|
Mehdi Rezagholizadeh
|
Boxing Chen
|
Qun Liu
|
Jian-Yun Nie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in conversational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly use closed-source LLMs to directly generate search queries from conversation history. We demonstrate on five well-established benchmarks that CHIQ leads to state-of-the-art results across most settings, showing highly competitive performances with systems leveraging closed-source LLMs. Our study provides a first step towards leveraging open-source LLMs in conversational search, as a competitive alternative to the prevailing reliance on commercial LLMs. Data, models, and source code will be publicly available upon acceptance at https://github.com/fengranMark/CHIQ.
pdf
bib
abs
A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models
Jiayin Wang
|
Fengran Mo
|
Weizhi Ma
|
Peijie Sun
|
Min Zhang
|
Jian-Yun Nie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are essential tools that users employ across various scenarios, so evaluating their performance and guiding users in selecting the suitable service is important. Although many benchmarks exist, they mainly focus on specific predefined model abilities, such as world knowledge, reasoning, etc. Based on these ability scores, it is hard for users to determine which LLM best suits their particular needs. To address these issues, we propose to evaluate LLMs from a user-centric perspective and design this benchmark to measure their efficacy in satisfying user needs under distinct intents. Firstly, we collect 1,846 real-world use cases from a user study with 712 participants from 23 countries. This first-hand data helps us understand actual user intents and needs in LLM interactions, forming the User Reported Scenarios (URS) dataset, which is categorized with six types of user intents. Secondly, based on this authentic dataset, we benchmark 10 LLM services with GPT-4-as-Judge. Thirdly, we show that benchmark scores align well with human preference in both real-world experience and pair-wise annotations, achieving Pearson correlations of 0.95 and 0.94, respectively. This alignment confirms that the URS dataset and our evaluation method establish an effective user-centric benchmark. The dataset, code, and process data are publicly available at https://github.com/Alice1998/URS.
pdf
bib
abs
History-Aware Conversational Dense Retrieval
Fengran Mo
|
Chen Qu
|
Kelong Mao
|
Tianyu Zhu
|
Zhan Su
|
Kaiyu Huang
|
Jian-Yun Nie
Findings of the Association for Computational Linguistics: ACL 2024
Conversational search facilitates complex information retrieval by enabling multi-turn interactions between users and the system. Supporting such interactions requires a comprehensive understanding of the conversational inputs to formulate a good search query based on historical information. In particular, the search query should include the relevant information from the previous conversation turns.However, current approaches for conversational dense retrieval primarily rely on fine-tuning a pre-trained ad-hoc retriever using the whole conversational search session, which can be lengthy and noisy. Moreover, existing approaches are limited by the amount of manual supervision signals in the existing datasets.To address the aforementioned issues, we propose a **H**istory-**A**ware **Conv**ersational **D**ense **R**etrieval (HAConvDR) system, which incorporates two ideas: context-denoised query reformulation and automatic mining of supervision signals based on the actual impact of historical turns.Experiments on two public conversational search datasets demonstrate the improved history modeling capability of HAConvDR, in particular for long conversations with topic shifts.
pdf
bib
abs
RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation Through Self-Alignment
Kelong Mao
|
Zheng Liu
|
Hongjin Qian
|
Fengran Mo
|
Chenlong Deng
|
Zhicheng Dou
Findings of the Association for Computational Linguistics: EMNLP 2024
Retrieval-Augmented Generation (RAG) has proven to be an effective paradigm for enhancing the quality of text generation by integrating large language models (LLMs) with external knowledge. However, an off-the-shelf RAG system, which relies on generally pre-trained LLMs and retrievers, often falls short in specialized domains and applications. In this paper, we introduce RAG-Studio, an efficient self-aligned training framework to adapt general RAG models to specific domains solely through synthetic data, eliminating the need for expensive human-labeled in-domain data. RAG-Studio accepts a specialized domain corpus, a general LLM, and a general retriever, then autonomously generates contrastive training data for both the LLM and retriever through self-alignment. We fine-tune them to work cohesively as an integrated and effective domain-specific RAG system, where the LLM is adapted to incorporate new domain knowledge and become robust to noisy contexts, and the retriever learns to better align with the LLM’s preferences, providing more useful information and minimizing the risk of misleading the LLM. Extensive experiments across diverse in-domain question-answering datasets spanning the biomedical, finance, law, and computing domains, show that RAG-Studio attains state-of-the-art performance, consistently outperforming the use of human-annotated data for fine-tuning.
pdf
bib
abs
DoRA: Enhancing Parameter-Efficient Fine-Tuning with Dynamic Rank Distribution
Yulong Mao
|
Kaiyu Huang
|
Changhao Guan
|
Ganglin Bao
|
Fengran Mo
|
Jinan Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fine-tuning large-scale pre-trained models is inherently a resource-intensive task. While it can enhance the capabilities of the model, it also incurs substantial computational costs, posing challenges to the practical application of downstream tasks. Existing parameter-efficient fine-tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) rely on a bypass framework that ignores the differential parameter budget requirements across weight matrices, which may lead to suboptimal fine-tuning outcomes. To address this issue, we introduce the Dynamic Low-Rank Adaptation (DoRA) method. DoRA decomposes high-rank LoRA layers into structured single-rank components, allowing for dynamic pruning of parameter budget based on their importance to specific tasks during training, which makes the most of the limited parameter budget. Experimental results demonstrate that DoRA can achieve competitive performance compared with LoRA and full model fine-tuning, and outperform various strong baselines with the same storage parameter budget. Our code is available at [github](https://github.com/MIkumikumi0116/DoRA)
2023
pdf
bib
abs
ConvGQR: Generative Query Reformulation for Conversational Search
Fengran Mo
|
Kelong Mao
|
Yutao Zhu
|
Yihong Wu
|
Kaiyu Huang
|
Jian-Yun Nie
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In conversational search, the user’s real search intent for the current conversation turn is dependent on the previous conversation history. It is challenging to determine a good search query from the whole conversation context. To avoid the expensive re-training of the query encoder, most existing methods try to learn a rewriting model to de-contextualize the current query by mimicking the manual query rewriting. However, manually rewritten queries are not always the best search queries. Thus, training a rewriting model on them would lead to sub-optimal queries. Another useful information to enhance the search query is the potential answer to the question. In this paper, we propose ConvGQR, a new framework to reformulate conversational queries based on generative pre-trained language models (PLMs), one for query rewriting and another for generating potential answers. By combining both, ConvGQR can produce better search queries. In addition, to relate query reformulation to the retrieval task, we propose a knowledge infusion mechanism to optimize both query reformulation and retrieval. Extensive experiments on four conversational search datasets demonstrate the effectiveness of ConvGQR.
pdf
bib
abs
Search-Oriented Conversational Query Editing
Kelong Mao
|
Zhicheng Dou
|
Bang Liu
|
Hongjin Qian
|
Fengran Mo
|
Xiangli Wu
|
Xiaohua Cheng
|
Zhao Cao
Findings of the Association for Computational Linguistics: ACL 2023
Conversational query rewriting (CQR) realizes conversational search by reformulating the search dialogue into a standalone rewrite. However, existing CQR models either are not learned toward improving the downstream search performance or inefficiently generate the rewrite token-by-token from scratch while neglecting the fact that the search dialogue often has a large overlap with the rewrite. In this paper, we propose EdiRCS, a new text editing-based CQR model tailored for conversational search. In EdiRCS, most of the rewrite tokens are selected from the dialogue in a non-autoregressive fashion and only a few new tokens are generated to supplement the final rewrite, which makes EdiRCS highly efficient. In particular, the learning of EdiRCS is augmented with two search-oriented objectives, including contrastive ranking augmentation and contextualization knowledge transfer, which effectively improve it to select and generate more useful tokens from the view of retrieval. We show that EdiRCS outperforms state-of-the-art CQR models on three conversational search benchmarks while having low rewriting latency, and is robust to out-of-domain search dialogues and long dialogue contexts.
pdf
bib
abs
A Customized Text Sanitization Mechanism with Differential Privacy
Sai Chen
|
Fengran Mo
|
Yanhao Wang
|
Cen Chen
|
Jian-Yun Nie
|
Chengyu Wang
|
Jamie Cui
Findings of the Association for Computational Linguistics: ACL 2023
As privacy issues are receiving increasing attention within the Natural Language Processing (NLP) community, numerous methods have been proposed to sanitize texts subject to differential privacy. However, the state-of-the-art text sanitization mechanisms based on a relaxed notion of metric local differential privacy (MLDP) do not apply to non-metric semantic similarity measures and cannot achieve good privacy-utility trade-offs. To address these limitations, we propose a novel Customized Text sanitization (CusText) mechanism based on the original
𝜖-differential privacy (DP) definition, which is compatible with any similarity measure.Moreover, CusText assigns each input token a customized output set to provide more advanced privacy protection at the token level.Extensive experiments on several benchmark datasets show that CusText achieves a better trade-off between privacy and utility than existing mechanisms.The code is available at
https://github.com/sai4july/CusText.
pdf
bib
abs
MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Le Zhang
|
Yihong Wu
|
Fengran Mo
|
Jian-Yun Nie
|
Aishwarya Agrawal
Findings of the Association for Computational Linguistics: EMNLP 2023
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.
pdf
bib
abs
Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search
Kelong Mao
|
Zhicheng Dou
|
Fengran Mo
|
Jiewen Hou
|
Haonan Chen
|
Hongjin Qian
Findings of the Association for Computational Linguistics: EMNLP 2023
Precisely understanding users’ contextual search intent has been an important challenge for conversational search. As conversational search sessions are much more diverse and long-tailed, existing methods trained on limited data still show unsatisfactory effectiveness and robustness to handle real conversational search scenarios. Recently, large language models (LLMs) have demonstrated amazing capabilities for text generation and conversation understanding. In this work, we present a simple yet effective prompting framework, called LLM4CS, to leverage LLMs as a text-based search intent interpreter to help conversational search. Under this framework, we explore three prompting methods to generate multiple query rewrites and hypothetical responses, and propose to aggregate them into an integrated representation that can robustly represent the user’s real contextual search intent. Extensive automatic evaluations and human evaluations on three widely used conversational search benchmarks, including CAsT-19, CAsT-20, and CAsT-21, demonstrate the remarkable performance of our simple LLM4CS framework compared with existing methods and even using human rewrites. Our findings provide important evidence to better understand and leverage LLMs for conversational search.
2022
pdf
bib
abs
ConvTrans: Transforming Web Search Sessions for Conversational Dense Retrieval
Kelong Mao
|
Zhicheng Dou
|
Hongjin Qian
|
Fengran Mo
|
Xiaohua Cheng
|
Zhao Cao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Conversational search provides users with a natural and convenient new search experience. Recently, conversational dense retrieval has shown to be a promising technique for realizing conversational search. However, as conversational search systems have not been widely deployed, it is hard to get large-scale real conversational search sessions and relevance labels to support the training of conversational dense retrieval. To tackle this data scarcity problem, previous methods focus on developing better few-shot learning approaches or generating pseudo relevance labels, but the data they use for training still heavily rely on manual generation.In this paper, we present ConvTrans, a data augmentation method that can automatically transform easily-accessible web search sessions into conversational search sessions to fundamentally alleviate the data scarcity problem for conversational dense retrieval. ConvTrans eliminates the gaps between these two types of sessions in terms of session quality and query form to achieve effective session transformation. Extensive evaluations on two widely used conversational search benchmarks, i.e., CAsT-19 and CAsT-20, demonstrate that the same model trained on the data generated by ConvTrans can achieve comparable retrieval performance as it trained on high-quality but expensive artificial conversational search data.
2020
pdf
bib
abs
A Joint Multiple Criteria Model in Transfer Learning for Cross-domain Chinese Word Segmentation
Kaiyu Huang
|
Degen Huang
|
Zhuang Liu
|
Fengran Mo
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Word-level information is important in natural language processing (NLP), especially for the Chinese language due to its high linguistic complexity. Chinese word segmentation (CWS) is an essential task for Chinese downstream NLP tasks. Existing methods have already achieved a competitive performance for CWS on large-scale annotated corpora. However, the accuracy of the method will drop dramatically when it handles an unsegmented text with lots of out-of-vocabulary (OOV) words. In addition, there are many different segmentation criteria for addressing different requirements of downstream NLP tasks. Excessive amounts of models with saving different criteria will generate the explosive growth of the total parameters. To this end, we propose a joint multiple criteria model that shares all parameters to integrate different segmentation criteria into one model. Besides, we utilize a transfer learning method to improve the performance of OOV words. Our proposed method is evaluated by designing comprehensive experiments on multiple benchmark datasets (e.g., Bakeoff 2005, Bakeoff 2008 and SIGHAN 2010). Our method achieves the state-of-the-art performances on all datasets. Importantly, our method also shows a competitive practicability and generalization ability for the CWS task.