Zhicheng Dou (窦志成)

Zhicheng Dou

Also published as: 志成窦

2025

pdf bib abs
Enhancing Reranking for Recommendation with LLMs through User Preference Retrieval
Haobo Zhang | Qiannan Zhu | Zhicheng Dou
Proceedings of the 31st International Conference on Computational Linguistics

Recently, large language models (LLMs) have shown the potential to enhance recommendations due to their sufficient knowledge and remarkable summarization ability. However, the existing LLM-powered recommendation may create redundant output, which generates irrelevant information about the user’s preferences on candidate items from user behavior sequences. To address the issues, we propose a framework UR4Rec that enhances reranking for recommendation with large language models through user preference retrieval. Specifically, UR4Rec develops a small transformer-based user preference retriever towards candidate items to build the bridge between LLMs and recommendation, which focuses on producing the essential knowledge through LLMs from user behavior sequences to enhance reranking for recommendation. Our experimental results on three real-world public datasets demonstrate the superiority of UR4Rec over existing baseline models.

pdf bib abs
How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA
Yujia Zhou | Zheng Liu | Zhicheng Dou
Proceedings of the 31st International Conference on Computational Linguistics

Retrieval-augmented Large Language Models (RaLLMs) are reshaping knowledge acquisition, offering long-form, knowledge-grounded answers through advanced reasoning and generation capabilities. Despite the emergence of impactful systems like WebGPT and New Bing, the reliability of RaLLMs, especially in complex situations, is under scrutiny. Our study tackles this concern by evaluating RaLLMs’ question-answering performance using a novel benchmark focusing on Correctness and Groundedness. Correctness measures the logical soundness of the responses, and Groundedness checks for support by relevant references. We introduce an automated model-based evaluation pipeline for multi-hop question-answering tasks, revealing RaLLMs’ proneness to generating inaccuracies when dealing with flawed or partial knowledge. To improve accuracy, we introduce two reasoning strategies, Self-Reflection’ and Self-Completion,’ enabling RaLLMs to identify and fill knowledge gaps, significantly improving answer quality without extensive model retraining.

Recent advancements in text-to-image generation, notably the series of Stable Diffusion methods, have enabled the production of diverse, high-quality photo-realistic images. Nevertheless, these techniques still exhibit limitations in terms of knowledge access. Retrieval-augmented image generation is a straightforward way to tackle this problem. Current studies primarily utilize coarse-grained retrievers, employing initial prompts as search queries for knowledge retrieval. This approach, however, is ineffective in accessing valuable knowledge in long-tail text-to-image generation scenarios. To alleviate this problem, we introduce FineRAG, a fine-grained model that systematically breaks down the retrieval-augmented image generation task into four critical stages: query decomposition, candidate selection, retrieval-augmented diffusion, and self-reflection. Experimental results on both general and long-tailed benchmarks show that our proposed method significantly reduces the noise associated with retrieval-augmented image generation and performs better in complex, open-world scenarios.

pdf bib abs
RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
Shuting Wang | Xin Yu | Mang Wang | Weipeng Chen | Yutao Zhu | Zhicheng Dou
Proceedings of the 31st International Conference on Computational Linguistics

Retrieval-augmented generation (RAG) effectively addresses issues of static knowledge and hallucination in large language models. Existing studies mostly focus on question scenarios with clear user intents and concise answers. However, it is prevalent that users issue broad, open-ended queries with diverse sub-intents, for which they desire rich and long-form answers covering multiple relevant aspects. To tackle this important yet underexplored problem, we propose a novel RAG framework, namely RichRAG. It includes a sub-aspect explorer to identify potential sub-aspects of input questions, a multi-faceted retriever to build a candidate pool of diverse external documents related to these sub-aspects, and a generative list-wise ranker, which is a key module to provide the top-k most valuable documents for the final generator. These ranked documents sufficiently cover various query aspects and are aware of the generator’s preferences, hence incentivizing it to produce rich and comprehensive responses for users. The training of our ranker involves a supervised fine-tuning stage to ensure the basic coverage of documents, and a reinforcement learning stage to align downstream LLM’s preferences to the ranking of documents. Experimental results on two publicly available datasets prove that our framework effectively and efficiently provides comprehensive and satisfying responses to users.

2024

pdf bib abs
Grounding Language Model with Chunking-Free In-Context Retrieval
Hongjin Qian | Zheng Liu | Kelong Mao | Yujia Zhou | Zhicheng Dou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper presents a novel Chunking-Free In-Context (CFIC) retrieval approach, specifically tailored for Retrieval-Augmented Generation (RAG) systems. Traditional RAG systems often struggle with grounding responses using precise evidence text due to the challenges of processing lengthy documents and filtering out irrelevant content. Commonly employed solutions, such as document chunking and adapting language models to handle longer contexts, have their limitations. These methods either disrupt the semantic coherence of the text or fail to effectively address the issues of noise and inaccuracy in evidence retrieval.The CFIC approach addresses these challenges by circumventing the conventional chunking process. It utilizes the encoded hidden states of documents for in-context retrieval, employing auto-aggressive decoding to accurately identify the specific evidence text required for user queries, eliminating the need for chunking. CFIC is further enhanced by incorporating two innovative decoding strategies, namely Constrained Sentence Prefix Decoding and Skip Decoding. These strategies not only improve the efficiency of the retrieval process but also ensure that the fidelity of the generated grounding text evidence is maintained.Our evaluations of CFIC on a range of open question answering datasets demonstrate its superiority in retrieving relevant and accurate information, offering a significant improvement over traditional methods. By doing away with the need for document chunking, CFIC presents a more streamlined, effective, and efficient retrieval solution, making it a valuable advancement in the field of RAG systems.

pdf bib abs
Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation
Haonan Chen | Zhicheng Dou | Kelong Mao | Jiongnan Liu | Ziliang Zhao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages. Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem – that is, users can perform a conversation in various ways, and these alternate conversations are unrecorded. Consequently, they often struggle to generalize to diverse conversations in real-world scenarios. In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug). We first generate multi-level augmented conversations to capture the diverse nature of conversational contexts. Inspired by human cognition, we devise a cognition-aware prompting process to mitigate the generation of false positives, false negatives, and hallucinations. Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space. A contrastive learning objective is then employed to train a better conversational context encoder. Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug. The code is released at https://github.com/haon-chen/ConvAug.

Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating a comprehensive understanding and execution of IR tasks, thereby limiting LLMs’ applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs’ proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Falcon, in IR tasks. Furthermore, we conduct extensive experiments to analyze the effects of instruction design, template diversity, few-shot demonstrations, and the volume of instructions on performance. We make our dataset and the fine-tuned models publicly accessible at https://github.com/DaoD/INTERS.

pdf bib abs
Interpreting Conversational Dense Retrieval by Rewriting-Enhanced Inversion of Session Embedding
Yiruo Cheng | Kelong Mao | Zhicheng Dou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Conversational dense retrieval has shown to be effective in conversational search. However, a major limitation of conversational dense retrieval is their lack of interpretability, hindering intuitive understanding of model behaviors for targeted improvements. This paper presents CONVINV, a simple yet effective approach to shed light on interpretable conversational dense retrieval models. CONVINV transforms opaque conversational session embeddings into explicitly interpretable text while faithfully maintaining their original retrieval performance as much as possible. Such transformation is achieved by training a recently proposed Vec2Text model based on the ad-hoc query encoder, leveraging the fact that the session and query embeddings share the same space in existing conversational dense retrieval.To further enhance interpretability, we propose to incorporate external interpretable query rewrites into the transformation process. Extensive evaluations on three conversational search benchmarks demonstrate that CONVINV can yield more interpretable text and faithfully preserve original retrieval performance than baselines. Our work connects opaque session embeddings with transparent query rewriting, paving the way toward trustworthy conversational search.

pdf bib abs
A Multi-Task Embedder For Retrieval Augmented LLMs
Peitian Zhang | Zheng Liu | Shitao Xiao | Zhicheng Dou | Jian-Yun Nie
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

LLMs confront inherent limitations in terms of its knowledge, memory, and action. The retrieval augmentation stands as a vital mechanism to address these limitations, which brings in useful information from external sources to augment the LLM. However, existing retrieval methods encounter two pressing issues. On one hand, the general retrievers are not properly optimized for retrieval augmentation hence exhibit limited effectiveness; on the other hand, the task-specific retrievers excel in the targeted retrieval augmentation scenario, while lack the versatility to handle diverse scenarios. In this work, we propose LLM-Embedder for the unified support of diverse retrieval augmentation scenarios. Our method presents three technical contributions. Firstly, we introduce a new reward formulation, namely rank-aware reward. It exploits the ranking position of the desired output among N sampled outputs from the LLM, which leads to fine-grained and robust computation of reward from the LLM’s feedback. Secondly, we design a novel distillation objective, called graded distillation. It incorporates both the absolute value and the relative order of the reward for more sufficient utilization of the LLM’s feedback. Thirdly, we systematically optimize the multi-task learning, which effectively unifies the multiple retrieval functionalities into one model. In our experiment, LLM-Embedder substantially improves the LLM’s performances in various downstream tasks, while introducing superior retrieval augmentation’s effect over both general and task-specifc retrievers. Our data, code, and model have been released at https://github.com/FlagOpen/FlagEmbedding.

pdf bib abs
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs
Jiejun Tan | Zhicheng Dou | Yutao Zhu | Peidong Guo | Kun Fang | Ji-Rong Wen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The integration of large language models (LLMs) and search engines represents a significant evolution in knowledge acquisition methodologies. However, determining the knowledge that an LLM already possesses and the knowledge that requires the help of a search engine remains an unresolved issue. Most existing methods solve this problem through the results of preliminary answers or reasoning done by the LLM itself, but this incurs excessively high computational costs. This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in LLMs with a slim proxy model, to enhance the LLM’s knowledge acquisition process. We employ a proxy model which has far fewer parameters, and take its answers as heuristic answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM. We only conduct retrieval for the missing knowledge in questions that the LLM does not know. Extensive experimental results on five datasets with two LLMs demonstrate a notable improvement in the end-to-end performance of LLMs in question-answering tasks, achieving or surpassing current state-of-the-art models with lower LLM inference costs.

Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via contrastive learning while enhancing the complex session understanding through masked instruction tuning on high-quality conversational instruction tuning data. Extensive experiments on five conversational search benchmarks demonstrate that ChatRetriever significantly outperforms existing conversational dense retrievers, achieving state-of-the-art performance on par with LLM-based rewriting approaches. Furthermore, ChatRetriever exhibits superior robustness in handling diverse conversational contexts. Our work highlights the potential of adapting LLMs for retrieval with complex inputs like conversational search sessions and proposes an effective approach to advance this research direction.

pdf bib abs
Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation
Chenlong Deng | Kelong Mao | Zhicheng Dou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods.

pdf bib abs
BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence
Jiajie Jin | Yutao Zhu | Yujia Zhou | Zhicheng Dou
Findings of the Association for Computational Linguistics: ACL 2024

Retrieval-augmented large language models (LLMs) have demonstrated efficacy in knowledge-intensive tasks such as open-domain QA, addressing inherent challenges in knowledge update and factual inadequacy.However, inconsistencies between retrieval knowledge and the necessary knowledge for LLMs, leading to a decline in LLM’s answer quality. This paper introduces BIDER, an approach that refines retrieval documents into Key Supporting Evidence (KSE) through knowledge synthesis, supervised fine-tuning (SFT), and preference alignment. We train BIDER by learning from crafting KSE, while maximizing its output to align with LLM’s information acquisition preferences through reinforcement learning. Evaluations across five datasets show BIDER boosts LLMs’ answer quality by 7% while reducing input content length in retrieval documents by 80%, outperforming existing methods. The proposed KSE simulation effectively equips LLMs with essential information for accurate question answering.

pdf bib abs
An Element is Worth a Thousand Words: Enhancing Legal Case Retrieval by Incorporating Legal Elements
Chenlong Deng | Zhicheng Dou | Yujia Zhou | Peitian Zhang | Kelong Mao
Findings of the Association for Computational Linguistics: ACL 2024

Legal case retrieval plays an important role in promoting judicial justice and fairness. One of its greatest challenges is that the definition of relevance goes far beyond the common semantic relevance as in ad-hoc retrieval. In this paper, we reveal that the legal elements, which typically comprise key facts in a specialized legal context, can largely improve the relevance matching of legal case retrieval. To facilitate the use of legal elements, we construct a Chinese legal element dataset called LeCaRD-Elem based on the widely-used LeCaRD dataset, through a two-stage semi-automatic method with a minimized reliance on human labor. Meanwhile, we introduce two new models to enhance legal search using legal elements. The first, Elem4LCR-E, is a two-stage model that explicitly predicts legal elements from texts and then leverages them for improved ranking. Recognizing the potential benefits of more seamless integration, we further propose an end-to-end model called Elem4LCR-I, which internalizes the legal element knowledge into its model parameters using a tailored teacher-student training framework. Extensive experiments underscore the significant value of legal elements and demonstrate the superiority of our two proposed models in enhancing legal search over existing methods.

pdf bib abs
RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation Through Self-Alignment
Kelong Mao | Zheng Liu | Hongjin Qian | Fengran Mo | Chenlong Deng | Zhicheng Dou
Findings of the Association for Computational Linguistics: EMNLP 2024

Retrieval-Augmented Generation (RAG) has proven to be an effective paradigm for enhancing the quality of text generation by integrating large language models (LLMs) with external knowledge. However, an off-the-shelf RAG system, which relies on generally pre-trained LLMs and retrievers, often falls short in specialized domains and applications. In this paper, we introduce RAG-Studio, an efficient self-aligned training framework to adapt general RAG models to specific domains solely through synthetic data, eliminating the need for expensive human-labeled in-domain data. RAG-Studio accepts a specialized domain corpus, a general LLM, and a general retriever, then autonomously generates contrastive training data for both the LLM and retriever through self-alignment. We fine-tune them to work cohesively as an integrated and effective domain-specific RAG system, where the LLM is adapted to incorporate new domain knowledge and become robust to noisy contexts, and the retriever learns to better align with the LLM’s preferences, providing more useful information and minimizing the risk of misleading the LLM. Extensive experiments across diverse in-domain question-answering datasets spanning the biomedical, finance, law, and computing domains, show that RAG-Studio attains state-of-the-art performance, consistently outperforming the use of human-annotated data for fine-tuning.

pdf bib abs
Enabling Discriminative Reasoning in LLMs for Legal Judgment Prediction
Chenlong Deng | Kelong Mao | Yuyao Zhang | Zhicheng Dou
Findings of the Association for Computational Linguistics: EMNLP 2024

Legal judgment prediction is essential for enhancing judicial efficiency. In this work, we identify that existing large language models (LLMs) underperform in this domain due to challenges in understanding case complexities and distinguishing between similar charges. To adapt LLMs for effective legal judgment prediction, we introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspired by human judicial reasoning. ADAPT involves decomposing case facts, discriminating among potential charges, and predicting the final judgment. We further enhance LLMs through fine-tuning with multi-task synthetic trajectories to improve legal judgment prediction accuracy and efficiency under our ADAPT framework. Extensive experiments conducted on two widely-used datasets demonstrate the superior performance of our framework in legal judgment prediction, particularly when dealing with complex and confusing charges.

pdf bib abs
Learning Dynamic Multi-attribute Interest for Personalized Product Search
Yutong Bai | Zhicheng Dou | Ji-Rong Wen
Findings of the Association for Computational Linguistics: EMNLP 2024

Personalized product search aims to learn personalized preferences from search logs and adjust the ranking lists returned by engines. Previous studies have extensively explored excavating valuable features to build accurate interest profiles. However, they overlook that the user’s attention varies on product attributes(e.g., brand, category). Users may especially prefer specific attributes or switch their preferences between attributes dynamically. Instead, existing approaches mix up all attribute features and let the model automatically extract useful ones from rather complex scenarios. To solve this problem, in this paper, we propose a dynamic multi-attribute interest learning model to tackle the influences from attributes to user interests. Specifically, we design two interest profiling modules: attribute-centered and attribute-aware profiling. The former focuses on capturing the user’s preferences on a single attribute, while the latter focuses on addressing the interests correlated with multi-attribute within the search history. Besides, we devise a dynamic contribution weights strategy that sends explicit signals to the model to determine the impacts of different attributes better. Experimental results on large-scale datasets illustrate that our model significantly improves the results of existing methods.

2023

pdf bib abs
Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval
Peitian Zhang | Zheng Liu | Shitao Xiao | Zhicheng Dou | Jing Yao
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby clusters w.r.t. an input query and only evaluates documents within them by subsequent codecs, thus avoiding the expensive cost from exhaustive traversal. However, the clustering is always lossy, which results in the miss of relevant documents in the probed clusters and hence degrades retrieval quality. In contrast, lexical matching, such as overlaps of salient terms, tend to be strong features for identifying relevant documents. In this work, we present the Hybrid Inverted Index (HI²), where the embedding clusters and salient terms work collaboratively to accelerate dense retrieval. To make best of both effectiveness and efficiency, we devise a cluster selector and a term selector, to construct compact inverted lists and efficiently searching through them. Moreover, we leverage simple unsupervised algorithms as well as end-to-end knowledge distillation to learn these two modules, with the latter further boosting the effectiveness. Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI² to achieve lossless retrieval quality with competitive efficiency across a variety of index settings.

pdf bib abs
Enhancing Generative Retrieval with Reinforcement Learning from Relevance Feedback
Yujia Zhou | Zhicheng Dou | Ji-Rong Wen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The recent advent of end-to-end generative retrieval marks a significant shift in document retrieval methods, leveraging differentiable search indexes to directly produce relevant document identifiers (docids) in response to a specific query. Nevertheless, this approach faces two fundamental challenges: (i) a discrepancy between the token-level probabilistic optimization and the broader document-level relevance estimation; (ii) an overemphasis on top-1 results at the expense of overall ranking quality. To tackle these challenges, we propose a generative retrieval model with reinforcement learning from relevance feedback, which aims to align token-level docid generation with document-level relevance estimation. The training process incorporates three stages: supervised fine-tuning, relevance reward model training, and reinforced learning-to-rank from relevance feedback. To train a high-quality reward model, we define “relevance” under three progressive scenarios, which collectively offer a comprehensive evaluation of the document relevance. Experiments conducted on two benchmark datasets demonstrate the effectiveness of our proposed approach.

Syllogistic reasoning, a typical form of deductive reasoning, is a critical capability widely required in natural language understanding tasks, such as text entailment and question answering. To better facilitate research on syllogistic reasoning, we develop a benchmark called SylloBase that differs from existing syllogistic datasets in three aspects: (1) Covering a complete taxonomy of syllogism reasoning patterns; (2) Containing both automatically and manually constructed samples; and (3) Involving both the generation and understanding tasks. We automatically construct 50k template-based syllogism samples by mining syllogism patterns from Wikidata and ConceptNet. To improve our dataset’s naturalness and challenge, we apply GPT-3 to paraphrase the template-based data and further manually rewrite 1,000 samples as the test set. State-of-the-art pre-trained language models can achieve the best generation ROUGE-L of 38.72 by T5 and the best multi-choice accuracy of 72.77% by RoBERTa on SylloBase, which indicates the great challenge of learning diverse syllogistic reasoning types on SylloBase. Our datasets are released at https://github.com/casually-PYlearner/SYLLOBASE.

Conversational query rewriting (CQR) realizes conversational search by reformulating the search dialogue into a standalone rewrite. However, existing CQR models either are not learned toward improving the downstream search performance or inefficiently generate the rewrite token-by-token from scratch while neglecting the fact that the search dialogue often has a large overlap with the rewrite. In this paper, we propose EdiRCS, a new text editing-based CQR model tailored for conversational search. In EdiRCS, most of the rewrite tokens are selected from the dialogue in a non-autoregressive fashion and only a few new tokens are generated to supplement the final rewrite, which makes EdiRCS highly efficient. In particular, the learning of EdiRCS is augmented with two search-oriented objectives, including contrastive ranking augmentation and contextualization knowledge transfer, which effectively improve it to select and generate more useful tokens from the view of retrieval. We show that EdiRCS outperforms state-of-the-art CQR models on three conversational search benchmarks while having low rewriting latency, and is robust to out-of-domain search dialogues and long dialogue contexts.

pdf bib abs
Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search
Kelong Mao | Zhicheng Dou | Fengran Mo | Jiewen Hou | Haonan Chen | Hongjin Qian
Findings of the Association for Computational Linguistics: EMNLP 2023

Precisely understanding users’ contextual search intent has been an important challenge for conversational search. As conversational search sessions are much more diverse and long-tailed, existing methods trained on limited data still show unsatisfactory effectiveness and robustness to handle real conversational search scenarios. Recently, large language models (LLMs) have demonstrated amazing capabilities for text generation and conversation understanding. In this work, we present a simple yet effective prompting framework, called LLM4CS, to leverage LLMs as a text-based search intent interpreter to help conversational search. Under this framework, we explore three prompting methods to generate multiple query rewrites and hypothetical responses, and propose to aggregate them into an integrated representation that can robustly represent the user’s real contextual search intent. Extensive automatic evaluations and human evaluations on three widely used conversational search benchmarks, including CAsT-19, CAsT-20, and CAsT-21, demonstrate the remarkable performance of our simple LLM4CS framework compared with existing methods and even using human rewrites. Our findings provide important evidence to better understand and leverage LLMs for conversational search.

2022

Generalized text representations are the foundation of many natural language understanding tasks. To fully utilize the different corpus, it is inevitable that models need to understand the relevance among them. However, many methods ignore the relevance and adopt a single-channel model (a coarse paradigm) directly for all tasks, which lacks enough rationality and interpretation. In addition, some existing works learn downstream tasks by stitches skill block (a fine paradigm), which might cause irrational results due to its redundancy and noise. In this work, we first analyze the task correlation through three different perspectives, , data property, manual design, and model-based relevance, based on which the similar tasks are grouped together. Then, we propose a hierarchical framework with a coarse-to-fine paradigm, with the bottom level shared to all the tasks, the mid-level divided to different groups, and the top-level assigned to each of the tasks. This allows our model to learn basic language properties from all tasks, boost performance on relevant tasks, and reduce the negative impact from irrelevant tasks. Our experiments on 13 benchmark datasets across five natural language understanding tasks demonstrate the superiority of our method.

pdf bib abs
ConvTrans: Transforming Web Search Sessions for Conversational Dense Retrieval
Kelong Mao | Zhicheng Dou | Hongjin Qian | Fengran Mo | Xiaohua Cheng | Zhao Cao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Conversational search provides users with a natural and convenient new search experience. Recently, conversational dense retrieval has shown to be a promising technique for realizing conversational search. However, as conversational search systems have not been widely deployed, it is hard to get large-scale real conversational search sessions and relevance labels to support the training of conversational dense retrieval. To tackle this data scarcity problem, previous methods focus on developing better few-shot learning approaches or generating pseudo relevance labels, but the data they use for training still heavily rely on manual generation.In this paper, we present ConvTrans, a data augmentation method that can automatically transform easily-accessible web search sessions into conversational search sessions to fundamentally alleviate the data scarcity problem for conversational dense retrieval. ConvTrans eliminates the gaps between these two types of sessions in terms of session quality and query form to achieve effective session transformation. Extensive evaluations on two widely used conversational search benchmarks, i.e., CAsT-19 and CAsT-20, demonstrate that the same model trained on the data generated by ConvTrans can achieve comparable retrieval performance as it trained on high-quality but expensive artificial conversational search data.

pdf bib abs
Explicit Query Rewriting for Conversational Dense Retrieval
Hongjin Qian | Zhicheng Dou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In a conversational search scenario, a query might be context-dependent because some words are referred to previous expressions or omitted. Previous works tackle the issue by either reformulating the query into a self-contained query (query rewriting) or learning a contextualized query embedding from the query context (context modelling). In this paper, we propose a model CRDR that can perform query rewriting and context modelling in a unified framework in which the query rewriting’s supervision signals further enhance the context modelling. Instead of generating a new query, CRDR only performs necessary modifications on the original query, which improves both accuracy and efficiency of query rewriting. In the meantime, the query rewriting benefits the context modelling by explicitly highlighting relevant terms in the query context, which improves the quality of the learned contextualized query embedding. To verify the effectiveness of CRDR, we perform comprehensive experiments on TREC CAsT-19 and TREC CAsT-20 datasets, and the results show that our method outperforms all baseline models in terms of both quality of query rewriting and quality of context-aware ranking.

pdf bib abs
MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling
Zhaoheng Huang | Zhicheng Dou | Yutao Zhu | Zhengyi Ma
Findings of the Association for Computational Linguistics: EMNLP 2022

Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user’s dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response’s quality while ignoring the correlations and fusions between the user’s dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users’ dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.

pdf bib abs
Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation
Hanxun Zhong | Zhicheng Dou | Yutao Zhu | Hongjin Qian | Ji-Rong Wen
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Personalized dialogue systems explore the problem of generating responses that are consistent with the user’s personality, which has raised much attention in recent years. Existing personalized dialogue systems have tried to extract user profiles from dialogue history to guide personalized response generation. Since the dialogue history is usually long and noisy, most existing methods truncate the dialogue history to model the user’s personality. Such methods can generate some personalized responses, but a large part of dialogue history is wasted, leading to sub-optimal performance of personalized response generation. In this work, we propose to refine the user dialogue history on a large scale, based on which we can handle more dialogue history and obtain more abundant and accurate persona information. Specifically, we design an MSP model which consists of three personal information refiners and a personalized response generator. With these multi-level refiners, we can sparsely extract the most valuable information (tokens) from the dialogue history and leverage other similar users’ data to enhance personalization. Experimental results on two real-world datasets demonstrate the superiority of our model in generating more informative and personalized responses.

2021

pdf bib abs
基于双星型自注意力网络的搜索结果多样化方法(Search Result Diversification Framework Based on Dual Star-shaped Self-Attention Network)
Xubo Qin (秦绪博) | Zhicheng Dou (窦志成) | Yutao Zhu (朱余韬) | Jirong Wen (文继荣)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

相关研究指出,用户提交给搜索引擎的查询通常为短查询。由于自然语言本身的特点,短查询通常具有歧义性,同一个查询可以指代不同的事物,或同一事物的不同方面。为了让搜索结果尽可能满足用户多样化的信息需求,搜索引擎需要对返回的结果进行多样化排序,搜索结果多样化技术应运而生。目前已有的基于全局交互的多样化方法通过全连接的自注意力网络捕获全体候选文档间的交互关系,取得了较好的效果。但由于此类方法只考虑文档间的相关关系,并没有考虑到文档是否具有跟查询相关的有效信息,在训练数据有限的条件下效率相对较低。该文提出了一种基于双星型自注意力网络的搜索结果多样化方法,将全连接结构改为星型拓扑结构,并嵌入查询信息以高效率地提取文档跟查询相关的全局交互特征。相关实验结果显示,该模型相对于基于全连接自注意力网络的多样化方法,具备显著的性能优势。

Dense retrieval requires high-quality text sequence embeddings to support effective search in the representation space. Autoencoder-based language models are appealing in dense retrieval as they train the encoder to output high-quality embedding that can reconstruct the input texts. However, in this paper, we provide theoretical analyses and show empirically that an autoencoder language model with a low reconstruction loss may not provide good sequence representations because the decoder may take shortcuts by exploiting language patterns. To address this, we propose a new self-learning method that pre-trains the autoencoder using a weak decoder, with restricted capacity and attention flexibility to push the encoder to provide better text representations. Our experiments on web search, news recommendation, and open domain question answering show that our pre-trained model significantly boosts the effectiveness and few-shot ability of dense retrieval models. Our code is available at https://github.com/microsoft/SEED-Encoder/.

2020

pdf bib abs
ScriptWriter: Narrative-Guided Script Generation
Yutao Zhu | Ruihua Song | Zhicheng Dou | Jian-Yun Nie | Jin Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

It is appealing to have a system that generates a story or scripts automatically from a storyline, even though this is still out of our reach. In dialogue systems, it would also be useful to drive dialogues by a dialogue plan. In this paper, we address a key problem involved in these applications - guiding a dialogue by a narrative. The proposed model ScriptWriter selects the best response among the candidates that fit the context as well as the given narrative. It keeps track of what in the narrative has been said and what is to be said. A narrative plays a different role than the context (i.e., previous utterances), which is generally used in current dialogue systems. Due to the unavailability of data for this new application, we construct a new large-scale data collection GraphMovie from a movie website where end- users can upload their narratives freely when watching a movie. Experimental results on the dataset show that our proposed approach based on narratives significantly outperforms the baselines that simply use the narrative as a kind of context.