Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue

E-commerce pre-sales dialogue aims to understand and elicit user needs and preferences for the items they are seeking so as to provide appropriate recommendations. Conversational recommender systems (CRSs) learn user representation and provide accurate recommendations based on dialogue context, but rely on external knowledge. Large language models (LLMs) generate responses that mimic pre-sales dialogues after fine-tuning, but lack domain-specific knowledge for accurate recommendations. Intuitively, the strengths of LLM and CRS in E-commerce pre-sales dialogues are complementary, yet no previous work has explored this. This paper investigates the effectiveness of combining LLM and CRS in E-commerce pre-sales dialogues, proposing two collaboration methods: CRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a real-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of two collaborative approaches with two CRSs and two LLMs on four tasks of Ecommerce pre-sales dialogue. We find that collaborations between CRS and LLM can be very effective in some cases.

1 Introduction E-commerce pre-sales dialogue refers to a dialogue between a user and a customer service staff before the purchase action (Chen et al., 2020;Zhao et al., 2021).A high-quality pre-sales dialogue can greatly increase the purchase rate of a user.However, there are many challenges to providing a high-quality pre-sales dialogue service (Liu et al., 2023b).Refine.Figure 1 shows an example of an e-commerce pre-sales dialogue.The bot needs to interact with the user, understanding the user's needs and responding with understandable words.Additionally, it should offer appropriate recommendations and elicit further preferences from the user.† Corresponding author.Conversational recommender systems (CRSs) aim to learn relationships between user preferences and candidate product representations to provide accurate recommendations (Li et al., 2018) and generate responses related to recommended products (Liang et al., 2021).However, understanding user preferences from dialogues relies heavily on external knowledge (Chen et al., 2019), e.g.DBpedia and ConceptNet.With an external knowledge base, CRS is able to recognize the entities present in the context, but it still has difficulty understanding the semantic information within the context.
Large language models (LLMs), which have numerous parameters pre-trained on a large amount of data, possess a wealth of knowledge that enables people to interact with them using natural language (Brown et al., 2020;Ouyang et al., 2022).With supervised fine-tuning tasks, LLMs can handle diverse user needs descriptions in pre-sales dialogues.However, LLMs lack information about candidate products, which makes them not suitable for providing domain-specific recommendations.
What will happen when LLMs and CRSs collaborate?In this paper, we explore two types of collaborations between LLMs and CRSs in Ecommerce pre-sales dialogues: (i) LLM assisting CRS.(ii) CRS assisting LLM.When a LLM assists a CRS, we append the generated response of the LLM to the input of the CRS.For the recommendation task, we incorporate the representation of the product predicted by the LLM into the calculation of the user representation.When a CRS assists a LLM, we append the predictions of the CRS to the input of the LLM.For the recommendation task, we insert the recommendation list, while the other tasks insert the text.
Specifically, we explore the effectiveness of collaborations on a real-world dataset of E-commerce pre-sales dialogues, namely U-NEED (Liu et al., 2023b).U-NEED contains pre-sales dialogues in five top categories and supports four key tasks in E-commerce pre-sales dialogue: (i) pre-sales dialogue understanding (ii) user needs elicitation (iii) user needs-based recommendation and (iv) pre-sales dialogue generation.We select two popular open source LLMs, ChatGLM-6B and Chinese-Alpaca-7B, as well as two latest CRSs, Bart-based CRS and CPT-based CRS.We report experimental results for each combination of collaborations on the four challenging tasks.Experimental results demonstrate that the collaboration between LLM and CRS is effective on three tasks: pre-sales dialogue understanding, user needs elicitation and user needs-based recommendation.
Main contributions of this paper are as follows: • To the best of our knowledge, we are the first to explore collaboration between LLM and CRS in a real-world scenario, namely Ecommerce pre-sales dialogue.
• We propose methods for two types of collaborations between LLMs and CRSs, i.e., CRS assisting LLM and LLM assisting CRS.
• Extensive experimental results on a real-world E-commerce pre-sales dialogue dataset indicate the effectiveness and potential of collaborations between LLMs and CRSs.

Related Work
We review the related work along two lines: (i) conversational recommendation and (ii) large language models (LLMs) for recommendation.
The emergence of LLMs has undoubtedly impacted CRS-related researches.However, previous work barely explores the collaboration between conversational language models and CRSs on tasks related to conversational recommendation.We investigate the collaboration between LLM and CRS in E-commerce pre-sales dialogues.
Different from previous work using LLM to enhance CRS, we systematically investigate the effectiveness of combining LLM and CRS, i.e., LLM assisting CRS and CRS assisting LLM, which provides insights for future research on CRSs.
Figure 2: A comparison of the three types of collaboration between a CRS and a LLM.We explore the collaboration between LLM and CRS, i.e., LLM assisting CRS and CRS assisting LLM, and we compare the three in detail in §5.

Overview
In this paper, we explore the collaboration of a conversational recommender system (CRS) and a large language model (LLM).Figure 2 provides an illustration of our collaboration framework.LLM assisting CRS.We leverage the prediction results of a LLM to support a CRS.Initially, we fine-tune a LLM using pre-sales dialogues, following the method described in Section 3.2.Subsequently, we incorporate the prediction results of the fine-tuned large model into the training process of the CRS, via prompts and vectors.For a detailed description, refer to Section 3.5.CRS assisting LLM.We utilize the prediction results of a CRS to assist a LLM.Initially, we train a CRS using pre-sales dialogues, following the approach outlined in Section 3.3.Subsequently, we integrate the prediction results of the trained CRS into the instructions and inputs to optimize the process of fine-tuning the LLM.For further details, see Section 3.4.
In this paper, we explore the effectiveness of collaboration by analyzing the impact of collaboration between CRS and LLM on the performance of four tasks in E-commerce pre-sales dialogue (Liu et al., 2023b).These tasks include: (i) pre-sales dialogue understanding (ii) user needs elicitation (iii) user needs-based recommendation and (iv) pre-sales dialogue generation.Due to space constraints, we provide detailed definitions of these tasks in Appendix A.

LLMs for E-commerce pre-sales dialogue
We introduce the method of fine-tuning a large language model (LLM) using pre-sales dialogues.Instruction data.Each sample within the training, validation, and test sets consists of "instruction", "input" and "output".The "instruction" com-prises several sentences that introduce the task's objective.The "input" contains the necessary information for completing the task.For instance, in the case of a user needs-based recommendation task, the "input" encompasses the user's needs, candidate products, and related product knowledge.The "output" remains consistent with the original task.Figure 3 shows an example of the instruction data corresponding to the user needs elicitation task.Additional examples of various tasks can be found in Appendix F. Note that the original user needs-based recommendation task involves numerous candidates, with each product possessing extensive attribute knowledge, the "input" surpasses the maximum input length permitted by LLMs.Consequently, in practice, we limit the number of candidates to 20.Base LLMs and fine-tuning.We select ChatGLM and Chinese-Alpaca-7B as base LLMs due to their openness and commendable performance in Chinese basic semantic understanding.ChatGLM is an open bilingual language model built upon General Language Model (GLM) framework (Zeng et al., 2023), with 6.2 billion parameters.1 LLaMA (Touvron et al., 2023) is a decoder-only, foundational large language model based on the transformer architecture (Vaswani et al., 2017).The Chinese LLaMA model is an extension of the original LLaMA model, incorporating an expanded Chinese vocabulary and undergoing secondary pre-training using Chinese data (Cui et al., 2023).We adopt the Chinese Alpaca model, which builds upon the aforementioned Chinese LLaMA model by incorporating instruction data for fine-tuning. 2To carry out the fine-tuning process, we follow the official method provided by ChatGLM6B/Chinese-Alpaca-Plus-7B, using LoRA (Hu et al., 2022b).

CRSs for E-commerce pre-sales dialogue
We introduce the method of train a conversational recommender system (CRS) for pre-sales dialogues.We adopt UniMIND (Deng et al., 2023) as our base CRS, as it focus on multiple tasks in conversational recommendation.The recommendation candidates for UniMIND are movies.The movie title can be generated based on the prompt.However, in E-commerce pre-sales dialogue, the recommendation candidate is the product ID, which is difficult to be decoded directly.Therefore, for the user needs-based recommendation task we follow a traditional user representation-based approach (Kang and McAuley, 2018).Prompts.Following Deng et al. (2023), we define the inputs and outputs of the four tasks using a unified sequence-to-sequence paradigm.We use five special tokens to indicate information segments: (i) [user] indicates the utterance of the user.(ii) [system] indicates the response from customer service staff.(iii) [understand] indicates the needs contained in the user utterance, i.e., attributes and attribute values.(iv) [elicit] indicates the attributes that the customer service staff plans to ask about user preferences.(v) [recommend] indicates the items that have been recommended by customer service staff.For instance, the original input X can be represented as follows: where u i is the i-th utterance of the user, s i is the i-th response of customer service staff, d i is the i-th user needs, a i is the i-th attribute to be asked, and e i is the i-th recommended product.We adopt natural language prompt (Raffel et al., 2020) to indicate each task: Z U = "Identify attributes and values:" Z A = "Select an attribute to ask:" Z G = "Generate a response:" Loss functions.Following UniMIND (Deng et al., 2023), we design a unified CRS with prompts learning and multitask learning for pre-sales dialogues.
where D = {D U , D A , D G } denote the train-set for three tasks: pre-sales dialogue understanding, user needs elicitation and pre-sales dialogue generation.
And L is the length of the generated sequence.For user needs-based recommendation task, the recommendation probability and loss function are defined as follows: where e i is the trainable item embedding, E is collection of candidate products and r i is recommendation probability.CLS(•) is a classifier, we apply linear layer in practice.And Enc(•) is the encoder, we adopt two versions: BART (Lewis et al., 2020) and CPT (Shao et al., 2021).
Training process.We train a CRS in two stages.
In the first stage we train a CRS only on the user needs-based recommendation task, i.e., L = L R .
Since BART and CPT do not have pre-sales conversation knowledge, this step aims to warm up CRS.
In the second stage we continue to train a CRS on all four tasks, i.e., L = L R + L θ .

Collaboration 1: CRS assisting LLM
We introduce the method of CRS assisting LLM.Initially, we train a CRS model following the approach outlined in Section 3.3.Subsequently, we enrich the original instructions and inputs of a LLM by incorporating the prediction outcomes from the CRS model.Finally, we fine-tune the LLM, as explained in § 3.3, utilizing the augmented instructions and inputs.
Enhanced instruction and input.We convert the output of a trained CRS into text and incorporate it into the input of LLM.We add additional instruction, i.e., completing the task requires considering the results of the CRS.An example is shown in the upper right corner of Figure 3.Note that in the user needs-based recommendation task, CRS can output a list of recommendations along with corresponding recommendation scores.We rank the candidates according to their scores and include the ranking to the input of LLM.For other tasks, we only consider the final output of the trained CRS.

Collaboration 2: LLM assisting CRS
We introduce the method of LLM assisting CRS.Initially, we fine-tune a LLM model using the technique described in Section 3.2.Subsequently, we employ the prediction outcomes from LLM to enhance the prompt and user representation in the CRS model.Finally, we train a CRS model as outlined in § 3.3.Enhanced prompts.We use the prediction results of LLM to enhance the prompts of CRS.
For X ′ U and X ′ S , âi and vi are attributes and values involved in utterance identified by LLM for the i-th turn.For X ′ A , âi is the attribute in user needs elicitation for the i-th turn.For X ′ G , ŝi is the response for the pre-sales dialogue generation task generated by LLM.Enhanced representation.For recommendation task, we consider the embedding of the product recommended by the fine-tuned LLM: where êi is the embedding of the product recommended by the fine-tuned LLM.
4 Experimental Settings

Research questions
To guide the remaining part of this paper, we set up two research questions: • Are LLM and CRS complementary?Does the combination of LLM and CRS improve performance?
• How does the combination of CRS and LLM perform on different tasks and different categories?What are the differences between different collaboration methods?
In the Results and Analysis section, we systematically examine the outcomes of each task to answer the aforementioned two research questions.

Dataset
We conduct experiments on U-NEED (Liu et al., 2023b).U-NEED consists of 7,698 fine-grained annotated pre-sales dialogues, which consist of 1662, 1513, 1135, 1748, and 1640 dialogues in Beauty, Phones, Fashion, Shoes and Electronics categories respectively.We follow the partition of the training set, validation set and test set proposed in U-NEED.

Baseline methods
For each task, baseline methods consist of typical methods, CRS methods and LLM methods.Typical methods.
We select typical methods for the four tasks following (Liu et al., 2023b).Specifically, we select Bert, Bert+CRF, Bert+BiLSTM+CRF as baselines for pre-sales dialogue.For user needs elicitation task, we select DiaMultiClass and DiaSeq as baselines.For user needs-based recommendation task, we choose Bert, SASRec, TG-CRS.And we select GPT-2 and KBRD as baseline methods for pre-sales dialogue generation task.For the limited space, we put the description of each typical methods in the Appendix B. We select UniMIND(BART) (Deng et al., 2023) and UniMIND(CPT) as CRS methods.For LLM methods, we select ChatGLM (Zeng et al., 2023) and Chinese-Alpaca (Cui et al., 2023).For combination of LLM and CRS, we define eight variants.We put the description of each variant in the Appendix C.

Evaluation metrics
We adopt the evaluation metrics used in U-NEED (Liu et al., 2023b).Specifically, we select precision, recall and f1 score as evaluation metrics for pre-sales dialogue understanding and user needs elicitation.For user needs-based recommendation task, we choose Hit@K and MRR@K.And we adopt automatic and human evaluation for pre-sales dialogue generation task.For automatic evaluation, we use Distinct-n.And for human evaluation, we measure the informativeness and relevance of generated response.For the limited space, we put the description of each metric in the Appendix D.

Results and Analysis
We conduct extensive experiments to explore the performance of collaborations of LLMs and CRSs on four tasks.We analyze impacts of collaborations on each task in turn.

Impacts of collaborations on pre-sales dialogue understanding
Based on  We think this may be due to the fact that user needs in the Beauty category are usually focused on specific attributes such as "skin type".

Impacts of collaborations on user needs elicitation
Based on (iii) Chinese-Alpaca and ChatGLM exhibit major differences in their performance in collaborations with CRSs.Specifically, CCRS-ALLM and BCRS-ALLM achieve the worst and secondworst performance on almost all metrics in all categories.This implies that the predicted results of CRSs cause a large disruption to Chinese-Alpaca.The two combinations of CRSs assisting Table 3: Performance of baseline methods on user needs-based recommendation task in 3 typical categories: Beauty, Fashion and Shoes.H@K and M@K refer to Hit@K and MRR@K.Acc.refers to accuracy.CLLM is short for ChatGLM and ALLM is short for Chinese-Alpaca.BCRS is short for UniMIND(BART) and CCRS is short for UniMIND(CPT).Since LLM methods only recommend 1 product, H@5 and M@5 cannot be calculated.The best results are highlighted in bold.ChatGLM, i.e., BCRS-CLLM and CCRS-CLLM, however, perform well.We think that the differences between ChatGLM and Chinese-Alpaca in collaborating with CRSs come from their base models and fine-tuning data.

Impacts of collaborations on user needs-based recommendation
Based on Table 3 we have the following observations: (i) LLMs show the potential for user needs-based recommendation.Specifically, LLMs, i.e., Chinese-Alpaca and ChatGLM, achieve the best and second best performance significantly outperforming all the methods on the Accuracy in Shoes category, respectively.They also achieve performance over classical methods on the results of all 5 categories.Based on this, we think that LLMs can somewhat provide suitable recommendations when the candidate range is small (the number of candidate products in Table 3 is 20).(ii) With the collaboration of LLMs, the recommendation performance of CRSs can be improved.Specifically, on the average results across all 5 categories, ALLM-CCRS achieves 6.9%, 3.8%, and 4.9% improvements on Accuracy, Hit@5, and MRR@5, respectively, compared to UniMIND (CPT).Similarly, on average results across all 5 categories, ALLM-BCRS achieves 9.4%, 3.0%, and 6.0% improvements on Accuracy, Hit@5, and MRR@5, respectively, when compared to UniMIND(BART).Note that LLMs provide recommendations in a different way than CRSs do.LLMs provide recommendations relying on given inputs, i.e., a recommended product is semantically related to user needs in some way.CRSs, on the other hand, model a representation of both and learn the implicit relationship between the two to compute the probability of a product being recommended.The above improvements are only that CRSs consider the representations of the recommended products given by the LLMs.We believe that collaborations between LLMs and CRSs on recommendation tasks go far beyond this and are a direction worth exploring.(iii) With collaborations with CRSs, LLMs can achieve comparable recommendation performance.Specifically, in the Phones category, ChatGLM and Chinese-Alpaca have very poor recommendation performance.In contrast, the four methods of "CRS assisting LLM" achieve the performance close to that of CRSs.Based on this, we think that when the recommendation performance of LLMs is very poor in a certain domain, a collaborative approach could make LLMs to achieve performance close to that of CRSs.Specifically, the methods of collaborations between LLMs and CRSs, i.e., "LLM assisting CRS" and "CRS assisting LLM", achieve the best performance on most of the metrics.However, the improvement from collaboration is marginal compared to CRSs or LLMs.We believe this may be due to the fact that LLMs and UniMIND are relatively close in their approaches to generating responses, i.e., both are based on pre-trained language models and prompts.Therefore, collaborations between two similar approaches does not have much impact.In future work, we plan to consider CRSs that focus on generating persuasive reasons for recommendations, e.g., NTRDs that introduce words related to the recommended items in the decoding process.Intuitively, collaborations between such CRSs and LLMs may work out well.

Conclusions and Future Work
In this paper, we investigated the integration of conversational recommender systems (CRS) and large language models (LLM) in E-commerce pre-sales dialogues.Specifically, we proposed two collaboration strategies: "CRS assisting LLM" and "LLM assisting CRS".We evaluate the effectiveness of these collaborations between two LLMs and two CRSs on four tasks related to E-commerce presales dialogues.Through extensive experiments and careful analysis, we found that the collaboration between CRS and LLM can be highly effective in certain scenarios, providing valuable insights for both research and practical applications in Ecommerce pre-sales dialogues.Additionally, our findings can inspire exploration of collaborations between general large models and private small models in other domains and areas.
For future work, we plan to examine the collaboration between LLMs and CRSs across various categories.For instance, a CRS within the Shoes category could provide information to a LLM in the Fashion category, resulting in a recommended combination such as a dress, pants, and shoes.

Limitations
One limitation of our work is that our findings may only apply to large language models (LLMs) around 7B parameters.In this work, we select two LLMs that are widely used, namely ChatGLM-6B and Chinese-Alpaca-7B.Related work reveals that there is some variation in the capability of LLMs with different parameter sizes (Wei et al., 2022).For LLMs with more parameters, such as 100B, more GPU resources and time are needed to explore the effects of combining CRSs and LLMs.In addition, LLMs are constantly being updated.Recently ChatGLM2-6B and Chinese-Alpaca2-13B have been open sourced. 3They show better performance than ChatGLM-6B and Chinese-Alpaca-7B on multiple benchmarks, and may have higher results on E-commerce pre-sales dialogues as well.However, we believe that the combination of LLMs and CRSs is still worth researching.
User needs-based recommendation aims to recommend products that satisfy explicit and implicit user needs.Explicit user needs refer to the needs and preferences expressed by the user in the ongoing dialogue, i.e., { ("Functional requirement", "Warmest"), ("Category", "Thermal underwear") }.Implicit user needs are related to user behaviors outside of the pre-sales dialogue.Users usually view some items before starting a dialogue with the customer service staff.In addition, they browse through items while talking to customer service staff.Such behaviors can reflect implicit user needs to some extent.The inputs to this task are explicit and implicit user needs, i.e., identified semantic frames and user behaviors, and the output is a collection of items.
Pre-sales dialogue generation aims to generate a response based on given information about user needs.Information consists of a collection of attributes and a collection of items.The collection of attributes is the output of the user needs elicitation task, i.e., the attributes that may elicit more information about the user's needs.The collection of items is the output of the user needs-based recommendation task, i.e., items that satisfy the current user needs.The inputs to this task are the dialogue context, the collection of attributes and the collection of items.The output is a response, which may be a query asking a question about an attribute, e.g.,"What are your requirements for the thermal underwear?"or a recommendation reason for recommending an item, e.g., "Recommended item: 655398643290.This one is seamless and made of double-sided fleece.You can take a look."
For the pre-sales dialogue understanding task, Bert, Bert+CRF, and Bert+BiLSTM+CRF adopt the sequence labeling approach to identify the semantic frames.Bert considers only the representation of the input utterance.Bert+CRF takes into account the sequential relationships of the predicted tags in addition to the representation of the input utterance.Whereas Bert+BiLSTM+CRF adds bidirectional information encoding after obtaining the representation of the input and considers the sequential relationships of the predicted tags to compute the probability of each tag.
For the user needs elicitation task, DiaMulti-Class and DiaSeq employ a multi-label classification approach to determine the collection of attributes.DiaMultiClass computes the probability of each attribute based on the representation of the inputs.DiaSeq computes the probability of each attribute based on the sequential relationship between semantic frames.
The baseline for the user needs-based recommendation task is Bert, SASRec and TG-CRS.Bert calculates the probability of an item based on the representation of the input.SASRec calculates the probability of an item based on the sequential relationships of user behaviors.TG-CRS considers both dialogue context and sequential user behaviors.
For the pre-sales dialogue generation task, the baselines are GPT-2 and KBRD.GPT-2 is a commonly used pre-trained language model for the dialogue generation task.KBRD utilizes a switching mechanism to introduce tokens related to the recommended items during the decoding of responses.

C Combinations of LLMs and CRSs
We define four variants of LLM assists CRS: • CLLM-BCRS refers that ChatGLM assists BART-based CRS.
We define four variants of CRS assists LLM: • BCRS-CLLM refers that BART-based CRS assists ChatGLM.

D Evaluation Metrics
Following Liu et al. (2023b), for the pre-sales dialogue understanding and user needs elicitation tasks, we set Precision, Recall and F1 score as evaluation metrics.Precision is the proportion of correctly selected tags to the total number of selected tags.Recall is the ratio of correctly selected tags to the original number of correct tags.The F1 score is calculated by taking the harmonic mean of precision and recall.
Regarding the user needs-based recommendation task, the evaluation metrics in U-NEED (Liu et al., 2023b) are Hit@10, Hit@50 and MRR@50.Due to the limitation of the input length of LLMs, where each product contains attributes and attribute values, we can provide a maximum of 20 candidate products.Therefore, in order to compare whether the collaborative approach improves the performance of CRSs, we measure Accuracy (Hit@1), Hit@5 and MRR@5.The Hit@K metric represents the proportion of relevant items that are present in the top-K results out of all the relevant items.The MRR@K score is determined by taking the average of the reciprocal ranks of the top-K items in a ranking.If an item does not appear in the top-K positions, its reciprocal rank is set to 0.
The evaluation metrics for the pre-sales dialogue generation task are Distinct@1, Informativeness and Relevance.Distinct@1 is computed as the average of the fraction of distinct 1-grams out of all 1-grams in a response.Distinct@1 measures the diversity of generated responses.Informativeness and relevance are for human evaluation.We randomly sample 100 dialogues and we recruit 12 annotators to evaluate 1400 responses from 14 methods on these 100 dialogues.Informativeness is calculated as the average informativeness of all generated responses.Relevance is determined as the average relevance degree of all generated responses.
The annotators evaluate the extent to which a generated response includes information about the product, as compared to the ground truth.The score of informativeness and relevance ranges from 1 to 5, and we calculate the average score from all annotators to obtain the final score.

E Implementation Details
We implement CRSs based on UniMIND. 4 The code is available online. 5 For CRSs, we use a NVIDIA A100-SXM4-80GB gpu and train model for 10 epochs, with a duration of approximately 12 hours.For LLMs, we use a NVIDIA A100-SXM4-80GB gpu and train model for 3 epochs, with a duration of approximately 9 hours.

F Examples of Fine-tuning LLMs
We give examples of the instructions, inputs, and outputs used to fine-tune the LLMs for each task in Tables 5, 6, 7, and 8, respectively.

G Examples of Collaborations of CRSs and LLMs
We show examples of inputs (instructions) for collaborations between CRSs and LLMs on pre-sales dialogue understanding and generation tasks in Fig. 4 and Fig. 5, respectively.

Example Translation
According to the pre-sales dialogue in Appliance category, select a series of attributes to guide users to provide more preference information about needs.Attribute values may or may not be included in the result.

Figure 1 :
Figure 1: Example of E-commerce pre-sales dialogue.

Figure 3 :
Figure 3: An example of collaboration between CRS and LLM on the user needs elicitation task.Left side shows the input and output of the task.The middle displays data used to fine-tune a LLM and train a CRS independently.The right side shows two cases of combining the two.Collaboration content is highlighted in red italics.

Figure 4 :
Figure 4: An example of collaboration between CRS and LLM on the pre-sales dialogue understanding task.Left side displays data used to fine-tune a LLM and train a CRS independently.The right side shows two cases of combining the two.Collaboration content is highlighted in red italics.

Figure 5 :
Figure 5: An example of collaboration between CRS and LLM on the pre-sales dialogue generation task.Left side displays data used to fine-tune a LLM and train a CRS independently.The right side shows two cases of combining the two.Collaboration content is highlighted in red italics.

Table 1 :
(Liu et al., 2023b)line methods on pre-sales dialogue understanding task in 3 typical categories: Beauty, Fashion and Shoes.Baseline results marked with * are taken from U-NEED(Liu et al., 2023b).CLLM is short for ChatGLM and ALLM is short for Chinese-Alpaca.BCRS is short for UniMIND(BART) and CCRS is short for UniMIND(CPT).The best results are highlighted in bold.

Table 1
better in understanding user needs.BCRS-CLLM outperforms ChatGLM in all category and all metrics, especially in Shoes and Phones.Similarly we observe that BCRS-ALLM outperforms Chinese-Alpaca in some metrics.By carefully comparing

Table 2 :
(Liu et al., 2023b)line methods on user needs elicitation task in 3 typical categories: Beauty, Fashion and Shoes.Baseline results marked with * are taken from U-NEED(Liu et al., 2023b).CLLM is short for ChatGLM and ALLM is short for Chinese-Alpaca.BCRS is short for UniMIND(BART) and CCRS is short for UniMIND(CPT).The best results are highlighted in bold.The improvement in understanding user needs brought by the collaboration of CRS to LLMs varies across categories.In the Shoes and Phones categories, BCRS-CLLM significantly outperforms ChatGLM with the collaborations of CRSs.While in the Beauty category, BCRS-CLLM has only a minor improvement compared to Chat-GLM.
the results of the four combinations of "CRS assisting LLM" with the results of the four methods of "No collaboration", we find that a better performing CRS improves the performance, while a worse performing CRS degrades the performance.Since LLMs are usually very sensitive to inputs, we think that the former may provide useful reference information to complement what the LLMs do not take into account.The latter, on the other hand, may bring in noises that disturb the judgments of the LLMs.(iii) Table 2 we have the following observa- tions: (i) LLMs do not exhibit superb performance on user needs elicitation.On the average results of the 5 categories, UniMIND(BART) achieves the best performance, followed by the classical method DialMultiClass and UniMIND(CPT).Moreover, DiaMultiClass beats all the methods on Recall met-rics in Beauty, Shoes, and Mobile categories.This indicates that in E-commerce pre-sales dialogue, the performance of methods that make decisions with the help of generative models, e.g.BART and LLMs, is somewhat limited.

Table 4 :
Performance of baseline methods on pre-sales dialogue generation task in 3 typical categories: Beauty, Fashion and Shoes.CLLM is short for ChatGLM and ALLM is short for Chinese-Alpaca.BCRS is short for UniMIND(BART) and CCRS is short for UniMIND(CPT).Info.and Rel.refer to informativeness and relevance.The best results are highlighted in bold.

Table 5 :
Three examples of fine-tuning LLMs for pre-sales dialogue understanding task in Appliance, Beauty and Fashion categories.Combined with the pre-sales dialogue in Beauty category, identify the product-related attribute values and corresponding attributes involved in the input of the current user or customer service.For user input, attributes and attribute values need to be identified.The attribute value in the customer service input may be empty Combined with the pre-sales dialogue in Fashion category, identify the product-related attribute values and corresponding attributes involved in the input of the current user or customer service.For user input, attributes and attribute values need to be identified.The attribute value in the customer service input may be empty.Pre-sale dialogue: User: Do you have any other recommendations?[SEP] Customer service: What material do you want, dear?[SEP] User: It's the ice silk one [SEP] Customer service: Only send product links [SEP] Customer service : This is made of modal material with ice silk touch [SEP] Customer service: Only send product links [SEP] Customer service: This is made of ice silk [SEP] User: Which one is better to wear in winter Current input: User: My son I like to wear the brand of Jinlilai.

Table 6 :
Three examples of fine-tuning LLMs for user needs elicitation task in Appliance, Beauty and Fashion categories.
Pre-sale conversation: User: Help me recommend an ordinary washing machine with high cost performance, durable leather, and no drying function [SEP] Customer service: Wave wheel or drum [SEP] User: Drum [SEP] User: Simple function [SEP] Customer service: Only send product links [SEP] Customer service: Only send product links [SEP] User: Do you have Little Swan?Azimuth massage, as gentle as washing by hand, reshape the cleanliness and softness of the clothes between rubbing;(2)Boil and wash at 95 degrees high temperature, sweep away the viruses and bacteria hidden in the fibers of the clothes, long-term sterilization and disinfection, 99.9% healthy sterilization (3)Wifi mobile phone remote control , anytime, anywhere, you can wear it as you want (4)Wash the special down jacket, enter the water in multiple stages, the washing cycle is soft, prevent the down jacket from floating on the water or damage, even washing and care, caring (5)BLDC inverter motor, dehydration is faster and more thorough, clean and less residue [SEP ] User: Which one of the wave wheel is more cost-effective?DurableAccording to the pre-sales dialogue in Beauty category, select a series of attributes to guide users to provide more preference information about needs.Attribute values may or may not be included in the result.According to the pre-sales dialogue in Fashion category, select a series of attributes to guide users to provide more preference information about needs.Attribute values may or may not be included in the result.