End-to-End Conversational Search for Online Shopping with Utterance Transfer

Successful conversational search systems can present natural, adaptive and interactive shopping experience for online shopping customers. However, building such systems from scratch faces real word challenges from both imperfect product schema/knowledge and lack of training dialog data. In this work we first propose ConvSearch, an end-to-end conversational search system that deeply combines the dialog system with search. It leverages the text profile to retrieve products, which is more robust against imperfect product schema/knowledge compared with using product attributes alone. We then address the lack of data challenges by proposing an utterance transfer approach that generates dialogue utterances by using existing dialog from other domains, and leveraging the search behavior data from e-commerce retailer. With utterance transfer, we introduce a new conversational search dataset for online shopping. Experiments show that our utterance transfer method can significantly improve the availability of training dialogue data without crowd-sourcing, and the conversational search system significantly outperformed the best tested baseline.


Introduction
Search systems play significant roles in today's online shopping experience. In conventional ecommerce search systems, user interacts with the system through typing of keywords, followed by product clicks or keywords modifications, depending on whether returned product list matches with user expectation. The recent success of intelligent assistants such as Alexa, Google Now, and Siri enables user to interact with search systems using natural language. For online shopping in particular, * Work performed during internship at Amazon. it becomes alluring that users can navigate through products with conversations like traditional in-store shopping, guided by a knowledgeable yet thoughtful virtual shopping assistant.
However, building a successful conversational search system for online shopping faces at least two real world challenges. The first challenge is the imperfect product attribute schema and product knowledge. While this challenge applies also to traditional search systems, it is more problematic for conversational search because the later depends on product attributes to link lengthy multi-turn utterances (in contrast to short queries in conventional search) with products. Most previous conversational shopping search work (Li et al., 2018a;Bi et al., 2019;Yan et al., 2017) looks for the target product through direct attribute matching, assuming availability of complete product knowledge in structured form. In practice, this assumption rarely holds, and systems designed with this assumption will suffer from product recall losses.
The second challenge is the lack of in-domain dialog dataset for model training. Constructing a large-scale dialog dataset by crowd-sourcing from scratch is inefficient. Popular approaches include Machines-Talking-To-Machines (M2M) (Shah et al., 2018), which generates outlines of dialogs by self-play between two machines, and Wizard-of-Oz (WoZ) (Kelley, 1984), which collects data through virtual conversations between annotators. Note that both approaches require manually written utterances. In addition, a line of other work (Lei et al., 2020;Luo et al., 2020;Bi et al., 2019) constructs conversations from the review datasets such as Amazon Product Data (McAuley et al., 2015) and LastFM 1 , whereas usage of these datasets is limited to sub-tasks (e.g., dialog policy) due to the absence of user utterances. Saha et al. (2018) collected a dialog dataset for fashion product shopping. However, the described method requires dozens of domain experts to manually create the dialog, and the dataset can hardly be generalized beyond fashion shopping given the lack of utterance annotations.
To address the first challenge of imperfect attribute schema and product knowledge, we propose ConvSearch, an end-to-end conversational search system that deeply combines the dialog and search system to improve the search performance. In particular, the Product Search module leverages both structured product attributes and unstructured product text (e.g. profile), where the product text may contain phrases matching with utterances when schema is incomplete or when a product attribute value is missing. Putting together, our system has the advantage of both reduced error accumulation along individual modules, and enhanced robustness against product schema/knowledge gaps.
To address the second challenge of lacking indomain dialog dataset, we propose a jump-start dialog generation method M2M-UT which 1) generates utterance from existing dialogues of similar domains (e.g., movie ticketing ), and 2) builds dialog outlines from e-commerce search behavior data, and fills them with the generated utterances. The proposed approach significantly reduces manual effort in data construction, and as a result we introduce a new conversational shopping search dataset CSD-UT with 942K utterances. Note that although the dialogue dataset construction focuses on shopping, the approach described here can be adapted for other task-oriented conversations as well, which we will leave it to future work. Our contributions are summarized as follows: • We proposed an end-to-end conversational search system which deeply combines dialog with search, and leverages both structured product attributes and unstructured text in product search to compensate for incomplete product schema/knowledge. 2 • We proposed a new dialog dataset construction approach, which can transfer utterances from dialogs of similar domains and build dialogues from user behavior records. Using this new approach which significantly reduced manual work compared with existing approaches, we introduced a new conversational search dataset for online shopping.
• Extensive experiments show that our system outperforms evaluated competitors for success rate (SR@5) score.

Related Work
Conversational Search System Conversational search task aims to understand user's search intents through multi-round conversational interactions, and return user the desired search item. Due to lack of annotated dialog utterances in particular for conversational search tasks, previous work either adopted rule-based utterance parsing or focused only on dialog policy. Yan et al. (2017) proposed a rule-based approach to cold-start online shopping dialog systems utilizing user search logs and intent phrases collected from community sites. In another line of work, Luo et al. (2020) and Zhang et al. (2018) utilized Amazon review dataset, Lei et al. (2020) and Li et al. (2018a) revised user reviews from Yelp 3 and LastFM 4 , all of which focused on the conversation policy without utterance understanding. As a comparison, in this paper we focused an end-to-end conversational search system, which fuses both utterance understanding and product search together through multi-task learning.
Constructing Dialog Dataset for Online Shopping Rastogi et al. (2020) proposed a crowd-sourcing version of Wizard-of-Oz (WOZ) paradigm for collecting domain-specific corpora.
In this system, users and wizards were given a predefined task to complete (e.g. find a Chinese restaurant in the North). To avoid the distracting latency, users and wizards were asked to contribute just a single turn for each dialogue. Saha et al. (2018) built a multi-mode dialog system for fashion with experts and in-house labors. They crawled 1 million fashion items from the web, hand-crafted taxonomy for items, identified the set of fashion attributes, and employed experts to write dialogs. The described methods were highly labor-consuming, and the published dataset did not contain attribute annotation on utterances, making Transformer Transformer Figure 1: Illustration of our end-to-end conversational search system. State Tracker module takes utterances to predict dialog state S t using the seq-to-seq transformer. Product Search module matches products represented by transformers with query representations q t using a multi-head attention mechanism. Dialog Policy module takes inputs from S t , intent and ranked product list, and decides the responses. NLG module composes system responses as instructed by Dialog Policy, and displays them to user.
it hard for utterance understanding model training. The approach adopted by Yan et al. (2017) mines phrases of shopping from community sites and uses crow-sourcing to label utterance intents. Although costing less labors, this work did not construct the full dialogs. As a comparison, in this paper we constructed the full shopping search dialogs through real user behavior data, with user utterances filled by transferring from existing dialogs of similar domains. To address the challenge of imperfect product attribute schema/knowledge, our Product Search module leverages both structured product attributes and unstructured text. To mutually benefit from each other's learning, we integrate State Tracker and Product Search together through multi-task learning, and build an end-to-end trainable search system.

State Tracker
Unlike previous work that treats state tracking as a multi-label classification task (Zhu et al., 2020;Wen et al., 2017a), we redefine the state tracking task as a sequence-to-sequence problem. As shown in Figure 1, we link the slots and values of dialog state with special delimiter tokens, turning it into a sequence. Then we employ a transformer network to translate dialog turns into state, which encodes the dialog lines with a bidirectional transformer encoder and generate state sequence autoregressively.
At each turn in the dialog, the State Tracker module outputs 1) the dialog state S and 2) the user utterance intent I, where S are attribute-values grouped by product attributes, representing the system's tracking of user's preferred search criteria, and intent I ∈ I is an enumerable value from I ={request, inform, ask_attribute_in_n, buy_n}.
Formally, given a dialog at turn t, we have all history records of {R 0 , U 0 , R 1 , U 1 , · · · , R t , U t , }, where R t and U t are system response and user utterance at turn t respectively. We then use a transformer model (Devlin et al., 2019) to predict the string S t , the state at t: (1) where trans(·) is the transformer, and concat(·) is a string concatenation function. For state prediction, we use the loss function: where y i * t denotes the ground-truth value for i-th item of output sequence at t-th turn.
We also use an MLP layer to predict the intent at t: where u t is the mean pooling of last layer output from the encoder in Equation (1), W I and b I are trainable parameters, and P t (I i ) represents the likelihood of intent I i ∈ I from user utterance at t-th turn. We use the following loss function for intent prediction: where I * i is the ground truth of intent at turn t.

Product Search
At each turn t, given current state, Product Search module estimates the matching likelihood P t (p j ) for each product p j , and then rank the products to be displayed to user (Figure 1).

Query Representation
We represent the product query as q t = u t ⊕ s t , where s t is the state representation obtained by mean pooling the last layer of decoder in Equation (1), and ⊕ denotes the vector concatenation operator.
Product Embedding We represent j-th product with p j = d j ⊕ a j , where d j is the mean pooling of the last encoding layer of: trans(description text of p j ), and a j is the product attribute embedding. In particular, we obtain a j by mean pooling the last layer of: trans(attribute sequence of p j ), where the attribute sequence is constructed as state sequence. The introduction of profile embedding d j compensates for the missing matching clue when product schema is incomplete or attribute values are missing, since they may be extracted from product text.
Search with Multi-Head Attention We use multi-head attention mechanism to match query and products. At dialog turn t, we first calculate a product context vector head k t based on the glimpse operation (Vinyals et al., 2016): where α t are attention weights, and W k p , W k q , v k s are trainable parameters for head k. We then concatenate K attention heads each with individual parameter sets, head t = ⊕ 0≤k≤K head k t . We then form likelihood of product p j at t-th turn as: where v p , W p , and W h are trainable parameters. We use the following loss function for product search job: where p * j is the ground-truth of product. Finally, we rank products with their likelihood, and return top products to Dialog Policy module for displaying.

Multi-task Learning
Our end-to-end training links all three tasks (state prediction, intent prediction and Product Search) together through multi-task learning: where α, β, γ are tunable hyper-parameters. With multi-task learning, these three tasks can enhance each other with shared weights and backpropagated errors. The training data requires intent and attribute annotation for each utterance, and purchased products with product attributes and text profiles (optional) associated with each dialog.

Dialog Policy and Natural Language Generation
During the conversation, the agent needs to propose additional attributes for user to narrow down the search. When triggered, we filter our product knowledge base using current state S to retrieve products matching with the criteria, then use EMDM (Entropy Minimization Dialog Management) (Wu et al., 2015) to select the proposed attribute with maximum entropy among filtered products, and show user recommended narrowing down question.
The Natural Language Generation module translates the action decision from the Dialog Policy module to natural language, e.g. request(brand) → Do you have a brand in mind?. In this paper we simply use manually written agent templates.

Dialogue Dataset Construction
We address the challenge of lack of conversational shopping search training data by proposing M2M-UT, a method that automatically constructs dialog datasets. Unlike previous works (Saha et al., 2018) that rely on crow-source to generate utterances, M2M-UT can automatically generates utterances with transfer.
We hypothesize that the conversation between the user and the shopping agent is guided by customer's intents that 1) span user's utterance in natural language, and 2) change according to agent's responses. Therefore, our dataset construction has two steps: 1) we use utterance transfer (UT) to generate utterances from existing dialog datasets of similar domains, and 2) we generate the outline of dialog using customer browsing records using Machine Talking To Machine (M2M) Saha et al. (2018).

Utterance Generation by Transfer
For utterance generation, widely used methods such as WoZ and M2M still require workers to create the various utterances, and thus are not easy (4) Fill in values.

Redundancy Preposition Phrases Domain Keywords
Slots/Values Figure 2: Utterance generation algorithm to generate variant utterances for coffee shopping domain. The utterance example employed in this figure is from MDC dataset (Li et al., 2018b). An utterance is first transferred to our domain with the help of constituency parser and then paraphrased to enhance the variance.
to scale up in the shopping conversation application. We found that dialogues from existing taskoriented domains such as movie ticketing or restaurants reservation contain rich form of utterances similar to shopping, for example, "... sounds good" is seen from both movie ticketing and shopping conversations. We propose utterance transfer (UT), a novel approach that generates shopping utterances from related task-oriented domains. As shown in Figure 2, UT consists of five stages. (1) remove redundant phrases: we remove the redundant phrases that not commonly seen in online shopping (e.g. location and time) with syntax rules. We employ a constituency parser (Kitaev and Klein, 2018) to get the syntax tree of the sentence and remove the PPs (preposition phrases) and NPs (noun phrases) referring to location and time. (2) replace values with slots: we identify and replace values with slots according to the original dataset annotations. For example, in Figure 2, we identify value "superhero" using the annotation, and replace it as slot "<description>". This step turns a complete utterance into a template. (3) keyword replacement: we replace verbs and nouns with those from online shopping domain with rules, e.g. "movie" to "coffee" and "watch" to "drink". (4) fill slots: we fill the slots with values according to What roast type is it in the second image. S: inform(roast_type= medium roast) It is medium roast.
U: buy_n(index=2) I will buy the second one. S: notify_success() Your order has been placed. user's action. (5) paraphrase: to augment the diversity of utterance, we use a fine-tuned T5 model (Raffel et al., 2020) to paraphrase the utterance.
Paraphrase One pitfall of utterances generated by templates and rules are the lack of diversity, whereas real conversations usually contain various ways of expressing the same intents. As paraphrase can improve the performance of dialog system , we employ a pre-trained neural paraphrase model to augment the variance of templates. Specifically, we use a T5 model (Textto-Text Transfer Transformer) 5 (Raffel et al., 2020) that is fine-tuned on paraphrase dataset, Quora Question Pairs 6 .

Dialog Generation
Our online shopping dialog in conversational search is supported by the dialog outlines, which consists of intent and its parameters. For user utterance intents as shown in Table 1 , their parameters are typically a list of product attributes with their values. For agent intents, parameters are either attribute values, or operation parameters that agent should execute with (e.g., push(top_5)). Similar to the dialog system presented in Section 3, we use state to track agent's understanding of user's search criteria.
We use real e-commerce search behavior data to supervise the construction of intent flow in the dialog. Each anonymous search session contains a query and the final purchased product. We first extract product attribute values from the search keywords as the initial attribute customer interested in, i.e., initial state. We then follow M2M (Machine Talking to Machine) (Shah et al., 2018) to generate the transition of dialogue outlines turn by turn. M2M runs in a self-playing manner by simulating the dialog with a user simulator and a system agent. We build an agenda-based user simulator initialized by the search behavior data, and use a finite state machine (Hopcroft et al., 2007) as the system agent.
By comparing initial state and the finally purchased product, we find that users were not always aware of the full search criteria at the beginning, therefore the dialog is constructed to simulate how agent helps user to fill the gap through attribute refinement. Specifically, as shown in Table  1, user starts with initial state (e.g., flavor=vanilla). Given current state, agent in the next turn proposes a new attribute (e.g., brand) using the policy EMDM (Entropy Minimization Dialog Management) (Wu et al., 2015) to narrow down the search. User's response in the next turn will be based on attribute value of the purchased product (e.g., brand=Folgers), which also updates the state. Then agent displays a list of products in the next turn (e.g., push(top_5)). If purchased product appear in push list, user asks questions, commits purchase, and ends the dialog (successful). Otherwise agent proposes a new attribute, and continues the conversation. Dialog ends when length exceeds 20 (unsuccessful).
We finally translate the generated outline into natural language by using corresponding utterance templates generated after step (3) in Section 4.1, and finalize the utterance following step (4) and (5) in Section 4.1. After these steps, we have a complete shopping search dialog.

Datasets
Our dataset includes three parts: user search behavior data, dialogs, and product knowledge base.
The user search behavior data is a collection of user search keywords and their final purchased   products sampled from e-commerce platform. We applied the dialog generation method described in Section 4 on the coffee shopping domain. We leveraged the utterances from dataset MDC (Li et al., 2018b) and MMD (Saha et al., 2018) and transferred 4 intents from their domains (i.e. movie ticketing, restaurant reservation, fashion shopping), which generated 49,999 dialogs, with each of the dialog contains on average 18.85 turns (Table 2). In addition, we built up a gold-standard test set of 196 dialogs manually written by workers to evaluate the performance. For the product knowledge base, we sampled 154,161 coffee products from the e-commerce platform. As shown in Table 3, each product has a text profile with average 17.34 tokens and also the attribute-value pairs for 13 different attributes. The vacant ratio of values is 32.16%, which indicates potential missing attribute values for products.

Settings
Hyper-parameters All the transformers used in experiment have 4 sublayers with hidden size of 256, and a word2vec (Mikolov et al., 2013) of 256 dimension is trained to initialize the embedding matrix. Our model used a vocabulary of 50257 entries for text embedding, and 14700 entries for attribute embedding. The models in experiments were trained with AdamW (Loshchilov and Hutter, 2017) optimizer with the initial learning rate of 1e-4 and batch size of 16. The initial learning rate is selected based on validation loss. We used learning rate scheduler to cut the learning rate by half every time the performance drops and stop training once the performance has three straight drops. Our Model SR@5 SR@10 TC-bot  35.71 51.02 ConvLab-2 (Zhu et al., 2020) 44  Table 4: Evaluation of the end-to-end system. attr. and text. denote attribute and product text respectively. The best score per metric is in bold. Our model outperforms the competitors by 6.64%. Rule search employs direct attribute matching as traditional work.
model was trained on a Nvidia Tesla P100 machine with 16G memory, and the strongest model (Con-vSearch w/ Neural Search (attr.&text.)) took 35 hours for convergence. For multi-task learning, we briefly set α, β, γ as 1. To save memory, we let the encoder of state tracker and encoder of profile share the parameters, and employed tf·idf to narrow down the search space into 400 products for product search module.

Evaluation Metrics
We use the success rate (SR@t) to measure the ratio of successful conversations, i.e. recommended the ground-truth item in t turns. We set the max turn t of a session to 5 or 10 and standardized the recommended list length as 5. Besides, we used recall, precision and F1 to evaluate the performance of state prediction, and reported the performance on slot and value respectively.
Baselines For state tracking task, we compared against the following baselines: e2e-Trainable (Wen et al., 2017b) which encodes utterances with a convolutional neural network (CNN), ZS-DST (Rastogi et al., 2020), a Bert-based model which first judges the presence of each slot then the start and end location. We also constructed a baseline by replacing transformers in our system with one-layer LSTMs. For the end-to-end system, we compared against two baselines: TC-bot , a modulized neural dialogue systems for task-completion, and ConvLab-2 7 (Zhu et al., 2020), an open-source toolkit for building, evaluating, and diagnosing a task-oriented dialogue system.   Table 6: Independent evaluation of search task. This experiment shows the benefit of combining product text profile and attribute for search. attr. is abbreviation for product attribute. The best score per metric is in bold.

End-to-End System Evaluation
lines significantly by 6.64%. This indicates the effectiveness of our end-to-end framework that deeply combines the dialog and search system, while ablation studies (last three rows in Table 4) also justify that leveraging both product text and attribute performs better than using only one of them. Table 5 shows the performance comparisons of state tracking task. It shows that our method outperforms all baselines in both state prediction and intent prediction tasks, which is because our state tracking task can better embed the context by concatenating the language of turns together. We also found State Tracker alone without Product Search task showed lower performance, suggesting the effectiveness of multi-task learning. Table 6 shows ablation studies of the Product Search module, along with comparisons with a simple tf.idf baseline. In particular, after the 3rd turn of dialog, we selected top-5 products with highest probability from the list returned by Product Search module, and calculated recall, precision and F1 value against the ground-truth purchased product. We can see that the end-to-end search significantly improved the search recall by 4.69 times over the tf·idf baseline. Improvement induced by combining text and attribute embedding suggests the benefits of combining product text and attributes in search task.

Dialog Generation Method Evaluation
We next conducted ablation studies on the data construction method. We evaluated the effectiveness of each component using the performance of State Tracker task. For each configuration in Table 7, we trained the State Tracker module with corresponding dataset, and reported the performance on a manually prepared test set. As shown in the table, the module performance degrades without syntax analysis since redundant phrase (e.g. time, location) won't be removed from the utterance. Similarly, module performance degrades without paraphrase since language variance will be weakened. These suggest that both removing redundancy with syntax and increasing variance with paraphrase are effective to improve the training dataset quality.

Human Evaluation
We also performed human evaluations on system responses. For each method, we collected 100 dialogs and asked three workers to evaluate them with three metrics: coherence, fluency and appropriateness. All metrics have five grades: from 1(worst) to 5(best), where 3 denotes 'good'. As shown in Table  8, ConvSearch outperforms the baseline model in all three metrics.

Conclusion and Future Work
In this work, we built an end-to-end conversation search system for online shopping, where we deeply combined the dialog and search system with multi-task learning. In particular, our product search module leverages both product attribute and text to retrieve products, which mitigates the imperfect product schema/knowledge challenges. To address issue of lacking in-domain dialog dataset, we proposed a dataset transfer method and constructed shopping dialog dataset from user search behavior data and existing dialogs of similar domain. The proposed dataset construction method lowers the cost, making it possible to scale-up to broader use scenarios. We will leave it to future work to expand the methodology across more shopping categories, and broader use scenarios such as clinical conversations and customer service, etc.