Context-Aware Query Rewriting for Improving Users’ Search Experience on E-commerce Websites

E-commerce queries are often short and ambiguous. Consequently, query understanding often uses query rewriting to disambiguate user-input queries. While using e-commerce search tools, users tend to enter multiple searches, which we call context, before purchasing. These history searches contain contextual insights about users’ true shopping intents. Therefore, modeling such contextual information is critical to a better query rewriting model. However, existing query rewriting models ignore users’ history behaviors and consider only the instant search query, which is often a short string offering limited information about the true shopping intent.We propose an end-to-end context-aware query rewriting model to bridge this gap, which takes the search context into account. Specifically, our model builds a session graph using the history search queries and their contained words. We then employ a graph attention mechanism that models cross-query relations and computes contextual information of the session. The model subsequently calculates session representations by combining the contextual information with the instant search query using an aggregation network. The session representations are then decoded to generate rewritten queries. Empirically, we demonstrate the superiority of our method to state-of-the-art approaches under various metrics.


Introduction
Query rewriting is a task where a user inputs a potentially problematic query (e.g., typos or insufficient information), and we rewrite it to a new one that better matches the user's real shopping intent. This task plays an important role in e-commerce query understanding, where without proper rewriting, search engines often return undesired items, rendering the search experience unsatisfactory.
Work was done during an internship at Amazon. Correspondence to simiaozuo@gatech.edu. From the query alone, it is implausible to tell if the user's intent is for costumes of actual bumblebee (i.e., the animal) or the character from the movie franchise. This type of ambiguity is common in e-commerce search, where queries are usually short (only 2-3 terms) and insufficiently informative (He et al., 2016b). Therefore, it is not possible to disambiguate queries using only the instant search. A common solution is to use statistical rules to differentiate the possible choices. Specifically, in our example, suppose a total of 100 users entered the "bumblebee costumes" query, and 70 of them eventually purchased the movie character costume. When a new user searches for the same query, the recommended products will consist of 70% movie character costumes and 30% animal costumes. This procedure is problematic because each user has a specific intent, i.e., either the movie character costume or the animal costume, but rarely both, which the aforementioned method fails to address.
We propose to explore contextual information from users' history searches to resolve the query ambiguity issue. Taking the "bumblebee costumes" example again, in Figure 1 (right), suppose a rewriting model recognizes that the user searched for "Transformers movie" earlier, then it could infer that the user's purchase intent is the movie character costume, and hence can remove the input ambiguity. There have been existing works that utilize search logs for query rewriting. For example, Zhai (2007, 2008) use traditional TF-IDF-based similarity metrics to capture relational information among the user's history searches. These approaches are too restrictive to handle the increasingly complex corpus nowadays. As such, the rewritten queries significantly differ from the original one in intent. More recently, neural network-based query rewriting algorithms (He et al., 2016b;Xiao et al., 2019;Yang et al., 2019) are proposed. Most of such approaches employ a multi-stage training approach. Consequently, they involve complicated hand-crafted features or require excessive human annotations for the intermediate features (sometimes both).
To overcome the drawbacks of existing methods, we propose an end-to-end context-aware query rewriting algorithm. Our model's backbone is the Transformer (Vaswani et al., 2017). It is a sequence-to-sequence encoder-decoder model that exploits recent advances of the self-attention mechanism (Bahdanau et al., 2015). In our context-aware model, the Transformer encoder learns representations for individual history queries. The representations are further transformed to carry cross-query relational information using a graph attention mechanism (GAT, Velickovic et al. 2018). The GAT computes contextual information of a session based on a session graph, where its nodes contain the history queries and the tokens contained in the history queries. After obtaining the contextual information from the GAT, it is aggregated with the instant search using an aggregation network. The augmented information is subsequently fed into the Transformer decoder to generate rewritten queries. Previous works (Tu et al., 2019;Wang et al., 2020) that share the same spirit have shown to be effective in various natural language processing tasks.
We highlight that our proposed session graph formulation and the GAT mechanism explicitly models cross-query relations, which is different from existing works. Previous approaches (e.g., (Dehghani et al., 2017)) capture such relations recursively, which is sub-optimal because such a structure suffers from the "forgetting" issue (Hochreiter and Schmidhuber, 1997), i.e., relation between queries far away will be lost. In contrast, GAT associates any two queries by their contained words, enabling relation-modeling regardless of distance.
Our proposed method improves upon existing works from three aspects. First, our model does not involve recursion, unlike conventional recurrent neural network-based approaches (He et al., 2016b;Yang et al., 2019;Xiao et al., 2019). Our proposed attention-based method can be trained in full parallel and avoids gradient explosion and gradient vanishing problems (Pascanu et al., 2013), from which existing models suffer. These advantages facilitate training deep models containing dozens of layers capable of capturing high-order information. Second, our end-to-end sequenceto-sequence learning formulation eliminates the necessity of excessive labeled data. Previous approaches Xiao et al., 2019) require the judgment of "semantic similarity", and thus crave for human annotations, which are expensive to obtain. In contrast, our method uses search logs as supervision, which does not involve human effort, and are cheap to acquire. Third, our method can leverage powerful pre-trained language models, such as BART (Lewis et al., 2020).
Such models contain rich semantic information and are successful in numerous natural language processing tasks (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019).
We demonstrate the effectiveness of our method on in-house data from an online shopping platform. Our context-aware query rewriting model outperforms various baselines by large margins. Notably, comparing with the best baseline method (Transformer-based model), our model achieves 11.6% improvement under the MRR (Mean Reciprocal Rank) metric and 20.1% improvement under the HIT@16 metric (a hit rate metric). We further verify the effectiveness of our approach by conducting online A/B tests.
The remainder of this paper is organized as follows. In Section 2 we review some related works. Section 3 describes the problem setup and the data collection process. In Section 4 we introduce our end-to-end context-aware query rewriting method. Experiments are presented in Section 5. We conclude this paper in Section 6.

Related Works
Context-based query rewriting One line of work uses statistical methods. For example, Cui et al. (2002Cui et al. ( , 2003 extract probabilistic correlations between the search queries and the product descriptions. Other works extract features that are related to the user's current search (Huang et al., 2003;Huang and Efthimiadis, 2009), or from relational information among the user's history searches (Billerbeck et al., 2003;Baeza-Yates and Tiberi, 2007;Wang and Zhai, 2007;Cao et al., 2008;Wang and Zhai, 2008). There are also statistical machine translation-based models (Riezler et al., 2007;Riezler and Liu, 2010) that employ sequence-to-sequence approaches.
The aforementioned statistical methods suffer from unreliable extracted features, such that the rewritten queries differ from the original one in intent.
Another line of work focuses on neural query rewriting models (He et al., 2016b;Xiao et al., 2019;Yang et al., 2019). These models adopt recurrent neural networks (RNNs, Hochreiter and Schmidhuber 1997;Sutskever et al. 2014) to learn a vectorized representation for the user's search query, after which KNN-based methods are used to find queries that yield similar representations.
One major limitation is that the rewritten queries are limited to the previously presented ones.
Also, these methods often involve complicated and ungrounded feature function designs, e.g., He et al. (2016b) and Xiao et al. (2019) hand-crafted 18 feature functions, or require excessive labeled data . There are other works (Sordoni et al., 2015;Dehghani et al., 2017;Jiang and Wang, 2018) that use RNNs for generative query suggestion, but they inherit the weaknesses of RNNs and yield unsatisfactory performance in practice.
Note that Grbovic et al. (2015) construct context-aware query embeddings using word2vec . In their approach, an embedding is learned for each distinct query in the dataset. As such, the quality of the learned embeddings rely heavily on the number of occurrences of each query. This method is not applicable to our case because in our dataset, almost all the queries are distinct.
Pre-trained language models These models gain increasing attention in natural language processing (NLP). Models such as BERT (Devlin et al., 2019), RoBERTa , and GPT-2 (Radford et al., 2019) achieve state-of-the-art performance in various NLP tasks, such as natural language understanding (He et al., 2020) and text classification (Yu et al., 2021). Pre-trained language models are essentially massive Transformer-based neural networks that are trained using enormous open-domain data in a completely unsupervised manner. When applying these models to downstream tasks, we only need to slightly modify the models instead of training from scratch.
Many popular of these models have either the Transformer encoder (e.g., BERT) or the Trans-former decoder (e.g., GPT-2), but not both. Since we formulate query rewriting as a sequenceto-sequence (seq2seq) learning problem, pre-trained seq2seq models, such as BART (Lewis et al., 2020) and UniLM (Dong et al., 2019) are more suitable for the task.

Problem Setup
The session data are collected from search logs. First, we collect all the searches from a specific user within a time window, and we call the searches a "session". After the user purchases a product, the session ends, i.e., we do not consider subsequent queries and behaviors after a purchase happens. This is because after a purchase, the user's intent often change. Note that different sessions may be collected from different users.
Each session contains multiple searches from the same user. We call the last query in the session the "target" query, the second to the last query the "source" (or the "instance) query, and the others the "history" queries. The intuition behind this is that because sessions always end with a purchase, the last search (i.e., the target) reflects the user's real intent. When the user enters the second to the last search (i.e., the source), if we can rewrite it to the target query, the user's intent will be fulfilled.
Below is an example of a search session. From the history queries, the user is interested in car related banners/posters. The source query contains a typo (i.e., "doger" is a baseball team) and we should rewrite it to the target query (i.e., "dodge posters"). We collect about 3 million (M) sessions, where each session consists of at least 3 history queries, a source query (i.e., the one we need to rewrite), and a target query (i.e., the ground-truth query that is associated with the purchase). We have roughly 18.7M queries, and on average, each session contains 4 history queries. Query rewriting is consequently formulated as a sequence-to-sequence learning problem. We highlight that per our formulation, we do not need human annotations, unlike existing approaches.

Transformer Encoder
For a given source query, we first pad it with a <boq> (begin-of-query) token. Then, we pass the padded query through a token embedding layer and a position embedding layer, and we obtain Y s ∈ R L s ×d . Here L s is the length of the padded source query, and d is the embedding dimension.
Note that the position embedding can either be a sinusoidal function or a learned matrix.
After the initial embedding layers, we pass Y s through the self-attention module. Specifically, we compute attention output S by Here where W o ∈ R Hd V ×d is a learnable aggregation matrix. The attention output is then fed through a position-wise feed-forward neural network to generate encoded representation H s ∈ R L s ×d for the source query: Figure 3: Left: Illustration of a session graph, where "T" stands for tokens and "Q" stands for queries. Right: One-step update based on the session graph.
Here {W 1 FFN , W 2 FFN , b 1 , b 2 } are weights of the neural network. Equations (1), (2), and (3) constitute as an encoder block. In practice we stack multiple encoder blocks to build the Transformer encoder, as demonstrated in Figure 2. For more details about the Transformer architecture, we refer to Vaswani et al. (2017).
For the history queries in this session, we also pad them with <boq> tokens. Suppose that we have N h padded history queries (recall a session contains multiple history queries), and their respective length is denoted by L 1 h , · · · , L N h h . We pad the history queries to the same length, and we obtain the history query matrix Then, following the same procedures as encoding the source query, we pass X h through the embedding layers and the encoder blocks, after which we obtain the history query representations U h ∈ R N h ×L h ×d .

Contextual Information from Session Graphs
After we obtain the history query representations U h , the next step is to refine them. Such refinement is necessary because the Transformer encoder considers the history queries separately, such that their interactions are not taken into account. However, since each search depends on its previous searches in the same session, modeling cross-query relations are imperative for determining the user's purchase intent. To this end, we use a graph attention mechanism (Velickovic et al., 2018;Wang et al., 2020) to capture contextual information from U h .

Session Graph Construction
First we specify how to build a graph for each session, which we call the session graph. Suppose we have a session that contains three history queries: where Q 1 , Q 2 , Q 3 are the three queries, and T 1 , · · · , T 5 are the five tokens that appear in the three queries. Recall Section 3 for the problem setup. Figure 3 (left) illustrates the session graph. In this bipartite graph, the circles are the token nodes (T 1 , · · · , T 5 ); and the rectangles are the query nodes (Q 1 , Q 2 , Q 3 ). In our example, the history query representations have size U h ∈ R 3×6×d , that is, we have 3 queries, and the maximum query lengths is 6 (recall we prepend a <boq> token to each query).

Node Representations
The next step is to refine the node representations. Each of the nodes in the session graph has its own representation.
• The token representations are simply the corresponding representations of the tokens, extracted from the token embedding matrix.
• The query representations are the representations of the <boq> token in each padded history query, i.e., the representation of the Q 1 query node in Figure 3 Note that this is akin to BERT, where a <cls> token is inserted and its representation is used for classification tasks.
the sets of representations for the query and token nodes, respectively. Here N q is the number of query nodes and N t is the number of token nodes. Note that all the node representations have the same size, i.e., q i , t i ∈ R d .

Update Node Representations
We use a multi-head graph attention mechanism to update the node representations. For simplicity, denote N g = N q +N t the number of distinct nodes in the session graph, and the set of all the node representations.
With the above notations, a single-head graph attention mechanism is defined as is the exponential linear unit, N i denotes the neighbor of the i-th node, and W a , W q , W k , W v are trainable weights. Note that a residual connection (He et al., 2016a) is added to the last equation in Eq. 5. This has proven to be an effective technique to prevent gradient vanishing, and hence, to stabilize training.
The session graph only induces attention between nodes that are connected. For example, in Figure 3 (right), the model updates Q 1 and Q 2 using T 3 , while Q 3 is unchanged, i.e., N T 3 = {Q 1 , Q 2 }.
A multi-head graph attention mechanism is then defined as the concatenation of [h 1 i , h 2 i , · · · , h K i ], where K is the number of heads, and each of the h i is calculated via Eq. 5. The token node representations and the query node representations are updated iteratively.
First, we update the token representations (G t ) using the query representations (G q ), in order that the tokens acknowledge to which queries they belong. Then, G q is re-computed using the updated version of G t , which essentially evaluates cross-query relations, using the token nodes as intermediaries. Note that the graph attention mechanism (GAT) used in each of the two steps are distinct, i.e., there are two different sets of weights [W a , W q , W k , W v ].
Eventually, we obtain the updated vectorized representations {h i } N g i=1 for all the nodes, and we treat them as the contextual information of the session.
We remark that the GAT mechanism explicitly models cross-query relations by associating query representations with word representations. Such an approach is fundamentally different from existing methods, where the relations are either ignored (e.g., conventional Transformer attention) or captured via recursion (e.g., RNN-based approaches).

Session Representation from Aggregation Network
Recall that we pass the source query through a Transformer encoder and obtain H s ∈ R L s ×d . The matrix H s contains representations for all the tokens in the source query. We use that of the prepended <boq> token as the representation of the source query, which is denoted h s ∈ R d . We adopt an aggregation network to extract useful information with respect to h s from the contextual The network employs an attention mechanism that determines to what extent each vector h i contributes to the source query h s . Figure 4 illustrates the architecture of the aggregation network. Concretely, Algorithm 1: Context-aware query rewriting. Input: D: dataset containing sessions; Initial parameters for the Transformer encoder and the Transformer decoder; Initial parameters for two graph attention mechanism (Eq. 5): GAT t→q , GAT q→t ; Initial parameters for the aggregation network (Eq. 6); K: the number of updates on the session graph; N : the number of rewritten queries for each session.
Output: A list that contains N generated queries for each session in the dataset.
Rewritten results: rewritten = {}; for each session in D do /* Encode input data. */ Compute source representation H s and history representation U h using the Transformer encoder; where W k and W v are trainable weights. The summation in the last equation in Eq. 6 is conducted row-wise, wherein H sess , H s ∈ R L s ×d , and v ∈ R d .
The matrix H sess serves as the representation of the session. Intuitively, by incorporating the aggregation network, we can filter out redundant information from the session history and only keep the ones pertinent to the source query.
After the Transformer encoder, the graph attention mechanism, and the aggregation network, we obtain H sess , the session representation that contains information on both the source query and its history searches. Subsequently, H sess is fed into the Transformer decoder to generate rewritten query candidates.
The algorithm is detailed in Algorithm 1.

Experiments
We conduct experiments on some in-house data. Notice that we focus on session-based query reformulation, a scenario that is rare in existing datasets (see Section 3 for details). We implement two methods with different model architectures: Transformer+Aggregation+Graph and BART+Aggregation+Graph. The first one is constructed in the previous section, and the second one employs a fine-tuning approach instead of training-from-scratch. For training a Transformer model from scratch, we adopt the Transformer-base (Vaswani et al., 2017) architecture. We use Adam (Kingma and Ba, 2015) as the optimizer, and the learning rate is chosen from {3 × 10 −4 , 5 × 10 −4 , 1 × 10 −3 }. We use 4 heads for the multi-head graph attention mechanism, where the head dimension is set to be 128 (note that the Transformer-base architecture has embedding dimension 512).

Training Details
For fine-tuning a BART model, we adopt the BART-base (Lewis et al., 2020) architecture. We use AdamW (Loshchilov and Hutter, 2019) as the optimizer, and the learning rate is chosen from {3 × 10 −5 , 5 × 10 −5 , 1 × 10 −4 }. Similar to the training from scratch scheme, we adopt 4 heads, each with dimension 192, for the graph attention mechanism.
For both training-from-scratch and fine-tuning, please refer to 1 Ott et al. (2019) for more details such as pre-processing steps and other hyper-parameters.

Baselines
The baselines are split into two groups: without pre-training and with pre-training. For the w/o pre-training group, we build the following models: Learning to Rewrite Queries (LQRW) (He et al., 2016b) is one of the first methods that applies deep learning techniques to query rewriting. Specifically, the LQRW model combines a sequence-to-sequence LSTM (Hochreiter and Schmidhuber, 1997;Sutskever et al., 2014) model with statistical machine translation (Riezler and Liu, 2010) techniques to generate queries. The candidates are subsequently ranked using hand-crafted feature functions. (Sordoni et al., 2015) employs a hierarchical recurrent neural network for generative query suggestion. The model is a step forward from its predecessors in that HERD is sensitive to the order of queries and the method is able to suggest rare and long-tail queries. 1 https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md Transformer (Vaswani et al., 2017) has achieved superior performance in various sequence-tosequence (seq2seq) learning tasks. To adopt Transformer to query rewriting, we treat the source query as the source-side input, and the target query as the target-side input. Then we train a model using only these constructed inputs, similar to machine translation. Note that this setting resembles most of the existing works. We adopt the Transformer-base architecture, which contains about 72M parameters.

Hierarchical Recurrent Encoder-Decoder (HRED)
MeshTransformer (Chen and Lee, 2020) is a variant of MeshBART, where the pre-trained BART module is replaced by a Transformer and the model is trained from scratch. The method concatenates history queries to the source query in order to integrate contextual information. See the

MeshBART method below for details.
Transformer+Aggregation is the model where we use the aggregation network to encode history search queries, i.e., without the graph attention mechanism. Specifically, we first obtain the representations of the source query and the history queries from the Transformer encoder. Then, we extract information related to the source query from the history representations using an aggregation network. Such information is added to the source representation, and we follow a standard decoding procedure using these two factors. See Section 4.3 for details.
The second group of methods adopt pre-trained language models for query rewriting.
BART (Lewis et al., 2020) is a pre-trained seq2seq model. We adopt this particular model instead of, for example, BERT (Devlin et al., 2019) or GPT-2 (Radford et al., 2019), because we treat query rewriting as a seq2seq task. And the aforementioned architectures have either the Transformer encoder (e.g., BERT) or the Transformer decoder (e.g., GPT-2), but not both. In our experiments, BART is fine-tuned in a setting similar to training the Transformer model. We adopt the BARTbase architecture in all the experiments, which contains about 140M parameters.
MeshBART (Chen and Lee, 2020) is a BART-based model that first concatenates the history queries to the source query, and then feeds the concatenated input to a pre-trained BART model for query generation. Note that the original method requires click information. We remove this component as the proposed method do not need such data.
BART+Aggregation is similar to Transformer+Aggregation, except we replace the Transformer backbone with the pre-trained seq2seq BART model.

Evaluation Metrics
We use BLEU, MRR (Mean Reciprocal Rank), HIT@1, and HIT@16 to evaluate the query rewriting models. For all metrics except BLEU, we report the gains over the the results calculated by using only source queries. We remark that MRR, HIT@1, and HIT@16 are more important than BLEU, because MRR and HIT are directly linked to user experience.
We use the BLEU score (Post, 2018) as an evaluation metric. This metric is constantly used to evaluate the quality of translation. We adopt it here because similar to machine translation, we formulate query rewriting as a seq2seq learning task. The correlation between the rewritten query and the target query reflects the model's ability to capture the user's purchase intent.
The MRR metric describes the accuracy of the rewritten queries. For each source query in the test set, we generate 10 candidate queries r 1 , · · · , r 10 . Then we search each of these candidates using our production search engine, and we obtain the returned products, of which we only keep the top 32. Recap that we know the actual product that the customer purchased. The next step is to calculate the reciprocal of the actual product's rank for each of r 1 , · · · , r 10 . For example, suppose for r 1 , the actual purchased product is the second within the 32 returned products, then the score for r 1 is score 1 = 1/2 = 0.5. The score of the rewritten queries r 1 , · · · , r 10 is then defined as max{score i } 10 i=1 . Finally, the score for the query rewriting model is the average over all the source query scores.
We also use HIT@1 and HIT@16 as evaluation metrics. The HIT@16 metric is the percentage that the actual product is ranked within the first 16 products (the first page) when we search the rewritten query. And the HIT@1 metric is similarly defined. Table 1 summarizes experimental results. Recall that in our formulation, we rewrite a source query to a target query. The "target query" entry in Table 1 is the performance gain of the ground truth target query, i.e., this entry signifies upper bounds of performance gain that any model can achieve.

Experimental Results
We can see that the attention-based models (i.e., BART, MeshBART, Transformer and Mesh-Transformer) outperforms the recurrent neural network-based approach (i.e., LQRW and HRED). This is because RNNs suffer from forgetting and training issues. In contrast, Transformer-based models use the attention mechanism instead of recursion to capture dependencies, which has proven to be more effective. Moreover, by aggregating history searches, BART+Aggregation and Transformer+Aggregation consistently outperform their vanilla alternatives. Essentially performance of these two methods indicate that integrating history queries into training is critical.
The performance is further enhanced by incorporating the session graphs. Specifically, Trans-former+Aggregation+Graph achieves the best performance under almost all the metrics. Notice that the HIT@16 metric gain improves from +15.9 to +20.1 when employing both the aggregation network and the session graph formulation for the Transformer-based models. We highlight that the graph attention mechanism can directly captures cross-query relations, which is implausible for all the baselines. We can see that this property indeed contributes to model performance, i.e., HIT@16 increases from +17.3 to +20.1 when we equip Transformer+Aggregation with the GAT mechanism.
Notice that BLEU is not a definitive metric. For example, the MRR and HIT metrics of HRED are consistently higher than those of LQRW, even though the BLEU score of the former is signif- icantly lower than the latter. Also, compared with Transformer-based models, the BLEU score is consistently higher when using the BART model as the backbone. This is because a pre-trained language model contains more semantic information. However, the MRR and HIT metrics of the BART-based models are slightly worse than those of the Transformer-based models.
However, the BLEU score is comparable for models with the same backbone. For example, for

Online A/B Test
To further validate the effectiveness of our approach, we conduct online A/B experiments on a large-scale e-commerce shopping platform with our query rewriting models. For a given search query within a session, we generate one reformulated query using the proposed model, and we feed both the original query and rewritten query into the search system. We run this online query reformulation experiment in the US market. Experiments are conducted over five days, during which our system processed over 30 million sessions. The proposed method improves business metrics, leading to $234M increase (annualized) in revenue. In addition, the number of reformulated searches significantly decreases by 0.21%. This indicates that the rewritten queries better meet customers' shopping intent since customers are able to find their desired products with less number of searches.   (Table 1). One reason is that publicly available pre-trained models are pre-trained on natural language corpus, but queries are usually short and have distinct structures. This raises doubts on whether current pre-trained models are suitable for the query domain. Indeed, the rich semantic information enables a much better BLEU score (32.9 vs. 28.2), but the MRR and HIT metrics suggest the fine-tuned models' unsatisfactory performance.
Another reason is that in a conventional fine-tuning task, a task-specific head is appended to the pre-trained model, and the head usually contains only a small number of parameters. But in the query rewriting task, both the aggregation network and the graph attention mechanism contain a significant amount of parameters (about 10% of BART). This is problematic because in fine-tuning, the learning rate is usually small since nearly all the weights are supposed to be meaningful and should not change much. Yet, in our case, we need to properly train a large amount of randomly initialized parameters. Moreover, the aggregation network and the GAT are added inside the pre-trained model (more specifically, they are added to the BART encoder) instead of appended after BART. Essentially this nullifies the pre-trained parameters on the decoder side, imposing additional challenges to the fine-tuning task. Nevertheless, the BART+Aggregation model still outperforms the vanilla BART model, and the performance is further improved by adding the GAT (i.e., BART+Aggregation+Graph).
Training from scratch vs. fine-tuning Figure 5 plots the training and validation perplexity (ppl) of the training-from-scratch approach and the fine-tuning approach. From Figure 5a and Figure 5b, we can see that by employing the aggregation network, Transformer+Aggregation fits the data better and exhibits enhanced generalization. The training and validation ppls are further significantly improved by incorporating the graph attention mechanism, i.e., by using Trans-former+Aggregation+Graph, we achieve even better performance.
Notice that in Figure 5c, BART+Aggregation outperforms BART+Aggregation+Graph in terms of training ppl, which is different from the training-from-scratch approach. As indicated by Figure 5d, BART+Aggregation shows clear sign of over-fitting. This is because even though pre-  trained language models contain rich semantic information, much of it is considered "noisy" for query rewriting. Thus feature enhancement initiated by the graph attention mechanism is needed.
Model size vs. performance Figure 6 illustrates the relation between model size and performance, where we decrease the embedding dimension (correspondingly the FFNs' hidden dimensions) and the number of layers. We can see that even with 1/8 of the parameters, model performance does not decrease much. Moreover, our model is more than 20% smaller than a BERT-base model (85M vs. 110M), rendering online deployment more than possible. Figure 7 demonstrates model performance regarding length of the instant query. We can see that the BLEU score gradually decreases when the length increases.

Query length vs. performance
This is because long queries are often very specific (e.g., down to specific models or makes), making the rewriting task harder.

Case Studies
Advantages of leveraging history information Two examples are shown in Table 2. The first example is error correction. In the example, the customer wishes to purchase dodge (a car brand) posters, but she mistakenly searches for dodger (a baseball team) posters. Without history information, it is impossible to determine the customer's true intent. However, by looking at session histories, we find that all the previous searches are related to automobiles (e.g., dodge and mopar), and therefore the query should be rewritten to "dodge posters". Our model successfully captures this pattern. Notice that the rewritten query without leveraging context does not match the user's intent.
The second example is keyword refinement. In the example, by looking at the history searches, it is obvious that the customer wishes to find phone cases, instead of phones. However, this intent is impossible to capture by using only the source query. Our model automatically adds the keyword "case" to the source query and matches the target query. On the other hand, without the context information, the rewritten result is not satisfactory. Diversity of query generation Table 3 demonstrates two examples. In the first example (the left three columns), notice that our model can grep information from history queries, e.g., "iphone 11 case sailor moon", and can delete keywords that are deemed insignificant or too restrictive, e.g., "iphone 11 case leopard" instead of "snow leopard". Also, our model can effectively capture domain information. For example, some of the history query keywords (e.g., pokemon, eevee) are often described as "cute", and our model recommends this keyword. All the history keywords are from Japanese anime series, therefore our model suggests another popular character, "totoro".
Additionally, the "disney" and "disney princess" keywords are generated based on the interest to virtual characters. Finally, notice that the likelihood of all the suggested queries is similar, which means our model cannot single out a significantly better query than the others. Therefore our model generated a diverse group of queries.
In the second example (the right two columns), the generated query successfully matches the target query. Note that the top two generated queries have high likelihood, and the likelihood decreases drastically as the suggested queries become more and more implausible. In this example, the first query is 172% more likely than the tenth query, whereas this number is only 41% in the previous example. This suggests that our model can differentiate between good quality suggestions and poor quality alternatives.

Conclusion and Discussions
We propose an end-to-end context-aware query rewriting model that can efficiently leverage user's history behavior. Our model infers a user's purchase intent by modeling her history searches as a graph, on which a graph attention mechanism is applied to generate informative session representations. The representations are subsequently decoded into rewritten queries. We conduct experiments using in-house data from an online shopping platform, where our model achieves 11.6% and 20.1% improvement under the MRR and HIT@16 metrics, respectively. Online A/B tests are also conducted to further demonstrate the effectiveness of the proposed context-aware query rewriting algorithm.
Our proposed session graph is flexible, and can be extended to incorporate more information. In this paper, we present a bipartite graph, which contains words and queries. Additional components can be added as extra layers to the session graph. For example, we can add product information such as categories to the session graph, which will turn the current bipartite graphs (word and query) to 3-partite graphs (word, query and product).