Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search

Online relevance matching is an essential task of e-commerce product search to boost the utility of search engines and ensure a smooth user experience. Previous work adopts either classical relevance matching models or Transformer-style models to address it. However, they ignore the inherent bipartite graph structures that are ubiquitous in e-commerce product search logs and are too inefficient to deploy online. In this paper, we design an efficient knowledge distillation framework for e-commerce relevance matching to integrate the respective advantages of Transformer-style models and classical relevance matching models. Especially for the core student model of the framework, we propose a novel method using $k$-order relevance modeling. The experimental results on large-scale real-world data (the size is 6$\sim$174 million) show that the proposed method significantly improves the prediction accuracy in terms of human relevance judgment. We deploy our method to the anonymous online search platform. The A/B testing results show that our method significantly improves 5.7% of UV-value under price sort mode.


Introduction
Relevance matching (Guo et al., 2016;Rao et al., 2019;Wang et al., 2020) is an important task in the field of ad-hoc information retrieval (Zhai and Lafferty, 2017), which aims to return a sequence of information resources related to a user query (Huang et al., 2020;Chang et al., 2021;Sun and Duh, 2020).Generally, texts are the dominant form of user queries and returned information resources.Given two sentences, the target of relevance matching is to estimate their relevance score and then judge whether they are relevant or not.However, text similarity does not mean semantic similarity.For example, while "mac pro 1.7GHz" and "mac * Chaokun Wang is the corresponding author.

Query
Item Positives Negatives q 1 --i 2 q 1 --i 3 q 3 --i 2 ARC-I model: using click behaviors as training labels lipstick 1.7ml" look alike, they describe two different and irrelevant products.Therefore, relevance matching is important, especially for long-term user satisfaction of e-commerce search (Niu et al., 2020;Xu et al., 2021;Zhu et al., 2020).
Recently, Transformer-style models (e.g., BERT (Devlin et al., 2019) and ERNIE (Sun et al., 2019b)) have achieved breakthroughs on many NLP tasks and shown satisfactory performance on relevance matching, but they are hard to deploy to the online environment due to their high time complexity.Moreover, these methods cannot deal with the abundant context information (i.e., the neighbor features in a query-item bipartite graph) in e-commerce product search.Last but not least, when applied to real-world scenarios, existing classical relevance matching models directly use user behaviors as labeling information (Figure 1).However, this solution is not directly suitable for relevance matching because user behaviors are often noisy and deviate from relevance signals (Mao et al., 2019;Liu and Mao, 2020).
In this paper, we propose to incorporate bipartite graph embedding into the knowledge distillation framework (Li et al., 2021;Dong et al., 2021;Rashid et al., 2021;Wu et al., 2021b;Zhang et al., 2020) to solve the relevance matching problem in the scene of e-commerce product search.We adopt BERT (Devlin et al., 2019) as the teacher model in this framework.Also, we design a novel model called BERM, Bipartite graph Embedding for Relevance Matching (BERM), which acts as the student model in our knowledge distillation framework.This model captures the 0-order relevance using a word interaction matrix attached with positional encoding and captures the higherorder relevance using the metapath embedding with graph attention scores.For online deployment, it is further distilled into a tiny model BERM-O.
Our main contributions are as follows: • We formalize the k-order relevance problem in a bipartite graph (Section 2.1) and address it by a knowledge distillation framework with a novel student model called BERM.
• We apply BERM to the e-commerce product search scene with abundant context information (Section 2.4) and evaluate its performance (Section 3).The results indicate that BERM outperforms the state-of-the-art methods.

Problem Definition
We first give the definition of the bipartite graph: each edge e i connects u j in U and v k in V.In addition, there is a node type mapping function f 1 : U ∪ V → A and an edge type mapping function Example 1 Given a search log, a query-item bipartite graph is built as shown in Figure 1, where A= {Query, Item} and R= {Click}.
In a bipartite graph, we use the metapath and metapath instance to incorporate the neighboring node information into relevance matching.They are defined as follows: Definition 2 Metapath and Metapath Instance in Bipartite Graph.Given a bipartite graph G=(U, V, E, A, R), the metapath ) is a path from a 1 to a l+1 successively through r 1 , r 2 , • • • , r l (a j ∈A, r j ∈R).The length of P i is denoted as |P i | and |P i | =l.For brevity, the set of all metapaths on G can be represented in regular expression as where a, a ′ ∈A and a̸ =a ′ .The metapath instance p is a definite node sequence instantiated from metapath P i .All instances of P i is denoted as I(P i ), then p∈I(P i ).
, where Φ(•) is a function to map each node to a representation vector, G(.) is the score function, u i ∈U, v j ∈V, and context information C k = Many existing relevance matching models (Huang et al., 2013;Shen et al., 2014;Hu et al., 2014a) ignore context information C k and only consider the sentences w.r.t. the query and item to be matched, which corresponds to 0-order relevance (for more details, please see the "Related Work" part in Appendix 4).We call it context-free relevance matching in this paper.Considering that both the 0-order neighbor (i.e., the node itself) and k-order neighbor (k ⩾ 1) are necessary for relevance matching, we argue that a reasonable mechanism should ensure that they can cooperate with each other.Then the research objective of our work is defined as follows: Definition 4 Contextual Relevance Matching.Given a bipartite graph G = (U, V, E, A, R), the task of contextual relevance matching is to determine the context information C k on G and learn the score function G(•) .

Overview
We propose a complete knowledge distillation framework (Figure 2), whose student model incorporates the context information, for contextual relevance matching in e-commerce product search.The main components of this framework are described as follows: • Graph construction.We first construct a raw bipartite graph G based on the search data collected from JD.com.Then we construct a knowledgeenhanced bipartite graph G ′ with the help of BERT, which is fine-tuned by the human-labeled relevance data.
• Student model design.We design a novel student model BERM corresponding to the score function G(•) in Definition 4. Specifically, macro and micro matching embeddings are derived in BERM to capture the sentence-level and wordlevel relevance matching signal, respectively.Also, based on the metapaths "Q-I-Q" and "I-Q-I", we design a node-level encoder and a metapath-instance-level aggregator to derive metapath embeddings.
• Online application.To serve online search, we conduct further distillation to BERM and obtain BERM-O, which is easy to be deployed online.

Bipartite Graph Construction
We introduce the external knowledge from BERT to refine the raw user behavior graph G into a knowledge-enhanced bipartite graph G ′ .The whole graph construction includes the following phases.
Fine-tuning BERT.We use the BERT model as the teacher model in our framework.BERT is pre-trained on a large text corpus and fine-tuned on our in-house data where the positive examples and negative examples are human-labeled and cover various item categories.The fine-tuned BERT is equipped with good relevance discrimination and thus acts as an expert in filtering noisy data.For each example pair p i in the transfer set S transfer , we use BERT to predict its score y i as the training label of the student model BERM.
Behavior graph construction.The user behavior graph G is built on the user search log over six months which records click behaviors and purchase behaviors as well as their frequencies.Each edge in G represents an existing click behavior or purchase behavior between the given query and item.
Knowledge-enhanced graph refinement.The click behavior edges are dense and highly noisy, so we leverage the fine-tuned BERT model to refine G. Specifically, we retain all the raw purchase behavior edges, and meanwhile use the knowledge generated by the fine-tuned BERT to refine the click behavior edges.We set two thresholds α and β to determine which raw edges are removed and which new edges are added.This strategy helps remove the noise in user behaviors, and at the same time retrieve the missing but relevant neighbors which cannot be captured by user behaviors.To preserve important neighbors, for each anchor node, we rank its 1-hop neighbors with the priority of "purchase>high click>low click" and select the top two of them as the final neighbor list, i.e., the neighbor list of a query node Q is represented as [I top1 , I top2 ] and the neighbor list of a query node I is represented as [Q top1 , Q top2 ].The algorithm of graph construction is provided in Appendix A.

BERM Model
In this part, we describe BERM in detail, including 0-order relevance modeling, k-order relevance modeling, and overall learning objective.

0-order Relevance Modeling
The whole structure of BERM includes both the 0-order relevance modeling and k-order relevance modeling.This subsection introduces the 0-order relevance modeling which captures sentence-level and word-level matching signals by incorporating the macro matching embedding and micro matching embedding, respectively.
Macro and micro matching embeddings.Each word is represented by a d-dimensional embedding vector, which is trained by Word2Vec (Mikolov et al., 2013).The i-th word's embedding of query To capture sentence-level and word-level matching signals, we employ macro matching embedding and micro matching embedding, respectively.For the macro matching embedding, taking query Q with l Q words and item I with l I words as examples, their macro embeddings are calculated by the column-wise mean value of For the micro matching embedding, we first build an interaction matrix M int ∈R l Q ×l I whose (i, j)-th entry is the dot product of E i Q and E j I : Then the micro matching embedding

k-order Relevance Modeling
The k-order relevance model contains a node-level encoder and a metapath-instance-level aggregator.
Node-level encoder.The input of the node-level encoder is node embeddings and its output is an instance embedding (i.e., the embedding of a metapath instance).Specifically, to obtain the instance embedding, we integrate the embeddings of neighboring nodes into the anchor node embedding with a mean encoder.Taking "Q-I top1 -Q top1 " as an example, we calculate its embedding E Q-I top1 -Q top1 ∈R d as follows: The metapath instance bridges the communication gap between different types of nodes and can be used to update the anchor node embedding from structure information.
Metapath-instance-level aggregator.The inputs of the metapath-instance-level aggregator are instance embeddings and its output is a metapath embedding.Different metapath instances convey different information, so they have various effects on the final metapath embedding.However, the mapping relationship between the instance embedding and metapath embedding is unknown.To learn their relationship automatically, we introduce the "graph attention" mechanism to generate metapath embeddings (Wu et al., 2021a;Liu et al., 2022).Taking metapath "Q-I-Q" as an example, we use graph attention to represent the mapping relationship between "Q-I-Q" and its instances.The final metapath embedding where σ(•) is the activation function of LeakyReLU.
Though Att i can be set as a fixed value, we adopt a more flexible way, i.e., using the neural network to learn Att i automatically.Specifically, we feed the concatenation of the anchor node embedding and metapath instance embedding into a one-layer neural network (its weight is W att ∈R 6d×4 and its bias is b att ∈R 1×4 ) with a softmax layer, which outputs an attention distribution: The above process is shown in Figure 3. Embedding fusion.By the 0-order and korder relevance modeling, three types of embeddings are generated, including macro matching embedding We concatenate them together and feed the result to a three-layer neural network (its weights are

Overall Learning Objective
We evaluate the cross-entropy error on the estimation score ŷi and label y i (note that y i ∈[0, 1] is the score of the teacher model BERT), and then minimize the following loss function: where ñ is the number of examples.We also analyze the complexities of BERT, BERM, and BERM-O in Appendix C.

Experiments
In this section, we present the offline and online experimental results of BERM * .

Experimental Setting
Datasets.We collect three datasets from the search platform of JD.com, including the "Electronics" category (Data-E), all-category (Data-A), and sampled all-category (Data-S).In the platform, there are mainly three different levels of item categories: Cid 1 (highest level, e.g., "Electronics"), Cid 2 (e.g., "Mobile phone"), and Cid 3 (lowest level, e.g., "5G phone").Data-A, Data-S, and Data-E have different data distributions.Specifically, Data-A covers all first-level categories Cid 1 in the platform; Data-S is generated by uniformly sampling 5,000 items from Cid 1 ; Data-E only focuses on the category of "Electronics" in Cid 1 .Details of Data-E, Data-A, and Data-S are reported in Table 1.
For the training data S train (also called S transfer ), the collected user behaviors include click and purchase.For the testing data S test , whose queries are disjointed with those of S train , we use human labeling to distinguish between relevant and irrelevant items.Specifically, editors are asked to assess the relevance scores between queries and items.In JD.com platform, the candidate set of relevance scores is {1, 2, 3, 4, 5}, where 5 means most relevant and 1 means least relevant.To simplify it, we use binary labeling including the positive label (i.e., 4 or 5) and negative label (i.e., 1, 2, or 3).
Evaluation Metrics.To measure the performance of baseline methods and our BERM, we use three kinds of evaluation metrics, including Area Under the receiver operating characteristic Curve (AUC), F1-score, and False Negative Rate (FNR).The low value of FNR indicates the low probability of fetching irrelevant items, which is closely related to the user's search experience.Therefore, we include it in the evaluation metrics.

Offline Performance
We compare BERM with 12 state-of-the-art relevance matching methods and 5 graph neural network models on our in-house product search data.
The results are shown in Table 2.Because some baseline methods (e.g., DRMM and ESIM) have high time complexities, we use Data-E and Data-S for training and testing models.
As shown in Table 2, BERM outperforms all the baselines according to the metrics of AUC, F1score, and FNR.More specifically, we have the following findings: 1) Compared to the second-

Case Study
Apart from the above quantitative analysis, we conduct qualitative analysis based on some cases of e-commerce product search.For these cases, we list the query phrase, item title, human labeling, score of BERT, and score of BERM in Table 3.We have the following empirical conclusions: 1) Most of the student's scores are close to the teacher's, which indicates the success of the proposed knowledge distillation framework.2) Some cases imply that context information is necessary for relevance matching.For example, for the query "nissan thermos cup", the teacher model cannot explicitly judge whether or not the item entitled "Disney thermos cup 500ML" is relevant to it.With the help of context information in the query-item bipartite graph, BERM can recognize that this query is related to "nissan", rather than "Disney".

Deployment & Online A/B Testing
We conduct further distillation to BERM and obtain a lighter model BERM-O whose basic structure is a two-layer neural network.The process of further distillation is almost the same as the first knowledge distillation.The transfer set gener- Online results.We compare BERM-O with BERT2DNN (Jiang et al., 2020) which is our online baseline model using knowledge distillation without context information.The results of A/B testing are reported in Table 4.These results are from one observation lasting more than ten days.Four widely-used online business metrics are adopted 1) conversion rate (CVR): the average order number of each click behavior, 2) user conversion rate (UCVR): the average order number of each user, 3) unique visitor value (UV-value): the average gross merchandise volume of each user, and 4) revenue per mile (RPM): the average gross merchandise volume of each retrieval behavior.The results show that BERM-O outperforms BERT2DNN in the platform according to all of the business metrics.For example, BERM-O significantly improves 5.7% (relative value) of UV-value under price sort mode.

Classical Relevance Matching Models
The classical relevance matching models use the deep learning technique to learn vector representations containing the semantics of words or sequences.The prevailing methods are either representation-focused (e.g., DSSM (Huang et al., 2013), CDSSM (Shen et al., 2014), and ARC-I (Hu et al., 2014b)) or interaction-focused (e.g., MatchPyramid (Pang et al., 2016a), ARC-II (Hu et al., 2014b), and ESIM (Chen et al., 2017)).The representation-focused methods learn the lowdimensional representations of both sentences and then predict their relationship by calculating the similarity between the two representations.The interaction-focused methods learn an interaction representation of both sentences based on the calculation from word-level to sentence-level.
However, the above methods ignore the inherent context information contained in search logs (Qin et al., 2022;Roßrucker, 2022).In this work, we incorporate the advantages of representation-based and interaction-based embeddings into BERM, which is focused on contextual relevance matching.
However, the multi-layer stacked Transformer structure in these models leads to high time complexity, so they are hard to deploy online.In this work, we use BERT to generate the supervised information of BERM and refine the noisy behavior data.Also, GRMM and GHRM are essentially different from ours in the definition of graphs.In their constructed graph, nodes are unique words and edges are the co-occurrent relationships.In this work, we leverage query phrases (or item ti-tles) as nodes and user behaviors as edges, which is more suitable for the product search problem.

Online Knowledge Distillation Methods
Knowledge distillation is firstly proposed in (Hinton et al., 2015).Its main idea is to transfer the knowledge generated by a massive teacher model into a light student model.Because of the low complexity of the student model, it is easy to deploy the student model to the online platform.Considering the strong semantic understanding ability of BERT, some studies exploit the potential of BERT as the teacher model of knowledge distillation.Two types of design principles are general: isomorphic principle and isomeric principle.Specifically, the distillation methods that follow the isomorphic principle use the same model architecture for teacher and student models, such as TinyBERT (Jiao et al., 2020), BERT-PKD (Sun et al., 2019a), MTDNN (Liu et al., 2019a), and Dis-tilBERT (Sanh et al., 2019).As a more advanced design principle, the isomeric principle uses different model architectures for teacher and student models, such as Distilled BiLSTM (Tang et al., 2019) and BERT2DNN (Jiang et al., 2020).
Although the above methods reduce the total time costs by learning a light student model, they ignore the context information in the real search scene.Our proposed knowledge distillation framework follows the isomeric principle and further integrates context information into the student model by bipartite graph embedding.

Conclusions and Future Work
In this paper, we propose the new problem of contextual relevance matching in e-commerce product search.Different from the previous work only using the 0-order relevance modeling, we propose a novel method of the k-order relevance modeling, i.e., employing bipartite graph embedding to exploit the potential context information in the queryitem bipartite graph.Compared to the state-of-theart relevance matching methods, the new method BERM performs robustly in the experiments.We further distill BERM into BERM-O and deploy BERM-O to JD.com online e-commerce product search platform.The results of A/B testing indicate that BERM-O improves the user's search experience significantly.In the future, we plan to apply our method to other e-commerce applications such as recommender systems and advertisements.

A Graph Construction Algorithm
We provide the complete algorithm of graph construction in Algorithm 1.

B Word Embedding in E-commerce Scene
In this part, we introduce the details of word embedding generation used in this work.In e-commerce scene, the basic representation of a query or an item is an intractable problem.On one hand, it is infeasible to represent queries and items as individual embeddings due to the unbounded entity space.On the other hand, product type names (like "iphone11") or attribute names (like "256GB") have special background information and could contain complex lexicons such as different languages and numerals.To address these problems, we adopt word embedding in BERM, which dramatically reduces the representation space.Also, we treat contiguous numerals, contiguous English letters, or single Chinese characters as one word and only retain the high-frequency words (such as the words occurring more than fifty times in a six-month search log) in the vocabulary.The final vocabulary is only in the tens of thousands, which saves memory consumption and lookup time of indexes by a large margin.

C Complexity Analysis
In this section, we analyze the time and space complexities of BERT (teacher model), BERM (student model), and BERM-O (online model).

C.1 Time Complexity
For the lookup operation on the static vocabulary table (i.e., a word embedding table whose size is n w ), the time complexities of BERT, BERM, and BERM-O are the same, i.e., O(n w ).For the model calculation part, BERT uses Transformer networks.We denote the word embedding size, head number, network number, query length, and item length as d 1 , h 1 , k 1 , l Q , l I , respectively.For the one-layer multi-head attention mechanism, the complexity of linear mapping , and the complexity of linear mapping (out- For the student model BERM, we denote the word em- 2: Fine-tune BERT on the human-labeled data.3: Use the fine-tuned BERT to predict on Sinput and then obtain S transfer = {pi; yi} ñ. 4: for pi, yi in S transfer do 5: Build an edge between the pair pi and denote it as ei=edge(pi).else if yi < α then 10: end if 15: end for bedding size, hidden size, and network number as d, h 2 , k 2 , respectively.The complexity of calculating micro matching embedding is O(l Q l I d), which is far more than that of calculating micro matching embedding.The complexity of k 2 -layer DNN is O(h 2 dk 2 ).Therefore, the total model calculation complexity of k 2 -layer BERM is O(l Q l I d+h 2 dk 2 ).For the online model BERM-O, we denote the word embedding size, hidden size, and network number as d 3 , h 3 , k 3 , respectively.The model calculation complexity of k 3 -layer BERM-O is O(d 3 h 3 k 3 ).Note that the complexity of BERM-O is independent of l Q and l I because BERM-O only receives sentence embeddings and does not calculate wordlevel matching signals.Based on the above analysis, we can conclude that BERM is more efficient than BERT and meanwhile BERM-O has more advantages than BERM on time complexity.

C.2 Space Complexity
The storage of the static vocabulary table takes up the majority of the total space storage.Therefore, the space complexities of BERT, BERM, and BERM-O are the same, i.e., O(n w d).

D.1 Baselines
The model BERM is compared with some state-ofthe-art models.Like BERM, these models are used as the student model of the proposed knowledge dis-tillation framework.We adopt the hyper-parameter settings recommended by the original papers for all the methods.According to the formulation process of embedding, these methods can be divided into the following three types: • Three representation-focused relevance matching methods: DSSM (Huang et al., 2013), MVL-STM (Wan et al., 2016), and ARC-I (Hu et al., 2014b).They learn the low-dimension representations of both sentences w.r.t. a query and an item, and then predict their relationship by calculating the similarity (such as cosine similarity) of representations.
• Two integrated relevance matching methods: Duet (Mitra et al., 2017) and BERT2DNN (Jiang et al., 2020).They combine the features of the above two types of methods into themselves.

D.2 Implementation Details
Here we introduce the implementation details of the whole knowledge distillation framework as follows: • Teacher model.For the teacher model, we adopt BERT-Base * with a 12-layer (k 1 =12) Transformer encoder where the word embedding size d 1 is 768 and head number h 1 is 12.We pretrain BERT-Base on a human-labeled dataset with 380,000 query-item pairs.The fine-tuned BERT-Base is then used as an expert to refine the noisy click behavior data from S train .The

E Additional Experiments
We conduct some additional experiments, including ablation studies (Section E.1) and sensitivity analysis (Section E.2).In these experiments, we adopt Data-E and Data-A consistently.

E.1.1 Integration of Embeddings
There are three types of components in the complete BERM: the representation-based embeddings E Q seq , E I seq , interaction-based embedding E int , and metapath embeddings E Q-I-Q , E I-Q-I .To further examine the importance of each component in the final embedding of BERM, we remove one or two components from it (Equation 9) at a time and examine how the change affects its overall performance.
The corresponding results on Data-E and Data-A are reported in Table 5 and 6.We have the following empirical observation and analysis: • In general, the both-component setting outperforms the single-component setting but is worse than the triple-component setting (i.e., BERM).It demonstrates that different components in BERM have different positive effects on the overall performance and they cannot replace each other.
• The introduction of k-order relevance modeling can bring stable advancement to each 0-order relevance model.For example, the combination of "E Q seq , E I seq , E Q-I-Q , E I-Q-I " surpasses the combination of "E Q seq , E I seq " 6.83% according to the metric of AUC on Data-A.This demonstrates that applying metapath embedding to relevance matching can make effective use of the neighboring nodes' information in the user behavior graph.

E.1.2 Effect of the Intermediate Node
The metapath defined in BERM includes the intermediate node.To further investigate the effect of the intermediate node, we compare the performances of BERM with the intermediate node (i.e., "Q-I-Q" and "I-Q-I") and BERM without the intermediate node (i.e., "Q-Q" and "I-I") on Data-E and Data-A in Figure 4. We observe that BERM with the intermediate node performs better than the other one.We infer that the intermediate node has strong semantic closeness to the anchor node and thus it is helpful for accurate semantic recognition.

E.2.1 Thresholds α and β
In BERM, α decides how many edges of the noisy click behavior should be deleted and β decides how many hidden useful edges should be retrieved.To investigate the sensitivity of α and β, we conduct experiments with 16 different hyper-parameter settings where α ranges from 0.2 to 0.5 and β ranges from 0.5 to 0.8.We apply the three-order curve interpolation method to show the final results in Figure 5.In general, the results of BERM are robust to the change of hyper-parameter α and β on either Data-E or Data-A.For example, the maximum error of AUC is no more than 1%.So we conclude that user behaviors play a major role in the performance of BERM and the knowledge from BERT provides auxiliary effects for it.

E.2.2 Threshold k
Here we evaluate the effect of k on the performance of BERM by sampling neighboring nodes with different hops from the bipartite graph.The comparison results on Data-E and Data-S are shown in Figure 6.We can see that the BERM with k=2 achieves the best performance among them.When k is too large such as k=5, many distant neighbors are aggregated into the anchor nodes, then it leads to the performance degradation of BERM due to lots of noise gathering in these distant neighbors.Therefore, we conclude that 2-order relevance matching modeling is the optimal choice for our e-commerce scene.

E.2.3 Selection of Neighbor Structure
In BERM, the selection of neighbor structure directly affects which context information is transmitted to the anchor node.A good selection strategy can aggregate valuable neighboring node information to enrich the anchor node's representation.To investigate the effect of different neighbor structure selection strategies on BERM and seek a relatively optimal solution, we use different values of hyper-parameter λ to control the ratio between user behavior and BERT's score.tion of edges refer to Score new (Q, I), rather than Score(Q, I).We report the results with different λ in Table 7 and 8. From them, we can conclude that: • Using BERT's score is better than using user behaviors for the selection of neighbors.Therefore, the value of AUC or F1-score gradually decreases with the increase of λ; the value of FNR increases with the increase of λ.The optimal λ is located in the interval [0.0, 0.2].
• According to the metric of FNR, the purchase behavior is better than the click behavior on Data-E.
The reason for it is that purchase behaviors reveal more accurate semantic relevance information than click behaviors.However, the purchase behavior is worse than the click behavior on Data-A.
We guess that it is caused by the sparsity of purchase behaviors in the dataset of all categories.

Figure 1 :
Figure 1: Shortcoming of the existing relevance matching model.Here we take the ARC-I model as an example.The right part shows the ground truth of queries and item titles.The left part shows two problematic examples in ARC-I, which deviate from the ground truth.

Figure 2 :
Figure 2: The e-commerce knowledge distillation framework proposed in our work.Three models are used in this framework: teacher model BERT, student model BERM, and online model BERM-O.

Figure 4 :Figure 5 :
Figure 4: Effect of the intermediate node.The red (blue) bar represents BERM with (without) the intermediate node.
Specifically, we calculate a new score Score new(Q, I) = λ•User(Q, I) + (1−λ)•Score(Q, I)where User(Q, I) is the user behavior feature (e.g., for click behavior, User(Q, I)=1 if click behavior happens between query Q and item I).The addition and dele-

Figure 6 :
Figure 6: The effect of different k.

Table 1 :
Statistics of the used datasets.

Table 2 :
Comparisons on Data-E and Data-S.In each column, the best result is bolded and the runner-up is underlined.The symbol of "↓" represents that the lower value corresponds to better performance."I, II, III" represent the representation-focused, interaction-focused, and both-focused relevance matching models, respectively."IV" represents the graph neural network models.

Table 3 :
Cases of e-commerce product search."y i " is the prediction score of the teacher model BERT and "ŷ i " is the relevance estimation score of the student model BERM.Data-S).Furthermore, BERM achieves the lowest value of FNR on both Data-E and Data-S.This implies that BERM can easily identify irrelevant items so that it can return a list of satisfactory items in the real-world scene.2) The collected training data have imbalanced classes (i.e., the positive examples are far more than the negative examples), which poses a challenge to model learning.Most baselines are sensitive to class imbalance.Since BERM learns explicit node semantics by integrating the neighboring node information, our method is robust when the data are imbalanced.

Table 4 :
Online performance of BERM-O under price sort mode and default sort mode.
Algorithm 1 Bipartite Graph Construction Input: Thresholds α and β; Collected dataset Sinput= {pi} ñ (note that ñ is the number of examples); Raw user behavior graph G=(U, V, E, A, R) where E= {e1, • • • , em}, A= {Query, Item}, R= {Click, P urchase}.Output: Transfer set S transfer = {pi; yi} ñ where yi is the training label of query-item pair pi (yi∈ [0, 1]); Refined bipartite graph https://github.com/google-research/bertrefinement rule is: if the prediction score y i of BERT-Base is less than α (the default value of α is 0.3), then the raw edge is deleted; if the score is larger than β (the default value of β is 0.7), then a new edge is added. *

Table 5 :
Ablation study on Data-E.In each column, the best result is bolded.

Table 6 :
Ablation study on Data-A.In each column, the best result is bolded.
seq , E I seq , E Q-I-Q , E I-Q-I

Table 7 :
Effect of different neighbor selection strategies on Data-E.
Rate Click+BERT's score Purchase+BERT's score