BERT Goes Shopping: Comparing Distributional Models for Product Representations

Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through~\textit{prod2vec}. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -- ~\textit{Prod2BERT} -- is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of~\textit{Prod2BERT} and~\textit{prod2vec} embeddings: while~\textit{Prod2BERT} is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints.


Introduction
Distributional semantics (Landauer and Dumais, 1997) is built on the assumption that the meaning of a word is given by the contexts in which it appears: word embeddings obtained from co-occurrence patterns through word2vec (Mikolov et al., 2013), proved to be both accurate by themselves in representing lexical meaning, and very useful as components of larger Natural Language Processing (NLP) architectures (Lample et al., 2018). The empirical success and scalability of word2vec gave rise to many domain-specific models (Ng, 2017;Grover and Leskovec, 2016;Yan et al., 2017): in eCommerce, prod2vec is trained replacing words in a sentence with product interactions in a shopping session (Grbovic et al., 2015), eventually generating vector representations of the products. The key intuition is the same underlying word2vec -you can tell a lot about a product by the company it keeps (in shopping sessions). The model enjoyed immediate success in the field and is now essential to NLP and Information Retrieval (IR) use cases in eCommerce (Vasile et al., 2016a;. As a key improvement over word2vec, the NLP community has recently introduced contextualized representations, in which a word like play would have different embeddings depending on the general topic (e.g. a sentence about theater vs soccer), whereas in word2vec the word play is going to have only one vector. Transformer-based architectures (Vaswani et al., 2017) in large-scale models -such as BERT (Devlin et al., 2019) -achieved SOTA results in many tasks (Nozza et al., 2020;Rogers et al., 2020). As Transformers are being applied outside of NLP , it is natural to ask whether we are missing a fruitful analogy with product representations. It is a priori reasonable to think that a pair of sneakers can have different representations depending on the shopping context: is the user interested in buying these shoes because they are running shoes, or because these shoes are made by her favorite brand?
In this work, we explore the adaptation of BERTlike architectures to eCommerce: through extensive experimentation on downstream tasks and empirical benchmarks on typical digital retailers, we discuss advantages and disadvantages of contextualized embeddings when compared to traditional prod2vec. We summarize our main contributions as follows: 1. we propose and implement a BERT-based contextualized product embeddings model (hence, Prod2BERT), which can be trained with online shopper behavioral data and produce product embeddings to be leveraged by 2. we benchmark Prod2BERT against prod2vec embeddings, showing the potential accuracy gain of contextual representations across different shops and data requirements. By testing on shops that differ for traffic, catalog, and data distribution, we increase our confidence that our findings are indeed applicable to a vast class of typical retailers; 3. we perform extensive experiments by varying hyperparameters, architectures and finetuning strategies. We report detailed results from numerous evaluation tasks, and finally provide recommendations on how to best trade off accuracy with training cost; 4. we share our code 1 , to help practitioners replicate our findings on other shops and improve on our benchmarks.

Product Embeddings: an Industry Perspective
The eCommerce industry has been steadily growing in recent years: according to U.S. Department of Commerce (2020), 16% of all retail transactions now occur online; worldwide eCommerce is estimated to turn into a $4.5 trillion industry in 2021 (Statista Research Department, 2020). Interest from researchers has been growing at the same pace (Tsagkias et al., 2020), stimulated by challenging problems and by the large-scale impact that machine learning systems have in the space (Pichestapong, 2019). Within the fast adoption of deep learning methods in the field (Ma et al., 2020;Yuan et al., 2020), product representations obtained through prod2vec play a key role in many neural architectures: after training, a product space can be used directly (Vasile et al., 2016b), as a part of larger systems for recommendation (Tagliabue et al., 2020b), or in downstream NLP/IR tasks . Combining the size of the market with the past success of NLP models in the space, investigating whether Transformer-based architectures result in superior product representations is both theoretically interesting and practically important. Anticipating some of the themes below, it is worth mentioning that our study sits at the intersection of two important trends: on one side, neural models typically show significant improvements at large scale (Kaplan et al., 2020) -by quantifying expected gains for "reasonable-sized" shops, our results are relevant also outside a few public companies (Tagliabue et al., 2021), and allow for a principled trade-off between accuracy and ethical considerations (Strubell et al., 2019); on the other side, the rise of multi-tenant players 2 makes sophisticated models potentially available to an unprecedented number of shops -in this regard, we design our methodology to include multiple shops in our benchmarks, and report how training resources and accuracy scale across deployments. For these reasons, we believe our findings will be interesting to a wide range of researchers and practitioners.

Related Work
Distributional Models. Word2vec (Mikolov et al., 2013) enjoyed great success in NLP thanks to its computational efficiency, unsupervised nature and accurate semantic content (Levy et al., 2015;Al-Saqqa and Awajan, 2019;Lample et al., 2018). Recently, models such as BERT (Devlin et al., 2019) and RoBERTa  shifted much of the community attention to Transformer architectures and their performance (Talmor and Berant, 2019;Vilares et al., 2020), while it is increasingly clear that big datasets (Kaplan et al., 2020) and substantial computing resources play a role in the overall accuracy of these architectures; in our experiments, we explicitly address robustness by i) varying model designs, together with other hyperparameters; and ii) test on multiple shops, differing in traffic, industry and product catalog.
Product Embeddings. Prod2vec is a straightforward adaptation to eCommerce of word2vec (Grbovic et al., 2015). Product embeddings quickly became a fundamental component for recommendation and personalization systems (Caselles-Dupré et al., 2018;Tagliabue et al., 2020a), as well as NLP-based predictions . To the best of our knowledge, this work is the first to explicitly investigate whether Transformer-based architectures deliver higher-quality product representations compared to non-contextual embeddings. Eschauzier (2020) uses Transformers on cart co-occurrence patterns with the specific goal of basket completion -while similar in the masking procedure, the breadth of the work and the evaluation methodology is very different: as convincingly argued by Requena et al. (2020), benchmarking models on unrealistic datasets make findings less relevant for practitioners outside of "Big Tech". Our work features extensive tests on real-world datasets, which are indeed representative of a large portion of the mid-to-long tail of the market; moreover, we benchmark several fine-tuning strategies from the latest NLP literature (Section 5.2), sharing -together with our code -important practical lessons for academia and industry peers. The closest work in the literature as far as architecture goes is BERT4Rec (Sun et al., 2019), i.e. a model based on Transformers trained end-to-end for recommendations. The focus of this work is not so much the gains induced by Transformers in sequence modelling, but instead is the quality of the representations obtained through unsupervised pretraining -while recommendations are important, the breadth of prod2vec literature (Bianchi et al., 2021b,a; shows the need for a more thorough and general assessment. Our methodology helps uncover a tighter-than-expected gap between the models in downstream tasks, and our industry-specific benchmarks allow us to draw novel conclusions on optimal model design across a variety of scenarios, and to give practitioners actionable insights for deployment.

Overview
The Prod2BERT model is taking inspiration from BERT architecture and aims to learn contextdependent vector representation of products from online session logs. By considering a shopping session as a "sentence" and the products shoppers interact with as "words", we can transfer masked language modeling (MLM) from NLP to eCommerce. Framing sessions as sentences is a natural modelling choice for several reasons: first, it mimics the successful architecture of prod2vec; second, by exploiting BERT bi-directional nature, each prediction of a masked token/product will make use of past and future shopping choices: if a shopping journey is (typically) a progression of intent from exploration to purchase (Harbich et al., 2017), it seems natural that sequential modelling may capture relevant dimensions in the underlying vocabu- lary/catalog. Once trained, Prod2BERT becomes capable of predicting masked tokens, as well as providing context-specific product embeddings for downstream tasks.

Model Architecture
As shown in Figure 1, Prod2BERT is based on a transformed based architecture Vaswani et al. (2017), emulating the successful BERT model. Please note that, different from BERT's original implementation, a white-space tokenizer is first used to split an input session into tokens, each one representing a product ID; tokens are combined with positional encodings via addition and fed into a stack of self-attention layers, where each layer contains a block for multi-head attention, followed by a simple feed forward network. After obtaining the output from the last self-attention layer, the vectors corresponding to the masked tokens pass through a softmax to generate the final predictions. shopping session, to the telemetry data, to the final masking sequence. The target output sequence is exactly the original sequence without any masking, thus the training objective is to predict the original value of the masked tokens, based on the context provided by their surrounding unmasked tokens. The model learns to minimize categorical crossentropy loss, taking into account only the predicted masked tokens, i.e. the output of the non-masked tokens is discarded for back-propagation.

Hyperparameters and Design Choices
There is growing literature investigating how different hyperparameters and architectural choices can affect Transformer-based models. For example, Lan et al. (2020) observed diminishing returns when increasing the number of layers after a certain point;  showed improved performance when modifying masking strategy and using duplicated data; finally, Kaplan et al. (2020) reported slightly different findings from previous studies on factors influencing Transformers performance. Hence, it is worth studying the role of hyperparameters and model designs for Prod2BERT, in order to narrow down which settings are the best given the specific target of our work, i.e. product representations. Table 1 shows the relevant hyperparameter and design variants for Prod2BERT; following improvement in data generalization reported by , when duplicated = 1 we augmented the original dataset repeating each session 5 times. 3 We set the embedding size to 64 after preliminary optimizations: as other values offered no improvements, we report results only  for one size.

Prod2vec: a Baseline Model
We benchmark Prod2BERT against the industry standard prod2vec (Grbovic et al., 2015). More specifically, we train a CBOW model with negative sampling over shopping sessions (Mikolov et al., 2013). Since the role of hyperparameters in prod2vec has been extensively studied before ( While prod2vec is chosen because of our focus on the quality of the learned representations -and not just performance on sequential inference per se -it is worth nothing that kNN (Latifi et al., 2020) over appropriate spaces is also a surprisingly hard baseline to beat in many practical recommendation settings. It is worth mentioning that for both prod2vec and Prod2BERT we are mainly interested in producing a dense space capturing the latent similarity between SKUs: other important relationships between products (substitution (Zuo et al., 2020), hierarchy (Nickel and Kiela, 2017) etc.) may require different embedding techniques (or extensions, such as interaction-specific embeddings (Zhao et al., 2020)).

Dataset
We collected search logs and detailed shopping sessions from two partnering shops, Shop A and Shop B: similarly to the dataset released by Requena et al. (2020), we employ the standard definition of "session" from Google Analytics 4 , with a total of five different product actions tracked:   Table 2. For fairness of comparison, the exact same datasets are used for both Prod2BERT and prod2vec.
Testing on fine-grained, recent data from multiple shops is important to support the internal validity (i.e. "is this improvement due to the model or some underlying data quirks?") and the external validity (i.e. "can this method be applied robustly across deployments, e.g. Tagliabue et al. (2020b)"?) of our findings.

Experiment #1: Next Event Prediction
Next Event Prediction (NEP) is our first evaluation task, since it is a standard way to evaluate the quality of product representations (Letham et al., 2013;Caselles-Dupré et al., 2018): briefly, NEP consists in predicting the next action the shopper is going to perform given her past actions. Hence, in the case of Prod2BERT, we mask the last item of every session and fit the sequence as input to a pre-trained Prod2BERT model 7 . Provided with the model's output sequence, we take the top K most likely values for the masked token, and perform comparison with the true interaction. As for prod2vec, we perform the NEP task by following industry best practices : given a type is not considered when preparing session for training. 6 We only keep sessions that have between 3 and 20 product interactions, to eliminate unreasonably short sessions and ensure computation efficiency. 7 Note that this is similar to the word prediction task for cloze sentences in the NLP literature (Petroni et al., 2019). trained prod2vec, we take all the before-last items in a session to construct a session vector by average pooling, and use kNN to predict the last item 8 . Following industry standards, nDCG@K (Mitra and Craswell, 2018) with K = 10 is the chosen metric 9 , and all tests ran on 10, 000 testing cases (test set is randomly sampled first, and then shared across Prod2BERT and prod2vec to guarantee a fair comparison).

Model
Config  Table 3: nDCG@10 on NEP task for both shops with Prod2BERT and prod2vec (bold are best scores for Prod2BERT; underline are best scores for prod2vec). Table 3 reports results on the NEP task by highlighting some key configurations that led to competitive performances. Prod2BERT is significantly superior to prod2vec, scoring up to 40% higher than the best prod2vec configurations. Since shopping sessions are significantly shorter than sentence lengths in Devlin et al. (2019), we found that changing masking probability from 0.15 (value from standard BERT) to 0.25 consistently improved performance by making the training more effective. As for the number of layers, similar to Lan et al.  Table 4: Time (minutes) and cost (USD) for training one model instance, per shop: prod2vec is trained on a c4.large instance, Prod2BERT is trained (10 epochs) on a Tesla V100 16GB GPU from p3.8xlarge instance. same model trained on the bigger Shop B obtained a small boost. Finally, duplicating training data has been shown to bring consistent improvements: while keeping all other hyperparameters constant, using duplicated data results in an up to 9% increase in nDCG@10, not to mention that after only 5 training epochs the model outperforms other configurations trained for 10 epochs or more.
While encouraging, the performance gap between Prod2BERT and prod2vec is consistent with Transformers performance on sequential tasks (Sun et al., 2019). However, as argued in Section 1.1, product representations are used as input to many downstream systems, making it essential to evaluate how the learned embeddings generalize outside of the pure sequential setting. Our second experiment is therefore designed to test how well contextual representations transfer to other eCommerce tasks, helping us to assess the accuracy/cost tradeoff when difference in training resources between the two models is significant: as reported by Table 4, the difference (in USD) between prod2vec and Prod2BERT is several order of magnitudes. 10

Experiment #2: Intent Prediction
A crucial element in the success of Transformerbased language model is the possibility of adapting the representation learned through pre-training to new tasks: for example, the original Devlin et al. (2019) fine-tuned the pre-trained model on 11 downstream NLP tasks. However, the practical significance of these results is still unclear: on one hand, Li et al. (2020); Reimers and Gurevych (2019) observed that sometimes BERT contextual embeddings can underperform a simple GloVe (Pennington et al., 2014) model; on the 10 Costs are from official AWS pricing, with 0.10 USD/h for the c4.large (https://aws.amazon.com/ it/ec2/pricing/on-demand/), and 12,24 USD/h for the p3.8xlarge (https://aws.amazon.com/it/ec2/ instance-types/p3/). While obviously cost optimizations are possible, the "naive" pricing is a good proxy to appreciate the difference between the two methods. other, Mosbach et al. (2020) highlights catastrophic forgetting, vanishing gradients and data variance as important factors in practical failures. Hence, given the range of downstream applications and the active debate on transferability in NLP, we investigate how Prod2BERT representations perform when used in the intent prediction task. Intent prediction is the task of guessing whether a shopping session will eventually end in the user adding items to the cart (signaling purchasing intention). Since small increases in conversion can be translated into massive revenue boosting, this task is both a crucial problem in the industry and an active area of research (Toth et al., 2017;Requena et al., 2020). To implement the intent prediction task, we randomly sample from our dataset 20, 000 sessions ending with an add-to-cart actions and 20, 000 sessions without add-to-cart, and split the resulting dataset for training, validation and test. Hence, given the list of previous products that a user has interacted with, the goal of the intent model is to predict whether an add-to-cart event will happen or not. We experimented with several adaptation techniques inspired by the most recent NLP literature (Peters et al., 2019;Li et al., 2020): 1. Feature extraction (static): we extract the contextual representations from a target hidden layer of pre-trained Prod2BERT, and through average pooling, feed them as input to a multilayer perceptron (MLP) classifier to generate the binary prediction. In addition to alternating between the first hidden layer (enc 0) to the last hidden layer (enc 3), we also tried concatenation (concat), i.e. combining embeddings of all hidden layers via concatenation before average pooling.
2. Feature extraction (learned): we implement a linear weighted combination of all hidden layers (wal), with learnable parameters, as input features to the MLP model (Peters et al., 2019).
3. Fine-tuning: we take the pre-trained model up until the last hidden layer and add the MLP classifier on top for intent prediction (finetune). During training, both Prod2BERT and task-specific parameters are trainable.
As for our baseline, i.e. prod2vec, we implement the intent prediction task by encoding each product within a session with its prod2vec embeddings, and  Table 5: Accuracy scores in the intent prediction task (best scores for each shop in bold).
feeding them to a LSTM network (so that it can learn sequential information) followed by a binary classifier to obtain the final prediction.

Results
From our experiments, Table 5 highlights the most interesting results obtained from adapting to the new task the best-performing Prod2BERT and prod2vec models from NEP. As a first consideration, the shallowest layer of Prod2BERT for feature extraction outperforms all other layers, and even beats concatenation and weighted average strategies 11 . Second, the quality of contextual representations of Prod2BERT is highly dependent on the amount of data used in the pre-training phase. Comparing Table 3 with Table 5, even though the model delivers strong results in the NEP task on Shop A, its performance on the intent prediction task is weak, as it remains inferior to prod2vec across all settings. In other words, the limited amount of traffic from Shop A is not enough to let Prod2BERT form high-quality product representations; however, the model can still effectively perform well on the NEP task, especially since the nature of NEP is closely aligned with the pretraining task. Third, fine-tuning instability is encountered and has a severe impact on model performance. Since the amount of data available for intent prediction is not nearly as important as the data utilized for pre-training Prod2BERT, overfitting proved to be a challenging aspect throughout our fine-tuning experiments. Fourth, by comparing the results of our best method against the model learnt with prod2vec embeddings, we observed prod2vec embeddings can only provide limited values for intent estimation and the LSTM-based model stops to improve very quickly; in contrast, the features provided by Prod2BERT embeddings seem to encode more valuable information, allowing the model to be trained for longer epochs and eventually reaching a higher accuracy score. As a more general consideration -reinforced by a qualitative visual assessment of clusters in the resulting vector space -, the performance gap is very small, especially considering that long training and extensive optimizations are needed to take advantage of the contextual embeddings.

Conclusion and Future Work
Inspired by the success of Transformer-based models in NLP, this work explores contextualized product representations as trained through a BERT-inspired neural network, Prod2BERT. By thoroughly benchmarking Prod2BERT against prod2vec in a multi-shop setting, we were able to uncover important insights on the relationship between hyperparameters, adaptation strategies and eCommerce performances on one side, and we could quantify for the first time quality gains across different deployment scenarios, on the other. If we were to sum up our findings for interested practitioners, these are our highlights: 1. Generally speaking, our experimental setting proved that pre-training Prod2BERT with Mask Language Modeling can be applied successfully to sequential prediction problems in eCommerce. These results provide independent confirmation for the findings in Sun et al. (2019), where BERT was used for in-session recommendations over academic datasets. However, the tighter gap on downstream tasks suggests that Transformers' ability to model long-range dependencies may be more important than pure representational quality in the NEP task, as also confirmed by human inspection of the product spaces (see Appendix A for comparative t-SNE plots).
3. Dataset size does indeed matter: as evident from the performance difference in Table 5, Prod2BERT shows bigger gains with the largest amount of training data available. Considering the amount of resources needed to train and optimize Prod2BERT (Section 5.1.1), the gains of contextualized embedding may not be worth the investment for shops outside the top 5k in the Alexa ranking 12 ; on the other hand, our results demonstrate that with careful optimization, shops with a large user base and significant resources may achieve superior results with Prod2BERT.
While our findings are encouraging, there are still many interesting questions to tackle when pushing Prod2BERT further. In particular, our results require a more detailed discussion with respect to the success of BERT for textual representations, with a focus on the differences between words and products: for example, an important aspect of BERT is the tokenizer, that splits words into subwords; this component is absent in our setting because there exists no straightforward concept of "sub-product" -while far from conclusive, it should be noted that our preliminary experiments using categories as "morphemes" that attach to product identifiers did not produce significant improvements.
We leave the answer to these questions -as well as the possibility of adapting Prod2BERT to even more tasks -to the next iteration of this project.
As a parting note, we would like to emphasize that Prod2BERT has been so far the largest and (economically) more significant experiment run by Coveo: while we do believe that the methodology and findings here presented have significant practical value for the community, we also recognize that, for example, not all possible ablation studies were performed in the present work. As Bianchi and Hovy (2021) describe, replicating and comparing some models is rapidly becoming prohibitive in term of costs for both companies and universities. Even if the debate on the social impact of largescale models often feels very complex (Thompson et al., 2020;Bender et al., 2021) -and, sometimes, removed from our day-to-day duties -Prod2BERT gave us a glimpse of what unequal access to resources may mean in more meaningful contexts. While we (as in "humanity we") try to find a solution, we (as in "authors we") may find temporary 12 See https://www.alexa.com/topsites. solace knowing that good ol' prod2vec is still pretty competitive.

Ethical Considerations
User data has been collected by Coveo in the process of providing business services: data is collected and processed in an anonymized fashion, in compliance with existing legislation. In particular, the target dataset uses only anonymous uuids to label events and, as such, it does not contain any information that can be linked to physical entities.

A Visualization of Session Embeddings
Figures 3 to 6 represent browsing sessions projected in two-dimensions with t-SNE (van der Maaten and Hinton, 2008): for each browsing session, we retrieve the corresponding type (e.g. shoes, pants, etc.) of each product in the session, and use majority voting to assign the most frequent product type to the session. Hence, the dots are colorcoded by product type and each dot represents a    unique session from our logs. It is easy to notice that, first, both contextual and non-contextual embeddings built with a smaller amount of data, i.e. Figures 3 and 4 from Shop A, have a less clear separation between clusters; moreover, the quality of Prod2BERT seems even lower than prod2vec, as there exists a larger central area where all types are heavily overlapping. Second, comparing Figure 5 with Figure 6, both Prod2BERT and prod2vec improve, which confirms Prod2BERT, given enough pre-training data, is able to deliver better separations in terms of product types and more meaningful representations.