Attribute Value Generation from Product Title using Language Models

Identifying the value of product attribute is essential for many e-commerce functions such as product search and product recommendations. Therefore, identifying attribute values from unstructured product descriptions is a critical undertaking for any e-commerce retailer. What makes this problem challenging is the diversity of product types and their attributes and values. Existing methods have typically employed multiple types of machine learning models, each of which handles specific product types or attribute classes. This has limited their scalability and generalization for large scale real world e-commerce applications. Previous approaches for this task have formulated the attribute value extraction as a Named Entity Recognition (NER) task or a Question Answering (QA) task. In this paper we have presented a generative approach to the attribute value extraction problem using language models. We leverage the large-scale pretraining of the GPT-2 and the T5 text-to-text transformer to create fine-tuned models that can effectively perform this task. We show that a single general model is very effective for this task over a broad set of product attribute values with the open world assumption. Our approach achieves state-of-the-art performance for different attribute classes, which has previously required a diverse set of models.


Introduction
Product attributes and their values play an important role in e-commerce platforms. There are hundreds of thousands of products sold online and each type of product has a different set of attributes. These attributes help customers search for products, compare the relevant items and purchase the product of their choice. While details of a product can be found both in its title as well as its description, commonly, the title includes important attributes of the product. Everyday many new products are added to the product catalogue often with new at-  tributes types and values. However, attribute information is often sparse, noisy and incomplete with missing values. For example, Figure 1 shows a product with its description and attribute value pairs available on the website. It contains attribute values for Brand Name, Type etc., but there are missing attributes, such as "Dual-coil" for Pickup Type, "6" for Strings etc. Given the wide diversity of products and new products constantly emerging, it is important that attribute value extraction works with the open world assumption, i.e., values for the attributes not seen before.
Earlier work (Ghani et al., 2006;Chiticariu et al., 2010;Gopalakrishnan et al., 2012) for attribute value extraction use a rule based approach with the help of a domain specific seed dictionary to identify the key phrases. Other work have formulated this as named entity recognition (NER) problem (Putthividhya and Hu, 2011;More, 2016). However, these approaches do not work under the open world assumption. More recently, various neural network based approaches have been proposed and applied to sequence tagging model for attribute value extraction. Huang et al. (2015) is the first to apply the BiLSTM-CRF model for sequence tagging. Zheng et al. (2018) propose an end-to-end tagging model using BiLSTM, CRF and attention without any dic-tionary or hand-crafted features. Most of these approaches create separate models for different attributes. Also, for each attribute a, they have one set of tags to denote beginning (B a ) and inside (I a ) of that attribute. Hence, these methods are not scalable for large set of attributes and these models can not identify emerging values for unseen attributes. Recent works (Xu et al., 2019;Wang et al., 2020) have set up this task as question answering (QA) task. Question answering in machine reading comprehension (MRC) selects a span of text from the given context to answer the question. Xu et al. (2019) considers product title as context, attribute as query, and proposes to find the attribute value using only global set of BIO tags. Although the sequence tagging models (Zheng et al., 2018;Xu et al., 2019) achieve promising result, they do not work well for discovering new attributes values.
In contrast to past extractive or classification based approaches, we have taken a generative approach to identify attribute values. Text generation using language models has several applications in real-world tasks such as text-editing, article writing, sentence completion, etc. Text infilling aims to fill the missing part of a given sentence. Motivated by their success as well as to leverage the large scale pretraining of the language models, we formulate the attribute value extraction as an instance of text infilling task as well as an answer generation task. We utilize Infilling by Language Modeling (ILM) (Donahue et al., 2020) for the infilling approach and we fine-tune Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020) as an answer generation task. We summarize the main contribution of this work as follows: • We propose a language modeling approach for attribute value extraction. • We empirically demonstrate that this approach achieves state-of-the-art results on discovering new attribute values.

Problem Statement
In this section, we formally define the problem of attribute value generation. Given a product context T = (w t 1 , w t 2 , ..., w t m ) and its attribue A = (w a 1 , w a 2 , ..., w a n ), our goal is to generate the value V = (w v 1 , w v 2 , ..., w v e ). For example, the context of the product in Figure 1  tributes Type and Fingerboard Material. We want to generate the value "Electric Guitar" for the attribute Type and NULL for the attribute Fingerboard Material as this attribute is not present in the context. In this work, first, we formulate this problem as a (i) text infilling task and then as an (ii) answer generation task. For text infilling, we combine the context, T , attribute, A, and the value, V , in a sentence as "T . A is V ." where the attribute value V is masked as blank. Our objective is to generate the missing span in this sentence to predict this value. Let the incomplete sentence bẽ S = (w s 1 , w s 2 , ..., w s p ). Our model outputs the best attribute value sequenceṼ by learning the distri-butionṼ = P (V |S). In the answer generation approach, our aim is to generate V as the answer, considering T as the context and A as the question.

Dataset
We have used publicly available dataset 1 which is collected from Sports & Entertainment category of AliExpress (Xu et al., 2019). This dataset contains 110, 484 examples. Each example contains a triple, i.e., context as product title, an attribute and its value. We preprocessed the dataset to handle noisy data, and removed triples with empty values and triples with '-', '/' as value. This led to a dataset comprising of 109, 957 triples which we refer to as AV-109K. There are 2, 157 unique attributes and 11, 847 uniques values in this dataset. Also, not all the attributes have a value in the context and these are represented as NULL. There are 21, 461 such triples in AV-109K. We randomly split the data into 7:1:2 ratio, i.e., we randomly select 76, 970 triples as training set, 10, 996 triples as validation set, and the remaining 21, 991 triples as the test set.  To further examine the model's ability to generate values for unseen attributes, we select five attributes with relatively low frequency (< 0.1%) in the dataset: Frame Color, Lenses Color, Shell Material, Wheel Material and Product Type and the number of triples for these attributes are 108, 62, 36, 23, and 523, respectively. All the triples with these attributes are included in the test set. From the remainder of the dataset, we pick 10% as validation set and the rest as the training set. We refer to this dataset as AV-zero.

Evaluation Metrics
To evaluate the models, we use the Exact Match (EM ) metric on the generated values where the whole sequence of the value must match. Since values can contain more than one tokens and models may generate tokens in any order, we have also computed average bag of word precision, recall and F 1 score as our evaluation measure which are denoted as P , R and F 1 , respectively. Let N be the size of the dataset, V = {v 1 , v 2 , .., v N } be the gold standard values, G = {g 1 , g 2 , ..., g N } be the generated values, and |v i ∩ g i | denotes the bag of words overlap between the gold standard and the generated values corresponding to the i th triple. The computation of P and R is shown below:
• SUOTag (Xu et al., 2019) uses two separate BiLSTMs over the BERT based pretrained word embeddings to represent the context and attribute. Then, it applies a cross attention between these two representations followed by a CRF layer.

Implementation Details
All the models are implemented with Py-Torch (Paszke et al., 2019). We train each model for 5 epochs. The model that performs the best on the validation set is used for evaluating the test set. The minibatch size is fixed to 32. We use AdamW optimizer and a learning rate of 5e-5. We use pretrained GPT-2 small (Radford et al., 2019) model to train ILM and we use the validation set perplexity of the model on the masked token. We fine-tune T5-Base for the answer generation framework.

Results and Discussion
We conduct experiments on different settings to (1) explore the scalability on large attribute set, (2) compare the performance on four frequent attributes, and (3) examine the model's ability to discover new attributes. Table 2 reports the performance on the AV-109K dataset. Since BiLSTM-CRF requires to tag each of the attributes a with separate B a and I a tags, it is not suitable for a large attribute set. So, we did not consider this model. The overall result shows that both ILM and T5 have the capability to a handle large number of attributes. Next, we examine the models for various interesting cases such as (   data. The details are summarized in Table 3. ILM performs better than other models in identifying triples having NULL values. Specifically, language models give a much better precision in this case. There are 19.26% NULL values in AV-109K, but SUOTag predicts 43.83% data as NULL. Hence, it has such high recall. There are very few triples where the attributes appear in the context -only 1.50% in train dataset and 1.59% in test dataset. So, when the attribute appears in the context, the performance of all the models is poor in comparison with when the attribute does not appear in the context. In the AV-109K dataset, there are 4, 058 triples whose value consist of multiple words. T5 performs the best in finding the values having more than one word. There are 8.5% numerical data in the test set and T5 gives much better results than other models in identifying them. The second experiment is conducted on the four most frequent attributes of the AV-109K dataset. Table 4 shows the result. T5 performs better than other models in Brand Name and Color. For Material and Category, ILM has the best performance. We have looked into the predictions of the values in these two categories and found that T5 is not correctly identifying the NULL values. On closer look at the dataset, we find that most of those NULL values are incorrectly annotated, e.g., "new 1pcs Golf Sports Mens Right Left Hand Golf Gloves Sweat Absorbent Microfiber Cloth Soft Breathable Abrasion Gloves" -the material of this product is microfiber, but it is annotated as NULL. T5 has pre-Attributes Model EM (%) P (%) R(%) F1(%)  Bike Seat Vintage Bicycle Leather Saddle Pad", but the annotation is NULL. Although T5 has identified the correct value of the attribute, it is marked as incorrect due to faulty annotation. The last experiment is performed on AV-zero dataset. Table 5 shows the result of discovering values of five new attributes. ILM is the best in identifying "Product Type". The value of most of the "Product Type" is Fishing Float, but T5 either predicted the product type to be NULL or the type of the float, e.g., Luminous Fishing Float, Ice Fishing Float, etc. For the remaining three attributes, T5 outperforms other models. 3 Both T5 and ILM perform better than SUOtag in discovering unseen attribute values.

Conclusion
In this work, we present a formulation to generate product attribute values as (i) an instance of text infilling task and (ii) as an answer generation task. We show that we can leverage GPT-2 based and T5 text-to-text transformer models for this task. The models achieve strong results over a broad set of attributes. T5 performs better at multi-word values, and ILM is better at predicting null values. Additionally, our approach outperforms the state-of-theart models for discovering new attribute values.