Learning Cross-Task Attribute - Attribute Similarity for Multi-task Attribute-Value Extraction

Automatic extraction of product attribute-value pairs from unstructured text like product descriptions is an important problem for e-commerce companies. The attribute schema typically varies from one category of products (which will be referred as vertical) to another. This leads to extreme annotation efforts for training of supervised deep sequence labeling models such as LSTM-CRF, and consequently not enough labeled data for some vertical-attribute pairs. In this work, we propose a technique for alleviating this problem by using annotated data from related verticals in a multi-task learning framework. Our approach relies on availability of similar attributes (labels) in another related vertical. Our model jointly learns the similarity between attributes of the two verticals along with the model parameters for the sequence tagging model. The main advantage of our approach is that it does not need any prior annotation of attribute similarity. Our system has been tested with datasets of size more than 10000 from a large e-commerce company in India. We perform detailed experiments to show that our method indeed increases the macro-F1 scores for attribute value extraction in general, and for labels with low training data in particular. We also report top labels from other verticals that contribute towards learning of particular labels.


Introduction
Online e-commerce marketplaces (e.g., Flipkart) operate by efficiently matching customer queries and browsing habits to appropriate seller inventory. Inventory is stored in a catalog which consists of images, structured attributes (keyvalue pairs) and unstructured textual description as shown in figure 1. Products of same kind (e.g., digital camera) are thus described using a unique set of attributes (e.g., zoom, resolution) -helping faceted navigation, merchandizing, search ranking and comparative summary.
Onboarding products in a catalog requires populating the structured as well as unstructured parts. The time a seller has to spend on a product addition request is proportional to the quantum of information that he/she has to provide. On the other hand, correctness and completeness of catalog results in better product discovery, leading to a trade-off with its onboarding time. A good amount of attributes information is present in product description as well. This motivates us to extract the information from unstructured text instead of explicitly asking sellers for attributes. Additional information in description (e.g., precise features, relation between products) as shown in figure 1 helps to enrich the catalog as well. The extracted attributes can be used to check consistency between unstructured and structured data provided by seller and thus quality control of addition request.
We design supervised deep learning techniques for the problem of attribute value extraction. Figure 2, shows a typical input sentence and the corresponding B, I, O tags. The task of our model is to predict the tags given an input sentence. This problem is related supervised sequence labelling problem (Zheng et al., 2018;Lample et al., 2016). However, this technique needs a lot of training data points (sentence -label pairs) to perform effectively, which in turn requires massive annotation efforts on the part of e-commerce companiesreduction of which is an ongoing challenge; Open-Tag (Zheng et al., 2018) uses active learning to annotate only most informative examples. E-commerce companies however have the products categorized as different verticals, e.g. dress, jeans, etc. Each of these verticals have a different set of attributes, and hence needs to annotated using different models. A lot of the attributes among these verticals are common, or related though. Hence, it should be possible to borrow information from annotations given in different verticals, to improve the prediction performance of a given vertical. The only challenge is that correspondences between similar labels of different verticals is not available readily.
Our main contribution here is thus to develop a multi-task learning (MTL) model (Ruder, 2017) which can simultaneously learn attribute extraction and attribute-attribute similarity for multiple verticals (here we report with only two verticals at a time). We do so by using a soft coupling loss function across pairs of similar (context,label) combinations between the two tasks, where similarity is learned using attention mechanism. The naive version of such an objective will be prohibitively large to optimize. We propose to use a cosine similarity based shortlist, which makes the solution feasible.
We validate our method using a large corpus (more than 10000 product descriptions, across 6 verticals) collected from the e-commerce company -Flipkart. Extensive experimentation shows that our method improves performance of prediction on almost all the verticals, and especially shows upto 50% improvement for many labels which have low number of training examples. This is especially interesting since we find that number of instances with an attribute is highly skewed across the attributes. Detailed analysis also confirms that the attention mechanism indeed discov- ers similar attributes from other verticals to borrrow information from.

Related Work
Attribute extraction: Various tokens (e.g., Apple) in an offer title are classified into attribute names (e.g., brand) relevant to the product (e.g., smartphone) (Joshi et al., 2015). For recognizing attributes (e.g., product family) in a short text segment, missing KB entries are leveraged through word embeddings learned on an unlabeled corpus (Kozareva et al., 2016). (Joshi et al., 2015) investigates whether distributed word vectors benefit NER in the e-commerce domain where entities are item properties (e.g., brand name, color, material, clothing size). (Xu et al., 2019) regards each attribute as a query and adopts only one global set of BIO tags for any attribute to reduce the burden of attribute tag or model explosion. Open-Tag (Zheng et al., 2018) uses active learning along with a deep tagging model to update a product catalog with missing values for many attributes of interest from seller-provided title/description. To create the initial labeled data set, (Rezk et al., 2019) proposes bootstraping of seed data by extracting new values from unstructured text in a domain/language-independent fashion. Through category conditional self-attention and multi-task learning, a knowledge extraction model Attribute prediction and value extraction tasks are jointly modelled (Zhu et al., 2020) from multiple aspects towards interactions between attributes and values. Contrastive entity linkage (Embar et al., 2020) helps identify grocery product attribute pairs that share same value (e.g., brand, manufacturer, product line) and differ from each other (e.g., package size, color). Retailers do not always provide clean data as textual descriptions in product catalog (e.g., non-distinctive names (cotton, black tshirt), blurred distinction (Amazon is a product/vs. brand), homonyms (Apple)).  discovers such attribute relationships towards a brand-product knowledge graph from diverse input data sources.
Multi-task Learning (MTL): Significant theoretical interest exists in MTL since it offers excel-lent generalization performance in domains where training data is scarce (Maurer et al., 2016). In NLP, (Collobert and Weston, 2008) proposed a unified deep learning architecture for many common tasks e.g. POS tagging, chunking, etc. (Yang and Hospedales, 2017) presented a new representation MTL framework that learns cross-task sharing structure at every layer in a deep network. (Rijula Kar, 2018) proposed a task-sensitive representation learning framework that learns mentiondependent representations for NED, and violates norm to share parameters in the final layer.  treats each attribute as a question finding the best answer span corresponding to its value in product context -modelled by a BERT encoder shared across all attributes for scalability. A distilled masked language model improving generalizability is then integrated with the encoder into a unified MTL framework. (Karamanolakis et al., 2020) applies to thousands of product categories organized in a hierarchical taxonomy. However, existing methods do not automatically discover attribute-attribute similarity from data, without taking attribute hierarchy as input.

Methods
In this section, we describe a novel multi-task approach to improving the accuracy of a supervised attribute value extraction system. We start with the attribute-value extraction system, based on deep bidirectional LSTM model, described in OpenTag (Zheng et al., 2018). Our main idea here is to leverage the information contained in instances of related tasks, e.g. in our case related domains / verticals of products. The key challenge in our case is that the set of labels across verticals need not be same, or even aligned. For example, the label PROCESSOR TYPE is a valid label for LAPTOP vertical but does not make sense for DRESS vertical. On the other hand, the set of values for the common label BRAND will be very different for the vertical DRESS compared to the vertical LAPTOP. Hence, our core challenge here is to determine the similarities between labels automatically in the context of each vertical in order to leverage the information from a related vertical. The proposed architecture is described in figure 3.

Problem setup
Each instance of the (single-task) attribute-value extraction problem comes with an input sentence denoted by a sequence of words w = {w 1 , . . . , w n } and a corresponding set of labels y = {y 1 , . . . , y n }. The task is to design a supervised ML algorithm which given the input sentence w, predicts the output labels y. Here, the labels correspond to the attributes, e.g. COLOR, and words correspond to the predicted values. Following common practice, we use 3 types of labels (also called tags): B, I, O. Here B and I are prepended to the label to indicate begining and end of a multi-word tag, respectively, while O refers to no tag for the word. For example, the multi-word color "light green" may be tagged as B COLOR and I COLOR.
This is an instance of sequence labeling problem (Lample et al., 2016), and the LSTM-CRF model proposed by Lample et al. (Lample et al., 2016) is the a state of the art model for this task. For each word w i , we obtain the corresponding word embedding x i using a concatenation of its glove embedding (Pennington et al., 2014) and it's character based embedding. The word embeddings of a sentence x = {x 1 , . . . , x n } is passed through a Bidirectional LSTM (BiLSTM) layer to produce the context sensitive word embedding h: We call this the the embedding layer for our input which is common to both single and multi-task models. Figure 3(a) describes the architecture in detail. For the multi-task attribute-value extraction problem, the input is a sentence w t j j = 1, . . . , n, and the output of model is a sequence of labels y t j , j = 1, . . . , n, where t = 1, . . . , T . In this paper we only consider the setting of T = 2, i.e. we learn from 2 tasks at a time, due to scalability reasons. However, in theory our method can be extended to learning from more than 2 tasks. We compute the word embeddings x and context dependent word embeddings h in a similar manner as described above.

Single-task attribute-value extraction
We use the LSTM-CRF model with character embeddings (Lample et al., 2016;Zheng et al., 2018) as our baseline single task model. For a given input sentence the word embeddings x and the con- text sensitive word embeddings h are computed as described above. The context sensitive word embeddings h i , i = 1, . . . , n is then passed through a fully connected layer to produce the score s(y) for every possible label y. This is parameterized by the matrix W ∈ R d×k and b ∈ R k where d is the dimension of h i and k is the total number of possible labels. Hence the score vector for every label is computed as: where n is the length of sentence. We can interpret the k th component of s i , denoted as s i (y = k|h i ), as the score of class k for word w i . Now, given a sequence of words vectors x , a sequence of score vectors {s 1 (y|x), . . . , s n (y|x)}, and a sequence of keys y, a linear-chain CRF defines a global score C ∈ R as, Here, s(y|x) is the y th component of the s vector and T (y, y ) is the transition score from label y to y , which is used to capture label dependency. A softmax over all possible tag sequences yields a probability for the sequence y. P (y|x) = e C(x,y) y ∈Y e C(x,y ) During training, we maximize the log-probability of the correct key sequence: log(P (y|x)) = C(x, y) − log( y ∈Y e C(x,y ) ) Here Y is the set of all possible labellings for sequence x. Given a dataset of sequences and labels D = {(x j , y j ), j = 1, . . . , m, we can define the CRF loss as the negative log-likelihood: −log(P (y j |x j )) (Lample et al., 2016) describes a method for learning the model parameters and inferring the partition function and scores by minimizing the above objective w.r.t. W and b.

Multi-task attribute-value extraction
As mentioned above, for multi-task attribute-value extraction, we have sequence and label combinations (x t , y t ) for two tasks, t ∈ {1, 2}. We also note that we have a common set of embedding layers (both word representation and BiLSTM) for the two tasks. However, the feedforward layer used for scoring the labels are specific to the tasks. Hence: The score and loss functions can be defined analogously to the single task model as: , and log(P t (y|x)) = C t (x, y) − log( y ∈Y e C t (x,y ) ). Given the multi-task dataset D t = {(x t j , y t j ), j = 1, . . . , m t , t = {1, 2}, our loss function can be written as: Hence, only parameters of the embedding layers get affected by the multi-task paradigm here, since those are the only shared layers between the tasks. However, these parameters are independent of the labels and are thus relatively robustly learned by just using a reasonably large corpus of input sentences. Another mechanism for borrowing information between tasks is through "soft coupling" (Ruder, 2017) of various scores or parameters which are not explicitly shared. In the next section, we devise a soft coupling loss between instances of the two tasks which achieve transfer of information at the granularity of labels.

Coupling loss
The principle we use for coupling of scores s t i (y|x) is: similar labels in similar contexts should have similar scores. Recall that the dataset for multi-task attribute value extraction consists of two sets of instances D 1 and D 2 , for each of the two tasks. Since we are attempting to compare the model predictions for the two tasks, the coupling loss depends on two contexts, one from each task: (x j , y j , i) and (x j , y j , i ). Here, j and j denotes indices of instances for the two tasks, and i and i indices within the each sentence instance to the two tasks. We note that since the are ∼ 1000 instances for each task, and ∼ 10 length sentences for each instance, the total number of terms for this loss will be ∼ 10 8 ((10 × 1000) 2 ). This is prohibitively large for our training purpose, and also is wasteful, since not all contexts (combination of instance j and position i) are related to each other.
Hence, as a first step we create a shortlist of pairs of contexts ((i, j), (i , j )) which can borrow informations from each other, by thresholding on the cosine similarity between the a windows around the contexts u i,j and u i ,j : Here, note that u(i, j) is the word embedding of a window around the context (i, j).
Context coupling error: Our next challenge is to design a mechanism to figure out similar contexts and similar labels. We use the softmax attention mechanism to automatically learn the similar label-context combinations, simultaneously as we also learn the scoring function. For efficiency of parameters, we use the Luong attention. Hence the attention score for context (i, j) from task 1 over context (i , j ) from task 2 is given by: Here, a = (a 1 , ...., a d ) are learnable parameters of same dimension as the word embeddings, and L(j, i) = {(j , i )|((i, j), (i , j )) ∈ L}. The context-coupling error is defined as: We note that this score is selecting the similar contexts from second task since it normalizes the attention score over the contexts of the second task. Symmetrically, we can define the attention score of context (i , j ) from task 2 over (i, j) from task 1 as: Hence the context coupling error in reverse direction is given by: Label coupling error In addition to the context coupling error defined above, we also take into account the explicit similarity between only labels, using a character k-gram based embedding of the labels in the context (i, j) as: v i,j . Hence, the label coupling error is given as: LCE is defined analogously. The label embeddings, v i,j , are learned jointly with the model. The total coupling error between contexts (i, j) and (i , j ) from the two tasks respectively, is the sum of context coupling error and the label coupling error: We optimize the sum total of all the CRF losses and total coupling error in order to obtain model parameters. We use stochatic gradient descent, where minibatches are constructed from three lists: D 1 , D 2 , and L. Samples from the first two lists are used to calculate the CRF losses, while samples from L are used to calculate total coupling error, and the corresponding updates.

Experimental Results
In this section, we report results from our proposed method for multi-task attribute extraction, against single task attribute extraction. We implemented our model using tensorflow on a 8-core Centos machine. We used 300 dimensional pre-trained Glove vectors. We have also experimented with other customized word embeddings e.g. fasttext, but did not achieve significantly better results. For this work, we use single layer BiLSTM as the embedding layer. The hidden layer size for BiLSTM layer was set to 700. We have experimented with other embedding layer architectures, e.g. hidden layer sizes ranging from 300 to 900, and also two layer BiLSTMs with hidden layer sizes (500,700). However, the performance of single layer LSTM with hidden layer size 700 was found to be similar or better than others. For training, the batch size was chosen to be 30 for both the CRF loss batches and for coupling loss batches sampled from the shortlist L. ADAM was used as optimizer and we trained for maximum of 30 epochs. We trained the model for 30 epochs.  Evaluation Metric As reported below, the datasets for this problem show extreme skew in terms of occurrence of labels. Hence, we use the standard metrics of macro precision, macro recall, and macro F1 score. We also report the microaccuracy. While computing the macro-metrics (precision, recall and F1), we ignore the 'O' label. It is clear that macro-F1 score without the 'O' label, is the most representative metric here, from an application point of view.

Datasets
The dataset used here are taken from actual systems for product delivery used in Flipkart. We performed our experiments using data (both product descriptions, and ground truth annotations) from six verticals: Dress, Jean, Mangalsutra 1 , Chain, Trouser and Jewellery available on Flipkart. These verticals are chosen based on three factors (1) GMV (Gross Merchandise Value), (2) Volume of data available and (3) Verticals with rich product descriptions. Number of labels in each vertical and number of tagged description in train and test data for each vertical is shown in table 1. The words in product descriptions for each vertical are tagged using B,I,O (short for beginning, inside, and outside) format where the B prefix before a tag indicates that the token is the beginning of a tag, and an I prefix before a tag indicates that the token is inside a tag and An O tag indicates that a token belongs to no tag. Table 2 shows the pairs of similar tasks (verticals) which were trained togather for MTL. The pairs were chosen manually based on probability 1 A type of Necklace of occurrence similar labels in these tasks. The results for the each of the verticals is the best achieved for these pairs of tasks. Note that, while we have to manually provide a similar pair of tasks, the similarity between labels is automatically deciphered. In this section, we illustrate the effectiveness of our multi-task learning method. Table 3, reports the best performances of single and multitask models for all the six verticals studied here. We can see that except for jewellery, multi-task model improve performance in terms of F1 score for all other verticals. For some verticals, e.g. chain, the improvement is more than 5 percent, while for other verticals the improvement lies in the 2 percent range. We note that the improvement depends on two main factors: whether we can find a close enough vertical to borrow from, and the number of examples already present in the current vertical. For example we can see that the vertical "Jewellery" has about 5000 examples, and also does not have a very close other vertical to borrow information from. Hence in it's case MTL is not able to improve the performance.

Performance Comparison
In table 4, we report the fine-grained improvements of top 5 labels for the verticals: Trouser, Jean, Mangalsutra, and Chain. We note that the top improvements for these verticals are in the range of 51%, 46%, 29% and 22% respectively. We also note that number of examples for these labels in the training dataset (#ex column) are respectively 6, 15, 6, and 7. Hence this table further corroborates our claim that MTL improves the performance for labels with lower amount of information in the single task training set.

Validation of Attribute Similarity
In this section, we validate the learned attributeattribute similarity, by studying the attribute-wise F1-scores for the similar attribute pairs. Figure 4-(a) shows the full attention heatmap for all labels between the pair of tasks: Mangasutra -Chain.
Here the attention is normalised over the attributes of y-axis (task chain  that attention mechanism is indeed choosing the similar labels between the pairs of tasks, irrespective of whether there is an improvement in accuracy for the pair of labels. In figure 4-(b), we report the topmost pairs of labels with the highest attention scores, along with the corresponding increase in accuracy. The left column borrower labels(Chain) and right column shows Lender labels(Mangalsutra) which got the highest average attention weights across all contexts-pairs in the list L. The value in brackets shows the attention value. The bold entries appear in top-5 attributes, with highest F1-scores in table 4. One can also see non-obvious correspondences, e.g. Necklace type from chain can borrow all the information from Gemstone from lender vertical Mangalsutra. We can also see that in most of the cases, the labels from task 1 borrow information from corresponding labels of task 2, even though this information was not explicitly furnished. This observation provides us further confidence that the attention mechanism used for discovery of similar labels and similar contexts, indeed works effectively.
This observation further validates the effectiveness of our attention model in extracting similar pairs of labels between two tasks using the coupling loss. We believe this mechanism can be applied in many more situations to shortlist important and similar attributes in other contexts, while jointly learning a prediction model.

Conclusion
In this paper, we study attribute-value extraction from production description in the e-commerce domain. Many of the attributes occur in very few descriptions. Hence the amount of supervised training data available for these attributes is very low, which leads to low prediction performance We thus propose a novel multi-task learning based algorithm which borrows information from related domains (i.e., category/vertical) in order to improve prediction performance of infrequently occurring attributes. We validate the proposed method with extensive experimental evaluation on a large dataset of six verticals from a prominent, e-commerce company. The proposed technique not only achieves higher accuracy on verticals with similar labels, but also can be used for discovering attribute similarities across verticals.