AdaTag: Multi-Attribute Value Extraction from Product Profiles with Adaptive Decoding

Automatic extraction of product attribute values is an important enabling technology in e-Commerce platforms. This task is usually modeled using sequence labeling architectures, with several extensions to handle multi-attribute extraction. One line of previous work constructs attribute-specific models, through separate decoders or entirely separate models. However, this approach constrains knowledge sharing across different attributes. Other contributions use a single multi-attribute model, with different techniques to embed attribute information. But sharing the entire network parameters across all attributes can limit the model’s capacity to capture attribute-specific characteristics. In this paper we present AdaTag, which uses adaptive decoding to handle extraction. We parameterize the decoder with pretrained attribute embeddings, through a hypernetwork and a Mixture-of-Experts (MoE) module. This allows for separate, but semantically correlated, decoders to be generated on the fly for different attributes. This approach facilitates knowledge sharing, while maintaining the specificity of each attribute. Our experiments on a real-world e-Commerce dataset show marked improvements over previous methods.


Introduction
The product profiles on e-Commerce platforms are usually comprised of natural texts describing products and their main features. Key product features are conveyed in unstructured texts, with limited impact on machine-actionable applications, like search (Ai et al., 2017), recommendation (Kula, 2015), and question answering (Kulkarni et al., 2019), among others. Automatic attribute value extraction aims to obtain structured product features from product profiles. The input is a textual Figure 1: An example of the product profile on an e-Commerce platform. It consists of a title, several information bullets, and a product description. sequence from the product profile, along with the required attribute to be extracted, out of potentially large number of attributes. The output is the corresponding extracted attribute values. Figure 1 shows the profile of a moisturizing cream product as an example, which consists of a title, several information bullets, and a product description. It also shows the attribute values that could be extracted.
Most existing studies on attribute value extraction use neural sequence labeling architectures (Zheng et al., 2018;Karamanolakis et al., 2020;Xu et al., 2019). To handle multiple attributes, one line of previous contributions develops a set of "attribute-specific" models (i.e., one model per attribute). The goal is to construct neural networks with (partially) separate model parameters for different attributes. For example, one can construct an independent sequence labeling model for each attribute and make predictions with all the models collectively (e.g., the vanilla OpenTag model (Zheng et al., 2018)). Instead of totally separate models, one can also use different tag sets corresponding to different attributes. These networks can also share the feature encoder and use separate label decoders (Yang et al., 2017). However, the explicit network (component) separation in these contributions constrains knowledge-sharing across different attributes. Exposure to other attributes can help in disambiguating the values for each attribute. And having access to the entire training data for all attributes helps with the generic sequence tagging task. Another line for multi-attribute extraction contributions learns a single model for all attributes. The model proposed by Xu et al. (2019), for example, embeds the attribute name with the textual sequence, to achieve a single "attribute-aware" extraction model for all attributes. This approach addresses the issues in the previous direction. However, sharing all the network parameters with all attributes could limit the model's capacity to capture attribute-specific characteristics.
In this paper we address the limitations of the existing contribution lines, through adaptive decoder parameterization. We propose to generate a decoder on the fly for each attribute based on its embedding. This results in different but semantically correlated decoders, which maintain the specific characteristics for each attribute, while facilitating knowledge-sharing across different attributes. To this end, we use conditional random fields (CRF) (Lafferty et al., 2001) as the decoders, and parameterize the decoding layers with the attribute embedding through a hypernetwork (Ha et al., 2017) and a Mixture-of-Experts (MoE) module (Jacobs et al., 1991). We further explore several pretrained attribute embedding techniques, to add useful attribute-specific external signals. We use both contextualized and static embeddings for the attribute name along with its potential values to capture meaningful semantic representations.
We summarize our contributions as follows: (1) We propose a multi-attribute value extraction model with an adaptive CRF-based decoder. Our model allows for knowledge sharing across different attributes, yet maintains the individual characteristics of each attribute. (2) We propose several attribute embedding methods, that provide important external semantic signals to the model. (3) We conduct extensive experiments on a real-world e-Commerce dataset, and show improvements over previous methods. We also draw insights on the behavior of the model and the attribute value extraction task itself.

Problem Definition
The main goal of the task is to extract the corresponding values for a given attribute, out of a number of attributes of interest, from the text sequence of a product profile. Formally, given a text sequence X = [x 1 , . . . , x n ] in a product profile, where n is the number of words, and a query attribute r ∈ R, where R is a predefined set of attributes, the model is expected to extract all text spans from X that could be valid values for attribute r characterizing this product. When there are no corresponding values mentioned in X, the model should return an empty set. For example, for the product in Figure 1, given its title as X, the model is expected to return ("Dry", "Sensitive") if r ="SkinType", and an empty set if r ="Color".
Following standard approaches (Zheng et al., 2018;Xu et al., 2019;Karamanolakis et al., 2020), under the assumption that different values for an attribute do not overlap in the text sequence, we formulate the value extraction task as a sequence tagging task with the BIOE tagging scheme. That is, given X and r, we want to predict a tag sequence Y = [y 1 , . . . , y n ], where y i ∈ {B, I, O, E} is the tag for x i . "B"/"E" indicates the corresponding word is the beginning/ending of an attribute value, "I" means the word is inside an attribute value, and "O" means the word is outside any attribute value. Table 1 shows an example of the tag sequence for attribute "Scent" of a shower gel collection, where "orchid", "cherry pie", "mango ice cream" could be extracted as the values.

BiLSTM-CRF Architecture
The BiLSTM-CRF architecture (Huang et al., 2015) consists of a BiLSTM-based text encoder, and a CRF-based decoder. This architecture has been proven to be effective for the attribute value extraction task (Zheng et al., 2018;Xu et al., 2019;Karamanolakis et al., 2020). We build our AdaTag model based on the BiLSTM-CRF architecture as we find that the BiLSTM-CRF-based models generally perform better than their BiLSTM-based, BERT-based (Devlin et al., 2019) and BERT-CRFbased counterparts, as shown in §5. We introduce the general attribute-agnostic BiLSTM-CRF architecture, which our model is based on, in this subsection.
Given a text sequence X = [x 1 , . . . , x n ]. We obtain the sequence of word embeddings X = [x 1 , . . . , x n ] using an embedding matrix W word . We get the hidden representation of each word by feeding X into a bi-directional Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) layer with hidden size d h : We use a CRF-based decoder to decode the sequence of hidden representations while capturing the dependency among tags (e.g., "I" can only be followed by "E"). It consists of a linear layer and a transition matrix, which are used to calculate the emission score and the transition score for the tag prediction respectively. Let V = [B, I, O, E] be the vocabulary of all possible tags. We calculate an emission score matrix P = [p 1 , . . . , p n ] ∈ R 4×n , where P ij is the score for assigning the i-th tag in V to x j . This is computed by feeding [h 1 , . . . , h n ] into a linear layer with parameters [W, b], specifically p i = Wh i + b ∈ R 4 , where W ∈ R 4×d h and b ∈ R 4 . For a BIOE tag sequence Y = [y 1 , . . . , y n ], we get its index sequence Z = [z 1 , . . . , z n ] where z i ∈ {1, 2, 3, 4} is the index of y i in V . The score for an input text sequence X to be assigned with a tag sequence Y is calculated as: where T ∈ R 4×4 is the transition matrix of CRF, such that T ij is the score of a transition from the i-th tag to the j-th tag in V .

Model Overview
The multi-attribute value extraction task can be thought of as a group of extraction subtasks, corresponding to different attributes. While all attributes share the general knowledge about value extraction, each has its specificity. The key idea in our proposed model is to dynamically adapt the parameters of the extraction model based on the specific subtask corresponding to the given attribute. We use a BiLSTM-CRF (Huang et al., 2015) architecture, where different subtasks, corresponding to different attributes, share the same text encoder to derive a contextualized hidden representation for each word. Then the hidden representations of the text sequence are decoded into a sequence of tags with a CRF-based decoder, the parameters of which are generated on the fly based on the attribute embedding. In this setup, different subtasks are trained jointly, and different decoders are correlated based on the attribute embedding. This facilitates a knowledge-sharing scheme across different attributes. Intuitively, this can help with learning generic abilities like detecting value boundary, which is at the core of the extraction process of any attribute. At the same time, our model provides each subtask with a customized decoder parameterization, which improves the model's capacity for capturing attribute-specific knowledge.
Figure 2 presents our overall model architecture, where we equip the BiLSTM-CRF architecture with an adaptive CRF-based decoder. In §3.2, we will introduce our adaptive CRF-based decoder which is parameterized with the attribute embedding. In §3.3, we will describe how to obtain pretrained attribute embeddings that can capture the characteristics of different subtasks, so that "similar" attributes get "similar" decoding layers.

Adaptive CRF-based Decoder
In attribute value extraction, the model takes the text sequence X with a query attribute r as input, and is expected to predict Y based on both X and r. To make the model aware of the query attribute, we need to incorporate the attribute information into some components of the BiLSTM-CRF architecture. The BiLSTM-based text encoder is responsible for encoding the text sequence and obtain a contextualized representation for each word, which can be regarded as "understanding" the sentence. The CRF-based decoder then predicts a tag for each word based on its representation. Therefore, we propose that all attributes share a unified text encoder so that the representation can be enhanced through learning with different subtasks, and each attribute has a decoder adapted to its corresponding subtask, the parameters of which are generated based on the attribute information.
As introduced in §2.2, a CRF-based decoder consists of a linear layer and a transition matrix. The linear layer takes hidden representations as input, and predicts a tag distribution for each word independently. It captures most of characteristics of value extraction for a given attribute based on the text understanding. More flexibility is needed to model the specificity of different attributes. By contrast, the transition matrix learns the dependency among tags to avoid predicting unlikely tag sequence. It only captures shallow characteristics for the attribute based on its value statistics. For example, the transition scores form "B" to other tags largely depend on the frequent lengths of the attribute values. If single-word values are mentioned more often, then "B" is more likely to be followed by "O". If two-word values dominate the vocabulary, then "B" is more likely to be followed by "E". Attributes could be simply clustered based on these shallow characteristics.
In this work we parameterize the CRF-based decoder with the attribute embedding r ∈ R dr , where d r is the dimension of the attribute embedding. For the linear layer, we adopt a hypernetwork (Ha et al., 2017) due to its high flexibility. For the transition matrix, we develop a Mixture-of-Experts (Pahuja et al., 2019) module to leverage the latent clustering nature of attributes. We nevertheless experiment with all 4 combinations of these methods in §5.3, and this choice does the best.
Hypernetwork. The idea of hypernetworks (Ha et al., 2017) is to use one network to generate the parameters of another network. Such approach has high flexibility when no constraint is imposed during generation. We therefore use it to parameterize the linear layer. In our model, we learn two different linear transformations that map the attribute embedding to the parameters of the linear layer (W ∈ R 4×d h , b ∈ R 4 ) in the CRF-based decoder: Here hyper ∈ R 4 , and the Reshape operator reshapes a 1-D vector into a matrix with the same number of elements.
Mixture-of-Experts. The idea of Mixture-of-Experts (Jacobs et al., 1991) is to have a group of networks ("experts") that jointly make decisions with dynamically determined weights. Unlike previous approaches that combine each expert's prediction, we combine their parameters for generating the transition matrix. Let k be the number of experts we use to parameterize the transition matrix T ∈ R 4×4 where k is a hyperparameter. We introduce k learnable matrices T (1) , . . . , T (k) for the k experts. Each expert's matrix can be understood as a cluster prototype and we employ a linear gating network to compute the probability of assigning the given attribute to each expert: The parameters for the transition matrix for this attribute is calculated as:

Pretrained Attribute Embeddings
The attribute embedding r plays a key role in deriving the attribute-specific decoding layers. Therefore, the quality of the attribute embeddings is crucial to the success of our parameterization method. Good attribute embeddings are supposed to capture the subtask similarities such that similar extraction tasks use decoders with similar parameters. In this work, we propose to use the attribute name and possible values as a proxy to capture the characteristics of the value extraction task for a given attribute.
The attribute embeddings can therefore be directly derived from the training data and loaded into the attribute embedding layer as initialization.
For each attribute r, we first collect all the sen-tences from the training data that are annotated with at least one value for r. We denote the collected sentences with values as wherer is the phrase representation of r (e.g.,r = "Skin Type" if r = "SkinType"), v i is a span in text sequence X i that serves as the value for r, and n r is the number of collected sentences. For each (r, v i , X i ), we can calculate an attribute name embedding r name i and an attribute value embedding r value i in either a contextualized way or an uncontextualized way, which are detailed later. We pool over all instances in D r to get the final attribute name embedding and attribute value embedding, which are concatenated as the attribute embedding: Contextualized Embeddings. Taking the context into consideration helps get embeddings that can more accurately represent the semantics of the word. Here we use the contextualized representations provided by BERT (Devlin et al., 2019) to generate the embedding. We use BERT to encode X i and get v i 's phrase embedding (the averaged embedding of each word in the phrase) as r value Uncontextualized Embeddings. Static embeddings like Word2Vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) can be more stable to use under noisy contexts. We use Glove (50d) to get the phrase embedding for v i as r value i and the phrase embedding forr as r name i .

Model Training
As we parameterize the CRF-based decoder with the attribute embedding through MoE and hypernetwork, the learnable parameters in our model includes θ encoder = {W word , θ bi-lstm }, We freeze the attribute embeddings W att as it gives better performance, which is also discussed in §5.3.
The whole model is trained end-to-end by maximizing the log likelihood of (X, r, Y ) triplets in the training set, which is derived from Equation 2   1 [BOA] and [EOA] are special tokens that are used to separate the attribute name from context in the synthetic sentence. as: where V n is the set of all tag sequences of length n.
The log likelihood can be computed efficiently using the forward algorithm (Baum and Eagon, 1967) for hidden Markov models (HMMs). At inference, we adopt Viterbi algorithm (Viterbi, 1967) to get the most likely Y given X and r in test set.

Dataset
To evaluate the effectiveness of our proposed model, we build a dataset by collecting product profiles (title, bullets, and description) from the public web pages at Amazon.com. 2 Following previous works (Zheng et al., 2018;Karamanolakis et al., 2020;Xu et al., 2019), we obtain the attribute-value pairs for each product using the product information on the webpages by distant supervision. We select 32 attributes with different frequencies. For each attribute, we collect product profiles that are labeled with at least one value for this attribute. We further split the collected data into training (90%) and development (10%) sets.
The annotations obtained by distant supervision are often noisy so they cannot be considered as gold labels. To ensure the reliability of the evaluation results, we also manually annotated an additional testing set covering several attributes. We randomly selected 12 attributes from the 32 training attributes, took a random sample from the relevant product profiles for each attribute, and asked human annotators to annotate the corresponding values. We ensured that there is no product overlapping between training/development sets and the test set.
Putting together the datasets built for each individual attribute, we end up with training and development sets for 32 attributes, covering 333,857 and 40,008 products respectively. The test set has 12 attributes and covers 11,818 products. Table 2 presents the statistics of our collected dataset. Table 3 shows the attribute distribution of the training set. It clearly demonstrates the data imbalance issue of the real-world attribute value extraction data.
Most of the attribute values are usually covered in the title and bullets, since sellers would aim to highlight the product features early on in the product profile. The description, on the other hand, can provide few new values complementing those mentioned in the title and bullets, but significantly increases the computational costs due to its length. Therefore, we consider two settings for experiments: extracting from the title only ("Title") and extracting from the concatenation of the title and bullets ("Title + Bullets").

Evaluation Metrics
For each attribute, we calculate Precision/Recall/F 1 based on exact string matching. That is, an extracted value is considered correct only if it exactly matches one of the ground truth values for the query attribute in the given text sequence. We use Macro-Precision/Macro-Recall/Macro-F 1 (denoted as P/R/F 1 ) as the aggregated metrics to avoid bias towards high-resource attributes. They are calculated by averaging per-attribute metrics.

Compared Methods
We compare our proposed model with a series of strong baselines for attribute value extraction. 3 BiLSTM uses a BiLSTM-based encoder. Each hidden representation is decoded independently into a tag with a linear layer followed by softmax. BiLSTM-CRF (Huang et al., 2015) uses a BiLSTM-based encoder and a CRF-based decoder, as described in §2.2. Zheng et al. (2018) propose OpenTag, which uses a self-attention layer between the BiLSTM-based encoder and CRF-based decoder for interpretable attribute value extraction. However, we find the self-attention layer not helpful for the performance. 4 We therefore only present the results for BiLSTM-CRF in §5. BERT (Devlin et al., 2019) and BERT-CRF replace the BiLSTMbased text encoder with BERT. 5 Note that these four methods don't take the query attribute as input. To make them work in our more realistic setting with multiple (N ) attributes, we consider two variants for each of them. (1) "N tag sets": We introduce one set of B/I/E tags for each attribute, so that a tag sequence can be unambiguously mapped to the extraction results for multiple attributes. For example, the tag sequence "B-SkinType E-SkinType O B-Scent" indicates that the first two words constitutes a value for attribute SkinType, and the last word is a value for Scent. Only one model is needed to handle the extraction for all attributes. (2) "N models": We build one value extraction model for each attribute -we'll train N models for this task.
The "N models" variant isolates the learning of different attributes. To enable knowledge sharing, other methods share the model components or the whole model among all attributes: BiLSTM-CRF-SharedEmb shares a word embedding layer among all attributes. Each attribute has its own BiL-STM layer and CRF-based decoder, which are independent from each other. BiLSTM-MultiCRF (Yang et al., 2017) shares a BiLSTM-based text encoder among all attributes. Each attribute has its own CRF-based decoder. SUOpenTag (Xu et al., 2019) encodes both the text sequence and the query attribute with BERT and adopts a cross-attention mechanism to get an attribute-aware representation for each word. The hidden representations are decoded into a tags with a CRF-based decoder.
We also include AdaTag (Random AttEmb), which has the same architecture as our model but uses randomly initialized learnable attribute embeddings of the same dimension.

Implementation Details
We implement all models with PyTorch (Paszke et al., 2019). For models involving BERT, we use the bert-base-cased version. Other models use pretrained 50d Glove (Pennington et al., 2014)   embeddings as the initialization of the word embedding matrix W word . We choose d h = 200 as the hidden size of the BiLSTM layer and 32 as the batch size. BERT-based models are optimized using AdamW (Loshchilov and Hutter, 2019) optimizer with learning rate 2e −5 . Others use the Adam (Kingma and Ba, 2015) optimizer with learning rate 1e −3 . We perform early stopping if no improvement in (Macro-) F 1 is observed on the development set for 3 epochs. For our model, we use contextualized attribute embeddings as described in §3.2 and freeze them during training. We set k = 3 for MoE. We made choices based on the development set performance. Table 4 presents the overall results using our dataset under both "Title" and "Title + Bullets" settings. Our model demonstrates great improvements over baselines on all metrics except getting second best recall under the "Title + Bullets" settings. The comparisons demonstrate the overall effectiveness of our model and pretrained attribute embeddings. The "N tag sets" variants get much lower performance than other methods, probably due to the severe data imbalance issue in the training set (see Table 3). All attributes share the same CRF-based decoder, which could make learning biased towards high-resource attributes. Note that introducing one set of tags for each entity type is the standard approach for the Named Entity Recognition (NER) task. Its low performance suggests that the attribute value extraction task is inherently different from

Methods
High-Resource Att. Low-Resource Att.  standard NER. Variants of "shared components" generally achieve higher performance than the independent modeling methods ("N models"), which demonstrates the usefulness of enabling knowledge sharing among different subtasks.

P(%) R(%) F1(%) P(%) R(%) F1(%)
We also notice that BERT and BERT-CRF models get lower performance than their BiLSTM and BiLSTM-CRF counterparts. The reason could be the domain discrepancy between the corpora that BERT is pretrained on and the product title/bullets. The former consist of mainly natural language sentences, while the latter are made up of integration of keywords and ungrammatical sentences.

High-vs. Low-Resource Attributes
To better understand the gain achieved by joint modeling, we further split the 12 testing attributes into 8 high-resource attributes and 4 low-resource attributes, based on the size of the training data with 1000 instances as the threshold. It is important to point out that many factors (e.g., vocabulary size, value ambiguity, and domain diversity), other than the size of training data, can contribute to the difficulty of modeling an attribute. Therefore, the performance for different attributes is not directly comparable. 6 From results in Table 5, we can see that our model gets a lot more significant improvement from the independent modeling approach (BiLSTM-CRF (N models)) on low-resource attributes compared to high-resource attributes. This suggests that low-resource attributes benefit more from knowledge sharing, making our model desirable in the real-world setting with imbalanced attribute distribution.

Ablation Studies
Attribute Embeddings. We study different choices of adopting pretrained attribute embed-dings. Specially, we experiment with contextualized embeddings (BERT name+value ) and uncontextualized embeddings (Glove name+value ) under the "Title" setting. For given attribute embeddings, we can either finetune them during training or freeze them once loaded. We also experiment with attribute name embeddings r name and attribute value embeddings r value only to understand which information is more helpful. The baseline is set as using randomly initialized learnable attribute embeddings. Table 6 shows the results. Comparing attribute embeddings with the same dimension, we find that freezing pretrained embeddings always leads to performance gain over the random baseline. This is because our parameterization methods have high flexibility in generating the parameters for the decoder. Using pretrained embeddings and freezing them provides the model with a good starting point and makes learning easier by reducing the degree of freedom. BERT name (freeze) outperforms BERT value (freeze), suggesting that the attribute name is more informative in determining the characteristics of the value extraction task on our dataset, where the values labeled through distant supervision are noisy.    the best one to present the results. We experiment under the "Title" setting. From Table 7, we find that parameterizing the linear layer with MoE leads to much lower performance. This is reasonable because the linear layer plays a much more important role in the decoder while the transition matrix acts more like a regularization to avoid bad tag sequences. MoE uses k matrices as basis and expects to represent the parameters for any attribute as a linear combination of the bases. That limits the expressiveness to capture complicated characteristics of different attributes and will thus severely hurt the performance. As for the transition matrix, modeling with MoE is a better choice. This is because the transition matrix is more "structured" in the sense that each of it element is expected to be either a big number or a small number based on its semantics. For example, the transition score for I → E should be much higher than I → B. Hypernetwork is too flexible to generate such "structured" parameters.

Effect of Number of Attributes
An important motivation of our model is that joint modeling of different attributes can facilitate knowledge sharing and improve the performance. Here we study the performance of model improvement along with increment of the number of jointly modeled attributes. We experiment under the "Title" setting. We start with training our model on 12 attributes that have test data. After that, we random select 5, 10, 15, 20 attributes from the remaining attributes, and add them to the joint training. The evaluation results on 12 test attributes are presented in Figure 3. While our model general demonstrates greater improvement with joint modeling of more attributes, other models' performance fluctuate or goes down. That also demonstrates the scalability of our model when new attributes keep emerging in real-world scenarios.

Related Work
Attribute Value Extraction. OpenTag (Zheng et al., 2018) formulates attribute value extraction as a sequence tagging task, and proposes a BiLSTM-SelfAttention-CRF architecture to address the problem. Xu et al. (2019) propose an "attribute-aware" setup, by utilizing one set of BIO tags and attribute name embedding with an attention mechanism, to enforce the extraction network to be attribute comprehensive. Karamanolakis et al. (2020) additionally incorporate the product taxonomy into a multitask learning setup, to capture the nuances across different product types. Zhu et al. (2020) introduce a multi-modal network to combine text and visual information with a cross-modality attention to leverage image rich information that is not conveyed in text.  use a question answering formulation to tackle attribute value extraction. We adopt the extraction setup in our model as most of previous contributions, using sequence labeling architecture. But we utilize an adaptive decoding approach, where the decoding network is parameterized with the attribute embedding.
Dynamic Parameter Generation. Our model proposes an adaptive-based decoding setup, parameterized with attribute embeddings through a Mixture-of-Experts module and a hypernetwork.

Conclusion
In this work we propose a multi-attribute value extraction model that performs joint modeling of many attributes using an adaptive CRF-based decoder. Our model has a high capacity to derive attribute-specific network parameters while facilitating knowledge sharing. Incorporated with pretrained attribute embeddings, our model shows marked improvements over previous methods.
For our AdaTag, the weights for the hypernetwork (W w hyper ∈ R 4d h ×dr ) have (4 × 200) × 1536 parameters. The number can be reduced by inserting a middle layer with fewer neurons.