C O M AVE : Contrastive Pre-training with Multi-scale Masking for Attribute Value Extraction

Attribute Value Extraction (AVE) aims to automatically obtain attribute value pairs from product descriptions to aid e-commerce. Despite the progressive performance of existing approaches in e-commerce platforms, they still suffer from two challenges: 1) difficulty in identifying values at different scales simultaneously; 2) easy confusion by some highly similar fine-grained attributes. This paper proposes a pre-training technique for AVE to address these issues. In particular, we first improve the conventional token-level masking strategy, guiding the language model to understand multi-scale values by recovering spans at the phrase and sentence level. Second, we apply clustering to build a challenging negative set for each example and design a pre-training objective based on contrastive learning to force the model to discriminate similar attributes. Comprehensive experiments show that our solution provides a significant improvement over traditional pre-trained models in the AVE task, and achieves state-of-the-art on four benchmarks 1 .


Introduction
Product features are crucial components of ecommerce platforms and are widely used in applications such as product recommendation (Cao et al., 2018), product retrieval (Magnani et al., 2019), and product question answering (Yih et al., 2015;Chen et al., 2021b).Each product feature typically consists of an attribute and one or more values, providing detailed product descriptions to help customers make purchasing decisions.In recent years, Attribute Value Extraction (AVE) (Xu et al., 2019;Zhu et al., 2020;Yan et al., 2021) methods have received increasing attention because they can automatically extract product features from a massive amount of unstructured product text, with impressive results in e-commerce platforms, such as Amazon, AliExpress, and JD.However, as e-commerce grows, some emerging domains, such as finance, insurance, and healthcare, bring two new challenges: a) Multi-scale values.Unlike normal products (e.g., clothing) with only short values (e.g., color: red), insurance products can have a value of a longer phrase or even multiple sentences.For example, the value of attribute renewal rule in Figure 1 contains more than 25 words (in green), rendering it impractical to retrieve them using related techniques such as Name Entity Recognition (NER) (Li et al., 2020;Yang et al., 2021).b) Fine-grained divisions of attributes.Compared with the coarse division of attributes in traditional e-commerce (e.g., color, size, and material), the division in insurance products is more refined, resulting in different attributes often having similar types.For instance, in the insurance clauses in Figure 1, maximum insurance age and maximum renewal age are both ages, and grace period and hesitation period are both periods.This fine-grained division makes the distinction be-tween the different attributes subtle, thus increasing the difficulty to distinguish between them.
Although recent pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) and ROBERTA (Liu et al., 2019) achieve tremendous success on a spectrum of NLP tasks, including AVE, we argue that they are not sufficient for the challenges above.First, the conventional Masking Language Model (MLM) focuses on token-level recovery and does not consider multi-scale values.Second, there is still a gap between the unsupervised general objectives and the downstream AVE in terms of task form, such that the model cannot benefit from pre-training when retrieving attributes, let alone distinguishing between fine-grained similar attributes..In this paper, we propose COMAVE, a novel PLM for AVE tasks.Relying on the large-scale corpus of triples ⟨text, attribute, value⟩ collected by distant supervision, we propose three pre-training objectives to address the challenges: a) Multi-Scale Masked Language Model (MSMLM).We extend token-level recovery to the phrase as well as the sentence level, using different masking mechanisms to force the model to perceive spans of various lengths, thus providing a basis for identifying values at different scales.b) Contrastive Attribute Retrieval (CAR).To adapt the model to the fine-grained division of attributes, we require it to retrieve the correct attributes from a challenging candidate set of semantically similar attributes.The candidates are mainly collected by clustering and a contrastive loss is designed to help the model perceive the subtle differences between them.c) Value Detection (VD).To close the gap between pre-training and downstream AVE and further enhance the model's perception of values extraction, we let the model recognize all values without considering the corresponding attribute.

Preliminaries
Given a natural language text T and a set of candidate attributes set A = {a 1 , a 2 , ..., a |A| }, where a i is an attribute, the goal of AVE is to extract a set , where a * i ∈ A and V i is the set of values belonging to a * i .For simplicity, each value v ∈ V i is defined as a span of T .In general, T is collected from a large number of product-related documents or other data sources, and A is a collection of attributes for various products in different categories.
Note that although formally AVE is similar to NER, the two still have significant differences, as we mentioned in section 1.First, the division of attributes is more fine-grained than the division of entity types (e.g., location and person).Second, the scale of entities is generally shorter, while that of values varies from token level to sentence level.Therefore, conventional NER methods are difficult to directly port to AVE tasks.

Pre-training Corpus Construction
The pre-training procedure of COMAVE requires a large-scale corpus C = {(T i , A i , Y i )} M containing tens of millions of data.Manual annotation of such a large corpus is obviously impractical, thus we designed an automatic method to construct C. In brief, we first collect the triples ⟨subject, predicate, object⟩ from several existing open-domain knowledge graphs, including DBpedia (Lehmann et al., 2015), Yago (Tanon et al., 2020), WikiData (Vrandecic and Krötzsch, 2014), and OpenKG (Chen et al., 2021a).Then, we regard each predicate and object as the attribute a i and the value v i , respectively, thereby building a seed set {(a i , V i )} N by aligning and merging the attributes.Finally, we use this set as a distant supervision to mine the corresponding texts from the web data, thus building the pre-training corpus.

Pre-training COMAVE
Since ROBERTA (Liu et al., 2019) has been shown to be promising and robust on multiple NLP tasks, we use it to initialize COMAVE.We then further pre-train COMAVE on our corpus C. As shown in Figure 2, we flatten each pair (T , A) into a sequence X with the </s> token, then COMAVE converts the each token of X into a semantic vector, where h T i ∈ R d and h a j,k ∈ R d denote the vector of x i and x a j,k , respectively, and h <s> ∈ R d is regarded as the global semantic vector of X .Considering the above challenges, we design three objectives to pre-train COMAVE as follows.

Multi-Scale Masked Language Model
The most common objective of pre-training is to employ MLM to guide the model to perform extensively.Unlike BERT or ROBERTA which focuses on token-level recovery, we prefer COMAVE to be aware of various values, regardless of their scales.Consequently, we design two parallel masking mechanisms, namely phrase-level and sentencelevel masking.During pre-training, each T is performed by only one of the two mechanisms, and the probabilities are set as ρ and 1-ρ, respectively.Furthermore, we empirically find that an appropriate masking percentage is a prerequisite for MLM to be effective.We denote this budget percentage by µ p and µ s , respectively, and try to make the masking result close to it for both mechanisms.
In phrase-level masking, we are inspired by SpanBERT (Joshi et al., 2020) and randomly mask a short span of tokens for each selected T until the budget µ p is spent.The probability distribution of the masking length, denoted by l ∈ [1, l max ], is: where both σ and γ are hyper-parameters.This distribution ensures that the masking probability of each span decreases smoothly as its length in-creases, while also preventing long spans from being rarely selected.Note that we make sure that each masked span is formed by complete words.
In sentence-level masking, we mask only one sentence for each selected T because recovering a sentence requires sufficient context.In this way, it is more difficult to make the total number of masked tokens approach µ s compared to masked phrases since the length of different sentences can vary significantly.To achieve this goal, we propose a simple but effective strategy to dynamically control the masking probability of each sentence.Specifically, assuming that the current sentence masking rate of , where T mask is the tokens that has been masked.If µ c < µ s , it means that the current masking rate is less than the standard value, so we should pay more attention to longer sentences S long = {s|l(s) > l(T ) * µ s }, giving higher masking probabilities.Otherwise, we should focus on short ones S short = {s|l(s) ≤ l(T ) * µ s }.
Following BERT (Devlin et al., 2019), we replace 80% of the masked tokens with <mask>, 10% with the random tokens in the corpus, and leave the remaining 10% unchanged.

Contrastive Attribute Retrieval
We expect to adapt COMAVE to the subtle differences between attributes in the pre-training phase.To this end, for each training text T and its ground truth attributes A + , a challenging negative set c is sampled using clustering to guarantee it is highly similar to A + (see below for details), and each a g ∈ A − g is a random one from the total attribute pool to maintain the diversity of the negative examples.If T has no ground truth, then where W CAR ∈ R d * η is a trainable parameter and η denotes the maximum of |A|.
To make the score of A + higher than that of each negative example, i.e., ∀a − ∈ A − , P(T , a + ) > P(T , a − ), where a + ∈ A + .Inspired by (Khosla et al., 2020), we define a Margin Ranking Loss to better leverage contrastive learning and strengthen the distinction between fine-grained attributes, (5) where P i is short for P(T , a i ), and λ is the margin.If both a i and a j are positive or negative examples, z = 0, otherwise z = 1.
The key to this training objective is how to collect A − c that is highly similar to A + .Clustering has been proven to have a natural advantage in retrieving similar instances, so we used the widely used K-medoids (Park and Jun, 2009) clustering method to construct A − c .Concretely, the distance between two attributes is where f t and f s denote Levenshtein distance and Euclidean metric, respectively.z ∈ R d is the ROBERTA (Liu et al., 2019) pooling vector of z. τ denotes the score normalization to ensure balance.The distance considers both the literal and semantic features of the attributes and associated values.

Value Detection
To further cross the gap between the pre-training and downstream AVE tasks, we also add a training objective of detecting values.For (T , A), wherein each positive attribute a + i ∈ A + corresponds to one or more extractable values V i = {v 1 , v 2 , ..., v n } in T .The model needs to classify each token x i ∈ T , according to whether it is a part values of V: where W VD ∈ R d is trainable parameter.We define "V" and "O" as labels to represent that x i ∈ V and x i / ∈ V, respectively.Note that each token does not need to be classified to the exactly belonged attribute.

Fine-tuning
To fully evaluate the effectiveness of our pretraining for downstream tasks, we add the following two output layers to fine-tune our COMAVE, respectively.

Sequence Tagging Layer
In this setting, T and all candidate attributes A are first fed to COMAVE, as in the pre-training.Then, according to the output h T , a Conditional Random Field (CRF) generates a sequence Y = {y 1 , y 2 , ..., y n }.Here n is the length of T and each y i ∈ |A| k=1 {B k , I k , O} is a tag indicating whether the token x i ∈ T is the beginning (B k ), inside (I k ) and outside (O k ) of a value in the attribute a k ∈ A.

Machine Reading Comprehension Layer
In this case, COMAVE takes each (T , a i ) as input and predicts the span of target values belonging to a i ∈ A in T .Here, we follow a representative work (Li et al., 2020) that consists of two steps.First, the candidate start and end indexes of the span are predicted using the binary classification of each token separately.Subsequently, a matching score is performed for each candidate index pair of start and end.Finally, the pairs with scores above the threshold are retained as the results.

Experiments Datasets
To comprehensively evaluate our method, we used the following four datasets covering both English and Chinese: 1) INS is a Chinese AVE dataset which is collected from the real product data of Alipay2 platform.It contains various types of large-scale insurance products from real scenarios, including wealth insurance, health insurance, travel insurance, life insurance, etc. From each product document, the attributes and values are manually annotated.There are 29 global attributes and the samples are divided into 9112/1138/1138 for Train/Val/Test, respectively.Table 1 gives several groups of similar attributes and the number of their corresponding examples.Table 2 shows the distribution of different value scales.They reveal that the two challenges we focus on are prevalent in INS. 2) MEPAVE (Zhu et al., 2020) is a Chinese AVE dataset with examples from the JD e-commerce platform3 , containing 26 global attributes and 87,194 samples.Most of the text is mainly from the product titles.We randomly divided the dataset into three parts of Train/Val/Test in the ratio of 8:1:1 according to (Zhu et al., 2020) for experiments.3) AE-Pub (Xu et al., 2019)  Note that the results are the amounts of the values over 2400 attributes obtained from AliExpress4 .In order to make a fair comparison with previous models that could not handle a large number of attributes, we selected 4 frequent attributes (i.e.BrandName, Material, Color, Category) and divided the relevant instances randomly by 7:1:2, referring to the dataset publisher.4) MAE (IV et al., 2017) is an English multi-modal AVE dataset that contains 200 million samples and 2000 attributes.Following (Zhu et al., 2020), we built an MAEtext dataset to focus on the textual modal.Same as AE-Pub, we also selected the 20 most frequent attributes from Train/Val/Test sets.

Evaluation Metrics
In most experiments, we used Mirco-F1 scores as the main evaluation metric.We followed the criterion of exact matching, where the complete sequence of predicted attributes and extracted values must be correct.Accuracy was also used as another evaluation in the detailed analysis.

Implementation Details
Our method ran on Tesla A100 GPUs.All the pretrained models used in our experiments were large versions by default.Chinese and English versions of COMAVE were pre-trained respectively for evaluation in two languages.The hyper-parameters in pre-training were set as follows: (1) The batch size and the learning rate were set to 256 and 1e-5.(2) In the CAR task, η, λ, and ω were set to 12, 2, and 0.4, respectively.The ratio for A + , A − g , and A − c was 1:1:1 (3) In the MSMLM task, ρ, σ, γ, ℓ max , µ p , and µ s were separately set to 0.2, 1.20, 2e-4, 20, 15%, 10%.In phrase-level masking, ℓ max was set to 20, and ℓ mean was approximately equal to 5.87.In the fine-tuning stage, the batch size and the learning rate were set to 80 and 2e-5, respectively.

Comparison with AVE Baselines
We first compared with the baselines.To ensure fairness in the number of parameters, we replaced BERT-Base with ROBERTA-Large in the evaluations of Chinese datasets, and the distilled context layer of AVEQA is also replaced by ROBERTA-Large in all evaluations.The results are shown in

Ablation Tests
To Value Detection.Here, MRC was uniformly selected as the output layer for all the settings due to its better performance.Table 5 shows the results on four datasets.CAR, MSMLM, and VD all bring obviously improvements, proving the effectiveness and necessity of our pre-training objectives for the AVE tasks.The result indicates that the contribution of CAR is the most pronounced among the three objectives.
The final performance of the model decreases significantly when either the clustering sampling or the contrast loss is removed.In addition, we find that the combination of phrase-level and sentencelevel masking is more effective than using only one of them.VD also delivers a promising improvement which proves the benefit for AVE tasks.

Tests on Fine-Grained Attribute Groups
To further validate the effectiveness of our method in discriminating fine-grained similar attributes, we evaluated the performance of the model on the finegrained attribute groups mentioned in Table 1.The experimental results are shown in the upper part of Figure 3.Our COMAVE equipped with all components achieves the best results on all attribute groups.When the CAR training objective is removed, the overall performance shows a dramatic decrease in all the fine-grained attribute groups.This reveals that contrastive learning in a challenging set during pre-training contributed significantly to enhancing the capability of discriminating similar attributes in downstream tasks.

Performance on Multi-Scale Values
We also tested the performance of the model for extracting values at different scales, and the results are shown in the lower part of Figure 3.As we expected, the contribution of phrase-level masking is greater when dealing with shorter values (number of tokens less than 10).The improvement of sentence-level masking becomes significant when the length of value gradually grows to more than 20.This proves the reasonableness of our combination of phrase-level and sentence-level masking.Moreover, we find that even pre-training without MSMLM, COMAVE still performs better than pure ROBERTA.This demonstrates the boost from objective VD and the expected external knowledge of the pre-training corpus.

Few-Shot Tests
To further fit realistic applications, we also tested the performance of the model in few-shot scenarios.We adopted the N -WAY K-SHOT setup, i.e., the few-shot training set has N attributes, and each attribute has corresponding K training samples randomly selected.Here, we let N equal the number of attributes per dataset and focused on testing two sets of 5-SHOT and 10-SHOT settings.
Table 6 shows the experimental results.Under the stringent condition of using only 5 training samples for each attribute, COMAVE scores Mirco-F1 over 50% on all datasets.When the training set is expanded to 10-SHOT, the performance reaches approximately 65% on the other three datasets, except for the challenging INS.The smaller the sample size, the greater the improvement of CO-MAVE.Due to pre-training with a large-scale AVE corpus collected, COMAVE is more capable than ROBERTA in handling the few-shot AVE task.

Related Work
With the development of e-commerce, Attribute Value Extraction which aims to retrieve the attributes and extract the values from the target data resource in order to obtain the structured information of the products recently attracts lots of at-tention.Several previous methods (Zheng et al., 2018;Xu et al., 2019) employ traditional sequence tagging models.Furthermore, AVEQA (Wang et al., 2020) first tries to use MRC based method to handle the task, but it can not be applied when each attribute has several different values.JAVE (Zhu et al., 2020) designs a multi-task model which divides the task into two sub-task: attributes prediction and value extraction.AdaTag (Yan et al., 2021) uses a hyper-network to train experts' parameters for each attribute in order to build an adaptive decoder.QUEACO (Zhang et al., 2021) adopts a teacher-student network to leverage weakly labeled behavior data to improve performance.MAVEQA (Yang et al., 2022) mixes multisource information with a novel global and local attention mechanism.However, none of the existing methods pay attention to the two challenges mentioned in section 1.

Conclusion
In this paper, we presented a new pre-training model for attribute value extraction, called CO-MAVE which is pre-trained by three novel objectives with a large-scale corpus.Multi-Scale Masked Language Model is designed to force the model to understand multi-scale values by recovering masked spans at both the phrase and sentence levels.Contrastive Attribute Retrieval improves the discrimination of fine-grained attributes based on contrastive learning.Meanwhile, Value Detection is adopted to reinforce the value extraction and further benefit downstream AVE tasks.Extensive experiments indicate that COMAVE achieves stateof-the-art results on four benchmarks compared with the existing baselines and PLMs.In future work, we will expand our work on more scenarios and industries, and also explore the optimization of the downstream fine-tune model.

Limitations
This paper proposed a novel pre-training model COMAVE which aims at textual AVE tasks, while in this field, multi-modal AVE tasks also widely exist in many e-commerce platforms.We expect that the following works can leverage COMAVE as a powerful word embedding pre-training model for text encoding combined with image feature representation in multi-modal AVE tasks in the future.Meanwhile, the same as the previous AVE works, we assume that each T is an independent extraction object, without considering the context-dependent of the whole data resources, such as long documents and instructions, which exceeds the length of an allowable single input.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4: Experiments, Implementation Details C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4: Experiments C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?
The packages used in our code are listed in GitHub.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?Section 4: Experiments, Datasets.We build a hand-labeled dataset called INS for evaluation.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? We build the manually labeled dataset INS for evaluation, while there is no human participation in other parts.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?
We build the manually labeled dataset INS for evaluation, while there is no human participation in other parts.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?The dataset is collected from the platform of our affiliation, and the source is not discussed in this submission for anonymity.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Our dataset contains the product information of the e-commerce platform, not the information of humans, and humans only participate in labeling the dataset.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Our dataset contains the product information of the e-commerce platform, without the information of humans, and humans only participate in labeling the dataset.

Figure 1 :
Figure 1: An example of attribute value extraction in the insurance field.Wherein each insurance clause contains multi-scale values and fine-grained similar attributes.

Figure 2 :
Figure 2: An overview of COMAVE.First, text T randomly selects one of the masking mechanisms (Phrase-Level Masking or Sentence-Level Masking), meanwhile, based on the golden positive attribute, global sampling and clustering sampling are adopted for negative attributes construction.They are combined and input to COMAVE for encoding with ROBERTA form.Thereafter, three objectives: CAR, MSMLM, and VD are predicted separately.
evaluate the contributions of each training objective, we considered the following settings: • − MSMLM: Removing the training objective of Multi-Scale Masked Language Model.• MSMLM − PhraM: Only Using the sentencelevel masking mechanism.• MSMLM − SentM: Only Using the phraselevel masking mechanism.• − CAR: Removing the training objective of Contrastive Attribute Retrieval.• CAR − MRL: Replacing the Margin Ranking Loss L CAR with the Cross Entropy Loss.• CAR − CS: Cluster sampling is not used in CAR, i.e., A − = A − g .• − VD: Removing the training objective of

Figure 3 :
Figure 3: Detailed ablation tests of fine-grained attribute retrieval and multi-scale value extraction on INS.Here, accuracy is used as the evaluation metric.

Table 1 :
Statistical results of the fine-grained attributes in INS.There are about 20% samples containing two or more attributes in the same group.

Table 2 :
Statistical results of multi-scale value in INS.

Table 3 :
, Overall results compared with existing baselines.Here, Attr, Val, and Over denote the Mirco-F1 of attribute retrieval, value extraction, and overall task, respectively.

Table 3
. Our proposed COMAVE equipped with the MRC layer achieves state-of-the-art on all four benchmarks.Most baselines perform poorly on INS because they focus on traditional e-commerce

Table 5 :
Overall ablation results on four datasets.whencompared.The results are shown in Table4.Here SpanBERT and MacBERT have no results on some datasets because of lacking a corresponding language version.SpanBERT achieves almost the same results as ROBERTA with half the number of parameters because it excels in span representation.ELECTRA adopts creative adversarial pretraining and therefore performs well.Compared to the pre-training backbone ROBERTA, our further pre-trained COMAVE gains a significant improvement.Moreover, regardless of the simple finetuning layer used, our model outperforms all the other PLMs, which indicates that our pre-training effectively alleviates the challenges of AVE tasks.

Table 6 :
Results of few-shot tests on all datasets.
is proposed as a universal pre-training model for several extraction tasks by generation, it is generic but lacks further fitting for different extraction tasks.Currently, there is no task-specific pre-training model for attribute value extraction.