Improving Factual Consistency of Abstractive Summarization on Customer Feedback

E-commerce stores collect customer feedback to let sellers learn about customer concerns and enhance customer order experience. Because customer feedback often contains redundant information, a concise summary of the feedback can be generated to help sellers better understand the issues causing customer dissatisfaction. Previous state-of-the-art abstractive text summarization models make two major types of factual errors when producing summaries from customer feedback, which are wrong entity detection (WED) and incorrect product-defect description (IPD). In this work, we introduce a set of methods to enhance the factual consistency of abstractive summarization on customer feedback. We augment the training data with artificially corrupted summaries, and use them as counterparts of the target summaries. We add a contrastive loss term into the training objective so that the model learns to avoid certain factual errors. Evaluation results show that a large portion of WED and IPD errors are alleviated for BART and T5. Furthermore, our approaches do not depend on the structure of the summarization model and thus are generalizable to any abstractive summarization systems.


Introduction
In order to improve customer order experience, most e-commerce stores allow customers to submit reviews or feedback via their post-order communication channels. Such customer feedback, usually in the form of short paragraphs of free texts, contains information reflecting the issues that customers experienced in their purchases. This information can be shared with sellers to bring their awareness on the problems in their products. However, customer feedback often include other contents that are irrelevant to the product issues. Such redundant information requires extra efforts for Source: (...) I ordered this mouse for my new laptop. However, when I received it, I could see many scratches on the product. It looks like it has been used before. (. . . ) Reference Summary: The mouse delivered has many scratches. It looks like it has been used. Model Summary: The laptop came with many scratches, looks like it has been used. Source: (. . . ) I checked the serial number and found it doesn't match the one on the website. This phone is not defective. I question the source of this product (. . . ) Reference Summary: The phone serial number doesn't match the one on the website but the phone is not defective. Model Summary: This phone is defective and the serial number doesn't match the one on the website. sellers to fully understand the customers major concerns, and sometimes even causes confusion.
To reduce the redundancy, a concise summary of customer feedback can be provided where the information is concentrated on the product issues while other irrelevant contents are filtered out. Such summary allows sellers to quickly capture and comprehend the problems, and thus they can address buyer dissatisfaction more efficiently.
The problem of generating summaries from customer feedback is modeled as a text summarization task (Nallapati et al., 2016;Allahyari et al., 2017; in the natural language processing (NLP) domain. Abstractive summarization models with transformer-based architecture have achieved success in a variety of summarization tasks Raffel et al., 2020;Zhang et al., 2020;Bao et al., 2020). Hence, we harnessed the recent state-of-the-art (SOTA) abstractive summarization models, BART  and T5 (Raffel et al., 2020), and fine tuned the models for our specific summarization task. We aim to utilize summarization models to produce the summary that can correctly describe the product issues presented in customer feedback. However, from human evalu-ation results, we observed that the summary generated by these abstractive summarization models sometimes contains the information that is inconsistent with facts in the input text. Such factual inconsistencies have also been observed in previous studies (Cao et al., 2018;Kryscinski et al., 2019Kryscinski et al., , 2020. More specifically, we analyzed 75 inconsistent summaries obtained from human evaluations on more than 600 model-generated summaries. We found that around 70% factual inconsistent summaries 1 follow two error patterns: wrong entity detection (WED) and incorrect product-defect description (IPD). The error of WED often occurs in the cases where the feedback text involves multiple entities but the models fail to detect the primary entity. For IPD, the generated summary contains the product-defect description that contradicts with the original description in the customer feedback. Table 1 shows the examples 2 of the two types of factual errors.
In this work, we propose a set of methods in order to improve the factual consistency of abstractive summarization on customer feedback. We first introduce specific factual errors into each target summary to generate their negative counterpart. We then use such pair of consistent and inconsistent summaries with a contrastive loss term added in the training objective to enhance the model's robustness against the two major factual errors.
Our contributions are two folds. First, The proposed approaches with corrupted summary generation and contrastive loss augmentation do not pose requirements on the achitecture of the summarization model. Thus, they can be applied to any abstraction-based summarization model to improve the model faithfulness. Second, we test the proposed approaches on SOTA summarization algorithms such as BART and T5. Our approaches show large benefits in reducing the common factual errors in customer-feedback summarization.

Related Work
There have been increasing research attentions on improving the factual consistency of abstractive summarization models. Lots of priors work focused on different ways of adding external signals or constraints to enhance the summary generation. Cao et al. (2018) built a dual-attention framework so that the summary generation is conditioned on both the source document and extracted key information.  incorporated the entailment knowledge by utilizing entailment-aware encoder and decoder. With using the textual entailment, Falke et al. (2019) re-ranked the candidates summaries to select the summary that's better aligned with the source document. Dou et al. (2020) studied different external signals, including key sentences, keywords and relations, and used them in addition to the input text to guide the summary generation. Mao et al. (2020) constrained certain tokens to require them to be present in the summary. Similarly, Yuan et al. (2020) add constraints on the model to include certain attribute words in the product summarization. Zhu et al. (2021) integrated information extraction and graph attention network into transformer-based seq2seq framework.
To identify and correct the unfaithful summaries,  proposed to use a question answering framework to check the faithfulness of the summary while  built a factual correction model that leverages knowledge learned from question answering models. Kryscinski et al. (2020) trained a BERT-based model to classify whether the summary is factual consistent. Cao et al. (2020) and Zhu et al. (2021) developed factual corrector based on BART  and UniLM (Dong et al., 2019), as a post-processor to rectify factual errors from the upstream summarization model. They corrupted the reference summaries with artificial errors and used them as the negative samples for training the correctors. In our work, we also generate corrupted summaries as the negative counterparts of the target summaries. The difference is that, instead of building a separate corrector model, we directly engineer the training objective of the summarization model. By leveraging contrastive learning (Schroff et al., 2015;Khosla et al., 2020), we define contrasive losses to guide the output summary away from certain factual errors.
Source: (...) I've bought cheese from this store for many times, and they were very good. So I think other products must be good too. Then I ordered several bottles of milk. But they are clearly expired (. . . ) Reference Summary: Milk delivered is expired. Corrupted Summary: Cheese delivered is expired. Source: (. . . ) The eggs I purchased have bad smells. They don't look like fresh eggs. (. . . ) Reference Summary: Eggs have bad smells, and don't look like fresh eggs. Corrupted Summary: Eggs have good smells, and don't look like fresh eggs. Table 2: Examples of corrupted summaries. We replace the primary entity in the first example and switch the description in the second example.
adding the contrastive loss so as to guide the model to avoid those mistakes.

Synthetic Factual Errors
We augment the training data by applying two types of corruption methods on the target summary. The corruptions are designed to mimic the factual errors we observed. In the first method, we replace the named entities in the target summary with the other random entities of the same type in the source document. If no such replacement entity can be found in the source document, we randomly pick one from the top 50 appeared entities in our dataset. We used Spacy toolkit (Honnibal et al., 2020) for the named entity extraction. In the second method, we use predefined rules to transform the productdefect description in the target summary. We detect the adjectives describing the product defect and switch their sentiment. There are two ways that we change the description. One is by adding negation word not before the adjective. For example, we alter "product is broken" to "product is not broken". If word not is already presented, we will remove it instead. The other way is by switching a descriptive word to the one with opposite meaning, such as changing "opened" to "sealed". Table 2 shows some examples of the corrupted summaries.

Training Objective
For each training sample, we now have a triplet (d, s + , s − ) consisting of the source document d, target summary s + , and corrupted summary s − . The summarization model takes d as the input and generates the output o. Our training objective is to drive the model output o to resemble s + while at the same time avoiding the factual errors presented in s − . Inspired by contrastive learning (Schroff et al., 2015;Khosla et al., 2020), we compare dif-ferent contrastive loss functions for model training.
Direct Contrast Compared to the ordinary loss function for summarization, we add an extra term that takes into account the informration from corrupted summary: where L(s + , o) is the cross entropy loss between s + and o, L(s − , o) is the cross entropy loss between s − and o, and α is a tunable hyperparameter controlling the impact from the second term. The loss function will purely focus on the difference between s + and s − if α = 1.0. Thus, we generally use small value for α to ensure the model will produce fluent summary.
Constrained Negative Here, we add a margin term M to constrain the value of L(s − , o): For easy negatives with L(s − , o) > M , their effects won't be taken into account during training as the model can confidently distinguish them from positive samples.

Constrained Contrast
We augment the ordinary loss function for summarization with a constrained contrastive term: In this formula, the model is not only trained towards predicting correct labels but also deviating from certain factual errors extracted from the contrast between the negative and positive samples.

Dataset
We collected 10,000 samples of negative customer feedback from the post-order communication channels of e-commerce stores. We asked subject matter experts to generate summary for each customer feedback text with emphasis on extracting the information related to product issues. The summary is required to contain the (1) primary item names and (2) descriptions about the product defects associated with the items, if they are presented in the customer feedback. We use the human-produced summary as the target summary in model training. The train/test split ratio is 85:15.  Table 3: Impact of our approaches on ROUGE scores. The reported numbers are relative changes of ROUGE scores compared to the ordinary fine-tuned BART and T5 models, respectively 4 .

Model
We use two recently proposed abstractive summarization models, BART  and T5 (Raffel et al., 2020), for customer-feedback summarization. We adopt the pretrained models from the HuggingFace implementation 3 and fine tune the models on our training dataset. Both models share the same training parameters including learning rate as 5e-5, α = 0.05 in L DC , (α = 0.5, M = 2.0) in L CN , and (α = 0.5, M = 5.0) in L CC .

Evaluation metrics
We employ the ROUGE-1, ROUGE-2, and ROUGE-L scores (Lin, 2004) to ensure that our proposed methods do not degrade the fluency and continuity of the generated summary. These ROUGE scores measure the accuracy based on unigrams, bigrams, and longest subsequences. We rely on the human evaluation to examine the factual consistency of the model output. We ask human annotators to classify the faithfulness of generated summary into consistent and inconsistent based on whether there are inaccurate or contradictory facts. We then compare the summary consistency before and after implementing the proposed methods.

ROUGE Scores
We report the changes of ROUGE scores 4 in Table  3. Results show that the models trained with our correction methods generally have improvements on the ROUGE scores compared to the original BART and T5 models. Higher scores imply that the summaries from the corrected models are better aligned with the target summaries. In addition,   using L CC as the loss function turns out to produce the highest ROUGE scores for both BART and T5. Thus, for human evaluation, we will focus on the summaries produced by the models trained with L CC .

Human Evaluation and Analysis
The human evaluation included 124 examples for BART and 600 examples for T5, all of which were randomly sampled from the test set. Table 4 shows the effect of our approaches on correcting the two major factual errors. As the results show, a large portion of the WED and IPD errors are corrected. Over 63% WED and 50% IPD mistakes from ordinary BART are rectified. For T5, our methods are able to correct around 46% WED and 42% IPD errors. It implies our models perform more robustly on the cases that can potentially lead to WED and IPD.
One remaining question is whether our approaches would degrade the originally faithful summaries. In Table 5, we report the percentage of cases where the summaries from the ordinary mod-Source: (...) I bought this expensive TV that's supposed to have good screen and built-in wifi connection. But this one runs with lots of lagging, not as advertised on the website. (. . . ) Original: Screen runs with lots of lagging, not as advertised. After: TV runs with lots of lagging, not as advertised. Source: (. . . ) The packaging is heavily damaged and opened, though the product inside is not broken. The seller should be careful on the packaging next time (. . . ) Original: The packaging is heavily damaged and opened. Product is broken. After: The packaging is heavily damaged and opened. The product inside is not broken. Table 6: Examples of error corrections using our methods. els are consistent but become inconsistent after using our methods. We can see that most of the summaries remain consistent from our models. Furthermore, our analysis shows that the overall amounts of inconsistent summaries are reduced by 44.1% for BART and 31.6% for T5, which indicates the effectiveness of our methods. Table 6 shows several input texts and summaries from the models before and after using our methods. In the first example, our model is able to pick up the correct entity from multiple entities in the source document, where the ordinary model fails. In the second example, the summary from the ordinary model contains contradicting description against the source document but our model captures the correct information.

Conclusion
In conclusion, we study the error patterns in the customer-feedback summaries generated by BART and T5. We propose to augment the training data with artificially corrupted summaries and use contrastive learning methods to enhance the model faithfulness. Human analysis shows that significant portion of WED and IPD errors from BART and T5 are reduced. Because our methods do not involve modifying the model structure, they can also be applied to other abstractive summarization frameworks.