Separating Context and Pattern: Learning Disentangled Sentence Representations for Low-Resource Extractive Summarization

,


Introduction
The glob of text summarization is to generate a concise highlight of a source document, which covers the crucial information conveyed in the source text.In this paper, we focus on extractive summarization.It aims to produce summaries by selecting and combining the salient sentences that are directly taken from the source text.
It is widely agreed that extractive summarization is mainly based on context information to select the important sentences.Meanwhile, there also exist other factors that can be used to identify these sentences, such as sentence position or certain ngram tokens.As shown in Figure 1, in the news

CNNDM
Num arXiv Num ( cnn ) -21k in this paper 11k according to the 3.5k as a function 6.4k the first time 2.4k in the case 4.8k the end of 1.3k we find that 3.7k summarization dataset, lead sentences always have a much higher possibility to become crucial sentences.Meanwhile, Table 1 shows that sentences with certain n-gram tokens like "in this paper" or "we find that" are also considered to be important in science paper summarization.Here, we collectively called these factors pattern information, since they are context-independent and can decide the sentence importance solely by themselves.However, as we displayed in Figure 1 and Table 1, pattern information varies from dataset to dataset.In this case, such information is only effective in its corresponding dataset or domain and can not be generalized like the context information.Although both context information and pattern information are crucial for the task, it is hard to tell whether the improvement of the current extractive summarization models stems from a better understanding of the context information or overfitting the pattern information on specific data.Hence, the existing models may fail to achieve good performance when transferring to other domains or datasets with limited data due to the intermingling of domain-specific pattern information.
In this paper, we aim to apply disentangled representation learning to extractive summarization, and separate the two key factors for the task, context information and pattern information, for a better generalization ability in low-resource settings (zero-shot and few-shot).Our model is built on a pretraining-based extractive summarization model (Liu and Lapata, 2019) that uses a BERT to encode each sentence with its context to the latent representation.We would like the latent representation to be disentangled with respect to the context and pattern information.Following the previous works (John et al., 2018;Cheng et al., 2020), we combine the multitask objectives and adversarial objectives/mutual information (MI) minimizing objectives to accomplish this.The multitask objectives aim to encourage the two latent spaces to learn its corresponding information.For the context information, we propose to approximate it by predicting the high-frequency non-stop word appearing in a sentence and its context.For the pattern information, we divide it into two parts: the position pattern feature and the n-gram pattern feature.The former one can be transferred into a sentence position predicting problem, while the latter one is approximated by predicting whether the target sentence contains any high-frequency n-gram patterns.Then we try two commonly used disentangled representation learning approaches, adversarial objectives/MI minimizing objectives, to further ensure the independence between the two latent spaces.
After the model is trained on a source dataset, it can be transferred to a target dataset for lowresource extractive summarization.In the zero-shot setting, we only utilize the context representation to do the extractive summarization.In the fewshot setting, we choose to fine-tune the patternrelated parameters with a few training instances to automatically select useful patterns for the target dataset.
To evaluate our proposed model, we conduct the experiments on three datasets from different domains: CNN/DaliyMail from the news summarization domain, arXiv from the science article summarization domain, and QMSum from the dialogue summarization domain.These experiments suggest the effectiveness of our model by disentangling context and pattern information.

Text Summarization
Extractive summarization is an important sub-topic for text summarization.Early works (Nallapati et al., 2017;Narayan et al., 2018;Zhou et al., 2018;Zhang et al., 2018) formulated it as a sentence binary classification problem and further extend it with different techniques.With the development of the pretrained model, using a transformerbased pretrained model as encoder (Liu and Lapata, 2019;Bae et al., 2019;Zhang et al., 2019) leads to a huge improvement in the task.Recently, MATCHSUM (Zhong et al., 2020) has achieved a state-of-the-art performance by combining contrastive learning with extractive summarization.These models mainly focus on improving the performance on a certain dataset or domain.Research on low-resource text summarization is also increasing.AdaptSum (Yu et al., 2021) propose a pre-train and then fine-tune strategy for low-resource domain adaptation for abstractive summarization.Other researchers (Fabbri et al., 2020) present a similar idea but further enhance it with a data augmentation method using the large corpus from Wikipedia.(Zhao et al., 2022) combines domain words and a prompt-based language model to achieve zero-shot domain adaption in dialogue abstractive summarization.In this work, we aim to explore the lowresource extractive summarization by disentangling context and pattern information.

Disentanglement Representation Learning
Disentanglement representation has first been explored in computer vision to disentangle features such as color or rotation.Recently, a growing amount of work has been proposed to investigate learning disentangled representations in NLP tasks.Early works (Hu et al., 2017;Shen et al., 2017;John et al., 2018) follow a similar idea, and applied disentanglement representation learning on style/sentiment transferring.Later, researchers further extend its application to different topics such as cross-lingual transfer (Wu et al., 2022), negation and uncertainty learning (Vasilakes et al., 2022), and fair classification (Park et al., 2021).Generally, there are mainly three types of approaches for disentanglement representation learning.A common approach (John et al., 2018) is to add an adversary that competes against the encoder trying to avoid learning certain types of attribute.Another approach (Cheng et al., 2020;Colombo et al., 2021) is to adopt the mutual information theory, and attempt to minimize the mutual information upper bound between two disentangle representations.Recently, some researchers (Colombo et al., 2022) propose a simpler approach by adding a set of regulizers to achieve disentanglement representation learning.Similar to cross-lingual transfer, in this work, we also aim to adopt disentanglement representation learning to domain transferring, but in the context of extractive summarization.

Problem Statement
In this work, we disentangle the sentence representation for extractive summarization into two parts: context representation and pattern representation.
To achieve this, we need to satisfy the following requirements for an effective disentanglement.
• The context and pattern representation need to have the ability to predict sentence importance and contribute to the extractive summarization.
• The context and pattern representation should be predictive of the corresponding groundtruth information.For example, the pattern representation of a sentence can predict its pattern feature such as its position.
• The context and pattern representation should lie in independent vector space, and one representation can not predict the corresponding ground-truth information of the other one.

Extractive Summarization Model
Given an input document containing n sentences x = {s 1 , s 2 , .., s n }, we adopt a BERT to generate contextualized representations for each sentence.
Since the output of BERT is grounded to tokens, we use a similar strategy with (Liu and Lapata, 2019) to modify the input sequence of BERT.We insert a [cls] token at the beginning of each sentence and use the embedding of the [cls] token to represent its corresponding sentence.Considering our glob is to disentangle it to context and pattern representation, we add two additional multilayer perceptrons (MLP) that map the sentence representations generated by BERT to context representations c and pattern representations p.Here, we collectively called the BERT and two MLP mappers encoder E.
Then a sigmoid classifier F ext takes the concatenation of both representations as input to predict a score y e i for sentence s i , and the loss of the whole model is the binary classification loss of y e i against gold label t e i .Note that the gold label refers to the one-hot distribution of the oracle sentences (the sentence set that has the highest similarity with the reference summary).The loss is shown in the following: This classification loss serves as our primary training objective for extractive summarization.Meanwhile, to better utilize the context representation and pattern representation in the low-resource setting, we expect the two disentangled representations can do extractive summarization independently.Hence, we add two similar classifiers that directly take context representation or pattern representation as input, and their losses are denoted as l ext(c) and l ext(p) .Note that the gradients of the two classifiers are detached from the main model.

Learning Context Representation
The context representation c is expected to do extractive summarization using the context information.In addition to the extractive summarization loss, we add a multitask objective to ensure the context information is contained in it.The question that lies ahead is to define what "context" actually refers to.A widely accepted idea is that the effective context information in extractive summarization is salient words/phrases that repeat multiple times in the context.Inspired by this, given a sentence s i , we propose to approximate the context information by predicting the non-stop words existing in both s i and its adjacent sentences.The distribution of these words on the vocabulary is considered as the context feature t c i for s i .We build a two-layer MLP classifier F mul(c) on the context representation c to predict the context feature, and the classifier is trained with crossentropy loss against the ground-truth distribution: where the voc stands for the vocabulary and is the predicted context feature.

Learning Pattern Representation
The pattern representation p needs to predict both sentence importance and pattern-related features.
In this paper, we mainly focus on the two types of pattern, position pattern and n-gram pattern, that contribute the most to extractive summarization.Position pattern refers to the position of the sentence in the document, which plays an important role in the news article summarization.We add a multitask objective that predicts the position of a sentence.In this case, the position pattern feature t o i is a one-hot vector with a length that is the same as the sentence number.N-gram pattern is another crucial factor that influences sentence importance, which represents the expressions/phrases that are commonly used for summaries.Inspired by (Salkar et al., 2022), We count the frequencies of all n-grams that appear in the oracle sentences and select the top 500 as the n-gram pattern set.The glob of pattern representation is to predict whether a sentence contains any pattern from the pattern set, which is a binary classification problem.
Similarly, we also use two MLP classifiers on the pattern representation p to predict the pattern related feature: where y p i = F mul(p) (p i ) is the predicted n-gram pattern feature and y o i = F mul(o) (p i ) is the predicted position pattern feature.

Learning Disentangled Representation
Although the multitask objectives assist the model to learn context and pattern information in different latent spaces, they are not effective enough to ensure the independence between c and p.As shown in the Figure 2, we adopt two commonly used objectives for learning disentangled representation in this paper.Adversarial Objective Considering one representation should be predictive of their corresponding information only, following (John et al., 2018), we add adversarial classifiers that try to predict the information related to the other one on both latent spaces, and the model is forced to structure the latent spaces such that the outputs of these adversarial classifiers are non-predictive.The adversarial objective is composed of two parts.The first part is the adversarial classifiers on each latent space for each type of non-target information.The second part is the adversarial loss aiming to maximize the entropy of the predicted distribution of the adversarial classifiers.
Taking the adversarial objective on the pattern space for example, we train a two-layer MLP classifier, context discriminator F dis(c) , to predict whether it contains any context information.One thing that is worth noticing is that the gradients of these classifiers are not back-propagated to the encoder.In this case, the training of the context discriminator will not influence the encoder.Similar to equation ( 3) and ( 5), a cross-entropy loss is shown as follow, but with different input and parameters: where y c i = F dis(c) (p i ) refers to the predicted context feature using pattern representation.
Then an adversarial loss is used to maximize the entropy of the output of context discriminator.Here, we only train the encoder with such adversarial loss and the parameters of the context discriminator are excluded.
We also impose the n-gram pattern discriminator and position pattern discriminator to disentangle the pattern information from the context space.These two adversarial objectives follow nearly the same way as the mentioned one and their corresponding loss are denoted as l dis(p) , l dis(o) , l adv(p) and l adv(o) .MI Minimization Objective Mutual information (MI) is a natural measure of the independence between two variables.Inspired by the previous works (Cheng et al., 2020), minimizing the upperbound estimate of the mutual information (MI) between two latent spaces is an effective way to disentangle them.Following the Contrastive Learning Upper-Bound (CLUB) estimate of the MI (Cheng et al., 2020), we firstly train a neural network M that aims to estimate pattern representation by taking context representation as input: where kl stands for the Kullback-Leibler divergence.Just like the discriminator in the adversarial objective, we fix the parameters of the encoder when we train the neural network M with this loss.
We minimize the Mutual information between the two latent spaces by minimizing the following equation: where k is selected uniformly from indices {1, ..., n}.Here, the optimization is only performed with parameters of the encoder E.

Training Strategy
The loss of our model mainly consists of two parts, the losses that update the discriminator (for MI Objective, it is M ) and the main loss (all the other losses).In the training process, for each batch, we first optimize the discriminator by l dis(c) , l dis(p) and l dis(o) with a weight λ dis (for MI Objective, it is l map ), and then optimize the encoder and all other classifiers with the main loss.The main loss L all for our model comprises three types of terms: the extractive summarization objectives, the context/pattern feature learning objectives and adversarial objectives (for MI Objective, it is l mi ), given by The checkpoint selection strategy and hyperparameter searching are also crucial for model training.Considering the glob of our model is to effectively utilize the context information in the target dataset rather than achieve the best performance on the source dataset, we follow two rules: (1) The disentanglement is successful (based on the training log); (2) We select the checkpoint with the best performance when using context representation on the validation set.In the experiment, the weights are λ mul = 1, λ adv = 1, λ dis = 3

Application in Low-Resource Setting
After we train the model on a source dataset, we can transfer it to a target dataset with limited data.Considering the pattern information in the source dataset may be misleading in a target dataset, we use the context representation to do the extractive summarization in the zero-shot setting.As for the few-shot setting, the data samples from the target dataset provide the model a chance to accomplish a quick adjustment on its pattern information.In this case, we choose to fine-tune the pattern-related parameters with the given samples to select useful patterns for the target dataset.ation metric (Lin, 2004) including Rouge-1 (R-1), Rouge-2 (R-2), and Rouge-L (R-L) as evaluation metrics.In practice, we use a python wrapper pyrouge to apply the classic Rouge 1.5.5.

Comparison
We compare our method with some commonly used baselines and previous state-of-the-art methods designed for low-resource text summarization.
There are three types of methods: unsupervised baselines, comparable unsupervised models based on domain transferring or pretraining, and other reference models that are not directly comparable.
Unsupervised Baselines Lead-n aims to select the lead sentences in the document as the summaries, and it always plays an important role in the news summarization dataset that heavily relies on the position pattern information such as CNN/DailyMail.We also show the result of two strong unsupervised baselines TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004).
Comparable Models AdaptSum (Yu et al., 2021) focuses on one-to-one domain adaption in text summarization.It proposes a Source Domain Pre-Training (SDPT) strategy that first fine-tunes a pretrained model on the source domain and then ap-  plies it to the target domain.Another research (Fabbri et al., 2020) also proposes a similar method with it and further extends with a data augmentation method.However, this data augmentation method requires the pattern information from the target dataset and is not comparable with our model.
Other Reference Models We display the result of BERTSum (Liu and Lapata, 2019) training on the full target dataset, which can be considered as the upper bound of our model.

Experiment Results
Zero-shot application We first evaluate the performance of our model in the zero-shot setting in Table 3 and Table 4, where the information of the target dataset is totally unknown.Here, we display the two variants of the model, Our_adv using the adversarial objective and Our_mi adopting the MI minimization objective.Based on the results, we have the following observation.Firstly, Our_adv achieves the best result in most cases.This indicates the effectiveness of context information in the zero-shot setting.Meanwhile, we also observe that Our_mi obtains a lower performance compared to Our_adv.Further investigation of the training process shows that using the MI minimization objective is more difficult to disentangle pattern and context information.We think the reason is that the two types of information are not naturally disentangled and are optimized by the same extractive summarization objectives.In this case, the model requires more clear guidance to achieve the disentanglement.Table 5: The results on CNN/DM when using context/pattern representation.
Figure 3: The predicted sentences position distribution on arXiv when using context/pattern representation.

Analysis of context and pattern information
To understand the influence of both context and pattern information on the target dataset, we compare the performance of using context representation, using pattern representation, and using both representations in Table 5. Considering the huge gap in the pattern between the two datasets, it is not surprising that using the pattern representation achieves the worst result.Meanwhile, its misleading information also pulls down the results of using both representations.We also display the position distribution of extracted sentences on arXiv using the model trained on CNN/DM in Figure 3. Since CNN/DM is known for its lead bias, the pattern latent space learned on it inevitably tend to select the lead sentences.This trend further dominates the situation when using both representations.As for using context representation alone, the lead bias is relatively weaker.Few-shot application Directly using the pattern information in an unsuitable dataset leads to a decrease in the model performance.However, this does not mean the pattern representation is completely useless.In the few-shot setting, we can obtain some information from the target dataset and fine-tune the pattern latent space.training data and 25 validation data.Here, despite our proposed model and AdaptSum, we also show the result of directly fine-tuning a BERTSum model on the limited data.In Table 6, the performance of all models is improved with the help of the limited data, while the gap between Our_adv and Adapt-Sum still exists.This shows our model is capable of selecting the effective pattern information for the target dataset and preserving its advantages on context information.
Ablation study We further conduct an ablation study.Firstly, we remove the adversary objectives from our model (-adv loss), which means the model can only learn the disentangled representation by approximating context/pattern features.
Then we further remove the multitask objectives (aux loss).In this case, the main difference between this model and the AdaptSum is that our classifier contains more parameters.Here we compare the result of only using context representations in the zero-shot setting.As shown in Table 7, we find that removing the adversary objectives leads to a clear performance drop.This suggests that using the adversary objectives alone is far enough to disentangle the context and pattern information.We also find that the result of model "-aux loss" is similar to the result of AdaptSum in Table 4, which shows the improvement of our model is not brought by the additional parameters.

Visualization
To have a more direct observation, we visualize the context and pattern representations by using the t-SNE algorithm (Van der Maaten and Hinton, 2008) to reduce them to two dimensions in

Conclusion
In this paper, we propose a novel extractive summarization model that aims to improve the generalization ability in low-resource setting.It disentangles the sentence representation to context and pattern representation and utilize the context information to reduce the influence of domain-specific pattern information during model transferring.The experiment suggests the ability of our model in the disentanglement, and it also supports the claim that the context information tends to have better generalization ability facing the dataset from a different domain.In the future, we plan to extend this idea by learning a more generalized context latent space from multiple summarization datasets.

Limitations
Firstly, we adopt two types of representative pattern information, position pattern, and n-gram pattern, but it does not mean they cover all effective pattern information.In this case, the way to efficiently include all types of pattern information is still an important problem.Secondly, we do not put too much effort into investigating the influence of different feature forms (pattern feature and context feature) for the multitask objectives.Thirdly, due to the limitation of time and paper length, we only evaluate our method in three representative domains.

Ethics Statement
Our D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Comparison of position distribution of oracle sentences in news summarization dataset CNN/DailyMail and science paper summarization dataset arXiv.The X-axis refers to 1 to 100 sentence position and the Y-axis represents its proportion.
Figure 2: The framework of our proposed model.Part (1) shows the model based on the adversarial objective, while part (2) displays the one based on the MI minimization objective.The blue blocks refer to different sentence representations, the green blocks stand for the model components and the yellow blocks represent the target features.The solid lines represent normal classification loss, and the dashed lines stand for the discriminator loss plus the adversary loss.
Figure 4.These representations are taken from 1000

Figure 4 :
Figure 4: Visualization of context and pattern represen-

Table 1 :
Examples about the high-frequency n-grams in oracle sentences from CNN/DailyMail and arXiv.

Table 2 :
The statistics and comparison of the datasets.

Table 3 :
The results of models trained on CNN/DM in zero-shot setting.

Table 4 :
The results of models trained on arXiv in zeroshot setting.

Table 6 :
To simulate this situation, for each target dataset, we build its fewshot version by randomly taking 50 data samples from its original training set and splitting it into 25 The results on arXiv and CNN/DM in few-shot setting.

Table 7 :
The ablation study in the zero-shot setting.
experimental datasets, CNN/DailyMail, arXiv, and QMSum, are well-established and publicly available.Datasets construction and annotation are consistent with the intellectual property and privacy rights of the original authors.The scientific artifacts we used are available for research with permissive licenses, including ROUGE and Transformers from HuggingFace.The use of these artifacts is consistent with their intended use.The task of our work is a classic NLP task, text summarization.Considering all the datasets are public available, we think there are no potential risks for this work.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4.1, Section 3.6 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?The experiments are conducted based on single run.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4.1 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? No response.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.