Active Learning for Abstractive Text Summarization

Construction of human-curated annotated datasets for abstractive text summarization (ATS) is very time-consuming and expensive because creating each instance requires a human annotator to read a long document and compose a shorter summary that would preserve the key information relayed by the original document. Active Learning (AL) is a technique developed to reduce the amount of annotation required to achieve a certain level of machine learning model performance. In information extraction and text classification, AL can reduce the amount of labor up to multiple times. Despite its potential for aiding expensive annotation, as far as we know, there were no effective AL query strategies for ATS. This stems from the fact that many AL strategies rely on uncertainty estimation, while as we show in our work, uncertain instances are usually noisy, and selecting them can degrade the model performance compared to passive annotation. We address this problem by proposing the first effective query strategy for AL in ATS based on diversity principles. We show that given a certain annotation budget, using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores. Additionally, we analyze the effect of self-learning and show that it can further increase the performance of the model.


Introduction
Abstractive text summarization (ATS) aims to compress a document into a brief yet informative and readable summary, which would retain the key information of the original document.State-of-theart results in this task are achieved by neural seq-toseq models (Lewis et al., 2020;Zhang et al., 2020;Qi et al., 2020;Guo et al., 2021;Liu and Liu, 2021) based on the Transformer architecture (Vaswani et al., 2017).Training a model for ATS requires a dataset that contains pairs of original documents and their short summaries, which are usually writ-ten by human annotators.Manually composing a summary is a very tedious task, which requires one to read a long original document, select crucial information, and finally write a small text.Each of these steps is very time-consuming, resulting in the fact that constructing each instance in annotated corpora for text summarization is very expensive.
Active Learning (AL;Cohn et al. (1996)) is a well-known technique that helps to substantially reduce the amount of annotation required to achieve a certain level of machine learning model performance.For example, in tasks related to named entity recognition, researchers report annotation reduction by 2-7 times with a loss of only 1% of F1-score (Settles and Craven, 2008a).This makes AL especially important when annotation is expensive, which is the case for ATS.
AL works iteratively: on each iteration, (1) a model is trained on the so far annotated dataset; (2) the model is used to select some informative instances from a large unlabeled pool using a query strategy; (3) informative instances are presented to human experts, which provide gold-standard annotations; (4) finally, the instances with annotations are added to the labeled dataset, and a new iteration begins.Traditional AL query strategies are based on uncertainty estimation techniques (Lewis and Gale, 1994;Scheffer et al., 2002).The hypothesis is that the most uncertain instances for the model trained on the current iteration are informative for training the model on the next iteration.We argue that uncertain predictions of ATS models (uncertain summaries) are not more useful than randomly selected instances.Moreover, usually, they introduce more noise and detriment to the performance of summarization models.Therefore, it is not possible to straightforwardly adapt the uncertainty-based approach to AL in text summarization.
In this work, we present the first effective query strategy for AL in ATS, which we call in-domain diversity sampling (IDDS).It is based on the idea of the selection of diverse instances that are semantically dissimilar from already annotated documents but at the same time similar to the core documents of the considered domain.The empirical investigation shows that while techniques based on uncertainty cannot overcome the random sampling baseline, IDDS substantially increases the performance of summarization models.We also experiment with the self-learning technique that leverages a training dataset expanded with summaries automatically generated by an ATS model trained only on the human-annotated dataset.This approach shows improvements when one needs to generate short summaries.The code for reproducing the experiments is available online1 .The contributions of this paper are the following: • We propose the first effective AL query strategy for ATS that beats the random sampling baseline.
• We conduct a vast empirical investigation and show that in contrast to such tasks as text classification and information extraction, in ATS, uncertainty-based AL query strategies cannot outperform the random sampling baseline.
• To our knowledge, we are the first to investigate the effect of self-learning in conjunction with AL for ATS and demonstrate that it can substantially improve results on the datasets with short summaries.

Related Work
Abstractive Text Summarization.The advent of seq2seq models (Sutskever et al., 2014) along with the development of the attention mechanism (Bahdanau et al., 2015) consolidated neural networks as a primary tool for ATS.The attentionbased Transformer (Vaswani et al., 2017) architecture has formed the basis of many large-scale pre-trained language models that achieve state-ofthe-art results in ATS (Lewis et al., 2020;Zhang et al., 2020;Qi et al., 2020;Guo et al., 2021).Recent efforts in this area mostly focus on minor modifications of the existing architectures (Liu and Liu, 2021;Aghajanyan et al., 2021;Liu et al., 2022).
Active Learning in Natural Language Generation.While many recent works leverage AL for text classification or sequence-tagging tasks (Yuan et al., 2020;Zhang and Plank, 2021;Shelmanov et al., 2021;Margatina et al., 2021), little attention has been paid to natural language generation tasks.Among the works in this area, it is worth mentioning (Haffari et al., 2009;Ambati, 2012;Ananthakrishnan et al., 2013).These works focus on neural machine translation (NMT) and suggest several uncertainty-based query strategies for AL.Peris and Casacuberta (2018) successfully apply AL in the interactive machine translation.Liu et al. (2018) exploit reinforcement learning to train a policy-based query strategy for NMT.Although there is an attempt to apply AL in ATS (Gidiotis and Tsoumakas, 2021), to the best of our knowledge, there is no published work on this topic yet.

Uncertainty Estimation in Natural Language
Generation.
A simple yet effective approach for uncertainty estimation in generation is proposed by Wang et al. (2019).They use a combination of expected translation probability and variance of the translation probability, demonstrating that it can handle noisy instances better and noticeably improve the quality of back-translation.Malinin and Gales (2021) investigate the ensemble-based measures of uncertainty for NMT.Their results demonstrate the superiority of these methods for OOD detection and for identifying generated translations of low-quality.Xiao et al. (2020) propose a method for uncertainty estimation of long sequences of discrete random variables, which they dub "BLEU Variance", and apply it for OOD sentence detection in NMT.It is also shown to be useful for identifying instances of questionable quality in ATS (Gidiotis and Tsoumakas, 2022).In this work, we investigate these uncertainty estimation techniques in AL and show that they do not provide any benefits over annotating randomly selected instances.
Diversity-based Active Learning.Along with the uncertainty-based query strategies, a series of diversity-based methods have been suggested for AL (Kim et al., 2006;Sener and Savarese, 2018;Ash et al., 2019;Citovsky et al., 2021).The most relevant work among them is (Kim et al., 2006), where the authors propose to use a Maximal Marginal Relevance (MMR; Carbonell and Goldstein (1998))-based function as a query strategy in AL for named entity recognition.This function aims to capture uncertainty and diversity and selects instances for annotation based on these two perspectives.We adapt this strategy for the ATS task and compare the proposed method with it.

Uncertainty-based Active Learning for Text Generation
In this section, we give a brief formal definition of the AL procedure for text generation and uncertainty-based query strategies.Here and throughout the rest of the paper, we denote an input sequence as x = (x 1 . . .x m ) and the output sequence as y = (y 1 . . .y n ), with m and n being lengths of x and y respectively.Let D = {(x (k) , y (k) )} K k=1 be a dataset of pairs (documents, summaries).Consider a probabilistic model p w (y | x) parametrized by a vector w.Usually, p w (y | x) is a neural network, while the parameter estimation is done via the maximum likelihood approach: where Many AL methods are based on greedy query strategies that select instances for annotation, optimizing a certain criterion A(x | D, ŵ) called an acquisition function: (2) The selected instance x * is then annotated with a target value y * (document summary) and added to the training dataset: D := D ∪ (x * , y * ).Subsequently, the model parameters w are updated and the instance selection process continues until the desired model quality is achieved or the available annotation budget is depleted.
The right choice of an acquisition function is crucial for AL.A common heuristic for acquisition is selecting instances with high uncertainty.Below, we consider several measures of uncertainty used in text generation.
Normalized Sequence Probability (NSP) was originally proposed by Ueffing and Ney (2007) and has been used in many subsequent works (Haffari et al., 2009;Wang et al., 2019;Xiao et al., 2020;Lyu et al., 2020).This measure is given by NSP where we define the geometric mean of probabilities of tokens predicted by the model as: p ŵ(y | x) = exp 1 n log p ŵ(y | x) .A wide family of uncertainty measures can be derived using the Bayesian approach to modeling.Under the Bayesian approach, it is assumed that model parameters have a prior distribution π(w).Optimization of the log-likelihood L(D, w) in this case leads to the optimization of the posterior distribution of the model parameters: Usually, the exact computation of the posterior is intractable, and to perform training and inference, a family of distributions q θ (w) parameterized by θ is introduced.The parameter estimate θ minimizes the KL-divergence between the true posterior π(w | D) and the approximation q θ(w).Given such an approximation, several uncertainty measures can be constructed.
Expected Normalized Sequence Probability (ENSP) is proposed by Wang et al. (2019) and is also used in (Xiao et al., 2020;Lyu et al., 2020): In practice, the expectation is approximated via Monte Carlo dropout (Gal and Ghahramani, 2016), i.e. averaging multiple predictions obtained with activated dropout layers in the network.
Expected Normalized Sequence Variance (ENSV; Wang et al. (2019)) measures the variance of the sequence probabilities obtained via Monte Carlo dropout: BLEU Variance (BLEUVar) is proposed by Xiao et al. (2020).It treats documents as points in some high dimensional space and uses the BLEU metric (Papineni et al., 2002) for measuring the difference between them.In such a setting, it is possible to calculate the variance of generated texts in the following way: The BLEU metric is calculated as a geometric mean of n-grams overlap up to 4-grams.Consequently, when summaries consist of less than 4 tokens, the metric is equal to zero since there will be no common 4-grams.This problem can be mitigated with the SacreBLEU metric (Post, 2018), which smoothes the n-grams with zero counts.When we use this query strategy with the Sacre-BLUE metric, we refer to it as SacreBLEUVar.Therefore, IDDS queries instances that are dissimilar to the annotated instances but at the same time are similar to unannotated ones (Figure 1).We propose the following acquisition function that implements the aforementioned idea (the higher the value -the higher the priority for the annotation): ) where s(x, x ′ ) is a similarity function between texts, U is the unlabeled set, L is the labeled set, and λ ∈ [0; 1] is a hyperparameter.Below, we formalize the resulting algorithm of the IDDS query strategy.
1.For each document in the unlabeled pool x, we obtain an embedding vector e(x).For this purpose, we suggest using the [CLS] pooled sequence embeddings from BERT.We note that using a pre-trained checkpoint straightforwardly may lead to unreasonably high similarity scores between instances since they all belong to the same domain, which can be quite specific.We mitigate this problem by using the task-adaptive pre-training (TAPT; Gururangan et al. (2020)) on the unlabeled pool.TAPT performs several epochs of selfsupervised training of the pre-trained model on the target dataset to acquaint it with the peculiarities of the data.
2. Deduplicate the unlabeled pool.Instances with duplicates will have an overrated similarity score with the unlabeled pool.
3. Calculate the informativeness scores using the IDDS acquisition function (8).As a similarity function, we suggest using a scalar product between document representations: The idea of IDDS is close to the MMR-based strategy proposed in (Kim et al., 2006).Yet, despite the resemblance, IDDS differs from it in several crucial aspects.The MMR-based strategy focuses on the uncertainty and diversity components.However, as shown in Section 6.1, selecting instances by uncertainty leads to worse results compared to random sampling.Consequently, instead of using uncertainty, IDDS leverages the unlabeled pool to capture the representativeness of the instances.Furthermore, IDDS differs from the MMR-based strategy in how they calculate the diversity component.MMR directly specifies the usage of the "max" aggregation function for calculating the similarity with the already annotated data, while IDDS uses "average" similarity instead and achieves better results as shown in Section 6.2.
We note that IDDS does not require retraining an acquisition model in contrast to uncertainty-based strategies since document vector representations and document similarities can be calculated before starting the AL annotation process.This results in the fact that no heavy computations during AL are required.Consequently, IDDS does not harm the interactiveness of the annotation process, which is a common bottleneck (Tsvigun et al., 2022).

Self-learning
Pool-based AL assumes that there is a large unlabeled pool of data.We propose to use this data source during AL to improve text summarization models with the help of self-learning.We train the model on the labeled data and generate summaries for the whole unlabeled pool.Then, we concatenate the generated summaries with the labeled set and use this data to fine-tune the final model.We note that generated summaries can be noisy: irrelevant, grammatically incorrect, contain factual inconsistency, and can harm the model performance.We detect such instances using the uncertainty estimates obtained via NSP scores and exclude k l % instances with the lowest scores and k h % of instances with the highest scores.We choose this uncertainty metric because according to our experiments in Section 6.1, high NSP scores correspond to the noisiest instances.We note that adding the filtration step does not introduce additional computational overhead, since the NSP scores are calculated simultaneously with the summary generation for self-learning.

Active Learning Setting
We evaluate IDDS and other query strategies using the conventional scheme of AL annotation emulation applied in many previous works (Settles and Craven, 2008b;Shen et al., 2017;Siddhant and Lipton, 2018;Shelmanov et al., 2021;Dor et al., 2020).For uncertainty-based query strategies and random sampling, we start from a small annotated seeding set selected randomly.This set is used for fine-tuning the summarization model on the first iteration.For IDDS, the seeding set is not used, since this query strategy does not require fine-tuning the model to make a query.On each AL iteration, we select top-k instances from the unlabeled pool according to the informativeness score obtained with a query strategy.The selected instances with their gold-standard summaries are added to the so-far annotated set and are excluded from the unlabeled pool.On each iteration, we fine-tune a summarization model from scratch and evaluate it on a held-out test set.We report the performance of the model on each iteration to demonstrate the dynamics of the model performance depending on the invested annotation effort.
The query size (the number of instances selected for annotation on each iteration) is set to 10 documents.We repeat each experiment 9 times with different random seeds and report the mean and the standard deviation of the obtained scores.For the WikiHow and PubMed datasets, on each iteration, we use a random subset from the unlabeled pool since generating predictions for the whole unlabeled dataset is too computationally expensive.In the experiments, the subset size is set to 10,000 for WikiHow and 1,000 for PubMed.

Baselines
We use random sampling as the main baseline.To our knowledge, in the ATS task, this baseline has not been outperformed by any other query strategy yet.In this baseline, an annotator is given randomly selected instances from the unlabeled pool, which means that AL is not used at all.We also report results of uncertainty-based query strategies and an MMR-based query strategy (Kim et al., 2006) that is shown to be useful for named entity recognition.

Metrics
Quality of Text Summarization.To measure the quality of the text summarization model, we use the commonly adopted ROUGE metric (Lin, 2004).Following previous works (See et al., 2017;Nallapati et al., 2017;Chen and Bansal, 2018;Lewis et al., 2020;Zhang et al., 2020), we report ROUGE-1, ROUGE-2, and ROUGE-L.Since we found the dynamics of these metrics coinciding, for brevity, in the main part of the paper, we keep only the results with the ROUGE-1 metric.The results with other metrics are presented in the appendix.
Factual Consistency.Inconsistency (hallucination) of the generated summaries is one of the most crucial problems in summarization (Kryscinski et al., 2020;Huang et al., 2021;Nan et al., 2021;Goyal et al., 2022).Therefore, in addition to the ROUGE metrics, we measure the factual consistency of the generated summaries with the original documents.We use the SummaC-ZS (Laban et al., 2022) -a state-of-the-art model for inconsistency detection.We set granularity = "sentence" and model_name = "vitc".

Datasets
We experiment with three datasets widely-used for evaluation of ATS models: AESLC (Zhang and Tetreault, 2019), PubMed (Cohan et al., 2018), and WikiHow (Koupaee and Wang, 2018).AESLC consists of emails with their subject lines as summaries.with their headlines as summaries.PubMed (Cohan et al., 2018) is a collection of scientific articles from the PubMed archive with their abstracts.The choice of datasets is stipulated by the fact that AESLC contains short documents and summaries, WikiHow contains medium-sized documents and summaries, and PubMed contains long documents and summaries.We also use two non-intersecting subsets of the Gigaword dataset (Graff et al., 2003;Rush et al., 2015) of sizes 2,000 and 10,000 for hyperparameter optimization of ATS models and additional experiments with self-learning, respectively.Gigaword consists of news articles and their headlines representing summaries.The dataset statistics is presented in Table 2 in Appendix A.

Models and Hyperparameters
We conduct experiments using the state-of-the-art text summarization models: BART (Lewis et al., 2020) and PEGASUS (Zhang et al., 2020).In all experiments, we use the "base" pre-trained version of BART and the "large" version of PEGASUS.
Most of the experiments are conducted with the BART model, while PEGASUS is only used for the AESLC dataset (results are presented in Appendices B, C) since running it on two other datasets in AL introduces a computational bottleneck.
We tune hyperparameter values of ATS models using the ROUGE-L score on the subset of the Gigaword dataset.The hyperparameter values are provided in Table 3 in Appendix A.
For the IDDS query strategy, we use λ = 0.67.We analyze the effect of different values of this parameter in Section 6.2.

Uncertainty-based Query Strategies
In this series of experiments, we demonstrate that selected uncertainty-based query strategies are not suitable for AL in ATS. Figure 2a and Figures 6, 7 in Appendix B present the results on the AESLC dataset.As we can see, none of the uncertaintybased query strategies outperform the random sampling baseline for both BART and PEGASUS models.NSP and ENSP strategies demonstrate the worst results with the former having the lowest performance for both ATS models.Similar results are obtained for the WikiHow and PubMed datasets (Figures 2b and 2c).
In some previous work on NMT, uncertaintybased query strategies outperform the random sampling baseline (Haffari et al., 2009;Ambati, 2012;Ananthakrishnan et al., 2013).Their low results for ATS compared to NMT might stem from the differences between these tasks.Both NMT and ATS are seq2seq tasks and can be solved via similar models.However, NMT is somewhat easier, since the  5 in Appendix E.

In-Domain Diversity Sampling
In this series of experiments, we analyze the proposed IDDS query strategy.Figure 3a and Figures 10,11 in Appendix C show the performance of the models with various query strategies on AESLC.We can see that the proposed strategy outperforms random sampling on all iterations for both ATS models and subsequently outperforms the uncertainty-based strategy NSP.IDDS demonstrates similar results on the WikiHow and PubMed datasets, outperforming the baseline with a large margin as depicted in Figures 3b and 3c.We also report the improvement of IDDS over random sampling in percentage on several AL iterations in Table 4.We can see that IDDS provides an especially large improvement in the cold-start AL scenario when the amount of labeled data is very small.We carry out several ablation studies for the proposed query strategy.First, we investigate the effect of various models for document embeddings construction and the necessity of performing TAPT.Figures 17 and 18 in Appendix F illustrate that TAPT substantially enhances the performance of IDDS. Figure 17 also shows that the BERT-base encoder appears to be better than Sen-tenceBERT (Reimers and Gurevych, 2019) and LongFormer (Beltagy et al., 2020).
Second, we try various functions for calculating the similarity between instances.Figures 19, 20 in Appendix F compare the originally used dot product with Mahalanobis and Euclidean distances on AESLC and WikiHow.On AESLC, IDDS with Mahalanobis distance persistently demonstrates lower performance.IDDS with the Euclidean distance shows a performance drop on the initial AL iterations compared to the dot product.On WikiHow, however, all the variants perform roughly the same.Therefore, we suggest keeping the dot product for computing the document similarity in IDDS since it provides the most robust results across the datasets.
We also compare the dot product with its normalized version -cosine similarity on AESLC and PubMed, see Figures 21 and 22 in Appendix F. On both datasets, adding normalization leads to substantially worse results on the initial AL iterations.We deem that this happens because normalization may damage the representativeness component since the norm of the embedding can be treated as a measure of the representativeness of the corresponding document.
Third, we investigate how different values for the lambda coefficient influence the performance of IDDS.Table 7 and Figure 23 in Appendix F shows that smaller values of λ ∈ {0, 0.33, 0.5} substantially deteriorate the performance.Smaller values correspond to selecting instances that are highly dissimilar from the documents in the unlabeled pool, which leads to picking many outliers.Higher values lead to the selection of instances from the core of the unlabeled dataset, but also very similar to the annotated part.This also results in a lower quality on the initial AL iterations.The best and most stable results are obtained with λ = 0.67.
Fourth, we compare IDDS with the MMR-based strategy suggested in (Kim et al., 2006).Since it uses uncertainty, it requires a trained model to calculate the scores.Consequently, the initial query is taken randomly as no trained model is available on the initial AL iteration.Therefore, we use the modification, when the initial query is done with IDDS because it provides substantially better results on the initial iteration.We also experiment with different values of the λ hyperparameter of the MMR-based strategy.Figure 24 illustrates a large gap in performance of IDDS and the MMRbased strategy regardless of the initialization / λ on AESLC.We believe that this is attributed to the fact that strategies incorporating uncertainty are harmful to AL in ATS as shown in Section 6.1.
Finally, we compare "aggregation" functions for estimating the similarity between a document and a collection of documents (labeled and unlabeled pools).Following the MMR-based strategy (Kim et al., 2006), instead of calculating the average similarity between the embedding of the target document and the embeddings of documents from the labeled set, we calculate the maximum similarity.We also try replacing the "average" aggregation function with "maximum" in both IDDS components in (8).Figures 25 and 26 in Appendix F show that average leads to better performance on both AESLC and WikiHow datasets.
The importance of diversity sampling is illustrated in Table 6 in Appendix E. We can see that NSP-based query batches contain a large number of overlapping instances.This may partly stipulate the poor performance of the NSP strategy since almost 9% of labeled instances are redundant.IDDS, on the contrary, does not have instances with overlapping summaries inside batches at all.

Self-learning
In this section, we investigate the effect of selflearning in the AL setting.Figures 4a, 4b illustrate the effect of self-learning on the AESLC and Gigaword datasets.For this experiment, we use k l = 10, k h = 1, filtering out 11% of automati-cally generated summaries.In both cases: with AL and without, adding automatically generated summaries of documents from the unlabeled pool to the training set improves the performance of the summarization model.On AESLC, the best results are obtained with both AL and self-learning: their combination achieves up to 58% improvement in all ROUGE metrics compared to using passive annotation without self-learning.
The same experiment on the WikiHow dataset is presented in Figure 4c.To make sure that the quality is not deteriorated due to the addition of noisy uncertain instances, we use k l = 38, k h = 2 for this experiment, filtering out 40% of automatically generated summaries.On this dataset, self-learning reduces the performance for both cases (with AL and without).We deem that the benefit of selflearning depends on the length of the summaries in the dataset.AESLC and Gigaword contain very short summaries (less than 13 tokens on average, see Table 2).Since the model is capable of generating short texts that are grammatically correct and logically consistent, such data augmentation does not introduce much noise into the dataset, resulting in performance improvement.WikiHow, on the contrary, contains long summaries (77 tokens on average).Generation of long, logically consistent, and grammatically correct summaries is still a challenging task even for the state-of-the-art ATS models.Therefore, the generated summaries are of low quality, and using them as an additional training signal deteriorates the model performance.Consequently, we suggest using self-learning only if the dataset consists of relatively short texts.We leave a more detailed investigation of this topic for future research.

Consistency
We analyze how various AL strategies and selflearning affect the consistency of model output in two ways.We measure the consistency of the generated summaries with the original documents on the test set on each AL iteration.Figure 5 shows that the model trained on instances queried by IDDS generates the most consistent summaries across all considered AL query strategies on AESLC.On the contrary, the model trained on the instances selected by the uncertainty-based NSP query strategy generates summaries with the lowest consistency.
Figure 28 in Appendix G demonstrates that on AESLC, self-learning also improves consistency 5135 regardless of the AL strategy.The same trend is observed on Gigaword (Figure 27 in Appendix G).However, for WikiHow, there is no clear trend.Figure 29 in Appendix G shows that all query strategies lead to similar consistency results, with NSP producing slightly higher consistency, and BLEU-Var -slightly lower.We deem that this may be due to the fact that summaries generated by the model on WikiHow are of lower quality than the golden summaries regardless of the strategy.Therefore, this leads to biased scores of the SummaC model with similar results on average.

Query Duration
We compare the average duration of AL iterations for various query strategies.Figure 30 in the Appendix H presents the average training time and the average duration of making a query.We can see that training a model takes considerably less time than selecting the instances from the unlabeled pool for annotation.Therefore, the duration of AL iterations is mostly determined by the efficiency of the query strategy.The IDDS query strategy does not require any heavy computations during AL, which makes it also the best option for keeping the AL process interactive.

Conclusion
In this work, we convey the first study of AL in ATS and propose the first active learning query strategy that outperforms the baseline random sampling.The query strategy aims at selecting for annotation the instances with high similarity with the documents in the unlabeled pool and low similarity with the already annotated documents.It outperforms the random sampling in terms of ROUGE metrics on all considered datasets.It also outperforms random sampling in terms of the consistency score calculated via the SummaC model on the AESLC dataset.We also demonstrate that uncertainty-based query strategies fail to outperform random sampling, resulting in the same or even worse performance.Finally, we show that self-learning can improve the performance of an ATS model in terms of both the ROUGE metrics and consistency.This is especially favorable in AL since there is always a large unlabeled pool of data.We show that combining AL and self-learning can give an improvement of up to 58% in terms of ROUGE metrics.
In future work, we look forward to investigating IDDS in other sequence generation tasks.This query strategy might be beneficial for tasks with the highly variable output when uncertainty estimates of model predictions are unreliable and cannot outperform the random sampling baseline.IDDS facilitates the representativeness of instances in the training dataset without leveraging uncertainty scores.

Limitations
Despite the benefits, the proposed methods require some conditions to be met to be successfully applied in practice.IDDS strategy requires making TAPT of the embeddings-generated model, which may be computationally consuming for a large dataset.Self-learning, in turn, may harm the performance when the summaries are too long, as shown in Section 6.3.Consequently, its application requires a detailed analysis of the properties of the target domain summaries.

Ethical Considerations
It is important to note that active learning is a method of biased sampling, which can lead to biased annotated corpora.Therefore, active learning can be used to deliberately increase the bias in the datasets.Our research improves the active learning performance; hence, our contribution would also make it more efficient for introducing more bias as well.We also note that our method uses the pre-trained language models, which usually contain different types of biases by themselves.Since bias affects all applications of pre-trained models, this can also unintentionally facilitate the biased selection of instances for annotation during active learning.

A Dataset Statistics and Model Hyperparameters
Table 2: Dataset statistics.We provide a number of instances for the training and test sets and average lengths of documents / summaries in terms of tokens.All the datasets are English-language.We filter the WikiHow dataset since it contains many noisy instances: we exclude instances with documents that have 10 or less tokens and instances with summaries that have 3 or less tokens.

Dataset
Subset Num.instances Av. document len.Av. summary len.Table 3: Hyperparameter values and checkpoints from the HuggingFace repository (Wolf et al., 2019) of the models.We imitate the low-resource case by randomly selecting 200 instances from Gigaword train dataset as a train sample, and 2,000 instances from the validation set as a test sample for evaluation consistency.For each model, we find the optimal hyperparameters according to evaluation scores on the sampled subset.Generation maximum length is set to the maximum summary length from the available labeled set.

AESLC
For WikiHow and PubMed datasets, we reduce the batch size and increase gradient accumulation steps by the same amount due to computational bottleneck.Hardware configuration: 2 Intel Xeon Platinum 8168, 2.7 GHz, 24 cores CPU; NVIDIA Tesla v100 GPU, 32 Gb of VRAM.

Figure 2 :
Figure 2: ROUGE-1 scores of BART-base with various uncertainty-based strategies compared with random sampling (baseline) on various datasets.Full results are provided in Figures 6, 8, 9, respectively.

Figure 4 :Figure 5 :
Figure 4: ROUGE-1 scores of the BART-base model with IDDS and random sampling strategies with and without self-learning on AESLC, Gigaword, and WikiHow.Full results are provided in Figures 14, 15, and 16, respectively.

Figure 6 :Figure 7 :Figure 8 :
Figure 6: The performance of the BART-base model with various uncertainty-based strategies compared with random sampling (baseline) on AESLC.

Figure 9 :
Figure 9: The performance of the BART-base model with various uncertainty-based strategies compared with random sampling (baseline) on PubMed.

FFigure 17 :Figure 18 :Figure 19 :Figure 20 :Figure 21 :Figure 22 :Figure 23 :
Figure 17: Ablation study of the document embeddings model & the necessity of performing TAPT for it in the IDDS strategy with BART-base on AESLC.

Figure 24 :Figure 25 :
Figure24: Comparison of IDDS with the MMR-based strategy suggested in(Kim et al., 2006) with BART-base on AESLC.We experiment with different λ values in MMR and the initialization schemes.

Figure 26 :Figure 27 :Figure 28 :Figure 29 :
Figure 26: Comparison of the average and maximum aggregation functions in IDDS with BART-base on WikiHow.

Figure 30 :
Figure 30: Average duration in seconds of one AL query of 10 instances with different strategies on the AESLC dataset with BART-base as an acquisition model.Train refers to the average time required for training the model throughout the AL cycle.Hardware configuration: 2 Intel Xeon Platinum 8168, 2.7 GHz, 24 cores CPU; NVIDIA Tesla v100 GPU, 32 Gb of VRAM.

Table 1 :
Golden Summary Gener.Summ.NSP Aquarius -Horoscope Friday, September 8, 2000 by Astronet.com.Powerful forces are at work to challenge you (...) Don't let hurt feelings prevent you from (...) Could I have the price for a 2 day swing peaker option at NGI Chicago, that can be exercised on any day in February 2002.Strike is FOM February, (...) Examples of instances selected with the NSP and IDDS strategies.Tokens overlapping with the source document are highlighted with green.Tokens that refer to paraphrasing a part of the document and the corresponding part are highlighted with blue.Tokens that cannot be derived from the document are highlighted with red.output is usually of similar length as the input and its variability is smaller.It is much easier to train a model to reproduce an exact translation rather than make it generate an exact summary.Therefore, uncertainty estimates of ATS models are way less reliable than estimates of translation models.These estimates often select for annotation noisy documents that are useless or even harmful for training summarization models.Table1reveals several documents selected by the worst-performing strategy NSP on AESLC.We can see that NSP selects domain-irrelevant documents or very specific ones.Their summaries can hardly be restored from the source documents, which means that they most likely have little positive impact on the generalization ability of the model.More examples of instances selected by different query strategies are presented in Table

Table 4 :
Percentage increase in ROUGE F-scores of IDDS over the baseline on different AL iterations.Average refers to the average increase throughout the whole AL cycle.

Table 6 :
Share of fully / partly overlapping summaries among batches of instances, queried with various AL strategies during AL using BART-base model on AESLC.We consider two summaries to be partly overlapping if their ROUGE-1 score > 0.66.The results are averaged across 9 seeds for all the strategies except for IDDS, which has constant seed-independent queries.

Table 7 :
ROUGE on AL iterations for different values of the lambda hyperparameter in IDDS.We select with bold the largest values w.r.t. the confidence intervals.