AUTOSUMM: Automatic Model Creation for Text Summarization

Recent efforts to develop deep learning models for text generation tasks such as extractive and abstractive summarization have resulted in state-of-the-art performances on various datasets. However, obtaining the best model configuration for a given dataset requires an extensive knowledge of deep learning specifics like model architecture, tuning parameters etc., and is often extremely challenging for a non-expert. In this paper, we propose methods to automatically create deep learning models for the tasks of extractive and abstractive text summarization. Based on the recent advances in Automated Machine Learning and the success of large language models such as BERT and GPT-2 in encoding knowledge, we use a combination of Neural Architecture Search (NAS) and Knowledge Distillation (KD) techniques to perform model search and compression using the vast knowledge provided by these language models to develop smaller, customized models for any given dataset. We present extensive empirical results to illustrate the effectiveness of our model creation methods in terms of inference time and model size, while achieving near state-of-the-art performances in terms of accuracy across a range of datasets.


Introduction
Machine learning algorithms, particularly, deep learning techniques have led to the simplification of several computationally expensive tasks. However, training and optimizing these models for different tasks demand the experienced engineering resources and require expertise, making it difficult for non-experts. Automated Machine Learning is a strategy to automate this pipeline for model creation including automated generation of the model itself. * Work done while authors were at Adobe Research.
In the case of Natural Language Processing and Text analysis, the advent of large language models such as BERT (Devlin et al., 2019), GPT2 (Radford et al., 2019), and more recently GPT3 (Brown et al., 2020) have created resources that can be exploited for the creation of robust models for several downstream NLP tasks. However, the need for ML expertise creates a bottleneck. Further, these deep learning models have thousands of parameters and need fairly large datasets and computational resources for training.
We focus on providing algorithms for autogeneration of ML models for complex NLP tasks such as extraction and generation, making them accessible to non experts. Our proposed approaches feed off the knowledge available in large pretrained models to auto-generate new, smaller, customized models for a custom dataset. Specifically, the major contributions are as follows.
(1) We propose a method to create machine learning models that are efficient and customized to a given dataset for the tasks of extractive and abstractive summarization, using a combination of neural architecture search and task-specific knowledge distillation from large language models.
(2) Aditionally, we propose an alternate method for summarization model generation using Transformer distillation, which is superior in terms of performance and resource utilisation.
(3) We conduct extensive experiments and present results illustrating the effectiveness of the proposed methods for extractive and abstractive summarization on a range of datasets, and compare our models in terms of model creation efficiency, model size, inference time, and performance, with the state-ofthe-art models.
To the best of our knowledge, this is the first effort towards automatically building customized and compressed models for text generation tasks, specifically summarization.

Related Work
Neural Architectural Search is a trending area in AutoML, which automates the process of model creation by searching efficient model architectures, without human expertise. A typical NAS problem involves identifying the search space, employing a search strategy to find the best task-specific model architecture, and training the model from scratch. Most of the NAS experiments are done on images - (Real et al., 2017), (Real et al., 2018), (Suganuma et al., 2018) using neuro-evolutional and genetic algorithms, which are computationally very expensive and time consuming. Recently, gradient based methods like DARTS , SNAS (Xie et al., 2019) and (Dong and Yang, 2019) are proposed to speed-up the search strategy. But, the explorations of using NAS for language-related tasks are very limited.  propose TextNAS with a search space for better understanding of the text representations. They use a simple and efficient, ENAS (Pham et al., 2018), which is guided by reinforcement learning for model generation. However, these models mainly focus on text understanding and do not directly extend to the generation-related tasks like summarization.
Text Summarization: The neural attention model (Rush et al., 2015) marked the beginning of using deep neural architectures for text summarization. Seq2seq models variations with convolutional encoder (Chopra et al., 2016), (Narayan et al., 2018b), hierarchical attention-based RNN encoder (Nallapati et al., 2017), pointer-generator networks (See et al., 2017) were used for both extractive and abstractive tasks. With the recent advent of multi-head attention (Vaswani et al., 2017), transformer-based models like PEGASUS (Zhang et al., 2019), BERT-Summ (Liu and Lapata, 2019a) are proposed with pre-trained objectives tailored for summarization tasks. While these methods give good results, they demand extreme human expertise and computational overhead for designing and deployment.
Knowledge Distillation & Model Compression: These techniques aim to take advantage of the immense knowledge from the pre-trained models. TinyBERT (Jiao et al., 2019) presents a distillation approach for text classification and natural language inference using BERT compression and distillation. Adabert (Chen et al., 2020a) present a differential NAS algorithm, leveraging a BERT model through knowledge distillation for classifica-tion and NLI tasks. (Chen et al., 2020b) transfers BERT knowledge to a encoder-decoder model for text generation. However, all these approaches are limited to the specific tasks, and are not directly extensible to a generation-based tasks. Figure 1 shows the overview of our model creation framework. The input is the dataset and task specifications (summary type, size) and the output is a custom trained summarization model, which can be further used to create text summaries. In this paper, we generate models for both extractive and abstractive summarization tasks, with the former being a binary classification task to extract summary sentences from the input, while the latter aims to generate summaries containing novel words and phrases that may not be present in the input text. Our proposed approaches distills knowledge from a language-model based teacher network to generate an encoder-decoder-based child model. We present two algorithms that aid in auto-creation of different types of resulting 'child' models -(1) a model with convolutional and recurrent units and (2) a mini-transformer based model. The first is achieved by our approach AUTOSUMM-CREATE and the second using AUTOSUMM-DISTILL, which are detailed as follows. Here, we combine knowledge distillation with neural architecture search to auto-create an encoderdecoder based summarization model. The stages in this method include:

AutoSumm-Create
1. Task-specific knowledge distillation: We leverage knowledge from a transformer-based BERT model (teacher) fine-tuned for extractive and abstractive summarization (Liu and Lapata, 2019b) on the given task-specific (summarization) dataset. The predictions from the teacher model are used for distillation, i.e., the sentences classification scores for extractive and probability distributions over the vocabulary for abstractive are augmented to the ground truth. A Knowledge Distillation(L KD ) loss is included to perform informed search on the child models, ensuring that they mimic the performance of the teacher. In extractive summarization, L KD is the MSE loss between soft labels from augmented data and the scored predicted by the child model.
In abstractive summarization, L KD is calculated at each time step t using soft labels P teacher (y t ) from teacher model and the predicted labels P pred (y t ) from child model over vocab V as follows: 2. Neural Architectural Search: Augmented dataset, along with a small labelled custom dataset, is used to train the NAS module, which searches for the right combination of cells that result in the child model most suited for the summarization task.
In our approach, we use NAS to search the encoder space while using a predefined (task-specific) decoder. The key components of this module are: Search space. Following , we define macro search space, such that the model can be represented by a directed acyclic graph (DAG) with nodes representing a layer from the search space and edges representing the directionality of flow of information. The search space has 4 key cell types -CNN (kernel sizes 1,3,5,7), RNN (bidirectional GRU), Pooling layers (avg. pool and max. pool with stride 1 and uniform padding), and Multi-head self-attention (8 heads, no positional embeddings). We constrain the search space by (1) defining the number of skip connections allowed, (2) limiting the maximum number of layers in the child architecture, l (in our case l ∈ 1,5,10,18,20), and (3) defining the cells allowed in the new architecture. These constraints define the exhaustive list of possibilities for the NAS algorithm.
Search algorithm: We implement ENAS (Pham et al., 2018), a reinforcement learning (RL) based algorithm used for several NAS implementations (Zoph and Le, 2017). It consists of an RNN controller network, that samples a model architecture from the search space and an RL reward to nudge this controller towards generating an optimal architecture.
Pre-defined Model Specifications: As stated earlier, we auto-create the encoder layers in the model but predefine the task-specific decoder. For extractive summarization, the decoder is a scorer function with sigmoid activation, which takes in the text representations learnt from the encoder and scores each sentence on a scale of (0,1]. The sentences with the high scores are chosen as the final summary based on the summary size specified. For abstractive summarization, a recurrent neural network is used as the decoder. The input is the text representation from the encoder and the output is a generated summary (generated in auto-regressive manner, by decoding a word at every time step).
Loss: The architectures are trained with a crossentropy loss at sentence level for extractive and vocab level for abstractive as follows: Final Loss: The final end-end loss associated with this framework is computed as the weighted sum of the L KD and L CE in the NAS module: RL Reward: A reward based on the performance of the child model, is sent back to the RNN controller. The policy gradients of the RNN controller are updated through REINFORCE (Williams, 1992) algorithm. Reward (R) is defined as 1 − Loss valid , normalized over the batchsize.
Re-training: The newly generated model, is trained using the user-provided training data optimizing for the total loss(L total ). This trained model can generate summaries for any given test sample.

AutoSumm-Distill
In this approach, the structure of the child is defined as a mini-transformer(4 layers). A knowledge distillation technique called transformer distillation (Jiao et al., 2019) is used to create a generalmini-transformer(4 layers) from a large transformer model (12 layers). Then, the knowledge is distilled from a task-specific fine-tuned BERT ('teacher') model to the general-mini-transformer. Figure 3 illustrates the workflow of this method. This method differs from AUTOSUMM-CREATE, in the child model architecture and the usage of two transformer teacher models. The key stages in this method are detailed below. Knowledge distillation: There are two forms of knowledge distillation in this method (1 and 3 in Fig.3). We detail the knowledge distillation from a task-specific transformer teacher (we use BERT-Summ (Liu and Lapata, 2019b)) to the general-mini-transformer which forms the encoder layer for the final child model. The decoder is pre-defined based on the task, similar to AUTOSUMM-CREATE. A transformer model has various types of layers including multi-headed attention, embedding layers, and the hidden layers. The intuition behind knowledge distillation is to teach the layers in the child transformer to mimic the corresponding layers in the teacher transformer. This is implemented by introducing separate losses for each layer type.
Attention-based distillation builds on the intuition that the attention layers in BERT capture linguistic information such as syntax and coreference information. Specifically, the student aims to learn the matrices of the multi-headed attention from teacher. This loss is given by where h is the number of attention heads A i refers to the attention matrix corresponding to the i-th head of the teacher (T) or the student (S), l is the input text length and M SE(.) refers to the mean squared error loss. Hidden-state distillation distills knowledge from the output of transformer hidden layer, with L hidn = M SE(H s W h , H T ) where H s and H T refer to the hidden states of the student and teacher models. W h is a learnable linear transformation. Embedding-layer distillation: Formulated as L embd = M SE(E s W e , E T ) where E s and E T are embeddings in the student and teacher networks respectively. W e plays a similar role as W h . Using these distillation objectives along with the general distillation already done to compress the transformer model to general-mini-transformer, the final loss is the unified distillation loss of the corresponding layers between the teacher and the student model. As a reminder, this step helps auto-learn the task specific encoder for extractive and abstractive summarization. Pre-defined Model Specifications: For extractive summarization, we define a single transformer layer on top of the newly created encoder with a classification layer as the decoder. For abstractive summarization, the decoder is 6-layer transformer.
Training and Re-training: General distillation & Fine-tuning: The above model is trained in a phased manner. The first distillation or training is done from a large transformer (BERT) to the general -mini-transformer. Parallelly, a large BERT model is fine-tuned for the specified tasks. Both these steps need not be repeated for every new dataset from the user and every run of the model. The fine-tuned model and the general-mini-transformers may be created once per task and once per a very large benchmark dataset. Task-specific Distillation: This process of teaching the student model from a fine-tuned teacher model is repeated each time a new user dataset is given to the system. Once trained, this is coupled with the specific decoder.
Re-training: Once the final child model i.e. minitransformer encoder and corresponding decoder are created, this complete model is trained on the input user dataset. The final model is the output for the user along with test summaries for any given text input.

Experiments
We evaluate our proposed framework by performing experiments that test the performance of the newly created models against benchmark summarization datasets on both extractive and abstractive tasks. The New York Times (NYT) Annotated Corpus contains the full text and metadata of NYT articles from 1987 to 2007. Following (Durrett et al., 2016), we extracted and filtered out the articles from 2000-2007, with abstractive summaries having more than 50 words. The articles were split based on the date of publication, where the articles from January 1, 2007 were chosen as test set.

Datasets
X-Sum (Narayan et al., 2018a) dataset is collected from online BBC articles, with short one sentence summaries. The Gigaword (Rush et al., 2015) dataset contains 4M examples from news articles for sentence summarization / headline generation task. The summaries are very short with 9 tokens per summary. The Contract dataset (Manor and Li, 2019) is a dataset compiled from two websites dedicated to explaining unilateral contracts in plain English: TL;DRLegal 1 and TOS;DR 2 . It is a small dataset with 500 samples.

Models
The generated models through AUTOSUMM-CREATE for extractive and abstractive are CHILD-EXT and CHILD-ABS respectively. -KD denote the child model variations trained through Knowledge distillation (KD). The fine-tuned models through AUTOSUMM-DISTILL are FT-TINYBERT-EXT and FT-TINYBERT-ABS. We compare the performance of our models against BERT-Summ (Liu and Lapata, 2019a), as it had a general framework for extractive and abstractive and was shown to give state-of-the-art performances. These baseline models are FT-BERT-EXT and FT-BERT-ABS.

Implementation Details
For all our experiments, we use the existing splits if available, otherwise we split the data according to the statistics in Table 1 and keep them constant across all the experiments. In our AUTOSUMM-CREATE experiments, we perform a 20-layer neural architectural search for encoder. The decoders are task-specific and predefined as explained in our previous section. We use GloVe word embeddings while providing the input to the generated model. We set the batch size as 128, max input length as 64, hidden unit dimension for each layer as 32, dropout ratio as 0.5 and L2 regularization. We utilize Adam optimizer and learning rate decay with cosine annealing. The parameter of KD proportion α is varied in NAS module. 3 . We also perform experiments with varying layer size, discussed in the later sections.

Evaluation metrics
Summarization quality is evaluated using F1 measure of ROUGE score (Lin, 2004) calculated between generated and ground-truth summary. 4 We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) to assess informativeness and the longest common sub-sequence (ROUGE-L) to access fluency. Additional metrics like number of parameters, disk-space and the inference time taken are considered to compare the computational efficiency between models.

Results and discussion
Extractive Summarization: Table 2 shows results comparing the performance of our generated   Table 2: A comparison of the generated models for Extractive summarization on CNN/DM and NYT ; FT-BERT-EXT is used as a baseline to compare against models and the baseline for extractive summarization across different datasets. The ROUGE scores show that the summaries by the auto-generated models from our proposed framework are close to the state-of-the-art BERT baseline. Whereas, our models gained significantly in terms of computational efficiency. Figure 4 illustrates the samewhen the models trained on CNN/DM dataset are compared along three aspects -Number of parameters (in millions), disk space for storing the model (in MB) and the inference time(in milliseconds) needed to generate the summary of an input sample. These graphs depict that the generated models with comparable performance to FT-BERT significantly reduce the disk space usage and the number of parameters. We note that the generated models from AUTOSUMM-CREATE lose some performance in terms of inference time, which is because the model architecture consists of RNNs and does not have the advantage of parallel computation present in BERT models. However, our FT-TINYBERT-EXT model overcomes this and significantly reduces the inference time.
Abstractive Summarization: Table 3 compares the performance of our abstractive-summarization models on Gigaword dataset, curated for extreme summarization. It is to be noted that our proposed summarization model with Transformer distillation FT-TINYBERT-ABS beats the FT-BERT-ABS with a huge margin, across all R-1,R-2,R-L.   The other dataset for extreme summarization is the Contract dataset. Table 4 shows the performance of our generated CHILD-ABS model on contracts dataset. We compare our results against the reported best performing Lead-K scores by Cohen et al.,(2018). Note that the limited size of the dataset was a bottleneck to train FT-TINYBERT-ABS model.

Model Architectures:
The AUTOSUMM-CREATE approach generates a new encoder architecture from scratch for a desired task and dataset. It is an interesting study to dive deeper into the distribution of the cells in the generated models. Table 5 shows the distribution of cell type in the generated models with a 10 layer encoder architecture, on extractive (on CNN/DM) and abstractive (on Gigaword) tasks. It can be observed that the pooling and the attention layers are sparse in the extractive models, but are major contributors in the abstractive architecture. Most recent models use the multi-head attention from transformer to get good results in the language generation task. A similar pattern is observed in the models generated through our AutoSumm framework.

Ablation Study
Variation in Layer Size: We analyze the performance of our framework across varying sizes of the target model i.e. varying the number of layers to be generated by the RNN-controller. We experiment with CHILD-EXT models. Figure 5 illustrates the results of this experiment for extractive summarization on the CNN/DM dataset. We observe that the CNN and RNN layers are the major constituents in these architectures. We can see that RNN cells are more preferred when the architecture is restricted to fewer layers (like 2, 5, 6), but as we increase the layers, Convolutional layers with larger stride (7) are preferred. Table 6 refers to the performance of these models. While the model with 15 layers gives the best performance, the performance does not drop too much with varying number of layers.
Hence, a smaller model, with fewer layers and in turn lesser number of parameters can be efficiently generated for extractive summarization through our approach.  Table 7 and Figure 6 presents the results when the same experiment is conducted on XSUM and Contract datasets with the layer sizes 5, 10, 12 and 15. While the trend in RNN preference for fewer layers and CNN preference for more layers still continues, it is noted that the larger architectures generated use the pooling and attention layers.
Cross-Dataset Experiments: Table 8 shows ex-    periments with CHILD-EXT-KD-X models trained on one dataset (X) and tested on another. The visualisations of these results as in Figure 7, also show that the architectures trained on one dataset can be used to generate summaries on a different dataset without significant loss in performance, establishing the generalizability of the proposed approach making it usable by non-experts for real-world applications. Training data variation: To reiterate, in our AUTOSUMM-CREATE framework, we generate an architecture with cells from a given search space, and re-train the generated architecture with the  17.84,25.57 18.52,2.43,11.93 21.22,6.07,15.95    training size and Knowledge Distillation. In order to measure the amount of data required for this retraining, we performed an experiment by varying the size of the training data used for this purpose. Figure 8 illustrates the result of this experiment, where we vary the amount of training data(from 0% to 100% of the total available data), used to re-train an architecture searched for extractive summarization on CNN/DM dataset. Here, 0% data refers to randomly initialized model that has not been re-trained. Note that the Rouge scores(R-1, R-2, R-L) reported for all these model variations are calculated on the test set. While it is intuitive that, more training data results in improved performance, we note that even with 10% data the decent performance value is achieved. We believe such an observation can help support the hypothesis that the using the proposed framework, we can build usable models with limited training data. Decoding Size variation: Table 9 denotes the results of varying ROUGE scores with the decoding summary sizes (1,3,5) on CNN/DM extractive summarization task. While, a summary size of 3 sentences yields best result (some of it due to the nature of the training data), we observe that the proposed framework allows generating shorter or longer summaries without significant loss in performance, again establishing the generalizability. Table 10 shows qualitative examples of the output summaries for a couple of newly generated models. Through the above experiments, we establish that the auto-generated models generated using the proposed NAS and transformer-distillation based frameworks report near state-of-the-art performance for both extractive and abstractive summarization. We establish the generalizability of the models through various experiments, while also showing the efficacy when learning with limited data.

Conclusions
We present a framework for auto-generation of ML models for extract and generate tasks by leveraging knowledge distillation, NAS, and transformerdistillation techniques. The proposed approach successfully creates new model architectures that are more efficient in terms of inference time and space while achieving near state-of-the-art performance in terms of accuracies across datasets for extractive and abstractive summarization. We believe our work can help create the foundation towards democratizing the use of deep-learning for NLP applications for non-experts in practice.