ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization

Abstractive text summarization is one of the areas influenced by the emergence of pre-trained language models. Current pre-training works in abstractive summarization give more points to the summaries with more words in common with the main text and pay less attention to the semantic similarity between generated sentences and the original document. We propose ARMAN, a Transformer-based encoder-decoder model pre-trained with three novel objectives to address this issue. In ARMAN, salient sentences from a document are selected according to a modified semantic score to be masked and form a pseudo summary. To summarize more accurately and similar to human writing patterns, we applied modified sentence reordering. We evaluated our proposed models on six downstream Persian summarization tasks. Experimental results show that our proposed model achieves state-of-the-art performance on all six summarization tasks measured by ROUGE and BERTScore. Our models also outperform prior works in textual entailment, question paraphrasing, and multiple choice question answering. Finally, we established a human evaluation and show that using the semantic score significantly improves summarization results.


Introduction
Abstractive text summarization is the task of generating a short, fluent, and concise text that contains novel words and phrases other than the original document, preserving the primary subjects in the document. In contrast with extractive summarization, which aims to select the most important parts of the text to generate a summary, in abstractive summarization, the main goal is to generate a new persuasive piece of text as the summary of a document.
Earlier abstractive summarization works (Hermann et al., 2015;See et al., 2017;Rush et al., 2015) focused on training with large datasets containing pairs of documents and summaries in a supervised manner. By introducing Transformer (Vaswani et al., 2017) architecture and pre-training objectives and their positive impact on most NLP tasks, most current state-of-the-art (SOTA) methods focused on self-supervised objectives for pretraining Transformer architecture in abstractive summarization tasks (Liu and Lapata, 2019;Zhang et al., 2020a;Qi et al., 2020). However, current pre-training works give more points to the summary with more words in common with the main text and pay less attention to the semantic similarity between generated sentences and the original document.
According to Simons (2017), the Persian language is one of the top 25 spoken languages in the world. However, there are limited research studies in Persian document summarization, and most of the prior works were mainly focused on extractive summarization. The main focus of this work is on Persian abstractive summarization. Nevertheless, our proposed method is language-independent.
In this work, we first bring semantic similarity scores into a sentence selection schema to create a document's pseudo summary. Briefly, we prepare a summary corresponding to each document in a dataset by selecting important sentences based on semantic scores in a self-supervised manner. Next, we propose three novel objectives for pre-training a seq2seq Transformer. Our model, ARMAN, uses Transformer encoder-decoder structure and introduces a new combination of masking sentences with sentence shuffling and reordering objectives. We fine-tuned the models on six downstream tasks. According to an experiment, we found that letting the training model to copy pieces of the input text into the output summary does not lead to better results in downstream tasks. Experiment results showed that our proposed models obtained SOTA performance in all Persian abstractive sum-marization datasets on both ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2020b). Our models generated even better summaries than previous SOTA in zero and few shot settings when fine-tuned with a small number of document-summary pairs. We achieved SOTA results on two datasets with only 1K examples. Moreover, our proposed models performed well in other NLU tasks, including textual entailment, question paraphrasing, and multiple choice question answering. Finally, to ensure the significant improvement in summarization, we held a human evaluation, and we performed a student t-test on its results.
The main contributions of this paper are threefold: • We introduce a top-sentence selection algorithm based on a semantic score to make document-summary pairs in a self-supervised manner.
• We propose three novel objectives to pre-train a Transformer encoder-decoder architecture for Persian abstractive text summarization that outperforms previous state-of-the-art models on six downstream tasks.

Related Work
Automatic text summarization was mainly performed based on statistical methods (Nenkova, 2005); most of them were striving to rank sentences by extracting their features (Svore et al., 2007;Erkan and Radev, 2004;Filippova and Altun, 2013). By rising of sequence to sequence learning with neural networks (Hochreiter and Schmidhuber, 1997;Sutskever et al., 2014) and attention mechanism (Bahdanau et al., 2015) usage in abstractive summarization tasks (Nallapati et al., 2016), a new era in abstractive summarization began. By introducing Transformer (Vaswani et al., 2017) and Masked Language Modeling (MLM) methods of BERT (Devlin et al., 2019), most NLP tasks achieved a vast improvement gain using these pre-training methods and architectures. Following BERT's approach, many other Language Models were trained (Liu et al., 2019;Joshi et al., 2020) with differences in the amount of data used for pre-training and some optimizations on BERT's pre-training method; most of them were only Encoders. Furthermore, Encoder-Decoder models were trained with a mixture of pre-training tasks; T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) are two of them.
Since the pre-training of Transformer was successful on most NLP tasks, some models were pre-trained for specific duties; PEGASUS (Zhang et al., 2020a) is a pre-trained model that was trained specifically for summarization on C4 and Huge-News corpora. PEGASUS trained with Gap Sentence Generation (GSG) that masks the most important sentences based on syntactic similarity of sentences of a document. ARMAN is different from PEGASUS in that we mask the most important sentences based on the semantic similarity of sentences. Furthermore, we use only a single mask token for any consecutive sentences that should be masked. This approach helps the model learn how many sentences should be generated for each masked token in the input sequence. STEP (Zou et al., 2020) is another pre-trained summarization model trained with MLM, Next Sentence Generation (NSG), and Sentence Reordering (SR) objectives. ARMAN uses SR as one of the pre-training methods in a modified form; we change the order of sentences in the input document. The model should select the most important sentences using semantic similarity of sentences to the document, then reorder them in the actual order that they appeared in the original document.
In the Persian language, some extractive summarization methods exist (Khademi et al., 2018;Rezaei et al., 2019;Kermani and Ghanbari, 2019;Khademi and Fakhredanesh, 2020), but to the best of our knowledge, we know just one model on abstractive summarization. Farahani et al. (2020b) have used ParsBERT (Farahani et al., 2020a) checkpoint with Rothe et al. (2020)'s method to train a new sequence to sequence model with pre-trained weights for the encoder and decoder. In this regard, ARMAN is one of the first works on abstractive summarization for the Persian language. Also, ARMAN was able to achieve SOTA results on all available datasets.

Methodology
This section introduces a sentence selection method based on semantic similarity scores to make a pseudo summary. Then, we propose three novel objectives for pre-training a seq2seq model for the abstractive summarization tasks.

Top Sentence Selection (TSS)
We introduce a new semantic-based approach for selecting important document sentences to make a pseudo summary in this work. The pseudo summary consists of important sentences of a given document, and the models are supposed to generate an output similar to the pseudo summary corresponding to the document. For comparison, we also use a syntactic-based metric to select sentences from the original document. Inspired by recent work in generating pseudo summaries (Zhang et al., 2020a), we select sentences from a document based on two strategies and concatenate them to create a pseudo summary. For each document in a data collection, we make a summary as described in Algorithm 1. At first, we calculate a score function for each pair of (sentence, document \ sentence). Then we calculate the top m sentences and merge them to make the pseudo summary. The parameter m is calculated based on the number of sentences.

Algorithm 1: Top Sentence Selection
Input :Document Output :Text, Summary for s i in Document do r i := score_f unc(s i , Document \ s i ) end for Summary := ∅ T ext := Document for j ← 1 to m do k := argmax{r i } ∀ s i / ∈Summary Summary := Summary ∪ {s k } T ext := T ext \ {s k } end for Syntactic-based approach: In this strategy, we create a pseudo summary by selecting and merging sentences from a document using a syntactic-based approach. ROUGE is a mainly used metric that calculates the similarity between a candidate sentence and a collection of reference sentences based on the overlap of N-grams (Lin, 2004). The higher the ROUGE score between two pieces of text, the more similar they are. The score-func in Algorithm 1 calculates the ROUGE1-F1 score between the sentence and remaining sentences of the document. PEGASUS (Zhang et al., 2020a) has used such a method as Gap Sentence Generation. Semantic-based approach: Although selecting sentences based on the ROUGE metric is simple, cost-effective, and usable in low-resource languages, ROUGE comes with some drawbacks (Kryscinski et al., 2019). In particular, ROUGE does not account for different words with the same meaning since it only calculates syntactical matches. Thus, if we have two sentences with the same meaning but expressed with different words, they will be assigned a low ROUGE score. To the best of our knowledge, this paper is the first to study semantic similarity in creating pseudo summaries and its effect on the quality of generated summaries.
To consider the semantic score in calculating the similarity of two sentences, we used a recent BERTScore metric. BERTScore computes a similarity score for each token in the candidate sentence with each in the reference sentence using contextual embeddings (Zhang et al., 2020b). Due to the high computational cost of calculating this metric for each pair of (sentence, document\sentence), we used FastText (Bojanowski et al., 2017) pretrained embeddings instead of BERT contextual embeddings. According to BERTScore, for a reference x, and a candidatex, the recall, precision, and F1 scores are: For applying semantic score, the score function in Algorithm 1 calculates F 1 F T 1 .

Pre-training Objectives
In this work, we propose new pre-training objectives and compare our models with the closely similar work of PEGASUS (Zhang et al., 2020a). We use Transformer encoder-decoder structure and introduce a new combination of masking sentences plus shuffling and reordering objectives. The general procedure of pre-training with the proposed objectives is shown in Figure 1.

TSS-ROUGE
In this objective, we implemented PEGASUS for the Persian language to compare with our proposed models. The base architecture of this model is a Transformer encoder-decoder. Instead of masking words, we mask sentences with <mask> tokens. In Figure 1: The procedure of making input and output for pre-training Seq2Seq Transformer. TTS selects the salient sentences and divides the original document into text and summary parts. The summary part is the desired output that the Transformer should generate.
order to generate pseudo summaries as input to this structure, the syntactic-based approach using the ROUGE metric is applied.

TSS-Semantic Similarity (SS)
This objective takes semantically created pseudo summaries into account. This method is the same as the previous TSS-ROUGE. The semantic-based approach using the modified BERTScore is applied to generate pseudo summaries as input to the structure. The masking criterion is a bit different from TSS-ROUGE. We put only one <mask> token for any number of consecutive sentences that should be masked. In this way, the model learns to guess the number of sentences as well. In 20% of the cases, instead of masking a sentence, we keep it in place; this will make the model learn to bring some pieces of the document into the summary. We call the trained model with this objective ARMAN(SS-80).

TSS-Shuffling (SH)
In addition to considering a semantic-based approach for creating a pseudo summary, we apply span shuffling in a sentence and the masking objective together in this objective. In particular, instead of masking sentences 20% of the cases, we shuffle a span of them. The intuition is that the model will learn not to just copy sentences in the final summary and be sensitive to precedence and latency at the span level. We call the trained model with this objective ARMAN(SH).

TSS-Modified Sentence Reordering (MSR)
In this objective, we do masking as the TSS-Semantic Similarity objective does in 90% of documents, and in 10% of other documents, we shuffle all sentences. In the latter, the model should reorder sentences and keep the top 30% of important sentences of the original document according to the semantic scores. The idea behind this method is that the model will learn to arrange the sentences in the correct order in the final summary. Moreover, the model will learn to care about important pieces of the document. In addition to enriching the summary semantically, this work also considers its brevity. We call the trained model with this objective ARMAN(MSR).

Data Collection
This section introduces the datasets used for pretraining and fine-tuning models and the procedure of cleaning corpora.

Pre-training Datasets
We merged four large Persian corpora from different sources for pre-training models, which contained formal and informal texts.
irBlogs (AleAhmad et al., 2016) is a collection of 5M+ posts from 600K+ Persian weblogs. Some blogs use informal language for their posts, so this dataset has an enormous amount of informal texts, which could help our models become familiar with this type of Persian speech.
MirasText ( YJC News 2 is a collection of articles gathered from the Young Journalist Club website 3 . This dataset contains news from various subjects, including 1M+ articles.

Downstream Datasets
For the Summarization task, five datasets were used. All datasets are publicly available and could be used to reproduce our results. Following Grusky et al. (2018), extractive density and coverage for each summarization dataset has been reported in Appendix A. Moreover, we used a Natural Language Understanding (NLU) dataset to test our models' performances on language modeling tasks.

Preprocessing
Due to the necessity of a massive amount of data for pre-training of language models, we needed to collect large datasets, but those datasets need to be cleaned. We adopted a heuristic function to produce an automatic pipeline for cleaning our pre-training datasets. First of all, for each document in each dataset, we separated the sentences and removed those which have the following characteristics; 1) sentences with less than five words 2) sentences that do not end with valid Persian end of sentence marks 3) sentences that contain some specific keywords from Persian webpages and javascript codes. Furthermore, we omitted documents with less than three sentences after the above cleaning. Next, we used the langdetect 6 package to filter out any document which is not in Persian with the probability of 0.99. Lastly, we removed duplicate paragraphs of documents. More information about the size of each corpus after cleaning is reported in Appendix A. Our heuristic was inspired by methods from Raffel et al. (2020)'s work. This preprocessing procedure only has been used for the pre-training datasets.

Experiments
In this section, we compare ARMAN with previous works and conduct several experiments to assess the performance of the proposed methods. The codes for pre-training and fine-tuning of all models are publicly available on GitHub 7 .

Pre-training and Implementation
Our model is based on Transformer (Vaswani et al., 2017) encoder-decoder structure. We pre-trained ARMAN, which contained a 12 layer encoder and a 12 layer decoder with 768 embedding/hidden size, 3072 feed-forward filter size, and 12 self-attention heads. ARMAN and PEGASUS were trained on   the mentioned pre-training corpora in section 4.1.
The batch size and the training steps of pre-training were set to 128 and 1M, respectively. Adafactor (Shazeer and Stern, 2018) with square root learning rate decay and a dropout rate of 0.1 was used in pre-training and fine-tuning. Pre-training experiments were carried out on the Google Colab platform with TPU v2-8. It took almost 11 days for 1M steps to train ARMAN. Also, we sampled 1M documents from the CC100 dataset and used the SentencePiece Unigram algorithm (Kudo, 2018) to generate the vocabulary for our models. The size of the vocabulary was 96K in all experiments.

Fine-tuning on Text Summarization
Abstractive summarization aims to produce a short, fluent, and concise text using advanced natural language techniques to extract essential information from the original document. We fine-tuned our pre-trained models on six downstream tasks. In all experiments, we set the input length (L input ) to 512 and output length to 256. Also, we used beam-search as Wu et al. (2016)'s approach with a beam-size of 8 and a length penalty of 0.8. More information about the experiments' setup is reported in Appendix B. Table 1 shows results based on standard ROUGE metrics. To compare summaries generated by our models with the state-of-the-art PEGASUS base with a text generation evaluation metric, we reported results based on original BERTScore (Zhang et al., 2020b) (using bert-base-multilingual-cased as pre-trained contextual embeddings) in Table  2. Both tables show the performance improvements of ARMAN(MSR) base on all downstream datasets. According to tables 1 and 2, even ARMAN(SS) base , our basic proposed method, outperforms PEGASUS base in all datasets. These results show that considering the semantic similarity in pre-training objectives is critical in improving the final summary.
In ARMAN(MSR) base , we encouraged the model to learn the correct relative orders between sentences by reordering at the sentence level. Results of this model show that the reordering objective gives an improvement in summarization. Our second model, ARMAN(SH) base , does not help in improving the quality of summaries. So, we conclude that shuffling at the span level leads to a sub-optimal response, as reported in Raffel et al. (2020).

To copy or not to copy!
We observed that PEGASUS large 8 tries to copy sentences from the document into a generated summary when it is not fine-tuned on any summarization datasets.The intuition is that when the task is to copy a sentence, and in return for that copying the model gets an extra score, the model becomes biased towards copying the sentences to increase the probability of catching a significant match. In other words, it always copies some sentences from     the input to the output with the hope that it will match the output because this yields a decrease in the loss function value. We set up an experiment to observe the behavior of our models when they are not encouraged to copy sentences of the input into the output. According to semantic score, all proposed methods selected 30% of the top-ranked sentences. In this experiment, we pre-trained ARMAN(SS) base with two different values for masking rate in TSS objective; 1) ARMAN(SS-80) base masked only 80% of important sentences and left the other 20% unchanged in the input text, 2) ARMAN(SS-100) base masked all of the important sentences without copying any sentences from input text into the pseudo summary.
Results in Figure 2 show that in a zero-shot setting, ARMAN(SS-100) base produces a higher ROUGE score when we do not consider copying in the pre-training objective. Additionally, we finetuned ARMAN(SS-100) and ARMAN(SS-80) on downstream tasks. Results in Table 3 and Figure  2 show that ARMAN(SS-100) base performs better than ARMAN(SS-80) before and after fine-tuning. Given these results, we used this more effective criteria in our best model, ARMAN(MSR) base .

Factual Consistency and Abstractiveness
From another perspective, we compared the abstractiveness and factual consistency of our best model, ARMAN(MSR), with PEGASUS on downstream summarization tasks because they are impor-tant factors for assessing the quality of summaries.
To compare the abstractiveness of models, we calculated the coverage and density (Grusky et al., 2018) of summaries generated by each model. A higher value for coverage indicates that the summary uses fewer novel words, and a higher value for density is an indicator of a more extractive summary. The average density and coverage of AR-MAN(MSR) and PEGASUS on each dataset are reported in table 4. The results show that ARMAN has a lower density and coverage compared to PE-GASUS in 4 out of 6 tasks. Also, in the Tebyan dataset, ARMAN has a higher density but lower coverage, which means ARMAN uses more novel words compared to PEGASUS. Therefore we conclude that ARMAN's summaries are more abstractive than PEGASUS.
To compare the factual consistency of models, we calculated precision-source and F1-target (Nan et al., 2021) metrics. While the mentioned metrics evaluate entity-level factual consistency, they still gives considerable information about the factual consistency of models. In order to extract named entities, we used the ParsBERT (Farahani et al., 2020a) model, which was trained on the PAYMA (Shahshahani et al., 2019) dataset 9 . The average precision-source and F1-target of ARMAN(MSR) and PEGASUS on each dataset are reported in Table 5. The results show that ARMAN has a higher F1-target and precision-source score than PEGA-  SUS in 5 out of 6 tasks. Therefore, it seems AR-MAN is more factually consistent than PEGASUS.

Zero and Few Shot Summarization
We studied our models in zero and few shot settings to make abstractive summarization a practical solution for real-world tasks where providing a large supervised collection of training and testing data is laborious. In a zero-shot setting, we pre-trained models on pre-training datasets and examined them on downstream tasks without finetuning. Results in Figure 2 show that our models outperformed PEGASUS. In a few-shot setting, we fed our best model with 10 k (k = 1, 2, 3, 4) examples to study the model's results on low resource scenarios. In this experiment, Transformer base and ARMAN(MSR) base were trained for 150K and 2K steps, respectively. According to Figure 3, we observed that in Wiki Summary and VOA datasets, our model has beaten the state-of-the-art model with only seeing 1K samples. In a larger dataset, Perkey, our model did not get a better result than Transformer base because it was fine-tuned on the whole dataset with more steps. We conclude that our model gets an acceptable outcome in lower amounts of data and computational resources.

NLU Results
In order to study if ARMAN works well as a language model, we tested our models in Natural Language Understanding (NLU) tasks. According to Khashabi et al. (2020), we selected multiple-choice question-answering, textual entailment, sentiment analysis, and question paraphrasing tasks to examine our models' performance on them. For more information about these tasks and datasets, see Appendix A and Khashabi et al. (2020).
According to the results in Table 6, ARMAN(SH) base has beaten other models in the natural part of Textual Entailment and Question Paraphrasing. This model learned how to arrange a disordered sentence. Thus, it makes sense why it is powerful in recognizing the same sentences with different written forms. In Multiple-Choice QA, our best-performing model achieves the highest accuracy in math and logic questions. Our proposed model, with semantic similarity and mask-only approach, surpasses others in literature questions. In the common knowledge task, WikiBERT base (Pyysalo et al., 2021) outperformed other models because it has been trained over a large Wikipedia dataset. In the Sentiment Analysis task, the proposed models could not achieve acceptable results compared to other models. A more detailed study about the behavior of models on NLU tasks is outside the scope of this work.

Human Evaluation
According to Kryscinski et al. (2019)'s work, we held a human evaluation experiment by considering ROUGE's drawbacks. Our purpose was to de-  termine whether semantic similarity makes better summaries than PEGASUS' GSG in the experiment. Also, we wanted to discover which model is the best from the human's viewpoint. We selected 30 documents from the PN-Summary dataset and the corresponding generated summaries from PE-GASUS, ARMAN(SS-80), and ARMAN(MSR) models. We gave them to 10 participants and asked them to rank the generated summaries from the best to worst similar to Zou et al. (2020)'s work according to fluency, informativeness, and succinctness of the generated summaries. In order to perform statistical tests, we converted rankings into scores (score = 4 − rank). The experiment result is reported in Table 7. Moreover, we have performed some student t-test between models, and results are reported in

Conclusion
There are few models for generating abstractive summaries in the Persian language. This work introduces ARMAN, a Transformer encoder-decoder-  based model pre-trained with a new combination of masking sentences with sentence shuffling and reordering objectives. We considered semantic similarities for important sentence selection to make document-summary input data in a self-supervised manner. The results show that the modified sentence selection and reordering model outperforms the most recent SOTA models in all six downstream tasks. Our model achieved a higher score than the previous SOTA with only 1K examples in the case of low supervised sample sizes. Finally, the human evaluation results show significant improvement over the dataset used for this experiment.
In future work, investigating the effect of using contextual embeddings for selecting salient sentences for producing text and summary pairs might prove necessary. Furthermore, the ability of models on extractive summarization is worth scrutinizing since our objectives select salient sentences, which is similar to extractive summarization.

A Datasets Statistics
In this section, extra information about downstream datasets and pre-training text corpora is reported. Some of the datasets did not provide any validation split; however, the number of examples in the train/validation/test split and the average length of articles and summaries for each dataset is reported in Table 11. Additionally, the size of pre-training texts corpora before and after preprocessing is reported in Table 9. where A is an article and S is the corresponding summary, and F (A, S) is the set of shared sequences of tokens in A and S. The density for extractive summaries is higher than more abstractive summaries. Lower coverage shows the novelty of text fragments in summary. Figure 4 shows that our downstream datasets range from more extractive summaries to more abstractive ones. Tebyan dataset contains articles and summaries from a well-known Persian lifestyle website that includes various articles from different categories. In order to produce the Tebyan dataset, we have crawled 100K pages of their site. We removed all HTML tags using beautifulsoup4 10 for each page, and each page's primary content was stored with the author's provided summary, and paragraphs were separated with a newline character. Lastly, we have used Langdetect 11 to remove articles that were not in Persian. After to this procedure, 92,289 articles and summaries were collected, so we separated them into three parts, 85% for train, 7.5% for validation, and 7.5% for test split.
We have tested ARMAN on NLU tasks with the ParsiNLU (Khashabi et al., 2020), which is a Persian NLU dataset. This dataset consists of 5 main tasks and translation as an extra task. In Table 10  did not test ARMAN on the Reading Comprehension of this dataset due to resource leakage. The Sentiment Analysis task of this dataset has two subtasks; sentence-level sentiment and aspect-based sentiment of a sentence. We have tested ARMAN on the sentence-level sentiment task. The Sentiment Analysis task of this dataset has two subtasks; sentence-level sentiment and aspect-based sentiment of a sentence. We have tested ARMAN on the sentence-level sentiment task. For Question Paraphrasing and Textual Entailment, this dataset contains two subtasks; sentences written by humans and sentences translated from English datasets into Persian, so we have reported the accuracy of models for each subtask separately.

B ARMAN Hyper Parameters and Training Settings
In this section, we have described pre-training and fine-tuning parameters and settings. In Table 12, pre-training settings for ARMAN base are reported. Tables 13 and 14 contain information about settings used in fine-tuning ARMAN base and Transformer base on summarization tasks. Also, the fine-tuning settings for NLU tasks are reported in Table 15. Finally, we have reported each model's parameter counts used in summarization tasks in Table 16.

D Samples
Two samples of ARMAN(SS), ARMAN(MSR), and PEGASUS generated summaries that were used in the human evaluation test are shown in Figures 5 and 6. More than 50% of participants believed that ARMAN(MSR)'s summaries were the best among all models in the human evaluation test, which shows that its summaries have high quality.        Table 15: Fine-tuning settings for ARMAN base models on NLU tasks. Batch size 48 was chosen to be the same as other models that were trained on those tasks. We have converted the classification problem into the text to text problems.