Cross-Lingual Abstractive Summarization with Limited Parallel Resources

Parallel cross-lingual summarization data is scarce, requiring models to better use the limited available cross-lingual resources. Existing methods to do so often adopt sequence-to-sequence networks with multi-task frameworks. Such approaches apply multiple decoders, each of which is utilized for a specific task. However, these independent decoders share no parameters, hence fail to capture the relationships between the discrete phrases of summaries in different languages, breaking the connections in order to transfer the knowledge of the high-resource languages to low-resource languages. To bridge these connections, we propose a novel Multi-Task framework for Cross-Lingual Abstractive Summarization (MCLAS) in a low-resource setting. Employing one unified decoder to generate the sequential concatenation of monolingual and cross-lingual summaries, MCLAS makes the monolingual summarization task a prerequisite of the CLS task. In this way, the shared decoder learns interactions involving alignments and summary patterns across languages, which encourages attaining knowledge transfer. Experiments on two CLS datasets demonstrate that our model significantly outperforms three baseline models in both low-resource and full-dataset scenarios. Moreover, in-depth analysis on the generated summaries and attention heads verifies that interactions are learned well using MCLAS, which benefits the CLS task under limited parallel resources.


Introduction
Cross-lingual summarization (CLS) helps people efficiently grasp salient information from articles in a foreign language. Neural approaches to CLS require large scale datasets containing millions of cross-lingual document-summary pairs (Zhu et   2019; Cao et al., 2020;Zhu et al., 2020). However, two challenges arise with these approaches: 1) most languages are low-resource, thereby lacking document-summary paired data; 2) large parallel datasets across different languages for neuralbased CLS are rare and expensive, especially under the current trend of neural networks. Therefore, a low-resource setting is more realistic, and challenging, one for cross-lingual summarization. To our best knowledge, cross-lingual summarization under low-resource settings has not been well investigated and explored. Therefore, in this paper, we will develop a new model for cross-lingual abstractive summarization under limited supervision.
For low-resource settings, multi-task learning has been shown to be an effective method since it can borrow useful knowledge from other relevant tasks to use in the target task (Yan et al., 2015;Motiian et al., 2017). Crosslingual summarization can be viewed as the combination of two tasks, i.e., monolingual summarization (MS) and cross-lingual translation (Zhu et al., 2019). A wealth of relationships exist across the target summaries of MS and CLS tasks, such as translation alignments and summarization patterns. Illustrated in Figure 1, "叙利亚" is mapped to "Syria", and similar maping is done with the other aligned phrases. Obviously, leveraging these relationships is crucial for the task of transferring summarization knowledge from high-resource languages to low-resource languages. Unfortunately, existing multi-task frameworks simply utilize independent decoders to conduct MS and CLS task separately (Zhu et al., 2019;Cao et al., 2020), which leads to failure in capturing these relationships.
To solve this problem, we establish reliant connections between MS and CLS tasks, making the monolingual task a prerequisite for the crosslingual task. Specifically, one decoder is shared by both MS and CLS tasks; this is done by setting the generation target as a sequential concatenation of a monolingual summary and the corresponding cross-lingual summary. Sequentially generating monolingual and cross-lingual summaries, the decoder also conducts the translation task between them, which enhances the interactions between different languages. These interactions implicitly involve translation alignments, similarity in semantic units, and summary patterns across different lingual summaries. To demonstrate these decoder interactions, we further visualize them by probing Transformer attention heads in the model. Based on this process, the new structure with these advanced interactions enhances low-resource scenarios which require the model to be capable of transferring summary knowledge from high-resource languages to low-resource language. We name our model Multi-task Cross-Lingual Abstractive Summarization (MCLAS) under limited resources.
In terms of a training strategy under limited resources, we first pre-train MCLAS on large-scale monolingual document-summary parallel datasets to well-equip the decoder with general summary capability. Given a small amount of parallel crosslingual summary samples, the model is then finetuned and is able to transfer the learned summary capability to the low-resource language, leveraging the interactions uncovered by the shared decoder.
Experiments on Zh2EnSum (Zhu et al., 2019) and a newly developed En2DeSum dataset demonstrate that MCLAS offers significant improvements when compared with state-of-the-art cross-lingual summarization models in both low-resource scenarios and full-dataset scenario. At the same time, we also achieved competitive performances in the En2ZhSum dataset (Zhu et al., 2019). Human evaluation results show that MCLAS produces more fluent, concise and informative summaries than baselines models under limited parallel resources. In addition, we analyzed the length of generated summaries and the success of monolingual generation to verify advantages offered by identifying interactions between languages. We further investigate the explainability of the proposed multi-task structure by probing the attention heads in the unified decoder, proving that MCLAS learns the alignments and interactions between two languages, and this facilitates translation and summarization in the decoder stage. Our analysis provides a clear explanation of why MCLAS is capable of supporting CLS under limited resources. Our implementation and data are available at https://github.com/WoodenWhite/MCLAS.

Cross-Lingual Summarization
Recently, cross-lingual summarization has received attention in research due to the increasing demand to produce cross-lingual information.
Traditional CLS systems are based on a pipeline paradigm (Wan et al., 2010;Wan, 2011;Zhang et al., 2016). These pipeline systems first translate the document and then summarize it or vice versa. Shen et al. (2018) propose the use of pseudo summaries to train the cross-lingual abstractive summarization model. In contrast, Duan et al. (2019a) and Ouyang et al. (2019) generate pseudo sources to construct the cross-lingual summarization dataset.
The first large-scale cross-lingual summarization datasets are acquired by use of a round-trip translation strategy (Zhu et al., 2019). Additionly, Zhu et al. (2019) propose a multi-task framework to improve their cross-lingual summarization system. Following Zhu et al. (2019), more methods have been proposed to improve the CLS task. Zhu et al. (2020) use a pointer-generator network to exploit the translation patterns in cross-lingual summarization. Cao et al. (2020) utilize two encoders and two decoders to jointly learn to align and summarize. In contrast to previous methods, MCLAS generates the concatenation of monolingual and crosslingual summaries, thereby modeling relationships between them.

Low-Resource Natural Language Generation
Natural language generation (NLG) for lowresource languages or domains has attracted lots of attention. Gu et al. (2018) leverage meta-learning to improve low-resource neural machine translation. Meanwhile, many pretrained NLG models have been proposed and adapted to low-resource scenarios Chi et al., 2020;Radford et al., 2019;. However, these models require large-scale pretraining. Our work does not require any large pretrained generation models or translation models, enabling a vital decreases in training cost.

Neural Cross-lingual Summarization
Given a source document . , x A m } in language A, a monolingual summarization system converts the source into a summary S A = {y A 1 , y A 2 , . . . , y A n }, where m and n are the lengths of D A and S A , respectively. A cross-lingual summarization system produces a summary S B = {y B 1 , y B 2 , . . . , y B n } consisting of tokens y B in target language B, where n is the length of S B . Note that the mentioned x A , y A , and y B are all tokens. Zhu et al. (2019) propose using the Transformer (Vaswani et al., 2017) to conduct crosslingual summarization tasks. The Transformer is composed of stacked encoder and decoder layers. The encoder layer is comprised of a self-attention layer and a feed-forward layer. The decoder layer shares the same architecture as the encoder except for an extra encoder-decoder attention layer, which performs multi-head attention over the output of stacked encoder layers. The whole Transformer model θ is trained to maximize the conditional probability of the target sequence S B as follows:

Improving NCLS with Multi-Task Frameworks
Considering the relationship between CLS and MS, in which they share the same goal to summarize important information in a document, Zhu et al. (2019) proposed employing a one-to-many multitask framework to enhance the basic Transformer model. In this framework, one encoder is employed to encode the source document D A . Two separate decoders simultaneously generate a monolingual summary S A and a cross-lingual summary S B , leading to a loss as follows: Figure 2: An overview of our proposed MCLAS. A unified decoder produces both monolingual (green) and cross-lingual (red) summaries. The green and red lines represent the monolingual and cross-lingual summaries' attention, respectively.
This multi-task framework shares encoder representation to enhance cross-lingual summarization. However, independent decoders in this model are incapable of establishing alignments and connections between cross-lingual summaries.

MCLAS with Limited Parallel Resources
To strengthen the connections mentioned, we propose making the monolingual task a prerequisite for the cross-lingual task through modeling interactions. According to previous work (Wan et al., 2010;Yao et al., 2015;Zhang et al., 2016), interactions between cross-lingual summaries (important phrase alignments, sentence lengths, and summary patterns, etc) are crucial for the final summary's quality. We leverage these interactions to further transfer the rich-resource language knowledge. Detailed descriptions of this step are presented in following sections.

Multi-Task Learning in MCLAS
To model interactions between languages, we need to share the decoder's parameters. Inspired by Dong et al. (2019), we propose sharing the whole decoder to carry out both the translation and the summarization tasks. Specifically, we substitute the generation target S A with the sequential concatenation of S A and S B : where [BOS] and [EOS] are the beginning and end token of the output summaries, respectively. And [LSEP] is the special token used as the separator of S A and S B . With the new generation target, the decoder learns to first generate S A , and then generate S B conditioned on S A and D A . The whole generation process is illustrated in Figure 2.
Formally, we maximize the joint probability for monolingual and cross-lingual summarization: The loss function can be divided into two terms. When generating S A , the decoder conducts the MS task based on D A , corresponding to the first term in Equation (4). When generating S B , the decoder already knows the information of corresponding monolingual summaries. In this way, it performs the translation task (for S A ) and the CLS task (for D A ), achieved by optimizing the second term in Equation (4). With the modification of the target, our model can easily capture interactions between cross-lingual summaries. The trained model shows effectiveness in aligning the summaries. Not only the output tokens, but also the attention distributions are aligned. The model we designed leverages this phenomenon to enable monolingual knowledge to be transferred under low-resource scenarios. Detailed investigation is presented in Section 6.
We adopt Transformers as our base model. In addition, we use multilingual BERT (Devlin et al., 2019) to initialize the encoder, improving its ability to produce multilingual representations. Additionally, having tried many different position embedding and language segmentation embedding methods, we find that [LSEP] is enough for the model to distinguish whether it is generating S B . Hence keeping the original position embedding (Vaswani et al., 2017) and employing no segmentation embedding are best for performance and efficiency.

Learning Schemes for MCLAS under Limited Resources
Since our proposed framework enforces interactions between cross multilingual summaries, it has further benefits to the low-resource scenario, as only a few training summary samples are available in a cross-language. Yet, simply training from scratch can not make the best of our proposed model in low-resource scenarios. Hence we use a pre-training and fine-tuning paradigm to transfer the rich-resource language knowledge.  First, we train the model in a monolingual summarization dataset. In this step, the model learns how to produce a monolingual summary for a given document. Then, we jointly learn MS and CLS with few training samples, optimizing Equation (4). We adopt similar initialization to existing CLS methods, which is introduced in Section 5.3.

Datasets
we conduct experiments on the En2ZhSum, Zh2EnSum CLS datasets 1 (Zhu et al., 2019) and a newly constructed En2DeSum dataset. En2ZhSum is an English-to-Chinese dataset containing 364,687 training samples, 3,000 validation, and 3,000 testing samples. The dataset is converted from the union set of CNN/DM (Hermann et al., 2015) and MSMO (Zhu et al., 2018) using a round-trip translation strategy. Converted from the LCSTS dataset, Zh2EnSum contains 1,693,713 Chinese-to-English training samples, 3,000 validation, and 3,000 testing samples. To better verify the CLS ability of MCLAS, we construct a new English-to-German dataset (En2DeSum), using the same methods proposed by Zhu et al. (2019). We use WMT'19 English-German winner 2 as our translation model to process the English Gigaword dataset. 3 We set the threshold T 1 = 0.6 and T 2 = 0.2. The final En2DeSum contains 429,393 training samples, 4,305 validation samples, and 4,099 testing samples.
All the training samples contain a source document, a monolingual summary, and a cross-lingual summary. For the full-dataset scenario, we train the model with the whole dataset. For low-resource scenarios, we randomly select 3 different amounts (minimum, medium, and maximum) of training samples for all datasets to evaluate our model's performance under low-resource scenarios. Detailed numbers are presented in Table 1.

Training and Inference
We use multilingual BERT (mBERT) (Devlin et al., 2019) to initialize our Transformer encoder. The decoder is a Transformer decoder with 6 layers. Each attention module has 8 different attention heads. The hidden size of the decoder's self-attention is 768 and that of the feed-forward network is 2048. The final model contains 296,046,231 parameters. Because the encoder is pretrained when the decoder is randomly initialized, we use two separate optimizers for the encoder and the decoder (Liu and Lapata, 2019). The encoder's learning rate η e is set as 0.005, while the decoder's learning rate η d is 0.2. Warmup-steps for the encoder are 10,000 and 5,000 for the decoder. We train the model on two TITAN RTX GPUs for one day with gradient accumulation every 5 steps. Dropout with a probability 0.1 is applied before all the linear layers. We find that the target vocabulary type doesn't have much influence on the final result. Therefore, we directly use mBERT's subwords vocabulary as our target vocabulary. Nevertheless, in case tokens would be produced in the wrong language, we constructe a target token vocabulary for each target language. In the inference period, we only generate tokens from the corresponding vocabulary. During the decoding stage, we use beam search (size 5) and trigram block to avoid repetition. Length penalty is set between 0.6 and 1. All the hyperparameters are manually tuned using PPL and accuracy metric on the validation set.

Baselines
We compare MCLAS in low-resource scenarios with the following baselines: NCLS CLS model proposed by Zhu et al. (2019). In low-resource scenarios, we initialize our model with the pretrained MS model and then use a few samples to optimize Equation (1).
NCLS+MS Multi-task framework proposed by Zhu et al. (2019). We find that NCLS+MS fails to converge when it is partly initialized by the pretrained MS model (the CLS decoder is randomly initialized). Hence, we fully initialize the multitask model using the pretrained MS model. Specifically, the two separate decoders are both initialized by the pretrained monolingual decoder. Then the model is optimized with Equation (2).
TLTran Transformer-based Late Translation is a pipeline method. First, a monolingual summarization model summarizes the source document. A translation model is then applied to translate the summary. The summarization model is trained with monolingual document-summary pairs in three datasets. Specifically, we continue using WMT'19 English-German winner as the translation model for En2DeSum.
Some recent proposed models improve the performance of CLS task. Methods NCLS+MT, TETran We choose not to use these models as baselines since comparing MCLAS with them is unfair.

Automatic Evaluation Results
The overall results under low-resource scenarios and full-dataset scenario are shown in Table 2. We reimplement a variety of models and evaluate them using F1 scores of the standard ROUGE metric (Lin, 2004) (ROUGE-1, ROUGE-2, and ROUGE-L) and BERTScore 4 . The following analysis is from our observations.
In the Zh2EnSum and En2DeSum datasets, MCLAS achieves significant improvements over baselines in all the low-resource scenarios. It is worth noting that combining NCLS+MS in our experiments does not bring much improvement to the NCLS model. We consider that this is because mBERT has already provided multilingual encoding for our models.
However, we find that in the En2ZhSum dataset, MCLAS did not perform as well as that in the other two datasets. We speculate that is due to the imbalance of English reference and Chinese reference. The average length of S A and S B in En2ZhSum is 55.21 and 95.96, respectively (Zhu et al., 2019). This condition largely breaks the alignment between languages, leading to MCLAS Models Zh2EnSum En2DeSum En2ZhSum     Finally, our proposed model also has superior performance compared to baseline models given the full training dataset, achieving the best ROUGE score in En2DeSum and Zh2EnSum datasets.

Human Evaluation
In addition to automatic evaluation, we conduct a human evaluation to verify our model's performance. We randomly chose 60 examples (20 for each low-resource scenario) from the Zh2EnSum test dataset. Seven graduate students with high levels of fluency in English and Chinese are asked to assess the generated summaries and gold summaries from independent perspectives: informativeness, fluency, and conciseness. We follow the Best-Worst Scaling method (Kiritchenko and Mohammad, 2017   the best and worst items from each perspective. The result scores are calculated based on the percentage of times each system is selected as best minus the times it is selected as worst. Hence, final scores range from -1 (worst) to 1 (best). Results are shown in Table 3.
As the data size increases, all the models achieve better results. Our proposed MCLAS outperformed NCLS and NCLS+MS in all the metrics. We notice that MCLAS is especially strong in conciseness. This phenomenon will be analyzed in Section 5.7 We show Fleiss' Kappa scores of our conducted human evaluation in Table 4, which demonstrates a good inter-agreement among the participants.

Analysis on Initialization Methods
We use a monolingual summarization model to initialize our model. However, whether this initialization method works is still in question. Therefore we compare our models with non-initialized models, shown in Figure 3. Among the three datasets, the initialization methods bring a huge improvement to all of the models.

Analysis on Summary Length
One of the goals of automatic summarization is to produce brief text. Yet many neural auto-regressive models tend to produce a longer summary to improve the recall metric. Results in Table 5 show that interactions enable MCLAS to generate shorter summaries than other models, which more closely resembles human summaries. We can safely conclude that MCLAS can keep the summary in a fairly appropriate length, leading to concise generated summaries. We speculate that this is due to its ability to capture interactions between languages, conditioning cross-lingual summaries on relatively precise monolingual summaries.

Analysis on Monolingual Summarization
Modeling interactions between languages brings many advantages. Specifically, we find that MCLAS can preserve more monolingual summarization knowledge than the NCLS+MS model during low-resource fine-tuning, or even promote its performance. We generate monolingual summaries with models trained in the maximum lowresource scenario. In Table 6, we can clearly see that MCLAS retains more monolingual summarization knowledge in the Zh2EnSum dataset. In the En2DeSum dataset, monolingual summarization performance is even significantly improved. We speculate that this is due to MCLAS's ability to provide the interactions between languages.
We focus specifically on digging into results in En2DeSum, evaluating its detailed ROUGE and average summary length, presented in Table 7. We find that ROUGE improvement mainly resulted from precision while recall barely decrease the performances. This and the Avg. length metric shows that MCLAS produces more precise summaries while retaining most of the important information, leading to the metric increase in ROUGE. Figure 5: Different types of self-attention heads in MCLAS's decoder. The x-axis and y-axis are both concatenated source-language summary S A and targetlanguage summary S B tokens. Darker color shows the more highly related associations between tokens. The horizontal line in self head represents the [LSEP] token. Some attention heads attend to [LSEP] to confirm whether it is generating a cross-lingual summary.

Case Study
In Figure 4, on the Zh2EnSum dataset, there is a list comparing the reference summary and outputs of models trained in the maximum low-resource scenario. Clearly, the NCLS model loses the information "two cars" and generates the wrong information "No.2 factory". The NCLS+MS model is not accurate when describing the number of injured people, dropping important information "more than". Additionally, the NCLS+MS model also has fluency and repetition issues: "in zhengzhou" appears twice in its generated summary. In contrast, MCLAS captures all of this information mentioned in both its Chinese and English output, and the English summary is well aligned with the Chinese summary. Finally, all of the models ignore the information "foxconn printed on the body of the car". See Appendix A for more examples.

Probing into Attention Heads
We have observed a successful alignment between S A and S B produced by our model in Section 5.9. In this section, we dig into this and analyze how the model learns the relationships. For a CLS task from document D A to S B , our hypotheses are: (1) the unified decoder is implicitly undertaking translation from S A to S B ; (2) the unified decoder also conducts both monolingual and cross-lingual summarization. To verify these hypotheses, we visualize attention distributions of the Transformer decoders trained on En2ZhSum. Neural models Figure 6: Different types of encoder-decoder attention heads in MCLAS's decoder. The x-axis represents concatenated source-language summary S A and targetlanguage summary S B tokens while the y-axis is the document D A tokens. In news texts, important information often gathers in the front part of the document. We only retain the informative part of the y-axis, omitting the blank part that the model do not attend to.
can be explicitly explained using probing into the attention heads (Michel et al., 2019;Voita et al., 2019). We follow the previous work and visualize the function of all attention heads in the decoder to verify the relationships of the concatenated cross-lingual summaries (i.e., translation) and cross-lingual document-summary pairs (i.e., summarization).

Analysis on Translation
We assume that the decoder translates only if the source summary S A and the target summary S B align well. This means that MCLAS is transferring knowledge from S A to S B . We visualize and probe all 48 self-attention heads in the unified decoder. We find 23 (47.9%) translation heads, defined as the heads attending from y B j to the corresponding words in language A. These heads undertake a translation function. 19 (39.6%) heads are local heads, attending to a few words before them and modeling context information. 12 (25%) heads are self heads, which only attend to themselves to retain the primary information. Some of the heads can be categorized into two types. Note that all of the heads behave similarly across different samples. We find that most of the heads are translation heads, indicating that our unified decoder is translating S A into S B . We sample some representative heads in Figure 5 to show their functionalities.

Analysis on Summarization
To analyze whether the decoder for S B is simply translating from S A or that it also summarizes the source document, we visualize the distribution of 48 encoder-decoder attention heads. We find 28 (58.3%) summarization heads that attend to the document's important parts when generating both the monolingual summary and the cross-lingual summary. We also find 20 (41.7%) translation heads, which focus on the source document when generating S A , while focusing on nothing when generating S B . We speculate that summarization heads are responsible for the summarization function and that translation heads cut down the relation between S B and source document D A , leaving space for translation. Again, all the heads behave similarly across different samples. We select two representative samples in Figure 6.
The existence of both summarization and translation heads in encoder-decoder attention components supports our views: the unified decoder simultaneously conducts translation and summarization. Therefore, our model enhances the interactions between different languages, being able to facilitate cross-lingual summarization under lowresource scenarios. See Appendix B for detailed visualization results.

Discussions
An ideal low-resource experiment should be conducted with real low-resource languages. Although possible, it takes much effort to acquire such datasets. Hence, it is the second-best choice that we simulate our low-resource scenarios by artificially limiting the amount of the available data. Some may question it about the feasibility of our method in real low-resource languages since machine translation systems, which is used to generate document-summary pairs, would be of lower quality for truly low-resource languages. For this concern, we consider it still possible to acquire thousands of high-quality human translated parallel summaries, as Duan et al. (2019b) adopt on their test set, to apply our method.

Conclusion
In this paper, we propose a novel multi-task learning framework MCLAS to achieve cross-lingual abstractive summarization with limited parallel resources. Our model shares a unified decoder that sequentially generates both monolingual and crosslingual summaries. Experiments on two crosslingual summarization datasets demonstrate that our framework outperforms all the baseline models in low-resource and full-dataset scenarios.