ACROSS: An Alignment-based Framework for Low-Resource Many-to-One Cross-Lingual Summarization

.


Introduction
Given a source document, Cross-Lingual Summarization (CLS) aims to generate a summary in a different language.Therefore, CLS helps users quickly understand news outlines written in foreign, unknown to them, languages.Early CLS approaches typically use pipeline frameworks (Leuski et al., 2003;Orasan and Chiorean, 2008), which are intuitive but suffer from the problem of error cascading.Researchers have recently turned to endto-end models (Zhu et al., 2019(Zhu et al., , 2020;;Bai et al., 2021) that are immune to this problem.However, these studies are limited to bilingual learning and do not conform to the reality of multilingual scenarios.
Toyota will recall 6.5 million vehicles globally.We select English as the target language.We constrain French and Chinese documents to have the same representation as paired English document in the learning process.Finally, the CLS system can give the target English summary independently.
Given that the real-world news is written in diverse languages and that only a few researchers have explored the multilingual scenarios, we investigate the many-to-one CLS scenario to meet realistic demands.As stated before, CLS data can be viewed as low-resource since parallel CLS data is significantly less abundant than monolingual data (Zhu et al., 2019).The low-resource characteristic of CLS data is further amplified in multilingual scenarios.However, directly training an end-to-end model does not perform well due to the ineffective use of high-resource data and the scarcity of lowresource data.The foremost challenges are how to model cross-lingual semantic correlations in multilingual scenarios and introduce new knowledge to low-resource languages.
To tackle the above challenges, we investigate a novel yet intuitive idea of cross-lingual alignment.
The cross-lingual alignment method can extract deep semantic relations across languages.As portrayed in Figure 1, the materials in three languages (i.e., French, Chinese, and English) express similar semantics.We can align all these languages for deep cross-lingual semantic knowledge, which is crucial for refining crosslingual materials over different languages for generating high-quality summaries.Moreover, we also consider devising a novel data augmentation (DA) method to introduce new knowledge to low-resource languages.
To investigate the two hypotheses, we introduce a novel many-to-one CLS model for lowresource learning called Aligned CROSs-lingual Summarization (ACROSS), which improves the performance of low-resource scenarios by effectively utilizing the abundant high-resource data.This model conducts cross-lingual alignments both at the model and at the data levels.From the model perspective, we minimize the difference between the cross-lingual and monolingual representations via contrastive and consistency learning (He et al., 2020;Pan et al., 2021;Li et al., 2021Li et al., , 2022)).This helps to facilitate a solid alignment relationship between low-resource and high-resource language.From the data perspective, we propose a novel data augmentation method that selects informative sentences from monolingual summarization (MLS) pairs, which aims to introduce new knowledge for low-resource language.
We conducted experiments on the CrossSum dataset (Hasan et al., 2021), which contains crosslingual summarization pairs in 45 languages.The results show that ACROSS outperforms the baseline models and achieves strong improvements in most language pairs (2.3 average improvement in ROUGE scores).
Our contributions are as follows: • We propose a novel many-to-one summarization model that aligns cross-lingual and monolingual representations to enrich low-resource data.
• We introduce a data augmentation method to extract high-resource knowledge which is later transferred and which facilitates lowresource learning.
• An extensive experimental evaluation validate the low-resource CLS performance of our model in both quantitative and qualitative ways.

Related Work
Early CLS research typically used pipeline methods, such as the translate-then-summarize (Leuski et al., 2003;Ouyang et al., 2019) or summarizethen-translate methods (Orasan and Chiorean, 2008;Wan et al., 2010;Yao et al., 2015;Zhang et al., 2016), which are sensitive to error cascading that causes their subpar performance.Thanks to the development of the transformerbased methods (Vaswani et al., 2017), researchers introduced teacher-student frameworks (Shen et al., 2018;Duan et al., 2019) wherein the CLS task can be approached via an encoder-decoder model.Thereafter, the multi-task framework started to be popular in this field (Zhu et al., 2019(Zhu et al., , 2020;;Bai et al., 2021).Recently, researchers have begun to investigate how to fuse translation and summarization tasks into a unified model to improve the performance on the CLS tasks (Liang et al., 2022;Takase and Okazaki, 2022;Bai et al., 2022;Nguyen and Luu, 2022;Jiang et al., 2022).For example, Bai et al. (2022) considered compression so that their model can handle both the CLS and translation tasks at different compression rates.
Focusing on multi-task learning, these multi-task studies attempt to improve CLS performance using machine translation (MT) and MLS tasks in bilingual settings.However, such approaches still establish implicit connections among languages and leave aside the information of high-resource data.Hasan et al. (2021) recognized the limitations of the above-mentioned scenarios.They proposed a new dataset, CrossSum, in multilingual scenarios and introduced a method balancing the number of different language pairs in a batch, which could alleviate the uneven distribution of training samples and balance performance in different languages.However, deep semantical correlations across languages as well as abundant information from highresource data have not been investigated.
In contrast to the aforementioned methods, ACROSS introduces cross-lingual alignment and a novel data augmentation method, which can improve low-resource performance from both model and data perspectives.the encoder and decoder sides.Figure 2 illustrates the overall framework.

Preliminary
Mono-Lingual Abstractive Summarization.Given a document D A = {x A 1 , x A 2 , ..., x A n } written in language A, a monolingual abstractive summarization model induces a summary S A = {y A 1 , y A 2 , ..., y A m } by minimizing the loss function as follows: where n and m are the lengths of the input document and output summary, respectively, and θ mls is the parameter of the monolingual summarization model.

Cross-Lingual Abstractive Summarization.
Different from monolingual abstractive summarization models, a cross-lingual abstractive summarization model generates a summary The loss function of the CLS model can be formulated as: where θ cls is the parameter of the CLS model.

Cross-Lingual
(3) Similarly, we can obtain the representation of D B + with a pretrained Encoder of the monolingual summarization model as: Finally, the contrastive learning objective is constructed to minimize the loss as follows: where τ is a temperature hyper-parameter and sim(•) denotes a similarity function that can measure the distance of two vectors in an embedding space2 .
Cross-Lingual Consistency Learning for Decoder.Consistency learning aims to model consistency across the models' predictions, which can help child models gain improvement from the pretrained parent model.By constraining the output probability distributions of decoders, the CLS child model can be aligned to the MLS pre-trained parent model.Given a tuple composed of a CLS document and its paired monolingual document (D A , D B + ), we can obtain the output distribution of the CLS model at each decoding step as follows: Similarly, we can construct the output distribution of the MLS model at each decoding step as: where θ * mls denotes frozen parameters and Model * mls means that the parameters of the MLS model are frozen during training.Then, we can bridge the distribution gap between the CLS and MLS models by minimizing the following consistency loss function as: where JS-Div denotes Jensen-Shannon divergence (Lin, 1991), which is used to measure the gap between the pretrained and child models.
Training Objective of ACROSS.We jointly minimize CLS, consistency, and contrastive loss during the training period.The final training objective of ACROSS is formulated as: where α, β, and γ are hyper-parameters used to balance the weights of the three losses.

Data Augmentation for Cross-Lingual Summarization
Data augmentation is a widely used technique in low-resource scenarios (Sennrich et al., 2016a;Fabbri et al., 2021).In Seq2Seq tasks, it often leverages translation to increase the amount of data in low-resource scenarios.However, in the CLS task, the direct translation of monolingual data from a  This example shows the process from an English monolingual summarization pair to a Chinese-English summarization pair.Each green block contains an English sentence, while each orange block contains a Chinese sentence.The sentences corresponding to the dark green blocks have higher ROUGE scores, and these sentences will be translated into Chinese.
high-resource language to a low-resource language might lose some valuable information.The distribution of information in the input document is uneven, making some sentences potentially more important than others.Therefore, directly translating all sentences into a low-resource language and using them for training the model may not be conducive to CLS.
Considering the characteristics of the summarization task, we propose an importance-based data augmentation method based on ROUGE scores.First, an input document D B i is split into several sentences S = {s 1 , s 2 , ..., s k }.Then, the ROUGE score is calculated for each sentence and summary S B .The ROUGE score of each sentence is represented as R = {r 1 , r 2 , ..., r k }.Next, the sentences corresponding to the top a% ROUGE scores are selected and translated into the low-resource language.Finally, the translated sentences are reassembled with other sentences to form a pseudo document D Ap i , so that pseudo-low-resource summarization pairs (D Ap i , S B ) can be generated.Figure 3 shows an example of the process from an English monolingual summarization pair to a Chinese-English summarization pair3 .The two sentences with the highest ROUGE scores are the second and third sentences; hence, these two sentences are translated into Chinese.
4 Experiment Setup

Dataset
We conduct experiments using the previously mentioned CrossSum dataset (Hasan et al., 2021).
CrossSum is a multilingual CLS dataset that contains cross-lingual summarization data in 45 languages.Moreover, it realistically reflects the skewness of data distribution in practical CLS tasks.Figure 4 portrays the degree of imbalance of the dataset.As we can see, English monolingual summaries constitute over 70% of the English target summaries, while there are less than 30% summaries of other 44 languages to English.We classify languages with less than 1,000 training samples as extremely low-resource scenarios, between 1,000 and 5,000 as medium low-resource scenarios, and larger than 5,000 as normal low-resource scenarios.

Baselines
We compare our model with the following baselines: Multistage: a training sampling strategy proposed by Hasan et al. (2021).Thi s method balances the number of different language pairs in a batch, thus alleviating the uneven distribution of training samples in different languages.
NCLS+MT: a method based on the multi-task framework proposed by Zhu et al. (2019).The model uses two independent decoders for CLS and MT tasks.As the original NCLS+MT model can only handle bilingual CLS task, we replace its encoder with a multilingual encoder.
NCLS+MLS: a method also proposed by Zhu et al. (2019).Its difference with NCLS+MT is that the multi-task decoder is used for MLS task.

Experimental Settings
For training, the MLS model is trained on the English-English subset of the CrossSum dataset and its parameters are initialized using mT5 (Xue et al., 2021).Thereafter, we initialize the CLS model using the pre-trained MLS model.We set dropout to 0.1 and the learning rate to 5e-4 with polynomial decay scheduling as well as a warm-up step of 5,000.For optimization, we use the Adam optimizer (Kingma and Ba, 2015) with ϵ = 1e-8, β 1 = 0.9, β 2 = 0.999, and weight decay = 0.01.
The hyper-parameters α, β, and γ are set to 1.0, 1.0, and 2.0, respectively.The size of the negative sample set is 1,024.The temperature hyper-parameter τ is set to 0.1.To stabilize the training process, we choose the gradient norm value to be 1.0.The vocabulary size is 250,112, and BPE (Sennrich et al., 2016b) is used as the tokenization strategy.We limit the max input length to 512 and the max summary length to 84.We train our model on 4 RTX A5000 GPUs for 40,000 training steps, setting the batch size to 256 for each step.For inference, we use the beam-search decoding strategy (Wiseman and Rush, 2016) and set the beam size to 5.

Main Results
We evaluate ACROSS on the standard ROUGE metric (Lin, 2004), reporting the F1 score (%) of ROUGE-1, ROUGE-2, and ROUGE-L.5, ACROSS-base outperforms Multistage-base significantly under the different language test sets.The ROUGE2 scores for more than 30 languages have an increase of more than 2, which represents a stable improvement of ACROSS.Moreover, ACROSS-small surpasses or is compared to Multistage-base under some metrics (improving ROUGE-1 by 0.2 and ROUGE-L by 0.09 in extremely low-resource scenarios).
In addition, we can see in Figure 5 that English-English's ROUGE-2 score improves by only 0.77, which illustrates that the improvement of ACROSS comes mainly from the better alignment between other languages and English, rather than from the improvement of the ability to do summarization on English.Considering the actual data size, ACROSS significantly overperforms baselines in low-resource CLS scenarios.Additionally, we demonstrate ROUGE-1 and ROUGE-L results in Appendix A. 1, ROUGE-2, and ROUGE-L scores of ACROSS are enhanced by more than 3, 1, and 2, respectively.This phenomenon reveals that multi-task approaches that rely on MT and MLS learning may be not effective in multilingual scenarios.ACROSS turns to be more suitable for the scenarios with imbalanced resources.

Analysis
Ablation Study.We next conduct the ablation study in small settings.We summarize the experimental results in Table 2 as below: • ctr+con+DA performs better than con+DA, suggesting that although con can significantly improve performance, the aligned representation is also beneficial for CLS tasks.
• The complete model produces better results compared with DA.Except for Multistage, DA performs worse than the models adding other losses, which implies that the excellent performance of ACROSS does not merely come from data augmentation.
• Comparing DA and con, we can see that the aligned model and representation are crucial for a successful CLS task.
Analysis of Data Augmentation.We conduct experiments on different selection approaches to evaluate the performance of our proposed DA method.
As recorded in Table 3, Informative performs best compared to the other methods, which indicates that the DA method can help ACROSS learn more important information in the CLS task.The Truncation performs inferior, because the more important sentences in the news report tend to be in the relatively front position.The results also validate the effectiveness of the DA method in selecting more important sentences for translation.
Generally speaking, the results tell us that the DA method is beneficial for the CLS task, and translating important sentences is useful for crosslingual alignment.
Human Evaluation.Due to the difficulty of finding a large number of users who speak low-resource languages, we only conduct the human evaluation on 20 random samples from Chinese-English and French-English test sets.We compare the summaries generated by ACROSS with those generated by Multistage.We invite participants to compare the auto-generated summaries with ground truth summaries from three perspectives: fluency (FL), informativeness (IF), and conciseness (CC).Each sample is evaluated by three participants.
The results shown in Table 4 indicate that ACROSS is capable of generating fluent summaries and these summaries are also informative and concise according to the human feelings.
Visualization of Alignment.To further demonstrate the alignment result of ACROSS, we visualize the similarity between CLS inputs and the paired English inputs in Chinese-English and French-English test sets.We randomly sample 50 cross-lingual inputs from the test set and obtain the  pairs.The closer the color of the point is to dark red, the higher the similarity between the two corresponding inputs is.A clear diagonal line in the 6b and Figure 6d indicates that the paired inputs have a higher similarity.In contrast, Figure 6a and Figure 6c have many unexpected lines, meaning the model cannot distinguish the paired inputs from any other negative pairs.representations of these cross-lingual inputs and the paired English inputs.Then, we calculate the cosine similarity of the two languages to construct the similarity matrix.Finally, we plot the heat map of the similarity matrix.
In Figure 6b and Figure 6d, the clear diagonal indicates the paired inputs have significantly higher similarities.In comparison, other unpaired inputs have lower similarities.In Figure 6a and Figure 6c, we can observe that the similarity distribution is characterized by more confusion.
In summary, ACROSS can effectively align cross-lingual and English inputs, demonstrating through the experiments that aligned representations are more useful for CLS tasks in multilingual settings.
Case Study.We finally implement the case study of a sample from the Chinese-English test set.The Baseline employed here is the Multistage model.The words and characters in red are important and overlap with Ground Truth.On the opposite, the words in green are errors.As shown in Figure 7, compared to Multistage, ACROSS can cover details in a better and more detailed way (e.g., using some proper nouns and phrases).For example, asthma and processed meat are present in the generated summary by ACROSS; yet, the summary generated by the baseline does not involve these important terms, and it also contains factual consistency errors.Taking another example, in the summary generated by the baseline, the terms fruit and vegetables, including cabbage, broccoli, and kale appear, while these terms are not mentioned in the original text.
The above examples suggest that ACROSS improves the performance of CLS based on the ability Source: 70克大约是一根香肠再加一片火腿。 根据法 国研究人员的调查发现,如果一周吃四份以上的加工 肉食品就会增加健康风险。但专家说,两者之间的联系 并没有得到证明,需要做更多的调查。专家还建议，人 们应该遵循一种更健康的饮食结构，例如每天吃的红 肉和加工肉食品不要超过70克。参加这项试验的人中 有一半是哮喘病人，然后观察他们的哮喘症状。试验 显示，如果他们吃了过多的加工肉,症状就会加重。 Translation: 70 grams is about one sausage plus one slice of ham.According to a survey by French researchers, eating more than four servings of processed meat a week increases health risks.But experts say the link between the two has not been proven and more investigation is needed.Experts also recommend that people follow a healthier diet, such as eating no more than 70 grams of red and processed meat per day.Half of the people who took part in the trial were asthmatics, and their asthma symptoms were then observed.Tests showed that if they ate too much processed meat, symptoms worsened.ACROSS: Eating lots of processed meat could increase the risk of an asthma attack, according to researchers.Baseline: A link between eating a lot of fruit and vegetables, including cabbage, broccoli and kale, has been suggested by French researchers.Ground Truth: Eating processed meat might make asthma symptoms worse, say researchers. of strong MLS under the guidance of alignment.

Conclusion
In this work, we propose ACROSS, a many-toone cross-lingual summarization model.Inspired by the alignment idea, we design contrastive and consistency loss for ACROSS.Experimental results show that with the ACROSS framework, CLS model improves the low-resource performance by effectively utilizing high-resource monolingual data.Our findings point to the importance of alignment in cross-lingual fields for future research.In the future, we plan to apply this idea to combine CLS in multimodal scenarios, which might enable the model to better serve realistic demands.

Limitations
Considering that English is the most widely spoken language, we select it as the high-resource monolingual language in this study.While ACROSS is a general summarization framework not limited to a certain target language, it deserves an in-depth exploration of how ACROSS works on other highresource languages.
Additionally, we employ mT5 as our backbone because it supports most languages in CrossSum.The performance of ACROSS after replacing mT5 with other models, such as mBART (Liu et al., 2020), FLAN-T5(Chung et al., 2022), will be investigated in the future.

Ethical Consideration
Controversial Generation Content.Our model is less likely to generate controversial content(e.g., discrimination, criticism, and antagonism) since the model is trained on a dataset from the BBC News domain.Data in the news domain is often scrutinized before being published, and thus the model is not likely to generate controversial data.
Desensitization of User Data.We use the Amazon Mechanical Turk crowdsourcing platform to evaluate three artificial indicators (i.e., fluency, informativeness, and conciseness).For investigators, all sensitive user data is desensitized by the platform.Therefore, we also do not have access to sensitive user information.

A Appendix
Analysis of Alignment Methods.To further show the effectiveness of ACROSS, we conduct an experiment to analyze the alignment methods.We replace the alignment methods of the encoder and decoder.As Table 5 shows, replacing any part of the original alignment methods will make the model perform worse.In particular, replacing the consistency and contrastive loss at the same time significantly reduces the model's performance, which reinforces the rationality of our different loss designs.Data Augmentation Settings.We use Helsinki-NLP 4 as our translation model.In practice, we select the sentences corresponding to the top 50% of ROUGE scores.Furthermore, we set the beam size to 4, lenght-penalty to 1.0 and min-length to 10 for decoding.

Model
ROUGE-1 & ROUGE-L Improvement for ACROSS-base.To show the improvement of our model on different metrics, we plot the improvement compared to Multistage-base, similar to Figure 5.As Figure 8 and 9 shows, ACROSSbase also has a significant and stable improvement on ROUGE-1 and ROUGE-L among different languages.
Analysis of Translation Ratio.We also analyze the impact of the translation ratio α on the final results.As table 6 shows, ACROSS-100% performs worse instead, probably because translating all sentences introduces too much extraneous noise instead.languages get a more significant improvement compared to high-resource languages.

Improvement in
Form for Human Evaluation. Figure 10 shows the form we gave to participants, on the case of French-English summarization evaluation.Participants were asked to compare the auto-generated summaries with ground truth summaries from three perspectives: fluency, informativeness, and conciseness from one to five.And each participant will be informed that their scores for the different summaries will appear in our study as an evaluation metric.

Figure 1 :
Figure1: The schematic diagram of ACROSS.We try to build a strong alignment relationship between crosslingual inputs and the corresponding monolingual input.We select English as the target language.We constrain French and Chinese documents to have the same representation as paired English document in the learning process.Finally, the CLS system can give the target English summary independently.

Figure 2 :
Figure 2: The framework of ACROSS.For CLS input D A , the paired MLS input D B + and the negative samples D B − are fed into the pretrained MLS model.ACROSS uses contrastive loss at the encoder side and consistency loss at the decoder to minimize the representations of the two languages.
AlignmentCross-Lingual Contrastive Learning for Encoder.Multilingual transformer treats all languages equally, which leads to the representation of different languages being distributed over different spaces, eventually making it difficult for CLS tasks to take advantage of the high-resource monolingual data.Therefore, we should encourage the model to improve cross-lingual performance with a strong monolingual summarization capability.With the help of contrastive learning, ACROSS can align the cross-lingual input representation to the monolingual space, thus realizing the idea mentioned above.Firstly, given a cross-lingual summarization and the paired monolingual document tuple: (D A , D B + , S B ), we need to randomly choose a negative document set N = {D B 1 , D B 2 , ..., D B |N | } in the dataset.Then, we can obtain the representation of D A with a Transformer Encoder and a pooling function F as follows:

Figure 3 :
Figure 3: An example of our data augmentation method.This example shows the process from an English monolingual summarization pair to a Chinese-English summarization pair.Each green block contains an English sentence, while each orange block contains a Chinese sentence.The sentences corresponding to the dark green blocks have higher ROUGE scores, and these sentences will be translated into Chinese.

Figure 4 :
Figure 4: Distribution of the number of training samples over different languages to English.English-to-English summarization data accounts for more than 70% of the entire dataset.

Figure 5 :
Figure 5: The improvement of ACROSS-base compared to mt5-base in ROUGE2-F on different languages to English (%).

Figure 6 :
Figure6: Visualization of Alignment Effect.The four figures are the heatmaps of different models and language pairs.The closer the color of the point is to dark red, the higher the similarity between the two corresponding inputs is.A clear diagonal line in the 6b and Figure6dindicates that the paired inputs have a higher similarity.In contrast, Figure6aand Figure6chave many unexpected lines, meaning the model cannot distinguish the paired inputs from any other negative pairs.

Figure 7 :
Figure 7: Case study.The words in red are important and overlap with Ground Truth.The green words are errors.

Figure 8 :
Figure 8: The improvement of ACROSS-base compared to Multistage-base in ROUGE1-F on different languages to English.

Figure 10 :
Figure 10: The form for human evaluation.
German national Florian Flegel , 22 , was arrested at Stansted Airport in Essex on 22 October when he was about to board a flight to Germany .

Table 1 :
Table 1 presents the main results of ACROSS and other The main results of different models on CrossSum dataset (%).The bold values indicate the best results in base settings.The underlined values indicate the best results in small settings.ACROSS improves significantly in different low-resource settings and metrics.
Comparision with Multi-Task Methods.Compared with the two multi-task methods (i.e., NCLS+MT and NCLS+MLS), we find that the two methods do not perform as well as Multistage and have a greater gap with ACROSS.Compared with NCLS+MT and NCLS+MLS, the ROUGE-

Table 3 :
Performance of different selection methods (%).Informative means selecting the sentences corresponding to the top 50% of ROUGE values.Uninformative is the opposite, and it selects the sentence with the lowest ROUGE value.Random denotes randomly selecting 50% of the sentences from the document.Truncation denotes that only the first half of the document is selected for translation.

Table 4 :
Human Evaluation of ACROSS and Multistage on Chinese-English and French-English, the best results are in bold.

Table 5 :
RG1 RG2 RGL ctr+con 29.24 9.01 22.70 ctr+ctr 26.58 8.28 21.23 con+con 28.43 8.89 22.32 con+ctr 26.23 8.01 21.18 Effective of alignment method.ctr+con refers to the original ACROSS model framework.con+ctr denotes that the encoder uses consistency loss, and the decoder uses contrastive loss.con+con and ctr+ctr indicate the two alignment methods of the model using all consistency loss and all contrastive loss, respectively.
Different Resource Scenarios.We also analyze the improvement in different resource scenarios.As table 7 shows, low-resource 4 https://huggingface.co/Helsinki-NLP

Table 6 :
Impact of translation ratio on the final result.