ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation

Despite the recent advancement in NLP research, cross-lingual transfer for natural language generation is relatively understudied. In this work, we transfer supervision from high resource language (HRL) to multiple low-resource languages (LRLs) for natural language generation (NLG). We consider four NLG tasks (text summarization, question generation, news headline generation, and distractor generation) and three syntactically diverse languages, i.e., English, Hindi, and Japanese. We propose an unsupervised cross-lingual language generation framework (called ZmBART) that does not use any parallel or pseudo-parallel/back-translated data. In this framework, we further pre-train mBART sequence-to-sequence denoising auto-encoder model with an auxiliary task using monolingual data of three languages. The objective function of the auxiliary task is close to the target tasks which enriches the multi-lingual latent representation of mBART and provides good initialization for target tasks. Then, this model is fine-tuned with task-specific supervised English data and directly evaluated with low-resource languages in the Zero-shot setting. To overcome catastrophic forgetting and spurious correlation issues, we applied freezing model component and data argumentation approaches respectively. This simple modeling approach gave us promising results.We experimented with few-shot training (with 1000 supervised data points) which boosted the model performance further. We performed several ablations and cross-lingual transferability analyses to demonstrate the robustness of ZmBART.


Introduction
Recent advancement in natural language generation (NLG) is heavily oriented towards large annotated training data. Such large task-specific annotated data is available for high resource language (HRL) like English. The tasks become challenging when limited training data is available. This is often observed for low-resource languages (LRLs) like Hindi, Japanese, etc. Manually annotating large data is time-consuming, expensive and uninteresting. This limits the model development and product deployment for LRLs. Moreover, despite large active research in cross-lingual representation learning (Hu et al., 2020;Conneau et al., 2020;Lewis et al., 2020b), the area of cross-lingual transfer and generation is relatively under-explored. Motivated by these factors, we propose a novel framework to transfer supervision from HRL to LRLs where model is trained on one language and directly evaluated for unseen languages. This enables crosslingual transfer and generation for low resource languages in zero and few-shot settings for different tasks. The framework can be easily extended to other tasks and languages.
We carefully selected four challenging NLG tasks i.e., news headline-generation (NHG), question generation (QG), abstractive text summarization (ATS) and distractor generation (DG) to validate the framework's performance. NHG and ATS require understanding of input passage to generate meaningful headline and summary respectively. QG task should accumulate information from a passage and answer to generate high-quality questions. Distractor generation is the task of generating incorrect options from reading comprehension MCQ. It is challenging because generated distractors should be in the context with question but should not be semantically equivalent to the answer. We consider two LRLs i.e., Hindi and Japanese from two different language families. English is selected as the HRL from which the learning would be transferred to the LRLs. All three selected languages are different in their syntactic structures and typologically diverse. As there is no established publicly available dataset for DG in Hindi, we also create a new DG dataset for Hindi called as HiDG 1 .
Our proposed framework to achieve this transfer of supervision from HRL to LRL under multiple languages and multiple tasks is named as Zm-BART. ZmBART is based on mBART (Liu et al., 2020), a pre-trained model for cross-lingual natural language generation (NLG). We further pretrain mBART with a novel auxiliary task. Then the trained model is fine-tuned on large task-specific supervised data in English and evaluated directly on Hindi and Japanese languages in zero/few-shot setting for the tasks under consideration. We observe that the auxiliary task plays a critical role on the model's performance and needs to be carefully designed. This framework can be directly applied to multiple cross-lingual generation tasks without even the need to modify model hyper-parameters. Figure-1 shows a zero-shot NHG sample output generated by the ZmBART model. Our main contributions in this work can be summarized as: 1. We propose a novel zero-shot cross-lingual generation framework called ZmBART without parallel data and without back-translation. The framework can be directly applied across multiple tasks without even modifications in hyper-parameter values.
2. We demonstrate the effectiveness of ZmBART on four cross-lingual generation tasks across three typologically diverse languages.
3. We have created HiDG, a high-quality distractor generation dataset for the Hindi language. Recently there are a few works in the direction of supervision transfer from HRL(s) to LRL(s) for language generation. Kumar et al. (2019) used back-translation (needs MT system) and annotated supervised data for cross-lingual question generation. Chi et al. (2020) used parallel data to train a sequence-to-sequence model for zero-shot crosslingual abstractive text summarization and question generation. Lewis et al. (2020a) proposed a pre-training based on mono-lingual paragraphs. Then this pre-trained model is used for zero-shot abstractive text summarization (ATS) in multiple languages. They trained a model on the ATS dataset on all the languages except the test language. This approach needs annotated data in multiple languages. Existing supervision transfer methods require parallel data for the cross-lingual tasks. Either they use available parallel corpora directly, or they translate/ back-translate data to generate pseudo-parallel corpora. Both these approaches pose significant challenges, as task-specific parallel data for multiple languages is difficult to obtain, and MT are far from perfect, especially for low resource languages.

Related Work
Unlike the previous approaches, we did not use any parallel data or back-translation in our proposed framework. We did not pre-train any model from scratch. Instead, we leveraged the existing pre-trained model mBART. We included four challenging generation tasks across three syntactically diverse languages. Even we did not modify any hyper-parameters across the tasks and languages. All these considerations make the framework simple and easy to use. Further, it enables the addition of different other languages and NLG tasks in the proposed framework a simple extension exercise. Figure 2 shows an outline of our proposed Zm-BART framework. ZmBART is based on pretrained mBART (Liu et al., 2020) model. In our framework, we take the mBART model and further pre-train it on an auxiliary task. The auxiliary task is designed in such a way that the objective function of auxiliary task is close to fine-tuning tasks and only utilizes the mono-lingual data from the selected languages. Similar to mBART model we use language identifier tag with slight modification.
We concatenate < f xx >< 2xx > tags in input data instance where xx indicates the language tag. Given an input sentence and the language tag the model encodes the sentence in multi-lingual space. By conditioning on the encoded representation and language tag the decoder generates output text in target language.

Multilingual BART (mBART)
Multilingual BART (Liu et al., 2020) is an extension of BART model (Lewis et al., 2020c) to multiple languages. It is a transformer-based sequenceto-sequence pre-trained model. The model is trained on monolingual data in many languages from Wikipedia Common Crawl corpus with BART language model objective. Particularly, The training data is concatenation of data from K languages i.e., D = {D 1 , D 2 . . . D K } where D i is a collection of monolingual documents in language i. They introduced two types of noises to corrupt the text: (1) random token span masking and (2) sentence order permutation. mBART is trained as denoising autoencoder. During training, the model has to predict text X from it's corrupted version g(X), where g is noise function. The aim is to maximize the following objective function where x is a data instance of language i. Probability distribution P is defined by the sequenceto-sequence model. mBART gave state-of-the-art results in sentence and document level machine translations tasks. Details about mBART model can be found in Liu et al. (2020).

Unsupervised Auxiliary Task
Although the mBART pre-trained model encodes a multi-lingual latent space, it can not be used directly for cross-lingual generation. This is because the model is jointly trained on denoising objectives which do not directly follow auto-regressive decoding, thereby causing mismatch between pretraining and fine-tuning objectives. To overcome this problem, an unsupervised auxiliary task is introduced. We design the auxiliary task with the following desiderata in mind. It (1) should only utilize mono-lingual data from selected languages, (2) should enrich the mBART latent representations for selected languages and (3) train the decoder in pure auto-regressive manner with a training objective which is close to multiple fine-tuning tasks. The auxiliary task in ZmBART is an additional pre-training step for better warm-start to downstream auto-regressive NLG tasks -although the final task (Distractor/Question/Summary generation) can be different from the auxiliary task. Additionally, this step allows the model to have a closer look at the languages under consideration and enrich/adjust the representations and parameters accordingly.
Outputs of the NLG tasks considered in this work are expected to contain words from different parts of the input. Generation of the output tokens are handled by the framework using an encoderdecoder setup. Hence we decide to have an auxiliary task that also encodes the input, and attends to this encoded representation to generate the output words in auto-regressive manner. This way, a single auxiliary task can help to enrich the token representations, warm up the encoder-decoder weights for fine tuning, and also caters to the multiple final output tasks. We define the auxiliary task as: Given an input passage, generate few random sentences (called rand-summary) from the passage. After experimentation we found that randomly generating 20% sentences from passage works the best.
Particularly, the input passage has length between 5-25 sentences and output is 1-5 random sentences from the passage. We do not assume any relations among sentences of the passage. We sample equal proportion of monolingual data from three languages. Data preparation steps for the auxiliary task are given below: 1. Generate a random number k ∈ {5. · · · , 25}.
k denotes the size of input passage 2. PASSAGE: Append k continuous sentences, starting from a random index of monolingual corpus D i of the i th language 3. RAND-SUMMARY: Randomly select 20% sentences from the passage 4. Repeat steps 1 to 3 for p languages 5. Repeat steps 1 to 4 for N times, to collect N p <PASSAGE, RAND-SUMMARY> pairs

Fine-Tuning on Downstream NLG Tasks
The proposed pre-trained model is directly finetuned on four downstream tasks: Question Generation (QG), News Headline Generation (NHG), Abstractive Text Summarization (ATS) and Distractor Generation (DG). First, the model is fine-tuned on large task-specific English supervised data and then this trained model is directly evaluated on Hindi and Japanese evaluation datasets in zero-shot setting. To validate the hypothesis that the ZmBART framework is robust across multiple tasks and languages, we did not modify any hyper-parameters during fine-tuning. It is often observed that including a few instances from LRL to supervised data boosts the model performance. To validate this point we further fine-tuned ZmBART with 1000 task-specific supervised data-points in Hindi and Japanese languages in few-shot setting which boosts the model performance.

Dealing with Catastrophic Forgetting and Spurious Correlation
During experimentation with the zero shot setup, it is observed that the model always generates the output text in English irrespective of input and language tag. We suspect this to be due to catastrophic forgetting problem (Van de Ven and Tolias, 2019). The supervised training completely overrides/erases the pre-trained learning. The generator (decoder) becomes biased towards English due to the explicit supervision learned from large taskspecific English data. To overcome this problem, we freeze all word embeddings and all the parameters of decoder layers during fine-tuning with En-glish data. Although this resolves the problem for NHG, QG and DG, the problem did not get completely resolved for the ATS task. We noticed that the zero-shot ATS output now is not completely in English, but it became of code-mix nature. In other words, the number of English words in the output reduced, but still lot many English words remained. The code-mixed outputs were logical and meaningful. We assume this to be due to spurious correlation issue, also reported in (Gu et al., 2019). To resolve this issue, we added a few examples (25 in number) of the auxiliary-task data during the fine-tuning step. This augmentation was helpful to address the spurious correlation issue for ATS. It is to be noted that the non-English data used for this augmentation is still of unsupervised and monolingual nature.

Experimental Setup and Results
We conduct experiments over four NLG tasks in three languages. We compare the performance of ZmBART with strong and MT pipeline based baseline models. We use both automated and manual evaluation metrics to evaluate model performances.

Baselines
Prior results are not available in literature for selected languages and datasets. Hence, for performance comparison, we developed several strong baselines based on recent models and architectures. Details of these baselines are mentioned below: • MT Pipeline (mBART): Here, we fine-tune mBART on task-specific English data. Non-English test data instances are first translated into English and passed to the fine-tuned model. The output is translated back to the input language. Google Translator is used for translations. • mBART+MADMO: This is an mBART based baseline where the auxiliary task has Masking And Denoising objective with Mono-lingual data in three languages. The aim is to enrich the crosslingual latent representation space of mBART for English, Hindi and Japanese. • mBART+MADPD: Inspired from (Chi et al., 2020), we took Parallel Data (English-Hindi and English-Japanese) and concatenate each parallel instances of two languages. Then we used this data with Masking And Doising objective to further train mBART. Including parallel data provides explicit supervision while generating Hindi and Japanese text.

Evaluation
We use both automated and manual evaluation metrics for performance comparison. Multiple metrics are used in literature for NLG tasks. Since we are considering multiple tasks, for brevity, against each task we only report values of the metrics commonly used by the community for that particular task. For automatic evaluation we used both lexical match (BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004)) as well as embedding based evaluation metrics (BERTScore (Zhang et al., 2020)). To evaluate question generation and distractor generation tasks we use case-mix BLEU-4 (BL) score from sacreBLEU implementation, ROUGE-L (R-L) and BERTScore (BS). For ATS and NHG tasks ROUGE-1, ROUGE-2 and ROUGE-L are used. We follow a similar approach for manual evaluation as Chi et al. (2020). We sampled 50 generated data points each for QG, ATS and NHG tasks in both Hindi and Japanese languages. We use three metrics: Fluency (Flu), Relatedness (Rel) and Correctness (Corr). Fluency measures how fluent the generated text is. Relatedness indicates how much the generated outputs are in the context with input(s), Correctness measures semantics and meaningfulness. For DG, we use an additional metric called Distractibility that measures the degree of confusion for generated incorrect options. For DG task, there can be large number of good distractors for given input, in such situation the manual evaluation is more reliable. We sample 100 generated outputs for DG task. We employed large pool of evaluators from native Hindi and Japanese speakers to evaluate Hindi and Japanese output texts respectively. We asked each annotator to rate the generated texts on a scale of 1-5 (1 is very bad and 5 is very good) for all the metrics. We intentionally selected outputs of ZmBART and two best baselines to reduce the evaluators workload.

News Headline Generation (NHG)
In this task, given a news article, we generate grammatically coherent, semantically correct and abstractive headline. We use 500k/30k/30k (train/validation/test) English NHG data splits from Gigaword headline generation corpus 2 . For Hindi and Japanese we use 1k/1k/5k spilt from Kaggle 3 (we manually filtered high-quality news and head-lines) and (Iwama and Kano, 2019) respectively.
In a zero-shot setting we fine-tune ZmBART model on supervised data and directly evaluate results on Hindi and Japanese test datasets. Automated evaluation results are included in Tables 1 and 2. We observe that, quality of generated headlines in Hindi is better compared to Japanese. The possible reasoning can be the input size. ZmBART outperforms the baseline with an absolute difference of 5.22 ROUGE-L score. mBART+MADMO is best among others which shows that masking and denoising with monolingual data indeed enrich the multi-lingual latent space for selected three languages. mBART+MADMO generates code mixed (Hindi-English or Hindi-Japanese) output which degrades the model performance. Few-shot training fills the mistakes of zero-shot models and generates better quality output. Manual evaluation scores (Tables 3 and 4) and automated scores correlate well validating ZmBART's performance on NHG task.

Question Generation (QG)
In the Question Generation (QG) task, given an input passage and an answer, the aim is to generate semantically and syntactically correct questions that can produce the answer. We use SQuAD 1.1 (Rajpurkar et al., 2016) English data for supervised training. SQuAD is popular question answering dataset consisting of 100k+ <passage, question, answer> tuples. Following (Zhao et al., 2018), we combine the train and validation sets of SQuAD and then spilt it as 80k/8k/10k training/validation/test tuples. For Hindi we use 1k/5.5k (train/test) from MLQA (Lewis et al., 2020d) and TyDiQA-GoldP (Clark et al., 2020) datasets. We use 1k/1k/5k for Japanese data from (Takahashi et al., 2019). Hindi and Japanese data are available in SQuAD data format which maintains consistency in terms of passage size, question and number of answers. For given passage and question we randomly selected one answer to form the dataset. We combine answer and passage as single input sequence separated by special token <s>.
Even without any parallel data, ZmBART outperformed all the baselines consistently across all automated evaluation metrics for zero-shot setting. Regarding manual evaluations, we see that Hindi questions received good score from the annotators, whereas the questions generated for the Japanese language inputs were considered as poor. Upon closer inspection of the generated text we find that several generated questions start with English wh-    words. This mixing of English 'code' in the output happened somewhat seamlessly for the Hindi data as tokens in both languages are written in leftto-right manner. Moreover, Hindi-English codemixed data is now getting very common and the annotators mostly accepted the mixing of the whwords with the Hindi texts. Such mixing is not very common with Japanese text. As a result, the annotators assigned lower scores to such texts. We then tried to understand the reason for getting the wh-words at the beginning of the output. English interrogative sentences often introduce whwords at the beginning even though they are not present in the original data. The model gets exposed to such special characteristics of the English interrogative sentences during the fine tuning. The output from other languages get impacted due to this in zero-shot settings. However, the semantics of the text is captured well for the model as demonstrated by the high BERTScore, indicating good cross-lingual transfer of semantic knowledge.

Abstractive Text Summarization (ATS)
In Abstractive Text Summarization (ATS), we aim to generate grammatically coherent, semantically correct and abstractive summary given an input document. We use recently released WikiLingua (Ladhak et al., 2020) cross-lingual abstractive summarization dataset containing data in 18 languages. Prior splits are not available for this dataset. We use 131k/5k/5k (train/validation/test) splits for English, and 1k/1k/5k splits for Hindi and Japanese.
By skimming through data in Hindi we observe that many input documents consist of technical instructions on usage of softwares/tools. Summarizing these instructions are challenging. Zero-shot ZmBART performed better as compared to baselines as shown in human evaluation (Tables 3 and  4 for Hindi and Japanese respectively). The human evaluation results correlate with automated evaluation as shown in Tables 1 and 2. Ladhak et al. (2020) reported cross-lingual ATS score with same data for four different languages. The R-L score for four languages are 34.06, 37.09, 31.67 and 32.33. We obtain R-L scores of 27.22 and 33.49 for Hindi and Japanese respectively, which shows that the few-shot performance of ZmBART is acceptable.

Distractor Generation (DG)
The final task to judge ZmBART's performance is Distractor Generation (DG). It is the task of generating incorrect options (also known as distractors) from reading comprehension MCQ. The generated distractors should be in the context with the question but shouldn't be semantically equivalent to the answer. Formally, for given passage, question and answer triplet, generate a long, coherent, and grammatically correct wrong option. Considering the fact that for a given triplet there can be many incorrect options that are completely different from each other, the problem is even more challenging. We use English DG dataset from (Maurya and Desarkar, 2020) which consists of approx 135k/17k/17k (train/validation/test) split. We were unable to find a suitable dataset in Japanese language. For Hindi language we created a dataset called HiDG 4 of 1k/1k/5k split. Similar to QG, to create input for ZmBART we concatenate the answer, question and passage in the same order and separate them with special token <s>.
To generate HiDG, we first extracted <passage, question, answer> triplets from English SQuAD 1.1 with atleast 150 tokens in the triplet. We generate distractors for these examples using model proposed by Maurya and Desarkar (2020). The distractors were translated to Hindi using Google Translator service. The translated distractors were manually verified or corrected (if necessary) by human annotators. The evaluation of the task is challenging because: 1) there can be more then one correct distractors. Automated evaluation metrics may not able to capture this aspect as only one ground truth distractor is available and 2) it may possible that the generated distractor is semantically similar to answer with high lexical overlap with reference distractor in those situation lexical match based metrics are not suitable. To evaluate the DG task we mainly rely on BERTScore and manual evaluation. Towards this effort we consider higher number of DG samples for manual evaluation. Results from Tables 1 and 3 indicate the superiority of ZmBART over the baseline models for this task.
To summarize, we have performed experiments for 14 different task-setup combinations involving low resource languages. With four tasks in Hindi and three tasks in Japanese, and each task in zero shot and few shot setup, we provide detailed comparative evaluation for the tasks. The tasks are of different natures, and each task offers its own unique challenge. We critically analyze the performances to show the robustness and the range of applicability for the proposed ZmBART framework. We use fairseq library (Ott et al., 2019) for all the implementation and experiments. The implementation details are included in supplementary.

Results Analysis and Ablation Study
In this section, we provide further analysis of the experimental results. We also perform ablation studies to understand the impacts of the different modeling decisions made in designing the framework.
•Supervised Training Results: Table 5 shows the comparative results of fine-tuned mBART with and without auxiliary task on task-specific supervised English data. We observe that there is no significant performance degradation of ZmBART over original mBART model with pure supervised training. Even, the auxiliary task helps in achieving slight improvement over the original mBART performance in most setups. This concludes that ZmBART can be adopted as replacement of original mBART model with additional functionalities.  •Effect of Auxiliary Task: Table 6 includes the results with and without auxiliary task of Zm-BART for ATS and QG tasks in zero-shot setting. It can be inferred that without the auxiliary task, lexical match based scores are poor because the decoder generates code-mixed outputs. We see that the BERTScore is still reasonable without auxiliary task owing to the multilingual mBART embedding. However, generation of the data in appropriate language is enabled only after inclusion of the auxiliary task. The auxiliary task contributes in two ways: it enables zero-shot generation and improves the mBART multilingual latent space even more as indicated by the improved BERTScore.
With these results we now want to understand whether the auxiliary task is able to generalize across multiple tasks, or favors specific tasks. Among the tasks considered in this work, we see that generation of meaningful summaries/headlines  Table 6: Zero-shot results of ZmBART with and without auxiliary task for Hindi and Japanese require understanding/abstracting of input text which is unlikely to be obtained by repeating sentences from input passages, as done in the auxiliary task. ZmBART achieves good zero-shot/fewshot/supervised results (Tables 1-5) on ATS and NHG over strong baselines. The generated headlines and summaries were found to be mostly abstractive, they don't contain large continuous sequences from input text. As described in Sections 4.4 and 4.6, Question Generation and Distractor Generation are more challenging tasks and have objectives vastly different from the auxiliary task's objective. Even for these tasks, decent evaluation scores (Tables 1-5) and improvements over the baselines across the languages considered indicate that the solutions are not spurious. Incorporation of auxiliary task improves the performance of diverse downstream tasks on real benchmark datasets, and does not favor any specific task or dataset.
• Approaches to avoid Catastrophic Forgetting: We use two approaches to address the catastrophic forgetting problem, (a) Freezing model components and (b) optimized regularization (Van de Ven and Tolias, 2019). Tables 7 and 8 show the automated evaluation results with different approaches used to deal with the catastrophic forgetting problem. It can be noted that the proposed modelling setup (i.e., ZmBART) gives best results.
• Effect of Architecture on Few-shot Training: In this set-up we experiment with few-shot training with mBART (directly fine-tuned on taskspecific supervised English data) and ZmBART (trained with auxiliary task and fine-tuned with English data). The results are presented in Table 9. We find that ZmBART does better than mBART in corresponding setups. Moreover, although freezing the decoder layer and word embeddings helps in zero-shot setting, it is natural and useful to unfreeze them during few shot training.
• Few-shot performance with Supervised data: Figures 3 and 4 show the trends of fewshot training of ZmBART with respect to supervised Hindi and Japanese training data for ATS

Conclusion
In this paper, we propose a novel unsupervised framework (ZmBART) for cross-lingual transfer and generation. The framework transfers supervision from HRL to LRLs which enables zero-shot language generation. The framework does not use any direct or pseudo-parallel data. ZmBART is directly applied to multiple generation tasks and languages. The model includes a carefully designed auxiliary task that further improved the multilingual embedding space, and helped to initialize encoder-decoder weights to enable zero shot language generation. We performed experiments in three languages and 18 task-setup combinations: four supervised tasks in English, four tasks in Hindi (each with zero-shot and few-shot), and three tasks in Japanese (each with zero-shot and few-shot).
Except zero-shot question generation tasks, for all other tasks involving LRLs, the proposed model generated good quality results as validated by automated and manual evaluation measures. In future we want to extend this work by adding multiple

Implementation Details:
We use a standard sequence-to-sequence Transformer architecture with 12 layers (each 16 heads) for encoder and decoder. The model has a dimension of 1024 (approx 680M parameters). Additional layer-normalization was used with both the encoder and decoder. We found FP16 precision stabilized the training. We trained all the models on 4 Nvidia V100 GPUs (32GB). Similar to mBART we use the Adam optimizer ( = 1e-6, β 2 = 0.98) and linear learning rate decay scheduling. The training started with a dropout value 0.3 and was later reduced to 0.2 after 20k steps and 0 after 40k steps. The loss function was cross-entropy label smoothing loss. 2500 warm-up steps and 3e-5 learning rate were used. The model selection was done based on validation data likelihood. We use beam-search with beam size 5 in the decoding for all the tasks. We loaded mBARTCC25 pre-trained checkpoint weights and further pre-train/fine-tune model on task-specific data with teacher forcing method. The above set of parameters are used for all the target tasks as well as the auxiliary task. We process different batch sizes of input for different tasks. We use 2048, 3000, 4096, 2048, and 5000 tokens per GPU for ATS, DG, QG, auxiliary, and NHG tasks, respectively. We use shared Byte Pair Encoding (BPE) vocabulary from sentencepiece tokenizer of size 250k. We use 34k/1k/1k (train/validation/test) data-points for auxiliary language (approx 11333 from each languages). We train the mBART model with the auxiliary task around 10k steps. Training time for the auxiliary task is around 2-3 hours. The fine-tuning times for TS, QG, NHG, and DG were around 4-5, 1-2, 1-2, and 2-3 hours. We observe a longer fine-tuning time for ATS because of long passages. We selected the best model based on loss and perplexity on the validation datasets. We checked with earlystopping and other checkpoints, which resulted in poor performance.

Evaluation Metric and Tokenizer Details:
For Automated evaluation, we use sacreBLEU implementation, ROUGE-L, and BERTScore. For ATS and NHG tasks, ROUGE-1, ROUGE-2, and ROUGE-L are used. We explicitly use communityadopted language specific-tokenizers. Links for language-specific tokenizers are given below:

Few Zero-shot Generated outputs from
ZmBART: In the next few figures, we present sample outputs generated by the model in zero-shot setups, for Hindi and Japanese languages.