mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences

We present our work on developing a multilingual, efficient text-to-text transformer that is suitable for handling long inputs. This model, called mLongT5, builds upon the architecture of LongT5, while leveraging the multilingual datasets used for pretraining mT5 and the pre-training tasks of UL2. We evaluate this model on a variety of multilingual summarization and question-answering tasks, and the results show stronger performance for mLongT5 when compared to existing multilingual models such as mBART or M-BERT.


Introduction
In recent years, there has been development of making transformer-based models more efficient so that they can handle longer input sequences.Many of the models though have been English-only, making them inapplicable to other languages.
In this paper, we present our work in extending one of these models to be able to handle multilingual data.Our model, called mLongT5, takes advantage of the efficient architecture of LongT5 (Guo et al., 2022), and has been pretrained on the multilingual mC4 dataset (Xue et al., 2021) to be able to work on multilingual tasks.We have applied mLongT5 to a variety of multilingual summarization and question-answering tasks, and results show that mLongT5 exhibits strong performance in these domains.
The configurations 1 and checkpoints 2 have all been open-sourced.

Related Work
There are two areas of related work -efficient transformer models that can handle long inputs, and multilingual models.
1 https://github.com/google/flaxformer/tree/main/flaxformer/t5x/configs/longt5/models 2 https://github.com/google-research/longt5 There has been much interest of late in making transformer models more efficient, such as to handle longer inputs.Example of these include ETC (Ainslie et al., 2020), Big Bird (Zaheer et al., 2020), LongT5 (Guo et al., 2022), andLongformer (Beltagy et al., 2020).These models were successful in taking various approaches to address the quadratic growth of the attention mechanism in transformers.Unfortunately though, these models are trained on English datasets, limiting their use in multilingual domains.
With respect to multilingual models, these would include mT5 (Xue et al., 2021), mBART (Liu et al., 2020), and the recent umT5 (Chung et al., 2023).These models re-used architectures used by English models but are pretrained on a larger, multilingual corpus, with mT5 and umT5 trained on 101 languages and mBART on 25.While these models showed strong performance on being able to handle a wide variety of languages, they suffered the same restrictions as their original English models on not being able to scale up to longer sequences.
3 Model mLongT5 builds upon the architecture of LongT5 (Guo et al., 2022).LongT5 was developed to efficiently handle long inputs by utilizing a more efficient attention mechanism.The model was shown to have strong performance on a variety of downstream tasks, and thus is the foundation for mLongT5.

Datasets
To make mLongT5 multilingual, we leverage the mC4 dataset used for training the multilingual model mT5 (Xue et al., 2021), which consists of 101 languages.This dataset has recently been updated, as described by Chung et al. (2023), and was used for training umT5 and creating a new Senten-cePiece model (Kudo and Richardson, 2018).As such, we then make use of the same SentencePiece model used for umT5, thus allowing mLongT5 to handle multilingual inputs.

Pretraining Tasks
One key difference with our model and LongT5 is the changing of tasks for pretraining the model.LongT5 made use of PEGASUS' Principle Sentences Generation (PSG) (Zhang et al., 2020) for pretraining its models.While this was shown to have strong performance for various downstream tasks, the one weakness of PSG is that it is less suitable for multilingual training.PSG relies on being able to split a piece of text into sentences, with current implementation best suited for Latin-based languages.The need to break text into sentences properly for 101 different languages makes it then a challenging task to use in a multilingual setting.
To overcome this, we instead decided to apply UL2's pretraining tasks (Tay et al., 2022).Their pretraining task, called Mixture-of-Denoisers (MoD), has the model learning from a mixture of tasks, and has been shown to work better than T5's original pretraining task (Raffel et al., 2019).More importantly, MoD can be more easily applied to other languages compared to PSG, thus making it ideal for pretraining mLongT5.

Pretraining Details
Pretraining mLongT5 has many similarities to how LongT5 was pretrained.It is pretrained for one million steps, and we pretrained model sizes of Base, Large, and XL.We also use the same pretraining lengths, 4,096 for the inputs and 910 for the targets.One small difference is increasing the batch size from 128 to 256, allowing the model to train on the same number of tokens as mT5.For the mC4 dataset, we used version 3.1.0,which is the version update by Chung et al. (2023).For dataset sampling, we use the UniMax sampling method (Chung et al., 2023).
Instead of PSG as pretraining task, we apply MoD, using the same configuration as defined in the original UL2 task definition.The only exception is that we do not use 0.5 corruption rate (using only corruption rate of 0.15), as our input lengths (4096) are much longer than our target lengths (910), making a corruption rate of 0.5 unfeasible.
All models were pretrained using 256 TPUv4 chips.Wall time to pretrain these models was 1.9 days for Base, 3.7 days for Large, and 12.4 days for XL.

Results
As with the original LongT5 paper, we look at two domains for evaluating our model: summarization and question answering.
For all of these tasks, we use the default values as used for T5 finetuning, only explicitly setting the input and target lengths as described in the tasks below.

Summarization
The three summarization tasks we are looking at are: • MLSUM (Scialom et al., 2020): a collection of newspaper articles and their corresponding summaries in five languages: French, German, Spanish, Russian, and Turkish.
• XL-Sum (Hasan et al., 2021): a collection of BBC articles and summaries in 44 languages.
• WikiLingua (Ladhak et al., 2020): a collection of documents from WikiHow (in Spanish, Turkish, Russian, and Vietnamese) that have been translated and summarized into English.For this task, we are using the GEM (Gehrmann et al., 2021) version of the datasets, allowing us to make use of their fixes in the splitting of the datasets for training and testing.
These tasks allow us to explore summarization where the task involves documents and their summaries in the same language (MLSUM, XL-Sum), or where the task involves both translation and summarization at the same time (WikiLingua).
We note that with respect to task lengths, these multilingual tasks are not very long when compared to the tasks covered in the original LongT5 paper.There is unfortunately a lack of lengthy, multilingual summarization tasks available, thus we use these three for comparisons.As such, we tested with input lengths of 4k for input and 512 for output, which covers most documents for all the above tasks.
For all these tasks, we report standard ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L).

MLSUM
Table 1 shows our results for the MLSUM task.We are comparing to the M-BERT (Devlin, 2018) model used in the original paper.The authors only reported ROUGE-L scores, while we also report ROUGE-1 and ROUGE-2 scores.Looking at the ROUGE-L scores, we can see that mLongT5 performs comparably to M-BERT for French, while doing better than M-BERT for all model sizes in German, Spanish, and Turkish.It is only with Russian does it do slightly worse.As noted in the original paper, Russian was the hardest language for language models, due to having a much smaller dataset when compared to the other languages in the corpus and a higher rate of novelty (words found in the summary but not in the input document).Additionally, as we mentioned before, the dataset input lengths are not very long, thus models with full attention can take better advantage of the short lengths compared to mLongT5.This can then contribute to mLongT5 not performing as well for this instance.

XL-Sum
For XL-Sum, we finetuned the model in a similar approach to the original paper -we finetuned on a mixture of all the languages for 50,000 steps, and then performed tests for each of the individual languages from this single model.
Table 2 shows a subset of the languages (the full results can be seen in Appendix A).We highlight languages that had longer input lengths (due to both the length of the original documents and how they are then subsequently tokenized by the SPM).
As we can see, mLongT5 performed well compared to mT5 for these lengthier inputs.When comparing base to base, it did slightly worse, as expected with mT5 having full attention.The original LongT5 model, when finetuned on datasets that are of shorter lengths, had also shown slightly worse performance when compared to a model of full attention.We are seeing similar results here.But mLongT5 is able to more easily scale to larger model sizes, and as such, we can see stronger results as we increase the size of the model.

WikiLingua
The final summarization task is WikiLingua, with results shown in Table 3.This task requires both translation and summarization, with the task translating from a full document of another language into an English summary.As previously mentioned we are using the GEM version of this task, and compare our results to the mT5 model on their leaderboard.
As shown in the results, mLongT5 tends to do better for many of the model sizes across the 4 languages, with only slightly worse performance with XL size for Spanish.

Question-Answering
For question-answering, we applied mLongT5 to TyDi QA (Clark et al., 2020).TyDi QA is a multilingual task covering 11 languages, trying to answer questions given a Wikipedia article.There are two versions of this task, and we focus on the Minimal Answer Span Task, in which one is trying to either find the minimal span that answer the question, give a yes/no answer if the question is a yes/no question, or Null if the question cannot be answered given the article.
Similar to the original LongT5 paper and their application to Natural Questions, we have redefined this task from extracting answer spans to a seq2seq task of generating answer texts.The Table 2: Results for XL-Sum, focusing on languages that have lengthier inputs.The rest of the results can be seen in the Appendix A.  results shown will then differ from the TyDi QA leaderboard.As such, we have also run the similar mT5 model on the same task to get a baseline to compare against.Additionally, as the test set is not available for this task, we use 90% of the training data as the train set and remaining 10% as the dev set, and use the original dev set as our test set for reporting metrics.

ES-EN TR-EN
Unlike the summarization tasks, TyDi QA has much longer input lengths -mean of 5,148 tokens and 90 th percentile of 12,967 tokens when tokenized with the SentencePiece model.As such, for mT5 we tested with input lengths between 512 and 4k, while for mLongT5 we tested with input lengths between 4k and 16k.
Table 4 show the results of running mT5 and mLongT5 on this dataset.For this task, we report metrics of Exact Match (EM) and F1 score.As can be seen in the results, mLongT5 is able to better answer the questions given that it can handle longer input sequences.

Conclusion
We have presented our new model mLongT5.It has the benefits of the efficient architecture of LongT5, with the ability to handle multingual inputs and outputs.As our report shows, the model is able to perform well on a variety of summarization and question-answering tasks.

Limitations
mLongT5 has the same limitations as seen in the original LongT5 model, in that they are more suited for tasks of lengthier inputs.Tasks with shorter inputs will be better served by models like mT5 and umT5, which can take advantage of full attention.

Table 3 :
WikiLingua summarization results.These results are using the GEM version of the task.