BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing (NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language understanding tasks. While there are some notable exceptions, most of the available models and research have been conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language (to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez, provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.


Introduction and background
Inductive transfer learning, that is, solving tasks with models that have been pretrained on very large amounts of data, was a game changer in computer vision (Krizhevsky et al., 2012). In NLP, while annotated data are scarce, raw text is virtually unlimited and readily available. It thus emerged that the ability to learn good representations from plain text could greatly improve general natural language understanding. Learning without labels is enabled via self-supervised learning 4 , a setting in which a system learns to predict part of its input from other parts of its input. In practice, one or more supervised tasks are created from the unlabeled data, and the model learns to solve these tasks with custom objectives.
Some of the earliest and most famous selfsupervised representation learning approaches in NLP are word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and FastText (Bojanowski et al., 2017). While these methods have made great strides, they produce static representations, which is a major limitation, as words are polysemous. That is, words have different meanings depending on the unique contexts in which they are used. Deep pretrained language models.
ELMo (Peters et al., 2018) provided the first contextualized embeddings, by extracting and combining the internal states of a pretrained deep bi-LSTM language model. Except for the word embeddings and the softmax layer, the forwards and backwards RNNs have different parameters. They showed that the learned representations could be transferred with great benefits to downstream architectures, to solve a variety of supervised NLU tasks.
Beyond simply combining internal states, Howard and Ruder (2018) proposed ULMFiT, a universal transfer learning method for text classification where the language model is pretrained on a large, general dataset, finetuned on a specific dataset, and finally augmented with classification layers trained from scratch on downstream tasks.
With the OpenAI GPT, Radford et al. (2018) capitalized on the Transformer architecture (Vaswani et al., 2017), shown superior and conceptually simpler than recurrent neural networks. More precisely, they pretrained a left-to-right Transformer decoder as a general language model, and finetuned it on 12 language understanding tasks by applying different transformations to the input.
By combining ideas from all the aforementioned models, and introducing bidirectional pretraining, BERT (Devlin et al., 2018) disrupted the field of NLP by setting new state of the art performance on 11 NLU tasks, with very wide margins. More precisely, BERT uses a bidirectional Transformer encoder with a masked language model objective, making the learned representations capture both the left and the right contexts, instead of just the left context. The sheer size of BERT, with up to 24 Transformer blocks, plays a role in performance too.
With GPT-2, a version of GPT with over an order of magnitude more parameters than GPT, Radford et al. (2019) showed that as long as they have very large capacities, general language models can reach reasonable performance on many specific NLU tasks out-of-the-box, without any finetuning, i.e., accomplish zero-shot transfer. This demonstrates the fundamental nature and importance of the language modeling objective for inductive transfer learning.
In RoBERTa,  showed that the performance of BERT could be improved by optimizing its hyperparameters and training procedure. The study of why and how BERT works so well has now its own dedicated research field, known as BERTology (Rogers et al., 2020).

Languages.
Following the success of BERT on the English language, numerous BERT models were created in other languages.
In addition to the aforelisted monolingual models, multilingual models were also proposed, notably mBERT (Devlin et al., 2018), XLM (Lample and Conneau, 2019) and XLM-R . Abstractive summarization. Abstractive sum-marization is an important and challenging task, requiring diverse and complex natural language understanding and generation capabilities. A good summarization model needs to read, comprehend, and write well.
GPT-2 can be used for summarization, by sampling a certain numbers of tokens from a given start seed. However, while qualitatively, the summaries are abstractive, grammatical, and fluent, quantitative performance is only slightly superior to that of a random extractive baseline.
Being a bidirectional encoder, BERT cannot be used out-of-the-box for language generation, unlike GPT-2. Furthermore, BERT produces singlesentence representations, whereas for summarization, reasoning over multiple sentence and paragraph representations is necessary. Liu and Lapata (2019) proposed a way to overcome these challenges. At the input level, they introduced special tokens to encode individual sentences, interval segment embeddings, and used more position embeddings than in BERT. Then, they combined a pretrained BERT encoder with a Transformerbased decoder initialized at random and jointly trained the two models with different optimizers and learning rates.
BART and mBART. BART  is a denoising auto-encoder that jointly pretrains a bidirectional encoder (like in BERT) and a forward decoder (like in GPT) by learning to reconstruct a corrupted input sequence. Both the encoder and the decoder are Transformers. Since not only the encoder but also the decoder is pretrained, BART is particularly effective when applied to text generation tasks.  pretrained a multilingual BART (mBART) on 25 different languages. They showed that this multilingual pretraining brings significant performance gains on a variety of machine translation tasks. MASS (Song et al., 2019) is another multilingual pretrained sequence to sequence model, that learns to predict a masked span in the input sequence. The main difference between MASS and BART, is that the former only predicts the masked fragment of the sentence, while the latter learns to reconstruct the entire corrupted sentence. ProphetNet (Yan et al., 2020) which also adopts the encoder-decoder structure, introduces a new learning objective called future n-gram prediction. This objective reduces overfitting on local correlations by learning to predict the next n-grams (instead of unigrams) at each time step given the previous context. Our contributions. In this paper, we introduce what is, to the best of our knowledge, the first BART model for the French language. We call this model BARThez. BARThez was pretrained on a very large monolingual French corpus from past research that we adapted to suit BART's specific perturbation schemes. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pretrained.
In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel summarization dataset, OrangeSum, that we make publicly available alongside this paper.
We also continue the pretraining of an already pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez, provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT, two state of the art BERT-based language models.

BARThez
Our model is based on BART (Lewis et al., 2019), a denoising auto-encoder. It consists of a bidirectional encoder and a left-to-right auto-regressive decoder.

Architecture
We use the BASE architecture, with 6 encoder and 6 decoder layers. It uses 768 hidden dimensions and 12 attention heads in both the encoder and the decoder. In total, our model has roughly 216M parameters. The architecture has two differences compared to the vanilla seq2seq Transformer (Vaswani et al., 2017). The first one is the use of GeLUs activation layers instead of ReLUs, and the second is the presence of a normalization layer on top of the encoder and the decoder, following . These additional layers help stabilizing the training when using FP16 precision.

Vocabulary
To generate our vocabulary, we use Sentence-Piece (Kudo and Richardson, 2018) that implements byte-pair-encoding (BPE) (Sennrich et al., 2015). We do not perform any type of pre-tokenization and we fix the size of the vocabulary to 50K sub-words. The SentencePiece model is trained on a 10GB random sample of the pretraining corpus. We fix the character coverage to 99.95%.

Self-supervised learning
We use the same pretraining as in BART, i.e., BARTHez learns to reconstruct a corrupted input . More precisely, the input text is perturbed with a noise function, and the model has to predict it by minimizing the cross-entropy between the predicted and the original text. Formally, having a set of documents {X 1 , X 2 , ..., X n } and a noising function n, we aim at finding the parameters θ that minimize: Following , two different types of noise are applied in n. First, we use the text infilling scheme, where a number of text spans are sampled and replaced with one [MASK] special token. The length of the spans is sampled from a Poisson distribution with (λ = 3.5) and 30% of the text is masked. The second perturbation scheme is sentence permutation, where the input document, seen as a list of sentences, is shuffled.

Pretraining corpus
We created a version of FlauBERT's corpus (Le et al., 2019) suitable for the two perturbation schemes described in subsection2.3. Indeed, in the original FlauBERT corpus, each sentence is seen as an independent instance, while we need instances to correspond to complete documents.
Other than that, BARTHez corpus is similar to FlauBERT's. It primarily consists in the French part of CommonCrawl, NewsCrawl, Wikipedia and other smaller corpora that are listed in Table  1. To clean the corpus from noisy examples, we used the script 5 provided by Le et al. (2019) after disabling the Moses tokenizer. Before applying SentencePiece tokenization, the total corpus size is 66/101GB before/after tokenization.

Training details
We pretrain BARTHez on 128 NVidia V100 GPUs. We fix the batch size to 6000 tokens per GPU and the update frequency to 2, which Corpus Size CommonCrawl 42.0 NewsCrawl (Li et al., 2019) 9.6 Wikipedia 4.0 GIGA (Tiedemann, 2012) 3.8 ORTOLANG (ATILF and CLLE, 2020) 2.7 MultiUn (Eisele and Chen, 2010) 2.2 EU Bookshop (Skadiņš et al., 2014) 2.1 gives a total batch size of roughly 22k documents per update. We use the Adam optimizer (Kingma and Ba, 2014) with ǫ = 10 −6 , β 1 = 0.9, and β 2 = 0.999, with a learning rate starting from 6.10 −4 and decreasing linearly as a function of the training step. We use a warm up of 6% of the total number of training steps. The pretraining lasts for approximately 60 hours, which allows for 20 passes over the whole corpus. In the first 12 epochs, we fix the dropout to 0.1, for epochs 12 to 16 we decrease it to 0.05, and finally set it to zero for epochs 16 to 20. All experiments are carried out using Fairseq .

OrangeSum
BART-based models are particularly well-suited to generative tasks, but unfortunately, FLUE (Le et al., 2019), the French equivalent of GLUE, only contains discriminative tasks 6 (Wang et al., 2018). We therefore decided to create one such task. We opted for single-document abstractive summarization, as it is a generative task that also requires the model to encode its input very well. In other words, for a model to summarize well, it needs to both read, comprehend, and write well, making abstractive summarization one of the most central and challenging evaluation tasks in NLP.

Motivation.
Our strategy here was to create a French equivalent of the recently introduced XSum dataset (Narayan et al., 2018). Unlike the historical summarization datasets (CNN, DailyMail, and NY Times), introduced by Hermann et al. (2015), which favor extractive strategies, XSum requires the models to display a high degree of abstractivity to perform well. XSum was created by scraping articles and their one-sentence summaries from the BBC website, where the one-sentence summaries are not catchy headlines, but rather capture the gist of the articles. Data collection. We adopted an analogous strategy, and scraped the "Orange Actu" website 7 . Orange S.A. is a large French multinational telecommunications corporation, with 266M customers worldwide. Our scraped pages cover almost a decade from Feb 2011 to Sep 2020. They belong to five main categories: France, world, politics, automotive, and society 8 . The society category is itself divided into 8 subcategories: health, environment, people, culture, media, high-tech, unsual ("insolite" in French), and miscellaneous.
Each article featured a single-sentence title as well as a slightly longer abstract, both professionally written by the author of the article. We extracted these two fields from each page, thus creating two summarization tasks: OrangeSum Title and OrangeSum Abstract. Note that like in XSum, titles in OrangeSum tend not to be catchy headlines but rather convey the essence of the article. The same can be said about the abstracts. Post-processing. As a post-processing step, we removed all empty articles, and articles whose titles were shorter than 5 words. For Orange-Sum Abstract, we removed the top 10% articles in terms of proportion of novel unigrams in the abstracts, as we observed that such abstracts tended to be introductions rather than real abstracts. This corresponded to a threshold of 57% novel unigrams.
For both OrangeSum Title and Orange-Sum Abstract, we set aside 1500 pairs for testing, 1500 for validation, and used all the remaining ones for training. The resulting dataset is publicly available at: https://github.com/moussaKam/OrangeSum. Analysis. Table 2 compares OrangeSum with XSum and the well-known CNN, DailyMail, and NY Times datasets. We can see that the two Or-angeSum datasets are very similar to XSum in terms of statistics, except for the size. Indeed, Or-angeSum is one order of magnitude smaller than XSum. However, the size of OrangeSum still allows for effective finetuning, as we later demonstrate in our experiments.

Experiments
We compare BARTHez to several other BARTbased models, summarized in Table 4 and described in what follows.
• mBART is a multilingual BART trained by . It follows a LARGE architecture, with 12 layers in both the encoder and the decoder, hidden vectors of size 1024, and 16 attention heads. It was trained on a multilingual corpus containing 1369 GB of raw text, over 2.5 weeks on 256 Nvidia V100 GPUs. The multilingual corpus covers 25 different languages where the French sub-copus size is 56 GB. In the original paper, the authors use mBART for machine translation. However, mBART can also be used to perform monolingual tasks.
• mBARThez. Here, we continued the pretraining of the pretrained mBART on BARTHez's corpus (detailed in subsection 2.4). Being multilingual, mBART uses a vocabulary containing tokens with non-latin characters. We eliminate these tokens from all embedding layers, reducing the number of parameters from 866M to 561M. We continued the pretraining of mBART for about 30 hours on 128 Nvidia V100 GPUs, which allowed for 4 passes over BARTHez's corpus. The initial learning rate was set to 0.0001 and linearly decreased towards zero.
• Random. As an additional baseline, we train a model with same architecture and vocabulary as BARThez from scratch on the downstream tasks.

Summarization
We first evaluate OrangeSum Title and Orange-Sum Abstract. All pretrained models are finetuned for 30 epochs and we use a learning rate that warms up to 0.0001 (6% of the training steps) and then decreases linearly to 0. The random model is trained for 60 epochs. We select the checkpoint associated with the best validation score to generate the test set summaries, using beam-search with a beam size of 4. We classically report ROUGE-1, ROUGE-2 and ROUGE-L scores (Lin, 2004) in Table 5. However, since ROUGE is limited to capturing n-gram overlap, which is poorly suited to the abstractive summarization setting, we also report BERTScore scores. BERTScore  is a recently introduced metric that leverages the contex-#layers #params Vocab. size Pretraining hours Pretraining GPUs Corpus size BARThez (ours)  12  216  50  60  128  66  mBART  24  866  250  432  256  1369  mBARThez (ours)  24  561  101  30  128  1369 + 66  Random  12  216  50  0  NA  NA   Table 4: Summary of the models used in our experiments. Parameters are given in millions, vocab sizes in thousands, and corpus sizes in GB.  Table 5: OrangeSum results. The two BertScore scores are with/without rescaling  tual representations of the candidate and reference sentences.
Following (Narayan et al., 2018), we include two extractive baselines in our evaluation, LEAD and EXT-ORACLE. LEAD creates a summary by extracting the first n sentences from the document. In our case we choose n = 1. The second baseline is EXT-ORACLE. It extracts from the document the set of sentences that maximizes a specific score. In our case, we extract the one sentence that maximizes ROUGE-L. Table 5 compares the performance of the models finetuned on the summarization task. While having four times less parameters, BARThez is on par with mBART, both in terms of ROUGE and BERTScore. mBARThez reaches best performance everywhere, which highlights the importance of adapting a multilingual pretrained model to a specific language before finetuning.
In addition, as shown in Table 6, mBARThez is able to generate more abstractive summaries than BARThez and mBART, as measured by the proportion of novel n-grams in the generated summaries. However, it is to be noted that all summaries are much less abstractive than the reference ones.

Discriminative tasks
In addition to generative tasks, BART-like models can perform discriminative tasks . In the case of sequence classification, the input sequence is fed to both the encoder and the decoder, and the representation of the last token in the sequence is used by adding a classification head on top of it. When the input consists of sev-eral sentences, these sentences are separated with a special token and pasted together. We evaluate the different models on five discriminative tasks from the FLUE benchmark 9 (Le et al., 2019), the French equivalent of GLUE (Wang et al., 2018).
• CLS. The Cross-lingual Sentiment analysis dataset (Prettenhofer and Stein, 2010) is made of Amazon reviews to be classified as positive or negative. It contains 3 product categories: books, DVD and music. The train and test sets are balanced and contains 2000 examples each, per product category. Following (Le et al., 2019), we use 20% of the train set as validation set.
• PAWSX. The Cross-lingual Adversarial Dataset for Paraphrase Identification  contains pairs of sentences, and the task is to predict whether they are semantically equivalent. The splits are as follows: 49401 examples for training, 1992 for development, and 1985 for testing. • XNLI. The Cross-lingual NLI corpus (Conneau et al., 2018) contains pairs of sentences, and the task is to predict whether the first one (premise) entails the second one (hypothesis), contradicts it, or neither entails nor contradicts it (neutral relationship). 392702 pairs are used for training, 2490 for development, and 5010 for testing.
In all experiments, we finetune the model for 10 epochs with a learning rate chosen from {10 −4 , 5.10 −5 , 10 −5 } based on the best validation score. We repeat each experiment 3 times with different seeds and report the mean and standard deviation to account for the stochastic nature of optimization.    Table 7 shows the test set accuracies of our experiments. For comparison purposes, we also copy that of other relevant BERT-based models as reported in (Le et al., 2019). These models are mBERT (Devlin et al., 2018), CamemBERT (Martin et al., 2019) and FlauBERT (Le et al., 2019).
Among the models having a BASE architecture BARThez outperforms other models in the three sentiment analysis tasks, while being very close to CamemBERT and FlauBERT in the paraphrasing and inference tasks. Among the LARGE models, mBARThez outperforms mBART in all tasks, showing again the importance of the language adaptive phase in the pretraining stage. FlauBERT has a slight edge over mBARThez in 3 out 5 tasks, which could be attributed to the fact that FlauBERT is trained for approximately 10 times more hours on a monolingual French corpus. In summary, we find that the ability of BARThez and mBARThez to perform well on generative tasks does not come at the expense of a decrease in performance on discriminative tasks, which is in line with the results found by .

Conclusion and Future Work
We released BARThez and mBARThez, the first BART models for the French language. We evaluated them on two abstractive summarization tasks form OrangeSum, a dataset that we release with this work. The evaluation shows that (1) BARThez is competitive to mBART while having four times less parameters, and that (2) mBARThez provides a significant boost over mBART by simply adding a relatively affordable language adaptive phase to the pretraining. In addition, we evaluate BARThez and mBARThez on 3 discriminative tasks (sentiment analysis, paraphrasing, natural language inference) against cutting edge French BERT-based language models (FlauBERT and CamemBERT), and obtain very competitive results. An interesting area for future work is to further explore the idea of "language adaptive pretraining", and to compare it with the traditional approach where the model is directly pretrained on monolingual corpus from scratch.