BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Moussa Kamal Eddine; Antoine Tixier; Michalis Vazirgiannis

doi:10.18653/v1/2021.emnlp-main.740

BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Moussa Kamal Eddine, Antoine Tixier, Michalis Vazirgiannis

Abstract

Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez’ corpus, and show our resulting model, mBARThez, to significantly boost BARThez’ generative performance.

Anthology ID:: 2021.emnlp-main.740
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9369–9390
Language:
URL:: https://aclanthology.org/2021.emnlp-main.740/
DOI:: 10.18653/v1/2021.emnlp-main.740
Bibkey:
Cite (ACL):: Moussa Kamal Eddine, Antoine Tixier, and Michalis Vazirgiannis. 2021. BARThez: a Skilled Pretrained French Sequence-to-Sequence Model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9369–9390, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model (Kamal Eddine et al., EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.740.pdf
Video:: https://aclanthology.org/2021.emnlp-main.740.mp4
Code: moussaKam/BARThez + additional community code
Data: OrangeSum, FLUE, GLUE

PDF Cite Search Code Video Fix data