Data-Efficient French Language Modeling with CamemBERTa

Wissam Antoun, Benoît Sagot, Djamé Seddah


Abstract
Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model’s performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to CamemBERT, despite being trained on only 30% of its total number of input tokens. In addition to our experimental results, we also publicly release the weights and code implementation of CamemBERTa, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective.
Anthology ID:
2023.findings-acl.320
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5174–5185
Language:
URL:
https://aclanthology.org/2023.findings-acl.320
DOI:
10.18653/v1/2023.findings-acl.320
Bibkey:
Cite (ACL):
Wissam Antoun, Benoît Sagot, and Djamé Seddah. 2023. Data-Efficient French Language Modeling with CamemBERTa. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5174–5185, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Data-Efficient French Language Modeling with CamemBERTa (Antoun et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.320.pdf
Video:
 https://aclanthology.org/2023.findings-acl.320.mp4