Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data

Riccardo Orlando; Luca Moroni; ‪Pere-Lluís Huguet Cabot; Simone Conia; Edoardo Barba; Sergio Orlandini; Giuseppe Fiameni; Roberto Navigli

Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data

Riccardo Orlando, Luca Moroni, Pere-Lluís Huguet Cabot, Simone Conia, Edoardo Barba, Sergio Orlandini, Giuseppe Fiameni, Roberto Navigli

Abstract

The increasing popularity of Large Language Models (LLMs) has led to a surge in research on adapting existing models to different languages. However, the pretraining of non-English LLMs is still an underexplored area and there is no open-source endeavor that explores what is achievable with open Italian data. To address this issue, we present Minerva, the first family of LLMs trained from scratch on Italian data. The creation of Minerva is an opportunity to explore and investigate the pretraining of LLMs for the Italian language, outlining the challenges that arise when training LLMs with native Italian texts. Minerva demonstrates that an LLM for a specific language brings a number of practical benefits compared to the adaptation of an existing one, including deep control over the composition of the vocabulary and the training data. With this paper, we aim to provide a comprehensive overview of the design choices, results, and evaluation of our Minerva models, showing promising results on Italian benchmarks and downstream tasks. Most importantly, we share what we learned and the findings obtained during the development of Minerva, as we believe that our experience will be valuable for the academic and industrial communities interested in training non-English LLMs from scratch.

Anthology ID:: 2024.clicit-1.77
Volume:: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:: December
Year:: 2024
Address:: Pisa, Italy
Editors:: Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:: CLiC-it
SIG:
Publisher:: CEUR Workshop Proceedings
Note:
Pages:: 707–719
Language:
URL:: https://aclanthology.org/2024.clicit-1.77/
DOI:
Bibkey:
Cite (ACL):: Riccardo Orlando, Luca Moroni, Pere-Lluís Huguet Cabot, Simone Conia, Edoardo Barba, Sergio Orlandini, Giuseppe Fiameni, and Roberto Navigli. 2024. Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 707–719, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):: Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data (Orlando et al., CLiC-it 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.clicit-1.77.pdf

PDF Cite Search Fix data