Sergio Orlandini

2024

The increasing popularity of Large Language Models (LLMs) has led to a surge in research on adapting existing models to different languages. However, the pretraining of non-English LLMs is still an underexplored area and there is no open-source endeavor that explores what is achievable with open Italian data. To address this issue, we present Minerva, the first family of LLMs trained from scratch on Italian data. The creation of Minerva is an opportunity to explore and investigate the pretraining of LLMs for the Italian language, outlining the challenges that arise when training LLMs with native Italian texts. Minerva demonstrates that an LLM for a specific language brings a number of practical benefits compared to the adaptation of an existing one, including deep control over the composition of the vocabulary and the training data. With this paper, we aim to provide a comprehensive overview of the design choices, results, and evaluation of our Minerva models, showing promising results on Italian benchmarks and downstream tasks. Most importantly, we share what we learned and the findings obtained during the development of Minerva, as we believe that our experience will be valuable for the academic and industrial communities interested in training non-English LLMs from scratch.

Co-authors

Roberto Navigli 1

Riccardo Orlando 1

Venues

clicit1

Fix data