Exploring Text-Embedding Retrieval Models for the Italian Language

Yuri Noviello, Fabio Tamburini


Abstract
Text retrieval systems have become essential in the field of natural language processing (NLP), serving as the backbone for applications such as search engines, document indexing, and information retrieval. With the rise of generative AI, particularly Retrieval-Augmented Generation (RAG) systems, the demand for robust text retrieval models has increased. However, existing large language models (LLMs) and datasets are often insufficiently optimized for Italian, limiting their performance in Italian text retrieval tasks. This paper addresses this gap by proposing both a data collection and specialized models tailored for Italian text retrieval. Through extensive experimentation, we analyze the improvements and limitations in retrieval performance, paving the way for more effective Italian NLP applications.
Anthology ID:
2024.clicit-1.73
Volume:
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:
December
Year:
2024
Address:
Pisa, Italy
Editors:
Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:
CLiC-it
SIG:
Publisher:
CEUR Workshop Proceedings
Note:
Pages:
654–661
Language:
URL:
https://aclanthology.org/2024.clicit-1.73/
DOI:
Bibkey:
Cite (ACL):
Yuri Noviello and Fabio Tamburini. 2024. Exploring Text-Embedding Retrieval Models for the Italian Language. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 654–661, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):
Exploring Text-Embedding Retrieval Models for the Italian Language (Noviello & Tamburini, CLiC-it 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clicit-1.73.pdf