Curated Data Does Not Mean Representative Data When Training Large Language Models: An Experiment Using Representative Data for Italian

Fabio Tamburini


Anthology ID:
2025.clicit-1.104
Volume:
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
Month:
September
Year:
2025
Address:
Cagliari, Italy
Editors:
Cristina Bosco, Elisabetta Jezek, Marco Polignano, Manuela Sanguinetti
Venue:
CLiC-it
SIG:
Publisher:
CEUR Workshop Proceedings
Note:
Pages:
1102–1111
Language:
URL:
https://aclanthology.org/2025.clicit-1.104/
DOI:
Bibkey:
Cite (ACL):
Fabio Tamburini. 2025. Curated Data Does Not Mean Representative Data When Training Large Language Models: An Experiment Using Representative Data for Italian. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), pages 1102–1111, Cagliari, Italy. CEUR Workshop Proceedings.
Cite (Informal):
Curated Data Does Not Mean Representative Data When Training Large Language Models: An Experiment Using Representative Data for Italian (Tamburini, CLiC-it 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.clicit-1.104.pdf