Francesco De Toni


2024

pdf bib
Documenting Geographically and Contextually Diverse Language Data Sources
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.

2022

pdf bib
Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0
Francesco De Toni | Christopher Akiki | Javier De La Rosa | Clémentine Fourrier | Enrique Manjavacas | Stefan Schweter | Daniel Van Strien
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.