Gérard Dupont


2024

pdf bib
Documenting Geographically and Contextually Diverse Language Data Sources
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.

2023

pdf bib
The ROOTS Search Tool: Data Transparency for LLMs
Aleksandra Piktus | Christopher Akiki | Paulo Villegas | Hugo Laurençon | Gérard Dupont | Sasha Luccioni | Yacine Jernite | Anna Rogers
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces: https://huggingface.co/spaces/bigscience-data/roots-search. We describe our implementation and the possible use cases of our tool.

2019

pdf bib
Interprétation et visualisation contextuelle de NOTAMs (messages aux navigants aériens) ()
Alexandre Arnold | Gérard Dupont | Catherine Kobus | François Lancelot | Pooja Narayan
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations

Dans cet article, nous présentons une démonstration de visualisation de l’information extraite automatiquement de la partie textuelle des NOTAMs. Dans le domaine aéronautique, les NOTAMs sont des messages publiés par les agences gouvernementales de contrôle de la navigation aérienne. Nous détaillons la construction du jeu de données, les expériences d’extraction d’information par apprentissage profond (approche et résultats), ainsi que le lien avec la visualisation contextuelle sur des cartes d’aéroports.