Gérard Dupont
2024
Documenting Geographically and Contextually Diverse Language Data Sources
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10
Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.
2023
The ROOTS Search Tool: Data Transparency for LLMs
Aleksandra Piktus | Christopher Akiki | Paulo Villegas | Hugo Laurençon | Gérard Dupont | Sasha Luccioni | Yacine Jernite | Anna Rogers
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Aleksandra Piktus | Christopher Akiki | Paulo Villegas | Hugo Laurençon | Gérard Dupont | Sasha Luccioni | Yacine Jernite | Anna Rogers
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces: https://huggingface.co/spaces/bigscience-data/roots-search. We describe our implementation and the possible use cases of our tool.
2019
Interprétation et visualisation contextuelle de NOTAMs (messages aux navigants aériens) ()
Alexandre Arnold | Gérard Dupont | Catherine Kobus | François Lancelot | Pooja Narayan
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations
Alexandre Arnold | Gérard Dupont | Catherine Kobus | François Lancelot | Pooja Narayan
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations
Dans cet article, nous présentons une démonstration de visualisation de l’information extraite automatiquement de la partie textuelle des NOTAMs. Dans le domaine aéronautique, les NOTAMs sont des messages publiés par les agences gouvernementales de contrôle de la navigation aérienne. Nous détaillons la construction du jeu de données, les expériences d’extraction d’information par apprentissage profond (approche et résultats), ainsi que le lien avec la visualisation contextuelle sur des cartes d’aéroports.
Search
Fix author
Co-authors
- Yacine Jernite 2
- Alham Fikri Aji 1
- Christopher Akiki 1
- Zaid Alyafeai 1
- Alexandre Arnold 1
- Stella Biderman 1
- Kimbo Chen 1
- Francesco De Toni 1
- Hady Elsahar 1
- Chris Chinenye Emezue 1
- Suzana Ilic 1
- Nurulaqilla Khamis 1
- Catherine Kobus 1
- François Lancelot 1
- Hugo Laurençon 1
- Colin Leong 1
- Sasha Luccioni 1
- Maraim Masoud 1
- Angelina McMillan-Major 1
- Pooja Narayan 1
- Pedro Ortiz Suarez 1
- Aleksandra Piktus 1
- Anna Rogers 1
- Aitor Soroa 1
- Zeerak Talat 1
- Paulo Villegas 1
- Daniel van Strien 1