SELEXINI – a large and diverse automatically parsed corpus of French

Manon Scholivet, Agata Savary, Louis Estève, Marie Candito, Carlos Ramisch


Abstract
The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.
Anthology ID:
2025.bucc-1.10
Volume:
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Serge Sharoff, Ayla Rigouts Terryn, Pierre Zweigenbaum, Reinhard Rapp
Venues:
BUCC | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
83–98
Language:
URL:
https://aclanthology.org/2025.bucc-1.10/
DOI:
Bibkey:
Cite (ACL):
Manon Scholivet, Agata Savary, Louis Estève, Marie Candito, and Carlos Ramisch. 2025. SELEXINI – a large and diverse automatically parsed corpus of French. In Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC), pages 83–98, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
SELEXINI – a large and diverse automatically parsed corpus of French (Scholivet et al., BUCC 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.bucc-1.10.pdf