Segmenting a Large French Meeting Corpus into Elementary Discourse Units

Laurent Prévot, Roxane Bertrand, Julie Hunter


Abstract
Despite growing interest in discourse-related tasks, the limited quantity and diversity of discourse-annotated data remain a major issue. Existing resources are largely based on written corpora, while spoken conversational genres are underrepresented. Although discourse segmentation into elementary discourse units (EDUs) is considered to be nearly solved for canonical written texts, conversational spontaneous speech transcripts present different challenges. In this paper, we introduce a large French corpus of segmented meeting dialogues, including 20 hours of manually transcribed and discourse-annotated conversations, and 80 hours of automatically transcribed and discourse-segmented data. We describe our annotation campaign, discuss inter-annotator agreement and segmentation guidelines, and present results from fine-tuning a model for EDU segmentation on this resource.
Anthology ID:
2025.sigdial-1.14
Volume:
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
August
Year:
2025
Address:
Avignon, France
Editors:
Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin
Venue:
SIGDIAL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
183–191
Language:
URL:
https://aclanthology.org/2025.sigdial-1.14/
DOI:
Bibkey:
Cite (ACL):
Laurent Prévot, Roxane Bertrand, and Julie Hunter. 2025. Segmenting a Large French Meeting Corpus into Elementary Discourse Units. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 183–191, Avignon, France. Association for Computational Linguistics.
Cite (Informal):
Segmenting a Large French Meeting Corpus into Elementary Discourse Units (Prévot et al., SIGDIAL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.sigdial-1.14.pdf