ODIL_Syntax: a Free Spontaneous Spoken French Treebank Annotated with Constituent Trees

Ilaine Wang, Aurore Pelletier, Jean-Yves Antoine, Anaïs Halftermeyer


Abstract
This paper describes ODIL Syntax, a French treebank built on spontaneous speech transcripts. The syntactic structure of every speech turn is represented by constituent trees, through a procedure which combines an automatic annotation provided by a parser (here, the Stanford Parser) and a manual revision. ODIL Syntax respects the annotation scheme designed for the French TreeBank (FTB), with the addition of some annotation guidelines that aims at representing specific features of the spoken language such as speech disfluencies. The corpus will be freely distributed by January 2020 under a Creative Commons licence. It will ground a further semantic enrichment dedicated to the representation of temporal entities and temporal relations, as a second phase of the ODIL@Temporal project. The paper details the annotation scheme we followed with a emphasis on the representation of speech disfluencies. We then present the annotation procedure that was carried out on the Contemplata annotation platform. In the last section, we provide some distributional characteristics of the annotated corpus (POS distribution, multiword expressions).
Anthology ID:
2020.lrec-1.652
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5301–5307
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.652
DOI:
Bibkey:
Cite (ACL):
Ilaine Wang, Aurore Pelletier, Jean-Yves Antoine, and Anaïs Halftermeyer. 2020. ODIL_Syntax: a Free Spontaneous Spoken French Treebank Annotated with Constituent Trees. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5301–5307, Marseille, France. European Language Resources Association.
Cite (Informal):
ODIL_Syntax: a Free Spontaneous Spoken French Treebank Annotated with Constituent Trees (Wang et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.652.pdf