metaTED: a Corpus of Metadiscourse for Spoken Language

Rui Correia, Nuno Mamede, Jorge Baptista, Maxine Eskenazi


Abstract
This paper describes metaTED ― a freely available corpus of metadiscursive acts in spoken language collected via crowdsourcing. Metadiscursive acts were annotated on a set of 180 randomly chosen TED talks in English, spanning over different speakers and topics. The taxonomy used for annotation is composed of 16 categories, adapted from Adel(2010). This adaptation takes into account both the material to annotate and the setting in which the annotation task is performed. The crowdsourcing setup is described, including considerations regarding training and quality control. The collected data is evaluated in terms of quantity of occurrences, inter-annotator agreement, and annotation related measures (such as average time on task and self-reported confidence). Results show different levels of agreement among metadiscourse acts (α ∈ [0.15; 0.49]). To further assess the collected material, a subset of the annotations was submitted to expert appreciation, who validated which of the marked occurrences truly correspond to instances of the metadiscursive act at hand. Similarly to what happened with the crowd, experts revealed different levels of agreement between categories (α ∈ [0.18; 0.72]). The paper concludes with a discussion on the applicability of metaTED with respect to each of the 16 categories of metadiscourse.
Anthology ID:
L16-1618
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3907–3913
Language:
URL:
https://aclanthology.org/L16-1618
DOI:
Bibkey:
Cite (ACL):
Rui Correia, Nuno Mamede, Jorge Baptista, and Maxine Eskenazi. 2016. metaTED: a Corpus of Metadiscourse for Spoken Language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3907–3913, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
metaTED: a Corpus of Metadiscourse for Spoken Language (Correia et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1618.pdf
Data
Penn Treebank