Holistic Evaluation of Automatic TimeML Annotators

Mustafa Ocal, Adrian Perez, Antonela Radas, Mark Finlayson


Abstract
TimeML is a scheme for representing temporal information (times, events, & temporal relations) in texts. Although automatic TimeML annotation is challenging, there has been notable progress, with F1s of 0.8–0.9 for events and time detection subtasks, and F1s of 0.5–0.7 for relation extraction. Individually, these subtask results are reasonable, even good, but when combined to generate a full TimeML graph, is overall performance still acceptable? We present a novel suite of eight metrics, combined with a new graph-transformation experimental design, for holistic evaluation of TimeML graphs. We apply these metrics to four automatic TimeML annotation systems (CAEVO, TARSQI, CATENA, and ClearTK). We show that on average 1/3 of the TimeML graphs produced using these systems are inconsistent, and there is on average 1/5 more temporal indeterminacy than the gold-standard. We also show that the automatically generated graphs are on average 109 edits from the gold-standard, which is 1/3 toward complete replacement. Finally, we show that the relationship individual subtask performance and graph quality is non-linear: small errors in TimeML subtasks result in rapid degradation of final graph quality. These results suggest current automatic TimeML annotators are far from optimal and significant further improvement would be useful.
Anthology ID:
2022.lrec-1.155
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1444–1453
Language:
URL:
https://aclanthology.org/2022.lrec-1.155
DOI:
Bibkey:
Cite (ACL):
Mustafa Ocal, Adrian Perez, Antonela Radas, and Mark Finlayson. 2022. Holistic Evaluation of Automatic TimeML Annotators. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1444–1453, Marseille, France. European Language Resources Association.
Cite (Informal):
Holistic Evaluation of Automatic TimeML Annotators (Ocal et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.155.pdf