Romanian TimeBank: An Annotated Parallel Corpus for Temporal Information

Corina Forăscu, Dan Tufiş


Abstract
The paper describes the main steps for the construction, annotation and validation of the Romanian version of the TimeBank corpus. Starting from the English TimeBank corpus ― the reference annotated corpus in the temporal domain, we have translated all the 183 English news texts into Romanian and mapped the English annotations onto Romanian, with a success rate of 96.53%. Based on ISO-Time - the emerging standard for representing temporal information, which includes many of the previous annotations schemes -, we have evaluated the automatic transfer onto Romanian and, and, when necessary, corrected the Romanian annotations so that in the end we obtained a 99.18% transfer rate for the TimeML annotations. In very few cases, due to language peculiarities, some original annotations could not be transferred. For the portability of the temporal annotation standard to Romanian, we suggested some additions for the ISO-Time standard, concerning especially the EVENT tag, based on linguistic evidence, the Romanian grammar, and also on the localisations of TimeML to other Romance languages. Future improvements to the Ro-TimeBank will take into consideration all temporal expressions, signals and events in texts, even those with a not very clear temporal anchoring.
Anthology ID:
L12-1451
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3762–3766
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/770_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Corina Forăscu and Dan Tufiş. 2012. Romanian TimeBank: An Annotated Parallel Corpus for Temporal Information. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3762–3766, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Romanian TimeBank: An Annotated Parallel Corpus for Temporal Information (Forăscu & Tufiş, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/770_Paper.pdf