ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT

Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg Tiedemann, Bonnie Webber


Abstract
We present ParCor, a parallel corpus of texts in which pronoun coreference ― reduced coreference in which pronouns are used as referring expressions ― has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.
Anthology ID:
L14-1268
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3191–3198
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/298_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg Tiedemann, and Bonnie Webber. 2014. ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3191–3198, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT (Guillou et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/298_Paper.pdf