Alignment-based Profiling of Europarl Data in an English-Swedish Parallel Corpus

Lars Ahrenberg

Alignment-based Profiling of Europarl Data in an English-Swedish Parallel Corpus

Abstract

This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the same parallel corpus. We first describe our method for comparison which is based on manually reviewed word alignments. We investigate relative frequences of different types of correspondence, including null alignments, many-to-one correspondences and crossings. In addition, both halves of the parallel corpus have been annotated with morpho-syntactic information. The syntactic annotation uses labelled dependency relations. Thus, we can see how different types of correspondences are distributed on different parts-of-speech and compute correspondences at the structural level. In spite of the fact that two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion of many types of restructurings, including additions, deletions, long distance reorderings and dependency reversals. We explain this by the fact that the majority of Europarl segments are parallel translations rather than source texts and their translations.

Anthology ID:: L10-1129
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/193_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Lars Ahrenberg. 2010. Alignment-based Profiling of Europarl Data in an English-Swedish Parallel Corpus. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: Alignment-based Profiling of Europarl Data in an English-Swedish Parallel Corpus (Ahrenberg, LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/193_Paper.pdf

PDF Cite Search Fix data