Enriching Word Alignment with Linguistic Tags

Xuansong Li, Niyu Ge, Stephen Grimes, Stephanie M. Strassel, Kazuaki Maeda


Abstract
Incorporating linguistic knowledge into word alignment is becoming increasingly important for current approaches in statistical machine translation research. To improve automatic word alignment and ultimately machine translation quality, an annotation framework is jointly proposed by LDC (Linguistic Data Consortium) and IBM. The framework enriches word alignment corpora to capture contextual, syntactic and language-specific features by introducing linguistic tags to the alignment annotation. Two annotation schemes constitute the framework: alignment and tagging. The alignment scheme aims to identify minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. The framework produces a solid ground-level alignment base upon which larger translation unit alignment can be automatically induced. To test the soundness of this work, evaluation is performed on a pilot annotation, resulting in inter- and intra- annotator agreement of above 90%. To date LDC has produced manual word alignment and tagging on 32,823 Chinese-English sentences following this framework.
Anthology ID:
L10-1458
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/670_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Xuansong Li, Niyu Ge, Stephen Grimes, Stephanie M. Strassel, and Kazuaki Maeda. 2010. Enriching Word Alignment with Linguistic Tags. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Enriching Word Alignment with Linguistic Tags (Li et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/670_Paper.pdf