Automatic word alignment tools to scale production of manually aligned parallel texts

Stephen Grimes, Katherine Peterson, Xuansong Li


Abstract
We have been creating large-scale manual word alignment corpora for Arabic-English and Chinese-English language pairs in genres such as newsire, broadcast news and conversation, and web blogs. We are now meeting the challenge of word aligning further varieties of web data for Chinese and Arabic """"dialects"""". Human word alignment annotation can be costly and arduous. Alignment guidelines may be imprecise or underspecified in cases where parallel sentences are hard to compare -- due to non-literal translations or differences between language structures. In order to speed annotation, we examine the effect that seeding manual alignments with automatic aligner output has on annotation speed and accuracy. We use automatic alignment methods that produce alignment results which are high precision and low recall to minimize annotator corrections. Results suggest that annotation time can be reduced by up to 20%, but we also found that reviewing and correcting automatic alignments requires more time than anticipated. We discuss throughout the paper crucial decisions on data structures for word alignment that likely have a significant impact on our results.
Anthology ID:
L12-1265
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2194–2198
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/487_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Stephen Grimes, Katherine Peterson, and Xuansong Li. 2012. Automatic word alignment tools to scale production of manually aligned parallel texts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2194–2198, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Automatic word alignment tools to scale production of manually aligned parallel texts (Grimes et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/487_Paper.pdf