Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons

Antoine Bourlon; Chenhui Chu; Toshiaki Nakazawa; Sadao Kurohashi

Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons

Antoine Bourlon, Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi

Abstract

Sentence alignment is a task that consists in aligning the parallel sentences in a translated article pair. This paper describes a method to perform sentence boundary detection and alignment simultaneously, which significantly improves the alignment accuracy on languages like Chinese with uncertain sentence boundaries. It relies on the definition of hard (certain) and soft (uncertain) punctuation delimiters, the latter being possibly ignored to optimize the alignment result. The alignment method is used in combination with lexicons automatically generated from the input article pairs using pivot-based MT, achieving better coverage of the input words with fewer entries than pre-existing dictionaries. Pivot-based MT makes it possible to build dictionaries for language pairs that have scarce parallel data. The alignment method is implemented in a tool that will be freely available in the near future.

Anthology ID:: L16-1348
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2192–2198
Language:
URL:: https://aclanthology.org/L16-1348/
DOI:
Bibkey:
Cite (ACL):: Antoine Bourlon, Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2016. Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2192–2198, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons (Bourlon et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1348.pdf

PDF Cite Search Fix data