Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese–Japanese

Wei Yang, Yves Lepage


Abstract
Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in different level of granularity between Chinese and Japanese. To improve the translation accuracy, we adjust and balance the granularity of segmentation results around terms for Chinese–Japanese patent corpus for training translation model. In this paper, we describe a statistical machine translation (SMT) system which is built on re-tokenized Chinese-Japanese patent training corpus using extracted bilingual multi-word terms.
Anthology ID:
W16-4619
Volume:
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Toshiaki Nakazawa, Hideya Mino, Chenchen Ding, Isao Goto, Graham Neubig, Sadao Kurohashi, Ir. Hammam Riza, Pushpak Bhattacharyya
Venue:
WAT
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
194–202
Language:
URL:
https://aclanthology.org/W16-4619/
DOI:
Bibkey:
Cite (ACL):
Wei Yang and Yves Lepage. 2016. Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese–Japanese. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pages 194–202, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese–Japanese (Yang & Lepage, WAT 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4619.pdf