Monolingual Data Optimisation for Bootstrapping SMT Engines

Jie Jiang, Andy Way, Nelson Ng, Rejwanul Haque, Mike Dillinger, Jun Lu


Abstract
Content localisation via machine translation (MT) is a sine qua non, especially for international online business. While most applications utilise rule-based solutions due to the lack of suitable in-domain parallel corpora for statistical MT (SMT) training, in this paper we investigate the possibility of applying SMT where huge amounts of monolingual content only are available. We describe a case study where an analysis of a very large amount of monolingual online trading data from eBay is conducted by ALS with a view to reducing this corpus to the most representative sample in order to ensure the widest possible coverage of the total data set. Furthermore, minimal yet optimal sets of sentences/words/terms are selected for generation of initial translation units for future SMT system-building.
Anthology ID:
2012.amta-monomt.2
Volume:
Workshop on Monolingual Machine Translation
Month:
October 28-November 1
Year:
2012
Address:
San Diego, California, USA
Editors:
Tsuyoshi Okita, Artem Sokolov, Taro Watanabe
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://aclanthology.org/2012.amta-monomt.2
DOI:
Bibkey:
Cite (ACL):
Jie Jiang, Andy Way, Nelson Ng, Rejwanul Haque, Mike Dillinger, and Jun Lu. 2012. Monolingual Data Optimisation for Bootstrapping SMT Engines. In Workshop on Monolingual Machine Translation, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Monolingual Data Optimisation for Bootstrapping SMT Engines (Jiang et al., AMTA 2012)
Copy Citation:
PDF:
https://aclanthology.org/2012.amta-monomt.2.pdf