Nelson Ng
2012
Monolingual Data Optimisation for Bootstrapping SMT Engines
Jie Jiang
|
Andy Way
|
Nelson Ng
|
Rejwanul Haque
|
Mike Dillinger
|
Jun Lu
Workshop on Monolingual Machine Translation
Content localisation via machine translation (MT) is a sine qua non, especially for international online business. While most applications utilise rule-based solutions due to the lack of suitable in-domain parallel corpora for statistical MT (SMT) training, in this paper we investigate the possibility of applying SMT where huge amounts of monolingual content only are available. We describe a case study where an analysis of a very large amount of monolingual online trading data from eBay is conducted by ALS with a view to reducing this corpus to the most representative sample in order to ensure the widest possible coverage of the total data set. Furthermore, minimal yet optimal sets of sentences/words/terms are selected for generation of initial translation units for future SMT system-building.