Burak Aydın
2014
Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation
Burak Aydın
|
Arzucan Özgür
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
The training data size is of utmost importance for statistical machine translation (SMT), since it affects the training time, model size, decoding speed, as well as the system’s overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this paper, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on first ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation filter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation system are achieved. We compared our results with the translation model combination approaches as well and reported the improvements. Moreover, we implemented our system with dependency parse tree based language modeling in addition to the n-gram based language modeling and reported comparable results.
2013
TÜBİTAK Turkish-English submissions for IWSLT 2013
Ertuğrul Yılmaz
|
İlknur Durgar El-Kahlout
|
Burak Aydın
|
Zişan Sıla Özil
|
Coşkun Mermer
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign
This paper describes the TU ̈ B ̇ITAK Turkish-English submissions in both directions for the IWSLT’13 Evaluation Campaign TED Machine Translation (MT) track. We develop both phrase-based and hierarchical phrase-based statistical machine translation (SMT) systems based on Turkish wordand morpheme-level representations. We augment training data with content words extracted from itself and experiment with reverse word order for source languages. For the Turkish-to-English direction, we use Gigaword corpus as an additional language model with the training data. For the English-to-Turkish direction, we implemented a wide coverage Turkish word generator to generate words from the stem and morpheme sequences. Finally, we perform system combination of the different systems produced with different word alignments.