Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Lucía Santamaría; Amittai Axelrod

Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Abstract

We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models [1], we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method [2], which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84% less data than the other methods. Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.

Anthology ID:: 2017.iwslt-1.19
Volume:: Proceedings of the 14th International Conference on Spoken Language Translation
Month:: December 14-15
Year:: 2017
Address:: Tokyo, Japan
Editors:: Sakriani Sakti, Masao Utiyama
Venue:: IWSLT
SIG:: SIGSLT
Publisher:: International Workshop on Spoken Language Translation
Note:
Pages:: 137–145
Language:
URL:: https://aclanthology.org/2017.iwslt-1.19/
DOI:
Bibkey:
Cite (ACL):: Lucía Santamaría and Amittai Axelrod. 2017. Data Selection with Cluster-Based Language Difference Models and Cynical Selection. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 137–145, Tokyo, Japan. International Workshop on Spoken Language Translation.
Cite (Informal):: Data Selection with Cluster-Based Language Difference Models and Cynical Selection (Santamaría & Axelrod, IWSLT 2017)
Copy Citation:
PDF:: https://aclanthology.org/2017.iwslt-1.19.pdf

PDF Cite Search Fix data