%0 Conference Proceedings %T Sample Selection for Large-scale MT Discriminative Training %A Cao, Yuan %A Khudanpur, Sanjeev %S Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers %D 2012 %8 oct 28 nov 1 %I Association for Machine Translation in the Americas %C San Diego, California, USA %F cao-khudanpur-2012-sample %X Discriminative training for MT usually involves numerous features and requires large-scale training set to reach reliable parameter estimation. Other than using the expensive human-labeled parallel corpora for training, semi-supervised methods have been proposed to generate huge amount of “hallucinated” data which relieves the data sparsity problem. However the large training set contains both good samples which are suitable for training and bad ones harmful to the training. How to select training samples from vast amount of data can greatly affect the training performance. In this paper we propose a method for selecting samples that are most suitable for discriminative training according to a criterion measuring the dataset quality. Our experimental results show that by adding samples to the training set selectively, we are able to exceed the performance of system trained with the same amount of samples selected randomly. %U https://aclanthology.org/2012.amta-papers.3