Vamshi Ambati


2011

2010

Large scale parallel data generation for new language pairs requires intensive human effort and availability of experts. It becomes immensely difficult and costly to provide Statistical Machine Translation (SMT) systems for most languages due to the paucity of expert translators to provide parallel data. Even if experts are present, it appears infeasible due to the impending costs. In this paper we propose Active Crowd Translation (ACT), a new paradigm where active learning and crowd-sourcing come together to enable automatic translation for low-resource language pairs. Active learning aims at reducing cost of label acquisition by prioritizing the most informative data for annotation, while crowd-sourcing reduces cost by using the power of the crowds to make do for the lack of expensive language experts. We experiment and compare our active learning strategies with strong baselines and see significant improvements in translation quality. Similarly, our experiments with crowd-sourcing on Mechanical Turk have shown that it is possible to create parallel corpora using non-experts and with sufficient quality assurance, a translation system that is trained using this corpus approaches expert quality.

2009

2008

Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. This is primarily due to the highly restrictive space of constituent segmentations that the trees on two sides introduce, which adversely affects the recall of the resulting translation models. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments. In this paper we explore the issue of lexical coverage of the translation models learned in both of these scenarios. We specifically look at how the non-isomorphic nature of the parse trees for the two languages affects recall and coverage. We then propose a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We evaluate the translation models learned from these restructured trees and show that they are significantly better than those learned using trees on both sides and trees on one side.
Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development. For these reasons, our research programs on minority language MT have focused on leveraging to the maximum extent two resources that are available for minority languages: linguistic structure and bilingual informants. All natural languages contain linguistic structure. And although the details of that linguistic structure vary from language to language, language universals such as context-free syntactic structure and the paradigmatic structure of inflectional morphology, allow us to learn the specific details of a minority language. Similarly, most minority languages possess speakers who are bilingual with the major language of the area. This paper discusses our efforts to utilize linguistic structure and the translation information that bilingual informants can provide in three sub-areas of our rapid development MT program: morphology induction, syntactic transfer rule learning, and refinement of imperfect learned rules.