Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop
- Anthology ID:
- October 21-25
- Waikiki, USA
- Association for Machine Translation in the Americas
Improving Syntax-Driven Translation Models by Re-structuring Divergent and Nonisomorphic Parse Tree Structures
Vamshi Ambati | Alon Lavie
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. This is primarily due to the highly restrictive space of constituent segmentations that the trees on two sides introduce, which adversely affects the recall of the resulting translation models. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments. In this paper we explore the issue of lexical coverage of the translation models learned in both of these scenarios. We specifically look at how the non-isomorphic nature of the parse trees for the two languages affects recall and coverage. We then propose a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We evaluate the translation models learned from these restructured trees and show that they are significantly better than those learned using trees on both sides and trees on one side.
Using Bilingual Chinese-English Word Alignments to Resolve PP-attachment Ambiguity in English
Victoria Fossum | Kevin Knight
Errors in English parse trees impact the quality of syntax-based MT systems trained using those parses. Frequent sources of error for English parsers include PP-attachment ambiguity, NP-bracketing ambiguity, and coordination ambiguity. Not all ambiguities are preserved across languages. We examine a common type of ambiguity in English that is not preserved in Chinese: given a sequence “VP NP PP”, should the PP be attached to the main verb, or to the object noun phrase? We present a discriminative method for exploiting bilingual Chinese-English word alignments to resolve this ambiguity in English. On a held-out test set of Chinese-English parallel sentences, our method achieves 86.3% accuracy on this PP-attachment disambiguation task, an improvement of 4% over the accuracy of the baseline Collins parser (82.3%).
Combination of Machine Translation Systems via Hypothesis Selection from Combined N-Best Lists
Almut Silja Hildebrand | Stephan Vogel
Different approaches in machine translation achieve similar translation quality with a variety of translations in the output. Recently it has been shown, that it is possible to leverage the individual strengths of various systems and improve the overall translation quality by combining translation outputs. In this paper we present a method of hypothesis selection which is relatively simple compared to system combination methods which construct a synthesis of the input hypotheses. Our method uses information from n-best lists from several MT systems and features on the sentence level which are independent from the MT systems involved to improve the translation quality.
The Value of Machine Translation for the Professional Translator
More and more Translation Memory (TM) systems nowadays are fortified with machine translation (MT) techniques to enable them to propose a translation to the translator when no match is found in his TM resources. The system attempts this by assembling a combination of terms from its terminology database, translations from its memory, and even portions of them. This paper reviews the most popular commercial TM systems with integrated MT techniques and explores their usefulness based on the perceived practical benefits brought to their users. Feedback from translators reveals a variety of attitudes towards machine translation, with some supporting and others contradicting several points of conventional wisdom regarding the relationship between machine translation and human translators.
Diacritization as a Machine Translation and as a Sequence Labeling Problem
Tim Schlippe | ThuyLinh Nguyen | Stephan Vogel
In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.
Multi-Source Translation Methods
Multi-parallel corpora provide a potentially rich resource for machine translation. This paper surveys existing methods for utilizing such resources, including hypothesis ranking and system combination techniques. We find that despite significant research into system combination, relatively little is know about how best to translate when multiple parallel source languages are available. We provide results to show that the MAX multilingual multi-source hypothesis ranking method presented by Och and Ney (2001) does not reliably improve translation quality when a broad range of language pairs are considered. We also show that the PROD multilingual multi-source hypothesis ranking method of Och and Ney (2001) cannot be used with standard phrase-based translation engines, due to a high number of unreachable hypotheses. Finally, we present an oracle experiment which shows that current hypothesis ranking methods fall far short of the best results reachable via sentence-level ranking.