Conference of the Association for Machine Translation in the Americas (2008)


bib (full) Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

pdf bib
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

pdf bib
Spanish-to-Basque MultiEngine Machine Translation for a Restricted Domain
Iñaki Alegria | Arantza Casillas | Arantza Diaz de Ilarraza | Jon Igartua | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola

We present our initial strategy for Spanish-to-Basque MultiEngine Machine Translation, a language pair with very different structure and word order and with no huge parallel corpus available. This hybrid proposal is based on the combination of three different MT paradigms: Example-Based MT, Statistical MT and Rule- Based MT. We have evaluated the system, reporting automatic evaluation metrics for a corpus in a test domain. The first results obtained are encouraging.

pdf bib
Exploiting Document-Level Context for Data-Driven Machine Translation
Ralf Brown

This paper presents a method for exploiting document-level similarity between the documents in the training corpus for a corpus-driven (statistical or example-based) machine translation system and the input documents it must translate. The method is simple to implement, efficient (increases the translation time of an example-based system by only a few percent), and robust (still works even when the actual document boundaries in the input text are not known). Experiments on French-English and Arabic-English showed relative gains over the same system without using document-level similarity of up to 7.4% and 5.4%, respectively, on the BLEU metric.

pdf bib
Shallow-Syntax Phrase-Based Translation: Joint versus Factored String-to-Chunk Models
Mauro Cettolo | Marcello Federico | Daniele Pighin | Nicola Bertoldi

This work extends phrase-based statistical MT (SMT) with shallow syntax dependencies. Two string-to-chunks translation models are proposed: a factored model, which augments phrase-based SMT with layered dependencies, and a joint model, that extends the phrase translation table with microtags, i.e. per-word projections of chunk labels. Both rely on n-gram models of target sequences with different granularity: single words, micro-tags, chunks. In particular, n-grams defined over syntactic chunks should model syntactic constraints coping with word-group movements. Experimental analysis and evaluation conducted on two popular Chinese-English tasks suggest that the shallow-syntax joint-translation model has potential to outperform state-of-the-art phrase-based translation, with a reasonable computational overhead.

pdf bib
Discriminative, Syntactic Language Modeling through Latent SVMs
Colin Cherry | Chris Quirk

We construct a discriminative, syntactic language model (LM) by using a latent support vector machine (SVM) to train an unlexicalized parser to judge sentences. That is, the parser is optimized so that correct sentences receive high-scoring trees, while incorrect sentences do not. Because of this alternative objective, the parser can be trained with only a part-of-speech dictionary and binary-labeled sentences. We follow the paradigm of discriminative language modeling with pseudo-negative examples (Okanohara and Tsujii, 2007), and demonstrate significant improvements in distinguishing real sentences from pseudo-negatives. We also investigate the related task of separating machine-translation (MT) outputs from reference translations, again showing large improvements. Finally, we test our LM in MT reranking, and investigate the language-modeling parser in the context of unsupervised parsing.

pdf bib
Translation universals: do they exist? A corpus-based NLP study of convergence and simplification
Gloria Corpas Pastor | Ruslan Mitkov | Naveed Afzal | Viktor Pekar

Convergence and simplification are two of the so-called universals in translation studies. The first one postulates that translated texts tend to be more similar than non-translated texts. The second one postulates that translated texts are simpler, easier-to-understand than non-translated ones. This paper discusses the results of a project which applies NLP techniques over comparable corpora of translated and non-translated texts in Spanish seeking to establish whether these two universals hold Corpas Pastor (2008).

pdf bib
Computing multiple weighted reordering hypotheses for a phrase-based statistical machine translation system
Marta R. Costa-Jussà | José A. R. Fonollosa

Reordering is one source of error in statistical machine translation (SMT). This paper extends the study of the statistical machine reordering (SMR) approach, which uses the powerful techniques of the SMT systems to solve reordering problems. Here, the novelties yield in: (1) using the SMR approach in a SMT phrase-based system, (2) adding a feature function in the SMR step, and (3) analyzing the reordering hypotheses at several stages. Coherent improvements are reported in the TC-STAR task (Es/En) at a relatively low computational cost.

pdf bib
Overcoming Vocabulary Sparsity in MT Using Lattices
Steve DeNeefe | Ulf Hermjakob | Kevin Knight

Source languages with complex word-formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation.

pdf bib
Toward the Evaluation of Machine Translation Using Patent Information
Atsushi Fujii | Masao Utiyama | Mikio Yamamoto | Takehito Utsuro

To aid research and development in machine translation, we have produced a test collection for Japanese/English machine translation. To obtain a parallel corpus, we extracted patent documents for the same or related inventions published in Japan and the United States. Our test collection includes approximately 2000000 sentence pairs in Japanese and English, which were extracted automatically from our parallel corpus. These sentence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retrieving patent documents across languages. This paper describes our test collection, methods for evaluating machine translation, and preliminary experiments.

pdf bib
Automatic Learning of Morphological Variations for Handling Out-of-Vocabulary Terms in Urdu-English MT
Nizar Habash | Hayden Metsky

We present an approach for online handling of Out-of-Vocabulary (OOV) terms in Urdu-English MT. Since Urdu is morphologically richer than English, we expect a large portion of the OOV terms to be Urdu morphological variations that are irrelevant to English. We describe an approach to automatically learn English-irrelevant (target-irrelevant) Urdu (source) morphological variation rules from standard phrase tables. These rules are learned in an unsupervised (or lightly supervised) manner by exploiting redundancy in Urdu and collocation with English translations. We use these rules to hypothesize in-vocabulary alternatives to the OOV terms. Our results show that we reduce the OOV rate from a standard baseline average of 2.6% to an average of 0.3% (or 89% relative decrease). We also increase the BLEU score by 0.45 (absolute) and 2.8% (relative) on a standard test set. A manual error analysis shows that 28% of handled OOV cases produce acceptable translations in context.

pdf bib
A Generalized Reordering Model for Phrase-Based Statistical Machine Translation
Yanqing He | Chengqing Zong

Phrase-based translation models are widely studied in statistical machine translation (SMT). However, the existing phrase-based translation models either can not deal with non-contiguous phrases or reorder phrases only by the rules without an effective reordering model. In this paper, we propose a generalized reordering model (GREM) for phrase-based statistical machine translation, which is not only able to capture the knowledge on the local and global reordering of phrases, but also is able to obtain some capabilities of phrasal generalization by using non-contiguous phrases. The experimental results have indicated that our model out- performs MEBTG (enhanced BTG with a maximum entropy-based reordering model) and HPTM (hierarchical phrase-based translation model) by improvement of 1.54% and 0.66% in BLEU.

pdf bib
A truly multilingual, high coverage, accurate, yet simple, subsentential alignment method
Adrien Lardilleux | Yves Lepage

This paper describes a new alignment method that extracts high quality multi-word alignments from sentence-aligned multilingual parallel corpora. The method can handle several languages at once. The phrase tables obtained by the method have a comparable accuracy and a higher coverage than those obtained by current methods. They are also obtained much faster.

pdf bib
Large-scale Discriminative n-gram Language Models for Statistical Machine Translation
Zhifei Li | Sanjeev Khudanpur

We extend discriminative n-gram language modeling techniques originally proposed for automatic speech recognition to a statistical machine translation task. In this context, we propose a novel data selection method that leads to good models using a fraction of the training data. We carry out systematic experiments on several benchmark tests for Chinese to English translation using a hierarchical phrase-based machine translation system, and show that a discriminative language model significantly improves upon a state-of-the-art baseline. The experiments also highlight the benefits of our data selection method.

pdf bib
Are Multiple Reference Translations Necessary? Investigating the Value of Paraphrased Reference Translations in Parameter Optimization
Nitin Madnani | Philip Resnik | Bonnie J. Dorr | Richard Schwartz

Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their corresponding reference translations. However, obtaining reference translations is expensive. In our earlier work (Madnani et al., 2007), we introduced a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and demonstrated that the resulting paraphrases can be used to cut the number of human reference translations needed in half. In this paper, we take the idea a step further, asking how far it is possible to get with just a single good reference translation for each item in the development set. Our analysis suggests that it is necessary to invest in four or more human translations in order to significantly improve on a single translation augmented by monolingual paraphrases.

pdf bib
Integrating a Phrase-based SMT Model and a Bilingual Lexicon for Semi-Automatic Acquisition of Technical Term Translation Lexicons
Yohei Morishita | Takehito Utsuro | Mikio Yamamoto

This paper presents an attempt at developing a technique of acquiring translation pairs of technical terms with sufficiently high precision from parallel patent documents. The approach taken in the proposed technique is based on integrating the phrase translation table of a state-of-the-art statistical phrase-based machine translation model, and compositional translation generation based on an existing bilingual lexicon for human use. Our evaluation results clearly show that the agreement between the two individual techniques definitely contribute to improving precision of translation candidates. We then apply the Support Vector Machines (SVMs) to the task of automatically validating translation candidates in the phrase translation table. Experimental evaluation results again show that the SVMs based approach to translation candidates validation can contribute to improving the precision of translation candidates in the phrase translation table.

pdf bib
Linguistically-motivated Tree-based Probabilistic Phrase Alignment
Toshiaki Nakazawa | Sadao Kurohashi

In this paper, we propose a probabilistic phrase alignment model based on dependency trees. This model is linguistically-motivated, using syntactic information during alignment process. The main advantage of this model is that the linguistic difference between source and target languages is successfully absorbed. It is composed of two models: Model1 is using content word translation probability and function word translation probability; Model2 uses dependency relation probability which is defined for a pair of positional relations on dependency trees. Relation probability acts as tree-based phrase reordering model. Since this model is directed, we combine two alignment results from bi-directional training by symmetrization heuristics to get definitive alignment. We conduct experiments on a Japanese-English corpus, and achieve reasonably high quality of alignment compared with word-based alignment model.

pdf bib
Parsers as language models for statistical machine translation
Matt Post | Daniel Gildea

Most work in syntax-based machine translation has been in translation modeling, but there are many reasons why we may instead want to focus on the language model. We experiment with parsers as language models for machine translation in a simple translation model. This approach demands much more of the language models, allowing us to isolate their strengths and weaknesses. We find that unmodified parsers do not improve BLEU scores over ngram language models, and provide an analysis of their strengths and weaknesses.

pdf bib
A Statistical Analysis of Automated MT Evaluation Metrics for Assessments in Task-Based MT Evaluation
Calandra R. Tate

This paper applies nonparametric statistical techniques to Machine Translation (MT) Evaluation using data from a large scale task-based study. In particular, the relationship between human task performance on an information extraction task with translated documents and well-known automated translation evaluation metric scores for those documents is studied. Findings from a correlation analysis of this connection are presented and contrasted with current strategies for evaluating translations. An extended analysis that involves a novel idea for assessing partial rank correlation within the presence of grouping factors is also discussed. This work exposes the limitations of descriptive statistics generally used in this area, mainly correlation analysis, when using automated metrics for assessments in task handling purposes.

pdf bib
Wider Pipelines: N-Best Alignments and Parses in MT Training
Ashish Venugopal | Andreas Zollmann | Noah A. Smith | Stephan Vogel

State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOP-inspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline.

pdf bib
Improving English-to-Chinese Translation for Technical Terms using Morphological Information
Xianchao Wu | Naoaki Okazaki | Takashi Tsunakawa | Jun’ichi Tsujii

The continuous emergence of new technical terms and the difficulty of keeping up with neologism in parallel corpora deteriorate the performance of statistical machine translation (SMT) systems. This paper explores the use of morphological information to improve English-to-Chinese translation for technical terms. To reduce the morpheme-level translation ambiguity, we group the morphemes into morpheme phrases and propose the use of domain information for translation candidate selection. In order to find correspondences of morpheme phrases between the source and target languages, we propose an algorithm to mine morpheme phrase translation pairs from a bilingual lexicon. We also build a cascaded translation model that dynamically shifts translation units from phrase level to word and morpheme phrase levels. The experimental results show the significant improvements over the current phrase-based SMT systems.

pdf bib
Mining the Web for Domain-Specific Translations
Jian-Cheng Wu | Peter Wei-Huai Hsu | Chiung-Hui Tseng | Jason S. Chang

We introduce a method for learning to find domain-specific translations for a given term on the Web. In our approach, the source term is transformed into an expanded query aimed at maximizing the probability of retrieving translations from a very large collection of mixed-code documents. The method involves automatically generating sets of target-language words from training data in specific domains, automatically selecting target words for effectiveness in retrieving documents containing the sought-after translations. At run time, the given term is transformed into an expanded query and submitted to a search engine, and ranked translations are extracted from the document snippets returned by the search engine. We present a prototype, TermMine, which applies the method to a Web search engine. Evaluations over a set of domains and terms show that TermMine outperforms state-of-the-art machine translation systems.

pdf bib
Two-Stage Translation: A Combined Linguistic and Statistical Machine Translation Framework
Yushi Xu | Stephanie Seneff

We propose a two-stage system for spoken language machine translation. In the first stage, the source sentence is parsed and paraphrased into an intermediate language which retains the words in the source language but follows the word order of the target language as much as feasible. This stage is mostly linguistic. In the second stage, a statistical MT is performed to translate the intermediate language into the target language. For the task of English-to-Mandarin translation, we achieved a 2.5 increase in BLEU score and a 45% decrease in GIZA-Alignment Crossover, on IWSLT-06 data. In a human evaluation of the sentences that differed, the two-stage system was preferred three times as often as the baseline.


bib (full) Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

pdf bib
Improving Syntax-Driven Translation Models by Re-structuring Divergent and Nonisomorphic Parse Tree Structures
Vamshi Ambati | Alon Lavie

Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. This is primarily due to the highly restrictive space of constituent segmentations that the trees on two sides introduce, which adversely affects the recall of the resulting translation models. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments. In this paper we explore the issue of lexical coverage of the translation models learned in both of these scenarios. We specifically look at how the non-isomorphic nature of the parse trees for the two languages affects recall and coverage. We then propose a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We evaluate the translation models learned from these restructured trees and show that they are significantly better than those learned using trees on both sides and trees on one side.

pdf bib
Using Bilingual Chinese-English Word Alignments to Resolve PP-attachment Ambiguity in English
Victoria Fossum | Kevin Knight

Errors in English parse trees impact the quality of syntax-based MT systems trained using those parses. Frequent sources of error for English parsers include PP-attachment ambiguity, NP-bracketing ambiguity, and coordination ambiguity. Not all ambiguities are preserved across languages. We examine a common type of ambiguity in English that is not preserved in Chinese: given a sequence “VP NP PP”, should the PP be attached to the main verb, or to the object noun phrase? We present a discriminative method for exploiting bilingual Chinese-English word alignments to resolve this ambiguity in English. On a held-out test set of Chinese-English parallel sentences, our method achieves 86.3% accuracy on this PP-attachment disambiguation task, an improvement of 4% over the accuracy of the baseline Collins parser (82.3%).

pdf bib
Combination of Machine Translation Systems via Hypothesis Selection from Combined N-Best Lists
Almut Silja Hildebrand | Stephan Vogel

Different approaches in machine translation achieve similar translation quality with a variety of translations in the output. Recently it has been shown, that it is possible to leverage the individual strengths of various systems and improve the overall translation quality by combining translation outputs. In this paper we present a method of hypothesis selection which is relatively simple compared to system combination methods which construct a synthesis of the input hypotheses. Our method uses information from n-best lists from several MT systems and features on the sentence level which are independent from the MT systems involved to improve the translation quality.

pdf bib
The Value of Machine Translation for the Professional Translator
Elina Lagoudaki

More and more Translation Memory (TM) systems nowadays are fortified with machine translation (MT) techniques to enable them to propose a translation to the translator when no match is found in his TM resources. The system attempts this by assembling a combination of terms from its terminology database, translations from its memory, and even portions of them. This paper reviews the most popular commercial TM systems with integrated MT techniques and explores their usefulness based on the perceived practical benefits brought to their users. Feedback from translators reveals a variety of attitudes towards machine translation, with some supporting and others contradicting several points of conventional wisdom regarding the relationship between machine translation and human translators.

pdf bib
Diacritization as a Machine Translation and as a Sequence Labeling Problem
Tim Schlippe | ThuyLinh Nguyen | Stephan Vogel

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

pdf bib
Multi-Source Translation Methods
Lane Schwartz

Multi-parallel corpora provide a potentially rich resource for machine translation. This paper surveys existing methods for utilizing such resources, including hypothesis ranking and system combination techniques. We find that despite significant research into system combination, relatively little is know about how best to translate when multiple parallel source languages are available. We provide results to show that the MAX multilingual multi-source hypothesis ranking method presented by Och and Ney (2001) does not reliably improve translation quality when a broad range of language pairs are considered. We also show that the PROD multilingual multi-source hypothesis ranking method of Och and Ney (2001) cannot be used with standard phrase-based translation engines, due to a high number of unreachable hypotheses. Finally, we present an oracle experiment which shows that current hypothesis ranking methods fall far short of the best results reachable via sentence-level ranking.


bib (full) Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Government and Commercial Uses of MT

pdf bib
Machine Translation for Triage and Exploitation of Massive Text Data
James E. Andrews | Kristen Summers

The National Ground Intelligence Center (NGIC) collects massive quantities of textual data in foreign languages. To support exploitation in light of intelligence requirements, a triage process must be applied to this data as those requirements emerge, to identify the most useful data for further exploitation. Machine translation provides critical support for this triage. This paper outlines the types of collected data and the different challenges they present for machine translation, as well as the types of triage to support for collections of this nature, and the issues raised for machine translation by those uses.

pdf bib
Global Public Health Intelligence Network (GPHIN)
Michael Blench

GPHIN is a secure Internet-based “early warning” system that gathers preliminary reports of public health significance on a near “real-time” basis, 24 hours a day, 7 days a week. This unique multilingual system gathers and disseminates relevant information on disease outbreaks and other public health events by monitoring global media sources such as news wires and web sites. This monitoring is done in nine languages with machine translation being used to translate non-English articles into English and English articles into the other languages. The information is filtered for relevancy by an automated process which is then complemented by human analysis. The output is categorized and made accessible to users. Notifications about public health events that may have serious public health consequences are immediately forwarded to users. GPHIN is managed by the Public Health Agency of Canada’s Centre for Emergency Preparedness and Response (CEPR), which was created in July 2000 to serve as Canada’s central coordinating point for public health security. It is considered a centre of expertise in the area of civic emergencies including natural disasters and malicious acts with health repercussions.

pdf bib
Sharing User Dictionaries Across Multiple Systems with UTX-S
Francis Bond | Seiji Okura | Yuji Yamamoto | Toshiki Murata | Kiyotaka Uchimoto | Michael Kato | Miwako Shimazu | Tsugiyoshi Suzuki

Careful tuning of user-created dictionaries is indispensable when using a machine translation system for computer aided translation. However, there is no widely used standard for user dictionaries in the Japanese/English machine translation market. To address this issue, AAMT (the Asia-Pacific Association for Machine Translation) has established a specification of sharable dictionaries (UTX-S: Universal Terminology eXchange -- Simple), which can be used across different machine translation systems, thus increasing the interoperability of language resources. UTX-S is simpler than existing specifications such as UPF and OLIF. It was explicitly designed to make it easy to (a) add new user dictionaries and (b) share existing user dictionaries. This facilitates rapid user dictionary production and avoids vendor tie in. In this study we describe the UTX-Simple (UTX-S) format, and show that it can be converted to the user dictionary formats for five commercial English-Japanese MT systems. We then present a case study where we (a) convert an on-line glossary to UTX-S, and (b) produce user dictionaries for five different systems, and then exchange them. The results show that the simplified format of UTX-S can be used to rapidly build dictionaries. Further, we confirm that customized user dictionaries are effective across systems, although with a slight loss in quality: on average, user dictionaries improved the translations for 44.8% of translations with the systems they were built for and 37.3% of translations for different systems. In ongoing work, AAMT is using UTX-S as the format in building up a user community for producing, sharing, and accumulating user dictionaries in a sustainable way.

pdf bib
Many-to-Many Multilingual Medical Speech Translation on a PDA
Kyoko Kanzaki | Yukie Nakao | Manny Rayner | Marianne Santaholma | Marianne Starlander | Nikos Tsourakis

Particularly considering the requirement of high reliability, we argue that the most appropriate architecture for a medical speech translator that can be realised using today’s technology combines unidirectional (doctor to patient) translation, medium-vocabulary controlled language coverage, interlingua-based translation, an embedded help component, and deployability on a hand-held hardware platform. We present an overview of the Open Source MedSLT prototype, which has been developed in accordance with these design principles. The system is implemented on top of the Regulus and Nuance 8.5 platforms, translates patient examination questions for all language pairs in the set {English, French, Japanese, Arabic, Catalan}, using vocabularies of about 400 to 1 100 words, and can be run in a distributed client/server environment, where the client application is hosted on a Nokia Internet Tablet device.

pdf bib
MT errors in Chinese-to-English MT systems: user feedback
Shin Chang-Meadows

pdf bib
Working with the US Government: Information Resources
Jennifer DeCamp

This document provides information on how companies and researchers in machine translation can work with the U.S. Government. Specifically, it addresses information on (1) groups in the U.S. Government working with translation and potentially having a need for machine translation; (2) means for companies and researchers to provide information to the United States Government about their work; and (3) U.S. Government organizations providing grants of possible interest to this community.

pdf bib
Reliable Innovation: A Tecchie’s Travels in the Land of Translators
Alain Désilets | Louise Brunette | Christiane Melançon | Geneviève Patenaude

Machine Translation (MT) is rapidly progressing towards quality levels that might make it appropriate for broad user populations in a range of scenarios, including gisting and post-editing in unconstrained domains. For this to happen, the field may however need to switch gear and move away from its current technology driven paradigm to a more user-centered approach. In this paper, we discuss how ethnographic techniques like Contextual Inquiry could help in that respect, by providing researchers and developers with rich information about the world and needs of potential end-users. We discuss how data from Contextual Inquiries with professional translators was used to concretely and positively influence several research and development projects in the area of Computer Assisted Translation technology. These inquiries had many benefits, including: (i) grounding developers and researchers in the world of their end-users, (ii) generating new technology ideas, (iii) selecting between competing development project ideas, (iv) finding how to alleviate friction for important ideas that go against the grain of current user practices, (v) evaluating existing or experimental technologies, (vi) helping with micro level design decision, (vii) building credibility with translators, and (viii) fostering multidisciplinary discussion between researchers.

pdf bib
Automated Machine Translation Improvement Through Post-Editing Techniques: Analyst and Translator Experiments
Jennifer Doyon | Christine Doran | C. Donald Means | Domenique Parr

From the Automatic Language Processing Advisory Committee (ALP AC) (Pierce et al., 1966) machine translation (MT) evaluations of the ‘60s to the Defense Advanced Research Projects Agency (DARPA) Global Autonomous Language Exploitation (GALE) (Olive, 2008) and National Institute of Standards and Technology (NIST) (NIST, 2008) MT evaluations of today, the U.S. Government has been instrumental in establishing measurements and baselines for the state-of-the-art in MT engines. In the same vein, the Automated Machine Translation Improvement Through Post-Editing Techniques (PEMT) project sought to establish a baseline of MT engines based on the perceptions of potential users. In contrast to these previous evaluations, the PEMT project’s experiments also determined the minimal quality level output needed to achieve before users found the output acceptable. Based on these findings, the PEMT team investigated using post-editing techniques to achieve this level. This paper will present experiments in which analysts and translators were asked to evaluate MT output processed with varying post-editing techniques. The results show at what level the analysts and translators find MT useful and are willing to work with it. We also establish a ranking of the types of post-edits necessary to elevate MT output to the minimal acceptance level.

pdf bib
User-centered MT Development and Implementation
Kathleen Egan | Francis Kubala | Allen Sears

pdf bib
Identifying Common Challenges for Human and Machine Translation: A Case Study from the GALE Program
Lauren Friedman | Stephanie Strassel

The dramatic improvements shown by statistical machine translation systems in recent years clearly demonstrate the benefits of having large quantities of manually translated parallel text for system training and development. And while many competing evaluation metrics exist to evaluate MT technology, most of those methods also crucially rely on the existence of one or more high quality human translations to benchmark system performance. Given the importance of human translations in this framework, understanding the particular challenges of human translation-for-MT is key, as is comprehending the relative strengths and weaknesses of human versus machine translators in the context of an MT evaluation. Vanni (2000) argued that the metric used for evaluation of competence in human language learners may be applicable to MT evaluation; we apply similar thinking to improve the prediction of MT performance, which is currently unreliable. In the current paper we explore an alternate model based upon a set of genre-defining features that prove to be consistently challenging for both humans and MT systems.

pdf bib
Automatic Translation of Court Judgments
Fabrizio Gotti | Guy Lapalme | Elliott Macklovitch | Atefeh Farzindar

This document presents an experiment in the automatic translation of Canadian Court judgments from English to French and from French to English. We show that although the language used in this type of legal text is complex and specialized, an SMT system can produce intelligible and useful translations, provided that the system can be trained on a vast amount of legal text. We also describe the results of a human evaluation of the output of the system.

pdf bib
Designing and executing MT workflows through the Kepler Framework
Reginald Hobbs | Clare Voss

pdf bib
ClipperRSS: A Light-Weight Prototype for the Cross-language Exploitation of Syndicated Feeds
Rod Holland | Brenden Keyes

Syndicated feeds in RSS, Atom, and related formats have emerged as ubiquitous information sources in World Wide Web language communities including Arabic, Farsi, Chinese, and others, providing subscribers with timely updates on topics of particular interest. We have modified an existing Open Source RSS reader, Sage, for cross-language use, permitting English-speakers to discover, subscribe to, update, and browse RSS feeds in ten languages. This early prototype, called Clip- perRSS, has been integrated with the Clipper cross-language information retrieval tool. The integrated system provides English-speakers with an effective means of exploring the potential of foreign-language syndicated feeds in their domains of interest.

pdf bib
Trends in automated translation in today’s global business
Sophie Hurst

SDL, in association with the International Association for Machine Translation (IAMT) and Association for Machine Translation Americas (AMTA), ran a survey which was completed by over 385 individuals in global businesses. The results were fascinating and definitely show an increased interest in the use of automated translation over the last two years.

pdf bib
Machine Translation for Indonesian and Tagalog
Brianna Laugher | Ben MacLeod

Kataku is a hybrid MT system for Indonesian to English and English to Indonesian translation, available on Windows, Linux and web-based platforms. This paper briefly presents the technical background to Kataku, some of its use cases and extensions. Kataku is the flagship product of ToggleText, a language technology company based in Melbourne, Australia.

pdf bib
Real-time translation of IM Chat
Robert Levin

pdf bib
TransSearch: What are translators looking for?
Elliott Macklovitch | Guy Lapalme | Fabrizio Gotti

Notwithstanding machine translation’s impressive progress over the last decade, many translators remain convinced that the output of even the best MT systems is not sufficient to facilitate the production of publication-quality texts. To increase their productivity they turn instead to translator support tools. We examine the use of one such tool: TransSearch, an online bilingual concordancer. From the millions of requests stored in the system’s logs over a 6-year period, we extracted and analyzed the most frequently submitted queries, in an effort to characterize the kinds of problems for which translators turn to this system for help. What we discover, somewhat surprisingly, is that our system seems particularly well-suited to help translate highly polysemous adverbials and prepositional phrases.

pdf bib
Language Translation Solutions for Community Content
Daniel Marcu

pdf bib
The Use of Machine-generated Transcripts during Human Translation
Allison L. Powell | Allison Blodgett

At the request of the USG National Virtual Translation Center, the University of Maryland Center for Advanced Study of Language conducted a study that assessed the role of several factors mediating transcript usefulness during translation tasks. These factors included source language (Mandarin or Modern Standard Arabic), native speaker status of the translators, transcript quality (low or moderate word error rate), and transcript functionality (static or dynamic). Using 54 Mandarin and 54 Arabic translators (half native speakers in each language) and broadcast news clips for input, the study demonstrated that translation environments that provide dynamic transcripts with low or moderate word error rates are likely to improve performance (measured as integrated speed and accuracy scores) among non-native speakers without decreasing performance among native speakers.

pdf bib
Meeting Army Foreign language Requirements with the Aid of Machine Translation
Cecil MacPherson | Devin Rollis | Irene Zehmisch

The United States Army has a wide range of language requirements, varying greatly in both the number of requisite languages, and the complexity of the tasks for which language translation is crucial. Machine language translation will be an important part of the support needed to translate documents, monitor news media, and engage non-English speakers in conversation. The machine language translation community has made significant advances in the technology over the past several years, and the Army is looking to both support research and development, and to capitalize on the technology to improve communication and save lives. The Army Language Requirements Branch and the Sequoyah Program Office have received several requests from language technology developers for information on the direction and end-state goals of the Sequoyah program. In this paper, we will attempt to describe the Army’s language needs and to document requirements and goals for a machine language translation program.

pdf bib
Hybrid Machine Translation Applied to Media Monitoring
Hassan Sawaf | Braddock Gaskill | Michael Veronis

In this paper, a system is presented that recognizes spoken utterances in Arabic Dialects which are translated into text in English. The input is recorded from a broadcast channel and recognized using automatic speech recognition that recognize Modern Standard Arabic and Iraqi Colloquial Arabic. The recognized utterances are normalized into Modern Standard Arabic and the output of this Modern Standard Arabic interlingua is then translated by a hybrid machine translation system, combining statistical and rule-based features.

pdf bib
Artificial Cognitive MT Post-Editing Intelligence
Jörg Schütz

Post-editing (PE) is a necessary process in every MT deployment environment. The compe­tences needed for PE are traditionally seen as a subset of a human translator's competence. Meanwhile, some companies are accepting that the PE process involves self-standing linguistic tasks, which need their own training efforts and appropriate software tool support. To date, we still lack recorded qualitatively and quantitatively PE user-activity data that adequately describe the tasks and in particular the human cognitive processes accomplished. This data is needed to effectively model, de­sign and implement supportive software sys­tems which, on the one hand, efficiently guide the human post-editor and enhance her cogni­tive capabilities, and on the other hand, have a certain influence on the translation perfor­mance and competence of the employed MT system. In this paper we argue for a frame­work of practices to describe the PE process by correlating data obtained in laboratory ex­periments and augmented by additional data from different resources such as interviews and mathematical prediction models with the tasks fulfilled, and to model the identified pro­cess in a multi-facetted fashion as a basis for the implementation of a human PE-aware in­teractive software system.

pdf bib
Language Processing for Analysis and Investigation
Kristen Summers | Diane Chandler

This paper describes an operational case and document management and exploitation system, GlobalView, that includes Machine Translation (MT) for use as an aid to human effort in analysis and investigation. It also presents the REFLEX platform for experimenting with language processing tools.

pdf bib
Embedding Technology at the front end of the Human Translation Workflow: An NVTC Vision
Carol van Ess-Dykema | Helen G. Gigley | Stephen Lewis | Emily Vancho Bannister

This paper describes the strategic vision for a new translation management workflow for the US Government’s National Virtual Translation Center (NVTC). The paper also describes past, current, and planned experiments validating the vision, along with experiment results to-date. The most salient features of the new workflow include the embedding of translation technology at the front end of the workflow (e.g., translation memory technology, specialized lexicons, and machine translation), technology-generated “seed translation”, a new human work role called “paralinguist” to assess the “seed translation” and assign an appropriate translator/post-editor, and new human translation strategies including federated search of online dictionaries and collaborative translation.

pdf bib
Applicability of Resource-based Machine Translation to Airplane Manuals
Eiko Yamamoto | Akira Terada | Hitoshi Isahara

Machine translation (MT) has been studied and developed since the advent of computers, and yet is rarely used in actual business. For business use, rule-based MT has been developed, but it requires rules and a domain-specific dictionary that have been created manually. On the other hand, as huge amounts of text data have become available, corpus-based MT has been actively studied, particularly corpus-based statistical machine translation (SMT). In this study, we tested and verified the usefulness of SMT for aviation manuals. Manuals tend to be similar and repetitive, so SMT is powerful even with a small amount of training data. Although our experiments with SMT are at the preliminary stage, the BLEU score is high. SMT appears to be a powerful and promising technique in this domain.

pdf bib
Applications of MT during Olympic Games 2008
Chengqing Zong | Heyan Huang | Shuming Shi


bib (full) Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Additional Papers

pdf bib
Can MT really help the Department of Defense?
Nicholas Bemish

The DoD already makes extensive use of machine translation and language support tools in a many environments to address a variety of communications, training, and intelligence challenges, and has done so for over 30 years. Mr. Bemish draws on his personal experience deploying MT, as well as his broad exposure to how translation technology is used in the branches of service and in military intelligence, to describe current uses of translation technology across a range of organizations within the DoD. He also addresses the technical issues that slow deployment and the cultural challenges involved in setting expectations and introducing technology that changes the way people work.

pdf bib
Language Technology Resource Center
Jennifer DeCamp

This paper describes the Language Technology Resource Center (LTRC), a U.S. Government website for providing information and tools for users of languages (e.g., translators, analysts, etc.) The LTRC provides information on a broad range of products and tools, and provides a means for product developers and researchers to provide the U.S. Government and the public with information about their work. The LTRC is part of the Foreign Language Resource Center (FLRC) program, which provides support to Government and professional organizations regarding language tools.