Machine Translation Summit (2005) - ACL Anthology

Machine Translation Summit (2005)

Volumes

Proceedings of Machine Translation Summit X: Invited papers MTSummit 11 papers
Proceedings of Machine Translation Summit X: Papers MTSummit 41 papers
Proceedings of Machine Translation Summit X: Posters MTSummit 22 papers
Proceedings of Machine Translation Summit X: Tutorial notes MTSummit 2 papers
Workshop on open-source machine translation MTSummit 4 papers
Workshop on example-based machine translation MTSummit 17 papers
Workshop on patent translation MTSummit 9 papers
Workshop on Semantic Web technologies for machine translation MTSummit 5 papers

bib (full) Proceedings of Machine Translation Summit X: Invited papers

Reviewing Back the Past MT Summits
Makoto Nagao

Multilingual Information Service System for the Beijing Olympics, a Playground of MT Technologies
Weiquan Liu

This paper introduces the Multilingual Information Service System being implemented for the Beijing Olympics. Multilingual machine translation is an important component in this system. This real world application asks for advanced as well as mature and proven technologies, where MT is challenged. However, by appropriately choosing domain and scenario, current MT technologies are successfully integrated in the pilot system. Future applications ask MT to have better performance in readability, lexicon coverage and more efficient. Multilinguality support and fast language adaptation is highly desired by such real world systems.

Commercial Language Technology in the Age of Open Source
Rose Lockwood

Global Public Health Intelligence Network (GPHIN)
Abla Mawudeku | Michael Blench

Accurate and timely information on global public health issues is key to being able to quickly assess and respond to emerging health risks around the world. The Public Health Agency of Canada has developed the Global Public Health Intelligence Network (GPHIN). Information from GPHIN is provided to the WHO, international governments and non-governmental organizations who can then quickly react to public health incidents. GPHIN is a secure Internet-based “early warning” system that gathers preliminary reports of public health significance on a “real-time” basis, 24 hours a day, 7 days a week. This unique multilingual system gathers and disseminates relevant information on disease outbreaks and other public health events by monitoring global media sources such as news wires and web sites. This monitoring is done in eight languages with machine translation being used to translate non-English articles into English and English articles into the other languages. The information is filtered for relevancy by an automated process which is then complemented by human analysis. The output is categorized and made accessible to users. Notifications about public health events that may have serious public health consequences are immediately forwarded to users. GPHIN employs a “best-of-breed” approach when it comes to the selection of the machine translation ‘engines’. This philosophy ensures that the quality of the machine translation is the best available for whatever language pair selected. It also imposes some unique integration and operational problems. GPHIN has a broad scope. It tracks events such as disease outbreaks, infectious diseases, contaminated food and water, bio-terrorism and exposure to chemicals, natural disasters, and issues related to the safety of products, drugs and medical devices. GPHIN is managed by Health Canada’s Centre for Emergency Preparedness and Response (CEPR), which was created in July 2000 to serve as Canada’s central coordinating point for public health security. It is considered a centre of expertise in the area of civic emergencies including natural disasters and malicious acts with health repercussions. CEPR offers a number of practical supports to municipalities, provinces and territories, and other partners involved in first response and public health security. This is achieved through its network of public health, emergency health services, and emergency social services contacts.

One Decade of Statistical Machine Translation: 1996-2005
Hermann Ney

In the last decade, the statistical approach has found widespread use in machine translation both for written and spoken language and has had a major impact on the translation accuracy. This paper will cover the principles of statistical machine translation and summarize the progress made so far.

Japan’s IT Strategy
Toru Yamauchi

Introduction to China’s HTRDP Machine Translation Evaluation
Qun Liu | Hongxu Hou | Shouxun Lin | Yueliang Qian | Yujie Zhang | Hitoshi Isahara

Since 1994, China’s HTRDP machine translation evaluation has been conducted for five times. Systems of various translation directions between Chinese, English, Japanese and French have been tested. Both human evaluation and automatic evaluation are conducted in HTRDP evaluation. In recent years, the evaluation was organized jointly with NICT of Japan. This paper introduces some details of this evaluation.

Phrase-Based Statistical Machine Translation for MANOS System
Bo Xu | Z. B. Chen | W. Wei | W. Pan | Z. D. Yang

MANOS (Multilingual Application Network for Olympic Services) project. aims to provide intelligent multilingual information services in 2008 Olympic Games. By narrowing down the general language technology, this paper gives an overview of our new work on Phrase-Based Statistical Machine Translation (PBT) under the framework of the MANOS. Starting with the construction of large scale Chinese-English corpus (sentence aligned) and introduction four methods to extract phrases, The promising results from PBT systems lead us to confidences for constructing a high-quality translation system and harmoniously integrate it into MANOS platform.

Improving Machine Translation Quality
Gregor Thurmair

This paper reports on measures to improve the quality of MT systems, by using a hybrid system architecture which adds corpus-based and statistical components to an existing rule-based system backbone. The focus is on improving the accuracy of the dictionary resources.

Research & Development of Multi-lingual Machine Translation and Applications
Huang Heyan

Intercultural Collaboration using Machine Translation
Toru Ishida

bib (full) Proceedings of Machine Translation Summit X: Papers

Extracting Representative Arguments from Dictionaries for Resolving Zero Pronouns
Shigeko Nariyama | Eric Nichols | Francis Bond | Takaaki Tanaka | Hiromi Nakaiwa

We propose a method to alleviate the problem of referential granularity for Japanese zero pronoun resolution. We use dictionary definition sentences to extract ‘representative’ arguments of predicative definition words; e.g. ‘arrest’ is likely to take police as the subject and criminal as its object. These representative arguments are far more informative than ‘person’ that is provided by other valency dictionaries. They are auto-extracted using both Shallow parsing and Deep parsing for greater quality and quantity. Initial results are highly promising, obtaining more specific information about selectional preferences. An architecture of zero pronoun resolution using these representative arguments is described.

Selection of Entries for a Bilingual Dictionary from Aligned Translation Equivalents using Support Vector Machines
Takeshi Kutsumi | Takehiko Yoshimi | Katsunori Kotani | Ichiko Sata | Hitoshi Isahara

This paper claims that constructing a dictionary using bilingual pairs obtained from parallel corpora needs not only correct alignment of two noun phrases but also judgment of its appropriateness as an entry. It specifically addresses the latter task, which has been paid little attention. It demonstrates a method of selecting a suitable entry using Support Vector Machines, and proposes to regard as the features the common and the different parts between a current translation and a new translation. Using experiment results, this paper examines how selection performances are affected by the four ways of representing the common and the different parts: morphemes, parts of speech, semantic markers, and upper-level semantic markers. Moreover, we used n-grams of the common and the different parts of above four kinds of features. Experimental result found that representation by morphemes marked the best performance, F-measure of 0.803.

Subword Clusters as Light-Weight Interlingua for Multilingual Document Retrieval
Udo Hahn | Kornel Marko | Stefan Schulz

We introduce a light-weight interlingua for a cross-language document retrieval system in the medical domain. It is composed of equivalence classes of semantically primitive, language-specific subwords which are clustered by interlingual and intralingual synonymy. Each subword cluster represents a basic conceptual entity of the language-independent interlingua. Documents, as well as queries, are mapped to this interlingua level on which retrieval operations are performed. Evaluation experiments reveal that this interlingua-based retrieval model outperforms a direct translation approach.

Example-based Machine Translation Based on TSC and Statistical Generation
Zhanyi Liu | Haifeng Wang | Hua Wu

This paper proposes a novel Example-Based Machine Translation (EBMT) method based on Tree String Correspondence (TSC) and statistical generation. In this method, the translation examples are represented as TSC, which consists of three parts: a parse tree in the source language, a string in the target language, and the correspondences between the leaf nodes of the source language tree and the substrings of the target language string. During the translation, the input sentence is first parsed into a tree. Then the TSC forest is searched out if it is best matched with the parse tree. The translation is generated by using a statistical generation model to combine the target language strings in the TSCs. The generation model consists of three parts: the semantic similarity between words, the word translation probability, and the target language model. Based on the above method, we build an English-to-Chinese Machine Translation (ECMT) system. Experimental results indicate that the performance of our system is comparable with that of the state-of-the-art commercial ECMT systems.

Learning Translations from Monolingual Corpora
Hirokazu Suzuki | Akira Kumano

This paper proposes a method for a machine translation (MT) system to automatically select and learn translation words, which suit the user’s tastes or document fields by using a monolingual corpus manually compiled by the user, in order to achieve high-quality translation. We have constructed a system based on this method and carried out experiments to prove the validity of the proposed method.

A Practical of Memory-based Approach for Improving Accuracy of MT
Sitthaa Phaholphinyo | Teerapong Modhiran | Nattapol Kritsuthikul | Thepchai Supnithi

Rule-Based Machine Translation (RBMT) [1] approach is a major approach in MT research. It needs linguistic knowledge to create appropriate rules of translation. However, we cannot completely add all linguistic rules to the system because adding new rules may cause a conflict with the old ones. So, we propose a memory based approach to improve the translation quality without modifying the existing linguistic rules. This paper analyses the translation problems and shows how this approach works.

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting
Andre Castilla | Alice Bacic | Sergio Furuie

The main objective of our project is to extract clinical information from thoracic radiology reports in Portuguese using Machine Translation (MT) and cross language information retrieval techniques. To accomplish this task we need to evaluate the involved machine translation system. Since human MT evaluation is costly and time consuming we opted to use automated methods. We propose an evaluation methodology using NIST/BLEU and METEOR algorithms and a controlled medical vocabulary, the Unified Medical Language System (UMLS). A set of documents are generated and they are either machine translated or used as evaluation references. This methodology is used to evaluate the performance of our specialized Portuguese-English translation dictionary. A significant improvement on evaluation scores after the dictionary incorporation into a commercial MT system is demonstrated. The use of UMLS and automated MT evaluation techniques can help the development of applications on the medical domain. Our methodology can also be used on general MT research for evaluating and testing purposes.

A Report on the Machine Translation Market in Japan
Setsuo Yamada | Syuuji Kodama | Taeko Matsuoka | Hiroshi Araki | Yoshiaki Murakami | Osamu Takano | Yoshiyuki Sakamoto

When conducting market research on machine translation, we research the volume of sales continuously in order to determine the scale of the machine translation market in Japan. We have officially announced these figures every year. Furthermore, since 2003, we administered questionnaires regarding the Web translation.

Document Authoring the Bible for Minority Language Translation
Stephen Beale | Sergei Nirenburg | Marjorie McShane | Tod Allman

This paper describes one approach to document authoring and natural language generation being pursued by the Summer Institute of Linguistics in cooperation with the University of Maryland, Baltimore County. We will describe the tools provided for document authoring, including a glimpse at the underlying controlled language and the semantic representation of the textual meaning. We will also introduce The Bible Translator’s Assistant© (TBTA), which is used to elicit and enter target language data as well as perform the actual text generation process. We conclude with a discussion of the usefulness of this paradigm from a Bible translation perspective and suggest several ways in which this work will benefit the field of computational linguistics.

Building an Annotated Japanese-Chinese Parallel Corpus – A Part of NICT Multilingual Corpora
Yujie Zhang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated Japanese-Chinese parallel corpus in the world. We created the corpus by selecting Japanese sentences from Mainichi Newspaper and then manually translating them into Chinese. We then annotated the corpus with morphological and syntactic structures and alignments at word and phrase levels. This paper describes the specification in human translation and detailed information annotation, and the tools we developed in the project. The experience we obtained and points we paid special attentions are also introduced for share with other researches in corpora construction.

Europarl: A Parallel Corpus for Statistical Machine Translation
Philipp Koehn

We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.

Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries
Patanakul Sathapornrungkij | Charnyote Pluempitiwiriyawej

We describe a method of constructing Thai WordNet, a lexical database in which Thai words are organized by their meanings. Our methodology takes WordNet and LEXiTRON machine-readable dictionaries into account. The semantic relations between English words in WordNet and the translation relations between English and Thai words in LEXiTRON are considered. Our methodology is operated via WordNet Builder system. This paper provides an overview of the WordNet Builder architecture and reports on some of our experience with the prototype implementation.

Augmentation of Modality Translation Rules in Korean-to-English Machine Translation by Rule Learning
Seong-Bae Park | Jeong-Woo Son | Yoon-Shik Tae

Semantically Relatable Sets: Building Blocks for Representing Semantics
Rajat Kumar Mohanty | Anupama Dutta | Pushpak Bhattacharyya

Maximum Entropy Models for Realization Ranking
Erik Velldal | Stephan Oepen

In this paper we describe and evaluate different statistical models for the task of realization ranking, i.e. the problem of discriminating between competing surface realizations generated for a given input semantics. Three models are trained and tested; an n-gram language model, a discriminative maximum entropy model using structural features, and a combination of these two. Our realization component forms part of a larger, hybrid MT system.

Evaluation of Machine Translation with Predictive Metrics beyond BLEU/NIST: CESTA Evaluation Campaign # 1
Sylvain Surcin | Olivier Hamon | Antony Hartley | Martin Rajman | Andrei Popescu-Belis | Widad Mustafa El Hadi | Ismaïl Timimi | Marianne Dabbadie | Khalid Choukri

In this paper, we report on the results of a full-size evaluation campaign of various MT systems. This campaign is novel compared to the classical DARPA/NIST MT evaluation campaigns in the sense that French is the target language, and that it includes an experiment of meta-evaluation of various metrics claiming to better predict different attributes of translation quality. We first describe the campaign, its context, its protocol and the data we used. Then we summarise the results obtained by the participating systems and discuss the meta-evaluation of the metrics used.

Inter-rater Agreement Measures, and the Refinement of Metrics in the PLATO MT Evaluation Paradigm
Keith J. Miller | Michelle Vanni

A Multi-aligner for Japanese-Chinese Parallel Corpora
Yujie Zhang | Qun Liu | Qing Ma | Hitoshi Isahara

Automatic word alignment is an important technology for extracting translation knowledge from parallel corpora. However, automatic techniques cannot resolve this problem completely because of variances in translations. We therefore need to investigate the performance potential of automatic word alignment and then decide how to suitably apply it. In this paper we first propose a lexical knowledge-based approach to word alignment on a Japanese-Chinese corpus. Then we evaluate the performance of the proposed approach on the corpus. At the same time we also apply a statistics-based approach, the well-known toolkit GIZA++, to the same test data. Through comparison of the performances of the two approaches, we propose a multi-aligner, exploiting the lexical knowledge-based aligner and the statistics-based aligner at the same time. Quantitative results confirmed the effectiveness of the multi-aligner.

Thot: a Toolkit To Train Phrase-based Statistical Translation Models
Daniel Ortiz-Martínez | Ismael García-Varea | Francisco Casacuberta

In this paper, we present the Thot toolkit, a set of tools to train phrase-based models for statistical machine translation, which is publicly available as open source software. The toolkit obtains phrase-based models from word-based alignment models; to our knowledge, this functionality has not been offered by any publicly available toolkit. The Thot toolkit also implements a new way for estimating phrase models, this allows to obtain more complete phrase models than the methods described in the literature, including a segmentation length submodel. The toolkit output can be given in different formats in order to be used by other statistical machine translation tools like Pharaoh, which is a beam search decoder for phrase-based alignment models which was used in order to perform translation experiments with the generated models. Additionally, the Thot toolkit can be used to obtain the best alignment between a sentence pair at phrase level.

Machine Translation of Bi-lingual Hindi-English (Hinglish) Text
R. Mahesh K. Sinha | Anil Thakur

In the present communication-based society, no natural language seems to have been left untouched by the trends of code-mixing. For different communicative purposes, a language uses linguistic codes from other languages. This gives rise to a mixed language which is neither totally the host language nor the foreign language. The mixed language poses a new challenge to the problem of machine translation. It is necessary to identify the “foreign” elements in the source language and process them accordingly. The foreign elements may not appear in their original form and may get morphologically transformed as per the host language. Further, in a complex sentence, a clause/utterance may be in the host language while another clause/utterance may be in the foreign language. Code-mixing of Hindi and English where Hindi is the host language, is a common phenomenon in day-to-day language usage in Indian metropolis. The scenario is so common that people have started considering this a different variety altogether and calling it by the name Hinglish. In this paper, we present a mechanism for machine translation of Hinglish to pure (standard) Hindi and pure English forms.

Dealing with Replicative Words in Hindi for Machine Translation to English
R. Mahesh | K. Sinha | Anil Thakur

The South Asian languages are well-known for their replicative words. In these languages, words of almost all the grammatical categories can occur in their reduplicative form. Hindi is one such language which is quite rich in having various types of replicative words in its lexicon. The traditional grammars and some of the research works have discussed the topic to some extent, particularly from the point of view of their descriptions and classifications. However, a detailed study of the topic becomes significant in view of the complexity involved in handling of such replicative words in the area of natural language processing, particularly for machine translation. In this paper, we discuss different types of replicative words in Hindi and their syntactic and semantic characteristics to formulate rules and strategies to identify their multiple functions and mapping patterns in English for machine translation from Hindi to English.

SEM-I Rational MT: Enriching Deep Grammars with a Semantic Interface for Scalable Machine Translation
Dan Flickinger | Jan Tore Lønning | Helge Dyvik | Stephan Oepen | Francis Bond

In the LOGON machine translation system where semantic transfer using Minimal Recursion Semantics is being developed in conjunction with two existing broad-coverage grammars of Norwegian and English, we motivate the use of a grammar-specific semantic interface (SEM-I) to facilitate the construction and maintenance of a scalable translation engine. The SEM-I is a theoretically grounded component of each grammar, capturing several classes of lexical regularities while also serving the crucial engineering function of supplying a reliable and complete specification of the elementary predications the grammar can realize. We make extensive use of underspecification and type hierarchies to maximize generality and precision.

DEMOCRAT: Deciding between Multiple Outputs Created by Automatic Translation
Menno van Zaanen | Harold Somers

Customizing a Korean-English MT System for Patent Translation
Munpyo Hong | Young-Gil Kim | Chang-Hyun Kim | Seong-Il Yang | Young-Ae Seo | Cheol Ryu | Sang-Kyu Park

This paper addresses a customization process of a Korean-English MT system for patent translation. The major customization steps include terminology construction, linguistic study, and the modification of the existing analysis and generation-module. T o our knowledge, this is the first worth-mentioning large-scale customization effort of an MT system for Korean and English. This research was performed under the auspices of the MIC (Ministry of Information and Communication) of Korean government. A prototype patent MT system for electronics domain was installed and is being tested in the Korean Intellectual Property Office.

Practicing Controlled Language through a Help System integrated into the Medical Speech Translation System (MedSLT)
Marianne Starlander | Pierrette Bouillon | Nikos Chatzichrisafis | Marianne Santaholma | Manny Rayner | Beth Ann Hockey | Hitoshi Isahara | Kyoko Kanzaki | Yukie Nakao

In this paper, we present evidence that providing users of a speech to speech translation system for emergency diagnosis (MedSLT) with a tool that helps them to learn the coverage greatly improves their success in using the system. In MedSLT, the system uses a grammar-based recogniser that provides more predictable results to the translation component. The help module aims at addressing the lack of robustness inherent in this type of approach. It takes as input the result of a robust statistical recogniser that performs better for out-of-coverage data and produces a list of in-coverage example sentences. These examples are selected from a defined list using a heuristic that prioritises sentences maximising the number of N-grams shared with those extracted from the recognition result.

The FAME Speech-to-Speech Translation System for Catalan, English, and Spanish
Victoria Arranz | Elisabet Comelles | David Farwell

This paper describes the evaluation of the FAME interlingua-based speech-to-speech translation system for Catalan, English and Spanish. This system is an extension of the already existing NESPOLE! that translates between English, French, German and Italian. This article begins with a brief introduction followed by a description of the system architecture and the components of the translation module including the Speech Recognizer, the analysis chain, the generation chain and the Speech Synthesizer. Then we explain the interlingua formalism used, called Interchange Format (IF). We show the results obtained from the evaluation of the system and we describe the three types of evaluation done. We also compare the results of our system with those obtained by a stochastic translator which has been independently developed over the course of the FAME project. Finally, we conclude with future work.

Assessing Degradation of Spoken Language Translation by Measuring Speech Recognizer’s Output against Non-native Speakers’ Listening Capabilities
Toshiyuki Takezawa | Keiji Yasuda | Masahide Mizushima | Genichiro Kikui

Integration of SYSTRAN MT Systems in an Open Workflow
Mats Attnäs | Pierre Senellart | Jean Senellart

Probabilistic Model for Example-based Machine Translation
Eiji Aramaki | Sadao Kurohashi | Hideki Kashioka | Naoto Kato

Example-based machine translation (EBMT) systems, so far, rely on heuristic measures in retrieving translation examples. Such a heuristic measure costs time to adjust, and might make its algorithm unclear. This paper presents a probabilistic model for EBMT. Under the proposed model, the system searches the translation example combination which has the highest probability. The proposed model clearly formalizes EBMT process. In addition, the model can naturally incorporate the context similarity of translation examples. The experimental results demonstrate that the proposed model has a slightly better translation quality than state-of-the-art EBMT systems.

Low Cost Portability for Statistical Machine Translation based on N-gram Coverage
Matthias Eck | Stephan Vogel | Alex Waibel

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.

Automatic Rating of Machine Translatability
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara

We describe a method for automatically rating the machine translatability of a sentence for various machine translation (MT) systems. The method requires that the MT system can bidirectionally translate sentences in both source and target languages. However, it does not require reference translations, as is usual for automatic MT evaluation. By applying this method to every component of a sentence in a given source language, we can automatically identify the machine-translatable and non-machinetranslatable parts of a sentence for a particular MT system. We show that the parts of a sentence that are automatically identified as nonmachine-translatable provide useful information for paraphrasing or revising the sentence in the source language, thus improving the quality of the final translation.

Learning Phrase Translation using Level of Detail Approach
Hendra Setiawan | Haizhou Li | Min Zhang

We propose a simplified Level Of Detail (LOD) algorithm to learn phrase translation for statistical machine translation. In particular, LOD learns unknown phrase translations from parallel texts without linguistic knowledge. LOD uses an agglomerative method to attack the combinatorial explosion that results when generating candidate phrase translations. Although LOD was previously proposed by (Setiawan et al., 2005), we improve the original algorithm in two ways: simplifying the algorithm and using a simpler translation model. Experimental results show that our algorithm provides comparable performance while demonstrating a significant reduction in computation time.

PESA: Phrase Pair Extraction as Sentence Splitting
Stephan Vogel

Most statistical machine translation systems use phrase-to-phrase translations to capture local context information, leading to better lexical choice and more reliable local reordering. The quality of the phrase alignment is crucial to the quality of the resulting translations. Here, we propose a new phrase alignment method, not based on the Viterbi path of word alignment models. Phrase alignment is viewed as a sentence splitting task. For a given spitting of the source sentence (source phrase, left segment, right segment) find a splitting for the target sentence, which optimizes the overall sentence alignment probability. Experiments on different translation tasks show that this phrase alignment method leads to highly competitive translation results.

Statistical Machine Translation of European Parliamentary Speeches
David Vilar | Evgeny Matusov | Sasa Hasan | Richard Zens | Hermann Ney

In this paper we present the ongoing work at RWTH Aachen University for building a speech-to-speech translation system within the TC-Star project. The corpus we work on consists of parliamentary speeches held in the European Plenary Sessions. To our knowledge, this is the first project that focuses on speech-to-speech translation applied to a real-life task. We describe the statistical approach used in the development of our system and analyze its performance under different conditions: dealing with syntactically correct input, dealing with the exact transcription of speech and dealing with the (noisy) output of an automatic speech recognition system. Experimental results show that our system is able to perform adequately in each of these conditions.

Practical Approach to Syntax-based Statistical Machine Translation
Kenji Imamura | Hideo Okuma | Eiichiro Sumita

This paper presents a practical approach to statistical machine translation (SMT) based on syntactic transfer. Conventionally, phrase-based SMT generates an output sentence by combining phrase (multiword sequence) translation and phrase reordering without syntax. On the other hand, SMT based on tree-to-tree mapping, which involves syntactic information, is theoretical, so its features remain unclear from the viewpoint of a practical system. The SMT proposed in this paper translates phrases with hierarchical reordering based on the bilingual parse tree. In our experiments, the best translation was obtained when both phrases and syntactic information were used for the translation process.

Bilingual N-gram Statistical Machine Translation
José B. Mariño | Rafael E. Banchs | Josep M. Crego | Adrià de Gispert | Patrik Lambert | José A. R. Fonollosa | Marta Ruiz

This paper describes a statistical machine translation system that uses a translation model which is based on bilingual n-grams. When this translation model is log-linearly combined with four specific feature functions, state of the art translations are achieved for Spanish-to-English and English-to-Spanish translation tasks. Some specific results obtained for the EPPS (European Parliament Plenary Sessions) data are presented and discussed. Finally, future research issues are depicted.

Reordered Search, and Tuple Unfolding for Ngram-based SMT
Josep M. Crego | José B. Mariño | Adrià de Gispert

In Statistical Machine Translation, the use of reordering for certain language pairs can produce a significant improvement on translation accuracy. However, the search problem is shown to be NP-hard when arbitrary reorderings are allowed. This paper addresses the question of reordering for an Ngram-based SMT approach following two complementary strategies, namely reordered search and tuple unfolding. These strategies interact to improve translation quality in a Chinese to English task. On the one hand, we allow for an Ngram-based decoder (MARIE) to perform a reordered search over the source sentence, while combining a translation tuples Ngram model, a target language model, a word penalty and a word distance model. Interestingly, even though the translation units are learnt sequentially, its reordered search produces an improved translation. On the other hand, we allow for a modification of the translation units that unfolds the tuples, so that shorter units are learnt from a new parallel corpus, where the source sentences are reordered according to the target language. This tuple unfolding technique reduces data sparseness and, when combined with the reordered search, further boosts translation performance. Translation accuracy and efficency results are reported for the IWSLT 2004 Chinese to English task.

Improving Online Machine Translation Systems
Bart Mellebeek | Anna Khasin | Karolina Owczarzak | Josef van Genabith | Andy Way

In (Mellebeek et al., 2005), we proposed the design, implementation and evaluation of a novel and modular approach to boost the translation performance of existing, wide-coverage, freely available machine translation systems, based on reliable and fast automatic decomposition of the translation input and corresponding composition of translation output. Despite showing some initial promise, our method did not improve on the baseline Logomedia1 and Systran2 MT systems. In this paper, we improve on the algorithm presented in (Mellebeek et al., 2005), and on the same test data, show increased scores for a range of automatic evaluation metrics. Our algorithm now outperforms Logomedia, obtains similar results to SDL3 and falls tantalisingly short of the performance achieved by Systran.

The Effect of Adding Rules into the Rule-based MT System
Zhu Jiang | Wang Haifeng

This paper investigates the relationship between the amount of the rules and the performance of the rule-based machine translation system. We keep adding more rules into the system and observe successive changes of the translation quality. Evaluations on translation quality reveal that the more the rules, the better the translation quality. A linear regression analysis shows that a positive linear relationship exists between the translation quality and the amount of the rules. We use this linear model to make prediction and test the prediction with newly developed rules. Experimental results indicate that the linear model effectively predicts the possible performance that the rule-based machine translation system may achieve with more rules added.

Cognates and Word Alignment in Bitexts
Grzegorz Kondrak

We evaluate several orthographic word similarity measures in the context of bitext word alignment. We investigate the relationship between the length of the words and the length of their longest common subsequence. We present an alternative to the longest common subsequence ratio (LCSR), a widely-used orthographic word similarity measure. Experiments involving identification of cognates in bitexts suggest that the alternative method outperforms LCSR. Our results also indicate that alignment links can be used as a substitute for cognates for the purpose of evaluating word similarity measures.

Boosting Statistical Word Alignment
Hua Wu | Haifeng Wang

This paper proposes an approach to improve statistical word alignment with the boosting method. Applying boosting to word alignment must solve two problems. The first is how to build the reference set for the training data. We propose an approach to automatically build a pseudo reference set, which can avoid manual annotation of the training set. The second is how to calculate the error rate of each individual word aligner. We solve this by calculating the error rate of a manually annotated held-out data set instead of the entire training set. In addition, the final ensemble takes into account the weights of the alignment links produced by the individual word aligners. Experimental results indicate that the boosting method proposed in this paper performs much better than the original word aligner, achieving a large error rate reduction.

bib (full) Proceedings of Machine Translation Summit X: Posters

Tracing Translations in the Making
Elliott Macklovitch | Ngoc Tran Nguyen | Guy Lapalme

This paper presents TTPlayer, a trace file analysis tool used to develop TransType, an innovative computer-aided translation system. We first discuss the context of the project and the design of the tracing tool. We show how it was used for discovering interesting patterns of use as well to guide further developments in the TT2 project.

Thai Word Segmentation a Lexical Semantic Approach
Krisda Khankasikam | Nuttanart Muansuwqan

In Thai language, the word boundary is not explicitly clear, therefore, word segmentation is needed to determine word boundary in Thai sentences. Many applications of Thai Language Processing require the word segmentation. Several approaches of Thai word segmentation such as maximal matching, longest matching and n-gram model do not take semantics into consideration. This paper presents a Thai word segmentation system using semantic corpus which is composed of four steps: generating all possible candidates, proper noun consideration, semantic tagging and semantic checking. The first three steps are conducted using a dictionary. Semantic checking is carried out on the basis of corpus-based approach. Finally, we assign the semantic scores to segmented words and select the ones that contain maximum semantic scores. In order to assign semantic scores, we use a Thai proper noun database and the semantic corpus derived from ORCHID corpus. This approach is more reliable than other approaches that do not take the meaning into consideration and performs the level of accuracy at 96-99% depending on the characteristic of input and the dictionary used in the segmentation.

Japanese Language Analaysis for Syntactic Tree Mining to Extract Characteristic Contents
Yohsuke Sakao | Takahiro Ikeda | Kenji Satoh | Susumu Akamine

Existing syntactic ordered tree mining methods for extracting characteristic contents from text sets have two problems: 1) subtrees which are semantically the same but are different ordered trees fail to be considered equivalent, and 2) raw extracted subtrees can be difficult to understand. In order to avoid these problems, we have developed a method of transforming all ordered trees so that the ordered trees having the same meaning are considered equivalent. We have also developed a method of constructing Japanese texts from extracted subtrees, and evaluated the effectiveness of our methods as applied to syntactic tree mining.

Divergence Patterns in Machine Translation between Hindi and English
R. Mahesh K. Sinha | Anil Thakur

The issue of translation divergence is an important research topic in the area of machine translation. An exhaustive study of the divergence issues in MT is necessary for their proper classification and resolution. In the literature on MT, scholars have examined the issue and have proposed ways for their classification and resolution (Dorr 1993, 1994). However, the topic still needs further exploration to identify different sources of translation divergence in different pairs of translation languages. In this paper, we discuss translation patterns between Hindi and English of different types of constructions with a view to identifying the potential topics of the translation divergences. We take Dorr’s (1993, 1994) classification of translation divergence as the base to examine the different topics of translation divergence in Hindi and English. The primary goal of the paper is to point out different types of translation divergences in Hindi and English MT that have not been discussed in the existing literature.

Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text
Pavol Zavarsky | Yoshiki Mikami | Shota Wada

In the paper we present an outline of our approach to identify languages and encoding schemes in extremely large sets of multi-lingual documents. The large sets we are analyzing in our Language Observatory project [1] are formed by dozens of millions of text documents. In the paper we present an approach which allows us to analyze about 250 documents every second (about 20 million documents/day) on a single Linux machine. Using a multithread processing on a cluster of Linux servers we are able to analyze easily more than 100 million documents/day.

Handling ki in Hindi for Hindi-English MT
R. Mahesh K. Sinha | Anil Thakur

ki is an indeclinable element (particle) in Hindi which is used in multiple roles that have multiple mapping patterns in English. In one of its uses, ki functions as a clause complementizer and is mapped usually by that in declarative clauses and by various wh-words (such as what, why, where, how, etc.) in interrogative clauses. The contexts of these mappings are dependent on syntactic-semantic types of the clause. In its non-complementizer use, ki is used to denote various other functions such as coordinate conjunction, purpose and reason clause conjunction, yes-no question particle, etc. It is a difficult task to identify the different uses of ki and determine its multiple mapping patterns in the context of Hindi-English machine translation. A detailed linguistic analysis is needed to disambiguate the different contexts of ki in Hindi. In this paper, we examine the multiple uses and patterns of ki in Hindi and propose strategies for their identification and disambiguation for Hindi-English MT.

Improving Translation Memory with Word Alignment Information
Hua Wu | Haifeng Wang | Zhanyi Liu | Kai Tang

This paper describes a generalized translation memory system, which takes advantage of sentence level matching, sub-sentential matching, and pattern-based machine translation technologies. All of the three techniques generate translation suggestions with the assistance of word alignment information. For the sentence level matching, the system generates the translation suggestion by modifying the translations of the most similar example with word alignment information. For sub-sentential matching, the system locates the translation fragments in several examples with word alignment information, and then generates the translation suggestion by combining these translation fragments. For pattern-based machine translation, the system first extracts translation patterns from examples using word alignment information and then generates translation suggestions with pattern matching. This system is compared with a traditional translation memory system without word alignment information in terms of translation efficiency and quality. Evaluation results indicate that our system improves the translation quality and saves about 20% translation time.

A Phrasal EBMT System for Translating English to Bengali
Sudip Kumar Naskar | Sivaji Bandyopadhyay

The present work describes a Phrasal Example Based Machine Translation system from English to Bengali that identifies the phrases in the input through a shallow analysis, retrieves the target phrases using a Phrasal Example base and finally combines the target language phrases employing some heuristics based on the phrase ordering rules for Bengali. The paper focuses on the structure of the noun, verb and prepositional phrases in English and how these phrases are realized in Bengali. This study has an effect on the design of the phrasal Example Base and recombination rules for the target language phrases.

An MT System Recycled
Ondřej Bojar | Petr Homola | Vladislav Kuboň

This paper describes an attempt to recycle parts of the Czech-to-Russian machine translation system (MT) in the new Czech-to-English MT system. The paper describes the overall architecture of the new system and the details of the modules which have been added. A special attention is paid to the problem of named entity recognition and to the method of automatic acquisition of lexico-syntactic information for the bilingual dictionary of the system.

Semi-Automated Elicitation Corpus Generation
Alison Alvarez | Lori Levin | Robert Frederking | Erik Peterson | Jeff Good

In this document we will describe a semi-automated process for creating elicitation corpora. An elicitation corpus is translated by a bilingual consultant in order to produce high quality word aligned sentence pairs. The corpus sentences are automatically generated from detailed feature structures using the GenKit generation program. Feature structures themselves are automatically generated from information that is provided by a linguist using our corpus specification software. This helps us to build small, flexible corpora for testing and development of machine translation systems.

Data Inferred Multi-word Expressions for Statistical Machine Translation
Patrick Lambert | Rafael Banchs

This paper presents a strategy for detecting and using multi-word expressions in Statistical Machine Translation. Performance of the proposed strategy is evaluated in terms of alignment quality as well as translation accuracy. Evaluations are performed by using the Verbmobil corpus. Results from translation tasks from English-to-Spanish and from Spanish-to-English are presented and discussed.

PARSIT-TE: Online Thai-English Machine Translation
Teerapong Modhiran | Krit Kosawat | Supon Klaithin | Monthika Boriboon | Thepchai Supnithi

This paper presents an online Thai-English MT system, called PARSITTE, which is an extension of PARSIT English-Thai one. We aim to assist foreigners and Thai in exchanging more easily their information. The system is a rule-based and Interlingua approach. To improve the system, we concentrate on pre-processing and rule analysis phases, which are considered necessary because of some specific problems of Thai language.

Estimating the predictive Power of N-gram MT Evaluation Metrics across Language and Text Types
Bogdan Babych | Anthony Hartley | Debbie Elliott

The use of n-gram metrics to evaluate the output of MT systems is widespread. Typically, they are used in system development, where an increase in the score is taken to represent an improvement in the output of the system. However, purchasers of MT systems or services are more concerned to know how well a score predicts the acceptability of the output to a reader-user. Moreover, they usually want to know if these predictions will hold across a range of target languages and text types. We describe an experiment involving human and automated evaluations of four MT systems across two text types and 23 language directions. It establishes that the correlation between human and automated scores is high, but that the predictive power of these scores depends crucially on target language and text type.

A Useful-based Evaluation of Reading Support Systems: Comprehension, Reading Speed and Effective Speed
Katsunori Kotani | Takehiko Yoshimi | Takeshi Kutsumi | Ichiko Sata | Hiroshi Isahara

This paper reports the result of our experiment, the aim of which is to examine the efficiency of reading support systems such as a sentence-machine translation system, a word-machine translation system, and so on. Our evaluation method used in the experiment is able to handle the different reading support systems by assessing the usability of the systems, i.e., comprehension, reading speed, and effective speed. The result shows that the reading-speed procedure is able to evaluate the support systems as well as the comprehension-based procedure proposed by Ohguro (1993) and Fuji et al. (2001).

Word Alignment Viewer for Long Sentences
Hideki Kashioka

An aligned corpus is an important resource for developing machine translation systems. We consider suitable units for constructing the translation model through observing an aligned parallel corpus. We examine the characteristics of the aligned corpus. Long sentences are especially difficult for word alignment because the sentences can become very complicated. Also, each (source/target) word has a higher possibility to correspond to the (target/source) word. This paper introduces an alignment viewer a developer can use to correct alignment information. We discuss using the viewer on a patent parallel corpus because sentences in patents are often long and complicated.

Comparative Study on Japanese and Uyghur Grammars for an English-Uyghur Machine Translation System
Polat Kadir | Koichi Yamada | Hiroshi Kinukawa

Uyghur is one of the Turkic languages in the Altaic language family. We are developing a machine translation system to translate from English into Uyghur. As there are no previous researches devoted to machine translation between English and Uyghur and being short of related works that we could use as a base for our research, we noted that by making clear the morphological and syntactic similarities and differences between Japanese and Uyghur we can make use of the approaches and methods of English-Japanese machine translation to make faster progress in our research. In order to attain this goal, we have performed a comparative study on the Japanese and Uyghur grammars. In this paper, we describe the similarities as well as differences between Japanese and Uyghur in both levels of morphology and syntax and we give a brief description of our English-Uyghur transfer method to which we are aiming at applying our comparative study on Japanese and Uyghur grammars.

Rapid Ramp-up for Statistical Machine Translation: Minimal Training for Maximal Coverage
Hemali Majithia | Philip Rennart | Evelyne Tzoukermann

This paper investigates optimal ways to get maximal coverage from minimal input training corpus. In effect, it seems antagonistic to think of minimal input training with a statistical machine translation system. Since statistics work well with repetition and thus capture well highly occurring words, one challenge has been to figure out the optimal number of “new” words that the system needs to be appropriately trained. Additionally, the goal is to minimize the human translation time for training a new language. In order to account for rapid ramp-up translation, we ran several experiments to figure out the minimal amount of data to obtain optimal translation results.

Input Normalization for an English-to-Chinese SMS Translation System
Aw AiTi | Zhang Min | Yeo PohKhim | Fan ZhenZhen | Su Jian

This paper describes an approach to preprocess SMS text for Machine Translation. As SMS text behaves differently from normal written text and to reduce the tremendous effort required to customize or adapt the language model of the traditional translation system to handle SMS text style, normalization is performed to moderate the irregularities in English SMS text using a noisy channel model. A mapping model is used to model the three major problems in SMS text. They are (1) substitution of word using non-standard acronym, (2) insertion of flavour word, and (3) omission of auxiliary verb and subject pronoun. Experiment results show that with normalization before translation, the rejection rate of our English-to-Chinese SMS translation for broadcasting purpose is reduced by 15.5%. We believe that the performance of normalization can be further improved with deeper linguistic processing.

A Look inside the ITC-irst SMT System
M. Cettolo | M. Federico | N. Bertholdi | R. Cattoni | B. Chen

This paper presents a look inside the ITC-irst large-vocabulary SMT system developed for the NIST 2005 Chinese-to-English evaluation campaign. Experiments on official NIST test sets provide a thorough overview of the performance of the system, supplying information on how single components contribute to the global performance. The presented system exhibits performance comparable to that of the best systems participating in the NIST 2002-2004 MT evaluation campaigns: on the three test sets, achieved BLEU scores are 26.35%, 26.92% and 28.13%, respectively.

Computer-Assisted Multingual E-communication in a Variety of Application Areas
Adriane Rinsche

The paper describes the architecture and functionality of LTC Communicator, a software product from the Language Technology Centre Ltd, which offers an innovative and cost-effective response to the growing need for multilingual web based communication in various user contexts. LTC Communicator was originally developed to support software vendors operating in international markets facing the need to offer web based multilingual support to diverse customers in a variety of countries, where end users may not speak the same language as the helpdesk. This is followed by a short description of several additional application areas of this software for which LTC has received EU funding: The AMBIENT project carries out a market validation for multilingual and multimodal eLearning for business and innovation management, the EUCAM project tests multilingual eLearning in the automotive industry, including a major car manufacturer and the German and European Metal Workers Associations, and the ALADDIN project provides a mobile multilingual environment for tour guides, interacting between tour operators and tourists, with the objective of optimising their travel experience. Finally, a case study of multilingual email exchange in conjunction with web based product sales is described.

Use of Machine Translation in India: Current Status
Sudip Naskar | Sivaji Bandyopadhyay

A survey of the machine translation systems that have been developed in India for translation from English to Indian languages and among Indian languages reveals that the MT softwares are used in field testing or are available as web translation service. These systems are also used for teaching machine translation to the students and researchers. Most of these systems are in the English-Hindi or Indian language-Indian language domain. The translation domains are mostly government documents/reports and news stories. There are a number of other MT systems that are at their various phases of development and have been demonstrated at various forums. Many of these systems cover other Indian languages beside Hindi.

Usability Considerations for a Cellular-based Text Translator
Leslie Barrett | Robert Levin

This paper describes a cellular-telephone-based text-to-text translation system developed at Transclick, Inc. The application translates messages bi-directionally in English, French, German, Italian, Spanish and Portuguese. This paper describes design features uniquely suited to hand-held-device based translation systems. In particular, we discuss some of the usability conditions unique to this type of application and present strategies for overcoming usability obstacles encountered in the design phase of the product.

bib (full) Proceedings of Machine Translation Summit X: Tutorial notes

Statistical Machine Translation: Foundations and Recent Advances
Franz Josef Och

Interlinguas and Semantic Roles
Mike Dillinger

bib (full) Workshop on open-source machine translation

The Open A.I. Kit: General Machine Learning Modules from Statistical Machine Translation
Daniel J. Walker

The Open A.I. Kit implements the major components of Statistical Machine Translation as an accessible, extendable Software Development Kit with broad applicability beyond the field of Machine Translation. The high-level system design policies of the kit embrace the Open Source development model to provide a modular architecture and interface, which may serve as a basis for collaborative research and development for endeavors in Artificial Intelligence.

An Open Architecture for Transfer-based Machine Translation between Spanish and Basque
Iñaki Alegria | Arantza Diaz de Ilarraza | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola | Mikel L. Forcada | Sergio Ortiz-Rojas | Lluís Padró

We present the current status of development of an open architecture for the translation from Spanish into Basque. The machine translation architecture uses an open source analyser for Spanish and new modules mainly based on finite-state transducers. The project is integrated in the OpenTrad initiative, a larger government funded project shared among different universities and small companies, which will also include MT engines for translation among the main languages in Spain. The main objective is the construction of an open, reusable and interoperable framework. This paper describes the design of the engine, the formats it uses for the communication among the modules, the modules reused from other project named Matxin and the new modules we are building.

Open Source Machine Translation with DELPH-IN
Francis Bond | Stephan Oepen | Melanie Siegel | Ann Copestake | Dan Flickinger

An Open-Source Shallow-Transfer Machine Translation Toolbox: Consequences of Its Release and Availability
Carme Armentano-Oller | Antonio M. Corbí-Bellot | Mikel L. Forcada | Mireia Ginestí-Rosell | Boyan Bonev | Sergio Ortiz-Rojas | Juan Antonio Pérez-Ortiz | Gema Ramírez-Sánchez | Felipe Sánchez-Martínez

By the time Machine Translation Summit X is held in September 2005, our group will have released an open-source machine translation toolbox as part of a large government-funded project involving four universities and three linguistic technology companies from Spain. The machine translation toolbox, which will most likely be released under a GPL-like license includes (a) the open-source engine itself, a modular shallow-transfer machine translation engine suitable for related languages and largely based upon that of systems we have already developed, such as interNOSTRUM for Spanish—Catalan and Traductor Universia for Spanish—Portuguese, (b) extensive documentation (including document type declarations) specifying the XML format of all linguistic (dictionaries, rules) and document format management files, (c) compilers converting these data into the high-speed (tens of thousands of words a second) format used by the engine, and (d) pilot linguistic data for Spanish—Catalan and Spanish—Galician and format management specifications for the HTML, RTF and plain text formats. After describing very briefly this toolbox, this paper aims at exploring possible consequences of the availability of this architecture, including the community-driven development of machine translation systems for languages lacking this kind of linguistic technology.

bib (full) Workshop on example-based machine translation

An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation
Toni Badia | Gemma Boleda | Maite Melero | Antoni Oliver

Context-sensitive Retrieval for Example-based Translation
Ralf Brown

Example-Based Machine Translation (EBMT) systems have typically operated on individual sentences without taking into account prior context. By adding a simple reweighting of retrieved fragments of training examples on the basis of whether the previous translation retrieved any fragments from examples within a small window of the current instance, translation performance is improved. A further improvement is seen by performing a similar reweighting when another fragment of the current input sentence was retrieved from the same training example. Together, a simple, straightforward implementation of these two factors results in an improvement on the order of 1.0–1.6% in the BLEU metric across multiple data sets in multiple languages.

Reversible Template-based Shake & Bake Generation
Michel Carl | Paul Schmidt | Jörg Schütz

Corpus-based MT systems that analyse and generalise texts beyond the surface forms of words require generation tools to re-generate the various internal representations into valid target language (TL) sentences. While the generation of word-forms from lemmas is probably the last step in every text generation process at its very bottom end, token-generation cannot be accomplished without structural and morpho-syntactic knowledge of the sentence to be generated. As in many other MT models, this knowledge is composed of a target language model and a bag of information transferred from the source language. In this paper we establish an abstracted, linguistically informed, target language model. We use a tagger, a lemmatiser and a parser to infer a template grammar from the TL corpus. Given a linguistically informed TL model, the aim is to see what need be provided from the transfer module for generation. During computation of the template grammar, we simultaneously build up for each TL sentence the content of the bag such that the sentence can be deterministically reproduced. In this way we control the completeness of the approach and will have an idea of what pieces of information we need to code in the TL bag.

Learning Translation Templates with Type Constraints
Ilyas Cicekli

This paper presents a generalization technique that induces translation templates from given translation examples by replacing differing parts in these examples with typed variables. Since the type of each variable is also inferred during the learning process, each induced template is associated with a set of type constraints. The type constraints that are associated with a translation template restrict the usage of that translation template in certain contexts in order to avoid some of wrong translations. The types of variables are induced using the type lattices designed for both source language and target language. The proposed generalization technique has been implemented as a part of an EBMT system.

The Influence of Example-data Homogeneity on EBMT Quality
Etienne Denoual

METISII: Example-based Machine Translation Using Monolingual CorporaSystem Description
Peter Dirix | Ineke Schuurman | Vincent Vandeghinste

The METIS-II project is an example-based machine translation system, making use of minimal resources and tools for both source and target language, making use of a target-language (TL) corpus, but not of any parallel corpora. In the current paper, we discuss the view of our team on the general philosophy and outline of the METIS-II system.

Graph-based Retrieval for Example-based Machine Translation Using Edit-distance
Takao Doi | Hirofumi Yamamoto | Eiichiro Sumita

Assembling a Parallel Corpus from RSS News Feeds
John Fry

We describe our use of RSS news feeds to quickly assemble a parallel English-Japanese corpus. Our method is simpler than other web mining approaches, and it produces a parallel corpus whose quality, quantity, and rate of growth are stable and predictable.

Towards a Definition of Example-based Machine Translation
John Hutchins

The example-based approach to MT is becoming increasingly popular. However, such is the variety of techniques and methods used that it is difficult to discern the overall conception of what example-based machine translation (EBMT) is and/or what its practitioners conceive it to be. Although definitions of MT systems are notoriously complex, an attempt is made to define EBMT in contrast to other MT architectures (RBMT and SMT).

EBMT by Tree-Phrasing: a Pilot Study
Philippe Langlais | Fabrizio Gotti | Didier Bourigault | Claude Coulombe

We present a study we conducted to build a repository storing associations between simple dependency treelets in a source language and their corresponding phrases in a target language. To assess the impact of this resource in EBMT, we used the repository to compute coverage statistics on a test bitext and on a n-best list of translation candidates produced by a standard phrase-based decoder.

The ‘purest’ EBMT System Ever Built: No Variables, No Templates, No Training, Examples, Just Examples, Only Examples
Yves Lepage | Etienne Denoual

We designed, implemented and assessed an EBMT system that can be dubbed the “purest ever built”: it strictly does not make any use of variables, templates or training, does not have any explicit transfer component, and does not require any preprocessing of the aligned examples. It uses a specific operation, namely proportional analogy, that implicitly neutralises divergences between languages and captures lexical and syntactical variations along the paradigmatic and syntagmatic axes without explicitly decomposing sentences into fragments. In an experiment with a test set of 510 input sentences and an unprocessed corpus of almost 160,000 aligned sentences in Japanese and English, we obtained BLEU, NIST and mWER scores of 0.53, 8.53 and 0.39 respectively, well above a baseline simulating a translation memory.

Monolingual Corpus-based MT Using Chunks
Stella Markantonatou | Sokratis Sofianopoulos | Vassiliki Spilioti | Yiorgos Tambouratzis | Marina Vassiliou | Olga Yannoutsou | Nikos Ioannou

In the present article, a hybrid approach is proposed for implementing a machine translation system using a large monolingual corpus coupled with a bilingual lexicon and basic NLP tools. In the first phase of the METIS system, a source language (SL) sentence, after being tagged, lemmatised and translated by a flat lemma-to-lemma lexicon, was matched against a tagged and lemmatised target language (TL) corpus using a pattern matching algorithm. In the second phase, translations are generated by combining sub-sentential structures. In this paper, the main features of the second phase are discussed while the system architecture and the corresponding translation approach are presented. The proposed methodology is illustrated with examples of the translation process.

Dependency Treelet Translation: The Convergence of Statistical and Example-based Machine-translation?
Arul Menezes | Chris Quirk

We describe a novel approach to machine translation that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two distinct parsers and oracle experiments. We also validate our automated BLEU scores with a small human evaluation.

An Example-Based Approach to Translating Sign Language
Sara Morrissey | Andy Way

Users of sign languages are often forced to use a language in which they have reduced competence simply because documentation in their preferred format is not available. While some research exists on translating between natural and sign languages, we present here what we believe to be the first attempt to tackle this problem using an example-based (EBMT) approach. Having obtained a set of English–Dutch Sign Language examples, we employ an approach to EBMT using the ‘Marker Hypothesis’ (Green, 1979), analogous to the successful system of (Way & Gough, 2003), (Gough & Way, 2004a) and (Gough & Way, 2004b). In a set of experiments, we show that encouragingly good translation quality may be obtained using such an approach.

A Machine Learning Approach to Hypotheses Selection of Greedy Decoding for SMT
Michael Paul | Eiichiro Sumita | Seiichi Yamamoto

This paper proposes a method for integrating example-based and rule-based machine translation systems with statistical methods. It extends a greedy decoder for statistical machine translation (SMT), which searches for an optimal translation by using SMT models starting from a decoder seed, i.e., the source language input paired with an initial translation hypothesis. In order to reduce local optima problems inherent in the search, the outputs generated by multiple translation engines, such as rule-based (RBMT) and example-based (EBMT) systems, are utilized as the initial translation hypotheses. This method outperforms conventional greedy decoding approaches using initial translation hypotheses based on translation examples retrieved from a parallel text corpus. However, the decoding of multiple initial translation hypotheses is computationally expensive. This paper proposes a method to select a single initial translation hypothesis before decoding based on a machine learning approach that judges the appropriateness of multiple initial translation hypotheses and selects the most confident one for decoding. Our approach is evaluated for the translation of dialogues in the travel domain, and the results show that it drastically reduces computational costs without a loss in translation quality.

A Semantics-based English-Bengali EBMT System for Translating News Headlines
Diganta Saha | Sivaji Bandyopadhyay

The paper reports an Example based Machine Translation System for translating News Headlines from English to Bengali. The input headline is initially searched in the Direct Example Base. If it cannot be found, the input headline is tagged and the tagged headline is searched in the Generalized Tagged Example Base. If a match is obtained, the tagged headline in Bengali is retrieved from the example base, the output Bengali headline is generated after retrieving the Bengali equivalents of the English words from appropriate dictionaries and then applying relevant synthesis rules for generating the Bengali surface level words. If some named entities and acronyms are not present in the dictionary, transliteration scheme is applied for obtaining the Bengali equivalent. If a match is not found, the tagged input headline is analysed to identify the constituent phrase(s). The target translation is generated using English-Bengali phrasal example base, appropriate dictionaries and a set of heuristics for Bengali phrase reordering. If the headline still cannot be translated using example base strategy, a heuristic translation strategy will be applied. Any new input tagged headline along with its translation by the user will be inserted in the tagged Example base after generalization.

Example-based Translation Without Parallel Corpora: First Experiments on a Prototype
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman

For the METIS-II project (IST, start: 10-2004 – end: 09-2007) we are working on an example-based machine translation system, making use of minimal resources and tools for both source and target language, i.e. making use of a target language corpus, but not of any parallel corpora. In the current paper, we present the results of the first experiments with our approach (CCL) within the METIS consortium : the translation of noun phrases from Dutch to English, using the British National Corpus as a target language corpus. Future research is planned along similar lines for the sentence as is presented here for the noun phrase.

bib (full) Workshop on patent translation

A Human-Aided Machine Translation System for Japanese-English Patent Translation
Christoph Neumann

The approach presented here enables Japanese users with no knowledge of English or legal English to generate patent claims in English from a Japanese-only interface. It exploits the highly determined structure of patent claims and merges Natural Language Generation (NLG) and Machine Translation (MT) techniques and resources as realized in the AutoPat and PC-Transfer applications. Due to its tuned MT engine, the approach can be seen as a human-aided machine translation (HAMT) system circumventing major obstacles in full-scale Japanese-English MT. The approach is fully implemented on a large scale and will be commercially released in autumn 2005.

Embedding MT for Generating Patent Claims in English from a Multilingual Interface
Svetlana Sheremetyeva

In this paper, we present a methodology for the development of interactive domain-tuned patent tools for generating patent claims in English from non-English interfaces. The methodology is based on a merger of an interactive English-to-English patent claim generator, AutoPat1 and any external MT engine that might be appropriate for a certain language. The translation procedure is reduced to translation words and phrases rather than a complex claim sentence. The approach has been successfully used in The J-E patent system 2 , a patent claim generator in English from a Japanese-only interface, and in Dan-Pat3, a similar tool for the Danish-English pair of languages. The two systems use different MT engines but feature similar overall architecture. The methodology is portable to other languages and MT engines.

Classification of Modified Relationships in Japanese Patent Sentences
Shoichi Yokoyama | Yuya Kaneda

It is well known that sentences in Japanese patents have long and complicated structures, especially necessary conditions and details. Here, patent sentences are analyzed and classified by pattern of modified relationships. Morphemes were first extracted using the famous morpheme analysis tool Chasen, and then the modified relations were extracted using the software Cabocha. Many modification mistakes were caused by long complicated structures, which required correction by humans. In the process of correction, the modification structure patterns were classified using about 200 sentences. This clarified the characteristics of Japanese patent sentences, and it is useful in machine translation of patent sentences.

Constant-Sense Connection Paths
Einat H. Nir | Geoffrey L. Melnick

A multilingual sense code may chart "constant-sense connection paths" across languages. A writer, not versed in any target language, may nonetheless proofread the sense for translation and edit it, to ensure that his meaning is conveyed as he wishes it, to other languages. A translation-ready format may be thus produced, to serve as a printing-press plate, for precise and automatic translation to any language, or to a plurality of languages. The translation-ready format may describe each word and the full document with a comprehensive code, which specifies the multilingual sense code and other relevant information about the word, in a standardized fashion, digitally, forming a unified, language-independent tagging system and a unified, language-independent lexicon.

Quality Analysis of Patent Parallel Corpus by the Scale
Isamu Okada | Shinichiro Miyazawa | Kazunari Ishida | Nobuhiko Shimizu | Toshizumi Ohta

Large-scale parallel corpus is extremely important for translation memory, example-based machine translation, and the support system to create English sentences. Organized collection or establishment of large-scale corpus is currently ongoing; however it is a difficult project in terms of copyrights as well as economic efficiency. To investigate general tendency of large-scale corpus helps to improve economical efficiency of parallel corpus collection as well as system establishment. In this study, therefore, the relationship between the scale of parallel corpus and the degree of correspondence is clarified, using parallel corpus for patents.

“Less, Easier and Quicker” in Language Acquisition for Patent MT
Svetlana Sheremetyeva

The paper describes some ways to save on knowledge acquisition when developing MT systems for patents by reducing the size of resources to be acquired, and creating intelligent software for knowledge handling and access speed. The approach is illustrated by knowledge acquisition and maintenance in the APTrans system for translating patent claims. Domain tuned resources are based on contrastive studies of multilingual patent documents and are handled by an electronic dictionary with a powerful user-friendly environment for acquisition, editing, browsing, defaulting and coherence proofing.

Domain Dependence of Lexical Translation: A Case Study of Patent Abstracts
Hiroyuki Kaji

The domain dependence of translations of nouns in English-to-Japanese patent translation is examined using an automatic method for identifying major translations from a pair of language corpora in the same domain. The method calculates the ratio of the number of associated words of a target word that suggest each translation of the target word to the total number of associated words. This ratio indicates how major a translation is in a domain. Application of the method to a bilingual patent-abstract corpus indicates the necessity and effectiveness of dividing the patent domain into subdomains and adapting a bilingual dictionary to subdomains.

Finding Translation Candidates from Patent Corpus
Sayori Shimohata

This paper describes a method for retrieving technical terms and finding their translation candidates from patent corpora. The method improves the reliability of bilingual seed words that measure similarity between a target word and its translation candidates. We conducted an experiment with PAJ (Patent Abstracts of Japan), which is a collection of bilingual patent abstracts written in Japanese and English. The experiment result shows that our method achieves a precision of 53.5% and a recall of 75.4%.

Terminology Construction Workflow for Korean-English Patent MT
Young-Gil Kim | Seong-Il Yang | Munpyo Hong | Chang-Hyun Kim | Young-Ae Seo | Cheol Ryu | Sang-Kyu Park | Se-Young Park

This paper addresses the workflow for terminology construction for Korean-English patent MT system. The workflow consists of the stage for setting lexical goals and the semi- automatic terminology construction stage. As there is no comparable system, it is difficult to determine how many terms are needed. To estimate the number of the needed terms, we analyzed 45,000 patent documents. Given the limited time and budget, we resorted to the semi-automatic methods to create the bilingual term dictionary in electronics domain. We will show that parenthesis information in Korean patent documents and bilingual title corpus can be successfully used to build a bilingual term dictionary.

bib (full) Workshop on Semantic Web technologies for machine translation

Ontologies for Crosslingual Applications
Hans Uszkoreit

Human translation is based on linguistic and extralinguistic knowledge. Despite promising pioneering advances, knowledge-based machine translation has remained a tempting vision. The bottleneck has been the engineering of sufficiently comprehensive bodies of relevant knowledge The Semantic Web offers opportunities for the gradual evolution of a global heterogeneous knowledge base. The immediate target has been the modelling of certain knowledge domains by practical ontologies. In the talk we will demonstrate the utilization of ontological knowledge indifferent crosslingual applications reaching from crosslingual document retrieval via crosslingual question answering to complex information services involving several crosslingual functionalities, including machine translation. We will then discuss the ramifications of this development and of the evolution of the World Wide Web for future directions in both statistical and rule-based machine translation.

Cross-lingual Retrieval in Semantic Web
Cristina Vertan

Natural Language is considered the friendliest way of man-machine communication. However the implementation of natural language interfaces faces often the problem of lack of linguistic and world-knowledge, especially when the application domain is not very specific. This is exactly the case of Web-based applications, which aim to serve for retrieval of information in every-day areas of work. The recent Semantic Web activities had as consequence the development of large ontologies for a broad spectrum of domains, as well as of mechanisms for annotating the resources with semantic information. In this paper we present a new architecture aiming to bring together the advantages of natural language querying and the power of semantic W eb. W e will show also how described application can be easily adapted for other domains.

Challenges for the Multilingual Semantic Web
Walther v. Hahn | Cristina Vertan

In this paper we give an overview of Semantic Web technologies and the impact of these ones for multilingual Web. We present a possible solution for improving the quality of on-line translation systems, using mechanisms and standards from Semantic Web. We focus on Example based machine translation and the automatization of the translation examples extraction by means of RDF-repositories.

Lexical Sets and Text-Processing
Christian Champendal | Thierry Pitarque

The extraction of lexical sets from a corpus in Digital Signal Processing (DSP) has been detailed before on general sets, with direct ELT applications. In this contribution, a more specialized set is investigated to illustrate the possibility of actually using the results in more "intelligent" Text-Processing.

A First Step in Integrating an EBMT into the Semantic Web
Natalia Elita | Antonina Birladeanu

In this paper we present the actions we made to prepare an EBMT system to be integrated into the Semantic Web. We also described briefly the developed EBMT tool for translators.