Conference of the European Association for Machine Translation (2018)


up

pdf (full)
bib (full)
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

pdf bib
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Maja Popović | Celia Rico | André Martins | Joachim Van den Bogaert | Mikel L. Forcada

pdf bib
Contextual Handling in Neural Machine Translation: Look behind, ahead and on both sides
Ruchit Agrawal | Marco Turchi | Matteo Negri

A salient feature of Neural Machine Translation (NMT) is the end-to-end nature of training employed, eschewing the need of separate components to model different linguistic phenomena. Rather, an NMT model learns to translate individual sentences from the labeled data itself. However, traditional NMT methods trained on large parallel corpora with a one-to-one sentence mapping make an implicit assumption of sentence independence. This makes it challenging for current NMT systems to model inter-sentential discourse phenomena. While recent research in this direction mainly leverages a single previous source sentence to model discourse, this paper proposes the incorporation of a context window spanning previous as well as next sentences as source-side context and previously generated output as target-side context, using an effective non-recurrent architecture based on self-attention. Experiments show improvement over non-contextual models as well as contextual methods using only previous context.

pdf bib
Towards a post-editing recommendation system for Spanish-Basque machine translation
Nora Aranberri | Jose A. Pascual

The overall machine translation quality available for professional translators working with the Spanish–Basque pair is rather poor, which is a deterrent for its adoption. This work investigates the plausibility of building a comprehensive recommendation system to speed up decision time between post-editing or translation from scratch using the very limited training data available. First, we build a set of regression models that predict the post-editing effort in terms of overall quality, time and edits. Secondly, we build classification models that recommend the most efficient editing approach using post-editing effort features on top of linguistic features. Results show high correlations between the predictions of the regression models and the expected HTER, time and edit number values. Similarly, the results for the classifiers show that they are able to predict with high accuracy whether it is more efficient to translate or to post-edit a new segment.

pdf bib
Compositional Source Word Representations for Neural Machine Translation
Duygu Ataman | Mattia Antonino Di Gangi | Marcello Federico

The requirement for neural machine translation (NMT) models to use fixed-size input and output vocabularies plays an important role for their accuracy and generalization capability. The conventional approach to cope with this limitation is performing translation based on a vocabulary of sub-word units that are predicted using statistical word segmentation methods. However, these methods have recently shown to be prone to morphological errors, which lead to inaccurate translations. In this paper, we extend the source-language embedding layer of the NMT model with a bi-directional recurrent neural network that generates compositional representations of the source words from embeddings of character n-grams. Our model consistently outperforms conventional NMT with sub-word units on four translation directions with varying degrees of morphological complexity and data sparseness on the source side.

pdf bib
Development and evaluation of phonological models for cognate identication
Bogdan Babych

The paper presents a methodology for the development and task-based evaluation of phonological models, which improve the accuracy of cognate terminology identification, but may potentially be used for other applications, such as transliteration or improving character-based NMT. Terminology translation remains a bottleneck for MT, especially for under-resourced languages and domains, and automated identification of cognate terms addresses this problem. The proposed phonological models explicitly represent distinctive phonological features for each character, such as acoustic types (e.g., vowel/ consonant, voiced/ unvoiced/ sonant), place and manner of articulation (closed/open, front/back vowel; plosive, fricative, or labial, dental, glottal consonant). The advantage of such representations is that they explicate information about characters’ internal structure rather than treat them as elementary atomic units of comparison, placing graphemes into a feature space that provides additional information about their articulatory (pronunciation-based) or acoustic (soundbased) distances and similarity. The article presents experimental results of using the proposed phonological models for extracting cognate terminology with the phonologically aware Levenshtein edit distance, which for Top-1 cognate ranking metric outperforms the baseline character-based Levenshtein by 16.5%. Project resources are released on: https://github.com/bogdanbabych/cognates-phonology

pdf bib
Rule-based machine translation from Kazakh to Turkish
Sevilay Bayatli | Sefer Kurnaz | Ilnar Salimzyanov | Jonathan Washington | Francis M. Tyers

This paper presents a shallow-transfer machine translation (MT) system for translating from Kazakh to Turkish. Background on the differences between the languages is presented, followed by how the system was designed to handle some of these differences. The system is based on the Apertium free/open-source machine translation platform. The structure of the system and how it works is described, along with an evaluation against two competing systems. Linguistic components were developed, including a Kazakh-Turkish bilingual dictionary, Constraint Grammar disambiguation rules, lexical selection rules, and structural transfer rules. With many known issues yet to be addressed, our RBMT system has reached performance comparable to publicly-available corpus-based MT systems between the languages.

pdf bib
SRL for low resource languages isn’t needed for semantic SMT
Meriem Beloucif | Dekai Wu

Previous attempts at injecting semantic frame biases into SMT training for low resource languages failed because either (a) no semantic parser is available for the low resource input language; or (b) the output English language semantic parses excise relevant parts of the alignment space too aggressively. We present the first semantic SMT model to succeed in significantly improving translation quality across many low resource input languages for which no automatic SRL is available —consistently and across all common MT metrics. The results we report are the best by far to date for this type of approach; our analyses suggest that in general, easier approaches toward including semantics in training SMT models may be more feasible than generally assumed even for low resource languages where semantic parsers remain scarce. While recent proposals to use the crosslingual evaluation metric XMEANT during inversion transduction grammar (ITG) induction are inapplicable to low resource languages that lack semantic parsers, we break the bottleneck via a vastly improved method of biasing ITG induction toward learning more semantically correct alignments using the monolingual semantic evaluation metric MEANT. Unlike XMEANT, MEANT requires only a readily-available English (output language) semantic parser. The advances we report here exploit the novel realization that MEANT represents an excellent way to semantically bias expectationmaximization induction even for low resource languages. We test our systems on challenging languages including Amharic, Uyghur, Tigrinya and Oromo. Results show that our model influences the learning towards more semantically correct alignments, leading to better translation quality than both the standard ITG or GIZA++ based SMT training models on different datasets.

pdf bib
M3TRA: integrating TM and MT for professional translators
Bram Bulté | Tom Vanallemeersch | Vincent Vandeghinste

Translation memories (TM) and machine translation (MT) both are potentially useful resources for professional translators, but they are often still used independently in translation workflows. As translators tend to have a higher confidence in fuzzy matches than in MT, we investigate how to combine the benefits of TM retrieval with those of MT, by integrating the results of both. We develop a flexible TM-MT integration approach based on various techniques combining the use of TM and MT, such as fuzzy repair, span pretranslation and exploiting multiple matches. Results for ten language pairs using the DGT-TM dataset indicate almost consistently better BLEU, METEOR and TER scores compared to the MT, TM and NMT baselines.

pdf bib
Reading Comprehension of Machine Translation Output: What Makes for a Better Read?
Sheila Castilho | Ana Guerberof Arenas

This paper reports on a pilot experiment that compares two different machine translation (MT) paradigms in reading comprehension tests. To explore a suitable methodology, we set up a pilot experiment with a group of six users (with English, Spanish and Simplified Chinese languages) using an English Language Testing System (IELTS), and an eye-tracker. The users were asked to read three texts in their native language: either the original English text (for the English speakers) or the machine-translated text (for the Spanish and Simplified Chinese speakers). The original texts were machine-translated via two MT systems: neural (NMT) and statistical (SMT). The users were also asked to rank satisfaction statements on a 3-point scale after reading each text and answering the respective comprehension questions. After all tasks were completed, a post-task retrospective interview took place to gather qualitative data. The findings suggest that the users from the target languages completed more tasks in less time with a higher level of satisfaction when using translations from the NMT system.

pdf bib
Are Automatic Metrics Robust and Reliable in Specific Machine Translation Tasks?
Mara Chinea-Rios | Alvaro Peris | Francisco Casacuberta

We present a comparison of automatic metrics against human evaluations of translation quality in several scenarios which were unexplored up to now. Our experimentation was conducted on translation hypotheses that were problematic for the automatic metrics, as the results greatly diverged from one metric to another. We also compared three different translation technologies. Our evaluation shows that in most cases, the metrics capture the human criteria. However, we face failures of the automatic metrics when applied to some domains and systems. Interestingly, we find that automatic metrics applied to the neural machine translation hypotheses provide the most reliable results. Finally, we provide some advice when dealing with these problematic domains.

pdf bib
Creating the best development corpus for Statistical Machine Translation systems
Mara Chinea-Rios | Germán Sanchis-Trilles | Francisco Casacuberta

We propose and study three different novel approaches for tackling the problem of development set selection in Statistical Machine Translation. We focus on a scenario where a machine translation system is leveraged for translating a specific test set, without further data from the domain at hand. Such test set stems from a real application of machine translation, where the texts of a specific e-commerce were to be translated. For developing our development-set selection techniques, we first conducted experiments in a controlled scenario, where labelled data from different domains was available, and evaluated the techniques both with classification and translation quality metrics. Then, the bestperforming techniques were evaluated on the e-commerce data at hand, yielding consistent improvements across two language directions.

pdf bib
Training Deployable General Domain MT for a Low Resource Language Pair: English-Bangla
Sandipan Dandapat | William Lewis

A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.

pdf bib
Deep Neural Machine Translation with Weakly-Recurrent Units
Mattia A. Di Gangi | Marcello Federico

Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine translation. Recently, new architectures have been proposed, which can leverage parallel computation on GPUs better than classical RNNs. Faster training and inference combined with different sequence-to-sequence modeling also lead to performance improvements. While the new models completely depart from the original recurrent architecture, we decided to investigate how to make RNNs more efficient. In this work, we propose a new recurrent NMT architecture, called Simple Recurrent NMT, built on a class of fast and weakly-recurrent units that use layer normalization and multiple attentions. Our experiments on the WMT14 English-to-German and WMT16 English-Romanian benchmarks show that our model represents a valid alternative to LSTMs, as it can achieve better results at a significantly lower computational cost.

pdf bib
Spelling Normalization of Historical Documents by Using a Machine Translation Approach
Miguel Domingo | Francisco Casacuberta

The lack of a spelling convention in historical documents makes their orthography to change depending on the author and the time period in which each document was written. This represents a problem for the preservation of the cultural heritage, which strives to create a digital text version of a historical document. With the aim of solving this problem, we propose three approaches—based on statistical, neural and character-based machine translation—to adapt the document’s spelling to modern standards. We tested these approaches in different scenarios, obtaining very encouraging results.

pdf bib
Neural Machine Translation of Basque
Thierry Etchegoyhen | Eva Martínez Garcia | Andoni Azpeitia | Gorka Labaka | Iñaki Alegria | Itziar Cortes Etxabe | Amaia Jauregi Carrera | Igor Ellakuria Santos | Maite Martin | Eusebi Calonge

We describe the first experimental results in neural machine translation for Basque. As a synthetic language featuring agglutinative morphology, an extended case system, complex verbal morphology and relatively free word order, Basque presents a large number of challenging characteristics for machine translation in general, and for data-driven approaches such as attentionbased encoder-decoder models in particular. We present our results on a large range of experiments in Basque-Spanish translation, comparing several neural machine translation system variants with both rule-based and statistical machine translation systems. We demonstrate that significant gains can be obtained with a neural network approach for this challenging language pair, and describe optimal configurations in terms of word segmentation and decoding parameters, measured against test sets that feature multiple references to account for word order variability.

pdf bib
Evaluation of Terminology Translation in Instance-Based Neural MT Adaptation
M. Amin Farajian | Nicola Bertoldi | Matteo Negri | Marco Turchi | Marcello Federico

We address the issues arising when a neural machine translation engine trained on generic data receives requests from a new domain that contains many specific technical terms. Given training data of the new domain, we consider two alternative methods to adapt the generic system: corpus-based and instance-based adaptation. While the first approach is computationally more intensive in generating a domain-customized network, the latter operates more efficiently at translation time and can handle on-the-fly adaptation to multiple domains. Besides evaluating the generic and the adapted networks with conventional translation quality metrics, in this paper we focus on their ability to properly handle domain-specific terms. We show that instance-based adaptation, by fine-tuning the model on-the-fly, is capable to significantly boost the accuracy of translated terms, producing translations of quality comparable to the expensive corpusbased method.

pdf bib
Translation Quality Estimation for Indian Languages
Nisarg Jhaveri | Manish Gupta | Vasudeva Varma

Translation Quality Estimation (QE) aims to estimate the quality of an automated machine translation (MT) output without any human intervention or reference translation. With the increasing use of MT systems in various cross-lingual applications, the need and applicability of QE systems is increasing. We study existing approaches and propose multiple neural network approaches for sentence-level QE, with a focus on MT outputs in Indian languages. For this, we also introduce five new datasets for four language pairs: two for English–Gujarati, and one each for English–Hindi, English–Telugu and English–Bengali, which includes one manually post-edited dataset for English– Gujarati. These Indian languages are spoken by around 689M speakers world-wide. We compare results obtained using our proposed models with multiple state-of-the-art systems including the winning system in the WMT17 shared task on QE and show that our proposed neural model which combines the discriminative power of carefully chosen features with Siamese Convolutional Neural Networks (CNNs) works best for all Indian language datasets.

pdf bib
A Reinforcement Learning Approach to Interactive-Predictive Neural Machine Translation
Tsz Kin Lam | Julia Kreutzer | Stefan Riezler

We present an approach to interactivepredictive neural machine translation that attempts to reduce human effort from three directions: Firstly, instead of requiring humans to select, correct, or delete segments, we employ the idea of learning from human reinforcements in form of judgments on the quality of partial translations. Secondly, human effort is further reduced by using the entropy of word predictions as uncertainty criterion to trigger feedback requests. Lastly, online updates of the model parameters after every interaction allow the model to adapt quickly. We show in simulation experiments that reward signals on partial translations significantly improve character F-score and BLEU compared to feedback on full translations only, while human effort can be reduced to an average number of 5 feedback requests for every input.

pdf bib
Machine Translation Evaluation beyond the Sentence Level
Jindřich Libovický | Thomas Brovelli | Bruno Cartoni

Automatic machine translation evaluation was crucial for the rapid development of machine translation systems over the last two decades. So far, most attention has been paid to the evaluation metrics that work with text on the sentence level and so did the translation systems. Across-sentence translation quality depends on discourse phenomena that may not manifest at all when staying within sentence boundaries (e.g. coreference, discourse connectives, verb tense sequence etc.). To tackle this, we propose several document-level MT evaluation metrics: generalizations of sentence-level metrics, language-(pair)-independent versions of lexical cohesion scores and coreference and morphology preservation in the target texts. We measure their agreement with human judgment on a newly created dataset of pairwise paragraph comparisons for four language pairs.

pdf bib
An Analysis of Source Context Dependency in Neural Machine Translation
Xutai Ma | Ke Li | Philipp Koehn

The encoder-decoder with attention model has become the state of the art for machine translation. However, more investigations are still needed to understand the internal mechanism of this end-to-end model. In this paper, we focus on how neural machine translation (NMT) models consider source information while decoding. We propose a numerical measurement of source context dependency in the NMT models and analyze the behaviors of the NMT decoder with this measurement under several circumstances. Experimental results show that this measurement is an appropriate estimate for source context dependency and consistent over different domains.

pdf bib
Gist MT Users: A Snapshot of the Use and Users of One Online MT Tool
Mary Nurminen | Niko Papula

This study analyzes usage statistics and the results of an end-user survey to compile a snapshot of the current use and users of one online machine translation (MT) tool, Multilizer’s PDF Translator1. The results reveal that the tool is used predominantly for assimilation purposes and that respondents use MT often. People use the tool to translate texts from different areas of life, including work, study and leisure. Of these, the study area is currently the most prevalent. The results also reveal a tendency for users to machine translate documents that are in languages they have some understanding of, rather than texts they do not understand at all. The findings imply that gist MT is becoming a part of people’s everyday lives and that perhaps people use gist MT in a different way than they use publishing-level translations.

pdf bib
Letting a Neural Network Decide Which Machine Translation System to Use for Black-Box Fuzzy-Match Repair
John E. Ortega | Weiyi Lu | Adam Meyers | Kyunghyun Cho

While systems using the Neural Network-based Machine Translation (NMT) paradigm achieve the highest scores on recent shared tasks, phrase-based (PBMT) systems, rule-based (RBMT) systems and other systems may get better results for individual examples. Therefore, combined systems should achieve the best results for MT, particularly if the system combination method can take advantage of the strengths of each paradigm. In this paper, we describe a system that predicts whether a NMT, PBMT or RBMT will get the best Spanish translation result for a particular English sentence in DGT-TM 20161. Then we use fuzzy-match repair (FMR) as a mechanism to show that the combined system outperforms individual systems in a black-box machine translation setting.

pdf bib
Data selection for NMT using Infrequent n-gram Recovery
Zuzanna Parcheta | Germán Sanchis-Trilles | Francisco Casacuberta

Neural Machine Translation (NMT) has achieved promising results comparable with Phrase-Based Statistical Machine Translation (PBSMT). However, to train a neural translation engine, much more powerful machines are required than those required to develop translation engines based on PBSMT. One solution to reduce the training cost of NMT systems is the reduction of the training corpus through data selection (DS) techniques. There are many DS techniques applied in PBSMT which bring good results. In this work, we show that the data selection technique based on infrequent n-gram occurrence described in (Gasco ́ et al., 2012) commonly used for PBSMT systems also works well for NMT systems. We focus our work on selecting data according to specific corpora using the previously mentioned technique. The specific-domain corpora used for our experiments are IT domain and medical domain. The DS technique significantly reduces the execution time required to train the model between 87% and 93%. Also, it improves translation quality by up to 2.8 BLEU points. The improvements are obtained with just a small fraction of the data that accounts for between 6% and 20% of the total data.

pdf bib
Translating Short Segments with NMT: A Case Study in English-to-Hindi
Shantipriya Parida | Ondřej Bojar

This paper presents a case study in translating short image captions of the Visual Genome dataset from English into Hindi using out-of-domain data sets of varying size. We experiment with three NMT models: the shallow and deep sequence-tosequence and the Transformer model as implemented in Marian toolkit. Phrase-based Moses serves as the baseline. The results indicate that the Transformer model outperforms others in the large data setting in a number of automatic metrics and manual evaluation, and it also produces the fewest truncated sentences. Transformer training is however very sensitive to the hyperparameters, so it requires more experimenting. The deep sequence-to-sequence model produced more flawless outputs in the small data setting and it was generally more stable, at the cost of more training iterations.

pdf bib
Feature Decay Algorithms for Neural Machine Translation
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way

Neural Machine Translation (NMT) systems require a lot of data to be competitive. For this reason, data selection techniques are used only for finetuning systems that have been trained with larger amounts of data. In this work we aim to use Feature Decay Algorithms (FDA) data selection techniques not only to fine-tune a system but also to build a complete system with less data. Our findings reveal that it is possible to find a subset of sentence pairs, that outperforms by 1.11 BLEU points the full training corpus, when used for training a German-English NMT system .

pdf bib
Investigating Backtranslation in Neural Machine Translation
Alberto Poncelas | Dimitar Shterionov | Andy Way | Gideon Maillette de Buy Wenniger | Peyman Passban

A prerequisite for training corpus-based machine translation (MT) systems – either Statistical MT (SMT) or Neural MT (NMT) – is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a highquality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus – both as a separate standalone dataset as well as combined with human-generated parallel data – affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.

pdf bib
Multi-Domain Neural Machine Translation
Sander Tars | Mark Fishel

We present an approach to neural machine translation (NMT) that supports multiple domains in a single model and allows switching between the domains when translating. The core idea is to treat text domainsasdistinctlanguagesandusemultilingual NMT methods to create multi-domain translation systems; we show that this approach results in significant translation quality gains over fine-tuning. We also explore whether the knowledge of pre-specified text domains is necessary; turns out that it is after all, but also that when it is not known quite high translation quality can be reached, and even higher than with known domains in some cases.

pdf bib
A Comparison of Different Punctuation Prediction Approaches in a Translation Context
Vincent Vandeghinste | Lyan Verwimp | Joris Pelemans | Patrick Wambacq

We test a series of techniques to predict punctuation and its effect on machine translation (MT) quality. Several techniques for punctuation prediction are compared: language modeling techniques, such as n-grams and long shortterm memories (LSTM), sequence labeling LSTMs (unidirectional and bidirectional), and monolingual phrase-based, hierarchical and neural MT. For actual translation, phrase-based, hierarchical and neural MT are investigated. We observe that for punctuation prediction, phrase-based statistical MT and neural MT reach similar results, and are best used as a preprocessing step which is followed by neural MT to perform the actual translation. Implicit punctuation insertion by a dedicated neural MT system, trained on unpunctuated source and punctuated target, yields similar results.

pdf bib
Integrating MT at Swiss Post’s Language Service: preliminary results
Pierrette Bouillon | Sabrina Girletti | Paula Estrella | Jonathan Mutal | Martina Bellodi | Beatrice Bircher

This paper presents the preliminary results of an ongoing academia-industry collaboration that aims to integrate MT into the workflow of Swiss Post’s Language Service. We describe the evaluations carried out to select an MT tool (commercial or open-source) and assess the suitability of machine translation for post-editing in Swiss Post’s various subject areas and language pairs. The goal of this first phase is to provide recommendations with regard to the tool, language pair and most suitable domain for implementing MT.

pdf bib
Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu
Sandipan Dandapat | Christian Federmann

Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on English–Telugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation.

pdf bib
Toward leveraging Gherkin Controlled Natural Language and Machine Translation for Global Product Information Development
Morgan O’Brien

Machine Translation (MT) already plays an important part in software development process at McAfee where the technology can be leveraged to provide early builds for localization and internationalization testing teams. Behavior Driven Development (BDD) has been growing in usage as a development methodology in McAfee. Within BDD, the Gherkin Controlled Natural Language (CNL) is a syntax and common terminology set that is used to describe the software or business process in a User Story. Given there exists this control on the language to describe User Stories for software features using Gherkin, we seek to use Machine Translation to Globalize it at high accuracy and without PostEditing and reuse it as Product Information. This enables global product information development to happen as part of the Software Development Life Cycle (SDLC) and at low cost.

pdf bib
Implementing a neural machine translation engine for mobile devices: the Lingvanex use case
Zuzanna Parcheta | Germán Sanchis-Trilles | Aliaksei Rudak | Siarhei Bratchenia

In this paper, we present the challenge entailed by implementing a mobile version of a neural machine translation system, where the goal is to maximise translation quality while minimising model size. We explain the whole process of implementing the translation engine on an English–Spanish example and we describe all the difficulties found and the solutions implemented. The main techniques used in this work are data selection by means of Infrequent n-gram Recovery, appending a special word at the end of each sentence, and generating additional samples without the final punctuation marks. The last two techniques were devised with the purpose of achieving a translation model that generates sentences without the final full stop, or other punctuation marks. Also, in this work, the Infrequent n-gram Recovery was used for the first time to create a new corpus, and not enlarge the in-domain dataset. Finally, we get a small size model with quality good enough to serve for daily use.

pdf bib
Bootstrapping Multilingual Intent Models via Machine Translation for Dialog Automation
Nicholas Ruiz | Srinivas Bangalore | John Chen

With the resurgence of chat-based dialog systems in consumer and enterprise applications, there has been much success in developing data-driven and rule-based natural language models to understand human intent. Since these models require large amounts of data and in-domain knowledge, expanding an equivalent service into new markets is disrupted by language barriers that inhibit dialog automation. This paper presents a user study to evaluate the utility of out-of-the-box machine translation technology to (1) rapidly bootstrap multilingual spoken dialog systems and (2) enable existing human analysts to understand foreign language utterances. We additionally evaluate the utility of machine translation in human assisted environments, where a portion of the traffic is processed by analysts. In English→Spanish experiments, we observe a high potential for dialog automation, as well as the potential for human analysts to process foreign language utterances with high accuracy.

pdf bib
How to Move to Neural Machine Translation for Enterprise-Scale Programs - An Early Adoption Case Study
Tanja Schmidt | Lena Marg

While Neural Machine Translation (NMT) technology has been around for a few years now in research and development, it is still in its infancy when it comes to customization readiness and experience with implementation on an enterprise scale with Language Service Providers (LSPs). For large, multi-language LSPs, it is therefore not only important to stay up-to-date on latest research on the technology as such, the best use cases, as well as main advantages and disadvantages. Moreover, due to this infancy, the challenges encountered during an early adoption of the technology in an enterprise-scale translation program are of a very practical and concrete nature and range from the quality of the NMT output over availability of language pairs in (customizable) NMT systems to additional translation workflow investments and considerations with regard to involving the supply chain. In an attempt to outline the above challenges and possible approaches to overcome them, this paper describes the migration of an established enterprise-scale machine translation program of 28 language pairs with post-editing from a Statistical Machine Translation (SMT) setup to NMT.

pdf bib
A Comparison of Statistical and Neural MT in a Multi-Product and Multilingual Software Company - User Study
Nander Speerstra

Over the last 4 years, Infor has been implementing machine translation (MT) in its translation process. In this paper, the results of both statistical and neural MT projects are provide to give an insight in the advantages and disadvantages of MT use in a large company. We also offer a look into the future of MT within our company and to strengthen the implementation of MT in our translation process.

pdf bib
Does Machine Translation Really Produce Translations?
Félix do Carmo

I will try to answer the question of whether Machine Translation (MT) can be considered a full translation process. I argue that, instead, it should be seen as part of a process performed by translators, in which MT plays a fundamental support role. The roles of translators and MT in the translation process is presented in an analysis that get its elements from Translation Studies and Translation Process Research.

pdf bib
Pre-professional pre-conceptions
Laura Bruno | Antonio Miloro | Paula Estrella | Mariona Sabaté Carrove

While MT+PE has become an industry standard, our translation schools are not able to accompany these changes by updating their academic programs. We polled 100 pre-professionals to confirm that in our local context they are reluctant to accept post-editing jobs mainly because they have inherited pre-conceptions or negative opinions about MT during their studies.

pdf bib
Determining translators’ perception, productivity and post-editing effort when using SMT and NMT systems
Ariana López Pereira

Thanks to the great progress seen in the machine translation (MT) field in recent years, the use and perception of MT by translators need to be revisited. The main objective of this paper is to determine the perception, productivity and the postediting effort (in terms of time and number of editings) of six translators when using Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) systems. This presentation is focused on how translators perceive these two systems in order to know which one they prefer and what type of errors and problems present each system, as well as how translators solve these issues. These tests will be performed with the Dynamic Quality Framework (DQF) tools (quick comparison and productivity tasks) using Google Neural Machine Translation and Microsoft Translator (SMT) APIs in two different English into Spanish texts, an instruction manual and a marketing webpage. Results showed that translators considerably prefer NMT over SMT. Moreover, NMT is more adequate and fluent than SMT.

pdf bib
Machine translation post-editing in the professional translation market in Spain: a case study on the experience and opinion of professional translators
Lorena Pérez Maćias

The objective of this paper is to analyse some aspects related to the practice of post-editing services in the current translation market in Spain. To this aim, some quantitative data collected through an online survey and concerning the experience and opinion of professional translators regarding post-editing will be shown.

pdf bib
Perception vs. Acceptability of TM and SMT Output: What do translators prefer?
Pilar Sánchez-Gijón | Joss Moorkens | Andy Way

This paper reports the results of two studies carried out with two different group of professional translators to find out how professionals perceive and accept SMT in comparison with TM. The first group translated and post-edited segments from English into German, and the second group from English into Spanish. Both studies had equivalent settings in order to guarantee the comparability of the results. It will also help to shed light upon the real benefit of SMT from which translators may take advantage.

pdf bib
Learning to use machine translation on the Translation Commons Learn portal
Jeannette Stewart | Mikel L. Forcada

We describe the Learn portal of Translation Commons (TC), a self-managed community of volunteer translators community aimed at sharing tools, resources and initiatives for the translation community as a whole. Members are encouraged to upload and share their free resources on the platform and to create free courses and tutorials. Specifically there are no educational material on machine translation yet and we invite experts to contribute.

pdf bib
Use of NMT in Ubiqus Group
Paloma Valenciano

After more than 30 years’ experience as a translator and as a reviser, I have recently started to post-edit. During these 10 months discovering a new approach to my profession, the experience has been highly positive. Ubiqus, the French group to which we belong, has developed 20 engines based on OpenNMT. OpenNMT derives from an academic project initiated in 2016 by Harvard NLP; Systran joined the project and an open source toolkit was released in January 2017. The community grew when individuals as well as localization professionals contributed. Ubiqus adopted this toolkit at the very beginning of 2017 and contributed to its development as well as with some extensions, developing a layer to integrate OpenNMT in our workflow environments, including SDL Studio and with our internal ERP, which enables to provide a highly efficient end-to-end system. I have been using the EN-ES and FR-ES engines mainly for legal texts. I very soon felt comfortable with the task, I started measuring my productivity by timing my output. I was surprised by the improvement since the very beginning, and as the NMT engine was further trained and I got more used to the post-editing task I achieved even better results, improving productivity by almost 30%. Ubiqus has also developed a scheme for the systematic scoring of all translation jobs, U-Score, a composite indicator of the overall performance of the machine. The U-Score is obtained by aggregating the information of BLEU, TER and DL-ratio and averaging them. It then performs a transformation allowing to spread the scale a bit. The scores have been clearly improving in the last months with a constant training of the engines.

pdf bib
An In-house Translator’s Experience with Machine Translation
Anna Zaretskaya | Marcel Biller

pdf bib
OctaveMT: Putting Three Birds into One Cage
Juan A. Alonso | Albert Llorens

This product presentation describes the integration of the three MT technologies currently used – rule-based (RBMT), Statistical (SMT) and Neural (NMT) – into one scalable single platform, OctaveMT. MT clients can access all three types of MT engines, whether on a user specified basis or depending on several translation parameters (language-direction, domain, etc.)

pdf bib
TransPerfect’s Private Neural Neural Machine Translation Portal
Diego Bartolomé | José Masa

We will present our solution to replace the usage of publicly available machine translation (MT) services in companies where privacy and confidentiality are key. Our MT portal can translate across a variety of languages using neural machine translation, and supports an extensive number of file types. Corporations are using it to enable multilingual communication everywhere.

pdf bib
Terminology validation for MT output
Giorgio Bernardinello

WebTerm Connector is a plugin for STAR MT Translate which combines machine translation with validated terminology information. The aim is to provide “understandable” information in the target language using corporate language and terminology.

pdf bib
The ModernMT Project
Nicola Bertoldi | Davide Caroselli | Marcello Federico

This short presentation introduces ModernMT: an open-source project 1 that integrates real-time adaptive neural machine translation into a single easy-to-use product.

pdf bib
Developing a New Swiss Research Centre for Barrier-Free Communication
Pierrette Bouillon | Silvia Rodríguez Vázquez | Irene Strasly

The project ‘Proposal and Implementation of a Swiss Research Centre for Barrier-free Communication’ (BFC) is a four-year project (2017–2020) funded by the Rectors' Conference of Swiss Higher Education Institutions (swissuniversities).1 Its purpose is to ensure that individuals with a visual or hearing disability, people with a temporary cognitive impairment and speakers without sufficient knowledge of local languages can communicate and enjoy barrier-free access to information in all spheres of life, with a special focus on higher education.

pdf bib
Massively multilingual accessible audioguides via cell phones
Itziar Cortes | Igor Leturia | Ińaki Alegria | Aitzol Astigarraga | Kepa Sarasola | Manex Garaio

Bidaide1 is a web service that allows the visitors of a museum, route or building to read or listen to explanations relative to the visited place on their own mobile and in their own language. The visitor can access the explanations in various ways: by scanning some QR codes located in the place, by GPS positioning (in outdoor routes), or by automatic Bluetooth proximity activation. This makes it accessible for people with reduced or null vision. On the other hand, this platform also offers to the manager of the visited site the most advanced language resources to create the texts and audios of the explanations in many languages.

pdf bib
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | Paulo Vale | José Luis Fonseca | Teresa Lynn | Jane Dunne | Federico Gaspari | Andy Way | Victoria Arranz | Khalid Choukri | Vladimir Popescu | Pedro Neiva | Rui Neto | Maite Melero | David Perez Fernandez | Antonio Branco | Ruben Branco | Luis Gomes

We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.

pdf bib
The SUMMA Platform: Scalable Understanding of Multilingual Media
Ulrich Germann | Peggy van der Kreeft | Guntis Barzdins | Alexandra Birch

We present the latest version of the SUMMA platform, an open-source software platform for monitoring and interpreting multi-lingual media, from written news published on the internet to live media broadcasts via satellite or internet streaming.

pdf bib
Smart Pre- and Post-Processing for STAR MT Translate
Judith Klein

After many successful experiments it has become evident that smart pre- and post-processing can significantly improve the output of neural machine translation. Therefore, various generic and language-specific processes are applied to the training corpus, the user input and the MT output for STAR MT Translate.

pdf bib
mtrain: A Convenience Tool for Machine Translation
Samuel Läubli | Mathias Müller | Beat Horat | Martin Volk

We present mtrain, a convenience tool for machine translation. It wraps existing machine translation libraries and scripts to ease their use. mtrain is written purely in Python 3, well-documented, and freely available.1

pdf bib
Empowering Translators with MTradumàtica: A Do-It-Yourself statistical machine translation platform
Adrià Martín-Mor | Pilar Sánchez-Gijón

According to Torres Hostench et al. (2016), the use of machine translation (MT) in Catalan and Spanish translation companies is low. Based on these results, the Tradumàtica research group,2 through the ProjecTA and ProjecTA-U projects,3 set to bring MT and translators closer with a two-fold strategy. On the one hand, by developing MTradumàtica, a free Moses-based web platform with graphical user interface (GUI) for statistical machine translation (SMT) trainers. On the other hand, by including MT-related contents in translators’ training. This paper will describe the latest developments in MTradumàtica.

pdf bib
Speech Translation Systems as a Solution for a Wireless Earpiece
Nicholas Ruiz | Andrew Ochoa | Jainam Shah | William Goethels | Sergio DelRio Diaz

The advances of deep learning approaches in automatic speech recognition (ASR) and machine translation (MT) have allowed for levels of accuracy that move speech translation closer to being a commercially viable alternative interpretation solution. In addition, recent improvements in micro-electronic mechanical systems, microphone arrays, speech processing software, and wireless technology have enabled speech recognition software to capture higher quality speech input from wireless earpiece products. With this in mind, we introduce and present a wearable speech translation tool called Pilot, which uses these systems to translate language spoken within the proximity of a user wearing the wireless earpiece.

pdf bib
Multi-modal Context Modelling for Machine Translation
Lucia Specia

MultiMT is an European Research Council Starting Grant whose aim is to devise data, methods and algorithms to exploit multi-modal information (images, audio, metadata) for context modelling in machine translation and other cross- lingual tasks. The project draws upon different research fields including natural language processing, computer vision, speech processing and machine learning.

pdf bib
Project PiPeNovel: Pilot on Post-editing Novels
Antonio Toral | Martijn Wieling | Sheila Castilho | Joss Moorkens | Andy Way

Given (i) the rise of a new paradigm to machine translation based on neural networks that results in more fluent and less literal output than previous models and (ii) the maturity of machine-assisted translation via post-editing in industry, project PiPeNovel studies the feasibility of the post-editing workflow for literary text conducting experiments with professional literary translators.

pdf bib
Smart Computer-Aided Translation Environment (SCATE): Highlights
Vincent Vandeghinste | Tom Vanallemeersch | Bram Bulté | Liesbeth Augustinus | Frank Van Eynde | Joris Pelemans | Lyan Verwimp | Patrick Wambacq | Geert Heyman | Marie-Francine Moens | Iulianna van der Lek-Ciudin | Frieda Steurs | Ayla Rigouts Terryn | Els Lefever | Arda Tezcan | Lieve Macken | Sven Coppers | Jens Brulmans | Jan Van Den Bergh | Kris Luyten | Karin Coninx

We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the Flemish Government IWT-SBO, project No. 130041.1

pdf bib
news.bridge - Automated Transcription and Translation for News
Peggy van der Kreeft | Renars Liepins

news.bridge provides a platform for multilingual video processing, including automated transcription and translation, subtitling, voice-over, and summarization, with post-editing facility of videos in a broad range of languages. The platform is currently in beta testing at Deutsche Welle for republishing of videos in other languages.

pdf bib
Europarl Datasets with Demographic Speaker Information
Eva Vanmassenhove | Christian Hardmeier

Research on speaker-adapted neural machine translation (NMT) is scarce. One of the main challenges for more personalized MT systems is finding large enough annotated parallel datasets with speaker information. Rabinovich et al. (2017) published an annotated parallel dataset for EN–FR and EN–DE, however, for many other language pairs no sufficiently large annotated datasets are available.