Nicola Bertoldi

2018

The ModernMT Project
Nicola Bertoldi | Davide Caroselli | Marcello Federico
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

This short presentation introduces ModernMT: an open-source project 1 that integrates real-time adaptive neural machine translation into a single easy-to-use product.

pdf bib

Online Neural Automatic Post-editing for Neural Machine Translation
Matteo Negri | Marco Turchi | Nicola Bertoldi | Marcello Federico
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

ESCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing
Matteo Negri | Marco Turchi | Rajen Chatterjee | Nicola Bertoldi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Evaluation of Terminology Translation in Instance-Based Neural MT Adaptation
M. Amin Farajian | Nicola Bertoldi | Matteo Negri | Marco Turchi | Marcello Federico
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We address the issues arising when a neural machine translation engine trained on generic data receives requests from a new domain that contains many specific technical terms. Given training data of the new domain, we consider two alternative methods to adapt the generic system: corpus-based and instance-based adaptation. While the first approach is computationally more intensive in generating a domain-customized network, the latter operates more efficiently at translation time and can handle on-the-fly adaptation to multiple domains. Besides evaluating the generic and the adapted networks with conventional translation quality metrics, in this paper we focus on their ability to properly handle domain-specific terms. We show that instance-based adaptation, by fine-tuning the model on-the-fly, is capable to significantly boost the accuracy of translated terms, producing translations of quality comparable to the expensive corpusbased method.

2017

pdf bib abs

Neural vs. Phrase-Based Machine Translation in a Multi-Domain Scenario
M. Amin Farajian | Marco Turchi | Matteo Negri | Nicola Bertoldi | Marcello Federico
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

State-of-the-art neural machine translation (NMT) systems are generally trained on specific domains by carefully selecting the training sets and applying proper domain adaptation techniques. In this paper we consider the real world scenario in which the target domain is not predefined, hence the system should be able to translate text from multiple domains. We compare the performance of a generic NMT system and phrase-based statistical machine translation (PBMT) system by training them on a generic parallel corpus composed of data from different domains. Our results on multi-domain English-French data show that, in these realistic conditions, PBMT outperforms its neural counterpart. This raises the question: is NMT ready for deployment as a generic/multi-purpose MT backbone in real-world settings?

pdf bib

FBK’s Participation to the English-to-German News Translation Task of WMT 2017
Mattia Antonino Di Gangi | Nicola Bertoldi | Marcello Federico
Proceedings of the Second Conference on Machine Translation

2014

pdf bib abs

The repetition rate of text as a predictor of the effectiveness of machine translation adaptation
Mauro Cettolo | Nicola Bertoldi | Marcello Federico
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Since the effectiveness of MT adaptation relies on the text repetitiveness, the question on how to measure repetitions in a text naturally arises. This work deals with the issue of looking for and evaluating text features that might help the prediction of the impact of MT adaptation on translation quality. In particular, the repetition rate metric, we recently proposed, is compared to other features employed in very related NLP tasks. The comparison is carried out through a regression analysis between feature values and MT performance gains by dynamically adapted versus non-adapted MT engines, on five different translation tasks. The main outcome of experiments is that the repetition rate correlates better than any other considered feature with the MT gains yielded by the online adaptation, although using all features jointly results in better predictions than with any single feature.

bib

MateCat: an open source CAT tool for MT post-editing
Marcello Federico | Nicola Bertoldi | Marco Trombetti | Alessandro Cattelan
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: Tutorials

bib

Working with MateCat: user manual and installation guide
Marcello Federico | Nicola Bertoldi | Marco Trombetti | Alessandro Cattelan
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: Tutorials

pdf bib

Online Word Alignment for Online Adaptive Machine Translation
M. Amin Farajian | Nicola Bertoldi | Marcello Federico
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

EU-BRIDGE is a European research project which is aimed at developing innovative speech translation technology. One of the collaborative efforts within EU-BRIDGE is to produce joint submissions of up to four different partners to the evaluation campaign at the 2014 International Workshop on Spoken Language Translation (IWSLT). We submitted combined translations to the German→English spoken language translation (SLT) track as well as to the German→English, English→German and English→French machine translation (MT) tracks. In this paper, we present the techniques which were applied by the different individual translation systems of RWTH Aachen University, the University of Edinburgh, Karlsruhe Institute of Technology, and Fondazione Bruno Kessler. We then show the combination approach developed at RWTH Aachen University which combined the individual systems. The consensus translations yield empirical gains of up to 2.3 points in BLEU and 1.2 points in TER compared to the best individual system.

pdf bib abs

FBK’s machine translation and speech translation systems for the IWSLT 2014 evaluation campaign
Nicola Bertoldi | Prashanu Mathur | Nicolas Ruiz | Marcello Federico
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the systems submitted by FBK for the MT and SLT tracks of IWSLT 2014. We participated in the English-French and German-English machine translation tasks, as well as the English-French speech translation task. We report improvements in our English-French MT systems over last year’s baselines, largely due to improved techniques of combining translation and language models, and using huge language models. For our German-English system, we experimented with a novel domain adaptation technique. For both language pairs we also applied a novel word triggerbased model which shows slight improvements on EnglishFrench and German-English systems. Our English-French SLT system utilizes MT-based punctuation insertion, recasing, and ASR-like synthesized MT training data.

2013

pdf bib

Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation
Nicola Bertoldi | Mauro Cettolo | Marcello Federico
Proceedings of Machine Translation Summit XIV: Papers

pdf bib

Project Adaptation for MT-Enhanced Computer Assisted Translation
Mauro Cettolo | Nicola Bertoldi | Marcello Federico
Proceedings of Machine Translation Summit XIV: Papers

pdf bib abs

FBK’s machine translation systems for the IWSLT 2013 evaluation campaign
Nicola Bertoldi | M. Amin Farajian | Prashant Mathur | Nicholas Ruiz | Marcello Federico
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the systems submitted by FBK for the MT track of IWSLT 2013. We participated in the English-French as well as the bidirectional Persian-English translation tasks. We report substantial improvements in our English-French systems over last year’s baselines, largely due to improved techniques of combining translation and language models. For our Persian-English and English-Persian systems, we observe substantive improvements over baselines submitted by the workshop organizers, due to enhanced language-specific text normalization and the creation of a large monolingual news corpus in Persian.

pdf bib

Generative and Discriminative Methods for Online Adaptation in SMT
Katharina Wäschle | Patrick Simianer | Nicola Bertoldi | Stefan Riezler | Marcello Federico
Proceedings of Machine Translation Summit XIV: Papers

EU-BRIDGE1 is a European research project which is aimed at developing innovative speech translation technology. This paper describes one of the collaborative efforts within EUBRIDGE to further advance the state of the art in machine translation between two European language pairs, English→French and German→English. Four research institutions involved in the EU-BRIDGE project combined their individual machine translation systems and participated with a joint setup in the machine translation track of the evaluation campaign at the 2013 International Workshop on Spoken Language Translation (IWSLT). We present the methods and techniques to achieve high translation quality for text translation of talks which are applied at RWTH Aachen University, the University of Edinburgh, Karlsruhe Institute of Technology, and Fondazione Bruno Kessler. We then show how we have been able to considerably boost translation performance (as measured in terms of the metrics BLEU and TER) by means of system combination. The joint setups yield empirical gains of up to 1.4 points in BLEU and 2.8 points in TER on the IWSLT test sets compared to the best single systems.

pdf bib

2012

pdf bib

Evaluating the Learning Curve of Domain Adaptive Statistical Machine Translation Systems
Nicola Bertoldi | Mauro Cettolo | Marcello Federico | Christian Buck
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib abs

Practical Domain Adaptation in SMT
Marcello Federico | Nicola Bertoldi
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Tutorials

Several studies have recently reported significant productivity gains by human translators when besides translation memory (TM) matches they do also receive suggestions from a statistical machine translation (SMT) engine. In fact, an increasing number of language service providers and in-house translation services of large companies is nowadays integrating SMT in their workflow. The technology transfer of state-of-the-art SMT technology from research to industry has been relatively fast and simple also thanks to development of open source software, such as MOSES, GIZA++, and IRSTLM. While a translator is working on a specific translation project, she evaluates the utility of translating versus post-editing a segment based on the adequacy and fluency provided by the SMT engine, which in turn depends on the considered language pair, linguistic domain of the task, and the amount of available training data. Statistical models, like those employed in SMT, rely on a simple assumption: data used to train and tune the models represent the target translation task. Unfortunately, this assumption cannot be satisfied for most of the real application cases, simply because for most of the language pairs and domains there is no sufficient data to adequately train an SMT system. Hence, common practice is to train SMT systems by merging together parallel and monolingual data from the target domain with as much as possible data from any other available source. This workaround is simple and gives practical benefits but is often not the best way to exploit the available data. This tutorial copes with the optimal use of in-domain and out-of-domain data to achieve better SMT performance on a given application domain. Domain adaptation, in general, refers to statistical modeling and machine learning techniques that try to cope with the unavoidable mismatch between training and task data that typically occurs in real life applications. Our tutorial will survey several application cases in which domain adaptation can be applied, and presents adaptation techniques that best fit each case. In particular, we will cover adaptation methods for n-gram language models and translation models in phrase-based SMT. The tutorial will provide some high-level theoretical background in domain adaptation, it will discuss practical application cases, and finally show how the presented methods can be applied with two widely used software tools: Moses and IRSTLM. The tutorial is suited for any practitioner of statistical machine translation. No particular programming or mathematical background is required.

2011

pdf bib

Methods for Smoothing the Optimizer Instability in SMT
Mauro Cettolo | Nicola Bertoldi | Marcello Federico
Proceedings of Machine Translation Summit XIII: Papers

pdf bib

Bootstrapping Arabic-Italian SMT through Comparable Texts and Pivot Translation
Mauro Cettolo | Nicola Bertoldi | Marcello Federico
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

pdf bib abs

Mining parallel fragments from comparable texts
Mauro Cettolo | Marcello Federico | Nicola Bertoldi
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

This paper proposes a novel method for exploiting comparable documents to generate parallel data for machine translation. First, each source document is paired to each sentence of the corresponding target document; second, partial phrase alignments are computed within the paired texts; finally, fragment pairs across linked phrase-pairs are extracted. The algorithm has been tested on two recent challenging news translation tasks. Results show that mining for parallel fragments is more effective than mining for parallel sentences, and that comparable in-domain texts can be more valuable than parallel out-of-domain texts.

pdf bib

Statistical Machine Translation of Texts with Misspelled Words
Nicola Bertoldi | Mauro Cettolo | Marcello Federico
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

pdf bib abs

Online language model adaptation for spoken dialog translation
Germán Sanchis-Trilles | Mauro Cettolo | Nicola Bertoldi | Marcello Federico
Proceedings of the 6th International Workshop on Spoken Language Translation: Papers

This paper focuses on the problem of language model adaptation in the context of Chinese-English cross-lingual dialogs, as set-up by the challenge task of the IWSLT 2009 Evaluation Campaign. Mixtures of n-gram language models are investigated, which are obtained by clustering bilingual training data according to different available human annotations, respectively, at the dialog level, turn level, and dialog act level. For the latter case, clustering of IWSLT data was in fact induced through a comparable Italian-English parallel corpus provided with dialog act annotations. For the sake of adaptation, mixture weight estimation is performed either at the level of single source sentence or test set. Estimated weights are then transferred to the target language mixture model. Experimental results show that, by training different specific language models weighted according to the actual input instead of using a single target language model, significant gains in terms of perplexity and BLEU can be achieved.

pdf bib abs

FBK at IWSLT 2009
Nicola Bertoldi | Arianna Bisazza | Mauro Cettolo | Germán Sanchis-Trilles | Marcello Federico
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper reports on the participation of FBK at the IWSLT 2009 Evaluation. This year we worked on the Arabic-English and Turkish-English BTEC tasks with a special effort on linguistic preprocessing techniques involving morphological segmentation. In addition, we investigated the adaptation problem in the development of systems for the Chinese-English and English-Chinese challenge tasks; in particular, we explored different ways for clustering training data into topic or dialog-specific subsets: by producing (and combining) smaller but more focused models, we intended to make better use of the available training data, with the ultimate purpose of improving translation quality.

pdf bib

Domain Adaptation for Statistical Machine Translation with Monolingual Resources
Nicola Bertoldi | Marcello Federico
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf bib abs

Shallow-Syntax Phrase-Based Translation: Joint versus Factored String-to-Chunk Models
Mauro Cettolo | Marcello Federico | Daniele Pighin | Nicola Bertoldi
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

This work extends phrase-based statistical MT (SMT) with shallow syntax dependencies. Two string-to-chunks translation models are proposed: a factored model, which augments phrase-based SMT with layered dependencies, and a joint model, that extends the phrase translation table with microtags, i.e. per-word projections of chunk labels. Both rely on n-gram models of target sequences with different granularity: single words, micro-tags, chunks. In particular, n-grams defined over syntactic chunks should model syntactic constraints coping with word-group movements. Experimental analysis and evaluation conducted on two popular Chinese-English tasks suggest that the shallow-syntax joint-translation model has potential to outperform state-of-the-art phrase-based translation, with a reasonable computational overhead.

pdf bib abs

Phrase-based statistical machine translation with pivot languages.
Nicola Bertoldi | Madalina Barbaiani | Marcello Federico | Roldano Cattoni
Proceedings of the 5th International Workshop on Spoken Language Translation: Papers

Translation with pivot languages has recently gained attention as a means to circumvent the data bottleneck of statistical machine translation (SMT). This paper tries to give a mathematically sound formulation of the various approaches presented in the literature and introduces new methods for training alignment models through pivot languages. We present experimental results on Chinese-Spanish translation via English, on a popular traveling domain task. In contrast to previous literature, we report experimental results by using parallel corpora that are either disjoint or overlapped on the pivot language side. Finally, our original method for generating training data through random sampling shows to perform as well as the best methods based on the coupling of translation systems.

pdf bib abs

FBK @ IWSLT-2008.
Nicola Bertoldi | Roldano Cattoni | Marcello Federico | Madalina Barbaiani
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper reports on the participation of FBK at the IWSLT 2008 Evaluation. Main effort has been spent on the Chinese-Spanish Pivot task. We implemented four methods to perform pivot translation. The results on the IWSLT 2008 test data show that our original method for generating training data through random sampling outperforms the best methods based on coupling translation systems. FBK also participated in the Chinese-English Challenge task and the Chinese-English and Chinese-Spanish BTEC tasks, employing the standard state-of-the-art MT system Moses Toolkit.

2007

pdf bib

pdf bib abs

FBK@IWSLT 2007
Nicola Bertoldi | Mauro Cettolo | Roldano Cattoni | Marcello Federico
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper reports on the participation of FBK (formerly ITC-irst) at the IWSLT 2007 Evaluation. FBK participated in three tasks, namely Chinese-to-English, Japanese-to-English, and Italian-to-English. With respect to last year, translation systems were developed with the Moses Toolkit and the IRSTLM library, both available as open source software. Moreover, several novel ideas were investigated: the use of confusion networks in input to manage ambiguity in punctuation, the estimation of an additional language model by means of the Google’s Web 1T 5-gram collection, the combination of true case and lower case language models, and finally the use of multiple phrase-tables. By working on top of a state-of-the art baseline, experiments showed that the above methods accounted for significant BLEU score improvements.