Hieu Hoang


2020

pdf bib
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

2019

pdf bib
ParaCrawl: Web-scale parallel corpora for the languages of the EU
Miquel Esplà | Mikel Forcada | Gema Ramírez-Sánchez | Hieu Hoang
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

pdf bib
Fast Neural Machine Translation Implementation
Hieu Hoang | Tomasz Dwojak | Rihards Krislauks | Daniel Torregrosa | Kenneth Heafield
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This paper describes the submissions to the efficiency track for GPUs at the Workshop for Neural Machine Translation and Generation by members of the University of Edinburgh, Adam Mickiewicz University, Tilde and University of Alicante. We focus on efficient implementation of the recurrent deep-learning model as implemented in Amun, the fast inference engine for neural machine translation. We improve the performance with an efficient mini-batching algorithm, and by fusing the softmax operation with the k-best extraction algorithm. Submissions using Amun were first, second and third fastest in the GPU efficiency track.

pdf bib
Marian: Cost-effective High-Quality Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Kenneth Heafield | Hieu Hoang | Roman Grundkiewicz | Anthony Aue
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This paper describes the submissions of the “Marian” team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we create a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier for this shared task.

2017

pdf bib
A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages
Nizar Habash | Nasser Zalmout | Dima Taji | Hieu Hoang | Maverick Alzate
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.

2016

pdf bib
Fast, Scalable Phrase-Based SMT Decoding
Hieu Hoang | Nikolay Bogoychev | Lane Schwartz | Marcin Junczys-Dowmunt
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

The utilization of statistical machine translation (SMT) has grown enormously over the last decade, many using open-source software developed by the NLP community. As commercial use has increased, there is need for software that is optimized for commercial requirements, in particular, fast phrase-based decoding and more efficient utilization of modern multicore servers. In this paper we re-examine the major components of phrase-based decoding and decoder implementation with particular emphasis on speed and scalability on multicore machines. The result is a drop-in replacement for the Moses decoder which is up to fifteen times faster and scales monotonically with the number of cores.

pdf bib
Fast and highly parallelizable phrase table for statistical machine translation
Nikolay Bogoychev | Hieu Hoang
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

2014

pdf bib
Integrating an Unsupervised Transliteration Model into Statistical Machine Translation
Nadir Durrani | Hassan Sajjad | Hieu Hoang | Philipp Koehn
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

bib
Statistical machine translation with the Moses toolkit
Hieu Hoang | Matthias Huck | Philipp Koehn
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: Tutorials

pdf bib
Augmenting String-to-Tree and Tree-to-String Translation with Non-Syntactic Phrases
Matthias Huck | Hieu Hoang | Philipp Koehn
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Preference Grammars and Soft Syntactic Constraints for GHKM Syntax-based Statistical Machine Translation
Matthias Huck | Hieu Hoang | Philipp Koehn
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

2013

pdf bib
Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?
Nadir Durrani | Alexander Fraser | Helmut Schmid | Hieu Hoang | Philipp Koehn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Open Source Statistical Machine Translation
Philipp Koehn | Hieu Hoang
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Tutorials

If you are interested in open-source machine translation but lack hands-on experience, this is the tutorial for you! We will start with background knowledge of statistical machine translation and then walk you through the process of installing and running an SMT system. We will show you how to prepare input data, and the most efficient way to train and use your translation systems. We shall also discuss solutions to some of the most common issues that face LSPs when using SMT, including how to tailor systems to specific clients, preserving document layout and formatting, and efficient ways of incorporating new translation memories. Previous years’ participants have included software engineers and managers who need to have a detailed understanding of the SMT process. This is a fast-paced, hands-on tutorial that will cover the skills you need to get you up and running with open-source SMT. The teaching will be based on the Moses toolkit, the most popular open-source machine translation software currently available. No prior knowledge of MT is necessary, only an interest in it. A laptop is required for this tutorial, and you should have rudimentary knowledge of using the command line on Windows or Linux.

2011

pdf bib
Left language model state for syntactic machine translation
Kenneth Heafield | Hieu Hoang | Philipp Koehn | Tetsuo Kiso | Marcello Federico
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

Many syntactic machine translation decoders, including Moses, cdec, and Joshua, implement bottom-up dynamic programming to integrate N-gram language model probabilities into hypothesis scoring. These decoders concatenate hypotheses according to grammar rules, yielding larger hypotheses and eventually complete translations. When hypotheses are concatenated, the language model score is adjusted to account for boundary-crossing n-grams. Words on the boundary of each hypothesis are encoded in state, consisting of left state (the first few words) and right state (the last few words). We speed concatenation by encoding left state using data structure pointers in lieu of vocabulary indices and by avoiding unnecessary queries. To increase the decoder’s opportunities to recombine hypothesis, we minimize the number of words encoded by left state. This has the effect of reducing search errors made by the decoder. The resulting gain in model score is smaller than for right state minimization, which we explain by observing a relationship between state minimization and language model probability. With a fixed cube pruning pop limit, we show a 3-6% reduction in CPU time and improved model scores. Reducing the pop limit to the point where model scores tie the baseline yields a net 11% reduction in CPU time.

2010

pdf bib
More Linguistic Annotation for Statistical Machine Translation
Philipp Koehn | Barry Haddow | Philip Williams | Hieu Hoang
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Improved Translation with Source Syntax Labels
Hieu Hoang | Philipp Koehn
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Machine Translation with Open source Software
Philipp Koehn | Hieu Hoang
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Tutorials

2009

pdf bib
Improving Mid-Range Re-Ordering Using Templates of Factors
Hieu Hoang | Philipp Koehn
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
A unified framework for phrase-based, hierarchical, and syntax-based statistical machine translation
Hieu Hoang | Philipp Koehn | Adam Lopez
Proceedings of the 6th International Workshop on Spoken Language Translation: Papers

Despite many differences between phrase-based, hierarchical, and syntax-based translation models, their training and testing pipelines are strikingly similar. Drawing on this fact, we extend the Moses toolkit to implement hierarchical and syntactic models, making it the first open source toolkit with end-to-end support for all three of these popular models in a single package. This extension substantially lowers the barrier to entry for machine translation research across multiple models.

pdf bib
A Systematic Analysis of Translation Model Search Spaces
Michael Auli | Adam Lopez | Hieu Hoang | Philipp Koehn
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf bib
Towards better Machine Translation Quality for the German-English Language Pairs
Philipp Koehn | Abhishek Arun | Hieu Hoang
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Design of the Moses Decoder for Statistical Machine Translation
Hieu Hoang | Philipp Koehn
Software Engineering, Testing, and Quality Assurance for Natural Language Processing

pdf bib
Improving Interactive Machine Translation via Mouse Actions
Germán Sanchis-Trilles | Daniel Ortiz-Martínez | Jorge Civera | Francisco Casacuberta | Enrique Vidal | Hieu Hoang
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Factored Translation Models
Philipp Koehn | Hieu Hoang
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Moses: Open Source Toolkit for Statistical Machine Translation
Philipp Koehn | Hieu Hoang | Alexandra Birch | Chris Callison-Burch | Marcello Federico | Nicola Bertoldi | Brooke Cowan | Wade Shen | Christine Moran | Richard Zens | Chris Dyer | Ondřej Bojar | Alexandra Constantin | Evan Herbst
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions