Tobias Domhan


pdf bib
The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation
Tobias Domhan | Eva Hasler | Ke Tran | Sony Trenous | Bill Byrne | Felix Hieber
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.


pdf bib
Improving the Quality Trade-Off for Neural Machine Translation Multi-Domain Adaptation
Eva Hasler | Tobias Domhan | Jonay Trenous | Ke Tran | Bill Byrne | Felix Hieber
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Building neural machine translation systems to perform well on a specific target domain is a well-studied problem. Optimizing system performance for multiple, diverse target domains however remains a challenge. We study this problem in an adaptation setting where the goal is to preserve the existing system quality while incorporating data for domains that were not the focus of the original translation system. We find that we can improve over the performance trade-off offered by Elastic Weight Consolidation with a relatively simple data mixing strategy. At comparable performance on the new domains, catastrophic forgetting is mitigated significantly on strong WMT baselines. Combining both approaches improves the Pareto frontier on this task.


pdf bib
The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020
Tobias Domhan | Michael Denkowski | David Vilar | Xing Niu | Felix Hieber | Kenneth Heafield
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Sockeye 2: A Toolkit for Neural Machine Translation
Felix Hieber | Tobias Domhan | Michael Denkowski | David Vilar
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We present Sockeye 2, a modernized and streamlined version of the Sockeye neural machine translation (NMT) toolkit. New features include a simplified code base through the use of MXNet’s Gluon API, a focus on state of the art model architectures, and distributed mixed precision training. These improvements result in faster training and inference, higher automatic metric scores, and a shorter path from research to production.


pdf bib
The Sockeye Neural Machine Translation Toolkit at AMTA 2018
Felix Hieber | Tobias Domhan | Michael Denkowski | David Vilar | Artem Sokolov | Ann Clifton | Matt Post
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures
Tobias Domhan
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With recent advances in network architectures for Neural Machine Translation (NMT) recurrent models have effectively been replaced by either convolutional or self-attentional approaches, such as in the Transformer. While the main innovation of the Transformer architecture is its use of self-attentional layers, there are several other aspects, such as attention with multiple heads and the use of many attention layers, that distinguish the model from previous baselines. In this work we take a fine-grained look at the different architectures for NMT. We introduce an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks. Making use of this language we show in experiments that one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention. Additionally, we find that self-attention is much more important on the encoder side than on the decoder side, where it can be replaced by a RNN or CNN without a loss in performance in most settings. Surprisingly, even a model without any target side self-attention performs well.


pdf bib
Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning
Tobias Domhan | Felix Hieber
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The performance of Neural Machine Translation (NMT) models relies heavily on the availability of sufficient amounts of parallel data, and an efficient and effective way of leveraging the vastly available amounts of monolingual data has yet to be found. We propose to modify the decoder in a neural sequence-to-sequence model to enable multi-task learning for two strongly related tasks: target-side language modeling and translation. The decoder predicts the next target word through two channels, a target-side language model on the lowest layer, and an attentional recurrent model which is conditioned on the source representation. This architecture allows joint training on both large amounts of monolingual and moderate amounts of bilingual data to improve NMT performance. Initial results in the news domain for three language pairs show moderate but consistent improvements over a baseline trained on bilingual data only.