Nikolay Bogoychev


pdf bib
Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice
Andreas Grivas | Nikolay Bogoychev | Adam Lopez
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The Softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to be predicted via argmax, irrespective of input features, and empirically, there is evidence this happens in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public.


pdf bib
TranslateLocally: Blazing-fast translation running on the local CPU
Nikolay Bogoychev | Jelmer Van der Linde | Kenneth Heafield
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Every day, millions of people sacrifice their privacy and browsing habits in exchange for online machine translation. Companies and governments with confidentiality requirements often ban online translation or pay a premium to disable logging. To bring control back to the end user and demonstrate speed, we developed translateLocally. Running locally on a desktop or laptop CPU, translateLocally delivers cloud-like translation speed and quality even on 10 year old hardware. The open-source software is based on Marian and runs on Linux, Windows, and macOS.

pdf bib
The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task
Pinzhen Chen | Jindřich Helcl | Ulrich Germann | Laurie Burchell | Nikolay Bogoychev | Antonio Valerio Miceli Barone | Jonas Waldendorf | Alexandra Birch | Kenneth Heafield
Proceedings of the Sixth Conference on Machine Translation

This paper presents the University of Edinburgh’s constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three stages: corpus filtering, back-translation, and fine-tuning. For En-Ha we use an iterative back-translation approach on top of pre-trained En-De models and investigate vocabulary embedding mapping.

pdf bib
Efficient Machine Translation with Model Pruning and Quantization
Maximiliana Behnke | Nikolay Bogoychev | Alham Fikri Aji | Kenneth Heafield | Graeme Nail | Qianqian Zhu | Svetlana Tchistiakova | Jelmer van der Linde | Pinzhen Chen | Sidharth Kashyap | Roman Grundkiewicz
Proceedings of the Sixth Conference on Machine Translation

We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.

pdf bib
The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation
Nikolay Bogoychev | Pinzhen Chen
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Machine translation systems are vulnerable to domain mismatch, especially in a low-resource scenario. Out-of-domain translations are often of poor quality and prone to hallucinations, due to exposure bias and the decoder acting as a language model. We adopt two approaches to alleviate this problem: lexical shortlisting restricted by IBM statistical alignments, and hypothesis reranking based on similarity. The methods are computationally cheap and show success on low-resource out-of-domain test sets. However, the methods lose advantage when there is sufficient data or too great domain mismatch. This is due to both the IBM model losing its advantage over the implicitly learned neural alignment, and issues with subword segmentation of unseen words.

pdf bib
Not all parameters are born equal: Attention is mostly what you need
Nikolay Bogoychev
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and Feed-Forward Neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weights do not change over the course of the training. Through this, we show that the attention and FFN are equally important and fulfil the same functionality in a model. We show that the decision about whether a component is frozen or allowed to train is at least as important for the final model performance as its number of parameters. At the same time, the number of parameters alone is not indicative of a component’s importance. Finally, while the embedding layer is the least essential for machine translation tasks, it is the most important component for language modelling tasks.


pdf bib
Parallel Sentence Mining by Constrained Decoding
Pinzhen Chen | Nikolay Bogoychev | Kenneth Heafield | Faheem Kirefu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.

pdf bib
In Neural Machine Translation, What Does Transfer Learning Transfer?
Alham Fikri Aji | Nikolay Bogoychev | Kenneth Heafield | Rico Sennrich
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs.

pdf bib
Character Mapping and Ad-hoc Adaptation: Edinburgh’s IWSLT 2020 Open Domain Translation System
Pinzhen Chen | Nikolay Bogoychev | Ulrich Germann
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the University of Edinburgh’s neural machine translation systems submitted to the IWSLT 2020 open domain JapaneseChinese translation task. On top of commonplace techniques like tokenisation and corpus cleaning, we explore character mapping and unsupervised decoding-time adaptation. Our techniques focus on leveraging the provided data, and we show the positive impact of each technique through the gradual improvement of BLEU.

pdf bib
Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task
Nikolay Bogoychev | Roman Grundkiewicz | Alham Fikri Aji | Maximiliana Behnke | Kenneth Heafield | Sidharth Kashyap | Emmanouil-Ioannis Farsarakis | Mateusz Chudyk
Proceedings of the Fourth Workshop on Neural Generation and Translation

We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task: single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.

pdf bib
Speed-optimized, Compact Student Models that Distill Knowledge from a Larger Teacher Model: the UEDIN-CUNI Submission to the WMT 2020 News Translation Task
Ulrich Germann | Roman Grundkiewicz | Martin Popel | Radina Dobreva | Nikolay Bogoychev | Kenneth Heafield
Proceedings of the Fifth Conference on Machine Translation

We describe the joint submission of the University of Edinburgh and Charles University, Prague, to the Czech/English track in the WMT 2020 Shared Task on News Translation. Our fast and compact student models distill knowledge from a larger, slower teacher. They are designed to offer a good trade-off between translation quality and inference efficiency. On the WMT 2020 Czech ↔ English test sets, they achieve translation speeds of over 700 whitespace-delimited source words per second on a single CPU thread, thus making neural translation feasible on consumer hardware without a GPU.


pdf bib
Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training
Alham Fikri Aji | Kenneth Heafield | Nikolay Bogoychev
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

pdf bib
From Research to Production and Back: Ludicrously Fast Neural Machine Translation
Young Jin Kim | Marcin Junczys-Dowmunt | Hany Hassan | Alham Fikri Aji | Kenneth Heafield | Roman Grundkiewicz | Nikolay Bogoychev
Proceedings of the 3rd Workshop on Neural Generation and Translation

This paper describes the submissions of the “Marian” team to the WNGT 2019 efficiency shared task. Taking our dominating submissions to the previous edition of the shared task as a starting point, we develop improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models. For efficient CPU-based decoding, we propose pre-packed 8-bit matrix products, improved batched decoding, cache-friendly student architectures with parameter sharing and light-weight RNN-based decoder architectures. GPU-based decoding benefits from the same architecture changes, from pervasive 16-bit inference and concurrent streams. These modifications together with profiler-based C++ code optimization allow us to push the Pareto frontier established during the 2018 edition towards 24x (CPU) and 14x (GPU) faster models at comparable or higher BLEU values. Our fastest CPU model is more than 4x faster than last year’s fastest submission at more than 3 points higher BLEU. Our fastest GPU model at 1.5 seconds translation time is slightly faster than last year’s fastest RNN-based submissions, but outperforms them by more than 4 BLEU and 10 BLEU points respectively.

pdf bib
Similar Minds Post Alike: Assessment of Suicide Risk Using a Hybrid Model
Lushi Chen | Abeer Aldayel | Nikolay Bogoychev | Tao Gong
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

This paper describes our system submission for the CLPsych 2019 shared task B on suicide risk assessment. We approached the problem with three separate models: a behaviour model; a language model and a hybrid model. For the behavioral model approach, we model each user’s behaviour and thoughts with four groups of features: posting behaviour, sentiment, motivation, and content of the user’s posting. We use these features as an input in a support vector machine (SVM). For the language model approach, we trained a language model for each risk level using all the posts from the users as the training corpora. Then, we computed the perplexity of each user’s posts to determine how likely his/her posts were to belong to each risk level. Finally, we built a hybrid model that combines both the language model and the behavioral model, which demonstrates the best performance in detecting the suicide risk level.

pdf bib
The University of Edinburgh’s Submissions to the WMT19 News Translation Task
Rachel Bawden | Nikolay Bogoychev | Ulrich Germann | Roman Grundkiewicz | Faheem Kirefu | Antonio Valerio Miceli Barone | Alexandra Birch
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English↔Gujarati, English↔Chinese, German→English, and English→Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English↔Gujarati, we also explored semi-supervised MT with cross-lingual language model pre-training, and translation pivoting through Hindi. For translation to and from Chinese, we investigated character-based tokenisation vs. sub-word segmentation of Chinese text. For German→English, we studied the impact of vast amounts of back-translated training data on translation quality, gaining a few additional insights over Edunov et al. (2018). For English→Czech, we compared different preprocessing and tokenisation regimes.


pdf bib
The University of Edinburgh’s Submissions to the WMT18 News Translation Task
Barry Haddow | Nikolay Bogoychev | Denis Emelin | Ulrich Germann | Roman Grundkiewicz | Kenneth Heafield | Antonio Valerio Miceli Barone | Rico Sennrich
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The University of Edinburgh made submissions to all 14 language pairs in the news translation task, with strong performances in most pairs. We introduce new RNN-variant, mixed RNN/Transformer ensembles, data selection and weighting, and extensions to back-translation.

pdf bib
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
Nikolay Bogoychev | Kenneth Heafield | Alham Fikri Aji | Marcin Junczys-Dowmunt
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.


pdf bib
N-gram language models for massively parallel devices
Nikolay Bogoychev | Adam Lopez
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Fast, Scalable Phrase-Based SMT Decoding
Hieu Hoang | Nikolay Bogoychev | Lane Schwartz | Marcin Junczys-Dowmunt
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

The utilization of statistical machine translation (SMT) has grown enormously over the last decade, many using open-source software developed by the NLP community. As commercial use has increased, there is need for software that is optimized for commercial requirements, in particular, fast phrase-based decoding and more efficient utilization of modern multicore servers. In this paper we re-examine the major components of phrase-based decoding and decoder implementation with particular emphasis on speed and scalability on multicore machines. The result is a drop-in replacement for the Moses decoder which is up to fifteen times faster and scales monotonically with the number of cores.

pdf bib
Fast and highly parallelizable phrase table for statistical machine translation
Nikolay Bogoychev | Hieu Hoang
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers


pdf bib
The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015
Barry Haddow | Matthias Huck | Alexandra Birch | Nikolay Bogoychev | Philipp Koehn
Proceedings of the Tenth Workshop on Statistical Machine Translation


pdf bib
Edinburgh SLT and MT system description for the IWSLT 2014 evaluation
Alexandra Birch | Matthias Huck | Nadir Durrani | Nikolay Bogoychev | Philipp Koehn
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the University of Edinburgh’s spoken language translation (SLT) and machine translation (MT) systems for the IWSLT 2014 evaluation campaign. In the SLT track, we participated in the German↔English and English→French tasks. In the MT track, we participated in the German↔English, English→French, Arabic↔English, Farsi→English, Hebrew→English, Spanish↔English, and Portuguese-Brazil↔English tasks. For our SLT submissions, we experimented with comparing operation sequence models with bilingual neural network language models. For our MT submissions, we explored using unsupervised transliteration for languages which have a different script than English, in particular for Arabic, Farsi, and Hebrew. We also investigated syntax-based translation and system combination.