Alham Fikri Aji

Also published as: Alham Fikri Aji


pdf bib
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
Alham Fikri Aji | Radityo Eko Prasojo Tirana Noor Fatyanosa | Philip Arthur | Suci Fitriany | Salma Qonitah | Nadhifa Zulfa | Tomi Santoso | Mahendra Data
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf bib
The University of Edinburgh’s Bengali-Hindi Submissions to the WMT21 News Translation Task
Proyag Pal | Alham Fikri Aji | Pinzhen Chen | Sukanta Sen
Proceedings of the Sixth Conference on Machine Translation

We describe the University of Edinburgh’s BengaliHindi constrained systems submitted to the WMT21 News Translation task. We submitted ensembles of Transformer models built with large-scale back-translation and fine-tuned on subsets of training data retrieved based on similarity to the target domain.

pdf bib
Efficient Machine Translation with Model Pruning and Quantization
Maximiliana Behnke | Nikolay Bogoychev | Alham Fikri Aji | Kenneth Heafield | Graeme Nail | Qianqian Zhu | Svetlana Tchistiakova | Jelmer van der Linde | Pinzhen Chen | Sidharth Kashyap | Roman Grundkiewicz
Proceedings of the Sixth Conference on Machine Translation

We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.

pdf bib
IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words
Haryo Akbarianto Wibowo | Made Nindyatama Nityasya | Afra Feyza Akyürek | Suci Fitriany | Alham Fikri Aji | Radityo Eko Prasojo | Derry Tanti Wijaya
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter
Alham Fikri Aji | Made Nindyatama Nityasya | Haryo Akbarianto Wibowo | Radityo Eko Prasojo | Tirana Fatyanosa
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

This paper describes our team’s submission for the Social Media Mining for Health (SMM4H) 2021 shared task. We participated in three subtasks: Classifying adverse drug effect, COVID-19 self-report, and COVID-19 symptoms. Our system is based on BERT model pre-trained on the domain-specific text. In addition, we perform data cleaning and augmentation, as well as hyperparameter optimization and model ensemble to further boost the BERT performance. We achieved the first rank in both classifying adverse drug effects and COVID-19 self-report tasks.

pdf bib
IndoNLI: A Natural Language Inference Dataset for Indonesian
Rahmad Mahendra | Alham Fikri Aji | Samuel Louvan | Fahrurrozi Rahman | Clara Vania
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research.


pdf bib
Benchmarking Multidomain English-Indonesian Machine Translation
Tri Wahyu Guntara | Alham Fikri Aji | Radityo Eko Prasojo
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

In the context of Machine Translation (MT) from-and-to English, Bahasa Indonesia has been considered a low-resource language, and therefore applying Neural Machine Translation (NMT) which typically requires large training dataset proves to be problematic. In this paper, we show otherwise by collecting large, publicly-available datasets from the Web, which we split into several domains: news, religion, general, and conversation, to train and benchmark some variants of transformer-based NMT models across the domains. We show using BLEU that our models perform well across them , outperform the baseline Statistical Machine Translation (SMT) models, and perform comparably with Google Translate. Our datasets (with the standard split for training, validation, and testing), code, and models are available on

pdf bib
Compressing Neural Machine Translation Models with 4-bit Precision
Alham Fikri Aji | Kenneth Heafield
Proceedings of the Fourth Workshop on Neural Generation and Translation

Neural Machine Translation (NMT) is resource-intensive. We design a quantization procedure to compress fit NMT models better for devices with limited hardware capability. We use logarithmic quantization, instead of the more commonly used fixed-point quantization, based on the empirical fact that parameters distribution is not uniform. We find that biases do not take a lot of memory and show that biases can be left uncompressed to improve the overall quality without affecting the compression rate. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. RNN architecture seems to be more robust towards compression, compared to the Transformer.

pdf bib
Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task
Nikolay Bogoychev | Roman Grundkiewicz | Alham Fikri Aji | Maximiliana Behnke | Kenneth Heafield | Sidharth Kashyap | Emmanouil-Ioannis Farsarakis | Mateusz Chudyk
Proceedings of the Fourth Workshop on Neural Generation and Translation

We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task: single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.

pdf bib
In Neural Machine Translation, What Does Transfer Learning Transfer?
Alham Fikri Aji | Nikolay Bogoychev | Kenneth Heafield | Rico Sennrich
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs.


pdf bib
Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training
Alham Fikri Aji | Kenneth Heafield | Nikolay Bogoychev
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

pdf bib
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Alham Fikri Aji | Kenneth Heafield
Proceedings of the 3rd Workshop on Neural Generation and Translation

Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.

pdf bib
From Research to Production and Back: Ludicrously Fast Neural Machine Translation
Young Jin Kim | Marcin Junczys-Dowmunt | Hany Hassan | Alham Fikri Aji | Kenneth Heafield | Roman Grundkiewicz | Nikolay Bogoychev
Proceedings of the 3rd Workshop on Neural Generation and Translation

This paper describes the submissions of the “Marian” team to the WNGT 2019 efficiency shared task. Taking our dominating submissions to the previous edition of the shared task as a starting point, we develop improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models. For efficient CPU-based decoding, we propose pre-packed 8-bit matrix products, improved batched decoding, cache-friendly student architectures with parameter sharing and light-weight RNN-based decoder architectures. GPU-based decoding benefits from the same architecture changes, from pervasive 16-bit inference and concurrent streams. These modifications together with profiler-based C++ code optimization allow us to push the Pareto frontier established during the 2018 edition towards 24x (CPU) and 14x (GPU) faster models at comparable or higher BLEU values. Our fastest CPU model is more than 4x faster than last year’s fastest submission at more than 3 points higher BLEU. Our fastest GPU model at 1.5 seconds translation time is slightly faster than last year’s fastest RNN-based submissions, but outperforms them by more than 4 BLEU and 10 BLEU points respectively.


pdf bib
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
Nikolay Bogoychev | Kenneth Heafield | Alham Fikri Aji | Marcin Junczys-Dowmunt
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.


pdf bib
Sparse Communication for Distributed Gradient Descent
Alham Fikri Aji | Kenneth Heafield
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.