Pre-training Methods for Neural Machine Translation

This tutorial provides a comprehensive guide to make the most of pre-training for neural machine translation. Firstly, we will briefly introduce the background of NMT, pre-training methodology, and point out the main challenges when applying pre-training for NMT. Then we will focus on analysing the role of pre-training in enhancing the performance of NMT, how to design a better pre-training model for executing specific NMT tasks and how to better integrate the pre-trained model into NMT system. In each part, we will provide examples, discuss training techniques and analyse what is transferred when applying pre-training.

on text representation. How to leverage the pretraining methods to improve the speech translation becomes a new challenge.
This tutorial provides a comprehensive guide to make the most of pre-training for neural machine translation. Firstly, we will briefly introduce the background of NMT, pre-training methodology, and point out the main challenges when applying pre-training for NMT. Then we will focus on analysing the role of pre-training in enhancing the performance of NMT, how to design a better pretraining model for executing specific NMT tasks and how to better integrate the pre-trained model into NMT system. In each part, we will provide examples, discuss training techniques and analyse what is transferred when applying pre-training.
The first topic is the monolingual pre-training for NMT, which is one of the most well-studied field. Monolingual text representations like ELMo, GPT, MASS and BERT have superiorities, which significantly boost the performances of various natural language processing tasks (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Song et al., 2019). However, NMT has several distinct characteristics, such as the availability of large training data (10 million or larger) and the high capacity of baseline NMT models, which requires carefully design of pre-training. In this part, we will introduce different pre-training methods and analyse the best practice when applying them to different machine translation scenarios, such as unsupervised NMT, low-resource NMT and rich-source NMT (Zhu et al., 2020;. We will cover techniques to finetune the pre-trained models with various strategies, such as knowledge distillation and adapter (Bapna and Firat, 2019;Liang et al., 2021).
The next topic is multi-lingual pre-training for NMT. In this context, we aims at mitigating the English-centric bias and suggest that it is possible to build universal representation for different language to improve massive multi-lingual NMT. In this part, we will discuss the general representation of different languages and analyse how knowledge transfers across languages. These will allow a better design for multi-lingual pre-training, in particular for zero-shot transfer to non-English language pairs (Johnson et al., 2017;Qi et al., 2018;Conneau and Lample, 2019;Pires et al., 2019;Huang et al., 2019;Lin et al., 2020;Pan et al., 2021;. The last technical part of this tutorial deals with the Pre-training for speech NMT. In particular, we focus on leverage weakly supervised or unsupervised training data to improve speech translation. In this part, we will discuss the possibilities of building a general representations across speech and text. And shows how text or audio pre-training can guild the text generation of NMT (Wang et al., 2019;Liu et al., 2019b;Bansal et al., 2019;Baevski et al., 2020a,b;Huang et al., 2021;Long et al., 2021;Dong et al., 2021b,a;Han et al., 2021;Ye et al., 2021).
We conclude the tutorial by pointing out the best practice when applying pre-training for NMT. The topics cover various of pre-training methods for different NMT scenarios. After this tutorial, the audience will understand why pre-training for NMT is different from other tasks and how to make the most of pre-training for NMT. Importantly, we will give deep analyze about how and why pre-training works in NMT, which will inspire future work on designing pre-training paradigm specific for NMT.

Type of Tutorial
Cutting-edge. In this tutorial, we will discuss the most advanced techniques of pre-training for neural machine translation. The instructors will also present their own practical experiences in enhancing a machine translation service as a product, which are usually not found in papers.

Tutorial Breadth
Based on the representative set of papers listed in the selected bibliography, we anticipate that 70%-80% of the tutorial will cover other researchers' work, while the rest concerns the work where at least one of the presenters has been actively involved in. We will introduce several important work related to the monolingual, the multi-lingual and the multi-modal pre-training for NMT.

Diversity
In the tutorial, some multilingual pre-training methods will scale to over 50 to 100 different languages. Researchers working on the diverse language pairs might find this tutorial relevant and useful.

Prerequisites
The tutorial is self-contained. We will address the background, the technical details and the examples. Basic knowledge about neural networks are required, including word embeddings, attention, and encoder-decoder models. Prior NLP courses and familarity with the machine translation task are preferred.
It is recommended (and optional) that audience to read the following papers before the tutorial: 1. Basic MT model: Attention is all you need (Vaswani et al., 2017).

Target Audience
This tutorial will be suitable for researchers and practitioners interested in pre-training applications and multilingual NLP, especially for machine translation.
To the best of our knowledge, this is the first tutorial that focuses on the pre-training methods and practice for NMT.

Technical Requirements
The tutorial will be online. Internet connection with proper live video device is needed.

Tutorial Presenters
Mingxuan Wang (ByteDance AI Lab) Google Scholar Dr. Mingxuan Wang is a senior researcher at ByteDance AI Lab. He received his PhD degree from the Chinese Academy of Sciences Institute of Computing Technology in 2017. His research focuses on natural language processing and machine translation. He has published over 20 papers in leading NLP/AI journals and conferences such as ACL, AAAI and EMNLP. He has served in the Program Committee for ACL/EMNLP 2016-2020, AAAI/IJCAI 2018/2019, NeurIPS 2020. He achieved outstanding results in various machine translation evaluation competitions, including the first place of Chinese-to-English translation at at the WMT 2018, the third place of Chinese-to-English translation at NIST 2015, etc. Together with Dr. Lei Li, he is leading a team developing the VolcTrans machine translation system.
He has given a tutorial about Machine Translation at CCMT 2017 and was an guest lecturer for 2016 Machine Translation for University of Chinese Academy of Sciences (UCAS).
Lei Li (ByteDance AI Lab) https://lileicc.github.io/ Dr. Lei Li is Director of ByteDance AI Lab, leading the research and product development for NLP, robotics, and drug discovery. His research interests are machine translation, speech translation, text generation, and AI powered drug discovery. He received his B.S. from Shanghai Jiao Tong University and Ph.D. from Carnegie Mellon University, respectively. His dissertation work on fast algorithms for mining co-evolving time series was awarded ACM KDD best dissertation (runner up). His recent work on AI writer Xiaomingbot received 2nd-class award of Wu Wen-tsün AI prize in 2017. He is a recipient of CCF distinguished speaker in 2017, and CCF Young Elite award in 2019. His team won first places for five language translation directions in WMT 2020 and the best in corpus filtering challenge. Before ByteDance, he worked at EECS department of UC Berkeley and Baidu's Institute of Deep Learning in Silicon Valley. He has served organizers and area chair/senior PC for multiple conferences including KDD, EMNLP, NeurIPS, AAAI, IJCAI, and CIKM. He has published over 100 technical papers in ML, NLP and data mining and holds more than 10 patents. He has started and is developing ByteDance's machine translation system, VolcTrans and many of his algorithms have been deployed.
He has delivered four tutorials at EMNLP 2019, NLPCC 2019, NLPCC 2016, and KDD 2010. He was an lecturer for 2014 Probabilistic Programming for Advancing Machine Learning summer school at Portland, USA.
Prior Related Tutorials Neural Machine Translation, presented by Thang Luong, Kyunghyun Cho, and Christopher Manning at ACL 2016. This tutorial is related but different from ACL 2016 NMT tutorial. It focuses on pre-training methods for both bilingual, multi-lingual, and multi-modal neural machine translation.
Unsupervised Cross-Lingual Representation Learning, presented by Sebastian Ruder, Anders Søgaard, and Ivan Vulić at ACL 2019. This tutorial is related in concerning multi-lingual NLP. However, their tutorial was on representation learning, while our tutorial is on neural machine translation.