Overview of the 8th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 teams submitted their translation results for the human evaluation. We also accepted 5 research papers. About 2,100 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.


Introduction
The Workshop on Asian Translation (WAT) is an open evaluation campaign focusing on Asian languages. Following the success of the previous workshops WAT2014- WAT2020 (Nakazawa et al., 2020, WAT2021 brings together machine translation researchers and users to try, evaluate, share and discuss brand-new ideas for machine translation. We have been working toward practical use of machine translation among all Asian countries.
For the 8th WAT, we included the following new tasks: • Malayalam Visual Genome Task: English → Malayalam multi-modal translation All the tasks are explained in Section 2.
WAT is a unique workshop on Asian language translation with the following characteristics: • Open innovation platform Due to the fixed and open test data, we can repeatedly evaluate translation systems on the same dataset over years. WAT receives submissions at any time; i.e., there is no submission deadline of translation results w.r.t automatic evaluation of translation quality.
• Domain and language pairs WAT is the world's first workshop that targets scientific paper domain, and Chinese↔Japanese and Korean↔Japanese language pairs.
• Evaluation method Evaluation is done both automatically and manually. Firstly, all submitted translation results are automatically evaluated using three metrics: BLEU, RIBES and AMFM. Among them, selected translation results are assessed by two kinds of human evaluation: pairwise evaluation and JPO adequacy evaluation.

ASPEC+ParaNatCom Task
Traditional ASPEC translation tasks are sentencelevel and the translation quality of them seem to be saturated. We think it's high time to move on to document-level evaluation. For the first year, we use ParaNatCom 1 (Parallel English-Japanese abstract corpus made from Nature Communications articles) for the development and test sets of the Document-level Scientific Paper Translation subtask. We cannot provide document-level training corpus, but you can use ASPEC and any other extra resources.

Document-level Business Scene Dialogue Translation
There are a lot of ready-to-use parallel corpora for training machine translation systems, however, most of them are in written languages such as web crawl, news-commentary, patents, scientific papers and so on. Even though some of the parallel corpora are in spoken language, they are mostly spoken by only one person (TED talks) or contain a lot of noise (OpenSubtitle). Most of other MT evaluation campaigns adopt the written language, monologue or noisy dialogue parallel corpora for their translation tasks. Traditional ASPEC translation tasks are sentence-level and the translation quality of them seem to be saturated. To move to a highly topical setting of translation of dialogues evaluated at the level of documents, WAT uses BSD Corpus 2 (The Business Scene Dialogue corpus) for the dataset including training, development and test data for the first time this year. Participants of this task must get a copy of BSD corpus by themselves.

JPC Task
JPO Patent Corpus (JPC) for the patent tasks was constructed by the Japan Patent Office (JPO) in Lang Train Dev DevTest Test-N zh-ja 1,000,000 2,000 2,000 5,204 ko-ja 1,000,000 2,000 2,000 5,230 en-ja 1,000,000 2,000 2,000 5,668 Lang Test-N1 Test-N2 Test-N3 Test-EP zh-ja 2,000 3,000 204 1,151 ko-ja 2,000 3,000 230 en-ja 2,000 3,000 668 - Table 1: Statistics for JPC collaboration with NICT. The corpus consists of Chinese-Japanese, Korean-Japanese and English-Japanese patent descriptions whose International Patent Classification (IPC) sections are chemistry, electricity, mechanical engineering, and physics. At WAT2021, the patent task has two subtasks: normal subtask and expression pattern subtask. Both subtasks use common training, development and development-test data for each language pair. The normal subtask for three language pairs uses four test datasets with different characteristics: • test-N: union of the following three sets; • test-N1: patent documents from patent families published between 2011 and 2013; • test-N2: patent documents from patent families published between 2016 and 2017; and • test-N3: patent documents published between 2016 and 2017 where target sentences are manually created by translating source sentences.  sists of Japanese-English sentence pairs. In addition to the test set (test set I) that has been provided from WAT 2017, a test set (test set II) with document-level context has also been provided from WAT 2020. These test sets are as follows.

Test set I : A pair of test and reference sentences.
The references were automatically extracted from English newswire sentences and not manually checked. There are no context data.
Test set II : A pair of test and reference sentences and context data that are articles including test sentences. The references were automatically extracted from English newswire sentences and manually selected. Therefore, the quality of the references of test set II is better than that of test set I.
The statistics of JIJI Corpus are shown in Table 2.
The definition of data use is shown in Table 3. Participants submit the translation results of one or more of the test data.
The sentence pairs in each data are identified in the same manner as that for ASPEC using the method from (Utiyama and Isahara, 2007).

ALT and UCSY Corpus
The parallel data for Myanmar-English translation tasks at WAT2021 consists of two corpora, the ALT corpus and UCSY corpus.
• The ALT corpus is one part from the Asian Language Treebank (ALT) project (Riza et al., 2016), consisting of twenty thousand Myanmar-English parallel sentences from news articles.
• The UCSY corpus (Yi Mon Shwe Sin and Khin Mar Soe, 2018) is constructed by the NLP Lab, University of Computer Studies, Yangon (UCSY), Myanmar. The corpus consists of 200 thousand Myanmar-English parallel sentences collected from different domains, including news articles and textbooks.
The ALT corpus has been manually segmented into words (Ding et al., 2018(Ding et al., , 2019, and the UCSY corpus is unsegmented. A script to tokenize the Myanmar data into writing units is released with the data. The automatic evaluation of Myanmar translation results is based on the tokenized writing units, regardless to the segmented words in the ALT data. However, participants can make a use of the segmentation in ALT data in their own manner.
The detailed composition of training, development, and test data of the Myanmar-English translation tasks are listed in Table 4. Notice that both of the corpora have been modified from the data used in WAT2018.

NICT-SAP Task
In WAT2021, we decided to continue the WAT2020 task for joint multi-domain multilingual neural machine translation involving 4 low-resource Asian languages: Thai (Th), Hindi (Hi), Malay (Ms), Indonesian (Id). English (En) is the source or the target language for the translation directions being evaluated. The purpose of this task was to test the feasibility of multi-domain multilingual solutions for extremely low-resource language pairs and domains.
Naturally the solutions could be one-to-many, many-to-one or many-to-many NMT models. The domains in question are Wikinews and IT (specifically, Software Documentation). The total number of evaluation directions are 16 (8 for each domain). There is very little clean and publicly available data for these domains and language pairs and thus we encouraged participants to not only utilize the small Asian Language Treebank (ALT) parallel corpora (Thu et al., 2016) but also the parallel corpora from OPUS 3 , other WAT tasks (past and present) and WMT 4 . The ALT dataset contains 18,088, 1,000 and 1,018 training, development and testing sentences. As for corpora for the IT domain we only provided evaluation (dev and test sets) corpora 5 (Buschbeck and Exel, 2020) and encouraged participants to consider GNOME, UBUNTU and KDE corpora from OPUS. We    also encouraged the use of monolingual corpora expecting that it would be for pre-trained NMT models such as BART/MBART (Lewis et al., 2020;Liu et al., 2020). In Table 5 we give statistics of the aforementioned corpora which we used for the organizer's baselines. Note that the evaluation corpora for both domains are created from documents and thus contain document level meta-data. Participants were encouraged to use document level approaches. Note that we do not exhaustively list 6 all available corpora here and participants were not restricted from using any corpora as long as they are freely available.

News Commentary Task
For the Russian↔Japanese task we asked participants to use the JaRuNC corpus 7 (Imankulova   Table 6 for the statistics of the in-domain parallel corpora. In addition, we encouraged the participants to use out-of-domain parallel corpora from various sources such as KFTT, 8 JESC, 9 TED, 10 ASPEC, 11 UN, 12 Yandex 13 and Russian↔English news-commentary corpus. 14 This year we also encouraged participants to use any corpora from WMT 2020 15 and WMT 2021 16 involving Japanese, Russian, and English as long as it did not belong to the news commentary domain to prevent any test set sentences from being unintentionally seen during training .  source  bn  gu  hi  kn  ml  mr  or  pa  ta  te  Grand Total  alt  20,106  20,106  40,212  bibleuedin  15,609  62,073  61,707  61,300  60,876  62,191  323,756  cvit-pib  91,985  58,264 266,545  43,087 114,220  94,494 101,092 115,968  44,720  930,375  iitb  1,603,080  1,603,080  jw  278,307 310,094 509,594 303,991  362,816 270,346  388,364 673,232 192,

Indic Multilingual Task
Owing to the increasing interest in Indian language translation and the success of the multilingual Indian languages tasks in 2018 (Nakazawa et al., 2018) and 2020 (Nakazawa et al., 2020), we decided to enlarge the scope of the 2020 task by adding new languages, scouring new data and creating an N-way parallel evaluation set. In 2020, the evaluation data came from the CVIT-PIB dataset 17 but it did not contain sufficient N-way parallel sentences to evaluate on additional languages. To this end, we decided to obtain evaluation corpora from the PMI dataset 18 which contains sufficient N-way parallel corpora spanning 10 Indian languages and English and is similar (domain wise) to the CVIT-PIB dataset.
The evaluation data consists of various articles composed by the Prime Minister of India. The languages involved are Hindi (Hi), Marathi (Mr), Kannada (Kn), Tamil (Ta), Telugu (Te), Gujarati (Gu), Malayalam (Ml), Bengali (Bn), Oriya (Or), Punjabi (Pa) and English (En). Compared to 2020, we have 3 additional languages leading to a total of 10 Indian languages, 4 of which are Dravidian and the rest are Indo-Aryan. English is either the source or the target language during evaluation leading to a total of 20 translation directions. Due to the N-way nature of the evaluation corpus we can also evaluate 90 Indian language to Indian language translation pairs but this may be the focus in future workshops. The objective of this task, like the Indic languages tasks in 2018 and 2020, was to evaluate the performance of multilingual NMT models. The desired solution could be one-to-many, many-toone or many-to-many NMT models. We provided a filtered parallel corpus collection spanning all languages 19 which was split into training, development and test sets. This dataset was created by first creating an evaluation set of 3,390 11-way sentences (1,000 for development and 2,390 for testing) and then filtering them out from all parallel corpora we could obtain at the time. Furthermore, we made sure to filter out sentences from the 2020 evaluation set. This way the provided parallel corpus can be safely used for benchmarking the 2020 evaluation set as well. The filtered training parallel corpora came from a variety of sources such as: CVIT-PIB, PMIndia, IITB 3.0, 20 JW, 21 NLPC, 22 UFAL EnTam, 23 Uka Tarsadia, 24 Wiki  Titles (ta, 25 gu, 26 )ALT, 27 OpenSubtitles, 28 Bible uedin, 29 MTEnglish2Odia, 30 OdiEnCorp 2.0, 31 TED, 32 and WikiMatrix 33 . Additionally we listed the CCAligned corpus 34 to be used despite its poor quality which applies to WikiMatrix as well. We also provided filtered monolingual corpora 35 sourced from PMI and we also encouraged the use of monolingual corpora from the IndicCorp. 36 The statistics of this corpus are given in table 8. We expected that this year, the novel way of using the monolingual corpora would be to pre-train NMT models such as BART/MBART (Lewis et al., 2020;Liu et al., 2020). In general we encouraged participants to focus on multilingual NMT (Dabre et al., 2020) solutions.
Detailed statistics for the aforementioned corpora can be found in Table 7. We also listed additional sources of corpora for participants to use. Our organizer's baselines used the PMI corpora for training as it is the in-domain corpus.

English→Hindi Multi-Modal Task
This task is running successfully in WAT since 2019 and attracted many teams working on multimodal machine translation and image captioning in Indian languages (Nakazawa et al., 2019(Nakazawa et al., , 2020. For English→Hindi multi-modal translation task, we asked the participants to use Hindi Visual Genome 1.  2019a,b). 37 The statistics of HVG 1.1 are given in Table 9. One "item" in HVG consists of an image with a rectangular region highlighting a part of the image, the original English caption of this region and the Hindi reference translation. Depending on the track (see 2.9.1 below), some of these item components are available as the source and some serve as the reference or play the role of a competing candidate solution.

English→Hindi Multi-Modal Task Tracks
1. Text-Only Translation (labeled "TEXT" in WAT official tables): The participants are asked to translate short English captions (text) into Hindi. No visual information can be used. On the other hand, additional text resources are permitted (but they need to be specified in the corresponding system description paper).
2. Hindi Captioning (labeled "HI"): The participants are asked to generate captions in Hindi for the given rectangular region in an input image.
3. Multi-Modal Translation (labeled "MM"): Given an image, a rectangular region in it and an English caption for the rectangular region, the participants are asked to translate the English text into Hindi. Both textual and visual information can be used.
The English→Hindi multi-modal task includes three tracks as illustrated in Figure 1.

English→Malayalam Multi-Modal Task
This task is introduced this year using the first multimodal machine translation dataset in Malayalam language. For English→Malayalam multi-modal translation task we asked the participants to use the Malayalam Visual Genome corpus (MVG for short Parida and Bojar, 2021) 38 . The statistics of MVG are given in Table 10. One "item" in MVG consists of an image with a rectangular region highlighting a part of the image, the original English caption of this region and the Malayalam reference translation as shown in Fig 3. Multi-Modal Translation (labeled "MM"): Given an image, a rectangular region in it and an English caption for the rectangular region, the participants are asked to translate the English text into Malayalam. Both textual and visual information can be used.

Flickr30kEnt-JP Japanese↔English Multi-Modal Tasks
The goal of Flickr30kEnt-JP Japanese↔English multi-modal task 39 is to improve translation performance with the help of another modality (images) associated with input sentences. For both English→Japanese and Japanese→English tasks, we use the Flickr30k Entities Japanese (F30kEnt-Jp) dataset (Nakayama et al., 2020). This is an   Here we use the MeCab tokenizer to count Japanese tokens.
* Some of the original English sentences are actually broken so we did not provide their translations.
extended dataset of the Flickr30k 40 and Flickr30k Entities 41 datasets where manual Japanese translations are added. Notably, it has the annotations of many-to-many phrase-to-region correspondences in both English and Japanese captions, which are expected to strongly supervise multimodal grounding and provide new research directions. This year, from the same shared tasks in WAT 2020, we increased the number of parallel sentences for training and validation. We summarize the statistics of the dataset for this year in Table  11. We use the same splits of training, validation and test data specified in Flickr30k Entities. For the training and the validation data, we use the F30kEnt-Jp version 2.0 which is publicly available. 42 The original Flickr30k has five English sentences for each image. While the Japanese set for WAT 2020 had the translations of only the first two sentences, this year we have all five translations for each image. Therefore, we can use five parallel sentences for each image to train and validate the systems. The test data remain exactly the same as in WAT 2020, where phrase-to-region annotation is not included.
There are two settings of submission: with and 40 http://shannon.cs.illinois.edu/DenotationGraph/ 41 http://bryanplummer.com/Flickr30kEntities/ 42 https://github.com/nlab-mpg/Flickr30kEnt-JP without resource constraints. In the constrained setting, external resources such as additional data and pre-trained models (with external data) are not allowed, except for pre-trained convolutional neural networks (for visual analysis) and basic linguistic tools such as taggers, parsers, and morphological analyzers.

Ambiguous MS COCO Japanese↔English Multimodal Task
This is another Japanese-English multimodal machine translation task. We provide the Japanese-English Ambiguous MS COCO dataset (Merritt et al., 2020) for validation and testing, which contains ambiguous verbs that may require visual information in images for disambiguation. The validation and testing sets contain 230 and 231 Japanese-English sentence pairs, respectively. The Japanese sentences are translated from the English sentences in the original Ambiguous MS COCO dataset. 43 Participants can use the constrained and unconstrained training data to train their multimodal machine translation system. In the constrained setting, only the Flickr30kEntities Japanese (F30kEnt-Jp) dataset 44 can be used as training data. In the unconstrained setting, the MS COCO English data 45 and STAIR Japanese image captions 46 can be used as additional training data.
We prepare a baseline using the double attention on image region method following (Zhao et al., 2020) for both Japanese→English and English→Japanese directions.

Restricted Translation Task
Despite recent success of NMT, the MT systems still struggle to generate translation with a consistent terminology. Consistency is the key to clear and accurate translation, especially when translating documents in a specific field, for instance, science or business and marketing contexts, requiring technical terms and proper nouns to get translated into the corresponding unique expressions continuously in the entire documents. To tackle this inconsistent translation issue, we have designed Restricted Translation task at WAT 2021.
(  texts under target vocabulary constraints. At inference time, such a restricted vocabulary is provided as a list of target words, consisting of scientific technical terms in the target language, and the system outputs must contain all these target words. For the English↔Japanese translation tasks, we employ the ASPEC corpus and allow to use other external data source. We built the restricted vocabulary lists by asking 10 bilingual speakers to manually extract the scientific technical terms from the evaluation data sets ("dev/devtest/test"). Table 12 reports the data statistics of the restricted vocabulary in the evaluation data. We evaluate systems with two distinct metrics: 1) BLEU score as a conventional translation accuracy and 2) a consistency score: the ratio of the number of sentences satisfying exact match of given constraints over the whole test corpus. For the "exact match" evaluation, we conduct the following process. In English, we simply lowercase hypotheses and constraints, then judge character-level sequence matching (including whitespaces) for each constraint. In Japanese, we judge character-level sequence matching (including whitespaces) for each constraint without preprocessing. For the final ranking, we also calculate the combined score of both: calculating BLEU with only the exact match sentences. We note that, in this scenario, the brevity score in BLEU does not carry its usual meaning, but the n-gram scores maintain their consistency. Table 13 shows the participants in WAT2021. The table lists 24 organizations from various countries, including Japan, India, USA, Singapore, Myanmar, Thailand, Korea, Poland, Denmark and Switzerland.

Participants
2,100 translation results by 28 teams were submitted for automatic evaluation and about 360 translation results by 24 teams were submitted for the human evaluation. Table 14 summarizes the participation of teams across WAT2021 tasks and indicates which tasks included manual evaluation. The human evaluation was conducted only for the tasks with the check marks in "human eval" line.
There were no participants in the Newswire (JIJI) task, BSD task and JaRuNC task.

Baseline Systems
Human evaluations of most of WAT tasks were conducted as pairwise comparisons between the translation results for a specific baseline system and translation results for each participant's system. That is, the specific baseline system served as the standard for human evaluation. At WAT 2021, we adopted some of neural machine translation (NMT) as baseline systems. The details of the NMT baseline systems are described in this section.
The NMT baseline systems consisted of publicly available software, and the procedures for building the systems and for translating using the systems were published on the WAT web page. 47 We also have SMT baseline systems for the tasks that started at WAT 2017 or before 2017. The baseline systems are shown in Tables 15, 16, and 17. SMT baseline systems are described in the WAT 2017 overview paper (Nakazawa et al., 2017). The commercial RBMT systems and the online translation systems were operated by the organizers. We note that these RBMT companies and online translation companies did not submit their systems. Because our objective is not to compare commercial RBMT systems or online translation systems from companies that did not themselves participate, the system IDs of these systems are anonymous in this paper.

Tokenization
We used the following tools for tokenization.
When we built BPE-codes, we merged source and target sentences and we used 100,000 for -s option. We used 10 for vocabulary-threshold when subword-nmt applied BPE.

For News Commentary
• The Moses toolkit for English and Russian only for the News Commentary data.
• Corpora are further processed by ten-sor2tensor's internal pre/post-processing which includes sub-word segmentation.

For Indic and NICT-SAP Tasks
• For the Indic task we did not perform any explicit tokenization of the raw data.
• For the NICT-SAP task we only character segmented the Thai corpora as it was the only language for which character level BLEU was to be computed. Other languages corpora were not preprocessed in any way.
• Any subword segmentation or tokenization was handled by the internal mechanisms of tensor2tensor.

For English→Hindi Multi-Modal and English→Malayalam Tasks
• Hindi Visual Genome 1.1 and Malayalam Visual Genome comes untokenized and we did not use or recommend any specific external tokenizer.

For English↔Japanese Multi-Modal Tasks
• For English sentences, we applied lowercase, punctuation normalization, and the Moses tokenizer.
• For Japanese sentences, we used KyTea for word segmentation.

Baseline NMT Methods
We used the NMT models for all tasks. Unless mentioned otherwise we use the Transformer model (Vaswani et al., 2017). We used Open-NMT (Klein et al., 2017) (RNN-model) for AS-PEC, JPC, JIJI, and ALT tasks, tensor2tensor 54 for the News Commentary (JaRuNC), NICT-SAP and MultiIndicMT tasks and OpenNMT-py 55 for other tasks.

NMT with Attention (OpenNMT)
For ASPEC, JPC, JIJI, and ALT tasks, we used OpenNMT (Klein et al., 2017) as the implementation of the baseline NMT systems of NMT with attention (System ID: NMT). We used the following OpenNMT configuration.
We used the following data for training the NMT baseline systems of NMT with attention.
• All of the training data mentioned in Section 2 were used for training except for the AS-PEC Japanese-English task. For the ASPEC Japanese-English task, we only used train-1.txt, which consists of one million parallel sentence pairs with high similarity scores. • All of the development data for each task was used for validation. 54 https://github.com/tensorflow/ tensor2tensor 55 https://github.com/OpenNMT/OpenNMT-py

Transformer (Tensor2Tensor)
For the News Commentary task, we used ten-sor2tensor's 56 implementation of the Transformer (Vaswani et al., 2017) and used default hyperparameter settings corresponding to the "base" model for all baseline models. The baseline for the News Commentary task is a multilingual model as described in Imankulova et al. (2019) which is trained using only the in-domain parallel corpora. We use the token trick proposed by Johnson et al. (2017) to train the multilingual model. For the NICT-SAP task, we used tensor2tensor to train many-to-one and one-to-many models where the latter were trained with the aforementioned token trick. We used default hyperparameter settings corresponding to the "big" model. Since the NICT-SAP task involves two domains for evaluation (Wikinews and IT) we used a modification of the token trick technique for domain adaptation to distinguish between corpora for different domains. In our case we used tokens such as 2alt and 2it to indicate whether the sentences belonged to the Wikinews or IT domain, respectively. For both tasks we used 32,000 separate sub-word vocabularies. We trained our models on 1 GPU till convergence on the development set BLEU scores, averaged the last 10 checkpoints (separated by 1000 batches) and performed decoding with a beam of size 4 and a length penalty of 0.6.
For the MultiIndicMT task we trained unidirectional models using only the PMI corpus instead of the entire training data. We intentionally used the PMI corpus because its domain is the same as that of the evaluation set. Due to lack of time and resources we did not train multilingual models nor did we use additional data. We trained "transformer_base" models with shared vocabularies of 8,000 subwords. We trained our models on 1 GPU till convergence on the development set BLEU scores, chose the model with the best development set BLEU and performed decoding with a beam of size 4 and a length penalty of 0.6.

Transformer (OpenNMT-py)
For the English→Hindi Multimodal and English→Malayalam Multimodal tasks, we used the Transformer model (Vaswani et al., 2018) as implemented in OpenNMT-py (Klein et al., 2017) and used the "base" model with default 56 https://github.com/tensorflow/ tensor2tensor parameters for the multi-modal task baseline. We have generated the vocabulary of 32k sub-word types jointly for both the source and target languages. The vocabulary is shared between the encoder and decoder.

Procedure for Calculating Automatic Evaluation Score
We evaluated translation results by three metrics: BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010) and AMFM (Banchs et al., 2015a). BLEU scores were calculated using multi-bleu.perl in the Moses toolkit (Koehn et al., 2007). RIBES scores were calculated using RIBES.py version 1.02.4. 57 AMFM scores were calculated using scripts created by the technical collaborators listed in the WAT2021 web page. 58 All scores for each task were calculated using the corresponding reference translations. Before the calculation of the automatic evaluation scores, the translation results were tokenized or segmented with tokenization/segmentation tools for each language. For Japanese segmentation, we used three different tools: Juman version 7.0 (Kurohashi et al., 1994), KyTea 0.4.6 (Neubig et al., 2011) with full SVM model 59 and MeCab 0.996 (Kudo, 2005)  The latest versions of Chrome, Firefox, Internet Explorer and Safari are supported for this site.
Before you submit files, you need to enable JavaScript in your browser.
File format: Submitted files should NOT be tokenized/segmented. Please check the automatic evaluation procedures.
Submitted files should be encoded in UTF-8 format.
Translated sentences in submitted files should have one sentence per line, corresponding to each test sentence. The number of lines in the submitted file and that of the corresponding test file should be the same. If you want to submit the file for human evaluation, check the box "Human Evaluation". Once you upload a file with checking "Human Evaluation" you cannot change the file used for human evaluation.
When you submit the translation results for human evaluation, please check the checkbox of "Publish" too.
You can submit two files for human evaluation per task.
One of the files for human evaluation is recommended not to use other resources, but it is not compulsory. Other: Team Name, Task, Used Other Resources, Method, System Description (public) , Date and Time(JST), BLEU, RIBES and AMFM will be disclosed on the Evaluation Site when you upload a file checking "Publish the results of the evaluation".
You can modify some fields of submitted data. Read "Guidelines for submitted data" at the bottom of this page.

Automatic Evaluation System
The automatic evaluation system receives translation results by participants and automatically gives evaluation scores to the uploaded results. As shown in Figure 3, the system requires participants to provide the following information for each submission: • Human Evaluation: whether or not they submit the results for human evaluation; • Publish the results of the evaluation: whether or not they permit to publish automatic evaluation scores on the WAT2021 web page; • Task: the task you submit the results for; • Used Other Resources: whether or not they used additional resources; and • Method: the type of the method including SMT, RBMT, SMT and RBMT, EBMT, NMT and Other.
Evaluation scores of translation results that participants permit to be published are disclosed via the WAT2021 evaluation web page. Participants can also submit the results for human evaluation using the same web interface. This automatic evaluation system will remain available even after WAT2021. Anybody can register an account for the system by the procedures described in the application site. 68

A Note on AMFM Scores
Up until WAT 2020, we used an older generation AMFM evaluation approach which did not use deep neural networks. Given the advances in multilingual pre-trained models, this year, our collaborators provided us with deep AMFM models. With the exception of ASPEC and restricted translation tasks we used the provided deep AMFM models to compute AMFM scores. Given that these deep models need GPUs to run quickly, we have not yet integrated it into our evaluation server as it is not equipped with GPUs. Instead, we compute the AMFM scores offline and add them to the evaluation scoreboard. For readers interested in AMFM and recent advances we refer readers to the follow-

JPO Adequacy Evaluation
We conducted JPO adequacy evaluation for the top two or three participants' systems of pairwise evaluation for each subtask. 69 The evaluation was carried out by translation experts based on the JPO adequacy evaluation criterion, which is originally defined by JPO to assess the quality of translated patent documents.

Sentence Selection and Evaluation
For the JPO adequacy evaluation, the 200 test sentences were randomly selected from the test sentences.
For each test sentence, input source sentence, translation by participants' system, and reference translation were shown to the annotators. To guarantee the quality of the evaluation, each sentence was evaluated by two annotators. Note that the selected sentences are basically the same as those used in the previous workshop. Table 18 shows the JPO adequacy criterion from 5 to 1. The evaluation is performed subjectively. "Important information" represents the technical factors and their relationships. The degree of importance of each element is also considered to evaluate. The percentages in each grade are rough indications for the transmission degree of the source sentence meanings. The detailed criterion is described in the JPO document (in Japanese). 70

Evaluation Results
In this section, the evaluation results for WAT2021 are reported from several perspectives. Some of the results for both automatic and human evaluations are also accessible at the WAT2021 web- 69 The number of systems varies depending on the subtasks. 70 http://www.jpo.go.jp/shiryou/toushin/ chousa/tokkyohonyaku_hyouka.htm site. 71 Figures 6 and 7 show those of MMT subtasks,Figures 8,9,10,11,12,13,14,15,16,and 17 show those of Indic Multilingual subtasks and Figures 18 and  19 show those of . Each figure contains the JPO adequacy evaluation result and evaluation summary of top systems.

Figures 4 and 5 show those of JPC subtasks,
The detailed automatic evaluation results are shown in Appendix A. The detailed JPO adequacy evaluation results for the selected submissions are shown in Table 19. The weights for the weighted κ (Cohen, 1968) is defined as |Evaluation1 − Evaluation2|/4.

JPC Task
Three teams participated in JPC task. Bering Lab and tpt_wat submitted results for all language pairs and TMU submitted results for J↔K and J↔E pairs. Similarly to WAT 2020, participants' systems were transformer-based or BART-based. Bering Lab trained Transformer models with additional corpora, which were crawled patent document pairs aligned by a sentence encoding method and contained more than 13M sentences for each language pair. Their system achieved the best BLEU, RIBES, and AMFM scores for J→C/K/E and the best BLEU and RIBES scores for K→J among the past and this year's systems. tpt_wat used Transformer and back-translation with a single setting for six language pairs. TMU used finetuned Japanese BART models and achieved the best AMFM score for K→J. As for human adequacy evaluation, the evaluated system TMU did not show superior performance to past years' systems for J↔E, while the results cannot be directly compared.
Among the top-performing systems, Bering Lab's systems obtained large BLEU improvements around two points over the past years' systems for J↔K. The improvements were probably due to their additional corpora, because their model without the additional corpus ranked second for J→K. Another finding by TMU was that pretrained Japanese BART brought gains for all J↔K/E directions.

NICT-SAP Task
In contrast to 2020 where we had only 1 submission, this year we received submissions from 5 teams, 4 of which submitted system description papers. The submitted models were trained using a variety of techniques such as domain adaptation, corpora selection and weighing, MBART pretraining and multilingual NMT training. All submissions significantly outperformed the organizers baselines as well as the best submission in 2020. The gains showed by this year's submissions range from approximately 14 to 30 BLEU (depending on the task) compared to the baselines. The main reason was that this year's submission rely on high quality data selection as well as on massively multilingual pre-trained models. Out of the 4 teams that submitted system description papers, only one relied on data selection and surprisingly obtained the best results for some language pairs. For other language pairs, this team obtained cometitive results. Regardless, is is clear that models like MBART are extremely useful in extremely low-resource domains such as Wikinews and software documentation.
Regarding, human evaluation we did JPO adequacy evaluation for English to Indonesian and English to Malay for the Software Documentation domain. Kindly refer to Figure 18 and 19 for the results of human evaluation. For both translation directions, team "sakura" had the highest JPO as well as BLEU scores but the scores for team "NICT-2" were not that far behind. They were certainly significantly better than the organizer scores who only developed models using parallel corpora without any pre-training. We can certainly say that at high enough BLEU score levels (higher than 40), the large differences in BLEU do not necessarily correlate with large differences in human evaluation scores. To be specific, the gap between "sakura" and "NICT-2" in terms of BLEU for English to Indonesian is 2.14 and for English to Malay is 1.5 BLEU. However, the corresponding gaps in human evaluation are 0.08 and 0.15 which is not significant. Human evaluation on a larger scale might be needed but we were unable to do so due to budgetary limitations.

News Commentary (JaRuNC) task
Unfortunately we did not receive any submissions this year.

Indic Multilingual Task
In WAT 2021, we received an overwhelming participation from 11 teams, 10 of which submitted system description papers. In contrast, in WAT 2020 there were only 4 system description papers. All participants trained multilingual NMT models. Some teams focused on leveraging monolingual corpora for pre-training MBART models or for backtranslation whereas other teams focused on script mapping to increase the similarity between the Indian languages and other teams focused on language family specific (Indo-Aryan vs Dravidian) models. Compared to the previous years, it is clear that backtranslation needs to be supplemented with pre-training as well as data selection for the best translation quality. The best performing team, "SRPOL", used back-translation, pre-training, data selection and domain adapta-tion. Following "SRPOL" teams such as "sakura", "CFILT", "IIIT-H", "IITP-MT" and "mcairt" performed the best with ranks varying depending on the translation direction. One important observation we made was that "SRPOL" results for Indian to English translation were far higher than those of the other teams. In general their submission were 2 to 5 BLEU higher than the second best team. We suppose that this is due to their detailed experimentation with data selection and back-translation. On the other hand, for English to Indian language translation, although "SRPOL" had the highest BLEU for most directions, the gap between "SRPOL" and other participants was not that high. In a number of cases the differences were less than 0.5 BLEU which is not significant.
In general, we observed that translation into English had substantially high BLEU scores with most participants obtaining higher than 25 BLEU for most directions. This makes sense because Indian languages are similar to each other and when the target language is the same, the increase in the target language data and transfer learning on the source side will lead to a large improvement in translation quality. In most cases, the scores for Indo-Aryan (Hindi, Marathi, Oriya, Punjabi, Gujarati and Bengali) to English translation were much higher than the scores for Dravidian (Tamil Telugu, Kannada and Malayalam) to English translation.
On the other hand, for translation into Indian languages, BLEU scores were relatively lower. This is due to the morphological richness of Indian languages as well as the fact that multilingual English to Indian language translation does not benefit from the abundance of target language corpus like multilingual Indian language to English translation does. The BLEU scores for translation into Indo-Aryan languages such as Hindi and Punjabi showed the best translation quality exceeding 30 BLEU. This makes sense because Hindi and Punjabi are very similar and Hindi is the most resource rich among all Indian languages. It is certain that Punjabi benefits from the Hindi parallel data via transfer learning despite not sharing the same script. Script sharing, a technique used by some participants, could help enhance the amount of transfer learning taking place even further. For other Indo-Aryan languages the translation quality was a bit lower where English to Bengali exhibited the least translation quality compared to the other Indo-Aryan languages. This shows that linguistic similarity is not enough to lead to a high amount of transfer. In the case of translation into Dravidian languages we observed the lowest BLEU scores, usually around 15 BLEU or lower, with the exception of English to Kannada. Despite having larger corpora than some Indo-Aryan languages, translation into Dravidian languages is very hard as they are significantly morphologically richer than Indo-Aryan languages. Simply leveraging large monolingual corpora may not be enough and methods that take Dravidian linguistics into account may be necessary.
With regards to human evaluation, we observed that differences in BLEU scores do not always correspond to differences in human evaluation scores. For example, take the case of English to Malayalam translation where the gap between "SRPOL" and "CFILT" in terms of BLEU is 2.7 and in terms of JPO scores is 0.87. For the same teams in case of English to Marathi, the gap in BLEU and JPO scores are 1.95 and 0.2 respectively. The difference between a gap of 2.7 and 1.95 is not very large as it is on a scale of 100 72 but the difference between 0.87 and 0.2 on a scale of 5 73 is quite large. In previous editions of this workshop we have always insisted that BLEU scores should not always be trusted in order to decide if translations truly are the best and this year's human evaluation results show that this is still the case. Multi-metric evaluation helps us better understand different aspects of translation and we recommend readers to adopt the same even if automatic metrics are used. Although we are limited by budgetary constraints we hope to conduct larger scale human evaluation in the future.

English→Hindi Multi-Modal Task
This year four teams participated in the different sub-tasks (TEXT, MM, and HI) of the English→Hindi Multi-Modal task. The WAT2021 automatic evaluation scores for the participating teams are shown in Tables 63, 60, 62, 58, 55, 57. The team "Volta" obtained the highest BLEU score for the text-only translation (TEXT) for both the evaluation (E-Test) and challenge (C-Test) test set. The best performance is obtained by fine-tuned mBART using IITB Corpus as an additional resource. For the captioning sub-task (HI) one team "NLPHut" participated and able to obtained better results compared to previous years' best results based on region-specific image caption generation. For the multimodal sub-task (MM), we received three submissions from the teams "Volta", "iitp" and "CNLP-NITS-PP", respectively. The team "Volta" obtained the highest BLEU score for the multimodal translation (MM) for both the evaluation (E-Test) and challenge (C-Test) test set. They extracted object tags from images using visual information to enhance the textual input and achieve the BLEU score of 51.60 on the challenge test set, also the translation output able to resolve ambiguity as compared with text-only translation.
Due to constraints, no human evaluation was made this year for the English→Hindi Multi-Modal Task. 72 BLEU scores go from 0 to 100. 73 Human evaluation scores go from 1 to 5.

English→Malayalam Multi-Modal Task
This year one team "NLPHut" participated in the different sub-tasks text-only translation (TEXT) and Malayalam captioning (ML) sub-tasks of the English→ Multi-Modal task. The WAT2021 automatic evaluation scores are shown in the Table 64, 61, 59, 56.
For English to Malayalam text-only translation the team "NLPHut" using the Transformer model obtained a BLEU score of 34.83 as compared to baseline of 30.49 on the evaluation test set and for the challenge test set obtained 12.15 compared to the baseline 12.98. For Malayalam image captioning, the team "NLPHut" used the region-specific approach by extracting image features for the given specific region (bounding box) along with the whole image features and concatenating both to pass into an LSTM decoder to obtained the captions.
Due to constraints, no human evaluation was made this year for the English→Malayalam Multi-Modal Task.

Flickr30kEnt-JP Japanese↔English Multi-Modal Tasks
This year, two teams participated in the English→Japanese task, and one team participated in the Japanese→English task, respectively. It is notable that all submissions outperformed the best scores in WAT 2020, probably because of the increased size of the training dataset as well as the novel techniques introduced by the participants.
Overall, we observe the similar trend as in the last year. In the English→Japanese task, MMT systems constantly outperformed text-only NMT models including unconstrained ones, while in the Japanese→English task, unconstrained NMT model achieved the best performance. This is perhaps because the Flickr30kEnt-JP dataset itself is indeed constructed by English to Japanese human translation where images were actually referred to resolve ambiguity. One team developed an elegant method for soft alignment of word-region to realize better grounding of multimodal information, which is shown to achieve a favorable performance gain. This result again indicates the importance of text-image grounding in MMT, and we believe that we still have much room for improvements.

Ambiguous MS COCO Japanese↔English Multimodal Task
This year only one team participated in the English→Japanese task. Their system was based on a word-region alignment method to enhance the interaction between source tokens and image regions and then integrating aligned information to the visual features during decoding (Zhao et al., 2021). We observe that their system outperformed the organizer's system, which is based on double attention to both source tokens and image regions. It verified that it is important to integrate visual information in a proper way for this task and multimodal MT in general that text is a strong clue for translation, but visual information can further improve translation if it is used properly. Unfortunately, there is no team participating the Japanese→English task. We hope that we can have more participates next year for the tasks in both directions.

Restricted Translation Task
We received 3 systems for the English→ Japanese translation task and 4 systems for the Japanese→ English. 74 On the whole, all the submitted systems are basically lexical-constraint-aware NMT models with lexically constrained decoding method, where the restricted target vocabulary is concatenated into source sentences and, during the beam search at inference time, the models generate translation outputs containing the target vocabulary. We observed that these techniques boost the final translation performance of the NMT models in the restricted translation task.
For human evaluation, we conducted the sourcebased direct assessment (Cettolo et al., 2017;Federmann, 2018) and source-based contrastive assessment (Sakaguchi and Van Durme, 2018;Federmann, 2018), to have the top-ranked systems of each team appraised by bilingual human annotators. In the human evaluation campaign, we also include the human reference data. Table 20 reports the final automatic evaluation score and the human evaluation results. In both tasks, the systems from the team "NTT" are the most highly evaluated in all the submitted systems in the final score and the human evaluation, consistently. We also note that our designed automation metric is well correlated Table 20: Human evaluation results of source-based direct assessment (src-based DA) and source-based contrastive assessment (src-based CA), ranging 0 to 100. TThe column of "final" reports the final score of the automatic evaluation metric described in Section 2.13 . with the human evaluation results. Besides that, we found that the ASPEC human reference data might have a quality issue, consisting of low-quality examples that are annotated with a score of [0, 50], with the ratio of (En-Ja, Ja-En)=(13.30%, 12.43%). This is why a few systems are shown to surpass the original human reference data in the human evaluation.

Conclusion and Future Perspective
This paper summarizes the shared tasks of WAT2021. We had 24 participants worldwide who submitted their translation results for the human evaluation, and collected a large number of useful submissions for improving the current machine translation systems by analyzing the submissions and identifying the issues. For the next WAT workshop, we will try to add more Indic languages to our MultiIndicMT task along with newer evaluation sets. Also, we will add a new English→Bengali Multi-Modal task into the Multimodal translation tasks.

Acknowledgement
The English→Hindi and English→Malayalam Multi-Modal shared tasks were supported by the following grants at Idiap Research Institute and Charles University. The authors do not see any significant ethical or privacy concerns that would prevent the processing of the data used in the study. The datasets do contain personal data, and these are processed in compliance with the GDPR and national law.
• At Idiap Research Institute, the work was supported by the EU H2020 project "Real-time network, text, and speaker analytics for combating organized crime" (ROXANNE), grant agreement: 833635.
• At Charles University, the work was supported by the grant 19-26934X (NEUREM3) of the Czech Science Foundation and using language resources distributed by the LIN-DAT/CLARIN project of the Ministry of Education, Youth