Modeling Task-Aware MIMO Cardinality for Efficient Multilingual Neural Machine Translation

Neural machine translation has achieved great success in bilingual settings, as well as in multilingual settings. With the increase of the number of languages, multilingual systems tend to underperform their bilingual counterparts. Model capacity has been found crucial for massively multilingual NMT to support language pairs with varying typological characteristics. Previous work increases the modeling capacity by deepening or widening the Transformer. However, modeling cardinality based on aggregating a set of transformations with the same topology has been proven more effective than going deeper or wider when increasing capacity. In this paper, we propose to efficiently increase the capacity for multilingual NMT by increasing the cardinality. Unlike previous work which feeds the same input to several transformations and merges their outputs into one, we present a Multi-Input-Multi-Output (MIMO) architecture that allows each transformation of the block to have its own input. We also present a task-aware attention mechanism to learn to selectively utilize individual transformations from a set of transformations for different translation directions. Our model surpasses previous work and establishes a new state-of-the-art on the large scale OPUS-100 corpus while being 1.31 times as fast.

Despite their advantages, multilingual systems tend to underperform their bilingual counterparts as the number of languages increases (Johnson et al., 2017;Aharoni et al., 2019). This is due to the fact that multilingual NMT must distribute its modeling capacity over different translation directions.  show that the model capacity is crucial for massively multilingual NMT to support language pairs with varying typological characteristics, and propose to increase the modeling capacity by deepening the Transformer.
However, compared to going deeper or wider, modeling cardinality based on aggregating a set of transformations with the same topology has been proven more effective when we increase the model capacity (Xie et al., 2017). In this paper, we efficiently increase the capacity of the multilingual NMT model by increasing the cardinality, i.e. stacking sub-layers that aggregate a set of transformations with the same topology.
Our main contributions are as follows: • We propose to efficiently increase the capacity of the multilingual NMT model by increasing cardinality, and present a novel MIMO design that allows transformations in the subsequent layer to take different outputs from the current layer as their inputs, unlike previous studies (Xie et al., 2017;Yan et al., 2020) which feed the same input to several transformations and merge their outputs into one; • We propose to learn a task-aware attention mechanism for the MIMO transformation, allowing the model to weigh different transformations of the set differently for specific translation directions; • In our experiments on the OPUS-100 corpus, our approach outperforms previous work and  Figure 2. We aggregate the final output of layer normalization of each "Trans" in the block into the input fed to the next block in different ways (i.e., (a)-(d)).
achieves a new state-of-the-art while being 1.31 times as fast.  overcome the capacity bottleneck of multilingual NMT via deepening NMT architectures. Xie et al. (2017) present a highly modularized network architecture for image classification. The network is constructed by repeating a building block that aggregates a set of transformations with the same topology. For a given input i, the block adopts n networks of the same topology trans to process i and merges their outputs into the final output o of the layer:

Preliminaries
This design strategy exposes a new dimension, namely "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. Xie et al. (2017) empirically show that increasing cardinality is more effective than going deeper or wider when we increase the capacity to improve classification accuracy. Yan et al. (2020) present a multi-unit Transformer to efficiently improve the translation perfor- mance by increasing cardinality instead of depth. However, their work implements stacks of input → performing multiple transformations → merging blocks (as illustrated in Figure 1 (a)), is developed for bilingual sentence-level transformation, and requires the additional design of a biasing module and sequential dependency that guide and encourage complementariness among different units. By contrast, our work aims at efficiently increasing the capacity for multilingual translation, proposes the MIMO transformation (Figure 1 (c)) between stacked blocks, and naturally uses the translation task in attention form to guide individual transformations of the set to learn different representations for different translation directions.

Multi-Input-Multi-Output (MIMO) Transformation
In contrast to previous approaches (Xie et al., 2017;Yan et al., 2020) that follow a stack of transformation-merging procedures (Figure 1 (a)) to increase cardinality, in our approach we allow our set of transformations to take different inputs. Compared to using the same input, this may encourage transformations to learn complementary representations. Furthermore, merging the outputs of different transformations into one is likely to incur information loss. This is avoided in our approach.
We employ a MIMO transformation between stacked layers (Figure 1 (c)) to enable each transformation of the block to selectively learn to operate on its own unique input.
Specifically, we keep n outputs of the set of transformations to produce multiple inputs for the next layer instead of merging them into one. The input i j k to the kth transformation of the jth layer trans j k is a weighted accumulation of the outputs o j−1 of the layer j − 1.
where p j m are softmax-normalized learnable parameters to model translation task-aware attention for multilingual NMT described in Section 3.2.
o j k is produced by trans j k with i j k as its input: In the case of a Transformer for multilingual NMT, trans j k can be either the multi-head attention or the feed-forward neural network. We adopt a one-to-many transformation (Figure 1 (b)) for the self-attention layer in the first encoder/decoder layer to project one input from the embedding layer to multiple inputs to subsequent layers, and perform a many-to-one transformation (Figure 1 (d)) with the outputs of the feed-forward layer of the last decoder layer to build a single input for the classifier.

Task-Aware Attention
Rather than separating the multilingual NMT model into 2 parts: 1) the shared part for all language pairs trained on the full dataset; 2) the language isolated part which will only be activated in the corresponding translation task and trained on the part of the whole dataset specifically for the language, we compute all transformations of each block regardless of the translation task, thus all model parameters can utilize and benefit from the whole training set. At the same time, we introduce a task-aware attention mechanism to utilize different transformations of the block differently for specific translation directions.
Specifically, we learn an embedding v for each translation direction (i.e., to X (e.g., en, zh, de)) for each transformation to weightedly aggregate multiple outputs of the block below. v is first normalized into a probability p: Next, p is used in Equation 2 for weighted aggregation. p is expected to assign a higher weight to corresponding transformations of the block which are more important for the translation direction.

Discussion
Increasing model capacity via increasing cardinality is more efficient than deepening a model or widening it (Xie et al., 2017;Yan et al., 2020). Compared to widening a model, increasing cardinality removes connections between hidden units and reduces both parameters and computation. Compared to deepening a model, increasing cardinality allows to parallelize the computation of all transformations of a set, accelerating both training and decoding.

Settings
We conducted our experiments on the challenging massively many-to-many translation task on the OPUS-100 corpus (Tiedemann, 2012;Aharoni et al., 2019;. We followed  for experiment settings. We implemented our approaches based on the Neutron implementation (Xu and Liu, 2019) of the Transformer translation model. Parameters were initialized under the Lipschitz constraint (Xu et al., 2020). We adopted BLEU (Papineni et al., 2002)    to , average BLEU over 4 selected typologically different target languages (de, zh, br, te) BLEU 4 , and average BLEU for zero-shot translation BLEU zero .

Main Results
For fair comparison, we use a 6-layer model where each attention/FFN block contains 4 transformations, which leads to a similar number of parameters compared to the 24-layer model of . Results are shown in Table 1. Table 1 shows that our approach achieves better performance in all evaluations while being 1.31 times as fast.

Ablation Study
We study removing MIMO transformations and task-aware attention. Results are shown in Table 2. Table 2 verifies that both mechanisms contribute to the performance.
We also examine different combinations of depth and cardinality. Results are shown in Table 3. Table 3 shows that using 6 layers with 4 transformations in each block leads to the best perfor-Main en de fr ar zh ru 1 rw sv pt he ja sh 2 yi da it mt ko lt 3 gd nn ca fa th sr 4 de nb es ga vi mk 5 xh no mt yo bn lv mance.

Task-Aware Attention Weight Analysis
To verify whether task-aware attention learns to aggregate similar languages together, we extract the learned task-aware attention probabilities, flatten them into vectors, and select the languages with the top-5 cosine similarity. Results for several languages are shown in Table 4. Table 4 shows that close languages are aggregated together.

Related Work
Multilingual NMT includes one-to-many (Dong et al., 2015), many-to-many (Firat et al., 2016a) and zero-shot (Firat et al., 2016b) scenarios. A simple solution is to insert a target language token at the beginning of the input sentence (Johnson et al., 2017).
There are also studies on the trade-off between

Conclusion
We propose to efficiently increase the capacity for multilingual NMT by increasing the cardinality. We present a MIMO architecture that allows each transformation of the block to have its own input. We also present a task-aware attention mechanism to learn to selectively utilize individual transformations from a set of transformations for different translation directions. Our model surpasses previous work and establishes a new state-of-the-art on the large scale OPUS-100 corpus while being 1.31 times as fast.
Maruan Al-Shedivat and Ankur Parikh. 2019. Consistency by agreement in zero-shot neural machine