Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Using a vocabulary that is shared across languages is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, assuming that shared tokens refer to similar meanings across languages. However, when word overlap is small, especially due to different writing systems, transfer is inhibited. In this paper, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) embeddings of words with similar meanings are better aligned across languages, 2) our method achieves consistent BLEU improvements of up to 2.3 points for high- and low-resource MNMT, and 3) less than 1.0\% additional trainable parameters are required with a limited increase in computational costs, while inference time remains identical to the baseline. We release the codebase to the community.


Introduction
Multilingual systems (Johnson et al., 2017;Lample and Conneau, 2019) typically use a shared vocabulary to build the word space uniformly.For instance, in MNMT scenarios, this is achieved by combining all source and target training sentences together and training a shared language-agnostic tokenizer, e.g., BPE (Sennrich et al., 2016), to split the words into tokens.
Such a design is simple and scales easily.Moreover, shared tokens also encourage positive knowledge transfer when they refer to equivalent or similar meanings.Research on cross-lingual word embeddings (Søgaard et al., 2018;Ruder et al., 2019) shows that exploiting a weak supervision signal from identical words remarkably boosts cross- lingual word embedding or bilingual lexicon induction, i.e., word translation.This point is also held for higher-level representation learning, e.g., multilingual BERT (Devlin et al., 2019), where Pires et al. (2019) show that knowledge transfer is more pronounced between languages with higher lexical overlap.For bilingual machine translation, Aji et al. (2020) investigate knowledge transfer in a parent-child transfer setting2 and reveal that word embeddings play an important role, particularly if they are correctly aligned.
All findings above point to the importance of word-level knowledge transfer for downstream systems, no matter whether this transfer is achieved through sharing, mapping, or alignment.To some extent, knowledge transfer in multilingual translation systems (Johnson et al., 2017) can be seen as a special case in the aforementioned bilingual translation setting, where model parameters and vocabulary are both shared and trained from scratch, and then knowledge transfer should emerge naturally.However, the shared vocabulary, as one of the core designs, has limitations: 1) When languages use different writing systems, there is little word overlap and knowledge sharing suffers.2) Even if languages use similar writing systems, shared tokens may have completely different meanings in different languages, increasing ambiguity.
In this paper, we target the first issue of wordlevel knowledge sharing across languages.As illustrated in Figure 1, due to different surface forms, bicycle in English and fiets in Dutch have the same meaning but are placed differently in the embedding space, unlike the shared word station.Our goal is to pull together embeddings of meaningequivalent words in different languages, including languages with different writing systems, facilitating knowledge transfer for downstream systems.For simplicity, we choose English as the pivot and encourage words in other languages to move closer to their English counterparts.
To this end, we define and mine subword-level knowledge propagation paths and integrate them into a graph which is represented as an adjacency matrix.Then, we leverage graph networks (Welling and Kipf, 2016) to model information flow.The embeddings of meaning-equivalent words will transfer information along with the paths defined in the graph and finally converge into a re-parameterized embedding table, which is used by the MNMT model as usual.At a higher level, our approach establishes priors of word equivalence and then injects them into the embedding table to explicitly encourage knowledge transfer between words in the vocabulary.
We choose multilingual translation as test bed to investigate the impact of word-based knowledge transfer.Several experiments show the advantages of our approach: 1) Our re-parameterized embedding table exhibits enhanced multilingual capabilities, resulting in consistently improved alignment of word-level semantics across languages, encouraging word-level knowledge transfer beyond identical surface forms.2) Our method consistently outperforms the baseline by a substantial margin (up to 2.3 average BLEU points) for high-and lowresource MNMT, which empirically demonstrates the benefits of our re-parameterized embeddings.3) Our method scales: The extra training time and memory costs are small, while the inference time is exactly the same as benchmark systems.Moreover, we demonstrate our method adapts to massive language pairs and a large vocabulary.

Related Work
Prior work has demonstrated the importance of word-level knowledge transfer.For instance, Gouws and Søgaard (2015) show that by replacing some specific words with their cross-lingual equivalent counterparts, a downstream cross-lingual partof-speech tagging task can benefit from its highresource counterpart.Using cross-lingual word embedding (Søgaard et al., 2018;Ruder et al., 2019) can be seen as an extension of this idea at the representation level.E.g., by fixing and applying pretrained embeddings to cross-lingual tasks.
For transfer learning in bilingual NMT, Amrhein and Sennrich (2020) show that the child and parent models should share a substantial part of the vocabulary.For languages with different scripts, vocabulary overlap is minimal and transfer suffers.To alleviate this issue, they leverage a romanization tool, which maps characters in various scripts to Latin characters, often based on pronunciation, to increase overlap.However, romanization has some limitations because 1) romanization operates at the character level, sometimes including pronunciation features, and hence its benefits are mostly limited to named entities, and 2) the process of romanization is not always reversible.
Sun et al. ( 2022) extend surface-level characters normalization and apply it to multilingual translation.Besides romanization signals, they further introduce phonetic and transliterated signals to augment training data and unify writing systems.Their approach also shows the merits of larger vocabulary overlaps.However, the aforementioned limitations of romanization and transliteration still hold.
Compared to previous work, our method is more general and practical.We mine equivalent words at the semantic level and inject priors into embeddings to improve knowledge transfer.Firstly, it avoids the flaws in romanization-or pronunciationbased approaches.Secondly, the graph-based transfer framework naturally adapts to multilingual scenarios, where many-to-many equivalences exist.

Reparameterization Framework
In this section, we describe 1) how to build an equivalence graph using pair-wise data and 2) how to re-parameterize the embeddings via our graphbased approach.Note that in the discussion below we use the term 'word' as referring to an actual word or a subword if a word segmentation approach such as BPE (Sennrich et al., 2016) is used.

Equivalence Graph Building
We define the words with the same meaning in different languages as equivalent words classes, e.g., {bicycle, bike, Fahrrad, fiets} all mean bike and therefore belong to the same equivalence class.This semantic equivalence involves many-to-many relationships across languages, which we represent in a graph as follows.For these words, we aim to transfer information with each other through the graph and make them converge into similar embedding representations.Given a shared vocabulary V with size |V |, we define the corresponding equivalence graph as an adjacency matrix G |V |×|V | .Each point g i,j ∈ G is a non-negative real number between 0.0 to 1.0 and denotes the ratio of information transferred from the word v j to word v i in V .In our approach, we mine equivalent words class via word alignment and define the corresponding transfer ratios base on them.
More formally, let D be the entire (multilingual) bitext corpus, D s,t ⊆ D denotes a subset of parallel data with a translation direction from source language s to target language t.We train and extract all of the subword-level alignments for each D s,t separately.For two words v i and v j in V , we define an alignment probability from v j to v i in D s,t as corresponding transfer ratios g s,t i,j as follows: where c s,t i,j is the number of times both words are aligned with each other across D s,t .A higher ratio is derived from a greater pairwise occurrence in the bitext.This is based on the intuition that when a pair of aligned words frequently co-occur, they 1) have higher confidence as equivalent words, and 2) the knowledge sharing between these two will benefit more context during training.The corresponding bilingual equivalence graph G s,t can be induced by filling an adjacency matrix using g s,t i,j .By element-wise summation of multiple G s,t , we can merge multiple bilingual graphs into a single multilingual graph G: and further normalize G to guarantee that each row sums to 1.0.Figure 2 illustrates our equivalence graph G for two language pairs, i.e., EN-DE and EN-NL.It is worth noting that, due to only English-centric alignments being extracted, the spaces of non-English direction are usually empty (such as German-Dutch).
For practicality, in this paper, we choose English as the pivot and define transfer paths based on alignments between English and other languages.It is worth mentioning that alternative approaches, like leveraging multilingual vocabulary (Lample et al., 2018b) or multilingual colexification graphs (Liu et al., 2023), could be used as well.

GNN based Messaging Passing
In Section 3.1, we model the priors of word equivalences as a graph.Here, we show how to inject such priors into embeddings via graph networks, thereby re-parameterizing the embedding table.
The graph can be represented as an adjacency matrix G |V |×|V | , see Section 3.1.Let E ∈ R |V |×d be the original embedding table containing all |V | original word embeddings, where d is the representation dimensionality.
Given G and E, we can easily define various information transfers on the graph as matrix operations.E.g., a weighted sum of meaning-equivalent word embeddings can be defined as E ′ = G E. Graph networks (Welling and Kipf, 2016) extend such operations in a multilayer neural network fashion.In each layer, non-linear functions and learnable linear projections are also involved to adjust the aggregation of messages.Here, the reparameterized word embeddings derived from the first-layer graph network are defined as follows: where i.e., the embedding of word v i , and W 1 , W 2 ∈ R d×d and b ∈ R d are learnable parameters, and ρ is a non-linear activation function, such as ReLu (Glorot et al., 2011).N (i) denotes a set of neighbors of i-th node in the graph, i.e., the aligned words with g i,• > 0. Respectively, W 1 learns the projection for the current word embedding e i and W 2 learns for each neighbors e j∈N (i) .
Equivalently, we can rewrite Equation 3 in a matrix fashion as follows: To allow the message to pass over multiple hops, we stack multiple graph networks and calculate representations recursively as follows: ) where h is the layer index, i.e., hop, and E 0 is equal to the original embedding table E. The last layer representation E H is the final re-parameterized embedding table, for the maximum number of hops H, which is then used by the system just like any vanilla embedding table.
Figure 2 illustrates the overall architecture.The information from the original embedding table propagates through multi-hop graph networks and converges to the re-parameterized table.Then, the downstream standard MNMT system looks up corresponding word embeddings from the reparameterized table.The whole architecture is endto-end and supervised by the translation objective.
It is worth noting that, although the graph we build in Section 3.1 only contains English-centric pathways, knowledge transfer beyond Englishcentric directions can also be handled through the multi-hop mechanism.E.g., words in Dutch will transfer information to English counterparts first and then further propagate to other languages, e.g., German, by a 2-hop mechanism.We empirically evaluate this point in Section 5.2.

Experiments and Results
In this section, we apply our approach to train multilingual translation models under different configurations.

Datasets
We conduct experiments on two datasets, the smaller IWSLT14 benchmark and our own largescale dataset called EC30.
For IWSLT14, we follow the setting of Lin et al. (2021) and collect 8 English-centric language pairs, with size ranging from 89k to 169k.The data processing script3 follows Tan et al. (2018).
To ensure a more diverse and inclusive largescale evaluation, we used the EC30 dataset derived from EC40 (Tan and Monz, 2023), where we excluded the data of 10 super low-resource languages.The EC30 dataset consists of 61 million English-centric bilingual sentences as training data, covering 30 non-English languages across a wide spectrum of resource availability, ranging from High (5M) to Medium (1M), and Low (100K) resources.Each resource group consists of languages from 5 families with multiple writing systems.We choose Ntrex (Federmann et al., 2022) and Flores-101 (Goyal et al., 2022) as our validation and test datasets, respectively.A more detailed description of datasets is provided in Appendix A.1.
We tokenize data with Moses (Koehn et al., 2007) and use SentencePiece4 with BPE (Sennrich et al., 2016) (Lin et al., 2021) 28.1 35.2 16.9 20.9 29.0 30.9 16.4 29.2 --25.8LASS (Lin et al., 2021) 29.8 37.3 17.9 22.9 30.9 33.0 17.9 30.9The learning rate is 5e-4 with 4,000 warmup steps and a inverse sqrt decay schedule.All dropout rates and label smoothing are set to 0.1.Data from different language pairs are sampled with a temperature of 2.0 and 5.0 for IWSLT14 and EC30, respectively.We train all models with an early-stopping strategy5 and evaluate by using the best checkpoint as selected based on the loss on the development set.More detailed training settings can be found in Appendix A.2 Tokenized BLEU is used as the metric in the content.To show a consistent improvement across metrics, SacreBLEU (Post, 2018) 6 , ChrF++ (Popović, 2017), and Comet (Rei et al., 2020) are also reported in Appendix A.3 and Appendix A.4.
If not mentioned otherwise, we use eflomal7 with intersect alignment symmetrization to extract alignments and build graphs as described in Section 3.1.

Results on IWSLT14
Table 1 summarizes the results for IWSLT14.The Weighted Sum represents our most naïve setting, i.e., the left multiplication of a graph matrix8 over the embedding table as described in Section 3.2.It is worth noting that even in this setting, +0.7 average BLEU gains are obtained without introducing any extra trainable parameters.For simplicity, we name our graph-based approach GraphMerge and conduct experiments for 1, 2, and 3 hops.
It can be seen that GraphMerge consistently outperforms the baseline by a substantial margin: 1) the GraphMerge models yield better performance for all language directions, 2) GraphMerge with a 3-hop setting achieves the best results with an average gain of +2.3 BLEU.The largest gain of +3.5 BLEU can be found for into-English translations.Also, for out-of-English translation which is often considered the more difficult task, our method obtains an improvement of +1.0 BLEU.
A noteworthy finding is that as we increase the depth of the GNN from 1-hop to 3-hop, the performance consistently improves for almost all language pairs, resulting in notable gains of 1.6, 1.9, and 2.3 BLEU, respectively.We attribute the progressive improvement to the gradual enhancement of the quality of cross-lingual word embeddings.A more detailed analysis can be found in Section 5.1.
The results in other metrics, like sacreBLEU, ChrF++, and Comet can also be found in Table 11, where we show that our improvements remain consistent across a large spectrum of evaluation metrics.

Large-scale Multilingual Results
To evaluate the scalability of our approach, we also run experiments on the EC30 dataset which includes 30 language pairs with 61M sentence pairs.Furthermore, we also use a bigger model (Transformer-Big) and larger vocabulary size (128K), resulting in a stronger baseline.
As shown in Table 2, our method achieves consistent improvements across all resource groups over the baseline for English-centric pairs.When extending GraphMerge from 1-hop to 2-hop, further improvements of up to +1.1 average BLEU points were achieved, but the improvements started to slightly weaken in the 3-hop setting.These improvements are in line with our results for IWSLT14, see Section 4.2.The results for individual languages can be found in Appendix A.4, where the results in other metrics, like SacreBLEU, ChrF++, and Comet are also reported and consistent improvements are shown as well.
We also evaluate the zero-shot translation performance on EC30 for 870 language directions not involving English.Figure 3 shows our results for 9 resource groups, considering all resource combinations (high, medium, low) for source and target.Note that here resource size refers to the amount of parallel English data that is available for a given source or target language.On average, our approach obtains improvements of 1.9 BLEU for all groups compared to the baseline.

Ablation
To investigate the impact of some specific settings in our framework, we conduct ablation experiments on the IWSLT14 dataset in this section.
Tied Embeddings.A basic setting for the transformer-based NMT model is to tie the decoder's input and output embedding (Press and Wolf, 2017; Pappas et al., 2018).Here, two embedding tables exist in our GraphMerge model, i.e., the original embedding table and the re-parameterized one.We test which one is better to tie with the decoder's output embedding.Note that all embeddings are trained from scratch.
Graph Quality.We use the eflomal alignment tool with the intersect strategy to extract alignments and construct the corresponding graph.The graphs induced from different alignment strategies may influence downstream results.We evaluate this point and set up experimental groups as follows: 1) Using FastAlign (Dyer et al., 2013) to extract alignments instead of using eflomal, where Table 4: English-centric word similarity analysis for each language pair in the IWSLT14 dataset.A high degree of consistency between the similarity of representations and the corresponding BLEU scores can be found.
the latter is often considered to have better performance (Östling and Tiedemann, 2016).2) Using grow-diag-final-and as the alignment strategy instead of intersect, which improves recall but reduces precision (Koehn, 2009).
Table 3 shows the ablation results on IWSLT14.We report results with different settings: 1-, 2-, 3-hop.By replacing the re-parameterized embeddings with the original one, BLEU performance drops by {0.7, 0.6, 1.0} on average, respectively.It shows that tying re-parameterized embeddings with the decoder's projection function significantly influences performance.The quality of the graph also mildly impacts downstream results: When alignments are noisier, as is the case for FastAlign and gdfa, performance drops by 0.2 and 0.4, respectively, compared to the 2-hop GraphMerge.

English-Centric Cross-lingual Word Similarity
To verify whether embeddings of words with similar meanings are indeed closer to each other in the representation space, compared to the baseline, we conduct additional experiments as described below.
We utilize bilingual dictionaries from MUSE 9 (Lample et al., 2018a) as the ground truth and analyze the embedding similarity between word pairs.MUSE contains 110 English-centric bilingual dictionaries and all languages in our experiments are included.We limit our comparisons to the word pairs that exist in both our vocabulary of IWSLT14 and MUSE dictionaries.For each language pair, more than 1,000 word pairs exist in the intersection. 9https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries We use cosine similarity to evaluate the average distance between the word pairs across all language pairs.We also consider the isotropy of the space, which can be seen as the distribution bias of a space (Ethayarajh, 2019).Preserving isotropy is to avoid a situation where the similarities of words in a certain space are all significantly higher than those in other spaces, making the comparison across spaces not fair.The detailed results of the degree of isotropy can be found in Appendix A.6.In short, isotropies are all at a high level, i.e., low degree of anisotropy, and therefore not a cause for inflated similarities.
Table 4 shows the analysis results on IWSLT14.It is evident that a strong level of consistency exists in our findings: 1) Firstly, as we increase the depth of our GNN model from 1-hop to 3-hop, we observe a progressive increase in the similarities between word representations with similar meanings.This indicates that our re-parameterized method effectively enhances the cross-linguality of the embedding table .2) Secondly, we observe consistent translation improvements in each direction as the cross-linguality becomes stronger.This supports our hypothesis that improving the multilinguality of the shared embedding table results in greater translation quality.

Beyond English-Centric Cross-lingual Word Similarity
As mentioned above, for practicality, we choose English as the pivot and only mine equivalent words between English and other languages.Hence, in non-English directions, such as DE-NL, meaningequivalent relationships are not explicitly incorporated into the graph (e.g., the empty parts in Figure 2).We argue that even in this setting, cross- linguality could still be enhanced due to the pivot of English bridge the way of knowledge passing from German to Dutch through our multi-hop mechanism.We evaluate this point as follows.
Non-English-centric dictionaries included in MUSE are limited, therefore we extend MUSE by mapping words paired with the same English words together, e.g., given an EN-DE pair {bike, Fahrrad} and an EN-NL pair {bike, fiets}, we can build a new word pair {Fahrrad, fiets} for DE-NL.For the 8 languages among the IWSLT14 dataset, we build 28 dictionaries for the corresponding non-Englishcentric language pairs.Each of these dictionaries contains more than 1,000 non-English-centric word pairs except for AR-HE, where 697 pairs are found.
We report analysis results for 6 language pairs among 4 languages, involving the same and different writing systems, i.e., DE, NL, AR, and HE in Table 5.One can easily see that our reparameterized embeddings consistently exhibit a higher degree of cross-linguality compared to the baseline, which underlines the generality of our approach even if only English-centric equivalence relationships are leveraged.Full results for 8 languages and 28 pairs can be found in Appendix A.7.

Bilingual Experiments
In this section, we explore whether our method is able to bring benefits for the "extreme" setting of multilingual translation systems, i.e., bilingual translation.We argue that the advantages of better semantic alignments should also apply here.
We pick two language pairs from EC30, i.e., EN-DA (1M) and EN-AF (100K), and conduct experiments individually (note that the graph is rebuilt for each as well).We use a rule of thumb setting, i.e., transformer-base (6 layers, 512 embeddings dim, 1024 feedforward, 4 attention heads), as the backbone.Also, we shrunk the vocabulary size to 16K considering less data.In Table 6, we compare the bilingual baseline with the GraphMerge-3hop setting in 4 language directions, where we also list the performances in these directions from our MMT model trained on EC30 (see Section 4.3).
It shows that even in an "extreme" setting (bilingual translation), our method still brings clear benefits, especially for the low-resource pair (EN-AF), +2.2 and +3.7 BLEU gains are achieved, respectively.When we extend the "extreme" setting to 30-language MMT, resulting in a stronger baseline, the improvements remain in these four directions.
It is worth noting that in the bilingual setting, only one language is fed to the encoder or decoder.We attribute the improvements here to two potential factors: 1) Monolingual synonyms are modeled better.E.g., bike and bicycle are both linked to the pivot words (e.g., cykel in Danish), resulting in enhanced representational similarity via the multihop mechanism, and 2) the enhanced cross-lingual word similarity, e.g., that of bike and cykel, may also lead to higher accuracy during decoding.We leave the further analysis for future work.
In addition, we also demonstrate our method's applicability for bilingual translation on the English-Hebrew track in the WMT 2023 competition.In short, clear gains are still observed, even when tested on a large-scale dataset exceeding 30 million sentences.A brief summary of partial experiments results for our WMT2023 submission (Wu et al., 2023) can be found in Section A.5.

Speed and Memory
Compared with the standard transformer model, our method introduces extra graph operations.Particularly, a big graph (adjacency) matrix of size 30K × 30K is multiplied with the original embedding table.An obvious question here is whether training latency increases to an unacceptable level.We show training latency in Table 7.All experiments here were conducted on a single NVIDIA A6000 GPU card with FP16 optimization.We utilized the sparse matrix optimization provided by PyTorch (Paszke et al., 2019) for implementation.
As shown in Table 7, for the model we used for IWSLT14 (30K vocabulary), the extra training latency is negligible, with only 4% and 6% for the 1-hop and 2-hop models.Even for a much bigger graph matrix, (128K vocabulary, nearly 16 times bigger adjacency matrix), training latency is limited.The only trainable parameter we introduced here is the dense matrix within the graph networks, which constitutes approximately 2% of all model parameters.Furthermore, due to the sparsity of the adjacency matrix, the additional memory usage is negligible, amounting to less than 1%.
It is worth noting that although the graph matrix can be sizable, the re-parameterized embedding table can be decoupled and stored for online deployment, meaning inference latency is exactly the same as for the baseline.Meanwhile, for training, GraphMerge only applies to the embedding table once per training step.In other words, the batch size or the model size of the transformer architecture does not matter.Both benefits show practicality for large-scale settings.

Conclusions
In this paper, we target the shortcomings of solely relying on a shared vocabulary for knowledge transfer.Broadly speaking, our approach is to mine word equivalence across languages and inject such priors into the embedding table using a graph network, thereby significantly improving transfer.
Our experiments show that our approach results in embeddings with a higher degree of multilinguality, leading to consistent improvements in MNMT.E.g., our approach achieves 2.3 average BLEU gain on IWSLT14, and the improvements still hold even on a much larger and more diverse dataset, namely EC30 (61M bitext, 30 language pairs).Also, even for the minimal setting of MMT, i.e., bilingual translation, the performance gains are still held for 30+M dataset.
At the same time, our framework remains practical: 1) Through the multi-hop mechanism, the pivot language (English) bridges the way of knowledge transfer among non-English language pairs.Therefore, even when only English-centric bitext datasets are available, multilingual transfer can be achieved.2) A negligible number of additional trainable parameters are required with a limited increase in computational costs during training.Meanwhile, the inference latency is exactly as same as benchmark systems by storing and deploying the re-parameterized embeddings for online systems.

Limitations
As the MT system scales up, a big vocabulary may be introduced.Extra computing costs may get pronounced as the vocabulary scales to a super big size, e.g., NLLB (Costa-jussà et al., 2022) uses a 256K vocabulary.

Broader Impact
Multilingual translation systems have significant progress recently.However, potential challenges such as mistranslation or off-target issues still exist.Moreover, the fairness problem also raise, i.e., the generation ability is not guaranteed to be fair across languages or demographic features, which may run the risk of exploiting and reinforcing the societal biases (e.g.gender or race bias) that are present in the underlying data.English).EC30 is more profound in the total number of languages and in the balance of language family and writing systems.Specifically, for each language family, we include 6 representative languages across different resources (2 for each resource level).we categorized the languages into high-, medium-, and low-resource groups, each having 5 million, 1 million, and 100,000 bitext examples, respectively.For experiments on EC30, we use temperature sampling to select data for training BPE due to it being highly imbalanced (high resource data is 50 times bigger than low ones), the temperature is aligned with that of the MNMT training phase.More specifically, we keep the lowest-resource dataset unchanged and sample data in other bitexts to match the temperature ratio.Then, we merge the data for BPE training and apply BPE to the EC30 corpus.

A.2 Training Settings
For all of our experiments, the training parameters of our methods and baselines range from 62M to 447M.All models are trained on 4 A6000 GPU cards and training times are from 10 hours to 8 days.

A.3 Detailed IWSLT14 Results
We show full results on IWSLT14 in tokenized BLEU in Table 10.An interesting finding is that as graph networks go deeper (from 1-hop to 5-hop), into-English translation consistently gets improved, while the average best one for out-of-English translation is GraphMeger with a 3-hop setting.
We also show full results on IWSLT14 in other wide-used metrics, like SacreBLEU, ChrF++, and Comet in Table 11.This shows that our improve-ments remain consistent across a large spectrum of evaluation metrics.

A.4 Detailed EC30 Results
We show full English-centric results (60 directions) on the EC30 dataset in Table 12.The results are placed from high-, medium-, to low-resource groups.
Also, we show the results in SacreBLEU, CharF++, and Comet on EC30 in Table 13.It shows that our improvements remain consistent across a large spectrum of evaluation metrics.

A.5 Detailed WMT2023 submission
For our WMT 2023submission (Wu et al., 2023), we make use of all the available data from the constrained track of the shared task for English-Hebrew translation, where 70M parallel data are released.After data preprocessing, we reduced the size of the bitext to 34M.We further normalize and tokenize bitexts using Moses and SentencePiece with 32K vocabulary size.
For offline evaluation, we conducted experiments on full data (34M) and the sampled data (2M).We apply Transformer-Large10 as the backbone for the full dataset considering the large scale of bilingual sentences while using Transformer Base for the sampled dataset here.Both bilingual and MMT (two directions within one model) settings are tested.We further equip GraphMerge-1hop upon MMT baselines to show the benefits, and the results can be found in Table 9.
It is easy to see that 1) The performances of the bilingual and MMT models are comparable on both full and sampled datasets.2) Moreover, the models equipped with GraphMerge-1hop achieve consistent improvements in both two directions, especially for sampled data, the gain is evident.Finally, we use the model trained on full data for WMT23's final online evaluation.It shows that even on a 30+M dataset consisting of two languages only, GraphMerge-1hop can still achieve +0.3 and +0.4 BLEU improvements for out-of-and into-English translation, which is also consistent with offline evaluation.Other strategies we used in this competition can be found in (Wu et al., 2023).

A.6 Isotropy Definition and Results
Introducing isotropy is to avoid a situation where the similarities of words in a certain space are all significantly higher than those in other spaces making it difficult to compare results across spaces.
Given the degree of isotropy derived from the standard transformer model as an example: For a set of word pairs (such as EN-DE), given the degree of isotropy derived from a specific model as an example: 1) for each word x i in the intersection of MUSE EN-DE dictionary and our vocabulary, we randomly select the target word x j 50 times from the whole subword space and average the similarity score between the embedding of x i and x j as I i , 2) then, average I i for all the words x i in the subsection as the metric I EN −DE .For the optimal isotropic space, such a metric should be close to zero.Meanwhile, if a large difference in isotropy between two spaces exists, the similarity comparison across space is problematic.
Table 14 shows the analysis results on IWSLT14.For all of the 8 language pairs, more than 1,000 word pairs exist in the intersection of our vocabulary and MUSE dictionaries.It is easy to see that our re-parameterized embeddings are consistent with better cross-linguality, i.e., the distance between word representations with similar meanings is much smaller.Meanwhile, the isotropies are all at a high level (low degree), i.e., evaluation risk due to space bias is excluded as well.In other words, similarity comparisons in Table 4 are fair.

A.7 Full Results of Beyond English-Centric Word Similarity
In Table 15, we show the full results of non-Englishcentric word similarity analysis for 28 language directions in the IWSLT14 dataset.It is easy to find that, as graph networks go deeper, word similarity consistently gets closer.

Figure 1 :
Figure 1: bicycle and fiets have the same meaning, but use different forms, potentially leading to a larger distance δ between their embeddings (E[•]).Our graphbased module G explicitly reparameterizes the word embeddings (E G [•]) leading to a reduced distance δ ′ .
Figure2: Illustration of our framework.The left part denotes the subgraphs we build for each language pair, e.g.,EN-DE and EN-NL, whichare further merged into a multilingual graph.Since we only rely on English-centric data, the graph is sparse, and only four (of the nine possible) sub-matrices are filled.As shown in the right part, the information from the original embeddings (in grey) transfer and converge into the re-parameterized embeddings (in blue) along the pathways defined in the graph, which are further used by a standard encoder-decoder model.All parameters, including embeddings, are trained from scratch.
with 30K and 128K merge operations

Table 1 :
Lin et al. (2021))LT14 dataset.Following previous work(Lin et al., 2021), we report average out-of-and into-English BLEU scores.For instance, the numbers on the DE column are the average of EN DE and DE EN BLEU scores.EN X and X EN denote the average performance on 8 language pairs.We show the results fromLin et al. (2021)in the first block, which learns language-specific sub-network for MNMT.Also, we report our reproduced baseline results.3-hop Gain are the gains over the reproduced baseline.The best results in each column are in bold.More detailed results can be found in Appendix A.3.

Table 2 :
Large-scale experiments on the EC30 dataset (61M sentence pairs, 128K shared vocabulary).EN X and X EN denote the average performance of out-of-and into-English translation on each resource group, respectively.The best results in each column are in bold.

Table 3 :
Ablation experiment results on IWSLT14.1-, 2-, and 3-hop refer to the number of hops in GraphMerge and 'w/o tie' indicates tied input and output with original embeddings instead of re-parameterized ones.gdfa is the abbreviation of grow-diag-final-and.

Table 5 :
Beyond English-centric word similarity analysis for each language pair in the IWSLT14 dataset.Consistently enhanced multilinguality shows as graph networks go deeper.

Table 6 :
Results of bilingual experiments on EN-DE and EN-AF language pairs.The evaluation set is the same as that of EC30, i.e., Flores-101.The best results in each column are in bold.

Table 8 :
Languages grouped in different resource levels in EC30.We use ISO 639-1 in this table.

Table 9 :
Offline and online evaluation results on sampled (2M) and full (34M) training data for the WMT2023 competition.MMT + GraphMerge 1-hop means that we equip graph-based re-parameterized embedding tables for our MMT baseline.The best BLEU scores in each column are written in bold.

Table 10 :
Detailed results on the IWSLT14 dataset.We report the BLEU score for each English-centric translation direction.

Table 11 :
Detailed results on the IWSLT14 dataset in SacreBLEU, ChrF++, and Comet.Consistent Gains can be found for the GraphMerge-3hop setting.Especially, into-English improvements are large.

Table 12 :
Detailed results on the EC30 dataset in 60 directions.The results are placed from high-, medium-, to low-resource groups.We report the BLEU score for each direction.