Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer

We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations. We experiment with two different data settings with a variety of language and script coverage, demonstrating improved performance compared to subword embeddings. We explore various properties of pixel representations such as parameter sharing within and across scripts to better understand where they lead to positive transfer. We observe that these properties not only enable seamless cross-lingual transfer to unseen scripts, but make pixel representations more data-efficient than alternatives such as vocabulary expansion. We hope this work contributes to more extensible multilingual models for all languages and scripts.


Introduction
Multilingual model vocabularies are finite and typically smaller than the possible set of Unicode characters, inherently leaving some languages and scripts under-represented.As coverage increases, parameter allocation to each language decreases, resulting in a trade-off between capability, capacity, and coverage.Recent work on pixel representations (Salesky et al., 2021;Rust et al., 2023) provides an appealing alternative to past approaches, because they do not have a discrete model vocabulary or finite embedding matrix, and can represent all scripts with complete parameter sharing.
Recent work (Rust et al., 2023) has also shown that pixel-based models can be directly finetuned across scripts without vocabulary extensions, adapters, or transliteration.However, pixel representations have previously only been trained or finetuned on individual languages at a time, rather than multilingually.This leaves unanswered questions about the effects of multilingual co-training, such as whether similar scripts will interfere with or boost performance, or if architectural changes will be needed given the larger input space.In this work we demonstrate how to effectively parameterize and train multilingual translation models with pixel representations, leading to improvements of up to 9 BLEU on two multilingual datasets with diverse language and script coverage.We explore various properties of pixel representations in order to understand their potential benefits and limitations, including positive transfer and representational similarity between languages, parameter sharing, and frequency-based relationships.Finally, we show that not only can pixel representations be finetuned cross-lingually or to unseen scripts, but can do so more data-efficiently than alternatives such as vocabulary expansion, with significant improvements for unseen scripts.

Our approach
Covering the larger character sets1 in multilingual models commonly results in significant parameter increases in the embedding matrix and softmax, creating a vocabulary bottleneck.While sampling data by language to balance vocabularies is common for large-scale multilingual systems (Fan et al., 2021), sampling may cause common vocabulary to be outof-vocabulary (OOV) for languages with longer-tail character distributions like Chinese (NLLB Team et al., 2022). 2 One alternative is to move to bytebased representations, which combats exploding model parameters by reducing the set of embeddings to 256.However, this approach increases sequence lengths up to 12× compared to characters, determined by the script's Unicode encoding, making optimal batch sizes prohibitively large and slow for our computational resources.
Rendering text to images bypasses many of the vocabulary challenges posed by multilingual modeling.Pixel-based representations have the advantage of no predetermined static vocabularies, no exploding embedding matrix parameters or sequence lengths, and complete parameter sharing across similar word forms at a sub-character level regardless of the underlying Unicode or byte structure.
Below we present the technical details of our approach and comparisons before proceeding to experimental settings and results.

Encoding text with pixels
Figure 2 demonstrates the rendering process and resulting Transformer inputs.We render text using the PangoCairo library 3,4 following Rust et al. (2023) with a font size of 10pt at 120 DPI.We tokenize sentence-level images into fixed-size image tokens with h=24, w=24, and stride s=12, which results in ∼3 Latin characters per token.The height was chosen to fit the wide variety of scripts and diacritics in our experimental data with a fixed font size.We use the Google Noto Sans fonts collection which covers the majority of Unicode codepoints. 5Further discussion on rendering parameter choices is found in App. C. No preprocessing is applied before rendering.We train many-to-one multilingual models with pixel representations on the source side, and generate discrete subword tokens as the target as below.

Traditional subword tokenization
We generated all subword vocabularies using Sen-tencePiece unigramLM (Kudo, 2018;Kudo and Richardson, 2018).In exploratory experiments, we 2 For example, the NLLB model vocabulary does not include the common characters in 'mother' in Chinese, 妈妈. 3 https://docs.gtk.org/PangoCairo 4 PangoCairo provides greater flexibility than alternatives such as PyGame, used in previous work, by supporting fallback fonts at the character level.This is necessary not only for code-mixing but to support common occurrences such as non-transliterated entities within non-Latin scripts. 5See https://notofonts.github.io/overviewfor the Noto fonts and their Unicode coverage.compared the union of subword vocabularies constructed per-language to a jointly-trained subword vocabulary of the same total size.Individual vocabularies were of size 5k,6 and scaled equivalently for joint vocabularies, e.g.35k for 7 source languages.The two constructions did not result in significant differences in downstream performances in our balanced or imbalanced data settings so we present only joint vocabulary results in the main text, as this approach scales more easily to 59 languages.
Results for both constructions are shown in App.G.We use separate source and target vocabularies and share target vocabularies between subword and pixel models in order to isolate the source representation change.Vocabulary sizes for all models and datasets are shown in Table 4 in App.B.

Model architecture
Our core architecture follows Salesky et al. (2021) and combines a convolutional block7 which processes image tokens and produces flattened vectors (Figure 2) with a Transformer encoder-decoder model.Convolutional layers use one color channel and a 3 × 3 kernel with a stride of 1.Our conventional text models share the same Transformer architecture and replace the convolutional block with a traditional embedding matrix of size V × 512.
Our base models are Transformers with 6 encoder and 6 decoder layers each, with hidden units of dim 512, feed-forward layers of dim 1024, and 4 heads.We train our models with the Adam optimizer (Kingma and Ba, 2015) with linear warmup, learning rate 5e-4, dropout of 0.1, and label smoothing 0.2.We train with temperature sampling T = 1.5 in language-imbalanced settings (Arivazhagan et al., 2019;Shaham et al., 2023).We use batches of 160k tokens, and train until performance on a held-out validation set fails to improve for ten validations.Trained models and scripts to replicate them will be released upon publication. 8 Reparameterizing model capacity with deeper encoders and shallower decoders has been shown to be beneficial particularly for large multilingual vocabularies and/or smaller granularity inputs such as characters (Cherry et al., 2018;Kasai et al., 2021;Kong et al., 2021;Xu et al., 2021;Berard et al., 2021).Replacing the source embedding matrix with visual representations frees parameters which may be re-allocated elsewhere within the model.As we expand language coverage with pixel-based representations, it is not clear a priori whether and where additional capacity may be needed to scale performance compared to models for individual languages or traditional text models.We experiment with different ways to add and allocate model capacity with both pixel and text inputs, with results presented in § 3.1.

Multilingual translation with pixels
We experiment with two datasets to investigate the performance and properties of multilingual pixel representations for machine translation.We perform initial experiments with the balanced 7 language pair multi-target TED data (Duh, 2018) used by Salesky et al. (2021), which we will refer to as TED-7, to compare performance to prior work with 8 https://github.com/esalesky/visrep/tree/multipixel representations and explore any necessary architectural changes in the multilingual setting.We then scale up using the larger 59 language pair TED talk corpus from Qi et al. (2018), or TED-59.In all cases, our models are many-to-one multilingual translation models with English as the target.We list the languages in each corpus with the number of training examples in App. A. Results for all datasets are shown in Table 1.

Model capacity: wider or deeper?
Increasing language coverage often requires increased model capacity.We find that the small base architecture from Salesky et al. (2021) is unstable and may not converge when trained multilingually without additional capacity or batch size.For TED-7, multilingual source embeddings account for 33% of the total parameters of the best subword model.9Without a source embedding matrix, despite the additional convolutional block, a pixel model with the same Transformer architecture as a subword model would be ∼17M parameters (or 38%) smaller, as seen in Figure 3, and may thus require different parameterization and result in different scaling behavior.
We investigate both the impact of reparameterizing the baseline model, as well as increasing capacity through greater encoder depth and/or width.We first find that shifting from an equal depth encoderdecoder model to a deep encoder and shallow decoder with the same number of parameters provides consistent improvements; for example, moving from 6−6 to 9−3 improves performance on the TED-7 dataset from 18.5 to 21.3 BLEU (+2.8).We maintain a shallow 3 layer decoder while varying the encoder through the remainder of this section.With an equal number of model parameters, increasing depth is more impactful than width, as seen in Figure 3a.Increasing width provides consistent improvements at all model sizes, while more significantly increasing overall parameters.With 12−3 layers and 2048 FF width, a pixel-based model has an equivalent number of parameters to the best subword model (∼55M) while able to continue improving with scale.The best pixel models for this dataset use 12−3 layers and FF width 4096.Continuing to increase depth and overall size has diminishing returns.Pixel models also appear more robust to overparameterization where text models degrade more quickly, as seen in Figure 3b.
Is the optimal parameterization determined by the granularity of pixel inputs, the amount of training data, or the multilinguality of the task?To see, we reparameterize the models for individual language pairs from Salesky et al. ( 2021) at both the small and large data sizes (shown in App.F).We find that performance would have decreased in both cases, suggesting this is more likely due to the multilingual task, not the amount of data or pixel representations inherently.
For the larger TED-59 dataset (1.2M→5.1M),we use the same architecture as for TED-7.Exact model configurations for each dataset and representation scheme are listed together in App.B.

Language coverage and imbalanced data
Including additional languages can sometimes interfere with rather than improve performance ('the curse of multilinguality' (Conneau et al., 2020)).When we compare our multilingual models to individual models for the same language pairs with TED-7, we see that all languages improve through multilingual training with pixel representations, while this is not the case for subword-based models, where two language pairs degrade (Figure 4).Improvements are greatest for those language pairs (ja, ko, zh) where individual models performed worse than BPE in Salesky et al. (2021).Improvements could be due to boosts from languages with similar scripts (zh and ja, or fr and de) or simply an increase in total training data: we investigate this in § 4.1 for TED-59 where we have more languages to study.Notably, improvements come without interference for pixel models here.Comparing multilingual pixel and BPE models, we see small but consistent improvements on TED-7 (Figure 5).
The TED-7 setting has relatively balanced data across all languages and scripts and at least 150k examples per pair, which is a reasonable baseline but unrealistic in the context of typical multilingual translation settings.We turn to the TED-59 dataset for increased language coverage with imbalanced training data and script representation for a more realistic setting to see if our improvements hold or interference emerges.Here we see larger improvements of up to 9 BLEU compared to BPE for most language pairs, and some degradation for 2 pairs whose scripts have only ∼5k training examples across all languages, highlighted in Figure 5.
Given the large and imbalanced nature of this

=0.70
Figure 6: Performance improvements with pixel representations are most strongly correlated with the total amount of data for a language's script compared to language or language family.Data size per language is listed in App. A. dataset with a many-to-many multilingual model 10 with language-aware multi-head attention, with 10 Their many-to-many model is trained on 2× as many sentences as the models presented here by reversing the dataset.
25.3 average BLEU: our many-to-one pixel model improves on this by +3.1 BLEU.They do not report results per language for further comparison.
4 Properties of multilingual pixel models

Positive transfer across languages
We look at the relationship between data representation for each source language, family, script and performance to find the greatest contributors to improvements with pixel representations on TED-59.The amount of data for a given pair is only weakly related to performance for both pixel and subword representations (ρ≤0.3,p < 0.05), while language family and script representation is moderately correlated (ρ=0.5 − 0.6, p ≪ 0.001) suggesting some positive transfer across languages and scripts for both approaches.However, looking at each factor's relationship to performance improvement rather than raw scores better reflects those responsible for the difference.As shown in Figure 6, the amount of data for a given script is strongly correlated with ∆BLEU, (ρ=0.70,p ≪ 0.001), while family is moderately correlated (0.35) and data for individual language pairs has no clear relationship.We conclude that pixels enable more effective crosslingual transfer between languages with the same script, and to a lesser degree family, than joint subword vocabularies.We hypothesize that we would see similar improvements for Bengali and Tamil with at least 10k examples for their scripts.

Clustering by language and script
To better understand how pixel representations pattern by language and script we compare our model subword embeddings and our pixel representations.Using the validation set as input, we compute sentence-level vectors by mean-pooling over token embeddings for each sentence for the subword model, or over the linearly projected vectors of the same dimension from the convolutional block in the pixel model.We visualize these representations using t-SNE clustering (van der Maaten and Hinton, 2008), in Figure 7 for TED-59 and in App.H for the smaller TED-7.
Pixel representations cluster neatly by script (7a), reflecting the strong ability to share information between languages of the same script discussed in § 4.1.Subword embeddings do not cluster as strongly by script despite shared subwords, with many separate clusters for e.g.Latin script languages (7b).We observe that subword embeddings cluster more tightly by language and family (7d), with less representational overlap between languages than we see with pixels (7c).However, the visual model still reflects some similarities within families both within and across scripts.For example, in the large Latin-script cluster in 7c, all Uralic languages appear within close proximity of each other, as do Austronesian, and some overlap exists between Cyrillic and Latin representations in 7a, which likely reflects Slavic family similarities rather than visually similar characters given sentence-level vectors.

Complete parameter sharing
With traditional model vocabularies, parameters are not shared between embeddings; only 3% of embeddings are updated per batch on average11 for TED-59 without redistribution techniques such as label smoothing.On the other hand, 100% of the pixel model representation block parameters are updated every batch due to parameter sharing at the pixel level.Pixel representations have direct access to token sub-components, whereas subwords do not, leading to more similar representations for words e.g. with and without diacritics-with the TED-59 subword vocabulary, the Arabic forms and for "book" have disjoint subword decompositions and so do not share embeddings, whereas the pixel representations are highly similar; as visualized in Figure 8, the convolutional layer feature activations remain highly similar despite the inserted diacritics.If a pixel-based model observes partial lexical matches such as "ktb" and "kitab" in training, parameters for both will be updated by backpropagation to the shared pixel values; we hypothesize that this contributes to the increased transfer across languages with the same script and performance improvements.Future work may investigate whether this property leads to more compositional representations.

Reduced frequency-based representation degeneration
Previous work has shown that embeddings can suffer from a frequency-based representation degeneration problem, where infrequent and unseen words cluster together in embedding space due to limited parameter updates during training (Gao et al., 2019).However, as pixel models share parameters at the pixel level, all representations are updated to some degree each batch regardless of subword- level text frequency.Therefore, the low-frequency degradation effect should reduce in pixel models and rare words may not cluster as strongly.
We examine this phenomenon by comparing the source embeddings from the subword model against representations from the pixel model on TED-7.We obtain a comparable set of representations from the pixel model by rendering each subword in the TED-7 source vocabulary and meanpooling the output of the convolutional block for all resulting resulting visual token(s).
We plot these embeddings using 2-D singular value decomposition, and color each point according to the log-frequency of its corresponding subword in Figure 9.We plot visual embeddings, excluding 1% of outliers for improved readability (and include the full plot in App.I).We see that in the text model, there is both a clear frequency bias and and a cluster of low-frequency embeddings.In the pixel model, though we see some frequency bias among embeddings, the distribution of low-frequency embeddings is improved.

Data-efficient cross-lingual transfer
It has been shown that using pretrained multilingual models for cross-lingual transfer can provide significant performance improvements, particularly when the target language is under-resourced.However, adapting models to unseen scripts with no lexical coverage in the original model typically requires techniques such as expanding the embedding matrix to include new vocabulary (Wang et al., 2019b) or language-specific adapters (Houlsby et al., 2019;Pfeiffer et al., 2020).In contrast, models with pixel representations can be finetuned directly on new languages and scripts without requiring any architectural changes (Rust et al., 2023).We hypothesize that the model properties discussed in § 4 will not only allow transfer without model ex-   tensions, but enable transfer more data-efficiently, requiring fewer examples to achieve good performance.
To evaluate the data-efficiency of cross-lingual transfer, we adapt our multilingual models to language pairs with five new source languages, each with different degrees of script coverage to those observed in pretraining as quantified in Table 3: Romanian, Polish, Farsi, Vietnamese, and Hebrew.We randomly sample 10k, 50k, and 150k (∼all) sentences from the multi-target TED dataset used for TED-7 for each new language pair and finetune our TED-7 models on the training data for each pair individually for up to 30 epochs, with early stopping if there are no improvements on the held-out validation sets for 5 epochs.We use the TED-7 models because they do not cover these languages in pretraining; we note that the overall performance on the original task is similar for pixel and subword models.In addition to the pixel and subword models, we also compare subword models with vocabulary expansion, where the source embedding matrix is extended to include BPE inventories of size 5k trained for each new language, for which embeddings are randomly initialized.
Whether model vocabularies cover a particular script is typically described as binary, but even with observed scripts new languages introduce unseen character sequences and diacritics which will not be appropriately represented.We observe that for Unicode-based models, transfer capability is strongly reflected in lexical coverage; vocabulary expansion improves performance slightly for languages with higher n-gram coverage, and significantly for Hebrew with minimal coverage, particularly with more data to train new language-specific embeddings, as seen in Figure 10.However, pixel representations enable models to perform better still than vocabulary expansion, particularly with less data.We believe this is because with complete parameter sharing across all scripts, all parameters for new languages are more strongly initialized.This direction may lead to more data-efficient crosslingual transfer, particularly for under-resourced languages and tasks.

Related Work
Previous work has shown allocating additional encoder capacity to be beneficial for smaller granularity inputs, both for characters and bytes (Cherry et al., 2018;Xue et al., 2022b) and other modalities (He et al., 2021;Zhang et al., 2017).Deep encoders and shallow decoders have been used to improve model efficiency and latency with subword inputs (Kim et al., 2019;Kasai et al., 2021;Kong et al., 2021), and deeper and narrower encoders have been shown to scale more effectively (Tay et al., 2022;Xue et al., 2022a).
Significant prior work has been devoted to broader and more effective language coverage, through full Unicode character coverage and downsampling (Clark et al., 2022), clustered vocabularies for efficient modeling of large vocabularies (Chung et al., 2020;Liang et al., 2023), bytelevel modeling (Gillick et al., 2016;Xue et al., 2022b), bytes in conjunction with BPE to combat data sparsity and memory issues (BBPE: Radford et al., 2019;Wang et al., 2019a) or bytefallback (Xue et al., 2022b).Mapping characters to a smaller set of common representations across scripts through transliteration (Amrhein and Sennrich, 2020; Purkayastha et al., 2023) or graphemeto-phoneme systems (Sun et al., 2022;Gheini and May, 2019) have also been shown beneficial for multilingual and cross-lingual transfer for re-lated languages across scripts, though they may also introduce collisions which can negatively affect performance.Post-hoc vocabulary expansion (Wang et al., 2019b;Moon and Okazaki, 2020) or language adapters (Houlsby et al., 2019;Pfeiffer et al., 2020) to increase vocabulary coverage have also been shown to be very effective.Recently, pixel representations have been proposed as a vocabulary-free alternative (Salesky et al., 2021;Rust et al., 2023), though not trained yet multilingually.We refer readers to the BigScience survey for greater discussion (Mielke et al., 2021).

Conclusions
We introduce and demonstrate how to effectively train multilingual pixel representations for machine translation.We experiment with two different data scales with a variety of language and script coverage, demonstrating improved performance compared to the traditional subword approach.We analyze various properties of pixel representations to better understand where they may provide potential benefits and the impact of different scripts and data representation.We observe that these properties not only enable cross-lingual transfer to unseen scripts, but make pixel representations more data-efficient than alternatives such as vocabulary expansion.We hope this work contributes to more extensible multilingual models for all languages and scripts.

Limitations
Our multilingual experiments are only many-toone thus far, and apply visual representations to the source languages only.Whether the dynamics would change with multiple target languages is not yet known.Though we do experiment with multiple resource scales up to ∼5M sentences our settings remain limited in scale and domain compared to large-scale industry models and it remains to be seen how this approach would fare in other settings.At very low-resource settings with fewer than 10k examples for a given script, our approach may perform worse than traditional subword embeddings.We observe that pixel models are in some settings slower to converge than subword equivalents, which we cautiously attribute to sub-optimal hyperparameters.Though the compute resources required for training models are similar to traditional text representations, significantly more disk space is required to save rendered text compared to raw text, which may be necessary if pre-computing batches without rendering on-the-fly and may limit efficiency in larger-scale settings.Scalability to longer text has not yet been investigated.

Ethics Statement
The aim of this work is to reduce the vocabulary bottleneck which disproportionately affects lowresource languages as they are less likely to be appropriately represented in traditional discrete multilingual model vocabularies.Alternatives such as byte-level tokenization potentially increase rather than decrease the disparity between scripts, as a single character may be represented as up to 12 bytes in e.g.Telugu, whereas Latin scripts are typically 1:1 characters:bytes (Ahia et al., 2023).We show the sequence lengths resulting from byte, character, BPE, and pixel 'tokenization' on TED-59 in Figure 11, App.D; of the alternatives to BPE tokenization, pixel representations result in the most similar sequence lengths and lowest variance across languages and scripts.
In application settings, substituting visually similar characters such as '0' for 'O' can be used to circumvent lexical filtering as used for e.g.spam filtering, hate speech detection, or censorship.Pixel representations may make these substitutions less effective which may be beneficial or harmful depending on the setting.

A List of Languages by Dataset
We list the source languages in each dataset with the number of training examples and language code.All datasets are many-to-one parallel with English as the target language.
For TED-7 and TED-59, we use the provided train/dev/test splits, and report results on test using model checkpoints chosen based on dev perplexities.

D Variance in sequence lengths across tokenizations
Below we show the sequence lengths resulting from byte, character, and BPE tokenization and pixel representations on TED-59.Of the alternatives to BPE, pixel representations result in the most similar sequence lengths and lowest variance across languages and scripts.

E Full results reported by individual language pair
In addition to the aggregated metric scores reported in the main text, below we report results for each individual language pair with three metrics: BLEU, chrF, and COMET.
Results are organized by dataset.TED-7 results are reported in Table 5, and TED-59 in Table 6.
E.1 Individual language pair results: TED-7 I Full SVD plot of TED-7 pixel model embeddings

Figure 1 :
Figure 1: Embedding matrices are disjoint parameter allocations by script, leading to a vocabulary bottleneck.Pixel representations however share parameters across scripts and are not limited to a discrete vocabulary.

Figure 2 :
Figure 2: Encoding text with pixels: text is rendered to images by sentence.Image tokens are created by overlapping sliding windows of fixed height (h), width (w), and stride (s).Convolutional layer output is projected to flat vectors for subsequent Transformer layers.

Figure 4 :Figure 5 :
Figure 4: Improvement with multilingual models over models for each lang.pair.
Figure 7: Clustering shows more representational similarity within scripts and across languages with pixel representations than with disjoint subword embeddings in the TED-59 dataset.Individual languages from the same family are shown with different shades of the same color in items (c) and (d).

Figure 8 :
Figure 8: Pixel representations result in similar representations for partial lexical matches due to visual similarity and parameter sharing at the pixel level.

Figure 9 :
Figure 9: SVD plots of source representations show traditional embeddings cluster infrequent subwords together more tightly than pixels.

Figure 10 :
Figure 10: Data-efficiency in cross-lingual transfer.Models with pixel-based representations adapt more efficiently and effectively to new scripts than with traditional text representations (shown here: Hebrew).

Figure 11 :
Figure 11: Average sequence length with various tokenization schemes compared on TED-59.

Figure 13 :
Figure 13: Full SVD visualization of source-side embeddings from the TED-7 pixel model.Only 1% of all embeddings lie above y = 0.03, which was excluded from the main text to assist readability.

Table 1 :
Model performance across two datasets on test.Models chosen by perplexity on held-out validation sets.Metric scores are averaged across all languages in the dataset; App.E shows results for individual language pairs.

Table 2 :
Results for 4 high-resource (HR) and lowresource (LR) language pairs used in previous work.

Table 3 :
Script coverage in pretraining measured at the level of character n-grams.Improvements with pixel representations are averaged across all resource settings.
Below we report the details of the best performing model for each dataset and source representation.Dataset #Sents Model V src V tgt Emb.dim.Enc.layers Dec. layers FF width Attn.heads #Params

Table 4 :
Details of pixel and subword model scale variants.

Table 5 :
Results on TED-7 evaluation set reported by individual language pair.

Table 6 :
Results on TED-59 evaluation set reported by individual language pair.