Exploring Better Text Image Translation with Multimodal Codebook

Text image translation (TIT) aims to translate the source texts embedded in the image to target translations, which has a wide range of applications and thus has important research value. However, current studies on TIT are confronted with two main bottlenecks: 1) this task lacks a publicly available TIT dataset, 2) dominant models are constructed in a cascaded manner, which tends to suffer from the error propagation of optical character recognition (OCR). In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts, providing useful supplementary information for translation. Moreover, we present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts, OCR dataset and our OCRMT30K dataset to train our model. Extensive experiments and in-depth analyses strongly demonstrate the effectiveness of our proposed model and training framework.


Introduction
In recent years, multimodal machine translation (MMT) has achieved great progress and thus received increasing attention. Current studies on MMT mainly focus on the text machine translation with scene images (Elliott et al., 2016;Calixto et al., 2017a;Elliott and Kádár, 2017;Libovický et al., 2018;Ive et al., 2019;Zhang et al., 2020;Sulubacak et al., 2020). However, a more common requirement for MMT in real-world applications is text image translation (TIT) (Ma et al., 2022), which aims to translate the source texts embedded in the image to target translations. Due to its wide Figure 1: An example of text image translation. The Bounding box in red represents the text to be recognized. We can observe that the incorrect OCR result will negatively affect the subsequent translation.
applications, the industry has developed multiple services to support this task, such as Google Camera Translation.
Current studies on TIT face two main bottlenecks. First, this task lacks a publicly available TIT dataset. Second, the common practice is to adopt a cascaded translation system, where the texts embedded in the input image are firstly recognized by an optical character recognition (OCR) model, and then the recognition results are fed into a textonly neural machine translation (NMT) model for translation. However, such a method tends to suffer from the problem of OCR error propagation, and thus often generates unsatisfactory translations. As shown in Figure 1, "富锦消防" ("fu jin xiao fang") in the image is incorrectly recognized as "富锦消阳" ("fu jin xiao yang"). Consequently, the text-only NMT model incorrectly translates it into "Fujin Xiaoyang". Furthermore, we use the commonly-used PaddleOCR 2 to handle several OCR benchmark datasets. As reported in Table 1, we observe that the highest recognition accuracy at the image level is less than 67% and that at the sentence level is not higher than 81%. It can be said that OCR errors are very common, thus they have a serious negative impact on subsequent translation.
In this paper, we first manually annotate a Chinese-English TIT dataset named OCRMT30K, 2 https://github.com/PaddlePaddle/PaddleOCR. providing convenience for subsequent studies. This dataset is developed based on five Chinese OCR datasets, including about 30,000 image-text pairs. Besides, we propose a TIT model with a multimodal codebook to alleviate the OCR error propagation problem. The basic intuition behind our model is that when humans observe the incorrectly recognized text in an image, they can still associate the image with relevant or correct texts, which can provide useful supplementary information for translation. Figure 3 shows the basic architecture of our model, which mainly consists of four modules: 1) a text encoder that converts the input text into a hidden state sequence; 2) an image encoder encoding the input image as a visual vector sequence; 3) a multimodal codebook. This module can be described as a vocabulary comprising latent codes, each of which represents a cluster. It is trained to map the input images and ground-truth texts into the shared semantic space of latent codes. During inference, this module is fed with the input image and then outputs latent codes containing the text information related to ground-truth texts. 4) a text decoder that is fed with the combined representation of the recognized text and the outputted latent codes, and then generates the final translation.
Moreover, we propose a multi-stage training framework for our TIT model, which can fully exploit additional bilingual texts and OCR data for model training. Specifically, our framework consists of four stages. First, we use a large-scale bilingual corpus to pretrain the text encoder and text decoder. Second, we pretrain the newly added multimodal codebook on a large-scale monolingual corpus. Third, we further introduce an image encoder that includes a pretrained vision Transformer with fixed parameters to extract visual features, and continue to train the multimodal codebook. Additionally, we introduce an image-text alignment task to enhance the ability of the multimodal codebook in associating images with related texts. Finally, we finetune the entire model on the OCRMT30K dataset. Particularly, we maintain the image-text alignment task at this stage to reduce the gap between the third and fourth training stages.
Our main contributions are as follows: • We release an OCRMT30K dataset, which is the first Chinese-English TIT dataset, prompting the subsequent studies.
• We present a TIT model with a multimodal codebook, which can leverage the input image to generate the information of relevant or correct texts, providing useful information for the subsequent translation.
• We propose a multi-stage training framework for our model, which effectively leverages additional bilingual texts and OCR data to enhance the model training.
• Extensive experiments and analyses demonstrate the effectiveness of our model and training framework.

Related Work
In MMT, most early attempts exploit visual context via attention mechanisms (Caglayan et al., 2016;Huang et al., 2016;Calixto et al., 2017a;Libovický and Helcl, 2017;Calixto and Liu, 2017;Su et al., 2021 Obviously, the effectiveness of conventional MMT heavily relies on the availability of bilingual texts with images, which restricts its wide applicability. To address this issue, Zhang et al. (2020) first build a token-image lookup table from an image-text dataset, and then retrieve images matching the source keywords to benefit the predictions of target translation. Recently, Fang and Feng (2022) present a phrase-level retrieval-based method that learns visual information from the pairs of source phrases and grounded regions.
Besides, researchers investigate whether visual information is really useful for machine translation. Elliott (2018) finds that irrelevant images have little impact on translation quality.  attribute the gain of MMT to the regularization effect. Unlike these conclusions, Caglayan et al. (2019) and  observe that MMT models rely more on images when textual ambiguity is high or textual information is insufficient.
To break the limitation that MMT requires sentence-image pairs during inference, researchers introduce different modules, such as image prediction decoder (Elliott and Kádár, 2017), generative imagination network (Long et al., 2021), autoregressive hallucination Transformer (Li et al., 2022b), to produce a visual vector sequence that is associated with the input sentence.
Significantly different from the above studies on MMT with scene images, several works also explore different directions in MMT. For instance, Calixto et al. (2017b) and  investigate product-oriented machine translation, and other researchers focus on multimodal simultaneous machine translation (Caglayan et al., 2020;Ive et al., 2021). Moreover, there is a growing body of studies on video-guided machine translation (Wang et al., 2019;Gu et al., 2021;Kang et al., 2023). These studies demonstrate the diverse applications and potential of MMT beyond scene images.
In this work, we mainly focus on TIT, which suffers from incorrectly recognized text information and is more practicable in real scenarios. The most related work to ours mainly includes (Mansimov et al., 2020;Jain et al., 2021;Ma et al., 2022). Mansimov et al. (2020) first explore in-image translation task, which transforms an image containing the source text into an image with the target translation. They not only build a synthetic in-image translation dataset but also put forward an end-toend model combining a self-attention encoder with two convolutional encoders and a convolutional decoder. Jain et al. (2021) focus on the TIT task, and propose to combine OCR and NMT into an endto-end model with a convolutional encoder and an autoregressive Transformer decoder. Along this line, Ma et al. (2022) apply multi-task learning to this task, where MT, TIT, and OCR are jointly trained. However, these studies only center around synthetic TIT datasets, which are far from the real scenario.

Dataset and Annotation
To the best of our knowledge, there is no publicly available dataset for the task of TIT. Thus we first manually annotate a Chinese-English TIT dataset named OCRMT30K, which is based on five commonly-used Chinese OCR datasets: RCTW-17 (Shi et al., 2017) . We hire eight professional translators for annotation over five months and each translator is responsible for annotating 25 images per day to prevent fatigue. Translators are shown an image with several Chinese texts and are required to produce correct and fluent translations for them in English. In addition, we hire a professional translator to sample and check the annotated instances for quality control. We totally annotate 30,186 instances and the number of parallel sentence pairs is 164,674. Figure 2 presents an example of our dataset.

Task Formulation
In this work, following common practices (Afli and Way, 2016;Ma et al., 2022), we first use an OCR model to recognize texts from the input image v. Then, we fed both v and each recognized textx into our TIT model, producing the target translation y. In addition, x is used to denote the ground-truth text ofx recognized from v.
To train our TIT model, we will focus on establishing the following conditional predictive proba- Figure 3: The overall architecture of our model includes a text encoder, an image encoder, a multimodal codebook, and a text decoder. Particularly, the multimodal codebook is the most critical module, which can associate images with relevant or correct texts.x is the recognized text from the input image v, e k represents the k-th latent code embedding andŷ is the outputted target translation. bility distribution: (1) where θ denotes the model parameters.

Model Architecture
As shown in Figure 3, our model includes four modules: 1) a text encoder converting the input text into a hidden state sequence; 2) an image encoder encoding the input image as a visual vector sequence; 3) a multimodal codebook that is fed with the image representation and then outputs latent codes containing the text information related to the ground-truth text; and 4) a text decoder that generates the final translation under the semantic guides of text encoder hidden states and outputted latent codes. All these modules will be elaborated in the following.
Text Encoder. Similar to dominant NMT models, our text encoder is based on the Transformer (Vaswani et al., 2017) encoder. It stacks L e identical layers, each of which contains a self-attention sub-layer and a feed-forward network (FFN) sublayer.
Let H where MHA(·, ·, ·) denotes a multi-head attention function (Vaswani et al., 2017). Particularly, H (0) e is the sum of word embeddings and position embeddings. Note that we follow Vaswani et al. (2017) to use residual connection and layer normalization (LN) in each sub-layer, of which descriptions are omitted for simplicity. During training, the text encoder is utilized to encode both the ground-truth text x and the recognized textx, so we useĤ (l) e to denote the hidden state of recognized text for clarity. In contrast, during inference, the text encoder only encodes the recognized textx, refer to Section 4.3 for more details.
Image Encoder. As a common practice, we use ViT (Dosovitskiy et al., 2021) to construct our image encoder. Similar to the Transformer encoder, ViT also consists of L v stacked layers, each of which includes a self-attention sub-layer and an FFN sub-layer. One key difference between the Transformer encoder and ViT is the placement of LN, where pre-norm is applied in ViT.
Given the image input v, the visual vector sequence H v,Nv output by the image encoder can be formulated as Multimodal Codebook. It is the core module of our model. The multimodal codebook is essentially a vocabulary with K latent codes, each of which is represented by a d-dimensional vector e k like word embeddings. Note that we always set the dimension of the latent code equal to that of the text encoder, so as to facilitate the subsequent calculation in Equation 11.
With the multimodal codebook, we can quantize the hidden state sequence H v,Nv to latent codes via a quantizer z q (·). Formally, the quantizer looks up the nearest latent code for each input, as shown in the following: By doing so, both text and image representations are mapped into the shared semantic space of latent codes. Text Decoder. This decoder is also based on the Transformer decoder, with L d identical layers. In addition to self-attention and FFN sub-layers, each decoder layer is equipped with a cross-attention sub-layer to exploit recognized text hidden stateŝ H represents the total number of hidden states. These hidden states are calculated using the following equations: ). (8) Finally, at each decoding timestep t, the probability distribution of generating the next target token y t is defined as follows:

Multi-stage Training Framework
In this section, we present in detail the procedures of our proposed multi-stage training framework. As shown in Figure 4, it totally consists of four stages: 1) pretraining the text encoder and text decoder on a large-scale bilingual corpus; 2) pretraining the multimodal codebook on a large-scale monolingual corpus; 3) using additional OCR data to train the image encoder and multimodal codebook via an image-text alignment task; 4) finetuning the whole model on our released TIT dataset.
Stage 1. We first pretrain the text encoder and text decoder on a large-scale bilingual corpus D bc in the way of a vanilla machine translation. Formally, for each parallel sentence (x, y)∈D bc , we define the following training objective for this stage: log(p(y t |x, y <t )), (10) where θ te and θ td denote the trainable parameters of the text encoder and text decoder, respectively.
Stage 2. This stage serves as an intermediate phase, where we exploit monolingual data to pretrain the multimodal codebook. Through this stage of training, we will learn a clustering representation for each latent code of the multimodal codebook.
Concretely, we utilize the same dataset as the first stage but only use its source texts. Following van den Oord et al. (2017), we update the multimodal codebook with an exponential moving average (EMA), where a decay factor determines the degree to which past values affect the current average. Formally, the latent code embedding e k is updated as follows: where I(·) is the indicator function and γ is a decay factor we set to 0.99, as implemented in (van den Oord et al., 2017). c k counts the number of text encoder hidden states that are clustered into the kth latent code, h k denotes the sum of these hidden states, and n k represents the sum of the past exponentially weighted average and the current value c k . Particularly, n k is set to 0 at the beginning. Stage 3. During this stage, we introduce an image-text alignment task involving an additional OCR dataset D ocr to further train the image encoder and multimodal codebook. Through this stage of training, we expect to endow the multimodal codebook with the preliminary capability of associating images with related texts.
Given an image-text training instance (v, x) ∈ D ocr , we define the training objective at this stage as where sg(·) refers to a stop-gradient operation and θ ie is the parameters of the image encoder except the ViT module. Specifically, z q (H e,i ), which represent the semantic information of image and text respectively. Via L ita , we expect to enable both image and text representations to be quantized into the same latent codes. Meanwhile, following van den Oord et al. (2017), we use the commitment loss L ic to ensure that the output hidden states of image encoder stay close to the chosen latent code embedding, preventing it fluctuating frequently from one latent code to another, and α is a hyperparameter to control the effect of L ic . Note that at this stage, we continue to update the parameters of the multimodal codebook using Equation 11.
Stage 4. Finally, we use the TIT dataset D tit to finetune the whole model. Notably, L 3 is still involved, which maintains the training consistency and makes finetuning smoothing.
Given a TIT training instance (v,x, x, y)∈D tit , we optimize the whole model through the following objective: L tit (θ te , θ ie , θ td ) = − |y| t=1 log(p(y t |v,x, y <t )), (17) where L tc is also a commitment loss proposed for the text encoder, and β is a hyperparameter quantifying its effect. Note thatx is only used as an input for L tit to ensure the consistency between the model training and inference, and x is used as an input for image-text alignment task to train the ability of the multimodal codebook in associating the input image with the ground-truth text. Besides, we still update the multimodal codebook with EMA.

Datasets
Our proposed training framework consists of four stages, involving the following three datasets: WMT22 ZH-EN 3 . This large-scale parallel corpus contains about 28M parallel sentence pairs and we sample 2M parallel sentence pairs from the original whole corpus. During the first and second training stages, we use the sampled dataset to pretrain our text encoder and text decoder.
ICDAR19-LSVT. It is an OCR dataset including 450, 000 images with texts that are freely captured in the streets, e.g., storefronts and landmarks. In this dataset, 50,000 fully-annotated images are partially selected to construct the OCRMT30K dataset, and the remaining 400,000 images are weakly annotated, where only the text-of-interest in these images are provided as ground truths without location annotations. In the third training stage, we use these weakly annotated data to train the image encoder and multimodal codebook via the image-text alignment task.
OCRMT30K. As mentioned previously, our OCRMT30K dataset involves five Chinese OCR datasets: RCTW-17, CASIA-10K, ICDAR19-MLT, ICDAR19-LSVT, and ICDAR19-ArT. It totally contains about 30,000 instances, where each instance involves an image paired with several Chinese texts and their corresponding English translations. In the experiments, we choose 1,000 instances for development, 1,000 for evaluation, and the remaining instances for training. Besides, We use the commonly-used PaddleOCR to handle our dataset and obtain the recognized texts. In the final training stage, we use the training set of OCRMT30K to finetune our whole model.

Settings
We use the standard ViT-B/16 (Dosovitskiy et al., 2021) to model our image encoder. Both our text encoder and text decoder consist of 6 layers, each of which has 512-dimensional hidden sizes, 8 attention heads, and 2,048 feed-forward hidden units. Particularly, a 512-dimensional word embedding layer is shared across the text encoder and the text decoder. We set the size of the multimodal codebook to 2,048.
During the third stage, following van den Oord et al. (2017), we set α in Equation 12 to 0.25. During the final training stage, we set α to 0.75 and β in Equation 15 to 0.25 determined by a grid search on the validation set, both of which are varied from 0.25 to 1 with an interval of 0.25. We use the batch size of 32,768 tokens in the first and second training stages and 4,096 tokens in the third and final training stages. In all stages, we apply the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 to train the model, where the inverse square root schedule algorithm and warmup strategy are adopted for the learning rate. Besides, we set the dropout to 0.1 in the first three training stages and 0.3 in the final training stage, and the value of label smoothing to 0.1 in all stages.
During inference, we use beam search with a beam size of 5. Finally, we employ BLEU (Papineni et al., 2002) calculated by SacreBLEU 4 (Post, 2018) and COMET 5 (Rei et al., 2020) to evaluate the model performance.

Baselines
In addition to the text-only Transformer (Vaswani et al., 2017), our baselines include: • Doubly-ATT (Calixto et al., 2017a). This model uses two attention mechanisms to exploit the image and text representations for translation, respectively.
• Imagination (Elliott and Kádár, 2017). It trains an image prediction decoder to predict a global visual feature vector that is associated with the input sentence.
• Selective Attn (Li et al., 2022a). It is similar to Gated Fusion, but uses a selective attention mechanism to make better use of the patchlevel image representation.
• VALHALLA (Li et al., 2022b). This model uses an autoregressive hallucination Transformer to predict discrete visual representations from the input text, which are then combined with text representations to obtain the target translation.
• E2E-TIT (Ma et al., 2022). It applies a multitask learning framework to train an end-toend TIT model, where MT and OCR serve as auxiliary tasks. Note that except for E2E-TIT, all other models are cascaded ones. Unlike other cascaded models that take recognized text and the entire image as input, the input to this end-to-end model is an image cropped from the text bounding box.
To ensure fair comparisons, we pretrain all these baselines on the same large-scale bilingual corpus. Table 2 reports the performance of all models. We can observe that our model outperforms all baselines, achieving state-of-the-art results. Moreover, we draw the following interesting conclusions:

Results
First, all cascaded models exhibit better performance than E2E-TIT. For this result, we speculate that as an end-to-end model, E2E-TIT may struggle to distinguish text from the surrounding background in the image when the background exhibits visual characteristics similar to the text.

BLEU COMET
Text-only Transformer Transformer (Vaswani et al., 2017) 39.38 30.01 Existing MMT Systems Imagination (Elliott and Kádár, 2017) 39.64 30.68 Doubly-ATT (Calixto et al., 2017a) 39.71 31.42 Gated Fusion  39.03 30.46 Selective Attn (Li et al., 2022a) 40.13 30.74 VALHALLA (Li et al., 2022b) 39.24 29.08 Existing TIT System E2E-TIT (Ma et al., 2022) 19.50 -31.90 Our TIT System Our model 40.78 ‡ 33.09 † Second, our model outperforms Doubly-ATT, Gated Fusion, and Selective Attn, all of which adopt attention mechanisms to exploit image information for translation. The underlying reason is that each input image and its texts are mapped into the shared semantic space of latent codes, reducing the modality gap and thus enabling the model to effectively utilize image information.
Third, our model also surpasses Imagination and VALHALLA, both of which use the input text to generate the representations of related images. We conjecture that in the TIT task, it may be challenging for the model to generate useful image representations from the incorrectly recognized text. In contrast, our model utilizes the input image to generate related text representations, which is more suitable for the TIT task.
Inspired by E2E-TIT, we also compare other baselines with the cropped image as input. Table  3 reports the results of our model compared with other baselines using the cropped image as input. We can observe that our model still achieves stateof-the-art results.

Ablation Study
To investigate the effectiveness of different stages and modules, we further compare our model with several variants in  stage of training. The result in line 3 indicates that this removal leads to a performance drop. The result confirms our previous assumption that training the preliminary capability of associating images and related texts indeed enhances the TIT model. w/o L 3 in Stage 4. When constructing this variant, we remove the loss item L 3 from stage 4. From line 4, we can observe that preserving L 3 in the fourth stage makes the transition from the third to the fourth stage smoother, which further alleviates the training discrepancy.
w/o multimodal codebook. We remove the multimodal codebook in this variant, and the visual features extracted through the image encoder are utilized in its place. Apparently, the performance drop drastically as reported in line 5, demonstrating the effectiveness of the multimodal codebook.
w/ randomly sampling latent codes. Instead of employing quantization, we randomly sample latent codes from the multimodal codebook in this variant. Line 6 shows that such sampling leads to a substantial performance decline. Thus, we confirm that latent codes generated from the input image indeed benefits the subquent translation.

Analysis
To further reveal the effect of the multimodal book, we provide a translation example in Figure 5(a), listing the OCR result and translations produced by ours and Gated Fusion, which is the most competitive baseline. It can be seen that "用品商店" ("supplies store") is incorrectly recognized as "用 品高店" ("supplies high store"), resulting in the incorrect translation even for Gated Fusion. By contrast, our model can output the correct translation with the help of the multimodal codebook.
During decoding for "supplies store," latent code 1368 demonstrated the highest cross-attention weight in comparison to other codes. Therefore, we only visualize the latent code 1368 for analysis. In Figure 5(b), since tokens may be duplicated and all images are different, we provide the five most frequent tokens and five randomly-selected images from this latent code, and find that all these tokens and images are highly related to the topic of business. Thus, intuitively, the clustering vector of this latent code will fully encode the information related to the business, and thus can provide useful information to help the model conduct the correct translation.

Conclusion
In this paper, we release a Chinese-English TIT dataset named OCRMT30K, which is the first publicly available TIT dataset. Then, we propose a novel TIT model with a multimodal codebook. Typically, our model can leverage the input image to predict latent codes associated with the input sentence via the multimodal codebook, providing supplementary information for the subsequent translation. Moreover, we present a multi-stage training framework that effectively utilizes additional bilingual texts and OCR data to refine the training of our model.
In the future, we intend to construct a larger dataset and explore the potential applications of our method in other multimodal tasks, such as videoguided machine translation.

Limitations
Since our model involves an additional step of OCR, it is less efficient than the end-to-end TIT model, although it can achieve significantly better performance. Besides, with the incorporation of image information, our model is still unable to completely address the issue of error propagation caused by OCR.

Ethics Statement
This paper proposes a TIT model and a multi-stage training framework. We take ethical considerations seriously and ensure that the methods used in this study are conducted in a responsible and ethical manner. We also release a Chinese-English TIT dataset named OCRMT30K, which is annotated based on five publicly available Chinese OCR datasets, and are used to support scholars in doing research and not for commercial use, thus there exists not any ethical concern.