Jian Luan


pdf bib
Exploring Better Text Image Translation with Multimodal Codebook
Zhibin Lan | Jiawei Yu | Xiang Li | Wen Zhang | Jian Luan | Bin Wang | Degen Huang | Jinsong Su
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text image translation (TIT) aims to translate the source texts embedded in the image to target translations, which has a wide range of applications and thus has important research value. However, current studies on TIT are confronted with two main bottlenecks: 1) this task lacks a publicly available TIT dataset, 2) dominant models are constructed in a cascaded manner, which tends to suffer from the error propagation of optical character recognition (OCR). In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts, providing useful supplementary information for translation. Moreover, we present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts, OCR dataset and our OCRMT30K dataset to train our model. Extensive experiments and in-depth analyses strongly demonstrate the effectiveness of our proposed model and training framework.

pdf bib
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023 Offline Task, Simultaneous Task and Speech-to-Speech Task
Wuwei Huang | Mengge Liu | Xiang Li | Yanzhi Tian | Fengyu Yang | Wen Zhang | Jian Luan | Bin Wang | Yuhang Guo | Jinsong Su
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This system description paper introduces the systems submitted by Xiaomi AI Lab to the three tracks of the IWSLT 2023 Evaluation Campaign, namely the offline speech translation (Offline-ST) track, the offline speech-to-speech translation (Offline-S2ST) track, and the simultaneous speech translation (Simul-ST) track. All our submissions for these three tracks only involve the English-Chinese language direction. Our English-Chinese speech translation systems are constructed using large-scale pre-trained models as the foundation. Specifically, we fine-tune these models’ corresponding components for various downstream speech translation tasks. Moreover, we implement several popular techniques, such as data filtering, data augmentation, speech segmentation, and model ensemble, to improve the system’s overall performance. Extensive experiments show that our systems achieve a significant improvement over the strong baseline systems in terms of the automatic evaluation metric.


pdf bib
BIT-Xiaomi’s System for AutoSimTrans 2022
Mengge Liu | Xiang Li | Bao Chen | Yanzhi Tian | Tianwei Lan | Silin Li | Yuhang Guo | Jian Luan | Bin Wang
Proceedings of the Third Workshop on Automatic Simultaneous Translation

This system paper describes the BIT-Xiaomi simultaneous translation system for Autosimtrans 2022 simultaneous translation challenge. We participated in three tracks: the Zh-En text-to-text track, the Zh-En audio-to-text track and the En-Es test-to-text track. In our system, wait-k is employed to train prefix-to-prefix translation models. We integrate streaming chunking to detect boundaries as the source streaming read in. We further improve our system with data selection, data-augmentation and R-drop training methods. Results show that our wait-k implementation outperforms organizer’s baseline by 8 BLEU score at most, and our proposed streaming chunking method further improves about 2 BLEU in low latency regime.