Cong Ma


pdf bib
CCIM: Cross-modal Cross-lingual Interactive Image Translation
Cong Ma | Yaping Zhang | Mei Tu | Yang Zhao | Yu Zhou | Chengqing Zong
Findings of the Association for Computational Linguistics: EMNLP 2023

Text image machine translation (TIMT) which translates source language text images into target language texts has attracted intensive attention in recent years. Although the end-to-end TIMT model directly generates target translation from encoded text image features with an efficient architecture, it lacks the recognized source language information resulting in a decrease in translation performance. In this paper, we propose a novel Cross-modal Cross-lingual Interactive Model (CCIM) to incorporate source language information by synchronously generating source language and target language results through an interactive attention mechanism between two language decoders. Extensive experimental results have shown the interactive decoder significantly outperforms end-to-end TIMT models and has faster decoding speed with smaller model size than cascade models.


pdf bib
CASIA’s System for IWSLT 2020 Open Domain Translation
Qian Wang | Yuchen Liu | Cong Ma | Yu Lu | Yining Wang | Long Zhou | Yang Zhao | Jiajun Zhang | Chengqing Zong
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the CASIA’s system for the IWSLT 2020 open domain translation task. This year we participate in both Chinese→Japanese and Japanese→Chinese translation tasks. Our system is neural machine translation system based on Transformer model. We augment the training data with knowledge distillation and back translation to improve the translation performance. Domain data classification and weighted domain model ensemble are introduced to generate the final translation result. We compare and analyze the performance on development data with different model settings and different data processing techniques.


pdf bib
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Haoran Li | Junnan Zhu | Cong Ma | Jiajun Zhang | Chengqing Zong
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The rapid increase of the multimedia data over the Internet necessitates multi-modal summarization from collections of text, image, audio and video. In this work, we propose an extractive Multi-modal Summarization (MMS) method which can automatically generate a textual summary given a set of documents, images, audios and videos related to a specific topic. The key idea is to bridge the semantic gaps between multi-modal contents. For audio information, we design an approach to selectively use its transcription. For vision information, we learn joint representations of texts and images using a neural network. Finally, all the multi-modal aspects are considered to generate the textural summary by maximizing the salience, non-redundancy, readability and coverage through budgeted optimization of submodular functions. We further introduce an MMS corpus in English and Chinese. The experimental results on this dataset demonstrate that our method outperforms other competitive baseline methods.