Transformer-based Cascaded Multimodal Speech Translation

Zixiu Wu, Ozan Caglayan, Julia Ive, Josiah Wang, Lucia Specia


Abstract
This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the experiments, the MMT model varies in terms of the way of integrating the visual context (simple conditioning vs. attention), the type of visual features exploited (pooled, convolutional, action categories) and the underlying architecture. For the latter, we explore both the canonical transformer and its deliberation version with additive and cascade variants which differ in how they integrate the textual attention. Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation; (ii) the transformer and cascade deliberation integrate the visual modality better than the additive deliberation, as shown by the incongruence analysis.
Anthology ID:
2019.iwslt-1.6
Volume:
Proceedings of the 16th International Conference on Spoken Language Translation
Month:
November 2-3
Year:
2019
Address:
Hong Kong
Editors:
Jan Niehues, Rolando Cattoni, Sebastian Stüker, Matteo Negri, Marco Turchi, Thanh-Le Ha, Elizabeth Salesky, Ramon Sanabria, Loic Barrault, Lucia Specia, Marcello Federico
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
Language:
URL:
https://aclanthology.org/2019.iwslt-1.6
DOI:
Bibkey:
Cite (ACL):
Zixiu Wu, Ozan Caglayan, Julia Ive, Josiah Wang, and Lucia Specia. 2019. Transformer-based Cascaded Multimodal Speech Translation. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong. Association for Computational Linguistics.
Cite (Informal):
Transformer-based Cascaded Multimodal Speech Translation (Wu et al., IWSLT 2019)
Copy Citation:
PDF:
https://aclanthology.org/2019.iwslt-1.6.pdf