Yihang Li


2024

pdf bib
MELD-ST: An Emotion-aware Speech Translation Dataset
Sirou Chen | Sakiko Yahata | Shuichiro Shimizu | Zhengdong Yang | Yihang Li | Chenhui Chu | Sadao Kurohashi
Findings of the Association for Computational Linguistics: ACL 2024

Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.

2023

pdf bib
Video-Helpful Multimodal Machine Translation
Yihang Li | Shuichiro Shimizu | Chenhui Chu | Sadao Kurohashi | Wei Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English parallel subtitle pairs, 520k Chinese-English parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models.

2022

pdf bib
VISA: An Ambiguous Subtitles Dataset for Visual Scene-aware Machine Translation
Yihang Li | Shuichiro Shimizu | Weiqi Gu | Chenhui Chu | Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features: (1) the parallel sentences are subtitles from movies and TV episodes; (2) the source subtitles are ambiguous, which means they have multiple possible translations with different meanings; (3) we divide the dataset into Polysemy and Omission according to the cause of ambiguity. We show that VISA is challenging for the latest MMT system, and we hope that the dataset can facilitate MMT research.