TriFine: A Large-Scale Dataset of Vision-Audio-Subtitle for Tri-Modal Machine Translation and Benchmark with Fine-Grained Annotated Tags

Boyu Guan, Yining Zhang, Yang Zhao, Chengqing Zong


Abstract
Current video-guided machine translation (VMT) approaches primarily use coarse-grained visual information, resulting in information redundancy, high computational overhead, and neglect of audio content. Our research demonstrates the significance of fine-grained visual and audio information in VMT from both data and methodological perspectives. From the data perspective, we have developed a large-scale dataset TriFine, the first vision-audio-subtitle tri-modal VMT dataset with annotated multimodal fine-grained tags. Each entry in this dataset not only includes the triples found in traditional VMT datasets but also encompasses seven fine-grained annotation tags derived from visual and audio modalities. From the methodological perspective, we propose a Fine-grained Information-enhanced Approach for Translation (FIAT). Experimental results have shown that, in comparison to traditional coarse-grained methods and text-only models, our fine-grained approach achieves superior performance with lower computational overhead. These findings underscore the pivotal role of fine-grained annotated information in advancing the field of VMT.
Anthology ID:
2025.coling-main.547
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8215–8231
Language:
URL:
https://aclanthology.org/2025.coling-main.547/
DOI:
Bibkey:
Cite (ACL):
Boyu Guan, Yining Zhang, Yang Zhao, and Chengqing Zong. 2025. TriFine: A Large-Scale Dataset of Vision-Audio-Subtitle for Tri-Modal Machine Translation and Benchmark with Fine-Grained Annotated Tags. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8215–8231, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
TriFine: A Large-Scale Dataset of Vision-Audio-Subtitle for Tri-Modal Machine Translation and Benchmark with Fine-Grained Annotated Tags (Guan et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.547.pdf