BehanceMT: A Machine Translation Corpus for Livestreaming Video Transcripts

Minh Van Nguyen, Franck Dernoncourt, Thien Nguyen


Abstract
Machine translation (MT) is an important task in natural language processing, which aims to translate a sentence in a source language to another sentence with the same/similar semantics in a target language. Despite the huge effort on building MT systems for different language pairs, most previous work focuses on formal-language settings, where text to be translated come from written sources such as books and news articles. As a result, such MT systems could fail to translate livestreaming video transcripts, where text is often shorter and might be grammatically incorrect. To overcome this issue, we introduce a novel MT corpus - BehanceMT for livestreaming video transcript translation. Our corpus contains parallel transcripts for 3 language pairs, where English is the source language and Spanish, Chinese, and Arabic are the target languages. Experimental results show that finetuning a pretrained MT model on BehanceMT significantly improves the performance of the model in translating video transcripts across 3 language pairs. In addition, the finetuned MT model outperforms GoogleTranslate in 2 out of 3 language pairs, further demonstrating the usefulness of our proposed dataset for video transcript translation. BehanceMT will be publicly released upon the acceptance of the paper.
Anthology ID:
2022.tu-1.4
Volume:
Proceedings of the First Workshop On Transcript Understanding
Month:
Oct
Year:
2022
Address:
Gyeongju, South Korea
Editors:
Franck Dernoncourt, Thien Huu Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, Trung H. Bui, David Seunghyun Yoon
Venue:
TU
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
30–33
Language:
URL:
https://aclanthology.org/2022.tu-1.4
DOI:
Bibkey:
Cite (ACL):
Minh Van Nguyen, Franck Dernoncourt, and Thien Nguyen. 2022. BehanceMT: A Machine Translation Corpus for Livestreaming Video Transcripts. In Proceedings of the First Workshop On Transcript Understanding, pages 30–33, Gyeongju, South Korea. International Conference on Computational Linguistics.
Cite (Informal):
BehanceMT: A Machine Translation Corpus for Livestreaming Video Transcripts (Nguyen et al., TU 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.tu-1.4.pdf