Finding Alternative Translations in a Large Corpus of Movie Subtitle

Jörg Tiedemann


Abstract
OpenSubtitles.org provides a large collection of user contributed subtitles in various languages for movies and TV programs. Subtitle translations are valuable resources for cross-lingual studies and machine translation research. A less explored feature of the collection is the inclusion of alternative translations, which can be very useful for training paraphrase systems or collecting multi-reference test suites for machine translation. However, differences in translation may also be due to misspellings, incomplete or corrupt data files, or wrongly aligned subtitles. This paper reports our efforts in recognising and classifying alternative subtitle translations with language independent techniques. We use time-based alignment with lexical re-synchronisation techniques and BLEU score filters and sort alternative translations into categories using edit distance metrics and heuristic rules. Our approach produces large numbers of sentence-aligned translation alternatives for over 50 languages provided via the OPUS corpus collection.
Anthology ID:
L16-1559
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3518–3522
Language:
URL:
https://aclanthology.org/L16-1559
DOI:
Bibkey:
Cite (ACL):
Jörg Tiedemann. 2016. Finding Alternative Translations in a Large Corpus of Movie Subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3518–3522, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Finding Alternative Translations in a Large Corpus of Movie Subtitle (Tiedemann, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1559.pdf
Data
OpenSubtitles