GTNC: A Many-To-One Dataset of Google Translations from NewsCrawl

Damiaan Reijnaers, Charlotte Pouw


Abstract
This paper lays the groundwork for initiating research into Source Language Identification; the task of identifying the original language of a machine-translated text. We contribute a dataset of translations from a typologically diverse spectrum of languages into English and use it to set initial baselines for this novel task.
Anthology ID:
2024.sigtyp-1.8
Volume:
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Month:
March
Year:
2024
Address:
St. Julian's, Malta
Editors:
Michael Hahn, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Yulia Otmakhova, Jinrui Yang, Oleg Serikov, Priya Rani, Edoardo M. Ponti, Saliha Muradoğlu, Rena Gao, Ryan Cotterell, Ekaterina Vylomova
Venues:
SIGTYP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
58–65
Language:
URL:
https://aclanthology.org/2024.sigtyp-1.8
DOI:
Bibkey:
Cite (ACL):
Damiaan Reijnaers and Charlotte Pouw. 2024. GTNC: A Many-To-One Dataset of Google Translations from NewsCrawl. In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 58–65, St. Julian's, Malta. Association for Computational Linguistics.
Cite (Informal):
GTNC: A Many-To-One Dataset of Google Translations from NewsCrawl (Reijnaers & Pouw, SIGTYP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigtyp-1.8.pdf