JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages

Željko Agić, Ivan Vulić


Abstract
Viable cross-lingual transfer critically depends on the availability of parallel texts. Shortage of such resources imposes a development and evaluation bottleneck in multilingual processing. We introduce JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average. In this paper, we present the resource and showcase its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection.
Anthology ID:
P19-1310
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3204–3210
Language:
URL:
https://aclanthology.org/P19-1310
DOI:
10.18653/v1/P19-1310
Bibkey:
Cite (ACL):
Željko Agić and Ivan Vulić. 2019. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages (Agić & Vulić, ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1310.pdf
Poster:
 P19-1310.Poster.pdf
Data
JW300