A Corpus of Native, Non-native and Translated Texts

Sergiu Nisioi, Ella Rabinovich, Liviu P. Dinu, Shuly Wintner


Abstract
We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
Anthology ID:
L16-1664
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4197–4201
Language:
URL:
https://aclanthology.org/L16-1664
DOI:
Bibkey:
Cite (ACL):
Sergiu Nisioi, Ella Rabinovich, Liviu P. Dinu, and Shuly Wintner. 2016. A Corpus of Native, Non-native and Translated Texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4197–4201, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
A Corpus of Native, Non-native and Translated Texts (Nisioi et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1664.pdf