Building The First English-Brazilian Portuguese Corpus for Automatic Post-Editing

Felipe Almeida Costa, Thiago Castro Ferreira, Adriana Pagano, Wagner Meira


Abstract
This paper introduces the first corpus for Automatic Post-Editing of English and a low-resource language, Brazilian Portuguese. The source English texts were extracted from the WebNLG corpus and automatically translated into Portuguese using a state-of-the-art industrial neural machine translator. Post-edits were then obtained in an experiment with native speakers of Brazilian Portuguese. To assess the quality of the corpus, we performed error analysis and computed complexity indicators measuring how difficult the APE task would be. We report preliminary results of Phrase-Based and Neural Machine Translation Models on this new corpus. Data and code publicly available in our repository.
Anthology ID:
2020.coling-main.533
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
6063–6069
Language:
URL:
https://aclanthology.org/2020.coling-main.533
DOI:
10.18653/v1/2020.coling-main.533
Bibkey:
Cite (ACL):
Felipe Almeida Costa, Thiago Castro Ferreira, Adriana Pagano, and Wagner Meira. 2020. Building The First English-Brazilian Portuguese Corpus for Automatic Post-Editing. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6063–6069, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Building The First English-Brazilian Portuguese Corpus for Automatic Post-Editing (Almeida Costa et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.533.pdf
Data
WebNLG