Artur Ventura
2020
A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?
Julia Ive
|
Lucia Specia
|
Sara Szoc
|
Tom Vanallemeersch
|
Joachim Van den Bogaert
|
Eduardo Farah
|
Christine Maroti
|
Artur Ventura
|
Maxim Khalilov
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.
2019
APE-QUEST
Joachim Van den Bogaert
|
Heidi Depraetere
|
Sara Szoc
|
Tom Vanallemeersch
|
Koen Van Winckel
|
Frederic Everaert
|
Lucia Specia
|
Julia Ive
|
Maxim Khalilov
|
Christine Maroti
|
Eduardo Farah
|
Artur Ventura
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
Search
Co-authors
- Julia Ive 2
- Lucia Specia 2
- Sara Szoc 2
- Tom Vanallemeersch 2
- Joachim Van Den Bogaert 2
- show all...