Alexander Klenner-Bajaja


2023

pdf bib
Building Machine Translation Tools for Patent Language: A Data Generation Strategy at the European Patent Office
Matthias Wirth | Volker D. Hähnke | Franco Mascia | Arnaud Wéry | Konrad Vowinckel | Marco del Rey | Raúl Mohedano del Pozo | Pau Montes | Alexander Klenner-Bajaja
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

The European Patent Office (EPO) is an international organisation responsible for granting patents and promoting global cooperation in the intellectual property world. With three official languages (English, German, French) and a need to constantly access and manipulate information in multiple languages, machine translation is essential for the EPO. Over the last years we have developed internal machine translation engines, specifically for the translation of patent language. This article presents our data generation strategy: it describes our approach to the generation of parallel corpora of documents, training datasets of aligned sentences, and respective evaluation datasets. Details on the challenges and technical implementation are presented, as well as statistics of the training dataset generation process.