Building Machine Translation Tools for Patent Language: A Data Generation Strategy at the European Patent Office

Matthias Wirth, Volker D. Hähnke, Franco Mascia, Arnaud Wéry, Konrad Vowinckel, Marco del Rey, Raúl Mohedano del Pozo, Pau Montes, Alexander Klenner-Bajaja


Abstract
The European Patent Office (EPO) is an international organisation responsible for granting patents and promoting global cooperation in the intellectual property world. With three official languages (English, German, French) and a need to constantly access and manipulate information in multiple languages, machine translation is essential for the EPO. Over the last years we have developed internal machine translation engines, specifically for the translation of patent language. This article presents our data generation strategy: it describes our approach to the generation of parallel corpora of documents, training datasets of aligned sentences, and respective evaluation datasets. Details on the challenges and technical implementation are presented, as well as statistics of the training dataset generation process.
Anthology ID:
2023.eamt-1.46
Volume:
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Month:
June
Year:
2023
Address:
Tampere, Finland
Editors:
Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
471–479
Language:
URL:
https://aclanthology.org/2023.eamt-1.46
DOI:
Bibkey:
Cite (ACL):
Matthias Wirth, Volker D. Hähnke, Franco Mascia, Arnaud Wéry, Konrad Vowinckel, Marco del Rey, Raúl Mohedano del Pozo, Pau Montes, and Alexander Klenner-Bajaja. 2023. Building Machine Translation Tools for Patent Language: A Data Generation Strategy at the European Patent Office. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 471–479, Tampere, Finland. European Association for Machine Translation.
Cite (Informal):
Building Machine Translation Tools for Patent Language: A Data Generation Strategy at the European Patent Office (Wirth et al., EAMT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eamt-1.46.pdf