Borek Požár


2022

pdf bib
CUNI Submission to the BUCC 2022 Shared Task on Bilingual Term Alignment
Borek Požár | Klára Tauchmanová | Kristýna Neumannová | Ivana Kvapilíková | Ondřej Bojar
Proceedings of the BUCC Workshop within LREC 2022

We present our submission to the BUCC Shared Task on bilingual term alignment in comparable specialized corpora. We devised three approaches using static embeddings with post-hoc alignment, the Monoses pipeline for unsupervised phrase-based machine translation, and contextualized multilingual embeddings. We show that contextualized embeddings from pretrained multilingual models lead to similar results as static embeddings but further improvement can be achieved by task-specific fine-tuning. Retrieving term pairs from the running phrase tables of the Monoses systems can match this enhanced performance and leads to an average precision of 0.88 on the train set.