CUNI Submission to the BUCC 2022 Shared Task on Bilingual Term Alignment

Borek Požár, Klára Tauchmanová, Kristýna Neumannová, Ivana Kvapilíková, Ondřej Bojar


Abstract
We present our submission to the BUCC Shared Task on bilingual term alignment in comparable specialized corpora. We devised three approaches using static embeddings with post-hoc alignment, the Monoses pipeline for unsupervised phrase-based machine translation, and contextualized multilingual embeddings. We show that contextualized embeddings from pretrained multilingual models lead to similar results as static embeddings but further improvement can be achieved by task-specific fine-tuning. Retrieving term pairs from the running phrase tables of the Monoses systems can match this enhanced performance and leads to an average precision of 0.88 on the train set.
Anthology ID:
2022.bucc-1.6
Volume:
Proceedings of the BUCC Workshop within LREC 2022
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Reinhard Rapp, Pierre Zweigenbaum, Serge Sharoff
Venue:
BUCC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
43–49
Language:
URL:
https://aclanthology.org/2022.bucc-1.6
DOI:
Bibkey:
Cite (ACL):
Borek Požár, Klára Tauchmanová, Kristýna Neumannová, Ivana Kvapilíková, and Ondřej Bojar. 2022. CUNI Submission to the BUCC 2022 Shared Task on Bilingual Term Alignment. In Proceedings of the BUCC Workshop within LREC 2022, pages 43–49, Marseille, France. European Language Resources Association.
Cite (Informal):
CUNI Submission to the BUCC 2022 Shared Task on Bilingual Term Alignment (Požár et al., BUCC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bucc-1.6.pdf