Building Comparable Corpora for Assessing Multi-Word Term Alignment

Omar Adjali; Emmanuel Morin; Pierre Zweigenbaum

Building Comparable Corpora for Assessing Multi-Word Term Alignment

Omar Adjali, Emmanuel Morin, Pierre Zweigenbaum

Abstract

Recent work has demonstrated the importance of dealing with Multi-Word Terms (MWTs) in several Natural Language Processing applications. In particular, MWTs pose serious challenges for alignment and machine translation systems because of their syntactic and semantic properties. Thus, developing algorithms that handle MWTs is becoming essential for many NLP tasks. However, the availability of bilingual and more generally multi-lingual resources is limited, especially for low-resourced languages and in specialized domains. In this paper, we propose an approach for building comparable corpora and bilingual term dictionaries that help evaluate bilingual term alignment in comparable corpora. To that aim, we exploit parallel corpora to perform automatic bilingual MWT extraction and comparable corpus construction. Parallel information helps to align bilingual MWTs and makes it easier to build comparable specialized sub-corpora. Experimental validation on an existing dataset and on manually annotated data shows the interest of the proposed methodology.

Anthology ID:: 2022.lrec-1.332
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3103–3112
Language:
URL:: https://aclanthology.org/2022.lrec-1.332/
DOI:
Bibkey:
Cite (ACL):: Omar Adjali, Emmanuel Morin, and Pierre Zweigenbaum. 2022. Building Comparable Corpora for Assessing Multi-Word Term Alignment. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3103–3112, Marseille, France. European Language Resources Association.
Cite (Informal):: Building Comparable Corpora for Assessing Multi-Word Term Alignment (Adjali et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.332.pdf

PDF Cite Search Fix data