Reducing the Search Space for Parallel Sentences in Comparable Corpora

Rémi Cardon, Natalia Grabar


Abstract
This paper describes and evaluates simple techniques for reducing the research space for parallel sentences in monolingual comparable corpora. Initially, when searching for parallel sentences between two comparable documents, all the possible sentence pairs between the documents have to be considered, which introduces a great degree of imbalance between parallel pairs and non-parallel pairs. This is a problem because even with a high performing algorithm, a lot of noise will be present in the extracted results, thus introducing a need for an extensive and costly manual check phase. We work on a manually annotated subset obtained from a French comparable corpus and show how we can drastically reduce the number of sentence pairs that have to be fed to a classifier so that the results can be manually handled.
Anthology ID:
2020.bucc-1.7
Volume:
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Reinhard Rapp, Pierre Zweigenbaum, Serge Sharoff
Venue:
BUCC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
44–48
Language:
English
URL:
https://aclanthology.org/2020.bucc-1.7
DOI:
Bibkey:
Cite (ACL):
Rémi Cardon and Natalia Grabar. 2020. Reducing the Search Space for Parallel Sentences in Comparable Corpora. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora, pages 44–48, Marseille, France. European Language Resources Association.
Cite (Informal):
Reducing the Search Space for Parallel Sentences in Comparable Corpora (Cardon & Grabar, BUCC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.bucc-1.7.pdf