Parallel Sentence Retrieval From Comparable Corpora for Biomedical Text Simplification

Rémi Cardon, Natalia Grabar


Abstract
Parallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Manually created reference data show 0.76 inter-annotator agreement. Our purpose is to state whether a given pair of specialized and simplified sentences is parallel and can be aligned or not. We treat this task as binary classification (alignment/non-alignment). We perform experiments with a controlled ratio of imbalance and on the highly unbalanced real data. Our results show that the method we present here can be used to automatically generate a corpus of parallel sentences from our comparable corpus.
Anthology ID:
R19-1020
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
168–177
Language:
URL:
https://aclanthology.org/R19-1020
DOI:
10.26615/978-954-452-056-4_020
Bibkey:
Cite (ACL):
Rémi Cardon and Natalia Grabar. 2019. Parallel Sentence Retrieval From Comparable Corpora for Biomedical Text Simplification. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 168–177, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Parallel Sentence Retrieval From Comparable Corpora for Biomedical Text Simplification (Cardon & Grabar, RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1020.pdf