SEDAR: a Large Scale French-English Financial Domain Parallel Corpus

Abbas Ghaddar, Phillippe Langlais


Abstract
This paper describes the acquisition, preprocessing and characteristics of SEDAR, a large scale English-French parallel corpus for the financial domain. Our extensive experiments on machine translation show that SEDAR is essential to obtain good performance on finance. We observe a large gain in the performance of machine translation systems trained on SEDAR when tested on finance, which makes SEDAR suitable to study domain adaptation for neural machine translation. The first release of the corpus comprises 8.6 million high quality sentence pairs that are publicly available for research at https://github.com/autorite/sedar-bitext.
Anthology ID:
2020.lrec-1.442
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3595–3602
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.442
DOI:
Bibkey:
Cite (ACL):
Abbas Ghaddar and Phillippe Langlais. 2020. SEDAR: a Large Scale French-English Financial Domain Parallel Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3595–3602, Marseille, France. European Language Resources Association.
Cite (Informal):
SEDAR: a Large Scale French-English Financial Domain Parallel Corpus (Ghaddar & Langlais, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.442.pdf
Code
 autorite/sedar-bitext