NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports

Corentin Masson, Patrick Paroubek


Abstract
Recent advances in neural computing and word embeddings for semantic processing open many new applications areas which had been left unaddressed so far because of inadequate language understanding capacity. But this new kind of approaches rely even more on training data to be operational. Corpora for financial applications exists, but most of them concern stock market prediction and are in English. To address this need for the French language and regulation oriented applications which require a deeper understanding of the text content, we hereby present “DoRe”, a French and dialectal French Corpus for NLP analytics in Finance, Regulation and Investment. This corpus is composed of: (a) 1769 Annual Reports from 336 companies among the most capitalized companies in: France (Euronext Paris) & Belgium (Euronext Brussels), covering a time frame from 2009 to 2019, and (b) related MetaData containing information for each company about its ISIN code, capitalization and sector. This corpus is designed to be as modular as possible in order to allow for maximum reuse in different tasks pertaining to Economics, Finance and Regulation. After presenting existing resources, we relate the construction of the DoRe corpus and the rationale behind our choices, concluding on the spectrum of possible uses of this new resource for NLP applications.
Anthology ID:
2020.lrec-1.275
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2261–2267
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.275
DOI:
Bibkey:
Cite (ACL):
Corentin Masson and Patrick Paroubek. 2020. NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2261–2267, Marseille, France. European Language Resources Association.
Cite (Informal):
NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports (Masson & Paroubek, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.275.pdf