EDGAR-CORPUS: Billions of Tokens Make The World Go Round

Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, Prodromos Malakasiotis


Abstract
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.
Anthology ID:
2021.econlp-1.2
Volume:
Proceedings of the Third Workshop on Economics and Natural Language Processing
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
ECONLP | EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13–18
Language:
URL:
https://aclanthology.org/2021.econlp-1.2
DOI:
10.18653/v1/2021.econlp-1.2
Bibkey:
Cite (ACL):
Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and Prodromos Malakasiotis. 2021. EDGAR-CORPUS: Billions of Tokens Make The World Go Round. In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 13–18, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.econlp-1.2.pdf
Code
 nlpaueb/edgar-crawler +  additional community code
Data
EDGAR-CORPUS