Lefteris Loukas


pdf bib
FiNER: Financial Numeric Entity Recognition for XBRL Tagging
Lefteris Loukas | Manos Fergadiotis | Ilias Chalkidis | Eirini Spyropoulou | Prodromos Malakasiotis | Ion Androutsopoulos | Georgios Paliouras
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT’s performance, allowing word-level BILSTMs to perform better. To improve BERT’s performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.


pdf bib
EDGAR-CORPUS: Billions of Tokens Make The World Go Round
Lefteris Loukas | Manos Fergadiotis | Ion Androutsopoulos | Prodromos Malakasiotis
Proceedings of the Third Workshop on Economics and Natural Language Processing

We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.

pdf bib
DICoE@FinSim-3: Financial Hypernym Detection using Augmented Terms and Distance-based Features
Lefteris Loukas | Konstantinos Bougiatiotis | Manos Fergadiotis | Dimitris Mavroeidis | Elias Zavitsanos
Proceedings of the Third Workshop on Financial Technology and Natural Language Processing