Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt

Peter Anderson; Mano Vikash Janardhanan; Jason He; Wei Cheng; Charlie Flanagan

doi:10.18653/v1/2024.emnlp-industry.26

Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt

Peter Anderson, Mano Vikash Janardhanan, Jason He, Wei Cheng, Charlie Flanagan

Abstract

Financial documents are filled with specialized terminology, arcane jargon, and curious acronyms that pose challenges for general-purpose text embeddings. Yet, few text embeddings specialized for finance have been reported in the literature, perhaps in part due to a lack of public datasets and benchmarks. We present BAM embeddings, a set of text embeddings finetuned on a carefully constructed dataset of 14.3M query-passage pairs including both public and proprietary financial documents. Demonstrating the benefits of domain-specific training, BAM embeddings achieve Recall@1 of 62.8% on a held-out test set, vs. only 39.2% for the best general-purpose text embedding from OpenAI. Further, BAM embeddings increase question answering accuracy by 8% on FinanceBench and show increased sensitivity to the finance-specific elements that are found in detailed, forward-looking and company and date-specific queries. To support further research we describe our approach in detail, quantify the importance of hard negative mining and dataset scale, and publicly release our embeddings.

Anthology ID:: 2024.emnlp-industry.26
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 362–370
Language:
URL:: https://aclanthology.org/2024.emnlp-industry.26/
DOI:: 10.18653/v1/2024.emnlp-industry.26
Bibkey:
Cite (ACL):: Peter Anderson, Mano Vikash Janardhanan, Jason He, Wei Cheng, and Charlie Flanagan. 2024. Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 362–370, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt (Anderson et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-industry.26.pdf

PDF Cite Search Fix data