MultiFin: A Dataset for Multilingual Financial NLP

Rasmus Jørgensen; Oliver Brandt; Mareike Hartmann; Xiang Dai; Christian Igel; Desmond Elliott

doi:10.18653/v1/2023.findings-eacl.66

MultiFin: A Dataset for Multilingual Financial NLP

Rasmus Jørgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, Christian Igel, Desmond Elliott

Abstract

Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MultiFin – a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.

Anthology ID:: 2023.findings-eacl.66
Volume:: Findings of the Association for Computational Linguistics: EACL 2023
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 894–909
Language:
URL:: https://aclanthology.org/2023.findings-eacl.66/
DOI:: 10.18653/v1/2023.findings-eacl.66
Bibkey:
Cite (ACL):: Rasmus Jørgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, Christian Igel, and Desmond Elliott. 2023. MultiFin: A Dataset for Multilingual Financial NLP. In Findings of the Association for Computational Linguistics: EACL 2023, pages 894–909, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: MultiFin: A Dataset for Multilingual Financial NLP (Jørgensen et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-eacl.66.pdf
Video:: https://aclanthology.org/2023.findings-eacl.66.mp4

PDF Cite Search Video Fix data