Sohom Ghosh

2025

pdf bib abs
InFiNITE (∞): Indian Financial Narrative Inference Tasks & Evaluations
Sohom Ghosh | Arnab Maji | Sudip Kumar Naskar
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems

This paper introduces Indian Financial Narrative Inference Tasks and Evaluations (InFiNITE), a comprehensive framework for analyzing India’s financial narratives through three novel inference tasks. Firstly, we present multi-modal earnings call analysis by integrating transcripts, presentation visuals, and market indicators via the Multi-Modal Indian Earnings Calls (MiMIC) dataset, enabling holistic prediction of post-call stock movements. Secondly, our Budget-Assisted Sectoral Impact Ranking (BASIR) dataset aids in systematically decoding government fiscal narratives by classifying budget excerpts into 81 economic sectors and evaluating their post-announcement equity performance. Thirdly, we introduce Bharat IPO Rating (BIR) datasets to redefine Initial Public Offering (IPO) evaluation through prospectus analysis, classifying potential investments into four recommendation categories (Apply, May Apply, Neutral, Avoid). By unifying textual, visual, and quantitative modalities across corporate, governmental, and public investment domains, InFiNITE addresses critical gaps in Indian financial narrative analysis. The open source data sets of the framework, including earnings calls, union budgets, and IPO prospectuses, establish benchmark resources specific to India for computational economic research under permissive licenses. For investors, InFiNITE enables data-driven identification of capital allocation opportunities and IPO risks, while policymakers gain structured insights to assess Indian fiscal communication impacts. By releasing these datasets publicly, we aim to facilitate research in computational economics and financial text analysis, particularly for the Indian market.

2024

pdf bib abs
Fine-tuning Language Models for Predicting the Impact of Events Associated to Financial News Articles
Neelabha Banerjee | Anubhav Sarkar | Swagata Chakraborty | Sohom Ghosh | Sudip Kumar Naskar
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing

Investors and other stakeholders like consumers and employees, increasingly consider ESG factors when making decisions about investments or engaging with companies. Taking into account the importance of ESG today, FinNLP-KDF introduced the ML-ESG-3 shared task, which seeks to determine the duration of the impact of financial news articles in four languages - English, French, Korean, and Japanese. This paper describes our team, LIPI’s approach towards solving the above-mentioned task. Our final systems consist of translation, paraphrasing and fine-tuning language models like BERT, Fin-BERT and RoBERTa for classification. We ranked first in the impact duration prediction subtask for French language.

pdf bib abs
IndicFinNLP: Financial Natural Language Processing for Indian Languages
Sohom Ghosh | Arnab Maji | Aswartha Narayana | Sudip Kumar Naskar
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Applications of Natural Language Processing (NLP) in the finance domain have been very popular of late. For financial NLP, (FinNLP) while various datasets exist for widely spoken languages like English and Chinese, datasets are scarce for low resource languages,particularly for Indian languages. In this paper, we address this challenges by presenting IndicFinNLP – a collection of 9 datasets consisting of three tasks relating to FinNLP for three Indian languages. These tasks are Exaggerated Numeral Detection, Sustainability Classification, and ESG Theme Determination of financial texts in Hindi, Bengali, and Telugu. Moreover, we release the datasets under CC BY-NC-SA 4.0 license for the benefit of the research community.

2023

pdf bib abs
A low resource framework for Multi-lingual ESG Impact Type Identification
Harsha Vardhan | Sohom Ghosh | Ponnurangam Kumaraguru | Sudip Naskar
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing

With the growing interest in Green Investing, Environmental, Social, and Governance (ESG) factors related to Institutions and financial entities has become extremely important for investors. While the classification of potential ESG factors is an important issue, identifying whether the factors positively or negatively impact the Institution is also a key aspect to consider while making evaluations for ESG scores. This paper presents our solution to identify ESG impact types in four languages (English, Chinese, Japanese, French) released as shared tasks during the FinNLP workshop at the IJCNLP-AACL-2023 conference. We use a combination of translation, masked language modeling, paraphrasing, and classification to solve this problem and use a generalized pipeline that performs well across all four languages. Our team ranked 1st in the Chinese and Japanese sub-tasks.

2022

pdf bib abs
LIPI at the FinNLP-2022 ERAI Task: Ensembling Sentence Transformers for Assessing Maximum Possible Profit and Loss from Online Financial Posts
Sohom Ghosh | Sudip Kumar Naskar
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Using insights from social media for making investment decisions has become mainstream. However, in the current era of information ex- plosion, it is essential to mine high-quality so- cial media posts. The FinNLP-2022 ERAI task deals with assessing Maximum Possible Profit (MPP) and Maximum Loss (ML) from social me- dia posts relating to finance. In this paper, we present our team LIPI’s approach. We ensem- bled a range of Sentence Transformers to quan- tify these posts. Unlike other teams with vary- ing performances across different metrics, our system performs consistently well. Our code is available here https://github.com/sohomghosh/LIPI_ERAI_ FinNLP_EMNLP- 2022/

pdf bib abs
Ranking Environment, Social And Governance Related Concepts And Assessing Sustainability Aspect of Financial Texts
Sohom Ghosh | Sudip Kumar Naskar
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Understanding Environmental, Social, and Governance (ESG) factors related to financial products has become extremely important for investors. However, manually screening through the corporate policies and reports to understand their sustainability aspect is extremely tedious. In this paper, we propose solutions to two such problems which were released as shared tasks of the FinNLP workshop of the IJCAI-2022 conference. Firstly, we train a Sentence Transformers based model which automatically ranks ESG related concepts for a given unknown term. Secondly, we fine-tune a RoBERTa model to classify financial texts as sustainable or not. Out of 26 registered teams, our team ranked 4th in sub-task 1 and 3rd in sub-task 2. The source code can be accessed from https://github.com/sohomghosh/Finsim4_ESG

pdf bib abs
FinRAD: Financial Readability Assessment Dataset - 13,000+ Definitions of Financial Terms for Measuring Readability
Sohom Ghosh | Shovon Sengupta | Sudip Naskar | Sunny Kumar Singh
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In today’s world, the advancement and spread of the Internet and digitalization have resulted in most information being openly accessible. This holds true for financial services as well. Investors make data driven decisions by analysing publicly available information like annual reports of listed companies, details regarding asset allocation of mutual funds, etc. Many a time these financial documents contain unknown financial terms. In such cases, it becomes important to look at their definitions. However, not all definitions are equally readable. Readability largely depends on the structure, complexity and constituent terms that make up a definition. This brings in the need for automatically evaluating the readability of definitions of financial terms. This paper presents a dataset, FinRAD consisting of financial terms, their definitions and embeddings. In addition to standard readability scores (like “Flesch Reading Index (FRI)”, “Automated Readability Index (ARI)”, “SMOG Index Score (SIS)”,“Dale-Chall formula (DCF)”, etc.), it also contains the readability scores (AR) assigned based on sources from which the terms have been collected. We manually inspect a sample from it to ensure the quality of the assignment. Subsequently, we prove that the rule-based standard readability scores (like “Flesch Reading Index (FRI)”, “Automated Readability Index (ARI)”, “SMOG Index Score (SIS)”,“Dale-Chall formula (DCF)”, etc.) do not correlate well with the manually assigned binary readability scores of definitions of financial terms. Finally, we present a few neural baselines using transformer based architecture to automatically classify these definitions as readable or not. Pre-trained FinBERT model fine-tuned on FinRAD corpus performs the best (AU-ROC = 0.9927, F1 = 0.9610). This corpus can be downloaded from https://github.com/sohomghosh/FinRAD_Financial_Readability_Assessment_Dataset.

pdf bib abs
LIPI at FinCausal 2022: Mining Causes and Effects from Financial Texts
Sohom Ghosh | Sudip Naskar
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

While reading financial documents, investors need to know the causes and their effects. This empowers them to make data-driven decisions. Thus, there is a need to develop an automated system for extracting causes and their effects from financial texts using Natural Language Processing. In this paper, we present the approach our team LIPI followed while participating in the FinCausal 2022 shared task. This approach is based on the winning solution of the first edition of FinCausal held in the year 2020.

2021

pdf bib
Term Expansion and FinBERT fine-tuning for Hypernym and Synonym Ranking of Financial Terms
Ankush Chopra | Sohom Ghosh
Proceedings of the Third Workshop on Financial Technology and Natural Language Processing

pdf bib
Data Driven Content Creation using Statistical and Natural Language Processing Techniques for Financial Domain
Ankush Chopra | Sohom Ghosh | Prateek Nagwanshi
Proceedings of the 3rd Financial Narrative Processing Workshop

pdf bib abs
FinRead: A Transfer Learning Based Tool to Assess Readability of Definitions of Financial Terms
Sohom Ghosh | Shovon Sengupta | Sudip Naskar | Sunny Kumar Singh
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Simplified definitions of complex terms help learners to understand any content better. Comprehending readability is critical for the simplification of these contents. In most cases, the standard formula based readability measures do not hold good for measuring the complexity of definitions of financial terms. Furthermore, some of them works only for corpora of longer length which have at least 30 sentences. In this paper, we present a tool for evaluating readability of definitions of financial terms. It consists of a Light GBM based classification layer over sentence embeddings (Reimers et al., 2019) of FinBERT (Araci, 2019). It is trained on glossaries of several financial textbooks and definitions of various financial terms which are available on the web. The extensive evaluation shows that it outperforms the standard benchmarks by achieving a AU-ROC score of 0.993 on the validation set.

Co-authors

Neelabha Banerjee 1

Swagata Chakraborty 1

Ponnurangam Kumaraguru 1

Venues