Harnessing LLMs for Temporal Data - A Study on Explainable Financial Time Series Forecasting

,


Introduction
The rapid advancements in Machine Learning (ML) and Artificial Intelligence (AI) technologies over the past few years have opened up numerous opportunities and challenges across various domains, including the realm of financial markets (Kovalerchuk and Vityaev, 2005;Bahrammirzaee, 2010;Qi and Xiao, 2018).In particular, the task of financial time series forecasting, a key element in industrial risk management, market insights, strategic decision-making and policy formation, has witnessed significant technological innovations, from statistical/econometric time series techniques (Härdle et al., 1997;Andersen et al., 2009;Chen et al., 2011;Patton, 2012), to machine learning techniques (Kim, 2003;Yoo et al., 2005;Krollner et al., 2010), to deep learning (Dingli and Fournier, 2017;Júnior and Nievola, 2018;Sezer et al., 2020;Leung and Zhao, 2021;Lara-Benítez et al., 2021).Despite these advancements, there are several inherent challenges associated with the deployment of ML/AI models in finance.
One challenge lies in the realm of crosssequence reasoning and inference, a vital aspect for understanding temporal patterns and making accurate predictions.The current approaches include time-series correlation analysis (Plerou et al., 1999;Gopikrishnan et al., 2000;Conlon et al., 2009;Chen et al., 2018) and clustering (Rani and Sikka, 2012;Babu et al., 2012;Aghabozorgi et al., 2015).Deep learning has recently been leveraged to learn from the complex latent dependencies among time series (Hua et al., 2019;Maulik et al., 2020;Song and Fujimura, 2021;Nguyen and Quanz, 2021).Despite these advancements, existing methods have yet to effectively capture the intricate dependencies characteristic of time series data.The varying design, implementation, and data requirements of these methods further creates a barrier for their widespread application in the field.
Another notable hurdle involves handling complex multi-modal financial temporal data that extends beyond numeric sequences.The data may encapsulate diverse sources such as historical news, financial knowledge graphs, social media activities, and various other market indicators.There has been recent effort leveraging statistical inference (Kanungsukkasem and Leelanupab, 2019), RNN/CNN with text embedding (Vargas et al., 2017), graph neural networks (Cheng et al., 2022), etc. to integrate the complex information.
Last but of utmost importance, the issue of interpretability and explainability poses significant challenges to the trustworthiness of machine learning and deep learning models.The majority of existing deep learning models operate as black boxes, offering little insight into their decision-making processes.This lack of transparency sometimes raises concerns about the result reliability and impedes user trust.This is particularly relevant in sensitive fields like finance, where substantial investments and assets are at stake.There is recent study trying to understand deep-learning based predictions through attention scores (Hsieh et al., 2021), but such insight is still not readily human readable and still requires considerable interpretation effort.
The recent advancement of Large Language Models (LLMs) (Brown et al., 2020a;Touvron et al., 2023b;Brown et al., 2020b;OpenAI, 2023a) potentially lend us a powerful tool to address all above challenges in a unified, flexible way.First, LLMs can learn complex relations among sequences.LLMs are the most powerful Transformer-based models, and there has been abundant researches showing Transformer-based models capable of learning the underlying complex relations among textual sequences (Yun et al., 2019;Rong et al., 2020;Zhang et al., 2020;Dwivedi and Bresson, 2020;Ying et al., 2021) and solving quantitative problems (Wei et al., 2022;Lewkowycz et al., 2022;Imani et al., 2023).It is reasonable to expect the potential of LLMs understanding complex dependencies among numeric time series augmented by temporal textual sequences.
Secondly, LLMs have demonstrated outstanding reasoning and inference capability over multi-modal data.By design, LLMs are proficient at learning from a broad spectrum of data sources and types.They are trained on a vast amount of texts from the internet, encompassing a wide range of topics, styles, and formats.This equips them to handle diverse input data, such as numerical, textual, structured data (Wu et al., 2023;Shen et al., 2023).This multi-modal data handling capability could be particularly useful for financial forecasting, where crucial information often comes from disparate sources, such as numerical market data, textual news articles, and social media posts.
Lastly, LLMs are natural explainers that generate human readable explanations providing insight into a decision.One of the key advantages of LLMs is their ability to generate natural language text that is coherent, contextual, and comprehensive.This allows them to provide human-readable explanations for their decisions (Zhao et al., 2023).Furthermore, through Chain-of-Thoughts (COT) or step-by-step thinking (Wei et al., 2022;Zhang et al., 2023;Lightman et al., 2023), beyond a few sentences of explanation, LLMs can even generate detailed step-by-step reasoning to reveal the decision-making process.
The following summarizes the main contributions of this paper, • This paper takes a novel exploration to study LLMs' potential to the valuable task of explainable financial time series forecasting.For this paper, we focus on the NASDAQ-100 stock price time series.To the best of our knowledge, there is not yet public studies on this topic to date.
• We experiment with a combination of zeroshot/few-shot inference techniques with the state-of-the-art AI model GPT-4 (OpenAI, 2023a), and instruction-based fine-tuning using Open LLaMA (Geng and Liu, 2023).Our experiment results also show that the technique of chain-of-thoughts helps boost the performance in most of the experiments.
• We compare our proposed LLM approaches with existing methods, including an ARMA-GARCH model and a gradient-boosting tree model.We show even zero-shot inference using GPT-4 can outperform a boosting-tree model with about ∼300 features.

Related Works
The field of financial time series forecasting has been a subject of extensive research, with various methodologies being proposed over the years.

Traditional Statistical/Econometric Methods
Traditional statistical/econometric methods have long been the cornerstone of financial time series forecasting.Techniques such as ARMA-GARCH models have been widely used due to their ability to capture dependencies and volatility clustering in financial time series (Drost and Nijman, 1993;Francq and Zakoian, 2004;Andersen et al., 2009;Henneke et al., 2011).These models have been extended and modified in various ways to better capture the complexities of financial markets (Tang et al., 2003;Ghahramani and Thavaneswaran, 2006;Hossain and Nasser, 2011;Ma and Yu, 2013).Other popular statistical/econometric methods for financial time series include Vector Autoregressive Models (VAM) (Zivot and Wang, 2006), State-Space Models and the Kalman Filter (De Jong and Zehnwirth, 1983), Diffusion Models (Fan, 2005), Vector Error Correction Model (VECM) (Johansen, 1995), Dynamic Stochastic General Equilibrium (DSGE) (Smets and Wouters, 2003), etc.

Machine Learning Techniques
With the advent of machine learning, a variety of models have been applied to financial forecasting.Decision trees, support vector machines, etc., have been actively studied for financial time series prediction (Trafalis and Ince, 2000;Yang et al., 2002;Pai and Lin, 2005;Wang and Chan, 2006;Tsai and Wang, 2009;Li and Liao, 2017).More recently, deep learning techniques, such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformer models, have been applied to this task, demonstrating their ability to capture complex, non-linear relationships in the data (Dingli and Fournier, 2017;Júnior and Nievola, 2018;Sezer et al., 2020;Leung and Zhao, 2021;Lara-Benítez et al., 2021).

Large Language Models
The recent development of Large Language Models (LLMs) has opened up new possibilities for financial time series forecasting.LLMs, such as GPT-3 (Brown et al., 2020b) and GPT-4 (OpenAI, 2023a), LLaMA(Touvron et al., 2023a) (including Alpaca (Taori et al., 2023), Vincuna (Chiang et al., 2023)), have demonstrated remarkable capabilities in reasoning and understanding complex dependencies in the heterogeneous data, and the ability to generate human-readable explanations for their decisions (Zhao et al., 2023;Wei et al., 2022;Zhang et al., 2023;Lightman et al., 2023).However, the application of LLMs in financial time series forecasting with explanation is still a relatively unexplored area, and this paper aims to contribute to this emerging field.

Methodology
For this paper, we study the NASDAQ-100 stock price time series, supplemented by metadata about the stock company and relevant financial news data concerning both the specific stock and the broader financial/economic landscape.Our primary focus is on forecasting weekly/monthly stock returns (defined as the percentage change in stock price from the beginning to the end of the week/month) with accompanying explanations1 .
We demonstrate our structured design of prompts for LLMs and apply the state-of-the-art GPT-4 model (OpenAI, 2023b) for zero-shot and few-shot inference tasks.For fine-tuning, we utilize the publicly available Open LLaMA (Geng and Liu, 2023).We also incorporate the technique of Chain of Thoughts (COT) (Wei et al., 2022;Lightman et al., 2023), which has been found to enhance the effectiveness of LLMs in other research studies.

Company Profile Data
We use GTP-4 to generate company description, general positive/negative factors that might impact the company's stock price.See Appendix Figure 1 for an example of the prompt to ask GPT-4 to generate the company profile, and the GPT-4 response.

Finance/Economy News Data
We use Google Custom Search API to obtain stock top-5 news stories on a weekly basis for each NASDAQ-100 stock.After that, we use GPT-4 to generate a summary and extract keywords from each obtained news article.An example of prompt and GPT-4 response is shown in Appendix Figure 2. A similar method is applied to obtain weekly top-5 news stories about macro economy and finance.
To reduce input size, We further generate meta summary & keywords for each week using GPT-4, given all the top story summaries and keywords of the week.An example of the meta summary & keywords is shown in Appendix Figure 3.They look similar to the example in Appendix Figure 2, but much condensed.We use the meta summary & keywords for further experiments and evaluation.4 for example).

Instruction
• We provide few-shot learning examples from stocks similar to the subject of interest.This design multi-purposes the few-shot examples to enable the LLM consider cross-sequence information from other stocks.To identify similar stocks, we query GPT-4 with a query "List top 3 NASDAQ stocks most similar to AAPL".A typical response is like "MSFT, GOOGL, AMZN"3 .Here we in fact implicitly leverage LLM inherent knowledge of financial entities and concepts.
• There are other tweaks to the prompt structure.For instance, we divided the instruction into two parts, positioning them at the beginning and end of the prompt.This aids the model in better recognizing the task: to predict next week's summary & keywords, rather than summarizing historical data.The predicted summary & keywords serve as the explanation for the stock return prediction.
We also experimented the Chain-of-Thoughts approach (Wei et al., 2022;Zhang et al., 2023;Lightman et al., 2023), i.e., the idea of "step-by-step thinking", by appending the instruction "Can you reason step by step before finalizing the output?" to the end of the prompt.To our surprise, this notably improved the performance by a few points (see Section 4.2).The result of the step-by-step thinking process in response to Appendix Figure 4 is illustrated in Appendix Figure 5, where it is evident that GPT-4 identifies a previously overlooked crucial point about "earnings reports" when explicit reasoning steps are generated.

Instruction-based Fine-tuning with Open LLaMA
We perform instruction-based fine-tuning using Open LLaMA 13B model to see how well a publicly available model could perform in comparison to GPT-4, especially after fine-tuning.The Open LLaMA 13B model, in its zero-shot inference, typically tends to replicate portions of the prompt rather than executing the prompt instructions effectively.Therefore, it is incapable of properly handling instruction-based prompts as shown in Appendix Figure 4 without undergoing a process of fine-tuning.Therefore we focus on fine-tuning with the Open LLaMA model in this paper.
Instruction-based fine-tuning has been recently shown to be effective in guiding the model's training process with specific directives (Taori et al., 2023;Peng et al., 2023).We created a dataset of 30K weekly forecasting plus 7K monthly forecasting, derived from 5-year historical data spanning from Jun 2017 to June 2022.Unlike GPT-4 that supports up to 8K token size, we need to compress the prompt into 1K tokens for fine-tuning Open LLaMA, due to model and hardware constraints.For each fine-tuning example, we employ GPT-4 to condense the full historical meta news summary/keywords (e.g. from week 8 to the last week as shown in Appendix Figure 4) into a single, even more concise summary/keywords pair.Simultaneously, the "Company Profile" and "Forecasting Examples" sections of the prompt are also respectively condensed into more succinct summary paragraphs.
While it would be ideal for Open LLaMA to manage its own end-to-end experiment, including the task of prompt compression for fine-tuning, we still resort to using GPT-4 right now.This is due to Open LLaMA 13B model's zero-shot summarization capability is considerably inferior to those of GPT-4 in practice.The summaries and keywords extracted by Open LLaMA 13B model often fall short of usability.
Once fine-tuned, the Open LLaMA 13B model demonstrates a much more satisfactory comprehen-sion of the instruction, resulting in the generation of a forecast and an accompanying explanation that appears coherent.This is illustrated in Appendix Figure 6.As per the result in section 4.2, when it comes to binary classification, the Open LLaMA model's performance is competitive compared to GPT-4.However, we've noticed that the Open LLaMA model has a tendency to produce more extreme predictions, such as U5+ or D5+, which result in a relatively higher squared error.

Data Time Window
The details of the data used in the experiments is as described in Section 3.1.We focus on NASDAQ-100 stock return forecasting for this paper.

Baseline Models
To evaluate the performance of our approach, we include a heuristic baseline using the most-frequent historical bin (i.e. the most frequent bin from historical weeks before the target week to forecast) as the prediction, an ARMA-GARCH model (p = q = 1) (Tang et al., 2003;Ma and Yu, 2013), and a gradient-boosting tree model (Natekin and Knoll, 2013) implemented by LightGBM package (Ke et al., 2017).These baseline models are trained on the training/fine-tuning data time window, and evaluated on the evaluation time window.
For the gradient-boosting tree model, we include the following features.There are total about 300 features for the tree.
1. Historical price time series available in the daily stock price data, including open, close, min, max prices, and the daily trading volume.
3. The stock sector information and historical earnings are obtained from Alpha Vantage4 .

Evaluation Metrics
We perform weekly and monthly stock return forecasting with the baselines and LLM-based methods.We treat 4 weeks as one month for convenience, and therefore there are 13 "month"s in the 52-week evaluation time window.
To evaluate the performance of our forecasting models, we employ three metrics.
• Binary precision assesses the model's ability to correctly predict the general direction of stock price movement, i.e., "Up" (U) or "Down" (D).
To evaluate the quality of the forecasting explanation (the predicted next-week/month summary/keywords), we employ ROGUE-1 and ROGUE-2 scores to compare with the actual summary/keywords by GPT-4 extracted from the actual top news of the next week/month.

Performance Evaluation
Our experiment results are summarized in Table 1  and 2. Table 1 provides a comparative analysis of our LLM-based methods and the baseline models in terms of their performance in forecasting stock returns.Table 2, on the other hand, evaluates the quality of the explanations generated by the LLMs.
In summary, our results show the effectiveness of LLMs in financial time series forecasting, with "GPT-4 few-shot with COT" consistently showing  the best performance in both prediction accuracy and explanation quality.The results also highlight the technique of Chain-of-Thoughts (COT) consistently boosts performance, and the potential of instruction-based fine-tuning with publicly available LLMs like Open LLaMA to achieve reasonable performance in comparison to GPT-4 through fine-tuning with COT.

Stock Price Forecasting
From the results of Table 1, we observe that both GPT-4 and Open LLaMA 13B model outperform the ARMA-GARCH model and the gradientboosting tree model in terms of both binary and bin precision.GPT-4, in particular, shows superior performance in both zero-shot and few-shot settings, with the few-shot setting with COT achieving the best performance.In terms of MSE, "GPT-4 fewshot with COT" also achieves the lowest error, indicating that it not only best predicts the direction of the price change but also provides a more accurate estimate of the magnitude of the change.
Open LLaMA 13B model, after fine-tuning, shows competitive performance compared to GPT-4 in terms of binary precision.However, its bin precision is obviously worse, indicating it lacks competitive fine-grained reasoning capability to pick the right bin.It also tends to produce more extreme predictions, resulting in a higher MSE.

Explanation Quality
Table 2 shows the quality of the explanations generated by the LLMs (GPT-4 and fine-tuned Open LLaMA), evaluated using ROUGE-1 and ROUGE-2 scores for both the summary (S) and keywords (K) of the news.Again, the results show that "GPT-4 few-shot with COT" achieves the highest ROUGE scores, indicating that it generates the most relevant and accurate explanations for the predictions.Open LLaMA, after fine-tuning with COT, also shows reasonable explanation quality in parallel with GPT-4 results without COT.

Conclusion
In this study, we explored using Large Language Models (LLMs) to tackle inherent challenges like cross-sequence reasoning, multi-modal signals integration, and result interpretability in financial time series forecasting.In particular, we experimented GPT-4 and Open LLaMA for the NASDAQ-100 stock return predictions.With structured prompts comprising company profile, historical stock price, and financial news data, LLMs generated human understandable explanations and forecasts.The performance of these LLMs surpassed traditional models like ARMA-GARCH and gradient-boosting trees, especially when integrating a step-by-step reasoning process based on the Chain of Thought (COT) approach.Furthermore, our fine-tuning experiments highlighted the viability of tuning a publicly available LLM to also achieve reasonable performance in comparison to GPT-4.
The preliminary results of applying LLMs in explainable financial forecasting are encouraging.This is the first step to develop a LLMbased explainable financial forecast system to assist business decision-making.We envision a future where financial forecasting is not only more precise but also more comprehensible and transparent, thus transforming financial and business decisionmaking across the sector.

Limitations
While we present promising initial results for the LLM-based approach for explainable financial time series based on NASDAQ-100 stock returns, the general applicability of our approach to different types of temporal data remains a question of future investigation.
• In a narrower context, the effectiveness of our approach when applied to other stock indices like the S&P 500 or Russell 2000 is yet to be validated.Each of these indices harbors distinct characteristics and diverse company compositions, which may influence the performance of our method.
• In a wider context, the potential of our method to forecast other types of financial temporal data remains unexplored.This includes internal temporal time series such as return-oninvestment (ROI), sales, headcounts, and costs from various departments, augmented by related internal documents.It also extends to other public time series data such as company earnings, housing prices, security prices, and futures prices.
The quality of explanations generated by our method is currently evaluated using automatic text similarity metrics, namely ROGUE and BLEU scores.These scores compare predicted explanations to ground-truth news summaries and keywords.However, we have yet to thoroughly examine other aspects of these explanations, including the possibility of hallucination or the presence of specific patterns in explanations associated with failed time series forecasts.
Other signals can be considered for future investigations, such as macro economy time series and Watson, 1999) and social media data (Mankar et al., 2018;Javed Awan et al., 2021).

Ethics Statement
In conducting our research, we committed to transparency in our methodology, results reporting, and data usage.Our work adhered to the guidelines and principles of responsible AI research.All data used in this paper is public data or can be obtained through publicly available APIs.No user data is involved in the research.Prompt [template, those inside {} are placeholders]: Please summarize the following noisy but possible news data extracted from web page HTML, and extract keywords of the news.The news text can be very noisy due to it is HTML extraction.Give formatted answer such as Summary: ..., Keywords: ... The news is supposed to be for {symbol} stock.You may put 'N/A' if the noisy text does not have relevant information to extract.

GPT-4 Response:
Summary: Apple Inc is postponing perks for some corporate divisions and expanding a cost-cutting effort, joining its peers in streamlining operations.The company proposes to forego bonuses or promotions for April, with the payments occurring only in October.Apple is also restricting hiring and leaving additional positions open after employee departure.This is triggering fear of downsizing among employees.
Keywords: Apple, bonus, hiring, cost-cutting, promotions, downsizing, employees Figure 2: An example of news summary & keywords extracted from one news article for a stock's company (AAPL in this case), generated by GPT-4.The original news is at https://sports.yahoo.com/apple-joins-cost-cut-bandwagon-145845685.html.The prompt in this example is a template.Texts inside [] are comments and are not really in the prompt when we submit it to the LLM, and we will fill stock symbol and the news into the placeholders enclosed by {}.
Summary: Apple Inc.'s stock is displaying rising relative strength, although shy of a key benchmark, and the company has been highlighted as a top-performing stock due to its sales of 1.3 billion iPhones and a significant $19.4 billion profit last quarter.As the NASDAQ surges with Apple as a big cap, Apple expands cost-cutting measures, postpones perks, restricts hiring, and leaves positions open, instigating downsizing concerns.The overall stock market performance is boosted by a $30 billion deposit infusion for struggling firms.

Figure 1 :
Figure 1: An example of a stock's company profile consisting of the company description, the general positive/negative factors impacting the company's stock prices.

Figure 3 :
Figure 3: An example of one week's meta summary & keywords condensed from all the company's summaries and keywords from the week.
-Based Zero-shot/Few-shot Inference with LLMs

Table 1 :
Performance comparison between the baseline models and LLMs for stock price weekly/monthly forecast.

Table 2 :
Explanation quality evaluation using ROGUE scores, using the GPT-4 summary/keyword extraction of each week's true top news from google search as the ground truth.