FAST: Financial News and Tweet Based Time Aware Network for Stock Trading

Designing profitable trading strategies is complex as stock movements are highly stochastic; the market is influenced by large volumes of noisy data across diverse information sources like news and social media. Prior work mostly treats stock movement prediction as a regression or classification task and is not directly optimized towards profit-making. Further, they do not model the fine-grain temporal irregularities in the release of vast volumes of text that the market responds to quickly. Building on these limitations, we propose a novel hierarchical, learning to rank approach that uses textual data to make time-aware predictions for ranking stocks based on expected profit. Our approach outperforms state-of-the-art methods by over 8% in terms of cumulative profit and risk-adjusted returns in trading simulations on two benchmarks: English tweets and Chinese financial news spanning two major stock indexes and four global markets. Through ablative and qualitative analyses, we build the case for our method as a tool for daily stock trading.


Introduction
The stock market, a financial ecosystem involving quantitative trading and investing, observed a market capitalization exceeding $US 60 trillion as of 2019. 1 Stock trading presents lucrative opportunities for investors to utilize the market as a platform for investing funds and maximizing profits. However, making profitable investment decisions is challenging due to the market's volatile and rapid-changing nature (Adam et al., 2016;Foucault et al., 2016). Research at the intersection of natural language processing (NLP) and finance presents encouraging prospects in stock prediction (Jiang,*Equal contribution. 1 World Federation of Exchanges: https://data. worldbank.org/indicator/CM.MKT.LCAP.CD/ Figure 1: Here, we study how Tesla's tweets influence investors' opinions about the company and impact its stock price trend. The first tweet shows positive opinions, and we observe a rise in prices. Later, the tweets made by the CEO rapidly lead to drastic price drops within minutes. Further, without a sequential context, it gets challenging to understand the tweets that follow. 2020). Conventional work forecasts future trends by modeling numerical historic stock data (Lu et al., 2009;Bao et al., 2017). However, price signals alone can not capture market surprises, news, and company announcements. Such events, often reported across financial news and social media, have shown to influence market dynamics (Laakkonen, 2004). As shown in Figure 1, prices immediately react to breaking news about the related company. Such reactions conform to the Efficient Market Hypothesis (EMH), which states that financial markets are informationally efficient and prices reflect all available information (Malkiel, 1989).
The abundance of stock affecting information across news and Twitter helps investors analyze market trends and inspires the adoption of NLP to study the interplay between textual data and stock prices (Xu and Cohen, 2018;Oliveira et al., 2017). However, unlike structured numerical data, analyzing natural language poses various challenges. First, analyzing individual tweets or news headlines may not be informative enough. They often exhibit a sequential context-dependency, where analyzing them together can provide a greater unified context, Figure 2: More accurate methods M 1 (higher accuracy) may not always be more profitable than less accurate methods M 2 (lower accuracy). Profit is gained by selling stocks having ¬ prediction for price movement from trading day t to t+1. as shown in Figure 1. Despite the success of recurrent neural networks (RNNs) in modeling such a sequential context ( §2), a critical drawback is that they assume all text to be equally spaced in time, ignoring the inherent dynamic timing irregularities of social media and news. Timing plays a critical role as stock markets rapidly react to new information (Foucault et al., 2016), leading to significant price changes within minutes, as shown in Figure 1. Scholtus et al. (2014) show that reacting one second slower than other market participants can lead to a loss in thousands of dollars. Further, not each text holds the potential to influence stock prices, texts have a diverse influence on stock prices based on their content, such as breaking news or tweets from a reliable source, as opposed to noise like vague comments as shown in Figure 1. These observations mandate the need to factor in the time-aware dependencies and diverse influence in analyzing online natural language data for stock trading.
Despite profitability being the prime objective of trading, NLP methods for stock prediction (Xu and Cohen, 2018;Hu et al., 2017) are commonly framed as classification or regression tasks, and are not directly optimized towards profitable stock selection. Consider the toy example in Figure 2 that shows how methods having a higher classification accuracy may not always lead to higher overall profits. This research gap in NLP methods for stock prediction presents a new direction for stock selection, where both predictive performance and profits are jointly and directly optimized.
Contributions: We formulate stock prediction as a learning-to-rank problem ( §3.1) and present FAST: Financial News and Tweet based Time Aware Network for Stock Trading, which uses text for maximizing profit by jointly optimizing predictive power and the optimal ranking of stocks. FAST learns time-aware representations of financial news and tweets, and captures relevant market signals using hierarchical temporal attention for ranking stocks ( §3). Through experiments ( §4) on English and Chinese text corresponding to the NASDAQ, Shanghai, Shenzhen, and Hong Kong markets, we show that FAST outperforms state-of-the-art methods in terms of intraday returns by over 8% and risk-adjusted returns by over 10% ( §5.1, §5.2). Further, through exploratory ( §5.3, §5.4) and qualitative ( §6) analyses, we demonstrate the practical applicability of FAST to daily stock trading.

Background
Conventional Methods: Stock prediction spans various methods, commonly framed as regression or classification tasks (Jiang, 2020). Conventional methods rely on numeric features like historical prices (Kohara et al., 1997;Lin et al., 2009), technical (Shynkevich et al., 2017), and macroeconomic indicators (Hoseinzade et al., 2019). These include discrete (Bollerslev, 1986), continuous (Andersen, 2007, and neural approaches Feng et al., 2019a). Despite their success, a limitation of these methods is that they are limited to numerical features and do not factor crucial stock influencing factors such as text (Lee et al., 2014).
Contemporary Methods: Newer models based on the EMH, leverage natural language features extracted from investor sentiments (Li and Shah, 2017), financial reports (Kogan et al., 2009;Rekabsaz et al., 2017), earnings calls (Qin and Yang, 2019), online news (Peng and Jiang, 2016;Chen et al., 2019a,b;Du and Tanaka-Ishii, 2020) and social media (Si et al., 2013;Tabari et al., 2018;Sawhney et al., 2020a) for stock price regression and movement classification tasks. These methods show how NLP can complement conventional price-based methods in capturing the effect of events like market surprises and mergers over stock returns. However, these methods do not directly optimize profit, and do not factor the fine-grain irregularities in release times of stock affecting text. For stock trading, the timing of release of information across these sources plays a critical role, as price changes rapidly factor all publicly available information (Norman, 2014). Firms may exploit investors' perception of market information (Forbes, 2009), for instance, by timing the release of negative news between positive ones to minimize losses (Segal and Segal, 2016). These limitations hinder contemporary methods from modeling a time-aware progression of stock-affecting market signals to directly optimize profit generation.
Time-aware Methods: Recently, time-aware modeling of time series data has shown improvements over conventional sequential models like RNNs and LSTMs on various tasks such as patient subtyping (Baytas et al., 2017), suicide ideation and buildup detection using Twitter history (Sawhney et al., 2020b), disease progression (Gao et al., 2020), and much more. However, modeling the temporal dynamics inherent in social media and online news is complex as it involves noisy and diversely influential data across irregular time intervals. The intersection of modeling the temporal dynamics of natural language with finance presents an underexplored yet promising research avenue.

Problem Formulation: Stock Ranking
We adopt a learning to rank formulation for stock selection. Let S = {s 1 , s 2 , . . . , s N } represent a set of N stocks, where for every stock s i ∈ S, on a trading day τ , there is an associated closing price p τ and a one-day return ratio r τ i = p τ −p τ −1 p τ −1 . On any given trading day τ , there exists an optimal ranking Y τ = {y τ 1 > y τ 2 · · · > y τ N } of the stocks, such that a total order exists between the ranks y τ i > y τ j for any two stocks s i , s j ∈ S, provided r τ i > r τ j . Such an ordering of stocks S on a trading day τ represents a ranking list, where stocks achieving higher ranking scores Y are expected to generate a higher investment revenue (profit) on day τ . Formally, given stock-relevant textual data (financial news or tweets) for a lookback period of length T days (i.e., days ∈ [τ − T, τ − 1]), we aim to learn a ranking function that outputs a scorê r τ to rank each stock s on day τ in terms of their expected profit.
We now describe the components of FAST as shown in Figure 3, for hierarchically and attentively learning time-aware representations of news and tweets within ( §3.2) and across ( §3.3) the days in the lookback period. Lastly, we optimize FAST to rank stocks in terms of expected profitability ( §3.4) for daily stock trading.

Intra-Day Textual Information Encoder
To model the news or tweets over a day, FAST first encodes the texts via an embedding layer.
Text Embedding Layer: Owing to the success of transfer learning and pre-training of language models in NLP, we use Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) to encode the texts. BERT has shown to capture more contextual text representations as opposed to methods like word2vec (Hu et al., 2017), GloVe (Xu and Cohen, 2018), ELMO (Mohammadi et al., 2019). We encode each text t to a higher dimensional representation m = BERT(t) ∈ R d where d = 768, obtained by averaging the token level outputs from the final layer of BERT.

Learning Stock Representations for One Day:
For each stock s on any given day i, a variable number of texts (news or tweets) [t 1 , t 2 , . . . t K ] are posted at times [k 1 , k 2 , . . . k K ] that may discuss news or express sentiments towards the stock. We encode each of the K texts posted in a day, using BERT as [m 1 , m 2 , . . . m K ]. Often, analyzing a single text alone may not be informative enough to analyze a stock (Barber and Odean, 2007). Whereas, analyzing a sequence of texts released over the day provides a unified context to gain a more informed understanding of the performance of a stock (Hu et al., 2017).
RNNs, particularly LSTMs are a natural way to capture such sequential context dependencies in tweets and news over time (Akhtar et al., 2017). However, a standard LSTM assume inputs (texts) to be equally spaced sequences in time. In contrast, the time interval between news releases or tweets can vary widely, from a few seconds to many hours that can have a drastic impact on their influence on the market (Robertson et al., 2007;O'Hara, 2015). Consequently, news and opinions may change substantially over a day. Capturing the fine-grained granularities in the posting times of online text can lead to better and quicker reactions to market opportunities and increased profits. 2 Since timing serves as a crucial factor to model the progression of market data (Tafti et al., 2016), we propose the use of a time-aware LSTM (t-LSTM) (Baytas et al., 2017), by modifying a standard LSTM. We feed the time between texts to the t-LSTM cell to model the temporal irregularities in news and tweets. The t-LSTM applies a decay to the short-term memory in the LSTM according to the time elapsed between the release of two successively posted texts. Formally, the t-LSTM adopts a decaying function of elapsed time, transforming the time differences into appropriate weights for each input as: where C k−1 is the previous cell memory, {W d , b d } are the network parameters, ∆k is the elapsed time between two financial news or tweets [t k , t k−1 ], and g(·) is a heuristic decaying function. We select g(∆k) = 1/∆k empirically as suggested by Baytas et al. (2017). Intuitively, the greater the elapsed time between two news or tweets, the lesser impact they should have on each other due to the market's dynamic nature. The t-LSTM computes the current hidden state h k for each input text t k generated in a day, as: where {W c , U c , b c } are the network parameters of the candidate memoryC, m is the embedding for text t k and {i k , f k , o k } are input, forget and output gates. We encode the texts for each stock s on day i using the t-LSTM as: where the hidden state h j represents the current text j as well as the preceding texts while focusing on text j in a time-aware fashion. All news and tweets released in a day may not be equally informative, and have diverse influence over a stock's trend (Barber and Odean, 2007). We use an intraday attention mechanism (Luong et al., 2015) to emphasize texts likely to have a more substantial influence on the price. As shown in Figure 3, the intra-day attention mechanism learns to adaptively aggregate the variable number of hidden states of the t-LSTM (due to a variable number of texts per day) into an intra-day text information vector x i as: where h m ∈ R K×dm denotes the concatenation of all the hidden states from the t-LSTM, d m is the dimension of each hidden state, γ j represents the learned attention weights, and W is a parameter.

Inter-Day Temporal Encoder
We now combine the representations learned from texts in each day across multiple days in a lookback period. We combine these representations in a hierarchical manner within and across days, using the sequence of intra-day text information vectors x. Since days are equally spaced in time, we first feed the vectors x to an LSTM layer as: where, h i is the hidden state representation for day i. However, as per the the adaptive market hypothesis (Lo, 2004), tweets and news published across different days have shown to have a varying impact on stock prices (Calvet and Fisher, 2007), due to financial phenomena such as calendar anomalies (Jacobs and Levy, 1988), the week-day effect (Berument and Kiymaz, 2001), etc. To selectively weigh critical days, we employ an inter-day attention mechanism (Luong et al., 2015). The inter-day attention aggregates representations across all days into an overall representation z τ using the learned attention weights β i for day i as: where, W is a learned linear transform, h z ∈ R T ×dz represents the concatenated hidden states, d z is the size of output space of LSTM. The interday and the intra-day attention together comprise a hierarchical temporal attention. FAST captures time-aware dependencies in large volumes of chaotic text to rank stocks, as described next.

Ranking and Network Optimization
To optimize FAST for stock ranking, we first concatenate the temporal representations z τ obtained for each stock s to form stock-level features Z. We then feed Z to a feed-forward neural network followed by a Leaky-ReLU activation (Maas et al., 2013) which outputs the predicted return ratior τ for stock ranking. We optimize FAST through a joint point-wise regression and pairwise rankaware loss L, to minimize the differences between the predicted and the actual return ratios, while maintaining the relative order of the top-ranked stocks as: wherer τ and r τ are the predicted and actual scores for ranking stocks on day τ and φ is a loss weighing parameter. Pre-processing: We pre-process English tweets using NLTK (Twitter mode), for treatment of URLs, identifiers (@) and hashtags (#). We adopt the Bert-Tokenizer for tokenization. For the English tweets, we use the pre-trained BERT-base-cased. 6 For the Chinese news, we adopt the Chinese-BERT. 6 We collect historical prices for all stocks from Yahoo Finance. 7 We align trading days by dropping samples that lack tweets for a consecutive 5-day trading window, and further align the data across trading windows for stocks to ensure data is available for all trading days in the window for the same set of stocks. We split the US S&P 500 temporally based on date ranges from 01/01/2014 to 31/07/2015 for training, 01/08/2015 to 30/09/2015 for validation, and 01/10/2015 to 01/01/2016 for testing. We split the China & HK dataset temporally based on date ranges from 01/01/2015 to 31/08/2015 for training, 01/09/2015 to 30/09/2015 for validation, and 01/10/2015 to 01/01/2016 for testing all models.

FAST Training Setup
We conduct all experiments on a Tesla P100 GPU. We use grid search to find optimal hyperparameters based on the validation Sharpe Ratio ( §4.3) for all models. We explore lookback window length T ∈ range[2, 10] (best T = 5), loss weighing factor φ ∈ range[1, 10] (best φ = 4), and the hidden state dimension for both t-LSTM and LSTM d ∈ [32, 64, 128] (best d = 64) for both datasets. We use Xavier initialization (Glorot and Bengio, 2010) to initialize all weights. We use an exponential learning rate scheduler (Li and Arora, 2019) with a decay rate of 0.05 and an initial learning rate of 5e−4. We train FAST end-to-end using the Adam optimizer (Kingma and Ba, 2014) for 500 epochs requiring 8 hours of compute time.

Evaluation Metrics and Trading Strategy
Returns: To assess the profit generation ability of all methods ( §4.4), we compute the Sharpe ratio (SR), a measure of the return of a portfolio compared to its risk (Sharpe, 1994), and the cumulative investment return ratio (IRR). Following Feng et al. (2019b), we adopt a daily buy-hold-sell trading strategy, that is, when the market closes on trading day τ − 1, the trader uses the method to get a ranked list of the predicted return ratio for each stock. The trader then buys the top η stocks and then sells the bought stocks on the market close of the trading day τ . The IRR on any day τ is de- denotes the set of stocks in the portfolio on day τ −1 and p τ i , p τ −1 i are the closing price of the stock i on days τ and τ − 1 respectively. We calculate SR by computing the earned return R a in excess of the risk-free return 8 R f , defined as: SR = Ranking: We also evaluate the stock ranking ability of FAST using two widely-used ranking metrics: Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG@η). MRR is the reciprocal rank of the first relevant stock while, NDCG@η sums the true scores for the top η stocks, ranked in the order induced by the predicted scores, after applying a logarithmic discount. For both returns and NDCG, we report results for top η = 5 stocks, and present performance variations with different values of top stocks η ( §5.4).

Baselines
We compare FAST with baselines spanning different formulations: regression, classification, reinforcement learning, and ranking. We follow the same preprocessing protocols as proposed by the works and adopt their implementation, if available.
Regression (REG) These methods regress return ratios from past data and trade the top stocks.
• AZFinText: Proper noun-based text representations fed to Support Vector Regression for forecasting return ratios (Schumaker and Chen, 2009).

Classification (CLF)
The following methods classify movements as [up, down, neutral] and trade the stocks where prices are expected to rise.
• TSLDA: Topic Sentiment Latent Dirichlet Allocation, a generative model jointly exploiting topics and sentiments in text (Nguyen and Shirai, 2015).
• CH-RNN: An RNN with cross-modal attention on price trends and texts across days (Wu et al., 2018).
• HAN: A Hierarchical Attention Network using GRU encoders with temporal attention on texts and days in the lookback period (Hu et al., 2017).

Reinforcement Learning (RL)
The following approaches optimize quantitative trading through reinforcement learning.
• iRDPG: An imitative RDPG algorithm exploiting temporal stock price features, while optimizing the Sharpe Ratio as the reward (Liu et al., 2020).
• S-Reward: Inverse-RL method to model relations between sentiments and returns (Yang et al., 2018).

Ranking (RAN)
The following methods rank stocks to select most profitable trading candidates.
• RankNet: A DNN that utilizes sentiment-based shock and trend scores to optimize a probabilistic ranking function (Song et al., 2017).

Profitability Comparison with Baselines
As the ultimate goal of stock prediction is profit, we compare the profitability of FAST against baseline methods in Table 2. FAST generates significantly (p < 0.001) higher cumulative and risk-adjusted returns than all methods. Overall, we observe that RL and ranking methods are more profitable as they are directly optimized towards profit generation through stock selection. This observation validates the premise of formulating stock prediction as a learning-to-rank problem, compared to conventionally adopted regression and classification tasks. Further, we find methods that study stock affecting information from news and tweets, generate profits higher or comparable to methods that only use historical prices. These improvements revalidate the effectiveness of leveraging textual sources to capture stock affecting signals like market surprises, announcements (mergers, acquisitions) and public sentiment. We attribute the higher profitability of FAST to two major reasons. First, through its hierarchical temporal attention mechanism, FAST captures the diverse influence of different texts and days over stock movements. Second, as FAST is time-aware, it models the influence of fine-grain temporal irregularities in the release of financial news and tweets over stock movements.
The test periods of the US S&P 500 and the China & Hong Kong datasets span diverse market conditions. The China & Hong Kong test period covers the 2015-16 China Stock Market Turbulence (Liu et al., 2016), a bearish market scenario, 9 and that of US S&P 500 covers standard market conditions. We find that FAST is profitable and outperforms existing baselines over such diverse market scenarios. Next, we further probe the performance of FAST through a series of ablative experiments. Table 1 shows how FAST's stock ranking ability and profitability benefits from each of its components. On just feeding all data into a single LSTM, we observe poor performance. As we adopt the intraday (LSTM) and the temporal (LSTM) encoders, we observe higher profits that suggest benefit in modeling the sequential context of texts hierarchi-  Figure 4: Profitability and ranking performance against granularities of time difference ∆k adopted by t-LSTM cally within and across days. Further, on adding intra-day attention, we note improvements in profit as FAST can better distinguish noise inducing text from relevant market signals, minimizing false evaluations and overreactions (De Long et al., 1990). The attention mechanism can likely diminish the impact of such noise (rumours, vague comments). Intuitively, complementing the intra-day with the inter-day attention lead to further improvements as FAST can better capture the diverse influence of texts, hierarchically within and across days. Next, we note the biggest improvements on adding the t-LSTM instead of a standard LSTM as the intra-day encoder, suggesting that FAST benefits by factoring the fine-grain time irregularities in texts to model the flow of stock-affecting information (Kalev et al., 2004). Through this time-aware mechanism, FAST can potentially better react to online news and tweets, by discounting stale information more accurately by factoring in fine-grained elapsed time differences (seconds). We further quantify the impact of time-aware modeling on the improvements in ranking and profitability, next.

Advantages of Time-Aware Modeling
The influence of older information over the market decreases rapidly as newer data is released (Russell, 2010). As we coarsen the granularity of the elapsed time difference between two texts fed to the t-LSTM, from minutes, to hours, to a day, we observe drops in FAST's performance as shown in Figure 4. At the coarsest granularity of a day, the t-LSTM essentially degenerates to an LSTM, and attains the lowest performance. The drops show that factoring time at finer granularities benefits FAST in modeling the temporal dynamics of market response to stock affecting signals. Our findings line with financial research that shows market reactions to news complete within minutes (Smales, 2013), and the impact of news and tweets achieves an equilibrium over time .

Parameter Analysis: Probing Sensitivity
Lookback window length T: Here, we study how FAST's stock ranking performance (NDCG@5) varies with the length of lookback T ∈ [2, 10] days in Figure 5. Lower ranking performance indicates the inability of shorter lookbacks to capture stock affecting market information, likely as public information requires time to absorb into price movements (Luss and D'Aspremont, 2015). As we increase T , we find that larger lookbacks allow the inclusion of stale information from older days having relatively lower influence on prices (Bernhaedt and Miao, 2004), thus deteriorating the ranking performance. We observe the best stock ranking performance for mid-sized (approx. 5-day) lookback periods.
Selected top stocks η: We analyze FAST's profitability (SR) variation with the number of top stocks η in Figure 5. We find that FAST performs well across varying η, showing suitability to strategies having different risk taking appetites.

FAST Qualitative Analysis
We now conduct an extended analysis as shown in Figure 6 to elucidate on FAST's explainable predictions and practical applicability to real-world quantitative trading. Here, we study the China & Hong Kong market during 5 th -9 th December 2015. We visualize token-level and hierarchical temporal attention to analyze how FAST ranks stocks on 10 th December 2015, outperforming the state-of-the-art baseline methods: RankNet and SN-HFA.
Analyzing Hierarchical Attention: Within days, the intra-day attention filters less informative news and emphasizes more influential ones. For instance, we observe that the second news about BOE Technology on 7 th categorizes the stock as "overweight," a rating through which equity analysts forecast better future performance (Kumar, 2009). Such news would likely induce positive public sentiments and drive more investment to the stock, as opposed to the less informative news about "margin trading data." The intra-day attention accurately captures the diverse influence of such news headlines. Further, we observe that news released on 9 th comprises relatively more crucial information than other days, and the inter-day attention accurately emphasizes its importance. These observations reiterate the diverse influence of different news and days over future stock returns, accurately captured by the hierarchical temporal attention mechanism.
Probing Time-aware Modeling: The news released for Shanghai Electric (SE) on the morning of 9 th reports a positive event, likely indicating future profits. Later during that afternoon, after a few hours, two other news report negative impacts over SE due to a loss in nuclear power stocks, indicating a downtrend for the upcoming days. FAST disregards older news to emphasize on the newer ones to forecast the upcoming loss and allots a lower rank to SE. In contrast, ranking methods such as RankNet assign a higher rank to SE, potentially due to their inability to model the time-aware dependencies. Further, classification methods like SN-HFA do not correctly predict the stock return trends, as they do not capture the fine-grain temporal irreg-ularities in texts, and are not optimized towards profit. Consequently, FAST relatively outperforms SN-HFA by a margin of 47.5%, and RankNet by 28.4% in profits on 10 th December 2015.

Conclusion
We propose FAST, a neural approach to rank profitable stocks using stock-relevant textual data across online financial news and tweets. To model the market information, FAST hierarchically learns temporally relevant signals from texts and shows the positive effects of factoring the fine-grain temporal irregularities in textual data. Through quantitative and qualitative analyses on English tweets and Chinese financial news spanning four stock markets, we highlight the real-world applicability of FAST. In trading simulations on the S&P 500 and China A-shares indexes, FAST outperforms state-ofthe-art methods across four different formulations by over 8% in terms of profit and Sharpe Ratio.