News Signals: An NLP Library for Text and Time Series

We present an open-source Python library for building and using datasets where inputs are clusters of textual data, and outputs are sequences of real values representing one or more timeseries signals. The news-signals library supports diverse data science and NLP problem settings related to the prediction of time series behaviour using textual data feeds. For example, in the news domain, inputs are document clusters corresponding to daily news articles about a particular entity, and targets are explicitly associated real-valued timeseries: the volume of news about a particular person or company, or the number of pageviews of specific Wikimedia pages. Despite many industry and research usecases for this class of problem settings, to the best of our knowledge, News Signals is the only open-source library designed specifically to facilitate data science and research settings with natural language inputs and timeseries targets. In addition to the core codebase for building and interacting with datasets, we also conduct a suite of experiments using several popular Machine Learning libraries, which are used to establish baselines for timeseries anomaly prediction using textual inputs.


Introduction
The natural ordering of many types of data along a time dimension is a consequence of the known physics of our universe.Real-world applications of machine learning often involve data with implicit or explicit temporal ordering.Examples include weather forecasting, market prediction, self-driving cars, and language modeling.
A large body of work on time series forecasting studies models which consume and predict real-valued target signals that are explicitly ordered in time; however, aside from some existing work mainly related to market signal prediction using social media (Chen et al., 2021(Chen et al., , 2022;; * equal contribution Figure 1: News Signals Datasets: clusters of documents, bucketed by time period, are associated with time series signals.ML models can be trained to predict time series signals using the textual data.Arno et al., 2022;Li et al., 2014;Bing et al., 2014;Kim et al., 2016;Wang and Luo, 2021), inter alia, the NLP research community has generally not focused on tasks with textual inputs and time series outputs.This is confirmed by the lack of any popular NLP tasks related to time series in resulttracking projects such as nlp-progress1 and paperswith-code2 .
We believe there is potential for novel, impactful research into tasks beyond market signal forecasting, in which textual inputs and real-valued output signals are explicitly organized along a time dimension with fixed-length "ticks".Two reasons for the lack of attention to such tasks to date may be: 1. researchers do not have access to canonical NLP datasets for time series forecasting.
2. data scientists are missing a high level software library for NLP datasets with time series.
Examples of tasks where natural language input can be used to predict a time series signal include: • weather or pandemic forecasting using social media posts from a recent time period, • market signal prediction using newsfeeds or bespoke textual data feeds, • media monitoring for consumer behavior prediction and forecasting, • forecasting the impact of a news event on the pageviews of a particular website, and many others.We refer to this general task setting as text2signal (T2S).

news-signals
This work introduces news-signals3 , a highlevel MIT-licensed software package for building and interacting with datasets where inputs are clusters of texts, and outputs are time series signals (Figure 1).Despite the package's news-focused origins, it is built to be a general purpose library for interacting with time-ordered clusters of text and associated time series signals.
Preparing and utilizing datasets for T2S tasks requires purpose-built software for retrieving and sorting data along the time dimension.In many cases, data will be retrieved from one or more APIs, or web-scraped, further complicating dataset generation pipelines.news-signals exposes an intuitive interface for generating datasets that we believe will be straightforward for any data scientist or developer familiar with the Python data science software stack (see Section 2).
news-signals includes tooling for: • calling 3rd party APIs to populate signals with text and time series data, • visualizing signals and associated textual data, • extending signals with new time series, feeds, and transformations, • aggregations on textual clusters, such as abstractive and extractive summarization.
news-signals provides two primary interfaces: Signal and SignalsDataset.A SignalsDataset represents a collection of related signals.A Signal consists of one or more textual feeds, each connected to one or more time series.Time series have strictly one real value per-tick, while feeds are time-indexed buckets of textual data.For example, a news signal might contain a feed of all articles from a financial source that mention a particular company, linked to multiple time series representing relevant market signals for that company.
news-signals datasets are designed to be easy to extend with new data sources, entities, and time series signals.In our initial release of the library, we work with three collections of entities: US politicians, NASDAQ-100 companies, and S&P 500 companies (see section 5).
The rest of the paper is organized as follows: section 2 gives an overview of library design and Section 3 describes the Signal and SignalsDataset APIs, the two main interfaces to time-indexed NLP datasets.Section 4 discusses how datasets can be created.Section 5 describes our example datasets, models, and end-to-end experiments, which are open-source, and can be used as templates for new research projects.Section 6 discusses applications, Section 7 reviews related work, and Section 8 gives conclusions and directions for the future.

Time-Indexed NLP Datasets
Traditional NLP and ML datasets consist of iid (X, Y ) pairs.These pairs can be assigned indices, and be operated on by standard pre-processing procedures, such as randomly shuffling and splitting into train, dev, test subsets.However, for time series forecasting and related tasks, inputs are ordered along a time axis, and the distribution of later time steps is typically heavily dependent upon the distribution of earlier time steps; therefore, training, dev and test subsets are usually partitioned and split chronologically to reduce the potential for leakage, introducing additional complexity into data preparation.
Within the Python data science ecosystem, libraries such as Numpy (Harris et al., 2020), Pandas (Wes McKinney, 2010), and Pytorch (Paszke et al., 2019) have standardized a syntax for indexing and slicing multi-dimensional matrices and dataframes along axes.When a Pandas dataframe is indexed along a dimension with time-interval semantics, slicing between dates or timestamps is a very useful feature.For example, a user may want to work with the news articles and corresponding time series signals that occurred between particular START and END dates.Pandas in particular includes rich tooling for indexing and slicing datasets along time-indexed axes, and news-signals delegates slice commands and indexing to Pandas, exposing an interface for interacting with datasets using datetime indices4 .

news-signals Technical Requirements
The key technical desiderata we took into consideration when building news-signals are listed below: • the complexity of data retrieval should be minimized: calling APIs, retrying failed requests, and parsing API output should be invisible to users.
• large datasets containing hundreds or thousands of signals, each lasting for thousands of "ticks", should be straightforward to configure and build.
• standard data science libraries such as Pandas should be used as much as possible to reduce maintenance burden over time.
• transformations on time series such as anomaly detection or trend/seasonality removal should be straightforward to implement.
• the complexity of compressing, saving, and loading datasets locally and remotely should be invisible to users.
• new types of signals should be easy to implement.
• Signals should be easy to use with standard machine learning libraries.

The Signal and SignalsDataset APIs
Signals consist of at least one time series coupled with zero or more textual data feeds.Figure 2 shows an example of creating and populating a Signal.Because most functions on the signal class return the signal itself, users can employ a convenient chaining syntax when performing multiple operations on a signal.The library retrieves and stores the time series and news stories for the signal, and exposes a Pandas-like API to the underlying dataframes.We can add arbitrary textual data feeds to signals; in figure 2, signal.sample_stories()samples stories for every day of the time series (see library documentation on GitHub for more detailed information on how this works).i m p o r t d a t e t i m e from n e w s _ s i g n a l s i m p o r t s i g n a l s # w i k i d a t a QID f o r T w i t t e r q i d = ' Q918 ' s i g n a l = s i g n a l s .A y l i e n S i g n a l ( name= ' T w i t t e r − S i g n a l ' , p a r a m s ={ " e n t i t y _ i d s " : [ q i d ] } ) s t a r t = ' 2023 −01 −01 ' end = ' 2023 −06 −01 ' # r e t r i e v e a t i m e s e r i e s f o r t h e c o u n t o f # news a r t i c l e s p e r − day f o r t h i s s i g n a l s i g n a l = \ s i g n a l ( s t a r t , end ) . a n o m a l y _ s i g n a l ( ) # s a m p l e s t o r i e s f o r e v e r y day i n t h e s i g n a l s i g n a l = s i g n a l .s a m p l e _ s t o r i e s ( ) # l e t ' s h a v e a l o o k a t t h e b i g g e s t anomaly t o p _ d a y = s i g n a l .a n o m a l i e s .idxmax ( ) # what was g o i n g on t h a t day ?s t o r i e s = s i g n a l .f e e d s _ d # T w i t t e r e x p e r i e n c i n g o u t a g e s n a t i o n w i d e # T w i t t e r e x p e r i e n c i n g i n t e r n a t i o n a l o u t a g e s . . .# I t ' s Not J u s t You , T w i t t e r I s A c t i n g Weird # : T w i t t e r b r i e f l y g o e s down # T w i t t e r o u t a g e : what happened , . . .# . . . .Once feeds and time series have been initialized, users can perform exploratory data analysis (EDA) in many ways, for example by examining and summarizing the news stories for an anomalous window of the signal's time series, or by plotting the signal.
Signals can also be easily mapped into a single dataframe representation by using the .dfproperty.Signals' dataframe representations contain the textual and time series data associated with a signal, indexed along a DatetimeIndex, but they do not contain metadata such as how the signal is populated from one or more APIs, and transformation semantics such as how anomalies are computed.
Signals automatically differentiate between textual data and time series data types -for example, when signal.plot() is called, a signal's as-sociated time series are automatically plotted in a multi-line plot.

API integrations
Most signals require retrieving data from one or more third-party APIs or on-disk datasets.In the current version of news-signals, we provide a deep integration with the Aylien NewsAPI, and additionally implement an interface to the Wikidata pageviews API for building pageview time series for Wikidata items5 .

The SignalsDataset API
Individual signals can be grouped into datasets.The SignalsDataset is a useful abstraction for working with groups of related signals -concretely, these might be signals for all politicians from a particular country, or for all companies connected to a specific market subset, such as the NASDAQ-100 or the S&P-500.Another dataset type could contain signals encapsulating content and time series related to different social media forums, such as Subreddits (Wang and Luo, 2021).The number of signals in a dataset can easily number in the hundreds or thousands, so we design a simple configuration DSL using yaml to allow easy construction of large datasets, which is documented in our GitHub repository.
Aylien NewsAPI and Wikimedia APIs Because our production use cases for news-signals are focused upon analyzing news data from the Aylien NewsAPI6 , the flagship Signal type in news-signals is currently7 the AylienSignal.This signal type abstracts away API call semantics, allowing users to populate a signal by simply calling signal(start_date, end_date).Of the data sources currently implemented in news-signals, Wikidata is completely free, but the Aylien NewsAPI requires a license key.However, we note that the Aylien NewsAPI currently has a two-week free trial allowing significant free API calls8 , and we hope to implement Signal types for fully public data sources beyond Wikidata in the near future.

Saving and loading Datasets
Local and remote serialization and persistence are essential features for dataset-focused libraries, and both Signal and SignalsDataset support saving and loading.We have also implemented persistent on Google Drive and Google Cloud Storage, that only require a remote path to be provided.Datasets are decompressed and cached locally so that the same dataset will not be re-downloaded if it is already available locally.
Library Documentation Section 3 has given only a small sample of the news-signals library capabilities, and we refer interested readers to the library documentation on GitHub, which also includes end-to-end example notebooks and video walkthroughs.

Building Signals Datasets
As discussed in section 3.2, news-signals provides an API for the creation of large-scale datasets representing collections of related signals.
Bootstrapping Datasets using Wikidata The Aylien NewsAPI links named entities in text to their Wikidata IDs (Vrandečić and Krötzsch, 2014).news-signals users can make use of the Wikidata Query Service9 to easily build new datasets starting from SPARQL queries that return sets of matching entities (Prud'hommeaux et al., 2013).We build the datasets for NASDAQ-100, S&P 500, and US Politicians in this manner, and the SPARQL queries used to bootstrap these entity sets are available in our repository.For the purpose of this paper, and to exemplify use of the library, we build three example datasets: NASDAQ100, S&P 500, and US Politicians.Each of these datasets is bootstrapped from a list of Wikidata entities belonging to the respective set.To retrieve the entity sets, we build a SPARQL query returning the set of Wikidata entities that match the query, and then use this entity set to generate a dataset.This is a powerful way to generate arbitrary datasets for collections of related entities: for example, datasets for all politicians from a particular country or all American football players could be generated in this fashion.Note that in some cases Wikidata does not contain all entities in a particular set, for example, the NASDAQ100 dataset contains fewer than 100 entities.Dataset statistics are summarized in Table 1.Each of the entity sets is retrieved via one or more SPARQL queries 10 .We then use the Aylien NewsAPI 11 to sample up to 20 stories about each entity for each day of the time period Jan 2020-Jan 2023.

Multi-document Summarization (MDS)
We provide a multi-document summarization model in news-signals for turning clusters of news articles associated with a particular timestamp into an easily readable summary.In particular, we use a hybrid extractive-abstractive approach that first uses a centroid-based sentence extraction method (Gholipour Ghalandari, 2017) to select 5 key sentences from the whole collection of provided news articles.We generate an abstractive summary from these sentences using a fine-tuned BART-large model (Lewis et al., 2020).The model was finetuned on such extractive summaries on the WCEP dataset (Gholipour Ghalandari et al., 2020), which contains compact event summaries with a neutral style.
Sampling News Data for Entities Importantly, we do not provide all news articles about each entity, rather, we provide only a sample of the news content about the entity for each day.This means that successful models should predict the timeseries signal based upon the content of the article, rather than global numerical features 12 .
Connecting Entities with Timeseries Signals In our example datasets, we focus upon entities that exist in the Wikidata knowledge graph.Different time series signal sources can be automatically linked to these entities.The Wikimedia API itself exposes several interesting time series signals, such as the number of pageviews and the number of edits for each page.We hypothesize that these signals are affected by events occurring in the real worldwhen an impactful event connected with an entity occurs, there is likely to be an observable change in signal behavior.

Dataset Release
To avoid potential licensing issues with releasing the news data content of the example datasets, at this stage we plan to only release the datasets containing article titles instead of full article texts and 10 about SPARQL 11 https://aylien.com/ 12We may also consider models such as vector autoregression that use signals derived from textual content as well as real-valued signals metadata.We also release a version of the datasets with daily abstractive summaries of the content, which do not reveal any source-specific content or data.Both versions will be available by email request to the authors13 .
Extending NewsSignals Because our datasets are grounded on the Wikidata knowledge graph, they are easy to extend with new inputs, entities, and signals.Obvious extensions to our work might include textual data from platforms such as Twitter and Reddit, and market signals such as stock price or other technical indicators for entities that are connected with publicly traded companies.Datasets should also be easy to extend with additional entities, and we provide a set of tools for extending NewsSignals in the accompanying code repository14 .

Docker Container and Example K8s Configuration
Because news-signals is designed to be used in both research and production settings, we have also provided a Dockerfile and an example Kubernetes (K8s) job configuration that can be deployed to Google Cloud Platform with minimal setup required.Together, these assets can be used to build signals datasets at a regular cadence, for example once a day or once a week.

Example Models and Experiments
This section presents a suite of example models and experiments for users to quickly adapt to their own task settings, and to verify the utility of news-signals by establishing baselines for a straightforward anomaly prediction task.

Binary Anomaly Prediction Task
In this work, we focus on a simple binary anomaly prediction task, which we treat as text classification.
The goal is to predict whether a time series signal about a particular entity is anomalous during some window in the past, present, or future, based on textual information in news feeds about the entity.
The input for an individual prediction is a set of news articles, an aspect (e.g. an entity) and the target a binary anomaly indicator.For simplicity, we predict the target value of a particular day from the textual input of the same day.We transform time series signals into binary anomaly predictions with the following procedure:

Target Signals
We experiment with two different time series target signals: anomalies time series of NewsAPI volume counts and Wikimedia page views.One target time series consists of day-level binary values for the time range of our datasets.We use a simple anomaly detector to convert the raw time series signals into binary values, based on the Z-score: We treat each value x t in a time series as an anomaly if the following is true: where µ is the mean and σ standard deviation of a time series.We set the anomaly threshold t (measured in standard deviations) to 3 which results in a proportion of 1-3% positive examples in our datasets.

Dataset Splits
Each of the three dataset is split chronologically into training (80%), validation (10%) and test (10%) sections.A trained model is informed about all entities in the training data and is tested to apply this knowledge to future data about these entities.
The split can also be done across entities to test whether models can generalize to new entities.In this work, we focus on the simpler setting where the entities are known.Note that this does not apply to the zero-shot baselines using LLMs discussed below.

Balanced Sampling for Training
We preserve the validation and test dataset split as they are, i.e. with a small amount of 1-3% of positive labels, and as continuous time periods.Since training with this label imbalance results in poor results, we create modified training datasets from the time period of the training split: we randomly sample 10,000 positive and 10,000 negative examples for each dataset.

Compressing Textual Input
Since we are dealing with a large amount of text for each individual prediction task, i.e. a set of 20 news articles, we need to compress these articles into a shorter text to fit the input size of typical current deep learning models.In our experiments, we use the concatenation of all headlines of a day as the textual input.We leave a comparison to alternatives, e.g.multi-document summaries or representative articles, to future work.

Models for Anomaly Classification
We include several text classification baselines that predict the target based on one day of compressed textual content: Fine-tuned Transformer Classifier: We finetune the pre-trained RoBERTa-base model (Liu et al., 2019) with an un-trained randomly initialized binary classification head.We fine-tune the model on 1 epoch of the label-balanced training examples with a batch size of 8, a learning rate of 2e-5 and a weight decay of 0.01, using the Adam optimizer.

Random Forest with Sparse Lexical Features:
We train random forest models on binary lexical features, to explore how well the target signals are represented in surface-level text.We use sklearn15 to extract sparse binary tokenindicator features, with a vocabulary of 10,000 tokens, excluding stop words.We train the models with 100 trees and a maximum depth of 20.We determined these values on the validation datasets.
Zero-Shot Classification with Llama-2 (13B): We use Meta's Llama-2-13b-chat16 model for zero-shot classification.We provide the 20 headlines of a day along with a prompt that describes the target signals as an input.The prompt used in the presented experiments is shown in Appendix B.

Evaluation and Results
We evaluate the binary anomaly classification task using Precision, Recall and F1-score.We put the results into perspective by comparing them to two random baselines: random-uniform, i.e. randomly classifying each input as an anomaly with a 50% chance, and random-target, where we classify each input as an anomaly with a probability set to the proportion of positive examples in the test set.Table 2 shows the results for anomaly classification for news volume and Wikimedia pageviews as target signals.The trained models achieve aboverandom f1-scores on most of the dataset-target combinations, and obtain better results than the zero-shot baseline.We discuss the results in more detail in Appendix A. Figure 3 shows an example of predicted anomalies, compared to the groundtruth anomalies defined by the anomaly detection method.The predicted anomalies in this example consistently correspond to a spike of Wikipedia page views on the day or shortly after the day on which input news stories were published.

Extending to forecasting tasks
This experimental setup can easily be converted into forecasting tasks by pairing the text content of a particular day with the target signal shifted by some offset into the future.By sliding our forecasting window earlier than the input, we can also study how well today's news predicts signals that already happened.This may be more relevant for signals that imply significant information asymmetry, such as stock price, as opposed to signals that are public by definition, such as Wikimedia pageviews.Rather than binary anomaly targets, we can train models to directly predict the real-valued signal or quantized representations of the signal.
6 Intended Applications of NewsSignals Time Series Forecasting using Texual Data As discussed, time series signal forecasting is an important task which is relatively unexplored in the context of models for natural language processing (NLP).
Financial Data Analysis We believe that this dataset and task setting should be straightforward to adapt to financial time series analysis.Financial time series such as stock price and trading volume are impacted by real-world events.The behavior of market signals reflects sentiment about particular entities, and is influenced by events happening in the world.However, market signals may contain opaque and confounding factors that make accurate prediction more challenging.Although this work deliberately does not consider market signals, it is very straightforward to add market time series such as stock price(s) or trading volume to signals.
NLP for Healthcare The text2signal task setting is well-suited to the emerging field of BioNLP or NLP for Healthcare -for example, predicting the number of hospital visits in subsequent months based upon a collection of doctor's notes from preceding months, or forecasting total medical expenditure in subsequent months based upon the content of a doctor's notes.
Sentiment To date, sentiment analysis datasets have been created by human annotation.However, the annotation task is difficult to fully specify, and impossible to scale to real-world volumes of data.An insight is that there are many real world signals that can be considered proxies to sentiment, most obviously market signals, especially when the definition of sentiment is constrained to specific (entity, aspect) pairs.Instead of using model-derived sentiment to forecast time series, market signals can be used as ground-truth proxies to sentiment annotations.
Social Sciences Social scientists may be interested in the tooling we have built around the Wikidata SPARQL endpoint, because news-signals allows users to easily build a set of signals connected to any set of Wikidata entities.In one of our example datasets, we produced a signal for every living US politician present in Wikidata, and we believe that many social scientists will be researching similar specific sets of entities and related time series signals.
This section discusses potential applications for news-signals and directions for future work.
Causality News-signals may be useful for NLP researchers working on tasks related to causality, because time series signals are well-suited to causality research.In general, we wish to find out what types of information are likely to impact time series signals.Concretely, we may believe that there is a true causal relationship between news and the edit rate on Wikimedia pages.

Related Work
NLP and Time Series Dataset Libraries news-signals can be seen as sitting between NLP-focused dataset libraries such as Huggingface Datasets (Lhoest et al., 2021) and time series focused libraries such as GluonTS and KATS (Alexandrov et al., 2019;Jiang et al., 2022).We specifically build tooling for working with datasets with textual inputs and time series outputs, and news-signals is complementary to and compatible with other popular NLP and time series libraries.
Granger Causality It is natural to consider whether the content of textual inputs "caused" an observed time series signal behavior.Granger causality (Granger, 1969) is a method of measuring the degree to which one signal may cause another.Marcinkevičs and Vogt (2021) propose a framework for discovering Granger Causality with interpretable neural networks.
Summary graphs (Peters et al., 2017) are a useful way of compressing relationships about Granger causality.Wen et al. (2017) introduce a flexible RNN architecture for time series forecasting.Nourbakhsh and Bang ( 2019) is a position paper discussing the use of PLMs for anomaly detection on financial data.

Time Series prediction with Textual Inputs
As discussed in Section 1, one significant line of work focuses on predicting financial time series using signals derived from text, in particular aggregations of sentiment scores from social media posts (Chen et al., 2021(Chen et al., , 2022;;Arno et al., 2022;Li et al., 2014;Bing et al., 2014;Kim et al., 2016;Wang and Luo, 2021), inter alia.
PLMs and Transfer Learning Recently, significant work has been done to adapt transformerbased models in particular to time series forecasting tasks with flexible semantics (Wen et al., 2023).

Timeline Summarization from News Corpora
A related line of work within the NLP community is constructing timelines of important events from large collections of news focused on longterm topics, e.g.disasters or entities (Martschat and Markert, 2018).The methods for identifying important events often make use of time-series-like signals defined over dates: the number of articles published per day or the number of times the date is mentioned in text (Tran et al., 2013;Ghalandari and Ifrim, 2020).
We have presented news-signals, an open source library for building and working with NLP datasets that predict time series signals based on textual inputs.We hope that this library can be useful to a broad group of researchers and data scientists in both academic and industry settings.Naturally, we would be very happy for additional contributions from the open source community to further improve the library.

Figure 2 :
Figure 2: Creating and using a news signal

Figure 3 :
Figure 3: Predicted and ground-truth anomalies of a Wikipedia pageviews time series of US politician Karen Bass.The predictions are from a random forest model with sparse lexical features.

Table 2 :
Evaluation results for anomaly classification experiments.%pos indicates the proportion of positive predicted labels.