Next-Year Bankruptcy Prediction from Textual Data: Benchmark and Baselines

Models for bankruptcy prediction are useful in several real-world scenarios, and multiple research contributions have been devoted to the task, based on structured (numerical) as well as unstructured (textual) data. However, the lack of a common benchmark dataset and evaluation strategy impedes the objective comparison between models. This paper introduces such a benchmark for the unstructured data scenario, based on novel and established datasets, in order to stimulate further research into the task. We describe and evaluate several classical and neural baseline models, and discuss benefits and flaws of different strategies. In particular, we find that a lightweight bag-of-words model based on static in-domain word representations obtains surprisingly good results, especially when taking textual data from several years into account. These results are critically assessed, and discussed in light of particular aspects of the data and the task. All code to replicate the data and experimental results will be released.


Introduction
Since the seminal work of Beaver [1966], bankruptcy prediction has received considerable attention by both academics and practitioners.A sound prediction model has numerous applications.For instance, successful quantitative methods can help professionals, such as creditors and investors, in managing financial risk [Bielecki and Rutkowski, 2013].Furthermore, as Bernanke [1981] has shown that economy-wide levels of bankruptcy risk plays a structural role in propagating recession, regulators can use bankruptcy prediction models to monitor the financial health of key economic actors and control systematic risk.
A large number of bankruptcy prediction models have been proposed in literature, such as the models from Beaver [1966], Ohlson [1980], Odom and Sharda [1990], Kim and Kang [2010] and Mai et al. [2019].However, it appears difficult to compare these studies and objectively assess progress in the field.We have identified the following three aspects

Related Work
After a general overview of research on bankruptcy prediction (Section 2.1), we describe some key aspects that make contributions in literature hard to compare (Section 2.2).

Bankruptcy Prediction Research
Beaver [1966] pioneered bankruptcy prediction literature with a discriminant model based on financial ratios.Subsequently, well-chosen structured financial variables were proposed to predict failure, along with increasingly advanced prediction models.Statistical models, such as discriminant analysis [Beaver, 1966;Altman, 1968], have been dominant in the past but rely on stringent assumptions about the data [Balcaen and Ooghe, 2006].Today, machine learning models are commonplace as they rely on fewer assumptions and learn directly from the data.Odom and Sharda [1990] used neural networks to predict bankruptcy, Kim and Kang [2010] have built an ensemble model and Hosaka [2019] generates predictions through a convolutional neural network with ratios presented as images.Keasey and Watson [1987] were Three years prior to bankruptcy "We are highly leveraged and a substantial portion of our liquidity needs arise from debt service requirements and from funding our costs of operations and capital expenditures, including acquisitions... we entered into a new asset-based revolving credit facility (ABL Facility)... secured by substantially all of our assets..." One year prior to bankruptcy " ... we received a waiver of certain events of default under the TLA arising from the inclusion of a going concern qualification from our registered public accounting firm, breach of the EBITDA financial covenant, and cross-default arising from the default under our ABL Facility...In order to address our liquidity issues and provide for a restructuring of our indebtedness to improve our long-term capital structure, we have entered into a Restructuring Support Agreement ... pursuant to a prepackaged plan of reorganization to be filed in a case commenced under chapter 11 of the United States Bankruptcy Code..." Table 1: Extracts from the MD&A section of a distressed company in our dataset, one year and three years prior to bankruptcy.Underlined words correspond to the top 20 tokens most informative for imminent bankruptcy in our respective Binary Bag-of-Words models.
the first to include non-financial variables in a corporate failure model, Shumway [2001] has shown that market-driven variables are strongly related to bankruptcy and Cecchini et al. [2010] found that textual disclosures can be used to discriminate between bankrupt and non-bankrupt firms.The information value of textual data was further established by Mayew et al. [2015] as they found that the opinion of management on the future of the company and the linguistic tone of the Management Discussion and Analysis has significant explanatory power for corporate failure.Mai et al. [2019] provide large-sample evidence of the predictive power of textual disclosures and show that deep learning models yield superior results when using textual data together with traditional accounting features.Furthermore, the authors compare two deep learning architectures based on skip-gram word representations [Mikolov et al., 2013] and conclude that an average embedding model leads to better results than a ConvNet architecture.Despite this promising work, bankruptcy prediction models using textual data are scarce.

Need for a Reproducible Benchmark
The following aspects prevent a straightforward comparison of research contributions, and may be avoided by a common benchmark along with the tools to reproduce experimental results, one of the goals of this work.

Temporal nature and class imbalance of bankruptcy data:
Due to the temporal nature of the data and the typically much smaller fraction of positive cases (enterprises going bankrupt), many strategies have been proposed to construct training data and define evaluation sets.The data source that serves as a basis for the model typically contains annual (or more fine-grained) observations for each firm in the sampling period.In earlier work [Beaver, 1966;Altman, 1968] the explanatory variables were selected only once for each firm in the dataset.In the 'paired sampling' approach [Altman, 1968], the independent variables for failed firms were retained in the year before failure, together with those for a paired healthy firm in that same year, to induce a balanced dataset from which a random evaluation set is sampled.Shumway [2001] has shown that such an approach leads to poor out-of-sample prediction performance and incorrect statistical inference.As an alternative, hazard models can be estimated by treating each firm-year sample as an independent observation, with the bankruptcy status by the end of the following year as the prediction target.Typically, the observations prior to some date are used for model training, and observations after this date are used to estimate the out-of-period prediction performance [Shumway, 2001;Mai et al., 2019].Sometimes even a random split is used, independent of time [Mai et al., 2019].In the work of Volkov et al. [2017], the explanatory variables for a number of consecutive years are used as input, with company status as the prediction target in the year afterwards.The class imbalance is managed through undersampling of healthy companies.Evaluation is done on a held-out subset of companies, which is therefore artificially balanced as well.Undersampling, oversampling, and data augmentation techniques are investigated by Veganzones and Séverin [2018].Training and evaluation are done on a non-overlapping subset of firms, with a oneyear shift in between, while also maintaining a predefined artificial ratio between the number of healthy and bankrupt firms (for both training and evaluation).
In our considered population (public companies in the US, see Section 3.1), all companies are known, as well as their yearly reports so far, and the goal is predicting bankruptcy for all of these firms in the near future (the coming year).This is simulated in our evaluation scenario, where we make predictions for all companies not (yet) bankrupt and observed through annual reports up to a given year, on their bankruptcy status the year afterwards (as further detailed in section 3.2).

Large variety of evaluation metrics:
The choice of evaluation metrics is often linked to the experimental setup, e.g., depending on whether a balanced test set is used.The evaluation scenario also influences the choice of threshold used for metrics like accuracy, precision, or recall.For example, Volkov et al. [2017] selects a threshold that maximises the F 2 -measure.Alternatively, Veganzones and Séverin [2018] select the threshold that minimises the expected cost of misclassification with equal weights.Aggregated metrics that avoid the use of a threshold, such as area under the ROC curve (AUC), decile rank, and cumulative accuracy profile ratio (CAP) are regularly reported as well [Mai et al., 2019].

Use of private datasets:
The final reason that makes model comparison hard is the lack of a standard benchmark dataset.Bankruptcy prediction literature either reports results on proprietary datasets [Matin et al., 2019] or on data obtained by manual collection or custom web scraping strategies (and kept private) [Cecchini et al., 2010;Wang et al., 2020].For a comprehensive overview of data sources used in recent corporate failure literature we refer the reader to the work of Mai et al. [2019].Our datasets are based on the combination of existing sources, i.e., the UCLA-LoPucki Bankruptcy Research Database (BRD)1 and the public EDGAR-CORPUS [Loukas et al., 2021].This allows researchers to reconstruct the same train, validation and test data from these sources, even if we are not allowed to make the resulting datasets public directly.

Methodology
In the next sections, we describe the data sources (Section 3.1) and motivate our design choices for the benchmark (Section 3.2), document pre-processing (Section 3.3), and the selected evaluation metrics (Section 3.4).

Data Sources
Our study makes use of the EDGAR-CORPUS, a novel economic dataset containing 10-k reports from all publicly traded companies in the US, spanning 25 years [Loukas et al., 2021].As we need information on bankruptcies as prediction target, these reports were matched with the UCLA-LoPucki Bankruptcy Research Database (the BRD)2 , through the unique Central Index Key to identify companies.The BRD contains information on all Chapter 7 and Chapter 11 filings of the United States Bankruptcy Code since 1997 and is updated monthly.
Consistent with prior work [Cecchini et al., 2010;Mayew et al., 2015;Mai et al., 2019], we limit the 10-k reports to section 7: "Management Discussion and Analysis".According to the U.S. Securities and Exchange Commission3 , it "... gives the company's perspective on the business results of the past financial year.This section, known as the MD&A for short, allows company management to tell its story in its own words."It also contains the risks and uncertainties that could materially affect the company.As an example, consider the extracts from the MD&A's of a distressed firm in Table 1.
Public company bankruptcy is a rare event.Figure 1 shows that the number of 10-k reports filed by non-bankrupt companies heavily exceeds the yearly number of Chapter 7 and Chapter 11 cases.Note how the influence of the Dot-com crisis (2000), the financial crisis (2007)(2008), and the COVID crisis (2020) on our population can be observed.Table 2 provides additional statistics for the aligned data sources.

Task Definition and Setup
Determining the prediction time window Prior work has not always been very transparent about the temporal aspect of the textual and numerical data in their models, but this requires special attention in order to arrive at a correct setup.A 10-k report is characterised by two dates, as schematically shown in Fig. 2: (1) the fiscal year-end t PR of the one-year time window T PR ('period of report') used to calculate the financial statements, and (2) the filing date t FD on which the report is filed with the SEC.Since in practice  t FD ≥ t PR , there may be a period after t PR yielding textual information in the MD&A (i.e., before t FD ), not present in the financial statements.It is therefore important to use the one-year period directly after t FD as the prediction time window T prediction when the textual data is used as input to the model.In the extreme case of bankruptcy in between t PR and t FD ('potential bankruptcy' in Fig. 2), it would lead to leakage and artificially high prediction accuracies if the year directly after t PR were used for prediction.It is possible, though, that information on an imminent bankruptcy shortly after t FD is already included in the report, but this does not present a conceptual problem for the prediction setup.

Dealing with missing 10-k reports
The dataset contains yearly 10-k reports from the first time a company appears, starting from the year 2000, until 2021 or until bankruptcy.However, some reports are missing for a number of companies, and our analysis reveals the following three scenarios.First, some companies stop reporting from a certain point in time onwards, without filing for bankruptcy.This may be due to a merger or an acquisition, but that particular information is not present in the data.Second, there may be gaps in the sequence of yearly reports.This arises when a company either does not submit a 10-k report (due to unknown reasons) or because of data quality issues.Third, we observe that some companies headed towards bankruptcy tend to fail in their reporting in the year(s) leading up to the bankruptcy filing.A naive approach would be to simply discard all instances with missing reports.However, this would make the evaluation scenario biased, since missing reports are not distributed uniformly over the data, due to the different scenarios described above.Consider our 2019 test set with a history of three years (discussed later in this section) as an example, of which close to 45% of companies have at least one missing report during the three-year history.The relative frequency of bankruptcy is 0.27% for the entire population, 0.00% for companies with only missing data (cf.an M&A event), 0.35% for companies with no missing data and 0.93% for companies where the data in only the year before prediction is missing.Therefore, we do not remove these companies and keep them in our dataset which results in a more realistic evaluation scenario.

Construction of input and target per firm-year
In order to create time-agnostic firm-year samples (following Shumway [2001]) during the construction of our train, validation and test sets (see further), we process a given year and company as follows: 1. Determine T prediction : If a 10-k report was filed by the company in the considered year, T prediction is the period between t FD and t FD + 1 year (cf. Figure 2).Otherwise, we use the one-year period starting the same day as the latest available t FD , but in the considered year.
2. Assign target label: If the company filed for bankruptcy during T prediction , the label is 1, otherwise 0. Note that potential firm-year instances with a bankruptcy filing before t FD are invalid for the considered year, as explained above.

Collect textual data:
The MD&A text from the report filed at t FD is used for the one-year history setting, as well as from the two previous years for the three-year scenario.
For missing reports, the token 'missing' is used.
Train / validation / test segmentation Training data: We construct two training sets in total.The first, using data up to 2015, is used for initial training while leaving sufficient for validation during hyperparameter tuning.The second, with data up to 2017, is used to train the final models.They are constructed as follows: 1. We leave out all reports with a t FD later than 2015 (2017), to ensure a proper temporal split between training and evaluation data.
2. For every firm and every year between the first year of the training data and 2015 (2017), we construct a firm-year instance as described above.
3. To reduce the impact on the training process of instances without any reports in their considered history (i.e., the one-year or three-year history, respectively), 95% of those are randomly removed.Validation data: We construct two validation sets, one for 2017 and one for 2018, both to be used for hyperparameter tuning.First, we filter out companies that have not filed any reports during the 5 years leading up to and including 2017 (2018).For each of these companies, one firm-year sample is created according to the method described above for the year (and hence t FD , even when the report is missing) 2017 (2018).Test data: In the same way, we construct two test sets, one for 2019 and one for 2020 (denoting the calendar year containing t FD ), for the final evaluation of the trained models.

Pre-processing
When dealing with textual data it is common to perform document pre-processing in order to decrease the dimensionality of the problem and reduce the computational cost of encoding the documents.We perform four pre-processing steps for the Bag-of-Words models presented in sections 4.1-4.3.First, we lowercase all documents.Second, we remove stopwords and punctuation.Third, we lemmatize each word in the documents through the NLTK library [Loper and Bird, 2002].Inflicted word forms such as paying and payed are transformed into the root form pay. Finally, we replace uncommon words by the token ' UNK ' (for 'unknown').A word is deemed uncommon when it does not appear in the 50,000 most frequent words in the training set.When dealing with transformer models [Vaswani et al., 2017], such as the Longformer [Beltagy et al., 2020], these steps are typically not required and might even lead to deteriorating performance.Preprocessing then consists of proper tokenization of the input text.We use the tokenization tools from Huggingface4 , which allow transforming the input text into a sequence of well-chosen word pieces.

Evaluation Metrics
Following Mai et al. [2019], we report the Area Under the Receiver Operating Curve (AUC) as main evaluation metric.The AUC is often used to quantify the overall prediction performance of binary decision models.It aggregates the information in the Receiver Operator Curve (ROC), which quantifies the trade-off between the true positive rate (or recall) and the false positive rate at various classification thresholds.However, in certain scenarios, a high true positive rate may be more relevant than a low false positive rate.Therefore, we also report the Recall@100.It quantifies the proportion of positive cases (bankrupt firms) present in the 100 highest ranked ones, out of all positive samples (all bankrupt firms in the considered year).In our context, this metric evaluates the models in their effectiveness to detect as many distressed enterprises as possible for a given budget (e.g., the manpower to investigate a hundred firms).The Cumulative Accuracy Profile Ratio (CAP) is a ranking based metric with a strong emphasis on recall of the positive class.It summarises the information in the CAP curve, which plots the cumulative proportion of positive samples against the percentage of the ranked data taken into account.The Cumulative Decile Rank is also a recall oriented metric.It gives the cumulative proportion of all positive samples (bankrupt firms) in each decile when ranking the samples according to the classifier score.Although we consider recall more important for the bankruptcy case from the perspective of the  2020), for several bag-of-words models: with binary one-hot vectors (Binary), TF-IDF, and mean word-to-vec (W2V) representations, as well as a Longformer classifier, and for single-year vs. three-year text inputs.
'given budget' scenario outlined above, we report a precision oriented metric as well.The Average Precision (AP) is the weighted mean of the precision at each classification threshold with the increase in recall as weight.

Models
Sections 4.1-4.3introduce our bag-of-words (BoW) models (which discard word order), followed by a neural sequence encoder model that does account for word order (Section 4.4), and some training details (Section 4.5).

Binary Bag-of-Words Model
As a trivial baseline (referred to as 'Binary') we represent our documents as vocabulary-sized binary vectors with '1' at a particular position indicating the presence of the corresponding word.As vocabulary, all occurring unigrams and bigrams are initially considered as features, and reduced to the 20 most informative ones through univariate feature selection, to be used in a logistic regression classifier.This baseline intends to quantify how well the occurrence of a small set of keywords allows predicting bankruptcy.The model for three-year history is obtained the same way, from the joint BoW over the considered years.

TF-IDF Bag-of-Words Model
The second model is similar to the Binary baseline, but considers term frequency -inverse document frequency (TF-IDF) features [Manning et al., 2008] rather than binary ones, combined with feature selection and an L2-regularized logistic regression classifier.The number of features to retain and the inverse regularisation strength are treated as hyperparameters.
The three-year model is constructed the same way, after concatenating the texts per year.

Word2Vec Average Embedding Model
As a final bag-of-words model (W2V), we implement the best performing architecture proposed by Mai et al. [2019], based on the Word2Vec model of Mikolov et al. [2013].First, the pre-processed data is used to train skip-gram word representations of dimension 100 (consistent with Mai et al. [2019]).Documents are then represented by the mean word vector over all occurring words.These serve as input to a two-layer feed-forward neural network with ReLU activations [Glorot et al., 2011] and standard dropout [Srivastava et al., 2014], followed by a sigmoid output.During training, we minimize the binary cross entropy loss with an L2-penalty, using the Adam optimizer [Kingma and Ba, 2014].The learning rate, weight decay (L2-penalty), hidden layer width, and dropout rate are treated as hyperparameters.When performing classification based on a history of three years, the document representations of each year are concatenated, resulting in a 300dimensional input to the first hidden layer of the neural network.

Longformer
For our most advanced neural model, we encode the documents through the Longformer of Beltagy et al. [2020].This transformer-based model is able to handle sequences up to 4096 tokens through its attention mechanism that scales linearly with the input text length (as opposed to the quadratic behavior in earlier Transformer models such as BERT [Devlin et al., 2018]).Given the mean document length of over 6k words in our corpus (cf.Table 2), we considered the Longformer a plausible baseline.We process the first 4096 tokens of each document with the Longformer model and retain the 768-dimensional pooled output as the document representation that feeds the same feed-forward classification neural network as described above.For dealing with a history of three years, the individual representations per year are again concatenated, and the input size of the first hidden layer is adjusted accordingly.During training, these representations are kept static (i.e., the Longformer weights are not further fine-tuned on our classification task).

Training Details
The classical models (Sections 4.1 and 4.2) are implemented in scikit-learn 5 and the hyperparameters are optimised through a grid search procedure.As constructing the vocabulary of all tokens in the training data is expensive, we choose to undersample the majority class until a 90%-10% distribution was reached.The neural models (Sections 4.3 and 4.4) are implemented in PyTorch 5 while the Word2Vec model was trained with Gensim 5 and the forward  When taking a single year of history into account, the W2V model is superior in terms of AUC, recall@100 and CAP while the TF-IDF model achieves the best results in terms of AP.For the 2019 test set, the TF-IDF model contains a slightly higher proportion of positive samples in the first decile but the W2V model is superior from the second decile onwards.When taking three years of history into account, the W2V model achieves the best results for the AUC and CAP metrics while the TF-IDF model performs better with respect to AP and recall@100.When looking at decile rank, the W2V models performs best, having ranked all bankrupt companies in the top 30% of the samples for the 2019 test set.
For each model, AUC and CAP are better when taking three years of history into account compared to a single year of history.The same applies for the decile rank (except for the TF-IDF model and the Longformer model in the first decile).AP is generally worse when using a longer history, except for the Binary model with test set 2020 and the Longformer model with test set 2019.The recall@100 metric varies over the two setups.
We observe that the Binary models based on a mere 20 keywords perform surprisingly well, although not on par with the TF-IDF and W2V models.Note that the latter are based on many more features (in particular, hyperparameter tuning led for the TF-IDF model to 25.000 (10.000) features for single (three) year history).The relatively good performance of the Binary baseline suggests that the presence of few very informative words is a strong indicator for impending bankruptcy.As an illustration, we list the top 15 unigrams and bigrams selected by the single year Binary model in table 4 and underline these features in the extracts in table 1.
Furthermore, the Longformer model performs significantly worse than the other models.Since we do not finetune the generic pre-trained Longformer model on the our end task, the resulting generic document representations appear unable to capture those features in the text that are important for bankruptcy prediction.
The W2V model leads overall to the best results, in particular for AUC (on which model selection was performed over the validation set) and CAP, and better than the Longformer over the entire line.Even though it is based on the mean representation over all words, it appears the relevant information regarding bankruptcy prediction is still sufficiently present.As opposed to the Longformer, the W2V document representations come from in-domain data (i.e., pretrained on 10-k reports).
Finally, we critically evaluate the observed performance improvements for the three-year w.r.t.single-year history setting.The Binary and TF-IDF models are by construction unable to distinguish the different years, but in principle the W2V and Longformer models could learn to capture a deteriorating financial situation over three years of history.However, when evaluating our final W2V models on the test sets with only complete observations (i.e., discard test instances with missing reports), we get the following results.The single year of history AUC is 0.93 (0.94) and the recall@100 is 0.48 (0.36) while the three year history AUC is 0.93 (0.93) and recall@100 was 0.24 (0.28).These results imply that our models taking three years of history into account only lead to better performance metrics as they are able to generate meaningful predictions for companies with some missing reports.Building more expressive models that can leverage the changes in the documents over the years present an interesting avenue for future research.

Conclusion and Future Work
Bankruptcy prediction models are valuable in many realworld applications and have received considerable research attention.However, assessing actual progress in the field is not obvious due to the lack of a common benchmark.In this work, we introduce such a benchmark for bankruptcy prediction using textual data along with several baseline models that demonstrate the predictive value of the textual data.We give a detailed discussion on our benchmark and evaluation design choices and share our code to reproduce the experiments.
In future work, we will focus on more advanced models to take into account the temporal evolution of enterprises' financial situation and more advanced language representations (i.e., by finetuning transformer encoders).We also plan to extend the benchmark with structured financial data to build hybrid prediction models.

Figure 1 :Figure 2 :
Figure 1: number of bankruptcies (including the mean) (left y-axis) and the number of 10-k reports filed (right y-axis) per year.

Table 3 :
Bankruptcy prediction results on the test sets: 2019 ( Bergstra et al., 2011]implemented in Optuna 5 .The hyperparameters are tuned to maximise the weighted AUC of the 2017 and 2018 validation data, and the obtained values are then used to train the final models using training data up to 2017, to be tested on the 2019 and 2020 test sets.
65 Results and Discussion Table3presents the out-of-period test performance metrics for our text-based bankruptcy prediction models, taking a single year or three years of history into account.