CR-COPEC: Causal Rationale of Corporate Performance Changes to Learn from Financial Reports

In this paper, we introduce CR-COPEC called Causal Rationale of Corporate Performance Changes from financial reports. This is a comprehensive large-scale domain-adaptation causal sentence dataset to detect financial performance changes of corporate. CR-COPEC contributes to two major achievements. First, it detects causal rationale from 10-K annual reports of the U.S. companies, which contain experts' causal analysis following accounting standards in a formal manner. This dataset can be widely used by both individual investors and analysts as material information resources for investing and decision making without tremendous effort to read through all the documents. Second, it carefully considers different characteristics which affect the financial performance of companies in twelve industries. As a result, CR-COPEC can distinguish causal sentences in various industries by taking unique narratives in each industry into consideration. We also provide an extensive analysis of how well CR-COPEC dataset is constructed and suited for classifying target sentences as causal ones with respect to industry characteristics. Our dataset and experimental codes are publicly available.


Introduction
Many critical decisions on events may require appropriate explanations of decisions based on accurate causal rationale.Justifying the root statements is directly related to identifying the causes of the events.When one could observe an event that is an apparent cause of a desired outcome, one can make a proper decision with confidence on events.
There has been extensive research in extracting causes of events in numerical data.As an example, Granger causality finds linear temporal dependence between two (or more) temporal sequences (Bressler and Seth, 2011).Shapley values derive the contributions of individual input attributes when a decision is made by a complex function (Shapley, 1971).These methods can explain numerical causes of decisions made by automated systems, e.g.Robo-advisers in financial services (Hwang et al., 2016;Karuna, 2019;Chhatwani, 2022).However, human uses various types of information including numerical and textual inputs when making an important decision.As an example, analysts write summarized reports by extracting related causal information from multiple textual sources in conference calls, annual reports, earning statements and markets reports.
Our research goal is to extract causal rationales from financial reports.In general, when investing firms predict a certain financial performance, they provide analysts' reports to support their predictions.Thus, we want to generate appropriate explanations for performance changes of a corporate from official documents.Our algorithm classifies causal sentences from documents and provides binary classification results.
We consider fine-tuning of Pre-trained Neural Language Models (NLMs) for the causality modeling tasks.Pre-trained NLMs have been state-of-thearts for many Natural Language Processing (NLP) tasks.For example, NLMs such as BERT (Devlin et al., 2018) and ALBERT (Lan et al., 2019) demonstrate outstanding performance in some tasks such as answering questions (Clark et al., 2020;Suissa et al., 2023) and computing conditional probabilities of masked words in a sentence (Kwon et al., 2022).Nonetheless, recent research indicates that the size of human-annotated data continues to be a significant factor influencing the performance of models (Gu et al., 2022;Mehrafarin et al., 2022).
There are previous works that introduce dataset for causality detection in the financial domain (El-Haj et al., 2016;Mariko et al., 2020).However, existing studies lacked consideration of industryspecific characteristics.Thus, it can be beneficial to consider it because the items in the financial statements that greatly affect financial performance are different for each company's primary business.
Our main contribution is to collect sufficient annotations to achieve reasonable causality detection performance with NLMs.We achieved this by collecting over 283K sentences from 1,584 10-K annual reports, that give a detailed summary of the financial status and business operations of each company, along with audited financial statements.Then, we manually label individual sentences whether the sentences explain the cause of certain financial performance changes.We name the 283K pairs of a sentence and a corresponding label as Causal Rationale of Corporate Performance Changes (CR-COPEC).CR-COPEC is built on a large scale with guides of experts in the financial domain.Trained with our dataset, BERT can distinguish sentences containing main causes of its financial events, from annual reports written officially from most U.S. public companies.Thus, individual investors can save efforts to read a huge amount of reports by themselves.
However, we find collecting the dataset does not solve the problem as itself.One challenge we observe in the process of annotation is diverse causality in industries.That is, we find diverse causalities over different sectors.Another challenge is imbalanced training, where the number of data for each sector varies.Thus, we need to build a model carefully, as applying a common model to all sectors does not work in our problem setting.We provide extensive analysis on our CR-COPEC dataset to overcome these issues.

Related Work
The causal rationale is "the true sufficient rationales to fully predict and explain the outcome without supurious information."(Zhang et al., 2023) and extracting rationale is invaluable when a decision has to be made.Research on extracting rationale from text has been tried with various types of text documents (Blanco et al., 2008;Ittoo and Bouma, 2011;Lu et al., 2022).A model detecting and identifying rationale from chat messages was suggested in (Alkadhi et al., 2017).Bug reports from Chrome web browser were used as main sources to extract rationale (Rogers et al., 2012).Moreover, patent documents were utilized to discover design rationale (Liang et al., 2012).To extract causal textual structures, one may consider a rule-based system where specific words such as 'due to', 'owing to' and 'affects' are listed to identify sentences including causal information for prediction (Girju and Moldovan, 2002;Chang and Choi, 2006;Sakai et al., 2015).
There are previous studies that introduce causal rationale corpus in the financial domain.To identify causal sentences from UK Preliminary Earning Announcements (PEAs), thirteen performance keywords including 'sales', 'revenue' and 'turnover' are used and selected sentences are annotated by human (El-Haj et al., 2016).Mariko et al. (2020)'s study is the most related our research.Herein, the authors built FinCausal corpus collected from financial news and websites which labeled with tags indicating the presence of causality and causal chunk as quantitative or non-quantitative.The corpus consists of two subtasks.The first subtask (FinCausal Task 1) is a binary classification task that targets to extract text having causality.The second subtask (FinCausal Task 2) is a relation extraction task identifying the substrings that indicate cause and effect.The main difference between CR-COPEC and Fin-Causal is that texts are written with a formal tone in 10-K reports since they should comply with regulatory rules.In addition, CR-COPEC solely concentrates on causal rationales of accounting items considering unique characteristics of various industries.Meanwhile, FinCausal dataset itself is written with a casual tone in general because it is collected from news or web contents.The detailed comparison of two datasets is described in Appendix E

Causal Rationale of Corporate Performance Changes Dataset
Our goal is to collect causal rationale of sentences that contain causal rationales in predicting changes in corporate performance based on key accounting items within the target industrial sector.For this purpose, we target Management's Discussion and Analysis (MD&A) section of 10-K reports since it provides the company's perspective on its operations and financial results of the prior fiscal year.We gather MD&A reports filed in 1997 and 2017.MD&A of 1997 is gathered from the existing MD&A data repository (Kogan et al., 2009) and 2017's is downloaded directly from the Securities and Exchange Commission (SEC) system.Then, we construct CR-COPEC dataset through keywordbased filtering (Section 3.1) and human annotation process (Section 3.2) illustrated in Figure 1.The validity of the dataset is verified in Section 3.3.In addition, we provide protocols of the dataset to train NLMs in Section 3.4.

Keyword-based Filtering
Since sentences we aim for detecting are rare in a single document, it is costly expensive to manually annotate every collected sentence.To mitigate the cost issue, previous studies extract text containing keywords such as domain specific terminologies (El-Haj et al., 2016;Fonseca et al., 2023) or causal phrases (Sakai et al., 2015;Dürlich et al., 2022).In our work, we utilize keyword-based filtering to extract candidate causal sentences.In particular, we form a list of keywords2 including causal trigger phrases of changes in financial performance.We filter out approximately 62.7% non-extracted sentences (177,629 out of 283,490) and simultaneously extract the remaining 37.3% extracted sentences (105,861 out of 283,490) from the total sentences of the MD&A section.Section 3.3 discusses on the coverage of the keyword list.

Data Annotation following SIC
We describe the process of analyzing causal rationale of sentences by considering companies' Standard Industrial Classification (SIC) that classifies the U.S. companies according to their primary business. 3For this, we divide 10-K reports into twelve industries which are predefined with respect to the SIC codes (French, 2019).These twelve categories include 1) consumer non-durables, 2) consumer durables, 3) manufacturing, 4) energy, 5) chemicals, 6) business equipment, 7) telephone, 8) utilities, 9) shops, 10) health, 11) finance and 12) others.Note that, throughout the paper, sectors are numbered in this order.
Since major factors that affect the financial performance of a company in each industry can be different, we build an annotation guideline for each sector by taking different accounting items among industries into consideration.Under the supervision of a finance faculty at the school business administration, we craft the annotation guideline by considering the main items of the balance sheets and income statements since those items are closely related to financial performance of companies.Then, we proceed annotation according to the annotation guideline.Each MD&A document is randomly assigned to an annotator.The number of documents and sentences from each sector is reported in Table 1.Examples of sentence containing main factors mostly observed in each sector are shown in Table 2.Additional examples of each sector's main factors are demonstrated in Appendix B and causal/non-causal rationale of sentences can be found in Appendix C.

Sector of Industry -Example [Document]
Consumer Durables -The sales increase for fiscal 1996 was principally due to improved sales of buses and ambulances.
[Collins Industries, Inc., January, 1997] Manufacturing -The increase in 1996 net sales was due primarily to increases in sales revenues recognized on the contracts to construct the first five Sealift ships, the Icebreaker and the forebodies for four double-hulled product tankers, which collectively accounted for 63% of the Company's 1996 net sales revenue.[Avondale Industries, Inc., March, 1997] Energy -Gas revenue increased $32.9 million or 81% because of a 39% price increase combined with a 30% increase in production.[Cross Timbers Oil Co., March, 1997] Chemicals -Loss of margin was principally due to sales price decreases and raw material price increases in the pyridine and related businesses, and higher manufacturing costs due to weather related problems in the first quarter 1994.[Cambrex Corp., March, 1997] Finance -Mortgage investment income decreased for 1995 as compared to 1994 primarily due to the assignment to HUD of the mortgage on El Lago Apartments in June 1995.[American Insured Mortgage Investors Series 85 L P, March, 1997] Table 2: Examples of causal rationale of sentences by each sector.

Annotator Sensitivity
We distinguish annotators into general and unskilled annotators based on the proficiency of the guideline.We regard annotators who 1) participated in the development of the annotation guideline and 2) labeled more than 30K sentences as general annotators, otherwise unskilled annotators.
The general annotators label each extracted sentence as non-causal or causal one.100,046 of the sentences were tagged by two general annotators.We regard labels annotated by general annotators as a standard and train a model with these labels alone.We call this model as an initial teacher model.Then, we apply this teacher model to documents that are labeled from unskilled annotators.If all labels in a single MD&A document match with predictions of the teacher model, we add them to the previous training set.We train another teacher model with the new version of training set again and apply this model to the rest of other documents.We repeat this process until no matched document is found.As a result, we collect 1,584 10-K reports and 105,861 sentences containing 11,132 causal and 94,729 non-causal sentences.
Note that, all the sentences from the same document are assigned to the same annotator, and the annotators label the current sentence based on its previous context.Details on the annotation environment of annotators can be found in Appendix D.

The Validity of the Dataset
In this section, we discuss the validity of our dataset construction process; 1) Keyword-based filtering  and 2) Human Annotation.

Verifying Keyword-based Filtering
We randomly sample a small number of sentences from full sentences and classify causal sentences manually.

Verifying Human Annotation Process
Human annotation task is sensitive to the reliability of the general annotators.Thus, to verify the reliability between annotators, we conduct the following steps.To be specific, this process includes three phases: 1) Random sampling, 2) Additional annotation and 3) Reliability calculation.Figure 2   is a conceptual diagram for the process.
To begin with, we randomly sample 1,000 sentences from 10-K reports.Then, we ask for the general annotators and two additional annotators 4 to annotate the sampled sentences.In this case, those additional annotators were trained on the annotation guideline and corpus labeling.As a result, we can get annotation #G1, and annotation #G2 from the general annotators and annotation #A1, and annotation #A2 from the additional annotators.
Finally, we calculate the reliability of the sampled sentences of the general annotators #1 and #2.For this purpose, we use Cohen's Kappa (Cohen, 1960), which evaluates the inter reliability of two annotators.As a result, we observe that the agreements of the general annotator #1 are 0.762, and 0.710 from the additional annotators #1 and #2, respectively, being able to be interpreted as a substantial agreement (κ > 0.6) (Viera et al., 2005).Furthermore, the agreements of the general annotator #2 are 0.808, and 0.740 from the additional annotators #1 and #2, respectively.Finally, the agreement between general annotators is 0.698.

Dataset to Validate Training Performance
This section describes how we recompose CR-COPEC dataset as suitable to train NLM.In ad-4 Herein, additional annotator #1 is working in the computational linguistic domain for more than 6 years and additional annotator #2 is a financial expert who worked at Financial Supervisory Service for more than 8 years.Note that, both general annotators are κ > 0.7, when compared to additional annotator #2, indicating that the annotations of the general annotators are reliable.dition, we provide protocols for different version of datasets through Section 3.4.1 and Section 3.4.2.

Protocols for Total Dataset
In CR-COPEC dataset, the ratio of causal to noncausal is imbalanced, the robustness of learning causality can be affected by the ratio of causal in each train/valid/test dataset.Thus, we carefully select train, valid and test dataset so that each sector composes similar causal ratio between different datasets.In this process, we iterate random selection until the ratio difference is less than 0.5%.The proportion of train/valid/test follows a 81%/9%/10% of total CR-COPEC dataset.The causal composition of train/valid/test dataset for each sector is reported in Table 4.

Protocols for Fraction of Dataset
In Section 4.3, we compare cross-sector performances to figure out transferable information between sectors.As we observe in Table 1, the size of each sector varies from 5K to 60K.Thus, in order to control the effect of size on performance, we randomly sample 3,500 sentences from each sector of train dataset.Since the smallest number of train datasets among all sectors is less than 4,000.

Experimental Settings
We conduct experiments with baseline models on CR-COPEC consisting of sentences and corresponding labels in supervised learning.Since the dataset is highly imbalanced, we use area under the precision-recall curve (AUPRC) of causal sentences as the evaluation metric (Davis and Goadrich, 2006).Models are optimized for the validation dataset.A trained model gives us the probability of causality on each sentence.Experiments are conducted on Google Colab Pro with one Tesla V100-SXM2-16GB GPU and four Intel(R) Xeon(R) CPU @ 2.00GHz CPUs.

Experimental Results on Baseline Models
We use LSTM, Bidirectional LSTM (Bi-LSTM) (Graves et al., 2005), ELECTRA Base (Clark et al., 2020) and BERT Base as baselines to compare the performance of extracting rationale of financial performance changes.As an input, each sentence is tokenized by ELECTRA and BERT (Devlin et al., 2018) tokenizers respectively for ELECTRA Base and BERT Base.For the LSTM models, we use Glove (Pennington et al., 2014) word embedding.
For the baseline experiments, we use total CR-COPEC dataset describribed in Section 3.4.1.As shown in Table 6, models based on transformers (ELECTRA Base and BERT Base) performed better than RNN based models.Because BERT Base achieved the highest AUPRC score (85.13%), we utilize BERT Base for followed experiments.

Examination of Sectors' Characteristics
CR-COPEC consists of twelve different industries.We observe that each industry possesses its own field of interests with respect to reasons of financial performance changes.Therefore, training each sector individually is required for a better classification.However, the number of data for each indus-try varies around [5k-60k], in which some sectors do not contain enough data to detect causal rationale precisely.Thus, this section tests a hypothesis if training other sectors together helps detecting causal sentences in a target sector.For this purpose, we conduct cross-sector test and compare the performance of models trained on various versions of the CR-COPEC dataset.
First, we fine-tune BERT Base on the equal size of fraction dataset described in Section 3.4.2and test across sectors.We call this model as Sectoronly Same-size.Then, we select neighbor sectors based on the results from the cross-sector test with valid dataset (Table 5).If the models trained on a neighbor sector detect target sector sentences better than the model trained on target sector, we assume that this sector can support to train a model.For example, in the case of Sector 8, we select Sector 5 as a neighbor which shows 68.32% in AUPRC (> 61.59%).After we select neighbors for each sector, we fine-tune BERT Base with the target and all selected neighbor sector datasets.We define this model as Neighbor model.Besides, for equal size comparison, we randomly select data from Total dataset to the same size of each Neighbor model trained on.The model trained on this dataset is called Total-random.In addition, we also train model on the whole size of each sector dataset and call it as Sector-only.Finally, we name BERT Base trained on Total dataset from the Section 4.1 as Total-all.We compare the performances of these four different types of models on in details below.
From the results in Table 5, we also find that the average performance of models on Sector 6 is the highest (71.19%) and Sector 11 (49.27%) is the lowest.This result demonstrates a gap in the level of difficulty of detecting causality on target sectors.Detecting causal sentences in Sector 6 is relatively easier and on Sector 11 is more difficult than other sectors.Furthermore, Sector-only Samesize model trained on fraction dataset of Sector 5 shows the highest average performances across all sectors.Meanwhile, the model trained on fraction dataset of Sector 11 shows the lowest average score.
We interpret these results as the sector generality and specificity.It can be considered as Sector 5 consists of more general causality patterns that can support to other sectors easier, while Sector 11 contains sector specific information so harder to transferable to other sectors.Hence, we find that Sector 11 is difficult to find causal sentences and also hard to be transferable to other sectors.We assume that since Sector 11 is the sector of finance industry, its unique characteristics present the sector specificity in our dataset.However, this transferable interaction between sectors is asymmetric, so other sectors can easily be a help to Sector 11. Figure 3 compares the performances of Sectoronly, Total-random, Neighbor and Total-all models on each sector.Overall, Neighbor models show significant higher performances compared to Sectoronly and Total-random.We conduct t-tests of Neighbor and Total-random to determine the statistical significance of the difference in performances of AUPRC (p < .01).In Sector 1 and 4, Neighbor is higher than Total-all, though the training dataset's size is much smaller that Total dataset.

Effects of Dataset Sizes
We check the effect of the number of training data with BERT Base trained on CR-COPEC.We increase the amount of training data and verify the causal extracting performance on each step.At each step, we randomly select a set of training data by increasing 10,000 sentences, then report AUPRC on CR-COPEC test dataset.AUPRC of  BERT Base increased with a huge gap (6.64%) from 10K to 20K.Then they increase gradually until 200K then slightly drop afterward.We provide the detailed results in Figure 4. Figure 3 shows that the gap between Total-all and Neighbor is bigger when the number of training data is relatively smaller than Total dataset, in case of Sector 9, 10 and 12.

Explaining Detected Causal Rationales
We analyze input features used for causal rationale classification.Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al., 2016) is an explanation method that makes the predictions of a classifier interpretable by learning a linear interpretable model locally around the prediction.With this explanation technique, we can visualize the most important features affecting the prediction.
To see the difference on the predicted sentences from various models, we applied LIME on our trained models (Figure 5).The probability of a causal sentence from Sector 1 is predicted as highest with Total-all, compared to Sector-only, Total-random and Total-all models.The Neighbor model also accurately predicts the correct answer (> 0.5).Furthermore, the word 'strike' works as negative to causal rationale in Sector-only model.However, it becomes neutral in models trained on more data.
In the example of Sector 4, sector specific models including Sector-only and Neighbor correctly predict the sentence as causal.Meanwhile, Totalrandom model regard the term 'depreciation expense' as negative one at prediction.This is because sentences containing causal rationale of 'depreciation expense' are annotated differently across sectors, so training all sectors together brings confusion in this case.

Comparisons with an Related Dataset
In order to verify the novelty of CR-COPEC, we compare our data with the existing financial causality detection corpus.For this, we conduct crossdataset experiments between CR-COPEC and the FinCausal task 1 (Mariko et al., 2020).In specific, we compare the test performances in a crossdataset setting where we fine-tune BERT Base on 1) FinCausal practice dataset, then test on both our test dataset and FinCausal trial dataset, and 2) CR-COPEC train dataset, then test on both of test dataset and FinCausal trial dataset.
Table 7 show that the performance of each model trained on a specific dataset is obviously decreased in its counterpart.This indicates that there are significant differences between the two datasets and the existing dataset cannot fully address the causality detection in 10-K reports and vice versa.We believe this is owing to the purpose and the target text of the datasets being different.Specifically, the CR-COPEC dataset aims to extract causal rationale sentences based on accounting items that are able to cause companies' performance changes.Besides, CR-COPEC targets formal public text.On the other hand, FinCausal targets relatively casual text since FinCausal dataset tries to elicit comprehensive causality texts in financial documents.

Multiple Sentence Modeling
Since the evidence for a causal statement can be scattered across more than one sentence, we conducted additional experiments on sentence n-gram settings to provide context for causal sentences.Each sentence n-gram consists of one target sentence and n-1 previous sentences, which serve as the context.Moreover, by following Mariko et al. (2020)'s setting, we conducted experiments for cases where n = 1, 2, 3, that the length of the context can have up to two sentences.Finally, since the number of tokens in sentence n-grams often exceeds the maximum token input size of BERT (512 tokens), all experiments were performed with Longformer Base (Beltagy et al., 2020), which accepts longer inputs (4,096 tokens).Appendix F provides detailed descriptions of the experimental settings.Experimental results of Table 8 shows that the performance of the multi-sentence (bi-and tri-gram) modelings achieves slightly higher performance at tri-gram modeling setting than single the uni-gram modeling but it is not significant (p > 0.2).The result implies that, in many cases, the evidences for a causal statement can be found within one sentence.

Experimental Results on a Large Language Model
This section reports the experimental results on a state-of-the-art large language model (ChatGPT; GPT-3.5-Turbo)(Ouyang et al., 2022) with our baseline model (BERT Base).For this, 1,000 instances were randomly sampled from the test set.
The experiments were performed in the promptingbased zero-shot inference (Zero Shot) and 5-shot inference (Brown et al., 2020) where 5 positive and negative examples from the same sector as the input text were randomly selected from the training set.
In addition, we conducted experiments on both of single inference and majority voting five inferences by following Arora et al. ( 2022)'s setting.Detailed experimental settings and prompts are described in Appendix G.
Table 9 shows that ChatGPT can successfully carry out a specific part of causal rationale text analysis without additional fine-tuning or zero-shot in-context learning.Additionally, 5-shot settings outperform zero-shot settings approximately 3%p.Furthermore, it has been demonstrated that employing majority voting surpasses the single inferencebased performance.However, those performances are still significantly lower (p < 10 −4 ) than that of the fine-tuned smaller model; BERT Base.This is because ChatGPT's results are sensitivity to minor fluctuations in financial condition or change in minor accounting items within (Interest income in the Sector 1) the company's industrial sector.These results imply the need for an annotated corpus based on the domain knowledge on the specialized field even in the large language model's era.

Potential Uses of Dataset
Financial analysts are key information intermediaries in capital markets, with their research, focused on uncovering private information and interpreting public data, being highly valued by investors.The value of analyst research can stem from two main sources: uncovering private information and interpreting public information, as exemplified by studies such as (Ivković and Jegadeesh, 2004;Asquith et al., 2005).To uncover this private information and reinterpret public information, analysts often analyze various linguistic patterns in annual reports.Given the sheer volume of corporate filings today, not only are they paying significant time and money to analyze these reports, but their relying on manual annotation also reduces the accuracy of the analysis.Otherwise, NLP models trained with our dataset can automates annotations enhancing both speed and scalability.Moreover, analysts typically specialize in specific areas, requiring domain knowledge in industry specifics and market patterns.This includes understanding industry dynamics, competitive landscape, and regulatory environment, along with learning unique financial metrics and varying accounting practices across industries.Analysts may need to grasp new metrics and varying accounting practices across industries to understand how financial statements are prepared and analyzed, allowing them to make meaningful comparisons.Our dataset is expected to significantly reduce the learning curve and associated costs for analysts transitioning between industries.

Conclusion
We introduce a novel large scale dataset to extract causal rationales from financial reports in various sector.The dataset was annotated by considering accounting factors according to the industry.We further validate the process of building our CR-COPEC dataset.Finally, through the qualitative and quantitative analysis, we observed that the model trained with CR-COPEC recognize the clue of causal sentences.We hope that our work would promote the study of causality detection in the financial text domain.license is not required to access the database.In addition, according to information presented on www.sec.gov is considered public information and may be copied or further distributed by users of the web site without the SEC's permission.Secondly, to all annotators, we paid them by obeying the minimum wage standard of national law.Besides, the general annotators were full-time employees.

Limitation
CR-COPEC may include two types of bias: 1) protocol bias, 2) annotators' subjectivity bias.First of all, we clearly defined in this paper that we annotate sentences based on annotation rules regarding financial statement items.Thus, our dataset may have different criteria from other causality detection problems (e.g, FinCausal dataset).Secondly, we are aware that the subjective judgment of annotators may exist in the annotation process.Therefore, inconsistency of annotations between annotators may occur for some sentences, which may affect data quality degradation and model performance.To control this issue, we verified the human annotation process (Section 3.3.2) and obtained substantial agreement results by measuring the kappa score between general annotators and additional annotators for 1000 sample sentences.The reliability of each general annotator is 0.7 or higher, and the agreement between general annotators is 0.698.
CR-COPEC is designed to detect causal statements that can limit further analyses.In the future, those without considering the positive or negative impact of performance changes.Additional analysis based on subcategories such as positive/negative/neutral on the impact of causal statements on a corporate performance changes is required in the future.Annotation files are based on the Comma-Separated Values (CSV) format.In this case, all text in each report of a corporation is gathered in a single CSV file.The file consists of 6 columns: sentence, section, document, date, location (loc), and true label.The sentence column contains the text information of the report that is split by each sentence.The section column indicates the SIC code of the corporation.The document column represents the unique document ID of the report.Date is the date the report was published.In addition, the location (loc) column signifies the sentence number of the text.Finally, true label is the human annotated label for a sentence.

D Details on Annotation Environment
Each file was assigned to an annotator.The annotator coded sentences based on the annotation guidelines and the previous context.Specifically, for consecutive sentences related to the same accounting item, only the sentences that provide a reason for the performance change are labeled as causal rationale.For example, in the sentences at location 11, although product sales (which could be a causal rationale for performance change of a company with section 1) increased, it is difficult to consider the sentence as a causal rationale since the reason is not explained.On the other hand, sentences at location 12 are regarded as a causal rationale because the reason for the increment in product sales is specified and clear in the previous context.Note that, since the document and location information are also provided in the final version of the file, it is possible on modeling and inference for multiple sentences as in the experiments in section 4.7.

Figure 2 :
Figure 2: Overview of the verifying the human annotations of general annotators.

Figure 3 :
Figure 3: Test performances of classification models on valid dataset of each sector (AUPRC).The size of dataset for Neighbor and Total-random is presented in parenthesis.

Figure 4 :
Figure 4: AUPRC of BERT Base by the number of training data from CR-COPEC.

Figure 5 :
Figure 5: Interpretation of predictions from (A)'Sector-only', (B)'Total-random', (C)'Neighbor' and (D)'Total-all' models with LIME.Features contributing on causal rationale are highlighted in orange, the opposites are in blue.

Figure 7 :
Figure 7: Illustrative example of the annotation form.

Table 1 :
Dataset (CR-COPEC) Composition: number of sentences and documents in each sector (The ratio of causes to total sentences is in parenthesis).

Table 3 :
The result of the coverage analysis of causal sentences extracted by the keywords.

Table 4 :
Total Dataset Causal Composition: the ratio of causal sentences in each sector (%).

Table 5 :
Cross-sector test with models trained on the same size of each sector (AUPRC, %).

Table 7 :
The experimental results of the cross-dataset between CR-COPEC and FinCausal dataset (AUPRC).