Causal Inference from Text: Unveiling Interactions between Variables

Adjusting for latent covariates is crucial for estimating causal effects from observational textual data. Most existing methods only account for confounding covariates that affect both treatment and outcome, potentially leading to biased causal effects. This bias arises from insufficient consideration of non-confounding covariates, which are relevant only to either the treatment or the outcome. In this work, we aim to mitigate the bias by unveiling interactions between different variables to disentangle the non-confounding covariates when estimating causal effects from text. The disentangling process ensures covariates only contribute to their respective objectives, enabling independence between variables. Additionally, we impose a constraint to balance representations from the treatment group and control group to alleviate selection bias. We conduct experiments on two different treatment factors under various scenarios, and the proposed model significantly outperforms recent strong baselines. Furthermore, our thorough analysis on earnings call transcripts demonstrates that our model can effectively disentangle the variables, and further investigations into real-world scenarios provide guidance for investors to make informed decisions.


Introduction
Causal Inference (Holland, 1985;Pearl, 2000;Morgan and Winship, 2007;Imbens and Rubin, 2015;Hernan and Robins, 2020) aims to identify how the treatment variable affects the outcome variable.For example, to estimate the effect of "political risk" (treatment) faced by a company on its "stock movement" (outcome).Early research efforts (Abadie and Imbens, 2004;Bardone-Cone and Cass, 2006;Kurth et al., 2006;Murnane and Willett, 2010;Keele, 2015) focusing on conducting randomized control trials (RCTs) to estimate causal effects from structural numeric data have made significant progress.However, these methods requires extensive effort in treatment assignment mechanism (Halloran and Struchiner, 1995) and may suffer from ethical issues.Natural Language Processing (NLP) researchers are increasingly interested in estimating causal effects from observational unstructured text.Early literature (Choudhury et al., 2016;Olteanu et al., 2017;Pryzant et al., 2018) largely focuses on transforming texts into high-dimensional vectors using lexical features for confounding adjustment.Recent research primarily focuses on learning adequate representations through advanced NLP models.For example, Veitch et al. (2020) fine-tuned BERT (Devlin et al., 2019) to produce contextual text representations for efficient estimation of causal effects.Later, Pryzant et al. (2021) introduced strategies involving treatment enhancement and text adjustment to estimate the causal effects related to linguistic properties.
Despite their efficacy, such approaches operate under the assumption that text solely encompasses confounding covariates.This assumption raises a potential issue due to the possible existence of unobserved non-confounding covariates that are pertain exclusively to either the treatment or the outcome.The causal estimation may be biased if we fail to differentiate non-confounding covariates from confounding ones when learning an estimation function through effective modeling of variable interactions (Pearl, 2010;Wooldridge, 2016).As illustrated in Figure 1, if we aim to accurately estimate the causal effects of treatment T (e.g, Political Risk) on the outcome Y (e.g., Stock Movement), we intentionally omit the consideration of the impacts originating from Z y (e.g., Expected Revenue).This mirrors our decision not to account for the influence of Z c (e.g., Geographical Location) on Y , as such inclusion could obfuscate our ability to discern the true effects originating from T .
In this paper, we propose a framework named Disentangling Interaction of VAriables (DIVA), specifically tailored for causal inference from text.We assume that the text carries sufficient information to identify the causal effects and consider the existence of non-confounding covariates.Drawing on the success of latent variable models for causal inference in literature (Louizos et al., 2017;Zhang et al., 2021), we use Variational Auto-Encoder (VAE) (Kingma and Welling, 2014) to infer confounding and non-confounding covariates.Additionally, we design a disentanglement module to ensure that covariates only contribute to their specific objectives, enabling independence between covariates.Furthermore, we propose to impose a constraint to balance representations from the treatment group and control group, which helps to mitigate selection bias.
Our contributions are summarized as follows: • We propose the Disentangling Interaction of VAriables (DIVA) approach, tailored to mitigate the bias issue in causal inference from text.

Related Work
Causal estimation with text data Early efforts in estimating causal effects from text focused on using lexical features for confounding adjustment (Choudhury et al., 2016;Choudhury and Kıcıman, 2017;Olteanu et al., 2017).Later studies investigating causal effects were devoted to effectively converting text into low-dimensional representations (Falavarjani et al., 2017;Pham and Shen, 2017;Pryzant et al., 2018;Weld et al., 2020;Cheng et al., 2021).Another line of work focused on using causal formalisms to make NLP methods more reliable (Wood-Doughty et al., 2018, 2021;Feder et al., 2021Feder et al., , 2022)).Most recently, pre-trained language models such as BERT (Devlin et al., 2019) significantly benefited causal estimation.For example, Veitch et al. (2020) fine-tuned BERT using multi-task learning to produce contextual text representations for efficient estimation of causal effects.Later, Pryzant et al. (2021) introduced treatment-boosting and text-adjusting strategies to estimate the causal effects of linguistic properties.
Our work differs from these works in three main aspects.First, we aim to mitigate the bias that arises from insufficient consideration of non-confounding covariates in causal inference.Second, we disentangle non-confounding covariates by encouraging independence among the variables, ensuring that each one contributes solely to its respective objective.Third, we introduce regularization to balance representations from the treatment group and control group, which helps to mitigate selection bias.

Causal inference with latent variable model
Latent variable models have demonstrated their effectiveness and gained significant popularity in causal inference (Fong and Grimmer, 2016;Sridhar and Getoor, 2019;Roberts et al., 2020).For example, Louizos et al. (2017) used Variational Auto-Encoder (VAE) (Kingma and Welling, 2014) to infer confounders from latent space to estimate the effect of job training on employment following the training.Rakesh et al. (2018) inferred the causation that leads to spillover effects between pairs of units by incorporating VAE to learn the latent attributes as confounders.We follow the line of decomposing latent factors for causal inference (Hassanpour and Greiner, 2020;Wu et al., 2020;Vowels et al., 2020;Yang et al., 2021;Zhang et al., 2021).However, there are several key distinctions in our approach.Firstly, while previous studies attempted to disentangle variables for causal inference in structured numeric data, we specifically focus on estimating causal effects from textual data.The inherently high-dimensional nature of text features presents substantial challenges in disentangling various variables within the latent space, leading to biased causal estimations.Secondly, we tailor distinct constraints to effectively model interactions among diverse variables, ensuring that each variable primarily contributes to its specific objective and promotes maximal independence.Lastly, we optimize the maximum mean discrepancy loss to achieve a balanced representation of samples from both treatment and control groups.
NLP for earnings call transcripts Earnings call transcripts (Frankel et al., 1997;Bowen et al., 2001;Price et al., 2011) have gained much popularity in financial analysis using NLP tools.Early work by Wang and Hua (2014) formulated financial risk prediction as a text regression task and used handcrafted features to improve SVM performance.Later, researchers (Qin and Yang, 2019;Sawhney et al., 2020;Sang and Bao, 2022;Pataci et al., 2022;Shah et al., 2022;Yang et al., 2022) focused on stock prediction by employing sophisticated neural networks with financial pragmatic features.Another line of work focused on analyzing the content of earnings call transcripts (Sawhney et al., 2021;Alhamzeh et al., 2022).For example, Keith and Stent (2019) examined analysts' decision-making behavior as it pertains to the language content of earnings calls.More in line with our work, Hassan et al. ( 2017) adapted linguistic tools to investigate the extent of political risk faced by firms over time and its correlation with stocks, hiring, and investment.In contrast with this prior work, our primary focus lies on estimating causal effects between financial interests, such as the impact of political risk on stocks, rather than measuring their correlations.

Preliminaries
Causal inference from text aims to estimate the causal effects based on observed textual data.Let Here, X i is the observed textual data (e.g., earnings call transcript) for the i-th example (e.g., company), and T i ∈ {0, 1} is the binary treatment variable2 .T i = 1 indicates that the ith example belongs to the treatment group (e.g., a company faced high political risk).Conversely, T i = 0 indicates that the i-th example belongs to the control group (e.g., a company faced low or no political risk).The causal effect τ i for the ith example is defined as the expected difference between its potential outcome Y i (e.g., stock volatility) of the treatment and control groups, known as the Individual Treatment Effect (ITE): One of the most challenging problems in estimating causal effects from observational data is the impossibility of simultaneously observing both potential outcomes Y i (T i = 0) and Y i (T i = 1) for a given example (Rubin, 1974;Holland, 1985).In other words, D only includes the observed outcome Y i for each example, but not the unobserved counterfactual outcome, which refers to the potential outcome for the i-th example in the alternative group.Nonetheless, it's feasible to identify the Conditional Average Treatment Effect (CATE) and the Average Treatment Effect (ATE) from observational data under certain assumptions (Spława-Neyman et al., 1990;Rubin, 1974;Pearl, 2009): Assumption 1 (Stable Unit Treatment Values Assumption (SUTVA)): The potential outcomes of one example are not influenced by the treatment assigned to other examples, and there are no varying forms or levels of the treatment that could result in different potential outcomes: Assumption 2 (Unconfoundedness): The potential outcomes are conditionally independent of the treatment given a set of observed covariates: (Y (1), Y (0)) ⊥ ⊥ T .
Assumption 3 (Positivity): Every individual has a non-zero probability of receiving treatment or control for all observed variables: 0 < P (T = 1|X = x) < 1.
In line with the potential outcome framework outlined by Spława-Neyman et al. (1990) and Rubin (1974), and with the above assumptions, we

Treatment Prediction Outcome Prediction
Orthogonal Module can define the CATE as follows:

Maximum Mean Discrepancy
where Y i (1) and Y i (0) are the potential outcomes had the i-th individual received the treatment or control.X is the observed variable which is sufficient for causal estimation.The ATE can be written as: as the potential outcome of observing treatment T = t for an example with X = x, the objective is to learn an estimation function Q(t, x) that can accurately predict both the observed outcome and counterfactual outcome from D. Therefore, we can plug in Q to estimate CATE:

DIVA: Disentangling Interaction of VAriables
In this section, we present the proposed Disentangling Interaction of VAriables (DIVA) framework (Figure 2) for causal inference from textual earnings call transcripts.Although previous research (Veitch et al., 2020;Pryzant et al., 2021) has explored estimating causal effects from text, one of the core contributions of our work is that we disentangle various variables to effectively model the interactions among them.This in turn enables us to learn a more accurate estimation function Q for predicting outcomes, thereby reducing the bias in the causal estimation.
Our proposed DIVA framework consists of a few steps.First, we extract the contextualized text representation from the pre-trained language model.Following that, we employ a variational auto-encoder to determine the posterior distribution for various latent variables.Once this distribution is obtained, we use the variable disentanglement module to encourage independence among the variables, ensuring that each one contributes solely to its respective objective.Next, we utilize the disentangled variables to learn the Q function via the outcome prediction task.Finally, we plug the trained Q into a pre-determined statistic to estimate the ATE.

Text Encoder
Given a transcript x = [w 1 , ..., w n ] that consists of n words, we adopt the pre-trained language (PLM) model FinBERT (Araci, 2019) 3 to obtain the contextual representation h for each transcript: (5)

Latent Variable Inducer
Inspired by recent works (Louizos et al., 2017;Zhang et al., 2021), we use the VAE to induce latent variables.Given the contextualized representation h.We compute the approximation variational posterior q ϕ (z|h) using the inference network Φ(h; ϕ): where W µ , W σ , b µ , and b σ are parameters for two MLPs.µ and σ define a multivariate Gaussian distribution with a diagonal covariance matrix, and ϵ ∼ N (0, I).Then, we sample from q ϕ (z|h) ≃ N (µ, σ 2 I) to generate z ∈ R l as the latent representation, where l is the dimension of the representation.Under the assumption that a transcript contains not only the confounding covariates, which affects both treatment and outcome, but also the non-confounding covariates specific to either the treatment or the outcome, we use separate inference networks Φ c (h; ϕ c ) for inferring confounding covariates z c , and Φ t (h; ϕ t ) and Φ y (h; ϕ y ) for inferring non-confounding covariates z t and z y , respectively.We use a one-layer parameterized MLP Θ(h; θ) := p θ (h|z t , z c , z y ) as the decoder to reconstruct h.The objective of the latent variable inducer is to maximize the evidence lower bound (ELBO): where k ∈ {c, t, y}, and p(z k ) is the prior follows the Gaussian distribution N (0, I).

Latent Variable Disentanglement
Despite the successful application of decomposing variables in previous work (Zhang et al., 2021), unfortunately, the high-dimensional nature of text features presents significant obstacles in disentangling different variables in a latent space, leading to biased causal estimation.As will be shown in Section 5.2 (e.g., TEDVAE v.s.CEVAE), considering only non-confounding covariates, without the ability to effectively model interactions between different variables, fails to consistently achieve better performance in textual data.
To address this issue, we tailor various distinct constraints to effectively disentangle nonconfounding covariates from confounding ones, ensuring that each variable primarily contributes to its specific objective and promotes maximal independence.
Specifically, we first minimize the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) loss to balance representations from the treatment group and the control group: where M(; ) denotes the maximum mean discrepancy metric.z treat k and z contl k are the representations in the treatment group and the control group, respectively.The nice property of this loss is that minimizing the loss essentially reduces the discrepancy between different groups, encouraging the satisfaction of the positivity assumption.Concurrently, it promotes the inference network to generalize from the factual to counterfactual domains, leading to better counterfactual inference (Johansson et al., 2016).
Next, we introduce an orthogonal loss to maximize the independence between z t , z c , and z y as much as possible: where k, v ∈ {t, c, y; k and I is the identity matrix.Intuitively, we expect that the prediction of the treatment label should primarily rely on z t and z c , rather than z y .To ensure this holds, we introduce the treatment loss: where t ∈ {0, 1} indicates whether the transcript belongs to the treatment group.Similarly, we expect the prediction of outcome should primarily rely on z y and z c , and define the outcome loss: where y ∈ Y is the potential outcome.O is an MSE loss for real-valued outcomes and a crossentropy loss for the binary outcomes.
The overall objective function of the latent variable disentanglement module is formulated as: where α, β, γ, and η are hyper-parameters.

Final Training Objective
Following Veitch et al. (2020) and Pryzant et al. (2021), we introduce a Masked Language Model (MLM) objective that predicts words that are randomly4 masked, in order to adapt text representation, making it more efficient for treatment and outcome prediction.Our final objective function is a multi-task learning objective: where λ is the coefficient that balances the contribution of each component in the training process.

Experiments
We conduct experiments on both semi-synthetic data and real-world application scenarios with two objectives: 1) to empirically evaluate the effectiveness of our proposed model, and 2) to investigate practical questions in the field of finance and gain insights from the application of our model to these real-world scenarios.

Experimental Setup
Baselines The baseline models selected for comparison can be broadly categorized into three groups: deep outcome regression models, latent variable models, and representation learning models.Deep outcome regression models include: • treatment-boosting and text-adjusting strategies to estimate causal effects of linguistic properties.Whenever possible, we generate results for baselines using the officially released source code.In cases where the code of models is not available at the time of writing, we independently implement those models using the optimal hyper-parameter settings reported in the respective papers.For a fair comparison, we use FinBERT (Araci, 2019) to encode text for generating contextualized feature representations for all models.
Evaluation Metric We evaluate the results using the precision in estimation of heterogeneous effect (PEHE) (Hill, 2011), which reflects model's individual-level estimation performance: Setup Details In our experimental evaluations, each model is trained for 30 epochs with a linear warmup for the first 10% of the training steps.We employ AdamW (Loshchilov and Hutter, 2019) as the optimizer.We set the maximum learning rate at 5e-5 and use a batch size of 86.We select the optimal model weights based on either accuracy or the MSE loss of the Q function on the development set5 .We report the average results along with the mean absolute deviations across five runs with randomly initialized parameters.

Experiments on Synthetic Data Dataset
Since ground truth causal effects ITE τ i and ATE τ , are typically inaccessible in real-world scenarios, directly training a model for causal inference is impractical.Therefore, we follow Veitch et al. (2020) and Pryzant et al. (2021), using real text and metadata to generate semi-synthetic data to empirically evaluate our proposed model.We collect 115,880 transcripts from 1,438 companies across twelve different sectors, for earnings calls held between May 2001 and October 2019.Then, we construct different datasets for two distinct treatment variablespolitical risk (T pr ) and sentiment (T s ) -under two separate scenarios: stock volatility (Y vol ) and stock movement (Y mov ).To derive T pr , we follow Hassan et al. (2017) to calculate the political risk score6 for each transcript.We then select the top 15,000 transcripts with the highest scores as the treatment group (T pr = 1), indicating that the company faces high political risk.Conversely, we designate the bottom 15,000 transcripts with the lowest scores as the control group (T pr = 0), suggesting these companies face lower or no political risk.To derive T s , we follow Maia et al. (2018) and Araci (2019) to calculate the sentiment score7 for each transcripts.We select the top 15,000 transcripts with the highest scores as the treatment group (T s = 1) and select the bottom 15,000 transcripts with the lowest scores as the control group (T s = 0).Finally, we simulate the outcomes by using the treatment variable T ∈ {T pr , T s } along with observed covariates, C size and C sect , which represent the size of the company in terms of the number of full-time employees and the industrial sector that the company operates.The best results on each dataset are in bold.The second-best ones are underlined.The † marker indicates that the p-value is less than 0.05 compared to the second-best results.The parameter setting used is (α=1, β=1, γ=0.5, ϵ=1) for Equation ( 14) and ( 15).
We split the dataset into the training, validation, and test sets in an 8:1:6 ratio and conduct experiments in a cross-validated manner, following Egami et al. (2018) and Pryzant et al. (2021).We conduct experiments for the two different treatment variable T pr and T s under the scenarios of stock volatility and stock movement, respectively.Detailed statistics of each scenario can be found in the Appendix.

Main Results
As shown in Table 1, DragonNet and CFRNet generally achieve better results than TARNet, suggest-ing that additional constraints indeed benefit the outcome regression model in causal estimation.For example, DragonNet improves upon the TARNet by 0.03 in terms of δATE based on political risk in the stock volatility scenario.We also observe that Causalbert and TextCause generally achieve better results than the deep outcome regression models such as TARNet, DragonNet, and CFRNet, as well as latent variable models such as CEVAE and TEDVAE.This suggests that the inclusion of the masked language modeling task has a positive impact on causal inference from text.Our model consistently outperforms all compared baseline models across both evaluation metrics and under both scenarios.For instance, DIVA demonstrates a significant improvement (with p < 0.05) over the bestperforming baseline TextCause and the CausalBert model.
Interestingly, we observe that TEDVAE struggles to consistently outperform CEVAE.In particular, TEDVAE achieves better results in terms of δATE but performs worse in terms of

Latent Covariates Analysis
To further investigate the influence of various covariates on model performance, we conduct an in-depth analysis of DIVA, focusing on the disentanglement of different covariates.As shown in Table 2, merely disentangling non-confounding covariates z t or z y from the confounding covariate z c fails to consistently achieve better results compared to considering only z c .Our model yields the best performance with the simultaneous disentanglement of z t , z c , and z y .This results underscore the necessity of comprehensive covariate disentanglement, specifically, disentangling both non-confounding covariates z t and z y from the confounding covariate z c , as opposed to a partial or singular focus.

Simulation Sensitivity Analysis
To evaluate the robustness of our proposed DIVA model, we have chosen to compare it with the two strongest baseline CausalBert and TextCause, under different simulation settings (α=1, β=10, γ=0.5, ϵ=4) in Equation ( 14) and ( 15).As shown in   parameter setting, demonstrating the robustness or our DIVA model.

Ablation Study
We conducted experiments to examine the effectiveness of the major components of our proposed model.timation of CATE from text data.This phenomenon aligns with previous studies such as (Veitch et al., 2020;Pryzant et al., 2018).

Real World Scenario Application
To answer the questions of "How does political risk faced by a company affect its stock?" and "How does the sentiment conveyed in the earning call transcription of a company affect its stock?",we apply our proposed model to estimate the treatment effect of political risk and sentiment on actual stock volatility and stock movement.
Stock Volatility Following Qin and Yang (2019) and Kogan et al. (2009), we obtain the stock prices from Yahoo Finance8 by stock-market-scraper9 and calculate stock volatility as: where r t = Pt P t−1 − 1 is the stock return between the close of trading day t − 1 and day t, P t is the divedend-adjusted closing stock price at t. r is the mean of r t over the period of day t − µ to day t.We choose different µ ∈ {3, 7, 15, 30} to evaluate the short-term and long-term causal effects.
Stock Movement Following (Medya et al., 2022), we define stock movement as: where v[t−µ,t] is the mean stock volatility over the period of day t − µ to day t.Result As shown in Figure 3, we observe that the causal effects of political risk on stock increases in the short term (3 days) and begin to decline over time.Conversely, the causal effect of sentiment on stock movement decreases over time.
Analysis To further investigate the effect of political risks on the stock market for different types of companies, we examine the causal effect of political risk faced by companies in different sectors on their stock prices.Figure 4 shows that the stock volatility of companies in Industrials Goods, Real Estate, and Energy are most significantly affected by the political risk they faced, while companies in Consumer Cyclical and Technology are affected to the smallest extent.The political risks faced by the Healthcare companies have no effect on their stock volatility.

Conclusion
In this paper, we propose DIVA, a novel framework designed specifically for causal inference from text.We verify its effectiveness by estimating the causal effects of treatment factor (e.g., political risk or sentiment) on a company's stock (e.g., stock volatility or movement) from the earnings conference call transcripts.The experimental results demonstrate that our model can effectively disentangle representations with different functionalities from text features by imposing constraints and utilizing multi-task learning.Furthermore, our analysis of real-world applications highlights the causal relationship between political risks faced by a company and its stock prices, providing valuable insights for the finance and investment industry.

Limitations
Our work has a number of limitations.First, we constructed a balanced dataset in which the number of transcripts in the treatment group is equal to that in the control group.While this facilitated relatively easier causal estimation, it does not account for the selection bias that commonly exists in realworld scenarios.Consequently, causal estimation in such scenarios becomes more challenging.Second, we modeled the relation between treatment factors and stocks as a linear relation.However, in reality, this relationship is likely to be much more complex and nonlinear.A more precise modeling of this relationship would enhance the accuracy of our causal estimation.

Figure 1 :
Figure 1: The causal diagram for our proposed model.Shaded nodes denote observed variables.Transparent nodes denote latent covariates derived from the transcripts, among which, nodes outlined in red represent non-confounding covariates that impact only either the treatment T or outcome O, whereas the node outlined in black denotes the confounding covariate that influences both T and O.
We also report the error of ATE estimation δATE = |τ − 1 N N i=1 τi |, which measure the model's population-level estimation performance.

Figure 3 :
Figure 3: Causal effect of political risk and sentiment on the actual stock over trading days.

Figure 4 :
Figure 4: Causal effect of political risk on stock volatility over companies in different sectors.

Table 1 :
The real-valued stock volatility Y vol can The causal estimation results of different treatment factors on stock volatility and stock movement.Lower is better.
Table 3, our DIVA model consistently outperforms both CausalBert and TextCause across both evaluation metrics and under both scenarios.These results suggest that the superior performance of our model is not sensitive to changes in the simulation

Table 4
Table 4 shows the ablation results on stock volatility and stock movement scenarios.We observe that each component, namely L mlm , L mmd , and L ort contributes to the overall performance of the model.Specifically, with the removal of the L mmd , the performance of the full model drops considerably in terms of √ PEHE.Similarly, removing L mlm results in a considerable drop in performance, measured by δATE.These observations demonstrate the vital role played by the L mmd regularization term, which encourages closer representations of individuals from different groups in the latent space.Incorporating the L mlm term benefits the es-