Bayesian Topic Regression for Causal Inference

Causal inference using observational text data is becoming increasingly popular in many research areas. This paper presents the Bayesian Topic Regression (BTR) model that uses both text and numerical information to model an outcome variable. It allows estimation of both discrete and continuous treatment effects. Furthermore, it allows for the inclusion of additional numerical confounding factors next to text data. To this end, we combine a supervised Bayesian topic model with a Bayesian regression framework and perform supervised representation learning for the text features jointly with the regression parameter training, respecting the Frisch-Waugh-Lovell theorem. Our paper makes two main contributions. First, we provide a regression framework that allows causal inference in settings when both text and numerical confounders are of relevance. We show with synthetic and semi-synthetic datasets that our joint approach recovers ground truth with lower bias than any benchmark model, when text and numerical features are correlated. Second, experiments on two real-world datasets demonstrate that a joint and supervised learning strategy also yields superior prediction results compared to strategies that estimate regression weights for text and non-text features separately, being even competitive with more complex deep neural networks.

Causal inference using observational text data is becoming increasingly popular in many research areas. This paper presents the Bayesian Topic Regression (BTR) model that uses both text and numerical information to model an outcome variable. It allows estimation of both discrete and continuous treatment effects. Furthermore, it allows for the inclusion of additional numerical confounding factors next to text data. To this end, we combine a supervised Bayesian topic model with a Bayesian regression framework and perform supervised representation learning for the text features jointly with the regression parameter training, respecting the Frisch-Waugh-Lovell theorem. Our paper makes two main contributions. First, we provide a regression framework that allows causal inference in settings when both text and numerical confounders are of relevance. We show with synthetic and semi-synthetic datasets that our joint approach recovers ground truth with lower bias than any benchmark model, when text and numerical features are correlated. Second, experiments on two real-world datasets demonstrate that a joint and supervised learning strategy also yields superior prediction results compared to strategies that estimate regression weights for text and non-text features separately, being even competitive with more complex deep neural networks.

Introduction
Causal inference using observational text data is increasingly popular across many research areas (Keith et al., 2020). It expands the range of research questions that can be explored when using text data across various fields, such as in the social and data sciences; adding to an extensive literature of text analysis methods and applications Grimmer and Stewart (2013); Gentzkow et al. (2019). Where randomized controlled trials are not possible, observational data might often be the only source of information and statistical methods need to be deployed to adjust for confounding biases. Text data can either serve as a proxy for otherwise unobserved confounding variables, be a confounding factor in itself, or even represent the treatment or outcome variable of interest.
The framework: We consider the causal inference settings where we allow for the treatment variable to be binary, categorical or continuous. In our setting, text might be either a confounding factor or a proxy for a latent confounding variable. We also allow for additional non-text confounders (covariates). To the best of our knowledge, we are the first to provide such statistical inference framework.
Considering both text and numerical data jointly can not only improve prediction performance, but can be crucial for conducting unbiased statistical inference. When treatment and confounders are correlated with each other and with the outcome, the Frisch-Waugh-Lovell theorem (Frisch and Waugh, 1933;Lovell, 1963), described in Section 2.2, implies that all regression weights must be estimated jointly, otherwise estimates will be biased. Text features themselves are 'estimated data'. If they stem from supervised learning, which estimated the text features with respect to the outcome variable separately from the numerical features, then the resulting estimated (causal) effects will be biased.
Our contributions: With this paper, we introduce a Bayesian Topic Regression (BTR) framework that combines a Bayesian topic model with a Bayesian regression approach. This allows us to perform supervised representation learning for text features jointly with the estimation of regression parameters that include both treatment and additional numerical covariates. In particular, information about dependencies between outcome, treatment and controls does not only inform the regression part, but directly feeds into the topic modelling process. Our approach aims towards estimating 'causally sufficient' text representations in the spirit of Veitch et al. (2020). We show on both synthetic and semi-synthetic datasets that our BTR model recovers the ground truth more accurately than a wide range of benchmark models. Finally, we demonstrate on two real-world customer review datasets -Yelp and Booking.com -that a joint supervised learning strategy, using both text and non-text features, also improves prediction accuracy of the target variable compared to a 'two-step' estimation approach with the same models. This does not come at a cost of higher perplexity scores on the document modelling task. We also show that relatively simple supervised topic models with a linear regression layer that follow such joint approach can even compete with much more complex, non-linear deep neural networks that do not follow the joint estimation approach.  2020) introduce a framework to estimate causally sufficient text representations via topic and general language models. Like us, they consider text as a confounder. Their framework exclusively focuses on binary treatment effects and does not allow for additional numerical confounders. We extend this framework. Causal inference framework with text: This general framework hinges on the assumption that through supervised dimensionality reduction of the text, we can identify text representations that capture the correlations with the outcome, the treatment and other control variables. Assume we observe iid data tuples D i = (y i , r i , W i , C i ), where for observation i, y i is the outcome, t i is the treatment, W i is the associated text, and C i are other confounding effects for which we have numerical measurements. Following the notational conventions set out in (Pearl, 2009), define the average treatment effect of the treated (ATT 1 ) as: In the spirit of Veitch et al. (2020), we assume that our models can learn a supervised text representation Z i = g(W i , y i , t i , C i ), which in our case, together with C i blocks all 'backdoor path' between y i and t i , so that we can measure the causal effect Intuitively, to obtain such Z i and consequently an unbiased treatment effect, one should estimate the text features in a supervised fashion taking into account dependencies between W i , y i , t i , and C i .

Estimating Conditional Expectations
To estimate the ATT, we need to compute the conditional expectation function (CEF): E[y|t, Z, C]. Using regression to estimate our conditional expectation function, we can write (1) Let f () be the function of our regression equation that we need to define, and Ω be the parameters of it. Section 2.3 covers text representation function g(). For now, let us simply assume that we obtain Z in a joint supervised estimation with f (). The predominant assumption in causal inference settings in many disciplines is a linear causal effect assumption. We also follow this approach, for the sake of simplicity. However, the requirement for joint supervised estimation of text representations Z to be able to predict y, t (and if relevant C) to be considered 'causally sufficient' is not constrained to the linear case (Veitch et al., 2020). Under the linearity assumption, the CEF of our regression can take the form When the CEF is causal, the regression estimates are causal (Angrist and Pischke, 2008). In such a case, ω t measures the treatment effect. Regression Decomposition theorem: The Frisch-Waugh-Lovell (FWL) theorem (Frisch and Waugh, 1933;Lovell, 1963), implies that the supervised learning of text representations Z and regression coefficients ω cannot be conducted in separate stages, but instead must be learned jointly. The FWL theorem states that a regression such as in (2) can only be decomposed into separate stages, and still obtain mathematically unaltered coefficient estimates, if for each partial regression, we were able to residualize both outcome and regressors with respect to all other regressors that have been left out. In general, for a regression such as y = Xω + , we have a projection matrix P = X(X X) −1 X that produces projections y when applied to y. Likewise, we have a 'residual maker' matrix M which is P 's complement the estimatesω t of treatment effect ω t in equations (2) and (3) would be mathematically identical (full theorem and proof in Appendix A). Here, M c,z residualizes t from confounders C and Z. This is however infeasible, since Z itself must be estimated in a supervised fashion, learning the dependencies towards y, t and C. Equation (2) must therefore be learned jointly, to infer Z and the CEF in turn. An approach in several stages in such a setup cannot fully residualize t from all confounders and estimation results would therefore be biased. What is more, if incorrect parameters are learned, out of sample prediction might also be worse. We demonstrate this both on synthetic and semi-synthetic datasets (section 5 and 6).

Supervised topic representations
Topic models are a popular choice of text representation in causal inference settings (Keith et al., 2020) and in modelling with text as data in social sciences in general (Gentzkow et al., 2019). We focus on this text representation approach for function g() in our joint modelling strategy. BTR: We create BTR, a fully Bayesian supervised topic model that can handle numeric metadata as regression features and labels. Its generative process builds on LDA-based models in the spirit of Blei and McAuliffe (2008). Given our focus on causal interpretation, we opt for a Gibbs Sampling implementation. This provides statistical guarantees of providing asymptotically exact samples of the target density while (neural) variational inference does not (Robert and Casella, 2013). Blei et al. (2017) point out that MCMC methods are prefer-able over variational inference when the aim of the task is to obtain asymptotically precise estimates. rSCHOLAR: SCHOLAR (Card et al., 2018) is a supervised topic model that generalises both sLDA (Blei and McAuliffe, 2008) as it allows for predicting labels, and SAGE (Eisenstein et al., 2011) which handles jointly modelling covariates via 'factorising' its topic-word distributions (β) into deviations from the background log-frequency of words and deviations based on covariates. SCHOLAR is solved via neural variational inference (Kingma and Welling, 2014;Rezende et al., 2014). However, it was not primarily designed for causal inference. We extend SCHOLAR with a linear regression layer (rSCHOLAR) to allow direct comparison with BTR. That is, its downstream layer is y = Aω, where A = [t, C, θ] is the design matrix in which θ represents the estimated document-topic mixtures. ω represents the regression weight vector. This regression layer is jointly optimized with the main SCHOLAR model via backpropagation using ADAM (Kingma and Ba, 2015), replacing the original downstream crossentropy loss with mean squared error loss.
Other recent supervised topic models that can handle covariates are for example STM (Roberts et al., 2016) and DOLDA (Magnusson et al., 2020). DOLDA was not designed for regression nor for causal inference setups. Topics models in the spirit of STM incorporate document metadata, but in order to better predict the content of documents rather than to predict an outcome. Many approaches on supervised topic models for regression have been suggested over the years. (Blei and McAuliffe, 2008) optimize their sLDA model with respect to the joint likelihood of the document data and the response variable using VI. MedLDA (Zhu et al., 2012) optimizes with respect to the maximum margin principle, Spectral-sLDA (Wang and Zhu, 2014) proposes a spectral decomposition algorithm, and BPsLDA (Chen et al., 2015) uses backward propagation over a deep neural network. Since BP-sLDA reports to outperform sLDA, MedLDA and several other models, we include it in our benchmark list for two-stage models. We include a Gibbs sampled sLDA to have a two-stage model in the benchmark list that is conceptually very similar to BTR in the generative topic modelling part. Unsupervised LDA (Blei et al., 2003;Griffiths and Steyvers, 2004) and a neural topic model counterpart GSM (Miao et al., 2017) are also added for comparison.

Regression Model
We take a Bayesian approach and jointly estimate f () and g() to solve equation (2). To simplify notation, encompass numerical features of treatment t and covariates C in data matrix X ∈ R D×(1+dim C ) . All estimated topic features are represented viā Z ∈ R D×K , where K is the number of topics. Finally, y ∈ R D×1 is the outcome vector. Define A = [Z, X] as the overall regression design matrix containing all features (optionally including interaction terms between topics and numerical features). With our fully Bayesian approach, we aim to better capture feature correlations and model uncertainties. In particular, information from the numerical features (labels, treatment and controls) directly informs the topic assignment process as well as the regression. This counters bias in the treatment effect estimation, following the spirit of 'causally sufficient' text representations (Veitch et al., 2020). Following the previous section, we outline the case for f () being linear. Our framework could however be extended to non-linear f (). Assuming Gaussian iid errors ∼ N (0, σ 2 I), the model's regression equation is then y = Aω + , such that p(y|A, ω, σ 2 ) = N (y|Aω, σ 2 I). (4) The likelihood with respect to outcome y is then where a d is the dth row of design matrix A. We model our prior beliefs about parameter vector ω by a Gaussian density where mean m 0 and covariance matrix S 0 are hyperparameters. Following Bishop (2006), we place an Inverse-Gamma prior on the conditional variance estimate σ 2 with shape and scale hyperparameters a 0 and b 0 p(σ 2 ) = IG(σ 2 |a0, b0).
Placing priors on all our regression parameters allows us to conduct full Bayesian inference, which not only naturally counteracts parameter overfitting but also provides us with well-defined posterior distributions over ω and σ 2 as well as a predictive distribution of our response variable.
(8) m n , S n , a n , b n follow standard updating equations for a Bayesian linear regression (Appendix B).

Topic Model
The estimated topic featuresZ, which form part of the design regression matrix A, are generated from a supervised model that builds on an LDAbased topic structure (Blei et al., 2003). Figure  1 provides a graphical representation of BTR and brings together our topic and regression model.
We have d documents in a corpus of size D, a vocabulary of V unique words and K topics. A document has N d words, so that w d,n denotes the nth word in document d. The bag-of-words representation of a document is w d = [w d,1 , . . . , w d,N d ], so that the entire corpus of documents is described by W = [w 1 , . . . , w D ]. z d,n is the topic assignment of word w d,n , where z d and Z mirror w d and W in their dimensionality. Similarly,z d denotes the estimated average topic assignments of the K topics across words in document d, such thatZ = [z 1 , . . . ,z D ] ∈ R D×K . β ∈ R K×V , describes the K topic distributions over the V dimensional vocabulary. θ ∈ R D×K describes the K topic mixtures for each of the D documents. η ∈ R V and α ∈ R K are the respective hyperparameters of the prior for β and θ. The generative process of our BTR model is then: Straightforward extensions also allow multiple documents per observation or observations without documents, as is described in Appendix C.1.4.

Posterior Inference
The objective is to identify the latent topic structure and regression parameters that are most probable to have generated the observed data. We obtain the joint distribution for our graphical model through the product of all nodes conditioned only on their parents, which for our model is The inference task is thus to compute the posterior distribution of the latent variables (Z, θ, β, ω, and σ 2 ) given the observed data (y, X and W ) and the priors governed by hyperparameters (α, η, m 0 , S 0 , a 0 , b 0 ). We will omit hyperparameters for sake of clarity unless explicitly needed for computational steps. The posterior distribution is then p(θ, β, Z, ω, σ 2 |W , y, X) = p(θ, β, Z, W , X, y, ω, σ 2 ) p(W , X, y) .
In practice, computing the denominator in equation (10), i.e. the evidence, is intractable due to the sheer number of possible latent variable configurations. We use a Gibbs EM algorithm (Levine and Casella, 2001) set out below, to approximate the posterior. Collapsing out the latent variables θ and β (Griffiths and Steyvers, 2004), we only need to identify the sampling distributions for topic assignments Z and regression parameters ω and σ 2 , conditional on their Markov blankets p(Z, ω, σ 2 |W , X, y) = p(Z|W , X, y, ω, σ 2 )p(ω, σ 2 |Z, X, y).
Once topic assignments Z are estimated, it is straightforward to recover β and θ. The expected topic assignments are estimated by Gibbs sampling in the E-step, and the regression parameters are estimated in the M-step.

E-Step: Estimate Topic Parameters
In order to sample from the conditional posterior for each z d,n we need to identify the probability of a given word w d,n being assigned to a given topic k, conditional on the assignments of all other words (as well as the model's other latent variables and the observed data) where Z −(d,n) are the topic assignments of all words apart from w d,n . This section defines this distribution, with derivations in Appendix C. By conditional independence properties of the graphical model, we can split this joint posterior into p(Z|W , X, y, ω, σ 2 ) ∝ p(Z|W )p(y|Z, X, ω, σ 2 ).
Topic assignments within one document are independent from topic assignments in all other documents and the sampling equation for z d,n only depends on it's own response variable y d , hence The first part of the RHS expression is the sampling distribution of a standard LDA model. Following Griffiths and Steyvers (2004), we can express it in terms of count variables s (topic assignments across a document) and m (assignments of unique words across topics over all documents). 2 The second part is the predictive distribution for y d . This is a Gaussian distribution depending on the linear combination ω(a d |z d,n = k), where a d includes the topic proportionsz d and x d variables (and any interaction terms), conditional on z d,n = k. We can write this in a convenient form that preserves proportionality with respect to z d,n and depends only on the data and the count variables.
First, we split the X features into those that are interacted, X 1,d , and those that are not, X 2,d such that the generative model for y d is then where ⊗ is the Kronecker product. Defineω z,d as a length K vector such that Noting thatω z,dz d =ω z,d N d (s d,−n + s d,n ), gives us the sampling distribution for z d,n stated in equation (14): a multinomial distribution parameterised by This defines the probability for each k that z d,n is assigned to that topic k. These K probabilities define the multinomial distribution from which z d,n is drawn.

M-Step: Estimate Regression Parameters
To estimate the regression parameters, we hold the design matrix A = [Z, X] fixed. Given the Normal-Inverse-Gamma prior, this is a standard Bayesian linear regression problem and the posterior distribution for which is given in equation (8) above. To prevent overfitting to the training sample there is the option to randomly split the training set into separate sub-samples for the E-and Msteps, following a Cross-Validation EM approach (Shinozaki and Ostendorf, 2007). We use the prediction mean squared error from the M-step sample to assess convergence across EM iterations.

Implementation
We provide an efficient Julia implementation for BTR and a Python implementation for rSCHOLAR on Github to allow for reproducibility of the results in the following experiment sections. 3 5 Experiment: Synthetic Data

Synthetic Data Generation
To illustrate the benefits of our BTR approach, we generate a synthetic dataset of documents which have explanatory power over a response variable, along with an additional numerical covariate that is correlated with both documents and response. We generate 10, 000 documents of 50 words each, following an LDA generative process, with each document having a distribution over three topics, defined over a vocabulary of 9 unique terms.
A numerical feature, x = [x 1 , ..., x D ] , is generated by calculating the document-level frequency of the first word in the vocabulary. As the first topic places a greater weight on the first three terms in the vocabulary, x is positively correlated withz 1 . The response variable y = [y 1 , ..., y D ] is generated through a linear combination of the numerical feature x and the average topic assignments where is an iid Gaussian white noise term. The regression model to recover the ground truth is then The true regression weights are thus ω * = [−1, 0, 0, 1]. In accordance with the FWL theorem, we cannot recover the true coefficients with a two-stage estimation process.

Synthetic Data Results
We compare the ground truth of the synthetic data generating process against: (1) BTR: our Bayesian model, estimated via Gibbs sampling.
(2) rSCHOLAR: the regression extension of SCHOLAR, estimated via neural VI. . Figure 2 shows the true and estimated regression weights for each of the six models. LR-sLDA and sLDA-LR estimate inaccurate regression weights for both the text and numerical features, as do the BPsLDA variants. Similarly, rSCHOLAR fails to recover the ground truth. However, BTR estimates tight posterior distributions around to the true parameter values. The positive correlation between z 1 and x makes a joint estimation approach crucial for recovering the true parameters. Standard supervised topic models estimate the regression parameters for the numerical features separately from the topic proportions and their associated regression parameters, violating the FWL theorem as outlined in section 2.2. A key difference between rSCHOLAR and BTR lies in their posterior estimation techniques (neural VI vs Gibbs). rSCHOLAR's approach seems to have a similarly detrimental effect as the two-stage approaches. We suspect further research into (neural) VI assumptions and their effect on causal inference with text could be fruitful.

Semi-Synthetic Data Generation
We further benchmark the models' abilities to recover the ground truth on two semi-synthetic datasets. In those datasets, we still have access to the ground truth (GT) as we either synthetically create or directly observe the correlations between treatment, confounders and outcome. However, the text and some numeric metadata that we use is empirical. We use customer review data from (i) Booking.com 4 and (ii) Yelp 5 , and analyse two different 'mock' research questions. For both datasets, we randomly sample 50, 000 observations and select 75% in Yelp, 80% in Booking for training. 6 Booking: Do people give more critical ratings (y i ) to hotels that have high historic ratings (av_score i ), once controlling for review texts? where prop_pos i is the proportion of positive words in a review. The textual effect is estimated via topic modelling in our experiment. The treatment in question is the average historic customer rating, being modelled as continuous.
Yelp: Do people from the US (US i =1) give different Yelp ratings (y i ) than customers from Canada (US i =0), controlling for average restaurant review (stars_av_b i ) and the review text?
To create the binary treatment variable US i , we compute each review's sentiment score (sent i ) using the Harvard Inquirer. This treatment effect is correlated with the text as where γ 1 controls the correlation between text and treatment. 7

Semi-Synthetic Data Results
On both semi-synthetic datasets and across all benchmarked models, BTR estimates the regression weights that are the closest to the ground truth. This consistently holds true across all tested numbers of topics K (see Figure 3). For Yelp, we also vary the correlation strength between treatment and 7 When γ1 = 1, correlation between U Si and senti is 0.23. For γ1 = 0.5 it is 0.39. confounder. The middle panel in Figure 3 shows the estimation results with a very high correlation between confounder and treatment (γ 1 = 1). The RHS panel shows the results when this correlation is lower (γ 1 = 0.5). As expected, a higher correlation between confounder and treatment increases the bias as outlined in the section 2.2. If the correlation between confounder and treatment is zero, a two-stage estimation approach no longer violates FWL and all models manage to estimate the ground truth (see Appendix E). Since the topic modelling approach is an approximation to capture the true effect of the text and its correlation with the metadata -and since this approximation is not perfect -some bias may remain. Overall, BTR gets substantially closer to the ground truth than any other model.

Experiment: Real-World Data
The joint supervised estimation approach using text and non-text features, not only counteracts bias in causal settings. It also improves prediction performance. We use the real-world datasets of Booking and Yelp for our benchmarking. For both datasets, we predict customer ratings (response) for a business or hotel given customer reviews (text features) and business and customer metadata (numerical features). 8

Benchmarks
We add the following models to the benchmark list from the previous section: 9 LDA+LR (Griffiths and Steyvers, 2004) and GSM+LR (Miao et al., 2017) unsupervised Gibbs sampling and neural VI 8 full specifications for each case are given in Appendix F 9 We also tested sLDA+LR and a pure sLDA, which performed consistently worse, see Appendix G.1 based topic models. LR+rSCHOLAR: the twostep equivalent for rSCHOLAR, estimating covariate regression weights in a separate step from the supervised topic model.
An alternative to topic based models are wordembedding based neural networks. We use (7) LR+aRNN: a bidirectional RNN with attention (Bahdanau et al., 2015). Since the model does not allow for non-text features, we use the regression residuals of the linear regression as the target. And (8) LR+TAM: a bidirectional RNN using global topic vector to enhance its attention heads (Wang and Yang, 2020) -same target as in LR+aRNN. 10

Prediction and Perplexity Results
We evaluated all topic models on a range from 10 to 100 topics, with results for 50 and 100 in Table 2. 11 Hyperparameters of benchmark models that have no direct equivalent in our model were set as suggested in the pertaining papers. We find that our results are robust across a wide range of hyperparameters (extensive robustness checks in Appendix G).
We assess the models' predictive performance based on predictive R 2 (pR 2 = 1 − MSE var(y) ). The upper part of Table 2 shows that BTR achieves the best pR 2 in the Yelp dataset and and very competitive results in the Booking dataset, where our rSCHOLAR extension outperforms all other models. Even the non-linear neural network mod-10 Wang and Yang (2020) use 100-dimensional word embeddings in their default setup for TAM and pre-train those on the dataset. We follow this approach. RNN and TAM results were very robust to changes in the hidden layer size in these setups, we use a layer size of 64. Full details of all model parametrisations are provided in Appendix G.2. 11 Hyperparameters of displayed results: α = 0.5, η = 0.01  els aRNN and TAM cannot achieve better results. Importantly, rSCHOLAR and BTR perform substantially better than their counterparts that do not jointly estimate the influence of covariates (LR+rSCHOLAR and LR+sLDA).
To assess document modelling performance, we report the test set perplexity score for all models that allow this (Table 2, The joint approach of both rSCHOLAR and BTR does not come at the cost of increased perplexity. If anything, the supervised learning approach using labels and covariates even improves document modelling performance when compared against its unsupervised counterpart (BTR vs LDA).
Assessing the interpretability of topic models is ultimately a subjective exercise. In Appendix G.4 we show topics associated with the most positive and negative regression weights, for each dataset. Overall, the identified topics and the sign of the associated weights seem interpretable and intuitive.

Conclusions
In this paper, we introduced BTR, a Bayesian topic regression framework that incorporates both numerical and text data for modelling a response variable, jointly estimating all model parameters. Motivated  by the FWL theorem, this approach is designed to avoid potential bias in the regression weights, and can provide a sound regression framework for statistical and causal inference when one needs to control for both numerical and text based confounders in observational data. We demonstrate that our model recovers the ground truth with lower bias than any other benchmark model on synthetic and semi-synthetic datasets. Experiments on real-world data show that a joint and supervised learning strategy also yields superior prediction performance compared to 'two-stage' strategies, even competing with deep neural networks.

A Causal Inference with Text
For D observations, we have outcome y ∈ R D×1 , treatment t ∈ R D×1 , text data W ∈ R D×V (where V is the vocabulary size) and numerical confounders C ∈ R D×P (where P is the number of numerical confounders). As established in the main part of the paper, in order to estimate the ATT, we need to compute the conditional expectation function (CEF) E[y|t, Z] or if we have additional numerical confounders E[y|t, Z, C]. Using regression to estimate our conditional expectation function, we can write Let f () be the function of our regression equation that we need to define, and Ω be the parameters of it. The predominant assumption in causal inference settings in many disciplines is a linear causal effect assumption. We follow this approach, also for the sake of simplicity. However, the requirement for joint supervised estimation of text representations Z to be able to predict y,t (and if relevant C) to be considered 'causally sufficient' is not constrained to the linear case (Veitch et al., 2020). Under the linearity assumption, the CEF of our regression can take the form where ∼ N (0, σ 2 ) is additive iid Gaussian noise, ie. E[ |t, Z, C] = 0 (see for example Angrist and Pischke (2008), chapter 3). Thus, σ represents the conditional variance V ar(y|t, Z, C). The regression approximates the CEF. Hence, when the CEF is causal, the regression estimates are causal (Angrist and Pischke, 2008). In such a case, ω t measures the treatment effect. Assuming that Z and C block all 'backdoor' paths, the CEF would allow us to conduct causal inference of the ATT of t on y (Pearl, 2009). We now shall revisit under which conditions, a decomposition of equation (24) into several separate estimation steps is permitted as described in the Frisch-Waugh-Lovell (or regression decomposition) theorem (Lovell, 2008), so that the regression estimates for ω t remain unchanged and hence can still be considered as causal.

A.1 Regression Decomposition Theorem
The regression decomposition theorem or Frisch-Waugh-Lovell (FWL) theorem (Frisch and Waugh, 1933;Lovell, 1963) states that the coefficients of a linear regression as stated in equation (24) are equivalent to the coefficients of partial regressions in which the residualized outcome is regressed on the residualized regressors -this residualization is in terms of all regressors that are not part of this partial regression.
For a moment, let us assume there are no confounding latent (that is to be estimated) text features Z. Our observational data only consist of outcome y, our treatment variable t and other observed confounding variables C, y = tω t + Cω C + .
The FWL theorem states that we would obtain mathematically identical regression coefficients ω t and ω C is we decomposed this regression and estimated each part separately, each time residualizing (ie. orthogonalizing) outcomes and regressors on all other regressors.
More generally, for a linear regression define y = Xβ + with y ∈ R D×1 , β ∈ R K×1 , X ∈ R D×M , which we could arbitrarily partition into X 1 ∈ R D×K and X 2 ∈ R D×J so we could also write y = X 1 β 1 + X 2 β 2 + , define projection (or prediction) matrix P such that P = X(X X) −1 X .
P produces predictions y when applied to outcome vector y, y = X β = X(X X) −1 X y = P y.
Also define the complement of P , the residual maker matrix M M = I − P = I − X(X X) −1 X such that M applied to an outcome vector y yields Theorem: The FWL theorem states that equivalent to estimating we would obtain mathematically identical regression coefficients β 1 and β 2 if we separately estimated and where M 1 and M 2 correspond to the data partitions X 1 and X 2 .
Finally, M 2ˆ =ˆ . X 2 is orthogonal to by construction of the OLS regression. Therefore, the residualized residuals are the residuals themselves. Which leaves us with The same goes through for M 1 by analogy.
In the simplest case assume there was no confounding text. Our observational data only consist of outcome y, our treatment variable t and other potential confounding variables C. The conditional expectation function is E[y|t, C]. We can estimate it via one joint regression as Now, assuming that the linearity assumption is correct, the fact that t ⊥ C implies that C is not actually a confounder in this setup. We would obtain the exact same regression coefficient estimates for ω t and ω C if we followed a two-step process, in which we first regress y on t y = tω t + 1 .
This is holds only true, if and only if t ⊥ C. Because in this case, t and C are already orthorgonal to each other. They already fulfill the requirements of the FWL and therefore such two-step process would yield mathematically equivalent regression coefficients ω to the joint estimation in equation (37) and and the obtained estimates ω t and ω C will be equivalent to those obtained from the joint estimation.
We now consider the case where part (or all) of our confounders are text or where text is a proxy for otherwise unobserved confounders. The joint estimation would be where Z itself is obtain through supervised learning via text representation function Z = g(W , y, t, C; Θ).
We therefore cannot decompose this joint estimation into separate parts. As long as the text features Z are correlated with the outcome and the other covariates, we would need to apply the orthogonalization via the respective M matrices for each partical regression. Since Z needs to be estimated itself (it is 'estimated data'), we cannot residualize on Z though. Nor can Z be residualized on the other covariates. A separate-stage approach will therefore lead to biased estimates of ω.

B Regression Model
Due to the conjugacy of the Normal-Inverse-Gamma prior, the posterior distribution of the regression parameters conditional on A has a known Normal-Inverse-Gamma distribution: p(ω, σ 2 |y, A) ∝ p(ω|σ 2 , y, A)p(σ 2 | y, A) = N ω|m n , σ 2 S −1 n IG σ 2 |a n , b n where m n , S n , a n and b n follow standard updating equations for a Bayesian Linear Regression (Bishop 2006) C Topic Model C.1 Gibbs-EM algorithm C.1.1 Sampling distribution for z The probability of a given word w d,n being assigned to a given topic k (such that z d,n = k), conditional on the assignments of all other words (as well as the model's other latent variables and the data) is where Z −(d,n) are the topic assignments for all words apart from w d,n . By the conditional independence properties implied by the graphical model, we can split this joint posterior into p(Z|W , X, y, ω, σ 2 ) ∝ p(Z|W )p(y|Z, X, ω, σ 2 ).
As topic assignments within one document are independent from topic assignments in all other documents, the sampling equation for the nth word in document d should only depend it's own response variable, y d , such that The first part of the RHS expression is just the sampling distribution of a standard LDA model, so it can be expressed in terms of the count variables s (the topic assignments across a document) and m (the assignments of unique words across topics over all documents). s d,k measures the total number of words in document d assigned to topic k and s d,k,−n the number of words in document d assigned to topic k, except for word n. Analogously, m k,v measures the total number of times term v is assigned to topic k across all documents and m k,v,−(d,n) measures the same, but excludes word n in document d. (51)

C.1.2 Regression
Given that the residuals are Gaussian, the probability of the response variable for a given document d is We can write this in a convenient form that preserves proportionality with respect to z d,n such that it depends only on the data and count variables used in the other two terms. First, we split the x d features into those that are interacted, x 1,d , and those that are not, x 2,d . The generative model for y d is then where ⊗ is the Kronecker product. Noting that X is observed, so we can think of this as a linear model with document-specific regression parameters. Defineω z,d as a length K vector such that Noting thatω z,dz d =ω z,d N d (s d,−n + s d,n ), the probability density of y conditional on z d,n = k is therefore proportional to This gives us the sampling distribution for z d,n stated in equation (50): a multinomial distribution parameterised by This defines for each k ∈ {1, ..., K} the probability that z d,n is assigned to that topic. These K probabilities define the multinomial distribution from which z d,n is drawn. Given topic assignments z, we can recover the latent variables θ and β from their predictive distributions viaθ . (58)

C.1.4 Observations without documents
A straightforward extension allows for some observations to be associated with an X and y, but no document. This is often the case in a social science context, for example time-series may be associated with documents at irregular intervals. If an observation is not associated with any documents, the priors on the document topic distributions suggest that the topic assignment for topic K is set to α k / k α k . These observations may still be very useful in estimating the relationship between X and y so they are worth including in the estimation.

C.1.5 Multiple paragraphs
If, as is often the case in the context of social science applications, we have relatively few observations but the documents associated with those observations are relatively long, we can exploit the structure of the documents by estimating the model at a paragraph level. Splitting up longer documents into paragraphs brings one of the key advantages of topic modelling to the fore: that the same word can have different meanings in different contexts. For example, the word "increase" might have quite a different meaning if it is in a paragraph with the word "risk" than if it is alongside "productivity". Treating the entire document as a single bag of words makes it hard for the model to make this distinction.
If there are observations with multiple documents, we can treat these as P d separate paragraphs of a combined document, indexed by p, each with an independent θ p distribution over topics. These paragraphs may also have different associated x d,p that interact with the topics, for example we may wish to interact topics with a paragraph specific sentiment score, but the response variable y d is common to all paragraphs in the same document and the M-step estimated at the document level. Figure 4 shows the extended graphical model.
If x d,p only enters linearly into the regression then some document-level average will have to be used and this transformation can be performed prior to estimation, converting it into an x 1,d , and so the algorithm will remain unchanged. However, if any of the x d,p variables are interacted withz d,p then we may wish for this interaction to be at the paragraph level. For example, if we think that a topic might have a different effect depending on the sentiment of the surrounding paragraph. In this case, we still need to aggregate the interaction to the document level, but aggregate after interacting rather than interacting after aggregating. We therefore define where [N ] denotes the set of integers {1, ..., N } and ⊗ represents the Kronecker product. The design matrix A is then and the predictive model for y d will be The simplest way to aggregate from paragraphs to documents is simply to give each word in the document equal weight as above. This will mean that longer paragraphs have greater weight than shorter ones.
As before, we can collapse out the latent variables θ and β so that we only need to sample for the topic assignments z in an E-step and then for ω and σ 2 in an M-step.
In the E-step, we need to sample from the conditional posterior for the topic assignment of each word Pr[z d,p,n = k|Z d,−(p,n) , W , α, η, y, X, ω, σ 2 ].
By the conditional independence properties of the graphical model, we can split this into p(Z|W , α, η) and p(y|Z, X, ω, σ 2 ). The sampling equation for the nth token in the pth paragraph of the dth document d will have the form Pr[z d,p,n = k|Z d,−(p,n) , W , α, η, y, X, ω, The topic assignment each document is independent, but there are dependencies across paragraphs. Crucially, these paragraphs have are independent with respect to θ, so p(Z|W , α, η) is paragraph specific.
However, the regression part is at the document level to p(y|Z, X, ω, σ 2 ) will condition on all the paragraphs in a given document. Given that the residuals are Gaussian, the probability of the outcome variable for a given document d is We can write this in a convenient form that preserves proportionality with respect to z d,p,n such that it depends only on the data and count variables used in the other two terms and the document-wide counts.
First we can break the prediction for y d into the section that depends on paragraph p and the section that depends on other paragraphs and document wide x 1,d .
where N d is the total number of words in the document.
Defineŷ d,−p as the predicted y d without paragraph p, We then have a predictive distribution that depends only on paragraph p.
We can then follow the same steps as for the single paragraph document case to derive the third term in the sampling distribution, definingω z,d,p,k = ω z,k + ω zx,k x 1,d,p analogously toω defined for the single paragraph case.
This gives us the sampling distribution for z, which is a Multinomial parameterised by In the M-step we can then still use the averagez d,p estimated in the E-step, but we need to weight each paragraph by the number of words in that paragraph to be consistent with the E-step, Figure 5 shows the topic-vocabulary distribution from which the synthetic documents are generated. Table 3 shows the hyperparameter settings used in the synthetic data section. We observed that the settings of the prior did hardly effect results, given the strong signal in the synthetic dataset.  E Semi-Synthetic Data Experiments Figure 6: Without correlation between confounders and treatments, the regression can be dissected into two separate parts (supervised topic estimation and regression weight estimation of the non-text features) without inducing bias in the estimators, as described in the section on the Frisch-Waugh-Lovell theorem. In such a case, all models manage to recover the ground truth.

F Real-World Datasets and Data Pre-Processing
The Yelp dataset contains over 8 million customer reviews of businesses, which we restrict to reviews for businesses in Toronto. The Booking dataset contains around 500, 000 hotel reviews. For both datasets, we randomly sample 50, 000 observations and randomly select 75% in Yelp, 80% in Booking of our sample for training, holding out the remainder for testing. We then further split the training set equally for training in the E-step and validation in the M-step. The features are normalized on the training data statistics and the response variable is de-meaned. We do this because the K topic features sum to one and therefore implicitly already add a constant to the regression (Blei and McAuliffe, 2008). We preprocess the text corpora by removing stopwords and then tokenizing and stemming the data. The Booking.com dataset allows consumers to enter the positive and negative parts of their reviews in separate boxes. We combine these two reviews for all our exercises, but we do use information on the word count in each of these sections (see below).
For the prediction exercises in Section 7, we use the number of stars associated with each review as the target variable. We also use the numerical metadata described in Table 5as covariates. This variable is correlated with the treatment (Average_Score i ) and with the outcome, and so the text can act as a confounder.

G Real-World Data Experiments
G.1 Empirical data evaluation across different K    We also tested sLDA+LR and a pure sLDA, which performed consistently worse so they are not included for the sake of brevity. For example, for K = 50, sLDA+LR achieved pR 2 of 0.420 and 0.564 for Booking and Yelp respectively, compared to 0.432 and 0.571 for LR+sLDA. Standalone sLDA achieves 0.356 and 0.526 respectively.

G.2 Model parametrisations
This section provides an overview over all used and tested hyperparameter settings across all models in our benchmark list. Table 8 lists all hyperparameter settings pertaining to topic model components. Table  9 provides an overview over all used neural network hyperparameters. 10 summarises the iteration and stopping criteria for all models.   Further notes on benchmark model specifications: For TAM and aRNN, the sequence length in the RNN component (ie. the maximum number of words per document) is 305 for Booking and 572 for Yelp which corresponds to the longest review in each respective data set. We therefore work with the full text of each review.
BPsLDA changes its behaviour quite drastically when α is set in an area 1 ≤ α ≤ 2, where it strongly increases its predictive performance (pR 2 ) at the cost of its document modelling performance (perplexity). This can be seen in the original paper (Chen et al., 2015). We included α = 1 in the robustness test range and BTR is still generally on par with BPsLDA in this specific case for low K and does better for K > 30. Even when including α = 1 in the robustness test range, BTR still outperforms BPsLDA and all other models across all hyperparameter settings, except K = 10 in the Yelp dataset, where BTR is a close second.

G.3 Robustness Tests
Robustness test across all topic models with LDA-like structure and Dirichlet hyperparameters for document-topic and word-topic distributions.
We assess the robustness of our findings to changes in the Dirichlet hyperparameters α and η. These hyperparameters act as priors on the topic-document distributions (β) and word-topic distributions (θ), respectively. Table 11 shows the results.
In terms of pR 2 , BTR continues to perform best for all settings. We generally find that the BTR prediction performance is robust to hyperparameter changes. Evaluating the perplexity scores, we see more fluctuation across all models, which is unsurprising since those hyperparameter directly affect the generative topic modelling processes. BTR remains on par with its sLDA counterpart.  Table 12 provides an extended robustness test on the predictive performance of the benchmark topic models across hyperparameters. BTR continues to be the best performing model throughout. Table 13 summarises robustness tests in terms of perplexity scores. BTR achieves almost identical perplexity scores as sLDA whilst achieving higher pR 2 throughout.    Table 14 provides an extended robustness test on the predictive performance of the benchmark topic models across hyperparameters. BTR continues to be the best performing model throughout, apart from the K=10 case, where it is a close second. Table 15 summarises robustness tests in terms of perplexity scores. BTR achieves almost identical perplexity scores as sLDA whilst achieving higher pR 2 throughout.

G.4 Estimated Topics
The below tables are an extended version of the corresponding table in the paper. The show the top 3 negative and positive topics for K = [10, 30, 100]. Inspecting the top words in each of these topics compared with its regression coefficient, BTR models highly interpretable topics -at least as interpretable as LDA or sLDA. At the same time BTR achieves substantially better prediction performances throughout all model specifications (see previous section).       Table 22 shows the time taken for 100 E-step iterations on a single 2.8GHz processor on the Booking data and 300-400 seconds on the Yelp data. We found that 100 E-step iterations is typically sufficient for the best performance and the model typically converges after between 10-25 EM iterations. A typical 30 topic model on Yelp data thus took around 1 hour to converge, and around 20 minutes for Booking. Computation time scales roughly linearly in the number of topics and total number of words across all documents. This is because the evaluation of the K-dimensional multinomial distribution for each z d,n (equation (50)) is the principle computational challenge. Note: Yelp data has roughly 3 times as many words as Booking.com data