DoubleLingo: Causal Estimation with Large Language Models

Marko Veljanovski, Zach Wood-Doughty


Abstract
Estimating causal effects from non-randomized data requires assumptions about the underlying data-generating process. To achieve unbiased estimates of the causal effect of a treatment on an outcome, we typically adjust for any confounding variables that influence both treatment and outcome. When such confounders include text data, existing causal inference methods struggle due to the high dimensionality of the text. The simple statistical models which have sufficient convergence criteria for causal estimation are not well-equipped to handle noisy unstructured text, but flexible large language models that excel at predictive tasks with text data do not meet the statistical assumptions necessary for causal estimation. Our method enables theoretically consistent estimation of causal effects using LLM-based nuisance models by incorporating them within the framework of Double Machine Learning. On the best available dataset for evaluating such methods, we obtain a 10.4% reduction in the relative absolute error for the estimated causal effect over existing methods.
Anthology ID:
2024.naacl-short.71
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
799–807
Language:
URL:
https://aclanthology.org/2024.naacl-short.71
DOI:
10.18653/v1/2024.naacl-short.71
Bibkey:
Cite (ACL):
Marko Veljanovski and Zach Wood-Doughty. 2024. DoubleLingo: Causal Estimation with Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 799–807, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
DoubleLingo: Causal Estimation with Large Language Models (Veljanovski & Wood-Doughty, NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-short.71.pdf