Estimating the Entropy of Linguistic Distributions

Aryaman Arora, Clara Meister, Ryan Cotterell


Abstract
Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropymust typically be estimated from observed data because researchers do not have access to the underlying probability distribution. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. We end this paper with a concrete recommendation for the entropy estimators that should be used in future linguistic studies.
Anthology ID:
2022.acl-short.20
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
175–195
Language:
URL:
https://aclanthology.org/2022.acl-short.20
DOI:
10.18653/v1/2022.acl-short.20
Bibkey:
Cite (ACL):
Aryaman Arora, Clara Meister, and Ryan Cotterell. 2022. Estimating the Entropy of Linguistic Distributions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 175–195, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Estimating the Entropy of Linguistic Distributions (Arora et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-short.20.pdf
Software:
 2022.acl-short.20.software.zip