The Zipfian Challenge: Learning the statistical fingerprint of natural languages

Christian Bentz

doi:10.18653/v1/2023.conll-1.3

The Zipfian Challenge: Learning the statistical fingerprint of natural languages

Abstract

Human languages are often claimed to fundamentally differ from other communication systems. But what is it exactly that unites them as a separate category? This article proposes to approach this problem – here termed the Zipfian Challenge – as a standard classification task. A corpus with textual material from diverse writing systems and languages, as well as other symbolic and non-symbolic systems, is provided. These are subsequently used to train and test binary classification algorithms, assigning labels “writing” and “non-writing” to character strings of the test sets. The performance is generally high, reaching 98% accuracy for the best algorithms. Human languages emerge to have a statistical fingerprint: large unit inventories, high entropy, and few repetitions of adjacent units. This fingerprint can be used to tease them apart from other symbolic and non-symbolic systems.

Anthology ID:: 2023.conll-1.3
Volume:: Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Jing Jiang, David Reitter, Shumin Deng
Venue:: CoNLL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27–37
Language:
URL:: https://aclanthology.org/2023.conll-1.3
DOI:: 10.18653/v1/2023.conll-1.3
Bibkey:
Cite (ACL):: Christian Bentz. 2023. The Zipfian Challenge: Learning the statistical fingerprint of natural languages. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 27–37, Singapore. Association for Computational Linguistics.
Cite (Informal):: The Zipfian Challenge: Learning the statistical fingerprint of natural languages (Bentz, CoNLL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.conll-1.3.pdf
Software:: 2023.conll-1.3.Software.pdf
Video:: https://aclanthology.org/2023.conll-1.3.mp4

PDF Cite Search Software Video