Length Dependence of Vocabulary Richness

Niklas Zechner


Abstract
The relation between the length of a text and the number of unique words is investigated using several Swedish language corpora. We consider a number of existing measures of vocabulary richness, show that they are not length-independent, and try to improve on some of them based on statistical evidence. We also look at the spectrum of values over text lengths, and find that genres have characteristic shapes.
Anthology ID:
2023.nodalida-1.56
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
565–573
Language:
URL:
https://aclanthology.org/2023.nodalida-1.56
DOI:
Bibkey:
Cite (ACL):
Niklas Zechner. 2023. Length Dependence of Vocabulary Richness. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 565–573, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Length Dependence of Vocabulary Richness (Zechner, NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.56.pdf