Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

Yi-An Lai, Xuan Zhu, Yi Zhang, Mona Diab


Abstract
Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.
Anthology ID:
2020.lrec-1.215
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1739–1746
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.215
DOI:
Bibkey:
Cite (ACL):
Yi-An Lai, Xuan Zhu, Yi Zhang, and Mona Diab. 2020. Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1739–1746, Marseille, France. European Language Resources Association.
Cite (Informal):
Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections (Lai et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.215.pdf
Data
SNIPS