Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks

Jinghui Lu, Maeve Henchion, Brian Mac Namee


Abstract
Jensen-Shannon divergence (JSD) is a distribution similarity measurement widely used in natural language processing. In corpus comparison tasks, where keywords are extracted to reveal the divergence between different corpora (for example, social media posts from proponents of different views on a political issue), two variants of JSD have emerged in the literature. One of these uses a weighting based on the relative sizes of the corpora being compared. In this paper we argue that this weighting is unnecessary and, in fact, can lead to misleading results. We recommend that this weighted version is not used. We base this recommendation on an analysis of the JSD variants and experiments showing how they impact corpus comparison results as the relative sizes of the corpora being compared change.
Anthology ID:
2020.lrec-1.832
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6740–6744
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.832
DOI:
Bibkey:
Cite (ACL):
Jinghui Lu, Maeve Henchion, and Brian Mac Namee. 2020. Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6740–6744, Marseille, France. European Language Resources Association.
Cite (Informal):
Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks (Lu et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.832.pdf