Evaluating a Topic Modelling Approach to Measuring Corpus Similarity

Richard Fothergill, Paul Cook, Timothy Baldwin


Abstract
Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been previously evaluated for this task. To evaluate these methods we use known-similarity corpora that have been previously used for this purpose, as well as a number of newly-constructed known-similarity corpora targeting differences in genre, topic, time, and region. Our findings indicate that, overall, the topic modelling approach did not improve on a chi-square method that had previously been found to work well for measuring corpus similarity.
Anthology ID:
L16-1042
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
273–279
Language:
URL:
https://aclanthology.org/L16-1042/
DOI:
Bibkey:
Cite (ACL):
Richard Fothergill, Paul Cook, and Timothy Baldwin. 2016. Evaluating a Topic Modelling Approach to Measuring Corpus Similarity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 273–279, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Evaluating a Topic Modelling Approach to Measuring Corpus Similarity (Fothergill et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1042.pdf