Term-Recency for TF-IDF, BM25 and USE Term Weighting
Divyanshu
Marwah
author
Joeran
Beel
author
2020-05 aug
text
Proceedings of the 8th International Workshop on Mining Scientific Publications
Petr
Knoth
editor
Christopher
Stahl
editor
Bikash
Gyawali
editor
David
Pride
editor
Suchetha
N
Kunnath
editor
Drahomira
Herrmannova
editor
Association for Computational Linguistics
Wuhan, China
conference publication
Effectiveness of a recommendation in an Information Retrieval (IR) system is determined by relevancy scores of retrieved results. Term weighting is responsible for computing the relevance scores and consequently differentiating between the terms in a document. However, the current term weighting formula (TF-IDF, for instance), weighs terms only based on term frequency and inverse document frequency irrespective of other important factors. This results in ambiguity in cases when both TF and IDF values the same for more than one document, hence resulting in same TF-IDF values. In this paper, we propose a modification of TF-IDF and other term-weighting schemes that weighs the terms based on the recency and the usage in the corpus. We have tested the performance of our algorithm with existing term weighting schemes; TF-IDF, BM25 and USE text embedding model. We have indexed three different datasets with different domains to validate the premises for our algorithm. On evaluating the algorithms using Precision, Recall, F1 score, and NDCG, we found that time normalized TF-IDF outperformed the classic TF-IDF with a significant difference in all the metrics and datasets. Time-based USE model performed better than the standard USE model in two out of three datasets. But the time-based BM25 model did not perform well in some of the input queries as compared to standard BM25 model.
marwah-beel-2020-term
https://aclanthology.org/2020.wosp-1.5
2020-05 aug
36
41