Term-Recency for TF-IDF, BM25 and USE Term Weighting

Divyanshu Marwah, Joeran Beel


Abstract
Effectiveness of a recommendation in an Information Retrieval (IR) system is determined by relevancy scores of retrieved results. Term weighting is responsible for computing the relevance scores and consequently differentiating between the terms in a document. However, the current term weighting formula (TF-IDF, for instance), weighs terms only based on term frequency and inverse document frequency irrespective of other important factors. This results in ambiguity in cases when both TF and IDF values the same for more than one document, hence resulting in same TF-IDF values. In this paper, we propose a modification of TF-IDF and other term-weighting schemes that weighs the terms based on the recency and the usage in the corpus. We have tested the performance of our algorithm with existing term weighting schemes; TF-IDF, BM25 and USE text embedding model. We have indexed three different datasets with different domains to validate the premises for our algorithm. On evaluating the algorithms using Precision, Recall, F1 score, and NDCG, we found that time normalized TF-IDF outperformed the classic TF-IDF with a significant difference in all the metrics and datasets. Time-based USE model performed better than the standard USE model in two out of three datasets. But the time-based BM25 model did not perform well in some of the input queries as compared to standard BM25 model.
Anthology ID:
2020.wosp-1.5
Volume:
Proceedings of the 8th International Workshop on Mining Scientific Publications
Month:
05 August
Year:
2020
Address:
Wuhan, China
Editors:
Petr Knoth, Christopher Stahl, Bikash Gyawali, David Pride, Suchetha N. Kunnath, Drahomira Herrmannova
Venue:
WOSP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36–41
Language:
URL:
https://aclanthology.org/2020.wosp-1.5
DOI:
Bibkey:
Cite (ACL):
Divyanshu Marwah and Joeran Beel. 2020. Term-Recency for TF-IDF, BM25 and USE Term Weighting. In Proceedings of the 8th International Workshop on Mining Scientific Publications, pages 36–41, Wuhan, China. Association for Computational Linguistics.
Cite (Informal):
Term-Recency for TF-IDF, BM25 and USE Term Weighting (Marwah & Beel, WOSP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wosp-1.5.pdf