SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Elena Shushkevich, Long Thanh Mai, Manuel V. Loureiro, Steven Derby, Tri Kurniawan Wijaya


Abstract
The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.
Anthology ID:
2024.lrec-main.1320
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
15181–15190
Language:
URL:
https://aclanthology.org/2024.lrec-main.1320
DOI:
Bibkey:
Cite (ACL):
Elena Shushkevich, Long Thanh Mai, Manuel V. Loureiro, Steven Derby, and Tri Kurniawan Wijaya. 2024. SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15181–15190, Torino, Italia. ELRA and ICCL.
Cite (Informal):
SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels (Shushkevich et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1320.pdf