Reconstruct to Retrieve: Identifying interesting news in a Cross-lingual setting

Boshko Koloski, Blaz Skrlj, Nada Lavrac, Senja Pollak


Abstract
An important and resource-intensive task in journalism is retrieving relevant foreign news and its adaptation for local readers. Given the vast amount of foreign articles published and the limited number of journalists available to evaluate their interestingness, this task can be particularly challenging, especially when dealing with smaller languages and countries. In this work, we propose a novel method for large-scale retrieval of potentially translation-worthy articles based on an auto-encoder neural network trained on a limited corpus of relevant foreign news. We hypothesize that the representations of interesting news can be reconstructed very well by an auto-encoder, while irrelevant news would have less adequate reconstructions since they are not used for training the network. Specifically, we focus on extracting articles from the Latvian media for Estonian news media houses. It is worth noting that the available corpora for this task are particularly limited, which adds an extra layer of difficulty to our approach. To evaluate the proposed method, we rely on manual evaluation by an Estonian journalist at Ekspress Meedia and automatic evaluation on a gold standard test set.
Anthology ID:
2023.clasp-1.10
Volume:
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)
Month:
September
Year:
2023
Address:
Gothenburg, Sweden
Editors:
Ellen Breitholtz, Shalom Lappin, Sharid Loaiciga, Nikolai Ilinykh, Simon Dobnik
Venue:
CLASP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
81–89
Language:
URL:
https://aclanthology.org/2023.clasp-1.10
DOI:
Bibkey:
Cite (ACL):
Boshko Koloski, Blaz Skrlj, Nada Lavrac, and Senja Pollak. 2023. Reconstruct to Retrieve: Identifying interesting news in a Cross-lingual setting. In Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD), pages 81–89, Gothenburg, Sweden. Association for Computational Linguistics.
Cite (Informal):
Reconstruct to Retrieve: Identifying interesting news in a Cross-lingual setting (Koloski et al., CLASP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.clasp-1.10.pdf