2023
pdf
bib
abs
Reconstruct to Retrieve: Identifying interesting news in a Cross-lingual setting
Boshko Koloski
|
Blaz Skrlj
|
Nada Lavrac
|
Senja Pollak
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)
An important and resource-intensive task in journalism is retrieving relevant foreign news and its adaptation for local readers. Given the vast amount of foreign articles published and the limited number of journalists available to evaluate their interestingness, this task can be particularly challenging, especially when dealing with smaller languages and countries. In this work, we propose a novel method for large-scale retrieval of potentially translation-worthy articles based on an auto-encoder neural network trained on a limited corpus of relevant foreign news. We hypothesize that the representations of interesting news can be reconstructed very well by an auto-encoder, while irrelevant news would have less adequate reconstructions since they are not used for training the network. Specifically, we focus on extracting articles from the Latvian media for Estonian news media houses. It is worth noting that the available corpora for this task are particularly limited, which adds an extra layer of difficulty to our approach. To evaluate the proposed method, we rely on manual evaluation by an Estonian journalist at Ekspress Meedia and automatic evaluation on a gold standard test set.
2021
pdf
bib
abs
BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers
Enja Kokalj
|
Blaž Škrlj
|
Nada Lavrač
|
Senja Pollak
|
Marko Robnik-Šikonja
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
Transformer-based neural networks offer very good classification performance across a wide range of domains, but do not provide explanations of their predictions. While several explanation methods, including SHAP, address the problem of interpreting deep learning models, they are not adapted to operate on state-of-the-art transformer-based neural networks such as BERT. Another shortcoming of these methods is that their visualization of explanations in the form of lists of most relevant words does not take into account the sequential and structurally dependent nature of text. This paper proposes the TransSHAP method that adapts SHAP to transformer models including BERT-based text classifiers. It advances SHAP visualizations by showing explanations in a sequential manner, assessed by human evaluators as competitive to state-of-the-art solutions.
pdf
bib
abs
EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions
Senja Pollak
|
Marko Robnik-Šikonja
|
Matthew Purver
|
Michele Boggia
|
Ravi Shekhar
|
Marko Pranjić
|
Salla Salmela
|
Ivar Krustok
|
Tarmo Paju
|
Carl-Gustav Linden
|
Leo Leppänen
|
Elaine Zosa
|
Matej Ulčar
|
Linda Freienthal
|
Silver Traat
|
Luis Adrián Cabrera-Diego
|
Matej Martinc
|
Nada Lavrač
|
Blaž Škrlj
|
Martin Žnidaršič
|
Andraž Pelicon
|
Boshko Koloski
|
Vid Podpečan
|
Janez Kranjc
|
Shane Sheehan
|
Emanuela Boros
|
Jose G. Moreno
|
Antoine Doucet
|
Hannu Toivonen
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.
2012
pdf
bib
abs
Irregularity Detection in Categorized Document Corpora
Borut Sluban
|
Senja Pollak
|
Roel Coesemans
|
Nada Lavrač
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The paper presents an approach to extract irregularities in document corpora, where the documents originate from different sources and the analyst's interest is to find documents which are atypical for the given source. The main contribution of the paper is a voting-based approach to irregularity detection and its evaluation on a collection of newspaper articles from two sources: Western (UK and US) and local (Kenyan) media. The evaluation of a domain expert proves that the method is very effective in uncovering interesting irregularities in categorized document corpora.