Andraž Pelicon


2022

pdf bib
EMBEDDIA project: Cross-Lingual Embeddings for Less- Represented Languages in European News Media
Senja Pollak | Andraž Pelicon
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

EMBEDDIA project developed a range of resources and methods for less-resourced EU languages, focusing on applications for media industry, including keyword extraction, comment moderation and article generation.

2021

pdf bib
Zero-shot Cross-lingual Content Filtering: Offensive Language and Hate Speech Detection
Andraž Pelicon | Ravi Shekhar | Matej Martinc | Blaž Škrlj | Matthew Purver | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We present a system for zero-shot cross-lingual offensive language and hate speech classification. The system was trained on English datasets and tested on a task of detecting hate speech and offensive social media content in a number of languages without any additional training. Experiments show an impressive ability of both models to generalize from English to other languages. There is however an expected gap in performance between the tested cross-lingual models and the monolingual models. The best performing model (offensive content classifier) is available online as a REST API.

pdf bib
EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions
Senja Pollak | Marko Robnik-Šikonja | Matthew Purver | Michele Boggia | Ravi Shekhar | Marko Pranjić | Salla Salmela | Ivar Krustok | Tarmo Paju | Carl-Gustav Linden | Leo Leppänen | Elaine Zosa | Matej Ulčar | Linda Freienthal | Silver Traat | Luis Adrián Cabrera-Diego | Matej Martinc | Nada Lavrač | Blaž Škrlj | Martin Žnidaršič | Andraž Pelicon | Boshko Koloski | Vid Podpečan | Janez Kranjc | Shane Sheehan | Emanuela Boros | Jose G. Moreno | Antoine Doucet | Hannu Toivonen
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

pdf bib
EMBEDDIA hackathon report: Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+
Matej Martinc | Nina Perger | Andraž Pelicon | Matej Ulčar | Andreja Vezovnik | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We conduct automatic sentiment and viewpoint analysis of the newly created Slovenian news corpus containing articles related to the topic of LGBTIQ+ by employing the state-of-the-art news sentiment classifier and a system for semantic change detection. The focus is on the differences in reporting between quality news media with long tradition and news media with financial and political connections to SDS, a Slovene right-wing political party. The results suggest that political affiliation of the media can affect the sentiment distribution of articles and the framing of specific LGBTIQ+ specific topics, such as same-sex marriage.

2019

pdf bib
Embeddia at SemEval-2019 Task 6: Detecting Hate with Neural Network and Transfer Learning Approaches
Andraž Pelicon | Matej Martinc | Petra Kralj Novak
Proceedings of the 13th International Workshop on Semantic Evaluation

SemEval 2019 Task 6 was OffensEval: Identifying and Categorizing Offensive Language in Social Media. The task was further divided into three sub-tasks: offensive language identification, automatic categorization of offense types, and offense target identification. In this paper, we present the approaches used by the Embeddia team, who qualified as fourth, eighteenth and fifth on the tree sub-tasks. A different model was trained for each sub-task. For the first sub-task, we used a BERT model fine-tuned on the OLID dataset, while for the second and third tasks we developed a custom neural network architecture which combines bag-of-words features and automatically generated sequence-based features. Our results show that combining automatically and manually crafted features fed into a neural architecture outperform transfer learning approach on more unbalanced datasets.