Matej Martinc


2021

pdf bib
Extending Neural Keyword Extraction with TF-IDF tagset matching
Boshko Koloski | Senja Pollak | Blaž Škrlj | Matej Martinc
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work, we develop and evaluate our methods on four novel data sets covering less-represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian, and Russian). First, we perform evaluation of two supervised neural transformer-based methods, Transformer-based Neural Tagger for Keyword Identification (TNT-KID) and Bidirectional Encoder Representations from Transformers (BERT) with an additional Bidirectional Long Short-Term Memory Conditional Random Fields (BiLSTM CRF) classification head, and compare them to a baseline Term Frequency - Inverse Document Frequency (TF-IDF) based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer-based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate for usage as a recommendation system in the media house environment.

pdf bib
Zero-shot Cross-lingual Content Filtering: Offensive Language and Hate Speech Detection
Andraž Pelicon | Ravi Shekhar | Matej Martinc | Blaž Škrlj | Matthew Purver | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We present a system for zero-shot cross-lingual offensive language and hate speech classification. The system was trained on English datasets and tested on a task of detecting hate speech and offensive social media content in a number of languages without any additional training. Experiments show an impressive ability of both models to generalize from English to other languages. There is however an expected gap in performance between the tested cross-lingual models and the monolingual models. The best performing model (offensive content classifier) is available online as a REST API.

pdf bib
EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions
Senja Pollak | Marko Robnik-Šikonja | Matthew Purver | Michele Boggia | Ravi Shekhar | Marko Pranjić | Salla Salmela | Ivar Krustok | Tarmo Paju | Carl-Gustav Linden | Leo Leppänen | Elaine Zosa | Matej Ulčar | Linda Freienthal | Silver Traat | Luis Adrián Cabrera-Diego | Matej Martinc | Nada Lavrač | Blaž Škrlj | Martin Žnidaršič | Andraž Pelicon | Boshko Koloski | Vid Podpečan | Janez Kranjc | Shane Sheehan | Emanuela Boros | Jose G. Moreno | Antoine Doucet | Hannu Toivonen
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

pdf bib
EMBEDDIA hackathon report: Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+
Matej Martinc | Nina Perger | Andraž Pelicon | Matej Ulčar | Andreja Vezovnik | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We conduct automatic sentiment and viewpoint analysis of the newly created Slovenian news corpus containing articles related to the topic of LGBTIQ+ by employing the state-of-the-art news sentiment classifier and a system for semantic change detection. The focus is on the differences in reporting between quality news media with long tradition and news media with financial and political connections to SDS, a Slovene right-wing political party. The results suggest that political affiliation of the media can affect the sentiment distribution of articles and the framing of specific LGBTIQ+ specific topics, such as same-sex marriage.

pdf bib
Scalable and Interpretable Semantic Change Detection
Syrielle Montariol | Matej Martinc | Lidia Pivovarova
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Several cluster-based methods for semantic change detection with contextual embeddings emerged recently. They allow a fine-grained analysis of word use change by aggregating embeddings into clusters that reflect the different usages of the word. However, these methods are unscalable in terms of memory consumption and computation time. Therefore, they require a limited set of target words to be picked in advance. This drastically limits the usability of these methods in open exploratory tasks, where each word from the vocabulary can be considered as a potential target. We propose a novel scalable method for word usage-change detection that offers large gains in processing time and significant memory savings while offering the same interpretability and better performance than unscalable methods. We demonstrate the applicability of the proposed method by analysing a large corpus of news articles about COVID-19.

pdf bib
Supervised and Unsupervised Neural Approaches to Text Readability
Matej Martinc | Senja Pollak | Marko Robnik-Šikonja
Computational Linguistics, Volume 47, Issue 1 - March 2021

Abstract We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages, and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labeled readability data sets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements.

2020

pdf bib
Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift
Matej Martinc | Petra Kralj Novak | Senja Pollak
Proceedings of the 12th Language Resources and Evaluation Conference

We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptation on large corpora. The results on the newly created Brexit news corpus suggest that the method can be successfully used for the detection of a short-term yearly semantic shift. And lastly, the model also shows promising results in a multilingual settings, where the task was to detect differences and similarities between diachronic semantic shifts in different languages.

pdf bib
Discovery Team at SemEval-2020 Task 1: Context-sensitive Embeddings Not Always Better than Static for Semantic Change Detection
Matej Martinc | Syrielle Montariol | Elaine Zosa | Lidia Pivovarova
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the approaches used by the Discovery Team to solve SemEval-2020 Task 1 - Unsupervised Lexical Semantic Change Detection. The proposed method is based on clustering of BERT contextual embeddings, followed by a comparison of cluster distributions across time. The best results were obtained by an ensemble of this method and static Word2Vec embeddings. According to the official results, our approach proved the best for Latin in Subtask 2.

pdf bib
Mining Semantic Relations from Comparable Corpora through Intersections of Word Embeddings
Špela Vintar | Larisa Grčić Simeunović | Matej Martinc | Senja Pollak | Uroš Stepišnik
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

We report an experiment aimed at extracting words expressing a specific semantic relation using intersections of word embeddings. In a multilingual frame-based domain model, specific features of a concept are typically described through a set of non-arbitrary semantic relations. In karstology, our domain of choice which we are exploring though a comparable corpus in English and Croatian, karst phenomena such as landforms are usually described through their FORM, LOCATION, CAUSE, FUNCTION and COMPOSITION. We propose an approach to mine words pertaining to each of these relations by using a small number of seed adjectives, for which we retrieve closest words using word embeddings and then use intersections of these neighbourhoods to refine our search. Such cross-language expansion of semantically-rich vocabulary is a valuable aid in improving the coverage of a multilingual knowledge base, but also in exploring differences between languages in their respective conceptualisations of the domain.

2019

pdf bib
Embeddia at SemEval-2019 Task 6: Detecting Hate with Neural Network and Transfer Learning Approaches
Andraž Pelicon | Matej Martinc | Petra Kralj Novak
Proceedings of the 13th International Workshop on Semantic Evaluation

SemEval 2019 Task 6 was OffensEval: Identifying and Categorizing Offensive Language in Social Media. The task was further divided into three sub-tasks: offensive language identification, automatic categorization of offense types, and offense target identification. In this paper, we present the approaches used by the Embeddia team, who qualified as fourth, eighteenth and fifth on the tree sub-tasks. A different model was trained for each sub-task. For the first sub-task, we used a BERT model fine-tuned on the OLID dataset, while for the second and third tasks we developed a custom neural network architecture which combines bag-of-words features and automatically generated sequence-based features. Our results show that combining automatically and manually crafted features fed into a neural architecture outperform transfer learning approach on more unbalanced datasets.

2018

pdf bib
Reusable workflows for gender prediction
Matej Martinc | Senja Pollak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing
Kaja Dobrovoljc | Matej Martinc
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Despite the significant improvement of data-driven dependency parsing systems in recent years, they still achieve a considerably lower performance in parsing spoken language data in comparison to written data. On the example of Spoken Slovenian Treebank, the first spoken data treebank using the UD annotation scheme, we investigate which speech-specific phenomena undermine parsing performance, through a series of training data and treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that utterance segmentation is the most prominent cause of low parsing performance, both in parsing raw and pre-segmented transcriptions. In addition to shorter utterances, both parsers perform better on normalized transcriptions including basic markers of prosody and excluding disfluencies, discourse markers and fillers. On the other hand, the effects of written training data addition and speech-specific dependency representations largely depend on the parsing system selected.