Senja Pollak - ACL Anthology

Senja Pollak

2025

A Computational Framework to Identify Self-Aspects in Text
Jaya Caporusso | Matthew Purver | Senja Pollak
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.

SEKE: Specialised Experts for Keyword Extraction
Matej Martinc | Thi Hong Hanh Tran | Senja Pollak | Boshko Koloski
Findings of the Association for Computational Linguistics: EMNLP 2025

Keyword extraction involves identifying the most descriptive words in a document, allowing automatic categorisation and summarisation of large quantities of diverse textual data. Relying on the insight that real-world keyword detection often requires handling of diverse content, we propose a novel supervised keyword extraction approach based on the mixture of experts (MoE) technique. MoE uses a learnable routing sub-network to direct information to specialised experts, allowing them to specialise in distinct regions of the input space. SEKE, a mixture of Specialised Experts for supervised Keyword Extraction, uses DeBERTa as the backbone model and builds on the MoE framework, where experts attend to each token, by integrating it with a bidirectional Long short-term memory (BiLSTM) network, to allow successful extraction even on smaller corpora, where specialisation is harder due to lack of training data. The MoE framework also provides an insight into inner workings of individual experts, enhancing the explainability of the approach. We benchmark SEKE on multiple English datasets, achieving state-of-the-art performance compared to strong supervised and unsupervised baselines. Our analysis reveals that depending on data size and type, experts specialise in distinct syntactic and semantic components, such as punctuation, stopwords, parts-of-speech, or named entities. Code is available at https://github.com/matejMartinc/SEKE_keyword_extraction.

2024

Transformer verbatim in-context retrieval across time and scale
Kristijan Armeni | Marko Pranjić | Senja Pollak
Proceedings of the 28th Conference on Computational Natural Language Learning

To predict upcoming text, language models must in some cases retrieve in-context information verbatim. In this report, we investigated how the ability of language models to retrieve arbitrary in-context nouns developed during training (across time) and as language models trained on the same dataset increase in size (across scale). We then asked whether learning of in-context retrieval correlates with learning of more challenging zero-shot benchmarks. Furthermore, inspired by semantic effects in human short-term memory, we evaluated the retrieval with respect to a major semantic component of target nouns, namely whether they denote a concrete or abstract entity, as rated by humans. We show that verbatim in-context retrieval developed in a sudden transition early in the training process, after about 1% of the training tokens. This was observed across model sizes (from 14M and up to 12B parameters), and the transition occurred slightly later for the two smallest models. We further found that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks. Around the transition point, all models showed the advantage of retrieving concrete nouns as opposed to abstract nouns. In all but two smallest models, the advantage dissipated away toward the end of training.

A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media
Jaya Caporusso | Damar Hoogland | Mojca Brglez | Boshko Koloski | Matthew Purver | Senja Pollak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Dehumanisation involves the perception and/or treatment of a social group’s members as less than human. This phenomenon is rarely addressed with computational linguistic techniques. We adapt a recently proposed approach for English, making it easier to transfer to other languages and to evaluate, introducing a new sentiment resource, the use of zero-shot cross-lingual valence and arousal detection, and a new method for statistical significance testing. We then apply it to study attitudes to migration expressed in Slovene newspapers, to examine changes in the Slovene discourse on migration between the 2015-16 migration crisis following the war in Syria and the 2022-23 period following the war in Ukraine. We find that while this discourse became more negative and more intense over time, it is less dehumanising when specifically addressing Ukrainian migrants compared to others.

Denoising Labeled Data for Comment Moderation Using Active Learning
Andraž Pelicon | Vanja Mladen Karan | Ravi Shekhar | Matthew Purver | Senja Pollak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Noisily labeled textual data is ample on internet platforms that allow user-created content. Training models, such as offensive language detection models for comment moderation, on such data may prove difficult as the noise in the labels prevents the model to converge. In this work, we propose to use active learning methods for the purposes of denoising training data for model training. The goal is to sample examples the most informative examples with noisy labels with active learning and send them to the oracle for reannotation thus reducing the overall cost of reannotation. In this setting we tested three existing active learning methods, namely DBAL, Variance of Gradients (VoG) and BADGE. The proposed approach to data denoising is tested on the problem of offensive language detection. We observe that active learning can be effectively used for the purposes of data denoising, however care should be taken when choosing the algorithm for this purpose.

LLMSegm: Surface-level Morphological Segmentation Using Large Language Model
Marko Pranjić | Marko Robnik-Šikonja | Senja Pollak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Morphological word segmentation splits a given word into its morphemes (roots and affixes), the smallest meaning-bearing units of language. We introduce a novel approach, called LLMSegm, to surface-level morphological segmentation leveraging large language models (LLMs). The proposed approach is applicable in low-data settings as well as for low-resourced languages. We show how to transform the surface-level morphological segmentation task to a binary classification problem and train LLMs to solve it efficiently. For input, we leverage the information from the default LLM subword tokenisation, and a custom morphological segmentation using novel encoding. The evaluation of LLMSegm across seven morphologically diverse languages demonstrates substantial gains in minimally-supervised settings as well as for low-resourced languages, compared to several existing competitive approaches. In terms of F1-scores and accuracy, we achieve improved results compared to the competing methods in six out of seven datasets. Keywords: morphological segmentation, surface-level segmentation, large language models, low-resource settings

DeBERTa Beats Behemoths: A Comparative Analysis of Fine-Tuning, Prompting, and PEFT Approaches on LegalLensNER
Hanh Thi Hong Tran | Nishan Chatterjee | Senja Pollak | Antoine Doucet
Proceedings of the Natural Legal Language Processing Workshop 2024

This paper summarizes the participation of our team (Flawless Lawgic) in the legal named entity recognition (L-NER) task at LegalLens 2024: Detecting Legal Violations. Given possible unstructured texts (e.g., online media texts), we aim to identify legal violations by extracting legal entities such as “violation”, “violation by”, “violation on”, and “law”. This system-description paper discusses our approaches to address the task, empirically highlighting the performances of fine-tuning models from the Transformers family (e.g., RoBERTa and DeBERTa) against open-sourced LLMs (e.g., Llama, Mistral) with different tuning settings (e.g., LoRA, Supervised Fine-Tuning (SFT) and prompting strategies). Our best results, with a weighted F1 of 0.705 on the test set, show a 30 percentage points increase in F1 compared to the baseline and rank 2 on the leaderboard, leaving a marginal gap of only 0.4 percentage points lower than the top solution. Our solutions are available at github.com/honghanhh/lner.

Comparing News Framing of Migration Crises using Zero-Shot Classification
Nikola Ivačič | Matthew Purver | Fabienne Lind | Senja Pollak | Hajo Boomgaarden | Veronika Bajt
Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024

We present an experiment on classifying news frames in a language unseen by the learner, using zero-shot cross-lingual transfer learning. We used two pre-trained multilingual Transformer Encoder neural network models and tested with four specific news frames, investigating two approaches to the resulting multi-label task: Binary Relevance (treating each frame independently) and Label Power-set (predicting each possible combination of frames). We train our classifiers on an available annotated multilingual migration news dataset and test on an unseen Slovene language migration news corpus, first evaluating performance and then using the classifiers to analyse how media framed the news during the periods of Syria and Ukraine conflict-related migrations.

L3i++ at SemEval-2024 Task 8: Can Fine-tuned Large Language Model Detect Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text?
Hanh Thi Hong Tran | Tien Nam Nguyen | Antoine Doucet | Senja Pollak
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper summarizes our participation in SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. In this task, we aim to solve two over three Subtasks: (1) Monolingual and Multilingual Binary Human-Written vs. Machine-Generated Text Classification; and (2) Multi-Way Machine-Generated Text Classification. We conducted a comprehensive comparative study across three methodological groups: Five metric-based models (Log-Likelihood, Rank, Log-Rank, Entropy, and MFDMetric), two fine-tuned sequence-labeling language models (RoBERTA and XLM-R); and a fine-tuned large-scale language model (LS-LLaMA). Our findings suggest that our LLM outperformed both traditional sequence-labeling LM benchmarks and metric-based approaches. Furthermore, our fine-tuned classifier excelled in detecting machine-generated multilingual texts and accurately classifying machine-generated texts within a specific category, (e.g., ChatGPT, bloomz, dolly). However, they do exhibit challenges in detecting them in other categories (e.g., cohere, and davinci). This is due to potential overlap in the distribution of the metric among various LLMs. Overall, we achieved a 6th rank in both Multilingual Binary Human-Written vs. Machine-Generated Text Classification and Multi-Way Machine-Generated Text Classification on the leaderboard.

whatdoyoumeme at SemEval-2024 Task 4: Hierarchical-Label-Aware Persuasion Detection using Translated Texts
Nishan Chatterjee | Marko Pranjic | Boshko Koloski | Lidia Pivovarova | Senja Pollak
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In this paper, we detail the methodology of team whatdoyoumeme for the SemEval 2024 Task on Multilingual Persuasion Detection in Memes. We integrate hierarchical label information to refine detection capabilities, and employ a cross-lingual approach, utilizing translation to adapt the model to Macedonian, Arabic, and Bulgarian. Our methodology encompasses both the analysis of meme content and extending labels to include hierarchical structure. The effectiveness of the approach is demonstrated through improved model performance in multilingual contexts, highlighting the utility of translation-based methods and hierarchy-aware learning, over traditional baselines.

2023

Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
Jakub Piskorski | Michał Marcińczuk | Preslav Nakov | Maciej Ogrodniczuk | Senja Pollak | Pavel Přibáň | Piotr Rybak | Josef Steinberger | Roman Yangarber
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

Analysis of Transfer Learning for Named Entity Recognition in South-Slavic Languages
Nikola Ivačič | Thi Hong Hanh Tran | Boshko Koloski | Senja Pollak | Matthew Purver
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

This paper analyzes a Named Entity Recognition task for South-Slavic languages using the pre-trained multilingual neural network models. We investigate whether the performance of the models for a target language can be improved by using data from closely related languages. We have shown that the model performance is not influenced substantially when trained with other than a target language. While for Slovene, the monolingual setting generally performs better, for Croatian and Serbian the results are slightly better in selected cross-lingual settings, but the improvements are not large. The most significant performance improvement is shown for the Serbian language, which has the smallest corpora. Therefore, fine-tuning with other closely related languages may benefit only the “low resource” languages.

Reconstruct to Retrieve: Identifying interesting news in a Cross-lingual setting
Boshko Koloski | Blaz Skrlj | Nada Lavrac | Senja Pollak
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

An important and resource-intensive task in journalism is retrieving relevant foreign news and its adaptation for local readers. Given the vast amount of foreign articles published and the limited number of journalists available to evaluate their interestingness, this task can be particularly challenging, especially when dealing with smaller languages and countries. In this work, we propose a novel method for large-scale retrieval of potentially translation-worthy articles based on an auto-encoder neural network trained on a limited corpus of relevant foreign news. We hypothesize that the representations of interesting news can be reconstructed very well by an auto-encoder, while irrelevant news would have less adequate reconstructions since they are not used for training the network. Specifically, we focus on extracting articles from the Latvian media for Estonian news media houses. It is worth noting that the available corpora for this task are particularly limited, which adds an extra layer of difficulty to our approach. To evaluate the proposed method, we rely on manual evaluation by an Estonian journalist at Ekspress Meedia and automatic evaluation on a gold standard test set.

IJS@LT-EDI : Ensemble Approaches to Detect Signs of Depression from Social Media Text
Jaya Caporusso | Thi Hong Hanh Tran | Senja Pollak
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

This paper presents our ensembling solutions for detecting signs of depression in social media text, as part of the Shared Task at LT-EDI@RANLP 2023. By leveraging social media posts in English, the task involves the development of a system to accurately classify them as presenting signs of depression of one of three levels: “severe”, “moderate”, and “not depressed”. We verify the hypothesis that combining contextual information from a language model with local domain-specific features can improve the classifier’s performance. We do so by evaluating: (1) two global classifiers (support vector machine and logistic regression); (2) contextual information from language models; and (3) the ensembling results.

2022

Fusion of linguistic, neural and sentence-transformer features for improved term alignment
Andraz Repar | Senja Pollak | Matej Ulčar | Boshko Koloski
Proceedings of the BUCC Workshop within LREC 2022

Crosslingual terminology alignment task has many practical applications. In this work, we propose an aligning method for the shared task of the 15th Workshop on Building and Using Comparable Corpora. Our method combines several different approaches into one cohesive machine learning model, based on SVM. From shared-task specific and external sources, we crafted four types of features: cognate-based, dictionary-based, embedding-based, and combined features, which combine aspects of the other three types. We added a post-processing re-scoring method, which reducess the effect of hubness, where some terms are nearest neighbours of many other terms. We achieved the average precision score of 0.833 on the English-French training set of the shared task.

Tracking Changes in ESG Representation: Initial Investigations in UK Annual Reports
Matthew Purver | Matej Martinc | Riste Ichev | Igor Lončarski | Katarina Sitar Šuštar | Aljoša Valentinčič | Senja Pollak
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference

We describe initial work into analysing the language used around environmental, social and governance (ESG) issues in UK company annual reports. We collect a dataset of annual reports from UK FTSE350 companies over the years 2012-2019; separately, we define a categorized list of core ESG terms (single words and multi-word expressions) by combining existing lists with manual annotation. We then show that this list can be used to analyse the changes in ESG language in the dataset over time, via a combination of language modelling and distributional modelling via contextual word embeddings. Initial findings show that while ESG discussion in annual reports is becoming significantly more likely over time, the increase varies with category and with individual terms, and that some terms show noticeable changes in usage.

EMBEDDIA project: Cross-Lingual Embeddings for Less- Represented Languages in European News Media
Senja Pollak | Andraž Pelicon
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

EMBEDDIA project developed a range of resources and methods for less-resourced EU languages, focusing on applications for media industry, including keyword extraction, comment moderation and article generation.

Knowledge informed sustainability detection from short financial texts
Boshko Koloski | Syrielle Montariol | Matthew Purver | Senja Pollak
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

There is a global trend for responsible investing and the need for developing automated methods for analyzing and Environmental, Social and Governance (ESG) related elements in financial texts is raising. In this work we propose a solution to the FinSim4-ESG task, consisting of binary classification of sentences into sustainable or unsustainable. We propose a novel knowledge-based latent heterogeneous representation that is based on knowledge from taxonomies and knowledge graphs and multiple contemporary document representations. We hypothesize that an approach based on a combination of knowledge and document representations can introduce significant improvement over conventional document representation approaches. We consider ensembles on classifier as well on representation level late-fusion and early fusion. The proposed approaches achieve competitive accuracy of 89 and are 5.85 behind the best achieved score.

Sentiment Classification by Incorporating Background Knowledge from Financial Ontologies
Timen Stepišnik-Perdih | Andraž Pelicon | Blaž Škrlj | Martin Žnidaršič | Igor Lončarski | Senja Pollak
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

Ontologies are increasingly used for machine reasoning over the last few years. They can provide explanations of concepts or be used for concept classification if there exists a mapping from the desired labels to the relevant ontology. This paper presents a practical use of an ontology for the purpose of data set generalization in an oversampling setting, with the aim of improving classification models. We demonstrate our solution on a novel financial sentiment data set using the Financial Industry Business Ontology (FIBO). The results show that generalization-based data enrichment benefits simpler models in a general setting and more complex models such as BERT in low-data setting.

Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?
Boshko Koloski | Senja Pollak | Blaž Škrlj | Matej Martinc
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Keyword extraction is the task of retrieving words that are essential to the content of a given document. Researchers proposed various approaches to tackle this problem. At the top-most level, approaches are divided into ones that require training - supervised and ones that do not - unsupervised. In this study, we are interested in settings, where for a language under investigation, no training data is available. More specifically, we explore whether pretrained multilingual language models can be employed for zero-shot cross-lingual keyword extraction on low-resource languages with limited or no available labeled training data and whether they outperform state-of-the-art unsupervised keyword extractors. The comparison is conducted on six news article datasets covering two high-resource languages, English and Russian, and four low-resource languages, Croatian, Estonian, Latvian, and Slovenian. We find that the pretrained models fine-tuned on a multilingual corpus covering languages that do not appear in the test set (i.e. in a zero-shot setting), consistently outscore unsupervised models in all six languages.

Extracting and Analysing Metaphors in Migration Media Discourse: towards a Metaphor Annotation Scheme
Ana Zwitter Vitez | Mojca Brglez | Marko Robnik Šikonja | Tadej Škvorc | Andreja Vezovnik | Senja Pollak
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The study of metaphors in media discourse is an increasingly researched topic as media are an important shaper of social reality and metaphors are an indicator of how we think about certain issues through references to other things. We present a neural transfer learning method for detecting metaphorical sentences in Slovene and evaluate its performance on a gold standard corpus of metaphors (classification accuracy of 0.725), as well as on a sample of a domain specific corpus of migrations (precision of 0.40 for extracting domain metaphors and 0.74 if evaluated only on a set of migration related sentences). Based on empirical results and findings of our analysis, we propose a novel metaphor annotation scheme containing linguistic level, conceptual level, and stance information. The new scheme can be used for future metaphor annotations of other socially relevant topics.

Embeddings models for Buddhist Sanskrit
Ligeia Lugli | Matej Martinc | Andraž Pelicon | Senja Pollak
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The paper presents novel resources and experiments for Buddhist Sanskrit, broadly defined here including all the varieties of Sanskrit in which Buddhist texts have been transmitted. We release a novel corpus of Buddhist texts, a novel corpus of general Sanskrit and word similarity and word analogy datasets for intrinsic evaluation of Buddhist Sanskrit embeddings models. We compare the performance of word2vec and fastText static embeddings models, with default and optimized parameter settings, as well as contextual models BERT and GPT-2, with different training regimes (including a transfer learning approach using the general Sanskrit corpus) and different embeddings construction regimes (given the encoder layers). The results show that for semantic similarity the fastText embeddings yield the best results, while for word analogy tasks BERT embeddings work the best. We also show that for contextual models the optimal layer combination for embedding construction is task dependant, and that pretraining the contextual embeddings models on a reference corpus of general Sanskrit is beneficial, which is a promising finding for future development of embeddings for less-resourced languages and domains.

E8-IJS@LT-EDI-ACL2022 - BERT, AutoML and Knowledge-graph backed Detection of Depression
Ilija Tavchioski | Boshko Koloski | Blaž Škrlj | Senja Pollak
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Depression is a mental illness that negatively affects a person’s well-being and can, if left untreated, lead to serious consequences such as suicide. Therefore, it is important to recognize the signs of depression early. In the last decade, social media has become one of the most common places to express one’s feelings. Hence, there is a possibility of text processing and applying machine learning techniques to detect possible signs of depression. In this paper, we present our approaches to solving the shared task titled Detecting Signs of Depression from Social Media Text. We explore three different approaches to solve the challenge: fine-tuning BERT model, leveraging AutoML for the construction of features and classifier selection and finally, we explore latent spaces derived from the combination of textual and knowledge-based representations. We ranked 9th out of 31 teams in the competition. Our best solution, based on knowledge graph and textual representations, was 4.9% behind the best model in terms of Macro F1, and only 1.9% behind in terms of Recall.

JSI at SemEval-2022 Task 1: CODWOE - Reverse Dictionary: Monolingual and cross-lingual approaches
Thi Hong Hanh Tran | Matej Martinc | Matthew Purver | Senja Pollak
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

The reverse dictionary task is a sequence-to-vector task in which a gloss is provided as input, and the output must be a semantically matching word vector. The reverse dictionary is useful in practical applications such as solving the tip-of-the-tongue problem, helping new language learners, etc. In this paper, we evaluate the effect of a Transformer-based model with cross-lingual zero-shot learning to improve the reverse dictionary performance. Our experiments are conducted in five languages in the CODWOE dataset, including English, French, Italian, Spanish, and Russian. Even if we did not achieve a good ranking in the CODWOE competition, we show that our work partially improves the current baseline from the organizers with a hypothesis on the impact of LSTM in monolingual, multilingual, and zero-shot learning. All the codes are available at https://github.com/honghanhh/codwoe2021.

IJS at TextGraphs-16 Natural Language Premise Selection Task: Will Contextual Information Improve Natural Language Premise Selection?
Thi Hong Hanh Tran | Matej Martinc | Antoine Doucet | Senja Pollak
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

Natural Language Premise Selection (NLPS) is a mathematical Natural Language Processing (NLP) task that retrieves a set of applicable relevant premises to support the end-user finding the proof for a particular statement. In this research, we evaluate the impact of Transformer-based contextual information and different fundamental similarity scores toward NLPS. The results demonstrate that the contextual representation is better at capturing meaningful information despite not being pretrained in the mathematical background compared to the statistical approach (e.g., the TF-IDF) with a boost of around 3.00% MAP@500.

2021

Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych | Olga Kanishcheva | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Vasyl Starko | Josef Steinberger | Roman Yangarber | Michał Marcińczuk | Senja Pollak | Pavel Přibáň | Marko Robnik-Šikonja
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

Exploratory Analysis of News Sentiment Using Subgroup Discovery
Anita Valmarska | Luis Adrián Cabrera-Diego | Elvys Linhares Pontes | Senja Pollak
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

In this study, we present an exploratory analysis of a Slovenian news corpus, in which we investigate the association between named entities and sentiment in the news. We propose a methodology that combines Named Entity Recognition and Subgroup Discovery - a descriptive rule learning technique for identifying groups of examples that share the same class label (sentiment) and pattern (features - Named Entities). The approach is used to induce the positive and negative sentiment class rules that reveal interesting patterns related to different Slovenian and international politicians, organizations, and locations.

This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.

Supervised and Unsupervised Neural Approaches to Text Readability
Matej Martinc | Senja Pollak | Marko Robnik-Šikonja
Computational Linguistics, Volume 47, Issue 1 - March 2021

We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages, and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labeled readability data sets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements.

Preliminary experimentation with combinations and extensions of forward-looking sentence detection wordlists
Jan Štihec | Senja Pollak | Martin Žnidaršič
Proceedings of the 3rd Financial Narrative Processing Workshop

BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers
Enja Kokalj | Blaž Škrlj | Nada Lavrač | Senja Pollak | Marko Robnik-Šikonja
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Transformer-based neural networks offer very good classification performance across a wide range of domains, but do not provide explanations of their predictions. While several explanation methods, including SHAP, address the problem of interpreting deep learning models, they are not adapted to operate on state-of-the-art transformer-based neural networks such as BERT. Another shortcoming of these methods is that their visualization of explanations in the form of lists of most relevant words does not take into account the sequential and structurally dependent nature of text. This paper proposes the TransSHAP method that adapts SHAP to transformer models including BERT-based text classifiers. It advances SHAP visualizations by showing explanations in a sequential manner, assessed by human evaluators as competitive to state-of-the-art solutions.

Extending Neural Keyword Extraction with TF-IDF tagset matching
Boshko Koloski | Senja Pollak | Blaž Škrlj | Matej Martinc
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work, we develop and evaluate our methods on four novel data sets covering less-represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian, and Russian). First, we perform evaluation of two supervised neural transformer-based methods, Transformer-based Neural Tagger for Keyword Identification (TNT-KID) and Bidirectional Encoder Representations from Transformers (BERT) with an additional Bidirectional Long Short-Term Memory Conditional Random Fields (BiLSTM CRF) classification head, and compare them to a baseline Term Frequency - Inverse Document Frequency (TF-IDF) based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer-based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate for usage as a recommendation system in the media house environment.

Zero-shot Cross-lingual Content Filtering: Offensive Language and Hate Speech Detection
Andraž Pelicon | Ravi Shekhar | Matej Martinc | Blaž Škrlj | Matthew Purver | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We present a system for zero-shot cross-lingual offensive language and hate speech classification. The system was trained on English datasets and tested on a task of detecting hate speech and offensive social media content in a number of languages without any additional training. Experiments show an impressive ability of both models to generalize from English to other languages. There is however an expected gap in performance between the tested cross-lingual models and the monolingual models. The best performing model (offensive content classifier) is available online as a REST API.

Exploring Neural Language Models via Analysis of Local and Global Self-Attention Spaces
Blaž Škrlj | Shane Sheehan | Nika Eržen | Marko Robnik-Šikonja | Saturnino Luz | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Large pretrained language models using the transformer neural network architecture are becoming a dominant methodology for many natural language processing tasks, such as question answering, text classification, word sense disambiguation, text completion and machine translation. Commonly comprising hundreds of millions of parameters, these models offer state-of-the-art performance, but at the expense of interpretability. The attention mechanism is the main component of transformer networks. We present AttViz, a method for exploration of self-attention in transformer networks, which can help in explanation and debugging of the trained models by showing associations between text tokens in an input sequence. We show that existing deep learning pipelines can be explored with AttViz, which offers novel visualizations of the attention heads and their aggregations. We implemented the proposed methods in an online toolkit and an offline library. Using examples from news analysis, we demonstrate how AttViz can be used to inspect and potentially better understand what a model has learned.

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

Interesting cross-border news discovery using cross-lingual article linking and document similarity
Boshko Koloski | Elaine Zosa | Timen Stepišnik-Perdih | Blaž Škrlj | Tarmo Paju | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Team Name: team-8 Embeddia Tool: Cross-Lingual Document Retrieval Zosa et al. Dataset: Estonian and Latvian news datasets abstract: Contemporary news media face increasing amounts of available data that can be of use when prioritizing, selecting and discovering new news. In this work we propose a methodology for retrieving interesting articles in a cross-border news discovery setting. More specifically, we explore how a set of seed documents in Estonian can be projected in Latvian document space and serve as a basis for discovery of novel interesting pieces of Latvian news that would interest Estonian readers. The proposed methodology was evaluated by Estonian journalist who confirmed that in the best setting, from top 10 retrieved Latvian documents, half of them represent news that are potentially interesting to be taken by the Estonian media house and presented to Estonian readers.

EMBEDDIA hackathon report: Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+
Matej Martinc | Nina Perger | Andraž Pelicon | Matej Ulčar | Andreja Vezovnik | Senja Pollak
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We conduct automatic sentiment and viewpoint analysis of the newly created Slovenian news corpus containing articles related to the topic of LGBTIQ+ by employing the state-of-the-art news sentiment classifier and a system for semantic change detection. The focus is on the differences in reporting between quality news media with long tradition and news media with financial and political connections to SDS, a Slovene right-wing political party. The results suggest that political affiliation of the media can affect the sentiment distribution of articles and the framing of specific LGBTIQ+ specific topics, such as same-sex marriage.

2020

Mining Semantic Relations from Comparable Corpora through Intersections of Word Embeddings
Špela Vintar | Larisa Grčić Simeunović | Matej Martinc | Senja Pollak | Uroš Stepišnik
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

We report an experiment aimed at extracting words expressing a specific semantic relation using intersections of word embeddings. In a multilingual frame-based domain model, specific features of a concept are typically described through a set of non-arbitrary semantic relations. In karstology, our domain of choice which we are exploring though a comparable corpus in English and Croatian, karst phenomena such as landforms are usually described through their FORM, LOCATION, CAUSE, FUNCTION and COMPOSITION. We propose an approach to mine words pertaining to each of these relations by using a small number of seed adjectives, for which we retrieve closest words using word embeddings and then use intersections of these neighbourhoods to refine our search. Such cross-language expansion of semantically-rich vocabulary is a valuable aid in improving the coverage of a multilingual knowledge base, but also in exploring differences between languages in their respective conceptualisations of the domain.

The NetViz terminology visualization tool and the use cases in karstology domain modeling
Senja Pollak | Vid Podpečan | Dragana Miljkovic | Uroš Stepišnik | Špela Vintar
Proceedings of the 6th International Workshop on Computational Terminology

We present the NetViz terminology visualization tool and apply it to the domain modeling of karstology, a subfield of geography studying karst phenomena. The developed tool allows for high-performance online network visualization where the user can upload the terminological data in a simple CSV format, define the nodes (terms, categories), edges (relations) and their properties (by assigning different node colors), and then edit and interactively explore domain knowledge in the form of a network. We showcase the usefulness of the tool on examples from the karstology domain, where in the first use case we visualize the domain knowledge as represented in a manually annotated corpus of domain definitions, while in the second use case we show the power of visualization for domain understanding by visualizing automatically extracted knowledge in the form of triplets extracted from the karstology domain corpus. The application is entirely web-based without any need for downloading or special configuration. The source code of the web application is also available under the permissive MIT license, allowing future extensions for developing new terminological applications.

Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift
Matej Martinc | Petra Kralj Novak | Senja Pollak
Proceedings of the Twelfth Language Resources and Evaluation Conference

We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptation on large corpora. The results on the newly created Brexit news corpus suggest that the method can be successfully used for the detection of a short-term yearly semantic shift. And lastly, the model also shows promising results in a multilingual settings, where the task was to detect differences and similarities between diachronic semantic shifts in different languages.

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context
Carlos Santos Armendariz | Matthew Purver | Matej Ulčar | Senja Pollak | Nikola Ljubešić | Mark Granroth-Wilding
Proceedings of the Twelfth Language Resources and Evaluation Conference

State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far.

SemEval-2020 Task 3: Graded Word Similarity in Context
Carlos Santos Armendariz | Matthew Purver | Senja Pollak | Nikola Ljubešić | Matej Ulčar | Ivan Vulić | Mohammad Taher Pilehvar
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the Graded Word Similarity in Context (GWSC) task which asked participants to predict the effects of context on human perception of similarity in English, Croatian, Slovene and Finnish. We received 15 submissions and 11 system description papers. A new dataset (CoSimLex) was created for evaluation in this task: it contains pairs of words, each annotated within two different contexts. Systems beat the baselines by significant margins, but few did well in more than one language or subtask. Almost every system employed a Transformer model, but with many variations in the details: WordNet sense embeddings, translation of contexts, TF-IDF weightings, and the automatic creation of datasets for fine-tuning were all used to good effect.

2018

Reusable workflows for gender prediction
Matej Martinc | Senja Pollak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style
Ben Verhoeven | Iza Škrjanec | Senja Pollak
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6% accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5%. Especially in the lemmatized version, we also observe stylistic and content-based differences in writing between men (e.g. more profane language, numerals and beer mentions) and women (e.g. more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages.

2015

Predicting the Level of Text Standardness in User-generated Content
Nikola Ljubešić | Darja Fišer | Tomaž Erjavec | Jaka Čibej | Dafne Marko | Senja Pollak | Iza Škrjanec
Proceedings of the International Conference Recent Advances in Natural Language Processing

2012

Irregularity Detection in Categorized Document Corpora
Borut Sluban | Senja Pollak | Roel Coesemans | Nada Lavrač
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents an approach to extract irregularities in document corpora, where the documents originate from different sources and the analyst's interest is to find documents which are atypical for the given source. The main contribution of the paper is a voting-based approach to irregularity detection and its evaluation on a collection of newspaper articles from two sources: Western (UK and US) and local (Kenyan) media. The evaluation of a domain expert proves that the method is very effective in uncovering interesting irregularities in categorized document corpora.

2011

Building and Using Comparable Corpora for Domain-Specific Bilingual Lexicon Extraction
Darja Fišer | Nikola Ljubešić | Špela Vintar | Senja Pollak
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources
Darja Fišer | Senja Pollak | Špela Vintar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents an innovative approach to extract Slovene definition candidates from domain-specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find well-formed definitions among the extracted candidates. The results of the experiment are encouraging, with accuracy ranging from 67% to 71%. The paper also addresses some drawbacks of the approach and suggests ways to overcome them in future work.

Co-authors

Andraž Pelicon 7

Thi Hong Hanh Tran 5

Antoine Doucet 4

Nikola Ljubešić 4

Marko Pranjić 4

Špela Vintar 4

Jaya Caporusso 3

Michał Marcińczuk 3

Preslav Nakov 3

Jakub Piskorski 3

Lidia Pivovarova 3

Pavel Přibáň 3

Josef Steinberger 3

Roman Yangarber 3

Martin Žnidaršič 3

Carlos Santos Armendariz 2

Bogdan Babych 2

Luis-Adrián Cabrera-Diego 2

Nishan Chatterjee 2

Nikola Ivačič 2

Olga Kanishcheva 2

Igor Lončarski 2

Vid Podpečan 2

Shane Sheehan 2

Uroš Stepišnik 2

Timen Stepišnik-Perdih 2

Hanh Thi Hong Tran 2

Andreja Vezovnik 2

Iza Škrjanec 2

Kristijan Armeni 1

Veronika Bajt 1

Michele Boggia 1

Hajo Boomgaarden 1

Emanuela Boroş 1

Roel Coesemans 1

Tomaž Erjavec 1

Linda Freienthal 1

Mark Granroth-Wilding 1

Larisa Grčić Simeunović 1

Damar Hoogland 1

Zara Kancheva 1

Vanja M. Karan 1

Petra Kralj Novak 1

Maria Lebedeva 1

Leo Leppänen 1

Fabienne Lind 1

Carl-Gustav Linden 1

Elvys Linhares-Pontes 1

Saturnino Luz 1

Dragana Miljkovic 1

Syrielle Montariol 1

José G. Moreno 1

Tien Nam Nguyen 1

Maciej Ogrodniczuk 1

Petya Osenova 1

Mohammad Taher Pilehvar 1

Andraž Repar 1

Salla Salmela 1

Katarina Sitar Šuštar 1

Ilija Tavchioski 1

Hannu Toivonen 1

Aljoša Valentinčič 1

Anita Valmarska 1

Ben Verhoeven 1

Ana Zwitter Vitez 1

Tadej Škvorc 1

Venues