Marina Litvak

2025

pdf bib abs
Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
Natalia Vanetik | Marina Litvak | Chaya Liebeskind
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models

Offensive language detection in Arabic is a challenging task because of the unique linguistic and cultural characteristics of the Arabic language. This study introduces a high-quality annotated dataset for classifying offensive language in Arabic, based on a structured taxonomy, categorizing offensive content across seven levels, capturing both explicit and implicit expressions. Utilizing this taxonomy, we re-annotate the FARAD-500 dataset, creating reFarad-500, which provides fine-grained labels for offensive texts in Arabic. A thorough dataset analysis reveals key patterns in offensive language distribution, emphasizing the importance of target type, offense severity, and linguistic structures. Additionally, we assess text classification techniques to evaluate the dataset’s effectiveness, exploring the impact of sentiment analysis and emotion detection on classification performance. Our findings highlight the complexity of Arabic offensive language and underscore the necessity of extensive annotation frameworks for accurate detection. This paper advances Arabic natural language processing (NLP) in resource-constrained settings by enhancing the recognition of hate speech and fostering a deeper understanding of the linguistic and emotional dimensions of offensive language.

pdf bib abs
Towards Safer Hebrew Communication: A Dataset for Offensive Language Detoxification
Natalia Vanetik | Lior Liberov | Marina Litvak | Chaya Liebeskind
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Text detoxification is the task of transforming offensive or toxic content into a non-offensive form while preserving the original meaning. Despite increasing research interest in detoxification across various languages, no resources or benchmarks exist for Hebrew, a Semitic language with unique morphological, syntactic, and cultural characteristics. This paper introduces HeDetox, the first annotated dataset for text detoxification in Hebrew. HeDetox contains 600 sentence pairs, each consisting of an offensive source text and a non-offensive text rewritten with LLM and human intervention. We present a detailed dataset analysis and evaluation showing that the dataset benefits offensive language detection. HeDetox offers a foundational resource for Hebrew natural language processing, advancing research in offensive language mitigation and controllable text generation.

2024

State-of-the-art abstractive summarization models still suffer from the content contradiction between the summaries and the input text, which is referred to as the factual inconsistency problem. Recently, a large number of works have also been proposed to evaluate factual consistency or improve it by post-editing methods. However, these post-editing methods typically focus on replacing suspicious entities, failing to identify and modify incorrect content hidden in sentence structures. In this paper, we first verify that the correctable errors can be enriched by leveraging sentence structure pruning operation, and then we propose a post-editing method based on that. In the correction process, the pruning operation on possible errors is performed on the syntactic dependency tree with the guidance of multiple factual evaluation metrics. Experimenting on the FRANK dataset shows a great improvement in factual consistency compared with strong baselines and, when combined with them, can achieve even better performance. All the codes and data will be released on paper acceptance.

pdf bib abs
From Linguistics to Practice: a Case Study of Offensive Language Taxonomy in Hebrew
Chaya Liebeskind | Marina Litvak | Natalia Vanetik
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

The perception of offensive language varies based on cultural, social, and individual perspectives. With the spread of social media, there has been an increase in offensive content online, necessitating advanced solutions for its identification and moderation. This paper addresses the practical application of an offensive language taxonomy, specifically targeting Hebrew social media texts. By introducing a newly annotated dataset, modeled after the taxonomy of explicit offensive language of (Lewandowska-Tomaszczyk et al., 2023)„ we provide a comprehensive examination of various degrees and aspects of offensive language. Our findings indicate the complexities involved in the classification of such content. We also outline the implications of relying on fixed taxonomies for Hebrew.

2023

pdf bib abs
Propaganda Detection in Russian Telegram Posts in the Scope of the Russian Invasion of Ukraine
Natalia Vanetik | Marina Litvak | Egor Reviakin | Margarita Tiamanova
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

The emergence of social media has made it more difficult to recognize and analyze misinformation efforts. Popular messaging software Telegram has developed into a medium for disseminating political messages and misinformation, particularly in light of the conflict in Ukraine. In this paper, we introduce a sizable corpus of Telegram posts containing pro-Russian propaganda and benign political texts. We evaluate the corpus by applying natural language processing (NLP) techniques to the task of text classification in this corpus. Our findings indicate that, with an overall accuracy of over 96% for confirmed sources as propagandists and oppositions and 92% for unconfirmed sources, our method can successfully identify and categorize pro- Russian propaganda posts. We highlight the consequences of our research for comprehending political communications and propaganda on social media.

2022

pdf bib abs
SAPGraph: Structure-aware Extractive Summarization for Scientific Papers with Heterogeneous Graph
Siya Qi | Lei Li | Yiyang Li | Jin Jiang | Dingxin Hu | Yuze Li | Yingqi Zhu | Yanquan Zhou | Marina Litvak | Natalia Vanetik
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Scientific paper summarization is always challenging in Natural Language Processing (NLP) since it is hard to collect summaries from such long and complicated text. We observe that previous works tend to extract summaries from the head of the paper, resulting in information incompleteness. In this work, we present SAPGraph to utilize paper structure for solving this problem. SAPGraph is a scientific paper extractive summarization framework based on a structure-aware heterogeneous graph, which models the document into a graph with three kinds of nodes and edges based on structure information of facets and knowledge. Additionally, we provide a large-scale dataset of COVID-19-related papers, CORD-SUM. Experiments on CORD-SUM and ArXiv datasets show that SAPGraph generates more comprehensive and valuable summaries compared to previous works.

This paper presents the results and findings of the Financial Narrative Summarisation Shared Task on summarising UK, Greek and Spanish annual reports. The shared task was organised as part of the Financial Narrative Processing 2022 Workshop (FNP 2022 Workshop). The Financial Narrative summarisation Shared Task (FNS-2022) has been running since 2020 as part of the Financial Narrative Processing (FNP) workshop series (El-Haj et al., 2022; El-Haj et al., 2021; El-Haj et al., 2020b; El-Haj et al., 2019c; El-Haj et al., 2018). The shared task included one main task which is the use of either abstractive or extractive automatic summarisers to summarise long documents in terms of UK, Greek and Spanish financial annual reports. This shared task is the third to target financial documents. The data for the shared task was created and collected from publicly available annual reports published by firms listed on the Stock Exchanges of UK, Greece and Spain. A total number of 14 systems from 7 different teams participated in the shared task.

pdf bib abs
Offensive language detection in Hebrew: can other languages help?
Marina Litvak | Natalia Vanetik | Chaya Liebeskind | Omar Hmdia | Rizek Abu Madeghem
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Unfortunately, offensive language in social media is a common phenomenon nowadays. It harms many people and vulnerable groups. Therefore, automated detection of offensive language is in high demand and it is a serious challenge in multilingual domains. Various machine learning approaches combined with natural language techniques have been applied for this task lately. This paper contributes to this area from several aspects: (1) it introduces a new dataset of annotated Facebook comments in Hebrew; (2) it describes a case study with multiple supervised models and text representations for a task of offensive language detection in three languages, including two Semitic (Hebrew and Arabic) languages; (3) it reports evaluation results of cross-lingual and multilingual learning for detection of offensive content in Semitic languages; and (4) it discusses the limitations of these settings.

pdf bib abs
Detection of Negative Campaign in Israeli Municipal Elections
Marina Litvak | Natalia Vanetik | Sagiv Talker | Or Machlouf
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)

Political competitions are complex settings where candidates use campaigns to promote their chances to be elected. One choice focuses on conducting a positive campaign that highlights the candidate’s achievements, leadership skills, and future programs. The alternative is to focus on a negative campaign that emphasizes the negative aspects of the competing person and is aimed at offending opponents or the opponent’s supporters. In this proposal, we concentrate on negative campaigns in Israeli elections. This work introduces an empirical case study on automatic detection of negative campaigns, using machine learning and natural language processing approaches, applied to the Hebrew-language data from Israeli municipal elections. Our contribution is multi-fold: (1) We provide TONIC—daTaset fOr Negative polItical Campaign in Hebrew—which consists of annotated posts from Facebook related to Israeli municipal elections; (2) We introduce results of a case study, that explored several research questions. RQ1: Which classifier and representation perform best for this task? We employed several traditional classifiers which are known for their good performance in IR tasks and two pre-trained models based on BERT architecture; several standard representations were employed with traditional ML models. RQ2: Does a negative campaign always contain offensive language? Can a model, trained to detect offensive language, also detect negative campaigns? We are trying to answer this question by reporting results for the transfer learning from a dataset annotated with offensive language to our dataset.

pdf bib abs
An Interactive Analysis of User-reported Long COVID Symptoms using Twitter Data
Lin Miao | Mark Last | Marina Litvak
Proceedings of the 2nd Workshop on Deriving Insights from User-Generated Text

With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have been observed in a large group of survivors. This paper is aimed at systematically analyzing user-generated conversations on Twitter that are related to long-term COVID symptoms for a better understanding of the Long COVID health consequences. Using an interactive information extraction tool built especially for this purpose, we extracted key information from the relevant tweets and analyzed the user-reported Long COVID symptoms with respect to their demographic and geographical characteristics. The results of our analysis are expected to improve the public awareness on long-term COVID-19 sequelae and provide important insights to public health authorities.

2021

pdf bib
Summarization of financial reports with AMUSE
Marina Litvak | Natalia Vanetik
Proceedings of the 3rd Financial Narrative Processing Workshop

pdf bib
Summarization of financial documents with TF-IDF weighting of multi-word terms
Sophie Krimberg | Natalia Vanetik | Marina Litvak
Proceedings of the 3rd Financial Narrative Processing Workshop

2020

pdf bib abs
The Financial Narrative Summarisation Shared Task (FNS 2020)
Mahmoud El-Haj | Ahmed AbuRa’ed | Marina Litvak | Nikiforos Pittaras | George Giannakopoulos
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

This paper presents the results and findings of the Financial Narrative Summarisation shared task (FNS 2020) on summarising UK annual reports. The shared task was organised as part of the 1st Financial Narrative Processing and Financial Narrative Summarisation Workshop (FNP-FNS 2020). The shared task included one main task which is the use of either abstractive or extractive summarisation methodologies and techniques to automatically summarise UK financial annual reports. FNS summarisation shared task is the first to target financial annual reports. The data for the shared task was created and collected from publicly available UK annual reports published by firms listed on the London Stock Exchange (LSE). A total number of 24 systems from 9 different teams participated in the shared task. In addition we had 2 baseline summarisers and additional 2 topline summarisers to help evaluate and compare against the results of the participants.

pdf bib abs
SCE-SUMMARY at the FNS 2020 shared task
Marina Litvak | Natalia Vanetik | Zvi Puchinsky
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

With the constantly growing amount of information, the need arises to automatically summarize this written information. One of the challenges in the summary is that it’s difficult to generalize. For example, summarizing a news article is very different from summarizing a financial earnings report. This paper reports an approach for summarizing financial texts, which are different from the documents from other domains at least in three parameters: length, structure, and format. Our approach considers these parameters, it is adapted to hierarchical structure of sections, document length, and special “language”. The approach builds an hierarchical summary, visualized as a tree with summaries under different discourse topics. The approach was evaluated using extrinsic and intrinsic automated evaluations, which are reported in this paper. As all participants of the Financial Narrative Summarisation (FNS 2020) shared task, we used FNS2020 dataset for evaluations.

pdf bib abs
Hierarchical summarization of financial reports with RUNNER
Marina Litvak | Natalia Vanetik | Zvi Puchinsky
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

pdf bib abs
Automated Discovery of Mathematical Definitions in Text
Natalia Vanetik | Marina Litvak | Sergey Shevchuk | Lior Reznik
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic definition extraction from texts is an important task that has numerous applications in several natural language processing fields such as summarization, analysis of scientific texts, automatic taxonomy generation, ontology generation, concept identification, and question answering. For definitions that are contained within a single sentence, this problem can be viewed as a binary classification of sentences into definitions and non-definitions. Definitions in scientific literature can be generic (Wikipedia) or more formal (mathematical articles). In this paper, we focus on automatic detection of one-sentence definitions in mathematical texts, which are difficult to separate from surrounding text. We experiment with several data representations, which include sentence syntactic structure and word embeddings, and apply deep learning methods such as convolutional neural network (CNN) and recurrent neural network (RNN), in order to identify mathematical definitions. Our experiments demonstrate the superiority of CNN and its combination with RNN, applied on the syntactically-enriched input representation. We also present a new dataset for definition extraction from mathematical texts. We demonstrate that the use of this dataset for training learning models improves the quality of definition extraction when these models are then used for other definition datasets. Our experiments with different domains approve that mathematical definitions require special treatment, and that using cross-domain learning is inefficient.

pdf bib abs
Detecting Troll Tweets in a Bilingual Corpus
Lin Miao | Mark Last | Marina Litvak
Proceedings of the Twelfth Language Resources and Evaluation Conference

During the past several years, a large amount of troll accounts has emerged with efforts to manipulate public opinion on social network sites. They are often involved in spreading misinformation, fake news, and propaganda with the intent of distracting and sowing discord. This paper aims to detect troll tweets in both English and Russian assuming that the tweets are generated by some “troll farm.” We reduce this task to the authorship verification problem of determining whether a single tweet is authored by a “troll farm” account or not. We evaluate a supervised classification approach with monolingual, cross-lingual, and bilingual training scenarios, using several machine learning algorithms, including deep learning. The best results are attained by the bilingual learning, showing the area under the ROC curve (AUC) of 0.875 and 0.828, for tweet classification in English and Russian test sets, respectively. It is noteworthy that these results are obtained using only raw text features, which do not require manual feature engineering efforts. In this paper, we introduce a resource of English and Russian troll tweets containing original tweets and translation from English to Russian, Russian to English. It is available for academic purposes.

pdf bib abs
Twitter Data Augmentation for Monitoring Public Opinion on COVID-19 Intervention Measures
Lin Miao | Mark Last | Marina Litvak
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The COVID-19 outbreak is an ongoing worldwide pandemic that was announced as a global health crisis in March 2020. Due to the enormous challenges and high stakes of this pandemic, governments have implemented a wide range of policies aimed at containing the spread of the virus and its negative effect on multiple aspects of our life. Public responses to various intervention measures imposed over time can be explored by analyzing the social media. Due to the shortage of available labeled data for this new and evolving domain, we apply data distillation methodology to labeled datasets from related tasks and a very small manually labeled dataset. Our experimental results show that data distillation outperforms other data augmentation methods on our task.

2019

pdf bib abs
In Conclusion Not Repetition: Comprehensive Abstractive Summarization with Diversified Attention Based on Determinantal Point Processes
Lei Li | Wei Liu | Marina Litvak | Natalia Vanetik | Zuying Huang
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Various Seq2Seq learning models designed for machine translation were applied for abstractive summarization task recently. Despite these models provide high ROUGE scores, they are limited to generate comprehensive summaries with a high level of abstraction due to its degenerated attention distribution. We introduce Diverse Convolutional Seq2Seq Model(DivCNN Seq2Seq) using Determinantal Point Processes methods(Micro DPPs and Macro DPPs) to produce attention distribution considering both quality and diversity. Without breaking the end to end architecture, DivCNN Seq2Seq achieves a higher level of comprehensiveness compared to vanilla models and strong baselines. All the reproducible codes and datasets are available online.

pdf bib abs
RANLP 2019 Multilingual Headline Generation Task Overview
Marina Litvak | John M. Conroy | Peter A. Rankel
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources

The objective of the 2019 RANLP Multilingual Headline Generation (HG) Task is to explore some of the challenges highlighted by current state of the art approaches on creating informative headlines to news articles: non-descriptive headlines, out-of-domain training data, generating headlines from long documents which are not well represented by the head heuristic, and dealing with multilingual domain. This tasks makes available a large set of training data for headline generation and provides an evaluation methods for the task. Our data sets are drawn from Wikinews as well as Wikipedia. Participants were required to generate headlines for at least 3 languages, which were evaluated via automatic methods. A key aspect of the task is multilinguality. The task measures the performance of multilingual headline generation systems using the Wikipedia and Wikinews articles in multiple languages. The objective is to assess the performance of automatic headline generation techniques on text documents covering a diverse range of languages and topics outside the news domain.

pdf bib abs
HEvAS: Headline Evaluation and Analysis System
Marina Litvak | Natalia Vanetik | Itzhak Eretz Kdosha
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources

Automatic headline generation is a subtask of one-line summarization with many reported applications. Evaluation of systems generating headlines is a very challenging and undeveloped area. We introduce the Headline Evaluation and Analysis System (HEvAS) that performs automatic evaluation of systems in terms of a quality of the generated headlines. HEvAS provides two types of metrics– one which measures the informativeness of a headline, and another that measures its readability. The results of evaluation can be compared to the results of baseline methods which are implemented in HEvAS. The system also performs the statistical analysis of the evaluation results and provides different visualization charts. This paper describes all evaluation metrics, baselines, analysis, and architecture, utilized by our system.

2017

pdf bib
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres
George Giannakopoulos | Elena Lloret | John M. Conroy | Josef Steinberger | Marina Litvak | Peter Rankel | Benoit Favre
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

In this brief report we present an overview of the MultiLing 2017 effort and workshop, as implemented within EACL 2017. MultiLing is a community-driven initiative that pushes the state-of-the-art in Automatic Summarization by providing data sets and fostering further research and development of summarization systems. This year the scope of the workshop was widened, bringing together researchers that work on summarization across sources, languages and genres. We summarize the main tasks planned and implemented this year, the contributions received, and we also provide insights on next steps.

pdf bib abs
Query-based summarization using MDL principle
Marina Litvak | Natalia Vanetik
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

Query-based text summarization is aimed at extracting essential information that answers the query from original text. The answer is presented in a minimal, often predefined, number of words. In this paper we introduce a new unsupervised approach for query-based extractive summarization, based on the minimum description length (MDL) principle that employs Krimp compression algorithm (Vreeken et al., 2011). The key idea of our approach is to select frequent word sets related to a given query that compress document sentences better and therefore describe the document better. A summary is extracted by selecting sentences that best cover query-related frequent word sets. The approach is evaluated based on the DUC 2005 and DUC 2006 datasets which are specifically designed for query-based summarization (DUC, 2005 2006). It competes with the best results.

2016

pdf bib abs
What’s up on Twitter? Catch up with TWIST!
Marina Litvak | Natalia Vanetik | Efi Levi | Michael Roistacher
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Event detection and analysis with respect to public opinions and sentiments in social media is a broad and well-addressed research topic. However, the characteristics and sheer volume of noisy Twitter messages make this a difficult task. This demonstration paper describes a TWItter event Summarizer and Trend detector (TWIST) system for event detection, visualization, textual description, and geo-sentiment analysis of real-life events reported in Twitter.

pdf bib
MUSEEC: A Multilingual Text Summarization Tool
Marina Litvak | Natalia Vanetik | Mark Last | Elena Churkin
Proceedings of ACL-2016 System Demonstrations

pdf bib abs
Social and linguistic behavior and its correlation to trait empathy
Marina Litvak | Jahna Otterbacher | Chee Siang Ang | David Atkins
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

A growing body of research exploits social media behaviors to gauge psychological character-istics, though trait empathy has received little attention. Because of its intimate link to the abil-ity to relate to others, our research aims to predict participants’ levels of empathy, given their textual and friending behaviors on Facebook. Using Poisson regression, we compared the vari-ance explained in Davis’ Interpersonal Reactivity Index (IRI) scores on four constructs (em-pathic concern, personal distress, fantasy, perspective taking), by two classes of variables: 1) post content and 2) linguistic style. Our study lays the groundwork for a greater understanding of empathy’s role in facilitating interactions on social media.

Marina Litvak

2025

2024

2023

2022

2021

2020

2019

2017

2016

2015

2013

2010

2008

Co-authors

Venues