Natalia Vanetik

2025

Towards Safer Hebrew Communication: A Dataset for Offensive Language Detoxification
Natalia Vanetik | Lior Liberov | Marina Litvak | Chaya Liebeskind
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Text detoxification is the task of transforming offensive or toxic content into a non-offensive form while preserving the original meaning. Despite increasing research interest in detoxification across various languages, no resources or benchmarks exist for Hebrew, a Semitic language with unique morphological, syntactic, and cultural characteristics. This paper introduces HeDetox, the first annotated dataset for text detoxification in Hebrew. HeDetox contains 600 sentence pairs, each consisting of an offensive source text and a non-offensive text rewritten with LLM and human intervention. We present a detailed dataset analysis and evaluation showing that the dataset benefits offensive language detection. HeDetox offers a foundational resource for Hebrew natural language processing, advancing research in offensive language mitigation and controllable text generation.

pdf bib abs

Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
Natalia Vanetik | Marina Litvak | Chaya Liebeskind
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models

Offensive language detection in Arabic is a challenging task because of the unique linguistic and cultural characteristics of the Arabic language. This study introduces a high-quality annotated dataset for classifying offensive language in Arabic, based on a structured taxonomy, categorizing offensive content across seven levels, capturing both explicit and implicit expressions. Utilizing this taxonomy, we re-annotate the FARAD-500 dataset, creating reFarad-500, which provides fine-grained labels for offensive texts in Arabic. A thorough dataset analysis reveals key patterns in offensive language distribution, emphasizing the importance of target type, offense severity, and linguistic structures. Additionally, we assess text classification techniques to evaluate the dataset’s effectiveness, exploring the impact of sentiment analysis and emotion detection on classification performance. Our findings highlight the complexity of Arabic offensive language and underscore the necessity of extensive annotation frameworks for accurate detection. This paper advances Arabic natural language processing (NLP) in resource-constrained settings by enhancing the recognition of hate speech and fostering a deeper understanding of the linguistic and emotional dimensions of offensive language.

2024

pdf bib abs

From Linguistics to Practice: a Case Study of Offensive Language Taxonomy in Hebrew
Chaya Liebeskind | Marina Litvak | Natalia Vanetik
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

The perception of offensive language varies based on cultural, social, and individual perspectives. With the spread of social media, there has been an increase in offensive content online, necessitating advanced solutions for its identification and moderation. This paper addresses the practical application of an offensive language taxonomy, specifically targeting Hebrew social media texts. By introducing a newly annotated dataset, modeled after the taxonomy of explicit offensive language of (Lewandowska-Tomaszczyk et al., 2023)„ we provide a comprehensive examination of various degrees and aspects of offensive language. Our findings indicate the complexities involved in the classification of such content. We also outline the implications of relying on fixed taxonomies for Hebrew.

pdf bib abs

State-of-the-art abstractive summarization models still suffer from the content contradiction between the summaries and the input text, which is referred to as the factual inconsistency problem. Recently, a large number of works have also been proposed to evaluate factual consistency or improve it by post-editing methods. However, these post-editing methods typically focus on replacing suspicious entities, failing to identify and modify incorrect content hidden in sentence structures. In this paper, we first verify that the correctable errors can be enriched by leveraging sentence structure pruning operation, and then we propose a post-editing method based on that. In the correction process, the pruning operation on possible errors is performed on the syntactic dependency tree with the guidance of multiple factual evaluation metrics. Experimenting on the FRANK dataset shows a great improvement in factual consistency compared with strong baselines and, when combined with them, can achieve even better performance. All the codes and data will be released on paper acceptance.

2023

pdf bib abs

Propaganda Detection in Russian Telegram Posts in the Scope of the Russian Invasion of Ukraine
Natalia Vanetik | Marina Litvak | Egor Reviakin | Margarita Tiamanova
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

The emergence of social media has made it more difficult to recognize and analyze misinformation efforts. Popular messaging software Telegram has developed into a medium for disseminating political messages and misinformation, particularly in light of the conflict in Ukraine. In this paper, we introduce a sizable corpus of Telegram posts containing pro-Russian propaganda and benign political texts. We evaluate the corpus by applying natural language processing (NLP) techniques to the task of text classification in this corpus. Our findings indicate that, with an overall accuracy of over 96% for confirmed sources as propagandists and oppositions and 92% for unconfirmed sources, our method can successfully identify and categorize pro- Russian propaganda posts. We highlight the consequences of our research for comprehending political communications and propaganda on social media.

2022

pdf bib abs

SAPGraph: Structure-aware Extractive Summarization for Scientific Papers with Heterogeneous Graph
Siya Qi | Lei Li | Yiyang Li | Jin Jiang | Dingxin Hu | Yuze Li | Yingqi Zhu | Yanquan Zhou | Marina Litvak | Natalia Vanetik
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Scientific paper summarization is always challenging in Natural Language Processing (NLP) since it is hard to collect summaries from such long and complicated text. We observe that previous works tend to extract summaries from the head of the paper, resulting in information incompleteness. In this work, we present SAPGraph to utilize paper structure for solving this problem. SAPGraph is a scientific paper extractive summarization framework based on a structure-aware heterogeneous graph, which models the document into a graph with three kinds of nodes and edges based on structure information of facets and knowledge. Additionally, we provide a large-scale dataset of COVID-19-related papers, CORD-SUM. Experiments on CORD-SUM and ArXiv datasets show that SAPGraph generates more comprehensive and valuable summaries compared to previous works.

pdf bib abs

Offensive language detection in Hebrew: can other languages help?
Marina Litvak | Natalia Vanetik | Chaya Liebeskind | Omar Hmdia | Rizek Abu Madeghem
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Unfortunately, offensive language in social media is a common phenomenon nowadays. It harms many people and vulnerable groups. Therefore, automated detection of offensive language is in high demand and it is a serious challenge in multilingual domains. Various machine learning approaches combined with natural language techniques have been applied for this task lately. This paper contributes to this area from several aspects: (1) it introduces a new dataset of annotated Facebook comments in Hebrew; (2) it describes a case study with multiple supervised models and text representations for a task of offensive language detection in three languages, including two Semitic (Hebrew and Arabic) languages; (3) it reports evaluation results of cross-lingual and multilingual learning for detection of offensive content in Semitic languages; and (4) it discusses the limitations of these settings.

pdf bib abs

Detection of Negative Campaign in Israeli Municipal Elections
Marina Litvak | Natalia Vanetik | Sagiv Talker | Or Machlouf
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)

Political competitions are complex settings where candidates use campaigns to promote their chances to be elected. One choice focuses on conducting a positive campaign that highlights the candidate’s achievements, leadership skills, and future programs. The alternative is to focus on a negative campaign that emphasizes the negative aspects of the competing person and is aimed at offending opponents or the opponent’s supporters. In this proposal, we concentrate on negative campaigns in Israeli elections. This work introduces an empirical case study on automatic detection of negative campaigns, using machine learning and natural language processing approaches, applied to the Hebrew-language data from Israeli municipal elections. Our contribution is multi-fold: (1) We provide TONIC—daTaset fOr Negative polItical Campaign in Hebrew—which consists of annotated posts from Facebook related to Israeli municipal elections; (2) We introduce results of a case study, that explored several research questions. RQ1: Which classifier and representation perform best for this task? We employed several traditional classifiers which are known for their good performance in IR tasks and two pre-trained models based on BERT architecture; several standard representations were employed with traditional ML models. RQ2: Does a negative campaign always contain offensive language? Can a model, trained to detect offensive language, also detect negative campaigns? We are trying to answer this question by reporting results for the transfer learning from a dataset annotated with offensive language to our dataset.

2021

pdf bib

Summarization of financial documents with TF-IDF weighting of multi-word terms
Sophie Krimberg | Natalia Vanetik | Marina Litvak
Proceedings of the 3rd Financial Narrative Processing Workshop

pdf bib

Summarization of financial reports with AMUSE
Marina Litvak | Natalia Vanetik
Proceedings of the 3rd Financial Narrative Processing Workshop

2020

pdf bib abs

SCE-SUMMARY at the FNS 2020 shared task
Marina Litvak | Natalia Vanetik | Zvi Puchinsky
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

With the constantly growing amount of information, the need arises to automatically summarize this written information. One of the challenges in the summary is that it’s difficult to generalize. For example, summarizing a news article is very different from summarizing a financial earnings report. This paper reports an approach for summarizing financial texts, which are different from the documents from other domains at least in three parameters: length, structure, and format. Our approach considers these parameters, it is adapted to hierarchical structure of sections, document length, and special “language”. The approach builds an hierarchical summary, visualized as a tree with summaries under different discourse topics. The approach was evaluated using extrinsic and intrinsic automated evaluations, which are reported in this paper. As all participants of the Financial Narrative Summarisation (FNS 2020) shared task, we used FNS2020 dataset for evaluations.

pdf bib abs

Hierarchical summarization of financial reports with RUNNER
Marina Litvak | Natalia Vanetik | Zvi Puchinsky
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

pdf bib abs

Automated Discovery of Mathematical Definitions in Text
Natalia Vanetik | Marina Litvak | Sergey Shevchuk | Lior Reznik
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic definition extraction from texts is an important task that has numerous applications in several natural language processing fields such as summarization, analysis of scientific texts, automatic taxonomy generation, ontology generation, concept identification, and question answering. For definitions that are contained within a single sentence, this problem can be viewed as a binary classification of sentences into definitions and non-definitions. Definitions in scientific literature can be generic (Wikipedia) or more formal (mathematical articles). In this paper, we focus on automatic detection of one-sentence definitions in mathematical texts, which are difficult to separate from surrounding text. We experiment with several data representations, which include sentence syntactic structure and word embeddings, and apply deep learning methods such as convolutional neural network (CNN) and recurrent neural network (RNN), in order to identify mathematical definitions. Our experiments demonstrate the superiority of CNN and its combination with RNN, applied on the syntactically-enriched input representation. We also present a new dataset for definition extraction from mathematical texts. We demonstrate that the use of this dataset for training learning models improves the quality of definition extraction when these models are then used for other definition datasets. Our experiments with different domains approve that mathematical definitions require special treatment, and that using cross-domain learning is inefficient.

2019

pdf bib abs

HEvAS: Headline Evaluation and Analysis System
Marina Litvak | Natalia Vanetik | Itzhak Eretz Kdosha
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources

Automatic headline generation is a subtask of one-line summarization with many reported applications. Evaluation of systems generating headlines is a very challenging and undeveloped area. We introduce the Headline Evaluation and Analysis System (HEvAS) that performs automatic evaluation of systems in terms of a quality of the generated headlines. HEvAS provides two types of metrics– one which measures the informativeness of a headline, and another that measures its readability. The results of evaluation can be compared to the results of baseline methods which are implemented in HEvAS. The system also performs the statistical analysis of the evaluation results and provides different visualization charts. This paper describes all evaluation metrics, baselines, analysis, and architecture, utilized by our system.

pdf bib abs

In Conclusion Not Repetition: Comprehensive Abstractive Summarization with Diversified Attention Based on Determinantal Point Processes
Lei Li | Wei Liu | Marina Litvak | Natalia Vanetik | Zuying Huang
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Various Seq2Seq learning models designed for machine translation were applied for abstractive summarization task recently. Despite these models provide high ROUGE scores, they are limited to generate comprehensive summaries with a high level of abstraction due to its degenerated attention distribution. We introduce Diverse Convolutional Seq2Seq Model(DivCNN Seq2Seq) using Determinantal Point Processes methods(Micro DPPs and Macro DPPs) to produce attention distribution considering both quality and diversity. Without breaking the end to end architecture, DivCNN Seq2Seq achieves a higher level of comprehensiveness compared to vanilla models and strong baselines. All the reproducible codes and datasets are available online.

2017

pdf bib abs

Query-based summarization using MDL principle
Marina Litvak | Natalia Vanetik
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

Query-based text summarization is aimed at extracting essential information that answers the query from original text. The answer is presented in a minimal, often predefined, number of words. In this paper we introduce a new unsupervised approach for query-based extractive summarization, based on the minimum description length (MDL) principle that employs Krimp compression algorithm (Vreeken et al., 2011). The key idea of our approach is to select frequent word sets related to a given query that compress document sentences better and therefore describe the document better. A summary is extracted by selecting sentences that best cover query-related frequent word sets. The approach is evaluated based on the DUC 2005 and DUC 2006 datasets which are specifically designed for query-based summarization (DUC, 2005 2006). It competes with the best results.

2016

pdf bib abs

What’s up on Twitter? Catch up with TWIST!
Marina Litvak | Natalia Vanetik | Efi Levi | Michael Roistacher
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Event detection and analysis with respect to public opinions and sentiments in social media is a broad and well-addressed research topic. However, the characteristics and sheer volume of noisy Twitter messages make this a difficult task. This demonstration paper describes a TWItter event Summarizer and Trend detector (TWIST) system for event detection, visualization, textual description, and geo-sentiment analysis of real-life events reported in Twitter.

pdf bib

MUSEEC: A Multilingual Text Summarization Tool
Marina Litvak | Natalia Vanetik | Mark Last | Elena Churkin
Proceedings of ACL-2016 System Demonstrations