Rafet Sifa


2025

pdf bib
MultiProp Framework: Ensemble Models for Enhanced Cross-Lingual Propaganda Detection in Social Media and News using Data Augmentation, Text Segmentation, and Meta-Learning
Farizeh Aldabbas | Shaina Ashraf | Rafet Sifa | Lucie Flek
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

Propaganda, a pervasive tool for influenc- ing public opinion, demands robust auto- mated detection systems, particularly for under- resourced languages. Current efforts largely focus on well-resourced languages like English, leaving significant gaps in languages such as Arabic. This research addresses these gaps by introducing MultiProp Framework, a cross- lingual meta-learning framework designed to enhance propaganda detection across multiple languages, including Arabic, German, Italian, French and English. We constructed a mul- tilingual dataset using data translation tech- niques, beginning with Arabic data from PTC and WANLP shared tasks, and expanded it with translations into German Italian and French, further enriched by the SemEval23 dataset. Our proposed framework encompasses three distinct models: MultiProp-Baseline, which combines ensembles of pre-trained models such as GPT-2, mBART, and XLM-RoBERTa; MultiProp-ML, designed to handle languages with minimal or no training data by utiliz- ing advanced meta-learning techniques; and MultiProp-Chunk, which overcomes the chal- lenges of processing longer texts that exceed the token limits of pre-trained models. To- gether, they deliver superior performance com- pared to state-of-the-art methods, representing a significant advancement in the field of cross- lingual propaganda detection.

pdf bib
Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models
Tobias Deußer | Max Hahnbück | Tobias Uelwer | Cong Zhao | Christian Bauckhage | Rafet Sifa
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Protecting personal and sensitive information in textual data is increasingly crucial, especially when leveraging large language models (LLMs) that may pose privacy risks due to their API-based access. We introduce a novel approach and pipeline for anonymizing text across arbitrary domains without the need for manually labeled data or extensive computational resources. Our method employs knowledge distillation from LLMs into smaller encoder-only models via named entity recognition (NER) coupled with regular expressions to create a lightweight model capable of effective anonymization while preserving the semantic and contextual integrity of the data. This reduces computational overhead, enabling deployment on less powerful servers or even personal computing devices. Our findings suggest that knowledge distillation offers a scalable, resource-efficient pathway for anonymization, balancing privacy preservation with model performance and computational efficiency.

2024

pdf bib
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali | Michael Fromm | Klaudia Thellmann | Richard Rutmann | Max Lübbering | Johannes Leveling | Katrin Klug | Jan Ebert | Niclas Doll | Jasper Buschhoff | Charvi Jain | Alexander Weber | Lena Jurkschat | Hammam Abdelwahab | Chelsea John | Pedro Ortiz Suarez | Malte Ostendorff | Samuel Weinbach | Rafet Sifa | Stefan Kesselheim | Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024

The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.

pdf bib
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024
Tiansi Dong | Erhard Hinrichs | Zhen Han | Kang Liu | Yangqiu Song | Yixin Cao | Christian F. Hempelmann | Rafet Sifa
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024

pdf bib
Word Sense Disambiguation as a Game of Neurosymbolic Darts
Tiansi Dong | Rafet Sifa
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024

Word Sense Disambiguation (WSD) is one of the hardest tasks in natural language understanding and knowledge engineering. The glass ceiling of the 80% F1 score is recently achieved through supervised learning, enriched by knowledge graphs. Here, we propose a novel neurosymbolic methodology that may push the F1 score above 90%. The core of our methodology is a neurosymbolic sense embedding, in terms of a configuration of nested n-dimensional balls. The central point of a ball well preserves pre-trained word embeddings learned from data, which partially fixes the locations of balls. Inclusion relations among balls precisely encode symbolic hypernym relations among senses, and enable simple logic deduction among sense embeddings. We trained a Transformer to learn the mapping from a contextualized word embedding to its sense ball embedding, just like playing the game of darts (a game of shooting darts into a dartboard). A series of experiments are carried out using pre-training n ball embeddings, which cover around 70% training data and 75% testing data in the benchmark WSD corpus. Euclidean distance and cosine similarity functions are used as objective functions, separately, and each reaches >95.0% F1 score in the ALL-n-ball dataset. This substantially breaks the glass ceiling of deep learning methods. Future work is discussed to develop a full-fledged neurosymbolic WSD system that substantially outperforms deep learning approaches.

2020

pdf bib
Fraunhofer IAIS at FinCausal 2020, Tasks 1 & 2: Using Ensemble Methods and Sequence Tagging to Detect Causality in Financial Documents
Maren Pielka | Rajkumar Ramamurthy | Anna Ladi | Eduardo Brito | Clayton Chapman | Paul Mayer | Rafet Sifa
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

The FinCausal 2020 shared task aims to detect causality on financial news and identify those parts of the causal sentences related to the underlying cause and effect. We apply ensemble-based and sequence tagging methods for identifying causality, and extracting causal subsequences. Our models yield promising results on both sub-tasks, with the prospect of further improvement given more time and computing resources. With respect to task 1, we achieved an F1 score of 0.9429 on the evaluation data, and a corresponding ranking of 12/14. For task 2, we were ranked 6/10, with an F1 score of 0.76 and an ExactMatch score of 0.1912.