Salima Lamsiyah


2025

pdf bib
The AraGenEval Shared Task on Arabic Authorship Style Transfer and AI Generated Text Detection
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Hamza Alami | Abdessamad Benlahbib | Salmane Chafik | Mo El-Haj | Abdelkader El Mahdaouy | Mustafa Jarrar | Salima Lamsiyah | Hamzah Luqman
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present an overview of the AraGenEval shared task, organized as part of the ArabicNLP 2025 conference. This task introduced the first benchmark suite for Arabic authorship analysis, featuring three subtasks: Authorship Style Transfer, Authorship Identification, and AI-Generated Text Detection. We curated high-quality datasets, including over 47,000 paragraphs from 21 authors and a balanced corpus of human- and AI-generated texts. The task attracted significant global participation, with 72 registered teams from 16 countries. The results highlight the effectiveness of transformer-based models, with top systems leveraging prompt engineering for style transfer, model ensembling for authorship identification, and a mix of multilingual and Arabic-specific models for AI text detection. This paper details the task design, datasets, participant systems, and key findings, establishing a foundation for future research in Arabic stylistics and trustworthy NLP.

pdf bib
Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives
Keerthana Murugaraj | Salima Lamsiyah | Marten During | Martin Theobald
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Analyzing historical discourse in large-scale newspaper archives requires scalable and interpretable methods to uncover hidden themes. This study systematically evaluates topic modeling approaches for newspaper articles from 1955 to 2018, comparing probabilistic LDA, matrix factorization NMF, and neural-based models such as Top2Vec and BERTopic across various preprocessing strategies. We benchmark these methods on topic coherence, diversity, scalability, and interpretability. While LDA is commonly used in historical text analysis, our findings demonstrate that BERTopic, leveraging contextual embeddings, consistently outperforms classical models in all tested aspects, making it a more robust choice for large-scale textual corpora. Additionally, we highlight the trade-offs between preprocessing strategies and model performance, emphasizing the importance of tailored pipeline design. These insights advance the field of historical NLP, offering concrete guidance for historians and computational social scientists in selecting the most effective topic-modeling approach for analyzing digitized archives. Our code will be publicly available on GitHub.

pdf bib
Reddit-V: A Virality Prediction Dataset and Zero-Shot Evaluation with Large Language Models
Samir El-amrany | Matthias R. Brust | Salima Lamsiyah | Pascal Bouvry
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

We present Reddit-V, a new dataset designed to advance research on social media virality prediction in natural language processing. The dataset consists of over 27,000 Reddit posts, each enriched with images, textual content, and pre-engagement metadata such as post titles, categories, sentiment scores, and posting times. As an initial benchmark, we evaluate several instruction-tuned large language models (LLMs) in a zero-shot setting, prompting them with post titles and metadata to predict post virality. We then fine-tune two multimodal models, CLIP and IDEFICS, to assess whether incorporating visual context enhances predictive performance. Our results show that zero-shot LLMs perform poorly, whereas the fine-tuned multimodal models achieve better performance. Specifically, CLIP outperforms the best-performing zero-shot LLM (CodeLLaMA) by 3%, while IDEFICS achieves an 7% improvement over the same baseline, highlighting the importance of visual features in virality prediction. We release the Reddit-V dataset and our evaluation results to facilitate further research on multimodal and text-based virality prediction. Our dataset and code will be made publicly available on Github

pdf bib
Trust but Verify: A Comprehensive Survey of Faithfulness Evaluation Methods in Abstractive Text Summarization
Salima Lamsiyah | Aria Nourbakhsh | Christoph Schommer
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Abstractive text summarization systems have advanced significantly with the rise of neural language models. However, they frequently suffer from issues of unfaithfulness or factual inconsistency, generating content that is not verifiably supported by the source text. This survey provides a comprehensive review of over 40 studies published between 2020 and 2025 on methods for evaluating faithfulness in abstractive summarization. We present a unified taxonomy that covers human evaluation techniques and a variety of automatic metrics, including question answering (QA)-based methods, natural language inference (NLI)-based methods, graph-based approaches, and large language model (LLM)-based evaluation. We also discuss meta-evaluation protocols that assess the quality of these metrics. In addition, we analyze a wide range of benchmark datasets, highlighting their design, scope, and relevance to emerging challenges such as long-document and domain-specific summarization. In addition, we identify critical limitations in current evaluation practices, including poor alignment with human judgment, limited robustness, and inefficiencies in handling complex summaries. We conclude by outlining future directions to support the development of more reliable, interpretable, and scalable evaluation methods. This work aims to support researchers in navigating the rapidly evolving landscape of faithfulness evaluation in summarization.

pdf bib
Quantifying the Overlap: Attribution Maps and Linguistic Heuristics in Encoder-Decoder Machine Translation Models
Aria Nourbakhsh | Salima Lamsiyah | Christoph Schommer
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Explainable AI (XAI) attribution methods seek to illuminate the decision-making process of generative models by quantifying the contribution of each input token to the generated output. Different attribution algorithms, often rooted in distinct methodological frameworks, can produce varied interpretations of feature importance. In this study, we utilize attribution mappings derived from three distinct methods as weighting signals during the training of encoder-decoder models. Our findings demonstrate that Attention and Value Zeroing attribution weights consistently lead to improved model performance. To better understand the linguistic information these mappings capture, we extract part-of-speech (POS), dependency, and named entity recognition (NER) tags from the input-output pairs and compare them with the XAI attribution maps. Although the Saliency method shows greater alignment with POS and dependency annotations than Value Zeroing, it exhibits more divergence in places where its attributions do not conform to these linguistic tags, compared to the other two methods, and it contributes less to the models’ performance.

pdf bib
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text
Salima Lamsiyah | Saad Ezzini | Abdelkader El Mahdaoui | Hamza Alami | Abdessamad Benlahbib | Samir El Amrani | Salmane Chafik | Hicham Hammouchi
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

pdf bib
M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text
Salima Lamsiyah | Saad Ezzini | Abdelkader El Mahdaouy | Hamza Alami | Abdessamad Benlahbib | Samir El amrany | Salmane Chafik | Hicham Hammouchi
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.

pdf bib
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Saad Ezzini | Hamza Alami | Ismail Berrada | Abdessamad Benlahbib | Abdelkader El Mahdaouy | Salima Lamsiyah | Hatim Derrouz | Amal Haddad Haddad | Mustafa Jarrar | Mo El-Haj | Ruslan Mitkov | Paul Rayson
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

pdf bib
ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models
Salima Lamsiyah | Kamyar Zeinalipour | Samir El amrany | Matthias Brust | Marco Maggini | Pascal Bouvry | Christoph Schommer
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

Recent efforts in natural language processing (NLP) commonsense reasoning research have led to the development of numerous new datasets and benchmarks. However, these resources have predominantly been limited to English, leaving a gap in evaluating commonsense reasoning in other languages. In this paper, we introduce the ArabicSense Benchmark, which is designed to thoroughly evaluate the world-knowledge commonsense reasoning abilities of large language models (LLMs) in Arabic. This benchmark includes three main tasks: first, it tests whether a system can distinguish between natural language statements that make sense and those that do not; second, it requires a system to identify the most crucial reason why a nonsensical statement fails to make sense; and third, it involves generating explanations for why statements do not make sense. We evaluate several Arabic BERT-based models and causal LLMs on these tasks. Experimental results demonstrate improvements after fine-tuning on our dataset. For instance, AraBERT v2 achieved an 87% F1 score on the second task, while Gemma and Mistral-7b achieved F1 scores of 95.5% and 94.8%, respectively. For the generation task, LLaMA-3 achieved the best performance with a BERTScore F1 of 77.3%, closely followed by Mistral-7b at 77.1%. All codes and the benchmark will be made publicly available at https://github.com/.

pdf bib
Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles
Azzedine Aftiss | Salima Lamsiyah | Christoph Schommer | Said Ouatik El Alaoui
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

Moroccan Dialect (MD), or “Darija,” is a primary spoken variant of Arabic in Morocco, yet remains underrepresented in Natural Language Processing (NLP) research, particularly in tasks like summarization. Despite a growing volume of MD textual data online, there is a lack of robust resources and NLP models tailored to handle the unique linguistic challenges posed by MD. In response, we introduce .MA_v2, an expanded version of the GOUD.MA dataset, containing over 50k articles with their titles across 11 categories. This dataset provides a more comprehensive resource for developing summarization models. We evaluate the application of large language models (LLMs) for MD summarization, utilizing both fine-tuning and zero-shot prompting with encoder-decoder and causal LLMs, respectively. Our findings demonstrate that an expanded dataset improves summarization performance and highlights the capabilities of recent LLMs in handling MD text. We open-source our dataset, fine-tuned models, and all experimental code, establishing a foundation for future advancements in MD NLP. We release the code at https://github.com/AzzedineAftiss/Moroccan-Dialect-Summarization.

2023

pdf bib
UL & UM6P at ArAIEval Shared Task: Transformer-based model for Persuasion Techniques and Disinformation detection in Arabic
Salima Lamsiyah | Abdelkader El Mahdaouy | Hamza Alami | Ismail Berrada | Christoph Schommer
Proceedings of ArabicNLP 2023

In this paper, we introduce our participating system to the ArAIEval Shared Task, addressing both the detection of persuasion techniques and disinformation tasks. Our proposed system employs a pre-trained transformer-based language model for Arabic, alongside a classifier. We have assessed the performance of three Arabic Pre-trained Language Models (PLMs) for sentence encoding. Additionally, to enhance our model’s performance, we have explored various training objectives, including Cross-Entropy loss, regularized Mixup loss, asymmetric multi-label loss, and Focal Tversky loss. On the official test set, our system has achieved micro-F1 scores of 0.7515, 0.5666, 0.904, and 0.8333 for Sub-Task 1A, Sub-Task 1B, Sub-Task 2A, and Sub-Task 2B, respectively. Furthermore, our system has secured the 4th, 1st, 3rd, and 2nd positions, respectively, among all participating systems in sub-tasks 1A, 1B, 2A, and 2B of the ArAIEval shared task.

pdf bib
UM6P & UL at WojoodNER shared task: Improving Multi-Task Learning for Flat and Nested Arabic Named Entity Recognition
Abdelkader El Mahdaouy | Salima Lamsiyah | Hamza Alami | Christoph Schommer | Ismail Berrada
Proceedings of ArabicNLP 2023

In this paper, we present our submitted system for the WojoodNER Shared Task, addressing both flat and nested Arabic Named Entity Recognition (NER). Our system is based on a BERT-based multi-task learning model that leverages the existing Arabic Pretrained Language Models (PLMs) to encode the input sentences. To enhance the performance of our model, we have employed a multi-task loss variance penalty and combined several training objectives, including the Cross-Entropy loss, the Dice loss, the Tversky loss, and the Focal loss. Besides, we have studied the performance of three existing Arabic PLMs for sentence encoding. On the official test set, our system has obtained a micro-F1 score of 0.9113 and 0.9303 for Flat (Sub-Task 1) and Nested (Sub-Task 2) NER, respectively. It has been ranked in the 6th and the 2nd positions among all participating systems in Sub-Task 1 and Sub-Task 2, respectively.

pdf bib
UL & UM6P at SemEval-2023 Task 10: Semi-Supervised Multi-task Learning for Explainable Detection of Online Sexism
Salima Lamsiyah | Abdelkader El Mahdaouy | Hamza Alami | Ismail Berrada | Christoph Schommer
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper introduces our participating system to the Explainable Detection of Online Sexism (EDOS) SemEval-2023 - Task 10: Explainable Detection of Online Sexism. The EDOS shared task covers three hierarchical sub-tasks for sexism detection, coarse-grained and fine-grained categorization. We have investigated both single-task and multi-task learning based on RoBERTa transformer-based language models. For improving the results, we have performed further pre-training of RoBERTa on the provided unlabeled data. Besides, we have employed a small sample of the unlabeled data for semi-supervised learning using the minimum class-confusion loss. Our system has achieved macro F1 scores of 82.25\%, 67.35\%, and 49.8\% on Tasks A, B, and C, respectively.

pdf bib
UM6P at SemEval-2023 Task 12: Out-Of-Distribution Generalization Method for African Languages Sentiment Analysis
Abdelkader El Mahdaouy | Hamza Alami | Salima Lamsiyah | Ismail Berrada
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper presents our submitted system to AfriSenti SemEval-2023 Task 12: Sentiment Analysis for African Languages. The AfriSenti consists of three different tasks, covering monolingual, multilingual, and zero-shot sentiment analysis scenarios for African languages. To improve model generalization, we have explored the following steps: 1) further pre-training of the AfroXLM Pre-trained Language Model (PLM), 2) combining AfroXLM and MARBERT PLMs using a residual layer, and 3) studying the impact of metric learning and two out-of-distribution generalization training objectives. The overall evaluation results show that our system has achieved promising results on several sub-tasks of Task A. For Tasks B and C, our system is ranked among the top six participating systems.

2018

pdf bib
Résumé automatique guidé de textes: État de l’art et perspectives (Guided Summarization : State-of-the-art and perspectives )
Salima Lamsiyah | Said Ouatik El Alaoui | Bernard Espinasse
Actes de la Conférence TALN. Volume 2 - Démonstrations, articles des Rencontres Jeunes Chercheurs, ateliers DeFT

Les systèmes de résumé automatique de textes (SRAT) consistent à produire une représentation condensée et pertinente à partir d’un ou de plusieurs documents textuels. La majorité des SRAT sont basés sur des approches extractives. La tendance actuelle consiste à s’orienter vers les approches abstractives. Dans ce contexte, le résumé guidé défini par la campagne d’évaluation internationale TAC (Text Analysis Conference) en 2010, vise à encourager la recherche sur ce type d’approche, en se basant sur des techniques d’analyse en profondeur de textes. Dans ce papier, nous nous penchons sur le résumé automatique guidé de textes. Dans un premier temps, nous définissons les différentes caractéristiques et contraintes liées à cette tâche. Ensuite, nous dressons un état de l’art des principaux systèmes existants en mettant l’accent sur les travaux les plus récents, et en les classifiant selon les approches adoptées, les techniques utilisées, et leurs évaluations sur des corpus de références. Enfin, nous proposons les grandes étapes d’une méthode spécifique devant permettre le développement d’un nouveau type de systèmes de résumé guidé.