Marco Maggini

2026

Enabling Structured Reasoning in Sindhi with Culturally Grounded Instruction Tuning
Mehak Mehak | Kamyar Zeinalipour | Pireh Soomro | Cristiano Chesi | Marco Gori | Marco Maggini
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

While Large Language Models (LLMs) excel in high-resource contexts, reasoning capabilities in low-resource languages (LRLs) like Sindhi remain limited. To bridge this gap, we introduce Sindhi-Reasoning-Instruct, the first culturally grounded Sindhi instruction corpus. We fine-tuned six LLaMA and Mistral models (1B–24B) to evaluate if parameter-efficient tuning enables deductive, inductive, and causal reasoning. Results demonstrate that linguistically authentic data is the decisive factor. Fine-tuning effectively restored Sindhi’s Perso-Arabic orthography and SOV structure, with the Mistral-Small-24B model achieving a massive 141% relative improvement in human quality ratings over its base version. Furthermore, structured reasoning capabilities were found to scale with model size; while smaller models achieved high fluency, Mistral-Small-24B achieved top performance across logical categories, reaching 83% on inductive reasoning tasks. This study provides empirical evidence that expert-curated, native instruction data allows LRL models to move beyond simple translation toward robust, structured reasoning. The dataset and models are publicly available.

2025

pdf bib abs

PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian
Kamyar Zeinalipour | Neda Jamshidi | Fahimeh Akbari | Marco Maggini | Monica Bianchini | Marco Gori
Proceedings of the First Workshop on Language Models for Low-Resource Languages

We present PersianMCQ-Instruct, a comprehensive resource that includes a dataset and advanced models for generating multiple-choice questions (MCQs) in standard Iranian Persian, a low-resource language spoken by over 80 million people. This resource features three state-of-the-art models for Persian MCQ generation: PMCQ-Gemma2-9b, PMCQ-Llama3.1-8b, and PMCQ-Mistral-7B. Inspired by the Agent Instruct framework and GPT-4o, we created the dataset by curating over 4,000 unique Persian Wikipedia pages, resulting in three MCQs per page and a total of over 12,000 questions. To ensure the quality of this dataset, we conducted human evaluations and model fine-tuning, both of which demonstrated significant performance improvements in Persian MCQ generation. The dataset and models are publicly available, offering valuable tools for researchers and educators, with particular benefits for advancing Persian-language educational technology.

pdf bib abs

Shawarma Chats: A Benchmark Exact Dialogue & Evaluation Platter in Egyptian, Maghrebi & Modern Standard Arabic—A Triple-Dialect Feast for Hungry Language Models
Kamyar Zeinalipour | Mohamed Zaky Saad | Oumaima Attafi | Marco Maggini | Marco Gori
Proceedings of The Third Arabic Natural Language Processing Conference

Content-grounded dialogue evaluation for Arabic remains under-resourced, particularly across Modern Standard (MSA), Egyptian, and Maghrebi varieties. We introduce Shawarma Chats, a benchmark of 30,000 six-turn conversations grounded in Wikipedia content, evenly split across the three dialects. To build this corpus, we prompt five frontier LLMs GPT-4o, Gemini 2.5 Flash, Qwen-Plus, DeepSeek-Chat, and Mistral Large to generate 1,500 seed dialogues. Native Arabic speakers evaluate these outputs to select the most effective generator and most human-aligned grader. Sub-A dialogues undergo a two-pass, rationale-driven self-repair loop where the grader critiques and the generator revises; unresolved cases are manually corrected. We apply this pipeline to 10,000 Wikipedia paragraphs to create 30,000 high-quality conversations 10,000 per dialect—at modest human cost. To validate the benchmark, we LoRA-fine-tune six open LLMs (1–24 B parameters) on Shawarma Chats and observe consistent gains in automatic-grader scores, BERTScore, BLEU and ROUGE particularly for models larger than 7 B parameters. Shawarma Chats thus establishes the first large-scale, dialect-aware, content-grounded dialogue benchmark for Arabic.

pdf bib abs

Recent efforts in natural language processing (NLP) commonsense reasoning research have led to the development of numerous new datasets and benchmarks. However, these resources have predominantly been limited to English, leaving a gap in evaluating commonsense reasoning in other languages. In this paper, we introduce the ArabicSense Benchmark, which is designed to thoroughly evaluate the world-knowledge commonsense reasoning abilities of large language models (LLMs) in Arabic. This benchmark includes three main tasks: first, it tests whether a system can distinguish between natural language statements that make sense and those that do not; second, it requires a system to identify the most crucial reason why a nonsensical statement fails to make sense; and third, it involves generating explanations for why statements do not make sense. We evaluate several Arabic BERT-based models and causal LLMs on these tasks. Experimental results demonstrate improvements after fine-tuning on our dataset. For instance, AraBERT v2 achieved an 87% F1 score on the second task, while Gemma and Mistral-7b achieved F1 scores of 95.5% and 94.8%, respectively. For the generation task, LLaMA-3 achieved the best performance with a BERTScore F1 of 77.3%, closely followed by Mistral-7b at 77.1%. All codes and the benchmark will be made publicly available at https://github.com/.

pdf bib abs

From Arabic Text to Puzzles: LLM-Driven Development of Arabic Educational Crosswords
Kamyar Zeinalipour | Moahmmad Saad | Marco Maggini | Marco Gori
Proceedings of the First Workshop on Language Models for Low-Resource Languages

We present an Arabic crossword puzzle generator from a given text that utilizes advanced language models such as GPT-4-Turbo, GPT-3.5-Turbo, and Llama3-8B-Instruct, specifically developed for educational purposes, this innovative generator leverages a meticulously compiled dataset named Arabic-Clue-Instruct with over 50,000 entries encompassing text, answers, clues, and categories. This dataset is intricately designed to aid in the generation of pertinent clues linked to specific texts and keywords within defined categories. This project addresses the scarcity of advanced educational tools tailored for the Arabic language, promoting enhanced language learning and cognitive development. By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective through gamification and interactivity. Integrating state-of-the-art artificial intelligence with contemporary learning methodologies, this tool can generate crossword puzzles from any given educational text, thereby facilitating an interactive and enjoyable learning experience. This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies.

pdf bib abs

FarSense: A Comprehensive Commonsense Benchmark and Evaluation Framework for the Farsi Language
Kamyar Zeinalipour | Neda Jamshidi | Seyedehbahareh Hejazi | Marco Maggini | Monica Bianchini | Simone Paoletti | Marco Gori
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Although Farsi is widely spoken, no comprehensive benchmark exists for assessing commonsense reasoning in language models. We therefore present FarSense, a 6‐task benchmark for Farsi covering True/False judgment, multiple-choice questions, Explanation, Cause‐Effect inference, Counterfactual reasoning, and Knowledge Completion. Starting from Farsi‐Wikipedia, we filtered noise and retained ~4,210 passages, rewrote them into realistic daily scenarios, and derived the above tasks from each scenario. Scenario and task generation quality was first judged via native‐speaker annotations on outputs from five major LLMs—GPT‐4o, Gemini-2.5-Flash, Mistral-Large, Qwen‐Plus, and DeepSeek‐Chat. Gemini-2.5-Flash demonstrated the highest performance, leading to its use in generating a large-scale dataset, subsequently finalized through meticulous two-step human validation. Using FarSense, we measured the commonsense ability of the same five flagship LLMs and also fine‐tuned six compact models (1B–24B parameters) before re‐evaluating them. To ensure broad applicability, task wording was designed to minimize dialectal, cultural, or religious bias. Experiments show that targeted fine‐tuning yields substantial gains, confirming FarSense as a reliable, openly licensed resource for advancing reproducible commonsense understanding research in Farsi NLP. We publicly release all code and data at https://github.com/KamyarZeinalipour/FarSense.

2024

pdf bib abs

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses
Kamyar Zeinalipour | Neda Jamshidi | Monica Bianchini | Marco Maggini | Marco Gori
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)

Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B, Llama-2-7B, Llama-3-8B, and gemma-7B, to produce valid protein sequences. All of these models are publicly available (https://github.com/KamyarZeinalipour/protein-design-LLMs).Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

pdf bib abs

Clue-Instruct: Text-Based Clue Generation for Educational Crossword Puzzles
Andrea Zugarini | Kamyar Zeinalipour | Surya Sai Kadali | Marco Maggini | Marco Gori | Leonardo Rigutini
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Crossword puzzles are popular linguistic games often used as tools to engage students in learning. Educational crosswords are characterized by less cryptic and more factual clues that distinguish them from traditional crossword puzzles. Despite there exist several publicly available clue-answer pair databases for traditional crosswords, educational clue-answer pairs datasets are missing. In this article, we propose a methodology to build educational clue generation datasets that can be used to instruct Large Language Models (LLMs). By gathering from Wikipedia pages informative content associated with relevant keywords, we use Large Language Models to automatically generate pedagogical clues related to the given input keyword and its context. With such an approach, we created clue-instruct, a dataset containing 44,075 unique examples with text-keyword pairs associated with three distinct crossword clues. We used clue-instruct to instruct different LLMs to generate educational clues from a given input content and keyword. Both human and automatic evaluations confirmed the quality of the generated clues, thus validating the effectiveness of our approach.

pdf bib abs

SLIMER-IT: Zero-Shot NER on Italian Language
Andrew Zamai | Leonardo Rigutini | Marco Maggini | Andrea Zugarini
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.

pdf bib abs

Harnessing LLMs for Educational Content-Driven Italian Crossword Generation
Kamyar Zeinalipour | Achille Fusco | Asya Zanollo | Marco Maggini | Marco Gori
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

In this work, we unveil a novel tool for generating Italian crossword puzzles from text, utilizing advanced language models such as GPT-4o, Mistral-7B-Instruct-v0.3, and Llama3-8B-Instruct. Crafted specifically for educational applications, this cutting-edge generator makes use of the comprehensive Italian-Clue-Instruct dataset, which comprises over 30,000 entries including diverse text, solutions, and types of clues. This carefully assembled dataset is designed to facilitate the creation of contextually relevant clues in various styles associated with specific texts and keywords.The study delves into four distinctive styles of crossword clues: those without format constraints, those formed as definite determiner phrases, copular sentences, and bare noun phrases. Each style introduces unique linguistic structures to diversify clue presentation.Given the lack of sophisticated educational tools tailored to the Italian language, this project seeks to enhance learning experiences and cognitive development through an engaging, interactive platform. By meshing state-of-the-art AI with contemporary educational strategies, our tool can dynamically generate crossword puzzles from Italian educational materials, thereby providing an enjoyable and interactive learning environment. This technological advancement not only redefines educational paradigms but also sets a new benchmark for interactive and cognitive language learning solutions.

2023

pdf bib

pdf bib abs

ArabIcros: AI-Powered Arabic Crossword Puzzle Generation for Educational Applications
Kamyar Zeinalipour | Mohamed Saad | Marco Maggini | Marco Gori
Proceedings of ArabicNLP 2023

This paper presents the first Arabic crossword puzzle generator driven by advanced AI technology. Leveraging cutting-edge large language models including GPT4, GPT3-Davinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT, the system generates distinctive and challenging clues. Based on a dataset comprising over 50,000 clue-answer pairs, the generator employs fine-tuning, few/zero-shot learning strategies, and rigorous quality-checking protocols to enforce the generation of high-quality clue-answer pairs. Importantly, educational crosswords contribute to enhancing memory, expanding vocabulary, and promoting problem-solving skills, thereby augmenting the learning experience through a fun and engaging approach, reshaping the landscape of traditional learning methods. The overall system can be exploited as a powerful educational tool that amalgamates AI and innovative learning techniques, heralding a transformative era for Arabic crossword puzzles and the intersection of technology and education.

2020

pdf bib abs

Vulgaris: Analysis of a Corpus for Middle-Age Varieties of Italian Language
Andrea Zugarini | Matteo Tiezzi | Marco Maggini
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Italian is a Romance language that has its roots in Vulgar Latin. The birth of the modern Italian started in Tuscany around the 14th century, and it is mainly attributed to the works of Dante Alighieri, Francesco Petrarca and Giovanni Boccaccio, who are among the most acclaimed authors of the medieval age in Tuscany. However, Italy has been characterized by a high variety of dialects, which are often loosely related to each other, due to the past fragmentation of the territory. Italian has absorbed influences from many of these dialects, as also from other languages due to dominion of portions of the country by other nations, such as Spain and France. In this work we present Vulgaris, a project aimed at studying a corpus of Italian textual resources from authors of different regions, ranging in a time period between 1200 and 1600. Each composition is associated to its author, and authors are also grouped in families, i.e. sharing similar stylistic/chronological characteristics. Hence, the dataset is not only a valuable resource for studying the diachronic evolution of Italian and the differences between its dialects, but it is also useful to investigate stylistic aspects between single authors. We provide a detailed statistical analysis of the data, and a corpus-driven study in dialectology and diachronic varieties.

Venues

Marco Maggini

2026

2025

2024

2023

2020

Co-authors

Venues