Nuhu Ibrahim

2026

Knowledge Augmentation Enhances Token Classification for Recipe Understanding
Nuhu Ibrahim | Robert Stevens | Riza Batista-Navarro
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we propose an entity type-specific and knowledge-augmented token classification framework designed to improve encoder models’ performance on recipe texts. Our empirical analysis shows that this approach achieves state-of-the-art (SOTA) results on 5 out of 7 benchmark recipe datasets, significantly outperforming traditional token classification methods. We introduce a novel methodology leveraging curated domain-specific knowledge contexts to guide encoder models such as BERT and RoBERTa, which we refer to as RecipeBERT-KA and RecipeRoBERTa-KA. Additionally, we release a newly reprocessed entity type-specific and knowledge-enriched dataset that merges seven widely used food datasets, making it the largest annotated food-related dataset to date. Comparative analysis with SOTA large language models (GPT-4o, Mistral-7B, LLaMA 3-13B and LLaMA 3-70B) highlights the practical advantages of our smaller and specialised models. Finally, we analyse the impact of the different knowledge contexts, our models’ potential for transfer learning, the effect of combining the datasets and scenarios where traditional token classification may still perform competitively, offering nuanced insight into method selection.

pdf bib abs

Lost in Formatting: How Output Formats Skew LLM Performance on Information Extraction
Rishi Ravikumar | Nuhu Ibrahim | Riza Batista-Navarro
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate how the choice of output format influences the performance of fine-tuned large language models on information extraction tasks. Based on over 280 experiments spanning multiple benchmarks, models and formats, we find that output formatting is a critical yet largely overlooked hyperparameter. Remarkably, in some cases, changing only the output format shifts F1 scores by over 40% despite using the same model. We further observe that no single format consistently dominates across settings, and the optimal choice depends on factors like model family and dataset characteristics. Overall, these results demonstrate that informationally equivalent output formats can produce substantial performance variation, highlighting the need to treat output formatting as a key factor in building accurate and reliable information extraction systems.

pdf bib abs

ReciFine: Finely Annotated Recipe Dataset for Controllable Recipe Generation
Nuhu Ibrahim | Rishi Ravikumar | Robert Stevens | Riza Batista-Navarro
Findings of the Association for Computational Linguistics: EACL 2026

We introduce ReciFine, the largest human-evaluated, finely annotated recipe dataset to date, designed to advance controllable and trustworthy recipe generation. Existing resources, such as RecipeNLG, extract food items only from ingredient lists, overlooking entities expressed in instructions, including tools, chef actions, food and tool states, and durations, which are crucial for realistic and context-aware generation. To address this limitation, we extend RecipeNLG with finely annotated extraction of over 97 million entities across ten entity types from 2.2 million recipes. We are the first to explore recipe generation with explicit control over multiple entity types, enabling models to generate recipes conditioned not only on ingredients but also on tools, chef actions, cooking durations, and other contextual factors. Large language models fine-tuned or few-shot prompted with ReciFine extractions consistently outperform those trained on ingredient-list data alone across both automatic and human evaluations. ReciFine establishes a foundation for factual, coherent, structured, controllable recipe generation, and we release a human-annotated benchmark to support future evaluation and model development.

2025

pdf bib abs

Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages
Nuhu Ibrahim | Felicity Mulford | Riza Batista-Navarro
Proceedings of the 9th Widening NLP Workshop

We introduce a multilingual benchmark for evaluating large language models (LLMs) on hate speech detection and generation in low-resource Ethiopian languages: Afaan Oromo, Amharic and Tigrigna, and English (both monolingual and code-mixed). Using a balanced and expert-annotated dataset, we assess five state-of-the-art LLM families across both tasks. Our results show that while LLMs perform well on English detection, their performance on low-resource languages is significantly weaker, revealing that increasing model size alone does not ensure multilingual robustness. More critically, we find that all models, including closed and open-source variants, can be prompted to generate profiled hate speech with minimal resistance. These findings underscore the dual risk of exclusion and exploitation: LLMs fail to protect low-resource communities while enabling scalable harm against them. We make our evaluation framework available to facilitate future research on multilingual model safety and ethical robustness.

2024

pdf bib abs

Resources for Annotating Hate Speech in Social Media Platforms Used in Ethiopia: A Novel Lexicon and Labelling Scheme
Nuhu Ibrahim | Felicity Mulford | Matt Lawrence | Riza Batista-Navarro
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

Hate speech on social media has proliferated in Ethiopia. To support studies aimed at investigating the targets and types of hate speech circulating in the Ethiopian context, we developed a new fine-grained annotation scheme that captures three elements of hate speech: the target (i.e., any groups with protected characteristics), type (i.e., the method of abuse) and nature (i.e., the style of the language used). We also developed a new lexicon of hate speech-related keywords in the four most prominent languages found on Ethiopian social media: Amharic, Afaan Oromo, English and Tigrigna. These keywords enabled us to retrieve social media posts (also in the same four languages) from three platforms (i.e., X, Telegram and Facebook), that are likely to contain hate speech. Experts in the Ethiopian context then manually annotated a sample of those retrieved posts, obtaining fair to moderate inter-annotator agreement. The resulting annotations formed the basis of a case study of which groups tend to be targeted by particular types of hate speech or by particular styles of hate speech language.

Co-authors

Venues

Fix author