Abduljalil Radman
2025
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura
|
Ahmed Heakl
|
Omkar Thawakar
|
Ali Husain Salem Abdulla Alharthi
|
Ines Riahi
|
Abduljalil Radman
|
Jorma Laaksonen
|
Fahad Shahbaz Khan
|
Salman Khan
|
Rao Muhammad Anwer
Findings of the Association for Computational Linguistics: NAACL 2025
Recent years have witnessed a significant interest in developing large multi-modal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the bestopen-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark will be publicly released.
Learning to Describe Implicit Changes: Noise-robust Pre-training for Image Difference Captioning
Zixin Guo
|
Jiayang Sun
|
Tzu-Jui Julius Wang
|
Abduljalil Radman
|
Selen Pehlivan
|
Min Cao
|
Jorma Laaksonen
Findings of the Association for Computational Linguistics: EMNLP 2025
Image Difference Captioning (IDC) methods have advanced in highlighting subtle differences between similar images, but their performance is often constrained by limited training data. Using Large Multimodal Models (LMMs) to describe changes in image pairs mitigates data limits but adds noise. These change descriptions are often coarse summaries, obscuring fine details and hindering noise detection. In this work, we improve IDC with a noise-robust approach at both data and model levels. We use LMMs with structured prompts to generate fine-grained change descriptions during data curation. We propose a Noise-Aware Modeling and Captioning (NAMC) model with three modules: Noise Identification and Masking (NIM) to reduce noisy correspondences, Masked Image Reconstruction (MIR) to correct over-masking errors, and Fine-grained Description Generation (FDG) to produce coherent change descriptions. Experiments on four IDC benchmarks show that NAMC, pre-trained on our large-scale data, outperforms streamlined architectures and achieves competitive performance with LLM-finetuned methods, offering better inference efficiency.