A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos

Tomas Ditchfield-Ogle, Ruslan Mitkov


Abstract
This project compares methods for de- tecting violent videos, which are crucial for ensuring real-time safety in surveil- lance and digital moderation. It evaluates four approaches: a random forest classi- fier, a transformer model, and two multi- modal vision-language models. The pro- cess involves preprocessing datasets, train- ing models, and assessing accuracy, inter- pretability, scalability, and real-time suit- ability. Results show that traditional meth- ods are simple but less effective. The trans- former model achieved high accuracy, and the multimodal models offered high vio- lence recall with descriptive justifications. The study highlights trade-offs and pro- vides practical insights for the deployment of automated violence detection.
Anthology ID:
2025.r2lm-1.2
Volume:
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Alicia Picazo-Izquierdo, Ernesto Luis Estevanell-Valladares, Ruslan Mitkov, Rafael Muñoz Guillena, Raúl García Cerdá
Venues:
R2LM | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
10–20
Language:
URL:
https://aclanthology.org/2025.r2lm-1.2/
DOI:
Bibkey:
Cite (ACL):
Tomas Ditchfield-Ogle and Ruslan Mitkov. 2025. A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos. In Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models, pages 10–20, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos (Ditchfield-Ogle & Mitkov, R2LM 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.r2lm-1.2.pdf