A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos

Tomas Ditchfield-Ogle; Ruslan Mitkov

A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos

Abstract

This project compares methods for de- tecting violent videos, which are crucial for ensuring real-time safety in surveil- lance and digital moderation. It evaluates four approaches: a random forest classi- fier, a transformer model, and two multi- modal vision-language models. The pro- cess involves preprocessing datasets, train- ing models, and assessing accuracy, inter- pretability, scalability, and real-time suit- ability. Results show that traditional meth- ods are simple but less effective. The trans- former model achieved high accuracy, and the multimodal models offered high vio- lence recall with descriptive justifications. The study highlights trade-offs and pro- vides practical insights for the deployment of automated violence detection.

Anthology ID:: 2025.r2lm-1.2
Volume:: Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Alicia Picazo-Izquierdo, Ernesto Luis Estevanell-Valladares, Ruslan Mitkov, Rafael Muñoz Guillena, Raúl García Cerdá
Venues:: R2LM | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 10–20
Language:
URL:: https://aclanthology.org/2025.r2lm-1.2/
DOI:
Bibkey:
Cite (ACL):: Tomas Ditchfield-Ogle and Ruslan Mitkov. 2025. A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos. In Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models, pages 10–20, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos (Ditchfield-Ogle & Mitkov, R2LM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.r2lm-1.2.pdf

PDF Cite Search Fix data