Tomas Ditchfield-Ogle
2025
A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos
Tomas Ditchfield-Ogle
|
Ruslan Mitkov
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
This project compares methods for de- tecting violent videos, which are crucial for ensuring real-time safety in surveil- lance and digital moderation. It evaluates four approaches: a random forest classi- fier, a transformer model, and two multi- modal vision-language models. The pro- cess involves preprocessing datasets, train- ing models, and assessing accuracy, inter- pretability, scalability, and real-time suit- ability. Results show that traditional meth- ods are simple but less effective. The trans- former model achieved high accuracy, and the multimodal models offered high vio- lence recall with descriptive justifications. The study highlights trade-offs and pro- vides practical insights for the deployment of automated violence detection.