Mohammad Javad Pirhadi
2025
CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning
Mohammad Javad Pirhadi
|
Motahhare Mirzaei
|
Sauleh Eetemadi
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
The dense video captioning task aims to detect all events occurring in a video and describe each event using natural language. Unlike most other video processing tasks, where it is typically assumed that videos contain only a single main event, this task deals with long, untrimmed videos. Consequently, the speed of processing videos in dense video captioning is a critical aspect of the system. To the best of our knowledge, all published work on this task uses RGB frames to encode input videos. In this work, we introduce the use of compressed videos for the first time in this task. Our experiments on the SoccerNet challenge demonstrate significant improvements in both processing speed and GPU memory footprint while achieving competitive results. Additionally, we leverage multilingual transcripts, which seems to be effective. The encoder in our proposed method achieves approximately 5.4× higher speed and 5.1× lower GPU memory usage during training, and 4.7× higher speed and 7.8× lower GPU memory usage during inference, compared to its RGB-based counterpart. The code is publicly available at https://github.com/mohammadjavadpirhadi/CVT5.
2023
PMCoders at SemEval-2023 Task 1: RAltCLIP: Use Relative AltCLIP Features to Rank
Mohammad Javad Pirhadi
|
Motahhare Mirzaei
|
Mohammad Reza Mohammadi
|
Sauleh Eetemadi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Visual Word Sense Disambiguation (VWSD) task aims to find the most related image among 10 images to an ambiguous word in some limited textual context. In this work, we use AltCLIP features and a 3-layer standard transformer encoder to compare the cosine similarity between the given phrase and different images. Also, we improve our model’s generalization by using a subset of LAION-5B. The best official baseline achieves 37.20% and 54.39% macro-averaged hit rate and MRR (Mean Reciprocal Rank) respectively. Our best configuration reaches 39.61% and 56.78% macro-averaged hit rate and MRR respectively. The code will be made publicly available on GitHub.
2022
Using Two Losses and Two Datasets Simultaneously to Improve TempoWiC Accuracy
Mohammad Javad Pirhadi
|
Motahhare Mirzaei
|
Sauleh Eetemadi
Proceedings of the First Workshop on Ever Evolving NLP (EvoNLP)
WSD (Word Sense Disambiguation) is the task of identifying which sense of a word is meant in a sentence or other segment of text. Researchers have worked on this task (e.g. Pustejovsky, 2002) for years but it’s still a challenging one even for SOTA (state-of-the-art) LMs (language models). The new dataset, TempoWiC introduced by Loureiro et al. (2022b) focuses on the fact that words change over time. Their best baseline achieves 70.33% macro-F1. In this work, we use two different losses simultaneously. We also improve our model by using another similar dataset to generalize better. Our best configuration beats their best baseline by 4.23%.