Adeel Zafar
2025
A Dataset for Programming-based Instructional Video Classification and Question Answering
Sana Javaid Raja
|
Adeel Zafar
|
Aqsa Shoaib
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
This work aims to develop an understanding of the rapidly emerging field of VideoQA, particularly in the context of instructional programming videos. It also encourages designing of system that can produce visual answer to programming based natural language questions. We introduce two datasets: CodeVidQA, with 2,104 question-answer pair links with timestamps taken from programming videos of Stack Overflow for Programming Visual Answer Localization task, and CodeVidCL with 4,331 videos (1,751 programming ,2580 non-programming) for Programming Video Classification task. In addition, we proposed a framework that adapts BigBird and SVM for video classification techniques. The proposed approach achieves a significantly high accuracy of 99.61% for video classification.
From Courtroom to Corpora: Building a Name Entity Corpus for Urdu Legal Texts
Adeel Zafar
|
Sohail Ashraf
|
Slawomir Nowaczyk
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
This study explores the effectiveness of transformer-based models for Named Entity Recognition (NER) in Urdu legal documents, a critical task in low-resource language processing. Given the legal texts’ specialized terminology and complex syntax, accurate entity recognition in Urdu remains challenging. We developed a legal Urdu dataset that contains 117,500 documents, generated synthetically from 47 different types of legal documents, and evaluated three BERT-based models. XLMRoBERTa, mBERT, and DistilBERT by analyzing their performance on an annotated Urdu legal dataset. mBERT demonstrated superior accuracy (0.999), and its F1 score (0.975) outperforms XLMRoBERTa and DistilBERT, highlighting its robustness in recognizing entities within low-resource languages. To ensure the privacy of the personal identifiers, all documents are anonymized. The dataset for this study is publicly hosted on Hugging Face and will be made public after the publication.