Murilo Boccardo
2026
Automatic Question classification in Portuguese: A Large-Scale Dataset and Comparative Evaluation of Classification Strategies
Murilo Boccardo | Valéria D. Feltrim
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Murilo Boccardo | Valéria D. Feltrim
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
This paper presents a comparative evaluation of automatic classification strategies for Brazilian university entrance exam questions by subject and fine-grained topic. A central contribution of this study is the creation and curation of a large-scale Portuguese-language dataset comprising approximately 17,000 questions collected from the Agatha.edu platform, carefully cleaned and normalized. We investigated two alternative classification strategies: a single-step approach that directly predicts fine-grained topics and a two-stage approach in which an initial model predicts the subject, followed by specialized topic classifiers. These strategies were evaluated using both classical machine learning methods, such as Support Vector Machines, Naive Bayes, and Random Forest, and transformer-based language models pre-trained for Portuguese. Experimental results show the feasibility of large-scale automatic question classification and highlight the potential of NLP-based classification strategies to support the curation, analysis, and organization of educational question banks.