Denis Deratani Mauá


2026

Trait-specific automated scoring of essays written for the standardized Brazilian National Entrance Exam (ENEM) has received significant attention in recent years. The task is both important in a classroom setting, to provide timely and personalized learning feedback, and in the official exam, to make the scoring process more scalable and consistent. The state-of-the-art systems approach the task as a purely statistical predictive task, ignoring the knowledge provided to human graders and test takers in the form of rubrics and guidelines.Aiming to produce more interpretable and informative formative feedback in this work, we leverage the official ENEM Grader’s handbook and develop two neuro-symbolic approaches to trait-specific essay scoring.The first approach uses a Large Language Model (GPT4o) to write an evaluative explanation of the essay score according to the subcriteria described in the guidelines; the explanation is then fed into a statistical model to effectively predict the score; the good performance of the scoring validates the quality of the explanations.The second approach formalizes the Guideline grading rubrics as logical rules that derive the essay score as a function of subcriteria, mimicking the recommended human grader’s scoring approach.In order to provide weak supervision in training and to evaluate the quality of the model, we build a dataset of 63 essays annotated with their subcriteria by two expert human graders.Our empirical results suggest that both approaches perform on par with purely statistical methods while providing more helpful and fine-grained feedback.
Brazil’s ENEM, a high-stakes assessment determining university admission for millions of students annually, creates an immense evaluation burden where human raters process hundreds of essays daily. Automated Essay Scoring (AES) offers a potential solution, yet Portuguese-language systems remain understudied due to fragmented datasets and the complexity of ENEM’s multi-trait rubric. This work investigated cross-prompt, trait-specific essay scoring using a corpus of 385 essays across 38 prompts, where models evaluated essays on unseen prompts across five traits scored on a six-point ordinal scale. We compared three model classes: feature-based methods (72 features), encoder-only transformers (109M–1.5B parameters), and decoder architectures (2.4B–671B parameters) with fine-tuned and zero-shot configurations. Experiments under varying information access and rubric conditioning revealed that no single approach serves all evaluation needs: encoder models excel at mechanical traits (fluency, cohesion) despite context limitations; decoder models achieve superior performance on argumentation (QWK 0.73) and writing style (QWK 0.60) when provided full context; and language-specific pretraining benefits only surface-level features without improving complex reasoning. Best-performing models achieved QWK scores of 0.60–0.73. Gaps to oracle bounds ranged from 0.15 (argumentation) to 0.29 (writing style), with the largest disparities in writing style and persuasiveness.
This work presents a study of automated reformulation of argumentative essays written by college-bound native speakers of Brazilian Portuguese as a form of pedagogical feedback. We first evaluate the feasibility of using large language models (LLMs) to score argument quality with respect to three criteria: the defense of a point of view, organization, and development. We then employ an LLM to provide a reformulated version of the essay as feedback. As we discuss, the main challenge is to constrain the automated feedback to address only argument quality, rather than improving other aspects such as spelling or cohesion, and to modify the essay as little as possible. We achieve levels of agreement in automatic essay scoring comparable to human inter-rater agreement metrics, while increasing explainability. Instructing the LLM to add argument support (facts, examples, etc.) was the best way to get non-superficial changes to the arguments, and it was able to add true examples and facts to the essays even without being provided with background information on the topic.
Automated Essay Scoring systems can relieve teachers of this laborious task and allow students to practice more frequently due to faster feedback cycles. In Brazilian Portuguese, there is growing interest in automatic scoring systems for the standardized ENEM exam. However, the only available datasets consist of essays written as practice for the official exam. In the literature, to the best of our knowledge, there is no work that evaluates official ENEM essays using mock-exam datasets.This work fills that gap by presenting a new labeled dataset composed of 157 essays written for the official ENEM exam. The analysis shows that this dataset shares characteristics similar to existing datasets of mock exam essays. The results also indicate that, for small datasets such as this one, the use of LLMs pretrained on mock exams significantly improves the performance of automatic scorers for official ENEM essays, yielding an average gain of 0.27 points in the Quadratic Weighted Kappa metric compared to training solely on official data.

2024