Fernando Ribeiro Trindade
2026
LexIris-pt and LexBert-pt: Specialized Sentence Embeddings for Legal Similarity in Brazilian Portuguese
Willgnner Ferreira Santos | João Gabriel Grandotto Viana | Antônio Pires de Castro Júnior | Fernando Ribeiro Trindade | Nádia Félix Felipe da Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Willgnner Ferreira Santos | João Gabriel Grandotto Viana | Antônio Pires de Castro Júnior | Fernando Ribeiro Trindade | Nádia Félix Felipe da Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
This work presents and evaluates two specialized sentence embedding models for the Portuguese legal domain, LexIris-pt and LexBert-pt, obtained through supervised fine-tuning of BERT-based models using pairs of initial petitions. We propose a comparative evaluation protocol along three fronts: (i) zero-shot inference with pretrained embeddings, (ii) supervised fine-tuning on these pairs, and (iii) vector retrieval with incremental clustering over a corpus of 20,000 initial petitions. The results show that fine-tuning consistently increases correlations with reference scores and improves performance in vector retrieval; additionally, the vector retrieval stage indicates that the metric configured in the index (cosine similarity or inner product) can change the granularity of the partitioning under a fixed threshold, reinforcing the need for joint calibration among the encoder, metric and threshold. After auditing by specialists from the partner institution, LexIris-pt and LexBert-pt were operationally adopted to support the screening and organization of repetitive claims and predatory litigation.