Pedro Campos

2026

NLP-based Page Classification for Efficient LLM Extraction from Brazilian Public Tender Documents
Pedro Campos | Ivo de Medeiros | Adailton de Araújo
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Extracting structured information from lengthy documents using Large Language Models (LLMs) is computationally expensive and prone to accuracy degradation as input size increases. We present a two-stage pipeline for extracting products from Brazilian tender documents (editais de licitação), combining NLP-based page classification with LLM extraction. We construct a novel dataset of 11,190 annotated pages from 350 documents across five product domains. Our experiments compare transformer-based classifiers (BERTimbau, DistilBERT) with classical machine learning approaches using engineered features. Results show that XGBoost with domain-specific features achieves 97.75% F1-score, outperforming fine-tuned BERT models by over 4 percentage points. The complete pipeline reduces LLM input tokens by 64-88% while maintaining extraction completeness, enabling cost-effective document processing at scale.

Co-authors

Venues

PROPOR1

Fix author