Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation

Hunter Heidenreich, Ratish Dalvi, Nikhil Verma, Yosheb Getachew


Abstract
Page Stream Segmentation (PSS) is critical for automating document processing in industries like insurance, where unstructured document collections are common. This paper explores the use of large language models (LLMs) for PSS, applying parameter-efficient fine-tuning to real-world insurance data. Our experiments show that LLMs outperform baseline models in page- and stream-level segmentation accuracy. However, stream-level calibration remains challenging, especially for high-stakes applications. We evaluate post-hoc calibration and Monte Carlo dropout, finding limited improvement. Future work will integrate active learning to enhance model calibration and support deployment in practical settings.
Anthology ID:
2025.coling-industry.26
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
305–317
Language:
URL:
https://aclanthology.org/2025.coling-industry.26/
DOI:
Bibkey:
Cite (ACL):
Hunter Heidenreich, Ratish Dalvi, Nikhil Verma, and Yosheb Getachew. 2025. Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 305–317, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation (Heidenreich et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-industry.26.pdf