Structured Information Extraction from Nepali Scanned Documents using Layout Transformer and LLMs

Aayush Neupane, Aayush Lamichhane, Ankit Paudel, Aman Shakya


Abstract
Despite growing global interest in information extraction from scanned documents, there is still a significant research gap concerning Nepali documents. This study seeks to address this gap by focusing on methods for extracting information from texts with Nepali typeface or Devanagari characters. The primary focus is on the performance of the Language Independent Layout Transformer (LiLT), which was employed as a token classifier to extract information from Nepali texts. LiLT achieved F1 score of approximately 0.87. Complementing this approach, large language models (LLMs), including OpenAI’s proprietary GPT-4o and the open-source Llama 3.1 8B, were also evaluated. The GPT-4o model exhibited promising performance, with an accuracy of around 55-80% accuracy for a complete match, accuracy varying among different fields. Llama 3.1 8B model achieved only 20-40% accuracy. For 90% match both GPT-4o and Llama 3.1 8B had higher accuracy by varying amounts for different fields. Llama 3.1 8B performed particularly poorly compared to the LiLT model. These results aim to provide a foundation for future work in the domain of digitization of Nepali documents.
Anthology ID:
2025.chipsal-1.13
Volume:
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Kengatharaiyer Sarveswaran, Ashwini Vaidya, Bal Krishna Bal, Sana Shams, Surendrabikram Thapa
Venues:
CHiPSAL | WS
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
134–143
Language:
URL:
https://aclanthology.org/2025.chipsal-1.13/
DOI:
Bibkey:
Cite (ACL):
Aayush Neupane, Aayush Lamichhane, Ankit Paudel, and Aman Shakya. 2025. Structured Information Extraction from Nepali Scanned Documents using Layout Transformer and LLMs. In Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025), pages 134–143, Abu Dhabi, UAE. International Committee on Computational Linguistics.
Cite (Informal):
Structured Information Extraction from Nepali Scanned Documents using Layout Transformer and LLMs (Neupane et al., CHiPSAL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.chipsal-1.13.pdf