Aman Shakya
2025
Structured Information Extraction from Nepali Scanned Documents using Layout Transformer and LLMs
Aayush Neupane
|
Aayush Lamichhane
|
Ankit Paudel
|
Aman Shakya
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Despite growing global interest in information extraction from scanned documents, there is still a significant research gap concerning Nepali documents. This study seeks to address this gap by focusing on methods for extracting information from texts with Nepali typeface or Devanagari characters. The primary focus is on the performance of the Language Independent Layout Transformer (LiLT), which was employed as a token classifier to extract information from Nepali texts. LiLT achieved F1 score of approximately 0.87. Complementing this approach, large language models (LLMs), including OpenAI’s proprietary GPT-4o and the open-source Llama 3.1 8B, were also evaluated. The GPT-4o model exhibited promising performance, with an accuracy of around 55-80% accuracy for a complete match, accuracy varying among different fields. Llama 3.1 8B model achieved only 20-40% accuracy. For 90% match both GPT-4o and Llama 3.1 8B had higher accuracy by varying amounts for different fields. Llama 3.1 8B performed particularly poorly compared to the LiLT model. These results aim to provide a foundation for future work in the domain of digitization of Nepali documents.
2022
COVID-19-related Nepali Tweets Classification in a Low Resource Setting
Rabin Adhikari
|
Safal Thapaliya
|
Nirajan Basnet
|
Samip Poudel
|
Aman Shakya
|
Bishesh Khanal
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
Billions of people across the globe have been using social media platforms in their local languages to voice their opinions about the various topics related to the COVID-19 pandemic. Several organizations, including the World Health Organization, have developed automated social media analysis tools that classify COVID-19-related tweets to various topics. However, these tools that help combat the pandemic are limited to very few languages, making several countries unable to take their benefit. While multi-lingual or low-resource language-specific tools are being developed, there is still a need to expand their coverage, such as for the Nepali language. In this paper, we identify the eight most common COVID-19 discussion topics among the Twitter community using the Nepali language, set up an online platform to automatically gather Nepali tweets containing the COVID-19-related keywords, classify the tweets into the eight topics, and visualize the results across the period in a web-based dashboard. We compare the performance of two state-of-the-art multi-lingual language models for Nepali tweet classification, one generic (mBERT) and the other Nepali language family-specific model (MuRIL). Our results show that the models’ relative performance depends on the data size, with MuRIL doing better for a larger dataset. The annotated data, models, and the web-based dashboard are open-sourced at https://github.com/naamiinepal/covid-tweet-classification.
Search
Fix data
Co-authors
- Rabin Adhikari 1
- Nirajan Basnet 1
- Bishesh Khanal 1
- Aayush Lamichhane 1
- Aayush Neupane 1
- show all...