Samir Kumar Borgohain
2024
An Aid to Assamese Language Processing by Constructing an Offline Assamese Handwritten Dataset
Debabrata Khargharia
|
Samir Kumar Borgohain
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Recent years have seen a growing interest in analyzing Indian handwritten documents. In pattern recognition, particularly handwritten document recognition, the availability of standard databases is essential for assessing algorithm efficacy and facilitating result comparisons among research groups. However, there is a notable scarcity of standardized databases for handwritten texts in Indian languages. This paper presents a comprehensive methodology for the development of a novel, unconstrained dataset named OAHTD (Offline Assamese Handwritten Text Dataset) for the Assamese language, derived from offline handwritten documents. The dataset, which represents a significant contribution to the field of Optical Character Recognition (OCR) for handwritten Assamese, is the first of its kind in this domain. The corpus comprises 410 document images, each containing a diverse array of linguistic elements including words, numerals, individual characters, and various symbols. These documents were collected from a demographically diverse cohort of 300 contributors, spanning an age range of 10 to 76 years and representing varied educational backgrounds and genders. This meticulously curated collection aims to provide a robust foundation for the development and evaluation of OCR algorithms specifically tailored to the Assamese script, addressing a critical gap in the existing literature and resources for this language.
2023
Dravidian Fake News Detection with Gradient Accumulation based Transformer Model
Eduri Raja
|
Badal Soni
|
Samir Kumar Borgohain
|
Candy Lalrempuii
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
The proliferation of fake news poses a significant challenge in the digital era. Detecting false information, especially in non-English languages, is crucial to combating misinformation effectively. In this research, we introduce a novel approach for Dravidian fake news detection by harnessing the capabilities of the MuRIL transformer model, further enhanced by gradient accumulation techniques. Our study focuses on the Dravidian languages, a diverse group of languages spoken in South India, which are often underserved in natural language processing research. We optimize memory usage, stabilize training, and improve the model’s overall performance by accumulating gradients over multiple batches. The proposed model exhibits promising results in terms of both accuracy and efficiency. Our findings underline the significance of adapting state-ofthe-art techniques, such as MuRIL-based models and gradient accumulation, to non-English languages to address the pressing issue of fake news.