mahaNLP: A Marathi Natural Language Processing Library

We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language. It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP. It is an easy-to-use, extensible, and modular toolkit for Marathi text analysis built on state-of-the-art MahaBERT-based transformer models. Our work holds significant importance as other existing Indic NLP libraries provide basic Marathi processing support and rely on older models with restricted performance. Our toolkit stands out by offering a comprehensive array of NLP tasks, encompassing both fundamental preprocessing tasks and advanced NLP tasks like sentiment analysis, NER, hate speech detection, and sentence completion. This paper focuses on an overview of the mahaNLP framework, its features, and its usage. This work is a part of the L3Cube MahaNLP initiative, more information about it can be found at https://github.com/l3cube-pune/MarathiNLP .


Introduction
Natural Language Processing (NLP) is a major subset of artificial intelligence.It helps to clear up linguistic ambiguity and gives the data a useful quantitative structure for numerous downstream tasks.While NLP has the potential to have a huge impact on the ML community, recent models have primarily focused on English and 6 other languages1 with a significantly high amount of resources (Joshi et al., 2020).There are around 7,000 languages spoken worldwide2 .Out of which, approximately 22+ languages are existing Indian languages3 that are widely spoken not only in India but also throughout the world.Developing models that work for these languages is important for a variety of reasons, including bridging the existing language divide and promoting exploration and research for Figure 1: A brief overview of the features in mahaNLP library non-English languages in the growing NLP field 4 .
The Marathi language is 11th in the list of popular languages across the globe5 .Despite being a widely spoken language, Marathi-specific NLP monolingual resources are still limited in comparison to other natural languages (Joshi, 2022a).There are many popular open-source tools like Spacy6 , Stanza7 , iNLTK8 , IndicNLP9 , providing features to create effective multilingual NLP systems.Though these libraries do support the multilingual functionality for Marathi, they are not exhaustive with respect to the functionalities they support and have their own set of limitations and discrepancies, such as the usage of older architectures to implement multilingual models (Ruder et al., 2021).Basic features like Marathi sentiment Figure 2: A comparison of a few existing libraries and mahaNLP specific to the Marathi language analysis, named entity recognition, and hate speech detection is missing in almost all the current multilingual libraries.
The L3Cube-MahaNLP10 (Joshi, 2022b) initiative is an umbrella for various tools, datasets, and models that greatly assist in Marathi language processing.With this work, we aim to provide a broader set of functionalities to developers.In this paper, we propose mahaNLP library -a pythonbased NLP toolkit focused on the Indian language Marathi and majorly built on top of Hugging Face transformer models11 .Along with basic language processing features, the open-source toolkit wraps the state-of-the-art monolingual Marathi transformer models for text analysis.It encompasses a myriad of features like sentiment and hate analysis, named entity recognition, and a variety of other Marathi language processing features as shown in Figure 1.Thus, the mahaNLP library aims to make Marathi NLP more accessible.The demonstration video12 and example colab13 are shared publicly.

Related Work
The communities of NLP and machine learning have a prolonged history of developing opensource tools and libraries.There are numerous user-friendly, all-purpose NLP libraries available.NLTK (Bird, 2006), Stanford CoreNLP (Manning et al., 2014), Spacy (Honnibal and Montani, 2017), AllenNLP (Gardner et al., 2018), Flair (Akbik et al., 2019), Stanza (Qi et al., 2020), Hugging Face Transformers (Wolf et al., 2019) and iNLTK (Arora, 2020) are some of the libraries that are primarily concerned with NLP tasks.These libraries provide NLP tasks to the ML community like tokenization, sentence encoding, text normalization, translation, and so on.However, while some of these libraries do support Marathi, they have a very limited range of NLP tasks and pre-trained language models.Figure 2 illustrates the provisional comparison between ma-haNLP and the existing libraries for the language Marathi.The iNLTK library is not actively supported, it is based on very limited datasets and LSTM-based ULMFiT models.Marathi models for features like sentiment and NER are unavailable in SpaCy.MahaNLP attempts to address these issues by offering a broader set of pre-trained language models, including BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), and AlBERT (Lan et al., 2019), as well as a broader set of NLP tasks such as sentiment analysis, hate speech analysis, NER Tagger, and MLM-based modules.The NER model supported by Stanza is based on our L3Cube-MahaNER corpus (Litake et al., 2022) 14 .

System Design and Architecture
The ease-of-access is an important characteristic of mahaNLP and the system is designed from the user's perspective.These perspectives can be mainly defined as -a Standard Flow and a Model Flow.

Standard Flow
The Standard Flow can be used by a basic programmer who has the least knowledge of the machine learning domain.In this flow, the complex model arguments are isolated from the users.They can use a feature without knowledge of the models used in the background.This flow has constructs similar to standard NLP libraries.The intuitive nature of this flow makes mahaNLP more user-friendly and easily accessible.The pre-processing, tokenizer, datasets, and ML-based models are part of this user flow.

Pre-process Module
An initial step in any NLP task is the preprocessing of data.The transformer models in NLP perform much better on cleaned data than raw, unprocessed corpus.The preprocess module helps to provide functions such as the removal of URLs, stopwords, and non-Devanagari words for cleaning Marathi textual data.

Tokenizer Module
A very important step in many NLP tasks is the tokenization of text.The mahaNLP Tokenizer module provides functionalities for sentence-level tokenization (splitting into multiple sentences) and wordlevel tokenization (splitting into multiple words).

Datasets Module
The mahaNLP library currently supports 3 datasets -MahaHate (Patil et al., 2022) for hate speech detection, MahaSent (Kulkarni et al., 2021) (Pingle et al., 2023) for sentiment analysis, and MahaNER (Litake et al., 2022) for named entity recognition.These datasets can mainly be used for finetuning tasks.Each of these corpora can be separately loaded in the pandas data frame.The datasets are cached locally to avoid repeated downloads.

Machine learning-based modules
A set of modules utilize machine learning models in the background to provide the desired functionality.Currently, these features or modules are implemented using state-of-theart Marathi Transformer models and are described below.The basic syntax to use them is: Table 1 shows the features supported, output type, and the corresponding functions.

Model Flow
The Model Flow provides advanced functionality intended for use by ML practitioners with knowledge in the NLP domain.It offers flexibility for programmers to select background models and adjust their parameters.The model_repo module defines the model flow.Table 2 presents the association between standard flow machine learning-based modules and model_repo submodules.

System Usage
The mahaNLP library is hosted on the official PyPI repository.It can be installed using the pip command:

pip install mahaNLP
Once installed, we can then import the required features using the python import statement and start utilizing various functionalities.Currently, the library has been tested on x64-bit Windows 10 OS and Google Colab platform (posix-linux).

Dataset Loading
A snippet of code demonstrating the loading mahaSent dataset provided by mahaNLP is given below:

Usage via Standard Flow
In Standard Flow, the user can simply import the feature they want to use (e.g.autocomplete, sentiment, tagger, etc.) and define the object to initialize that particular model.Here, the user can optionally pass the model_name as an argument during model initialization.The get_polarity_score function returns float value representing the confidence score for a predicted sentiment class.

Usage via Model Flow
In Model Flow, the user has to import the specific model (e.g.mahaHate, mahaNER, etc.) using "import mahanlp.model_repo.modelname".Then, the user can define the model object and also can optionally pass the model_name as an argument.
Refer the following code for loading and usage of the MaskFillModel class object via Model Flow.
The predicted token string, sequence, and score can be returned as: The library also supports various standard hardware devices.The gpu_enabled = True option allows users to utilize the gpu for model usage or inference.
Table 3 demonstrates all the modules of Ma-haNLP along with inputs and their expected outputs.The detailed description and usage of all the functionalities in mahaNLP are available at the mahaNLP PyPI project.

Conclusion and Future Work
We have built mahaNLP -a simple, easy-to-use and extensible toolkit for Marathi language processing and development of robust NLP models.The mahaNLP is majorly built on the top of the huggingface transformers15 library and utilizes py-Torch16 framework under the hood.It also utilizes the resources developed by L3Cube as a part of MahaNLP initiative.It primarily provides wrapper classes to perform downstream tasks on supervised datasets, transformer models, and other tools for natural language processing in Marathi.The paper demonstrates the usage of various modules in the mahaNLP library.
We are working on expanding our project scope in the following manner: • Expanding the project to support more lowresource Indic languages.• Extending the existing corpora provided by the library and supporting Indo-Aryan language datasets and other regional language datasets like Konkani and Dongari which are very prevalent in Maharashtra state.
• Addition of pretrained models for advanced downstream tasks like machine translation, natural language inference etc.
• Creation of web-based user interactive tools or APIs which can be used by developers at the production level.
• Further improvements in model accuracy for predictions and reduction in overall model size and computational cost.

Limitations
• Pre-processing can be made more efficient by expanding the dataset for stopwords to a more exhaustive list.
• The state-of-art huggingface transformer models have not been tested on a generic domain, but only on specific domains such as the news or social media domain.More generic models built as a part of a broader initiative will be integrated into the library.
• The current language models provided by ma-haNLP are built on transformers that have high runtime on CPU.In the future, we plan to build and integrate more compact models.
These issues are planned to be addressed in the upcoming versions of the library.
A simple snippet of generic flow usage for SentimentAnalyzer class object is given below:Standard Flow module model_repo submodule Huggingface model sentiment -similarity-sbert(Joshi et al., 2022)

Table 1 :
Standard Flow ML features

Table 2 :
Table shows which model_repo submodules are inherited by the Standard Flow modules and the default huggingface model internally used.The huggingface models can be listed and selected in the Model Flow.

Table 3 :
Sample examples of features provided by mahaNLP