BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.


Introduction
Lemmatization is a crucial task in Natural Language Processing (NLP), where the goal is to obtain the base form of a word, known as the lemma.It has widespread applications in several NLP tasks, such as information retrieval (Balakrishnan and Lloyd-Yemoh, 2014), text classification (Toman et al., 2006), machine translation (Carpuat, 2013), etc. Lemmatization is a particularly challenging task in highly inflectional languages such as Bangla (Bhattacharya et al., 2005), due to the large number of inflectional and derivational suffixes that can be added to words.Generally, lemmatization reduces the inflectional form of a word to its dictionary form.
Lemmatization in Bangla has several challenges due to various linguistic factors.Firstly, the language exhibits a wide range of morphological diversity, making it difficult for a system to cover all its aspects (Islam et al., 2022).Secondly, Bangla has approximately 50000 roots, each capable of generating a large number of inflected words based on several factors such as tense, gender, and number (Chakrabarty et al., 2016).Thirdly, Bangla words can have multiple meanings, known as polysemy, depending on their Part-of-Speech (PoS), surrounding words, context, and other factors (Chakrabarty and Garain, 2016).Additionally, the development of lemmatization systems for this complex morphological language is hindered by the lack of available resources.
Prior research attempts have employed different methodologies (e.g., learning-based, rule-based, and hybrid approaches) (Pal et al., 2015; Das et al., 2017; Islam et al., 2022; Chakrabarty and Garain, 2016).Despite the fact that some studies have shown satisfactory performance in specific scenarios, there remains a pressing need for a robust lemmatizer tailored to the Bangla language.
In this study, we have taken a rule and dictionary-based approach to tackle the lemmatization problem in Bangla.Unlike other rule-based studies for lemmatization in Bangla, we have derived the stripping methods based on the suffix marker sequences considering the PoS class of a word.A suffix েদরগুেলােত (d̪ ergulot ̪e) from a word িশশুেদরগুেলােত (ʃiʃud̪ ergulot ̪e) is a combination of েদর (d̪ er), গুেলা (gulo), and েত (t ̪e) markers.We strip the last to the first marker sequentially to ob- Figure 1: Figure 1a shows the illustration of our proposed BanLemma lemmatizer.For simplicity, we demonstrate Noun lemmatization only.For noun, emphasis, case, and plural markers are stripped in succession as specified in Section 3.1.Figure 1a shows its effectiveness on cross-dataset settings.On Chakrabarty and Garain (2016), we outperform their lemmatizer by a significant margin and achieve competitive results on the whole dataset of Islam et al. (2022).On Chakrabarty et al. (2017), we outperform their model on the corrected PoS and Lemma version of their evaluation set, elaborated in Section 5.3.
tain the lemma িশশু (ʃiʃu; child).To derive the marker sequences, we analyzed the word formations using a large Bangla text corpus.By embracing the marker sequence striping approach, we are able to effectively address a wide range of suffixes in the Bangla language, where other studies exhibit substantial limitations.Figure 1 shows an illustration of our lemmatization process and its effectiveness in cross-dataset settings where our lemmatizer achieves higher accuracy than other lemmatizers when tested on their respective datasets (Chakrabarty and Garain, 2016; Islam et al., 2022; Chakrabarty et al., 2017).The key contributions of this study are as follows: • We introduce BanLemma, a lemmatization system specifically designed for Bangla.By leveraging a precisely crafted linguistical framework, our system demonstrates superior performance compared to existing state-ofthe-art Bangla lemmatization methods.
• We present a set of linguistical rules interpreting the process by which inflected words in the Bangla language are derived from their respective base words or lemmas.
• The linguistic rules are derived from rigorous analysis conducted on an extensive Bangla text corpus of 90.65M unique sentences.It encompasses a vast collection of 0.5B words, where 6.17M words are distinct.We sampled 22675 words through a systematic approach to manually analyze the inflected words.
• To assess the efficacy of BanLemma, we have employed both intra-dataset and cross-dataset evaluation.This evaluation framework enables us to measure the robustness of our proposed system across multiple datasets.
• Utilizing human annotated PoS tag, we have achieved 96.36% accuracy on intra-dataset testing.Moreover, in cross-dataset testing, BanLemma surpasses recently published methodologies, exhibiting substantial performance improvements ranging from 1% to 11% (see Figure 1b), which implies our proposed BanLemma's robustness.1: Meaning difference of a word based on its PoS class.When used as a noun, the word "কর" means "tax", while used as a verb, it means "to do".

Related Work
(2019) used two novel techniques jointly: Dictionary Based Search by Removing Affix (DB-SRA) and Trie (Cormen et al., 2009) to lemmatize Bangla words.Chakrabarty and Garain (2016) proposed a novel Bangla lemmatization algorithm using word-specific contextual information like part of speech and word sense.
Contextual lemmatizers lemmatize a word based on the surrounding context using deep neural networks.Chakrabarty et al. (2017) employed a two-stage bidirectional gated recurrent neural network to predict lemmas without using additional features.Release of the Universal Dependencies (UD) dataset (de Marneffe et al., 2014,Nivre et al., 2017) and Sigmorphon 2019 shared task formed the basis of encoder-decoder architectures to solve the lemmatization task as a string-transduction task (Qi et al., 2018; Kanerva et al., 2018).For the Bangla language, Saunack et al. (2021) employed a similar two-step attention network that took morphological tags and inflected words as input.Islam et al. (2022) used PoS tags of each word as additional features to the encoder-decoder network achieving 95.75% accuracy on validation dataset.
Earlier rule-based approaches did not consider the composition of suffixes and removed them based on the highest length or trie-like data structures.In contrast, we provide specific rules on how a sequence of markers forms a suffix based on a word's PoS tag and show its efficacy on a crossdataset setup.

Methodology
In Bangla, the meaning of a word is greatly influenced by its PoS class within a given context of a sentence (see  ing the word class or semantic meaning (Lieber, 2021).They do not involve adding prefixes and altering the word's meaning.According to Karwatowski and Pietron (2022), for lemmatization it is crucial to determine the word's intended PoS accurately and its meaning within a sentence, considering the broader context.In this study, we have adopted a word formation-dependent rule-based approach, considering the following factors: i) The lemmatizer will operate on the inflected forms only and leave the derivational forms as they are. ii) The rules depend on the words' PoS class to use the contextual information.

Development of Lemmatization Rules
To analyze the behavior of inflected words, we utilized a raw text corpus of 90.65 million sentences, totaling about 0.58 billion words, where approximately 6.17 million words are unique.The corpus was crawled from 112 sources, covering ten different domains across various time periods.Figure 3 and Figure 4 in Appendix A.1 provide visual representations of the dataset's distribution across domains and time respectively.To obtain the PoS tags for each word in the dataset, we employed the automatic PoS tagger from the BNLP toolkit (Sarker, 2021).We projected each narrow PoS class to its corresponding basic PoS class: noun, pronoun, adjective, verb, adverb, conjunction, interjection, and postposition (Islam et al., 2014).For example, NC (common noun), NP (proper noun), and NV (verbal noun) were mapped to the class "noun".This allowed us to categorize the words based on their PoS classes.
After categorizing the words, we selected 19591 words as the analysis dataset which was used to analyze the inflection patterns of Bangla words.The analysis dataset preparation procedure is elaborated in Appendix A.2. Word distribution of the raw text corpus and analysis dataset per PoS class is depicted in Figure 2. As the majority of the words in the analysis dataset were in colloquial form, there was insufficient data available to study words in classical forms, such as তাহািদগর (t ̪ahad̪ igɔr; their), িগয়ািছেলন (giʲacʰilen; went), etc.To address this limitation, we collected classical texts from specifically selected sources3 .Utilizing 500 sentences comprising 5155 total words, where 3084 words were unique, we manually created clusters for the corresponding PoS classes using these classical texts.We did not employ the automatic PoS tagger here as it was trained only on colloquial text (Bali et al., 2010).Adding classical texts allowed us to include a wider range of words, with a total of 22675 words.
The morphological synthesis of different PoS is highly effective in determining whether inflections are applied and helps identify the lemma.The investigations of the analysis dataset revealed interesting sequential patterns of inflected words from different PoS classes.Nouns and pronouns were found to have four inflectional suffixes, including case markers (Moravcsik, 2008), plural markers, determiners, and emphasis markers.Verb inflections, on the other hand, depend on factors like tense, person, and number.Adjectives, in comparison, have only two suffixes তর (t ̪ɔro) and তম (t ̪ɔmo) which indicate the comparative and superlative degrees respectively (Das et al., 2020).Lastly, only emphatic inflections are found in adverb and postposition word classes.It should be noted that while other PoS classes exhibit distinct patterns of inflection, conjunctions, and interjections function without undergoing any inflection.

Inflections of Nouns
Nouns in Bangla comprise both NP and NC, which contribute a significant portion to the vocabulary of Bengali phrases.Nominal inflections are observed at four levels of nouns, including inanimate, animate, human, and elite (Faridee and Tyers, 2009).These inflections are added to nouns to signify grammatical roles and incorporate morphological features.
Case markers in the Bangla suffix system determine the noun's role in the sentence, indicating subject, object, possessor, or locative position.From seven Bangla cases, four case markers are used as noun suffixes e.g., nominative, objective, genitive, and locative (Mahmud et al., 2014).Determiner markers in Bangla noun suffixes provide specificity and indicate singularity, while plural markers indicate multiple entities or instances of a noun phrase.Some plural markers are specifically used with animate nouns, such as গণ (gɔn), বৃ ন্দ (brind̪ o), মণ্ডলী (mɔndoli), কু ল (kul), etc. while others are used with inanimate objects, like আবিল (abli), গুচ্ছ (guccʰo), গৰ্াম (gram), চয় (cɔʲ), etc. (Islam and Sarkar, 2017).Though these suffixes are found in traditional grammar and literature, their frequency of usage is quite low.Emphasis markers ই (i) and ও (o) are employed to emphasize nouns.In Bengali, the word মানু ষগুেলােকও (manuʃgu-loke͡ o̯ ) is formed by combining the base word মানু ষ (manuʃ; human) + গুেলা (gulo) + েক (ke where lemma is মানু ষ (manuʃ; human) with গুেলা (gulo) plural, েক (ke) case, and ও (o) emphatic markers.This inflected form expresses the meaning of "even the humans" and conveys plurality, objective case, and emphasis in a single word.The order of these suffixes is crucial because altering the sequence, such as মানু ষওেকগুেলা (manuʃokegulo), would result in a nonsensical word.The key considerations in noun morphology are selecting the appropriate suffixes, figuring out how to arrange them, and understanding the changes that take place at the border during affixation (Bhattacharya et al., 2005).
From analysis, we find that the emphasis marker always takes the end position of a noun and does not occur in the middle of the suffix sequence, where the other markers can be combined in different ways.
Here, W , L, P M , CM , DM , and EM denote the original word, the corresponding lemma, plural marker, case marker, determiner marker, and emphasis marker, respectively.We would use these notations in the following equations also.According to the equation, an inflected form of a word can consist of a lemma and up to four possible suffixes.The first suffix can be either a plural marker or a case marker, and they can alternate positions.However, it is not possible for one type of marker to occur twice consecutively.It is also possible to omit any of these suffixes from the sequence.

Inflections of Verbs
Bangla extensively utilizes verbs, which are action words comprising a verb root and an inflectional ending.The inflectional ending varies based on the tense (present, past, future), person (first, second, third), and honor (intimate, familiar, formal) (Mahmud et al., 2014).Traditional Bangla grammar divides the tense categories into ten different forms.Verbs in Bangla can be written in classical/literary form or colloquial form.Table 11 of Appendix A.4 showcases the different forms of verbal inflection in Bangla.
In Bangla, verb suffixes do not break down into markers like other parts of speech.Removing the suffixes from a verb does not yield the lemma but rather the root form of the verb.For instance, যািচ্ছ (ɟaccʰi; going), যােবা (ɟabo; will go), and িগেয়িছলাম (giʲecʰilam; went) are different forms of the verb যাওয়া (ɟa͡ o̯ ʷa; to go), which is the dictionary word (Das et al., 2020; Dash, 2000).
Briefly, a two-step process is followed to accurately lemmatize verbs.Firstly, the suffixes are removed from the verb to extract its root form.Then, a root-to-lemma mapping is applied to determine the final lemma form of the verb.Table 2: The two-pass approach to lemmatize verbs in Bengali.Firstly, a suffix such as িছলাম (cʰilam; was) is removed from a word, for instance েখলিছলাম (kʰelcʰilam; was playing), to retrieve its root form েখল (kʰael; to play).Then, the root is matched to a lemma in a rootlemma mapping to obtain the lemma form, such as েখলা (kʰela; play).

Inflections of Pronouns
Bangla pronouns represent specific nouns and exhibit similar inflectional patterns to noun classes.The language has nine types of pronouns, categorized into first, second, and third person based on personal distinctions (Dash, 2000).Appendix A.5 lists the singular, plural, and possessive forms of Bangla personal pronouns, offering a comprehensive understanding of their usage in the language.
Many Bangla personal pronouns have inherent suffixes that are integral to the words, and stripping these suffixes can result in meaningless strings.For example, pronouns like আমার (amar; my), েতামার (t ̪omar; your), আমােদর (amad̪ er; our), েতামােদর (t ̪omad̪ er; yours) contain case markers র (rɔ) and েদর (d̪ er) as inherent parts, which can further be inflected with other markers like the plural marker, determiner, and emphasis marker.Additionally, other pronouns are inflected with four nominal suffixes, including the plural marker, case marker, determiner, and emphasis markers.For instance, the pronoun েতামােদরেকই (t ̪omad̪ erke͡ i ̯) is inflected with the case marker েক (ke) and the emphasis marker ই (i).However, the marker েদর (d̪ er) is considered part of the pronoun itself, and our lemmatizer does not strip it, resulting in the lemma being েতামােদর (t ̪omad̪ er; yours), even though েদর (d̪ er) can function as a case marker.Pronoun lemmas can be inflected using the marker sequence shown in Equation 2.

Inflections of Adjectives, Adverbs and
Postpositions Bangla adjectives serve as modifiers for nouns expressing their features and can also modify adverbs.Suffix markers associated with adjectives indicate comparative and superlative degrees (Das et al., 2020).There are only two degree markers, তর (t ̪ɔro) as comparative and তম (t ̪ɔmo) as superlative, which inflect adjectives to indicate a degree.The lemmatizer strips the degree marker from an adjective and results in the corresponding positive adjective.For example: বৃ হত্তর (brihot ̪t ̪or; largest) is lemmatized as বৃ হৎ (brihɔt ̪; large) and ক্ষ ু দৰ্তম (kʰud̪ rot ̪ɔmo; smallest) is lemmatized as ক্ষ ু দৰ্ (kʰud̪ ro; small).Adjectives can also take the form of numerics when quantifying nouns.For example, একিট (ekti; a) is an adjective inflected with a nominal suffix.In such cases, the lemmatizer will not strip the suffix, resulting in the lemma being the same as the inflected form.Moreover, because of syntactic structure, nouns can function as adjectives e.g আেগর িদেনর স্মৃ িতগুেলা (ager d̪ iner srit ̪igulo; The memories of the previous day).Here, আেগর (ager; previous) will be unchanged as the lemma for being an adjective in this sentence.Additionally, adjectives can be inflected with emphatic markers.Equation 3, where DgM represents the degree marker, indicates the sequence of markers that can inflect an adjective.
Adverbs modify verbs to indicate how an action takes place.Postpositions, on the other hand, serve a functional role in establishing syntactic connections between syntactic units (Bagchi, 2007).Postpositions can also undergo inflection by emphasis markers only.Adverbs and postpositions are inflected according to W = L + EM sequence.
Our analysis revealed that words belonging to conjunction and interjection PoS classes do not get inflected in the Bangla language.

BanLemma
BanLemma consists of two main components: PoS-dependent rules and a dictionary.When given an input sentence, BanLemma employs an automatic PoS tagger, a suffix list, and a dictionary.The PoS tagger assigns tags to each word in the sentence, resulting in a list of word-PoS tag pairs.Subsequently, BanLemma iterates over each element of the list and applies the relevant lemmatization rule based on the PoS tag.In the case of a noun, BanLemma utilizes a method based on Equation 1 to determine the lemma.In contrast, the method utilizes the dictionary and applies sequential suffix stripping to determine the lemma as described in Algorithm 1.We discuss more detailed and implementation-oriented pseudo codes in the Appendix A.6.

Algorithm 1 Marker stripping method
Require: A word (W ), Marker list (M ), and Dictionary words (D w ) Ensure: The marker list is sorted according to length in descending order.

Development of BanLemma Dictionary
The dictionary used in the lemmatization process includes inflected words and their corresponding lemmas.For instance, (অংশীদারেক (ɔŋʃid̪ arke; to the partner), অংশীদার (ɔŋʃid̪ ar; partner)) represents the mapping of an inflected word to its lemma.However, for base words, the key and value in the mapping are the same, as in (েকতন (kaet ̪on; flag), েকতন (kaet ̪on; flag)).The dictionary is organized into 6 PoS clusters (e.g., nouns, pronouns, verbs, adjectives, adverbs, and postpositions) containing a total of around 71.5k wordlemma pairs.To prepare the dictionary, we used sources including Accessible (2023); Chowdhury (2012).The dictionary format and organization are shown in Figure 6 in Appendix A.7.

Test Dataset Preparation
We created a test dataset using the text corpus described in Section 3.1.Instead of random selection, we employed a systematic approach to choose 1049 sentences while maintaining the same domain distribution.The detailed procedure for preparing the test dataset is discussed in Appendix A.8.This dataset had 25.16% words overlapping with the analysis dataset, enabling a reliable evaluation of our proposed rules.To ensure accuracy, we manually annotated the PoS tags and lemmas of the test dataset.We also prepare a separate test dataset containing only classical texts that contain 70 sentences totaling 607 words.

Performance of BanLemma
Table 3 summarizes the lemmatizer's performance for PoS categories.At first, we evaluate it using the whole test dataset and report the result in All column, where we achieve 96.36% overall accuracy.To measure the performance on the classical texts only, we separate the classical sentences and report the performance on the CSCL column that demonstrates 96.48% overall accuracy.After that, we tried to evaluate the performance of words we did not include during the manual analysis of inflected words, i.e., non-overlapping with the analysis dataset.Column NOAD shows we achieved an accuracy of 96.41% in this attempt.Finally, we measured the performance of the words where neither the word nor the lemma was not included in the dictionary.We report the accuracy to be 96.32% for this experiment in the NOD column.The table shows that the lemmatizer achieves a perfect accuracy of 100% for postpositions, which can be attributed to the limited number of lemmas and inflections in this category, most of which are included in the word-lemma mapping dictionary.The "others" category also achieves 100% accuracy, as the lemmatization process considers the word itself as the lemma for any category not explicitly targeted for lemmatization.We further evaluated the performance of BanLemma in terms of precision, recall, and F1 score, which is dis- Table 4 summarizes the impact of using an automatic PoS tagger to get the tags of each word.It indicates that the lemmatizer tends to show significantly reduced performance when used with an automatic PoS tagger.It also highlights that the rules are well capable of lemmatizing Bangla text more accurately given the correct PoS.Table 5 provides examples where the lemmatizer fails to generate accurate lemmas due to incorrect PoS information.These examples highlight the dependency of the lemmatizer on the accuracy of the automatic PoS tagger.It demonstrates that the lemmatizer is capable of producing the correct lemma when provided with accurate manually annotated PoS tags.ated the lemmatizer on the entire BaNeL dataset.

Cross Dataset Evaluation
Though they reported the performance on a test split, our lemmatizer demonstrates competitive performance achieving 94.80% accuracy on the whole dataset, which is only 0.95% less than their reported accuracy.On the other hand, our system exhibits lower performance on the test dataset provided by Chakrabarty et al. (2017), achieving an accuracy of 80.08%, which is 11.06% lower than the reported performance.Several factors contribute to this performance gap, including the reliance on an automatic PoS tagger, which introduced inherent errors.Further investigation reveals significant inconsistency between the dataset used in their study and our considerations during the development of the lemmatizer.These inconsistencies are discussed in detail in Appendix A.10.To evaluate the performance of our proposed system on the dataset provided by Chakrabarty et al. (2017), we manually annotated and corrected PoS tags and lemmas.We focused on 52 sentences comprising a total of 695 words.Additionally, we reviewed and corrected 2000 lemmas from the BaNeL dataset.The results of our evaluation are summarized in Table 7.
For the Chakrabarty et al. (2017)  performed three evaluation steps.Firstly, we assessed the performance of our lemmatizer on unmodified data within the portion where manual efforts were applied.The accuracy column (Acc.) of the table presents the accuracy on this attempt to be 79.97%.Secondly, we examined the performance using an automatic PoS tagger while correcting the lemmas.The automatic PoS tags (A-PoS) and original lemma (O-Lem) column report this accuracy to be 87.09%.Finally, we measured the overall performance using manually annotated PoS tags and corrected lemmas.The correct PoS (C-PoS) and corrected lemma (C-Lem) column of the table illustrate the outcomes of the final experiment to be 94.34%.
Since the BaNeL dataset already provides manually annotated PoS tags, we focused solely on evaluating the performance of our lemmatizer on the corrected lemmas.The fourth column of the table presents the performance in this scenario.In both cases, our lemmatizer demonstrated improved performance compared to the initial evaluations.

Conclusion and Future Works
This study introduces BanLemma, a Bangla lemmatization system aimed at enriching Bangla language resources.BanLemma is composed of linguistically derived rules, obtained through rigorous analysis of a large Bangla text corpus.To overcome the challenges associated with limited resources in Bangla lemmatization, we also provide a comprehensive collection of morphological markers and rules.To demonstrate the effectiveness of BanLemma, we have conducted evaluations using a human-annotated test dataset, annotated by trained linguists and some recently published Bangla datasets.Our proposed BanLemma achieved an accuracy of 96.36% on our humanannotated test set.Furthermore, in cross-dataset evaluation, BanLemma exhibited significant performance improvements ranging from 1% to 11%.The results of our study shed light on the formation of inflected words, offering a solution to address the limitations of previous lemmatization methods.These findings contribute to the advancement of research in this field and pave the way for further investigations in the domain of Bangla lemmatization.

Limitations
During our analysis, we found some limitations of BanLemma.We discuss these in the following points: • Out of dictionary words: we identified a notable pattern in the lemmatizer's behavior regarding words that are not present in the dictionary but already are lemmas and end with a suffix substring.In this case, the lemmatizer erroneously strips the suffix from the words.For instance, the word নূ নয্তম (nunnɔt ̪ɔmɔ; minimum) itself is a lemma, yet the lemmatizer strips the ending substring তম (t ̪ɔmo), resulting in the lemma নূ নয্ (nunno), which is incorrect.We also noticed that this limitation is particularly prominent with proper nouns.It also emphasizes the significance of the dictionary's richness in the lemmatization process.Words that are not present in the dictionary but end with suffix substrings will be inaccurately lemmatized.
• Ambiguous semantic meaning: We observed that the lemmatizer struggles to comprehend the semantic meaning of certain words, resulting in incorrect lemmatization.For example, Table 8 illustrates a case where the lemma varies depending on the context although having the same PoS class.The lemmas differ based on whether they express the action of hanging or the state of something being hung.
• Automatic PoS dependency: The lemmatizer heavily relies on PoS information, which introduces errors when an automatic PoS tagger is used in the workflow.The majority of the data originates from the "Bangla News" and "Encyclopedia" domains, while the "Essay" data represents the smallest portion.However, there are 66757 words, accounting for 0.01% of all words, for which the domain could not be determined.

A.2 Analysis Dataset Preparation
To obtain the PoS tags for each word in the dataset, we used the PoS tagger from the BNLP toolkit.সরকারেক (ʃɔrkarke), and so on, which share the common initial substring সরকার (ʃɔrkar; government), were grouped together.For the analysis, we then selected clusters that contain a minimum threshold of words.Initially, we set a minimum threshold of 10, but this resulted in an overabundance of nouns, verbs, and adjectives while filtering out clusters from pronouns, postpositions, conjunctions, and interjections due to limited words in those groups.To address this, we individually determined the minimum threshold for each PoS class.The thresholds were set as follows: 14 for nouns, 7 for adjectives, 6 for verbs, 2 for pronouns and adverbs, and 1 for postpositions, conjunctions, and interjections.These thresholds were determined through a combination of tuning and manual examination of the selected clusters.To neutralize the error of the automatic PoS tagger, we manually curated the words while removing any word if necessary.Finally, we came up with 19591 selected words for the analysis dataset.To study the classical texts rigorously, we additionally use some classical text sources and select some words.Subsequently, we selected a total of 22675 words for further analysis.

A.5 Personal Pronouns
Table 12 lists the personal pronouns in singular, plural, and possessive forms.

A.6 Lemmatization Algorithm and Methods
Algorithm 2 summarizes the overall lemmatization algorithm.
Algorithm 3 provides a summary of the lemmatization method for noun words.The stripping process begins with the last marker, which in the case of nouns is the emphasis marker.Equation 1 illustrates that the sequence of the last three markers is fixed.However, if the first marker is a plural marker, then the second marker will be a case marker, or vice versa.After stripping the last three types of markers, the algorithm determines the next last marker and selects a second appropriate sequence for stripping the remaining markers.
The marker stripping method is described in detail in Algorithm 1.This algorithm is responsible for identifying and removing markers at the end of a word.It begins by checking if a word ends with a marker.If a match is found, it determines whether to stop the matching process based on whether the remaining prefix of the word is present in the dictionary.If the remaining prefix  is found in the dictionary, the lemma is immediately returned.However, if the remaining prefix is not found in the dictionary, the algorithm continues matching to determine if stripping a shorter suffix marker would result in a dictionary entry.This design choice allows for handling cases where multiple markers are present.For example, when stripping case markers from the inflected word েছেলর (of the boy), the algorithm would first match the marker এর (ে◌র).Stripping this marker would result in the word েছল, which is not the correct lemma.By continuing the matching process, the algorithm would then match the marker , resulting in the correct lemma েছেল (boy).However, if any shorter marker is not found, it would strip the longest marker at the end.
The lemmatization methods for other PoS classes can be achieved by modifying Algorithm 3. The details of these modifications are discussed in Algorithm 4 to Algorithm 7. Algorithm 4 to 7 presents the algorithms to lemmatize the words from the corresponding PoS class.

Figure 2 :
Figure 2: Distribution of all words in the raw text corpus and the analysis dataset by their PoS classes, excluding punctuation and symbols.Both corpora have a similar distribution with an abundance of nouns, verbs, and adjectives and a small number of interjections.

Figure 3 :
Figure 3: Distribution of word count of raw dataset corpus categorized by their respective domain.The amount of word count is denoted by 'M' (millions).The majority of the data originates from the "Bangla News" and "Encyclopedia" domains, while the "Essay" data represents the smallest portion.However, there are 66757 words, accounting for 0.01% of all words, for which the domain could not be determined.

Figure 5 :
Figure 5: The analysis dataset preparation process.

Table 1). Inflections are
morphemes that convey grammatical features without chang-

Table 9 in
Appendix A.3 lists the markers used in nouns .

Table 10 in
Appendix A.3 lists some examples of how nouns are inflected by taking different markers in a specific sequence in the marker combinations, as represented by Equation 1.

Table 3 :
PoS wise and overall performance of the lemmatizer using the human-annotated test dataset.All, CSCL, NOAD, and NOD represents the dataset of all text, classical text, No Overlap with Analysis Dataset and No Overlap with Dictionary respectively.Others indicates all other PoS tags except the aforementioned PoS tags.For the "others" class, the word itself is considered as its lemma and there was no non-overlapping Postposition for NOAD and NOD datasets.

Table 4 :
Lemmatizer's performance based on how PoS tags were obtained.

Table 6
presents the results of the cross-dataset evaluation.Our lemmatizer outperforms BenLem, achieving an 11.63% improvement in accuracy on their provided test dataset.BaNeL did not provide any separate test dataset.So, we evalu-

Table 5 :
The table showcases instances where our lemmatizer produces incorrect lemmas when an automatic PoS tagger provides inaccurate tags but accurately lemmatizes a word if the PoS tag is correct.

Table 6 :
Lemmatization results on cross-dataset evaluation.* We report the performance on the entire dataset of BaNeL while they reported the metric on a test split from the entire dataset (Acc=Accuracy and Ch Acc=Character Accuracy).

Table 7 :
dataset, we Performance of our lemmatizer on the sampled cross dataset measured in accuracy.The second column, Acc.indicates accuracy on the unmodified datasets.C-PoS and A-PoS indicates manually and automatically annotated PoS tags.C-Lem.indicates manually corrected lemmas.

Table 8 :
Difference of lemma of the same verb ঝু লা (ɟʱula) based on the context of the sentences.The lemma is ঝু লােনা (ɟʱulano) when it expresses an action of hanging something, and the lemma is ঝু লা (ɟʱula) when it expresses that something is hanging.

Table 11
presents all suffixes that inflect the a verb root.

Table 9 :
A list of markers in used in nouns which contains 37 plural markers, 12 case markers, 7 determiners, and 2 emphasis markers, totaling 58 markers.