BSpell: A CNN-Blended BERT Based Bangla Spell Checker

Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach.


Introduction
Bangla is the native language of 228 million people which makes it the sixth most spoken language in the world1 .This Sanskrit originated language has 11 vowels, 39 consonants, 11 modified vowels and 170 compound characters (Sifat et al., 2020).There is vast difference between Bangla grapheme representation and phonetic utterance for many commonly used words.As a result, fast typing of Bangla yields frequent spelling mistakes.Almost all Bangla native speakers type using English QWERTY layout keyboard (Noyes, 1983) which makes it difficult to type Bangla compound characters, phonetically similar single characters and similar pronounced modified vowels correctly.Thus Bangla typing speed, if error-free typing is desired, is slow.An accurate spell checker (SC) can be a solution to this problem.
Existing Bangla SCs include phonetic rule (Uz-Zaman and Khan, 2004Khan, , 2005) ) and clustering based methods (Mandal and Hossain, 2017).These methods do not take misspelled word context into consideration.Another N-gram based Bangla SC (Khan et al., 2014) takes only short range previous context into consideration.Recent state-of-the-art (SOTA) spell checkers have been developed for Chinese language, where a character level confusion set (similar characters) guided sequence to sequence (seq2seq) model has been proposed by Wang et al. (2019).Another research used similarity mapping graph convolutional network in order to guide BERT based character by character parallel correction (Cheng et al., 2020).Both these methods require external knowledge and assumption about confusing character pairs existing in the language.The most recent Chinese SC offers an assumption free BERT architecture where error detection network based soft-masking is included (Zhang et al., 2020).This model takes all N characters of a sentence as input and produces the correct version of these N characters as output in a parallel manner.One of the limitations in developing Bangla SC using SOTA BERT based implementation (Zhang et al., 2020) is that number of input and output characters in BERT has to be exactly the same.Such scheme is only capable of correcting substitution type errors.As compound characters are common in Bangla words, an error made due to the substitution of such characters also changes word length (see the table in Figure 1).So, we introduce word level prediction in our proposed BERT based model.The table shown in Figure 2 illustrates the importance of context in Bangla SC.Although the red marked words of this figure are the misspelled versions of the corresponding green marked correct words, these red words are valid Bangla words.But if we check these red words based on sentence semantic context, we can realize that these words have been produced accidentally because of spelling error.An effective SC has to consider word pattern, its prior context and its post context.Spelling errors often span up to multiple words in a sentence.Figure 3 provides an example where all four words have been misspelled.The correction of each word has context dependency on a few other words of the same sentence.The problem is that these words that form the correction context are also misspelled.The table in the figure shows the words to look at in order to correct each misspelled word.In the original sentence (colored in red), all these words that need to be looked at for context are misspelled.If a SC cannot understand the approximate underlying meaning of these misspelled words, then we lose all context for correcting each misspelled word which is undesirable.

Correct
We propose a word level BERT (Devlin et al., 2018) based model BSpell.This model is capable of learning prior and post context dependency through the use of multi-head attention mechanism of stacked Transformer encoders (Vaswani et al., 2017).The model uses CNN based learnable Se-manticNet sub-model to capture semantic meaning of both correct and misspelled words.BSpell also uses specialized auxiliary loss to facilitate word level pattern learning and vanishing gradient problem removal.We introduce hybrid pretrainingfor BSpell to capture both context and word error pattern.We perform detailed evaluation on three error datasets that include a real life Bangla error dataset.Our evaluation includes detailed analysis on possible LSTM based SCs, SC variants of BERT and existing classic Bangla SCs.

Related Works
Several studies on Bangla SC development have been conducted in spite of Bangla being a low resource language.A phonetic encoding oriented Bangla word level SC based on Soundex algorithm was proposed by UzZaman and Khan (2004).This encoding scheme was later modified to develop a Double Metaphone encoding based Bangla SC (UzZaman and Khan, 2005).They took into account major context-sensitive rules and consonant clusters while performing their encoding scheme.Another word level Bangla SC able to handle both typographical and phonetic errors was proposed by Mandal and Hossain (2017).An N gram model was proposed by Khan et al. (2014) for checking sentence level Bangla word correctness.An encoderdecoder based seq2seq model was proposed by Islam et al. (2018) for Bangla sentence correction task which involved bad arrangement of words and missing words, though this work did not include incorrect spelling.A recent study has included Hindi and Telugu SC development, where mistakes are assumed to be made at character level (Etoori et al., 2018).They have used attention based encoderdecoder modeling as their approach.
SOTA research in this domain involves Chinese SCs as it is an error prone language due to its confusing word segmentation, phonetically and visually similar but semantically different characters.A seq2seq model assisted by a pointer network was employed for character level spell checking where the network is guided by externally generated character confusion set (Wang et al., 2019).Another research incorporated phonological and visual similarity knowledge of Chinese characters into BERT based SC model by utilizing graph convolutional network (Cheng et al., 2020).A recent BERT based SC has taken advantage of GRU (Gated Recurrent Unit) based soft masking mechanism and has achieved SOTA performance in Chinese character level SC in spite of not providing any external knowledge to the network (Zhang et al., 2020).Another external knowledge free approach namely FASPell used BERT based seq2seq model (Hong et al., 2019).HanSpeller++ is notable among initially implemented Chinese SCs (Xiong et al., 2015).It was an unified framework utilizing a hidden Markov model.

Problem Statement
Suppose, an input sentence consists of n words -W ord 1 , W ord 2 , . . ., W ord n .For each W ord i , we have to predict the right spelling, if W ord i exists in the top-word list of our corpus.If W ord i is a rare word (Proper Noun in most cases), we predict U N K token denoting that we do not make any correction to such words.For correcting a particular W ord i in a paragraph, we only consider other words of the same sentence for context information.

BSpell Architecture
Figure 4 shows the details of BSpell architecture.Each input word of the sentence is passed through the SemanticNet sub-model.This sub-model returns us with a SemanticVec vector representation for each input word.These vectors are then passed onto two separate branches (main branch and secondary branch) simultaneously.The main branch is similar to BERT_Base architecture (Gong et al., 2019).This branch provides us with the n correct words corresponding to the n input sentence words at its output side.The secondary branch consists of an output dense layer.This branch is used for the sole purpose of imposing auxiliary loss to facilitate SemanticNet sub-model learning of misspelled word patterns.

SemanticNet Sub-Model
Correcting a particular word requires the understanding of other relevant words in the same sentence.Unfortunately, those relevant words may also be misspelled.As humans, we can understand the meaning of a word even if it is misspelled because of our deep understanding at word syllable level and our knowledge of usual spelling error pattern.We want our model to have similar semantic level understanding of the words.We propose Se-manticNet, a sequential 1D CNN sub-model that is employed at each individual word level with a view to learning intra word syllable pattern.Details of individual word representation has been shown in the bottom right corner of Figure 4. We represent each input word by a matrix (each character represented as a one hot vector).We apply global max pooling on the final convolution layer output feature matrix of SemanticNet which gives us the SemanticVec vector representation of the input word.We get a similar SemanticVec representation from each of our input words by independently applying the same SemanticNet sub-model on each of their matrix representations.

BERT_Base as Main Branch
Each of the SemanticVec vector representations obtained from the input words are passed parallelly on to our first Transformer encoder.12 such Transformer encoders are stacked on top of each other.Each Transformer employs multi head attention mechanism, layer normalization and dense layer specific modification on each input vector.The attention mechanism applied on the word feature vectors in each transformer layer helps the words of the input sentence interact with one another extracting sentence context.We pass the final Transformer layer output vectors to a dense layer with Softmax activation function applied on each vector in an independent manner.So, now we have n probability vectors from n words of the input sentence.Each probability vector contains len P values, where len P is one more than the total number of top words considered (the additional word represents rare words).The top word corresponding to the index of the maximum probability value of i th probability vector represents the correct word for W ord i of the input sentence.

Auxiliary Loss in Secondary Branch
Gradient vanishing problem is a common phenomena in deep neural networks, where weights of the shallow layers are not updated sufficiently during backpropagation.With the presence of 12 Transformer encoders on top of the SemanticNet submodel, the layers of this sub-model certainly lie in a shallow position.Although SemanticNet constitutes a small initial portion of BSpell, this portion is responsible for word pattern learning, an important task of SC.In order to eliminate gradient vanishing problem of SemanticNet and to turn it into an ef-

BERT Hybrid Pretraining
In contemporary BERT pretraining methods, each input word W ord i maybe kept intact or maybe replaced by a default mask word in a probabilistic manner (Devlin et al., 2018;Liu et al., 2019).BERT has to predict the masked words.Mistakes from the BERT side will contribute to loss value accelerating backpropagation based weight update.
In this process, BERT learns to fill in the gaps, which in turn teaches the model language context.Sun et al. (2020) proposed incremental ways of pretraining the model for new NLP tasks.We take a more task specific approach for masking.In SC, recognizing noisy word pattern is important.But there is no provision for that in contemporary pretraining schemes and so, we propose hybrid masking (see Figure 5).Among n input words in a sentence, we randomly replace n W words with a mask word M ask W .Among the remaining n − n W words, we choose n C words for character masking.We choose m C characters at random from a word having m characters to be replaced by a mask character M ask C during character masking.Such masked characters introduce noise in words and helps BERT to understand the probable semantic meaning of noisy/ misspelled words.

Implemented Pretraining Schemes
We have experimented with three types of masking based pretraining schemes.During word masking we randomly select 15% words of a sentence and replace those with a fixed mask word.During character masking, we randomly select 50% words of a sentence.For each selected word, we randomly mask 30% of its characters by replacing each of them with a special mask character.Finally, during hybrid masking, we randomly select 15% words of a sentence and replace them a fixed mask word.We randomly select 40% words from the remaining words.For these selected words, we randomly mask 25% of their characters.

Dataset Specification
We have used one Bangla and one Hindi corpus with over 5 million (5 M) sentences for BERT pretraining (see Table 1).Bangla pretraining corpus consists of Prothom Alo2 articles dated from 2014-2017 and BDnews243 articles dated from 2015-2017.The Hindi pretraining corpus consists of Hindi Oscar Corpus4 , preprocessed Wikipedia articles5 , HindiEnCorp05 dataset6 and WMT Hindi News Crawl data7 (all of these are publicly available corpus).We have used Prothom-Alo 2017 online newspaper dataset for Bangla SC training and validation purpose.Our errors in this corpus have been produced synthetically using the probabilistic algorithm described by Sifat et al. (2020).We further validate our baselines and proposed methods on Hindi open source SC dataset, namely Tools-ForIL (Etoori et al., 2018).For real error dataset, we have collected a total of 6300 sentences from Nayadiganta8 online newspaper.Then we have distributed the dataset among ten participants.They have typed (in regular speed) each correct sentence using English QWERTY keyboard producing natural spelling errors.It has taken 40 days to finish the labeling.Top words have been taken such that they cover at least 95% of the corresponding corpus.

BSpell Architecture Hyperparameters
SemanticNet sub-model of BSpell consists of a character level embedding layer producing a 40 size vector from each character, then 5 consecutive layers each consisting of 1D convolution (batch normalization and Relu activation in between each pair of convolution layers) and finally, a 1D global max pooling in order to obtain SemanticVec representation from each input word.The five 1D convolution layers consist of (64, 2), (64, 3), (128, 3), (128, 3), (256, 4) convolution, respectively.The first and second element of each tuple denote number of convolution filters and kernel size, respectively.We provide a weight of 0.3 (λ value of loss function) to the auxiliary loss.The main branch of BSpell is similar to BERT_Base (Gong et al., 2019) in terms of stacking 12 Transformer encoders.Attention outputs from each Transformer is passed through a dropout layer (Srivastava et al., 2014) with a dropout rate of 0.3 and then layer normalized (Ba et al., 2016).We use Stochastic Gradient Descent (SGD) Optimizer with a learning rate of 0.001 for our model weight update.We clip our gradient value and keep it below 5.0 to avoid gradient exploding problem.

Training and Validation Details
In case of Bangla SC, we randomly initialize the weights of model M .We use our large Bangla pretrain corpus for hybrid pretraining and get pretrained model M pre .Next we split our benchmark synthetic spelling error dataset (Prothom-Alo) into 80%-20% training-validation set.We fine tune M pre using the 80% training portion (obtaining fine tuned model M f ine ) and report performance on the remaining 20% validation portion.We use the Bangla real spelling error dataset in two ways -(1) We do not fine tune M f ine on any of part of this data and use the entire dataset as an independent test set (result reported with the title real error (no fine tune)) (2) We split this real error dataset into 80%-20% training-validation and fine tune M f ine further using the 80% portion, then validate on the remaining 20% (result reported with the title real error (fine tuned)).In case of Hindi, the first two steps (pretraining and fine tuning) are the same.
We have not constructed any real life spelling error dataset for Hindi.So, results are reported on the 20% held out portion of the benchmark dataset.

BSpell vs Contemporary BERT Variants
We start with BERT Seq2seq where the encoder and decoder portion consist of 12 stacked Transformers (Devlin et al., 2018).Predictions are made at character level.Similar architecture has been used in FASpell (Hong et al., 2019) for Chinese SC.A word is considered wrong if even one of its characters is predicted incorrectly.Hence character level seq2seq modeling achieves poor result (see Table 2).Moreover, in most cases during sentence level spell checking, the correct spelling of the i th word of input sentence has to be the i th word in the output sentence as well.Such constraint is difficult to follow through such architecture design.BERT Base consisting of stacked Transformer encoders has two differences from the design proposed by Cheng et al. ( 2020) -(i) We make predictions at word level instead of character level (ii) We do not incorporate any external knowledge about Bangla SC since such knowledge is not well established in the field.This approach achieves good performance in all four cases.Soft Masked BERT learns to apply specialized synthetic masking on error prone words in order to push the error correction performance of BERT Base further.The error prone words are detected using a GRU sub-model and the whole architecture is trained end to end.Although Zhang et al. (2020) implemented this architecture to make corrections at character level, our implementation does everything in word level.We have used popular FastText (Athiwaratkun et al., 2018) word representation for both BERT Base and Soft Masked BERT.BSpell shows decent performance improvement in all cases.

Comparing BSpell Pretraining Schemes
We  guage through a fill in the gaps sort of approach.SC is not all about filling in the gaps.It is also about what the writer wants to say, i.e. being able to predict a word even if some of its characters are blank (masked).Character masking takes a more drastic approach by completely eliminating the fill in the gap task.This approach masks a few of the characters residing in some of the input words of the sentence and asks BSpell to predict these noisy words' original correct version.The lack of context in such pretraining scheme puts negative effect on performance over real error dataset experiments, where harsh errors exist and context is the only feasible way of correcting such errors (see Table 3).
Hybrid masking focuses both on filling in word gaps and on filling in character gaps through prediction of correct word and helps BSpell achieve SOTA performance.

BSpell vs Possible LSTM Variants
BiLSTM is a many to many bidirectional LSTM (two layers) that takes in all n words of a sentence at once and predicts their correct version as output (Schuster and Paliwal, 1997).During SC, BiL-STM takes in both previous and post context into consideration besides the writing pattern of each word and shows reasonable performance (see Table 4).In Stacked BiLSTM, we stack twelve many to many bidirectional LSTMs instead of just two.We see marginal improvement in SC performance in spite of such large increase in parameter number.Attn_Seq2seq LSTM model utilizes attention mechanism at decoder side (Bahdanau et al., 2014).This model takes in misspelled sentence characters as input and provides the correct sequence of characters as output (Etoori et al., 2018).Due to word level spelling correction evaluation, this model faces the same problems as BERT Seq2seq model discussed in Subsection 5.2.Proposed BSpell outperforms these models by a large margin.

Ablation Study
BSpell has three unique features -(1) secondary branch with auxiliary loss (possible to remove this branch), (2) 1D CNN based SemanticNet submodel (can be replaced by simple Byte Pair Encoding (BPE) (Vaswani et al., 2017)) and (3) hybrid pretraining (can be replaced by word masking based pretraining).Table 5 demonstrates the results we obtain after removing any one of these features.In all cases, the results show a downward trend compared to the original architecture.

Existing Bangla Spell Checkers vs BSpell
Phonetic rule based SC takes a Bangla phonetic rule based hard coded approach (Saha et al., 2019), where a hybrid of Soundex (UzZaman and Khan, 2004)   on word cluster formation, distance measurement and correct word suggestion (Mandal and Hossain, 2017).Since these two SCs are not learning based, fine tuning is not applicable for them.They do not take misspelled word context into consideration while correcting that word.As a result, their performance is poor especially in Bangla real error dataset (see Table 6).BSpell outperforms these Bangla SCs by a wide margin.The main motivation behind the inclusion of SemanticNet in BSpell is to obtain vector representations of error words as close as possible to their corresponding correct words.We take 10 frequently occurring Bangla words and collect three real life error variations of each of these words.We produce SemanticVec representation of all 40 of these words using SemanticNet.We use principal component analysis (PCA) (Shlens, 2014) on each of these SemanticVecs and plot them in two dimensions.Finally, we implement K-Means Clustering algorithm using careful initialization with K = 10 (Chen and Xia, 2009).Figure 6 shows the 10 clusters obtained from this algorithm.Each cluster consists of a popular word and its three error variations.In all cases, the correct word and its three error versions are so close in the graph plot that they almost form a single point.

Conclusion
In this paper, we have proposed a SC named BSpell for Bangla and Hindi language.BSpell uses Seman-ticVec representation of input misspelled words and a specialized auxiliary loss for the enhancement of spelling correction performance.The model exploits the concept of hybrid masking based pretraining.We have also investigated into the limitations of existing Bangla SCs as well as other SOTA SCs proposed for high resource languages.BSpell has two main limitations -(a) it cannot handle accidental merge or split of words and (b) it cannot correct misspelled rare words.A potential research direction can be to eradicate these limitations by designing models that can perform prediction at sub-word level which includes white space characters and punctuation marks.

Limitations
BSpell model provides a word for word correction, i.e., number of input words and number of output words have to be exactly the same.Unfortunately, during accidental word merging or word splitting, number of input and output words differ and so in such cases BSpell will fail in resolving such errors.This type of error is more common in Chinese language.The advantage for us is that this type of error is rare in Bangla and Hindi as the words of these languages are clearly spaced in sentences.So, people will rarely perform accidental merge or split of words.Another limitation is that BSpell has been trained to correct only the top Bangla and Hindi words that cover 95% of the entire corpus.
As a result, this spell checker will face problems while correcting spelling errors in rare words.For such rare words, BSpell simply provides UNK as output which means that it is not sure what to do with these words.An advantage here is that most of these rare words are some form of proper nouns which should not be corrected and should ideally be left alone as they are.For example, someone may have an uncommon name.We do not want our model to correct that person's name to some commonly used name.An immediate research direction is to overcome the limitations of the proposed method.A straightforward way of dealing with the word merge, word split and rare word correction problem is to model spelling errors at character level (sequenceto-sequence type approach).We have taken this trivial attempt and have failed miserably (see the performance reported in the first row of Table 2).Solving these problems while maintaining the current spelling correction performance of BSpell can be a challenge.Another interesting future direction is to investigate on personalized Bangla and Hindi spell checker which has the ability to take user personal preference and writing behaviour into account.The main challenge here is to effectively utilize user provided data that must be collected in an online setting.Recently, deep learning based automatic grammatical error correction has gained a lot of attention in English language (Chollampatt and Ng, 2018), (Chollampatt and Ng, 2017), (Stahlberg and Kumar, 2021).SOTA grammar correction models developed for English can be trained and tested on Bangla and Hindi spell checking tasks as part of future research effort.Such benchmarking studies can play a vital role in pushing the boundaries of low resource language correction automation.

Figure 1 :
Figure 1: Heterogeneous character number between error word and corresponding correctly spelled word Figure 2: ample words that are correctly spelled accidentally, but are context-wise incorrect.

Figure 3 :
Figure 3: Necessity of understanding existing erroneous words for spelling correction of misspelled words

Figure 6 :
Figure 6: Visualizing SemanticVec representation of 10 popular words with their error variants Each of the n SemanticVecs obtained from the n input words are passed parallelly on to a Softmax layer without any further modification.The outputs obtained from this branch are probability vectors similar to the main branch output.The total loss of BSpell can be expressed as: L T otal = L F inal + λ × L Auxiliary .We want our final loss to have greater impact on model weight update as it is associated with the final prediction made by BSpell.Hence, we impose the constraint 0 < λ < 1.This secondary branch of BSpell does not have any Transformer encoders through which the input words can interact to produce context in-formation.The prediction made from this branch is dependent solely on misspelled word pattern extracted by SemanticNet.This enables SemanticNet to learn more meaningful word representation.

Table 2 :
Comparing BERT based variants.Typical word masking based pretraining has been used on all these variants.Real-Error (Fine Tuned) denotes fine tuning of the Bangla syn-thetic error dataset trained model on real error dataset, while Real-Error (No Fine Tune) means directly validating synthetic error dataset trained model on real error dataset without any further fine tuning.
have implemented three different pretraining schemes (details provided in Subsection 4.1) on BSpell before fine tuning on spell checker dataset.Word masking teaches BSpell context of a lan-

Table 3 :
Comparing BSpell exposed to various pretraining schemes

Table 4 :
and Metaphone (UzZaman and Khan, 2005) algorithm has been used.Clustering based SC on the other hand follows some predefined rules Comparing LSTM based variants with hybrid pretrained BSpell.FastText word representation has been used with LSTM portion of each architecture.

Table 5 :
Comparing BSpell with its variants created by removing one of its novel features