Language Model is Suitable for Correction of Handwritten Mathematical Expressions Recognition

,


Introduction
Handwritten Mathematical Expression Recognition (HMER), a demanding subsection of optical character recognition (OCR), constitutes an interdisciplinary crossroad of computer vision, pattern recognition, and natural language processing (NLP).The unfolding of deep learning advancements has notably enhanced the effectiveness of HMER, ushering its adoption in diverse arenas, including intelligent education.Nonetheless, the precision of these technologies is continuously challenged by inherent ambiguities in handwritten characters and the complexity of mathematical formulas.These hurdles underscore the pivotal role that NLP could play in enhancing the robustness of current visual models grappling with these issues.
The encoder-decoder architecture is the prevalent method for HMER, which recasts the problem as an image-to-sequence translation task, converting a handwritten formula image into a LaTeX markup sequence.In contrast to traditional OCR tasks, the two-dimensional structure of handwritten formulas necessitates an approach that doesn't rely on direct segmentation.Since Zhang et al. (2017) introduces a decoder using RNN with attention, subsequent work has concentrated on enhancing the accuracy of the visual attention (Zhao et al., 2021;Bian et al., 2022;Li et al., 2022).Currently, various tree decoders and methods of syntactic analysis, such as Zhang et al. (2020) and Yuan et al. (2022), are employed to focus on analyzing the expression structure and the relations of symbols.
While structure-focused methods have undeniably enriched recognition model capabilities, they have also precipitated two notable challenges: 1) They rely on complex Context-Free Grammars (CFGs), necessitating the pre-transformation of the LaTeX markup sequence into specific tuple representations, which limits their extensibility.2) The issue of visual ambiguity is left behind.Tree decoders pay less attention to context when predicting triples, often unable to distinguish differences such as '2' and 'z'.Ung et al. (2021) try to employ a language model (LM) for post-correction, but Gupta et al. (2021) underline the inherent risk of wholly depending on a LM for the correction of low-redundancy information, such as numbers, which is particularly susceptible to biases introduced by probabilistic skewing.
However, as a formal language designed for mathematical structures, LaTeX mathematical expressions possess unique language characteristics.We believe that advance structural analysis separately following normal NLP methods may not be a prerequisite to catch complicated structures of LaTeX mathematical expressions.
Two key characteristics of LaTeX mathematical expressions are: 1) They have explicit structural symbols.2) Minimal units are symbols, and they are directly assigned specific semantics.Based on this, we propose that LMs can proffer both structural and semantic information, making them suitable for correction of HMER.An in-depth theoretical and statistical exploration of this perspective is articulated in Section 3.
Specifically, regarding the current limitations of structure-focused methods, we believe that: 1) The structural information can be represented by structural symbols' semantic with the context provided by mathematical notations.This circumvents the necessity for complex CFGs to generate triplet data.
2) Character ambiguity between different types can be rectified with contexts provided by structural symbols and mathematical notations.
We substantiate our propositions through experiments on HMER correction.Specifically, we deploy a math LM to rectify an HMER model reliant on unstructured-based methods, demonstrating the suitability of LM in addressing current limitations.Additionally, we argue against the sole reliance on LMs in a post-correction method.By leveraging information from the recognition model, we can constrict the correction space.
Finally, we propose our architecture called Recognition and Language Fusion Network (RLFN), which integrates recognition and language feature to output correct sequences and optimizes jointly with the recognition model.Encoder-Decoder based methods are led by Deng et al. (2017) and Zhang et al. (2017).Based on CNN encoder and RNN decoder, Deng et al. (2017) design a coarse-to-fine process, while Zhang et al. (2017) design coverage attention to avoid over-parsing and under-parsing.Zhang et al. (2018) use DenseNet (Huang et al., 2017) as encoder and introduce a decoder with multi-scale attention.Wu et al. (2020) integrate left-to-right attention to simulate the progressive nature of human perception.Wang et al. (2019) use multi-modal attention aim to fully utilize both online and offline information.Zhao et al. (2021) replace RNN-based decoder with a bidirectionally trained transformer, leading to enhance global coverage and parallelization capabilities.Bian et al. (2022) apply mutual learning to enhance bidirectional learning and design a multiscale coverage attention for longer expressions.
Several works focus on the tree structure of math expressions.Zhang et al. (2020) regarded the expression as a tree represented by triples that include parent, children, and relation; then designed a tree decoder to predict each triple.Based on this work, Zhong et al. (2022) expanded prediction of symbols into attribute prediction and position prediction, then purposed a transformer-based decoder to predict triples.Yuan et al. (2022) utilized grammar constrained attention to transform the whole image into a parse tree.Wu et al. (2022a) added thinking attention to tree decoder, assisted by pixel-level auxiliary loss to improve recognition of complex expressions.Wu et al. (2022b) designed a structural string representation, attempting to utilize both language model and tree structure.These structured representations are specifically designed, limiting their extensibility.

HMER & OCR Correction
Limited research are done on HMER correction.Chan and Yeung (2001)  Several math LMs are pretrained jointly with text and LaTeX expressions, potentially beneficial for HMER.Novotný and Štefánik (2022) design MathBERTa based on RoBERTa (Liu et al., 2019), with a soft vector space model to capture the semantic similarity between tokens.Peng et al. ( 2021) is designed to improve the prediction of masked formula substructures extracted from the Operator Tree (OPT).Scarlatos and Lan (2023) conduct multiple modifications on the GPT-2 (Radford et al., 2019) model, resulting in MathGPT, which exhibits strong performance in generating mathematical expressions.Our method utilizes MathBERTa to provide auxiliary information for correction.
3 Why LM is Suitable for HMER Correction?

Theoretical Analysis
As a formal language designed specifically for the representation of complex mathematical symbols and formulas, symbols in LaTeX mathematical expressions can be broadly divided into four distinct categories: 1) Structural symbols _, ^, {, }, \{, \} 2) Mathematical notations (e.g., \frac, \sqrt, +, -), 3) Latin and Greek alphabets (e.g., A, a, α), and 4) Numbers.The key differences from English can be summarized in the following two points: 1) The structural symbols in LaTeX mathematical expressions explicitly convey their structure, and certain mathematical notions serve to assist in this structural representation.This mechanism of structural representation shares fundamental simi-  larities with the mechanism of how words within phrases in English explicitly communicate their semantics.And this characteristic enables math LMs to synchronously provide structural information just as they provide semantic information.
As figure 1 shows, '\frac' and '\sqrt' represent the fraction line and radical symbol itself while providing context.Then the following '{' respectively represent the beginning of the numerator and radicand.In the English case, 'parse' relies on different context to express different semantics in sentence 'parse a sentence' and 'parse words'.
2) LaTeX mathematical expressions treat symbols as minimal semantic units.And based on contextual semantics, visual ambiguities between categories can be corrected.Though using 'x' or 'y' as unknown variables has no difference, the semantic distinction between an unknown variable and a number is significant.For instance, 2 √ x is reasonable, but z √ x is not in line with convention.We will represent it as x 1/z instead.Similarly, when we trust z, we tend to believe that the expression is not a radical expression.
From the perspective of analogy with English, Figure 2 illustrates why different category symbols in LaTeX mathematical expressions have semantic differences when given structural relationships.In English, sentence semantic errors caused by certain types of words can be detected through semantic relationships.According to WordNet (Fellbaum, 2005) of 'wheeled vehicle', 'wagon' and 'vehicle' have an 'is-a' relation with it.While given 'has-part' relationship, semantic errors in sentence 'wheeled vehicle has a vehicle' can be discovered.Similarly, the symbol '[ ]' represents the 'root-index' in the case of '\sqrt[2]{x} ' and '\sqrt[3]{x} '. '\sqrt[z]{x}' uses an unknown variable as the 'root-index' which is generally unconventional.Moreover, words that represent semantic relationships, such as 'multiply' and 'multiplied', are not explicitly stated in the case of 'xyz' and 'xyx'.In general, given the multiplicand (the left term of multiplication), it's expected to use an unknown number as the multiplier, or semantic errors may occur in the case of 'xy2'.
In addition to language characteristic that make LM suitable for HMER correction, the task itself is also suitable.While OCR employs letters as the smallest unit for correcting words, HMER utilizes symbols as the minimum unit for amending expressions.The former represents morphological correction, while the latter is semantic correction.

Statistical Analysis
We conducted statistical analysis as collateral evidence on the CROHME dataset (Mouchère et al., 2014).The CROHME dataset, a byproduct of the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME), is universally recognized as the principal public dataset within HMER field.A comprehensive collection, the CROHME training set comprises 8835 handwritten mathematical expressions.In addition, it includes three distinct testing subsets, namely CROHME 2014, 2016, and 2019, containing 986, 1147, and 1199 handwritten mathematical expressions respectively.Noteworthy is the inclusion of a total of 111 symbol classes, which encompasses the "sos" and "eos" symbols.
The explicit structural symbols do express their semantics.This is affirmed via an application of the Math-aware BERT model (Reusch et al., 2022), where the calculated perplexity acts as an index for semantic strength, applied to those four categories of symbols in CROHME training set.
In detail, we engage an individualized masking operation, followed by a model prediction of the obfuscated symbol.The outcome is a probability  distribution, the reciprocal of which, corresponding to the actual word, signifies its perplexity.Subsequently, the mean perplexity, according to category, is designated as the perplexity of this particular category of symbols. 1 reveal that among the four types of symbols, the structural symbols exhibit the lowest perplexity.This finding aligns with our theoretical analysis that the explicit structural symbols suggest a comparatively robust semantic signal.

Results in Table
The visual ambiguity between numbers and alphabets do exist.We conducted an analysis of the results from the string decoder recognition model, DWAP (Zhang et al., 2018), using the CROHME 2014/2016/2019 datasets.During this analysis, we employed a reliable metric called Substitute-by-One (SUB1), which identifies cases where the model's predictions deviate from the ground truth by only one substitution.Within the SUB1 cases, the mis-recognition of one character as another is verified, avoiding character substitution indeterminacy in evaluation.
The outcomes are shown in Table 2.Among all SUB1 instances, the top 5 pairs of mis-recognition between numbers and alphabets contribute to 11% of the total mis-recognition, while the overall misrecognition between numbers and alphabets contribute to 26%.These statistical findings highlight that mis-recognition between numbers and letters not only exists considerably but also tends to concentrate on visual ambiguity.Thus overcoming the visual ambiguity issues as mentioned in theoretical analysis is significant.

Method
In an endeavor to empirically corroborate our hypotheses, we propose a novel architecture, the Recognition and Language Fusion Network (RLFN), engineered specifically to address the dual challenge of visual ambiguity and the complex structural issues that are inherent to string decoder recognition models.The RLFN is built with a string decoder recognition module, a language module that extracts language information, and a fusion module to refine the recognition output by utilizing the language information.

Recognition Module
Our recognition module basically follows DWAP (Zhang et al., 2018), using DenseNet (Huang et al., 2017) to extract visual feature F ∈ R H ′ ×W ′ ×D from the single-channel input image.
As shown in Figure 3, each step t in the decoder, we iteratively update two state weights: the GRU (Cho et al., 2014) hidden state h t and the coverage attention (cumulative attention map) A t .
α i,j;t = exp(e i,j;t − max i,j (e i,j;t )) i,j exp(e i,j;t − max i,j (e i,j;t )) (3) Here, e t represents the attention score which produces the attention map α t , where i, j denote the coordinate on the feature map.Eγ t−1 represents the embedding of the last symbol, F p corresponds to the position encoded feature (Parmar et al., 2018).W e , W A , and W h represent the trainable weights.After that, we generate the content vector c t with element-wise multiplication of α t with visual features F .Then c t is combined with h t and embedding of γ t−1 to obtain the symbol state s t and the recognition prediction symbol γ t .
W c , W γ , W ′ h , w, b are trainable weights.Lastly, recognition module outputs the total symbol states s as recognition feature F R , along with the recognition prediction sequence γ.

Language Module
We utilize MathBERTa (Novotný and Štefánik, 2022) to extract language information, which is a RoBERTa (Liu et al., 2019) model specifically fine-tuned on LaTeX expressions.
As a variant of BERT (Devlin et al., 2019), RoBERTa solely focuses on masked language modeling (MLM) task, uses larger mini-batches and employs dynamic masking, which all contribute to improve bidirectional semantic language modeling.
Building upon RoBERTa, MathBERTa further focuses on language processing of LaTeX expressions.It undergoes fine-tuning on an extensive dataset containing both text and LaTeX expressions.This specialized training enhances MathBERTa's comprehension of semantic and syntactical properties of LaTeX mathematical expressions.
Considering that our recognition output does not need tokenization, and LaTeX mathematical notations are prone to problems, we manually associate the vocabulary with the one-hot encoding in Math-BERTa instead of using the tokenizer.Then given the recognition prediction sequence γ, MathBERTa outputs the language feature F L .

RLFN
As shown in Figure 4, in RLFN, the input image is sent to recognition module to extract recognition feature and the prediction sequence.The latter is passed to language module to form language feature.Both features are fused in fusion module to output the corrected prediction sequence.
The main objective of the fusion module is to integrate the information from the recognition module with the semantic information obtained through the language module, so as to generate the corrected prediction sequence.
In the fusion module, the recognition feature F R and language feature F L are first dimensionally aligned through linear operation.Then we follow Yao et al. (2017) to incorporate a gating neuron, denoted as σ.This neuron allows us to assign weights based on contributions of two features during the calculation of output.Within the gating neuron, the two aligned features are horizontally concatenated.The resulting concatenated vector is then adjusted to match the size of the aligned features.Subsequently, the resized vector is fed into a sigmoid function to generate weights.These weights are then utilized to modulate the output of the aligned features, which are first processed through tanh activation function.
The process of generating the corrected sequence y in the fusion module is as follows: where σ refers to sigmoid activation function, W R , W L , W F , w ′ , b ′ are weights to be learned.

Parameter Learning
We jointly optimize our RCFN with the recognition module through a linear classification layer, and the loss function is as follows: where L R and L F are the cross-entropy loss of the recognition prediction sequence probability and the corrected prediction sequence probability with respect to the ground-truth.L R aims to guide the the recognition module and the additional linear classification layer, while L F focus on guiding the fusion of the two features.To mitigate the influence of the training set's probability bias and the large number of parameters, we have frozen the language model's parameters.Furthermore, gradient separation has been employed to enhance the focus of the loss functions on their respective optimization goals within the recognition model.Nevertheless, joint optimization still affects each other's updates of problematic parts through the optimizers and other means.

Implement Details
Our RCFN is implemented in PyTorch with a single NVIDIA GeForce RTX 3090.We use Adadelta optimizer (Zeiler, 2012) with the learning rate increases from 0 to 1 at the first epoch and decays to 0 following the cosine schedules (Zhang et al., 2019b).No data augmentation for fair comparison.The batch size is set to 8. All images within a batch are filled in the upper left corner of the canvas of the same size.Due to memory limitations, the canvas size does not exceed 1280 * 280, and any excess images will be discarded.The total training epoch is set to 200 epochs taking around 16 hours.fusion module.To separate the impact of the language module, we did not use the fusion module in RLFN.Instead, in order to generate the corrected prediction sequence, we treat it as a translation task and add a decoder with two transformer layers in the language module.Other settings are all identical to RLFN.Results on CROHME 2014 are shown in table 4, we can only tell that the language module and the fusion module both have their impact.

Improvement Study
In this section, we explore whether the improvement comes from the correction of complex structure and visual ambiguity to validate our proposition that LM do obtain semantic and structural information synchronously and complementarily.
We conduct an incremental comparison on structural complexity to assess the improvement of our model on complex expressions across all datasets.We define the structural complexity of an expression as the count of six structural symbols mentioned in Section 3.1.The results are presented in Table 5. Models perform worse when facing more complex expressions, suggesting they are more challenging to recognize.Interestingly, the relative improvement becomes progressively higher for more complex expressions.The improvement   for the top 25% most complex expressions is nearly twice that of the improvement among all expressions.These observations indicate that RLFN outperforms our baseline especially in recognizing complex structures which might be because our RLFN can extract explicit structural information through LM, serving the similar role to tree decoders.
Regarding RLFN's performance in handling visual ambiguity, we conduct a replicated analysis same to the one described in Section 3.1.Specifically, we compare the frequency of top 5 numberalphabet pairs of mis-recognition with our baseline and study the difference, which is shown in Table 6.We observe that RLFN effectively reduces the occurrence of the top-5 mis-recognition by 21% compared to our baseline.This shows the capability of RLFN to reduce visual ambiguity between alphabets and numbers using contextual information provided by language modeling.

Case Study
As shown in Figure 5, we present two complex structure cases in group A, along with another two visual ambiguity cases in group B.
In group A, the baseline model recognizes an additional structure symbol '\}'in one case and misplaces '\}' to the wrong position in the other.In contrast, RLFN gets both expressions correctly.This is not a symbol recognition problem, which indicates that RLFN has learned the semantics of structure symbols and gained the ability of struc-  In group B, cases possess visual ambiguity between variables and numbers.The baseline model relies solely on visual appearance and cannot distinguish visually resemble symbols.This is likely due to its lack of architecture to effectively utilize and comprehend contextual and structural information.RLFN, on the contrary, can correctly recognize '\times' with surrounding numbers and recognize '1' and '0' based on their superscript and subscript structural relations.This indicates that RLFN can complement each other's semantic and structural information when recognizing visually ambiguous symbols.
As depicted in Figure 6, we delve into a particular case about its top-5 probability.The probability derived from the baseline demonstrates visual ambiguity concerning the symbols 'x', 'X', and '\times' without understanding their semantics.After determining that this may be a multiplication structure, RLFN reduces the probability associated with 'x' and 'X', while recognizing the correct \times symbol with high confidence.

Conclusion
In this research, we have examined the unique language characteristics intrinsic to LaTeX mathematical expressions, with a keen focus on the minimal semantic unit and explicit structural symbols.Our investigation underscores that these characteristics give HMER systems the potential to obtain both se-mantic and structural information through language models.We subsequently propose an innovative architecture that harmoniously integrates recognition and language features to yield corrected sequences.This framework eliminates the requirement to construct complex CFGs for resolving structural issues, and serves to ameliorate the challenge of visual ambiguities.This integrative approach offers fresh insights and promising theoretical groundwork for the development of HMER and related mathematical endeavors.

Limitations
The limitations of our theoretical assessment warrant acknowledgment.In our analysis, we scrutinized the language characteristics of LaTeX mathematical expressions, drawing parallels between their expression mechanisms and those of the English language.This led us to posit that a LM adept at handling English semantics should, in theory, be equally proficient with LaTeX mathematical expressions.However, our methodology, rooted in inferential analogy, is weaker than directly analyzing how LMs handles LaTeX mathematical expressions and cannot be further extended, such as customizing a LM suitable for LaTeX mathematical expressions.
Our proposed model architecture is not devoid of certain limitations.The architectural design broadly follows a late fusion strategy, which, when contrasted with the early fusion approach seen in semantic modeling and modal fusion during the decoding phase of the recognition module, exhibits a lack of thorough information interaction.This shortfall is exemplified by our model's disregard for the prediction probability of the recognition sequence input to the language module, resulting in some information loss.
Besides, given the current state of the field, where most existing recognition models rely heavily on tree decoders and bidirectional training architectures, triplet data and reverse sequences are not suitable for language modeling.This limitation confines the range of selectable baseline models.Notwithstanding, one of our overarching goals in this endeavor is to maneuver around this intrinsic constraint that inherently stifles expansion.
Furthermore, as the formidable capabilities of LLM and LMM/MLLM are widely researched, some methods can even achieve an OCR-free understanding of text images.This casts doubt on the significance of excavating model architectures for specific tasks.However, is there a real need for the involvement of a general large model in a specific task?When the accuracy requirements are stringent, how do the upper limits of a general large model compare with that of a small model tailored for a specific task?Or is it the case that data is truly everything?These questions still require deep consideration.
detect and correct errors based on grammar and heuristics rules.Ung et al. (2021) train and apply a language model for post-correction tasks.More correction works are done on OCR.Litman et al. (2020) repetitively correct recurrent block outputs by fusing it with visual features each step in training.Other works use LM to assist correction.Nguyen et al. (2020) use BERT for error detection, and then use neural machine translator to correct errors.Qiao et al. (2020) use the pre-trained FastText (Joulin et al., 2017) to supervise the generation of semantic features, fusing it with encoder visual features to capture global semantic information.Gupta et al. (2021) utilize perplexity of language model to choose output among multiple aligned models, and stress that correction of numbers requires extra reliable information source.Yasunaga et al. (2021) adopt unsupervised correction by comparing logits of LM output with local perturbations of the text.Fang et al. (2021, 2023) explicitly use built-in bidirectional LM to iteratively correct the output.

Figure 1 :
Figure 1: Cases of structural symbols explicitly convey their structure in the context of mathematical notations.Like in English, the same word has different semantics in different contexts.

Figure 2 :
Figure2: Cases of the semantic problems within different types of symbols can be found with the structure and mathematical notation.Like in English, given the semantic relationship in WordNet, whether these types of words are correct or incorrect.

Figure 4 :
Figure 4: Architecture of Recognition and Language Fusion Network (RLFN)

Figure 5 :
Figure 5: Cases with complex structure (A) and visual ambiguity (B) that RLFN outperforms our baseline.

Table 5 :
Incremental examination of the Top 25%, 50%, and total expressions based on structural complexity in the unified CROHME 2014/2016/2019 dataset.RI denotes the relative improvement from DWAP to RLFN.

Table 6 :
Relative change of top 5 mis-recognition.