Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation

Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC by using two newly developed Transformer-based pretrained sequence-to-sequence models. We address the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve state-of-the-art results on two Arabic GEC shared tasks datasets and establish a strong benchmark on a newly created dataset.


Introduction
English grammatical error correction (GEC) has witnessed significant progress in recent years due to increased research efforts and the organization of several shared tasks (Ng et al., 2013(Ng et al., , 2014;;Bryant et al., 2019).Most state-of-the-art (SOTA) GEC systems borrow modeling ideas from neural machine translation (MT) to translate from erroneous to corrected texts.In contrast, grammatical error detection (GED), which focuses on locating and identifying errors in text, is usually treated as a sequence labeling task.Both tasks have evident pedagogical benefits to native (L1) and foreign (L2) language teachers and students.Also, modeling GED information explicitly within GEC systems yields better results in English (Yuan et al., 2021).
When it comes to morphologically rich languages, GEC and GED have not received as much 1 https://github.com/CAMeL-Lab/arabic-gecattention, largely due to the lack of datasets and standardized error type annotations.Specifically for Arabic, the focus on GEC started with the QALB-2014 (Mohit et al., 2014) and QALB-2015 (Rozovskaya et al., 2015) shared tasks; however, recent sequence-to-sequence (Seq2Seq) modeling advances have not been explored much in Arabic GEC.Moreover, multi-class Arabic GED has not been investigated due to the lack of error type information in Arabic GEC datasets.In this paper, we try to address these challenges.Our main contributions are as follows: proaches (Felice et al., 2014;Junczys-Dowmunt andGrundkiewicz, 2014, 2016) and then neural MT approaches (Yuan and Briscoe, 2016;Xie et al., 2016;Junczys-Dowmunt et al., 2018;Watson et al., 2018), with Transformer-based models being the most dominant (Yuan et al., 2019;Zhao et al., 2019;Grundkiewicz et al., 2019;Katsumata and Komachi, 2020;Yuan and Bryant, 2021).
More recently, edit-based models have been proposed to solve GEC (Awasthi et al., 2019;Malmi et al., 2019;Stahlberg and Kumar, 2020;Mallinson et al., 2020;Omelianchuk et al., 2020;Straka et al., 2021;Mallinson et al., 2022;Mesham et al., 2023).While Seq2Seq models generate corrections to erroneous input, edit-based models generate a sequence of corrective edit operations.Edit-based models add explainability to GEC and improve inference time efficiency.However, they generally require human engineering to define the size and scope of the edit operations (Bryant et al., 2023).
GED Approaches Rei and Yannakoudakis (2016) presented the first GED results using a neural approach framing GED as a binary (correct/incorrect) sequence tagging problem.Others used pretrained language models (PLMs) such as BERT (Devlin et al., 2019), ELECTRA (Clark et al., 2020), and XLNeT (Yang et al., 2019) to improve binary GED (Bell et al., 2019;Kaneko and Komachi, 2019;Yuan et al., 2021;Rothe et al., 2021).Zhao et al. (2019) and Yuan et al. (2019) demonstrated that combining GED and GEC yields improved results: they used multi-task learning to add token-level and sentence-level GED as auxiliary tasks when training for GEC.Similarly, Yuan et al. (2021) showed that binary and multi-class GED improves GEC.
Arabic GEC modeling efforts ranged from feature-based ML classifiers to statistical MT models (Rozovskaya et al., 2014;Bougares and Bouamor, 2015;Nawar, 2015).Watson et al. (2018) introduced the first character-level Seq2Seq model and achieved SOTA results on the L1 Arabic GEC data used in the QALB-2014 and 2015 shared tasks.Recently, vanilla Transformers (Vaswani et al., 2017) were explored for synthetic data generation to improve L1 Arabic GEC and were tested on the L1 data of the QALB-2014 and 2015 shared tasks (Solyman et al., 2021(Solyman et al., , 2022(Solyman et al., , 2023)).To the best of our knowledge, the last QALB-2015 L2 reported results were presented in the shared task itself.We compare our systems against the best previously developed models whenever feasible.
A number of researchers reported on Arabic binary GED.Habash and Roth (2011) used featureengineered SVM classifiers to detect Arabic handwriting recognition errors.Alkhatib et al. (2020) and Madi and Al-Khalifa (2020) used LSTM-based classifiers.None of them used any of the publicly available GEC datasets mentioned above to train and test their systems.In our work, we explore multi-class GED by obtaining error type annotations from ARETA (Belkebir and Habash, 2021), an automatic error type annotation tool for MSA.To our knowledge, we are the first to report on Arabic multi-class GED.We report on publicly available data to enable future comparisons.

Arabic Linguistic Facts
Modern Standard Arabic (MSA) is the official form of Arabic primarily used in education and media across the Arab world.MSA coexists in a diglossic (Ferguson, 1959) relationship with local Arabic dialects that are used for daily interactions.When native speakers write in MSA, there is frequent code-mixing with the dialects in terms of phonological, morphological, and lexical choices (Habash et al., 2008).In this paper, we focus on MSA GEC.While its orthography is standardized, written Arabic suffers many orthographic inconsistencies even in professionally written news articles (Buckwalter, 2004;Habash et al., 2012).For example, hamzated Alifs ( Â, Ǎ) 2 are commonly confused with the un-hamzated letter ( A), and the word-final letters y and ý are often used interchangeably.These errors affect 11% of all words (4.5 errors per sentence) in the Penn Arabic Treebank (Habash, 2010).Additionally, the use of punctuation in Arabic is very inconsistent, and omitting punctuation marks is very frequent (Awad, 2013;Zaghouani and Awad, 2016).Punctuation errors constitute ∼40% of errors in the QALB-2014 GEC shared task.This is ten times higher than punctuation errors found in the English data used in the CoNLL-2013 GEC shared task (Ng et al., 2013).Arabic has a large vocabulary size resulting from its rich morphology, which inflects for gender, number, person, case, state, mood, voice, and aspect, and cliticizes numerous particles and pronouns.Arabic's diglossia, orthographic inconsistencies, and morphological richness pose major challenges to GEC models.

Arabic GEC Data
We report on three publicly available Arabic GEC datasets.The first two come from the QALB-2014 (Mohit et al., 2014) and QALB-2015(Rozovskaya et al., 2015) shared tasks.The third is the newly created ZAEBUC dataset (Habash and Palfreyman, 2022).None of them were manually annotated for specific error types.Table 1 , 2001).Since the ZAE-BUC dataset did not have standard splits, we randomly split it into Train (70%), Dev (15%), and Test (15%), while keeping a balanced distribution of CEFR levels.
The three sets vary in a number of dimensions: domain, level, number of words, percentage of erroneous words, and types of errors.Appendix C 2 Arabic HSB transliteration (Habash et al., 2007).
presents automatic error type distributions over the training portions of the three datasets.Orthographic errors are more common in the L1 datasets (QALB-2014 and ZAEBUC) compared to the L2 dataset (QALB-2015).In contrast, morphological, syntactic, and semantic errors are more common in QALB-2015.Punctuation errors are more common in QALB-2014 and QALB-2015 compared with ZAEBUC.

Metrics for GEC and GED
GEC systems are most commonly evaluated using reference-based metrics such as the MaxMatch (M 2 ) scorer (Dahlmeier and Ng, 2012), ERRANT (Bryant et al., 2017), and GLEU (Napoles et al., 2015), among other reference-based and referenceless metrics (Felice and Briscoe, 2015;Napoles et al., 2016;Asano et al., 2017;Choshen et al., 2020;Maeda et al., 2022).In this work, we use the M 2 scorer because it is language agnostic and was the main evaluation metric used in previous work on Arabic GEC.The M 2 scorer compares hypothesis edits made by a GEC system against annotated reference edits and calculates the precision (P), recall (R), and F 0.5 .In terms of GED, we follow previous work (Bell et al., 2019;Kaneko and Komachi, 2019;Yuan et al., 2021) and use macro precision (P), recall (R), and F 0.5 for evaluation.We also report accuracy.

Arabic Grammatical Error Detection
Most of the work on GED has focused on English ( §2), where error type annotations are provided manually (Yannakoudakis et al., 2011;Dahlmeier et al., 2013) or obtained automatically using an error type annotation tool such as ER-RANT (Bryant et al., 2017).However, when it comes to morphologically rich languages such as Arabic, GED remains a challenge.This is largely due to the lack of manually annotated data and standardized error type frameworks.In this work, we treat GED as a mutli-class sequence labeling task.We present a method to automatically obtain error type annotations by extracting edits from parallel erroneous and corrected sentences and then passing them to an Arabic error type annotation tool.To the best of our knowledge, this is the first work that explores multi-class GED in Arabic.

Edit Extraction
Before automatically labeling each erroneous sentence token, we need to align the erroneous and  corrected sentence pairs to locate the positions of all edits so as to map errors to corrections.This step is usually referred to as edit extraction in GEC literature (Bryant et al., 2017).
We first obtain character-level alignment between the erroneous and corrected sentence pair by computing the weighted Levenshtein edit distance (Levenshtein, 1966) for each pair of tokens in the two sentences.The output of this alignment is a sequence of token-level edit operations representing the minimum number of insertions, deletions, and replacements needed to transform one token into another.Each of these operations involves one token at most belonging to either sentence.However, some errors may involve more than one single edit operation.To capture multi-token edits, we extend the alignment to cover merges and splits by implementing an iterative algorithm that greedily merges or splits adjacent tokens such that the overall cumulative edit distance is minimized.

Error Type Annotation
Next, we pass the extracted edits to an automatic annotation tool to label them with specific error types.We use ARETA, an automatic error type annotation tool for MSA (Belkebir and Habash, 2021).Internally, ARETA is built using a combination of rule-based components and an Arabic morphological analyzer (Taji et al., 2018;Obeid et al., 2020).It uses the error taxonomy of the Arabic Learner Corpus (ALC) (Alfaifi and Atwell, 2012  ARETA comes with its own alignment algorithm that extracts edits, however, it does not handle many-to-one and many-to-many edit operations (Belkebir and Habash, 2021).We replace ARETA's internal alignment algorithm with ours to increase the coverage of error typing.Using our edit extraction algorithm with ARETA enables us to automatically annotate single-token and multi-token edits with various error types.Appendix C presents the error types obtained from ARETA by using our alignment over the three GEC datasets we use.
To demonstrate the effectiveness of our alignment algorithm, we compare our algorithm to the alignments generated by the M 2 scorer, a standard Levenshtein edit distance, and ARETA.Table 2 presents the evaluation results of the alignment algorithms against the manual gold alignments of the QALB-2014 and QALB-2015 Dev sets in terms of precision (P), recall (R), and alignment error rate (AER) (Mihalcea and Pedersen, 2003;Och and Ney, 2003).Results show that our alignment algorithm is superior across all metrics.Figure 1 presents an example of the different alignments generated by the algorithms we evaluated.The M 2 scorer's alignment over-clusters multiple edits into a single edit (words 6-13).This is not ideal, particularly because the M 2 scorer does not count partial matches during the evaluation, which leads to underestimating the models' performances (Felice and Briscoe, 2015).A standard Levenshtein alignment does not handle merges correctly, e.g., words 8 and 9 in the erroneous sentence are aligned to words 9 and 10 in the corrected version.Among the drawbacks of ARETA's alignment is that it does not handle merges, e.g., erroneous words 8 and 9 are aligned with corrected words 9 and 10, respectively.

Arabic Grammatical Error Correction
Recently developed GEC models rely on Transformer-based architectures, from standard Seq2Seq models to edit-based systems built on top of Transformer encoders.Given Arabic's morphological richness and the relatively small size of available data, we explore different GEC models, from morphological analyzers and rule-based systems to pretrained Seq2Seq models.Primarily, we are interested in exploring modeling approaches to address the following two questions: • RQ1: Does morphological preprocessing improve GEC in Arabic?
• RQ2: Does modeling GED explicitly improve GEC in Arabic?
Morphological Disambiguation (Morph) We use the current SOTA MSA morphological analyzer and disambiguator from CAMeL Tools (Inoue et al., 2022;Obeid et al., 2020).Given an input sentence, the analyzer generates a set of potential analyses for each word and the disambiguator selects the optimal analysis in context.The analyses include minimal spelling corrections for common errors, diacritizations, POS tags, and lemmas.We use the dediacritized spellings as the corrections.We extend the Seq2Seq models we use to incorporate token-level GED information during training and inference.Specifically, we feed predicted GED tags as auxiliary input to the Seq2Seq models.We add an embedding layer to the encoders of AraBART and AraT5 right after their corresponding token embedding layers, allowing us to learn representations for the auxiliary GED input.The GED embeddings have the same dimensions as the positional and token embeddings, so all three embeddings can be summed before they are passed to the multi-head attention layers in the encoders.

Maximum
Our approach is similar to what was done by Yuan et al. (2021), but it is much simpler as it reduces the model's size and complexity by not introducing an additional encoder to process GED input.Since the training data we use is relatively small, not drastically increasing the size of AraBART and AraT5 becomes important not to hinder training.

Arabic Grammatical Error Detection
We build word-level GED classifiers using Transformer-based PLMs.From the many avail- In our GED modeling experiments, we project multi-token error type annotations to single-token labels.In the case of a Merge error (many-to-one), we label the first token as Merge-B (Merge beginning) and all subsequent tokens as Merge-I (Merge inside).For all other multi-token error types, we repeat the same label for each token.We further label all deletion errors with a single Delete tag.To reduce the output space of the error tags, we only model the 14 most frequent error combinations (appearing more than 100 times).We ignore unknown errors when we compute the loss during training; however, we penalize the models for missing them in the evaluation.Since the majority of insertion errors are related to missing punctuation marks rather than missing words (see Appendix C), and due to inconsistent punctuation error annotations (Mohit et al., 2014), we exclude insertion errors from our GED modeling and evaluation.We leave the investigation of insertion errors to future work.The full GED output space we model consists of 43 error tags (43-Class).
We take advantage of the modularity of the ARETA error tags to conduct multi-class GED experiments, reducing the 43 error tags to their corresponding 13 main error categories as well as to a binary space (correct/incorrect).The statistics of the error tags we model across all datasets are in Appendix D. Figure 1 shows an example of error types at different granularity levels.Table 3 presents the GED granularity results.Unsurprisingly, all numbers go up when we model fewer error types.However, modeling more error types does not significantly worsen the performance in terms of error detection accuracy.It seems that all systems are capable of detecting comparable numbers of errors despite the number of classes, but the verbose systems struggle with detecting the specific class labels.

Arabic Grammatical Error Correction
We explore different variants of the abovementioned Seq2Seq models.For each model, we study the effects of applying morphological preprocessing (+Morph), providing GED tags as auxiliary input (+GED), or both (+Morph+GED).Applying morphological preprocessing simply means correcting the erroneous input using the morphological disambiguator before training and inference.
To increase the robustness of the models that take GED tags as auxiliary input, we use predicted (not gold) GED tags when we train the GEC systems.For each dataset, we run its respective GED model on the same training data it was trained on and we pick the predictions of the worst checkpoint.During inference, we resolve merge and delete errors before feeding erroneous sentences to the model.This experimental setup yields the best performance across all GEC models.
To ensure fair comparison to previous work on Arabic GEC, we follow the same constraints that were introduced in the QALB-2014 and QALB-2015 shared tasks: systems tested on QALB-2014 are only allowed to use the QALB-2014 training data, whereas systems tested on QALB-2015 are allowed to use the QALB-2014 and QALB-2015 training data.For ZAEBUC, we train our systems on the combinations of the three training datasets.We report our results in terms of precision (P), recall (R), F 1 , and F 0.5 .F 1 was the official metric used in the QALB-2014 and QALB-2015 shared tasks.However, we follow the most recent work on GEC and use F 0.5 (weighing precision twice as much as recall) as our main evaluation metric.
We use Hugging Face's Transformers (Wolf et al., 2019) to build our GED and GEC models.The hyperparameters we used are detailed in Appendix A.

Does GED help Arabic GEC?
We start off by using the most fine-grained GED model (43-Class) to exploit the full effect of the ARETA GED tags and to guide our choice between AraBART and AraT5.Using GED as an auxiliary input in both AraT5 and AraBART improves the results across all three Dev sets, with AraBART+GED demonstrating superior performance compared to the other models, on average.Applying morphological preprocessing as well as using GED as an auxiliary input yields the best performance across the three Dev sets, except for QALB-2015 in the case of AraT5+Morph+GED.Overall, AraBART+Morph+GED is the best performer on average in terms of F 0.5 .The improvements using GED with GEC systems are mostly due to recall.An error comparison between AraBART and the AraBART+Morph+GED model (Appendix E) shows improved performance on the majority of the error types.
To study the effect of GED granularity on GEC, we train two additional AraBART+Morph+GED models with 13-Class and 2-Class GED tags.The results in Table 5 show that 13-Class GED was best in QALB-2014 and ZAEBUC, whereas 43-Class GED was best in QALB-2015 in terms of F 0.5 .However, in terms of precision and recall, GED models with different granularity behave differently across the three Dev sets.On average, using any GED granularity improves over AraBART, with 13-Class GED yielding the best results, although it is only 0.1 higher than 43-Class GED in terms of F 0.5 .For completeness, we further estimate an oracle upper bound by using gold GED tags with different granularity.The results (in Table 5) show that using GED with different granularity improves the results considerably.This indicates that GED is providing the GEC system with additional information; however, the main bottleneck is the GED prediction reliability as opposed to GED granularity.Improving GED predictions will most likely lead to better GEC results.
Test Results Since the best-performing models on the three Dev sets benefit from different GED granularity when used with AraBART+Morph, we present the results on the Test sets using all different GED granularity models.The results of using AraBART and its variants on the Test sets are presented in Table 6.On QALB-2014, using Morph, GED, or both improves the results over AraBART, except for 2-Class GED.AraBART+43-Class GED is the best performer (0.3 increase in F 0.5 , although not statistically significant). 3It is worth noting that AraBART+Morph achieves the highest recall on QALB-2014 (2.7 increase over AraBART and statistically significant at p < 0.05).For QALB-2015-L1, using GED by itself across all granularity did not improve over AraBART, but when combined with Morph, the 43-Class GED model yields the best performance in F 0.5 (0.6 increase statistically significant at p < 0.05).When it comes to QALB-2015-L2, Morph does not help, but using GED alone improves the results over AraBART, with 43-Class and 13-Class GED being the best (0.4 increase).Lastly, in ZAEBUC, Morph does not help, but using 13-Class GED by itself improves over AraBART (0.4 increase).Overall, all the improvements we observe are attributed to recall, which is consistent with the Dev results.
Following the QALB-2015 shared task (Rozovskaya et al., 2015) reporting of no-punctuation results due to observed inconsistencies in the references (Mohit et al., 2014), we present results on the Test sets without punctuation errors in Table 7.The results are consistent with those with punctuation, indicating that GED and morphological preprocessing yield improvements compared to using AraBART by itself across all Test sets.The score increase among all reported metrics when removing punctuation, specifically in the L1 data, indicates that punctuation presents a challenge for GEC models and needs further investigation both in terms of data creation and modeling approaches.
Analyzing the Test Results Table 8 presents the average absolute changes in precision and recall over the Test sets when introducing Morph, GED, or both.Adding Morph alone or GED alone improves recall (up to 0.8 in the case of Morph) and slightly hurts precision.When using both Morph and GED, we observe significant improvements in recall with an average of 1.5 but with higher drops of precision with an average of −0.7.

Conclusion and Future Work
We presented the first results on Arabic GEC using Transformer-based pretrained Seq2Seq models.We also presented the first results on multi-class Arabic GED.We showed that using GED information as an auxiliary input in GEC models improves GEC performance across three datasets.Further, we investigated the use of contextual morphological preprocessing in aiding GEC systems.Our models achieve SOTA results on two Arabic GEC shared tasks datasets and establish a strong benchmark on a recently created dataset.
In future work, we plan to explore other GED and GEC modeling approaches, including the use of syntactic models (Li et al., 2022;Zhang et al., 2022).We plan to work more on insertions, punctuation, and infrequent error combinations.We also plan to work on GEC for Arabic dialects, i.e., the conventional orthography of dialectal Arabic normalization (Habash et al., 2018;Eskander et al., 2013;Eryani et al., 2020).

Limitations
Although using GED information as an auxiliary input improves GEC performance, our GED systems are limited as they can only predict error types for up to 512 subwords since they are built by fine-tuning CAMeLBERT.We also acknowledge

A Detailed Experimental Setup
Grammatical Error Detection Our GED models were fine-tuned for 10 epochs using a learning rate of 5e-5, a batch size of 32, and a seed of 42.At the end of the fine-tuning, we pick the best checkpoint based on the performance on the Dev sets.
Grammatical Error Correction When using AraBART, we fine-tune the models for 10 epochs by using a learning rate of 5e-5, a batch size of 32, a maximum sequence length of 1024, and a seed of 42.For AraT5, we fine-tune the models for 30 epochs by using a learning rate of 1e-4 and the rest of the hyperparameters are the same as the ones used in AraBART.During inference, we use beam search with a beam width of 5 for all models.At the end of the fine-tuning, we pick the best checkpoint based on the performance on the Dev sets by using the M 2 scorer.The M 2 scorer suffers from extreme running times in cases where the generated outputs differ significantly from the input.To mitigate this bottleneck, we extend the M 2 scorer by introducing a time limit for each sentence during evaluation.If the evaluation of a single generated sentence surpasses this limit, we pass the input sentence to the output without modifications.We use this extended version of the M 2 scorer when reporting our results on the Dev sets.When reporting our results on the Test sets, we use the M 2 scorer release that is provided by the QALB shared task.We make our extended version of the M 2 scorer publicly available.
ChatGPT We start with prompting ChatGPT with a 3-shot prompt.Our exact prompt is the following: "Please identify and correct any spelling and grammar mistakes in the following sentence indicated by <input> IN-PUT </input> tag.You need to comprehend the sentence as a whole before gradually identifying and correcting any errors while keeping the original sentence structure unchanged as much as possible.

Figure 1 :
Figure1: An example showing the differences between the alignments of the M 2 scorer, a standard Levenshtein distance, ARETA, and our proposed algorithm.The edit operations are keep (K), replace (R), insert (I), delete (D), merge (M), and split (S).Dotted lines between the erroneous and corrected sentences represent gold alignment.The last three rows present different granularities of ARETA error types based on our alignment.The sentence in the figure can be translated as "Social media must be used wisely, as it has both negative and positive effects".

Table 1 :
Corpus statistics of Arabic GEC datasets.

Table 2 :
Evaluation of different alignment algorithms.
et al., 2013)which defines seven error classes covering orthography (O), morphology (M), syntax (X), semantics (S), punctuation (P), merges, and splits.The error classes are further differentiated into 32 error tags that can be assigned individually or in combination.
P (c i |w i , w i−1 , e i ); where w i and w i−1 are the erroneous word (or phrases in case of a merge error) and its bigram context, e i is the error type of w i , and c i is the correction of w i .During inference, we pick the correction that maximizes the MLE probability.If the bigram context (w i and w i−1 ) was not observed during training, we backoff to a unigram.If the erroneous input word was not observed in training, we pass it to the output.
Likelihood Estimation (MLE) We exploit our alignment algorithm to build a simple lookup model to map erroneous words to their corrections.We implement this model as a bigram maximum likelihood estimator over the training data: Seq2Seq with GED Models We experiment with two newly developed pretrained Arabic Transformer-based Seq2Seq models: AraBART (Kamal Eddine et al., 2022) (pretrained on 24GB of MSA data mostly in the news domain), and AraT5 (Nagoudi et al., 2022) (pretrained on 256GB of both MSA and Twitter data).

Table 5 :
GED granularity results when used within the best GEC system (AraBART+Morph+GED) on the Dev sets of QALB-2014, QALB-2015, and ZAEBUC.The best results are in bold.
Table 4 presents the results on the Dev sets.

Table 7 :
No punctuation GED granularity results when used within GEC on the Test sets of QALB-2014, QALB-2015, and ZAEBUC.The best results are in bold.
Please feel free to refer to these examples.Remember to format your corrected output results with the tag <out-put> Your Corrected Version </output>.Please start: <in-put> INPUT </input>"

Table 9 :
Corpus statistics of Arabic GEC datasets.

Table 11 :
The statistics of the different GED granularity error types we model across the three datasets.The description of the labels in the 13-Class and 43-Class categories are in Appendix C. For the 2-Class labels, E refers to erroneous words and C refers to correct words.