Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features

We report two essential improvements in readability assessment: 1. three novel features in advanced semantics and 2. the timely evidence that traditional ML models (e.g. Random Forest, using handcrafted features) can combine with transformers (e.g. RoBERTa) to augment model performance. First, we explore suitable transformers and traditional ML models. Then, we extract 255 handcrafted linguistic features using self-developed extraction software. Finally, we assemble those to create several hybrid models, achieving state-of-the-art (SOTA) accuracy on popular datasets in readability assessment. The use of handcrafted features help model performance on smaller datasets. Notably, our RoBERTA-RF-T1 hybrid achieves the near-perfect classification accuracy of 99%, a 20.3% increase from the previous SOTA.


Introduction
The long quest for advancing readability assessment (RA) mostly centered on handcrafting the linguistic features that affect readability (Pitler and Nenkova, 2008).RA is a time-honored branch of natural language processing (NLP) that quantifies the difficulty with which a reader understands a text (Feng et al., 2010).Being one of the oldest systematic approaches to linguistics (Collins-Thompson, 2014), RA developed various linguistic features.These range from simple measures like the average count of syllables to those as sophisticated as semantic complexity (Buchanan et al., 2001).
Perhaps due to the abundance of dependable linguistic features, an overwhelming majority of RA systems are Support Vector Machines (SVM) with handcrafted features (Hansen et al., 2021).Such traditional machine learning (ML) methods were linguistically explainable, expandable, and most importantly, competent against the modern neural models.As a fragmentary example, Filighera et al. (2019) reports that a large ensemble of 6 BiLSTMs with BERT (Devlin et al., 2019), ELMo (Peters et al., 2018), Word2Vec (Mikolov et al., 2013), and GloVe (Pennington et al., 2014) embeddings showed only ∼1% accuracy improvement from a single SVM model developed by Xia et al. (2016).
Even though deep neural networks have achieved state-of-the-art (SOTA) performance in almost all semantic tasks where sufficient data were available (Collobert et al., 2011;Zhang et al., 2015), neural models started showing promising results in RA only quite recently (Martinc et al., 2021).A known challenge for the researchers in RA is the lack of large public datasets -with the unique exception of WeeBit (Vajjala and Meurers, 2012).Technically speaking, even WeeBit is not entirely public since it has to be directly obtained from the authors.Martinc et al. (2021) raised the SOTA classification accuracy on the popular WeeBit dataset (Vajjala and Meurers, 2012) by about 4% using BERT.This was the first solid proof that neural models with auto-generated features can show significant improvement compared to traditional ML with handcrafted features.However, neural models, or transformers (which is the interest of this paper), still show not much better performance than traditional ML on smaller datasets like OneStopEnglish (Vajjala and Lučić, 2018), despite the complexity.
From our observations, the reported low performances of transformers on small RA datasets can be accounted for two reasons.1.Only BERT was applied to RA, and there could be other transformers that perform better, even on small datasets.2. If a transformer shows weak performance on small datasets, there must be some additional measures done to supply the final model (e.g.ensemble) with more linguistic information, but such a study is rare in RA.Hence, we tackle the abovementioned issues in this paper.In particular, we 1. perform a wide search on transformers, traditional ML models, and handcrafted features & 2. develop a hybrid architecture for SOTA and robustness on small datasets.
However, before we move on to hybrid models, we begin by supplementing an underexplored linguistic branch of handcrafted features.According to survey research on RA (Collins-Thompson, 2014), the study on advanced semantics is scarce.We lack a model to capture how deeper semantic structures affect readability.We attempt to solve this issue by viewing a text as a collection of latent topics and calculating the probability distribution.
Then, we move on to combine traditional ML (w handcrafted features1 ) and transformers.Such a hybrid system is only reported by Deutsch et al. (2020), concluding, "(hybrid models) did not achieve SOTA performance."But we obtain contrary results.Through a large study on the optimal combination, we obtain SOTA results on WeeBit and OneStopEnglish.Also, our BERT-GB-T1 hybrid beats the (previous) SOTA accuracy with only 30% of the full dataset, in section 4.7.
Our main objectives are creating advanced semantic features and hybrid models.But our contributions to academia are not limited to the abovementioned two.We make the following additions: 1.We numerically represent certain linguistic properties pertaining to advanced semantics.2. We develop a large-scale, openly available 255 features extraction Python toolkit (which is highly scarce2 in RA).We name the software LingFeat3 .3. We conduct wide searches and parametrizations on transformers4 and traditional ML5 for RA use. 4. We develop hybrid models for SOTA and robustness on small datasets.Notably, RoBERTa-RF-T1 achieves 99% accuracy on OneStopEnglish, 20.3% higher than the previous SOTA (table 5).

Definition, Background, and Overview
A text is a communication between author and reader, and its readability is affected by the reader having shared world/domain knowledge.According to Collins-Thompson (2014), the features resulting from topic modeling may characterize the deeper semantic structures of a text.These deeper representations accumulate and appear to us in the form of perceivable properties like sentiment and genre.But advanced semantics aims to capture the deeper representation itself.
Among the four branches of linguistic properties (in RA) identified by Collins-Thompson (2014), advanced semantics remain unexplored.Lexicosemantic (Lu, 2011;Malvern and Richards, 2012), syntactic (Heilman et al., 2007;Petersen andOstendorf, 2009), anddiscourse-based (McNamara et al., 2010) properties had several notable works but little dealt with advanced semantics as the given definition.The existing examples in higher-level semantics focus on word-level complexity (Collins-Thompson and Callan, 2004;Crossley et al., 2008;Landauer et al., 2011;Nam et al., 2017).Such a phenomenon is complex.The lack of investigation on advanced semantics could be due to its low correlation with readability.This is plausible because RA studies often test their features on a human-labeled dataset, potentially biased towards easily recognizable surface-level features (Evans, 2006).Such biases could cause low performance.
Further, it must be noted that: 1. world knowledge might not always directly indicate difficulty, and 2. there can be other existing substitute features that capture similar properties on a word level.S1) Kindness is good.

S3
) I return with the stipulation to dismiss Smith's case; the same being duly executed by me.
S2 above seems to require more world knowledge than S1.However, "Christmas", as a familiar entity, seems to have no apparent contribution to increased difficulty.If any, similar properties can be captured by word frequency/familiarity measures (lexico-semantics) in a large representative corpus (Leroy and Kauchak, 2013).Also, it seems that S3 is the most difficult, and this can be easily deduced using entity counts (discourse).Entities mostly introduce conceptual information (Feng et al., 2010).
Our key objective in studying advanced semantics is to identify features that add orthogonal information.In other words, we hope to see a performance increase in our overall RA model rather than specific features' high correlations with readability.
Given the considerations, we draw two guidelines: 1. develop passage-level features since most word-level attributes are captured by existing features, and 2. value the orthogonal addition of information, not individual feature's high correlation.

Approach
Topics convey text meaning on a global level (Holtgraves, 1999).In order to capture the deeper structure of meaning (advanced semantics), we hypothesized that calculating the distribution of documenttopic probabilities from latent dirichlet allocation (LDA) (Blei et al., 2003) could be helpful.
Moreover, domain/world knowledge can be accounted for in LDA-resulting measures since LDA can be trained on various data.As explored in Qumsiyeh and Ng (2011), it may seem sensible to use the count of discovered topics as the measure of required knowledge.However, such measures can be extremely sensitive to passage length.Along with the count of discovered topics, we develop three others that consider how these topics are distributed: semantic richness, clarity, and noise.
Fig. 1 depicts the steps: 1. obtain output from a trained LDA model, 2. ignore topic ID and create a sorted probabilities list, and 3. calculate semantic richness, clarity, and noise.We model "how" the topics are distributed, not "what" the topics are.

Semantic Richness
Traditionally, semantic richness is quantified according to word usage (Pexman et al., 2008).In a high-dimensional model of semantic space (Li et al., 2000), co-occurring words clustered as semantic neighbors, quantifying semantic richness.As such, the previous models of semantic richness were often studied for word-level complexity and made no explicit connection with readability on a global level.Also, they were often subjectdependent (Buchanan et al., 2001).As concluded by Pexman et al. (2008), semantic richness is defined in several ways.We propose a novel variation.
We apply the similar co-occurrence concept but on the passage level using LDA.Here, semantic richness is the measure of how "largely" populated the topics are.In fig. 1, we approximately define richness as the product total of SPL, which measures the count of discovered topics (n) and topic probability (p).Additionally, we multiply index (i) to reward longer n so that the overall richness increases faster with more topics.See eqn. 1.

Semantic Clarity
Semantic clarity is critical in understanding text (Peabody and Schaefer, 2016).Likewise, complex meaning structures lead to comprehension difficulty (Pires et al., 2017).Some existing studies quantify semantic complexity (or clarity) through various measures, but most on the fine line between lexical and semantic properties (Collins-Thompson, 2014).They rarely deal with the latent meaning representations or the clarity of the main topic.
For semantic clarity, we quantify how the probability distribution (fig. 1) is focused (skewed) towards the largest discovered topic.In other words, we hope to see how easily identifiable the main topic is.We wanted to adopt the standard skewness equation from statistics, but we developed an alternative (eqn.2) because the standard equation failed to capture the anticipated trends in appendix A.

Semantic Noise
Semantic noise is the measure of the less-important topics (those with low probability), also the "tailedness" of sorted probability lists (fig.1).A sorted probability list that resembles a (right-halved) leptokurtic curve would have higher semantic noise.
In comparison, a (right-halved) platykurtic curve of similar length would have low semantic noise.We adopt the kurtosis equation under Fisher definition (Kokoska and Zwillinger, 2000).See eqn. 3.

Covered Features
We study 255 linguistic features.For the already existing features, we add variations to widen coverage.The full list of features, feature codes, and definition are provided in appendix B. Also, we classify features into 14 subgroups.External dependencies (e.g.parser) are reported in appendix D.

Advanced Semantic Features (AdSem)
Here, we follow the methods provided in section 2. 1∼3) Wikipedia (WoKF), WeeBit (WBKF), & OneStop Knowledge Features (OSKF).Each subgroup name represents the respective training data.We train Online LDA (Hoffman et al., 2010) with the 20210301 dump6 from English Wikipedia for WoKF.The others are trained on two popular corpora in RA: WeeBit and OneStopEnglish.
For each training set, four variations of 50, 100, 150, 200 topics models are trained.Four featuressemantic richness, clarity, noise, and the total count of discovered topics -are extracted per model.

Discourse-Based Features (Disco)
A text is more than a series of random sentences.It indicates a higher-level structure of dependencies.
4) Entity Density Features (EnDF).Conceptual information is often introduced by entities.Hence, the count of entities affects the working memory burden (Feng et al., 2009).We bring entity-related features from Feng et al. (2010).
5) Entity Grid Features (EnGF) Coherent texts are easy to comprehend.Thus, we measure coherence through entity grid, using the 16 transition pattern ratios approach by Pitler and Nenkova (2008) as features.Also, we adopt local coherence scores (Guinaudeau and Strube, 2013), using the code implemented by Palma and Atkinson (2018).

Syntactic Features (Synta)
Syntactic complexity is associated with longer processing times (Gibson, 1998).Such syntactic properties also affect the overall complexity of a text (Hale, 2016), an important indicator of readability.
7) Tree Structure Features (TrSF).We deal with the structural shape of parsed trees, inspired by the work on average parse tree height by Schwarm and Ostendorf (2005).On a constituency parser (appendix D) output, NLTK (Loper and Bird, 2002) is used for the final calculation of features.
8) Part-of-Speech Features (POSF).Several studies report the effectiveness of using POS counts as features (Tonelli et al., 2012;Lee and Lee, 2020a).We count based on Universal POS tags7 .

Lexico-Semantic Features (LxSem)
Perhaps the most explored, lexico-semantics capture the attributes associated with the difficulty or unfamiliarity of words (Collins-Thompson, 2014).9) Variation Ratio Features (VarF) Lu ( 2011) reports noun, verb, adjective, and adverb variations, which represent the proportion of the respective category's words to total.We implement the features with variants from Vajjala and Meurers (2012).
10) Type Token Ratio Features (TTRF).TTR has been widely used as a measure of lexical richness in language acquisition studies (Malvern and Richards, 2012).We bring five variations of TTR from Vajjala and Meurers (2012).For MTLD (Mc-Carthy and Jarvis, 2010), we default TTR to 0.72.
12) Word Familiarity Features (WorF) Word frequency in a large representative corpus often represents lexical difficulty (Collins-Thompson, 2014) due to unfamiliarity.We use SubtlexUS database (Brysbaert and New, 2009) to measure familiarity.

Shallow Traditional Features (ShaTr)
Classic readability formulas (e.g.Flesch-Kincaid Grade) (Kincaid et al., 1975) or shallow measures often do not represent a specific linguistic branch.
13) Shallow Features (ShaF) These features capture surface-level difficulty.Our measures include the average count of tokens and syllables.
14) Traditional Formulas (TraF).For Flesh-Kincaid Grade Level, Automated Readability, and Gunning Fog, we follow the "new" formulas in Kincaid et al. (1975).We follow Si and Callan (2001) for Smog Index (Mc Laughlin, 1969).And we follow Eltorai et al. (2015) for Linsear Write.In our hybrid model, we take a simple approach of joining the soft label predictions of a neural model (e.g.BERT) with handcrafted features and wrapping it with a non-neural model (e.g.SVM).
In fig.2, the non-neural model (i.e.secondary predictor) learns 1. predictions/outputs of the neural model and 2. handcrafted features.The addition of handcrafted features supplements what neural models (i.e. initial predictor) might miss, reinforcing performance on the secondary prediction.

In Pursuit of the Best Combination
Our hybrid architecture (fig.2) is simple; Deutsch et al. ( 2020) explored a similar concept but did not achieve SOTA.But the benefits (section 4.1) from its simplicity are critical for RA, which has a lacking number/size of public datasets, wide educational use, and diverse handcrafted features.We obtain SOTA with a wider search on combinations.

Datasets and Evaluation Setups
WeeBit.Perhaps the most widely-used, WeeBit is often considered the gold standard in RA.It was first created as an expansion of the famous Weekly Reader corpus (Feng et al., 2009).To avoid classification bias, we downsample classes to equalize the number of items (passages) in each class to 625.It is common practice to downsample WeeBit.OneStopEnglish.OneStopEnglish is an aligned passage corpus developed for RA and simplification studies.A passage is paraphrased into three readability classes.OneStopEnglish is designed to be a balanced dataset.No downsampling is needed.
Cambridge.Cambridge (Xia et al., 2016) categorizes articles based on Cambridge English Exam levels (KET, PET, FCE, CAE, CPE).These five exams are targeted at learners at A2-C2 levels of the Common European Framework of Reference (Xia et al., 2016).We downsample to 60 items/class.

Search on Neural Model
Extending from the existing use of BERT on RA (Deutsch et al., 2020;Martinc et al., 2021), we explore RoBERTa, (Liu et al., 2019), BART (Lewis et al., 2020), and XLNet (Yang et al., 2019).We use base models for all (details in appendix D).For each of the four models (table 2), we perform grid searches on WeeBit validation sets to identify the well-performing hyperparameters based on 5fold mean accuracy.Once identified, we used the same configuration for all the other corpora and performed no corpus-specific tweaking.We search the learning rates of [1e-5, 2e-5, 4e-5, 1e-4] and the batch sizes of [8,16,32].The input sequence lengths are all set at 512, and we used Adam optimizer.Last, we fine-tuned the model for three epochs.Full hyperparameters are in appendix F.
In table 2, RoBERTa & BART beat BERT & XL-Net on most metrics.Martinc et al. (2021) reports that transformers are weak on parallel datasets (On-eStopEnglish) due to their reliance on semantic information.However, RoBERTa & BART show great performances on OneStopEnglish as well.Such a phenomenon likely derives from numerous aspects of the architecture.We carefully posit that the varying pretraining steps could be a reason.
BERT uses two objectives, masked language model (MLM) and next sentence prediction (NSP).The latter was included to capture the relation between sentences for natural language inference.Thus, sentence/segment-level input is used.Likewise, XLNet adopts a similar idea, limiting input to sentence/segment-level.But RoBERTa disproved the efficiency of NSP, adopting document-level inputs.Similarly, BART, via random shuffling of sentences and in-filling scheme, does not limit itself to a sentence/segment size input.As in section 3, "readability" is possibly a global-level representation (accumulated across the whole document).Thus, the performance differences could stem from the pretraining input size; sentence/segment-level input likely loses the global-level information.

Search on Non-Neural Model
We explored SVM, Random Forest (RandomF), Gradient Boosting (XGBoost) (Chen and Guestrin, 2016), and Logistic Regression (LogR).With the exception of XGBoost, the chosen models are frequently used in RA but rarely go through adequate hyperparameter optimization steps (Ma et al., 2012;Yaneva et al., 2017;Mohammadi and Khasteh, 2020).We perform a randomized search to first identify the sensible range of hyperparameters to search.Then, we apply grid search to specify the optimal values.The parameters are in appendix F.
In table 3, we report the performances of the parameter-optimized models trained with all 255 handcrafted features.Compared to transformers, these non-neural models show lower accuracy in general.Even on the smallest Cambridge dataset, non-neural models do not necessarily show higher performances than transformers.But it is important to note that they managed to show fairly good, "expectable" performances on a much smaller dataset.

Search on Handcrafted Features
We start by ranking performances of the feature subgroups.In table 4, we report the top 7 (upper half) by accuracy on WeeBit.The result is obtained   after training the respective model using the specified feature subgroup.Importantly, the advanced semantic features show good performance in all measures.WorF and PsyF, features calculated from external databases, rank in the top 7 for all corpora, hinting they are strong measures of readability.
Moving on, we constructed several types of feature combinations with varying aims.These include: 1. T-type to thoroughly capture linguistic properties and 2. P-type to collect features by performance.We tested the variations on LogR and SVM to determine the optimal.Two sets (table 6) performed well.Appendix G reports all tested variations.We highlight that both advanced semantics and discourse added distinct (orthogonal) information, which was evidenced by performance change.

Assembling Hybrid Model
Based on the exploration so far, we assemble our hybrid model.We perform a brute-force grid search on four neural models (table 2), four non-neural models (table 3), and 14 feature sets (table 24).
To interweave the model, we followed the steps of 1: obtain soft labels (probabilities that a text belongs to the respective readability class) from a neural model by softmax layer, 2: append the soft labels to handcrafted features (create a dataframe), 3. train non-neural model on the dataframe.As in fig 2, the neural models performed a sort of reprediction to the data used for training to match the dataframe dimensions in training and test stages.

Hybrid Model Results and Limitations
In table 5, our hybrid models achieve SOTA performances on WeeBit (BART-RF-T1) and OneSt-opEnglish (RoBERTa-RF-T1).With the exception of Xia et al. (2016) which uses extra data to increase accuracy, we also achieve SOTA on Cambridge: 76.3% accuracy on a small dataset of only 60 items/class.Among the hybrids, RoBERTa-RF-T1 showed consistently high performance on all metrics.But all hybrid models beat previous SOTA results by a large margin.Notably, we achieve the near-perfect accuracy of 99% on OneStopEnglish, a massive 20.3% increase from the previous SOTA (Martinc et al., 2021) by HAN (Meng et al., 2020).
Both neural and non-neural models benefit from the hybrid architecture.This is explicitly shown in BERT-GB-T1 performance on OneStopEnglish, achieving 98.2% accuracy.This is an 18.1% increase from BERT and a 26.3% increase from XGBoost.However, BART did not benefit much from the hybrid architecture on WeeBit and On-eStopEnglish, meaning that hybrid architectures do not augment model performance at all conditions.
Along similar lines, the hybrid architecture performance on the larger WeeBit dataset showed only a small improvement from the transformer-only result.On the other hand, the hybrid architecture performance on the smaller Cambridge dataset was consistently better than the transformer-only performance.The hybrid shows ∼10% improvement in accuracy on average for Cambridge.On the smallest dataset (Cambridge), the hybrid architecture benefited more from a non-neural, handcrafted features-based model like RF (Random Forest) and GB (XGBoost).On the largest dataset (WeeBit), the hybrid benefited more from a transformer.
Our explanation is that the handcrafted features do not add much, at the data size of WeeBit.But the handcrafted features could be a great help where data is insufficient like they did for the Cambridge dataset.OneStopEnglish, being the medium-sized parallel dataset, could have hit the sweet spot for the hybrid architecture.But it must be noted that the data size is not the only determining factor as to which model (neural or non-neural) the hybrid architecture benefits more from.It must also be questioned if the max performance (∵ label noise induced by subjectivity) (Frénay et al., 2014) is already achieved on WeeBit (Deutsch et al., 2020).
Also, it seems that the hybrid architecture benefits when each model (neural and non-neural) already shows considerably good performance.This is plausible as the neural model outputs are considered features for the non-neural model.Including more "fairly" well-performing features only creates extra distractions.The hybrid architecture's limit is that it gets a model from "good" to "great," not "fair" to "good."But determining the definition of "fair" performance is a difficult feat as it likely depends on the dataset and a researcher's intuition from the empirical experience of the model.Hence, the hybrid architecture's limit is that one must test several combinations to pick the effective one.

Why Not Directly Append Features?
Regarding the model architecture, we examined appending the handcrafted features to transformer embeddings without the use of a secondary predic-  (Meng et al., 2020) hints that such a model is not robust to small datasets.ReadNet reports 52.8% accuracy on Cambridge, worse than any of our tested models (table 2, 3, 5).Besides, Read-Net claims to have achieved 91.7% accuracy on WeeBit, without reports on downsampling.Many studies, like Deutsch et al. (2020), report that the model accuracy can increase ∼4% on the full, classimbalanced WeeBit.Hence, ReadNet is not directly comparable.We omitted ReadNet from table 5.

BERT vs BERT, Ours Was Better
Noticeable in table 2 and table 5 is that our BERT implementation performed much better on WeeBit than what was reported.The dataset preparation methods and overall evaluation settings are the same or very similar across ours (accuracy: 89.3%), Deutsch et al. (2020)'s (accuracy: 83.9%), and Martinc et al. (2021)'s (accuracy: 85.7%).We believe that the differences stem from the hyperparameters.
Notably, Deutsch et al. (2020) uses 128 input sequence length.This is ineffective as the downsampled WeeBit has 2374 articles of over 128 tokens but only 275 articles of over 512 tokens (which was our input sequence length).Hence, we can reasonably think that much semantic information was lost in Deutsch et al. (2020)'s implementation.Martinc et al. (2021) uses 512 input sequence length but lacks a report on other possibly critical hyperparameters, and we cannot compare in detail.

Data Size Effect
In table 5, our hybrid architecture generally did not contribute much to the classification on WeeBit.But we argue that it has much to do with data size.
To model how data size affects the accuracies of 1. hybrid model, 2. transformer, and 3. traditional ML, we conducted an additional experiment using the same test data (10% of WeeBit) explained in section 4.2.1.However, we random sampled the train data (80% of WeeBit) into the smaller sizes of from 50 to 750, with 50 passages increase each set.We sampled with equal class weights, meaning that a 250 passages train set has 50 from each readability class.We trained BERT-GB-T1 (table 5) on the sampled data and evaluated on the same test data throughout.We also recorded BERT and XGBoost (with T1 features) performances in fig. 3.
In fig.3, the hybrid model performs consistently better than transformer (+0.01 ∼ 0.05) at all sizes.But the difference gap gets smaller as the train data size increases.The hybrid model does help the efficiency of learning RA linguistic properties.
Contrary to the conventional beliefs, the transformer (BERT) performed better than our expectations, even on smaller data sizes.BERT always outperformed XGBoost.The traditional ML performance was arguably more consistent but never better than a transfomer's.

Domain Overfitting and Cross Domain Evaluation
99% accuracy on OneStopEnglish (table 5) shows that our model is capable of almost perfectly capturing the linguistic properties relating to readability on certain datasets.This is a positive and abnormally quick improvement, considering that the reported RA accuracies have never exceeded 90% on popular datasets (Vajjala and Meurers, 2012;Xu et al., 2015;Xia et al., 2016;Vajjala and Lučić, 2018) until 2021.Since the reported in-domain accuracies in RA had much room for improvement, we were not at the stage to be seriously concerned about cross-domain evaluation (Štajner and Nisioi, 2018) in this paper.
It would be very favorable to run an extra crossdomain evaluation (which we believe to be a nextlevel topic).But realistically, performing a crossdomain evaluation requires a thorough study on at least two datasets, which is potentially out of scope in this research.The readability classes/levels are labeled by a few human experts, making the standards vary among datasets.To make two datasets suitable for cross-domain evaluation, much effort is needed to connect the two, such as the class mapping used in Xia et al. (2016).However, it should be noted for future researchers that the notion of domain overfitting is indeed a common problem faced in RA, which often uses one dataset for train/test/validation.Without a new methodology to connect several datasets or a new large public dataset for RA, it will forever be challenging to develop a RA model for general use (Vajjala, 2021).

Conclusion
We have reported the four contributions mentioned in section 1.We checked that the new advanced semantic features add orthogonal information to the model.Further, we created hybrid models (table 5) that achieved SOTA results.RoBERTA-RF-T1 achieved 99% accuracy on OneStopEnglish, and BERT-GB-T1 beat the previous SOTA on WeeBit using only 30% of the original train data.

As a Gentle Reminder
To the general NLP community, the most prominent characteristic of our proposed method might be that we utilize handcrafted features and traditional ML models, which are often considered "outdated."Interestingly, these outdated methods maintained SOTA in RA until Martinc et al. (2021) utilized BERT (as already discussed).
The findings we report are not limited to the technical innovations that achieved the new SOTA.Rather, we want to stress that: 1. there are still many areas in NLP that insist on traditional methodologies, which potentially hinders the improvement in model accuracy, 2. but we must also take time to look back on these outdated methods and their linguistic values.If we achieved anything meaningful through this research, it was possible because we realized the abovementioned two situations.idx Code Definition 55 ra_SSToT_C ratio of SS transitions : total, count from Entity Grid 56 ra_SOToT_C ratio of SO transitions : total, count from Entity Grid 57 ra_SXToT_C ratio of SX transitions : total, count from Entity Grid 58 ra_SNToT_C ratio of SN transitions : total, count from Entity Grid 59 ra_OSToT_C ratio of OS transitions : total, count from Entity Grid 60 ra_OOToT_C ratio of OO transitions : total, count from Entity Grid 61 ra_OXToT_C ratio of OX transitions : total, count from Entity Grid 62 ra_ONToT_C ratio of ON transitions : total, count from Entity Grid 63 ra_XSToT_C ratio of XS transitions : total, count from Entity Grid 64 ra_XOToT_C ratio of XO transitions : total, count from Entity Grid 65 ra_XXToT_C ratio of XX transitions : total, count from Entity Grid 66 ra_XNToT_C ratio of XN transitions : total, count from Entity Grid 67 ra_NSToT_C ratio of NS transitions : total, count from Entity Grid 68 ra_NOToT_C ratio of NO transitions : total, count from Entity Grid 69 ra_NXToT_C ratio of NX transitions : total, count from Entity Grid 70 ra_NNToT_C ratio of NN transitions : total, count from Entity Grid
1. Feature codes consist of 8 letters/numerals, with 1 or 2 underscores depending on feature types.
2. All features classify into either count-based or score-based, following popular convention.
• Count-based define: final calculation uses simple counts (i.e.total, avg per sent, avg per token, ratio) format: xx_xxxxx_C.First two letters are "to" (total), "as" (avg per sent), "at" (avg per token), "ra" (ratio).Five letters in the middle explain what the feature is.Last letter always "C." Two underscores in between.
• Score-based define: require additional calculation (e.g.log, square), or famous features with predefined names (e.g.Flesch-Kincaid, TTR).format: xxxxxxx_S.Seven letters are all dedicated to explaining what the feature is.Last letter always "S."One underscore.
3. For the "explanation" part of each feature code, capital letters denote new words.The small letters that follow are from the same word.(e.g.1: Coleman Liau → ColeLia, 2: AoA (Age of Acquisition) Kuperman of words → AAKuW)

Table 1 :
Statistics for datasets.
uses semi-supervised learning (self-training) on a larger corpus to increase performance. **

Table 6 :
Best feature sets.

Table 10 :
OneStop Knowledge Features (OSKF).number of Entities Mentions per sentence 51 at_EntiM_C average number of Entities Mentions per token (word) 52 to_UEnti_C total number of unique Entities 53 as_UEnti_C average number of unique Entities per sentence 54 at_UEnti_C average number of unique Entities per token (word)