LFTK: Handcrafted Features in Computational Linguistics

Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, no actively-maintained open-source library extracts a wide variety of handcrafted features. The current handcrafted feature extraction practices have several inefficiencies, and a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system to give the community a rich set of pre-implemented handcrafted features.


Introduction
Handcrafted linguistic features have long been inseparable from natural language processing (NLP) research. Even though automatically-generated features (e.g., Word2Vec, BERT embeddings) have recently been mainstream focus due to fewer manual efforts required, handcrafted features (e.g., typetoken ratio) are still actively used in many research (Weiss and Meurers, 2022;Campillo-Ageitos et al., 2021;Chatzipanagiotidis et al., 2021;Kamyab et al., 2021;Qin et al., 2021;Esmaeilzadeh and Taghva, 2021). This creates the constant demand for both the identification of new handcrafted features and utilization of existing handcrafted features. After reviewing the recent research, we observed that most research on auto-generated features tends to focus on creating deeper semantic representations of natural language. On the other hand, researchers utilize handcrafted features to create wider numerical representations, encompassing syntax, discourse, and others. An interesting new trend is that these handcrafted features are often used to assist auto-generated features in creating wide and deep representations for educational applications like English readability assessment (Lee et al., 2021) and automatic essay scoring (Uto et al., 2020) systems.
Though using handcrafted features seems to benefit multiple research fields, current feature extraction practices suffer from critical weaknesses. One is the inconsistent implementations of the same handcrafted feature across research works. For example, the exact implementation of average words per sentence feature can be different in Lee et al. (2021) and Pitler and Nenkova (2008) even though both works deal with text readability. Also, there have been no standards for categorizing these handcrafted features, which furthers the confusion.
In addition, no open-source feature extraction system works multilingual, though handcrafted features are increasingly used in non-English applications. The handcrafted linguistic features can be critical resources for understudied or lowresource languages because they often lack highperformance textual encoding models like BERT. In such cases, handcrafted features can be useful in creating text embeddings for machine learning studies (Zhang et al., 2022;Kruse et al., 2021;Maamuujav et al., 2021). In this paper, we make two contributions to address the shortcomings in the current handcrafted feature extraction practices.
1. We systematically categorize an extensive set of reported handcrafted features and create a feature extraction toolkit. The main contribution of this paper is that we collect more than 200 handcrafted features from diverse NLP research, like text readability assessment, and categorize them. We take a systematic approach for easiness in future expansion. Notably, we designed the system so that a fixed set of foundation features can build up to various derivation features. We then categorize the implemented features into four linguistic branches and 12 linguistic families, considering the original author's intention. The linguistic features are also labeled with available language, depending on whether our system can extract the feature in a language-agnostic manner. Our extraction software is built on top of another open-source library, spaCy 1 , to ensure high-performance parsing, multilingualism, and future reproducibility by citing a specific version. Our feature extraction software aims to cover most of the generally found handcrafted linguistic features in recent research. 1 github.com/explosion/spaCy 2. We report basic correlation analysis on various task-specific datasets. Due to the nature of the tasks, most handcrafted features are from text readability assessment or linguistic analysis studies with educational applications in mind. The broader applications of these handcrafted features to other fields, like text simplification or machine translation corpus generation, have been only reported fairly recently (Brunato et al., 2022;Yuksel et al., 2022). Along with the feature extraction software, we report the predictive abilities of these handcrafted features on four NLP tasks by performing a baseline correlation analysis. As we do so, we identify some interesting correlations that have not been previously reported. We believe our preliminary study can serve as a basis for future in-depth studies.
In a way, we aim to address the recent concern about the lack of ready-to-use code artifacts for handcrafted features (Vajjala, 2022). Through this work, we hope to improve the general efficiency of identifying and implementing handcrafted features for researchers in related fields.
2 Related Work 2.1 What are Handcrafted Features?
The type of linguistic feature we are interested in is often referred to as handcrafted linguistic feature, a term found throughout NLP research (Choudhary and Arora, 2021;Chen et al., 2021;Albadi et al., 2019;Bogdanova et al., 2017). Though the term "handcrafted linguistic features" is loosely defined, there seems to be some unspoken agreement among existing works. In this work, we define a handcrafted linguistic feature as a single numerical value produced by a uniquely identifiable method on any natural language (refer to Figure 2).
Unlike automatic or computer-generated linguistic features, these handcrafted features are often manually defined by combining the text's features with simple mathematical operations like root or division (Lee et al., 2021). For example, the average difficulty of words (calculated with an external word difficulty-labeled database) can be considered Figure 3: This diagram shows how we collected all handcrafted linguistic features implemented in our extraction software. This is our general framework for categorizing features for future expansion too. a handcrafted feature (Lee and Lee, 2020). Though the scope of what can be considered a single handcrafted feature is very broad, each feature almost always produces a single float or integer as the result of the calculation. More examples of such handcrafted features will appear as we proceed.

Hybridization of Handcrafted Features
It takes a great deal of effort to make automatic or computer-generated linguistic features capture the full linguistic properties of a text, other than its semantic meaning (Gong et al., 2022;Hewitt and Manning, 2019). For example, making BERT encoding capture both semantics and syntax with high quality can be difficult (Liu et al., 2020). On the other hand, combining handcrafted features to capture wide linguistic properties, such as syntax or discourse, can be methodically simpler. Hence, handcrafted features are often infused with neural networks in the last classification layer or directly with a sentence's semantic embedding to enhance the model's ability in holistic understanding (Hou et al., 2022;Lee et al., 2021). Such feature hybridization techniques are found in multiple NLP tasks like readability assessment (Vajjala, 2022) and essay scoring (Ramesh and Sanampudi, 2022).

Handcrafted Features in Recent Studies
Until recently, NLP tasks that require a holistic understanding of a given text have utilized machine learning models based only on handcrafted linguistic features. Such tasks include L2 learner's text readability assessment (Lee and Lee, 2020), fake news detection (Choudhary and Arora, 2021), bias detection (Spinde et al., 2021), learner-based reading passage selection (Lee and Lee, 2022). Naturally, these fields have handcrafted and identified a rich set of linguistic features we aim to collect in this study. We highlight text readability assessment research as an important source of our implemented features. Such studies often involve 80∼255 features from diverse linguistic branches of advanced semantics (Lee et al., 2021), discourse (Feng et al., 2010), and syntax (Xia et al., 2016).

Assembling a Large-Scale Handcrafted
Linguistic Feature Extractor

Overview
By exploring past works that deal with handcrafted linguistic features, we aim to implement a comprehensive set of features. These features are commonly found across NLP tasks, but ready-to-use   public codes rarely exist. We collected and categorized over 200 handcrafted features from past research works, mostly on text readability assessment and automated essay scoring, fake news detection, and paraphrase detection. Figure 3 depicts our general process of implementing a single feature. Tables 1 and 2 show more details on categorization.

Formulation
The main idea behind our system is that most handcrafted linguistic features can essentially be broken down into multiple fundamental blocks. Depending on whether a feature can be split into smaller building blocks, we categorized all collected features into either foundation or derivation. Then, we designed the extraction system to build all derivation features on top of the corresponding foundation features. This enables us to exploit all available combinations efficiently and ensure a unified extraction algorithm across features of similar properties. The derivation features are simple mathematical combinations of one or more foundation features. For example, the average number of words per sentence is a derivation feature, defined by dividing total number of words by total number of sentences. A foundation feature can be the fundamental building block of several derivation features. But again, a foundation feature cannot be split into smaller building blocks. We build 155 derivation features out of 65 foundation features in the current version.

Linguistic Property
Each handcrafted linguistic feature represents a certain linguistic property. But it is often difficult to pinpoint the exact property because features tend to correlate with one another. Hence, we only categorize all features into the broad linguistic branches of lexico-semantics, syntax, discourse, and surface. The surface branch can also hold features that do not belong to any specific linguistic branch. The linguistic branches are categorized in reference to Collins-Thompson (2014). We mainly considered the original author's intention when assigning a linguistic branch in unclear cases.
Apart from linguistic branches, handcrafted features are also categorized into linguistic families. The linguistic families are meant to group features into smaller subcategories, enabling users to search more effectively for the feature they need. All family names are unique, and each family belongs to a specific formulation type. This means that the features in a family are either all foundation or all derivation. A linguistic family also serves as a building block of our feature extraction system. Our extraction program is essentially a linked collection of several feature extraction modules, each representing a linguistic family (refer to Figure 4).

Applicable Language
Since handcrafted features are increasingly used for non-English languages, it is important to deduce whether a feature is generally extractable across languages. Though our extraction system is also designed with English applications in mind, we devised a systematic approach to deduce if an implemented feature is language agnostic. Like the example in Table 3, we only classify a derivation feature as generally applicable if all its components (foundation features) are generally applicable.
We can take the example of the average number of nouns per sentence, defined by dividing total number of nouns by total number of sentences. Since both component foundation features are generally applicable (we use UPOS tagging scheme), we can deduce that the derivation is generally applicable too. On the other hand, Flesch-Kincaid Grade Level (FKGL) is not generally applicable because our system's syllables counter is Englishspecific. But clearly, our classification of language applicability should only be used as guidelines for our system's capabilities, not as an indicator of the feature's universal effectiveness.

Feature Details by Linguistic Family
Due to space restrictions, we only report the number of implemented features in Tables 4 and 5. A full list of these features is available in the Appendices. The following sections are used to elaborate on the motivations and implementations behind features.

WordSent & AvgWordSent
WordSent is a family of foundation features for character, syllable, word, and sentence count statistics. With the exception of syllables, this family heavily depends on spaCy for tokenization. SpaCy is a high-accuracy parser module that has been used   We use a custom syllables count algorithm.
AvgWordSent is a family of derivation features for averaged character, syllable, word, and sentence count statistics. An example is the average number of syllables per word, a derivation of the total number of words and the total number of syllables foundation features.

WordDiff & AvgWordDiff
WordDiff is a family of foundation features for word difficulty analysis. This is a major topic in educational applications and second language acquisition studies, represented by age-of-acquisition (AoA, the age at which a word is learned) and corpus-based word frequency studies. Notably, there is the Kuperman AoA rating of over 30,000 words (Kuperman et al., 2012), an implemented feature in our extraction system. Another implemented feature is the word frequency statistics based on SUBLTEXus research, an improved word frequency measure based on American English subtitles (Brysbaert et al., 2012). AvgWordDiff averages the WordDiff features by word or sentence counts. This enables features like the average Kuperman's age-of-acquisition per word.

PartOfSpeech & AvgPartOfSpeech
PartOfSpeech is a family of foundation features that count part-of-speech (POS) properties on the token level based on dependency parsing. Here, we use spaCy's dependency parser, which is available in multiple languages. All POS counts are based on the UPOS tagging scheme to ensure multilingualism. These POS count-based features are found multiple times across second language acquisition research (Xia et al., 2016;Vajjala and Meurers, 2012). The features in AvgPartOfSpeech family are the averages of PartOfSpeech features by word or sentence counts. One example is the average number of verbs per sentence.

Entity & AvgEntity
Central to discourse analysis, Entity is a family of foundation features that count entities. Often used to represent the discourse characteristics of a text, these features have been famously utilized by a series of research works in readability assessment to measure the cognitive reading difficulty of texts for adults with intellectual disabilities (Feng et al., 2010(Feng et al., , 2009. AvgEntity family are the averages of Entity features by word or sentence counts. One example is the average number of "organization" entities per sentence.

LexicalVariation
Second language acquisition research has identified that the variation of words in the same POS category can correlate with the lexical richness of a text (Vajjala and Meurers, 2012;Housen and Kuiken, 2009). One example of a derivative feature in this module is derived by dividing the number of unique verbs by the number of verbs, often referred to as "verb variation" in other literature. There are more derivations ("verb variation -1, 2") using squares or roots, which are also implemented in our system.

TypeTokenRatio
Type-token ratio, often called TTR, is another set of features found across second/child language acquisition research (Kettunen, 2014). This is perhaps one of the oldest lexical richness measures in a written/oral text (Hess et al., 1989;Richards, 1987). Though TypeTokenRatio features essentially aim to measure similar textual characteristics as LexicalVariation features, we separated TTR into a separate linguistic family due to its unique prevalence.

ReadFormula
Before machine learning techniques were applied to text readability assessment, linear formulas were used to represent the readability of a text quantitatively (Solnyshkina et al., 2017). Recently, these formulas have been utilized for diverse NLP tasks like fake news classification (Choudhary and Arora, 2021) and authorship attribution (Uchendu et al., 2020). We have implemented the traditional readability formulas that are popularly used across recent works (Lee and Lee, 2023;Horbach et al., 2022;Gooding et al., 2021;Nahatame, 2021).

Out Extraction System in Context
As we have explored, we tag each handcrafted linguistic feature with three attributes: domain, family, and language. These attributes assist researchers in efficiently searching for the feature they need, one of two research goals we mentioned in §1. Instead of individually searching for handcrafted features, they can sort and extract features in terms of attributes.
Notably, our extraction system is fully implemented in Python, unlike other systems like Coh-Metrix (Graesser et al., 2004) and L2 Syntactic Complexity Analyzer (Lu, 2017). Considering the modern NLP research approaches (Mishra and Mishra, 2022;Sengupta, 2021;JUGRAN et al., 2021;Sarkar, 2019), the combination of opensource development and Python makes our extraction system more expandable and customizable in the community.
Excluding the spaCy model's processing time (which is not a part of our extraction system), our system can extract 220 handcrafted features from a dummy text of 1000 words on an average of 10 seconds. This translates to about 0.01 seconds per word, and this result is obtained by averaging over 20 trials of randomized dummy texts of exactly 1000 words. The fast extraction speed makes our extraction system suitable for large-scale corpus studies. Since our extraction system works with a wide variety of tokenizers (different accuracies and processing times) available through spaCy, one might choose an appropriate model according to the size of the studied text. Since spaCy and our extraction system are open sources registered through the Python Package Index (PyPI), reproducibility can easily be maintained by specifying the versions.
In addition, our extraction system achieves such a speed improvement due to our systematic break- Figure 4: Schematic representation of how a user might use LFTK to extract handcrafted features. Black line arrows represent inheritance relationships. Our extraction system is a collection of multiple linguistic family modules. To interweave this program and resolve multiple dependencies, we designed a foundation collector object to inherit all foundation linguistic families first. Then all derivation linguistic families inherit the same foundation collector object. A derivation collector then inherits all derivation linguistic families, and the main extractor object inherits the derivation collector object. Considering the recent research trend, our program is solely based on Python.
down of handcrafted features into foundation and derivation (see §3.1.1). As depicted in Figure 4, designing the system so that derivation features are built on top of foundation features reduced duplicate program calculation to a minimum. Once a foundation feature is calculated, it is saved and used by multiple derivation features. Indeed, the total number of words does not have to be calculated twice for average word difficulty per word and Flesch-Kincaid Grade Level.

Which applies to which? Task-Feature Correlation Analysis
For handcrafted features to be generally useful to the larger NLP community, it can be important to provide researchers with a sense of which features can be potentially good in their problem setup. This section reports simple correlation analysis results of our implemented features and four NLP tasks.
To the best of our knowledge, we chose the representative dataset for each task. Table 6 reports the Pearson correlation between the feature and the dataset labels. We only report the top 10 features and bottom ten features. The full result is available in the Appendices. We used the CLEAR corpus's crowdsourced algorithm of reading comprehension score controlled for text length (CAREC_M) for readability labels on 4724 instances (Crossley et al., 2022). We used the ASAP dataset's 2 do-2 www.kaggle.com/c/asap-aes/data main1_score on prompt 1 essays for student essay scoring labels on 1783 instances. We used the LIAR dataset for fake news labels on 10420 instances (Wang, 2017). We used SemEval 2019 Task 5 dataset's PS for binary hate speech labels on 9000 instances (Basile et al., 2019).
Though limited, our preliminary correlation analysis reveals some interesting correlations that have rarely been reported. For example, n_verb negatively correlates with the difficulty of a text. But there is much room to be explored. One utility behind a large-scale feature extraction system like ours is the ease of revealing novel correlations that might not have been obvious.

Conclusion
In this paper, we have reported our open-source, large-scale handcrafted feature extraction system. Though our extraction system covers a large set of pre-implemented features, newer, task-specific features are constantly developed. For example, URLs count is used for Twitter bot detection (Gilani et al., 2017) and grammatical error count is used for automated essay scoring (Attali and Burstein, 2006). These features, too, fall under our definition ( Figure 2) of handcrafted linguistic features. Our open-source script is easily expandable, making creating a modified, research-specific version of our extraction program more convenient. With various foundation features to build from, our extraction program will be a good starting point.  Table 6: Task, dataset, and top 10 correlated features (reported both in the positive and negative direction). Under our experimental setup, positive is more difficult in readability assessment. Positive is well-written in essay scoring.
Positive is more truthful in fake news detection. Positive is hateful in hate speech detection. We only report feature keys due to space restrictions. The full correlation analysis and key-description pairs are available in the Appendices.
Another potential user group of our extraction library is those looking to improve a neural or nonneural model's performance by incorporating more features. Performance-wise, the breadth of linguistic coverage is often as important as selection (Lee et al., 2021;Yaneva et al., 2021;Klebanov and Madnani, 2020;Horbach et al., 2013). Our current work has various implemented features, and we believe the extraction system can be a good starting point for many research works.
Compared to other historically important code artifacts like the Coh-Metrix (Graesser et al., 2004) and L2 Syntactic Complexity Analyzer (Lu, 2017), our extraction system is comparable or larger in size. To the best of our knowledge, this research is the first attempt to create a "general-purpose" handcrafted feature extraction system. That is, we wanted to build a system that can be widely used across NLP tasks. To do so, we have considered expandability and multilingualism from architecture design. And such consideration is grounded in the systematic categorization of popular handcrafted linguistic features into the attributes like domain and family. With the open-source release of our system, we hope that the current problems in feature extraction practices ( §1) can be alleviated.

A All implemented features
Our extraction software is named LFTK, and its current version is 1.0.9. Tables 7, 8, 9, and 10 reference v.1.0.9. We only report linguistic family here due to space restrictions. Though our feature description will be regularly updated at this address 3 whenever there is a version update, we also put the current version's full feature table in our extraction program. Through PyPI or GitHub, the published version of our program is always retrievable.

B Feature correlations
Tables 11, 12, 13, and 14 report the full feature correlations that are not reported in Table 6. We have used spaCy's en_core_web_sm model, and the library version was 3.0.5. Pearson correlation was calculated through the Pandas library, and its version was 1.1.4. All versions reflect the most recent updates in the respective libraries.