IFlyEA: A Chinese Essay Assessment System with Automated Rating, Review Generation, and Recommendation

Automated Essay Assessment (AEA) aims to judge students’ writing proficiency in an automatic way. This paper presents a Chinese AEA system IFlyEssayAssess (IFlyEA), targeting on evaluating essays written by native Chinese students from primary and junior schools. IFlyEA provides multi-level and multi-dimension analytical modules for essay assessment. It has state-of-the-art grammar level analysis techniques, and also integrates components for rhetoric and discourse level analysis, which are important for evaluating native speakers’ writing ability, but still challenging and less studied in previous work. Based on the comprehensive analysis, IFlyEA provides application services for essay scoring, review generation, recommendation, and explainable analytical visualization. These services can benefit both teachers and students during the process of writing teaching and learning.


Introduction
Automated essay assessment (AEA) is an important educational application (Page, 1968;Rudner et al., 2006). It aims to reduce the burden of teachers for scoring student essays and give students direct instructions to improve their writing ability.
Automated essay scoring (AES) is one of the most important modules for AEA, which is usually formulated as a supervised learning problem. The early approaches utilized hand-crafted features to predict essay scores (Yannakoudakis et al., 2011;Chen and He, 2013;Phandi et al., 2015). Recently, deep learning has been applied to AES as well (Taghipour and Ng, 2016;Song et al., 2020c).
One issue about AES is that its prediction lacks explainability since a single score gives very limited information. Many efforts have been paid to expand the boundary of AES, and try to analyze detailed linguistic properties, such as grammatical errors (Ng et al., 2014), coherence (Somasundaran et al., 2014), organization (Burstein et al., 2003;Persing et al., 2010) and so on.
Several AES systems, such as E-Rater (Attali and Burstein, 2006) and Lingglewrite (Tsai et al., 2020), have been successfully applied in the education scenario. However, many of them focus on evaluating second-language learners' writing ability or evaluating basic language usages depending on shallow features, which may be not sufficient for evaluating essays written by native speakers. Moreover, most existing platforms mainly target on English, while there are significantly fewer systems working on other languages, such as Chinese.
In this paper, we introduce the IFlyEssayAssess (IFlyEA) system, which is a Chinese automated essay assessment system, focusing on assessing the quality of essays written by native Chinese students from primary and junior schools.
IFlyEA has the following highlights: • IFlyEA has comprehensive multi-level and multi-dimension analytical modules. It provides state-of-the-art Chinese spelling error correction and grammatical error diagnosis at grammar level. More specially, it also provides rich rhetoric and discourse level analysis, which are less studied but important for evaluating native speakers' writing ability.
• Based on the information provided by the analytical modules, IFlyEA provides a complete set of application services, including rating, review generation and recommendation.
• IFlyEA has an easy-to-use visualization and interactive interface, which can clearly show the detailed analytical results of an essay, and improve the explainability of predictions at the application level.
The target users of IFlyEA is students from primary and junior schools, in other hands, it is also helpful for teachers to reduce their heavy work. IFlyEA has been applied in practice and it is being continually improved by learning from user feedback.

System Architecture
The main modules of IFlyEA can be categorized into two types: analytical modules and application modules, as shown in Figure 1. These modules are integrated with visualization and interactive interfaces.
The analytical modules involve multi-level and multi-dimension analysis of essay quality, which mainly cover three levels: • Grammar level: This level aims to judge whether students can correctly use words to communicate. IFlyEA applies several technical approaches such as spelling correction and grammatical error diagnosis.
• Rhetoric level: This level aims to judge whether students can gracefully and skillfully convey their ideas. IFlyEA can recognize rhetorical devices and beautiful sentences in essays.
• Discourse level: This level aims to judge whether students can logically connect basic discourse units to construct a coherent whole.
The system identifies discourse elements for representing and evaluating essay organization, and also has other discourse level analysis such as topic classification and genre classification.
The techniques at grammar level are widely used for essay scoring, especially for evaluating secondlanguage learners. The rhetoric and discourse levels are more important for evaluating essays written by native speakers, especially for distinguishing well-written essays from moderate ones.
The application modules include: • Essay scoring: This module gives scores to indicate the general quality of an essay and the quality of specific aspects.
• Review generation: This module provides readable reviews on multiple writing dimensions.
• Recommendation: This module suggests relevant and potentially helpful materials to students.
The review generation and recommendation modules depend on the results from the analytical modules and the essay scoring module.
In general, the analytical modules are the basis of the application modules, providing evidence and diagnosis, and also improving the explainability for the predictions of application modules. As illustrated in Figure 2, through web page visualization and interfaces, students or teachers can receive rich information and interact with the analytical results.

Analytical Modules
IFlyEA has multi-level and multi-dimension quality evaluation to provide comprehensive analytical results. This section will introduce the main analytical modules, which can be roughly categorized into 3 levels: grammar, rhetoric and discourse levels.

Grammar-level Analysis
Correctly using words is a fundamental requirement for effective writing. Grammar-level analysis would try to detect spelling and grammatical errors in essays, and highlight detected errors as a reminder.

Spelling Error Correction
Given a sentence, our spelling checker would locate spelling errors if there is any, and provide a list of corrected candidates (Tseng et al., 2015).
Inspired by Liu et al. (2013); Yu and Li (2014), we establish a confusion-set based unsupervised two-stage method to detect and correct spelling errors. Confusion set: A confusion set is built to group characters with similar pronunciation or graphemic into clusters. We implement it with an inverted indexing structure so that given a target character, we can quickly get a list of confusion characters from the same cluster. Stage 1: Correction candidate detection with local context: We train a 5-gram language model LM on a large-scale corpus. For each character in a sentence, we substitute it with its corresponding  confusion characters one by one, and use LM to compute perplexity. If any confusion character leads to a lower perplexity than the original one by a pre-defined threshold, it would be retained as a correction candidate. After state 1, we obtain a small list of correction candidates. This stage can be processed very fast. Stage 2: Correction candidate reranking with global context: We further use the masked language model MLM from BERT (Devlin et al., 2019) to take advantage of the pre-trained transformer based language model and exploit the whole sentence as context to rerank the correction candidates at different positions, respectively. We evaluate our system on the SIGHAN 2015 benchmark. As shown in Table 1, the results demonstrate that our system can obtain competitive results to state-of-the-art methods, although it is unsupervised.

Grammatical Error Diagnosis
We focus on 4 types of grammatical errors: redundant word, missing word, word selection, and word ordering (Rao et al., 2018). We concentrate on detecting whether a sentence has any grammatical error (detection level), and show the positions of possible grammatical errors (position level).
In line with (Bell et al., 2019;), Dataset Detection level Position level CGED 2020 data (Rao et al., 2020) 0.894 0.404 Domain data 0.797 0.631 we formulate grammatical error diagnosis as a sequence labeling problem. Specifically, we build our model based on (Wang et al., 2020), where a ResNet enhanced multi-layer bidirectional transformer encoder (ResELECTRA) is used to encode sentences. This solution ranked 1st in the NLPTEA-2020 CGED shared task at identification and position level.
Since we target on student essays, we continue to train ResELECTRA on a sample of primary students' essays annotated with grammatical error types. The performance on primary students' essays can reach 63% F1-score at position level. The score is higher at position level but lower at detection level than on the CGED 2020 test set. This is because that the label distributions of both levels are different.

Rhetoric-level Analysis
Grammar-level analysis is important but is not enough for sufficiently evaluating the quality of native speakers' writing. For example, grammatical errors already become much less in junior students' essays compared with that in secondlanguage learners' essays.
This section will introduce rhetoric-level analytical modules, which aim to identify excellent sentences and rhetorical devices, to explore whether language is used in a graceful way.

Modeling the Beauty of Sentences
We define beautiful sentences as the ones that can induce aesthetic feelings in us. This definition is vague and the criterion is subjective. Therefore, we construct a classifier to identify beautiful sentences in a data-driven way.
We collect more than 20k sentences with beautiful or not labels through crowd-sourcing. Each sentence is at least labeled by two annotators. For training, we only keep the sentences that are labeled with the same tags by two annotators. We train a simple attention based BiLSTM model (Bahdanau et al., 2014) to classify whether a sentence should be annotated as beautiful. The classifier can get an accuracy of 81% through cross-validation evaluation.

Figurative Language Recognition
Figurative language refers to the use of words in a way that deviates from the literal meaning to convey a complicated meaning to amplify our writing. Figurative language recognition in essays enables monitoring students' ability in using figurative language and providing clues for evaluating quality of essays. Currently, we focus on identifying simile and personification. Simile Recognition Simile leads a comparison between concepts using explicit comparators such as like, as in English and Xiang, Si, Ru in Chinese. But a sentence with a comparator does not always trigger a simile, unless the two arguments of the comparator form a cross-domain mapping (Lakoff and Johnson, 2008). So simile recognition is not a trivial task.
We adopt a multi-task learning framework for simile recognition . The framework jointly optimizes two subtasks: simile sentence classification and simile component extraction. The model is trained on 12k annotated sentences that contain a comparator. The simile sentence classifier can obtain a 86% F 1 score in 5-fold cross-validation evaluation on the dataset. Personification Recognition Personification is another special case of figurative language, borrowing human's actions, expressions, or other characteristics to ascribes specific attributes of non-human objects, such as, "Life has cheated me" (Lakoff and Johnson, 2008). This task is cast as a typical classification problem. We adopt an attention based BiLSTM (Bahdanau et al., 2014) to encode a sentence into a dense feature vector. This vector is then fed into a nonlinear layer and a softmax layer to generate the classification result. Considering the characteristics of this task, we introduce an external knowledge base Chinese CiLin (A Synonymy Thesaurus of Chinese Words) (Mei, 1984) to group words into clusters according to word senses, and assign a learnable embedding vector for each cluster. Each word is represented by the concatenation of its word embedding and cluster embedding, which is fed into the encoder for learning. The personification recognizer can achieve a 80% F 1 score. This task shows to be more difficult than simile recognition.

Sentence Parallelism Recognition
Sentence parallelism is also a widely used rhetorical device in writing. It can be defined as two or more coherent text spans (phrases or sentences), which have similar syntactic structures and related semantics, and express relevant content or emotion together (Song et al., 2016). Parallelism adds balance and rhythm to make speeches and writings more vivid and powerful.
We adopt a feature-based method for this task. The features contain a set of alignment measures at position, word, syntactic and semantic levels. We find that sentence parallelism can be recognized with accepted performance (82% F1-score at pairwise level and 72% F1-score at parallelism block level) using a random forest classifier trained on hundreds of training samples. We also observe that sentence parallelism has a positive correlation to the quality of essays, especially in argumentative essays.

Quotation Detection
Quotation is a figure-of-speech that intentionally referring to some predecessor's words, like poems, maxims, and proverbs, to explain one's own idea, which is aim to amplify the writing or enhance the persuasiveness of argument. We collect a largescale quotation corpus from the Internet, ranging from poetry to proverbs, and exploit information retrieval (IR) techniques and semantic matching for quotation detection.

Discourse-level Analysis
Discourse analysis aims to build connections between discourse units to form a whole (Song and Liu, 2020). For essay scoring, we mainly focus on analyzing the organization of essays. One important issue is how to represent essay organization. Our solution is to use discourse elements, which are defined as the function of discourse units in building a coherent discourse. The discourse elements of an essay are dependent on its genre. For example, narrative and argumentative essays usually have different organizational strategies and have different discourse elements.

Argumentation Structure Modeling for Argumentative Essays
For argumentative essays, we define a set of discourse elements following previous work (Attali and Burstein, 2006;Persing et al., 2010), including prompt, thesis, main idea, support and conclusion.
These discourse elements can be used for both sentences and paragraphs (Song et al., 2020a,b). IFlyEA currently maintains a hybrid organization module. A discourse element is represented by combining its distributed semantic vector and a manually constructed feature vector (Song et al., 2015). The learning framework is based on hierarchical multi-task learning (Song et al., 2020b), which jointly optimizes sentence and paragraph level discourse element identification and organization evaluation. Evaluation shows that some minority discourse elements, such as thesis and ideas, are more difficult to recognize, and organization evaluation of argumentative essays is still challenging due to the lack of large-scale training data. However, visualizing recognized discourse elements helps teachers quickly see the organization structure of an essay, and helps us collect user feedback through interactions to accumulate more training data.

Discourse Mode Recognition for Narrative Essays
Evaluating organization of narrative essays is even more difficult, since narrative text understanding is still very challenging and open in both theory and practice. IFlyEA uses discourse modes as discourse elements influenced by (Smith, 2003). The main reasons are: (1) discourse modes can represent the essay organization by segmenting an essay into discourse mode zones; (2) discourse modes are closely related to rhetoric (Connors, 1981;Brooks and Warren, 1958) so that discourse modes can reflect writing proficiency in a degree.
Discourse modes are categorized into narration, description, exposition, argument and emotion, following (Song et al., 2017). Moreover, we further identify fine-grained description types, such as appearance, facial expression, action, natural scene, psychology, dialogue and so on. How to accurately and vividly describe details of a character, a scene or an object is an important lesson to be learned for writing. Identifying and visualizing fine-grained description types let people quickly find some highlights in writing descriptions.
Technically, we adopt a two-stage approach. In the first stage, we use a discourse-level hierarchical encoder to encode an essay and identify 5 discourse modes (Song et al., 2017). The hidden state of each sentence is used as a sentence representation for classification. In the second stage, we further classify descriptive sentences into fine-grained description types, which is formulated as a typical classification problem.

Discourse-level Abnormal Detection and Content Analysis
Abnormal detection is important for building a robust system. For example, intentional plagiarism is a terrible behavior and should be detected. We build a large-scale corpus covering common plagiarism resource, and exploit IR techniques and semantic matching to detect plagiarism. We also filter out malicious input, such as non-Chinese essays or meaningless character sequences, utilizing a pre-trained language model. Other content analysis, including off-topic detection, genre classification, and topic classification, are also required to support the comprehensive assessment of essays. We formulate these tasks as a classification problem. The genre and topic classification can be well solved, while off-topic detection is very challenging at present.

Essay Scoring
Essay scoring is a main module for AES. Instead of giving a single general score only, we consider scoring from multiple aspects additionally, including content, expression, rhetoric and organization, to provide a comprehensive assessment.
We formulate these scoring tasks as an essay classification problem, classifying a given essay into four grades: bad, moderate, good and excellent. We construct a feature-based model for each task, and use different feature templates for different aspects. The feature templates can be divided into three types: basic features, such as length, vocabulary, syntax and distributed dense representations; common analytical features, which are based on the output of our analytical modules, such as the counts of spelling and grammatical errors, and the use of rhetorical devices; and genre related features, for example, we use different strategies for modeling the organization of narrative and argumentative essays so that the features would be extracted accordingly.

Review Generation
Generating a review based on the multi-level evaluation can benefit students for getting direct instructions, and also benefit teachers for getting scor-ing reports fast and automatically. Currently, our system generates reviews based on a series of predefined templates. The scores of multiple aspects and the whole essay are generated by essay scoring module. According to these scores, the system would manage template selection and integration to generate a coherent review, revealing both the advantages and the shortcomings of an essay.

Recommendation
In addition to rate and review essays, it is also important to help students learn from feedback to overcome existing weaknesses. To tackle this, We build a module to recommend relevant materials according to diagnosis results at three levels.
We trigger the grammar-level recommendation if spelling errors are detected. In addition to recommending the correct characters, IFlyEA will automatically generate a set of cloze test questions. We first retrieve sentences containing the correct character from an existing corpus of this module, then mask the character in each sentence, and mix it with characters from its confusion set, and finally let students choose the best character to fill the blank. We expect students can better master correct usage of characters and distinguish confusion characters through exercises. As a supplement, the meaning and example usage of both the correct character and its confusion set are prepared previously, which will be displayed after the exercises.
At rhetoric level, we recommend some wellwritten rhetorical sentences that describe similar objects or scenes as in the target essay, while at discourse level, we show more well-written essays or passages related to similar topics. To achieve this, we have constructed a high quality resource bank of high scoring essays, proses and novels written by famous writers. We use the analytical modules to analyze the resource to support recommendations according to different demands.

Conclusion and Future Work
This paper presented IFlyEA, a Chinese automated essay assessment system. IFlyEA demonstrates the techniques, that we have developed, could tackle with evaluating the quality of essays written by native Chinese students. A demonstrating video is available at https://youtu.be/BujBQfxvX3A.
The main advantage of IFlyEA is its multi-level and multi-dimension analytical modules for essay assessment, especially on several high level skill-ful language usage abilities, which is less studied previously. Most of these modules can achieve moderate and above performance. IFlyEA also provides comprehensive services for rating, review generation and recommendation. Together with the visualization and interactive interfaces, teachers and students can get useful feedback and easily understand why the system makes such predictions.
IFlyEA has been applied in practice. In future, we plan to conduct more user studies and continue to improve the system. And how to evaluate the impact of the system on students is another important problem, which is worth exploring.