Document-Level Text Simplification: Dataset, Criteria and Baseline

Text simplification is a valuable technique. However, current research is limited to sentence simplification. In this paper, we define and investigate a new task of document-level text simplification, which aims to simplify a document consisting of multiple sentences. Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia and perform analysis and human evaluation on it to show that the dataset is reliable. Then, we propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task. Finally, we select several representative models as baseline models for this task and perform automatic evaluation and human evaluation. We analyze the results and point out the shortcomings of the baseline models.


Introduction
Text simplification is a valuable technique that deserves to be studied in depth (Woodsend and Lapata, 2011). One definition of text simplification is to simplify the original text to a more understandable text, while keeping the main meaning of the original text unchanged (Štajner and Saggion, 2018;Maddela et al., 2020). It can provide convenience for non-native speakers (Petersen and Ostendorf, 2007;Glavaš and Štajner, 2015;Paetzold and Specia, 2016), non-expert readers (Elhadad and Sutaria, 2007;Siddharthan and Katsos, 2010) and children (De Belder and Moens, 2010;Kajiwara et al., 2013).

Why it is Valuable to Study Document-level Text Simplification
Currently, researches on text simplification focus on sentence simplification, and the existing common text simplification datasets such as Wikilarge, Wikismall, and Newsela are also designed for sentence simplification. However, various complex ap-plications in the real world often require documentlevel simplification rather than sentence-level simplification. Imagining that if you want to simplify an article in Time magazine for children to read, it is very inefficient to simplify the sentences separately. Besides, sentences that are obscure and have little relation to the subject should be deleted instead of simplified. Therefore, studying documentlevel text simplification may be more meaningful than studying sentence-level text simplification alone. Unfortunately, the research on documentlevel text simplification is still scarce: there is no formal definition, no suitable dataset, and evaluation criteria.

Similarities and Differences with Text Summarization
Other tasks that may be related to document-level text simplification are text summarization (Dong et al., 2018;Cao et al., 2020), paraphrasing (Zhao et al., 2018;Guo et al., 2018), and split & rephrase (Narayan et al., 2017;Surya et al., 2019). Obviously, the paraphrasing and split & rephrase tasks are both sentence-level tasks. The most closely related task is text summarization, which is also a document-level task. We use an example to illustrate the difference between text summarization and our task, as shown in Table 1. We can see that text summarization does not involve rewriting text with simplified versions, though both the two tasks may filter or delete some unimportant text from the original document.

Our Contributions
In this paper, we are committed to promoting research on document-level text simplification. In summary, the main contributions of our work include: (1) We define the new task of document-level text simplification and build the D-Wikipedia Original article Firefighters or firemen are people whose job is to put out fires and rescue people. Besides fires, firefighters rescue people and animals from car wrecks, collapsed buildings, stuck elevators and many other emergencies. Firefighting is a job which requires bravery, strength, quick thinking and a wide range of skills. Firefighters are based at a building called a " fire station " ( also known as a " firehouse " or " fire hall " ). When their help is needed, they drive a vehicle called a " fire engine " or " fire truck " to the scene responding code 1 code 2 or code 3. These vehicles can pump water and foam to put out fires. Fire engines also carry ladders, cutting tools and lots of different types of rescue equipment. Most carry first aid kits to help people who are injured or hurt.

Document-level simplification
The job of a firefighter is to put out fires and save lives from many emergencies. They are based at a building called a " fire station ". They drive a vehicle called a " fire engine " or " fire truck " to the scene. The vehicle carries many types of rescue equipment to help people in danger.
Text summarization Firefighters or firemen are people whose job is to put out fires and rescue people and animals from many emergencies. Firefighters are based at a building called a " fire station ". When their help is needed, they drive a vehicle called a " fire engine " or " fire truck " which may carry different types of rescue equipment to help people who are injured or hurt to the scene . dataset for research 1 .
(2) We propose a new automatic evaluation metric called D-SARI that is more suitable for the new task.
(3) We select several representative models and perform both automatic evaluation and human evaluation. The results could serve as the baselines.

Related Works
Sentence simplification aims to rewrite an original sentence into a more straightforward sentence (Saggion, 2017;Sulem et al., 2018b). The input and output of the model are just sentences instead of articles. Based on the English Wikipedia and the Simple English Wikipedia, many researchers have built high-quality datasets such as Wikilarge (Zhang and Lapata, 2017), Wikismall (Zhu et al., 2010), and so on (Coster and Kauchak, 2011;Kauchak, 2013). Based on Newsela, Xu et al. (2015) established the Newsela dataset. The above datasets are widely used in the field of sentence simplification. Most of the early simplification models were based on statistical machine translation (Wubben et al., 2012;Narayan and Gardent, 2014). Nisioi et al. (2017)   clearly defined, and there are no available highquality datasets and criteria to evaluate the generated articles.

Problem Formulation
The document-level text simplification task can be defined as follows. Given an original complex article C, the article consists of n sentences, denoted as C = {S 1 , S 2 , ...S n }. Document-level simplification aims to simplify C into m sentences, which form the simplified article F , denoted as F = {T 1 , T 2 , ...T m }, and m may not be equal to n. F retains the primary meaning of C and is more straightforward than C, making it easier for people to understand. The operations for sentence-level simplification include word reservation and deletion, synonym replacement, etc. (Xu et al., 2016) Based on the work of Alva-Manchego et al. (2019), we define six types of document-level simplification operations, namely, sentence joining, sentence splitting, sentence deletion, sentence reordering, sentence addition, and anaphora resolution. See Appendix A for the specific definition and example of each operation.
In our definition, document-level simplification should allow the loss of information but should not allow the loss of important information. Zhong et al. (2020) pointed out that sentence deletion is a prevalent phenomenon in document simplification. We believe that information that has little relevance to the primary meaning should be removed to improve readability.

Dataset Construction
According to the definition of document-level simplification, we built a new large-scale dataset named D-Wikipedia based on the English Wikipedia and Simple English Wikipedia. We first downloaded dumps from the official website of Wikipedia and created over 170,000 article pairs 2 .
Considering that it is not easy to establish a one-toone correspondence between the contents, i.e., the subheadings, we kept only the main content, which is the abstract below the headings. Meanwhile, we considered that if the article is too long, it will occupy a large amount of memory during training. Therefore, we removed those article pairs whose original article or simplified article is longer than 1,000 words. Finally, we built a dataset containing 143,546 article pairs. The D-Wikipedia dataset not only can be used for document-level simplification research but also can be further aligned to construct a sentence-level simplification dataset.
In this work, we randomly divided the dataset into 132K article pairs as the training set, 3K article pairs as the validation set, and 8K article pairs as the test set. There is no overlap between the training set, validation set, and test set.

Additional Newsela Test Set
There is also a commonly-used and high-quality corpus named Newsela that might be used for document-level simplification. Each original article in the Newsela corpus corresponds to four articles of different simplification levels. We also removed the article pairs whose original article or simplified article is longer than 1000 words. Given that the number of articles in each simplification level is less than a thousand, we only use them to build four additional test sets of different simplification levels. In addition, using the Newsela corpus requires a license 3 , while the D-Wikipedia dataset will be completely open-source.

Statistics and Comparison
We randomly sampled 100 article pairs from the established D-Wikipedia dataset to estimate the percentage of the articles which contain each of the six document-level simplification operations (mentioned in Section 3). We used Amazon Mechanical Turk to invite three workers to identify the operations in the articles, and the percentage of articles with each operation is shown in Table 2. It can be seen that each simplification operation appears in most of the simplified articles in the dataset. In other words, most articles involve with different simplification operations.
We also calculated the percentage of each simplification operation according to the total occurrences of the operations in the simplified articles, which is shown in Figure 1. It can be seen that the sentence deletion operation occurs most frequently in the dataset.  Figure 1: The percentage of each simplification operation. Sentence deletion occurs most frequently, accounting for nearly half of the total simplification operations, while sentence reordering occurs least frequently, accounting for only 6% of the total simplification operations.
To analyze the word-level differences between the original articles and the simplified articles, following Xu et al. (2015), we adopted the odds ratio method proposed by Monroe et al. (2008). The odds ratio of token t between corpus i and corpus j is defined as: In Equation 1, y i t represents the count of token t in corpus i and y j t represents the count of token t in corpus j. n i and n j represent the size of corpus i and corpus j, respectively. We have found some complex words that occur frequently as examples to show that they are sufficiently simplified, as shown in Table 3.
It can be seen from R original and R simple that the relative frequency of complex words appearing  Table 3: R original and R simple indicate the ranking of the number of occurrences of the word in the original article and the simplified article, respectively. A smaller odds ratio means a greater reduction of the complex word. The closer the p-value of the test is to zero, the more significant the difference between the odds ratio and one.
in simplified texts is much lower than in the original texts. We used a chi-square test to show if the odds ratio is significantly different from one 4 . An odds ratio significantly lower than one means that the complex words are well simplified. The reduction of the word "including" may means that clauses are deleted or split into multiple sentences. Sentence splitting is a common operation in document-level simplification. When splitting the conjoined clauses, to preserve the rhetorical relation, Siddharthan (2003) introduced the cue words. We calculated the odds ratio of the conjunction and the cue word, and the results are shown in Table 4. We did not calculate all words because the number of occurrences of some words, such as "hence", was too low to be statistically meaningful.  The odds ratio of most conjunctions is significantly less than one, and the odds ratio of most cue words is significantly greater than one, indicating that the simplified article may contain more split sentences and the long sentences in the original article have been simplified.
The D-Wikipedia dataset was also analyzed and compared with the Newsela corpus, and the results are shown in  of the D-Wikipedia dataset is lower than that of the Newsela corpus of any simplification level. In terms of the average number of words per sentence, the compression ratio of the D-Wikipedia dataset is between the Simp-2 level and the Simp-3 level of the Newsela corpus.

Human Evaluation
In this section, we employ human judges to evaluate the quality of the D-Wikipedia dataset. Before evaluation, we need to analyze whether the human evaluation indicators used for sentence-level simplification are suitable to evaluate document-level simplification.
In human evaluation, sentence simplification is usually evaluated from the three perspectives of simplicity, meaning, and grammar. Simplicity is the most important indicator. Simplicity indicates if the simplified sentence is simpler than the original sentence. However, we believe that this measure is not a good indicator for scoring document-level simplification. For example, if only the first sentence of the original article is simplified and the other sentences are deleted, the simplified article will be very short and simple and will get a high simplicity score. But such an article does not retain the main information of the original article, which is not what we want.
Therefore, we propose a new indicator named O-simplicity (Overall simplicity with quality guarantee). O-simplicity indicates if the simplified article is simpler than the original article, under the condition of quality guarantee, i.e., it also should read smoothly and can retain the main meaning of the original article. As an indicator to evaluate how good the simplification is, O-simplicity is a more meaningful and comprehensive measure than the original simplicity indicator or simply averaging the simplicity, meaning, and grammar scores.
Following Sulem et al. (2018c), we also use the fine-grained simplicity-phrase and simplicitystructure, which measures the simplification of words and the simplification of sentence structure, respectively. In this way, O-simplicity is an overall indicator that needs comprehensive consideration, and the other four indicators are focusing on specific aspects. More examples and scoring guidelines are given and analyzed in Appendix C.
We invited three workers to evaluate the quality of the D-Wikipedia dataset with the above measures. We randomly selected 100 article pairs from the dataset, and the five-point Likert scale is used for rating. For the results, the average O-simplicity score is 3.94, indicating the simplification of the articles is generally good. The Simplicity-phrase and Simplicity-structure scores reach 4.28 and 4.23, respectively, implying that the simplified article has made considerable lexical and sentence structure simplifications compared to the original article. The grammar score achieves 4.65, probably because the simplified articles are written by humans and are easy to read. The meaning score is 3.69, indicating that the simplified article can preserve the meaning of the original article.

The D-SARI Metric
Currently, the most commonly used automatic evaluation metric for sentence-level simplification is the SARI metric (Xu et al., 2016). The correlation of SARI with human judgments of the simplicity indicator proved to be high in sentence-level simplification (Sulem et al., 2018a). However, this metric has shortcomings when used directly to evaluate document-level simplification. For better understanding, in this section, we conduct a qualitative analysis and give the following example: Original article: marengo is a town in and the county seat of iowa county , iowa , united states .  Table 6: The SARI and D-SARI values for the four simplified articles. The fourth simplified article does the best job of simplification, not only retaining the main meaning of the original article but also deleting the unimportant information and the difficult words. However, its SARI value is the lowest among the simplified articles.
it has served as the county seat since august 1845 , even though it was not incorporated until july 1859 . the population was 2,528 in the 2010 census , a decline from 2,535 in 2000 .
Simplified article 1: in the US . 2,528 in 2010 . Simplified article 2: marengo is a city in iowa , the US . it has served as the county seat since august 1845 , even though it was not incorporated . the population was 2,528 in the 2010 census , a decline from 2,535 in 2010 . Simplified article 3: marengo is a town in iowa . marengo is a town in the US . in the US . the population was 2,528 . the population in the 2010 census .
Simplified article 4: marengo is a town in iowa , united states . in 2010 , the population was 2,528 . Reference article: marengo is a city in iowa in the US . the population was 2,528 in 2010 .
The SARI values of the four simplified articles are shown in the left half of Table 6 5 . When using the SARI metric, we take the whole article as input, output and reference.
The simplified article 1 generates "US", a word that does not appear in the original article, which raises the overall SARI value. Intuitively, however, it makes no sense to generate several simplified words if the simplified article is too short to convey the main meaning of the original article. Therefore, we believe that a penalty factor LP 1 should be added to the F add score. If the generated simplified article is shorter than the reference article, the F add score will be penalized.
The simplified article 2 does a good job of simplifying the first sentence of the original article but retains much useless information and difficult words. Paradoxically, its P del score is the highest among the four simplified essays. A more common scenario is that the original article is much longer than the simplified article, then according 5 We use the script in https://github.com/ cocoxu/simplification/blob/master/SARI. py.
to the formula of P del , removing fewer words will have a limited effect on P del . Therefore, we believe that a penalty factor LP 2 should be added to the P del score, penalizing the P del score if the generated simplified article is longer than the referenced article.
The simplified article 3 finds the important information in the original article and does a good job of simplifying it. Nevertheless, it performs duplicate generation, which leads to a severe decrease in readability and should not get such a high F keep value. According to the formula of F keep , if the duplicate n-grams also appear in the reference, then the F keep value will not decrease. Therefore, we believe that the LP 2 should also be added to the F keep score. Besides, we add a sentence-level penalty factor SLP to penalize the F keep score if the number of generated sentences is far from the number of sentences in the reference article.
In summary, based on the SARI metric (Xu et al., 2016), we propose the D-SARI metric for the document-level simplification task. We retain the idea of calculating the scores of add, keep and delete separately in SARI, which proved to be effective in sentence simplification. The D-SARI metric is shown as below:   I, O, and R represent the number of words (including punctuation) in the input article, the output article, and the reference article, respectively. O S and R S represent the number of sentences in the output article and the reference article, respectively. Due to the limitation of space, please refer to Xu et al. (2016) for the calculation of F add , F keep and P del . We also calculate the D-SARI values for each of the simplified articles in the given example, as shown in the right half of Table 6.
As we analyzed from the given example, it is reasonable to penalize the three components in SARI. In the D-SARI metric, the penalty is based on length. The motivation comes from BLEU (Papineni et al., 2002). A candidate should be neither too long nor too short, and an evaluation metric should enforce this. The difference between the length of a simplified sentence and the original sentence in sentence-level text simplification is not very large, while the opposite is true for documentlevel text simplification. An original article may be long, while a simplified article may contain only one sentence. It is simple enough, but not a good simplification of the original article. It is a reasonable proposition that the length of the simplified article should be close to the length of the reference article.
We also conduct an empirical analysis of the D-SARI metric. In Section 7.3, we use Spearman's rank correlation coefficient (Zwillinger and Kokoska, 1999) to show that the D-SARI metric has the strongest correlation among several metrics with human ratings.

Baseline Models
We selected four representative models as the baselines for the document-level simplification task, which are: (1) Transformer: It treats the task as a sequenceto-sequence problem. Both the encoder and decoder contain six transformer layers (Vaswani et al., 2017).
(2) SUC: It simplifies each sentence in the article by using use contextual information (Sun et al., 2020).
(3) BertSumextabs: It achieves excellent results on the text summarization task, using the Bert-base model as the encoder (Liu and Lapata, 2019).
(4) BART: It is a recently proposed pretrained model based on large-scale corpus and achieves state-of-the-art results on many sequenceto-sequence tasks (Lewis et al., 2019).
All the models were tested on our delineated test sets. We used the fairseq toolkit and performed replicate experiments. See Appendix B for detailed parameters.

Automatic Evaluation Results
We used the SARI metric, the BLEU metric, the FKGL metric, and the D-SARI metric for automatic evaluation. We have described the SARI and D-SARI metrics in detail in Section 5. BLEU is a method for comparing the similarity between the reference and the output (Papineni et al., 2002) 6 . FKGL is used to measure the readability of the text (Kincaid et al., 1975) The automatic evaluation results of the D-Wikipedia test set are shown in    For the Newsela corpus, as mentioned in Section 4.2, we choose a representative test set called Simp-4 to show the automatic results. The models were both trained and validated on the D-Wikipedia dataset, and the results are shown in Table 8.

Human Evaluation Results
We performed human evaluation according to the method described in Section 4.4. To maintain consistency, we selected the same 100 article pairs in the D-Wikipedia test set that were randomly selected for evaluating the dataset in Section 4.4. We added some fake examples to the questionnaire and checked whether the workers gave a reasonable score to ensure the quality of human evaluation.
The human evaluation results are shown in Table 9. We also report the correlation of the article's length against the human ratings, as shown in Table 11. The results prove that human judges tend to give high simplicity-phrase scores and high simplicity-structure scores to short articles. The O-simplicity indicator places more emphasis on the overall simplification effect, including the retention of the main meaning and the fluency of the sentences. Therefore, as we analyzed in Section 4.4, the O-simplicity indicator can evaluate how good the simplification is, which is better than the simplicity-phrase and the simplicity-structure indicators. Generally, the BART and BertSumextabs models perform better than the other two models, especially on the O-simplicity measure. Directly applying the sentence simplification model SUC does not get good results, which means documentlevel simplification is very different from sentencelevel simplification.

Correlation of Automatic Metrics with Human Ratings
We calculated Spearman's rank correlation coefficient between each automatic metric and human ratings on the results for the 100 article pairs, and the correlation scores are shown in Table 10. The D-SARI metric has the highest correlation with the O-simplicity indicator, surpassing both BLEU and SARI. In terms of simplicity-phrase and simplicitystructure, the correlation of D-SARI with human ratings also exceeds that of SARI, and although FKGL has the highest correlation, it does not correlate with the O-simplicity indicator. We also noticed that BLEU has little correlation with the meaning and grammar indicators, probably because the simplification contains lots of splitting operations, which is consistent with the conclusion obtained by Sulem et al. (2018a).

The Challenge of Document-level Simplification
There are many problems with applying existing models directly to the document-level simplification task. From the automatic evaluation, the D keep values of the baseline models are not high, and the FKGL values also need to be further reduced. From human evaluation, the O-simplicity scores of the articles simplified by the models are still far from that of the reference. As can be seen from the given example in Appendix D, the best-performing BertSumextabs model among the four models still retains some complex vocabulary and sentence structure compared with the reference, and the model's ability to screen out important information needs further improvement. We also noticed that the results of the SUC model are much lower than all other models, which indicates that document-level simplification cannot be addressed by stitching together the results of sentence simplification as simplified articles.
Above all, we believe that new models designed for document-level simplification could be proposed in the future, which will greatly advance this field.

Conclusion
In this paper, we are committed to promoting research on document-level text simplification. We established a large-scale high-quality dataset named D-Wikipedia and proposed a new automatic evaluation metric called D-SARI. We also selected several representative models as baselines for this task. The results demonstrate that the dataset is of high quality and the metric is reliable. A Six Types of Operations in Document-Level Text Simplification 1 Sentence joining and sentence reordering Src: the fields medal is a prize awarded to two , three , or four mathematicians under 40 years of age at the international congress of the international mathematical union ( imu ) , a meeting that takes place every four years. the fields medal is regarded as one of the highest honors a mathematician can receive , and has been described as the mathematician 's nobel prize , although there are several key differences , including frequency of award , number of awards , and age limits . according to the annual academic excellence survey by arwu , the fields medal is consistently regarded as the top award in the field of mathematics worldwide , and in another reputation survey conducted by ireg in 2013-14 , the fields medal came closely after the abel prize as the second most prestigious international award in mathematics. the prize comes with a monetary award which , since 2006 , has been 15,000 . the name of the award is in honour of canadian mathematician john charles fields . fields was instrumental in establishing the award , designing the medal itself , and funding the monetary component. the medal was first awarded in 1936 to finnish mathematician lars ahlfors and american mathematician jesse douglas , and it has been awarded every four years since 1950 . its purpose is to give recognition and support to younger mathematical researchers who have made major contributions . Tgt: the fields medal is a prize given to mathematicians who are not over 40 years of age . it is given at each international congress of the international mathematical union . this is a meeting that takes place every four years. the canadian mathematician john charles fields was the first to propose this medal and it was first awarded in 1936 . it has been regularly awarded since 1950 . its purpose is to support younger mathematicians who made major contributions. the fields medal is viewed , at least in the media , as the top honor a mathematician can receive . Analysis: Sentence joining means combining two or more sentences into one sentence. In src, the two sentences marked in red are merged into the sentence marked in red in tgt. Some information in the original two sentences is removed. Sentence reordering implies a change in the structure of the article. In src, the sentences marked in red appear before the sentence marked as underlined, but in tgt, the simplified sentence marked as underlined appears before the sentence marked in red.
2 Sentence splitting Src: it is a decentralized digital currency without a central bank or single administrator that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries . Tgt: bitcoin is a digital and global money system currency . it allows people to send or receive money across the internet , even to someone they do n't know or do n't trust . money can be exchanged without being linked to a real identity . Analysis: In contrast to sentence joining, sentence splitting is the division of a long sentence into two or more sentences. The sentences in tgt are simplified from the parts of the sentence in src marked with the corresponding colors.
3 Sentence addition Src: 104.6 rtl is a private radio station that is produced in a hot adult contemporary format . it is transmitted from studios in kurfürstendamm in berlin-charlottenburg . according to german media analysis 2011/ii , the station reaches 209,000 listeners in an average transmitting hour ( mon-fri , 6am-6pm ) with a total of 709,000 listeners per day and thereby is one of the most listened to radio programs in berlin and brandenburg . Tgt: 104.6 rtl is a german radio station . it first aired on 9 september , 1991 . it broadcasts in berlin and hopes that the 14-39 age group will listen . the studios are at the kurfürstendamm in berlincharlottenburg . in 2005 the radio channel has been awarded the german radio award for the best morning show .
Analysis: Sentence addition means that there is a sentence in tgt for which the corresponding sentence is not found in src. Sentence addition introduces additional information, often used for explanation and clarification.
4 Sentence deletion Src: landudal is a commune in the finistère department of brittany in north-western france . the writer angèle jacq , winner of the cezam prix littéraire inter ce in 2000 for her novel " le voyage de jabel " , was born in landudal . Tgt: landudal is a commune . it is found in the region brittany in the finistère department in the northwest of france . Analysis: Sentence deletion means that there is a sentence in src that does not find a corresponding sentence in tgt. This is usually because the original sentence is difficult to simplify and the deletion does not affect the main meaning of the text.
5 Anaphora resolution Src: Winston Churchill was a great politician and statesman. He also won the Nobel Prize for literature in 1953. Tgt: Winston Churchill won the Nobel Prize in 1953.
Analysis: Anaphora resolution is usually associated with sentence deletion. the first sentence in src is deleted, then the pronoun "he" in the second sentence is replaced with the person's name in tgt.

B Implementation Details
We used the fairseq toolkit 8 to implement the transformer model and the BART model. We used the code on the github to implement the BertSumextabs model 9 and the SUC model 10 . All the models except the SUC model are trained on the training set of the D-Wikipedia dataset we constructed. The SUC model is trained on the Wikipedia dataset and when testing, the original articles in our delineated test set are simplified with this model sentence by sentence, and then the output sentences are stitched together to get the simplified articles. All the models are trained on Nvidia GTX 1080ti. The batchsize we set can make full use of its video memory. The hyperparameters are shown in the following tables.

C Human Evaluation Guideline
The goal of this review is to evaluate the simplification quality of different articles. In this review, you will be given an original article and its corresponding simplified articles. You should evaluate the quality of the simplification in the following five ways: (1) Simplicity-phrase. Are the words in the simplified article simpler than those in the original article?
(2) Simplicity-structure. Are the sentence structures in the simplified article simpler than those in the original article?
(3) Meaning. The text simplification operation can remove some sentences from the original article, but the main meaning of the original article should be kept intact.
(4) Grammar. The simplified article should be grammatically correct and fluent.
(5) O-simplicity. The simplified article should be simpler than the original article, and it also should read smoothly and can retain the main meaning of the original article.
You will do this using a 1-5 rating scale, where 5 is the best and 1 is the worst. There are no "correct" answers and whatever choice is appropriate for you is a valid response. For example, if you are given the following original article and simplified articles: Original article: the fields medal is a prize awarded to two, three, or four mathematicians under 40 years of age at the international congress of the international mathematical union ( imu ), a meeting that takes place every four years. the fields medal is regarded as one of the highest honors a mathematician can receive, and has been described as the mathematician 's nobel prize, although there are several key differences, including frequency of award, number of awards, and age limits. according to the annual academic excellence survey by arwu, the fields medal is consistently regarded as the top award in the field of mathematics worldwide, and in another reputation survey conducted by ireg in 2013-14, the fields medal came closely after the abel prize as the second most prestigious international award in mathematics. the prize comes with a monetary award which, since 2006, has been 15,000. the name of the award is in honour of canadian mathematician john charles fields. fields was instrumental in establishing the award, designing the medal itself, and funding the monetary component. the medal was first awarded in 1936 to finnish mathematician lars ahlfors and american mathematician jesse douglas, and it has been awarded every four years since 1950. its purpose is to give recognition and support to younger mathematical researchers who have made major contributions. in 2014, the iranian mathematician maryam mirzakhani became the first female fields medalist. in all, sixty people have been awarded the fields medal. the most recent group of fields medalists received their awards on 1 august 2018 at the opening ceremony of the imu international congress, held in rio de janeiro, brazil. the medal belonging to one of the four joint winners, caucher birkar, was stolen shortly after the event. the icm presented birkar with a replacement medal a few days later.
Simplified article 1: (Score: Simplicity-phrase 5 Simplicity-structure 5 Meaning 5 Grammar 5 O-simplicity 5) the fields medal is an award given to mathematicians under 40 years of age. the name of the prize is in honor of the canadian mathematician john charles field. and it is awarded every four years since 1950. the fields medal is regarded as the highest award in the field of mathematics in the world. it is intended to be used to encourage young mathematicians.
Simplified article 2: (Score: Simplicity-phrase 4 Simplicity-structure 5 Meaning 5 Grammar 5 O-simplicity 5) the fields medal is a prize given to mathematicians who are not over 40 years of age. it is given at each international congress of the international mathematical union. this is a meeting that takes place every four years. the canadian mathematician john charles fields was the first to propose this medal and it was first awarded in 1936. it has been regularly awarded since 1950. its purpose is to support younger mathematicians who made major contributions. the fields medal is viewed, at least in the media, as the top honor a mathematician can receive. it comes with a monetary award. in 2006 the award was $ 15,000 ( us $ 13,400 or C10,550 ). the abel prize has similar prestige, and more money. Simplified article 3: (Score: Simplicity-phrase 4 Simplicity-structure 3 Meaning 2 Grammar 5 O-simplicity 2) the fields medal is consistently regarded as the top award in the field of mathematics worldwide. Since 2006, the prize of this award has been 15,000. the most recent group of fields medalists received their awards on 1 august 2018 at the opening ceremony of the imu international congress. the medal belonging to one of the four joint winners, caucher birkar , was stolen shortly after the event .
Simplified article 4: (Score: Simplicity-phrase 1 Simplicity-structure 1 Meaning 5 Grammar 5 O-simplicity 1) the fields medal is a prize awarded to two, three, or four mathematicians under 40 years of age at the international congress of the international mathematical union ( imu ) a meeting that takes place every four years. according to the annual academic excellence survey by arwu, the fields medal is consistently regarded as the top award in the field of mathematics worldwide, and in another reputation survey conducted by ireg in 2013-14, the fields medal came closely after the abel prize as the second most prestigious international award in mathematics. the name of the award is in honour of canadian mathematician john charles fields. he was instrumental in establishing the award, designing the medal itself, and funding the monetary component. the purpose of the fields medal is to give recognition and support to younger mathematical researchers who have made major contributions.
Simplified article 5: (Score: Simplicity-phrase 5 Simplicity-structure 5 Meaning 4 Grammar 1 O-simplicity 2) the fields medal is to a prize giving to mathematicians who are not over 40 years of age. but it is awarding every four years since 1950. the prize is in honor of the canadian mathematician john charles field. the fields metal described as the mathematician 's nobel prize as the mathematician 's nobel prize. its purpose are to support younger mathematicians who made major contributions.
Analysis: The Simplified article 1 does a good job on the simplification of words and sentence structures. The simplification includes removing difficult vocabulary, splitting and simplifying long sentences, etc.. So, it scores full marks for the simplification-phrase and the simplificationstructure. It is equally able to summarize the main meaning of the original article, so it scores full marks for meaning. It reads smoothly, like it is written by humans, so it scores full marks for grammar. The overall feeling of the article is very good. It reads very simple, fluently and maintains the main meaning, so it scores full marks for the O-simplicity. The Simplified article 2 has some words that need further simplification, such as "prestige" and "monetary". So, it scores a little bit lower than the simplified article 1 on the simplicity-phrase. However, it reads smoothly and the main meaning is well maintained. One will also feel that the simplification effect is very good when reading this article. These two articles also illustrate that articles that score high marks can be presented in different ways. Obviously, simplified article 3 does not retain the main meaning of the original article, but rather some non-essential information. Therefore, it scores very low on meaning. Besides, it contains long and complex sentences and the sentence structures are not simple enough compared to the original article. One' s experience of reading such an article is not very good, because it deviates from the main meaning and is not simple enough. The Simplified article 4 is able to find those relatively important sentences in the original article. But unfortunately, it does little simplification operation and is not easy to read, so it scores very low on the simplification-phrase and the simplification-structure. Children and nonnative speakers will not be able to read such an article, so it scores very low on the O-simplicity. The Simplified article 5 contains many grammatical errors and repetition of some phrases, making it look less like it is written by a human. Therefore, it scores very low on grammar. Although its words and sentence structures are very simple, the existence of grammatical errors makes it difficult to read, so it scores low on the O-simplicity.

D Case Study
Input: atal bihari vajpayee ( ; 25 december 1924 -16 august 2018 ) was an indian statesman who served three terms as the prime minister of india , first for a term of 13 days in 1996 , then for a period of 13 months from 1998 to 1999 , followed by a full term from 1999 to 2004 . a member of the bharatiya janata party ( bjp ) , he was the first indian prime minister not of the indian national congress to serve a full term in office . he was also noted as a poet and a writer . he was a member of the indian parliament for over five decades , having been elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he served as the member of parliament for lucknow , retiring from active politics in 2009 due to health concerns . he was among the founding members of the bharatiya jana sangh (  SUC: atal bihari vajpayee ( ; 25 december 1924 -16 august 2018 ) was an indian statesman who served three terms as the prime minister of india , first for a term of 13 days in 1996 , then for a period of 13 months from 1998 to 1999 , followed by a full term from 1999 to 2004 . a member of the bharatiya janata party ( bjp ) , he was the first indian prime minister not of the indian national congress to be elected into the full term in the first indian prime minister . he was the first indian prime minister of the indian national congress and was called the bharatiya janata party ( bjp ) . he was a member of the indian parliament for over five decades and has been elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he was elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he served as the member of parliament for lucknow , retiring from active politics in 2009 due to health concerns . he was among the founding members of the bharatiya jana sangh ( bjs ) , of which he was president from 1968 to 1972 . the bjs merged with several other parties to form the janata party , which won the 1977 general election . it to make the janata party , he won the 1977 general election . in march 1977 , vajpayee became the minister of external affairs in the cabinet of prime minister morarji desai . he became the minister of external affairs in the cabinet of prime minister morarji desai . he left in 1979 and left the alliance collapsed soon after . in 1980 's first president.during members of prime minister , india carried out the bjp nuclear tests as prime minister . in 1998 he became prime minister , india , and created the pokhran-ii nuclear tests . vajpayee sought to improve diplomatic relations with pakistan , travelling to lahore by bus to meet with prime minister nawaz sharif . after the 1999 kargil war with pakistan , he sought to restore relations through engagement with president pervez musharraf , inviting him to india for a summit at agra.the administration of narendra modi declared in 2014 that vajpayee 's birthday , 25 december , would be marked as good governance day . in 2015 , he was conferred india 's highest civilian honour , the bharat ratna , by the president of india , pranab mukherjee . he died on 16 august 2018 of age-related illness .
BertSumextabs: atal bihari vajpayee ( ; 25 december 1924 -16 august 2018 ) was an indian politician . he served three terms as the prime minister of india , first for a term of 13 days in 1996 , then for a period of 13 months from 1998 to 1999 , followed by a full term from 1999 to 2004 . a member of the bharatiya janata party ( bjp ) , he was the first indian prime minister not of the indian national congress .
BART: atal bihari vajpayee ( 25 december 1924 -16 august 2018 ) was an indian politician . he was the first indian prime minister not of the indian national congress to serve a full term in office.he was elected as a member of the indian parliament for over five decades , having been elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he served as the member of parliament for lucknow , retiring from active politics in 2009 due to health concerns.he was among the founding members of the bharatiya jana sangh ( bjs ) , of which he was president from 1968 to 1972.in march 1977 , vajpayee became the minister of external affairs in the cabinet of prime minister morarji desai . he resigned in 1979 , and the janata alliance collapsed soon after . former members of the bjs formed the bjp in 1980 , with vajpayee its first president.during his term as prime minister , india carried out the pokhran-ii nuclear tests in 1998 . vajpayee sought to improve diplomatic relations with pakistan , travelling to lahore by bus to meet prime minister nawaz sharif . after the 1999 kargil war with pakistan , he sought to restore relations through engagement with president pervez musharraf , inviting him to india for a summit at agra. vajpayee died on 16 august 2018 in lucknow , aged 93 .
Analysis: We use red to mark sentences with factual errors. We use blue to mark sentences that should have been deleted but are not deleted and not simplified, and we use cyan to mark sentences with grammatical errors. The output articles of the SUC model and the BART model are too long and retain a large number of unsimplified sentences in the input article. The output article of the Transformer model contains many factual errors and is poorly readable. The BertSumextabs model simplifies a less important sentence in the original article, and the simplification is not reasonable. Because it removes the critical information of "to serve a full term in office", the meaning of the sentence may be changed. Besides, the BertSumextabs model do not keep the information about the person's death from the original article.