Automatic Text Simplification for Social Good: Progress and Challenges

Since the late 1990s, automatic text simpliﬁca-tion (ATS) was promoted as a natural language processing (NLP) task with great potential to make texts more accessible to people with various reading or cognitive disabilities, and enable their better social inclusion. Large multidisciplinary projects showed promising steps in that direction. Since 2010, the ﬁeld started attracting more attention but at the cost of major shifts in system architecture, target audience, and evaluation strategies. Somewhere along the way, the focus has shifted from ATS for social good towards building complex end-to-end neural architectures that are not aimed at any particular target population. This study presents the trajectory of ATS for social good, the main issues in current ATS trends, and the ways forward that could bring the ﬁeld back to its initial goals.


Rationale
The right to accessible information is a fundamental right that should be granted to all people (UN, 2020). It is the key factor for personal empowerment and social inclusion. Nevertheless, textual information found on the web, in the news, health leaflets, and other sources is often linguistically too complex for many people and thus impedes their active participation in the society.

Adult Literacy
In the OECD Adult Literacy Report (OECD, 2013), "literacy is defined as the ability to understand, evaluate, use and engage with written texts to participate in society, to achieve one's goals, and to develop one's knowledge and potential". The literacy scale comprises of six levels: 1 Below Level 1: Able to read brief texts on familiar topics and to locate a single piece of specific information (the process requires only basic vocabulary knowledge).
Level 1: Able to read very short texts (sometimes non-continuous or mixed) and locate a single piece of information identical or synonymous to the information given in the question/directive (requires only basic vocabulary knowledge). Texts may contain little, if any, competing information.
Level 2: Able to integrate two or more pieces of information, use paraphrasing or low-level inferences, compare and contrast or reason about the information requested.
Level 3: Able to understand dense and lengthy texts, various rhetorical structures, perform various levels of inference, and disregard irrelevant or inappropriate content.
Level 4: Able to integrate, interpret or synthesise information from complex or lengthy texts, perform complex inferences and apply background knowledge, identify and understand one or more non-central idea(s), evaluate subtle evidence-claim or persuasive discourse relationships, discern between the correct and competing information which sometimes appear in almost equal ratio.
Level 5: Able to search for and integrate information across multiple dense texts, synthesise various ideas and points of view, evaluate evidencebased arguments, apply and evaluate logical and conceptual models of ideas, evaluate reliability of the sources, selecting relevant information, notice subtle, rhetorical cues and make high-level inferences or use specialised background knowledge.
The percentage of adults (age 16-65) whose literacy is below Level 2 is 16.7%, on average, across 24 countries taken into account (OECD, 2013 89.4% conceptual simplification Table 1: Percentage of population in need for text simplification (the numbers are inferred from (OECD, 2013)).
Italy, Spain, and United States, this percentage is even higher (Table 1). On average, 0.7% of population has the literacy level 5. According to the definitions of the literacy levels in the OECD report, outlined above, one can conclude that: (1) all people with literacy below Level 4 have difficulties understanding conceptually complex texts; (2) all people with literacy below Level 3 additionally have difficulties understanding syntactically complex texts; and (3) all people with literacy below level 2 have difficulties understanding texts which are linguistically difficult in any sense (lexically, syntactically, or conceptually). This means that, according to the results of the OECD report, approximately 16.7% of population needs lexical simplification of everyday texts, 50% of population needs syntactic simplification, and 89.4% of population needs conceptual simplification (Table 1).

Manual Text Simplification
Text simplification aims to transform original texts into their simpler variants, which are more understandable to the target reader(s), while preserving the original meaning. Some nuances in meaning would inevitably be lost during that process, but the core information should stay the same. The first guidelines for how to write more accessible texts for international communication, by using reduced vocabulary and restricted number of grammatical rules, were Basic English (Ogden, 1937) and Plain English initiative (Crystal, 1987). Since the late 1990s, many more initiatives have raised awareness about how to write more accessible texts, e.g. how to write for people with intellectual disabilities (Freyhoff et al., 1998;Mencap, 2002;Karreman et al., 2007), how to make public information more accessible to wider audiences (PlainLanguage, 2011), and how to make web content more accessible (W3C, 2008;Cooper et al., Netherlands, Norway, Poland, the Slovak Republic, Spain, Sweden, the United Kingdom (England and Northern Ireland), the United States, Cyprus, and the Russian Federation. 2010). Specialized websites that offer easy-to-read news are now common in many countries, e.g. Noticias fácil in Spain 3 , DR in Denmark 4 , News Web Easy in Japan 5 .

Automatic Text Simplification (ATS)
Manual text simplification, apart from being slow and expensive (requiring trained human editors), cannot keep up-to-date with the new information published online, offer variety of written content, or adaptation at a personal level. That created the need for automatic, or at least semi-automatic (involving a manual post-editing step) TS systems.
The first automatic text simplification (ATS) systems were rule-based, with focus on syntactic simplification in English, and were envisioned mainly as a pre-processing step for various natural language processing (NLP) applications, e.g. parsing, machine translation, summarization, and information retrieval; the clarity of texts for human readers was only one of the possible use cases (Chandrasekar et al., 1996;Chandrasekar and Srinivas, 1997). Subsequent ATS systems were proposed for various languages and were encompassing any number of the following simplification operations: lexical simplification, syntactic simplification, or explanation generation (Shardlow, 2014;Siddharthan, 2014). 6

ATS Projects for Social Good
Automatic text simplification gained more interest from research community through several national and international projects that secured public founding for building ATS systems for various vulnerable populations and in various languages (Table 2).  All aimed to provide a better social inclusion, thus promoting ATS as an NLP task for social good.
ATS for Readers with Aphasia. PSET (Practical Simplification of English Text) was an UK national project (1996)(1997)(1998)(1999)(2000) that aimed to build an ATS system that would make newspaper articles more accessible to people who suffer from aphasia (Carroll et al., 1998). This was the first project that promoted the use of ATS for social good. The solution was envisioned as a modular system consisting of two components: an analyser and a simplifier. The analyser would use lexical tagger, morphological analyser, and parser to recognize potentially difficult words and sentences for aphasic readers. The simplifier would use rule-based syntactic simplification systems to convert passive to active voice, extract embedded clauses, and split conjoined sentences, and a lexical simplification system which would use WordNet (Vossen, 2004) for retrieving substitution candidates and the Oxford Psycholinguistic Database (Quinlan, 1992) to choose the simplest one of them. It was envisioned to evaluate the final system with a group of aphasic readers (Carroll et al., 1998). HAPPI (Helping Aphasic People Process Information) was the continuation of the PSET project, this time with the aim of developing web-based ATS system which would make web content more accessible to aphasic readers by simplifying text vocabulary and syntax (Devlin and Unthank, 2006). To the best of our knowledge, there are no publications about the evaluations of the final systems with the end users.
ATS for Readers with Low Literacy. The Por-Simples (Simplification of Portuguese Text for Digital Inclusion and Accessibility) project (Aluísio and Gasperin, 2010) was a Brazilian national project (2007)(2008)(2009)(2010). Its aim was to produce tools and resources for people with low literacy levels (through the assistive technology systems called FACILITA and Educational FACILITA) and au-thors that want to produce texts for this audience (through an authoring system called SIMPLIFICA).
The FACILITA system was envisioned with the focus on summarization and syntactic simplification, i.e. sentence splitting, change of discourse markers, conversion from passive to active voice, inversion of clause order, subject-verb-object order, topicalization and de-topicalization (Watanabe et al., 2009). Educational FACILITA was a web application prototype which offered lexical elaborations and named entity labelling. A pilot study with low-literacy users reported improved text comprehension, though participants reported to be sometimes confused with the offered list of synonyms that had multiple meanings (Watanabe et al., 2010).

ATS for People with Cognitive Disabilities.
Simplext (Saggion et al., 2011 was a Spanish national project (2010-2013) that proposed several modules for automatic simplification of Spanish (news) texts for people with cognitive disabilities (particularly people with Down's syndrome). The modules combined rule-based and corpusbased techniques for reducing syntactic complexity of sentences, deleting unnecessary information, performing numerical simplification, normalizing reporting verbs, and reducing lexical complexity . The final ATS system was tested by 44 people with Down's syndrome by measuring reading time and text comprehension. More correct answers were obtained for simplified texts, on average, and participants positively rated the existence of such a tool that is available through different technological channels, e.g. computer, smartphone, tablet . The Able to Include project (2014-2017) was an international project that aimed to improve the quality of life of people with intellectual or developmental disabilities, dementia, or any kind of cognitive impairments. 7 Three applications were built during the project: ABLEsocial (media app), ABLEchat (mobility app), and KOLUMBA (ABLEmail app) by using Simplext (ATS technology), PictoText, and Text2Speech web services. 8 This project did not offer significant advances in core ATS research. Its focus was, instead, on bringing existing ATS solutions to life through freely accessible prototypes.
ATS for Readers with Autism. The FIRST project (2012 was an EC-funded project under the FP7 ICT call concerning smart and personalised inclusion (Orȃsan et al., 2018). It provided OpenBook, a software that can automatically identify a range of language phenomena (e.g. complex syntactic structures, complex words and phrases, ambiguous words, and metaphors) that are problematic for people with high-functioning (IQ>70) autism spectrum disorders (ASD), and replace some of them. It also offered adding illustrative pictures and concise document summaries. The tool supported three languages: English, Spanish, and Bulgarian. Acknowledging that ATS was creating a large number of errors for some texts, and that end users have low tolerance for ungrammatical and erroneous text, OpenBook was also offering powerful post-editing options so that carers can post-edit the texts before showing them to the end users (Orȃsan et al., 2018). The architecture was highly modular, with separate modules for each transformation, to allow for highest level of personalization possible. The final evaluation with 243 participants with high-functioning autism from UK, Spain, and Bulgaria was performed on texts which were automatically simplified and postedited (by the clinical teams) using OpenBook. The participants were more successful at answering multiple-choice questions about the simplified than the original texts, and human-aided text conversion time significantly decreased by using OpenBook (Orȃsan et al., 2018).

ATS in Medical Domain
A special type of literacy is health literacy, defined by the European Health Literacy Consortium: "Health literacy is linked to literacy and entails people's knowledge, motivation and competences to access, understand, appraise and apply health information in order to make judgements and take decisions in everyday life concerning health care, disease prevention and health promotion to maintain or improve quality of life during the life course." According to the European Health Literacy Survey, nearly half of all adults in the eight European countries covered by the survey were found to have low health literacy skills, which may result in less healthy choices, riskier behaviour, poorer health, less self-management and more hospitalization (WHO, 2013). Due to their potential to improve the health of people with low health literacy skills, attempts at building ATS systems focused on simplifying medical content have recently attracted noticeable attention.
In contrast to the ATS systems mentioned so far, the ATS systems in medical domain are particularly focused on translating highly specialized medical expressions into their layman variants, and thus require domain-specific resources and customized algorithms. Kloehn et al. (2018) proposed an algorithm for automatically generating explanations for difficult (medical) terms in English and Spanish. van den Bercken et al. (2019) explored the possibility of training the earlier proposed general purpose neural text simplification model (Nisioi et al., 2017) on medical parallel corpus. Shardlow and Nawaz (2019) used the general purpose neural text simplification model (Nisioi et al., 2017) augmented with the phrase table of complex-simple medical terminology to automatically simplify clinical letters in English. Cardon and Grabar (2020) used similar approach for biomedical texts in French. Emphasizing the need for high-quality simplification in medical domain, Van et al. (2020) explored the possibility of applying pretrained neural language models to the autocomplete process for sentencelevel medical text simplification.

ATS Research Trends
Research on ATS in English 9 could roughly be divided in three phases: rule-based systems that focus on specific, well-defined transformations (until 2010); data-driven supervised machine learning systems trained on parallel TS data (2010-2014/2016); and neural text simplification systems (from 2015/2017 onward). The neural trends 9 The vast majority of ATS research was published for English and thus the discussion here is focused on those. Most of the ATS research for other languages uses similar ideas and adapts them to the particular language. started earlier for purely lexical simplification systems by leveraging word embeddings (Glavaš anď Stajner, 2015;Paetzold and Specia, 2016). Sentence simplification systems embraced neural architectures in 2017 by using neural machine translation (Nisioi et al., 2017) and reinforcement learning (Zhang and Lapata, 2017). Those three years (2010, 2015, and 2017) can clearly be observed in Figures 1 and 2, which show the number of scholarly articles that mention the specific TS terms anywhere in the text, and the number of scholarly articles with those TS terms in their title. 10  Interestingly, the number of articles mentioning syntactic simplification was steadily increasing until 2015, and then again in the period 2017-2019 (Figure 1), while the number of articles with syntactic simplification in their title was increasing until 2014, then noticeably decreased by 2016, to never increase much again (Figure 2). The number of lexical simplification (LS) papers, in turn, has been increasing since 2010, with two peaks, in 2012 and 2016. The peak in 2012 was due to the papers describing systems which participated in the SemEval-2012 English Lexical Simplification task 10 We excluded the patents and citations from this search, and looked for the exact matches of the TS terms. All returned titles were additionally manually checked. (Specia et al., 2012), and lexical simplification endeavours in the scope of the Simplext project (Bott et al., 2012;Drndarević and Saggion, 2012a,b). Almost all lexical simplification research published in 2016 was focused on LS for non-native speakers.
So far, conceptual simplification has been tackled only through coreference resolution (Orȃsan et al., 2018). Although most works mention "text simplification", they actually refer to sentence simplification. The state-of-the-art ATS systems published in top tier NLP/CL/AI conferences, e.g. (Nisioi et al., 2017;Zhang and Lapata, 2017;Surya et al., 2019;Kumar et al., 2020;Mallinson et al., 2020), all describe end-to-end systems for sentence simplification, and are not directed towards any particular simplification transformation or target population.
Very few papers proposed ATS methods that operate beyond sentence level (Glavaš andŠtajner, 2013;Narayan and Gardent, 2014;Štajner and Glavaš, 2017). The research interests thus do not seem to follow the reported text simplification needs (see Table 1 in Section 1.1).

Challenges in ATS for Social Good
In spite of the evident increased interest for ATS in the last few years, and the obvious benefits it could bring to the society, no commercial systems have been offered so far, proving thus that automatic text simplification is a very challenging task.
The main reason might lie in the multidisciplinary nature of the task, i.e. the need for combining linguistics, psycholinguistics, computer science, and, to be used in real-world applications, advanced software engineering. Only large multidisciplinary teams can ensure solutions that are technically-advanced and, at the same time, truly help the end users. The absence of such multidisciplinary research leads to usage of non-optimal resources and evaluation procedures which do not involve end users, thus steering the field away from any real-world applications.

Resources
The number of manually simplified texts increases daily due the number of websites that provide easy-to-read materials (Section 1.2). Unfortunately, those materials and their original (non-simplified) versions, are usually not made available for neither research nor commercial purposes. The exceptions are the English and Spanish Newsela corpora,   which are available upon request for research purposes (the portion until 2016). 11 The largest freely available TS corpora up-todate are those based on matching the original English Wikipedia articles with the articles of the same title in Simple English Wikipedia, the EW-SEW corpora. They are available in several versions, as document-aligned (Kauchak, 2013) 12 and sentence-aligned (Zhu et al., 2010;Coster and Kauchak, 2011b;Kauchak, 2013;Hwang et al., 2015). Apart from not representing a truly parallel corpus but rather just comparable, as the articles in Simple English Wikipedia might have been written completely independently from the articles with the same title in the original English Wikipedia, many objections have been raised regarding the quality of simplifications (Amancio and Specia, 2014;Štajner et al., 2015;Xu et al., 2015;Štajner et al., 2020).
The English and Spanish Newsela corpora (Xu et al., 2015;Štajner et al., 2018), in contrast, provide parallel texts on several complexity levels, manually simplified under strict quality control (Xu et al., 2015).
The other available parallel (document-aligned) TS corpora are significantly smaller and do not provide enough material for training ATS models ( Table 3). As their texts were simplified for different target populations, they cannot be combined to make larger training datasets for any specific target population, as it is known that there are no universal native simplified-language speaker (Siddharthan, 2014). Nevertheless, those smaller TS corpora could be used for evaluation of ATS systems aimed at particular target populations.
The sentence-aligned TS corpora, apart from those obtained from Newsela and EW-SEW, are still very scarce and limited in their size (Table 4).
The main problem with most sentence-aligned TS corpora is that they were automatically aligned using various sentence similarity measures with high similarity thresholds. This resulted in high number of pairs of identical, or nearly identical, sentences which, if used for training ATS systems, lead to conservative systems that rarely suggest any changes to the original sentence (Štajner and . The newer alignment methods (Hwang et al., 2015;Štajner et al., 2017;Paetzold et al., 2017), offer two types of sentence pairs: full matches and partial matches. In NMT-based TS (Štajner and Nisioi, 2018), the EW-SEW dataset (Hwang et al., 2015) resulted in a model with similarly high percentage of changed sentences as the model trained on Newsela sentence-aligned corpus (Štajner et al., 2017). Those results indicate that the EW-SEW dataset has a potential to be used for training state-of-the-art ATS models, if the sentences are carefully matched and filtered.
For building training datasets for ATS in medical domain, van den Bercken et al. (2019) and Van et al. (2020) used the EW-SEW dataset by retaining only those sentence pairs in which the original sentences contain certain number of medical terms.
One of the biggest challenges in ATS is how to collect a parallel dataset of truly strong paraphrases which are necessary for training ATS systems for people that require higher levels of simplification (e.g. people with cognitive or reading disabilities, and people with low literacy levels). 13 Such sentence pairs are still rarely correctly extracted with automatic alignment methods (Štajner et al., 2018).
The sentence-aligned TS corpora in Brazilian Portuguese, the PorSimples (Specia, 2010) and PorSimplesSent 14 (Leal et al., 2018), compiled under the PorSimples project, are the only ones that did not lose the strong paraphrase pairs due to automatic sentence-alignment, as manual simplifications were performed using a special editing tool which recorded the exact sentence alignments (Caseli et al., 2009). Both datasets are still very small to allow for automatically learning strong paraphrases solely from them. Nevertheless, they demonstrate that by providing editing tools for manual simplification, it is possible to automatically extract strong paraphrases from manually simpli-13 Strong paraphrases are those paraphrases where vocabulary and sentence structure were changed beyond isolated lexical simplifications, straightforward sentence splitting and passive to active voice conversion. For examples, see Tables 1, 2, and 5 in (Štajner et al., 2018). 14 https://github.com/sidleal/ porsimplessent fied corpora. This approach could thus be used in collaboration with various associations and content editors that perform professional manual simplifications for various target populations, to compile large, high-quality training datasets for building ATS systems aimed for those particular end users.

Evaluation
Ideally, ATS systems should be evaluated for the output quality, as well as for its usability, i.e. whether or not they make texts easier to read and understand for the particular reader.

Quality
The quality of ATS output should be evaluated for its grammaticality (and naturalness), meaning preservation (ensuring that, albeit some necessary loss of nuances, the core meaning remains unchanged), and preservation of text coherence and cohesion. The last is particularly important for systems that perform syntactic simplifications and sentence splitting which often break the cohesion links and make texts more complex at a discourse level (Siddharthan, 2003(Siddharthan, , 2006. Meaning preservation is also often compromised, both by syntactic simplification and by lexical simplification. Some unsupervised LS systems (Glavaš anď Stajner, 2015;Paetzold and Specia, 2016) particularly suffer from this issue due to the use of word embeddings for synonyms retrieval, which often returns antonyms that appear in similar contexts (Glavaš andŠtajner, 2015). Grammatical errors are less common, and can be easily corrected manually in a post-editing step. Manual correction of change in meaning, in turn, can sometimes require more time than simplifying from scratch. Coherence problems in automatically simplified texts are the most expensive, as they might require heavy restructuring of the whole text.
To measure the quality of the automatically simplified texts,Štajner and Glavaš (2017) proposed to measure the post-editing time needed to restore the text's grammaticality, original meaning and coherence. The output quality of the sentence simplification systems is, in contrast, evaluated only for grammaticality and meaning preservation, usually using a five-point Likert scale by native or non-native speakers with high literacy levels.

Usability
Usability of ATS systems should ideally be evaluated by measuring reading time and text com-prehension by end users, usually via eye-tracking methods and comprehension tests. This type of evaluation requires significant time and special training for preparing, collecting and interpreting the data. Preparation of comprehension tests requires experts, and eye-tracking methods require special equipment and trained humans to analyse and interpret the results. The access to vulnerable populations (e.g. people with autism or cognitive disabilities) usually requires special permissions and training. Another obstacle with vulnerable populations is that laboratory conditions and use of comprehension questionnaires provide unnatural reading scenario that sometimes leads to unreliable results (Yaneva et al., 2015(Yaneva et al., , 2016a. For all those reasons, ATS systems are rarely evaluated for their usability in this way. Notable exceptions are: the eye-tracking analyses of simplification strategies for dyslexic readers (Rello et al., 2013b,c,a), and readers with autism spectrum disorders (Yaneva et al., 2016b), the comprehensionbased evaluations of the Simplext system by people with Down's syndrome , the OpenBook software by people with ASD (Orȃsan et al., 2018), and several ATS systems by nonnative low-pay workers (Angrosh et al., 2014).

Simplicity
The usability evaluation with end users requires long time and special expertise. To obtain the results faster, in ATS research papers, it became a common strategy to, instead, evaluate a small portion of automatically simplified sentences for their simplicity using a Likert scale by evaluators who are not the intended end users.
In some studies the evaluators are native speakers, e.g. (Yatskar et al., 2010;Baeza-Yates et al., 2015;, in some nonnative speakers, e.g. (Coster and Kauchak, 2011a;Wubben et al., 2012;Glavaš andŠtajner, 2013Glavaš andŠtajner, , 2015Paetzold and Specia, 2016), in some a mixture of the two (Angrosh et al., 2014;Kumar et al., 2020), and in some not specified, e.g. (Specia, 2010;Xu et al., 2016;Mallinson et al., 2020;Alva-Manchego et al., 2020a). In some of those studies, the evaluators are students, i.e. readers with high literacy levels, and in others, Amazon Mechanical Turk workers whose literacy level is usually unknown.Štajner and Nisioi (2018) found that evaluators with high literacy levels rarely notice any difference in complexity of the original and automatically simplified sentences. Other studies have shown that native and non-native speakers differently perceive sentence complexity (Štajner, 2018) and have different lexical simplification needs (Yimam et al., 2017a(Yimam et al., ,b, 2018. Therefore, a direct comparison of the ATS systems proposed in studies where simplicity was evaluated by different type of evaluators is not possible. Apart from the type of evaluators, the evaluation procedures used across the ATS community differ also by evaluation type (absolute or relative) and scale (e.g. 0/1, 1-3, 1-5), thus hindering the possibility to directly compare the results reported in different studies (Štajner, 2018). In the absolute evaluation, evaluators are presented with one sentence at the time and asked to rate its simplicity on a certain scale. If provided with different variants of the same sentence one after another, depending on the guidelines, evaluators may sometimes emphasize even the small differences between them by giving them different scores. Similar may happen in relative evaluation of simplicity gain, if evaluators are explicitly asked to count the number of successful lexical and syntactic paraphrases between the two sentences (original and simplified), as in the work of Xu et al. (2016). These approaches are thus helpful if the goal is to spot even the slightest differences between several models. If the goal is, in contrast, to estimate the usability and helpfulness of the systems in a real-world scenario, in a text comprehension task, then the approach in which evaluators are presented with pairs of sentences (original and simplified, without specifying which one is which) and asked whether or not they find one of them easier to understand, proposed by Nisioi et al. (2017), might be better suited.
To enable a direct comparison of results reported in different ATS studies, a standardized evaluation procedure across the field is needed. To establish the right evaluation procedure, it would be necessary to first investigate if some particular group of evaluators can be a good proxy for assessing the complexity of texts/sentences for intended target population. In the case that none of those groups satisfies the condition, the evaluators would have to be provided with detailed evaluation guidelines and trained for spotting potential obstacles for the intended end users. Furthermore, it would be necessary to systematically investigate which evaluation type and scale give the most reliable assessment of potential usability of the ATS systems in a realworld scenario.

Automatic Evaluation
With the goal of offering a more standardized way of measuring ATS systems' performances, several test sets have been compiled for lexical simplification in English (De Belder and Moens, 2012;Specia et al., 2012;Horn et al., 2014;Paetzold and Specia, 2015). De Belder and Moens (2012) propose a test set of 430 sentences, each with one marked word and the list of words that can replace it, ordered by their difficulty. Similarly, Specia et al. (2012) compiled a test set of 1710 sentences, each with one marked word and four substitution candidates ordered by their difficulty. Horn et al. (2014) compiled a set of 500 sentences from English Wikipedia, each with one marked word and 50 substitution candidates collected via crowdsourcing using Amazon Mechanical Turk.
All these datasets rely on suggestions and rankings of either non-native speakers (usually with high education levels), or neurotypical native speakers. Therefore, without systematic investigation of how well they reflect simplification needs of other target populations, it is not clear to which extent, if at all, they could be used to evaluate the performances of ATS systems aimed at them.
For automatic evaluation of sentence splitting modules for English, two datasets are currently available: WebSplit  and Wik-iSplit (Botha et al., 2018). Both datasets were automatically extracted and aligned; the WebSplit from the WebNLG corpus , and WikiSplit from Wikipedia edit histories. Recently, a crowdsourced dataset with multiple human simplification suggestions (both syntactic and lexical) for 2,359 original English Wikipedia sentences has been released (Alva-Manchego et al., 2020a). So far, none of them has been tested for how well it could evaluate the usability of real-world ATS systems for any particular target population.
Many ATS systems have been automatically evaluated using the Flesch-Kincaid Grade Level (FKGL) (Kincaid et al., 1975), MT-inspired evaluation metrics, e.g. BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011), TER (Snover et al., 2009), or more recently proposed TS-specific metrics: SARI (Xu et al., 2016) and SAMSA (Sulem et al., 2018b). None of these metrics, however, can replace the necessary manual evaluation of grammaticality and meaning preservation (at a sentence level), or cohesion and coherence (at a discourse level). Traditional read-ability metrics such as FKGL were proposed for human-generated texts and are completely oblivious of meaning, i.e. short, meaningless sentences would score very well on those. The use of BLEU for TS evaluation has been reported to have many drawbacks if used for ATS (Štajner et al., 2015;Xu et al., 2016;Sulem et al., 2018a). Some other MT-based evaluation metrics have shown good correlations with human judgements of grammaticality and meaning preservation (Štajner et al., 2014;Popović andŠtajner, 2016), but they all favor simplifications that are close to the original sentence (i.e. offer very few changes). SARI, in turn, favors simplifications that are as different from original as possible, thus often rewarding simple sentences which might have significantly altered the meaning.

Conclusion and Outlook
In this study, the potential for using ATS for social good and main challenges in the field to achieve that goal were emphasized: the lack of large highquality TS datasets for training ATS systems for particular target populations, and the lack of standardized evaluation procedures for estimating the usability potential of the proposed ATS systems in real-world scenarios. Due to those, it is not clear if the recently proposed neural ATS systems, published in top tier NLP/CL/AI conferences, e.g. (Nisioi et al., 2017;Zhang and Lapata, 2017;Surya et al., 2019;Kumar et al., 2020;Mallinson et al., 2020), represent a real step forward toward using ATS for social good.
To address the above-mentioned challenges, it seems necessary to: (1) establish multidisciplinary collaborations with associations and content editors which have experience in simplifying texts for specific groups of end users; and (2) prioritize work on highly modular ATS systems, which allow for easy customization of components according to the simplification needs of particular groups of users. Multidisciplinary collaborations would enable compiling of larger high-quality datasets for various target populations, better understanding of which ATS modules require higher prioritization, and the possibility for testing the ATS systems with the real end users.

Ethics/Impact Statement
As this is a special theme paper and it does not propose any novel systems, the main focus of this section is the discussion about the general ethical considerations for automatic text simplification and its impact on the society.

Intended Use
Automatic text simplification (ATS) has strong potential to improve social inclusion for many people who are otherwise marginalized in one way or another. While many large-scale projects proved this, the last several years have been dominated by ATS research which does not seem to lead towards practical applications of ATS in the society. The goal of this paper is to raise the awareness among researchers about the original motivations for ATS and its potential for being used for social good, as well as to point out the main challenges in the field, and suggest ways forward.

Failure Modes
As mentioned in Section 3.2, ATS systems often produce ungrammatical sentences, changes to the original meaning, and break the text coherence. For this reason, to avoid potential unintended harm, it is advisable to add post-editing capabilities in ATS systems, as it was done in the OpenBook software (Orȃsan et al., 2018). The unintended changes in meaning produced by ATS systems would be especially dangerous in legal and medical domains.

Biases
As all machine learning (ML) systems, ML-based ATS systems can suffer from algorithmic biases. If trained on data that represents manual simplifications performed by trained human editors, the system might inherit the biases and preferences of the editor. If trained on automatically sentence-aligned data, the system might learn transformations which might rather reflect the alignment methods than the actual simplification methods used in the prealigned corpora.

Misuse Potential
As any technological advancement, ATS could also be misused, e.g. by learning to make text more complex with the intention of providing linguistically complex texts to hide non-desirable facts, as it is known that high percentage of people have problems to understand complex texts (Section 1.1), or by unintentionally altering the original meaning thus offering incorrect information.

Potential Harm to Vulnerable Populations
People with low literacy levels, or any kind of reading or cognitive impairments, which cannot understand original texts and thus fully rely on the simplified versions, can potentially be harmed by being provided with incorrect information due to the use of premature ATS systems where the output is not manually checked for meaning preservation. Therefore, it is important that any ATS system intended for those populations offers post-editing capabilities and that its output passes a rigorous manual check for meaning preservation.
Furthermore, if provided by automatically simplified learning material, which was not carefully manually checked for its grammaticality and natural soundness of sentences, children and language learners might learn incorrect word forms or sentences structures.

Computing Time
The latest trends in ATS research steered the systems towards very computationally expensive (and environmentally unfriendly) neural architectures. As the current neural state-of-the-art ATS systems are still far behind the traditional modular rulebased ATS systems (especially for syntactic simplification), many research groups and organizations need numerous attempts with neural architectures even just to be able to publish a paper in a top-tier conference. The current best performing neural lexical simplification systems (Glavaš andŠtajner, 2015;Specia, 2016, 2017a;Qiang et al., 2020a,b) perform better than the state-ofthe-art non-neural lexical simplification systems (Glavaš andŠtajner, 2015;Paetzold and Specia, 2017b;Qiang et al., 2020b). Nevertheless, due to using heavy resources, such as BERT (Devlin et al., 2018) and PPDB (Ganitkevitch et al., 2013), or their multiple operations with word embeddings, they require significant computational power and are too slow to be used in real-world applications.