TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analyses. This enables practitioners to automatically evaluate their models from various aspects or to customize their evaluations as desired with just a few lines of code. TextFlint also generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model in terms of its robustness. To guarantee acceptability, all the text transformations are linguistically based and all the transformed data selected (up to 100,000 texts) scored highly under human evaluation. To validate the utility, we performed large-scale empirical evaluations (over 67,000) on state-of-the-art deep learning models, classic supervised methods, and real-world systems. The toolkit is already available at https://github.com/textflint with all the evaluation results demonstrated at textflint.io.


Introduction
The recent breakthroughs in deep learning theory and technology provide strong support for the wide application of NLP technology, such as question answering systems (Seo et al., 2016), information extraction (Zeng et al., 2014), and machine translation (Hassan et al., 2018). A large number of models have emerged whose performance surpasses that of humans (Lan et al., 2020;Clark et al., 2020) when the training and test data are independent and identically distributed (i.i.d.). However, the repeated evaluation of models on a hold-out test set can yield overly optimistic estimates of the model performance (Dwork et al., 2015). The goal of building NLP systems is not merely to obtain high scores on the test datasets, but to generalize to new examples in the wild. However, recent research had reported that highly accurate deep neural networks (DNN) can be vulnerable to carefully crafted adversarial examples (Li et al., 2020), distribution shift (Miller et al., 2020), data transformation (Xing et al., 2020), and shortcut learning (Geirhos et al., 2020). Using hold-out datasets that are often not comprehensive tends to result in trained models that contain the same biases as the training data (Rajpurkar et al., 2018), which makes it difficult to determine where the model defects are and how to fix them (Ribeiro et al., 2020).
Recently, researchers have begun to explore ways to detect robustness prior to model deployment. Approaches to textual robustness evaluation focus on making small modifications to the input that maintain the original meaning but result in a different prediction. These approaches can be roughly divided into three categories: (1) adversarial attacks based on heuristic rules or language models that modify characters and substitute words (Morris et al., 2020;Zeng et al., 2020); (2) text transformations, task-agnostic (Ribeiro et al., 2020), or task-specific (Xing et al., 2020) testing methodologies that create challenge datasets based on specific natural language capabilities; (3) subpopulations that aggregate metrics with particular slices of interest . Using the continual evaluation paradigm rather than testing a static artifact, a model can continuously be evaluated in light of new information about its limitations. However, these methods have often focused on either universal or task-specific generalization capabilities, for which it is difficult to make a comprehensive robustness evaluation. We argue that the current robustness evaluations have the following three challenges: 1. Integrity. When examining the robustness of a model, practitioners often hope that their evaluation is comprehensive and has verified the model's robustness from as many aspects as possible. However, previous work has often focused on universal or task-specific generalization capabilities. On one hand, universal generalization evaluations, like perturbations (Ribeiro et al., 2020) and subpopulations , have difficulty finding the core defects of different tasks (Section 4.1). On the other hand, task-specific transformations may be invalid for use on other tasks. For customized needs (e.g., the combination of reversing sentiment and changing named entities), practitioners must try how to make different evaluation tools compatible.
2. Acceptability. Only when newly transformed texts conforms to human language can the evaluation process obtain a credible robustness result. The uncontrollability of the words generated by a neural language model, incompatibility caused by template filling, and instability of heuristic rules in choosing words often make the generated sentences linguistically unacceptable to humans, which means the robustness evaluation will not be persuasive.
3. Analyzability. Users require not only prediction accuracy on new datasets, but also relevant analyses based on these results. An analysis report should be able to accurately explain where a model's shortcomings lie, such as the problems with lexical rules or syntactic rules. Existing work has provided very little information regarding model performance characteristics, intended use cases, potential pitfalls, or other information to help practitioners evaluate the robustness of their models. This highlights the need for detailed documentation to accompany trained deep learning models, including metrics that capture bias, fairness and failure considerations (Mitchell et al., 2019).
In response to these challenges, here, we introduce TextFlint, a unified, multilingual, analyzable robustness evaluation toolkit for NLP. The challenges described above can be addressed in the Customize ⇒ Produce ⇒ Analyze workflow. We summarize this workflow as follows: 1. Customize. TextFlint offers 20 general transformations and 60 task-specific transformations, as well as thousands of their combinations, which cover all aspects of text transformations to enable comprehensive evaluation of the robustness of a model (Section 3). TextFlint supports evaluations in multiple languages, currently English and Chinese, with other languages under development. In addition, TextFlint also incorporates adversarial attack and subpopulation. Based on the integrity of the text transformations, TextFlint automatically analyzes the deficiencies of a model with respect to its lexics, syntax, and semantics, or performs a customized analysis based on the needs of the user.

2.
Produce. TextFlint provides 6,903 new evaluation datasets generated by the transformation of 24 classic datasets for 12 tasks. Users can directly download these datasets for robustness evaluation. For those who need comprehensive evaluation, TextFlint supports the generation of all the transformed texts and corresponding labels within one command, the automatic evaluation on the model, and the production of comprehensive analysis report. For those customized needs, users can modify the Config file and type a few lines of code to achieve a specific evaluation (Section 2).
3. Analyze. After scoring all of the existing transformation methods with respect to their plausibility and grammaticality by human evaluation, we use these results as a basis for assigning a confidence score for each evaluation result (Section 3.6). Based on the evaluation results, TextFlint provides a standard analysis report with respect to a model's lexics, syntax, and semantic. All the evaluation results can be displayed via visualization and tabulation to help users gain a quick and accurate grasp of the shortcomings of a model. In addition, TextFlint generates a large number of targeted data to augment the evaluated model, based on the the defects identified in the analysis report, and provides patches for the model defects.
TextFlint is easy to use for robustness analysis. To demonstrate the benefits of its process to practitioners, we outline how users with different needs can use TextFlint to evaluate their NLP models (Section 2.2). (1) Users who want to comprehensively evaluate a model's robustness can rely on predefined testbenches or generated datasets for direct evaluation. We explain how to use FlintModel automatically to evaluate model robustness from all aspects of text transformations (Section 2.1.1). (2) Users who want to customize their evaluations for specific tasks can construct their own testbenches with a few lines of code using the Config available in TextFlint.
(3) Users who want to improve model robustness can accurately identify the shortcomings of their model with reference to the analysis report (Section 2.1.3), then use TextFlint to augment the training data for adversarial training(Section 2.1.2).
We tested 95 the state-of-the-art models and classic systems on 6,903 transformation datasets for a total of over 67,000 evaluations, and found almost all models showed significant performance degradation, including a decline of more than 50% of BERT's prediction accuracy on tasks such as aspect-level sentiment classification, named entity recognition, and natural language inference. It means that most experimental models are almost unusable in real scenarios, and the robustness needs to be improved.

TextFlint Framework
TextFlint provides comprehensive robustness evaluation functions, i.e., transformation, subpopulation and adversarial attack. For ordinary users, TextFlint provides reliable default config to generate comprehensive robustness evaluation data, with little learning cost. At the same time, TextFlint has strong flexibility and supports providing customized config files. TextFlint can automatically analyze the target model's deficiencies and generate a visual report that can be used to inspire model improvement. Finally, TextFlint enables practitioners to improve their model by generating adversarial samples which can be used for adversarial training.
In this section, we introduce the design philosophy and modular architecture of TextFlint. In the following, workflow and usage for various requirements are provided. Figure 1 shows the architecture of TextFlint. To apply TextFlint to various NLP tasks, its architecture is designed to be highly modular, easy to configure, and extensible. TextFlint can be organized into three main components according to its workflow, i.e., Input Layer, Generation Layer, Reporter Layer, respectively. We will introduce each of the three components in more detail.

Input Layer
To apply robustness verification, Input Layer prepares necessary information, including the original dataset, config file, and target model.
Sample A common problem is that the input format of different models is highly different, making it very difficult to load and utilize data. It is therefore highly desirable to unify data structure for each task. Sample solves this problem by decomposing various NLP task data into underlying Fields, which cover all basic input types. Sample provides common linguistic functions, including tokenization, partof-speech tagging and dependency parsing, which are implemented based on Spacy (Montani et al., 2021). Moreover, we break down the arbitrary text transformation method into some atomic operations inside Sample, backed with clean and consistent implementations. Such design enables us to easily implement various transformations while reusing functions that are shared across transformations.  (Wolf et al., 2020) which enable practitioners to download public datasets directly.
FlintModel FlintModel is a necessary input to apply adversarial attack or generate robustness report. TextFlint has great extensibility and allows practitioners to customize target model with whichever deep learning framework. Practitioners just need to wrap their own models through FlintModel and implement the corresponding interfaces.
Config It is vital for the toolkit to be flexible enough to allow practitioners to configure the workflow, while providing appropriate abstractions to alleviate the concerns of practitioners who overly focus on the low-level implementation. TextFlint enables practitioners to provide a customized config file to specify certain types of Tranformation, Subpopulation, AttackRecipe or their combinations, as well as their related parameters information. Of course, TextFlint provides reliable default parameters, which reduces the threshold for use.

Generation Layer
After Input Layer completes the required input loading, the interaction between TextFlint and the user is complete. Generation Layer aims to apply data generation function which includes Transformation, Subpopulation and AttackRecipe to each sample. To improve memory utilization, Generation Layer dynamically creates Transformation, SubPopulation, and AttackRecipe instances according to the parameters of the Config instance.
Transformation Based on the atomic operations provided by Sample, it is easy to implement an arbitrary text transformation while ensuring the correctness of the transformation. Thanks to the highly modular design of TextFlint, Transformation can be flexibly applied to samples for different tasks.It is worth noting that the procedure of Transformation does not need to query the target model, which means it is a completely decoupled process with the target model prediction.
In order to verify the robustness comprehensively, TextFlint offers 20 universal transformations and 60 task-specific transformations, covering 12 NLP tasks. According to the granularity of the transformations, the transformations can be categorized into sentence level, word level and character level. Sentence-level transformations includes BackTranslation, Twitter, InsertAdv, etc. Wordlevel transformations include SwapSyn-WordNet, Contraction, MLMSuggestion, etc. Character level deformation includes KeyBoard, Ocr, Typos, etc. Due to limited space, refer to Section 3 for specific information.
AttackRecipe AttackRecipe aims to find a perturbation of an input text satisfies the attack's goal to fool the given FlintModel. In contrast to Transformation, AttackRecipe requires the prediction scores of the target model. Once Dataset and FlintModel instances are provided by Input Layer, TextFlint would apply AttackRecipe to each sample. TextFlint provides 16 easy-to-use adversarial attack recipes which are implemented based on TextAttack (Morris et al., 2020).
Validator Are all generated samples correct and retain the same semantics as the original samples, instead of being completely unrecognizable by humans? It is crucial to verify the quality of samples generated by Transformation and AttackRecipe. TextFlint provides several metrics to calculate confidence, including (1) language model perplexity calculated based on the GPT2 model (Radford et al., 2019); (2) word replacement ratio in the generated text compared with the original text; (3)The edit distance between original text and generated text; (4) Semantic similarity calculated based on Universal Sentence Encoder (Cer et al., 2018); (5) BLEU score (Papineni et al., 2002).
Subpopulation Subpopulation is to identify the specific part of dataset on which the target model performs poorly. To retrieve a subset that meets the configuration, Subpopulation divides the dataset through sorting samples by certain attributes. TextFlint provides 4 general Subpopulation configurations, including text length, language model performance, phrase matching, and gender bias, which work for most NLP tasks. Take the configuration of text length for example, Subpopulation retrieves the subset of the top 20% or bottom 20% in length.

Report Layer
In Generation Layer, TextFlint can generate three types of adversarial samples and verify the robustness of the target model. Based on the results from Generation Layer, Report Layer aims to provide users with a standard analysis report from lexics, syntax, and semantic levels. The running process of Report Layer can be regarded as a pipeline from Analyzer to ReportGenerator.

Analyzer
The Analyzer is designed to analyze the robustness of the target model from three perspectives: (1) robustness against multi-granularity transformed data and adversarial attacks; (2) gender bias and location bias; (3) subpopulation division. For the shortcomings of the target model, Analyzer can also look for potential performance improvement directions.
ReportGenerator According to the analysis provided by Analyzer, ReportGenerator can visualize and chart the performance changes of the target model under different transformations.
ReportGenerator conveys the analysis results to users in PDF or L A T E X format, which makes it easy to save and share. ReportGenerator also provides users with a concise and elegant API to display their results and reduce the cost of analysis in a large amount of experimental data. We believe that a clear and reasonable analysis report will inspire users. Moreover, TextFlint can generate adversarial samples for the defects of target model, and these adversarial samples are beneficial during training process.  Figure 2: Workflow of TextFlint. Original dataset is transformed in TextFlint by multi-granularity transformations, which is specified by task config. The original and transformed datasets are then applied to target models to evaluate model robustness on multiple transformations. Results will finally be reported in a visualized form, and transformed dataset could further be used as adversarial training samples of target models.

Workflow and Usage
The general workflow of TextFlint is displayed in Figure 2. With correspondence to Figure 1, evaluation of target models could be devided into three steps. For input preparation, the original dataset for testing, which is to be loaded by Dataset, should be firstly formatted as a series of JSON objects. TextFlint configuration is specified by Config. Target models are also loaded as FlintModels. Then in adversarial sample generation, multi-perspective transformations (as Transformation), including subpopulation division (as Subpopulation), are performed on Dataset to generate transformed samples. Besides, to ensure semantic and grammatical correctness of transformed samples, Validator calculates confidence of each sample to filter out unacceptable samples. Lastly, Analyzer collects evaluation results and ReportGenerator automatically generates a comprehensive report of model robustness. Additionally, users could feed train dataset into TextFlint to obtain substantial amount of transformed samples, which could be used to do adversarial training on target models. Due its user-friendly design philosophy, TextFlint shows its practicality in real application. As mentioned in Section 1, we summarize three occasions in which users would found challenging in model robustness evaluation. In those occasions, TextFlint is proven to be helpful due to its comprehensive features and customization ability.
General Evaluation For users who want to evaluate robustness of NLP models in a general way, TextFlint supports generating massive and comprehensive transformed samples within one command. By default, TextFlint performs all single transformations on original dataset to form corresponding transformed datasets, and the performance of target models is tested on these datasets. As a feedback of model robustness, the results of target model performance change on each of the transformed datasets and their corresponding original datasets are reported in a clear form. The evaluation report provides a comparative view of model performance on datasets before and after certain types of transformation, which supports model weakness analysis and guides particular improvement.
Customized Evaluation For users who want to test model performance on specific aspects, they demand a customized transformed dataset of certain transformations or their combinations. In TextFlint, this could be achieved by modifying Config, which determines the configuration of TextFlint in generation. Config specifies the transformations being performed on the given dataset, and it could be modified manually or generated automatically. For the latter, users could implement customized requests by slightly modifying the codes, like enabling combination between any two transformations, or adjusting settings of certain transformations. Moreover, by modifying the configuration, users could decide to generate multiple transformed samples on each original data sample, validate samples by semantics, preprocess samples with certain processors, etc.
Target Model Improvement For users who want to improve robustness of target models, they may work hard to inspect the weakness of model with less alternative support. To tackle the issue, we believe a diagnostic report revealing the influence of comprehensive aspects on model performance would provide concrete suggestions on model improvement. By using TextFlint and applying transformed dataset to target models, the transformations corresponding to significant performance decline in evaluation report will provide improvement guidance of target models. Moreover, TextFlint supports adversarial training on target models with large-scale transformed dataset, and the change of performance will also be reported to display performance gain due to adversarial training.
To summarize, the ease-to-use framework satisfies the needs of model robustness evaluation by providing multi-aspect transformations and supporting automatic analysis. Moreover, the proposed transformation schemes in TextFlint are ensured to be linguistic-conformed and human-accepted, which liberates users from contemplating and implementing their own transformation schemes. In the next section, the linguistic basis of transformations included in TextFlint will be concisely discussed.

Linguistically based Transformations
We attempt to increase the variety of text transformations to a large extent while maintaining the acceptability of transformed texts. For this purpose, we turn to linguistics for inspiration and guidance ( Figure 3), which is to be discussed at length in the following sections with bold for universal transformations and bold italic for task-specific ones.

Morphology
With word-level transformation being the first step, morphology sheds light on our design from the very beginning. Morphology is the study of how words are formed and interrelated. It analyzes the structure of words and parts of words, e.g., stems, prefixes, and suffixes, to name a few. This section discusses the transformations with respect to different aspects of morphology.

Derivation
Morphological derivation is the process of forming a new word from an existing word, often by adding a prefix or suffix, such as abor -ly. For example, abnormal and normally both derive from the root word normal.
Conversion, also called "zero derivation" or "null derivation," is worth noting as well. It is a type of word formation involving the creation of a word from an existing word without any change in form, namely, derivation using only zero. For example, the noun green is derived ultimately from the adjective green. That is to say, some words, which can be derived with zero, carry several different parts of speech.
SwapPrefix Swapping the prefix of one word usually keeps its part of speech. 2 For instance, "This is a pre-fixed string" might be transformed into "This is a trans-fixed string" or "This is an af-fixed string." The POS tags of the test sentence is supposed to remain the same, since it is merely changed in one single word without converting its part of speech. SwapPrefix is especially applicable to the POS tagging task in NLP.  SwapMultiPOS It is implied by the phenomenon of conversion that some words hold multiple parts of speech. That is to say, these multi-part-of-speech words might confuse the language models in terms of POS tagging. Accordingly, we replace nouns, verbs, adjectives, or adverbs with words holding multiple parts of speech, e.g., "There is an apple on the desk" is transformed into "There is an imponderable on the desk" by swapping the noun apple into imponderable, which can be a noun or an adjective. Although the transformed sentence is not as accessible as the original, anyone with even the slightest knowledge of English would be able to tell the right part of speech of imponderable that fits the context without understanding its meaning. Since the transformation of SwapMultiPOS alters the semantic meaning of sentences, it is, again, only applicable for the POS tagging task.

Inflection
Morphological inflection generally tells the tense, number, or person of one word. The word "love", for example, performs differently in sentences "I love NLP," "He love-s NLP," "She love-d NLP," and "They are love-ing NLP," where love-s denotes that the subject of the verb love is third person and singular and that the verb is in the present tense, while love-d denotes the simple past tense and love-ing for present progressive. Similarly, the transformation Tense changes the tense of verbs while maintaining the semantic meaning to a large extent, just as from "He is studying NLP" to "He has studied NLP." Besides, reduplication is a special type of inflection in which the root or stem of a word or even the whole word is repeated exactly or with a slight change. As, for example, quack-quack imitates the sound of a duck, fiddle-faddle suggests something of inferior quality, and zigzag suggests alternating movements. This phenomenon is more common in Chinese than in English, where most verbs with one character "A" can be reduplicated to express the same meaning in the form of "A(?)A", just as the verb "看 (look)" holds the same meaning with "看看," "看一看," "看了看," and "看了一看." As a result, the accordingly implemented SwapVerb is tailored especially for the task of Chinese word segmentation.

Contraction
A contraction is a word made by shortening and combining two words, such as can't (can + not), you're (you + are), and I've (I + have), which is often leveraged in both speaking and writing. Contraction changes the form of words while leaving the semantic meaning unchanged. Likewise, the transformation Contraction replaces phrases like will not and he has with contracted forms, namely, won't and he's. With Contraction modifying neither the syntactic structure nor the semantic meaning of the original sentence, it fits all of the tasks in NLP, be it token-or sequence-level.

Acronym
An acronym is a shorter version of an existing word or phrase, usually using individual initial letters or syllables, as in NATO (North Atlantic Treaty Organization) or App (application). From the perspective of acronym, SwapLonger detects the acronyms in one sentence and supersedes them with the full form, with NLP changed into Natural Language Processing and USTC into University of Science and Technology of China. Although SwapLonger might be feasible for most NLP tasks, it is especially effective in evaluating the robustness of models for named entity recognition (NER) in that it precisely modifies those named entities to be recognized. Similarly, SwapAcronym is tailored for Chinese word segmentation in a reverse way that the acronym like "中国 (China)" is turned into its full form "中华 人民共和国 (People's Republic of China)" to confuse the segmentation.

Word as Symbols
As a tool for communication, language is often written as symbols, as it is also regarded as a symbol system. Thus, words have the "form-meaning duality." From time to time, a typographical error happens while writing or typing, which means the form of a word is destructed while the meaning stays. Humans often make little effort to understand words with typographical errors; however, such words might be totally destructive for deep learning models.
To imitate this common condition in the daily use of language, SpellingError and Typos both bring slight errors to words, while being implemented in different ways. The former replaces a word with its typical error form (definitely → difinately), and the latter randomly inserts, deletes, swaps or replaces a single letter within one word (Ireland → Irland). Nearly the same with Typos, EntTypos works for NER and swaps only named entities with misspelled ones (Shanghai → Shenghai). Keyboard turn to the way how people type words and change tokens into mistaken ones with errors caused by the use of keyboard, like word → worf and ambiguous → amviguius. Besides, it is worth noting that some texts are generated from pictures by optical character recognition (OCR); we also take into consideration the related errors. With like being recognized as l1ke or cat as ca+, human readers take no effort understanding these mistaken words, while it is worth inspecting how language models react toward this situation.

Paradigmatic Relation
A paradigmatic relation describes the type of semantic relations between words that can be substituted with another word in the same category, which contains synonymy, hyponymy, antonymy, etc. As in the sentence "I read the ( ) you wrote two years ago," the bracket can be filled with book, novel, dictionary, or letter. The following sections discuss the specific relations leveraged in our transformations.

Synonym
A synonym is a word or phrase that means nearly the same as another word or phrase. For example, the words begin, start, commence, and initiate are all synonyms of one another. One synonym can be replaced by another in a sentence without changing its meaning. Correspondingly, SwapSyn (Syn short for synonym) switches tokens into their synonyms according to WordNet or word embedding. As for instance, "He loves NLP" is transformed into "He likes NLP" by simply replacing loves with likes.

Antonym
Antonymy describes the relation between a pair of words with opposite meanings. For example, mortal : immortal, dark : light, and early : late are all pairs of antonyms. Although the meaning of a sentence is altered after one word being replaced by its antonym, the syntax of the sentence remain unchanged. As a result, SwapAnt and Add/RmvNeg are suitable for some NLP tasks, including but not limited to dependency parsing, POS tagging, and NER. The implementation of SwapAnt is similar with SwapSyn, while Add/RmvNeg performs differently. Transferring "John lives in Ireland" into "John doesn't live in Ireland," the overall meaning is reversed with a simple insertion of the negation doesn't, while the syntactic structure is saved.

Incompatible
Incompatibility is the relation between two classes with no members in common. Two lexical items X and Y are incompatibles if "A is f (X)" entails "A is not f (Y )": "I'm from Shanghai" entails "I'm not from Beijing," where Shanghai and Beijing are a pair of incompatibles. Incompatibility makes up for the vacancy, where neither synonym nor antonym is applicable. The transformation SwapNum, which means swap number, shifts numbers into different ones, such as "Tom has 3 sisters" into "Tom has 2 sisters." SwapName is designed specially for Chinese word segmentation, which is not simply substituting people's names into random others but deliberately chosen ones of which the first character can form a phrase with the former character. To make it clear, "我朝小明挥了挥手 ( I waved at Xiao Ming)" might be changed into "我朝向明挥了挥手 ( I waved at Xiang Ming)," where the first character, "向," of the substitution forms a phrase with the character "朝," resulting in "朝向(towards)." Though the semantic meaning is slightly changed with the swap of the mentioned name, the result of segmentation is supposed to remain the same.

Syntax
The rules of syntax combine words into phrases and phrases into sentences, which also specify the correct word order for a language, grammatical relations of a sentence as well as other constraints that sentences must adhere to. In other words, syntax governs the structure of sentences from various aspects.

Syntactic Category
A family of expressions that can substitute for one another without loss of grammaticality is called a syntactic category, including both lexical category, namely the part of speech, and phrasal category. To illustrate, in "I love NLP" and "I love CV," NLP and CV belong to the same lexical category of noun (N); for "He is running in the park" and "He is running on the roof," in the park and on the roof are of the same phrasal category, namely, prepositional phrase (PP), which means these two phrases are interchangeable without altering the structure of the whole sentence.
Realizing this special component of a sentence makes room for various transformations at this level. SwapNamedEnt, SwapSpecialEnt, and SwapWord/Ent are clear as their names imply, the named entities in a sentence are swapped into others of the same type, which means the syntactic structure and the named entity tags remain constant. Similarly, OOV and CrossCategory are special to the NER task, where the substitutions are out of vocabulary or are from a different category. For instance, "I love NLP" can be transformed into either "I love NlP" (OOV) or "I love Shanghai" (CrossCategory). DoubleDenial, tailored for semantic analysis (SA), is able to preserve both syntactic and semantic attribute of the original sentence, such as "I love NLP" and "I don't hate NLP." For aspect based semantic analysis (ABSA), RevTgt (short for reverse target) and RevNon (short for reverse non-target) generate sentences that reverse the original sentiment of the target aspect and the non-target aspect respectively. As in the sentence "Tasty burgers, and crispy fries," with the target aspect being burgers, it might be transformed into "Terrible burgers, but crispy fries" by RevTgt or "Tasty burgers, but soggy fries" by RevNon.
Similarly, MLMSuggestion (MLM short for masked language model) generates new sentences where one syntactic category element of the original sentence is replaced by what is predicted by masked language models. With the original sentence "This is a good history lesson" masked into "This is a good ()," MLMSuggestion generates several predictions like story, data set, or for you, of which the first two are retained to augment test data in that they are of the same syntactic category with history lesson.
Besides commuting a single syntactic category element with brand new ones, the existing two elements within one sentence can be shifted with one another. SwapTriplePos (Pos short for position) exchanges the position of two entities in one triple, which works only for the relation extraction (RE) task. For example, the sentence "Heig, born in Shanghai, was graduated from Fudan University," where subject and object are Heig and Shanghai, and the relation being birth, can be transformed into "Born in Shanghai, Heig was graduated from Fudan University."

Adjunct
An adjunct is a structurally dispensable part of a sentence that, if removed or discarded, will not structurally affect the remainder of the sentence. As in the sentence "I love NLP from bottom of my heart," the phrase from bottom of my heart is an adjunct, which also means a modifier of the verb love. Since the adjunct is structurally dispensable, it doesn't matter if any adjunct is removed or appended. As typical adjuncts, adverbs can be inserted before the verb of one sentence without considering the structure or semantics, InsertAdv is adaptable for most of the NLP tasks. Furthermore, Delete/AddSubTree and InsertClause change sentences by appending or removing adjuncts from the aspect of dependency parsing (DP), just as the sub tree/clause "who was born in China" being inserted into the original sentence "He loves NLP" and ends up with "He, who was born in China, loves NLP."

Pragmatics
Pragmatics concerns how context affects meaning, e.g., how the sentence "It's cold in here" comes to be interpreted as "close the windows" in certain situations. Pragmatics explain how we are able to overcome ambiguity since meaning relies on the manner, place, time, etc. of an utterance.

Maxims of Conversation
The maxims of conversation were first discussed by the British philosopher H. Paul Grice and are sometimes called "Gricean Maxims." These maxims describe how people achieve effective conversational communication in common social situations, including the maxim of quantity, quality, relevance, and manner. The maxim of quantity requires saying neither more nor less than the discourse requires. The maxim of quality suggests not telling lies or making unsupported claims. The maxim of relevance, just as the name implies, tells that the speakers should say what is relevant to the conversation. Last but not least, the maxim of manner values being perspicuous, namely, avoiding obscurity or ambiguity of expression and being brief and orderly.
Grice did not assume that all people should constantly follow these maxims. Instead, it is found to be interesting when these were not respected, which is going to be further illustrated in the following sections. For example, with Marry saying, "I don't think the model proposed by this paper is robust," her listener, Peter, might reply that "It's a lovely day, isn't it?" Peter is flouting the maxim of relevance in that the author of the paper being discussed is standing right behind Marry who doesn't even notice a bit. From time to time, people flout these maxims either on purpose or by chance, while listeners are still able to figure out the meaning that the speaker is trying to convey, inspired by which we design transformations to simulate the situations where the maxims of conversation are flouted.
AddSum (Sum short for summary) and RndRepeat/Delete (Rnd short for random) imitates flouting the maxim of quantity in a way that providing more or less information than needed. AddSum works for semantic analysis, which involves adding the summary of the mentioned movie or person, enriching the background information of the sentence however unnecessary. RndRepeat/Delete proves effective for the paragraph-level task of coreference resolution, where the number of sentences makes room for repetition and deletion, providing inappropriate amount of information for the purpose of communication.
For the same reason, RndShuffle is also made possible by coreference resolution in a way of going against the maxim of manner, which randomly shuffles sentences within one paragraph and messes up the logic chain of utterance and causes confusion. Another transformation that reflects the offence of manner maxim is Add/RmvPunc (short for remove punctuation), namely, adding extra punctuation or removing necessary ones to disturb target models.
Considering the offence of maxim of quality, AddSentDiverse (Sent short for sentence) and PerturbAnswer/Question for machine reading comprehension (MRC) tasks bring disturbance to either the texts based on which questions are to be answered or the formulation of questions.
Last but not least, the maxim of relevance is also often flouted by language users. Analogously, we inspect language models' performance upon the springing of irrelevant information with AppendIrr (Irr short for irrelevant), TwitterType, RndInsert, and ConcatSent. The first two transformations adapt to most NLP tasks in that they change texts without altering the semantic meaning or the original structure of sentences. Specifically, TwitterType changes plan texts into the style of Twitter posts, such as turning "I love NLP" into "@Smith I love NLP. https://github.com/textflint." RndInsert and ConcatSent work for coreference resolution and NER respectively.

Language and Prejudice
Words of a language reflect individual or societal values (Victoria Fromkin, 2010), which can be seen in phrases like "masculine charm" and "womanish tears." Until recently, most people subconsciously assume a professor to be a man and a nurse to be a woman. Besides, users of any language might also relate countries or regions with certain values. It's clear that language reflects social bias toward gender and many other aspects, and also any social attitude, positive or negative.
For inspection of how language models take on the prejudice that resides in human language, Prejudice offers the exchange of mentions of either human or region, from mentions of male into female or from mentions of one region into another. To specify, "Marry loves NLP and so does Ann" can be replaced by "Tom loves NLP and so does Jack" by a simple setting of "male," which means all the mentions of female names are altered into names of male. The settings of region are just alike.

Model Related
Besides how humans use language, we also take into consideration how deep learning models actually process language to design transformations accordingly. We examine the general patterns of language models and end up with the following transformations.
BackTrans (Trans short for translation) replaces test data with paraphrases by leveraging back translation, which is able to figure out whether or not the target models merely capture the literal features instead of semantic meaning. ModifyPos (Pos short for position), which works only for MRC, examines how sensitive the target model is to the positional feature of sentences by changing the relative position of the golden span in a passage. As for the task of natural language inference (NLI), the overlap of the premise and its hypothesis is an easily captured yet unexpected feature. To tackle this, Overlap generates pairs of premise and hypothesis by overlapping these two sentences and making a slight difference on the hypothesis, just as premise being "The judges heard the actors resigned" and its hypothesis being "The judges heard the actors." The aforementioned three transformations focus on examining the features learned by target models, and Subpopulation tackles the distribution of a dataset by singling out a subpopulation of the whole test set in accordance with certain rules. To be more specific, LengthSubpopulation retrieves subpopulation by the length of each text, and LMSubpopulation (LM short for language model) by the performance of the language model on certain test data, for both of which the top 20% and bottom 20% are available as options.

Human Evaluation
Only when transformed text conforms to the way how humans use language can the evaluation process obtain a credible robustness result. To verify the quality of the transformed text, we conducted human evaluation on the original and transformed texts under all of the above mentioned transformations. Specifically, we consider two metrics in human evaluation, i.e., plausibility and grammaticality.
• Plausibility (Lambert et al., 2010) measures whether the text is reasonable and written by native speakers. Sentences or documents that are natural, appropriate, logically correct, and meaningful in the context will receive a higher plausibility score. Texts that are logically or semantically inconsistent or contain inappropriate vocabulary will receive a lower plausibility score.
• Grammaticality (Newmeyer, 1983) measures whether the text contains syntax errors. It refers to the conformity of the text to the rules defined by the specific grammar of a language.
For human evaluation, we used the text generated from both the universal and task-specific transformations to compare with the original text from all of the twelve NLP tasks. We randomly sample 100 pairs of original and transformed texts for each transformation in each task, with a total of about 50,000 texts. We invited three native speakers from Amazon Mechanical Turk to evaluate the plausibility and grammaticality of these texts. For each metric, we ask the professionals to rate the texts on a 1 to 5 scale (5 for the best). Due to the limitations of the layout, we select tasks based on four common NLP problems: text classification, sequence labeling, semantic matching, and semantic understanding. The human evaluation results of these tasks are shown in Table 1 and Table 2, and the remaining results are available at http://textflint.io.
We have the following observations: 1. The human evaluation score is consistent and reasonable on the original text of each task, which proves the stability and effectiveness of our human evaluation metrics. From table 1, we can see that the human evaluation scores of the original text are consistent within each task. For the Grammaticality metric, the scores for all four tasks are around 3.7. One possible explanation for this case is that the source datasets of these original texts are well organized and have no obvious grammatical errors. For the plausibility metric, ABSA scores the highest, ranging from 4 to 4.1, while MRC scores are the lowest, ranging from 3.3 to 3.5. ABSA data are about restaurant reviews, and a single topic leads clear logic. MRC data are long paragraphs on various topics, a large number of proper nouns and domain-specific knowledge, making it more difficult to judge the rationality of these texts.
2. The transformed text generated by the universal transformations can be accepted by humans. As shown in Table 1, different transformation methods change the original text to different degrees and result in different human evaluation scores. Some transformations (e.g., WordCase, AddPunc change the case of text or add/delete punctuations. These transformations do not change the semantics of the text or affect the readability, so their human evaluation scores did not change much. Some transformations (e.g., SwapSyn, SwapAnt) replace several words in the original text with their synonyms or antonyms. These transformations are well developed and widely used, and they will slightly lower the evaluation scores. Some transformations (e.g., Ocr, SpellingError, and Tense) replace words in the text with wrong words or change the tense of verbs. These transformations actively add wrong information to the original text and cause the human evaluation score to decrease. On the whole, the transformed text has achieved a competitive human evaluation performance compared with the original text in each task. This verifies that, when the text has pattern changes, minor spelling errors, and redundant noisy information, these transformed texts are still fluent and readable and therefore acceptable to humans.
3. The transformed text generated by the task-specific transformations still conforms to human language habits, while task-specific transformations change the original text more than universal transformations. As shown in Table 2, we believe this is because these transformations are specific to each task, and they have a good attack effect on the original text, which leads to larger changes in the text. The ConcatSent transformation in the NER task concatenates multiple original texts into one text. The transformed text has no grammar error, but the logic between different sentences is inconsistent. As a result, its Plausibility drops from 4.14 to 3.54 while Grammaticality remains the same. In the SA task, the movie and person vocab list contains common phrases, such as "go home", and these transformations may contain grammar errors, resulting in varying degrees of Grammaticality decline. However, replacing the movie and person names has little effect on the rationality of the sentence, and the Plausibility remains unchanged. The evaluation performance of these transformations is still stable and acceptable. This proves, again, that the transformed texts conform to human language, and the robustness evaluation results with these transformed texts are also persuasive.  Aspect-Based Sentiment Analysis (ABSA) is a typical text classification task that aims to identify fine-grained sentiment polarity toward a specific aspect associated with a given target. In this work, we conduct experiments on SemEval 2014 Laptop and Restaurant Reviews (Pontiki et al., 2014), one of the most popular ABSA datasets, to test robustness of different lines of systems, including SOTA neural architectures. We follow  to remove instances with conflicting polarity and use the same train-dev split strategy. In the experiment, we adopt Accuracy and Macro-F1 as the metrics to evaluate system performances, which are widely used in previous works (Fan et al., 2018;Xing et al., 2020).  (Xing et al., 2020). AddDiff causes drastic performance degradation among non-BERT models, indicating that these models lack the ability to distinguish relevant aspects from non-target aspects.
Named Entity Recognition (NER) is a fundamental NLP task that involves determining entity boundaries and recognizing categories of named entities, which is often formalized as a sequence labeling task. To perform robustness evaluation, we choose three widely used NER datasets, including CoNLL   (Weischedel et al., 2012) 4 . We test 10 models under five different transformation settings using the metric of F1 score. The changes of model performances are listed in Table 4, where we can observe that model performance is not noticeably influenced by ConcatSent, which indicates that general transformations such as simple concatenation might have difficulty finding core defects for specific tasks. On the other hand, task-specific transformations, e.g., CrossCategory and SwapLonger, induce a significant performance drop of all tested systems. It indicates that most existing NER models are inadequate to deal with inherent challenges of NER, such as the problem of combinatorial ambiguity and OOV entities.
Machine Reading Comprehension (MRC) aims to comprehend the context of given articles and answer the questions based on them. Various types of MRC datasets exist, such as cloze-style reading comprehension (Hermann et al., 2015) and span-extraction reading comprehension (Rajpurkar et al., 2016). In this work, we focus on the span-extraction scenario and choose two typical MRC datasets, namely, SQuAD 1.0 (Rajpurkar et al., 2016) and SQuAD 2.0 (Rajpurkar et al., 2018). Since the official test set is not publicly released, we use the development set to produce transformed samples. Following previous works (Seo et al., 2016;Chen et al., 2017), we adopt Exact Match (EM) and F1 score as our evaluation metrics. Table 5 presents the results of different systems on the original and enriched development set of SQuAD 1.0 dataset.
From the table, we find that ModifyPos can hardly hurt model performances, which indicates that span-extraction models are insensitive to answer positions. Meanwhile, the modification of text contents, e.g., PerturbAnswer, can bring a drastic performance degradation to all systems. It reflects that models might overfit on dataset-specific features and fail to identify answer spans that are perturbed into unseen patterns even when their meanings are unchanged.
Natural Language Inference (NLI), also known as recognizing textual entailment (RTE), is the task of determining if a natural language hypothesis can be inferred from a given premise in a justifiable  manner. As a benchmark task for natural language understanding, NLI has been widely studied; further, many neural-based sentence encoders, especially the pretrained models, have been shown to consistently achieve high accuracies. In order to check whether it is the semantic understanding or the pattern matching that leads to a good model performance, we conduct experiments to analyze the current mainstream pretrained sentence encoders. Table 6 lists the accuracy of the eight models on the MultiNLI 5 (Williams et al., 2018) dataset. From Table 6, we can observe that (1) NumWord, on average, induces the greatest performance drop, as it requires the model to perform numerical reasoning for correct semantic inference. (2) SwapAnt makes the average performance of the models drop by up to 23.33%. It indicates that the models cannot handle the semantic contradiction expressed by the antonyms (not explicit negation) between premisehypothesis pairs well. (3) AddSent also makes the model performance drop significantly, indicating that model's ability to filter out irrelevant information needs to be improved. (4) Our transformation strategy, especially Overlap, generates many premise-hypothesis pairs with large word overlap but different semantics, which successfully confuse all the systems. (5) Improved pretrained models (e.g. XLNet) perform better than the original BERT model, which reflects that adequate pretraining corpora and suitable pretraining strategies help to improve the generalization performance of the models.
Chinese Word Segmentation (CWS), the first step in many Chinese information processing systems, aims to segment Chinese sentences into word lists. Ambiguity and out-of-vocabulary (OOV) words are two main obstacles to be solved in this task. We conduct experiments to analyze the robustness of word segmentation models in the face of difficult examples such as ambiguous and OOV words. Table 7 shows the F1 score of eight different CWS models on the CTB6 dataset (Xia, 2000): It is obvious that all the CWS models achieve a high F1 score. However, SwapName generate words with intersection ambiguity by modifying the last name in the text, which reduces the model score by an average of 3.16%. SwapNum, SwapContraction generate long quantifiers and proper nouns, which drops the average F1 score of the models drop by up to 4.88% and 5.83%, respectively. These long  words may contain combinatorial ambiguity and OOV words. SwapSyn generate synonyms for words in the original text, which may also introduce OOV words and cause model performance degradation.

Variations of Universal Transformation
In this section, we explore the influence of universal transformations (UT) on different natural language processing (NLP) tasks. The UT strategies cover various scenarios and applicable to multiple languages and tasks, aiming to evaluate its robustness via linguistically based attacks, model bias, and dataset subpopulations. In addition, we carry out experiments to test models under different UT combinations and task-specific transformations, thereby analyzing a correlation and synergy of these strategies.

Multi-Granularity Transformation
The UT strategies are guided linguistically and categorized into three levels: characters, words, and sentences. To demonstrate the influence of multi-granularity transformations on different tasks and models, we report their evaluating results under the same UT strategy and compare the original model performances with those tested from transformed samples. We design 3 UTs character-level, 12 UTs word-level, and 4 UTs sentence-level to explore the influence of linguistically based text transformation. Figure 4 shows the results of multi-granularity transformed texts for several typical NLP tasks, evaluated with the Accuracy metric. We demonstrate results of Typos, SwapNamedEnt, and WordCase as the exemplar UTs on three levels. For Typos, we test models from the NER and ABSA tasks. The results show that slight changes of characters, e.g., replacement, deletion, C2F + BERT (Devlin J,2018) C2F + SPANBERT (Joshi , 2020) Gender Bias-NLI Gender Bias-Coref Figure 5: Results of gender bias transformations. We replace human names by female names and perform robustness evaluation in NLI and Coref tasks. and insertion, can drastically reduce the performance. The outcome reflects that most NLP systems are sensitive to fine-grained perturbations and vulnerable to small and precise attacks since typos may become OOV words that make the entire sentence unrecognizable. In terms of SwapNamedEnt, where entity words are replaced by other entities in the same category without changing the Part-Of-Speech (POS) tag and sentence structure, system performances are negatively affected by the NLI and Semantic Matching (SM) tasks. Different from Typos, entity replacement does not always create OOV words, as the new entity might also appear in the training data. However, the NLP systems tend to learn underlying patterns and enhance co-occurrence relationships of words rather than logic and facts, difficult to apply to other rare samples. For WordCase, we convert all characters into uppercase letters without changing sentence structures. In the POS-Tagging and SM task, almost all evaluated systems show a significant performance drop, indicating that they cannot deal with cased texts. This issue is difficult to be ignored because cased texts are based on important information of emphasis, emotion, and special entities.

Gender and Location Bias
Gender and location bias is the preference or prejudice toward a certain gender or location (Moss-Racusin et al., 2012), exhibited in multiple parts of a NLP system, including datasets, resources, and algorithms (Sun et al., 2019). Systems with gender or location biases often learn and amplify negative associations about protected groups and sensitive areas in training data . In addition, they produce biased predictions and uncontrollably affect downstream applications. To analyze the underlying effects of such biases in NLP systems, we design and carry out a group of universal transformations based on the gender and location bias to evaluate their robustness on a wide range of tasks and models. We devise an entity-based transformation for bias robustness evaluation, which detects entities of human names and locations in texts, and replaces them with human names based on gender or locations in a specific region. Figure 5 compares the results of different systems on the biased texts. We present results of gender bias on NLI tasks and Coreference Resolution (Coref.), two representative tasks for semantic understanding. From Figure 5, we observe that after replacing original names with female names, all systems in NLI and Coref. suffered a serious performance drop (approximately 10% drop in accuracy). A possible explanation is that female names are inaccessible in the training set, and the model fails to recognize unfamiliar names to achieve an accurate prediction. Accordingly, we assume that if training resources exhibit gender preference or prejudice, extra negative associations between names and labels lead to a worse situation, especially in an application that focuses on social connections.

Subpopulations
With an increase in computational power and complexity of the deep learning model, the data size for model training is increasing. The performance of a complex model varies in a different subpopulation of a large, diverse dataset. In other words, good performance in one subpopulation does not imply good performance in other subpopulation (Buolamwini and Gebru, 2018), which is of additional interest to financial and medical communities (Jagielski et al., 2020). We design a group of subpopulation transformation to evaluate the underlying effects on diverse subpopulations. We perform a robustness evaluation on different NLP tasks with their representative models. We take Natural language inference  (2) The semantic implication between long sentences is more difficult to deal with, as it requires the model to encode more complex context semantics. (3) Compared with questions, the model can deal with negation better. (4) Surprisingly, the NLI model can process the text pair that contains the pronouns for women better than that of men.

Combination of Transformations
To identify model shortcomings and help participants to revise the models in the real-world development cycle, we carry out a comprehensive robustness analysis regarding the evaluation of model failure. Besides, we carry out 60 task-specific transformations to find core defects of different tasks, as described in section 4.1, TextFlint offers 20 universal transformations and thousands of combinations for the generalization capabilities probe and customized analysis, respectively. Based on the different transformation and their combination, large-scale evaluations are carried out on 12 NLP tasks using the corresponding mainstream model. We demonstrate two classic tasks, Named Entity Recognition and Natural Language Inference, as case studies. For each task, we present results of different models under one task-specific transformation, one universal transformation, and their combination. The experimental results are displayed in table 9. For Named Entity Recognition task, we use OOV and SpellingError as task-specific and general transformation, respectively. We observe some important phenomena after comparing the performance degradation caused by OOV, SpellingError, and their combination. Although OOV and SpellingError drop model performance significantly, the F 1 score of each model is more reduced on OOV + SpellingError than either of OOV and SpellingError. Take TENER model as an example, the performance drop in combinational transformation is 45.56, higher than the sum of the performance drop of the two-separate transformation (20.18 and 11.03, specifically).
For NLI task, we use NumWord and SwapSyn as task-specific and universal transformation, respectively. Similarly, we observe that NumWord, SwapSyn, and their combination drop the model accuracy by average of 35.41, 6.49, and 37.73, respectively. The outcomes indicate that the combination of several different transformation strategies makes a more challenging probe, essential for comprehensively detecting model defects.

Analysis of Different Model Frameworks
We adopt TextFlint to evaluate hundreds of models of 12 tasks, covering many model frameworks and learning schemas, ranging from traditional feature-based machine learning approaches to stateof-the-art neural networks. All evaluated models and their implementations are available publicly.   Table 10 provides the evaluation result of the three commercial APIs. From the perspective of the transformation method, CrossCategory on average induces the most performance drop. In addition, OOV and SwapLonger cause performance drop significantly, indicating the ability of the commercial APIs in identifying the OOV entities. However, the entities with ambiguity must be further improved. In addition, EntTypos has relatively little influence on the results, showing that these APIs are robust to slight spelling errors.
In comparison, Google API and Amazon API have a low performance on both the original data and transformed data. To identify the cause of the low performance, we randomly selected 100 data samples not processed correctly and manually analyzed the cause of the error. We find that many named entities in the CoNLL2003 dataset are in other named entities. The nested named entity is recognized by Google API but fails to be labeled in the CoNLL2003 dataset. The inconsistent labeling approach makes Google API get a lower score. In addition, we find that Amazon API has high accuracy in Person recognition, but it confuses Location with Organization, which is the reason for its low score.

Patching up with Augmented Data
After users feed the target model into TextFlint and customize their needs, TextFlint produces comprehensive transformed data in diagnosing the robustness of the target model. Through diagnosis of dozens of transformed data, the robustness evaluation results describe model performance from the lexical, syntactic, semantic levels. TextFlint conveys the above evaluation results to users through visualization and tabulation reports, helping users to understand the shortcomings of the target model and design potential improvements. Moreover, TextFlint could generate massive augment data to address the defect of the target model. TextFlint contributes to the entire development cycle from evaluation to enhancement.
In Section 4.1, we tested 10 different models for task-specific transformations on ABSA and observed significant performance degradation. To address the disability of these models to distinguish relevant aspects from non-target aspects, TextFlint generated three types of transformed data for adversarial training. We show the performance of models before/after adversarial training (Trans. → Adv.) on three task-specific transformations of the Restaurant dataset in Table 11. Compared with training only on the original dataset, adversarial training has significantly improved performance in the three taskspecific transformations. The high-quality augment data generated by TextFlint can effectively improve the shortcomings of the target model, and all of these can be easily implemented in TextFlint.

Related Tools and Work
Our work is related to many existing open-source tools and works in different areas.
Robustness Evaluation Many tools include evaluation methods for robustness. NLPAug (Ma, 2019) is an open-source library focusing on data augmentation in NLP, which includes several transformation methods that also help to evaluate robustness. Errudite  supports subpopulation for erroranalysing. AllenNLP Interpret (Wallace et al., 2019) includes attack methods for model interpreting. Checklist (Ribeiro et al., 2020) also offers pertubations for model evaluating. These tools are only applicable to small parts of robustness evaluations, while TextFlint supports comprehensive evaluation methods like subpopulation, adversarial attacks, transformations and so on. There also exist several tools concerning robustness that are similar to our work (Morris et al., 2020;Zeng et al., 2020;Goel et al., 2021), which also include a wide range of evaluation methods. Our work is different from these works. First, these tools only focus on general generalization evaluations and lack task-specific evaluation designs for detecting the defects for specific tasks, while TextFlint supports both general and task-specific evaluations. Second, these tools lack quality evaluations on generated texts or only support automatic quality constraints (Morris et al., 2020;Zeng et al., 2020), while TextFlint have ensured the acceptability of each transformation method with human evaluations. Additionally, these tools provide limited analysis on the robustness evaluation results, while TextFlint provides a standard report that can be displayed with visualization and tabulation.