TAPE: Assessing Few-shot Russian Language Understanding

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.


Introduction
The ability to acquire new concepts from a few examples is central to human intelligence (Tenenbaum et al., 2011).Recent advances in the NLP field have fostered the development of language models (LMs; Radford et al., 2019;Brown et al., 2020) that exhibit such generalization capacity under a wide range of few-shot learning and prompting methods (Liu et al., 2021;Beltagy et al., 2022).The community has addressed various aspects of few-shot learning, such as efficient model application (Schick and Schütze, 2021), adaptation to unseen tasks and domains (Bansal et al., 2020a,b), and cross-lingual generalization (Winata et al., 2021;Lin et al., 2021).
The latest research has raised an essential question of standardized evaluation protocols to assess few-shot generalization from multiple perspectives.The novel tool-kits and benchmarks mainly focus on systematic evaluation design (Bragg et al., 2021;Zheng et al., 2022), cross-task generalization (Ye et al., 2021;Wang et al., 2022), and real-world scenarios (Alex et al., 2021).However, this rapidly developing area fails to provide similar evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm.Motivation and Contributions.In this paper, we introduce TAPE2 , a novel benchmark for few-shot Russian language understanding evaluation.Our objective is to provide a reliable tool and methodology for nuanced assessment of zero-shot and few-shot methods for Russian.The objective is achieved through two main contributions.Contribution 1.Our first contribution is to create six more complex question answering (QA), Winograd schema, and ethics tasks for Russian.The tasks require understanding many aspects of language, multi-hop reasoning, logic, and commonsense knowledge.The motivation behind this is that there are systems that match or outperform human baselines on most of the existing QA tasks for Russian, e.g., the ones from Russian SuperGLUE (Shavrina et al., 2020): DaNetQA (Glushkova et al., 2020), MuSeRC and RuCoS (Fenogenova et al., 2020).To the best of our knowledge, datasets on ethical concepts have not yet been created in Russian.To bridge this gap, we propose one of the first Russian datasets on estimating the ability of LMs to predict human ethical judgments about various text situations.Contribution 2. Our second contribution is to develop a framework for multifaceted zero-shot and few-shot NLU evaluation.The design includes (i) linguistic-oriented adversarial attacks and perturbations for testing robustness, and (ii) subpopulations for nuanced performance analysis.
Here, we follow the methodological principles and recommendations by Bowman and Dahl (2021) and Bragg et al. (2021), which motivate the need for systematic benchmark design and adversariallyconstructed test sets.Findings.Our findings are summarized as fivefold: (i) zero-shot evaluation may outperform fewshot evaluation, meaning that the autoregressive baselines fail to utilize demonstrations, (ii) fewshot results may be unstable and sensitive to prompt changes, (iii) negative result: zero-shot and fewshot generation for open-domain and span selection QA tasks leads to near-zero performance, (iv) the baselines are most vulnerable to spelling-based and emoji-based adversarial perturbations, and (v) human annotators significantly outperform the neural baselines, indicating that there is still room for developing robust and generalizable systems.

Related Work
Benchmark Critique.Benchmarks such as GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) have become de facto standard tools to measure progress in NLP.However, recent studies have criticized the canonical benchmarking approaches.Bender et al. (2021) warn performance gains are achieved at the cost of carbon footprint.Elangovan et al. (2021) claim that the current benchmarks evaluate the LM's ability to memorize rather than generalize because of the significant overlap between the train and test datasets.Church and Kordoni (2022) argue that benchmarks focus on relatively easy tasks instead of creating longterm challenges.Raji et al. (2021) raise concerns about the resource-intensive task design.In particular, benchmarks present with large-scale train datasets, which are expensive to create.This may lead to benchmark stagnation, as new tasks can not be added easily (Barbosa-Silva et al., 2022).In turn, few-shot benchmarking offers a prospective avenue for LMs evaluation regarding generalization capacity, computational and resource costs.
Few-shot Benchmarking.Research in few-shot benchmarking has evolved in several directions.Schick and Schütze (2021) create FewGLUE by sampling small fixed-sized training datasets from SuperGLUE.Variance w.r.t to training dataset size and sampling strategy is not reported.Later works overcome these issues by exploring evaluation strategies by K-fold cross-validation (Perez et al., 2021), bagging, and multi-splits, introduced in FewNLU (Zheng et al., 2022).Additionally, FewNLU explores correlations between performance on development and test sets and stability w.r.t. the number of runs.CrossFit (Ye et al., 2021) studies cross-task generalization by unifying task formats and splitting tasks into training, development, and test sets.FLEX (Bragg et al., 2021) covers the best practices and provides a unified interface for different types of transfer and varying shot sizes.Finally, to the best of our knowledge, the only non-English dataset for few-shot benchmarking is Few-CLUE in Chinese (Xu et al., 2021).TAPE is the first few-shot benchmark for Russian, which introduces variations at the data level by creating adversarial test sets.

Task Formulations
TAPE includes six novel datasets for Russian, each requiring modeling "intellectual abilities" of at least two skills: logical reasoning ( §3.1; extended Winograd schema challenge), reasoning with world knowledge ( §3.2; CheGeKa, RuOpen-BookQA and RuWorldTree), multi-hop reasoning ( §3.2; MultiQ), and ethical judgments ( §3.3; Ethics 1/2 ).This section describes the task formulations, general data collection stages, and dataset examples.Appendix A provides the general dataset statistics, while Appendix E.1 includes details on dataset collection and extra validation stage via a crowd-sourcing platform Toloka3 (Pavlichenko et al., 2021).

Logical Reasoning
Winograd.The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logical reasoning (Levesque et al., 2012).The texts for the dataset are collected with a semi-automatic pipeline.First, lists of 11 typical grammatical structures with syntactic homonymy (mainly case) are compiled by a few authors with linguistic background (see Appendix B).Queries corresponding to these constructions are submitted to the search of the Russian National Corpus 4 , which includes a sub-corpus with re-solved homonymy.In the resulting 2k+ sentences, homonymy is resolved automatically with UDPipe 5 and then validated manually by a few authors afterward.Each sentence is split into multiple examples in the binary classification format, indicating whether the reference pronoun is dependant on the chosen candidate noun.

Reasoning with World Knowledge
RuOpenBookQA.RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions, which probe understanding of 1k+ core science facts.The dataset is built with automatic translation of the original English dataset by Mihaylov et al. (2018) and manual validation by a few authors.
RuWorldTree.The collection approach of Ru-WorldTree is similar to that of RuOpenBookQA, the main difference being the additional lists of facts and the logical order that is attached to the output of each answer to a question (Jansen et al., 2018).
MultiQ.Multi-hop reasoning has been one of the least explored QA directions for Russian.The task is addressed by the MuSeRC dataset (Fenogenova et al., 2020) and only a few dozen questions in 5 UDPipe package SberQUAD (Efimov et al., 2020) and RuBQ (Rybin et al., 2021).In response, we have developed a semi-automatic pipeline for multi-hop dataset generation based on Wikidata and Wikipedia.First, we extract the triplets from Wikidata and search for their intersections.Two triplets (subject, relation, object) are needed to compose an answerable multi-hop question.For instance, the question "Na kakom kontinente nakhoditsya strana, grazhdaninom kotoroy byl Yokhannes Blok?" (In what continent lies the country of which Johannes Block was a citizen?) is formed by a sequence of five graph units: "Blok, Yokhannes" (Block, Johannes), "grazhdanstvo" (country of citizenship), "Germaniya" (Germany), "chast' sveta" (continent), and "Yevropa" (Europe).Second, several hundreds of the corresponding question templates are curated by a few authors manually, which are further used to fine-tune ruT5-large6 to generate multi-hop questions given the graph units sequences.Third, the resulting questions undergo paraphrasing (Fenogenova, 2021) and manual validation procedure to control the quality and diversity.Finally, each question is linked to two Wikipedia paragraphs with the help of wptools7 , where all graph units appear in the natural language.The task is to select the answer span using information from both paragraphs.
• Question: "Gde nakhoditsya istok reki, pritokom kotoroy yavlyayetsya Getar?" (Where is the source of the river, the tributary of which is the Getar?)   questions based on wit and common sense knowledge.We directly contacted the authors of Russian Jeopardy!(Mikhalkova, 2021) and asked about including their training and private test sets in our benchmark.The task is to provide a free response given a question and the question category.

Ethical Judgments
There is a multitude of approaches to evaluating ethics in machine learning.The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with Hendrycks et al. (2021).The task is to predict human ethical judgments about diverse text situations in two multi-label classification settings.The first one is to identify the presence of concepts in normative ethics, such as virtue, law, moral, justice, and utilitarianism (Ethics 1 ).The second one is to evaluate the positive or negative implementation of these concepts with binary categories (Ethics 2 ).
The composition of the dataset is conducted in a semi-automatic mode.First, lists of keywords are formulated to identify the presence of ethical concepts (e.g., "kill", "give", "create", etc.).The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVectores project (Kutuzov and Kuzmenko, 2017).After that, the news and fiction subcorpora of the Taiga corpus (Shavrina and Shapovalova, 2017) are filtered to extract short texts containing these keywords.Each text is annotated via Toloka as documented in Appendix E.1.
• Text: "Pechen'kami sobstvennogo prigotovleniya nagradila 100-letnyaya Greta Plokh malysha, kotoryy pomog yey pereyti cherez ozhivlennoye shosse po peshekhodnomu perekhodu."(100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing.)No extra data.We do not provide validation sets nor any additional unlabeled data to test the zeroshot and few-shot generalization capabilities of LMs (Bao et al., 2019;Tam et al., 2021).Number of shots.We consider k ∈ {1, 4, 8} for few-shot evaluation to account for sensitivity to the number of shots.We also include zero-shot evaluation, which can be a strong baseline and simulate scenarios where no supervision is available.Episode sampling.We provide 5 episodes in each k-shot setting k ∈ {1, 4, 8} and report standard deviation over the episodes to estimate the variability due to the selection of demonstrations (Schick and Schütze, 2021).Each episode train k randomly sampled from D train with replacement, and a single test D A test acquired via the combination of original and adversarial test data.
Subpopulations.Subpopulations (Goel et al., 2021) are utilized for fine-grained performance analysis w.r.t.such properties of D test as length, domain, and others.Robustness.LMs are susceptible to adversarial examples, purposefully designed to force them output a wrong prediction given a modified input (Ebrahimi et al., 2018;Liang et al., 2018;Jia and Liang, 2017).We analyze the LMs' robustness to different types of adversarial data transformations.Here, each E i train k corresponds to T + 1 test variations, including the original D test and T adversarial test sets D A test , acquired through the modification of D test .T depends on the dataset and can be adjusted based on the user's needs.

Attacks and Perturbations
Table 1 summarizes the TAPE's adversarial attacks and perturbations based on the generally accepted typology (Zhang et al., 2020;Wang et al., 2021b).
Word-level Perturbations.Word-level perturbations utilize several strategies to perturb tokens, ranging from imitation of typos (Jin et al., 2020) to synonym replacement (Wei and Zou, 2019).We consider the following: Spelling.BUTTERFINGERS is the typo-based perturbation that adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance.Modality.EMOJIFY replaces the input words with the corresponding emojis, preserving their original meaning.A few authors have manually validated translations of the English emoji dictionary.
Sentence-level Perturbations.In contrast to word-level perturbations, sentence-level perturbation techniques affect the syntactic structure: Random.Easy Data Augmentation (EDA; Wei and Zou, 2019) have proved to be efficient in fooling LMs on text classification tasks.We use two EDA configurations: swapping words (EDA SWAP ) and deleting tokens (EDA DELETE ).Paraphrasis.BACKTRANSLATION (Yaseen and Langer, 2021) allows to generate linguistic variations of the input without changing named entities.We use the OpusMT model9 to translate the input text into English and back to Russian.Distraction.ADDSENT is an adversarial attack that generates extra words or sentences with the help of a generative text model.We pass the input to the mGPT10 LM and generate continuations with the sampling strategy.In the multiple-choice QA tasks, we replace one or more incorrect answers with their generated alternatives.

Data Curation
Adversarial perturbations and attacks are efficiently utilized to exploit weaknesses in LMs (Goel et al., 2021).At the same time, popular techniques may distort semantic meanings or generate invalid adversarial examples (Wang et al., 2021a).We aim to address this problem by using: (i) adversarial probability thresholds, (ii) task-specific constraints, and (iii) semantic filtering.
Probability thresholds.The degree of the input modification can be controlled with an adversarial probability threshold, which serves as the hyperparameter.The higher the probability, the more the input gets modified.The thresholds used in our experiments are specified in Table 1.• This is a sentence to test the code , if you want to delete it 0.5 92.95 Table 1: Examples of the TAPE's adversarial attacks and perturbations.The examples are given for the English sentence "This is a sentence used to test the code" for illustration purposes.The similarity scores for each transformed sentence are given in percent.
Constraints.The TAPE's attacks and perturbations do not drastically change the input's meaning.Despite this, we consider using rule-based constraints that keep the linguistic structure and task-specific aspects unchanged (see
Perplexity-based evaluation.We consider the setting where the classification and multiple-choice tasks are formulated in natural language as a clozestyle prompt template: Winograd, Ethics 1/2 , Ru-WorldTree and RuOpenBookQA.We provide examples of the prompt templates for each task in Appendix C.After filling in each possible target class or choice, we compute the per-token cross-entropy loss, which is reduced to negative log-probability due to one-hot encoding of the target tokens.The most probable string has the lowest sum of negative log probabilities of its tokens normalized over the total number of tokens in the input, as specified in Equation 1.
where t is the input prompt and |t| is the length of the prompt in tokens.The choice relies on our preliminary experiments, where instead, the most probable string is chosen based on the lowest sum of negative log probabilities of the prompt's tokens.However, the latter approach has shown worse results on the subsets of the training sets.
Zero-shot and few-shot generation.Text generative baselines are of the greatest interest for tasks that can not be solved by the perplexitybased approach: CheGeKA and MultiQ.Here, we generate the answer given the corresponding task prompt (see Appendix C) with nucleus sampling (top p = 0.8).The choice of the strategy and hyperparameters is based on the grid search16 on a subset of the corresponding D train .The output is limited by 100/200 (CheGeKA) and 400/800 (MultiQ) tokens in zero-shot/few-shot settings, respectively.

Human Baselines
The human evaluation is run via Toloka.Access to the annotation projects is granted to annotators certified as Russian native speakers.predictions that are changed after the perturbation or attack is applied.

Generalization Evaluation
Table 3 presents the zero-shot and few-shot performance results of the non-neural, neural, and human baselines on the original D test sets.
Classification and multiple-choice tasks.The zero-shot evaluation provides a strong baseline, matching or exceeding the few-shot performance on Winograd (k ∈ {1, 4}) and Ethics 1 (k ∈ {1, 4, 8}).The zero-shot performance is similar among the models despite their size (Winograd), or it can steadily improve (RuWorldTree, RuOpenBookQA) and significantly drop when the model size increases (Ethics 1/2 ).We observe that introducing more examples increases variability on the imbalanced classification tasks (Winograd, Ethics 1/2 ) and leads to performance degradation, specifically for ruGPT S .Furthermore, the performance degenerates into constant predictions, which is indicated by the significant difference in accuracy and F1 scores on the (Winograd) and (Ethics 1/2 ) tasks.In particular, the LMs predict the negative label for about 97% of the Winograd samples in the zero-shot setting.
In the few-shot, however, the number of constant predictions is reduced to 80% (k ∈ {4, 8}).This result indicates that the demonstrations may help generalize to the task, but the predictions are still affected by the imbalanced classification setting.We also observe that Ethics 1/2 is the most challenging task for both human and neural baselines.
The results are sensitive to prompt changing, and human annotators may receive low inter-annotator agreement on the examples due to subjectivity.
Zero-shot and few-shot generation results.However, approaching the CheGeKa and MultiQ tasks with zero-shot and few-shot generation results in near-zero generalization performance.Both generative tasks demonstrate the most significant difference between human evaluation and baseline results, which can be explained, on the one hand, by the lack of answer choices, on the other, by the limitations of standard QA metrics for assessing semantically correct but non-literal generated answers.To better understand this, we manually analyzed a sample of 100 predictions per task and found that the generated outputs rarely match the golden answers, e.g., the models generate irrelevant texts or texts related to the question.Discussion.The neural baselines are capable to generalize to multiple-choice QA tasks well but perform worse than random baseline or blindly predict the target labels on the imbalanced classification tasks.Our results are consistent with Lin et al. (2021) in that: (i) the few-shot evaluation results may rely heavily on the input prompts, (ii) it is difficult for GPT-style LMs to perform judgments on social value tasks in the zero-shot and few-shot settings, and (iii) the few-shot results on some tasks are worse than zero-shot, meaning that LMs are not able to utilize given demonstrations for solving them.We also observe the negative result: zero-shot and few-shot generation baselines may receive near-zero performance on the open-domain and extractive QA tasks.

Robustness
Table 4 shows the ASR scores for each perturbation and k-shot setting averaged over the Ru-WorldTree and RuOpenBookQA tasks, where the model performance exceeds the random base- line.We observe that the models are more susceptible to simple spelling-based perturbations (BUTTERFINGERS), token deletion (EDA DELETE ), and modality changes (EMOJIFY).Paraphrasing the input (BACKTRANSLATION) and, especially, addition of the distractors (ADDSENT), in turn, bear significantly less effect.Most of the perturbations either gain effectiveness as the number of demonstrations increases (BUTTERFINGERS, EDA DELETE , EDA SWAP ) or does not seem to directly depend on the k value (EMJ, BACKTRANSLA-TION).However, the ASR scores for the distraction adversarial attack (ADDSENT) are inversely proportional to the number of shots.In other words, models are less likely to fall for the generated answer option as the number of demonstrations increases.The model size is not the main factor in model robustness, and the difference between the models is quite subtle.We notice, however, that larger models (ruGPT3 M , ruGPT3 L ) are less affected by the ADDSENT perturbation and more vulnerable to BACKTRANSLATION, in comparison to ru-GPT3 S .
Nonetheless, the latter proves to be less robust to adversarial perturbations, which is indicated by the higher ASR scores on average, closely followed by the largest model.
Discussion.The neural baselines are most vulnerable to word-level text perturbations: typo-based transformations and modality changes.In terms of the sentence-level perturbations, simple EDA techniques prove to be more effective than paraphrasis and the distraction-based adversarial attack, which can be due to the task-specificity.This is in line with Wang et al. (2021a), who observe typo-based adversarial perturbations as the most effective automatic attack methods.

Diagnostic Analysis
Subpopulation analysis reveals that the GPT-style models exhibit length bias, which is indicated by low performance on longer inputs (see Figure 1-Figure 4 in Appendix F).The effect is observed primarily on the RuWorldTree task, where the models' performance significantly drops on longer texts.As might be expected, for QA tasks, question complexity, determined by school grade or exam name, also affects the model performance.The models tend to better deal with easier questions, which becomes more prominent as the number of demonstrations increases.Seemingly, the readability and lexical diversity of a question, determined by Flesch Reading Ease scores and type-token ratio, respectively, do not affect model performance.
Nevertheless, slight increases in performance on more readable and diverse texts are present.
Discussion.We reveal that the baseline performance depends on the input length.One of the reasons for such behavior can be the limited context window that the models have.Alex et al. (2021) have previously explored reasoning over long texts in a few-shot setting and their results are consistent with our findings.

Conclusion and Future Work
Zero-shot and few-shot methods have evolved as a new paradigm in NLP.Addressing the best practices, we introduced TAPE, a text attack and perturbation evaluation benchmark for Russian.TAPE combines the general language understanding evaluation techniques with the greener no-tuning approach, allowing the evaluation of LMs' robustness on complex intellectual tasks.We present six new datasets and a framework for generating adversarial attacks and perturbations, which can also be used as a standalone tool for practical purposes.
In future, we plan to incorporate more LMs with various architectures and prompting-based methods into the framework.Another direction is to evaluate the cross-lingual generalization capabilities of autoregressive LMs.We hope to encourage the community to foster evaluation of LMs' generalization capacity in non-English languages, leading to the development of more robust and reliable LMs.

Limitations
Performance aggregation.The well-established GLUE-style benchmarks evaluate systems using mean aggregation over heterogeneous task-specific metrics (Wang et al., 2018(Wang et al., , 2019(Wang et al., , 2021a)).Based on the criticism of this evaluation protocol by the research community (e.g., Waseem et al., 2021;Mishra and Arunkumar, 2021;Agarwal et al., 2021), we recognize that mean aggregation in our case does not account for the nature of the adversarial transformations and attacks and task specifications, such as the task type, domain, and the number of episodes in D train and D test .Baseline evaluation.First, our baseline model evaluation relies on using the same prompts for all language models unless mentioned otherwise.Second, we do not utilize related few-shot learning and prompt-tuning methods, which could serve as more solid baseline approaches.We recognize that it can lead to biased evaluation and spurious conclusions about the baseline performance.However, we aim to provide a scope of baseline solutions, ranging from perplexity-based to zero-shot open-ended generation approaches.At the same time, our training sets are publicly available, and it is not anticipated that the users will apply this data for fine-tuning.Human performance.The comparison of our neural and human baselines is inconsistent regarding the number of demonstrations provided to understand a given task.The zero-shot and few-shot human performance can be comparable to neural LMs' performance when humans would receive k ∈ {0, 1, 4, 8} examples in the annotation training stage (Mukherjee et al., 2021).

Ethics Statement
Subjectivity related to ethics.Ethics is a multidimensional subject, which remains a complicated problem for LMs and controversial for humans in a multitude of situations.Although our methodology spans general concepts in normative ethics, we acknowledge that it can be challenging to perform objective ethical judgments about some situations (Martineau, 2006).For instance, judgments about law are based on formal criteria (e.g., the criminal code), morality may rely on public sentiment, while justice may heavily rely on private sentiment and human worldview.At the same time, the real-life situations described in a given text are imbalanced concerning the number of acts annotated as positive and the number of acts with various disadvantages in terms of the ethical norms.In practice, this leads to moderate inter-annotator agreement and approximate human and model performance estimates.
Risks related to ethics.We acknowledge that approaches to evaluating LMs' ability to perform ethical judgments about text situations have been criticized (Talat et al., 2022).While we use a similar set of ethical concepts (Hendrycks et al., 2021), we collect annotations according to the five criteria that describe the aspects of the annotators' attitude towards the deed.The attitude can be determined by various individual and social aspects.Here, we have analyzed metadata of our Ethics 1 annotators available via the Toloka interface.There are 481 Russian speakers across 16 different countries, who can be grouped by their age as follows: 18 − 30 (163 annotators); 30 − 50 (265 annotators), and 50 − 78 (53 annotators).Thus, we will further take into account specific risks arising within the annotation process: Social properties: the diffusion of norms in the Russian-speaking communities has been the object of rapid changes (Casier, 2022).This can be expressed in a shift in attitude towards actions that have different interpretations from the point of view of regional cultural norms, cultures of small peoples, religious norms, and normative behavior for classes of society.Legal properties: as the "legality" of a deed in a text can change over time, we are sure to see a growing annotation inconsistency in individual examples that reflect societal changes after some years.The risks are partially mitigated by the prior training of the annotators and annotator's performance control.Running the annotation experiments from year to year is reasonable to understand possible norm shifts, measuring the variation in annotators' opinions about aspects of the described actions.Furthermore, other data-dependent risks can be indicated, such as genre bias and author's bias in specific publicly available text sources.Societal impact.The TAPE's design allows us to alleviate the problems of a large carbon footprint (Bender et al., 2021) and keep computational costs accessible to academic and industrial fields (Couldry and Mejias, 2020).In particular, our evaluation approach does not consider LMs' fine-tuning and relies on a limited amount of episodes, while the number of attacks and perturbations can be adjusted based on the user's needs.

B Winograd Queries
This appendix provides the list of queries that correspond to the RusCorpora query language 19 and examples in natural language.
• Type 1: Noun phrase & subordinate clause with "that" in the same gender and number.

C Prompt formats
We design the prompt templates based on the task specifics and format (see Table 2, Table 3).The choice of the prompts is based on the preliminary experiments on the corresponding training set and manual analysis of the results.
• Winograd: we use "yes" and "no" label encoding.• RuOpenBookQA and RuWorldTree: we unite the question or the sentence prefix with each of the possible choices.• Ethics 1/2 : we regard each category as a separate binary target, which we encode as "yes" or "no" and, therefore, use different prompts for each category.We manually crafted a large pool of templates and selected between 1 and 3 best prompts for each target, which yields the best F1-score on a subset of the training set.• MultiQ and CheGeKa: we use generative baselines and format the prompts so that the LMs better capture the task.

E Annotation Protocols
Human annotators' submissions are collected and stored anonymously.The average hourly pay rate exceeds the hourly minimum wage in Russia.Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion).The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.

E.1 Data Collection
MultiQ.We have run an annotation project of the MultiQ test set aimed at identifying if: (i) the automatically selected answer span is correct and fits the context, (ii) the question can be answered based on the given main and supporting texts, (iii) the question can be answered based on the information either in main or supporting text (i.e., does not require multi-hop reasoning), and (iv) either of the input texts contains noise.The annotators were also asked to: (i) select spans of the bridge entity in the supporting text and the answer in the main text, (ii) provide comments on the points as mentioned earlier.We discarded samples where the annotators had not agreed on either of the spans with the confidence of more than 50% and manually validated each remaining example using the annotators' votes and comments.
CheGeKa.The private test set underwent multiple validations and filtering stages.First, we have manually excluded questions on sensitive topics, questions containing obscene words, and questions that are difficult to answer without the question category.Second, the annotators were asked to answer the questions; the instruction can be found in Table 8 in Appendix E.2.Third, we filtered out votes from annotators whose average performance on the control examples is below 50%.Next, each submission was validated using a set of heuristics on the presence of obscene words, arbitrary or empty answers, and noise.Finally, since the task requires a free response, it is challenging to compute the IAA rates and aggregate votes.Therefore, we manually validated each submission and identified answers that can also be considered golden.We added such answer options to the corresponding test samples.
Ethical judgments.The annotation design choices rely on multiple studies, where we experimented with the instructions, schemes, questions asked to annotators, and answer choices.Each study was run using the same data sample of 100 examples per each ethical concept and further analyzed based on the Dawid-Skene IAA rates (Dawid and Skene, 1979).The objective here is to identify ethical concepts that can be unambiguously used for controlling the annotation quality with the honey-pot/control examples and the design choices that maximize the IAA rates.To this end, use the per-concept Dawid-Skene IAA score and the percentage of three annotators who agree with one another in the target class (confidence; in %).The results on the Dawid-skene IAA/confidence scores are the following: We have empirically set the confidence score threshold to 45%.We do not consider the concepts of moral and utilitarianism (Ethics 1 ) and justice (Ethics 2 for controlling the quality due to their ambiguity or subjectivity.The Dawid-Skene IAA scores above 90 indicate strong agreement between the annotators.The final design of both tasks is available as a part of the human evaluation experiments in Table 10 and Table 11 (see Appendix E.2).

E.2 Human Evaluation
Table 5 summarizes the general human evaluation details for each annotation project.In general, we collect the majority vote labels from three to five qualified annotators after filtering them by: (i) average performance on the control examples (more than 50% of the control examples are correct), (ii) the response time, (iii) manual submission validation, and (iv) additional automatic submission verification according to the presence of the obscene words, arbitrary or empty answers, and noise.The number of votes is set to 3 for RuOpenBookQA,    Question: Does "It" refer to "a pie"?"Yes" "No" Explanation: It is the pie contained a lot of butter.The correct answer is "Yes".2. Text: A heavy ball broke through the table, as it was made of thin plywood.Question: Does "it" refer to "ball"?"Yes" "No" Explanation: The ball can not be made of plywood.Thus, the correct answer is "No".

Example of web interface
This is a toy text.This is a toy question."Yes" "No"

Example of web interface
This is a toy question.
Category: This is a toy category.
Please write the answer below: The University of Jena, officially the Friedrich Schiller University Jena (abbreviated FSU, shortened form Uni Jena), is a public research university located in Jena, Thuringia, Germany.Explanation: "Friedrich Schiller" is the answer to our question.Please select this span in the text.

Example of web interface
Question: This is a toy question.
Text 1: This is a toy first text.
Text 2: This is a toy second text.
If there is no answer please state it below: Table 9: The instruction for the MultiQ human evaluation project translated for illustration purposes.

Task
• Your task is to answer "Yes" or "No" to five questions about a situation described in a given text.• In your opinion, is the situation about: anyone's good or cruel intentions?anything related to law, order, routine, traditions?anything that is actively approved or not approved by the society?karma or justice?material and emotional gains or losses?

Examples
Text: 100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing.The video was posted on the Readers Channel.In your opinion, is the situation about: • anyone's good or cruel intentions?"Yes" "No" • anything related to law, order, routine, traditions?"Yes" "No" • anything that is actively approved or not approved by the society?"Yes" "No" • karma or justice?"Yes" "No" • material and emotional gains or losses?"Yes" "No" Explanation: Please note that the old lady had good intentions and the toddler too.Everyone gains something good in this text.It is justice.
So select the answer "Yes" for question 1, 4, 5 and "No" for the other ones.Nothing in this text related to law and crime and social approval.

Example of web interface
This is a toy text.
In your opinion, is the situation about: • anyone's good or cruel intentions?"Yes" "No" • anything related to law, order, routine, traditions?"Yes" "No" • anything that is actively approved or not approved by the society?"Yes" "No" • karma or justice?"Yes" "No" • material and emotional gains or losses?"Yes" "No" Table 10: The instruction for the Ethics 1 human evaluation project translated for illustration purposes.
Task • Your task is to answer "Yes" or "No" to fiev questions about a situation described in a given text.• Questions: -Do the characters in this text act with the best intentions, showing their kindest character traits and spiritual qualities?-Do the characters act according to the laws and rules of their time?-Do the actants do something that society will approve of?-Do the characters receive a fair retribution/reward/punishment for their actions?-Have the people in the text become wealthier and happier without making others much more unhappy?

Examples
Text: 100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing.The video was posted on the Readers Channel.
Please answer the questions: • Do the characters in this text act with the best intentions, showing their kindest character traits and spiritual qualities?"Yes" "No" • Do the characters act according to the laws and rules of their time?"Yes" "No" • Do the actants do something that society will approve of?"Yes" "No" • Do the characters receive a fair retribution/reward/punishment for their actions?"Yes" "No" • Have the people in the text become wealthier and happier without making others much more unhappy?"Yes" "No" Explanation: A toddler and the old lady have shown their best spiritual qualities.Both acted according to the law.Society usually approves of such behavior.The good deed was rewarded with justice.Furthermore, everyone in the text became happier: the old woman who successfully crossed over to the other side and a toddler who received a treat.Please answer "Yes" to all five questions.

Example of web interface
This is a toy text.
Please answer the questions: • Do the characters in this text act with the best intentions, showing their kindest character traits and spiritual qualities?"Yes" "No" • Do the characters act according to the laws and rules of their time?"Yes" "No" • Do the actants do something that society will approve of?"Yes" "No" • Do the characters receive a fair retribution/reward/punishment for their actions?"Yes" "No" • Have the people in the text become wealthier and happier without making others much more unhappy?"Yes" "No" Table 11: The instruction for the Ethics 2 human evaluation project translated for illustration purposes.

Figure 1 :
Figure 1: Overview of the TAPE's design.(a) D test is passed to the adversarial framework ( § 4.2) to create the adversarial test D A test that includes the original and adversarial examples.(b) We randomly sample 5 sets of demonstration examples from D train for each k ∈ {1, 4, 8}.In the zero-shot scenario, we skip this stage.(c) After that, we merge the demonstrations, when applicable, with the examples from D A test to construct evaluation episodes E N k .(d) Each E N k is used to obtain predictions from the model.(e) The performance is summarized in a diagnostic evaluation report.BF -BUTTERFINGERS, AS -ADDSENT, S -subpopulation.

Figure 1 :Figure 2 :Figure 3 :Figure 4 :
Figure 1: Evaluation report for ruGPT models on the RuWorldTree task in the 0-shot setting.2494 The baselines are fit on the corresponding D train and evaluated on D test .
(Devlin et al., 2019)).For instance, it is crucial to leave named entities in the QA tasks untouched or not modify the syntactic structure and anaphors when perturbing the Winograd examples.Semantic filtering.We follow Wang et al. on filtering the adversarial examples with BERTScore 11(Zhang et al., 2019), a BERT-based text similarity metric(Devlin et al., 2019).We measure the semantic similarity between the original input and adversarial output and keep examples with the highest similarity score.In cases when the score is lower than a specified threshold, we iteratively decrease the adversarial probability threshold and re-score the new adversarial examples.rangeN ∈ [1; 4].The classifier is trained on top-150k features with default L2-regularization hyperparameters.11hf.co/bert-base-multilingual-cased

Table 2 :
Summary of the TAPE benchmark.

Table 4 :
The robustness evaluation results by adversarial perturbation and attack.The ASR values are averaged over the RuOpenBookQA and RuWorldTree tasks.The lower, the better.The best ASR value is put in bold and the second best is underlined.

Table 1 :
General statistics for each dataset.N T refers to the total number of tokens.N U denotes the number of unique tokens.Label distribution by target class is presented in %.We report the distribution of the positive class for each category in Ethics 1/2 .

Table 5 :
Details on the human evaluation projects.IAA refers to the Dawid-Skene IAA scores.Total is the total cost of the annotation project.Verification refers to the manual validation of each vote.Overlap is the number of votes per example.N T is the number of training tasks.N page denotes the number of examples per page.N C is the number of control examples.ART means the average response time in seconds.*We report the number of votes discarded after the manual validation of each submission instead of the IAA scores for MultiQ and CheGeKa.RuWorldTree, and MultiQ and to 5 for CheGeKa.The number of votes for Winograd and Ethics 1/2 is dynamically ranges from 3 to 5. Here, the number of votes per example is automatically computed by Toloka based on the annotators' performance on the training and control examples and IAA score.IAA is computed with the Dawid-Skene aggregation model directly in Toloka.Below, we provide the IAA scores per ethical concept for the Ethics 1/2 Task• In this task, you are given questions covering various school curriculum topics, such as geography, physics, and chemistry.• Each question has four possible answers.Your task is to select the correct answer for each question (only one answer is possible).

Table 6 :
The instruction for the RuOpenBookQA and RuWorldTree human evaluation projects translated for illustration purposes.
Task• You are given a text.Your task is to define whether a highlighted pronoun or conjunction refers to the given noun or not.•Choose "Yes" if the highlighted pronoun or conjunction refers to the noun.•Choose "No" otherwise.Examples 1. Text: I put a pie in the refrigerator.It had a lot of butter.

Table 7 :
The instruction for the Winograd human evaluation project translated for illustration purposes.This motto of one of the great houses of Westeros is also the title of the first episode in the first season of Game of Thrones.

Table 8 :
The instruction for the CheGeKa human evaluation project translated for illustration purposes.German art historian, writer, and translator.He studied history at the University of Jena and other universities and chose the history of art as his specialization, mainly Italian and modern German.Explanation: The University of Jena is our hint.2.Looking for the answer in the second text: