READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises

For many real-world applications, the user-generated inputs usually contain various noises due to speech recognition errors caused by linguistic variations or typographical errors (typos). Thus, it is crucial to test model performance on data with realistic input noises to ensure robustness and fairness. However, little study has been done to construct such benchmarks for Chinese, where various language-specific input noises happen in the real world. In order to fill this important gap, we construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises. READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input. We designed our annotation pipeline to maximize diversity, for example by instructing the annotators to use diverse input method editors (IMEs) for keyboard noises and recruiting speakers from diverse dialectical groups for speech noises. We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN even with robustness methods like data augmentation. As the first large-scale attempt in creating a benchmark with noises geared towards user-generated inputs, we believe that READIN serves as an important complement to existing Chinese NLP benchmarks. The source code and dataset can be obtained from https://github.com/ thunlp/READIN.


Introduction
User-generated inputs in real-world applications often contain noises where wrong characters or words 1 Note that linguistic variations themselves are not noises or errors, but they can lead to noises in the data processing for example due to failure of speech recognition. * Equal contribution † Corresponding authors are used instead of the intended ones (Xu et al., 2021). This is especially true when users type fast or are using speech input in noisy environments or with less common accents that cause errors in postprocessing systems. However, most benchmarks used in academic research do not explicitly try to capture such real-world input noises (Naplava et al., 2021), leaving the doubt whether models performing well on standard clean test sets can transfer well onto real-world user-generated data.
To evaluate the performance on noisy data for languages like English, existing work typically generates typos via character-level perturbation such as randomly sampled or adversarial character swap or deletion (Belinkov and Bisk, 2018;Pruthi et al., 2019;Jones et al., 2020;Ma et al., 2020), automatic back-translation and speech conversion (Peskov et al., 2019;Ravichander et al., 2021). However, there are many factors not considered in the automatic approaches, for example, the keyboard design of users' devices and speakers' phonetic and phonological variations. These overlooked factors have a large impact on the types of noises possible in keyboard and speech inputs. One notable exception to the above is NoiseQA (Ravichander et al., 2021). Apart from automatic approaches, they also collected test sets with noises produced by annotators. Their dataset only considered the question answering task and is only in English.
In this paper, we focus on Chinese instead and present a multi-task benchmark with REalistic And Diverse Input Noise, named READIN. Compared to the case of English, Chinese input noises have very different patterns due to the very different nature of the two languages. Chinese is a pictographic language without morphological inflections that are common in Indo-European languages. Also, the tone system is a unique and integral part of Chinese phonology but not in English. Such differences cause different types of input noises in both keyboard typing and speech input. To compre-arXiv:2302.07324v1 [cs.CL] 14 Feb 2023 Original 花呗怎么不能提额了(1a) huā bei zěn me bù néng tí é le Why can't I raise my quota on HuaBei?
Keyboard 花呗怎么不能贴了(1b) huā bei zěn me bù néng tiē le Speech 画呗怎么不能提饿了(1c) huà bei zěn me bù néng tí è le Table 1: An example of our crowd-sourced keyboard and speech noises. The original question comes from AFQMC . We also present the Pinyin transliteration of the text. Colors indicate the original and corresponding mis-entered characters.
hensively study the effect of real-world noises, we cover four diverse tasks: paraphrase identification, machine reading comprehension, semantic parsing (text2SQL) and machine translation, all of which represent important real-life applications.
We consider noises occurring in two widely used Chinese input methods, keyboard input and speech input, and provide an example in Table 1.
For keyboard input, Chinese users need to use an input method editor (IME) to convert the raw transliteration 2 sequences into Chinese characters. In such cases, noises can either occur in the transliteration input, or occur when users are choosing the intended word from the candidate list suggested by the IME. It is different from the case of English where typos and spelling variations are expected to happen on the character level. The noise patterns are further coupled with the typing habits of individual users, for example, whether they type the full Pinyin transliteration or just the abbreviations results in different noise patterns. In order to capture these nuances, we recruit annotators with different typing habits and instruct them to use different IMEs for typing.
For speech input, noises could arise when the speakers' accents or background noises lead to failures of the post-processing automatic speech recognition (ASR) systems. To capture these, we recruit 10 speakers from different regions of China to cover diverse accents and use a commonly used Chinese commercial ASR system for post-processing. For instance, in Table 1, the speech noise occurs because the speaker has different tones in their accent, leading the ASR system to produce different char-acters than the original ones. Ensuring that models are robust across these accent variations has important implications for fairness.
We take many additional measures in the annotation process in order to capture the real-world input noise distribution, as detailed in Section 2. In Section 3, we provide more statistics and analysis of the collected data. In Section 4, we train strong baseline models on the clean training data and test the models on our READIN test sets. The results indicate that these models suffer significant performance drops on the real-world input noises, leaving ample room for future improvement.

Annotation Process
Our annotation asks crowdworkers to re-enter clean test data from these existing NLP datasets. Our goal is to induce realistic and diverse input noises in the annotation. We collect data using two different types of input methods: keyboard (Pinyin) input and speech input, both are commonly used among Chinese users (Fong and Minett, 2012). All examples are annotated with both input methods and we keep two separate tracks for data collected with these two different input methods. In the following subsections, we first introduce the four tasks and the original datasets that our annotations are based on, and then introduce the annotation process for keyboard input and speech input respectively.

Tasks and Original Datasets
Paraphrase Identification is a binary classification task that aims to determine whether the given sentence pair are paraphrases. We use the AFQMC dataset  as the original source for annotation, where the data come from customer services in the financial domain. The original dataset is unbalanced (with more negative pairs than positive), we down-sample the negative examples to make the training and dev sets balanced, and we report the accuracy separately for positive pairs and negative pairs. During annotation, we annotate both sentences in each sentence pair since in reality both sentences could be user-generated.
Machine Reading Comprehension gives the model passage-question pairs and asks the model to output the correct answer. We choose a spanextraction MRC dataset CMRC2018 (Cui et al., 2019) as the original data source. We use answer string exact match as the evaluation metric. During annotation, we only annotate the questions and Given the exact same Pinyin input ("shi shi"), different IMEs suggest different words in different orders for users to select from. We use three different IMEs in keyboard annotation for wider coverage.
keep the passages clean. This simulates the realistic setting where users enter their queries potentially with typos.
Semantic Parsing requires the model to convert natural language queries into logical forms. We use the CSpider dataset (Min et al., 2019) which is a dataset for the natural language to SQL query task and is the Chinese version of the Spider dataset (Yu et al., 2018). We use exact match as the metric. During annotation, we annotate the natural language questions to induce typos and use the original SQL queries as the gold reference.
Machine Translation requires the model to translate the input in the source language into the target language. We use the news translation shared task from WMT2021 (Akhbardeh et al., 2021) as our original data source. Following the standard practice of the MT community, we use Sacre-BLEU (Post, 2018) to compute the BLEU score as the metric. During annotation, we only annotate the Chinese sentence and preserve the original English translation as the gold reference.

Pinyin Input Annotation
We present each annotator with a set of input data and ask them to re-type with the Pinyin input method. We implement the following restrictions in the annotation. 3 Different IMEs There are many commercial IME softwares available for the Pinyin input method. To maximize diversity, every input sentence is annotated by three different annotators, where each annotator uses a different IME software. We specified three commonly-used commer-cial Pinyin IMEs: Microsoft 4 , QQ 5 , and Sogou 6 . The main difference among these different IMEs is that when users type the same Pinyin transliteration input, different IME softwares suggest different candidate words and in different orders, as illustrated in Figure 1. The use of different IMEs captures a wider range of possible typing noises.
Speed Limit Through our pilot run, we find that some annotators like to double-check their typed sequence. This is against our intention to collect more diverse noises for stress testing models, and we prefer to simulate cases where users may type in a much faster pace. Therefore, we set a speed limit of 40 characters per minute, which is the average rate of several runs of pilot annotation. We include a timer in the annotation pipeline and annotations with significantly slower typing speed are requested for re-annotation with a faster pace.

Disallow Post-Editing
In pilot runs, we also find that some annotators like to correct their typos when they double-check their inputs, which again goes against our purpose. To complement the speed limit restriction, we also implement an additional constraint where post correction is not allowed in the annotation pipeline.

Speech Input Annotation
For speech input, we present each annotator with a set of input data and ask them to read and record them. The recordings are then converted to text data with ASR. We implement the following measures to ensure the diversity of speech input noises.
Setup To represent realistic settings, all recordings are done with mobile devices (the annotators' phones), with 16kHz sampling rate, which is high enough for ASR. We also instruct the annotators to record in environments with natural background noises, for example in their offices with some light background talking or street noises.
Diversity There are large phonetic and phonological variations among different users especially since there are many accents across Chinese speakers. To capture such variation, we recruited a total of 10 different annotators for this speech input task (4 males and 6 females). They are selected from a larger pool of annotators through our trial run to  maximally diversify accents. They come from different parts of China with different dialectic groups (more annotator details are in the appendix). Their ages range from 32 to 64. We instruct the annotators to speak Mandarin while preserving their accents. Each input sentence is annotated by 3 different annotators from different dialectic groups to maximize diversity.
ASR The collected speech data are converted to text with a commercial automatic speech recognition (ASR) software iFlytek 7 . We choose this commercial software because it is optimized for Mandarin and outperforms other open-source toolkits that we explored in the pilot run in terms of character-level error rates. We also release the raw audio recordings so that future work can explore using other alternative ASR choices as well.
Throughout the paper, we report results separately for the keyboard and speech noisy test sets for more fine-grained comparisons. We introduce more details of the annotated test sets in the next section.

Dataset Overview
In this section, we analyse the annotated noisy test sets, including data statistics, our proposed metrics for robustness evaluation, a manual quality assessment of the annotated data as well as a qualitative analysis of the diverse types of input noises.

Corpus Statistics
The keyboard and speech noise data have the same sizes. 8 We only perform noise annotation on the 7 https://global.xfyun.cn/products/ real-time-asr 8 We performed some minimal filtering on the speech noise data to remove nonsensical outputs from ASR, which only involves about 50 examples in total and is omitted in the  test data and the training and dev sets remain clean. This serves our purpose to stress test models' robustness. Since the original datasets did not publicly release their test sets, we use their original dev splits as our test sets and we re-split the existing training data into our new train and dev splits, and we only annotate the test splits. We present the statistics of our data splits in Table 2.
To gauge the amount of noises in our annotated test sets, we report the character-level error rates for each noisy test set. Since the noise data could involve various changes like character deletion, insertion, or substitution, we use Levenshtein distance to measure the level of noise. Specifically, given a clean sentence s and its annotated noisy version t, we define its error rate as: We measure the micro-average (average overall all annotations) as well as the worst-average (only consider the highest error rate annotation for each example) error rate across all three annotations over all examples. These two measures are further explained in the next section. The error rates are presented in Table 3. We find that speech noises generally incur larger error rates except on CSpider, and in all cases, the error rates are well below 50%.

Evaluation Metrics
Apart from the individual metrics as introduced in section 2.1, we introduce two other benchmarklevel metrics to account for the variations across the three different annotations per test example.
Suppose for the i-th example, the performance of the model (by its task-specific metric) on the three typo annotations are p i 1 , p i 2 , p i 3 respectively.
We define the following two measures: Micro-Average takes the average of all performance across the three annotations, and then averages across all examples, In other words, this is equivalent to taking the average of the per-annotator performance.
Worst-Average takes the minimum of the performance among all three annotations per average, and then averages across all examples, This is a more challenging setting where we examine the worst-case performance across the annotation variations for each example.

Data Quality Analysis
In order to analyze the quality of our annotated data, we design a human evaluation experiment. We compare our noisy test sets with the automatically constructed input noise test sets as in Si et al. (2023). Specifically, they replace characters in the original sentences with randomly sampled homophones based on an existing Chinese homophone dictionary (Zeng et al., 2021). We replicate their approach as a baseline and add an additional constraint that we only allow simplified Chinese characters in the character substitution process since our data focus on simplified Chinese. We aim to compare whether our crowdsourced noise data are more likely to occur in the real world. Towards this goal, we conduct a human preference selection experiment, where we present pairs of sentences to two annotators (different from the ones who did the noisy input annotation). Each pair consists of a sentence with automatic typos and another with our crowdsourced input noise, and the ordering is randomly shuffled for all pairs. We instruct the annotators to select the sentence that is more likely to occur in real user input settings (i.e., more plausible). We perform such annotation on 160 randomly sampled sentence pairs, for both keyboard input noises and speech input noises.
We show some qualitative examples to compare our real-world noises and automatically constructed ones in Table 4, where we see that automatic noises involve substitutions that are unlikely to happen in real-world (for example only changing a single character "毒" to "独" in the word "病毒" rather than mis-typing the entire word like human annotators tend to do). Quantitatively, we find that our crowdsourced keyboard input noises are preferred 87.5% of the time as compared to automatic typos, and our speech input noises are preferred 86.3% of the time compared to automatic typos (the results are averaged over two annotators). These results suggest that our crowdsourced noisy data are much more plausible than automatic typos.

Diversity Analysis
To understand the diversity of the noise patterns in our annotated data, we first present some qualitative case studies. We present sampled examples in Table 4 showing a wide range of noise patterns. We traced back to the annotation recordings to better understand how these noises arise during typing. In example (3b), "里程" and "历程" have the same Pinyin transliteration and the annotator chose the wrong word on the IME ; in example (4b), the annotator typed the abbreviation "y j l" for "yao jin li" ("要尽力"), which turned into "yao ji liang" ("药剂量") due to wrong word selection (these two words have the same abbreviation); in example (2b), the annotator mis-typed the Pinyin input by swapping "er" ("二") to "re" ("热").
For speech input data, we listened to some sampled raw recordings and found that different annotators have vastly different accents leading to various noise patterns. The speech noise (1c) in Table 1 shows an example where the first tone ('花' [huā]) is pronounced as the fourth tone ('画' [huà]); in example (2c), "jin xin" ("浸信") is pronounced as "qing xing" ("情形"). The noises arise when these accent variations lead to corresponding characters through ASR post-processing. Additionally, we found that the text data produced by the ASR system sometimes have a language modeling effect where the original words are replaced with more likely substitutes for better coherence (similar to the finding in Peskov et al. (2019) on English ASR). For example, in example (3c), "8缸或" ("bā gāng huò") is converted to "八港货" ("bā gǎng huò").
Quantitatively, we performed an additional annotation on 240 sampled keyboard input examples  Overall, our analysis highlights that READIN covers realistic and diverse input noises, posing greater challenges for existing models.

Experiments
We benchmark several pretrained language models and examine whether their performance stays strong on READIN.

Baseline Setups
We use RoBERTa-wwm (Cui et al., 2021) and MacBERT  as baselines for classification tasks. RoBERTa-wwm is a Chinese version of RoBERTa , where whole-wordmasking is used during pretraining. MacBERT is a modification to BERT (Devlin et al., 2019) where replaced word correction is used as a pretraining objective. Both of these models, like the original Chinese BERT, directly use the WordPiece (Wu et al., 2016) tokenizer on Chinese characters. We use the base scale checkpoint for both models.
For machine translation, we adopt mBART50 (Tang et al., 2020) as the baseline, which is a multilingual Transformer model 9 More details are in the Appendix. that consists of 12 encoder layers and 12 decoder layers and is trained based on mBART  for multilingual translation. For semantic parsing, we use DG-SQL (Wang et al., 2021), a competitive baseline on CSpider based on multilingual BERT (Devlin et al., 2019).
For experiments on AFQMC, CMRC2018, and CSpider, we finetune the pretrained checkpoints on the corresponding clean training sets. For WMT2021, we directly take mBART50 for inference without additional finetuning on Chinese-English parallel data since mBART50 itself is already trained on parallel translation data including Chinese-to-English.

Robustness Methods
Apart from standard finetuning, we also experiment several robust training and data processing methods in order to assess how much can existing robustness methods solve our benchmark. We briefly introduce these methods below.
Adversarial Data Augmentation ADA (Si et al., 2021b) is commonly used to enhance robustness against adversarial examples. We perform ADA by creating synthetic noisy training examples through random homophone substitution as in (Si et al., 2023) Table 6: DG-SQL performance on CSpider and mBART50 performance WMT2021 test sets. We compare model performance on the original clean test set ('Clean') and our new noisy test sets. For results on noisy test sets, we report both micro-average ('Average') and worst-average ('Worst') performance. For CSpider, we report exact match with the gold reference; for WMT2021, we report BLEU.
typo correction software 10 to pre-process data in READIN and then perform evaluation on the corrected data. We only perform this step on the noisy test sets, not the clean sets.
SubChar Tokenization Models (Si et al., 2023) released a series of BERT-style models trained with SubChar tokenization, which use sub-character units such as radicals and syllables to compose Chinese characters. In particular, their SubChar-Pinyin model has the advantage of being robust to homophone typos. We adopt their model and also consider performing ADA on top of the SubChar-Pinyin model.

Results
We present results of the baseline models in Table 5 (for NLU tasks) and  Table 7: Finetuning results of BERT models trained with subword and SubChar tokenizers on the AFQMC (pos) subset. SubChar models are more robust than subword models, especially after performing data augmentation.

Input Noises Cause Large Drops
We first compare performance of the same models on the clean test sets and the noisy test sets. We see a clear trend that model performance drops significantly when evaluated on the noisy test sets as compared to the clean test sets. As expected, the worst-average performance is much worse than the micro-average, showing that robustness across annotator variations is challenging. Moreover, we find that speech noises cause larger performance drops than keyboard noises (except on CSpider), which corresponds to the character error rates of these different test sets (Table 3). One notable result is on AFQMC, where we observe drastic performance drop on the positive paraphrase pairs but marginal drop or even performance increase for negative pairs. The reason is that models are exploiting spurious correlation in the training data such as lexical overlap as cues for positive pairs (McCoy et al., 2019;. When we introduce input noises to the data, the lexical overlap decreases, thus models exploiting spurious features become more likely to predict negative labels. Better performance on the positive examples in AFQMC (without significant sacrifice on the clean tests) can be taken as a sign for better robustness. We also present results on AFQMC as measured by the F1 metric in the appendix, and the results also indicate a drop in F1 on the noisy tests.

Robustness Methods Have Inconsistent Gains
For the adversarial data augmentation (ADA) and word correction pre-processing methods, we find that they have inconsistent gains on different datasets. For example, ADA improves performance on the noisy test sets on the AFQMC (pos) set, but not on the CMRC2018 dataset. On the other hand, word correction improves performance on the keyboard noise test sets of CSpider and WMT2021, but not on the other datasets.
SubChar Tokenization Helps Lastly, in Table 7, we show results for finetuning models with Sub-Char tokenization. We find that the SubChar-Pinyin model outperforms the Subword model (which uses conventional subword tokenization). Moreover, the gain is much larger after training SubChar-Pinyin with ADA.

Related Work
Spelling Errors Previous works have recognized the impact of spelling and grammatical errors in multiple languages. Several typo and grammatical corpora have been collected (Hagiwara and Mita, 2020), notably by tracking Wikipedia edits (Grundkiewicz and Junczys-Dowmunt, 2014;Tanaka et al., 2020). The major difference with our work, apart from the language used, is that we focus on realworld downstream applications with diverse input settings. There is also effort on spelling error correction (SEC) (Wu et al., 2013;Cheng et al., 2020). While SEC aims to restore the spelling errors, our goal is to make sure models perform well on downstream applications even in the existence of input noises. Applying an SEC model as pre-processing could be one way to improve performance on our READIN benchmark. Other alternatives for training robust models against spelling errors include noiseaware training (Namysl et al., 2020) and learning typo-resistant representation (Edizel et al., 2019;Schick and Schütze, 2020;Ma et al., 2020). We believe such modeling explorations to future work.

Linguistic Variations
Our READIN not only relates to spelling errors or typos, but also related to linguistics variations especially in terms of phonological variations. Previous works have examined linguistic variations such as non-standard English (Tan et al., 2020a,b;Groenwold et al., 2020) and dialect disparity (Ziems et al., 2022). Such works have important implications for building equatable NLP applications especially for minority language groups in the society. Yet, such effort is absent in Chinese NLP and our benchmark is a first attempt towards incorporating linguistic variations in model evaluation.
Adversarial Robustness Works in the adversarial robustness often involved adversarially optimized character or word perturbations in an attempt to minimize model performance (Ebrahimi et al., 2018a,b;Jones et al., 2020). Corresponding defenses have also been proposed such as adversarial training or data augmentation (Belinkov and Bisk, 2018;Si et al., 2021b,a). Our work differs from this adversarial robustness line of work because we are not measuring worst-case attacks, but rather more realistic input noises that would actually occur in real-world user-generated inputs.

Conclusion
In this work, we present READIN -the first Chinese multi-task benchmark with realistic and diverse input noises. Our annotation is carefully designed to elicit realistic and diverse input noises for both keyboard Pinyin input and speech input. Through both quantitative and qualitative human evaluation, we show that our crowdsourced input noises are much more plausible and diverse than existing automatically created ones. Our experiments on strong pretrained language model baselines show that models suffer significant drops on our noisy test sets, indicating the need for more robust methods against input noises that would happen in the real world.

Ethics and Broader Impact
We use this additional section to discuss potential ethical considerations as well as broader impact of our work.
Ethical Consideration This work involves human annotation. We made sure that all annotators are properly paid. We discussed extensively with all annotators involved to set a compensation that all agree on before starting the annotation, and the total cost of annotation for the project is about 30K RMB. We also explicitly informed all annotators about how the collected data will be used and made adjustments in the data collection and release protocol to avoid any privacy concerns. Overall, we believe that there is no harm involved in this project's annotation jobs.
Positive Societal Impact This project tackles the real-world problem of input noises. We believe that our work will have a positive societal impact because we collected test data from annotators with diverse backgrounds. Our benchmark will facilitate the development of models that can perform well across all these variations, which has important implications to ensure the accessibility of our language technologies to users from diverse backgrounds. This fairness and inclusion aspect is often under-valued in the Chinese NLP community and we hope that our work can push the community to put more work on this front.
Limitations While we tried our best to maximize the diversity and coverage of our benchmark, it is practically impossible to cover all possible input noises. We acknowledge aspects that we did not get to cover, for example, the impact of different input devices (phones, tablets, as compared to keyboards used in our annotation). Also, while we tried to re-construct the real-world input settings as much as possible, there may still be subtle differences between real-world input and our annotation process, for example, we posed speed limits during the keyboard input annotation and this may not capture exactly how users type in real applications. We encourage future work to consider how to increase the coverage of such benchmarks and also possible innovations in the data collection procedure to collect fully realistic user data.

A.2 AFQMC F1 Results
We present evaluation results on AFQMC with the F1 metric in Table 9. We can see significant performance drops on the noisy test sets. We prefer to report accuracy numbers for the positive and negative examples separately in the main paper because they better capture the different performance patterns for the positive and negative examples.

A.3 Noise Type Annotation
To better understand the different noise patterns and diversity of the keyboard noise data, we perform an additional human annotation on two keyboard input subsets in READIN: AFQMC and WMT2021. From each dataset we examine the annotation recording of 40 sentences from different annotators. Since there are three annotators for each dataset (each using a different IME), this  Table 10: Noise breakdown of sampled Pinyin input examples. We categorise the noises into four types based on whether they are types as full Pinyin sequences (Full) or abbreviations (Abbr) and whether the noises are due to wrong input or word selection. results in a sample size of 240 sentences for this human annotation. The authors of this paper performed this annotation task by categorising the noises in these sampled inputs into four categories detailed below.
We note that the annotators have two different typing habits: they either input the full Pinyin sequence or the abbreviations (e.g., just typing the first syllables of each character). Orthogonal to these different typing habits, the noises have two different sources: they either occur because the input Pinyin sequence is wrong or the input sequence is right but the original annotators selected the wrong word in the IME. The combination of these two typing habits and error sources results in the four noise types listed in Table 10. We follow such a scheme for error breakdown because these categories represent very different noisy input patterns and may pose different challenges for the models.
From Table 10, we can see that wrong word selection is more common than wrong input sequences, and typing in full is more common than typing abbreviations. Moreover, there are a significant number of examples from each category, confirming the diversity of the noise patterns in the Pinyin input annotations.