DuQM: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models

In this paper, we focus on the robustness evaluation of Chinese Question Matching (QM) models. Most of the previous work on analyzing robustness issues focus on just one or a few types of artificial adversarial examples. Instead, we argue that a comprehensive evaluation should be conducted on natural texts, which takes into account the fine-grained linguistic capabilities of QM models. For this purpose, we create a Chinese dataset namely DuQM which contains natural questions with linguistic perturbations to evaluate the robustness of QM models. DuQM contains 3 categories and 13 subcategories with 32 linguistic perturbations. The extensive experiments demonstrate that DuQM has a better ability to distinguish different models. Importantly, the detailed breakdown of evaluation by the linguistic phenomena in DuQM helps us easily diagnose the strength and weakness of different models. Additionally, our experiment results show that the effect of artificial adversarial examples does not work on natural texts. Our baseline codes and a leaderboard are now publicly available.


Introduction
The task of Question Matching (QM) aims to identify the question pairs that have the same meaning, and it has been widely used in many applications, e.g., community question answering and intelligent customer services, etc.Though neural QM models have shown compelling performance on various datasets, including Quora Question Pairs (QQP) (Iyer et al., 2017), LCQMC (Liu et al., 2018), BQ (Chen et al., 2018) and AFQMC 2 , neu-ral models are often not robust to adversarial examples, which means that the neural models predict unexpected outputs given just a small perturbations on the inputs.As the example 1 in Tab. 1 shows, a model might not distinguish the minor difference ("面 noodles") between the two sentences, and thus predicts the two questions semantically equivalent.
Recently, it attracts a lot of attentions from the research community to deal with the robustness issues of neural models on various NLP tasks, such as question matching, natural language inference and machine reading comprehension.Early works examine the robustness of neural models by creating certain types of artificial adversarial examples (Jia and Liang, 2017;Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020), and involving human-and-model-in-the-loop to create dynamic adversarial examples (Nie et al., 2020;Wallace et al., 2019).Further studies discover that a few types of superficial cues (i.e.shortcuts) in the training data, are learned by the models and hence affect the model robustness (Gururangan et al., 2018;Mc-Coy et al., 2019;Lai et al., 2021).Besides, several studies try to improve the robustness of the neural models by adversarial data augmentation (Min et al., 2020) and data filtering (Bras et al., 2020).All these efforts motivate us to better find and fix the robustness issues.
However, there are several limitations in previous studies.Firstly, the analysis and evaluation in previous work focus on just one or a few types of adversarial examples or shortcuts, but we need normative evaluation (Linzen, 2020;Ettinger, 2020;Phang et al.).In the normative evaluation (Linzen, 2020), the objective is not to fool the system by exploiting its particular weaknesses, but to comprehensively evaluate the basic linguistic capabilities of the models with a variety of systemically controlled datasets.Checklist (Ribeiro et al., 2020), QuAIL (Rogers et al., 2020) and Textflint (Wang et al., 2021) are great attempts of normative evalua-tions.However, it is not clear that if the artificial adversarial training is effective on natural texts from real-world applications (Morris et al., 2020).Some other works manually perturb the examples to construct natural examples, but the manual perturbation is time consuming and costly (Gardner et al., 2020).Moreover, to the best of our knowledge, there are few Chinese datasets for QM robustness evaluation.
Towards this end, we create an open-domain Chinese dataset namely DuQM containing natural questions with linguistic perturbation for evaluating the robustness of QM models.(1) By linguistic, we mean that the dataset provides a detailed breakdown of evaluation by linguistic phenomenon.As shown in Tab. 1, there are 3 categories and 13 subcategories with 32 linguistic perturbations in DuQM, which enables us to evaluate the model performance by each category instead of just a single metric.( 2) By natural, we mean all the questions in DuQM are natural, that are issued by the users in Baidu search.This design can help us to properly evaluate model's robustness on natural texts rather than artificial texts, which may not preserve semantics and the distribution of which is far from real-world applications.
The contributions of this paper can be summarized as follows: The remaining of this paper is organized as follows.Sec. 2 describes the 3 categories and 13 subcategories with 32 linguistic perturbations in DuQM.Sec. 3 gives the construction process of DuQM.In Sec. 4, we conduct experiments to demonstrate 3 characteristics of DuQM.We conclude our work in Sec. 5.

Linguistic Perturbations in DuQM
The design of DuQM is aimed at normative evaluation, that contains a detailed breakdown of evaluation by linguistic phenomenon.Hence, we create DuQM by introducing a set of linguistic features that we believe are important for model diagnosis in terms of linguistic capabilities.Basically, 3 categories of linguistic features are used to build DuQM, i.e., lexical features (see Sec. 2.1), syntactic features (see Sec. 2.2), and pragmatic features (see Sec. 2.3).We list 3 categories, 13 subcategories with 32 operations of perturbation in Tab. 1.The detailed descriptions of all categories are given in this section.

Lexical Features
Lexical features are associated with vocabulary items, i.e. words.As a word is the smallest independent but meaningful unit of speech , an operation on a single word may change the meaning of the entire sentence.It is a basic but crucial capability for models to understand word and perceive word-level perturbations.To provide a fine-grained evaluation for model's capability of lexical understanding, we further consider 6 subcategories: Part of Speech.Parts of speech (POS), or word classes, describe the part a word plays in a sentence.DuQM considers 6 POS in Chinese grammar, including noun, verb, adjective, adverb, numeral and quantifier, which are content words that carry most meaning of a sentence.In this subcategory, we aim to test the models' understanding of words with different POSs by replacing them with related but not identical words.Antonym.In contrast to synonyms, antonyms are words within an inherently incompatible binary relationship.This subcategory examines model's capability on distinguishing words with opposed meanings.We mainly focus on adjective's opposite, e.g., "高high" and "低low" (see example 20).Negation.Negation is another way to express contradiction.To negate a verb or an adjective in Chinese, we normally put a negative before it, e.g., "不not" before "哭cry" (example 21), "不是not" before "红的red" (example 22).The negative before the verb or the adjective negates the statement.
It is an effective way to analyze model's basic skill of figuring out the contradictory meanings even there is only a minor change.Moreover, we include some equivalent paraphrases with negation in this subcategory.In example 23, "无法平静can't calm down" is the negative paraphrase of "激动excited", so that the paraphrase sentence is equivalent to the positive sentence.We believe that a robust QM system should be able to recognize this kind of paraphrase question pairs.Temporal Word.Temporal reasoning is the relatively higher-level linguistic capability that allows the model to reason about time.Unlike English, verbs in Chinese do not have morphological inflections.Tenses and aspects are expressed either by temporal noun phrases like "明天tomorrow" (examples 24) or by aspect particles like "了le" which indicates the completion of an action (example 25).This subcategory focuses on the temporal distinc-tions and helps us evaluate the models' temporal reasoning capability.

Syntactic Features
While single word sense is important to question meaning, how words composed together into a whole also affects sentence understanding.We believe that the relations among words in a sentence is important for models to capture, so we focus on several types of syntactic features in this category.We pre-define 4 linguistic phenomena that we believe is meaningful to locate model's strength and weakness, and introduce them in this subsection.
Symmetry.Sometimes paraphrases can be generated by only swapping two conjuncts around in a structure of coordination.As shown in example 26, "鱼fish" and "鸡蛋egg" are joined together by the conjunction "和and", which have the symmetric relation to each other.Even if we swap them around, the sentence meaning will not change.We name this subcategory Symmetry.
Asymmetry.Some words (such as "和and") denote symmetric relations, while others (for example, preposition "到to") denote asymmetric.Example 27 shows a sentence pair in which the word before the preposition "到to" is an adverbial and the word after it is the object.Swapping around the adverbial and the object of the prepositional phrase will definitely leads to a nonequivalent meaning.If a model performs well only on subcategory Symmetry or Asymmetry, it may rely on shortcuts instead of the understanding of the syntactic information.
Negative Asymmetry.To further explore the syntactic capability of QM model, DuQM includes a set of test examples which consider both syntactic asymmetry and antonym, and we name this category Negative Asymmetry.In example 28, the asymmetric relation between "男人men" and "女 人women" and the opposite meaning of "高taller" and "矮shorter" resolve to an equivalent meaning.With this subcategory, we can better explore model's capability of inferring more complex syntactic structure.
Voice.Another crucial syntactic capability of models is to differentiate active and passive voices.In Chinese, the most common way to express the passive voice is using Bei-constructions which feature an agentive case marker "被bei".The subject of a Bei-construction is the patient of an action, and the object of the preposition "被bei" is the agent.Compared to Fig. 2(a) (in Appendix A), the additional "被bei" and the change of word order of "猫cat" and "狗dog" in Fig. 2(b) convert the sentence from active to passive voice, but the two sentences have the same meaning.If we further change the word order from Fig. 2(b) to Fig. 2(c), the sentence still uses passive voice but has different meaning.Moreover, passive voice is not always expressed with "被bei".Sometimes a sentence without any passive marker is still in passive voice.In example 29, although the first sentence is without "被bei", it expresses the same meaning as the second one.
There are a set of active-passive examples in this category, which are effective to evaluate model's performance on active and passive voices.

Pragmatic Features
Lexical items ordered by syntactic rules are not all that make a sentence mean what it means.Context, or the communicative situation that influence language use, has a part to play.We include some pragmatic features in DuQM so as to observe whether models are able to understand the contextual meaning of sentences.
Misspelling.Misspellings are quite often seen by search engines and question-answering systems, which are mostly unintentional.Models should have the capability to capture the true intention of the questions with spelling errors to ensure the robustness.In example 30, despite the misspelled word "文身tatoo" the two questions mean the same, In some real world situations, models should understand misspellings appropriately.For example, when users search a query but type in misspelling, a robust model will still give the correct result.Discourse Particle.Discourse particles are words and small expressions that contribute little to the information the sentence conveys, but play some pragmatic functions such as showing politeness, drawing attention, smoothing utterance, etc.As shown in example 32, the word "求助help" is used to draw attention and brings no additional information to the sentence.Whether using these little words does not change the sentence meaning.It is necessary to a model to identify the semantic equivalency when such words are used.

Construction
We design DuQM as a diverse and natural corpus.The construction process of DuQM is divided into 4 steps and illustrated in Fig. 1.Firstly, we preprocess the source questions to obtain their linguis- tic knowledge, which will be used to perturb the source texts.Then we pair the source and perturbed question as an example.The examples' naturalness is reviewed manually.At last, the examples are annotated manually and DuQM is finally constructed.We will introduce the construction details in this section.

Linguistic Preprocessing
We collect a large number of source questions from the search query log of Baidu search, and filter out question sentences with a question identification model (the accuracy is higher than 95%).All the source questions are natural that users have entered into Baidu search and then we perform several linguistic preprocessings on them: named entity recognition, POS tagging, dependency parsing, and word importance analysis.The linguistic knowledge of the source questions we obtained in this step will be used for perturbation.

Perturbation.
We conduct different perturbation operations for different subcategories.In general, we perturb the sentences in three ways: • replace: replace a word with another word, e.g., for category Synonym, we replace one word with its synonym; • insert : insert an additional word, e.g., for category Temporal Word, we insert temporal word to the source question; • swap: swap two words.This operation is only used in Syntactic Feature.
The perturbation for each linguistic category is listed in column Perturbation Operation of Tab. 1, and the perturbation details are as follows: Lexical Features.For each source question, we select the word with specific POS tag or entity type and high word importance score as target word, and perturb the source questions with some other words we collect from following 4 sources: • Elasticsearch3 : to collect words which have high character overlap with target words4 ; • Faiss5 : to collect words which are semantically similar to target words; Specifically, we use RockectQA6 to train a question dense retrieval model and employ it by faiss for similarity search; • Bigcilin7 : to collect synonym of target words; • Baidu Hanyu8 : to collect antonym and synonym of target words; • XLM-RoBERTa (Conneau et al., 2020): to insert additional words to source sentences9 ; • Vocabulary lists10 : to insert some specific words, such as negation word and temporal word.
Syntactic Features.For Symmetry and Asymmetry, we retrieve the source questions from the search log and select the questions whose edit distance to source question is equal to 4 as candidate questions.
Then we compare the dependency structures of the source question and candidate questions.Only the question pairs which contain symmetric or asymmetric relations are retained.To generate examples for Negative Asymmetry, from Asymmetry we select the example pairs, one side of which can be negated, and negate one side of the pairs.The asymmetric syntactic structure of two sentences and one-sided negation resolves to a positive meaning.For Voice, we add "被bei" word to source questions to conduct a change of voice.Pragmatic Features.
Misspelling.With a Chinese heteronym list11 , we obtain a set of common typos and substitute the correct-spelling words with typos.Additionally, the perturbation should satisfy two constraints: 1) the typos should be commonly used Chinese characters; 2) only one character in the source sentence is replaced with its typo.Discourse Particle.We construct this category in 2 ways: 1) we replace or add some question words, auxiliary words or punctuation marks to With above approaches, we perturb the source questions and obtain a large set of question pairs.Then the generated question pairs are manually reviewed in terms of naturalness and quality.

Naturalness Review
To ensure the generated sentences are natural, we examine their appearances in the search log and only retain the sentences which have been entered into Baidu search.Then the source question and generated question are paired together as an example.

Manual Annotation
To ensure the quality, linguistic experts from our internal data team evaluate the examples in terms of fluent, grammatically correct, and correctly categorized.

Experiments
In this section, we conduct experiments to discuss three characteristics (char.) of DuQM.In Sec.4.1, we provide the experimental setup and the evaluation metrics.In Sec.4.2 ~4.4,we give the experimental results and discussions.

Experimental Setup
Datasets.To evaluate the robustness of QM models, we select LCQMC to train the models and evaluate the models' performance on DuQM.LCQMC is a large-scale Chinese QM corpus in general domain and the source questions are collected from Baidu Knows (a popular Chinese community question answering website) (Liu et al., 2018), which are similar to the search queries in form.Specifically, we firstly train QM models on LCQMC train .Then we choose the model with the best performance on LCQMC dev and report the results of the chosen models on LCQMC test and DuQM.Tab. 8 presents the statistics of LCQMC.
It is worth mentioning that LCQMC is in general domain and its source questions are similar to the search query, which are the form of source questions for DuQM.In other words, DuQM is not a out-of-domain (ood) test set of LCQMC, so that models' low performance on DuQM could not be attributed to being ood.
Models.We choose 6 popular pre-trained models to conduct experiments: BERT b (Devlin et al., 2019), ERNIE b (Sun et al., 2019), RoBERTa b , RoBERTa l (Liu et al., 2019), MacBERT b , MacBERT l (Cui et al., 2020) Tab. 3 shows the performances of models on heldout set LCQMC test and our DuQM, which presents the primary characteristic of DuQM: it is challenging and can better discriminate models' abilities.
As shown in Tab. 3, all models achieve accuracy higher than 87% on LCQMC test , but show a significant performance drop on DuQM.Column △ in Tab. 3 shows the differences between models' performances on LCQMC test and DuQM, which presents that the performance on DuQM is lower than on LCQMC test by at most 20.5%.This indicates that DuQM is more challenging, and we claim that a challenging dataset could better distinguish the models' performance.As shown in Tab. 3, all the models have similar performances on LCQMC test (around 87%), but different performance on DuQM: the accuracy of base models differs from 66.6% to 70.3%, and the large models show higher performance (73.8%).In conclusion, DuQM shows a better discrimination ability to evaluate models.

Char. 2: Diagnose Model in Diverse Ways
DuQM is a corpus which has 3 linguistic categories and 13 subcategories and enables a detailed breakdown of evaluation on different linguistic phenomena.In Tab. 1, we give the performance of 6 models on all fine-grained categories of DuQM, and Tab. 4 reports the micro-averaged and macro-averaged accuracy.By comparing these results, we introduce the second characteristic of DuQM: it can diagnose the strengths and weaknesses of the models in diverse ways.Several interesting observations are noticed: (from Tab. 1 and 4): 1) In most categories, large models outperform base models.As the large models have more parameters and larger pre-training corpus, it is reasonable that they have better capabilities than relatively smaller models.2) In Named Entity, all models show good performance (higher than 90%).Another interesting finding is that although ERNIE b is a relatively small model, it performs slightly better than RoBERTa l on this subcategory, which might attribute to the entity masking strategy during pre-training.3) MacBERT l is significantly better than other models on Synonym.We suppose that it benefits from its pre-training strategy that using similar words instead of random words for masking.Moreover, RoBERTa l and MacBERT l have remarkable better performance on Antonym.4) Low performance in Temporal word show that all models lack the ability of temporal reasoning.5) All models have surprisingly poor performance on Asymmetry while good performance on Symmetry.We suppose that lack of learning word orders would result in a wrong prediction when the words orders are altered.6) BERT b and ERNIE b perform better on Misspelling, and RoBERTa b and MacBERT b are relatively better on Complex Discourse Particles.In general, DuQM diagnoses models from a linguistic perspective and can help us identify the strengths and weaknesses of the models.

Char. 3: Natural Examples
DuQM is composed of adversarial testing examples generated by linguistically perturbed natural questions 14 .We consider that natural examples can better evaluate models' robustness than artificial examples.To demonstrate it, we conduct an experiment to compare the performance of two adversarial training (AT) methods PWWS (Ren et al., 2019) and TextFooler (Jin et al., 2020) on artificial and natural test examples: 14 An adversarial example is an input to a machine learning model that is purposely designed to cause a model to make a mistake in its predictions despite resembling a valid input to a human.are not significant (-5.8% ~2.3%).On the other 2 natural test sets, Checklist nat and DuQM, the effects of 2 adversarial methods are also not obvious (-2.4% ~2.3%).
In conclusion, the common artificial AT methods are not so effective on the natural datasets.As a corpus consisting linguistically perturbed natural questions, DuQM is beneficial to a robustness evaluation to help us mitigate models' undesirable performance in real-world applications.

Conclusion
In this work, we create a Chinese dataset namely DuQM which contains linguistically perturbed natural questions for evaluating the robustness of QM models.DuQM is designed to be fine-grained and natural.Specifically, DuQM has 3 categories and 13 subcategories with 32 linguistic perturbations.We conduct extensive experiments with DuQM and the results demonstrate that DuQM has 3 characteristics: 1) DuQM is challenging and has better discrimination ability; 2) The fine-grained design of DuQM helps to diagnose the strengths and weakness of models, and enables us to evaluate the models in diverse ways; 3) Artificial adversarial training fails in the natural texts of DuQM.In the fine-tuning stage, we insert a [SEP ] between the question pairs.The pooled output is passed to a classifier.We use different learning rates and epochs for different pre-trained models.Specifically, for large models, the learning rate is 5e-6 and the number of epochs is 3.For base models, the learning rate is 2e-5, and we set the number of epochs as 2. The batch size is set as 64 and the maximal length of question pair is 64.We use early stopping to select the best checkpoint.We choose the model with the best performance on LCQMC dev to report test results and each model is fine-tuned 3 times on LCQMC train .

C.2 Adversarial Training Details
Tab. 5 gives a detailed statistics of adversarial examples generated with TextFooler, PAWS and Checklist.To generate training samples, we select a set of LCQMC training questions and apply the methods PWWS and TextFooler on them.The labels are same as original samples.To generate test samples and ensure a robust evaluation, we utilize 4 datasets, PWWS nat , TextFooler nat , Checklist nat 17 and DuQM, which are natural adversarial examples.We conduct an experiment about adversarial training by feeding the models both the original data and the adversarial examples, and observe whether the original models become more robust.We use pre-trained model RoBERTa l (described in Tab.7) for fine-tuning and the details are described in Sec.4.1.

C.3 Results of Attacks
We give the main results of attacks to BERT b and RoBERTa l in Tab. 9.The results show that the un-natural attacks (on artificial adversarial samples, i.e.PWWS and TextFooler in Tab. 9) have higher success rate than DuQM.However, if we select the natural examples from the artificial adversarial samples (PWWS nat and TextFooler nat in Tab.9), the attack success rate of PWWS and TextFooler is significantly decreasing by at least 18.5% on BERT b generate Simple Discourse Particle examples (Discourse Particle (Simple) in Tab.1); 2) for Complex Discourse Particle examples (Discourse Particle (Complex) in Tab.1), we select some question pairs from a Frequently-Asked-Questions (FAQ) log, especially pairs with big differences in sentence length.Then the pairs are annotated manually and we retained the positive examples.
E27: 北京 到 上海 航班有哪些 / 上海 到 北京 航班有哪些 what are the flights from Beijing to Shanghai / what are the flights from Shanghai to Beijing 梦见狗咬左腿意味着什么 / 梦见 被 狗咬左腿意味着什么 what does it mean to dream of being bitten by a dog / what does it mean to dream of being bitten by a dog of the meaning of names and background knowledge about entities.Thus, we include Named Entity as an independent subcategory to test the model's behavior of named entity recognition, and focus on 4 types of NE most commonly seen, i.e., location, organization, person and product.Example 12 is a search query and its perturbation on NE.The two named entities, "山西 Shanxi" and "陕西 Shaanxi", are similar at character level but denote two different locations.We expect that the models can capture the subtle difference.Synonym.A synonym is a word or phrase that means exactly or nearly the same as another word or phrase in a given language.This subcategory aims to test whether models can identify two semantically equivalent questions whose surface forms only differ in a pair of synonyms.As in example 16, the two sentences differ only in two words, both of which refer to Kiwifruit, and has the same meaning.
1±1.4 E17: 什么果汁可以 减肥 / 什么果汁可以 减重 what juice can lose weight / what juice can slim 3 E20: 什么水果脂肪 低 / 什么水果脂肪 高 what fruit is low in fat / what fruit is high in fat 9 E23: 激动 怎么办 / 无法 平静 怎么办 what to do if too excited / what to do if can't calm down 8 E25: 昨天 下雪 了 吗 / 明儿 会下雪吗 was it snow yesterday / will it snow tomorrow E26: 鱼 和 鸡蛋 能一起吃吗 / 鸡蛋 和 鱼 能一起吃吗 can I eat fish with egg / can I eat egg with fish E28: 男人 比 女人 更 高 吗 / 女人 比 男人 更 矮 吗 are men taller than women / are women shorter than men Table 1: Categories of DuQM (described in Sec. 2) and performance of 6 models on each subcategory (discussed in Sec.4).Bold face and underlined indicate the first and second highest accuracy for each testing scenario.standing

Table 2 :
Data statistics of DuQM.The class distribution of all categories are given in Tab. 1.Additional data statistics are provided in Tab. 2. The construction methods are not Chinese-specific.Except for few categories (e.g.Bei-construction), most of construction methods can be easily extended to other languages.

Table 3 :
Accuracy(%) on LCQMC test and DuQM.b indicates base, and l indicates large.
. A detailed comparison is provided in Tab.7 (in Appendix), and the training details are described in Appendix C.1.1.

Table 4 :
The micro-averaged and macro-averaged accuracy are on each category of DuQM.

Table 5 :
Statistics of the adversarial examples.
As they are designed to evaluate different linguistic capabilities of the modes, all examples in DuQM are adversarial examples.

Table 6 :
Adversarial training results of RoBERTa l .'FOOLER' refers to 'TEXTFOOLER'.We use green and red subscripts to represent a higher and lower accuracy respectively.We fine-tune RoBERTa l on LCQMC and the artificial adversarial examples generated by PWWS and TextFooler, and evaluate on the adversarial test sets.The results are shown in Tab. 6. Row LCQMC shows that only training with LCQMC train shows a low performance on PWWS and TextFooler (we provide a detailed analysis in Appendix C.3), and the performance on PWWS and TextFooler are significantly higher on PWWS nat and TextFooler nat respectively.However, if we incorporate LCQMC train with the examples generated by PWWS and TextFooler, the model's performance on PWWS and TextFooler increase greatly (both methods achieve an great improvement of more than 16%) , but the effects on natural examples PWWS nat and TextFooler nat