SICK-NL: A Dataset for Dutch Natural Language Inference

We present SICK-NL (read: signal), a dataset targeting Natural Language Inference in Dutch. SICK-NL is obtained by translating the SICK dataset of (Marelli et al., 2014) from English into Dutch. Having a parallel inference dataset allows us to compare both monolingual and multilingual NLP models for English and Dutch on the two tasks. In the paper, we motivate and detail the translation process, perform a baseline evaluation on both the original SICK dataset and its Dutch incarnation SICK-NL, taking inspiration from Dutch skipgram embeddings and contextualised embedding models. In addition, we encapsulate two phenomena encountered in the translation to formulate stress tests and verify how well the Dutch models capture syntactic restructurings that do not affect semantics. Our main finding is all models perform worse on SICK-NL than on SICK, indicating that the Dutch dataset is more challenging than the English original. Results on the stress tests show that models don’t fully capture word order freedom in Dutch, warranting future systematic studies.


Introduction
One of the primary tasks for Natural Language Processing (NLP) systems is Natural Language Inference (NLI), where the goal is to determine, for a given premise sentence whether it contradicts, entails, or is neutral with respect to a given hypothesis sentence.
For English, several standard NLI datasets exist, such as SICK (Marelli et al., 2014), SNLI (Bowman et al., 2015) and MNLI . Having such inference datasets available only for English may introduces a bias in NLP research. Conneau et al. (2018) introduce XNLI, a multilingual version of a fragment of the SNLI dataset, that contains pairs for Natural Language Inference in 15 languages and is explicitly intended to serve as a resource for evaluating crosslingual representations. However, Dutch is not represented in any current NLI dataset, a lack that we wish to complement.
Dutch counts as a high-resource language, with the sixth largest Wikipedia (2M+ articles), despite having ca. 25M native speakers. Moreover, the syntactically parsed LASSY corpus of written Dutch (van Noord et al., 2013), and the SONAR corpus of written Dutch (Oostdijk et al., 2013) provide rich resources on which NLP systems may be developed. Indeed, Dutch is in the scope of the multilingual BERT models published by Google (Devlin et al., 2019), and two monolingual Dutch BERT models have been published as part of Hugging-Face's transformers library (de Vries et al., 2019;Delobelle et al., 2020).
Compared to English, however, the number of evaluation tasks for Dutch is limited. There is a Named Entity Recognition task coming from the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003); from a one million word hand annotated subcorpus of SONAR (Oostdijk et al., 2013) one derives part-of-speech tagging, Named Entity Recognition and Semantic Role Labelling tasks. More recently a Sentiment Analysis dataset was introduced, based on Dutch Book reviews (van der Burgh and Verberne, 2019). Moreover, Allein et al. (2020) introduce a classification task where a model needs to distinguish between the pronouns die and dat.
Given the focus on word/token-level tasks in Dutch, we aim to complement existing resources with an NLI task for Dutch. We do so by deriving it from the English SICK dataset, for the following reasons: first, this dataset requires a small amount of world knowledge as it was derived mainly from image captions that are typically concrete descriptions of a scene. Therefore, no world knowledge requirements will be imposed on an NLP model for the task, but rather its ability for reasoning will be assessed. Secondly, due to the structure of the sentences in SICK, the types of inferences can be attributed to particular constructs, such as hypernymy/hyponymy, negation, or choice of quantification. Thirdly, SICK contains 6076 unique sentences and almost 10K inference pairs, making it a sizeable dataset for NLP standards, while deriving a Dutch version is more manageable than with other datasets. We make the dataset, code and derived resources (see Section 5) available online 1 .

Dataset Creation
We follow a semi-automatic translation procedure to create SICK-NL, similar to the Portuguese version of SICK (Real et al., 2018). First, we use a machine translator to translate all of the (6076) unique sentences of SICK 2 . We review each sentence and its translation in parallel, correcting any mistakes made by the machine translator, and maintaining consistency of individual words' translation, in the process guaranteeing that the meaning of each sentence is preserved as much as possible. Finally, we perform a postprocessing step in which we ensure unique translations for unique sentences (alignment) with as few exceptions as possible. In this way we obtain 6059 unique Dutch sentences, which means that the dataset is almost fully aligned on the sentence level. It should be noted, however, that we can not fully guarantee the same choice of words in each sentence pair in the original dataset, as we translate sentence by sentence. In the whole process, we adapted 1833 of the 6076 automatic translations, either because of translation errors or alignment constraints. As we are interested in collecting a comparable and aligned dataset, we maintain the relatedness and entailment scores from the original dataset. Table 1 shows some statistics of SICK and its Dutch translation. The most notable difference is that the amount of unique words in Dutch is 23% higher than that in English, even though the total number of words in SICK-NL is about 93% of that in SICK. We argue that this is due to morphological complexities of Dutch, where verbs can be separable or compound, leading them to be split up into multiple parts (for example, "storing" becomes "opbergen", which may be used as "de man bergt 1 github.com/gijswijnholds/sick_nl 2 We used DeepL's online translator www.deepl.com/ translator SICK SICK-NL iets op"). Moreover, Dutch enjoys a relatively free word order, which in the case of SICK-NL means that sometimes the order of the main verb and its direct object may be swapped in the sentence, especially when the present continuous form ("is cutting an onion") is preserved in Dutch ("is een ui aan het snijden"). Finally, we follow the machine translation, only making changes in the case of grammatical errors, lexical choice inconsistencies, and changes in meaning. This freedom leads to a decrease in relative word overlap between premise and hypothesis sentence, computed as the number of words in common divided by the length of the shortest sentence. From the perspective of Natural Language Inference this is preferable as word overlap often can be exploited by neural network architectures (McCoy et al., 2019).

Baseline Evaluation and Results
We evaluate two types of models as a baseline to compare SICK-NL with its English original. First, we evaluate embeddings that were not specifically trained on SICK. Table 2 shows the correlation results on the relatedness task of SICK and SICK-NL, where the cosine similarity between two independently computed sentence embeddings is correlated with the relatedness scores of human annotators (between 1 and 5).  To obtain sentence embeddings here, we average skipgram embeddings, or, in the case of contextualised embeddings, we take either the sentence embedding given by the [CLS] token, or we take the average of the individual word's embeddings. For the skipgram embeddings in English, we use the standard 300-dimensional GoogleNews vectors provided by the word2vec package and for Dutch, we use the 320-dimensional Wikipedia trained embeddings of Tulkens et al. (2016).

SICK SICK-NL
The relatedness results show that (a) using the [CLS] token embedding as a sentence encoding performs worse than taking the average of word embeddings, and that (b) the Dutch incarnation of SICK is harder than the original English dataset. It may be noted that relatedness scores are less robust then entailment labels, and so our first result may not be enough support for the claim that SICK-NL poses a more challenging task. For example, it could be that relatedness scores will differ slightly if we were to ask a number of annotators to reevaluate the Dutch dataset.
In the second setup, we use BERTje, the Dutch BERT model of de Vries et al. (2019) and RobBERT, the Dutch RoBERTa model of Delobelle et al. (2020), with their corresponding English counterparts, as well as multilingual BERT (mBERT), as sequence classifiers on the Entailment task of SICK(-NL). Here we observe a similar pattern in the results in Table 3: while there are individual difference on the same task, the main surprise is that the Dutch dataset is harder, even when exactly the same model (mBERT) is used.  Table 3: Accuracy results on the entailment task of the English SICK dataset and its Dutch translation for two Dutch BERT models and their English counterparts. For each model, we report the best score out of 20 epochs of fine-tuning.

Error Analysis
In order to understand the differences between the Dutch and English language models on the respective tasks, we dive deeper into the classification results. We plot confusion matrices for each model in Table 5, where we separate predictions that the models have in common and the from the predictions that are unique to each model. In the case of English, performance on classifying contradictions is worse for multilingual BERT and RoBERTa, and RoBERTa also gives highest recall values for the Neutral and Entailment labels. This is all not surprising given that RoBERTa has the overall highest test set accuracy. The surprising results come mainly from the comparison between English and Dutch models. Where BERTje is rather indecisive when it comes to Neutral sentence pairs (it classifies roughly equal numbers as Neutral and Entailment), it classifies 74% of Entailment pairs as Neutral. For multilingual BERT the situation is reversed, with 47% of Neutral entailments classified as Entailment, although for cases of entailment, the classifier did not clearly distinguish Neutral from Entailment. The most surprising pattern was observed in RobBERT: where RoBERTa still has high recall for Neutral and Entailment, its Dutch counterpart RobBERT mistakes most Neutral cases as Entailment and even more so vice versa. For all models, in these four cases of misclassification, in the case of the English task the correct inference was made in at least 99% of the cases.
Following Naik et al. (2018), we inspect these prominent cases of misclassification in Dutch by looking at the number of cases of high overlap (at most four words not in common), and at the number of length mismatches (the difference between sentence length exceeds 4), and set off these distributions against that of the test set, in Table 4  The main finding here is that word overlap does provide a strong cue in the English dataset, especially given that SICK has more cases (1970) overall than SICK-NL (1385), and that in SICK they are more concentrated in cases of Entailment. Length mismatches occur more often in SICK-NL but seem to provide less of a cue to the models to make strong inference decisions.

Stress Testing
One of the potential sources of error could have been the passive form translation of a verb. Such constructions, combined with a prepositional phrase, form an interesting testbed for Dutch as they allow the prepositional phrase to be moved in front of the verb in a sentence without changing the meaning. For example, "Een vrouw is aan het wakeboarden op een meer" ("A woman is wakeboarding on a lake"), may in Dutch be used interchangeably with "Een vrouw is op een meer aan het wakeboarden"). We select all (87) sentences in SICK-NL that contain both the 'aan het' construction and a prepositional phrase, and generate their permutations. Then, we replace all (225) inference pairs with these sentences such that they now contain a sentence with different word order but the exact same meaning and therefore the inference label is preserved. We then verify how the model's predictions do on those inference pairs that were in the test set (116). Additionally, we check whether the models are able to interchange sentences and their rewritten equivalent (i.e. classify as Entailment).
As a second test, we investigate the role of the simple present versus the present continuous. We take all the (383) cases of present continuous in the Dutch dataset and replace them by a simple present equivalent, leading to 1137 pairs, out of which 576 occur in the test data. For example, we turn the sentence "De man is aan het zwemmen" into the simple form "De man zwemt". We then repeat the same procedure as above, asking how many inference predictions change as a result of this form change, and whether the forms can be used interchangeably for the models.  The results in Table 6 indicate that the interchange between present continuous and simple present forms does not make much of a difference to the models' performance, and interchangeability is high except for RobBERT that scores under 90%. However, switching the order of prepositional phrase and verb has a much stronger effect with all models consistently scoring lower on the relevant part of the test set, and mainly the models being particularly poor at interchanging these sentences that are semantically equivalent.

Conclusion
In this paper we introduced an NLI dataset for Dutch by semi-automatically translating the SICK dataset. To our knowledge this is the first available inference task for Dutch. Despite the common perception that Dutch is very similar to English, SICK-NL was significantly more difficult to tackle, even for language models that had access to the training data for fine-tuning. We hypothesised that the difference in result may be due to a larger vocabulary in SICK-NL, and a decline in word overlap between inference pairs. In addition we performed two stress tests and found that pretrained models that were exposed to the training data had difficulty detecting semantically equivalent sentences that differ only in word order. Further work will therefore more systematically assess such phenomena.