IndoNLI: A Natural Language Inference Dataset for Indonesian

We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research.


Introduction
Indonesian language or Bahasa Indonesia is the 10th most spoken language in the world with more than 190 million speakers. 1 Yet, research in Indonesian NLP is still considered under-resourced due to the limited availability of annotated public datasets. To help accelerate research progress, IndoNLU (Wilie et al., 2020) and IndoLEM (Koto et al., 2020) collect a number of annotated data to benchmark Indonesian NLP tasks.
In line with their effort, we introduce Indonesian NLI (INDONLI), a natural language inference dataset for Indonesian. Natural language inference (NLI), also known as recognizing textual entailment (RTE; Dagan et al., 2005) is the task of determining whether a sentence semantically entails another sentence. NLI has been used extensively as a benchmark for NLU, especially with the availability of large-scale English datasets such as the Stanford NLI (SNLI;Bowman et al., 2015) and the Multi-Genre NLI (MNLI;  datasets. Recently, there have been efforts to build NLI datasets in other languages. The most common approach is via translation (Conneau et al., 2018;Budur et al., 2020). One exception is the work by Hu et al. (2020) which uses a human-elicitation approach similar to MNLI to build an NLI dataset for Chinese (OCNLI).
Until now, there are two Indonesian NLI datasets available. The first one is WReTE (Setya and Mahendra, 2018), which is created using revision history from Indonesian Wikipedia. The second one, INARTE (Abdiansah et al., 2018), is created automatically based on question-answer pairs taken from Web data. Both datasets have relatively small number of examples (400 pairs for WReTE and ∼1.5k pairs for INARTE) and only uses two labels (entailment and non-entailment). Furthermore, since the hypothesis sentence is generated automatically from the premise, they tend to be so similar that arguably they will not be effective as a benchmark for NLU (Hidayat et al., 2021). On the other hand, INDONLI is created using a human-elicited approach similar to MNLI and OCNLI. It consists of ∼18K annotated sentence pairs, making it the largest Indonesian NLI dataset to date.
INDONLI is annotated by both crowd workers (layperson) and experts. Lay-annotated data is used for both training and testing, while expertannotated data is used exclusively for testing. Our goal is to introduce a challenging test-bed for Indonesian NLI. Therefore the expert-annotated test data is explicitly designed to target phenomena such as lexical semantics, coreference resolution, idioms expression, and common sense reasoning. Seakan tak bisa dipisahkan, dua sahabat itu sama-sama Dua sahabat itu selalu bersama-sama. N n/a sedang menggarap proyek musik. (As if they cannot be separated, the two friends (The two friends are always together.) are both working on music projects.) Meskipun trikomoniasis adalah penyakit yang sangat umum, Trikomoniasis bukanlah penyakit C n/a penyakit ini seringkali sulit diketahui. yang umum. (Although trichomoniasis is very common (Trichomoniasis is not a common disease, it is often difficult to detect.) disease.)  English translation is provided in the bracket for context. The expert-annotated data is annotated with linguistic phenomena contributing to make the inference. For illustrative purposes, we highlight the sentence chunks that correspond to the specific phenomena (noting that such highlighting is not available in the released dataset).

EXPERT-ANNOTATED DATA
We also propose a more efficient label validation protocol. Instead of selecting a consensus gold label from 5 votes as in MNLI data protocol, we incrementally annotate the label starting from 3 annotators. We only add more label annotation if consensus is not yet reached. Our proposed protocol is 34.8% more efficient than the standard 5 votes annotation.
We benchmark a set of NLI models, including multilingual pretrained models such as XLM-R (Conneau et al., 2020) and pretrained models trained on Indonesian text only (Wilie et al., 2020). We find that the expert-annotated test is more difficult than lay-annotated test data, denoted by lower model performance. The Hypothesis-only model also yields worse results on our expert-annotated test, suggesting fewer annotation artifacts. Furthermore, our expert-annotated test has less hypothesispremise word overlap, signifying more diverse and creative text. Overall, we argue that our expertannotated test can be used as a challenging test-bed for Indonesian NLI.

Related Work
NLI Data Besides SNLI and MNLI, another large-scale English NLI data which is proposed recently is the Adversarial NLI (ANLI; Nie et al., 2020). It is created using a human-and-model-inthe-loop adversarial approach and is commonly used as an extension of SNLI and MNLI.
For NLI datasets in other languages, the Crosslingual NLI (XNLI) corpus extends MNLI by manually translating sampled MNLI test set into 15 other languages (Conneau et al., 2018). The Original Chinese Natural Language Inference (OCNLI) is a large-scale NLI dataset for Chinese created using data collection similar to MNLI (Hu et al., 2020). Other works contribute to creating NLI datasets for Persian (Amirkhani et al., 2020) and Hinglish (Khanuja et al., 2020).
Some corpora are created with a mix of machine translation and human participation. The Turkish NLI (NLI-TR) corpus is created by machinetranslating SNLI and MNLI sentence pairs into Turkish, which are then validated by Turkish native speakers (Budur et al., 2020). For Dutch, Wijnholds and Moortgat (2021) introduce the SICK-NL by machine-translating the SICK dataset (Marelli et al., 2014). It is then manually reviewed to maintain the correctness of the translation.
AmericasNLI, an extension of XNLI to 10 indigenous languages of the Americas, is created with the primary goal to investigate the performance of NLI models in truly low-resource language settings (Ebrahimi et al., 2021). The dataset is built by translating Spanish XNLI into the target languages, and vice versa, using Transformerbased sequence-to-sequence models.  (2018) show that there are hidden biases in the NLI corpus, such as word choices, grammaticality, and sentence length, which allow models to predict the correct label only from the hypothesis.
Several studies also investigate whether NLI models might use heuristics in their learning. Many NLI models still suffer from various aspects such as antonymy, numerical reasoning, word overlap, negation, length mismatch, and spelling error , lexical overlap, subsequence and constituent (McCoy et al., 2019), lexical inferences (Glockner et al., 2018) and syntactic structure (Poliak et al., 2018a).
Research to analyze which linguistic phenomena are learned by current models has gained interest. This ranges from the definition of the diagnostic test , the linguistic phenomena (Bentivogli et al., 2010), fine-grained annotation scheme , to the taxonomic categorization refinement .

Data Source
Our premise text is originated from three genres: Wikipedia, news, and Web articles. For the news genre, we use premise text from Indonesian PUD and GSD treebanks provided by the Universal Dependencies 2.5 (Zeman et al., 2019) and IndoSum (Kurniawan and Louvan, 2018), an Indonesian summarization dataset. For the Web data, we use premise text extracted from blogs and institutional websites (e.g., government, university, and school). To maximize vocabulary and topic variability, we set a limit of five text snippets from the same document as premise text. Moreover, the source of premise text covers a broad range of topics including, but not limited to, science, politics, entertainment, and sport.  In contrast to most previous NLI studies that only use a single sentence as the premise, we use premise text consists of a varying number of sentences, i.e., single-sentence (SINGLE-SENT), double-sentence (DOUBLE-SENTS), and multiple sentences (PARAGRAPH).

Annotation Protocol
To collect NLI data for Indonesian, we follow the data collection protocol used in SNLI, MNLI, and OCNLI. It consists of two phases, i.e., hypothesis writing and label validation.
The annotation process involves two groups of annotators. We involve 27 Computer Science students as volunteers in the data collection project. All of them are native Indonesian speakers and were taking NLP classes. Henceforth, we refer to them as the lay annotators. The other group of annotators, which we call as expert annotators are five co-authors of this paper, who are also Indonesian native speakers and have at least seven years of experience in NLP.
Writing Phase In this phase, each annotator is assigned with 100-120 premises. For each premise, annotators are asked to write six hypothesis sentences, two for each semantic label (entailment, contradiction, and neutral). This strategy is similar to the MULTI strategy introduced in the OCNLI data collection protocol. 3 We provide instruction used in the writing phase in Appendix D.
For hypothesis writing involving expert annotators, we further ask the annotators to tag linguistic  Validation Phase We perform label verification for ∼30% and 100% pairs of lay-authored and expert-authored data, respectively. Our validation process is done through three rounds. In the first round, each pair is relabeled by two other independent annotators. If the label determined by those two annotators is the same as the initial label given by the annotator in the writing phase (author), we assign it as the gold label. Otherwise, we move the sentence pair to the second round in which another different annotator provides the label to the data. If any label was chosen by three of the four annotators (i.e., author, two annotators in the first round, and another annotator in the second round), it is assigned as the gold label. If there is no consensus, we proceed to the last round to collect another label from the fourth annotator.
In the MNLI data collection protocol, the goal of the validation phase is to obtain a three-vote consensus from the original label by the author and the labels given by four other annotators. Therefore, the annotation needed under the MNLI protocol is 5N for N pairs of data. For INDONLI, our three-round annotation process can reduce the number of required annotations to 3N + X + Y , where X and Y are the numbers of data for the second and third annotation rounds, respectively. Table 2 shows that approximately 15K annotations are required to label and validate 3K data in EXPERT data if we use MNLI-style validation process, while this number can be reduced into less than 10K annotations (34% more efficient) using our three-round annotation process. 4 Our proposed strategy requires less annotation cost, which is worthwhile for the NLP research community with a limited budget. Table 3 summarizes our final data, along with a comparison to SNLI, MNLI, XNLI, and OCNLI. Our results are on par with the number reported in SNLI / MNLI and better than OCNLI. About 98% of the validated pairs have the gold label, suggesting that our dataset is of high quality in general. The annotator agreement for EXPERT data is higher than LAY data, suggesting that the first is less ambiguous than the latter.

The Resulting Corpus
After filtering out premise-hypothesis pairs with no gold labels (no consensus), we ended up with 17,712 annotated sentence pairs. All expertannotated pairs possessing gold labels are used as a test set. The lay-annotated pairs are split into development and test sets, such that there is no overlapping premise text between both sets. In the end, we have two separate test sets: Test EXPERT and Test LAY . Sentence pairs that are not included in Train  Dev Test LAY Test EXPERT   #entailment  3476  807  808  1041  #contradiction  3439  749  764  999  #neutral  3415  641  629  944 premise len 21.0 (14.0) 19.9 (10.9) 20.4 (11.6) 31.1 (18.9) hypothesis len 7.6 (2.9) 7.7 (2.8) 7.7 (3.  the validation phase and the lay-annotated pairs without a gold label are used for the training set.

5
The number of expert-annotated pairs missing gold labels is extremely small. We excluded them in the distributed corpus. INDONLI data characteristics is described in the Appendix A (Bender and Friedman, 2018) The resulting corpus statistic is presented in Table 4. We observe that the three semantic labels have a relatively balanced distribution. On average, lay-annotated data seems to have a shorter premise and hypothesis length than expert-annotated data. In both LAY and EXPERT data, single-sentence premises (SINGLE-SENT) is the most dominant, followed by DOUBLE-SENTS and PARAGRAPH.

Word Overlap Analysis McCoy et al. (2019)
show that NLI models might utilize lexical overlap between premise and hypothesis as a signal for the correct NLI label. To measure this in our data, we use the Jaccard index to measure unordered word overlap and the longest common subsequence, LCS, to measure ordered word overlap. In addition, we also measure the new token rate (i.e., the percentage of hypothesis tokens not present in the premise) as a proxy to measure token diversity in the hypothesis. Table 5 shows our results. Regarding the Jaccard index, Test EXPERT has an overall lower similarity than Test LAY and the two have higher similarity for pairs with entailment labels than the other labels. Test EXPERT also has a lower LCS similarity score than Test LAY , suggesting that the expert annotators use different wording or sentence structure in the hypothesis. In terms of the new token rate, we find that Test EXPERT has a higher rate than Test LAY . This indicates that, in general, 5 We use the initial label given by the author.  expert annotators use more diverse tokens than lay annotators when generating hypotheses.

Experiments
We experiment with several neural network-based models to evaluate the difficulty of our corpus.   Table 6 reports the performance of all models on the INDONLI dataset, along with human performance on both test sets (Nangia and Bowman, 2019). CBoW model gives moderate improvements over the majority baseline. However, as expected, its performance is still below the Transformers model performance. We observe that IndoBERT lite obtains comparable or better performance than mBERT. IndoBERT large has better performance than IndoBERT lite , but worse than XLM-R. This is interesting since one might expect that the Transformers model trained only on Indonesian text will give better performance than a multilingual Transformers model. One possible explanation is because the size of Indonesian pretraining data in XLM-R is much larger than the one used in IndoBERT large (180GB vs. 23GB uncompressed). This finding is also in line with IndoNLU benchmark results (Wilie et al., 2020), where XLM-R outperforms IndoBERT large on several tasks. In terms of difficulty, it is evident that Test EXPERT is more challenging than Test LAY as there is a large margin of performance (up to 16%) between Test EXPERT and Test LAY across all models. We also see larger human-model gap in Test EXPERT (18.8%) compared to Test LAY (2.8%). This suggests that IN-DONLI is relatively challenging for all the models, as there is still room for improvements for Test LAY and even more for Test EXPERT .

Analysis by Labels and Premise Type
We compare the performance based on the NLI labels and premise type between Test LAY and Test EXPERT (Table 7). We observe that the accuracy across labels is similar on the lay data, while on Test EXPERT , the performance between labels is substantially dif-     PMI Analysis We compute Pointwise Mutual Information (PMI) to see the discriminative words for each NLI label (Table 9). Manual analysis suggests that some words are actually part of multi-word expression (Suhardijanto et al., 2020). For example, the word salah is actually part of the expression salah satu which means one of. In general, we observe that the PMI values in lay data are relatively higher than expert data, indicating that the expert data has better quality and is more challenging. For contradiction label, it is dominated by negation words (e.g., bukan, tidak ). However, for expert data, only one negation word presents in the top 3 words, while for lay data, all top three words are negation words. This suggests that our annotation protocol in constructing expert data is effective in reducing these particular annotation artifacts.

Cross-Lingual Transfer Performance
Prior work (Conneau et al., 2018;Budur et al., 2020) has demonstrated the effectiveness of crosslingual transfer when training data in the target language is not available. To evaluate the difficulty of our test sets in this setting, we experiment with two cross-lingual transfer approaches, zero-shot learning, and translate-train. In this experiment, we only use XLM-R as it obtains the best performance in our NLI evaluation.
In the zero-shot learning, we employ an XLM-R model trained using a concatenation of MNLI training set and XNLI validation set, which covers 15 languages in total. 7 In the translate-train setting, we machine-translate MNLI training and validation sets into Indonesian and fine-tune the pre-trained XLM-R on the translated data. Our English to Indonesian machine translation system uses the standard Transformer architecture (Vaswani et al., 2017)

Linguistic Phenomena in Test EXPERT
To investigate natural language challenges in IN-DONLI data, we perform an in-depth analysis of linguistic phenomena and task accuracy breakdown (Linzen, 2020) on our test-bed. Specifically, we examine the distribution of inference categories in Test EXPERT data and investigate which category the models succeed or fail. We curate a subset of 650 examples from Test EXPERT as the diagnostic dataset. 8 To annotate the diagnostic set with linguistics phenomena, we ask one expert annotator (who is not the example's author) to review the inference categories tagged by the expert-author when creating premisehypotheses pairs . The annotation is multi-label, in which a premise-hypotheses pair can correspond to more than one natural language phenomenon. Our annotation scheme incorporates 15 types of inference categorization. They include a variety of linguistic and logical phenomena and may require knowledge beyond text. The definition for inference tags examined in diagnostic set is stated in the Table 13 in the Appendix F. We provide the tag distribution in diagnostic set and also report the performance of IndoBERT large , XLM-R, and translate train models on the curated examples, as shown in Table 11.
We see that many premise-hypothesis pairs in IN-DONLI diagnostic set apply lexical semantics (e.g., synonyms, antonyms, and hypernyms-hyponyms) or require common-sense knowledge (i.e., a basic 8 We use the same examples to evaluate human baseline understanding of physical and social dynamics) to make an inference. Pairs with NUM tag also occur with high frequency in our data, whereas the challenges of idiomatic expression is less prevalent. Few phenomena are evenly distributed among labels, e.g., NUM, QUANT, and COORD. On the other hand, there is only a small proportion of pairs in LSUB and STRUCT categories which have neutral labels. Many examples of NEG unsurprisingly have contradiction label.
Our analysis on diagnostic set shows that the models can handle examples tagged with morphosyntactic categories or boolean logic well. CO-ORD and MORPH are among 2 tags with highest performance for XLM-R. The translate-train model achieves 85% on LSUB, indicating that our data is considerably robust with respect to syntactic heuristics, such as lexical overlap and subsequence (McCoy et al., 2019). On the other hand, all models also have decent accuracy on inference pairs with negation; the performance remains stable in three different models (more than 70%).
In contrast, the hardest overall categories appear to be NUM and COMP, indicating that models struggle with arithmetic computation, reasoning about quantities, and dealing with comparisons (Ravichander et al., 2019;Roy et al., 2015). In addition, models find it difficult to reason about temporal and spatial properties, as the accuracy for examples annotated with TEMP and SPAT are also inadequate. Commonsense reasoning is shown as another challenge for NLI model, suggesting that there is still much room for improvement for models to learn tacit knowledge from the text.

Conclusion
We present INDONLI, the first human-elicited NLI data for Indonesian. The dataset is authored and annotated by crowd (lay data) and expert annotators (expert data). INDONLI includes nearly 18K sentence pairs, makes it the largest Indonesian NLI dataset to date. We evaluate state-of-the-art NLI models on INDONLI, and find that our dataset, especially the one created by expert is challenging as there is still a substantial human-model gap on Test EXPERT . The expert data contains more diverse hypotheses and less annotation artifacts makes it ideal for testing models beyond its normal capacity (stress test). Furthermore, our qualitative analysis shows that the best model struggles in handling linguistic phenomena, particularly in numerical reasoning and comparatives and superlatives. We expect this dataset can contribute to facilitating further progress in Indonesian NLP research.

Ethical Considerations
INDONLI is created using premise sentences taken from Wikipedia, news, and web domains. These data sources may contain harmful stereotypes, and thus models trained on this dataset have the potential to reinforce those stereotypes. We argue that additional measurement on the potential harms introduced by these stereotypes is needed before using this dataset to train and deploy models for real-world applications.

Acknowledgments
We would like to thank Kerenza Doxolodeo, Theresia Veronika Rampisela, and Ajmal Kurnia for their contribution in preparing unannotated data set and assisting the annotation process. We also thank the student annotators, without whom this work would not have been possible.
RM's work on this project was financially supported by a grant from Program Kompetisi Kampus Merdeka (PKKM) 2021, Faculty of Computer Science, Universitas Indonesia.
CV's work on this project at New York University was financially supported by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program) and Samsung Research (under the project Improving Deep Learning using Latent Structure) and benefitted from in-kind support by the NYU High-Performance Computing Center. This material is based upon work supported by the National Science Foundation under Grant No. 1922658. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The dataset consists of 17,712 premisehypothesis pairs and is divided into training, development, and test splits. We curate two separate test sets, Test LAY , which is authored and annotated by 27 students, and Test EXPERT , which is authored and annotated by experts. The students were offered the compensation whose the rate is similar to the wage for a research assistant in Faculty of Computer Science, Universitas Indonesia. All authors and annotators are Indonesian native speakers.

A.2 Language Variety
The premise sentences used to create INDONLI are taken from three sources: Wikipedia, news, and web domain. For the news text, we use premise sentences from the Indonesian PUD 9 and GSD 10 treebanks provided by the Universal Dependencies 2.5 (Zeman et al., 2019) and IndoSum dataset (Kurniawan and Louvan, 2018). For the web domain, we collect premise sentences from blogs and institutional websites (e.g., government, university, and school). Our manual analysis shows that most of the sentences are written in standard written Indonesian.

A.3 Speaker Demographic
All INDONLI authors are Indonesian native speakers. Lay authors are undergraduate students who have taken NLP class, while expert authors consist of 2 Ph.D. students and 3 researchers with at least 7 years experience in NLP. Besides this information, we do not collect any other demographic information of the authors.

A.4 Annotator Demographic
The authors of INDONLI are also the annotators.

A.5 Speech Situation
Each hypothesis in INDONLI is written based on premise sentence(s) taken from Wikipedia, news, or web articles.

D INDONLI Writing Instruction
In this task, given a premise text consisting of one or more sentences, you are asked to write six different hypothesis sentences, two for each label (entailment, contradiction, and neutral).
A premise-hypothesis pair is annotated with entailment label if it can be concluded that the hypothetical text is correct based on the information contained in the premise text. It is annotated with contradiction label is if it can be concluded that the hypothesis text is wrong based on the information contained in the premise text. Otherwise, the label is neutral; in other words, based on the information contained in the premise text, the truth of the hypothesis text cannot be determined (not enough information).
Please make sure that each hypothesis sentence satisfies the following criteria: • It consists of one sentence and not multiple sentences, • It contains some keywords present in the premise text, • It is grammatical according to Indonesian grammar.
Some strategies that you can apply when writing the hypothesis sentence including, but not limited to: 1. Word deletion Delete one or more words from the premise text.

Word addition
Add one or more words to the premise text. For example, you can add adjectives, negation words, etc.

Lexical change
Replace one or more words from premise text with their synonym, antonym, hypernym, or hyponym.

Paraphrase
Write premise text with your own words.

Structural change
Change the structure of the premise text. For example, you can change the active voice into passive voice or change the order of the sub-sentence in the premise text.

Reasoning
Apply reasoning to the given premise text to write a hypothesis sentence, such that the reasoning skill is needed when deciding the correct entailment label. For example, you can use numerical reasoning or commonsense knowledge.
If you are given a premise text which consists of multiple sentences, you should not write a hypothesis sentence that is identical to one of the sentences in the premise text. Figure 1: This is the instruction given to lay authors for writing hypothesis sentences.

E INDONLI Validation Instruction
In this task, you will be given a set of sentence pairs. For each sentence pair: 1. Check if they are free of errors such as ungrammatical, incomplete, or have wrong punctuation.
2. Fix any error that is found in one or both sentences.
3. Pick the correct semantic label for the sentence pair. Figure 2: This is the instruction given to annotators for labeling each sentence pair. IDIOM Idiomatic expression. Idioms  SPAT Spatial reasoning that involves places and spatial relations Spatial (Kim et al., 2019) between entities; understanding the preposition of location Spatial (Joshi et al., 2020) and direction

TEMP
Temporal reasoning that involves a common sense of time, (Vashishtha et al., 2020) for example, the duration an event lasts, the general time an activity is carried out and, the sequence of events

CS
Commonsense knowledge that is expected to be possessed by Common sense  most people, independent of cultural or educational background. Plausibility  This includes a basic understanding of physical and social dynamics, plausibility of events, and cause-effect relations WORLD Reasoning that requires knowledge about named entities, World knowledge  knowledge about historical and cultural, current events; and Reasoning-Fact  domain-specific knowledge. World