Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU

Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited knowledge of local Indonesian languages and culture. Other smaller models such as BLOOMZ and Falcon perform at even lower levels.

School exams serve as a powerful means to assess the reasoning abilities and real-world knowl-1 Code and dataset can be found at https://github.com/fajri91/IndoMMLU Figure 1: Distribution of subject areas and education levels in IndoMMLU."Hum", "Social", "Indo", and "Local" refer to Humanities, Social Science, Indonesian Language, and Local Languages and Cultures, respectively.
edge of LLMs, given that these tests are meticulously designed by expert educators, drawing upon the principles of learning science.At various educational levels, school exams function as assessment tools, evaluating not only language proficiency but also higher-order cognitive skills such as comprehension, analytic abilities, and the application of real-world knowledge across diverse scenarios (Novak, 1988).Hendrycks et al. (2021) proposed MMLU, a massive multitask language understanding benchmark in English that is compiled from different exams, covering topics including US history, computer science, and high school subjects.Recent progresses on LLMs such as LLaMA (Touvron et al., 2023) and GPT-4 (OpenAI, 2023) use MMLU as one of the evaluation datasets.In the GPT-4 technical report, automatic evaluation is further extended to encompass various standardized exams, including SAT, GRE, and bar exams.
While there has been a plethora of work on LLM evaluation for English (OpenAI, 2023;Katz et al., 2023;Choi et al., 2023;Ryznar, 2023;Chalkidis, 2023), there has been comparatively little in other languages.Recent work by OpenAI (2023) evaluated GPT-4 using a translated version of MMLU, and reported strong performance.While encouraging, using translations of English evaluation datasets has serious shortcomings, including translation noise, a complete lack of content that is sensitized to the local language/culture (esp.as most English evaluation datasets are highly US centric), and conversely, the existence of content that is irrelevant to the local language/culture (e.g.questions relating to US law or customs) and incongruent with the language-specific evaluation.
In this paper, we ask professional teachers (of Indonesian nationality) to collect exam questions from various educational levels in Indonesian schools (i.e.primary school, junior high school, senior high school, and university).We categorize the collected questions into different subject areas, including: (1) STEM (Science, Technology, Engineering, and Mathematics); (2) Social Science; (3) Humanities; (4) Indonesian Language; and (5) Local Languages and Cultures.Figure 1 presents an overview of the distribution of the resulting dataset, IndoMMLU, across different subject areas and education levels.It is worth mentioning that 21% of the questions specifically focus on the Indonesian language, and 25% encompass nine distinct local languages and cultures that are specific to Indonesia.
Our contributions can be summarized as follows: • We introduce the first Indonesian MMLU dataset, namely IndoMMLU, which comprises 63 tasks across different subject areas and education levels in Indonesia.• Our dataset includes exam questions from school grades 1 to 12, as well as university entrance exams.This comprehensive coverage allows us to perform fine-grained assessment of the Indonesian language proficiency of existing LLMs.• Approximately 25% of our data encompasses nine distinct local languages and cultures in Indonesia, namely Lampungic (ljp), Balinese (ban), Makassarese (mak), Banjarese (bjn), Madurese (mad), Sundanese (sun), Javanese (jav), Dayak Ngaju (nij), and Minangkabau.2These questions are not only in under-represented languages but also incorporate specific cultural content, such as art, poetry, and daily life.For Lampungic (ljp) and Makassarese (mak) in particular, this is the very first NLP resource to be released.• We evaluate various multilingual LLMs, including GPT-3.5 (Ouyang et al., 2022), XGLM (Lin et al., 2021), Falcon (Penedo et al., 2023), BLOOMZ (Muennighoff et al., 2022), mT0 (Muennighoff et al., 2022), LLaMA (Touvron et al., 2023), and Bactrian-X (Li et al., 2023a), across different model sizes.We find that only GPT-3.5 passes the highest primary school level exam, and no models demonstrate familiarity with local Indonesian languages and culture.

Related Work
Evaluating Large Language Models Various benchmarks have been released to evaluate English pre-trained LMs (Devlin et al., 2019;Conneau et al., 2020).Early benchmarks such as GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) consist of various natural language understanding (NLU) tasks of different types with varying training data sizes.XGLUE (Liang et al., 2020), XTREME (Hu et al., 2020), and XTREME-R (Ruder et al., 2021) serve as multilingual benchmarks of more than 20 languages.For natural language generation (NLG), the GEM benchmark (Gehrmann et al., 2021) is a collection of machine translation, summarization, and generated descriptions in many languages.

IndoMMLU
IndoMMLU is a multiple-choice problem set in 63 subjects from different education levels, following the format of English MMLU (see Figure 2 and Figure 3).IndoMMLU, however, is based on the Indonesian education curriculum, and has more fine-grained education levels than MMLU.In Indonesia's curriculum, schools are categorized into three levels: (1) six years of primary school (Sekolah Dasar = "SD"), (2) three years of junior high school (Sekolah Menengah Pertama = "SMP"), and (3) three years of senior high school (Sekolah Menengah Atas = "SMA").At primary school, pupils in all grades are taught the Indonesian language, civics, mathematics, art, sports, and religion.From grade 4 to 6 and in junior high school, pupils additionally learn a foreign language, a local language/culture, science, and social science. 3In senior high school, pupils study more specialized natural science and social science subjects, including physics, chemistry, biology, geography, sociology, economy, and history.In IndoMMLU, we explicitly exclude mathematics because the questions typically consist primarily of symbols with little language content, and there are existing datasets for mathematical reasoning such as GSM-8K (Cobbe  The local language/culture subjects vary across provinces in Indonesia and depend on the local government policy.For example, in West Sumatra, Minangkabau culture is taught using the Indonesian language, while in West Java, pupils are exposed to the Sundanese language and culture.Figure 2 illustrates two exam questions for Minangkabau culture, and one exam question for Sundanese.

Data Construction
We asked seven professional teachers with at least a bachelor's degree in education to gather publiclyavailable school exam questions in Indonesia from web sources. 4They were tasked with gathering problems for specific subject areas and educational levels, as well as metadata such as the source (i.e.URL of the source document), school level, class level, question, multiple-choice options, and the correct answer key.We instructed the teachers to only include exams that had accompanying answer keys, and to exclude problems that contained im- 4 The seven teachers were selected from 70 applicants.Indonesian language Indonesian language (SD, SMP, SMA, UE) Table 2: Subject areas in IndoMMLU."SD", "SMP", "SMA", "UE" indicate that questions in the subject are are available in primary school, junior high school, senior high school, and university entrance exams, respectively.
ages.Additionally, we organized an 1-hour workshop to discuss the data collection procedure with all the teachers, addressing any questions or concerns they had.All teachers are paid competitively, higher than the Indonesian average monthly wage.5

Quality Control
To ensure the accuracy of the data entry process, we randomly checked questions collected by each teacher.We manually verified the questions, multiple-choice options, and the corresponding answer keys based on the given URL, and found that each teacher conducted the work accurately.We also additionally performed automatic filtering to remove repetitive questions, and remove questions that have no answer key.

Data Statistics
After data cleansing, we obtained a total of 14,906 questions, distributed over school levels and subjects as detailed in Figure 1, and the details of each subject area are in  14% university entrance exam questions.Table 1 shows the average question length for each education level and subject area.We can observe that primary school questions tend to be shorter and university entrance exam questions are longer.Indonesian language questions have the highest average length, while local languages and culture questions are around 88 characters on average.
For closed-source models, we evaluate questions by comparing the first generated tokens (e.g. , A,  B, C) and the answer key using a regular expression. 7For open-sourced models, we benchmark two strategies.Given a question and the corresponding multiple-choice options, we calculate: (1) the probability of the full generated answer; and (2) the probability of the first token in the generated answer.For the first, we select the answer with the highest normalized log likelihood, and for the second, we simply select the key token (e.g., C) with the highest probability among all possible keys.

Results
Figure 4 presents the zero-shot accuracy when using: (1) the full answer probability; and (2) the probability of the first token in the generated answer.Among the open-sourced language models (LLMs) including XGLM (7.5B), Falcon (40B), BLOOMZ (7.1B), mT0 xxl (13B), LLaMA (65B), and Bactrian-X (13B), we find that estimating the answer based on the probability of the first token in the generated answer generally performs best, with the notable exception of XGLM.Thus, we report results under this configuration in the remaining sections; the full results for both settings can be found in the Appendix.
Results across all models Table 3 shows the average accuracy for each subject area across the 24 models.To compute the scores, we disregard the education level of the questions, and averaging scores based on the subject (e.g.Biology), and finally combine the scores across all subject areas (e.g.STEM).The random performance varies between 20% to 27% due to the differing number of multiple-choice options (i.e. three to five).
Overall, we found that GPT-3.5 attains the highest overall accuracy, albeit low at 53.2%.GPT-3.5 is also notably the highest in each subject area, except in local languages and culture subjects.Among the open-source models, we observe that mT0 xxl (13B) achieves an average accuracy of 42.5%.The recently released Falcon (40B) model performs worse than mT0 xxl (13B) and BLOOMZ (7B).
Performance based on model size varies, with smaller models such as BLOOMZ (7B) and mT0 xxl being better than Falcon (40B) and LLaMA (65B).We suspect that this is due to the absence of the Indonesian language in Falcon and LLaMA's pretraining data.The poor performance of the 13B and 30B LLaMA models might imply that any "emergent abilities" of LLMs generally appear in the same or closely-related languages.This is further supported by Bactrian-X-LLaMA (13B), a LLaMA model fine-tuned on instruction datasets in 52 languages (including Indonesian), which obtain a +5% average increment, compared to LLaMA (13B).Results across education levels As illustrated in Figure 1, IndoMMLU includes detailed education level metadata, which enables us to gain a deeper understanding of the capabilities of LLMs in terms of human education levels.In the Indonesian context, the minimum passing score for exams varies across subjects and typically ranges between 65 and 70. 8 By setting the passing score at 65, we assess GPT-3.5 over real-world knowledge capabilities, as shown in Table 4. Green indicates that the model has successfully passed the subject, while red indicates it has failed.This reveals that GPT-3.5 generally performs well on primary school exams for general subjects, but exhibits a lack of understanding of local languages and culture.In subjects that require less analytical thinking, such as civics and religion, GPT-3.5 tends to achieve higher scores in high school exams.
Indonesian language proficiency of LLMs As discussed in Section 3, IndoMMLU specifically includes Indonesian language exams for all grades and education levels, allowing us to assess the In-8 This refers to Curriculum 2013 in Indonesia.donesian language proficiency of LLMs. Figure 5 illustrates that GPT-3.5 achieves its highest accuracy in grade 1, approaching 90%.However, as the education level increases, the model's performance gradually declines.For grades 3 and above, the scores fall below 75, and for classes 7 and above, GPT-3.5 fails to pass the exams.We observe that this trend is similar for mT0 xxl and BLOOMZ, which only pass grades 1, 2, and 3.This finegrained evaluation provides a valuable benchmark for LLM proficiency in Indonesian.
LLM performance on local languages and cultures It is interesting to observe in Table 3 that despite having only 13B parameters, mT0 xxl achieves the highest accuracy on local languages and cultures.On the other hand, GPT-3.5 with 175B parameters achieves competitive accuracy, just 0.3 lower than mT0 xxl .To further investigate this, Figure 6 displays the accuracy scores of each local language and culture subject, revealing that both mT0 xxl and GPT-3.5 excel in different subject areas.mT0 xxl shows greater familiarity with Javanese and Sundanese, with a disparity of +10 for both subjects compared to GPT-3.5.GPT-3.5 performs better in Dayak Ngaju, Banjarese, and Minangkabau culture. . 9Few-shot inference does not yield improvements in instruction-tuned models like mT0 and BLOOMZ, as evidenced by a decrease in accuracy.

Analysis
In contrast, the pure LLMs Falcon and LLaMA show better performance with few-shot inference compared to zero-shot.These findings align with those of Liu et al. (2023); Li et al. (2023b), where few-shot prompts may lead to unnatural inferences for instruction-tuned models.
Model confidence Given the top three models in Table 3, we assess whether their confidence predictions (i.e. the predicated likelihood of the predicted answer being correct) corresponds to the actual accuracy across 63 tasks.This uncertainty calibration gives us hints about the model's reliability and how to use them appropriately in real-world 9 Refer to the Appendix for details of the prompts.2022), using a hightemperature value (0.7) during decoding.For each question, we generate n different outputs and measure self-consistency.The probability of a multiplechoice option is calculated based on the output frequency.In this experiment, we use n = 7, and choose the most frequently-occurring answer as the final prediction.
We average the confidence scores across the 63 tasks, and display the calibration of mT0, BLOOMZ, and GPT-3.5 in Figure 8.We observe that all three models are well-calibrated, with correlation scores of r > 0.85.
Additionally, we examine the relationship between confidence scores and question length, as depicted in Figure 9.We found a very weak correlation for both mT0 and BLOOMZ.It is worth noting that the confidence score can also be interpreted as a measure of question difficulty, based on which question length appears to have no bearing on difficulty.

Impact of negation
In Indonesian school exam questions, the use of negation is common to enhance question difficulty and assess students' reasoning abilities.Similarly, in the field of NLP, negation is known to increase the difficulty of NLP tasks (Truong et al., 2022).To investigate the impact of negation, we employ a simple string-matching strategy to identify questions that contain negations   within each subject area. 10We then break down the accuracy for the top three models (GPT-3.5, mT0, and BLOOMZ) based on the presence or absence of negation.Among the subject areas, Indonesian language and social science are the most prevalent in employing negation, accounting for approximately 10% in each group.Through manual observation of 100 random samples, we verified that 85% of these questions indeed contained negation.
Table 5 shows the effects of negation on IndoMMLU accuracy.For the Indonesian language subject area, negated questions prove to be more challenging, with a decrease in accuracy ranging from −4 to −10.In social science, mT0 and BLOOMZ are similarly more accurate over questions without negation.Compared to mT0, however, BLOOMZ is less robust to negation, as indicated by the −5 accuracy drop.
If LLMs are to be deployed in diverse contexts, it is critical to have more work on evaluation for different languages and cultures.In Table 1 we observed that the models struggle to answer questions that pertain to local languages and cultures across all levels of education in Indonesia.Minangkabau culture in particular is taught and assessed in the Indonesian language, and yet the limited performance in answering questions relating to it underscores a lack of cultural knowledge, despite reasonable results for the Indonesian language.
We also argue that education science should play a more central role in the future evaluation of LLMs.Current NLP work has mostly focused on developing larger models with different techniques and architectures, and evaluation has primarily been in terms of specific NLP tasks.Education science has decades of experience in designing assessments to evaluate student progress through painstakingly-designed comprehensive tests, which the NLP community should better engage with.With IndoMMLU, we have shown that exam questions across fine-grained educational levels offer a more profound comprehension of model proficiency in the Indonesian language, while also revealing potential areas for improvement.

Conclusion
In this paper, we presented IndoMMLU, a multi-task language understanding benchmark for real-world evaluation of knowledge in the Indonesian context.By leveraging education level metadata, we found that current LLMs like GPT-3.5 are only able to pass primary school exams in Indonesia, while smaller models struggle across nearly in all education levels.Notably, none of the 24 evaluated models perform well in the domain of local languages and cultures, highlighting the need for further research in this direction.

Limitations
Despite being the largest question-answering dataset in the Indonesian context, IndoMMLU still has some limitations, in that it lacks: (1) multimodal questions; (2) arithmetic reasoning tasks; and (3) essay-style questions.First, IndoMMLU is comprised solely of text-based questions, and questions with tables and figures are discarded to simplify data collection.We specifically exclude math questions as they are already well covered by existing English math reasoning benchmarks.We suggest that essay questions enable a deeper assessment of comprehension and critical thinking, but that methods for evaluating essay quality across education levels in languages other than English are severely lacking.

Ethical Considerations
The IndoMMLU dataset used in our study is collected from publicly-available web resources.In compliance with the Indonesian Copyright Law number 28 year 2014, specifically article 44, the use, retrieval, reproduction, and/or modification of works and/or related rights products, in whole or in substantial part, is not considered a copyright infringement if the source is fully cited or mentioned for educational and research purposes. 11 Regarding our experimental results, it is important to note that they do not provide a definitive answer as to the relative abilities of LLMs, and we caution readers against overinterpreting the findings.While we conclude that GPT-3.5 demonstrates proficiency in passing primary school exams in Indonesia based on IndoMMLU, it is essential to consider potential contamination in GPT-3.5'spretraining data, which could impact the results.Furthermore, it is worth noting that real-world student assessments encompass not only multiple-choice questions but also practical exams, laboratory work, and essay writing.huggyllama/llama-65b Bactrian-X-LLamA (7B) MBZUAI/bactrian-x-llama-7b-lora Bactrian-X-LLamA (13B) MBZUAI/bactrian-x-llama-13b-lora Table 10: With the exception of GPT-3.5 (Ouyang et al., 2022), all the models used in this study were sourced from Huggingface (Wolf et al., 2020).

Figure 2 :
Figure2: The first question focuses on the family relationship between anak pisang "children" and induak bako "aunt on the father's side".Both terms are commonly used in Minangkabau but not in the Indonesian language.The second and third questions pertain to traditional art.Kawih in the third question means a song set to a distinctive beat in Sundanese culture.Left is the original text and right is the English translation for illustrative purposes.The bold options are the correct answer keys.

Figure 3 :
Figure 3: Examples of civics and chemistry exam questions.Left is the original text and right is the English translation for illustrative purposes.The bolded options are the answer keys.

Figure 4 :
Figure 4: LLM performance (% accuracy) based on: (1) the probability of the full generated answer; and (2) the probability of the first token in the generated answer.

Figure 5 :
Figure5: Fine-grained accuracy (%) of GPT-3.5, mT0 xxl , and BLOOMZ in the Indonesian language subject area.The horizontal line depicts the passing score of 65, and the education level of 13 refers to the university entrance exam.

Figure 8 :
Figure 8: Zero-shot calibration of mT0 xxl , BLOOMZ, and GPT-3.5 across 63 tasks.The average standard deviations of the confidence scores across all data points are 36.5,26.4, and 43.9, respectively

Figure 9 :
Figure 9: Correlation between question difficulty and question length.

Figure 10 :
Figure10: Illustration of our few-shot prompt template.The English translation on the right is solely for illustrative purposes.In our experiments, we used up to three examples within the prompt.The placeholders [SUBJECT], Example-i, Answer-i, and QUESTION correspond to the subject, the i-th question example, the answer key for the i-th question example, and the main question, respectively.
Table 2 and the Appendix.

Table 3 :
Zero-shot performance (% accuracy) of LLMs, combined across education levels.Average means the average across all subject areas in IndoMMLU.

Table 9 :
C Zero-shot Performance Based on the Probability of the Full Generated Answer Zero-shot performance (% accuracy) of large language models based on the probability of the full generated answer, aggregated across education levels.Average means the average across all subject areas in IndoMMLU.