IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Task-oriented dialogue (ToD) systems have been mostly created for high-resource languages, such as English and Chinese. However, there is a need to develop ToD systems for other regional or local languages to broaden their ability to comprehend the dialogue contexts in various languages. This paper introduces IndoToD, an end-to-end multi domain ToD benchmark in Indonesian. We extend two English ToD datasets to Indonesian, comprising four different domains by delexicalization to efficiently reduce the size of annotations. To ensure a high-quality data collection, we hire native speakers to manually translate the dialogues. Along with the original English datasets, these new Indonesian datasets serve as an effective benchmark for evaluating Indonesian and English ToD systems as well as exploring the potential benefits of cross-lingual and bilingual transfer learning approaches.


Introduction
Task-oriented dialogue (ToD) systems are conversational agents designed to interact with users and assist them in various domains, such as restaurant (Bordes et al., 2016;Wen et al., 2017), public transport (Budzianowski et al., 2018;Lin et al., 2021b), and in-car assistance (Eric et al., 2017).This system also serves as the base for many commercial products that operate using the dialogue systems approach because of its ability to operate and understand the dialogue context without using hand-crafted rules (Lin et al., 2021b).Despite the recent growing interest in developing end-to-end ToD systems due to their simplicity, ToD systems are mostly created using monolingual datasets in high-resource languages such as English and Chinese.Moreover, building a ToD system can be challenging due to the limited availability of datasets for training and evaluating the system, which has been identified as the most critical factor in preventing the creation of bilingual/multilingual ToD systems (Wen et al., 2017;Razumovskaia et al., 2021).
Developing ToD systems in additional underrepresented languages is essential to expand their capabilities to understand dialogue contexts in diverse languages (Kanakagiri and Radhakrishnan, 2021).One of them is Indonesian, a language spoken by many people worldwide yet still considered an underrepresented language in the end-to-end ToD system.Indonesian is ranked as the fourth most internet users in the world based on the latest data per country (Aji et al., 2022), with around 212 million internet users.1However, the language itself is still categorized as an underrepresented language in the natural language processing (NLP) community because of problems such as scattered datasets and minimum community engagement (Wilie et al., 2020;Cahyawijaya et al., 2021Cahyawijaya et al., , 2022)).To the best of our knowledge, there is only one publicly available end-to-end Indonesian ToD dataset2 which is COD (Majewska et al., 2023), a multilingual ToD dataset that is solely used for evaluation -only test set available.It has a very limited number of samples for Indonesian with 194 dialogues across 11 domains (∼18 dialogues per domain) and no training data provided.This emphasizes the need to create larger end-to-end Indonesian ToD datasets to expand the capabilities to build and evaluate localized Indonesian ToD systems.
To address the aforementioned issues, we propose IndoToD, an end-to-end multi-domain ToD benchmark in Indonesian.IndoToD comprises the collection of two parallel Indonesian end-to-end ToD datasets covering four different domains by manually translating two existing English datasets: CamRest676 (Wen et al., 2017) and SMD (Eric et al., 2017) using delexicalization and lexicalization processes, as well as the evaluation of existing end-to-end ToD frameworks in various settings such as monolingual, cross-lingual, and bilingual.IndoToD provides more dialogue samples compared to (Majewska et al., 2023) that can be utilized for training and evaluation.This paper's contributions are summarized threefold as follows: • We introduce IndoToD, the multi-domain benchmark for Indonesian ToD systems.The benchmark comprises two datasets: IndoCam-Rest and IndoSMD, with four different domains that serve as resources for training and evaluation.
• We establish baselines on monolingual, bilingual, and cross-lingual settings on existing end-to-end ToD frameworks.
• We analyze the effectiveness of training bilingual datasets to leverage more training data and handle tasks in both languages in building ToD systems.

IndoToD Benchmark
IndoToD benchmark3 is created to develop the Indonesian ToD system.The benchmark covers four different domains, i.e., restaurant search, point-of-interest (POI) navigation, calendar scheduling, and weather information.IndoToD extends the existing English datasets, and we follow a streamlined process to conduct dialogue collection.The dialogues are multi-turn conversations that involve two speakers (i.e., user U and system S).For each conversation, there is a knowledge base (KB) for the system to generate the correct entity for the user.

Dataset Collection
We use CamRest676 (Wen et al., 2017) and SMD (Eric et al., 2017) as the original English datasets that will go through several stages before becoming datasets that can be used in various Indonesian Task-oriented Dialogue (ToD) system experiments.CamRest676 is a ToD dataset that focuses on restaurant search queries collected via Wizard-of-Oz (WoZ) framework (Kelley, 1984).It consists of a collection of dialogues between the user and the system, where each dialogue has a task-specific goal (e.g., finding a restaurant).In this dataset, the user acts as a client who requests restaurant information and the system acts as an information provider that guides and answers the user's request.
Meanwhile, SMD is an in-car assistant multidomain ToD dataset.It covers several domains such as POI navigation, calendar scheduling, and weather information.This dataset was also created by using the same WoZ framework to get a high-quality dataset that imitates a conversation between two individuals in a way that resembles a  (Majewska et al., 2023).
real-life interaction between a driver and an in-car assistant.In the WoZ framework, the source of conversation is created through a human-to-human dialogue conversation which is collected through crowd-sourcing.Because of that, the conversations between the user and the system are more natural.These datasets then will be used as a reference for generating a ToD dataset that mimics natural conversation between two speakers and includes relevant knowledge base information for generating accurate responses.Both two datasets provide a diverse variety of dialogues since they have a different set of domains that complement the IndoToD benchmark.

Dataset Construction
The datasets have been created through several steps, as shown in Figure 1.First, existing English datasets are delexicalized into several user and system template sentences using Madotto et al. (2020) approach.Then, these template sentences are translated by native Indonesian annotators into Indonesian sentences.Lastly, new Indonesian datasets are built through the KB retrieval and lexicalization process.
Delexicalization Initially, the datasets are preprocessed by delexicalizing the entities.The process is carried out using the source code implemented by Madotto et al. (2020) to remove the entities in the dataset's dialogue.The aim of this step is to reduce the number of dialogues that require annotation (translation) by replacing all entities with a common, pre-defined value based on the type of entity.This pre-processing method is effective because the number of sentences that need to be translated is reduced by 30% and 34% for Cam-Rest676 and SMD, respectively.The output of this step is a list of pre-processed, entity-less sentences for each dataset.
Translation The list of pre-processed sentences is translated into Indonesian by annotators.In total, there are 3,834 and 1,126 sentences translated from CamRest676 and SMD, respectively.This process is followed by a cross-validation process where an annotator has to check the work of other annotators to maintain the dataset's quality.Note that for the SMD dataset, we only use 11% (323 dialogues) of its original dialogues due to the limitations on the annotator side.

KB Retrieval and Lexicalization
After the translation process, to build each dialogue, a collection of Indonesian sentences and their corresponding entities are needed to begin the lexicalization process.We retrieve the Indonesian entities by querying the KB for the required slots to construct the dialogue.The lexicalization process is conducted by using Indonesian entities and sentences that have been retrieved earlier.This process ends with an evaluation of 100 randomly selected newly created Indonesian dialogues by human evaluators to detect any errors to ensure the quality of the resulting dialogues.If there are any errors, we ask them to manually edit that utterance.The sample of corrected Indonesian utterances is presented in Appendix Table 1.

Dataset Statistics
We collect a total of 999 dialogues with 6,338 utterances from two datasets, named as IndoCam-Rest and IndoSMD, which are derived from Cam-Rest676 and SMD datasets, respectively.We com- pare several statistic attributes between our datasets (IndoCamRest and IndoSMD) and COD that are shown in Table 1.Furthermore, the per-domain statistics of IndoCamRest and IndoSMD are presented in Table 2. IndoCamRest and IndoSMD share some of the same statistical characteristics as the previous datasets, including the number of domains, intents, and slot types.It is worth noting that both of our datasets have a substantially larger number of dialogues per domain compared to COD.
In addition, IndoCamRest and IndoSMD comprise both the lexicalized and delexicalized forms in the dialogue.The inclusion of the delexicalized form is intended to facilitate the generation of dialogues by lexicalizing the delexicalized sentences.
3 Experimental Setup

Experiment Settings
We set up a benchmark for both Indonesian and English ToD to evaluate the performance of the current ToD systems.We explore three different training and evaluation settings: • Monolingual -id/en.We train the end-toend ToD systems using monolingual corpus in each language independently.
• Cross-lingual.We train the systems using the English end-to-end ToD dataset before testing it using an Indonesian test set to analyze the effectiveness of the cross-lingual approach in understanding the context of the dialogue.
• Bilingual -id+en.We train the systems by combining English and Indonesian datasets, utilizing parallel corpora from CamRest and SMD.This allows us to examine how the system's performance is affected by exposure to bilingual languages.
For monolingual and bilingual settings, we test each ToD framework in English and Indonesian test sets separately to analyze the impact of monolingual and bilingual training in each language.We also split the data into 3:1:1 and 8:1:1 for the Cam-Rest and SMD, respectively.

ToD Baselines
As baselines, We evaluate four ToD frameworks: Sequicity (Lei et al., 2018), LABES (Zhang et al., 2020a), MinTL (Lin et al., 2020), and GALAXY (He et al., 2022).Since the four ToD system frameworks are designed for English, we have made adjustments to the frameworks to suit the requirements of our experiments.We largely adopt the configuration settings used in the original paper, including batch size, decoding method, early stop count, and learning rate.
To adapt the Sequicity and LABES frameworks for our experiments, we utilize fastText (Bojanowski et al., 2017) instead of GloVe (Pennington et al., 2014).This was necessary because fastText can accommodate English, Indonesian, and bilingual Indonesian-English word vectors (Conneau et al., 2017).Specifically, we incorporate common crawl4 English and Indonesian fastText word vectors for monolingual English and Indonesian experiment settings respectively.Meanwhile, we use bilingual English-Indonesian word vectors (Conneau et al., 2017) for other settings.
Unlike the other frameworks, MinTL utilizes T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) as its backbone, which sets it apart regarding approach and potential performance.Note that for this experiment, we only use T5 (Raffel et al., 2020) and mT5 (Xue et al., 2021)   Levenshtein Belief Spans (Lev) resulting in an inference process similar to Sequicity.In all of the experiment settings, mT5-small is implemented in the framework and used as the backbone of the framework.We also try to use T5 for monolingual and bilingual settings to analyze the impact of using two different kinds of pre-trained language models (LMs) that will be discussed later in the Subsection 4.2.We implement reader and evaluator modules for both CamRest and SMD datasets in MinTL, as their paper only included the reader and evaluator for the MultiWOZ (Budzianowski et al., 2018) dataset.

Hyper-parameters
We conduct the learning rate search [1e-4, 6e-4, 1e-5] to optimize the performance of MinTL on the new datasets, and the best learning rate is 6e-4.For other ToD frameworks, we adopt their proposed learning rate in our experiments.We use a learning rate of 3e-3 for Sequicity, 1e-4 for GALAXY, and, for LABES, we set the learning rate to 3e-4 and 1e-4 in the CamRest and SMD experiments, respectively.We follow the inference sampling strategies recommended by the original papers that introduce the frameworks.We use a beam size of 5 for generating inference responses in GALAXY, while in the other frameworks, we employ greedy search.We run each experiment once per experiment setting for uniformity with an A100 40GB GPU.

Evaluation Metrics
We use four evaluation metrics for end-to-end ToD tasks to evaluate the quality of responses.There are 1) BLEU score (Papineni et al., 2002) to assess how fluent the generated response is, 2) Match rate (Lin et al., 2020), checks whether the ToD system can produce the entity constraints specified by the user, thereby measuring task completion, 3) Success F1 score (Lei et al., 2018), which determines whether the system has successfully provided all of the information that the user has requested, such as the address or phone number, and 4) Combined score (Lin et al., 2020;Mehri et al., 2019) using the equation Combined = (Match + Success) × 0.5 + BLEU.We assess the performance of all frameworks using the CamRest and SMD datasets, along with the designated experiment settings.

Results
Tables 3 and 4 show the experiment results for Indonesian and English test sets, respectively.We conclude that GALAXY has a decent score in both monolingual settings for CamRest, and Sequicity with fastText word vector has the best model in monolingual settings for SMD.Furthermore, MinTL outperforms other frameworks in zero-shot cross-lingual settings for both datasets.The noncontextual word-embedding frameworks, such as Sequicity and LABES achieve the highest combined scores for bilingual settings.Furthermore, it was observed that GALAXY does not perform as well as other frameworks when dealing with Indonesian, as we denote its results by using '-' in Table 3.Additionally, we present the combined metric results of the ToD frameworks on the IndoSMD test set, categorized by domain, in Figure 2. We discover that POI navigation is relatively easier than other domains to learn, while calendar scheduling is the hardest domain for ToD frameworks on SMD dataset.Fine-grained results can be found in Appendix Table 2.

Analysis
Monolingual English vs.
Indonesian Performance.In general, the monolingual model trained and tested in English performs better than in Indonesian across different frameworks and datasets.We hypothesize that the pre-trained LMs used in the experiments are better trained to handle English data.GALAXY outperforms the monolingual setting on both Indonesian and English Cam-Rest datasets.However, we find that the GALAXY framework fails to adapt to the IndoSMD dataset and the model is not converged.We further observe that the experiments using GALAXY in Indonesian are not as effective as English, as evidenced by the superior performance on the English Cam-Rest dataset, despite having the same amount of dialogue and utterance as IndoCamRest.Furthermore, it is worth noting on the IndoToD benchmark that we are not using Indonesian T5 models.This is because there is no paper working on releasing the Indonesian T5 models that are comparable to the English T5 models we use in the experiment.
Transferability from English to Indonesian.Based on the cross-lingual results, it can be concluded that existing ToD frameworks are unable to handle the task when trained in a different language.While not all training data needs to be in the same language as the test set, it is still essential for the framework to have some knowledge of the target language in order to understand the task at hand.Based on the inference results, we observe that all of the ToD frameworks generate responses in English despite Indonesian user inputs.This can be correlated to the fact that the training data used in the cross-lingual setting is in English.Among all the ToD frameworks that are evaluated in this setting, MinTL performs the best in the cross-lingual setting (due to mT5 as its pre-trained model, trained on multilingual datasets), while GALAXY and Sequicity do not perform well.
Monolingual vs. Bilingual.The results demonstrate that, in most cases, the bilingual setting yields higher scores than the monolingual (i.e., Indonesian and English monolingual setting), especially in the Indonesian test set.The bilingual setting has the advantage of using a larger amount of training data and performing tasks in both languages (Lin et al., 2021b), increasing both the overall metric scores and the individual metric components.We also compare and analyze the framework's performance closely based on the test set we used.The results show that most of the bilingual settings scores are higher than monolingual settings in both the Indonesian and English test sets showing the effective usage of language data counterpart for domain adaptation.
However, based on the inference result in Table 5, we observe that the responses generated between bilingual and monolingual settings are not substantially different in terms of the semantics, confirmed by the fact that the difference in  BLEU scores between these two settings is not significant.However, we see some improvements in belief spans generation in the bilingual setting.ToD frameworks can understand the context from the user better when they are trained with a larger amount of training data, although the training data are a mixture of English and Indonesian.This highlights the significance of leveraging cross-lingual data, in this case, using English data on top of Indonesian data, in the training process of the ToD system to enhance the model performance and achieve better metric scores in the target language.Furthermore, we delve deeper into analyzing the potential of bilingual training using the MinTL framework.Interestingly, this framework is flexible since we can modify the pre-trained transformer LM unlike other frameworks.We explore not only English T5, but also multilingual version of T5.We compare the performance of T5-small and mT5small as the backbone of the MinTL framework, shown in Figure 3 and Appendix Table 3 for per metric details.Based on our findings, we observe that bilingual training positively impacts the BLEU score, match rate, and combined metrics for the MinTL framework, either using T5 or mT5.The results indicate that the bilingual setting can yield benefits on the efficient training process of T5 and mT5, especially in Indonesian, that is considered as underrepresented language.Moreover, we notice that mT5-small is more effective than T5-small in end-to-end ToD tasks.Overall, the bilingual training is beneficial for training end-to-end ToD, especially to improve the performance on non-English languages.
Error Analysis.We conduct an error analysis to investigate the limitations of existing frameworks.In general, most issues occur in the form of a "template response".The end-to-end ToD systems sometimes generate responses that appear to be template-like despite varying user inputs which might be because of the limited trainable parameters and dataset used in the training process.Despite its goal to complete the user's task, the ToD must also be able to maintain the user's satisfaction whenever using the ToD as a product.Users may become disinterested if they receive repeated responses, potentially leading to reduced usage.Another issue we notice is the occurrence of repeated tokens in the generated output by the framework during the inference phase.Specifically, we find that the Sequicity and MinTL frameworks tend to produce responses with repeated tokens more than the other ToD frameworks.This could be due to the decoding method applied by these frameworks in our research which relied on a greedy search approach.The other issue that is worth mentioning in inference examples is the hallucination in the generated outputs.Some of the outputs do not have the correct combination of slots that are found in the training set.

Related Work
ToD Datasets.Multiple studies have contributed to creating datasets for ToD using various approaches.bAbI (Bordes et al., 2016) is an English synthetic dataset combining five subtasks into ToD system datasets.SMD (Eric et al., 2017)

Belief spans Response
"carikan saya restoran mahal yang menyediakan makanan internasional."(find me an expensive restaurant that serves international food) gold: "restaurantfood": "internasional", "restaurantpricerange": "mahal" monolingual: "restaurant-food": "internasional", "restaurantpricerange": "menengah" bilingual: "restaurant-food": "internasional", "restaurantpricerange": "mahal" gold: "maafkan saya.saya tidak bisa menemukan restoran apapun yang menyediakan makanan internasional di kisaran harga yang mahal" (I am sorry.I couldn't find any restaurant that provides international food at such an expensive price range) monolingual: "value_name adalah restoran mahal yang menyediakan makanan internasional.apakah kamu ingin tahu lokasinya?"(value_name is an expensive restaurant that serves international food.Would you like to know its location?)bilingual: "value_name menyediakan makanan internasional.apakah kamu ingin informasi lebih lanjut?" (value_name serves international foods.Would you like other further information?)uses the WoZ scheme to build an English car assistant dialogue system that has knowledge of weather and places.Other English datasets using the WoZ scheme are CamRest (Wen et al., 2017), a dataset for a restaurant reservation, MultiWoZ (Budzianowski et al., 2018), a multi-domain WoZ dataset containing five domains collected using the human-to-human scheme, and OpenDialKG (Moon et al., 2019), a synthetic dataset containing four domains.On non-English datasets.such as Chinese, there are CrossWoZ (Zhu et al., 2020), a large-scale Chinese WoZ dataset with five domains, and RiSAWoZ (Quan et al., 2020), a large-scale multidomain Chinese WoZ dataset with rich semantic annotations up to 12 domains.MetaLWOz (Lee et al., 2019) is a goal-oriented dialogue corpus containing 51 domains and collected using WoZ.Other datasets include bilingual languages, such as BiToD (Lin et al., 2021b), a dataset for tourism assistants focusing on English and Chinese languages, and COD (Majewska et al., 2023), a dataset containing Russian, Arabic, Indonesian, and Swahili.Based on the aforementioned datasets, we conclude that the ToD dataset with contextual knowledge focusing on Indonesian is not yet available.
Indonesian Dialogue Systems Datasets.Research for the ToD system in the Indonesian language is still underway.IndoNLG (Cahyawijaya et al., 2021), a benchmark for natural language generation in low-resource languages, focuses on summarization, question answering, chit-chat, and machine translation for Indonesian, Javanese, and Sundanese languages.An Indonesian language subset of the XPersona (Lin et al., 2021a) dataset, which consists of approximately 17 thousand dialogues, is used for the chit-chat task.It is worth noting that although IndoNLG is making significant progress, it has not yet explored the ToD system.Furthermore, NusaCrowd (Cahyawijaya et al., 2022), a large pool of hundreds of Indonesian NLP data, only listed one Indonesian ToD dataset, i.e., COD (Majewska et al., 2023), which is a multilingual ToD dataset that is solely used for evaluation with only 194 dialogues spanning across 11 domains.Therefore, the Indonesian ToD dataset is still very limited, and urgent action is required to increase the coverage of Indonesian ToD datasets.
Frameworks for End-to-End ToD.GALAXY (He et al., 2022) is a pre-trained dialogue model that uses semi-supervised learning to learn dialogue policy from limited labeled dialogues and large-scale unlabeled dialogue corpora.LABES (Zhang et al., 2020a), conversely, represents belief states as discrete latent variables and models them jointly with system responses given user inputs.MinTL (Lin et al., 2020) efficiently uses pretrained LMs in developing task-oriented dialogue systems, eliminating the need for ad hoc modules in studies that use this framework.DAMD (Zhang et al., 2020b) is a network that accommodates stateaction pair structure in generation and can leverage the proposed multi-action data augmentation framework to address the multidomain response generation problem.Sequicity (Lei et al., 2018) uses the Seq2Seq model for dialogue state tracking and generating a response to the user.Some other recent works (Bang et al., 2023;Hudeček and Dušek, 2023) explore the potential of using large LMs for performing zero-shot ToD.In this paper, we utilize GALAXY, LABES, MinTL, and Sequicity to evaluate the performance of various end-to-end ToD frameworks.

Conclusion
We introduce IndoToD, an end-to-end multidomain Indonesian ToD benchmark.We extend the existing two English ToD datasets using an efficient framework by delexicalizing the conversation into templates and reconstruction using KB.We evaluate our benchmark in monolingual, cross-lingual, and bilingual settings.We also show the benefits of having English data to improve Indonesian performance.Furthermore, we conduct some analysis from multiple viewpoints that shows the existing ToD system's inability to understand dialogue context from unseen language and the current ToD system's behavior in handling underrepresented language such as Indonesian using our datasets.We also investigate some errors that usually occur in ToD system responses to explore its limitation.

Limitations
There are several limitations that hinder our progress in creating the first Indonesian end-to-end ToD benchmark.Firstly, we encountered a limitation in translating the English dataset.Due to the constraints of our resources, specifically having only three annotators, we could only translate the CamRest dataset and a mere 10% of the SMD dataset.Expanding the dataset size would significantly contribute to advancing the Indonesian ToD system.Second, we only use publicly available end-to-end ToD system framework repositories as our baselines.Based on MultiWoZ benchmark5 , there are several end-to-end ToD frameworks with slightly superior metric scores compared to our baselines (Budzianowski et al., 2018).But because of their unavailability, we could not evaluate them and only focused on evaluating the best available ToD frameworks possible.

Ethics Statement
Our work spotlights a need to develop ToD systems for underrepresented languages such as Indonesian through our new ToD datasets: IndoCamRest and IndoSMD.During our study, we commit to the ethical principles of NLP research and are well aware of its impact on the language community.We strongly believe that there is no ethical issue within this work, including the data collection process, annotation, existing dataset usage, and experiments.Within our work, annotators are well-rewarded and give us informed consent as they understand and agree to provide their annotations to build new ToD datasets.We uphold our annotator's privacy and follow the data protection and privacy regulation for releasing the datasets.The dataset itself is free from abusive language and personal information as our ultimate goal in this work is to contribute and make an impact on society with provide useful NLP task resources and more linguistic diversity in the NLP field.
The table compares monolingual and bilingual approach on English test set.The result shows that bilingual setting have a beneficial impact on MinTL to learn the task well compares to monolingual setting.

E Annotator Instruction and Consent
We report instructions for annotators to translate several English CamRest and SMD utterance templates, as well as annotator informed consent templates on Appendix Figure 1, Appendix Figure 2, and Appendix Figure 3.The instructions are in Indonesian as all of the annotators are Indonesian native speakers.

Figure 1 :
Figure 1: Illustration of the dialogues collection pipeline: a) Delexicalize dialogue conversations to create English templates; b) Translate English conversation templates to Indonesian; c) Retrieve KB to retrieve corresponding entities; d) Lexicalize the dialogue conversations to collect Indonesian dialogues.

Figure 2 :
Figure 2: ToD framework results on IndoSMD per domain.Left: POI navigation domain, Center: Calendar scheduling domain, and Right: Weather information domain.

Table 1 :
Comparison of our IndoToD benchmark with the Indonesian subset of COD

Table 2 :
Per-domain statistics of IndoCamRest and IndoSMD.
as the backbone models, and use this framework without utilizing

Table 3 :
Experiment settings result on Indonesian test set.bold denotes the best score per metric.

Table 4 :
Experiment settings result on English test set.bold denotes the best score per metric.

Table 5 :
Example of the LABES response on IndoCamRest using monolingual and bilingual settings.The English translation is in italic.