Goal-Oriented Script Construction

The knowledge of scripts, common chains of events in stereotypical scenarios, is a valuable asset for task-oriented natural language understanding systems. We propose the Goal-Oriented Script Construction task, where a model produces a sequence of steps to accomplish a given goal. We pilot our task on the first multilingual script learning dataset supporting 18 languages collected from wikiHow, a website containing half a million how-to articles. For baselines, we consider both a generation-based approach using a language model and a retrieval-based approach by first retrieving the relevant steps from a large candidate pool and then ordering them. We show that our task is practical, feasible but challenging for state-of-the-art Transformer models, and that our methods can be readily deployed for various other datasets and domains with decent zero-shot performance.


Introduction
A script is a standardized sequence of events about stereotypical activities (Feigenbaum et al., 1981). For example, "go to a restaurant" typically involves "order food", "eat", "pay the bill", etc. Such script knowledge has long been proposed as a way to enhance AI systems (Abelson and Schank, 1977). Specifically, task-oriented dialog agents may greatly benefit from the understanding of goaloriented scripts 2 . However, the evaluation of script knowledge remains an open question (Chambers, 2017). Moreover, it is unclear whether current models can generate complete scripts. Such an ability is in high demand for recent efforts to reason about * Equal contribution. 1 Our models and data are be available at https:// github.com/veronica320/wikihow-GOSC.
2 https://developer.amazon.com/ alexaprize Figure 1: An example script constructed by our Step-Inference-Ordering pipeline in a zero-shot manner. The input is a goal, and the output is an ordered list of steps. complex events (Li et al., 2020;Wen et al., 2021) 3 .
We propose the task of Goal-Oriented Script Construction (GOSC) to holistically evaluate a model's understanding of scripts. Given a goal (or the name of a script), we ask the model to construct the sequence of steps (or events in a script) to achieve the goal. This task targets a model's ability to narrate an entire script, subsuming most existing evaluation tasks. Our rationale is that a model that understands some scripts (e.g. how to "travel abroad" and "go to college") should be able to produce new ones (e.g. how to "study abroad") using the absorbed knowledge, close to how humans learn.
While almost all prior script learning work has focused on English, we introduce a novel multilingual corpus. Our corpus is collected from wikiHow (wikihow.com), a website of how-to articles in 18 languages. The articles span a wide range of domains, from commonplace activities like going to a restaurant to more specific ones like protecting oneself from the coronavirus.
We train and evaluate several baseline systems on our GOSC task. First, we consider a generationbased approach where a pretrained language model, multilingual T5, is finetuned to produce scripts from scratch. As an alternative, observing that most desired steps can be drawn from the training scripts due to their magnitude and high coverage, we also propose a retrieval-based approach. Concretely, we develop a Step-Inference-Ordering pipeline using existing models to retrieve relevant steps and order them. We also improve the pipeline with techniques such as multitask learning. From the experiments, the GOSC task proves challenging but feasible for state-of-the-art Transformers. Furthermore, we show that our pipeline trained on wikiHow can generalize to other datasets and domains (see an example in Figure 1). On three classic script corpora, OMICS, SMILE, and De-Script, it achieves strong zero-shot performance. It can also be directly deployed to construct scripts in distant domains (e.g. military/political).
In this paper, we make several contributions: 1) We propose the GOSC task targeting the comprehensive understanding of scripts. 2) We introduce the first multilingual script learning dataset available in 18 languages. 3) We compare generation-based and retrievalbased approaches using both automatic and human judgments, which demonstrate the feasibility but also the difficulty of GOSC. 4) We show that our approach can be readily applied to other datasets or other domains.

Related Work
The notion of scripts (Abelson andSchank, 1977), or schemas (Rumelhart, 1975), encodes the knowledge of standardized event sequences. We dissect previous work on script learning into two lines, narrative and procedural.
One line of work focuses on narrative scripts, where declarative, or descriptive knowledge is distilled from narrative texts like news or stories (Mujtaba and Mahapatra, 2019). Such scripts are not goal-oriented, but descriptions of sequential events (e.g. a traffic accident involves a collision, injuries, police intervention, etc.). Chambers and Jurafsky (2008) introduced the classic Narrative Cloze Test, where a model is asked to fill in the blank given a script with one missing event. Following the task, a few papers made extensions on representa-tion (Chambers and Jurafsky, 2009;Pichotta and Mooney, 2014) or modeling (Jans et al., 2012;Pichotta and Mooney, 2016a,c,b), achieving better performance on Narrative Cloze. Meanwhile, other work re-formalized Narrative Cloze as language modeling (LM) (Rudinger et al., 2015) or multiplechoice (Granroth-Wilding and Clark, 2016) tasks. However, the evolving evaluation datasets contain more spurious scripts, with many uninformative events such as "say" or "be", and the LMs tend to capture such cues (Chambers, 2017).
The other line of work focuses on procedural scripts, where events happen in a scenario, usually in order to achieve a goal. For example, to "visit a doctor", one should "make an appointment", "go to the hospital", etc. To obtain data, Event Sequence Descriptions (ESD) are collected usually by crowdsourcing, and are cleaned to produce scripts. Thus, most such datasets are small-scale, including OMICS (Singh et al., 2002), SMILE (Regneri et al., 2010), the Li et al. (2012) corpus, and De-Script (Wanzare et al., 2016. The evaluation tasks are diverse, ranging from event clustering, event ordering (Regneri et al., 2010), text-script alignment (Ostermann et al., 2017) and next event prediction (Nguyen et al., 2017). There are also efforts on domain extensions (Yagcioglu et al., 2018;Berant et al., 2014) and modeling improvements (Frermann et al., 2014;Modi and Titov, 2014).
In both lines, it still remains an open problem what kind of automatic task most accurately evaluates a system's understanding of scripts. Most prior work has designed tasks focusing on various fragmented pieces of such understanding. For example, Narrative Cloze assesses a model's knowledge for completing a close-to-finished script. The ESD line of work, on the other hand, evaluates script learning systems with the aforementioned variety of tasks, each touching upon a specific piece of script knowledge nonetheless. Recent work has also brought forth generation-based tasks, but mostly within an open-ended/specialized domain like story or recipe generation (Fan et al., 2018;Xu et al., 2020).
Regarding data source, wikiHow has been used in multiple NLP efforts, including knowledge base construction (Jung et al., 2010;Chu et al., 2017), household activity prediction (Nguyen et al., 2017), summarization (Koupaee andWang, 2018;Ladhak et al., 2020), event relation classification (Park and Motahari Nezhad, 2018), and next passage completion (Zellers et al., 2019). A few recent papers (Zhou et al., 2019;Zhang et al., 2020b) explored a set of separate goal-step inference tasks, mostly in binary-classification/multiple-choice formats, with few negative candidates. Our task is more holistic and realistic, simulating an open-ended scenario with retrieval/generation settings. We combine two of our existing modules from Zhang et al. (2020b) into a baseline, but a successful GOSC system can certainly include other functionalities (e.g. paraphrase detection). Also similar is Zhang et al. (2020a), which doesn't include an extrinsic evaluation on other datasets/domains though.
In summary, our work has the following important differences with previous papers: 1) Existing tasks mostly evaluate fragmented pieces of script knowledge, while GOSC is higher-level, targeting the ability to invent new, complete scripts. 2) We are the first to study multilingual script learning. We evaluate several baselines and make improvements with techniques like multitask learning. 3) Our dataset improves upon the previous ones in multiple ways, with higher quality than the mined narrative scripts, lower cost and larger scale than the crowdsourced ESDs. 4) The knowledge learned from our dataset allows models to construct scripts in other datasets/domains without training.

Goal Oriented Script Construction
We propose the Goal-Oriented Script Construction (GOSC) task. Given a goal g, a system constructs a complete script as an ordered list of steps S, with a ground-truth reference T . As a hint of the desired level of granularity, we also provide an expected number of steps (or length of the script), l, as input. Depending on whether the set of possible candidate steps are given in advance, GOSC can happen in two settings: Generation or Retrieval.
In the Generation setting, the model must generate the entire script from scratch.
In the Retrieval setting, a large set of candidate steps C is given. The model must predict a subset of steps S from C, and provide their ordering.
For script learning, we extract from each wiki-How article the following critical components to form a goal-oriented script. Goal: the title stripped of "How to"; Section: the header of a "method" or a "part" which contains multiple steps; 4 Steps: the headlines of step paragraphs; Category: the top-level wikiHow category. An example wikiHow script is shown in Figure 2.
Our previous corpus provides labels of whether each English article is ordered, predicted by a highprecision classifier. We project these labels to other languages using the cross-language links in each wikiHow article. For articles without a match to English, it defaults to unordered. In our task setup, we only require the model to order the steps if an article is ordered.
For all experiments below, we randomly hold out 10% articles in each language as the test set, and use the remaining 90% for training and development. 5 We use the corpus to construct a dataset for multilingual GOSC. For the Retrieval setting, the set of candidate steps C are all the steps present in the test set. However, we observe that not only the large number of steps may render the evaluation intractable, but most steps are also evidently distant from the given goal. To conserve computing power, we restrict C as all the steps from articles within Step-Inference-Ordering pipeline for the GOSC Retrieval task. An example ordered script is shown with example steps in the input and output. Those that appear in the ground-truth script is in bold. the same wikiHow category for each script.

Models
We develop two systems based on state-of-the-art Transformers for the GOSC task. 6

Generation Approach: Multilingual T5
For the Generation setting, we finetune mT5 (Xue et al., 2021), a pretrained generation model that is not only state-of-the-art on many tasks but also the only available massively multilingual one to date.
During finetuning, we provide the goal of each article in the training set as a prompt, and train the model to generate the sequence of all the steps conditioned on the goal. Therefore, the model's behavior is similar to completing the task of inferring relevant steps and sorting them at once. At inference time, the model generates a list of steps given a goal in the test set.

Retrieval Approach:
Step-Inference-Ordering Pipeline We then implement a Step-Inference-Ordering pipeline for the Retrieval setting. Our pipeline contains a Step Inference model to first gather the set of desired steps, and a Step Ordering model to order the steps in the set. These models are based on our previous work (Zhang et al., 2020b). Under the hood, the models are pretrained XLM-RoBERTa (Conneau et al., 2020) or mBERT (Devlin et al., 2019) for binary classification, both state-of-the-art multilingual representations. Our Step Inference model takes a goal and a candidate step as input, and outputs whether the 6 Reproducibility details can be found in Appendix C. candidate is indeed a step toward the goal with a confidence score. During training, for every script, its goal forms a positive example along with each of its steps. We then randomly sample 50 steps from other scripts within the same wikiHow category and pair them with the goal as negative examples. The model predicts a label for each goal-step pair with a cross-entropy loss. During evaluation, for each script in the test set, every candidate step is paired with the given goal as the model input. We then rank all candidate steps based on the model confidence scores decreasingly. Finally, the top l steps are retained, where l is the required length. Our Step Ordering model takes a goal and two steps as input, and outputs which step happens first. During training, we sample every pair of steps in each ordered script as input to the model with a cross-entropy loss. During evaluation, we give every pair of retrieved steps as input, and count the total number of times that a step is ranked before others. We then sort all steps by this count to approximate their complete ordering.
An illustration of our Step-Inference-Ordering pipeline is shown in Figure 3. We also consider two additional variations. Multitask Learning (MTL): The Step Inference and the Step Ordering models share the encoder layer, but have separate classifier layers. During training, the MTL system is then presented with a batch of examples from each task in an alternating fashion. During evaluation, the corresponding classifier is used. Cross-Lingual Zero-Shot Transfer (C0): While there are abundant English training scripts, data in some other languages are scarce. Hence, we  also attempt to directly evaluate the English-trained models on non-English data.

In-Domain Evaluation
To demonstrate the performance of models on the GOSC task, we evaluate them on our multilingual wikiHow dataset using both automatic metrics and human judgments. The ultimate utility for this task is the extent to which a human can follow the constructed steps to accomplish the given goal.
As direct user studies might be costly and hard to standardize, we carefully choose measures that adhere to this utility. By default, all models are trained and evaluated on the same language.

Auto Evaluation for Generation Setting
To automatically evaluate models in the Generation Setting, we report perplexity and BERTScore (Zhang et al., 2019), as two frequently used metrics for evaluating text generation.
The mean perplexity of mT5 on the test set of each language is shown in Table 1. The results show a large range of variation. To see if perplexity correlates with the data size, we conduct a Spearman's rank correlation two-tailed test. We find a Spearman's ρ of −0.856 and a p-value of 1e−5 between the perplexity and the number of articles in each language in our dataset; we find a Spearman's ρ of −0.669 and a p-value of 2e − 4 between the perplexity and the number of tokens in each language in the mC4 corpus where mT5 is pretrained on. These statistics suggest a significant correlation between perplexity and data size, while other typological factors are open to investigation. Table 1 also shows the BERTScore F1 measure of the generated scripts compared against the gold scripts. Except for English (.82), the performance across different languages varies within a relatively small margin (.65 -.72). However, we notice that as a metric based on the token-level pairwise similarity, BERTScore may not be the most suitable metric to evaluate scripts. It is best designed for aligned texts (e.g. a machine-translated sentence and a human-translated one), whereas in scripts, certain candidate steps might not have aligned reference steps. Moreover, BERTScore does not measure whether the ordering among steps is correct.
To address these flaws, we further perform human evaluation in Section 6.3.

Auto Evaluation for Retrieval Setting
To automatically evaluate models in the Retrieval Setting, we first calculate accuracy, i.e. the percentage of predicted steps that exist in the groundtruth steps. To account for the ordering of steps, we also compute Kendall's τ between the overlapping steps in the prediction and the ground-truth. The performance of our Step Inference-Ordering pipeline using mBERT and XLM-RoBERTa 8 on all 18 languages are shown in Figure 4. Complete results can be found in Appendix D. Across languages, the results are generally similar with a large room for improvement. On average, our best system constructs scripts with around 30% accuracy and around 0.2 Kendall's τ compared to the ground-truth. Compared to the baseline, our multitask and cross-lingual zero-shot variations demonstrate significant improvement on ordering. This is especially notable in low-resource languages. For example, MTL on Korean and C0 on Thai both outperform their baseline by 0.17 on Kendall's τ .

Human Evaluation
To complement automatic evaluation, we ask 6 annotators 9 to each edit 30 output scripts by the Step-Inference-Ordering pipeline and mT5 in English, French, Chinese, Japanese, Korean and Hindi, respectively. The edit process consists of a sequence of two possible actions: either 1) delete a generated step entirely if it is irrelevant, nonsensical or not a reasonable step of the given goal, or 2) move a step somewhere else, if the order is incorrect. Then, 8 XLM-RoBERTa is not able to converge on the training data for Step Ordering for all but 3 languages using a large set of hyperparameter combinations. 9 The annotators are graduate students and native or proficient speakers of the language assigned.  the generated script is evaluated against the edited script in 3 aspects: Correctness, approximated by the length (number of steps) of the edited script over that of the originally constructed script (c.f. precision); Completeness, approximated by the length of the edited script over that of the ground-truth script (c.f. recall); Orderliness, approximated by Kendall's τ between overlapping steps in the edited script and the generated script. 10 The results are shown in Table 3. While the constructed scripts in the Retrieval setting contain more correct steps, their ordering is significantly worse than those in the Generation setting. This suggests that the generation model is better at producing fluent texts, but can easily suffer from hallucination.

Qualitative Examples
To understand models' behavior, we present two representative scripts produced by the mBERT Retrieval model and the mT5 Generation model side by side, accompanied by the ground-truth script, shown in Figure 5.
The retrieved "Draw Santa Claus" script has a high step accuracy (85%), with a reasonable ordering of drawing first the outline and then details. The generation output is more off-track, hallucinating irrelevant details like "singing" and "scorpion", despite being on the general topic of drawing. It also generates more repetitive steps (e.g. the head is drawn twice), most of which are abridged.
As for "Make a Quotebook", the retrieved script has a 50% step accuracy. The third step, though not in the gold reference, is similar enough to "find some quotes", suggesting that our exact match evaluation isn't perfect. In the generated script, all steps are also generally plausible, but some essential steps are missing (e.g. find a book, find quotes). This suggests that the generation model dwells too much on the details, ignoring the big picture. These patterns in the two scripts are common in the model outputs, a larger sample of which is included in the Supplementary Materials.

Zero-shot Transfer Learning
To show the potential of our model for transfer learning, we use the retrieval-based Step-Inference-Ordering pipeline finetuned on wikiHow to construct scripts for other datasets and domains. We quantitatively evaluate our model on 4 other script learning corpora, and qualitatively analyze some constructed scripts in a case study.  format of different scenarios (e.g. "eat in a restaurant", similar to our goal) each with a number of event sequence descriptions (ESDs, similar to our steps). Statistics for each corpus are in Table 4.

Quantitative Evaluation
For each dataset, we select the ESD with the most steps for every scenario as a representative script to avoid duplication, thus converting the dataset to a GOSC evaluation set under the Retrieval setting. We then use the XLM-RoBERTabased Step-Inference-Ordering pipeline trained on our English wikiHow dataset to directly construct scripts on each target set, and report its zero-shot performance in Table 4. We see that 30% − 60% steps are accurately retrieved, and around 40% are correctly ordered. This is close to or even better than the in-domain results on our English test set. As a comparison, a random baseline would have only 0.013 Accuracy and 0.004 τ on average. Both facts indicate that the script knowledge learned from our dataset is clearly non-trivial.

Case Study: The Bombing Attack Scripts
To explore if the knowledge about procedural scripts learned from our data can also facilitate the zero-shot learning of narrative scripts, we present a case study in the context of the DARPA KAIROS program 12 . One objective of KAIROS is to automatically induce scripts from large-scale narrative texts, especially in the military and political domain. We show that models trained on our data of commonplace events can effectively transfer to vastly different domains.
With the retrieval-based script construction model finetuned on wikiHow, we construct five scripts with different granularity levels under the Improvised Explosive Device (IED) attack scenario: "Roadside IED attack", "Backpack IED attack", "Drone-brone IED attack", "Car bombing IED attack", "IED attack". We take the name of each script as the input goal, and a collection of related documents retrieved from Wikipedia and Voice of America news as data sources for extracting step candidates.
Our script construction approach has two components. First, we extract all events according to the KAIROS Event Ontology from the documents using OneIE (Lin et al., 2020). The ontology defines 68 event primitives, each represented by an event type and multiple argument types, e.g. a Damagetype event has arguments including Damager, Artifact, Place, etc. OneIE extracts all event instances of the predefined primitives from our source documents. Each event instance contains a trigger and several arguments (e.g. Trigger: "destroy", Damager: "a bomber", Artifact: "the building", ... ). All event instances form the candidate pool of steps for our target script.
Since the events are represented as triggerarguments tuples, a conversion to the raw textual form is needed before inputting them into our model. This is done by automatically instantiating the corresponding event type template in the ontology with the extracted arguments. If an argument is present in the extracted instance, we directly fill it in the template; else, we fill in a placeholder word (e.g."some", "someone", depending on the argument type). For example, the template of Damage-type events is " arg1 damaged arg2 using arg3 instrument", which can be 12 www.darpa.mil/program/knowledgedirected-artificial-intelligence-reasoning -over-schemas Figure 6: An example narrative script produced by our retrieval-based pipeline trained on wikiHow. Each event is represented by its Event Type and an example sentence.
instantiated as "A bomber damaged the building using some instrument"). Next, we run the Step Inference-Ordering Pipeline in Section 5.2 on the candidate pool given the "goal". The only modification is that since we don't have a gold reference script length in this case, all retrieved steps with a confidence score higher than a threshold (default=0.95) are retained in the final script.
We manually evaluate the constructed scripts with the metrics defined in Section 6.3, except Completeness as we don't have gold references. The 5 constructed scripts have an average Correctness of 0.735 and Orderliness of 0.404. Despite the drastic domain shift from wikiHow to KAIROS, our model can still exploit its script knowledge to construct scripts decently. An example script, "Roadside IED attack", is shown in Figure 6. All the steps retrieved are sensible, and most are ordered with a few exceptions (e.g. the ManufactureAssemble event should precede all others). 13

Limitations
Event representation: Our representation of goals and steps as natural language sentences, though containing richer information, brings the extra difficulty in handling steps with similar meanings. For example, "change strings frequently" and "put on new strings regularly" have nearly identical meanings and both are correct steps for the goal "maintain a guitar". Hence, both could be included by a retrieval-based model, which is not desired.
Modeling: Since GOSC is a new task, there is no previously established SOTA to compare with. We build a strong baseline for each setting, but they are clearly not the necessary or sufficient means to do the task. For example, our Step-Inference-Ordering pipeline would benefit from a paraphrasing module that eliminates semantic duplicates in retrieved steps. It also currently suffers from long run-time especially with a large pool of candidates, since it requires pairwise goal-step inference. An alternative is to filter out most irrelevant steps using similarity-based heuristics in advance. Evaluation: Under the retrieval-based setting, our automatic evaluation metrics do not give credit to inexact matches as discussed above, which can also be addressed by a paraphrasing module. Meanwhile, for the generation-based setting, BERTScore, or other comparison-based metrics like BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014), may not be the most suitable metric to evaluate scripts. They are best designed for aligned texts like translation pairs, and do not measure whether the ordering among steps is correct. While we complement it with manual evaluation, only one human annotator is recruited for each language, resulting in potential subjectivity. Alternatively, crowdsourcing-based evaluation is costly and hard to standardize. Due to the complexity of the GOSC task and its evaluation, we suggest that future work investigate better means of evaluation.

Conclusion and Future Work
We propose the first multilingual script learning dataset and the first task to evaluate the holistic understanding of scripts. By comprehensively evaluating model performances automatically and manually, we show that state-of-the-art models can produce complete scripts both in-and out-of-domain, with a large room for improvement. Future work should investigate additional aspects of scripts, such as usefulness, granularity, etc., as well as their utility for downstream tasks that require automated reasoning. A Corpus Statistics Table 5 shows the statistics of our multilingual wik-iHow script corpus.

B Evaluation Details
In Section 3, we formalize the Goal-Oriented Script Construction (GOSC) task as follows: Given a goal g, the model is asked to construct a complete script as an ordered list of steps S, with a ground-truth reference T . As a hint of the desired level of granularity, we also provide an expected number of steps (or length of the script), l, as input.
In the Retrieval setting, a set of candidate steps C is also available. We evaluate an output script from two angles: content and ordering.
First, we calculate the accuracy, namely the percentage of predicted steps that exist in the groundtruth. Denote s i as the i-th step in S.
If the gold script is ordered, we further evaluate the ordering of the constructed script by calculating Kendall's τ between the intersection of the predicted steps and the ground-truth steps.
where N C is the number of concordant pairs, N D the number of discordant pairs; A ∩ B is used as a special notation for the intersection of ordered lists, denoting elements that appear in both A and B, in the order of A.
It is likely that a model includes two modules: a retrieval module and an ordering module. In this case, it is sensible to separately evaluate these two modules.
To evaluate the retrieval module independently, assume that the model retrieves a large set of steps R ranked by their relevance to the goal g. Denote r i as the i-th step in R. We calculate recall and normalized discounted cumulative gain 14 at position k. Assume k > l.
To evaluate the ordering module independently, we directly give the model the set of ground-truth steps to predict an ordering. We again use Kendall's τ to evaluate the ordered steps. where T is the set ground-truth steps ordered by the model. In the Generation setting, a model is evaluated using perplexity on the test set, following standard practice.
where L(S) is the log-likelihood of the sequence of steps assigned by the model. When evaluating a model on multiple scripts, all aforementioned metrics are averaged.

C Modeling Details
All our models are implemented using the Hugging-Face Transformer service 15 . For all experiments, we hold out 5% of the training data for development.
For mBERT, XLM-RoBERTa and RoBERTa, we finetune the pretrained models on our dataset using the standard SequenceClassification pipeline on HuggingFace 18 . For mT5, we refer to the official finetuning scripts 19 from the project's Github repository.
For each in-domain evaluation experiment, we perform grid search on learning rate from 1e − 5 to 5e−8, batch size from 16 to 128 whenever possible, and the number of epochs from 3 to 10. As mBERT and XLM-RoBERTa have a large number of hyperparameters, most of which remain default, we do not list them here. Instead, the hyperparameter values and pretrained models will be available publicly via HuggingFace model sharing. We choose the model with the highest validation performance to be evaluated on the test set. For the Retrieval setting, we consider the accuracy of contracted scripts; for the Generation setting, we consider perplexity.
We run our experiments on an NVIDIA GeForce RTX 2080 Ti GPU, with half-precision floating point format (FP16) with O1 optimization. The experiments in the Retrieval setting take 3 hours to 5 days in the worst case for all languages. The experiments in the Generation setting take 2 hours to 1 day in the worst case for all languages.  Table 6: The GOSC Retrieval performance of multitask learning mBERT. Results higher than those produced by the single-task mBERT are in bold.

D Additional Results
Our complete in-domain evaluation results can be found in Table 6, 7, and 8.

E More Qualitative Examples
Aside from the examples shown in Section 6.4, we show 2 more example scripts constructed by the mBERT baseline under the Retrieval setting in Section 5.2 vs. those by the mT5 baseline under the Generation setting in Section 5.1. For each script name, the Retrieval output and the Generation output are shown side by side. Please see Figure  Lang. Step