META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current TOD systems usually focus on multi-turn text/speech interaction, then they would call back-end APIs designed for TODs to perform the task. However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking TOD-specific backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. We also propose a multi-model action prediction and response model, which show promising results on META-GUI. The dataset, codes and leaderboard are publicly available.


Introduction
Recent years have witnessed the rapid development of task-oriented dialogue systems (Zhang et al., 2020;Ni et al., 2022;Chen et al., 2022Chen et al., , 2017)).They have been widely applied to customer support, booking system and especially intelligent personal assistant.These task-oriented dialogue systems work in a similar pipeline: firstly identify the user intent, then extract necessary information by the process of slot-filling.After getting enough information for the task, the agent will call the backend APIs (provided by APP developers) to fetch infor- mation, and then generate a response based on the query result.
There are some drawbacks of this framework.Firstly, TODs rely on publicly accessible APIs or APIs designed for TODs to perform tasks, but such APIs may not exist in real-life APPs, which hinders the application of TODs.Secondly, a system should be customized to recognize the pre-defined API-related slots, which limits the generality.
Consider how humans perform tasks on smartphones They don't need a parametric API but finish tasks by interacting with the GUI (graphical user interface), indicating that GUI is a more general interface.Previous studies explore how to translate natural language commands into GUI operations (Mazumder and Riva, 2021;Pasupat et al., 2018;Xu et al., 2021a).These studies focus on single query and step-by-step operations, while in Action Description Click(item = x) Click the item with index x on the screen.Swipe(direction = x) Swipe screen towards direction x, which includes "up" and "down".

Input(text = x)
Input the text x to the smartphone.Enter( ) Press the "Enter" button on the keyboard.Clear( ) Clear the current input box.Back( ) Press the "back" button on the smartphone.End( ) Turn has been finished and it will go to Response Generator module.
Table 1: The actions in our dataset.There are 7 different actions with 3 different parameters.
real scenarios the query would be multi-turn interaction and there is no clear instruction about how to execute the task.Etan (Riva and Kace, 2021) and SUGILITE (Li et al., 2017) are two systems that support learning GUI operations from demonstrations, but these systems are script-based and are sensitive to the change in GUI and workflow.Duplex on the web (Crunch, 2019) can directly operate the website to perform the required task, for example booking a movie ticket.However, it only supports limited websites, and it's more a unified GUI interface than a task-oriented dialogue system that enables general GUI operation.
To this end, we propose the task of GUI-based task-oriented dialogue system (GUI-TOD).It supports multi-turn conversation and direct GUI operation.All tasks would be performed on the GUI of real APPs, which means we no longer need TODspecific APIs to communicate with APPs, and it would be possible to apply TOD on any APPs.Since there is no available benchmark published, We collect META-GUI, a dataset with dialogues and GUI traces on real Android APPs.A GUI trace is a series of GUI operations, including screenshots, Android view hierarchies as well as actions.Android view hierarchy is an XML-style file, which organizes the content of GUI through a hierarchical structure.It also contains the types of items on the screen and their bounding boxes.An example is shown in Appendix C. When a user requests a task, the system should open the related APP and execute the task through multiple operations on GUI.It requires a comprehensive understanding of GUI structure and interaction logic.An interaction example is shown in Figure 1.
We focus on building an agent with general ability to operate GUI, rather than optimize for specific APPs.Our proposed GUI-TOD system leverages both the visual information and textual information on the screen to predict the next action to be executed and generate the system response.Our experiments show that the GUI-TOD outperforms heuristic baselines by a large margin, with an action completion rate of 82.74%.
Our contributions are followings: • We propose a GUI-based task-oriented dialogue system, which can perform tasks on mobile APPs through multiple operations on GUI.
• We collect META-GUI, a dataset with dialogues and GUI operation traces serving as the benchmark for the proposed system.
• We conduct thorough experiments on our dataset and validate the importance of multimodal information and history information.
We show that it is a promising task but needs further exploration.The overview of GUI-TOD is shown in Figure 2. It consists of two sub-modules: Action Executor (AE) and Response Generator (RG).The traditional task-oriented dialogue system (Chen et al., 2017;Zhang et al., 2020;Yu et al., 2014) splits the task into natural language understanding (NLU) (Zhu et al., 2021), dialogue manager (DM) (Chen et al., 2020a;Zhu et al., 2020;Chen et al., 2018Chen et al., , 2019Chen et al., , 2020b)), and natural language generation (NLG) (Keskar et al., 2019).We omit the NLU module and directly send user utterances to AE.The AE module has similar features with DM, it executes the requested task by interacting with the GUI for multiple rounds, while DM accomplishes this by calling TOD-specific APIs.The RG module will generate the system response based on the execution results, which is the same as NLG.The process of executing a task is a series of GUI operations, including click, swipe, etc.The task of AE module is action prediction, which aims at predicting the next action to be performed on GUI, and the RG module focuses on generating system's response after executing a task.A major improvement of GUI-TOD is that it does not rely on a pre-defined domain ontology.Conventionally, the DM module will identify a set of slot-value from the user utterance, which serves as the parameter for backend APIs.However, GUI-TOD handles task-specific slot-values during the execution of tasks.When the APP requires a certain input (for example, entering the time and destination), the system can obtain the information by understanding the current user utterance or generating a response for further asking.Compared with CUED actions (Young, 2007) in traditional TOD, actions in GUI-TOD are GUI-related operations rather than communication actions between user and system.Formally, the action prediction task can be defined as: given the GUI trace and dialogue history, predict the next action to be performed.We define the set of actions that can be performed on the APPs in Table 1.All the actions would take the form of Action(parameter = * ).There are seven types of Action, including six physical actions: click, swipe, input, enter, clear, back, and one virtual action: end.The corresponding parameters are listed in Table 1.The end action is the last action for every GUI trace, which means the end of GUI operations.After an end action is generated, the GUI-TOD would move to the RG module.We denote the jth action in turn i as A i,j = (t, p), where t is the action type and p is the corresponding parameter.S i,j = (s, v) is the jth screen in turn i, including the screenshot s and the view hierarchy v.The dialogue in turn i is represented as D i = (U i , R i ) where U i is the ith user utter-ance and R i is the ith system response.The action prediction task is formulated as:

Task Definition
where 1 : i means from turn 1 to i, F is a trainable action model, which we discuss in 4.1.The RG module takes the GUI trace and dialogue history as input, then generates a response based on the execution result and context.Denote the set of actions in turn i as A i , the screens in turn i as S i , the response generation task is formulated as: where G is the response generator model, which we discuss in 4.2.

Meta-GUI Creation
Our dataset consists of two kinds of data: dialogues and GUI operation traces.In each dialogue, user would ask the agent to complete a certain task through multi-turn interaction.Our tasks involve six different domains: weather, calendar, search, taxi, hotel and restaurant.In this paper, we consider APPs that accomplish the same kind of tasks to be in the same domain.To enhance the diversity of our dataset, we use multiple Apps from the calendar and weather domains.The details of APPs are listed in Appendix A.

Collecting GUI traces
We collected our data in two-stage: first we collected GUI traces for existing dialogues, then we collected both dialogues and GUI traces.
In the first stage, we provided dialogues to annotators and instructed them to perform tasks on real APPs.We started from extracting dialogues from the SMCalFlow dataset (Andreas et al., 2020).SMCalFlow contains multi-turn task-oriented dialogues, which is known for complex reference phenomenon that requires a comprehensive understanding of context.We extract dialogues from calendar, weather and search domains.Six annotators were recruited to label the GUI traces.We built a web-based annotation system, which was connected to a real Android smartphone (see Appendix B).Annotators can see the current screen of the smartphone in the system, and control the smartphone by clicking buttons.A dialogue would be shown in the system.Annotators should first read the dialogue, then they were allowed to explore how to finish the task (e.g.check the weather) on smartphone.If the task requirement in the dialogue conflicted with the real-world scenario (for example, creating an event in the past), the annotators could change the content of the dialogue to make the task achievable.After they were ready, they need to use the annotation system to record the actual process of executing the task.Each operation would be recorded, and the screenshot after each operation was also saved together with the view hierarchy.
In the second stage, we collected dialogues and GUI traces for domains of hotel, restaurant and taxi.Because there are no available dialogues of these domains in previous datasets, we asked annotators to write new dialogues.We selected three experienced annotators from the last stage.Different from the last stage, the annotator was shown a task objective, which was generated randomly from all available conditions in APPs.The annotators should act as user and system alternatively to write dialogues according to the task objectives.To avoid annotators writing short and simple dialogues, we added constraints about the number of turns and the behaviors in dialogue, e.g.adding a condition or changing a condition.An example of the generated target is shown in Appendix E. After writing dialogues, the annotators should also record the corresponding GUI operation traces for each turn, which is the same as the last stage.

Data Review
After annotation, we manually reviewed the data.The checklist includes: whether the recorded GUI traces match the dialogues, whether there are invalid operations due to the system error or misoperation, and whether there are redundant operations in the GUI trace.We manually fixed annotations that only have small mistakes, and discarded the task requiring significant modification.The dialogue level pass rate is about 63.6%, and finally we got 1125 dialogues in total.For more information, please refer to Appendix D.

Post-processing
The dialogues collected in the second state were created by three annotators, which lack diversity in expression.Therefore, we published a dialog rewritten task on AMT * (Amazon Mechanical Turk) to polish the dialogues.During GUI trace annotation, some APPs can not obtain valid Android hierarchy.To handle this problem, we used the online Optical Character Recog-nition (OCR) service, provided by Baidu Cloud † , to detect all texts on the image with their corresponding positions and generate a pseudo layout file.
We extract items from screen using the corresponding layout file.An item is a clickable leaf node.Similar to (Zhou and Li, 2021), we consider an item to be clickable if its clickable attribute is true or its parent node is clickable.An item consists of text content, item type and bounding box.We extract the text content of an item by looking at its text property first.If it is empty, we use its content-desc attribute, otherwise we would use the resource-id property.Based on the extracted items, we can locate the target item for the click action by comparing the click position and the bounding boxes of items.

Data Analysis
The total number of dialogues in our dataset is 1125, including 4684 turns.The average number of images for each turn is 5.30, and the average number of words for each utterance is 8. On average, there are 23.80 items for each image, and the item text length is 2.48 words.The distribution of item types is shown in Figure 3.We also provide an example for each item type in Appendix F. It is clear that TextView and ImageView are the two most frequent type, which indicates that our dataset is informative.
The distribution of actions is listed in Figure 4.The click is the most frequent action, while clear is the least action for the reason that only a small number of tasks require clearing the current input box.For click action, we further compute the type distribution of target items, which is shown in Figure 3. TextView and Button type are mostly clicked, while there are 8 item types never been operated.This implies that the item types may supply some hints for predicting the target items.Besides, the average numbers of words for response and input action are 9 and 3 respectively.

Model Design
The overview of our system is illustrated in Figure 5. It's composed of four components: encoder, image feature extractor, multi-modal information fusion module and the output module.The output † https://cloud.baidu.com/module can be the Action Module or the Response Module.

Action Model
We call the combination of encoder, image feature extractor, multi-modal information fusion module and the Action Module as Action Model, which is used to predict the next GUI action based on the history.Next, we will describe these modules respectively.For simplify, for the screen history we only consider the last screen here.We will discuss adding more screen histories later.
Encoder The input of encoder consists of two parts: dialog history {D 1:i−1 , U i } = {w 1 , ..., w n } and texts in the items {m 1,1:l 1 , . . ., m k,1:l k }.Items are extracted from the last screen, k is the number of items and l i is the length of the ith item's text: where H = [D; M] and D = {w 1 , w 2 , . . ., w n } represents encoder outputs of the dialogue history, M = {m 1,1:l 1 ; . . .; m k,1:l k } represents encoder outputs of item texts.
Image feature extractor Given a screenshot and its corresponding layout file, we use Faster R-CNN (Ren et al., 2015) to extract the feature map.
Then we apply ROI pooling based on the bounding box of each item, and get the item-level image features I = {I 1 , ..., I k }.
Multi-modal information fusion module Given the encoder output and the regional image feature extracted above, we concatenate them together.
The text features from one item m i,1:l k are concatenated with the same item feature I i , and the w 1:n are concatenated with zeros.Then we use a Transformer encoder with M layers to fuse the multi-modal features.For each layer, to enhance the image information, we will concatenate the image features and the output from the last layer again to form the input for the next layer.
Action Module For the Action model, we need to predict the action type and its corresponding parameters.As shown in Table 1, there are 7 action types with 3 different parameters.We show some examples of parameter predictions in Appendix G.We use the encoder output of the [CLS] token for action type prediction.We apply a feed-forward Transformer Encoder

Bounding Box
Image Feature network followed by a Softmax layer to predict the action type: where p a is the probability distribution of action, and FFN represents the Feed-Forward Network.
For the action parameter, we use three different classifiers: 1) Input Text Prediction We assume that the input to the APPs must be part of the user utterance, so we formulate the prediction of input text as a span prediction task.We use two classifiers to predict the begin and end positions in the dialogue: where the p ds and p ds are the probability of start and end position respectively.
2) Target Item Prediction The target item classifier is based on the encoding outputs of items.We first computed the item representation by applying average pooling on the encoding outputs, then we use a feed-forward layer to compute the probability of selecting an item followed by a Softmax layer: where p m is the probability distribution of items.
3) Direction Prediction The direction classifier is a two-classes classification layer for the direction up and down: where p d is the probability distribution of swipe direction.
Adding history information According to the task definition, besides dialogue histories, we can still use action histories and screen histories.To verify this, we add them to the action model.For action histories, we regard action types as special tokens and add them to the dictionary.We concatenate the most recent H action types {t 1:H } before the dialogue history as input: where X stands for the input of Encoder, t represents the action type.
For screenshot histories, we encode all the screenshot in a recurrent way.Assume Îi = [I i,1 , ..., I i,k ] is the image feature for ith screenshot, and Īi is the history image feature for time step i.We compute Īi+1 by: where Ī1 = Î1 , H is the length of history, Attn is the attention mechanism (Vaswani et al., 2017), and W * are trainable parameters.We use the ĪH to replace the image features in Figure 5.

Response Model
The  We process the dataset in the granularity of action.Each data point takes as input the screenshot history, action history, dialogue history and predicts the action to be performed.We obtained 18337 data points in total, and we randomly divide the data into the training set, development set and test set with the ratio of 8:1:1.The data statistics are shown in Table 2.

Experiment Setup
We train our baselines on the training set and select the best models on the dev set based on the Action completion rate.We use pretrained BERT (Devlin et al., 2019), LayoutLM (Xu et al., 2020) and LayoutLMv2 (Xu et al., 2021b) as our encoder models.‡ BERT is pretrained on pure text corpus by masked languages modeling task, while Lay-outLM and LayoutLMv2 are pretrained on scanned documents by masked visual-language modeling task and incorporate image features.
We use a batch size of 4 and fine-tune for 8 epochs.We use Adam optimizer with the learning rate of 1e-5.For Response Model, the number of Transformer Decoder Block is 4. Furthermore, we use three heuristic methods in our experiments: Random We randomly predict action type and its corresponding parameters.
Frequency Method (FM) We first calculate the frequency of each action type and its corresponding parameters.Then, we apply the results to the development set and generate the prediction according to the frequency.‡ There are some pre-trained models about GUI understanding, like ActionBERT (He et al., 2021) and UIBERT (Bai et al., 2021).But they are not open-source.
Most Frequent Method (MFM) Similar to the frequency method, we generate the prediction with the most frequent result.
For the evaluation, we use completion rate for action prediction.We first define two completion rate metrics: action completion rate and turn completion rate.One action is regarded as completed only if the action type and its parameters are correctly predicted.And if all actions in the same turn are completed, then the corresponding turn will be considered completed.For action type prediction, item prediction and direction prediction, we use accuracy.For input prediction, we use token level exact match and F1.And we use BLEU score to evaluate the Response Model.

Experiment Result
The experiment results of the Action Model are listed in Table 3.We can find that the deep learning methods outperform the heuristic methods by a large margin, which is expected.Comparing the results of BERT backbone and LayoutLM backbone, we find that BERT model yields better performance.The reason is that LayoutLM model was pre-trained on a scanned document image dataset, and there exists a large gap between the Android GUI and the scanned document images.Furthermore, we can find that LayoutLMv2 performs worse than LayoutLM.We hypothesize that LayoutLMv2 uses early-fusion method, which will bring more noises.We can also find that adding multi-modal information to BERT leads to a better performance (52.08% → 53.96%), and the improvements are mainly from the action type prediction, target item prediction and swipe direction prediction.The reason why adding images would help is that the image information contains some action histories that cannot be represented by text.For example, when filtering conditions on hotel reservations, the conditions selected in the previous action can be seen through the image (as a highlighted text), but they can not be reflected through text.An example is illustrated in Appendix H. Besides, the image information can help the model to locate the item more accurately.For example, for a screen with multiple radio buttons, since the BERT model does not take the item position as input, the model cannot distinguish the corresponding button for each option by only textual input.However, we also find that the performance of input text prediction degrades after adding image information.We  assume that BERT itself can successfully model text information, but adding visual information will affect the model's ability to understand text.
We further verify the importance of history information by adding action histories and screenshot histories.From the experiment results, we find that adding history information to BERT can improve the performance (52.08% → 55.42% after adding action history to BERT, 53.96% → 55.62% after adding screenshot history to BERT+mm).Adding action histories leads to greater performance improvement, which means action sequence is a more effective way to represent history.The screenshots contain higher-level history information, but the screenshot changes a lot before and after operation (sometimes one click may change the screen completely), which will bring difficulties to the information fusion.
Finally, we add all information, including multimodal information, action histories and screenshot histories, to the BERT model and get the m-BASH (multi-modal BERT with Action histories and Screenshot Histories), which results in the state-of-the-art performance (56.88%).
The results of the Response Model are shown in Table 4. BERT outperforms LayoutLM and Lay-outLMv2 by a large margin, which is consistent with the results of Action Model.We also find that adding multi-modal information and screenshot histories can improve performance, which means the model leverage the information from history to generate response.

Method
Response

Generality
According to the design of our system, it does not need to pre-define API-related slots, therefore our system has a strong generality and can be easily adapted to new APPs.To demonstrate this, we re-partition our dataset as followings: app generality Since we use multiple apps in weather domain and calendar domain, we use the data from one APP as the test set, and the other data forms the training set.
domain generality We use the data from one domain as the test set, and the other data forms the training set.
We evaluate the performance of m-BASH on these datasets.The results are shown in Table 5.We can find that our system can still obtain a reasonable performance, and the results of app generality experiments are even comparable to the main experiment results of LayoutLM.This result shows that common operation logic does exist in APPs, and our system can gain a general comprehension of GUI operations.It is easily applied to a new app or a new domain without modification, which shows the effectiveness and potential of our system.6 Related Work

Natural Language Commands on GUI
Executing natural language commands on GUI is getting research interests recently.Some studies focused on semantic parsing (Mazumder and Riva, 2021;Pasupat et al., 2018;Xu et al., 2021a), whose task is mapping the natural language query to the operations on websites.Google Duplex (Crunch, 2019) can operate websites to finish tasks like booking movie tickets or making restaurant reservations.However, it only supports limited websites and it's more a unified interface than a general dialogue system with GUI operating ability.Our proposed dataset contains real-world APPs and aims at training models with general GUI understanding.

Programming by Demonstration on GUI
Programming by Demonstration (PbD) systems focus on learning GUI tasks from human demonstration (Riva and Kace, 2021;Li andRiva, 2021, 2018;Li et al., 2019).SUGILITE (Li et al., 2017) records user's operations on GUI and generates a script for the learned task.APPINITE (Li et al., 2018) proposed to add descriptions for ambitious actions to enhance the robustness of the generated script.These systems generate scripts based on handcrafted rules and XML analysis, which is sensitive to GUI changes and exceptions.In this work, we aim to build a robot that can work with general mobile GUI, rather than repeating operations.

Visual Dialogue
More and more researchers combine CV and NLP into the dialogue system and are involved inß a more challenging task, visual dialogue (Le and Hoi, 2020;Agarwal et al., 2020;Le et al., 2020).It can be seen as a multi-step reasoning process over a series of questions (Gan et al., 2019).Gan et al. (2019) updated the semantic representation of the question based on the image and dialogue history.Wang et al. (2020) proposed VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pre-trained BERT language models for Visual Dialog tasks.Visual dialogue focuses on understanding the image contents.Besides this, our tasks also require understanding the interactions between UIs.

Conclusion
In this paper, we proposed the task of GUI-based task-oriented dialogue system, which replaces the traditional TOD-specific API calls with GUI operations on real APPs.The advantage is that intelligent agents can perform tasks without the need of backend TOD-specific APIs and it doesn't rely on a domain-specific schema, which means it can be applied to a new domain easily.We collect META-GUI, a dataset with dialogues and GUI traces to serve as a benchmark.Our model shows promising results on the dataset, and we hope this work could stimulate more advanced methods on GUI-TOD.
In the future, we will explore how to better incorporate GUI traces into our model and build the GUI semantics based on interactions.

Limitations
We propose a GUI-based task-oriented dialogue system, which can perform GUI operations on real APPs to complete tasks.To verify the validity of the system, we collect META-GUI dataset, which contains dialogues and GUI operation traces.In real scenarios, an agent may not know how to complete the task presented by the user.In these cases, an agent might reply "It's too hard for me.", or something like this, which are not included in our dataset.In the future, we will augment the dataset to include such cases.Furthermore, the models we used are too large to be applied in mobile phones.
It is important to compress the models, which we will attempt in the future.

A Details of Apps
We list the information of applications used in Table 6.To ensure the diversity of our dataset, we use 4 apps for weather domain, 3 apps for calendar domain, and 1 app each for the last 4 domains.We also list the number of turns belonging to each app.The total number of turns is larger than the actual number of turns, since that one turn may involve several Apps.

D Data Review
After annotation, we manually reviewed the data.The checklist includes: (1) whether the recorded GUI traces match the dialogues: we will check whether the GUI operations match the tasks proposed by the users, for example, whether the scheduled time is correct.(2) whether there are invalid operations due to the system error or misoperation: during annotation, some annotators may click a wrong position or swipe the screen mistakenly.
The annotation system may sometimes run into failure.
(3) whether there are redundant operations in the GUI trace: for example, some annotators may take screenshots of the same screen multiple times.

EndUser:Figure 1 :
Figure 1: An example of the GUI-based task-oriented dialogue system(GUI-TOD).The Action Executor will execute tasks on GUI and the system will generate a response based on the execution result.

Figure 3 :
Figure 3: The distribution of the total number of items versus the clicked one for each item type.

Figure 4 :
Figure 4: The distribution of actions.

Figure 6 :
Figure 6: The illustration of our Annotation System.The annotators can see dialogues in the Dialog Box and the current screen of smartphone in the

Figure 7 :
Figure 7: An example of the View Hierarchy for a given screen.The "+" button with a red border on the lefthand side corresponds to the highlighted element in the view hierarchy on the right-hand side.
Response Model aims to generate the response to user.We use the Response Module as the output module and the other parts are the same as Action Model.Considering the prediction of response is mainly decided by the execution results and dialogues, we do not use action histories for the Response Model.For the Response Module, we use a Transformer Decoder with N layers:

Table 3 :
The experiment results of the Action Model on the test set.Acc.: accuracy.EM: Exact Match.F1: F1 score.CR: completion rate.MFM: Most Frequent Method.FM: Frequency Method.mm: use the multi-modal information fusion module to add image information.act_h: add action histories.scr_h: add screenshot histories.

Table 4 :
The experiment results of Response BLEU score on the test set.

Table 5 :
The results of generality experiments.

Table 6 :
The information of Apps.The total number of turns is larger than the actual number of turns because some turns involve several APPs.
B Annotation System