Information Extraction and Human-Robot Dialogue towards Real-life Tasks A Baseline Study with the MobileCS Dataset

Recently, there have merged a class of taskoriented dialogue (TOD) datasets collected through Wizard-of-Oz simulated games. However, the Wizard-of-Oz data are in fact simulated data and thus are fundamentally different from real-life conversations, which are more noisy and casual. Recently, the SereTOD challenge is organized and releases the MobileCS dataset, which consists of real-world dialog transcripts between real users and customerservice staffs from China Mobile. Based on the MobileCS dataset, the SereTOD challenge has two tasks, not only evaluating the construction of the dialogue system itself, but also examining information extraction from dialog transcripts, which is crucial for building the knowledge base for TOD. This paper mainly presents a baseline study of the two tasks with the MobileCS dataset. We introduce how the two baselines are constructed, the problems encountered, and the results. We anticipate that the baselines can facilitate exciting future research to build human-robot dialogue systems for real-life tasks.


Introduction
Building human-robot dialogue systems is an important research question not only for artificial intelligence applications but also for artificial intelligence itself.In the Turing test, if the human evaluator finds that human-robot dialogue and humanhuman dialogue are indistinguishable, the robot would be said to exhibit intelligent behaviour and pass the test (Turing, 1950).So presumably, the best strategy to build an intelligent dialogue system may be to train the system over a large amount of real human-to-human conversations to mimic human behaviors.This approach was once pursued and several human-human dialogue datasets have been released, such as the Twitter dataset (Ritter et al., 2010), the Reddit conversations (Schrad-ing et al., 2015), and the Ubuntu technical support corpus (Lowe et al., 2015).It is argued in (Budzianowski et al., 2018) that the lack of grounding conversations onto an existing knowledge base (KB) limits the usability of the systems developed over these human-human dialogue datasets.
So a class of Wizard-of-Oz simulated games have emerged to collect human-human conversations (Wen et al., 2017b;El Asri et al., 2017;Budzianowski et al., 2018;Zhu et al., 2020;Quan et al., 2020), particularly for task-oriented dialogue (TOD) systems which help users accomplish specific goals such as finding restaurants or booking flights and usually require a task-related KB.In the Wizard-of-Oz set-up, through random sampling based on an ontology and a KB (both are pre-defined), a task template is created for each dialogue session between two crowd workers.One worker acts as the role of a user and the other performs the role of a clerk (i.e. the system side).In practice, multiple workers may contribute to one dialogue session.In this way, annotations of belief states and systems acts become easy, and grounding conversations onto the KB is realized.
However, dialogue data collected in the Wizardof-Oz set-up are in fact simulated data and thus are fundamentally different from real-life conversations.During the Wizard-of-Oz collection, specific instructions (e.g., goal descriptions for the user side and task descriptions for the system side) are provided for crowd workers to follow.In contrast, real-life dialogues are more casual and freestyle, without instructions.Even with some goals in mind, chit-chat or redundant turns are often exist in real-life conversations, e.g., asking for repeating or confirming key information.In some sense, we could say that real-life dialogues are more noisy.Moreover, spoken conversations in real-world have a distinct style with those well-written conversations and are full of extra noise from grammatical errors, influences or barge-ins (Kim et al., 2021).

arXiv:2209.13464v2 [cs.CL] 18 Oct 2022
For building dialogue systems that are more applicable to real-life tasks, real human-human dialogue datasets with grounding annotations to KBs are highly desirable.
Recently, the SereTOD challenge is organized (Ou et al., 2022) and releases a new human-human dialogue dataset, called the MobileCS (Mobile Customer Service) dataset.It consists of real-world dialog transcripts between real users and customerservice staffs from China Mobile.Based on the observation and analysis of those dialogue transcripts, a schema is summarized to our best1 , according to which about 10,000 dialogues are annotated with entities, attribute triples and speaker's intents for every turn.The annotated part of the MobileCS dataset is randomly split into a train, development and test set, which consists of 8975, 1025 and 962 dialogues, respectively.
Based on the MobileCS dataset, the SereTOD challenge not only evaluates the construction of the dialogue system itself (Task 2), but also examines information extraction from dialog transcripts (Task 1), which is crucial for building the KB.The MobileCS data are more noisy and challenging, as compared to previous Wizard-of-Oz data.It is nontrivial to establish baseline systems on such dataset.This paper mainly presents a baseline study of the two tasks with the MobileCS dataset.Two baseline systems are constructed for the two tasks respectively, which both are released as open source2 and provided to the participating teams in the SereTOD challenge.We introduce how the two baselines are constructed, the problems encountered, and the results.The results clearly show the challenge for information extraction and human-robot dialogue, when trained and tested on real human-human data.We anticipate that the baselines can facilitate exciting future research to build human-robot dialogue systems for real-life tasks.

Dialogue Datasets
According to Budzianowski et al. (2018), existing dialog datasets (whether task-oriented or not) can be grouped into three categories: machineto-machine, human-to-machine, and human-tohuman.The machine-to-machine datasets may en-sure full coverage of all possible dialogue outcomes within a certain domain, but they do not consider noisy conditions in real life, which poses a risk of a mismatch between training data and real interactions.The human-to-machine datasets, however, depend on the provision of an existing working dialogue system, which limits the practicality of the datasets.The human-to-human datasets address the problems in the above two classes of datasets.However, previous human-to-human datasets lack knowledge base and explicit goal in the conversation, making that systems trained with these corpus struggle in generating consistent and diverse responses (Li et al., 2016).
It is non-trivial to collect a TOD dataset with knowledge base and user goals.Previous TOD datasets are either collected through Wizard-of-Oz simulated games (Wen et al., 2017b;El Asri et al., 2017;Budzianowski et al., 2018;Zhu et al., 2020;Quan et al., 2020), or collected by converting machine-generated outlines to natural languages using crowd workers (Shah et al., 2018;Rastogi et al., 2020;Lee et al., 2022).However, during the collection of these previous datasets, specific instructions are provided for crowd workers, which is different from real-life conversation scenarios and leads to a gap between collected data and dialogues in real-life.The MobileCS dataset, introduced in SereTOD Challenge, comes from real-world dialogue transcripts and represents a step advancing to remedy the above deficiencies.

Dialogue Information Extraction
Dialogue information extraction is the task of extracting structured information, e.g., entities and attributes, from dialogue transcripts.Different from the traditional information extraction in general domain text (Sarawagi et al., 2008;Li et al., 2020b;Han et al., 2020), dialogue transcirpts are more verbalized and loose with more irregular expressions and grammar errors.Previous works have explored how to extract user information (Catizone et al., 2010;Hirano et al., 2015;Wu et al., 2019), clinical information (Kannan et al., 2018;Peng et al., 2021), and relations between speakers and mentioned entities in dialogues (Yu et al., 2020;Jia et al., 2021).However, there is no previous work focusing on extracting information on real-world dialogue transcripts between real users and customer-service staffs.In the paper, we develop a modern dialogue information extraction baseline, based on the Mo-bileCS dataset, which contains dialogue transcripts from China Mobile.

Task-oriented Dialogue System
The methodology for building TOD systems is gradually advancing from separate training of individual modules (Williams et al., 2016;Mrkšić et al., 2017;Dai et al., 2018) to the end-to-end (E2E) trainable approach (Wen et al., 2017a;Liu and Lane, 2017;Lei et al., 2018;Shu et al., 2019;Zhang et al., 2020;Gao et al., 2020;Zhang et al., 2020).In early E2E methods, the sequential turns of a dialog are modeled with LSTM-based backbones.Recently, self-attention based Transformer neural networks (Vaswani et al., 2017) have shown their superiority in capturing long-term dependencies over LSTM based networks.Transformer based pretrained language models (PLM), such as GPT2 (Radford et al., 2019) and T5 (Raffel et al., 2020), have been leveraged to build generative E2E TOD systems in the pretraining-andfinetuning paradigm, which have shown improved performance over LSTM-based ones.Examples include GPT2-based SimpleTOD (Hosseini-Asl et al., 2020), SOLOIST (Li et al., 2020a), AuGPT (Kulhánek et al., 2021) and UBAR (Yang et al., 2021), and T5-based PPTOD (Su et al., 2021) and MT-TOD (Lee, 2021), among others.However, these previous TOD systems are mainly examined on simulated data collected by crowd workers.It is not clear what the potential performance of the current methodology of building TOD systems is in real-life tasks.In this paper, we present our effort to answer this question, by developing a TOD system on the MobileCS data, which are from real-life customer-service.

MobileCS Dialogue Dataset
The MobileCS dialogue dataset contains 10,000 dialogue labeled by crowd-sourcing and around 90,000 unlabeled dialogues.For each dialogue turn, the annotations consist of entities, attribute triples, and speaker's intents within the scope of the schema.Another around 1000 dialogs are put aside as the test data.More detailed information about the MobileCS dataset can be found in the challenge description paper for the SereTOD challenge (Ou et al., 2022).
The two tasks defined over the MobileCS dataset for the SereTOD challenge require different annotations.For information extraction (Task 1), the annotations of entities and attribute triples are needed for training and evaluating the system.For TOD system construction (Task 2), user intents, system intents and a local knowledge base (local KB, which covers personal information and relevant public knowledge in a dialogue) are required.A global KB, which covers and fuses all public knowledge and all personal information in the MobileCS domain, is difficult to obtain during the research phase.Thus, the SereTOD challenge introduces the concept of a local KB, which could be viewed as being composed of the relevant local snapshots from the global KB for each dialog.The local KB is obtained automatically by integrating all the annotations of entities and attributes into a sequence of entities3 .Besides, user goals are needed for evaluating the performance of TOD systems in Task 2. Similarly, user goals are obtained automatically by integrating user intents and all the entities and attributes mentioned by the user.The examples of local KB and user goal can be found in Listing 1 in the challenge description paper (Ou et al., 2022).
Data Quality The MobileCS data were annotated by two professional data labeling teams according to well documented guidelines as described in (Ou et al., 2022).Quality control was enforced by sampling the annotated data and performing crossing check of the annotations by the two teams.Nevertheless, there still exist annotation errors in such a large dataset.Some annotation errors can be corrected by rules.A typical example of errors is the granularity error of entity types.In the schema, entity types have inheritance relationships, for example, "main package" inherits from "package" and contains all its properties.Therefore, there are quite a few annotation confusions between parent types and child types in the data.To correct those type errors, the most fine-grained type for each entity was selected according to the attributes held by the entity.By combining the schema with manual rules, more annotation errors can be corrected.The updated MobileCS dataset is called v1.1, which is released in the SereTOD challenge and used in the experiments in this paper.

Task 1: Information Extraction from Dialog Transcripts
Task 1 aims to extract structured information from real-world dialogue transcripts for constructing KB for the TOD system.This task consists of two subtasks: entity extraction and slot filling.The entity extraction task aims to extract entities, involving named entity recognition and entity coreference resolution.And the slot filling task aims to extract the attributes and values of entities, and the status of user accounts.Compared to the information extraction tasks on general domain texts, this task poses more challenges.First, dialogue transcripts are more verbalized, loose and noisy, which requires more robust models.Second, dialogue transcripts contain more pronouns and referents, some of which even span several rounds.This requires coreference resolution and long context modeling.

Task 2: Task-Oriented Dialog Systems
The basic task for the TOD system is, for each dialog turn, given the dialog history, the user utterance and the local KB, to predict the user intent, query the local KB and generate appropriate system intent and response according to the queried information.Compared with previous work, this task has the following characteristics.First, there is no global KB but only a local KB for each dialog, containing all the information in entity and attribute annotations and representing the unique information for each user, e.g., the user's package plan and remaining phone charges.Second, the user's constraints on entities are relatively simple, e.g., 38M data package, so the customer service system usually can confirm the entities that the user refers to in one dialogue turn, without the need of dialogue state accumulation.
5 Baseline Models

Task 1 Baseline
Task 1 involves two challenging sub-tasks: entity extraction and slot filling.Therefore, we design a pipeline method to extract information from dialogue transcripts.For entity extraction, the pipeline is two-step: named entity recognition and entity coreference resolution.For slot filling, the pipeline is also two-step: slot recognition and entity slot alignment.For each step, we first utilize a text encoder backbone to encode utterances and then a task-specific module to extract specific information based on the encoded representations of the utterances.In our experiments, we adopt three text encoders: LSTM (Lai et al., 2015), BERT (Kenton and Toutanova, 2019), and RoBERTa (Liu et al., 2019).The overall model architecture is illustrated in Figure 1.The hyper-parameters are shown in Table 1.The details of each step are as follows.
Named Entity Recognition First, we utilize a sequence labeling method to extract entity mentions in dialogue transcripts as in Yamada et al. (2020a).Specifically, after encoding utterances, we adopt conditional random field (Lafferty et al., 2001) on the top of hidden representations to label entity mentions from each utterance of the dialogue transcripts.
Entity Coreference Resolution After extracting entity mentions from dialogue transcripts, we utilize an entity coreference resolution method to group the mentions that refer to the same entity, as the local KB organizes knowledge in entity level instead of mention level.Specifically, after encoding dialogues, we adopt the dot product between the representation vectors of the two entity mentions as the metric to assess whether two mentions refer to the same entity.The representation vector of an entity mention is defined by the mean pooling of the representations of the tokens of the entity mention, as did in Yao et al. (2019).We then utilize the binary cross entropy loss as the objective to fine-tune the backbone encoders.
Slot Recognition Slot recognition aims to recognize slots from plain texts, regardless of which entity the slot belongs to.We utilize a sequence labeling method to recognize the slots, i.e., to label certain spans in the utterance as slots, which are the attributes of entities and the status of users.Specifically, we utilize the same model architecture as in Named Entity Recognition to label slots from each utterance of the dialogue transcripts.
Entity Slot Alignment To construct a local KB, the final procedure is to link slots to the corresponding entities.We formulate the task as a sequence classification task.Specifically, we highlight an entity and a slot using special markers and then encode the text to the contextual representation, which is inspired by Soares et al. (2019).We adopt a linear classification head to classify whether the

Named Entity Recognition Entity Coreference Resolution Slot Recognition
Linear Head 0 1

Entity Slot Alignment
Figure slot corresponds to the entity.

Task 2 Baseline
KB Query We need to design a KB query function to help the TOD system access information from the local KB.After observing the dataset, we find that user queries can be divided into three different types.We encapsulate all query scenarios into one function and list their inputs (i.e. the arguments of the query function) and outputs as follows.
• Query the attribute of a specified entity.The input is the entity name and the attribute to be queried, the output is the attribute value in the local KB.
• Query entities of specified types.The input is entity type, the output is the entity names of this type.
• Query the attribute for users.The input is the attribute to be queried, the output is the queried attribute value in the local user profile (part of the local KB).
With the above query function, the TOD system can use the predicted user intent to access information from the local KB.
Baseline Architecture We divide the TOD system into several sub-tasks.For every dialog turn, the system needs to perform the following steps in order: 1) predict the entity name mentioned or referred to by the user; 2) predict the user intent (including the arguments of the query function); 3) query the local KB using the predicted user intent and obtain the KB result; 4) predict the system intent; 5) predict the system response.Note that there are many pronouns and co-references of entity names, so that the system may not be able to predict correct entity name with only the user utterance in current turn.To solve this problem, dialogue history information is needed.However, in real-life dialogues, the dialogue history is particularly long and contains plenty of characters, which will seriously hurt the training efficiency of the model (Liu et al., 2022).Therefore, we maintain a list of entity names mentioned by the user in all previous turns (entity name history) to replace the dialogue history.The entity name history and user utterance are fed into the model as the conditioning input to complete the above sub-tasks.Similar to Hosseini-Asl et al. ( 2020 2022), we employ a sequence generation architecture based on Chinese GPT-2 (Du, 2019) to implement the dialog system, which is depicted in Figure 2.
Data Analysis As described in Section 1, there are chit-chat or redundant turns in real-life dialogues.As observed from MobileCS, we find that these redundant turns can be divided into three cases: 1) one speaker asks for repetition and the other repeats what he/she said before; 2) one speaker confirms information and the other responds to it passively; 3) the user interrupts the agent with some simple interjections, and then the agent continues to speak.Three examples corresponding to the three cases are shown in Table 2.These redundant turns are interesting new phenomena revealed from the MobileCS data, which are transcribed from spoken conversations.Remarkably, the repetition and confirmation may be caused by that the staff did not hear clearly due to the accent or low quality of the user speech.The interjection is a special feature for spoken dialogues.However, after transcription of speech, the speech modality is missing, since only text is remained.Thus, the system in textual dialogues receives no relevant input from the user and is thus unable to respond properly.We leave further study of this problem for future work.In this work, we perform some pre-processing on the data to reduce the noise brought by the three cases.Specifically, for the first two cases, we simply delete the whole redundant turn (including utterances on both user and system sides) in the dialogue.For the third case, we merge the redundant turn with its previous turn by deleting the user utterance and merge the agent response with the previous one.Finally, we obtain a cleaned dataset with 15% fewer turns than the original one.

Task 1 Results
Metrics The evaluation metrics are two-fold.The metric for entity extraction is the span-level F1, following previous named entity recognition work (Yamada et al., 2020b).The metric for slot filling is the triple-level F1: a predicted entity-slotvalue triple is correct if and only if the entity, slot and value are all correct.The evaluation for slot filling is a combinatorial optimization problem, as the entity is also predicted.We hence utilize the Hungarian algorithm (Kuhn, 1955)  entity matching between predictions and golden labels before calculating the metric for slot filling.

Results
The models are trained on the training set for a certain number of epochs (as shown in Table 1), selected according to performance over the dev set, and evaluated on the official test set4 .The results are shown in Table 3.It can be seen that even with powerful pre-trained language models as text encoders, the performance of the baseline model is poorer on the MobileCS dataset, especially for the named entity recognition and slot recognition sub-tasks, as compared to results on other datasets reported in the literature (Yamada et al., 2020a).These results demonstrate how demanding the MobileCS dataset is, and indicate that extracting structured information from long and loose texts, e.g.dialogue transcripts, remains challenging for existing models, which urges more powerful and robust models.

Task 2 Results
Metrics In order to measure the performance of TOD systems, both automatic evaluation and human evaluation are conducted.Table 3: Experimental results of Task 1 on the official test set, with different text encoder backbones."Golden Labels" means using golden prerequisite labels (e.g.golden entities for entity coreference resolution) for each pipeline step."Pipeline" represents using previous predictions for each pipeline step.The evaluation metric is micro F1 for named entity recognition and entity slot alignment, B-cubed metric (Bagga and Baldwin, 1998)   culated for both predicted user intents and system intents.Success rate is the percentage of generated dialogs that achieve user goals.Specifically, for each dialogue, we extract the information requested in the user goal from the local KB, then we regard this dialogue as a success if the generated responses contain all the requested information.BLEU score evaluates the fluency of generated responses by comparing them with oracle responses.For human evaluation, 5 testers (staffs from China Mobile) interacted with the system, and each tester should interact for at least 10 dialogues with the system.The tester would score the system on a 5-point scale (1 to 5) by the following 3 metrics.Success measures if the system successfully completes the user goal by interacting with the user.Coherency measures whether the system's response is logically coherent with the dialogue context.Fluency measures the fluency of the system's response.

Results
Based on the analysis in Section 5. large margin in comparison to the results on other Wizard-of-Oz datasets.For example, the Success rate of state-of-the-art models on MultiWOZ2.1 is around 75%, while it is lower than 30% on Mo-bileCS.The BLEU score on MobileCS is much lower than that on CrossWOZ (Liu et al., 2021).
Note that both TOD systems on MobileCS and CrossWOZ are based on Chinese GPT-2, though not strictly comparable.These results demonstrate how challenging of building TOD systems for reallife tasks is.The agent responses from real-life are much more difficult to be modeled, as compared those in the Wizard-of-Oz scenarios.
We further perform human evaluation for the best baseline model (i.e. the model trained on the cleaned dataset) and the average scores of all tested dialogues are shown in Table 5.The scores of the three metrics are relatively low (lower than 3), which shows that in most cases, responses generated by the baseline system are neither fluent nor coherent enough, and can not provide requested information satisfactorily.In a word, building a TOD system that can perform well on real-life dialogues is very challenging, and there is much room for the baseline TOD system to be improved.The MobileCS dataset offers a valuable and challenging testbed for future research of building human-robot dialogue systems for real-life tasks.
The performance of task-oriented dialogue systems on Wizard-of-Oz datasets have been improved continuously to a high level, for example, as shown in MultiWOZ 5 .However, Wizard-of-Oz dialogue data are in fact simulated data and thus are fundamentally different from real-life conversations, which are more noisy and casual.For further advancement of human-robot dialogue technology, real human-human dialogue data with grounding annotations to KBs are highly desirable.Further, noting that the KB is an indispensable part for TOD systems and usually is not readily available for reallife tasks, it is very important to investigate not only the dialogue system itself but also information extraction to construct the KB.
With the MobileCS dataset released by the Sere-TOD challenge, this paper presents a baseline study of both information extraction (Task 1) and humanrobot dialogue (Task 2) over real human-human dialogue data.We introduce how the baselines for the two tasks are constructed, the problems encountered, and the results.It is found that the MobileCS dataset offers a challenging testbed for both tasks, with interesting open problems.Our baselines provide an easy entry to investigate the new dataset, and we anticipate that the baselines can facilitate exciting future research to build human-robot dialogue systems for real-life tasks.

Figure 2 :
Figure 2: The baseline model architecture for Task 2. Examples are provided under the title of each box.

Table 1 :
1: The overall model architecture of the pipeline model for Task 1.For the sub-task entity slot alignment, we utilize marker (e.g., <ent_1>entity mention<\ent_1>) to highlight entities and slots in the original text input.Hyper-parameters of fine-tuning LSTM and PLMs (BERT, RoBERTa) on Track1 task.

Table 2 :
Examples of three types of redundant turns in MobileCS.The redundant utterances are marked in blue.
For automatic evaluation, metrics include Precision/Recall/F1 score, Success rate and BLEU score.P/R/F1 are cal- for entity coreference resolution, and accuracy for entity slot alignment.NER: Named Entity Recognition.ECR: Entity Coreference Resolution.SR: Slot Recognition.ESA: Entity Slot Alignment.SF: Slot Filling.

Table 4 :
The results of Task 2 baseline on the official dev set.U-P/R/F1 and S-P/R/F1 denote P/R/F1 for the user side and the system side, respectively.

Table 5 :
Human evaluation of the Task 2 baseline system (trained on the cleaned dataset).