DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog

Harvesting question-answer (QA) pairs from customer service chatlog in the wild is an efficient way to enrich the knowledge base for customer service chatbots in the cold start or continuous integration scenarios. Prior work attempts to obtain 1-to-1 QA pairs from growing customer service chatlog, which fails to integrate the incomplete utterances from the dialog context for composite QA retrieval. In this paper, we propose N-to-N QA extraction task in which the derived questions and corresponding answers might be separated across different utterances. We introduce a suite of generative/discriminative tagging based methods with end-to-end and two-stage variants that perform well on 5 customer service datasets and for the first time setup a benchmark for N-to-N DialogQAE with utterance and session level evaluation metrics. With a deep dive into extracted QA pairs, we find that the relations between and inside the QA pairs can be indicators to analyze the dialogue structure, e.g. information seeking, clarification, barge-in and elaboration. We also show that the proposed models can adapt to different domains and languages, and reduce the labor cost of knowledge accumulation in the real-world product dialogue platform.


Introduction
The development of natural language processing and conversational intelligence has radically redefined the customer service landscape.The customer service chatbots empowered by knowledge bases or frequently asked questions (FAQs) drastically enhance the efficiency of customer support (e.g.Cui et al., 2017;Ram et al., 2018;Burtsev et al., 2018;Liu et al., 2020;Paikens et al., 2020) In the cold start or continuous integration scenarios, harvesting QA pairs from existing or growing customer service chatlog is an efficient way to enrich knowledge bases.Besides the retrieved QA pairs can be valuable resources to improve dialogue summarization (Lin et al., 2021), gain insights into the prevalent customer concerns and figure out new customer intents or sales trends (Liang et al., 2022), which are of vital importance to business growth.
Prior work on question-answer extraction follows the utterance matching paradigm, e.g., matching the answers to the designated questions in a dialogue session (Jia et al., 2020) in the offline setting or figuring out the best response to the specific user query (Wu et al., 2017;Zhou et al., 2018;Yuan et al., 2019) in the online setting.Within this framework, 1-to-1 QA extraction has been explored by Jia et al. (2020), however, we argue that users might not cover all the details in a single query while interacting with the customer service agents, which means that a certain QA pair might involve multiple utterances in the dialogue session.
In this paper, we extend 1-to-1 QA matching to N-to-N QA extraction, where the challenges are two-fold: 1) cluster-to-cluster QA matching with no prior knowledge of the number of utterances involved in each QA pair and the number of QA pairs in each dialogue session, as the question might be distributed in single or multiple user queries.2) session-level representation learning with a longer context, as the paired questions and answers might be separated within the dialogue, and the model shall detect multiple same-topic questions and then check if the answer is related to any one of the questions.We propose session-level tagging-based methods to deal with the two challenges.Our method is not only compatible with the N-N QA extraction task setting but also 1-1 and 1-N, which is generic.Switching from matching to tagging, we feed the entire dialogue session into powerful pre-trained models, like BERT (Devlin et al., 2019)

给我个送货部电话。
Give me the phone number of the delivery department.

UID=13 配送员师傅的手机号码是[电话]。
The mobile phone number of the delivery man is [Phone].

请问还有其他还可以帮到您的 吗?
Is there anything else I can help you with?

UID=16
没人接电话，还有其他人的电 话吗。 No one answered, is there any other phone number?

配送员师傅可能在送货中。
The delivery person may be in the delivery.

UID=18 这个是没有的。
We do not have any other phone number.

请您过段时间再与配送师傅联系。
Please contact the delivery person later.

我还需要开增值税专用发票。
I also need a special VAT invoice.

是需要换开的是吧？
You need to replace the previous invoice, right?

UID=24 换开专票,资质已经上传了。
I need to change it to the special VAT invoice, the qualification has been uploaded.

UID=25
好的，那需要您先寄回，作废后财 务会处理的哈。 OK, you need to send the old invoice back, then the finance will handle after it is invalid.
Hmm, what's the mailing address? or mT5 (Xue et al., 2021), and design a set of QA tags that empowers N-to-N QA matching.Through careful analysis, we find that DialogQAE can serve as a powerful tool in the dialogue structure analysis, as the relations between and within QA pairs are implicit signals for dialogue actions like information seeking, clarification, barge-in and elaboration.
From a pragmatic perspective, we show that the proposed models can be easily adapted to different domains and languages, and largely accelerate the knowledge acquisition on FAQs of real-world users in the product dialogue system.We summarize our contributions below: • We setup a benchmark for DialogQAE with end-to-end and two-stage baselines that support N-to-N QA extraction, as well as the utterance and session-level evaluation metrics.
• We show that DialogQAE is an effective paradigm for dialogue analysis by summarizing 5 between-QA-pairs and 3 in-QA-pair relations that characterize the dialogue structure in the conversation flow.
• Through careful analysis of domain and language adaptation, as well as real-world applications, we show that the proposed Di-alogQAE model effectively automates and accelerates the cold-start or upgrade of a commercial dialogue system.

Task Overview
A complete snippet of customer service conversation between human service representatives and customers, which is canonically termed as a dialogue session S, consists of multiple, i.e. n, dialogue utterances.Formally we have S = {(u 1 , r 1 ), (u 2 , r 2 ), • • • , (u n , r n )}, in which r i sig-nifies the speaker role of the i-th utterance u r i i .In this paper we focus on two-party dialogue, more concretely we have r i ∈ {C, A} in which 'C', 'A' represents the roles of speakers: customers and human agents respectively.
After feeding the dialogue session into the DialogQAE model, we expect the model to extract (2) U Q j , U A j represent the unions of question and answer dialogue utterances in S2 , respectively.
To better characterize the proposed n-to-n dialogue QA extraction, we introduce two notions which are conceptually related to the mapping between dialogue utterances and extracted QA pairs. 1) Exclusive dialogue utterance: each utterance in S can only be exclusively mapped to one single question or answer union in R, i.e. the mapping between S and R is a one-to-one (injection) function.2) Speaker role consistency: a common assumption for most two-party conversations is that the customers raise questions while the agents answer the questions.Formally for each extracted QA pair, e.g.U Q j and U A j in R, {r q 1:s } = {C}, {r a 1:t } = {A}.In our setting, the rule of exclusive dialogue utterance strictly holds for all the datasets we used.However, although most datasets in this paper exhibit speaker role consistency, we still observe the customer queries in the answer union or the agent responses in the question unions, for example in Fig 1 the 23-rd utterance from the agent is included in the question union Q3.
As shown in Fig 1, depending on the sizes of question and answer unions, e.g., U Q j , U A j , in the certain QA pair, we categorize the DialogQAE task into four types: 1-to-1, 1-to-N, N-to-1 and N-to-N, in which the former and latter numbers indicate the size of question and answer unions respectively.

Tagging as Extraction
The prior mainstream research on dialogue QA matching are based on the text segment alignment, such as matching the specific answers with the given questions in QA extraction (Jia et al., 2020), measuring the similarity of the user query and candidate answers in the response selection (Wu et al., 2017;Henderson et al., 2019), or extracting relations with entity matching in the dialogue (Tigunova et al., 2021).
The key to successfully excavating n-to-n QA pairs from customer service chatlog is to figure out the cluster-to-cluster mapping among utterances in a dialogue session.

End-to-end QA Extraction
We convert QA extraction into"fill-in-the-blank" sequence labeling task, hoping that the model would quickly learn to predict the label As depicted in Figure 2, the input of the model is "r [SEP] signifies mask token, separation token and both of which are in the vocabulary of the masked language model.We formulate the QA sequence labeling task as either a generative, i.e. mT5-style (Xue et al., 2021), or a discriminative, i.e.BERT-style (Devlin et al., 2019)), classifier.
From the generative perspective, i.e., spancorruption model mT5, the [MASK] token symbolizes the <extra_id_i> for each utterance u i , and we use the semicolon (;) as the replacement for the separation token [SEP].The output of the QA extraction model is a list of Q/A labels L, where for the encoder-only model each prediction is exactly on the masked position, and for encoder-decoder model mT5 the prediction is a sequence "<extra_id_0> l 1 ... <extra_id_n − 1> l n ".For the discriminative tagging model (BERTstyle), we use [unusedX] to denote the label set {O, Q 1 , ..., O 1 , ...} and for span-corruption encoder-decoder model, we just use their text form "O, Q1, ..., O1, ..." to represent the label.

Two-stage QA Extraction
Instead of predicting the label of utterance in a single round, we could decouple the process into two steps (Moryossef et al., 2019;Liu et al., 2019Liu et al., , 2021)), which firstly figures out questions then extract corresponding answers.We illustrate the twostage workflow in Figure 6: in the first stage, the model input is the same as the end-to-end QA extraction (Sec 3.2), however the dialogue question extractor only predicts l stage1 ∈ {O, Q 1 , ...}, to determine whether each utterance in the dialogue session is a question or not.In the second stage, we fill in the labels where the stage-1 model predicts as questions.Then we feed the filled utterances sequence to the dialog answer extractor, which predicts the remaining utterances within the label set l stage2 ∈ {O, A 1 , ...}, to decide whether they are the answer A j to the question Q j .Moreover, in the 1-to-N (including 1-to-1) scenario, where a question covers only a single utterance, we could further break down the question labeling process in a context-less way.As shown in Figure 7, at stage 1, we separately perform binary classification {Q, O} for each utterance with the input format of " [CLS] U i ", where [CLS] is the classification token, and at stage 2, we relabel those predicted as Q in the sequential order Q 1 , Q 2 , ... to fill in the blank, and then apply the same dialog answer extracting strategy.

Datasets
We conduct the experiments on 5 Chinese customer service multi-turn dialogue datasets, namely CSDS (Lin et al., 2021), MedQA (Jia et al., 2020), EduQA, CarsaleQA and ExpressQA.CSDS and MedQA are two public dialogue chatlog datasets while the latter three datasets are internal datasets which are accumulated through genuine customer-agent interactions in a commercial industrial dialogue platform.CSDS is derived from JDDC (Chen et al., 2020a) corpus and tailored for dialogue summarization where n-to-n QAs are also provided as the clues for the summaries.MedQA is accumulated on a medical QA platform 3 that covers con-3 https://www.120ask.com/versations between doctors and patients.EduQA, CarsaleQA and ExpressQA, as indicated by their names, come from real-world conversations in the education, carsales and express delivery domains.As shown in Table 1, EduQA, CarsaleQA and Ex-pressQA are composed exclusively of 1-1 QA pairs while CSDS and MedQA involve 1-N, N-1 and N-N mappings in the extracted QA pairs.

Evaluation Metrics
To evaluate the performance at the utterance level, we apply the traditional precision (P), recall (R) and F1 metrics, which ignore the non-QA label "O": where N is the number of instances, and P red (i) , Ref (i) denote the prediction and reference label sequences of the i-th instance.
Similarly, for QA-pair level evaluation, we propose adoption rate (AR), hit rate (HR) and session F1 (S-F1): where R_pred (i) , R_ref (i) denote the prediction and reference QA-pair set of the i-th instance.We list the number of sessions (#Sess), the average number of utterances (Avg_Us), questions (Avg_Qs) and answers (Avg_As) in each session.We also figure out the average distances between the starting and ending utterance within a QA pair (Dist_QAs), which signifies the (minimal) context required for successful QA extraction, as well as the ratio of 1-1, 1-N, N-1, N-N QA pairs in each datasets.

Dataset
From the perspective of FAQ database population by extracting QA pairs from the customer service chatlog, the predicted QA pairs would serve as an automated module in the workflow, followed by the human verification.The adoption rate (AR) corresponds to the ratio of "accepted" QA pairs by human judges within the predicted QAs, which is analogous to the utterance-level precision.The hit rate (HR), on the other hand, signifies the proportion of predicted QAs in all annotated QAs within the dialogue session which corresponds to the utterance-level recall.

Baseline Performance
Table 2 and 3 illustrate the utterance-level and session-level performance of QA extraction on the MedQA and CSDS datasets respectively.For both end-to-end and two-stage models, enlarging the model parameters leads to a considerable performance gain, which indicates that the dialogue session encoders with higher capacity are of vital importance for extracting QA pairs.In terms of the comparisons between the end-to-end and twostage models, we observe that the two-stage models outperform end-to-end models on the MedQA dataset while it is the other way around on the CSDS dataset, which shows that end-to-end methods are more favorable in the N-to-N QA extraction that requires reasoning over longer dialogue context, such as CSDS (6.88 in Dist_QA and 65.57% non-1-to-1 QAs as shown in Fig 1), as presumably complex Q-A mapping exaggerates the error propagation of aligning the potential answers to the given predicted questions in the two-stage models.For the model performance on the N-to-N mapping shown in Fig 3, as we expect, the models get higher scores on 1-to-1 mapping than N-to-N mapping.
We also highlight the comparison between the generative (mT5-style, 'Gen') and the discriminative (BERT-style, 'Tag') models in Table 2 and 3. We observe that with comparable pre-trained model size, i.e.DeBERTa-large (304M) versus mT5-base (580M) and MegatronBERT (1.3B) versus mT5large (1.5B), generative models perform better on the CSDS dataset while discriminative models win on the MedQA dataset, showing that T5 models might be a promising option on the dialogue analysis with long context (Meng et al., 2022).We believe that model size is an important factor in performance, since intuitively model with more   2) are incompatible with CSDS as the binary classifier is tailored for 1-to-1 and 1-to-N extraction (Sec 3.3).
parameters would fit the data better, and yet discriminative models with Masked Language Model pre-training task may not enjoy the same scaling law (Hoffmann et al., 2022) as the generative models do.

Dialogue Structure Analysis
Prior research tried to extract and analyze the structure of a given dialogue session through latent dialogue states (Qiu et al., 2020), discourse parsing (Galitsky and Ilvovsky, 2018) or event extraction (Eisenberg and Sheriff, 2020).However, those methods are specific to the predefined semantic/information schema or ontology, i.e., discourse, dependency, AMR parsing trees (Xu et al., 2021(Xu et al., , 2022)), dialogue actions or event/entity labels (Liang et al., 2022).Through the analysis of the extracted QA pairs of a dialogue session, we summarize a more general schema to categorize the dialogue structure according to the customer-agent interaction in the dialogue flow.QA Flow, where Position(A1) < Position(Q2) and Role(Q1) = Role(Q2); in this case, one complete QA pair is after another.For Follow-up Information Seeking, here Position(A1) < Position(Q2) but Role(Q1) ̸ = Role(Q2), indicating the answer leads to a new question.For elaboration/Detailing, Position(Q2) < Position(A1) and Role(Q1) = Role(Q2), which means one person asked two questions in a row, and in turn, the other answered consecutively.
In the example, the doctor sequentially answers the consecutive questions raised by the patients, with the second answer elaborating on the first one.For Clarification/Confirmation, Position(Q2) < Position(A1) and Role(Q1) ̸ = Role(Q2) and Position(A2) < Position(A1), which implies the first question can not get the answer yet, and more information is needed from the questioner; once provided, the first question can finally be answered correctly.In the example, the doctor asked a clarification question on when the symptoms occurred after the inquiry of the patient instead of answering the inquiry instantly.For Barge-in/Interruption, which is not common, is the case of Position(Q2) < Position(A1) and Role(Q1) = Role(Q2) and Position(A2) < Position(A1), where the second question is answered first.As shown in Fig 3, the QA extraction models perform better on SF, FIS, and BI than CC and ED, presumably the interleaving QA pairs pose a bigger challenge to the dialogue information extraction.
We delve into the relative position of the question and answer utterances within an N-to-N QA pair in Fig 5 .Most questions and answers are disjoint within a QA pair while the overlapping questions and answers account for 26.91% in the CSDS dataset.We take a further step to split the overlapping QAs into two circumstances: in-pair Q-A and in-pair Q-A-Q, depending on the role (Q or A) of the last utterance in the QA pair.As illustrated in Fig 3, all three QAE models perform better on the disjoint QA pairs than overlapping ones.

Domain and Language Adaptation
We illustrate the domain and language adaptation of our dialogQAE models in Table 4 and 5 respectively, which highlight the real-world utility of our models.
In Table 4, we observe that mixing the datasets from different domains is a simple but effective way to boost the overall performance.5: The illustration for the language transfer of the End-to-end (Gen, mT5-xl) model trained on MedQA.We abbreviate the precision and adoption rate scores in the utterance and session level as 'P' and 'AR'.The scores correspond to the accuracy of the predicted QA pairs, according to the human judges.correlation between different domains is the key factor of the model performance on the domain transfer, e.g. the bidirectional transfer between the carsale and the education domains gets higher scores than other domain pairs.
Thanks to the multilingual nature of mT5, the models trained on the Chinese datasets can be easily applied to datasets in other languages, e.g.English.We test the Chinese DialogQAE model on different domains of the MultiDOGO dataset (Peskov et al., 2019).As the MultiDOGO dataset does not have Q-A-pair annotations, we ask the human annotator to decide whether the recognized QA pairs by the MedQA-DialogQAE model are eligible according to the semantics in the dialogue flow.We use majority votes among 3 human judges and the inter-annotator agreement (the Krippendorf's alpha) is 0.89.
6 Related Work question-answer pairs over 10,000 Wikipedia articles.For question generation, Yang et al. (2017) use a trained model to generate questions on unlabeled data.Later, Wang et al. (2019) proposed to identify key phrases first and then generate questions accordingly.For machine reading comprehension (MRC), research on dialogues MRC aims to teach machines to read dialogue contexts and make response Zeng et al. (2020) aims to answer the question based on a passage as context.Shinoda et al. (2021) leveraged variational question-answer pair generation for better robustness on MRC.However, extraction methods that can work on 1-1, 1-N, and N-N scenario is under-explored.

Dialogue Analysis
For dialogue information extraction (IE), in order to save the efforts of the assessor in the medical insurance industry, Peng et al. (2021) proposed a dialogue IE system to extract keywords and generate insurance reports.To figure out the semantics in the dialogue flow, Galitsky and Ilvovsky (2018) proposed a dialogue structure-building method from the discourse tree of questions.Qiu et al. (2020) incorporated structured attention into a Variational Recurrent Neural Network for dialogue structure induction in an unsupervised way.Eisenberg and Sheriff (2020) introduced a new problem, extracting events from dialogue, annotated the dataset Personal Events in Dialogue Corpus, and trained a support vector machine model.Relation Extraction over Dialogue is a newly defined task by DialogRE (Yu et al., 2020), which fo-cuses on extracting relations between speakers and arguments in a dialogue.DialogRE is an English dialogue relation extraction dataset, consisting of 1788 dialogues and 36 relations.MPDD (Chen et al., 2020b) is a Multi-Party Dialogue Dataset built on five Chinese TV series, with both emotion and relation labels on each utterance.Long et al. (2021) proposed a consistent learning and inference method for dialogue relation extraction, which aims to minimize possible contradictions.Fei et al. (2022) introduced a dialogue-level mixed dependency graph.Shi and Huang (2019) proposed a deep sequential model for discourse parsing on multi-party dialogues.The model predicts dependency relations and constructs a discourse structure jointly and alternately.

Conclusion
In this paper, we propose N-to-N question and answer (QA) pair extraction from customer service dialogue history, where each question or answer may involve more than one dialogue utterance.We introduce a suite of end-to-end and two-stage tagging-based methods that perform well on 5 customer service datasets, as well as utterance and session level evaluation metrics for DialogQAE.With further analysis, we find that the extracted QA pairs characterize the dialogue structure, e.g.information seeking, clarification, barge-in, and elaboration.Extensive experiments show that the proposed models can adapt to different domains and languages and largely accelerate knowledge accumulation in the real-world dialogue platform.

Limitations
This work focuses on the N-to-N question and answer extraction from a dialogue session and does not touch the relevant tasks such as question generation (e.g.Du et al., 2017;Duan et al., 2017) and dialogue summarization (Lin et al., 2021).The proposed task can be seen as a preparation for the subsequent tasks by decomposing the entire procedure of question generation and dialogue summarization into two steps: extraction before generation.The extracted QA pairs can also be further processed in order to visualize some important factors for customer service, like common customer concerns about the products, winning sales scripts to persuade the customers and emerging or trending user intents (Liang et al., 2022), by a set of atomic natural language processing modules like keyword extraction, sentiment analysis and semantic parsing and clustering.

Ethics Statement
The internal datasets we used in this paper, i.e.EduQA, CarsalesQA and ExpressQA, have gone through a strict data desensitization process, with the guarantee that no user privacy or any other sensitive data is being exposed by a hybrid of automatic and human verification.Human verification also eliminates the dialogue sessions with gender/ethnic biases or profanities.The other two datasets, CSDS and MedQA, are publicly available and we use them with any modification.The model for extracting questions and answers in the dialogue paves the for N-to-N dialogue QA extraction, without any risk of violating the EMNLP ethics policy.

A.1 Model Performance on the Internal Datasets
We show the model performance on the internal datasets, i.e.EduQA, CarsalesQA and ExpressQA in Table 6.In terms of comparison between endto-end and two-stage models, end-to-end models are clear winner with respect to session level F1 on the Carsales and EndQA datasets, while two-stage models take the lead on the ExpressQA dataset.According to the dataset statistics in Table 1, we guess this is because ExpressQA has longer dialogue session (14.13 for the average number of utterances) and require longer context (2.57versus 2.23/1.36 in Dist_QA) for extracting QA pairs.The DialogQAE models have been deployed in a commercial platform for conversational intelligence.The module serves as an automatic dialogue information extractor, followed by human verification and modification on the extracted QA pairs so that they can serve as standard and formal FAQs in the customer service.According to the user feedbacks from the online customer service department of an international express company, the assistance of dialogue QA extraction has largely accelerated the information enrichment for customer service FAQs, reducing from around 8 days per update to 2 days per update.

A.2 Between-QA-Pairs Relations Examples
We show more examples on the dialogue structure from the MedQA datasets below.

Ç9>R^óöõïñLô
You can pick it up at the delivery site.

Dataset
Training Do you have any follow-up question on our latest feedback, or other new questions?UID=2 联系一下送货的，什么时候能到， 等一天了。 Contact the delivery person.When will it arrive?I have been waiting for a whole day.UID=3 请稍等我为您查询一下。 Please wait a second.I am working on it.UID=5 查好了么？ Have you done checking?UID=4 还需要查询一下什么时间可以安装 Also need to check when it can be assembled.UID=6 我正在查看中，请您稍等。 I'm double checking, please wait for a second.UID=7 这个预计[日期]可以到达。 It is expected to arrive on [Date].UID=8 [日期]为您协商安装。 Assemble for you on [Date].UID=9 想问几点能够送达?I want to know exactly on what clock it will be delivered?UID=10 UID=11 具体时间需要您与配送员师傅联系 You need to contact the delivery person to get the detailed delivery time.

Figure 1 :
Figure 1: The task overview for DialogQAE.Given a session of two-party conversation with 32 utterances (top), we aim at extracting six QA pairs (bottom) that characterize the dialogue structure and can serve as a valid resource to enrich the knowledge base.The task can be categorized into four types: 1-to-1, 1-to-N, N-to-1, and N-to-N, according to the number of utterances involved in the extracted question or answer unions.

Figure 2 :
Figure 2: The model workflow for the 'fill-in-the-blank' style one-stage dialog QA pair extraction in the end-to-end fashion.U 1:4 , Q 1:2 , A 2 and O represent dialogue utterances, questions, answers and not-Q-or-A utterances.

Fig 4
Fig 4 demonstrates the typical 'between-QApairs' relations based on the extracted QA mappings.The most common case is Sequential

Figure 4 :
Figure4: The demonstration for the between-QA-pairs relations and their proportion in the MedQA dataset.Given a snippet of consecutive dialogue utterances (for UIDs, u 1 < u 2 < u 3 < u 4 ), we roughly categorize the dialogue flows into 5 different types according to the between-QA interactions between customers and human agents.

Figure 5 :
Figure 5: The demonstration of the in-QA-pair relations and their proportion in the CSDS dataset.According to the relative positions of Q and A utterances in an n-to-n QA pair, we categorize the interleaving utterances into 3 types.

Table 2 :
The benchmark for the QA extraction task on the MedQA dataset.The discriminatively (BERT-style) and generatively (mT5-style) trained end-to-end models are abbreviated as 'Tag' and 'Gen'.The 2 variants (B+G and G+G) of two-stage models differ in the model formulation of the first stage, i.e. binary classifier (Fig 6) versus mT5-style generative (Fig 7) model.We highlight the winner in each training strategy and the best scores with boldface and underlined marks.

Table 3 :
The benchmark for the QA extraction task on the CSDS dataset.Note that two-stage (B+G) models (Table

Table 6 :
The performances of the end-to-end and two-stage models (mT5-large) for the QA extraction task on the internal datasets.