Dialogue Act-based Breakdown Detection in Negotiation Dialogues

Thanks to the success of goal-oriented negotiation dialogue systems, studies of negotiation dialogue have gained momentum in terms of both human-human negotiation support and dialogue systems. However, the field suffers from a paucity of available negotiation corpora, which hinders further development and makes it difficult to test new methodologies in novel negotiation settings. Here, we share a human-human negotiation dialogue dataset in a job interview scenario that features increased complexities in terms of the number of possible solutions and a utility function. We test the proposed corpus using a breakdown detection task for human-human negotiation support. We also introduce a dialogue act-based breakdown detection method, focusing on dialogue flow that is applicable to various corpora. Our results show that our proposed method features comparable detection performance to text-based approaches in existing corpora and better results in the proposed dataset.


Introduction
Negotiation is an essential task involved in our daily life. In negotiation, people work to maximize their profits by bargaining; however, negotiation sometimes breaks down due to conflicts between people's competing interests. To help them to reach rational agreement, previous studies of multiagent systems have proposed the use of negotiating agents (Lin and Kraus, 2010;Jonker et al., 2017;Baarslag et al., 2013a). Recently, several studies have succeeded in modeling a negotiating agent in natural language that can control both text generation and reasoning in the context of goaloriented dialogue systems, and such agents have produced better performance than human players * This work was conducted when the first author was a master's student at the University of Sheffield, UK. in some cases (Lewis et al., 2017;He et al., 2018;Cheng et al., 2019). Further, support for humanhuman negotiation in natural language has also been tackled, involving negotiation corpora developed for goal-oriented dialogue systems, such as a Nash bargaining solution estimation (Iwasa and Fujita, 2018), real-time negotiation coaching (Zhou et al., 2019), and negotiation breakdown detection (Yamaguchi and Fujita, 2020).
Although they have recently attracted additional attention, there are only few negotiation corpora, as the most recent follow-up studies (Iwasa and Fujita, 2018;Cheng et al., 2019;Zhou et al., 2019;Yamaguchi and Fujita, 2020) have only utilized either the DEALORNODEAL (DN) (Lewis et al., 2017), CRAIGSLISTBARGAIN (CB) (He et al., 2018) datasets or both. Moreover, most existing corpora have simplified negotiation settings; for example, the DN dataset handles the negotiation of item division between humans with 22.5 possible solutions per dialogue and uses a standard linear additive utility function (Keeney and Raiffa, 1993;Raiffa et al., 2002) for scoring. The CB dataset is only concerned with price negotiation on a listed product between two human negotiators. These settings might make it easy for a machine learning (ML) model to reach optimal solution or fulfill its goal. Finally, some existing corpora (Konovalov et al., 2016;Petukhova et al., 2016;Asher et al., 2016) other than the DN and CB datasets have far smaller samples (scenarios), which makes it challenging to use them for goal-oriented dialogue systems or end-to-end human-human negotiation support. All of these factors inhibit further development in the field and its future applicability to real-world problems. Furthermore, no effective breakdown detection method for negotiation dialogues has been proposed. Negotiation features certain unique characteristics relative to other dialogues, such as offering proposals, accepting them, and making counter-offers (Thompson et al., 2010;Traum et al., 2008). If the breakdown detection method can incorporate these characteristics, the quality of breakdown detection will be improved.
This study proposes a new negotiation corpus in a job interview setting with increased complexities relative to a range of solutions and a utility function. We enact a breakdown detection task (Yamaguchi and Fujita, 2020) across three negotiation datasets including a proposed one with a novel dialogue act-based approach that can focus on dialogue flow. This task can support human-human negotiation by alerting negotiators to potential breakdowns, which prevents the loss of time and negotiator utility. We highlight the following contributions: 1. We develop a new English negotiation corpus for a job interview setting, consisting of 2639 crowd-sourced dialogues (Section 3).
2. We propose a novel breakdown detection method that employs dialogue act-based features and a gated recurrent unit (GRU) (Chung et al., 2014)-based model (Section 5).
3. We demonstrate that the proposed method exhibits results that are comparable to models with text-based features in the existing corpora and outperforms them in the proposed corpus, which has a far smaller breakdown ratio (Section 7). 4. We conduct ablation studies and error analyses to examine how our proposed features works on a GRU-based model (Section 7).

Related Work Automated Negotiation in Multiagent Systems
Automated negotiation is a field of research, in which computers negotiate with each other and try to seek appropriate agreement without human intervention (Baarslag et al., 2013a). Typical applications include supply chain management (Wang et al., 2009) and smart grids (Ketter et al., 2013).
As automated negotiation has gained momentum, the International Automated Negotiating Agents Competition (ANAC) (Baarslag et al., 2015;Jonker et al., 2017) has been being held annually since 2010. This event encourages the development of state-of-the-art negotiating strategies for automated negotiating agents in both agent-agent and humanagent (Mell et al., 2018) negotiations. The major difference between automated negotiation and ours is that the former supports negotiation by letting the agents negotiate instead of humans, whereas the latter seeks to support human-human negotiation in natural language only by providing feedback to negotiators with ML models.
NLP for Human-human Negotiation Support Automated negotiation has gained a great deal attention, but there have been only a few studies conducted on support for human-human negotiation in natural language: Iwasa and Fujita (2018) have proposed a GRU-based model to suggest a draft agreement that maximizes the sum of utilities based on the estimated weights of all items in the DN dataset. Zhou et al. (2019) proposed a dynamic negotiation coaching method in the setting of CB dataset that provides useful recommendations to sellers, resulting in increased profits. Our work is a follow-up study to Yamaguchi and Fujita (2020), who demonstrated that neural-network (NN)-based models trained with text-based features could capture signs of breakdowns in DN and CB datasets.
Here, we show that text-based methods cannot detect breakdowns in the proposed corpus relative to our dialogue act-based approach.
Negotiation Dialogue Systems Previous efforts on building negotiation dialogue systems initially focused on modeling strategic aspects (Cuayáhuitl et al., 2015;Keizer et al., 2017;Petukhova et al., 2017), to construct an agent that could outperform human players by controlling a discrete action space. By contrast, Lewis et al. (2017) and He et al. (2018) have recently tried to simultaneously handle both text generation and reasoning by employing end-to-end neural negotiating models; moreover, Cheng et al. (2019) proposed adversarial training to improve the robustness of goal-oriented models. Although our main scope is supporting humanhuman negotiation, our corpus can also be used for goal-oriented dialogue systems (Lewis et al., 2017;He et al., 2018;Cheng et al., 2019) as its fundamental design is drawn from the DN dataset.   Konovalov et al. (2016) are similar to each other, in that both handle a job contract scenario. However, three main differences appear between the two: (1) The former handles human-human negotiation, whereas the latter deals with human-agent negotiation.
(2) The former considers 11.5 times more possible solutions per dialogue than the latter. (3) The former has 2639 dialogues, and the latter has 105.

Dialogue Breakdown Detection Challenge
The recently held Dialogue Breakdown Detection Challenge (DBDC) (Higashinaka et al., 2016;Hori et al., 2019) was intended to improve the coherency of a dialogue system. Given a dialogue history between a human and a system, the task is to evaluate whether a certain system response is valid. By contrast, our study focuses on predicting negotiation outcomes based on human-human negotiation to avoid negotiation breakdowns; that is, our task is different from the DBDC.

Overview
The JOBINTERVIEW (JI) dataset is an instance of multi-issue multi-option negotiation, which includes the preferences of the negotiators, a dialogue history, proposed offers, and a settled agreement in a job interview setting. The negotiators conduct a conversation in English in the roles of recruiter or applicant and negotiate regarding the issues listed in Table 1 to maximize their scores. A dialogue sample from the JI dataset is shown in Table 2  https://github.com/gucci-j/ negotiation-breakdown-detection, and details on the negotiation interface and procedures are given in Appendix A.

Utterance
Dialogue Act

Mathematical Design
To make the negotiation competitive, we define each negotiator's preferences, and a scoring function, as in Lewis et al. (2017). In addition, we consider the interdependency (Kardan and Janzadeh, 2008;Alam et al., 2013) between a pair of issues such that the negotiators cannot easily reach an optimal agreement (Ito et al., 2006), leading them to seek a compromise solution through dialogue.
Preferences The importance of each issue and option, and bias assignment in representing interdependency between specific issues are defined as follows. Two negotiators A = {a 1 , a 2 } participate in a negotiation over the set of independent issues I and of issues J with an interdependent relationship.
While an issue included in a set of specific issues with an interdependent relationship (j from , j to ) ∈ J 2 has its own weight per a k , only an option of j to has a bias for that of j from and j to ; that is, o j from does not have a bias. The bias b  particular pair of options (o jto , o j from ). Note that each weight and bias is initialized using uniform random numbers within a predefined range.

Scoring Function
We define a scoring (utility) function to calculate a negotiation score. The weight of option w o j to is normalized after considering bias. More specifically, when an option o j from is in a draft agreement, the normalized weight of the option w o j to is calculated using min-max nor- Thus, the scoring function is defined as follows: where o i s is the option of i and is included in a draft agreement s. The function is derived from a linear additive utility function, utilized in automated bilateral negotiation (Baarslag et al., 2016) and in Lewis et al. (2017).

Data Collection
We hired workers through Amazon Mechanical Turk to collect human-human dialogues. Only those based in the USA with at least 1,000 previous HITs and an approval rating of over 95% could join our experiments. Before each session, the workers read the task description and instructions for negotiating with the opponent 1 . During a negotiation, each worker could propose a draft agreement up to three times and was asked to send six messages or more in total to submit the proposal. We paid $0.20 per dialogue and gave a $(score − 5)/5 bonus if the score was more than 5/10 to promote efficient negotiations. Table 3 shows the quantitative comparison of three negotiation dialogue corpora. The vocabulary size is the largest in the CB dataset because it handles several categories of listed products. The JI and DN datasets focus on a single domain, and of the two, the former has the larger vocabulary size. The average number of turns per dialogue in the JI dataset is the largest of the three, though it has the smallest average number of words per turn. These statistics indicate that participants in the JI dataset likely had enough conversations to reach agreement.

Agreement Ratio
The JI dataset had the highest agreement ratio of 92.9%, a sharp contrast with the values of 76.2% and 74.9% for the DN and CB datasets. This difference may be because the participants in the JI dataset could propose intermediate offers up to three times each, while those in the existing corpora could only submit one proposal per session.

Complexity of Negotiation Scenarios
The JI dataset has far fewer Pareto optimal 3 solutions for agreements than the DN dataset, which can be ascribed to the following reasons: (1) the larger number of issues and options in the JI dataset, with 9920 possible solutions per dialogue, and (2) the introduction of an interdependent relationship that prevented the scoring function from following a standard linear additive utility function. As a result, participants in the JI dataset struggled to find better solutions and might have compromised with each other more often than in the DN dataset.

Task Description
Task We formally define the task of breakdown detection in negotiation dialogues. Let D be a negotiation dialogue between two negotiators, composed of n ∈ N turn's utterances {s 1 , s 2 , . . . , s n }, where each utterance s is a message from one of the negotiators and includes one or more sentences. Given D, the task is to label D as either a success (reaching an agreement: 0) or a breakdown (failing to find an agreement: 1).

Evaluation Metrics
To evaluate the effectiveness of the different approaches, we employ area under curve (ROC-AUC) and confusion matrix (CM), both of which are based on Yamaguchi and Fujita (2020). We also use average precision (AP) to consider the imbalanced nature of breakdown labels in negotiation datasets.

Methodology
This section introduces our breakdown detection approach using a dialogue act-based feature and ML models, including linear and NN-based models. The intuition that guides this feature is that because a breakdown dialogue should have distinct flow (e.g., many disagreements), focusing on the dialogue flow can help detect this type of breakdown.

Dialogue Act Extraction
Our dialogue acts and their extraction are based on He et al. (2018), but we made some changes in the extraction process to capture dialogue flow effectively. The process consists of two stages: (1) pattern matching and (2) filtering and alignment. The first step is almost identical to He et al. (2018), but the second is newly designed for this study.
Pattern Matching Given a dialogue turn, we extract dialogue acts according to the matching patterns (Table 4) using regular expressions 4 . If there is no matched pattern in it, an unknown tag <unk> is given. Note that because negotiators in the JI dataset can propose intermediate offers up to three times and because such offers are part of negotiations, we add corresponding dialogue acts whenever these offers are detected during conversations.
Filtering and Alignment He et al. (2018) only extracted one dialogue act per turn. However, because negotiators could send one or more sentences for each turn in the DN, CB, and JI datasets, there may have been two or more dialogue acts. To capture the dialogue flow in detail while matching noise due to the rule-based extraction is reduced, we filter extracted dialogue acts in a way that matches Figure 1, which only allows dialogue acts to appear in the designated order. If an illegal dialogue act follows a matched one, all remaining unmatched ones will be discarded. The constrained flow is motivated by an alternating-offer protocol (Rubinstein, 1982)     dialogues do not have a well-defined negotiation protocol, unlike the case of automated negotiation, we assume that human negotiators should follow an unwritten code to reach agreement with their opponents. Table 2 shows an example of extracted dialogue acts along with the text.

Using Dialogue Act-based Features as Inputs for ML Models
Once we extract all features from a dialogue, we concatenate each turn with the addition of a separator tag <sep> to the head of each turn and an end tag <end> to the end of the dialogue. We then create an input vector for linear or NN-based models and use it to train the model. The input vector is produced as follows: Linear Models We create a count vector by counting the number of each dialogue act per dialogue, including <unk>, <sep> and <end> tags.

NN-based Models
We convert each extracted dialogue act into a one-hot representation e ∈ R 1×10 , which includes a padding tag <pad>. We then concatenate all one-hot representations in time series per dialogue, which generates an input matrix E ∈ R n×10 , where n is the number of extracted dialogue acts, including padding.
6 Experimental Settings

Classification Models
We experiment with linear and NN-based models trained with either text-based or dialogue act-based features: LR-BOW A logistic regression model trained with bag-of-words features weighted by TF-IDF.
GRU A GRU-based model with a linear layer on top of recurrent units. For text-based inputs, we used frozen pre-trained 300-dimensional word embeddding (GloVe) (Pennington et al., 2014). We also considered the model with a self-attention mechanism (GRU-Att) (Zhou et al., 2016). Random A naive classifier that predicts negotiation outcomes by respecting training set's class distribution.

Data and Preprocessing
We employed three negotiation datasets compared in Table 3 for our experiments. The breakdown label of each dataset was assigned as follows. DN: A log has either a <disagree> or <no agreement> tag inside an <output> tag. CB: A log does not have an offer price. JI: A "status" in a log is not "completed." For the CB and JI datasets, we removed short dialogues with less than three turns, as these are often labeled as breakdown and rarely include bargaining components, such as proposals. After the removal, the breakdown ratios of the CB and JI datasets were 18.9% and 4.9%. We preprocessed texts with lower-casing and inserted the <sep> and <end> tags into each dialogue, as in the dialogue act-based case. We tokenized the texts using spaCy 5 . For BERT, we 5 https://spacy.io/ used a pre-trained BERT tokenizer provided by the Transformers library (Wolf et al., 2020).

Implementation Details
We trained and tested models using stratified fivefold cross-validation. The model-specific implementation details are as follows: Linear Model We implemented an LR-BOW model using Scikit-learn (Pedregosa et al., 2011) and trained it on Intel Core i5 (2.9 GHz -6267U). We tested the n-gram combination of {(1, 1), (1, 2), (1, 3)}. We applied L2 regularization and weight adjustments to make the weights inversely proportional to the labels in training data.

NN-based Models
We set the maximum number of epochs to 100 for GRU-based models and 20 for BERT-based models, with early stopping. We further split the training folds into training (80%) and validation subsets (20%). We used the binary cross-entropy loss and optimized the models with an Adam optimizer (Kingma and Ba, 2014). We implemented the models using PyTorch (Paszke et al., 2019) and tuned their hyperparameters based on validation F 1 6 . For BERT-based models, we utilized the implementation provided by Hugging-Face (Wolf et al., 2020). We trained and tested our models with NVIDIA Tesla V100 (SXM2 -32GB).

Quantitative Results
Results in Existing Corpora We can observe from Table 5 that a fine-tuned BERT BASE model shows the best AP for the DN and CB datasets. Moreover, NN-based models with text-based features exhibit results that are comparable to those of the best-performing models in terms of AP, in the 95% confidence interval. The proposed approach (GRU TAG ) also showed comparable results for either AP or CM in both datasets. Although a logistic regression model with text-based features (LR-BOW TEXT ) produced poor results in terms of AP, it showed the best results for the pair of FN and TP and that of TN and FP in the DN and CB datasets, respectively.
Because we intend to support human-human negotiation, accurate classification for both cases is vital to providing beneficial feedback to negotiators. Thus, the use of this approach is not helpful to our task. Third, NN-based models with text-based features did not perform well in the JI dataset. This was likely due to the far smaller breakdown ratio of 4.9% in the dataset compared to 23.8% and 18.9% in the DN and CB datasets. However, BERT-based models showed far better results than GRU-based ones in terms of the TP ratio. We hypothesize that BERT's rich contextualized information helped detect signs of breakdown.

Ablation Study
We conducted two ablation studies to better understand dialogue act-based input features. We first analyzed the importance of each dialogue act by replacing it with an unknown tag and tested with our best-performing model (GRU TAG ) over the five test folds. The <agree> tag was important for breakdown detection across the three corpora, despite its infrequency, especially in the DN and JI datasets (Figure 2). The frequent tag <propose> also played an important role in classification. By contrast, the <disagree> and <inquire> tags were not important except for the <inquire> tag in the CB dataset, possibly due to its highest frequency. Finally, the <greet> and <inform> tags were the least important in all datasets as these appeared less frequently and are not as closely related to breakdown as the others. Next, we verified whether the GRU TAG model captured the roles of <agree> and <disagree> tags in the breakdown detection task by replacing these tags with their counterpart or an <unk> tag (Figure 3). By replacing an <agree> tag with a <disagree> FP (DN) <sep> i'd love to take a book and two hats off your hands <sep> hm, not many points for me but i'll agree to that. <end> <sep> <propose> <sep> <disagree> <end> FN (CB) <sep> hello, i am very interested in your car. however $12000 is out of my price range for a car that is 7 years old. i offer $6000 and i will pick up the car myself. <sep> there is no possible way i could go that low. i would take $11, 000 <sep> that's fine, i will go elsewhere with my money. <sep> okay <end> <sep> <greet> <propose> <sep> <disagree> <propose> <sep> <agree> <sep> <agree> <end> tag, we saw a rise in a TP ratio and a significant drop in a TN ratio compared to the baseline. When the <disagree> tag was replaced with an <agree> tag, the TN ratio slightly increased, while the TP ratio significantly decreased. These results suggest that the model properly took into account the roles of "<agree>" and "<disagree>" to some extent, and the number of such tags appeared played an important role in detecting a breakdown. While replacement with an <unk> tag also showed a similar trend, except with the <disagree> tag in the JI dataset, this was probably due to the relative increase of the counterpart.

Error Analysis
Last, we conducted error analyses to examine the behavior of a GRU TAG model and reveal its potential limitations. The first example is an FP sample from the DN dataset, where the model possibly focused on a <disagree> tag corresponding to not. The second one is an FN sample from the CB dataset, in which the model might have focused on repetitive <agree> tags. We consider that the proposed approach could not cope with euphemistic phrases because of the rule-based dialogue act extraction. Thus, annotating negotiation corpora with dialogue acts will be an important research direction for more precise detection.

Conclusions and Future Work
This study proposed a job interview negotiation dialogue dataset with 2639 dialogues and increased complexities compared to existing datasets to help propel development of the study of human-human negotiation support and goal-oriented dialogue systems. We also proposed a dialogue act-based breakdown detection model that can focus on negotiation flow. Our approach (GRU TAG ) showed comparable results when used with existing datasets and better results for the proposed dataset than models trained with text-based features. In the future, we intend to explore another application of dialogue act-based features to related tasks, such as preference estimations. We will also utilize the proposed corpus in related tasks in human-human negotiation support and goal-oriented dialogue systems.

A Job Interview Negotiation Dataset
Here, we introduce the negotiation interface and negotiation procedures. Our dataset and negotiation interface are available at https://github.com/ gucci-j/negotiation-breakdown-detection.

A.1 Negotiation Interface
We developed an online negotiation interface for our job-interview negotiation, which implemented all mathematical settings such as preferences and a scoring function discussed in the body of the paper. Figure 4 shows the screenshot of our negotiation interface.
At the beginning of each negotiation session, the interface generates negotiators' preferences and displays them next to the corresponding issues and options so that the negotiators can easily understand which issue and option is important for them.
During the session, whenever the negotiators select a new solution, the interface calculates the score of its solution according to the scoring function described in Subsection 3.2 and displays it with the corresponding evaluation. The evaluation is based on Table 7 and intended for providing feedback to the negotiators to promote a better agreement.
At the end of the session, the interface stores the log that consists of the preferences of the participants, dialogue history, proposed offers and settled agreement in json format.

A.2 Negotiation Procedures
Before entering a negotiation session, each negotiator reads the instruction page that describes the outline of the negotiation, its procedures and some precautions (e.g., the maximum number of proposals per negotiator). During the session, the negotiators can talk to their opponent using the left-hand side of the negotiation interface (Figure 4), while they can select an option for each issue in the right-hand side of the  interface. Besides, they can also check the current score, its evaluation and estimated HIT reward for the selected options.
When the negotiators believe that they had sufficient discussion, they can propose a draft agreement by clicking the "PROPOSE" button shown in the bottom-left side of the interface. Once it is sent to the opponent, the opponent can check its details and score with the "ACCEPT" button shown on the interface. If the opponent clicks the button, the negotiation is regarded as successful. Otherwise, the negotiation continues until both the sides exceed the maximum number of propositions. If exceeding the limit, the negotiation is regarded as a breakdown, and the score of each negotiator is recorded as zero.

B Hyperparameter Tuning
Linear Models For the DN dataset, n-gram combination of (1, 3) (uni-gram, bi-gram, and tri-gram) was chosen. For the CB dataset, that of (1, 2) (unigram and bi-gram) was selected. For the JI dataset, that of (1, 1) (uni-gram) was chosen. Since none of the models trained with dialogue act-based features did not work, these have no optimal n-gram combinations.
Neural Network-based Models We tuned the hyperparameters of all NN-based models employed in our experiments using the Optuna framework (Akiba et al., 2019). We split training folds into training (80%) and validation (20%) subsets. We tested 100 hyperparameter combinations and evaluated their performance based on F 1 in each validation subset. Tables 8 and 9 show the hyperparameters and search space for GRU and BERT-based models, respectively. Figure 4: Negotiation interface used for the JI dataset. Each value shown next to an issue or an option denotes its importance for a negotiator. The score and importance of each issue and option were calculated by the interface based on the mathematical settings discussed in the body of the paper. Note that the score shown on the interface are multiplied by ten for the ease of players' understanding.   (Wolf et al., 2020) by replacing blanks with " ".