Value type: the bridge to a better DST model

,


Introduction
Task-oriented dialogue systems have become more and more important as people's demand for life increases(booking flights or restaurants), which have become increasingly important in the field of NLP(Nature Language Process).(Henderson et al., 2019;Hung et al., 2021;Zheng et al., 2022) Traditionally, the task-oriented dialogue system consists of four modules (Zhang et al., 2020): Natural language understanding(NLU), Dialogue state tracking(DST), Dialogue manager(DM) and Natural language generation(NLG).This module directly affects the decision-making behavior of the dialogue system, and plays an extremely important role in the task-based dialogue system.(Lee et al., 2019) The recent methods in DST work are mainly divided into two categories.The first category is based on ontology which means the candidate slot value is assumed to be known eg (Zhou et al., 2022;Ye et al., 2021b;Guo et al., 2021).The second is the way without ontology.These studies have completely abandoned ontology, and they assume that the slot value is unknown.eg (Wu et al., 2019;Kim et al., 2019;Kumar et al., 2020;Lin et al., 2021).However, most of their work is based on dialog state, dialog and slot modeling, ignoring that the value type of each slot may be different.If these slots are modeled uniformly, then there is a lack of a specific feature of each slot.
In this work, we propose a new DST framework named SVT-DST, which uses the Slot-Value Type as the bridge to increrase the model performance.With this method, each slot has specificity for the attention of the conversation history to better identify the slot value.Specifically, we first classify all the slots in the dataset according to their slot value types.As shown in Figure 1, adjectives, time and numbers correspond to pricerange, arrive-time and book-people respectively.We train a sequence annotation model with dialogue training which is used to extarct entities and corresponding entitytypes in each on the turn.We hope that the attention between the dialogue and slots can be higher when the turn is near to current turn with the same slotvalue type.In order to achieve the goal, we use monotonically decreasing functions to integrate the attention weights, which will be described in detail in the method.we use monotonically decreasing functions to integrate these types into the attention operation.
Our main contributions are as follows: 1) We classify the slot according to the slot-value type, then train the Ner model to extract these types to improve the attention formula.2)We design a sampling strategy to integrate these types into the attention formula, which decrease the error of Ner model.3)We have achieved competitive results on MultiWOZ 2.1 and 2.4.We analyze the results and point out the future work.

Method
Figure 2 shows the structure of our DST model, including encoder, attention module and slot value processing module.In this section, we will introduce each module of this method in detail.
A T-turn conversation can be expressed as C t = {(U 1 , R 1 ), ..., (R t 1 , U t )}, where R t represents system discourse and U t represents user discourse.We define the dialogue state of the t-th turn as B t = {(S j , V t j )| 1 <= j <= J}, where V t j represents the value of the j-th slot S j in the t-th turn.J represents the number of predefined slots.Follow (Ren et al., 2018), we express the slot as a "domain slot" pair, such as 'restaurant-price range'.

Encoder
Follow (Ye et al., 2021b) , we use two bert (Devlin et al., 2018) models to encode context and slot respectively.

Context encoder
We express the dialogue at turn t as D t = R t U t , where represents sentence connection.Then the history of the dialogue including t-th turn as The output of the encoder is: Where C t 2 R |Xt|⇥d , |X t | is the length of M t and d is the hidden size of bert.bert finetuned indicates that the bert model updates a part of parameters during training.

Slot-value related encoder
We employ the first token to represent the aggregate representation of the entire input sequence.Therefore, for any slot S j 2 S(1  j  J) and any value v t j 2 V j we have: (2) For the last turn of dialogue state B t 1 , we have Where bert fixed indicates that the bert model has fixed parameters during training.

Cross-Attention
We use the multi-head-attention module (Vaswani et al., 2017) as the basis of our attention module.

Slot-Context Attention
We first calculate the bias term of the attention formula.For each dialogue history M t , we first use the monotonically decreasing distribution function ⌘(n) to initialize the weight of each turn of dialogue D t in the dialogue history: Where n = T t, n represents the distance between the last turn and the current turn.The closer the distance is, the greater the weight will be obtained.
Note that (T ) represents the weight of distance T for this turn (turn 0) and the latest turn t.We record the turns of the value type type j with slot S j in the history: Where n>m, which represents the turn indexs.Then we calculate the weight of these turns: Finally, we add these two weights according to the turn indexs to get bias: The attention between S j and C t can be calculated as: Where '() indicates a learnable mapping built by embedding.W bias , W r 1 and W r 2 indicates a linear layer, respectively.

Slot-State Attention
For S j and B t 1 , their attention can be expressed as:

Gate Fusion
inspired by (Zhou et al., 2022), we employ a gate module to combine the attention between Slotcontext and Slot-state: Where ⌦ indicates vector product, indicates the sigmoid function and • indicates element-wise product operation.

Self-Attention And Value Matching
In this part, we have followed the relevant part of (Ye et al., 2021b).

Ner Model And Sampling Strategy
We employ the W2NER model (Li et al., 2022) as our tagging model.The strategy of our labelmaking is that: for each value in the ontology, if the value is in current turn, we will tagging this value.For sampling strategy, only when the target entities are different from entities extracted from previous turns, this turn will be marked with the entities' type.This strategy helps to reduce the interference of duplicate entities.For the specific classification of each slot, please refer to the appendix.In particular, for bool type, we train the annotation model to extract keywords, such as internet, parking, etc.

Optimization
We use the sum of the negative log-likelihood as the loss function at each turn t: Where indicates the output of self-attention module corresponding to S j at the t-th turn.3 Experiments

Dataset, metric and Evaluation
We evaluate our method on these datasets: Mul-tiWOZ 2.1 (Eric et al., 2019) and MultiWOZ 2.4 (Ye et al., 2021a) which provide turn-level annotations of dialogue states in 7 different domains.We evaluate our method on this dataset and follow the pre-processing and evaluation setup from (Wu et al., 2019), where restaurant, train, attraction, hotel, and taxi domains are used for training and testing.We use Joint Goal Accuracy that is the average accuracy of predicting all slot assignments for a given service in a turn correctly to evaluate the main results of models.

Baselines
(1) Trade: Transferable dialogue state generator (Wu et al., 2019) which utilizes copy mechanism to facilitate domain knowledge transfer.
(4) Star: Framework with self-attention modules to learn the relationship between slots better (Ye et al., 2021b)  LUNA: It applies a slot-turn alignment strategy to accurately locate slot values and their associated context.(Wang et al., 2022)

Main Results And Analysis Experiments
Table 1 shows the results of our main test and ablation study.Our base model achieved 53.28% for the joint-acc, while our Ner-based model achieved 55.37% , a significant improvement of 2.09% compared with the base model.In 2.4 dataset, our model achieved 68.28%, a significant improvement of 2.93% compared with the base model.And When we use the correct type labels for training, the model performance reaches 59.27%, which has exceeded all baseline models.Ground truth is extracted according to the slot-type in the turn label, similar to our sampling strategy.In order to model the attention of state and dialog history separately, we changed the attention in Star (Ye et al., 2021b) to the fusion of slot attention and dialog history attention.Such changes reduced the performance of the model.However, the ablation experiment shows that the method we proposed can really benefit the model indicators.
Table 2 shows the results of our analysis experiments, which use different distribution functions to model attention.For both 2.1 and 2.4 datasets, the experimental results show that under different distribution function modeling, the distribution with constant term bias may produce higher results such as 0.5 ⇤ (1 + x) + 1 and 1 x/30.And it often has a positive impact on the experiment when the power of the independent variable is 1.

Case Study
We conducted a series of analytical experiments on attention weights.As shown in the Table 3, we randomly selected a slot, "attraction-name," and then chose an example PMUL4648 from the test set to observe the attention distribution of this slot for each turn in the test samples.In the example, the attraction-name slot is activated in the turn 2. It can be seen that function 3 noticed this turn with a large weight, followed by function 1.As a comparison, function 2 assigned larger weights to the first turn, which is sufficient to indicate that the fitting effect of function 2 is weaker compared to the other two functions.Our analysis is as follows: If there is no constant term in the distribution function, the difference between score+bias and score is not significant, resulting in limited performance improvement of the model.On the other hand, the power of the independent variable is greater than 1 such as function 2, the magnitude changes too obviously after Softmax.This leads to not smooth transitions between turns, resulting in limited performance improvement.The result of using the ground truth labels training model shows that there is still huge space for improvement in Ner model annotation.One of the biggest challenges is that the annotation model often assigns certain entities to labels based on some fragmented tokens, without considering the impact of context, which leads to the proliferation of labels.We will solve this problem in future work.

Conclusion
In this paper, we propose an effective method to integrate slot-types into the DST model.Specifically, we propose the SVT-DST.This framework incorporates the slot-types information into the attention operation to help model pay more attention to these turns that include the type of one slot.Further, We design a sampling strategy to integrate these types into the attention formula to decrease the error of Ner model.Results on MultiWOZ dataset show that our method has significant improvement on this task.

Limitation
This work has two main limitations: (1) The performance of the model largely depends on the performance of the annotation model.If the annotation model is too simple, it may cause the performance of the DST model to decline.On the contrary, it will increase the complexity of the overall model and prolong the reasoning time.(2) Even for the labeling model with good performance, the tagging values may also interfere with the DST model.For details, please refer to the analysis experiment.12 layers and the hidden size is 768.The quantity of trainable of the whole model is 24.85M.Our model is trained with a base learning rate of 0.0001 for 12 epochs about 4 hours.We use 1 NVIDIA 3090 GPU for all of our experiments.Joint goal accuracy is used to evaluate the performance of the models.Predicted dialogue states are correct only when all of the predicted values exactly match the correct values.The result of the model comes from the result of two averages.The annotation model is based on w2ner, which uses bert-largecased (330M parameters) as encoder.

Figure 1 :
Figure 1: Common slot-value types in conversation, such as location, adjective, number and time.

Figure 2 :
Figure 2: The overall architecture of our proposed Model.

Table 3 :
One case of the attention between the attraction-name slot and context for dialogue PMUL4648 in the 2.4 dataset.Score denotes QK/ p dk and b denotes the attention bias

Table 4 :
Type classification corresponding to each slot.