Slot Transferability for Cross-domain Slot Filling

Cross-domain slot ﬁlling focuses on using labeled data from source domains to train a slot ﬁlling model for target domains. It is of great signiﬁcance for transferring a dialogue system into new domains. Most of the existing work focused on building a cross-domain transfer model. From the perspective of slots themselves, this paper proposes a model-agnostic Slot Transferability Measure (STM) for evaluating the transferability from a source slot to a target slot, speciﬁcally, the degree that labeled data of the source slot is helpful to train the slot ﬁlling model for the target slot. We also give a STM-based method for a model to select helpful source slots and their labeled data for a given target slot. Experimental results on multiple existing models and datasets show that our method signiﬁcantly outperforms state-of-the-art baselines in cross-domain slot ﬁlling. The code is available at https://github. com/luhengtong/STM-for-cdsf.git .


Introduction
As an important task in task-oriented dialog systems, slot filling aims to identify task-related slot information in user utterances. When a task (or domain) has a large amount of labeled data, most existing slot filling models can achieve desired performance. However, there is usually little or even no labeled data for a new task. How to train the slot filling model in the new task (target task) with the labeled data of one or more existing tasks (source tasks) is of great significance for the rapid expansion of the application of task-oriented dialog systems.
Existing work can be mainly classified into two categories. The first is to establish implicit semantic alignment between slot representations of the source task and the target task, the model trained with the source task data is directly used for the target task (Bapna et al., 2017;Lee and Jha, 2019;Shah et al., 2019). The second is to use a twostage strategy (Liu et al., 2020), which treats all slot values as entities. First, it trains a generic entity recognition model using source task labeled data to identify all candidate slot values in the target task. Then, the candidate slot value is classified into the target task slot by comparing the similarity between its representation and the target task slot information.
Most of the existing work has focused on building cross-task transferable models that leverage the association information between source tasks and target tasks, and the model is always trained using the labeled data of all the source tasks without distinction. However, not all the source task data will have transferable value to the target task, and the value of different source tasks data to a particular target task may be different. For example, flight-ticket-reservation task and train-ticketreservation task have high similarity so that the labeled data of the former will be helpful to the latter. While the flight-ticket-reservation task and the weather-inquiry task have great difference so that the labeled data of the former has no or only little value to the latter, and even has a negative effect on the target model. Furthermore, even though the source task is similar to the target task, not every source slot will be useful for all the slots of the target task. For example, the labeled data for leaving-time slot in flight-ticket-reservation task may be helpful for the slot filling of leaving-time in train-ticket-reservation task, but not useful for the train-type slot. Therefore, finding valuable source slots that can provide transferable information for slot filling in target slot and then training a model based on the labeled data of these slots can make better use of the data in source tasks. This is the starting point of this paper which is different from the existing work.
In achieving this goal, we firstly propose slot transferability measure (STM) and give a method to calculate the STM. By comparing the STM between the target slot and each source slot, we can select different set of source slots for different target slot. Only the labeled data of these source slots are used to train the slot filling model for the target slot. To be more specific, we fuses distribution similarity of the slot value representations and of the slot value context representations between target slot and source slot as STM between two slots. All source slots are sorted according to their STMs with the target slot. Labeled data of the source slot with the highest STM are used to train the model, and then the labeled data of the source slot with the second highest STM is added to train the model. The process continues until the model gains no improvement on validation set of target slots. Those source slots and their labeled data are used to build the final slot filling model for the target slots.
Our main contributions are three-fold as follows.
1. We propose a metric called STM to measure the transferability between two slots. To our best knowledge, it is the first study on the transferability between two slots. The STM is model-agnostic.
2. We also propose a STM-based method to select source slots and their labeled data for training slot filling model for target slots.
3. Experimental results on several existing models and datasets show that this method brings consistent performance improvement for cross-domain slot filling.

Related work
As a key component of dialog system, the slot filling task has been studied extensively. Traditional supervised learning methods have made great achievements with a large amount of labeled data Mesnil et al., 2015;Hakkani-Tür et al., 2016;Kurata et al., 2016;Liu and Lane, 2016;Goo et al., 2018;E et al., 2019). However, there is little or even no labeled data for a new task, the cross-domain slot filling task which uses labeled data in source tasks to training model for target task is gaining increasing attention (Yazdani and Henderson, 2015;Bapna et al., 2017;Zhu and Yu, 2018;Lee and Jha, 2019;Shah et al., 2019;Liu et al., 2020;Zhu et al., 2020). There are mainly two streams of methods in previous work.
The first is to establish implicit semantic alignment of the slot representations between the source task and the target task (Bapna et al., 2017;Lee and Jha, 2019;Shah et al., 2019;Liu et al., 2020). Bapna et al. (2017) proposed the Concept Tagging model (CT), which unified the slot filling model on the source tasks and the target task by combining the slot representations modeled by slot description information, and then conducting BIO 3-way classification. Based on CT, Lee and Jha (2019) proposed the Zero-Shot Adaptive Transfer model (ZAT), which introduced an attention layer in building representations of slot description; Meanwhile, Shah et al. (2019) proposed the Robust Zero-shot Tagger (RZT) model, which used a small number of sample slot values of the target slot to constrain the slot filler to avoid the negative transfer caused by the misalignment of slot names.
The second is a coarse-to-fine approach, which first identifies all candidates of slots and then classifies them into corresponding slots. Liu et al. (2020) proposed a Coarse-to-fine approach (Coach). They first predicted whether the tokens are slot value candidates, and then identified their specific slot types based on the similarity between the tokens and the representation of each slot description. In addition, Coach utilized a template regularization method which clusters the representations of semantically similar utterance into a similar vector space. It greatly improves the robustness of the model.
Most of these efforts focus on building a crosstask transferable model by exploiting the correlation information between source and target tasks. All source data is used to train the transfer model no matter if the data is helpful for target slot filling. On the contrast, this paper proposes a new method to select parts of source slots and their labeled data for model training.

Methodology
This section describes the cross-domain slot filling method proposed in this paper. First, we propose the concept of slot transferability and its measurement STM in Section 3.1. Then we describe the method of finding source slots for target slot based on the STM in Section 3.2. Finally, we introduce how this method can be deployed and implemented on existing models in Section 3.3. The STM is model-agnostic and will be validated on multiple existing models.

Slot Transferability Measure
Given slots s a and s b , the transferability from s a to s b refers to the degree of that the slot filling information of s a can be used for slot filling of s b , denoted as STM(s a , s b ).
Let p v (s i ) be the distribution of slot value representation of slot s i (i = {a, b}), p c (s i ) be the distribution of slot value context representation of slot s i (i = {a, b}). We define the transferability from slot s a to slot s b as Equation (1): (1) where sim(p, q) denotes the similarity between distribution p and q. The β parameter determines the weight of similarity between distributions of slot value context representations. β > 1 favors similarity between distributions of slot value context representations, β < 1 lends more weight to similarity between distributions of slot value representations. The larger the STM β (s a , s b ), the higher the transferability from slot s a to slot s b .
Maximum Mean Discrepancy (MMD) is employed to calculate sim(p, q). MMD is usually used as a loss function in transfer learning (Tzeng et al., 2014;Zhang et al., 2015;Long et al., 2015Long et al., , 2016Long et al., , 2017Yan et al., 2017). It minimizes the difference between different domains to obtain the domain-invariant features. It serves as test statistics to determine if two distributions are the same, as well as measure the similarity between two distributions. The smaller the MMD is, the higher the similarity between distributions is. Let F be a class of functions F : X → R. Let X = {x 1 , ..., x m } and Y = {y 1 , ..., y n } be samples composed of independent and identically distributed observations drawn from distribution p and q, respectively. MMD is defined as Equation (2), and the square of the MMD can be empirically estimated by Equation (3) (Borgwardt et al., 2006): where k is the kernel function. Gaussian kernel functions are usually used as. Therefore, the sim-ilarity between the distributions of slot value representations of slots s a and s b , and the similarity between distributions of slot value context representation of slots s a and s b are Equations (4) and (5) respectively: where Ω vi and Ω ci is the sample set of the slot values representation distribution and the sample set of the slot value context representation distribution corresponding to slots s i (i = {a, b}).
is the corresponding label sequence. Since x (i) contains the slot value of slot s a , there is either "B-s a " (the slot value is a word) or "B-s a " and "I-s a " (the slot value includes several words) in the label sequence. We first extract all slot value words from labeled dataset D sa , and have the sample set Ω va of the slot value representation of slot s a , as shown in Equation (6): where E is a word embedding mapping, and I va indicates whether x (i) j is the slot value of slot s a defined as Equation (7): Then we extract the slot values context. N words before and after the slot values are extracted to form the sample set Ω ca as shown in Equation (8): where E is a word embedding mapping, and I ca indicates whether x (i) k is the slot value context for slot s a defined as Equation (9): Similarly, we can obtain Ω vb and Ω cb . Based on Ω va , Ω ca , Ω vb and Ω cb , we can calculate sim(p v (s a ), p v (s b )) and sim(p c (s a ), p c (s b ) based on Equations (4) and (5) and then calculate STM β (s a , s b ) based on Equation (1).
Slot transferability has the following two properties: Symmetry STM is symmetric. Let the transferability from s b to s a be STM β (s b , s a ), we have: Relativity When comparing the STM between two slot pairs, it is meaningful only their source slot or target slot is the same. When STM β (s a , s b ) < STM β (s c , s b ), the transferability from s a to s b is higher than that from s c to s b , or when STM β (s a , s b ) < STM β (s a , s c ), the transferability from s a to s b is higher than that from s a to s c . The comparison between STM β (s a , s b ) and STM β (s c , s d ) is meaningless.

Selection of source slots based on STM
Given a slot set S = {s 1 , · · · , s ns } from source tasks and the target slot s t . Each source slot s a has a labeled dataset , and the target slot s t is given a labeled dataset for validation. We select a slot set S t for training the target slot filling model from S basing on the following steps.
1. For each slot s i in the source slot set, we calculate the transferability STM β (s i , s t ).
2. After sorting [s 1 , · · · , s ns ] from big to small basing on STM β (s i , s t ), we get [s (1) , · · · , s (n) ], the slots sequence according to the order of transferability from highest to lowest.
3. First, we select a slot filling model M and define the union of training data corresponding to the first h slots (h is initialized to 1) in the sorted list [s (1) , · · · , s (ns) ] as D ACC . Then h = h + 1, till F1 gets to its maximum, then S t = [s (1) , · · · , s (h) ].

Model training
Given a set of source tasks T = {T 1 , · · · , T n }, a target task T tgt and the slot set S i = {s 1 , · · · , s N i } corresponding to task T i . We define the set of all source task slots as S union = S 1 ∪· · ·∪S n , and the set of target task slots as S tgt . For an existing crossdomain slot filling model M base , we deploy our approach on the model by following these steps.
Firstly, the training set and validation set corresponding to the target task and the source task are divided into the training set and validation set corresponding to each slot according to whether the slots contained in each sample. Then select the corresponding source slot set S ti for each slot s ti in the target task slot set from all the source tasks set S union . We combine the source slot set corresponding to all target task slots to get the source slot set for the target task S tgt = S t1 ∪ · · · ∪ S tNtgt and then replace the labels corresponding to all slots in the source task training set that are not in S tgt with labels O. Finally, the source slot set S tgt and the training data after replacement are used to train the model M base .

Experiments
In this section we describe the dataset used for evaluation, the baseline models used for comparison, and more details of the experimental settings.

Datasets
To evaluate the effectiveness of our approach, we conduct experiments on SNIPS (Coucke et al., 2018). In order to further evaluate the generalization ability of our approach, we also construct a cross-task slot filling dataset called MultiWoz-Slot (MWS) based on the multi-domain task-oriented dialog dataset MultiWoz (Budzianowski et al., 2018;Eric et al., 2020). Table 1 displays some statistics about the two datasets. Details about the two datasets and how the MWS dataset is constructed are described as follows.
SNIPS SNIPS is a public SLU dataset that contains 7 tasks (intents) and 39 slots, and each task contains approximately 2000 training samples. As shown in Table 1, the data contains a total of 14,484 samples, the vocabulary size is 12,134, the average length of the sample utterance is 9, and the average number of slots in each sample is 2.6.

Multiwoz-Slot
MultiWoz is a public multidomain task-oriented dialogue dataset that contains 7 album 4 , artist 14 , best rating 5 , city 23 , country 23 , condition description 3 , condition temperature 3 , cuisine 2 , current location 3 , entity name 1 , year 4 , geographic poi 3 , location name 7 , playlist owner 1 , object type 567 , party size description 2 , rating unit 5 , movie type 7 , served dish 2 , service 4 , sort 24 , genre 4 , music item 14 , object location type 7 , object name 56 , object part of series type 5 , object select 5 , f acility 2 , party size number 2 , playlist 14 , movie name 7 , poi 2 , rating value 5 , restaurant name 2 , restaurant type 2 , state 23 , timeRange 237 , track 4 , spatial relation 237 tasks and 24 slots. Since the (hospital, police) tasks have little conversation data and only appear in the training data, we use user-side utterance for just five tasks (attractions, hotel, restaurant, taxi, train) to construct the MWS dataset, which contains 14 slots. When constructing the training, validation and test data of a task in MWS, we extract the userside utterance containing the task separately from the conversations in the training set, validation set and test set of Multiwoz. Since the training set of the target task is generally used as the final test set in the cross-domain slot filling task, we combine the validation set and test set as validation set for each task. Table 1 shows the number of slots and the number of sample of training set and validation set included in each task in MWS. As shown in Table 1, the data contains a total of 39,839 samples, and the vocabulary size is 3,314, the average length of the sample utterance is 15, and the average number of slots in per sample is 1.7. Compared to SNIPS, MWS has smaller vocabulary size and the number of slots in each task. However, MWS has more samples in each task, so when it is used as a cross-domain slot filling dataset, its source tasks have more sufficient training samples, and the correlation between these tasks is stronger.

Models
We conduct our experiments on the following models.
Concept Tagger (CT) A cross-domain slot filling model proposed by Bapna et al. (2017), which uses the information of the slot descriptions to establish implicit alignment between target slots and source slots.
Robust Zero-shot Tagger (RZT) A model proposed by Shah et al. (2019), which uses the slot value sample of slots to improve the robustness of the model on the target task based on the CT model.
Coarse-to-fine Approach (Coach) A two-stage cross-domain slot filling method proposed by Liu et al. (2020), which splits the cross-domain slot filling task into two stages: coarse-grained BIO 3-way classification and fine-grained slot type classification, and uses slot descriptions in the second stage to help recognize unseen slots.
Coach+TR A variant of Coach proposed by Liu et al. (2020), which further uses template regularization on the basis of Coach to improve the performance of the model on similar or the same slots, is the state-of-the-art model.

Implementation Details
We deploy the proposed method on above slot filling models CT, RZT, Coach, and Coach+TR.
β is set to 1. A two-layer BiLSTM (Schmidhuber and Hochreiter, 1997) model is used for selecting source slots for all models. 300 dimensions Glove (Pennington et al., 2014) vector is used for word embedding. The hidden layer dimension is set to 300, the learning rate is 0.001. We train the model 30 epochs and select the model with the best performance on the validation set as the final model.  Table 2: The main result of the four models (CT, RZT, Coach, Coach +TR) trained using original data (All data) and data selected by our method (STM 1 ). Scores in each row are F1 of target task.
In order to make a fair comparison, we use the same settings with Liu et al. (2020) to construct the cross-domain slot filling model. We concatenate the 100-dimensional character-level representation and the 300-dimensional word-level representation as word representation. We set the hidden layer dimension of all the BiLSTM encoders to 300 and set the dropout rate to 0.3. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.0005. The samples of each task in SNIPS are divided into two parts: 500 samples as validation set, and the remaining samples as training set. When a task is set as a target task, its training set is used as a test set. We evaluate on two datasets respectively. For each test, we choose one task as the target task and set the other tasks as the source tasks.

Result and Discussion
In this section, we describe and analyze the experimental results. Firstly, the main results of the experiment are described in Section 5.1. Then, we analyze the impact of some factors on STM in Section 5.2.

Main Results
Quantitative Analysis Table 2 shows the main result of the four models. For each model, the first column is F1 of the model trained by all labeled data available, i.e., the original way for using the model. The second column is F1 of the model trained by labeled data selected by method proposed in the paper. As can be seen from the 2. Our method improves the performance of all four models on most target tasks, even improves several of them by more than 10 points.
Qualitative analysis We perform a qualitative analysis of the STM on the MWS dataset. Figure  1 shows the thermal diagram of slot transferability between any two slots in MWS. Each cell in the figure represents the value of STM 1 between the slot labeled in the horizontal axis and the vertical axis. The higher the brightness, the higher the transferability. The figure is symmetric because the STM 1 is symmetric. It can be found that the slot with high transferability to each other are roughly divided into 7 categories, as shown in Table 3. After observing the data, we find that there are mainly three kinds of slots in the same category. The first type is the slot with high coincidence degree of slot value set. For example, "attraction-name" and "taxi-dest" have some common values, such as "adc theatre", "all saints church", "county folk museum" and so on. The second kind of slots is that the slot values appear in similar context. For example, "attractionname" and "hotel-name" have some common context words, such as "about", "for", "at" and so on. The third kind of slots are the slots with higher coincidence degree, as well as similar context of slot values, such as "attraction-area" and "hotel-area". These phenomena are consistent with the definition of STM.

The impact of some factors on STM
There are three main factors in the calculation of STM beta . The following is an experimental analysis of the impact of the three factors on STM.
The impact of β on STM β parameter determines the weight of similarity between distributions of slot value context representations . We randomly select four slot pairs which are (attractionname, hotel-name), (attraction-name, restaurantname), (attraction-name, taxi-dest), and (hotelname, taxi-dest). In the first two groups, the slot values appear in similar context, but the sets of slot values almost have no intersection. In the last two groups, the sets of slot values set have high consistency, but the contexts of slot values are not similar. β is range from 0 and +∞. When β = 0, STM β only measures the similarity of distributions between slot values representations of the two slots, and when β = +∞, STM β only measures the  similarity of distributions between the slot value context representation of the two slots. As shown in Figure 2, the STM β of the first two groups increased with the increase of β, while the STM β of the last two groups decreased with the increase of β. Therefore, when β increases, the effect of slot value similarity on STM β becomes greater, and the effect of slot value context similarity on STM β becomes smaller.
The impact of sample number on STM In order to measure the impact of sample number on STM β , we randomly selected three slot pairs which are (taxi-arrive, train-arrive), (taxi-arrive, restaurant-time) and (taxi-arrive, hotel-name) for comparative experiment. We select 25%, 50%, 75% and 100% samples from the validation set used to calculate STM β on the three groups of slots. The experimental results are shown in Figure  3. According to the figure, although the absolute values of STM β of the three slot pairs changed, their relative relations didn't change. That is, sam- ple size will affect the value of STM β . However, for the same source slot, the relationship between STM β to different target slots does not change.
The impact of N on STM When calculating slot transferability, we fuse distribution similarity of the slot value representations and of the slot value context representations. We select one word before and after slot value as slot value context. In order to measure the impact of slot context window size N on slot transferability, (attractionname, hotel-name), (attraction-name, restaurantname), (attraction-name, taxi-dest), (attractionname, hotel-stars), (hotel-stay, restaurant-people) and (hotel-stay, restaurant-name) are randomly selected for comparing. The first two groups have similar context, the middle two have similar slot values sets, and the last two have low similarity in both slot values and context. N is range from 1 and 10. We observe the change of context representation distribution similarity sim(p c (s a ), p c (s b )) and STM 1 among 6 groups of slots. As shown in Figure 4 and Figure 5, the similarity among context representation distributions increases with the increase of context window size N, and the context similarity among 6 groups of slots tends to be the same. In addition, STM increases with the increase of window size N, and the distinction of STM between different types of slot pairs decreases. We conjecture this is due to the fact that the context we extracted contains too much slot-independent context when the window size N becomes large.

Running time analysis
The method proposed in Sec 3.2 does increase the running time. However, there are two sides of running time. Selecting slots by a Bi-LSTM cost some times, while training the model with selected (less) data saves times. We don't calculate the save minutes in training, however, we find the increase of time consumption in slots selection is small, it is acceptable considering the performance improvements it brings. Since the process in Sec 3.2 is offline and once for a new domain, and the model used in Sec 3.2 is a simple Bi-LSTM model, it increases only a little time. To be more detailed, we conducted experiments on one Titan V GPU, the average running time of the method in Sec 3.2 is 80 minutes for a new domain.

Conclusions and Future Work
In this paper, we propose a metric STM to measure the slot transferability of the slots across task, and the calculation of this metric is model-agnostic. Based on this metric, we also propose a crossdomain slot filling method to improve the performance of the existing models by selecting the source slots with high transferability for the target slots. The results on several existing models and datasets show that our method can bring consistent performance improvement to the slot filling models of the target tasks, which show the effectiveness of the STM. We also further explore the impact of some factors on STM. In the future, we hope to use STM to further guide the improvements of models.