Bridge to Target Domain by Prototypical Contrastive Learning and Label Confusion: Re-explore Zero-Shot Learning for Slot Filling

Zero-shot cross-domain slot filling alleviates the data dependence in the case of data scarcity in the target domain, which has aroused extensive research. However, as most of the existing methods do not achieve effective knowledge transfer to the target domain, they just fit the distribution of the seen slot and show poor performance on unseen slot in the target domain. To solve this, we propose a novel approach based on prototypical contrastive learning with a dynamic label confusion strategy for zero-shot slot filling. The prototypical contrastive learning aims to reconstruct the semantic constraints of labels, and we introduce the label confusion strategy to establish the label dependence between the source domains and the target domain on-the-fly. Experimental results show that our model achieves significant improvement on the unseen slots, while also set new state-of-the-arts on slot filling task.


Introduction
Slot filling, as an important part of the task-oriented dialogue system, is mainly used to extract specific information in user utterances. Traditional supervised methods have shown remarkable performance in slot filling tasks (Liu and Lane, 2016;Goo et al., 2018;E et al., 2019;He et al., 2020b;Wu et al., 2020;He et al., 2020a;Oguz and Vu, 2020;Qin et al., 2020), but they require a large amount of domain-specific labeled data.
Recently, there has been increasing interest in the zero-shot cross-domain slot filling task, whose goal is to identify unseen slots in the target domain without any labeled data. Previous methods can be classified into two types: one-stage and two-stage. In the one-stage framework, (Bapna et al., 2017; Shah * The first two authors contribute equally. Weiran Xu is the corresponding author. 1 Our source code is available at: https://github. com/W-lw/PCLC (a) Performance of some zero-shot cross-domain slot filling models on seen and unseen slots in GetWeather domain in SNIPS.

Original Label Semantic Space Refined Label Semantic Space
Dynamic Refinement

Source Domain Cross-domain shared Target Domain
Colors of Slot Prototypes :  et al., 2019b; Lee and Jha, 2019) perform slot filling task for each slot type respectively, and the slot type description is then integrated into the prediction process to achieve zero-shot adaptation. The main drawback is that the model possibly predicts multiple slot types for one entity span. To avoid the above problem, (Liu et al., 2020b,a;He et al., 2020c) decompose the slot filling task into two stages. First, all slot entities in the utterances are identified by the coarse-grained binary sequence labeling model. Then slot types are classified by mapping the entity value to the representation of the corresponding slot label in the semantic space. However, we find that these methods have poor performance on unseen slot in the target domain, as shown in Fig 1(a). In the cross-domain slot filling task, there are always seen slots and unseen slots in the target domain. The former exists in both the source domain and the target domain, while the latter only exists in the target domain. Although previous methods have achieved good overall performance on the cross-domain slot filling task, we find that their high performance mostly comes from the seen slots, while the performance on the unseen slots remains very low. We argue that these methods don't achieve domain adaption well. Actually, these methods lack explicit modeling of the association between the source and target domain. They directly utilize the slot name embedding as its slot prototype, whose distribution is often chaotic in semantic space due to lack of modeling for constraint relationships of slot prototypes in different domains. And due to the lack of data in the target domain, the model can't learn the mapping relationship between the slot value in the target domain and the slot prototype. Therefore, when making predictions, the model can only correctly predict the seen slot type, and the prediction for unseen slots is almost random. An intuitive way to solve this problem is to refine the label semantic space and establish the constraint relationship between source and target domains, as shown in Fig 1(b).
In this paper, we propose a novel method based on Prototypical Contrastive learning and Label Confusion strategies (PCLC) to dynamically refine the constraint relationship between slot prototypes in the semantic space. First, we introduce prototypical contrastive learning (Medina et al., 2020;Cao et al., 2020;Yue et al., 2021;Li et al., 2021;Zhao et al., 2021), which makes the mapped slot value embeddings close to its corresponding slot prototype and away from other slot prototypes, to enhance the accuracy of mapping between feature space and semantic space. Second, we introduce a label obfuscation strategy to establish the dependency between the slots of the source domain and the target domain. In the training process of the source domain, we confuse the original one-hot label into the probability distribution of the source domain and the target domain by calculating the similarity between the slot prototypes in the source domain and the target domain.
Our contributions are three-fold. (1) We evaluate the performance of existing methods on crossdomain slot filling. It turns out these methods do not achieve domain adaptation effectively as the performance varies widely between unseen slots and seen slots. (2) We propose a novel method based on prototypical contrastive learning and label confusion to refine semantic space and enhance the domain adaptation. (3) Experiments demonstrate that the performance of our proposed method has improved significantly on unseen slots, and the overall performance outperforms the state-of-theart models on both zero-shot and few-shot settings.

Overall Architecture
Consistent with Coach (Liu et al., 2020b), we adapt the two-stage framework as the backbone of our model. In the first stage, we utilize a BiLSTM-CRF structure (Lample et al., 2016) (as a BIO-label sequence tagger) to encode the input utterance and identify the slot entities.
In the second stage, our model encodes the slot entity and predicts the label for it by calculating the similarity with the slot prototypes in the label semantic space. Unlike the previous works (Liu et al., 2020b;He et al., 2020c) which directly utilize the slot name embedding as slot prototypes, we introduce Prototypical Contrastive learning and Label Confusion strategies (PCLC) strategies to dynamically refine the constraint relationship between slot prototypes in the semantic space, as shown in Fig  2. In the training procedure, we use an MLP layer to encode the original slot name embedding. So we can obtain a dynamically updated slot prototype matrix. We will introduce these in the following subsections.

Prototypical Contrastive Learning
There is a problem of utilizing slot name embedding as the slot prototypes that the distribution of slot name embedding is very chaotic and somewhat dense in semantic space. Therefore, when the slot values are mapped to the semantic space, they can hardly establish a correct relationship with the corresponding slot prototype. So, we introduce prototypical contrastive learning (Li et al., 2021) to enhance the precision of the mapping function from feature space to semantic space and reduce the density of slot prototype distribution in the label semantic space.
As shown in Fig 2, given a slot value and the corresponding label , we will map to the refined semantic space, and obtain the prototypical contrastive loss: , where is the representation of slot prototype with label , is temperature factor and is the number of slot categories. By optimizing the above objective function L pc , slot values can be close to corresponding slot prototype in semantic space and be away from other slot prototypes.

Label Confusion
Now that prototypical contrastive learning has separated the slot prototypes and established the relation between slot values and slot prototypes, we need to establish the dependency between the slots of the source domain and the target domain by label confusion.
In the training process of the source domain, we confuse the original one-hot label into the probability distribution of the source domain and the target domain by calculating the similarity between the slot prototypes in the source domain and the target domain. Given a slot value and its corresponding label , we first calculate the cosine similarities between z and slot prototypes z of all slot label ∈ T in the target domain: , where the norm(·) indicates the L 1 norm. We regard the D tgt as the soft distribution over the slot labels in the target domain and then concatenate them with the one-hot distribution on the source domain with label confusion factor : , where one_hot( ) indicates the one-hot distribution of the source slot label and concat(·) indicates the concatenation operation. Then we can obtain the KL-Divergence loss: donate the representation matrix of the slot prototypes. The final loss function is L = L + L , where is a hyperparameter.

Setup
Dataset. We evaluate our method on SNIPS (Coucke et al., 2018), a public spoken language understanding dataset which contains 7 domains and 39 slots. To simulate the cross-domain scenarios, we follow the setup of (Liu et al., 2020b), choosing one domain as the target domain and the left six domains as the source domains.
Baselines. We compare our approach with the following baselines: • Concept Tagger (CT) A method proposed by (Bapna et al., 2017), which utilizes the slot description to boost the performance on the unseen slots in the target domain.
• Robust Zero-shot Tagger (RZT) Based on CT, a method proposed by (Shah et al., 2019a), which adds additional example values of slots to improve the robustness of the model.
• Coarse-to-fine Approach (Coach) A twostage framework proposed by (Liu et al., 2020b): the first stage performs coarsegrained BIO labeling task, and the second stage utilizes slots descriptions to perform fine-grained slot type classification task. And it has a stronger variant 2 , which applies template regularization to improve performance.
• Contrastive Zero-Shot Learning with Adversarial Attack (CZSL-Adv) A method proposed by (He et al., 2020c) based on Coach, which utilizes contrastive learning and adversarial attacks to improve the performance and robustness of the framework.

Main Results
As illustrated in Table 1, our method PCLC outperforms the SOTA model by 1.83% on the average F1-score under zero-shot setting, and 1.35% under few-shot setting. Besides, compared to Coach, the model we directly modify, our method achieves superior performance by 4.7% under zero-shot setting and 3.1% under few-shot setting. The significant performance improvement proves that the combined use of the two strategies we proposed can help establish a better mapping relation between slots values and slot prototypes in label semantic space.

Results on Seen slots and Unseen Slots
In this section, we re-explore all these methods' transferability by measuring their performance directly on unseen slots and seen slots respectively 3 . The experimental results are shown in  limited on the unseen slot under the zero-shot setting, while PCLC has achieved a very significant improvement on the unseen slot under both settings. For huge improvement on unseen slots, we hypothesize that with our proposed two strategies our model does achieve a more effective knowledge transferring to the target domain instead of overfitting the seen slot. It is necessary to mention that there still exists huge gap of model performance on unseen and seen slot, which is worth further study. More details about all domains can be found in Table 3.

Ablation Analysis
We compare the effect for the main performance of PCL and LC strategy in Table 1. We find that the LC strategy would cause a slight decrease in performance under the zero-shot setting while PCL can provide some performance boost, comparing to coach. Interestingly, both LC and PCL can achieve a significant improvement close to PCLC under the few-shot setting. Besides, we conduct the ablation experiment on unseen slots, as shown in Table 2. We find that LC and PCL don't work very well individually on unseen slots under zero-shot setting.
The experimental results show that both PCL and LC would bring significant improvement under the few-shot setting, but they need to be combined together for better performance under the zero-shot setting.

Visualization Analysis
To explore the effectiveness on the refinement of the slot prototype embedding in semantic space, we  Table 3: Detailed F1-scores for seen and unseen slots across all target domains. The bold part is the highest F1score obtained by our model and baselines on the unseen and seen slots, under the zero-shot and the few-shot settings respectively.

Source Domain Cross-domain shared Target Domain
Figure 3: t-SNE projection of slot prototypes. The left represents the original distribution and the right represents refined distribution by our method. The target domain is Getweather while others are source domains.
do the t-SNE visualization for the slot prototype representation after refinement, as shown in Fig 3. It is observed that after refinement, the distribution of slot prototype in semantic space changes from a chaotic distribution to a separated distribution between the source domain and the target domain. This observation indicates our thoughts: prototypical contrastive learning stimulates different slots to be separated from each other, while label confusion is able to establish the constraint relations between the source domain and the target domain.
The separation between the source domain and the target domain helps to build the mapping between slot value and slot prototype, and thus improve the accuracy on unseen slots.

Hyperparameter Analysis
The hyperparameter is used to control the degree of label confusion, which is set too small will make the prediction too random, while set too large the original effect will not be achieved. As shown in Figure 4, the model has the best performance when we set to 0.6

Conclusion & Discussion
In this paper, we propose a novel method based on Prototypical Contrastive learning and Label Confusion strategy (PCLC) for cross-domain slot filling. Our main contribution was to improve the domain adaptability of the model. The proposed method conducts a refinement process for label semantic space to re-establish the constraint relationship between different slots. Experiments show the effectiveness of our method, especially for recognizing the unseen slots.
As there is still a huge gap in model performance between unseen and seen slots, in future work, we will focus on improving the performance on unseen slots while maintaining the performance on seen slots. Representation disentanglement (Wang et al., 2021) can be used to disentangle domain-specific and domain-shared knowledge in the source domain, then we can preserve domainshared knowledge and focus on establishing the relation of domain-specific knowledge between the source and the target domain.