HierarchicalContrast: A Coarse-to-Fine Contrastive Learning Framework for Cross-Domain Zero-Shot Slot Filling

In task-oriented dialogue scenarios, cross-domain zero-shot slot filling plays a vital role in leveraging source domain knowledge to learn a model with high generalization ability in unknown target domain where annotated data is unavailable. However, the existing state-of-the-art zero-shot slot filling methods have limited generalization ability in target domain, they only show effective knowledge transfer on seen slots and perform poorly on unseen slots. To alleviate this issue, we present a novel Hierarchical Contrastive Learning Framework (HiCL) for zero-shot slot filling. Specifically, we propose a coarse- to fine-grained contrastive learning based on Gaussian-distributed embedding to learn the generalized deep semantic relations between utterance-tokens, by optimizing inter- and intra-token distribution distance. This encourages HiCL to generalize to the slot types unseen at training phase. Furthermore, we present a new iterative label set semantics inference method to unbiasedly and separately evaluate the performance of unseen slot types which entangled with their counterparts (i.e., seen slot types) in the previous zero-shot slot filling evaluation methods. The extensive empirical experiments on four datasets demonstrate that the proposed method achieves comparable or even better performance than the current state-of-the-art zero-shot slot filling approaches.


Introduction
Slot filling models are devoted to extracting the contiguous spans of tokens belonging to pre-defined slot types for given spoken utterances and gathering information required by user intent detection, and thereby are an imperative module of task-oriented dialogue (TOD) systems.For instance, as shown in  Figure 2, given a user utterance "send a reminder for a tire check next week" belonging to reminder domain, the slot filling task is to identify slot entities: "a tire check" and "next week" that correspond to slot types (an alias of slot type is slot), todo and date_time, respectively.Supervised slot filling methods (Kurata et al., 2016;Wang et al., 2018;Li et al., 2018;Goo et al., 2018;Qin et al., 2019) have achieved promising performance.Nevertheless, these methods are strongly dependent on substantial and high-quality annotation data for each slot type, which prevents them from transferring to new domains with little or no labeled training samples.
To solve this problem, more approaches have emerged to deal with this data scarcity issue by leveraging zero-shot slot filling (ZSSF).Typically, these approaches can be divided into two categories: one-stage and two-stage.In one-stage paradigm, Bapna et al. (2017); Shah et al. (2019); Lee and Jha (2019) firstly generate utterance representation in token level to interact with the representation of each slot description in semantic space, , where coarse-grained and fine-grained slot labels are used as supervised signal for CL, respectively, i.e., step one is entity-label level CL and step two is token-label level CL.Entity-label is a pseudo-label in our Hierarchical CL.Different colors of rectangular bounding box denote different slot types.
and then predict each slot type for utterance token individually.The primary weakness for this paradigm is multiple prediction issue where a token will probably be predicted as multiple slot types (Liu et al., 2020;He et al., 2020).To overcome this issue, Liu et al. (2020); He et al. (2020); Wang et al. (2021) separate slot filling task into two-stage paradigm.They first identify whether the tokens in utterance belong to BIO (Ramshaw and Marcus, 1995) entity span or not by a binary classifier, subsequently predict their specific slot types by projecting the representations of both slot entity and slot description into the semantic space and interact on each other.Based on the previous works, Siddique et al. (2021) propose a two-stage variant, which introduces linguistic knowledge and pretrained context embedding, along with the entity span identify stage (Liu et al., 2020), to promote the effect on semantic similarity modeling between slot entity and slot description.Recently, Heo et al. (2022) develop another two-stage variant that applies momentum contrastive learning technique to train BERT (Devlin et al., 2019a) and to improve the robustness of ZSSF.However, as shown in Figure 1, we found that these methods perform poorly on unseen slots in the target domain.
Although two-stage approaches have flourished, one-pass prediction mechanism of these approaches (Liu et al., 2020;He et al., 2020;Wang et al., 2021) inherently limit their ability to infer unseen slots and seen slots separately.Thus they have to adopt the biased test set split method of unseen slots (see more details in Appendix F), being incapable of faithfully evaluating the real unseen slots performance.Subsequently, their variants (Siddique et al., 2021;Heo et al., 2022;Luo and Liu, 2023) still struggle in the actual unseen slots performance evaluation due to following this biased test set split of unseen slots (Siddique et al., 2021;Heo et al., 2022), or the intrinsic architecture limit of one-pass inference (Luo and Liu, 2023).In another line (Du et al., 2021;Yu et al., 2021), ZSSF is formulated as a one-stage question answering task, but it is heavily reliant upon handcrafted question templates and ontologies customized by human experts, which is prohibitively expensive for this method to generalize to unseen domains.Besides, the problem with multiple slot types prediction that happened to them (Siddique et al., 2021;Heo et al., 2022;Du et al., 2021;Yu et al., 2021) seriously degrades their performance.In this paper, we introduce a new iterative label set semantics inference method to address these two problems.
Moreover, a downside for these two orthogonal methods is that they are not good at learning domain-invariant features (e.g., generalized tokenclass properties) by making full use of sourcedomain data while keeping target-domain of interest unseen.This may lead them to overfit to the limited slot types in source training domains.Actually the current models' performance on unseen slots in target domain is still far from upper bound.
To tackle the above limitation, intuitively, through contrastive learning (CL), we can redistribute the distance of token embeddings in semantic space to learn generalized token-class features across tokens, and better differentiate between token classes, and even between new classes (slot-   agnostic features) , which is beneficial for new token class generalization across domains (Das et al., 2022).However, it's hard to train token-level class since its supervised labels are closely distributed, which easily leads to misclassifications (Ji et al., 2022).While training entity-level class is relatively easy since its training labels are dispersedly distributed and it does not require label dependency that exists in token-level sequence labeling training (Ji et al., 2022).
We argue that entity-level class knowledge contributes to token-level class learning, and their combination is beneficial for ZSSF task.Entity-level class learning supplements token-class learning with entity type knowledge and boundary information between entity and non-entity, which lowers the difficulty of token class training.
Hence, we propose a new hierarchical CL framework called HiCL, a coarse-to-fine CL approach.As depicted in Figure 2, it first coarsely learns entity-class knowledge of entity type and boundary knowledge via entity-level CL.Then, it combines features of entity type and boundary, and finely learns token-class knowledge via token-level CL.
In recent years, some researchers have employed Gaussian embedding to learn the representations of tokens (Vilnis and McCallum, 2015;Mukherjee and Hospedales, 2016;Jiang et al., 2019a;Yüksel et al., 2021) due to their superiority in capturing the uncertainty in representations.This motivates us to employ Gaussian embedding in our HiCL to represent utterance-tokens more robustly, where each token becomes a density rather than a single point in latent feature space.Different from existing slot filling contrastive learners (He et al., 2020;Wu et al., 2020;Wang et al., 2021;Heo et al., 2022) that optimize training objective of pairwise similarities between point embeddings, HiCL aims to optimize distributional divergence by leveraging effectively modeling Gaussian embeddings.While point embeddings only optimize pairwise distance, Gaussian embeddings also comprise additional constraint which preserves the class distribution through their variance estimates.This distinctive quality helps to explicitly model entityor token-class distributions, which not only encourages HiCL to learn generalized feature representations to categorize and differentiate between different entity (token) classes, but also fosters zerosample target domain adaptation.
Concretely, as shown in Figure 3, our token-level CL pulls inter-and intra-token distributional distance with similar labels closer in semantic space, while pushing apart dissimilar ones.Gaussian distributed embedding enables token-level CL to better capture semantics uncertainty and semantics coverage variation of token-class than point embedding.This facilitates HiCL to better model generalized slot-agnostic features in cross-domain ZSSF scenario.
Our major contributions are three-fold: • We introduce a novel hierarchical CL (coarse-to-fine CL) approach based on Gaussian embedding to learn and extract slotagnostic features across utterance-tokens, effectively enhancing the model's generalization ability to identify new slot-entities.
• We find unseen slots and seen slots overlapping problem in the previous methods for unseen slots performance evaluation, and rectify this bias by splitting test set from slot type granularity instead of sample granularity, thus propose a new iterative label set semantics inference method to train and test unseen slots separately and unbiasedly.Moreover, this method is also designed to relieve the multiple slot types prediction issue.
• Experiments on two evaluation paradigms, four datasets and three backbones show that, compared with the current state-of-the-art (SOTA) models, our proposed HiCL framework achieves competitive unseen slots performance, and overall performance for crossdomain ZSSF task.

Problem Definition
Zero-shot Setting For ZSSF, a model is trained in source domains with a slot type set A s (i) and tested in new target domain with a slot type set where i, j and k are index of different slot type sets, A ts (j) are the slot types that both exist in source domains and target domain (seen slots), and A tu (k) are the slot types that only exist in target domain (unseen slots).Since A s (i) ∩ A tu (k) = ∅, it is a big challenge for the model to generalize to unseen slots in target domain.Task Formulation Given an utterance U = x 1 , ..., x n , the task of ZSSF aims to output a label sequence O = o 1 , ..., o n , where n is the length of U.

Methodology
The architecture of HiCL is illustrated in Figure 3. HiCL adopts an iterative label set semantics inference enhanced hierarchical CL approach and conditional random field (CRF) scheme.Firstly, HiCL successively employs entity-level CL and token-level CL, to optimize distributional divergence between different entity-(token-) class embeddings.Then, HiCL leverages generalized entity (token) representations refined in the previous CL steps, along with the slot-specific features learned in the CRF training with BIO label, to better model the alignment between utterance-tokens and slotentities.

Hierarchical Contrastive Learning
Encoder Given an utterance of n tokens U = x i n i=1 , a target slot type with k tokens S = s i k 1 , and all slot types except the target slot type with q tokens A = a i q 1 , we adopt a pre-trained BERT (Devlin et al., 2019a) as the encoder and feed "[CLS]S[SEP ]A[SEP ]U[SEP ]" as input to the encoder to obtain a d l -dimensional hidden representation h i ∈ R d l of each input instance: We adopt two encoders for our HiCL, i.e., BiL-STM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019a), to align with baselines.Gaussian Transformation Network We hypothesize that the semantic representations of encoded utterance-tokens follow Gaussian distributions.Similar to (Jiang et al., 2019a;Das et al., 2022), we adopt exponential linear unit (ELU) and non-linear activation functions f µ and f Σ to generate Gaussian mean and variance embedding, respectively: where 1 ∈ R d is an array with all value set to 1, ELU and 1 are designed to ensure that every element of variance embedding is non-negative.µ i ∈ R d , Σ i ∈ R dxd denotes mean and diagonal covariance matrix of Gaussian embeddings, respectively.f µ and f Σ are both implemented with ReLU and followed by one layer of linear transformation.

Positive and Negative Samples Construction
In the training batch, given the anchor token x u , if token x v shares the same slot label with token x u , i.e., y v = y u , then x v is the positive example of x u .Otherwise, if y v ̸ = y u , x v is the negative example of x u .
Contrastive Loss Given a pair of positive samples x u and x v , their Gaussian embedding distributions follow x u ∼ N (µ u , Σ u ) and x v ∼ N (µ v , Σ v ), both with m dimensional.Following the formulas (Iwamoto and Yukawa, 2020;Qian et al., 2021), the Kullback-Leibler divergence from where T r is the trace operator.Since the asymmetry features of the Kullback-Leibler divergence, we follow the calculation method (Das et al., 2022), and calculate both directions and average them: Suppose the training set in source domains is T a , at each training step, a randomly shuffled batch T ∈ T a has batch size of N t , each sample (x j , y j ) ∈ T .For each anchor sample x u , match all positive instances T u ∈ T for x u and repeat it for all anchor samples: Formulating the Gaussian embedding loss in each batch, similar to Chen et al. (2020), we calculate the NT-Xent loss: where 1 [k̸ =u] ∈ {0, 1} is an indicator function evaluating to 1 iff k ̸ = u, τ is a scalar temperature parameter and n u is the total number of positive instances in T u .
Coarse-grained Entity-level Contrast In coarse-grained CL, entity-level slot labels are used as CL supervised signals in training set T a .Coarsegrained CL optimizes distributional divergence between tokens Gaussian embeddings and models the entity class distribution.According to Eq.(3)-Eq.(6), we can obtain coarse-grained entity-level contrastive loss L i coarse , and the in-batch coarsegrained CL loss is formulated: Fine-grained Token-level Contrast In finegrained CL, token-level slot labels are used as CL supervised signals in training set T a .As illustrated in Figure 3, fine-grained CL optimizes KLdivergence between tokens Gaussian embeddings and models the token class distribution.Similarly, the in-batch fine-grained CL loss is formulated:

Training Objective
The training objective L is the weighted sum of regularized loss functions.Slot Filling Loss where ŷi j is the gold slot label of j-th token and n l is the number of all slot labels.Overall Loss where α, β and γ are tunable hyper-parameters for each loss component, λ denotes the coefficient of L 2 -regularization, Θ represents all trainable parameters of the model.

Unseen and Seen Slots Overlapping Problem in Test Set
The problem description, and the proposed rectified method for unseen and seen slots test set split are presented in Appendix F.

Evaluation Paradigm
Training on Multiple Source Domains and Testing on Single Target Domain A model is trained on all domains except a single target domain.For instance, the model is trained on all domains of SNIPS dataset except a target domain GetWeather which is used for zero-shot slot filling capability test.This multiple training domains towards single target domain paradigm is evaluated on datasets SNIPS (Coucke et al., 2018), ATIS (Hemphill et al., 1990) and SGD (Rastogi et al., 2020).MIT_corpus (Nie et al., 2021).

Baselines
We compare the performance of our HiCL with the previous best models, the details of baseline models are provided in Appendix E.

Training Approach
Training Sample Construction The output of HiCL is a BIO prediction for each slot type.The training samples are of the pattern , where S t represents a target slot type, A t represents all slot types except S t , U i represents an utterance, Y ′ i represents BIO label for S t , all slot types A = S t ∪ A t , for simple depiction, hierarchical CL labels are omitted here.For a sample from given dataset with the pattern (U i , Y i ) that contains entities for k slot types, k positive training samples for U i can be generated by setting each of k slot types as S t in turn and generating the corresponding A t and Y ′ i .Then m negative training samples for U i can be generated by choosing slot types that belongs to A and does not appear in U i .For example, in Figure 3 Iterative Label Set Semantics Inference Itera-tively feeding the training samples constructed in § 4.5 into HiCL, the model would output BIO label for each target slot type.We named this training or predict paradigm iterative label set semantics inference (ILSSI).Algorithm 1 and 2 in Appendix elaborate on more details of ILSSI.

Main Results
We examine the effectiveness of HiCL by comparing it with the competing baselines.The results of the average performance across different target domains on dataset of SNIPS, ATIS, MIT_corpus and SGD are reported in Table 1, 2, 3, 4, respectively, which show that the proposed method consistently outperforms the previous BERT-based and ELMobased SOTA methods, and performs comparably to the previous RNN-based SOTA methods.The detailed results of seen-slots and unseen-slots performance across different target domains on dataset of SNIPS, ATIS, MIT_corpus and SGD are reported in Table 6, 7, 8, 9, respectively.On seen-slots side, the proposed method performs comparably to prior SOTA methods, and on unseen-slots side, the proposed method consistently outperforms other SOTA methods.

Quantitative Analysis
Ablation Study To study the contribution of different component of hierarchical CL, we conduct ablation experiments and display the results in Table 1 to Table 9.
The results indicate that, on the whole, both coarse-grained entity-level CL and fine-grained token-level CL contribute substantially to the performance of the proposed HiCL on different dataset.Specifically, taking the performance of HiCL on SNIPS dataset for example, as shown in Table 1, the removal of token-level CL ("w/o fine CL") sharply degrades average performance of HiCL by 2.17%, while the removal of entity-level CL ("w/o coarse CL") significantly drops average performance of HiCL by 1.26%.Besides, as shown in Table 6, removing entity-level CL ("w/o L coarse "), the unseen slots effect is drastically reduced by 4.61%, and removing token-level CL ("w/o L f ine "), the unseen slots effect of the proposed model is considerably decreased by 4.01%.Coarse-grained CL vs. Fine-grained CL On the basis of ablation study results ( § 5.2), our analyses are that, coarse-grained CL complements fine-grained CL with entity-level boundary information and entity type knowledge, while fine-grained CL complements coarse-grained CL with token-level boundary information (BIO) and token class type knowledge.Their combination is superior to either of them and helps HiCL to obtain better performance.Unseen Slots vs. Seen Slots As shown in Table 6, 7, 8, 9, the proposed HiCL significantly improves the performance on unseen slots against the SOTA models, while maintaining comparable seen slots performance, which verifies the effectiveness of our HiCL framework in cross-domain ZSSF task.From the remarkable improvements on unseen slot, we clearly know that, rather than only fitting seen slots in source domains, our model have learned generalized slot-agnostic features of entity-classes including unseen classes by leveraging the proposed hierarchical contrast method.This enables HiCL to effectively transfer domain-invariant knowledge of slot types in source domains to unknown target domain.Gaussian embedding vs. Point embedding We provide ablation study for different forms of embedding that are migrated to our HiCL, i.e., Gaus-sian embedding and point embedding, and investigate their performance impact on unseen slots and seen slots.As shown in Table 5  We present analysis of causes as below: (1) TOD-BERT is unable to directly take advantage of the prior knowledge of pre-training on the datasets of SNIPS, ATIS and MIT_corpus.These datasets are not included in the pre-training corpora of TOD-BERT and their knowledge remains unseen for TOD-BERT.
(2) There is a discrepancy of data distribution between the corpora that TOD-BERT pre-trained on and the datasets of SNIPS, ATIS and MIT_corpus.The datasets that TOD-BERT pretrained on are multi-turn task-oriented dialogues of modeling between user utterances and agent responses, whereas the datasets of SNIPS, ATIS and MIT_corpus are single-turn utterances of users in task-oriented dialogues.Perhaps this intrinsic data difference affects the performance of TOD-BERT on these single-turn dialogue datasets.
(3) TOD-BERT may also suffer from catastrophic forgetting (Kirkpatrick et al., 2016) during pre-training.TOD-BERT is further pre-trained by initializing from BERT, catastrophic forgetting may prevent TOD-BERT from fully leveraging the general purpose knowledge of pre-trained BERT in zero-shot learning scenarios.From the experimental results, we observe that the performance of TOD-BERT is even much lower than BERT-based models (e.g., mcBERT), which may be a possible empirical evidence for the above analysis.
In contrast, TOD-BERT performs extremely well on SGD dataset and it beats all BERT-based models.This is because that TOD-BERT is pretrained on SGD dataset (Wu et al., 2020) and it can thoroughly leverage the prior pre-trained knowledge on SGD to tackle various downstream tasks including zero-shot ones on this dataset.However, when our HiCL migrates to TOD-BERT (HiCL+TOD-BERT), as shown in Table 4 and 9, the performance of the model again achieves an uplift.Concretely, the overall performance increases by 0.4% and unseen slots performance increases by 9.18%, which is a prodigious boost, only at the expense of a drop of 2.43% on seen slots performance.This demonstrates that, in terms of unseen slots performance, even on the seen pre-training datasets, our method of HiCL can still compensate the shortcoming of pre-trained TOD models (e.g., TOD-BERT).

Qualitative Analysis
Visualization Analysis Figure 5 in Appendix provides t-SNE scatter plots to visualize the performance of the baseline models and HiCL on test set of GetWeather target domain of SNIPS dataset.
In Figure 5(a) and Figure 5(c), we observe that TOD-BERT and mcBERT have poor clustering styles, their representations of predicted slot en-tities for unseen slots (digital number 1, 2, 3, and 4 in the figure) sparsely spread out in embedding space and intermingle with other slot entities' representations of unseen slots and outside (O) in large areas, and TOD-BERT is even more worse and its slot entities' representations of seen slots (digital number 5, 6, 7, 8 and 9 in the figure) are a little sparsely scattered.This phenomenon possibly suggests two problems.On the one hand, without effective constraints of generalization technique, restricted to prior knowledge of fitting the limited slot types in source training domains, TOD-BERT and mcBERT learn little domain-invariant and slotagnostic features that could help them to recognize new slot-entities, they mistakenly classify many slot entities of unseen slots into outside (O).On the other hand, although TOD-BERT and mcBERT generate clear-cut clusters of slot-entity classes of seen slots, they possess sub-optimal discriminative capability between new slot-entity classes of unseen slots, they falsely predict multiple new slot types for the same entity tokens.
In Figure 5(b) and 5(d), we can see clearly that, HiCL produces a better clustering division between new slot-entity classes of unseen slots, and between new slot-entity class and outside (O), due to generalized differentiation ability between entity-class (token-class) by extracting class agnostic features through hierarchical CL.Moreover, equipped with Gaussian embedding and KL-divergence, HiCL exhibits even more robust performance on unseen slots than equipped with point embedding and Euclidean distance, the clusters of new slot-entity classes of unseen slots and outside (O) distribute more compactly and separately.Case Study Table 11 in Appendix demonstrates the prediction examples of ILSSI on SNIPS dataset with HiCL and mcBERT for both unseen slots and seen slots, in target domain of BookRestaurant and GetWeather, respectively.The main observations are summarized as follows: (1) mcBERT is prone to repeatedly predict the same entity tokens for different unseen slot types, which leads to its misclassifications and performance degradation.For instance, in target domain of BookRestaurant, given the utterance "i d like a table for ten in 2 minutes at french horn sonning eye", mcBERT repeatedly predicts the same entity tokens "french horn sonning eye" for three different types of unseen slots.This phenomenon can be interpreted as a nearly random guess of slot type for certain entity, due to learning little prior knowledge of generalized token-or entity-classes, resulting in inferior capacity to differentiate between token-or entity-categories, which discloses the frangibility of mcBERT on unseen slots performance.Whereas, HiCL performs more robustly for entity tokens prediction versus different unseen slots, and significantly outperforms mcBERT on unseen slots in different target domains.Thanks to hierarchical CL and ILSSI, our HiCL learns generalized knowledge to differentiate between tokenor entity-classes, even between their new classes, which is a generalized slot-agnostic ability.(2) HiCL is more capable of recognizing new entities over mcBERT by leveraging learned generalized knowledge of token-and entity-class.For instance, in target domain of GetWeather, both HiCL and mcBERT can recoginze token-level entity "warmer" and "hotter" that belong to unseen class of condition_temperature, but mcBERT fails to recognize "freezing" and "temperate" that also belong to the same class, owing to the limited generalization knowledge of token-level class.With the help of hierarchical CL that aims at extracting the most domain-invariant features of token-and entity-classes, our HiCL can succeed in recognizing these novel entities.(3) HiCL performs slightly better than mcBERT on the seen slots.The two models demonstrate equivalent knowledge transfer capability of seen slots from different training domains to target domains.

Conclusion
In this paper, we improve cross-domain ZSSF model from a new perspective: to strengthen the model's generalized differentiation ability between entity-class (token-class) by extracting the most domain-invariant and class-agnostic features.Specifically, we introduce a novel pretraining-free HiCL framework, that primarily comprises hierarchical CL and iterative label set semantics inference, which effectively improves the model's ability of discovering novel entities and discriminating between new slot-entity classes, which offers benefits to cross-domain transferability of unseen slots.Experimental results demonstrate that our HiCL is a backbone-independent framework, and compared with several SOTA methods, it performs comparably or better on unseen slots and overall performance in new target domains of ZSSF.

Limitations and Future Work
Large language models (LLMs) exhibit powerful ability in zero-shot and few shot scenarios.However, LLMs such as ChatGPT seem not to be good at sequence labeling tasks (Li et al., 2023;Wang et al., 2023), for example, slot filling, named-entity recognition, etc.Our work endeavors to remedy this shortage with light-weighted language models.However, if the annotated datasets are large enough, our method will degenerate and even possibly hurt the generalization performance of the models (e.g., transformer based language models).Since the models would generalize pretty well through thoroughly learning the rich data features, distribution and dimensions, without the constraint of certain techniques that would become a downside under these circumstances, which reveals the principle that the upper bound of the model performance depends on the data itself.We directly adopt slot label itself in contrastive training of our HiCL, which does not model the interactions between label knowledge and semantics.In our future work, we will develop more advanced modules via labelsemantics interactions, that leverages slot descriptions and pre-trained transformer-based large language models to further boost unseen slot filling performance for TOD.  in NER task, while it is the first of its kind for us to introduce token level CL, and we innovatively present a hierarchical CL architecture and empirically verify that the combination of entity-and token-level CL will significantly outperform either of them.Secondly, we explore a more challenge research orientation, i.e., zero-shot task of single-turn and multi-turn task-oriented dialogues instead of few-shot NER task that comprises single independent sentences (Das et al., 2022).Finally, we take a deep dive into the pre-trained general language models for task-oriented dialogue (TOD), evaluate and compare it with our professionally designed expert model of ZSSF (HiCL).Most researchers will be curious about whether the vanilla capability of pre-trained general TOD models would replace that of all small expert models of ZSSF in this scenario, and whether specially designed generaliza- tion techniques would still work or bring benefits for pre-trained general models in this specific field.This even brings some enlightenment to the research of large language models (LLMs) like Chat-GPT2 , our work bring some explorations into this kind of thoughts.We introduce the pre-training expert model (TOD-BERT) that pre-trained on large corpora of task-oriented dialogue as a baseline to explore that whether this model is good enough for unseen slots generalization, and whether our method can continue to improve the unseen slots performance on top of TOD-BERT, which is also missing from the research (Das et al., 2022).In this work, we introduce a Gaussian embedding based hierarchical CL framework.At first, it coarsely learns entity-class knowledge of entity type and boundary via entity-level CL.Then, it combines the learned entity-level features, to finely learn token-class knowledge of BIO type and boundary via token-level CL.Our method is essentially different from the above approaches.

B.1 Performance Gain vs. Dataset Volume
In Tabel 1, 2, 3, 4, we observe that, with the increase of dataset volume (see Table 10), the performance gain of HiCL against baselines gradually diminishes.Besides, as indicated in Table 6, 7,8, 9, on the large dataset, HiCL needs to sacrifice more seen slots performance to improve unseen slots performance.This phenomenon indicates that the zero-shot generalization ability of the baseline models gradually becomes stronger with the growth of dataset volume and the diversity of slot types, which helps baseline models to learn more domain-invariant slot features for cross-domain ZSSF.
ATIS (Hemphill et al., 1990) is a dataset that contains transcribed audio recordings of people making flight reservations with 18 domains.Domains that contain less than 100 utterances are merged into a single domain Others in our experiments.
MIT_corpus3 is a spoken query dataset that consists of MIT restaurant domain and MIT movie domain.MIT movie domain contains eng corpus and trivia10k13 corpus, namely simple query version and complex query version, respectively.We merge the two version into one corpus and call it MIT movie domain.
SGD (Rastogi et al., 2020) contains 16 domains.However, we find that training on 15 domains except a single target domain, almost all slot types become seen slots.To increase unseen slots and knowledge transfer difficulty, we adopt the same dataset span as (Gupta et al., 2022;Coope et al., 2020) and choose four domains for SGD dataset, namely buses, events, homes, rental cars.

C.2 Unseen Slots and Seen Slots in Different Domains
Table 12 presents detailed unseen slots and seen slots in different domains for four datasets, i.e., SNIPS, ATIS, MIT_corpus and SGD.

D Implementation Details
We use uncased BERT4 to implement the encoder in our model, which has 12 attention heads and 12 transformer blocks.For TOD-BERT, we use the stronger variant5 that pre-trained using both the MLM and RCL objectives.We uses 100 dimensional Gaussian embeddings, AdamW optimizer (Loshchilov and Hutter, 2019) with β ′ = (0.9, 0.999) and warm-up strategy (warm-up steps is 1% of total training steps).Early stop of patience is set to 30 for stability.τ =0.07 for Eq.( 6).dropout rate is 0.3.We set α=1 and β=1 in Eq.( 10).We select the best hyperparameters by searching a combination of batch size, learning rate and γ in Eq.( 10): learning rate is in 1 × 10 −6 , 5 × 10 −6 , 1 × 10 −5 , 5 × 10 −5 , batch size is in {8, 16, 32, 64}, and β is in {0.001, 0.01, 0.05, 0.1, 0.5}.For instance, an optimal learning rate is 1 × 10 −5 for SNIPS dataset, and 1 × 10 −6 for MIT_corpus dataset.We select the best-performing model on dev set and evaluate it on test set.We run 5 times for all our experiments and then average them to generate the results.We train and test our model on 4 NVIDIA GeForce RTX 3090 GPUs and 1 NVIDIA Tesla A100 GPU, and it takes averagely less than one hour to reach convergence.

E Baseline Details
We compare HiCL with the following competing models.
• Concept Tagger (CT) (Bapna et al., 2017) is a one-stage leading model for ZSSF, which adopts slot descriptions to promote the performance on unseen slots in the target domain.
• Robust Zero-shot Tagger (RZT) (Shah et al., 2019) incorporates additional slot example entities combined with slot descriptions to improve zero-shot adaption.
• Coarse-to-fine Approach (Coach) (Liu et al., 2020) is a pioneer of two-stage framework for ZSSF, which divides the ZSSF task into two stages: coarse-grained slot entity segmentation in the form of BIO and fine-grained alignment between slot entities and slot types by utilizing slot descriptions.We use their stronger variant Coach+TR and call it Coach for brevity.

F Rectified Test Set Split for Unseen Slots and Seen Slots
In the cross-domain slot filling scenario, seen slots refer to those slot types that appear in both source domains and target domain, while unseen slots refer to those slot types that only appear in target domain.As shown in Figure 6, Liu et al. (2020) divide SNIPS (Coucke et al., 2018) test set into unseen and seen subset according to whether an utterance contains at least one unseen slot, which leads to unseen-seen slots overlapping problems in unseen slots performance evaluation.Since the performance results of the unseen samples are actually entangled with seen slots (unseen sample test set contains both unseen slots and seen slots), which seriously causes a bias in testing a model's actual performance on unseen slots.
For example, for an utterance "will it be colder in connorville" in target test set, its corresponding slot label is "O O O B-condition_temperature O Bcity", this utterance sample comprises both unseen slot type condition_temperature and seen slot type city.Nevertheless, this sample will be classified into unseen test set according to the method of Coach (Liu et al., 2020).
We rectify this bias by splitting unseen and seen test set with slot granularity strategy instead of sample granularity method used in Coach (Liu et al., 2020).This slot granularity split method is illustrated in § 4.5, Algorithm 1 and 2 (iterative label set semantics inference), which is able to train and test unseen and seen slots, separately and unbiasedly.For instance, following this new split method, as shown in Figure 7, slot type condition_temperature and city, will be re-classified into unseen subset and seen subset for the utterance "will it be colder in connorville" , respectively ( § 4.5).Through the comparison between Figure 6 and Figure 7, it is observed that unseen and seen sub-test set across domains distribute more evenly by employing our approach over Coach (Liu et al., 2020) method.Furthermore, we can find the defect of original Coach (Liu et al., 2020) split method.For example, in target domain RateBook, as shown in 6, the split result indicates that the number of seen samples (or seen slots) is zero and all samples are unseen ones.However, splitting with our new method, the number of seen slots is 3,112 and the number of unseen slots is 7,780.We can see approximately one third of slots in domain RateBook are seen slots, but they were wrongly classified into unseen test set by this biased split method of Coach (Liu et al., 2020), and their performance was regarded as that of unseen samples.To our knowledge, almost all baselines follow the split method of Coach (Liu et al., 2020), so we have to reevaluate their real performance for unseen slots with our new split method, i.e., iterative label set semantics inference illustrated in § 4.5, Algorithm 1 and 2.

Figure 1 :
Figure 1: Cross-domain slot filling models' performance on unseen and seen slots in GetWeather target domain on SNIPS dataset.

Figure 2 :
Figure2: Hierarchical contrastive learning (CL), where coarse-grained and fine-grained slot labels are used as supervised signal for CL, respectively, i.e., step one is entity-label level CL and step two is token-label level CL.Entity-label is a pseudo-label in our Hierarchical CL.Different colors of rectangular bounding box denote different slot types.

Figure 3 :
Figure 3: The main architecture of HiCL.For simplicity, we only draw detailed illustration for fine-grained tokenlevel CL.Different colors of rectangular box in utterances and token-level CL (right side) denote different slot types.

Figure 4 :
Figure 4: HiCL's performance on unseen-slots and seen slots of each domain in SNIPS dataset when equipped with Gaussian embedding and point embedding.The lines denote average F1 scores over all domains.
Total Prediction Set O, Ounseen and Oseen, and F1 Scores for O, Ounseen and Oseen in Target Domain // Testing Data Dts includes all negative and positive samples constructed as § 4.5 in Target Domain Data: Testing Data Dts, All Slot Types in Source Training Domains A tr , All Slot Types in Target Domain A ts 1 for testing sample Di ∈ Dts do 2 if St ∈ A ts and St / ∈ A tr then 3 D unseen ts ← Di // add Di to D unseen ts 4 else if St ∈ A ts and St ∈ A tr then 5 D seen ts ← Di // add Di to D seen ts 6 for testing batch D ∈ Dts do 7 for Di ∈ D do 8 Oi = CRF(Wi ⊗ (PLM((St, At, Ui)) + bi)) 9 O ← Oi // add Oi to O 10 for unseen slots testing batch D un i ⊗ (PLM((St, At, Ui)) + bi)) 13 Ounseen ← Oi // add Oi to Ounseen 14 for seen slots testing batch D sn i ⊗ (PLM((St, At, Ui)) + bi)) 17 Oseen ← Oi // add Oi to Oseen 18 calculate f1 score for O 19 calculate f1 score for Ounseen 20 calculate f1 score for Oseen Figure 5: t-SNE visualization in testset of GetWeather target domain on SNIPS dataset for different methods.1-4 denote unseen slots and 5-9 denote seen slots in all the subfigures.
He et al. (2020) advocate an adversarial attack strengthened contrastive learning with an objective of optimizing the mapping loss from slot entity to slot description in representation space for cross-domain slot filling.Wu et al. (2020)  pretrain natural language understanding BERT(Devlin et al., 2019b) for task-oriented dialogue with the joint masked language modeling (MLM) loss and response contrastive loss (RCL), achieving improvements on slot filling performance for dialogue state tracking.Liu et al. (2021) propose a joint contrastive learning for few-shot intent classification and slot filling in task-oriented dialogue system.Wang et al. (2021)  propose a prototypical contrastive learning to bridge semantics gap between token features and slot types in ZSSF.Heo et al. (2022) train BERT encoder with momentum contrastive learning to develop a robust ZSSF model.

Figure 6 :
Figure6: Unseen and seen slots test set split on SNIPS dataset with the method of Coach(Liu et al., 2020).The figure on the top of bar chart denotes the number of seen or unseen samples for different target domain in test set."0" represents no sample (unseen or seen) exists.

Figure 7 :
Figure 7: Unseen and seen slots test set split on SNIPS dataset with our method.The figure on the top of bar chart denotes the number of seen or unseen slots for different target domain in test set."0" represents no unseen-slot exists.

Table 2 :
Slot F1 scores on ATIS dataset for different target domains that are unseen in training.AR, AF, AL, FT, GS and OS denote Abbreviation, Airfare, Airline, Flight, Ground Service, Others, respectively.

Table 3 :
Slot F1 scores on MIT_corpus dataset for different target domains that are unseen in training.
Training on Single Source Domain and Testing on Single Target Domain A model is trained on single source domain and test on a single target domain.This single training domain towards single target domain paradigm is evaluated on dataset of

Table 4 :
Slot F1 scores on SGD dataset for different target domains that are unseen in training.BS, ET, HE, RC denote Buses, Events, Homes, Rental Cars, respectively.

Table 5 :
The ablation study of HiCL adopting different types of embedding on SNIPS dataset.
and Figure4, HiCL achieves better performance on both seen slots and unseen slots by employing Gussian embedding over its counterpart, i.e., point embedding.

Table 6 :
Detailed F1 scores on SNIPS for seen and unseen slots across all target domains.†denotestheresultsreported in(Wang et al., 2021).‡denotesthat we run the publicly released code(Siddique et al., 2021)to obtain the experimental results and ♮ denotes that we re-implemented the model, and we reevaluate their performance on seen and unseen slots following the split method of unseen-and seen-slots sub-test set in Appendix F and the test method of iterative label set semantics inference in Algorithm 2.

Table 7 :
Detailed F1 scores on ATIS for seen and unseen slots across all target domains.

Table 8 :
Detailed F1 scores on MIT_corpus for seen and unseen slots across all target domains.

Table 9 :
Detailed F1 scores on SGD for seen and unseen slots across all target domains.

Table 10 :
Data statistics of training, validation and test for all domains of SNIPS, ATIS, MIT_corpus and SGD, respectively, after data augmentation.All baselines and HiCL are fine-tuned on the same augmented dataset.

Table 12 :
Unseen and seen slots in different domains of SNIPS, ATIS, MIT_corpus, and SGD dataset, respectively.The evaluation paradigm is that adopting each single domain as target or test domain and the remainder of domains as training domains in the dataset.'-' denotes no slot type exists.