Universal Domain Adaptation for Robust Handling of Distributional Shifts in NLP

When deploying machine learning systems to the wild, it is highly desirable for them to effectively leverage prior knowledge to the unfamiliar domain while also firing alarms to anomalous inputs. In order to address these requirements, Universal Domain Adaptation (UniDA) has emerged as a novel research area in computer vision, focusing on achieving both adaptation ability and robustness (i.e., the ability to detect out-of-distribution samples). While UniDA has led significant progress in computer vision, its application on language input still needs to be explored despite its feasibility. In this paper, we propose a comprehensive benchmark for natural language that offers thorough viewpoints of the model's generalizability and robustness. Our benchmark encompasses multiple datasets with varying difficulty levels and characteristics, including temporal shifts and diverse domains. On top of our testbed, we validate existing UniDA methods from computer vision and state-of-the-art domain adaptation techniques from NLP literature, yielding valuable findings: We observe that UniDA methods originally designed for image input can be effectively transferred to the natural language domain while also underscoring the effect of adaptation difficulty in determining the model's performance.


Introduction
Deep learning models demonstrate satisfactory performance when tested on data from the training distribution.However, real-world inputs encounter novel data ceaselessly that deviate from the trained distribution, commonly known as distributional shift.When confronted with such inputs, machine learning models frequently struggle to differentiate them from regular input.Consequently, they face challenges in adapting their previously acquired knowledge to the new data distribution, resulting Whip me up a sandwich real quick?
Figure 1: The model trained with formal language (source domain) will likely face spoken language (target domain) in the real world.The model is expected to properly handle such transferable input despite the distributional shift.(middle) At the same time, the model should discern unprocessable inputs (bottom) from the target domain.
The aforementioned phenomenon represents a longstanding challenge within the machine learning community, wherein even recent cutting-edge language models (OpenAI, 2023;Touvron et al., 2023;Chowdhery et al., 2022;Brown et al., 2020) do not serve as an exception to this predicament (Wang et al., 2023).
In response to these challenges, existing literature proposes two distinct approaches.The first approach, known as Domain Adaptation (DA) (Blitzer et al., 2006;Ganin et al., 2016b;Karouzos et al., 2021;Wu and Shi, 2022), endeavors to establish alignment between a new set of data from an unknown distribution and the model's prior knowl-edge distribution.The objective is to enhance the model's generalization capability and reduce performance drop springing from the distributional shift.In parallel, a distinct line of work, referred to as out-of-distribution (OOD) detection (Aggarwal, 2017;Hendrycks and Gimpel, 2017;Hendrycks et al., 2019;Cho et al., 2021), focuses on discerning inputs originating from dissimilar distributions.They opt to circumvent potential risks or disruptions arising from shifted inputs, thereby enriching system robustness and resilience.
While both approaches offer unique advantages addressing specific distributional shifts, integrating their merits could substantially enhance robustness.In pursuit of this objective, a novel field called Universal Domain Adaptation (UniDA) (You et al., 2019) has emerged, aiming to harness the synergies of both OOD detection and DA when confronted with distributional shifts.UniDA leverages the best of the two worlds and offers comprehensive perspectives that integrate the merits of these two research areas.The essence of UniDA lies in measuring the uncertainty of the data from the shifted distribution precisely.Then, we can enhance the model's transferability by distinguishing portions of low-uncertainty inputs that can be adequately handled with the current model's knowledge.Simultaneously, we enrich the robustness of the model to OOD inputs by discerning the remaining samples that cannot be processed normally.However, distinguishing between these inputs and properly processing them becomes increasingly challenging without explicit supervision.
Despite the versatility of UniDA, this topic has yet to be explored in the Natural Language Processing (NLP) literature.As a cornerstone in enhancing reliability against distributional shifts in NLP, we introduce a testbed for evaluating the model's robustness in a holistic view.First, we construct various adaptation scenarios in NLP, utilizing an array of thoughtfully selected datasets.To discern the degree to which our proposed datasets incorporate the various degree of challenges in UniDA, we define two novel metrics: Performance Drop Rate (PDR) and Distinction Difficulty Score (DDS).Using these metrics, We verify that our testbed captures a broad spectrum of distributional shifts.Finally, based on the suggested setting, we systematically compare several UniDA methods inherently designed for the task, against heuristic combinations of previous approaches for the parts of the problem, i.e., OOD detection and DA.
Our empirical results show that UniDA methods are fully transferable in the NLP domain and can robustly respond to various degrees of shift.Moreover, we find out that the adaptation difficulty notably affects the performance of the methods.In certain circumstances, DA methods display comparable or even better performance.We release our dataset, encouraging future research on UniDA in NLP to foster the development of more resilient and domain-specific strategies.2 2 Universal Domain Adaptation

Problem Formulation
Distributional shift refers to a situation where the joint distribution P estimated from the training dataset fails to adequately represent the wide range of diverse and complex test inputs.More formally, a distributional shift arises when the test input x test originates from a distant distribution Q, which is not effectively encompassed by the current trained distribution P .
One of the most prevalent research areas addressing this distributional shift includes OOD detection and DA.OOD detection aims to strictly detect all inputs from Q to enhance the model's reliability.Although distribution Q demonstrates a discernibly different distribution from the distribution P , the trained model can still transfer a subset of instances from Q, overcoming the inherent discrepancy between P and Q.This particular capability serves as a fundamental motivation underlying the pursuit of DA.UniDA endeavours to integrate the merits of both fields, thereby enhancing both the generalizability and reliability of the model.Specifically, let us divide the target distribution Q into disjoint subsets H, which share the same label space with source distribution P and its complement I (Q = H ∪ I).The objective of UniDA is to enrich the robustness of the model by flexibly transferring existing knowledge to transferable samples from H while firing alarms to unknown samples from I.

Challenges in UniDA
UniDA models should be capable of accurately capturing the underlying reasons behind the shift, thereby enabling the discrimination between transferable samples and unknown samples.Among the diverse categories of causes, the domain gap and the category gap (You et al., 2019) emerge as pivotal factors, each exerting a substantial impact on the overall complexity of the UniDA problem.While these concepts have previously been defined in a rather vague manner, we deduced the necessity for a more explicit definition.Thus, we set forth to redefine the concepts of domain and category gap more explicitly.
Domain gap refers to the performance drop when a model trained on P fails to correctly process transferable inputs due to the fundamental discrepancy between P and H, i.e., a domain shift.A dataset with a higher domain gap amplifies the problem's difficulty as the trained model becomes more susceptible to misaligning transferable samples.
A category shift, characterized by the disparity in the class sets considered by P and I, causes a category gap.Category gap represents the performance drop that arises for inputs from I, which cannot be processed properly due to differing class sets between P and I, which are erroneously handled.A larger category gap makes distinguishing unknown samples from transferable samples harder, thereby worsening the robustness of the model.
From the perspective of addressing the domain gap and category gap, the main goal of UniDA is to minimize both gaps simultaneously.This aims to ensure transferable samples properly align with the source domain for adequate processing, while handling unknown samples as unprocessable exceptions.

Testbed Design
The primary objective of our research is to construct a comprehensive benchmark dataset that effectively captures the viewpoint of UniDA.To accomplish our objective, we attempt to create a diverse dataset that encompasses a range of difficulty levels and characteristics, such as domains, sentiment, or temporal change.These variations are the fundamental elements that can significantly influence the overall performance.
Specifically, we initially select datasets from multiple practical domains and approximate the adaptation difficulty by quantifying different shifts with our newly proposed metrics.In the following subsections, we provide an in-depth explanation of our dataset along with the analysis of our benchmarks.

Quantifying Different Shifts
As the extent of both domain and category gaps significantly influences the overall adaptation complexity, it is essential to quantify these gaps when designing the dataset for evaluation.Unfortunately, existing literature has not devised a clear-cut and quantitative measure for assessing domain and category gaps.Therefore, we endeavoured to define measures that can aptly approximate the two types of gaps.
Performance Drop Rate (PDR) measures the degree of domain gap by assessing the absolute difficulty of the dataset itself and the performance drop caused by the shift from P to H. Specifically, we fine-tune bert-base-uncased on the source train set and evaluate its test set accuracy acc s from the same distribution.Leveraging the same model trained on the source domain, we then measure the accuracy of the target test set acc t .We measure the performance degradation caused by the distributional shift by measuring acc s − acc t .Since the significance of the performance drop may vary depending on the source performance, we normalize the result with the source performance and measure the proportion of the performance degradation.A more significant drop rate indicates a greater decline in performance, considering the source domain performance.Formally, PDR for a source domain s and a target domain t can be measured as follows: Distinction Difficulty Score (DDS) is measured to estimate the difficulty of distinguishing between H and I, which, in other words, measures the difficulty of handling the category shift.We utilized the same model trained on the source domain and extracted the [CLS] representations of the source inputs.We estimated the source distribution, assuming the extracted representations follow the multivariate normal distribution.We then extracted [CLS] representations of target distribution inputs from the same model and measured the Mahalanobis distance between the source distribution.Using the distance, we measured the Area Under the ROC Curve (AUC) as a metric for discerning I and H. AUC values closer to 1 indicate the ease of discerning unknown inputs from the transferable inputs.Since we focus on quantifying the difficulty in distinguishing the two, we subtract the AUC from 1 to derive our final measure of interest.For the source domain s, the target domain t, and AUC as AUC s,t , DDS can be measured as: (2)

Implementation of Different Shifts
To construct a representative testbed for UniDA, it is essential to illustrate domain and category shifts.
To exhibit domain shift, we delineated domains from various perspectives.This involves explicit factors such as temporal or sentiment and implicit definitions based on the composition of the class set.Detailed formation of domains for each dataset is stipulated in Section 3.3.
To establish category shifts, the source and the target domain must have a set of common classes C and a set of their own private classes, Cs and Ct , respectively.We followed previous works (You et al., 2019;Fu et al., 2020)

Dataset Details
We focused on text classification tasks for our experiments.Four datasets were selected from multiple widely used classification domains in NLP, such as topic classification, sentiment analysis, and intent classification.We reformulated the datasets so that our testbed could cover diverse adaptation scenarios.
Huffpost News Topic Classification (Huffpost) (Misra, 2022) contains Huffpost news headlines spanning from 2012 to 2022.The task is to classify news categories given the headlines.Using the temporal information additionally provided, we split the dataset year-wise from 2012 to 2017, treating each year as a distinct domain.We selected the year 2012 as the source domain, with the subsequent years assigned as the target domains, creating 5 different levels of temporal shifts.
Multilingual Amazon Reviews Corpus (Amazon) (Keung et al., 2020) includes product reviews that are commonly used to predict star ratings based on the review, and additional product information is provided for each review.We have revised the task to predict the product information given the reviews and utilized the star ratings to define sentiment domains.Reviews with a star rating of 1 or 2 are grouped as negative sentiment, and those with a rating of 4 or 5 are categorized as positive.We exclude 3-star reviews, considering them neutral.MASSIVE (FitzGerald et al., 2022) is a hierarchical dataset for intent classification.The dataset consists of 18 first-level and 60 secondlevel classes.Each domain is defined as a set of classes, including private classes exclusive to a specific domain and common classes shared across domains.We divided the common first-level class into two parts based on second-level classes to simulate domain discrepancy.The half of the divided common class represents the source domain while the other half represents the target domain.We assume that the second-level classes within the same first-level class share a common feature and thus can be adapted.
CLINC-150 (Larson et al., 2019) is widely used for intent classification in OOD detection.The dataset consists of 150 second-level classes over 10 first level-classes and a single out-of-scope class.The domain is defined in the same way as MAS-SIVE.

Dataset Analysis
In this section, we intend to validate whether our testbed successfully demonstrates diverse adaptation difficulties, aligning with our original motivation.We assess adaptation difficulty from domain and category gap perspectives, each approximated by PDR and DDS, respectively.
The results of PDR and DDS are reported in Table 1.The result shows diverse PDR values ranging from 7 to 36 points, indicating various degrees of domain gap across the datasets.MASSIVE measured the most considerable domain gap, while Huffpost (2013) demonstrated the most negligible domain gap among the datasets.Additionally, our testbed covers a wide range of category gaps, indicated by the broad spectrum of DDS values.Specifically, Amazon exhibits a significantly high DDS value, representing an extremely challenging scenario of differentiating the unknown samples from the transferable samples.
We consolidate the two indicators to measure the coverage of different adaptation complexity of our proposed testbed.We visualized the datasets considering the PDR and DDS; metrics on the respective axes.Figure 2 is the visualization of the adaptation complexity for each dataset factored by PDR and DDS.We grouped the datasets into four distinct clusters based on the plotted distribution.Datasets closer to the lower-left corner, CLINC-150 and Huffpost (2013), are easy adaptation scenarios with minor domain and category gaps.Datasets plotted in the center, Huffpost (2014Huffpost ( , 2015Huffpost ( , 2016)), presents moderate difficulty.Amazon suffers from a category gap significantly, while Huffpost (2017) and MASSIVE demonstrate a notable domain gap, yielding high adaptation complexity.The results validate that our testbed embodies a diverse range of adaptation difficulties as intended.

Compared Methods
We compare several domain adaptation methods on our proposed testbed.We selected two previous state-of-the-art closed-set Domain Adaptation (CDA) methods, UDALM (Karouzos et al., 2021) and AdSPT (Wu and Shi, 2022), under the assumption that all the inputs from the target domain are transferable without considering unknown classes.Two previous state-of-the-art UniDA methods were selected, OVANet (Saito and Saenko, 2021) and UniOT (Chang et al., 2022), which are fully optimized to handle UniDA scenarios in the vision domain.We also conducted experiments with additional baseline methods such as DANN (Ganin et al., 2016a), UAN (You et al., 2019), and CMU (Fu et al., 2020).However, the performance was subpar compared to the selected methods, exhibiting a similar tendency.Hence, we report the additional results in Appendix A. For the backbone of all the methods, we utilized bert-base-uncased (Devlin et al., 2019) and used the [CLS] representation as the input feature.Implementation details are stipulated in Appendix B.

Thresholding Method
Since CDA methods are not designed to handle unknown inputs, additional techniques are required to discern them.A straightforward yet intuitive approach to detecting unknown inputs is applying a threshold for the output of the scoring function.The scoring function reflects the appropriateness of the input based on the extracted representation.If the output of the scoring function falls below the threshold, the instance is classified as unknown.We sequentially apply thresholding after the adaptation process. 3Formally, for an input x, categorical prediction ŷ, threshold value w, and a scoring function f score , the final prediction is made as: We utilize Maximum Softmax Probability (MSP) as the scoring function4 (Hendrycks and Gimpel, 2017).Following the OOD detection literature, the value at the point of 95% from the sorted score values was selected as the threshold.

Evaluation Protocol
The goal of UniDA is to properly process the transferable inputs and detect the unknown inputs simultaneously, consequently making both the transferable and the unknown accuracies crucial metrics.
We applied H-score (Fu et al., 2020) as the primary evaluation metric to integrate both evaluation metrics.H-score is the harmonic mean between the accuracy of common class acc C and unknown class acc Ct , where acc C is the accuracy over the common class set C and acc Ct is the accuracy predicting the unknown class.The model with a high H-score is considered robust in the UniDA setting, indicating its proficiency in both adaptation (high acc C ) and OOD detection (high acc Ct ).Formally, the H-score can be defined as : Although the H-score serves as an effective evaluation criterion for UniDA, we also report acc C and acc Ct to provide a comprehensive assessment.We report the averaged results and standard deviations over four runs for all experiments.
5 Experimental Results

Overview
We conduct evaluations based on the clusters defined in Section 3.4 and analyze how the results vary depending on the adaptation complexity.and UniDA approach: AdSPT representing CDA and UniOT representing UniDA.Despite an outlier caused by unstable thresholding in CLINC-150, the overall trend demonstrates that AdSPT manifests comparable performance in less complex scenarios, while UniOT exhibits superior performance towards challenging scenarios.These trends align with the findings of other methods that are not depicted in the figure.

Detailed Results
Table 2 demonstrates the results of relatively easy adaptation scenarios.CDA methods demonstrate performance that is on par with, or even superior to, UniDA methods.The results appear counterintuitive, as CDA methods are designed without considering unknown samples.Specifically, UDALM outperforms all the UniDA methods in CLINC-150 and performs comparable or even better in Huffpost (2013).AdSPT exhibits the best performance in Huffpost (2013).However, AdSPT suffers a significant performance drop in CLINC-150, as we speculate this result is due to the inherent instability of the thresholding method.The misguided threshold classifies the majority of the inputs as unknown, which leads to a very high acc Ct , but significantly reduces the acc C .This inconsistency also leads to a high variance of acc Ct for all the CDA methods.
In the case of moderate shifts, no particular method decisively stands out, as presented by Table 3.In all cases, AdSPT and UniOT present the best performance with a marginal difference, making it inconclusive to determine a superior approach.Despite the relatively subpar performance, UDALM and OVANet also exhibit similar results.Still, it is notable that CDA methods, which are not inherently optimized for UniDA settings, show comparable results.The result of Amazon, in which the category gap is most prominent, is reported in Table 4. UniDA methods exhibit substantially superior performance to CDA methods.In particular, while the difference in acc C is marginal, there exists a substantial gap of up to 55 points in acc Ct .As the category gap intensifies, we observe the decline in the performance of CDA methods, which are fundamentally limited by the inability to handle unknown inputs.
Finally, the results of Huffpost (2017) and MAS-SIVE, which exhibit a high domain gap, are reported in Table 5.The result indicates that UniDA methods consistently display superior performance in most cases.However, the divergence between the approaches is relatively small compared to Amazon.UniOT demonstrates the best performance in all datasets, with OVANet's slightly lower performance.AdSPT demonstrates marginally better performance than OVANet in Huffpost (2017),  but the gap is marginal.Even though CDA methods pose comparable performance, UniDA methods demonstrate better overall performance.

Impact of Threshold Values
The selection of the threshold value considerably influences the performance of CDA methods.In order to probe the impact of the threshold values on the performance, we carry out an analysis whereby different threshold values are applied to measure the performance of the methods.
The results are demonstrated in Figure 4.In cases of low or moderate adaptation complexity, such as CLINC-150 and Huffpost (2013, 2014, 2015, 2016), CDA methods demonstrate the potential to outperform UniDA methods when provided an appropriate threshold.However, as the adaptation complexity intensifies, such as Huffpost (2017), MASSIVE, and Amazon, UniDA methods outperform CDA methods regardless of the selected threshold.These observations align seam- lessly with the findings from Section 4.3 that underscore the proficiency of UniDA methods in managing challenging adaptation scenarios.Additionally, it should be noted that determining the optimal threshold is particularly challenging in the absence of supervision from the target domain.Therefore, the best performance should be considered upperbound of the CDA methods.
6 Related Work

Domain Adaptation
The studies in the field of DA in NLP primarily assumes a closed-set environment, which the source and the target domain share the same label space.CDA research predominantly concentrated on learning domain invariant features (Blitzer et al., 2006;Pan et al., 2010;Ben-David et al., 2020;Ganin and Lempitsky, 2015;Du et al., 2020) for effective adaptation.With the advent of pretrained language models (PLMs), CDA methods have evolved to effectively leverage the capabilities of PLMs.Techniques such as masked language modeling (Karouzos et al., 2021) or soft-prompt with adversarial training (Wu and Shi, 2022) have shown promising results.However, the closed-set assumption has a fundamental drawback as it may leave the models vulnerable when exposed to data from an unknown class.
To mitigate such issue, a new line of work named UniDA (You et al., 2019) was proposed which assumes no prior knowledge about the target domain.Other recent works focus on utilizing mutually nearest neighbor samples (Chen et al., 2022a,c) or leveraging source prototypes with target samples (Chen et al., 2022b;Kundu et al., 2022).Despite the practicality of UniDA, its application in the NLP domain has barely explored.

Conclusion and Future Work
In this study, we present a testbed for evaluating UniDA in the field of NLP.The testbed is designed to exhibit various levels of domain and category gaps through different datasets.Two novel metrics, PDR and DDS, were proposed which can measure the degree of domain and category gap, respectively.We assessed UniDA methods and the heuristic combination of CDA and OOD detection in our proposed testbed.Experimental results show that UniDA methods, initially designed for the vision domain, can be effectively transferred to NLP.Additionally, CDA methods, which are not fully optimized in UniDA scenario, produce comparable results in certain circumstances.
Recent trends in NLP focus on Large Language Models (LLMs) of their significant generalization abilities.However, the robustness of LLMs from the perspective of UniDA remains uncertain.As part of our future work, we assess the performance and the capabilities of LLMs from a UniDA viewpoint.

Limitations
Limited coverage of the evaluated model sizes The evaluation was conducted only with models of limited size.Moreover, there is a lack of zeroshot and few-shot evaluations for large language models (LLMs) that have recently emerged with remarkable generalization capabilities.The evaluation of LLMs is currently being considered as a top priority for our future work, and based on preliminary experiments, the results were somewhat unsatisfactory compared to small models with basic tuning for classification performance.In this regard, recent research that evaluated LLMs for classification problems (such as GLUE) also reported that the performance is not yet comparable to task-specifically tuned models.Considering the limitations of LLMs in terms of their massive resource usage and the fact that tuning small models still outperforms them in a task-specific manner, the findings from this study are still considered highly valuable in the NLP community.
Limited scope of the tasks Our proposed testbed is restricted to text classification tasks only.The majority of existing research on DA and OOD also focuses on classification.This selective task preference is primarily due to the challenge of defining concepts such as domain shifts and category shifts in generative tasks.However, in light of recent advancements in the generative capabilities of models, handling distributional shifts in generative tasks is indubitably an essential problem that needs to be addressed in our future work.

A.1 UniDA Results
Table 7 is the full results of UniDA methods in our proposed testbed.Baseline methods such as UAN (You et al., 2019) and CMU (Fu et al., 2020) are included in the results.We can observe that UniDA methods do not always retrain the same level of applicability in NLP.Specifically, UAN and CMU utilize a fixed threshold defined in the vision domain.While CMU remains fully compatible in the NLP domain, UAN struggles to apply effectively, as it fails to detect unknown samples.

A.2 CDA Results
In this section, we demonstrate the experimental results of CDA methods with two additional scoring functions: cosine similarity and Mahalanobis distance.The threshold value was selected based on the score from the scoring functions, using the same approach as the main experiment.Also, we report the results of DANN (Ganin et al., 2016a) and source-only fine-tuning which was left out from the main experiment.In some cases, sourceonly fine-tuning outperforms other adaptation methods, which is also observed in the vision domain (You et al., 2019).

B Implementation Details
For the experiments, we adopt a 12-layer pretrained language model bert-base-uncased (Devlin et al., 2019) as the backbone of all the methods.We utilized the [CLS] representation as the input feature.AdamW optimizer (Loshchilov and Hutter, 2019) was used for all the experiments with a batch size of 32.We selected the best learning rate among 5e-4, 1e-4, 5e-5, 1e-5, and 5e-6.The learning rate for each method is reported in Table 6.The model was trained for 10 epochs with an early stopping on the accuracy of the source domain's evaluation set.All the experiments were implemented with Pytorch (Paszke et al., 2019) and Huggingface Transformers library (Wolf et al., 2020).The experiments take an hour on a single Tesla V100 GPU.

C Ablation on Different Class Splits
For the main experiment, we utilized class names as the criterion to implement the category gap.However, this may only show the specific scenario of the category gap.To provide a more comprehensive analysis, we also report the results when the class set is randomly split.We utilized CLINC-150 and MASSIVE dataset for the ablation study, and MSP thresholding was applied for CDA methods.We conducted three experiments, each with a different class split, and for every split, we reported the average results of three different runs.Table 11 is the results of the experiments.Due to the changes in the class set to be predicted, the task difficulty varies, resulting in differences in absolute performances.However, when comparing the relative performance between different methods, we can observe that they exhibit consistent trends regardless of the class split.

D Receiver Operating Characteristic (ROC) Curve
To measure Distinction Difficulty Score (DDS), we calculated the AUROC and subtracted from 1.
Real WorldHey, mind opening the window for me?Thanks!Sorry, I cannot handle your request.Sure.I'll open thewindow.
by sorting the class name in alphabetical order and selecting the first |C| classes as common, the subsequent | Cs | as source private, and the rest as target private classes.The class splits for each dataset are stated as |C|/| Cs |/| Ct | in the main experiments.

Figure 2 :
Figure 2: The visualization of adaptation difficulty for each dataset in terms of domain gap (PDR) and category gap (DDS).We categorized the dataset into 4 distinct groups based on the adaptation complexity.Best viewed in color.

Figure 3 :
Figure 3: H-score results of AdSPT and UniOT on all the datasets.The preferred method varies depending on the adaptation complexity.
Figure  3 presents an overview of the H-score results for the best-performing method from each CDA Figure 4: H-score performance with different threshold values.Results of UniDA methods are visualized as a horizontal line for comparison.
You et al. (2019) quantifies sample-level transferability by using of uncertainty and domain similarity.Following the work,Fu et al. (2020) calibrates multiple uncertainty measures to handle such an issue.Saito and Saenko (2021) apply a one-vsall classifier to minimize inter-class distance and classify unknown classes.More recently,Chang et al. (2022) applied Optimal Transport and further expanded the task to discovering private classes.

Figure 5 :
Figure5: ROC curve of discerning unknown samples from the transferable samples.The closer the ROC curve is to the upper-left corner, it becomes easier to distinguish between the them.

Table 1 :
PDR and DDS values of each datasets.The largest value of PDR and DDS is highlighted in bold.

Table 5 :
Experimental results on MASSIVE and Huffpost (2017), which demonstrates high domain gap.The best method with the highest H-score is in bold, and the second-best method is underlined.

Table 9 :
Experimental results of CDA methods with Mahalanobis distance as the scoring function.For each dataset, the best method with the highest H-score is in bold and the second-best method is underlined.