ConDA: Contrastive Domain Adaptation for AI-generated Text Detection

Large language models (LLMs) are increasingly being used for generating text in a variety of use cases, including journalistic news articles. Given the potential malicious nature in which these LLMs can be used to generate disinformation at scale, it is important to build effective detectors for such AI-generated text. Given the surge in development of new LLMs, acquiring labeled training data for supervised detectors is a bottleneck. However, there might be plenty of unlabeled text data available, without information on which generator it came from. In this work we tackle this data problem, in detecting AI-generated news text, and frame the problem as an unsupervised domain adaptation task. Here the domains are the different text generators, i.e. LLMs, and we assume we have access to only the labeled source data and unlabeled target data. We develop a Contrastive Domain Adaptation framework, called ConDA, that blends standard domain adaptation techniques with the representation power of contrastive learning to learn domain invariant representations that are effective for the final unsupervised detection task. Our experiments demonstrate the effectiveness of our framework, resulting in average performance gains of 31.7% from the best performing baselines, and within 0.8% margin of a fully supervised detector. All our code and data is available at https://github.com/AmritaBh/ConDA-gen-text-detection.


Introduction
In recent years there have been significant improvements in the area of large language models that are capable of generating human-like text.Several variants of such language models are designed for specific tasks such as summarization, translation, paraphrasing, etc.Recent advancements in conversational language models such as ChatGPT and GPT-4 (OpenAI, 2023) have demonstrated how these language models can generate incredibly human-like text, along with serving as an AI assis- tant for several use cases such as creative writing, explanation of ideas and concepts, code generation and correction, solving mathematical proofs etc. (Bubeck et al., 2023).However, along with improved progress in machine generation of text, there is also a growing concern about how these technologies may be misused and abused by malicious actors.Given how convincing some of these machine-generated texts are, malicious actors may use these models to propagate misinformation/disinformation (Zellers et al., 2019), propaganda (Varol et al., 2017), or even spam/scams.With the accessibility and ease of use of newer language models that have public-facing APIs, the risk of these technologies being used for generating disinformation or misleading information at scale has increased significantly (De Angelis et al., 2023) and hence has prompted researchers to worry about detection and mitigation strategies (Zhou et al., 2023).For example, recently, there have been concerns about misleading news websites hosting fully AI-generated news articles 1 .Such unprecedented improvement in language generation capabilities hence naturally necessitates the development of detectors that can accurately and reliably classify such generated text.Motivated by this, we focus on the sub-problem of AI-generated news detection.
A major issue surrounding building a supervised classifier for AI-generated text is the sheer variety of large language models that are available for use.Prior work (Jawahar et al., 2020) has demonstrated that detectors built to identify text generated by a particular generator struggle with text from other generators.Furthermore, for newer generators, it might even be impossible to collect and curate labeled training datasets, since access to such models might be limited or even forbidden.Given this data problem, in this paper, we consider the situation where we have access to text from a generator but we do not know which generator it came from.However, we do have labeled data from some generators.In this context, we propose a framework for AI-generated text detection that can perform well on target data in the absence of labels.We frame this problem as an unsupervised domain adaptation problem, assuming we have labeled data from a source generator and unlabeled data from (perhaps newer) target generators.Our framework also uses a contrastive loss component that acts as a regularizer and helps the model learn invariant features and avoid overfitting to the particular generator it was trained on, hence improving performance on the unknown generator (Figure 1).For news text, our model achieves performance with a 0.8% margin of a fully supervised detector.Our main contributions in this paper are: 1. We propose a novel AI-generated text detection framework, ConDA, that uses unsupervised domain adaptation and self-supervised contrastive learning to effectively leverage labeled source domain and unlabeled target domain data.
2. Through extensive evaluations on benchmark human/AI-generated news datasets, spanning a variety of LLMs, we show that ConDA effectively solves the problem of label scarcity, and achieves state-of-the-art performance for unsupervised detection.
3. Furthermore, we create our own ChatGPTgenerated data and via a case study, show the efficacy of our model on text generated using new conversational language models.

Related Work
Generated Text Detection The burgeoning progress in the generation capabilities of large language models has led to a corresponding increase in research and development efforts in the field of detection.Several recent efforts look at methods, varying from simple feature-based classifiers to fine-tuned language model-based detectors, in order to classify whether a piece of input text is human-written or AI-generated (Ippolito et al., 2019;Gehrmann et al., 2019;Mitchell et al., 2023), along with methods that specifically focus on AIgenerated news (Zellers et al., 2019;Bogaert et al., 2022).A related direction of work is that of authorship attribution (AA).While older AA methods focused on human authors, more recent efforts (Uchendu et al., 2020;Munir et al., 2021) build models to identify the generator for a particular input text.Recent work also shows how AI-generated text can deceive state-of-the-art AA models (Jones et al., 2022), thus making the task of detecting such text even more important.

Contrastive Learning for Text Classification
Following the success of contrastive representation learning in the computer vision domain, several recent works in natural language have used contrastive learning for text classification, often for benefits such as robustness (Zhang et al., 2022;Ghosh and Lan, 2021;Pan et al., 2022), generalizability (Tan et al., 2020;Kim et al., 2022) and also in few-shot scenarios (Jian et al., 2022;Zhang et al., 2021;Chen et al., 2022a).Authors in (Qian et al., 2022;Chen et al., 2022b) also use ideas from contrastive learning to leverage label information to learn better representations for the classification task.
Domain Adaptation for Text Classification Domain adaptation(DA) is a paradigm that aims to tackle the distribution shift between training and testing distributions, by learning a discriminative classifier, that is invariant to domain-specific features (Sener et al., 2016).Along with labeled source data, DA methods may use either unlabeled target data (unsupervised DA) or a few labeled target samples (semi-supervised DA).In our work, we consider the unsupervised DA setting (Ganin et al., 2016).In the domain of language, unsupervised domain adaptation has been used in a variety of tasks (Ramponi and Plank, 2020), such as sentiment classification (Glorot et al., 2011;Trung et al., 2022), question answering (Yue et al., 2021), event detection (Trung et al., 2022), sequence tagging or labeling (Han and Eisenstein, 2019), etc.
In this work, we frame the problem of detecting AI-generated news text from multiple generators as an unsupervised domain adaptation task, where the different generators are the different data domains.Our proposed framework combines the representational power of self-supervised contrastive learning and a principled method for unsupervised domain adaptation to solve the AI-generated text detection problem.To the best of our knowledge, we are the first to propose this kind of a formulation for AI-generated text detection, along with a novel framework for this task.In the following section, we describe our framework in detail, along with our training objective.

Model
In this work, we consider a setting where we have labeled data from the source generator and only unlabeled samples from the target generator2 .More formally, the source domain dataset is denoted by S = {(x S i , y S i )} N S i=1 where y S i ∈ {0, 1} corresponding to 'human-written' or 'AI-generated' labels, and N S is the number of source domain samples.The target domain is denoted by , where N T is the number of target domain samples.Note that all domains share the same label space.

ConDA Framework
We show our framework in Figure 2.For the detector, we use a pre-trained RoBERTa model (roberta-base) from Huggingface3 , with a classifier head on top of it.As the input, we have two articles: x S i from the source and x T i from the target.We perform a text transformation τ on this text whereby we get the transformed sam-ples x S j and x T j .In order to input both the original and the transformed (also referred to as 'perturbed' throughout this paper), we use a Siamese network (Bromley et al., 1993;Neculoiu et al., 2016;Reimers and Gurevych, 2019) where the RoBERTa model weights are shared across the two branches.For the two input texts, we take the hidden layer representation of the [CLS] token: . Following the methodology in (Chen et al., 2020), we pass these embeddings through a projection layer that consists of a multilayer perceptron (MLP) with one hidden layer and compute a contrastive loss in the lower dimensional projection space.The MLP can be represented as a function g(•) : R d h → R dp , where d h is the size of the hidden layer embedding: 768 for roberta-base, and we set d p as 300, following (Pan et al., 2022).For the source domain, we also compute the cross-entropy losses for binary classification of both the original and transformed text.Furthermore, we have a domain discrepancy component between the projected representations of the source and target text.We elaborate on the losses and related design choices in the following section.

Training Objective
Source Classification Loss: We leverage the availability of the source labels and compute the binary cross-entropy (CE) losses for the original and the perturbed text: L S CE denotes the CE loss for the original text, b denotes the batch size.Similarly, we compute L S ′ CE for the perturbed text, and we skip the equation for brevity.Inspired by the training objective in (Pan et al., 2022), we use CE losses for both the original and perturbed samples in the final training objective.The transformation performed on the original text (i.e.synonym replacement in our experiments) preserves the semantics of the text and hence is label-preserving.In such a case we would want a classifier to be able to detect text with such minor, semantic-preserving perturbations as well.Not only is this supposed to improve the robustness of the classifier, but in turn also the generalizability of the detector (Xu and Mannor, 2012), which is essential for our use-case.
Contrastive Loss: To learn a better representation of the input text, we use contrastive losses, for both the source and target texts (Figure 2).We use a loss similar to the one in (Chen et al., 2020): the only difference is that, instead of computing the loss between two transformed views of the text, we use the transformed text and the original anchor text.For our transformation, we use synonym replacement (more details regarding implementation are in the Appendix).The contrastive loss for the source is denoted by: z S i and z S j denote the projection layer embeddings for the original (anchor) and the transformed text, t is the temperature, b is the current mini-batch, sim(•, •) is a similarity metric which is cosine similarity in our case.Similar to (Chen et al., 2020), we do not sample or mine negatives explicitly, we simply consider the remaining 2(|b| − 1) samples in the mini-batch b as negatives.We have a similar contrastive loss for the target domain, denoted by L T ctr , and we skip the equation here for brevity.The objective of these contrastive losses is to bring the positive pairs, i.e. anchor and the transformed sample, closer in the representation space, and well separated from the negative samples.
Since the performance of contrastive learning depends significantly on the transformation used to generate the positive sample (Tian et al., 2020), we take a principled approach to choosing a transformation out of several possible ones (Bhattacharjee et al., 2022).To choose one transformation for the main experiments, we evaluate a simple detection model (only one domain) over different choices of transformations and choose the one that gives the best performance, and therefore, is the most discriminative.In the input space, we use random swap, random crop, and synonym replacement as the choices.In the latent space, we have paraphrasing and summarization as the choices.Based on detection performance, we finally choose synonym replacement as the transformation that we use throughout the remainder of the paper.
Maximum Mean Discrepancy(MMD): Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) is a metric to measure the distance between two distributions, which in our case refers to two different generators.Formally, let S = {x S 1 , x S 2 , ..., x S N S } and T = {y T 1 , y T 2 , ..., y T N T } be two sets of samples drawn from distribution S and T , respectively.The MMD distance between the distributions S and T is defined as the distance between means of two samples mapped to the Reproducing Kernel Hilbert Space (RKHS) (Steinwart, 2001).Following past work (Pan et al., 2010;Long et al., 2015), we compute the MMD between text embeddings in a lower dimensional space, i.e. between z S i and z T i .Formally, where ϕ : S → H and H represents the RKHS space.
The final training objective for our main framework is: where λ 1 and λ 2 are hyper-parameters.

Experimental Settings
In this section we describe the datasets, baselines and the training details we use for our experiments.

Dataset
Since our task requires news text from multiple generators, we use the publicly available Turing-Bench4 dataset (Uchendu et al., 2021), which contains human-written and machine-generated news articles from 19 generators, spanning over 10 different language model architectures (including different sizes for the some of the generators).For a full list of labels, check Appendix B.1.Out of the 10 different architectures available in the dataset,  (Radford et al., 2019) is the 1.5B size version of GPT-2, which is also a transformer-based language model built upon the architecture of the original GPT model (Radford et al., 2018), with further modifications.GPT-3 (Brown et al., 2020) is the successor of the GPT-2 model, and is the largest model we use in our evaluation, with a size of 175B parameters.GROVER_mega (Zellers et al., 2019) is the largest version of the GROVER model, which is a transformer-based model, similar in architecture to GPT-2, but trained to conditionally generate news articles.XLM (Lample and Conneau, 2019) is also a transformer-based language model designed for cross-lingual tasks.
Furthermore, given the challenge of detecting text from the more recent conversational language models, we augment the TuringBench dataset with ChatGPT news articles.Following a similar data generation procedure as in (Uchendu et al., 2021), we use a subset of around 9, 000 news articles from The Washington Post and CNN (more details in Appendix B.2), and use the headlines to generate articles using ChatGPT.For this paper, we used the OpenAI API with the gpt-3.5-turbomodel (version as on March 14, 2023).After experimenting with a few different prompt types, we finally used the following prompt for each news headline: "Generate a news article with the headline '<headline>'."Finally, we have a balanced dataset of approximately 9k human-written articles, and 9k articles generated using ChatGPT (after accounting for null values and API request errors).For simplicity, we name this dataset Chat-GPT News and we use this dataset for a case study on ChatGPT generated news articles, in Section 6.

Baselines
For a fair comparison, we compare our method with baselines that do not require labeled data.We use two open-source AI-generated text detectors, namely GLTR (Gehrmann et al., 2019) and the more recent DetectGPT (Mitchell et al., 2023), as our unsupervised baseline models.
GLTR utilizes a proxy language model to calculate the token-wise log probability of the input text.It employs four statistical tests: (i) log probabilities (log p(x)), (ii) average token rank (Rank), (iii) token log-rank (LogRank), and (iv) predictive entropy (Entropy).The first test assumes that a higher average log probability in the input text indicates AI generation.The second and third tests follow a similar assumption, where input texts with lower average rank are more likely to be generated by AI.The last test is based on the hypothesis that AI-generated texts tend to exhibit less diversity and surprises, resulting in low entropy.
DetectGPT also utilizes a proxy language model to calculate the token-wise log probability.However, its decision function is based on comparing the log probability of the original input text with the log probability of a set of n perturbed versions of the input text.These perturbations are generated using the mask-filling language model T5(T5base) (Raffel et al., 2020).The decision function assumes that if the log probability difference between the input text and the perturbed text is positive with high probability, then the input text is likely to be AI-generated.
In addition to these zero-shot baselines, we include the off-the-shelf OpenAI-GPT2 detector as one of the baselines in our study.The OpenAI-GPT2 detector is a RoBERTa model fine-tuned specifically for detecting GPT2-generated text.It was trained on a GPT-2-output dataset5 comprising

Results
To understand and investigate the effectiveness of our model, we try to answer the following research questions: -RQ1: Does ConDA perform well on unknown target domains in comparison to a source-only model (Table 2) and a supervised model fine-tuned on the target (Table 3)?-RQ2: How well does ConDA perform in com-parison to unsupervised-baselines (Table 4)?-RQ3: Are each of the loss components beneficial in training (Table 5)?
All results are reported as an average over 3 training runs with 3 different random seeds.

Performance of ConDA on unlabeled target data
To evaluate the performance of ConDA on each of the target domains, i.e. generators, we first look at how our model improves over a sourceonly model.Table 2 shows the results for this experiment, grouped by target domain.We report F1 scores for the ConDA framework and a source-only model, along with scores averaged over sources, for each target.The source-only model is a pretrained RoBERTa (roberta-base) fine-tuned only on the source domain S. The source-only scores provide an estimate of how well a model trained just on the source transfers to the target domain.
Although a few of the source-only models have satisfactory performance on the target, using our ConDA framework, we achieve performance gains over the source-only model in almost all tasks (rows with positive ∆F1 values).Particularly interesting are the cases where we use a smaller generator as the source, a larger one as the target, and still get high performance gains: 58 F1 points for FAIR_wmt19 (656M)→ GROVER_mega (1.5B), and 41 F1 points for FAIR_wmt19 (656M)→ .This may suggest that, with our ConDA framework, even having unlabeled data from newer and possibly larger generators can improve performance if we use a suitable generator as the source.
Next, we compare the performance of our model with a fully-supervised detector trained on the target domain in Table 3.For ConDA, we show the test performance for all target-source pairs.For the supervised model, we use a pre-trained RoBERTa (roberta-base) fine-tuned on the target data.We then evaluate the model on the test set of the same target domain, and essentially this is our upper bound performance.ConDA achieves test performance comparable to fully-supervised models.In particular, for targets CTRL and XLM, ConDA (with GROVER_mega as source) achieves upper bound performance.For targets GROVER_mega and GPT-2_xl, ConDA performs within 3 and 6 F1 points of the fully-supervised model.
Interestingly, for target generator GPT-3, all the ConDA models perform better than the fully- supervised performance, with the best F1 (from ConDA with source FAIR_wmt19) being 27 points higher than the supervised performance.Furthermore, when GPT-3 is used as the source domain, we get mediocre performance for all target domains.We suspect that this might be due to the following reason: The GPT-3 data in TuringBench might be noisy and therefore lack good quality, discriminative signals that can guide the detector.The performance improvement that occurs when ConDA is evaluated on GPT-3 as target, with any other domain as source, is possibly due to the effective transfer of discriminative signals from the labeled source data, hence improving the performance on GPT-3 data even in the absence of labels.

Performance compared to unsupervised baselines
We compare our ConDA framework with relevant unsupervised baselines and report results in Table 4. Out of the four GLTR measures (log p(x), Rank, Log Rank, and Entropy), the first three fare quite well for detecting CTRL-generated text, but performance on other generators is quite poor.De-tectGPT, which is the most recent method we evaluate, performs poorly on almost all generators, with some satisfactory performance on CTRL and XLM.Surprisingly, the OpenAI GPT-2 Detector performs poorly on the GPT-2_xl data from TuringBench, although it can be considered supervised for this particular target.Finally, we see ConDA outperforms all the baselines in terms of maximum AUROC, and all but one in terms of average AUROC.Interestingly, we see that ConDA models trained with GROVER_mega as the source perform very well for several target domains.This might be because GROVER (Zellers et al., 2019) was designed and trained in order to generate news articles.Since our task here is to specifically detect human vs. AI written news articles, training models on data generated using GROVER_mega is useful and this data possibly has good discriminative signals.

Ablation: Effectiveness of loss components
We evaluate variants of the ConDA model, by removing one component at a time and compare these in Table 5. ConDA \CEs removes the two cross-entropy losses, i.e. no supervision even for the source.ConDA \contrast removes the contrastive loss components for both source and target.ConDA \MMD removes the MMD loss between source and target.Hence the only component that makes use of the unlabeled target domain data is the target contrastive loss.Finally, ConDA is the full model.We see that the full model outperforms all the variants, implying that all three types of components are essential for detection performance in this problem setting.Combined with source supervision, the contrastive losses and the MMD objective effectively tie the power of self-supervised learning and unsupervised domain adaptation resulting in superior performance across target domains.

A Case Study on ChatGPT
Given recent concerns surrounding OpenAI's Chat-GPT and GPT-4 (OpenAI, 2023), it is important to create detectors for text generated by these conversational language models.With the incredible fluency and writing quality these language models possess, not only can such text easily fool humans (Else, 2023) but can also be extremely difficult for detectors to identify.Even OpenAI's detector struggles to detect AI-generated text reliably6 .Hence in this case study, we are interested   in evaluating our ConDA framework on ChatGPTgenerated articles, in an unsupervised manner.Since there is no existing dataset of ChatGPTgenerated vs. human-written text or news, we create our own dataset as explained in Section 4.1.We assign ChatGPT as the unlabeled target domain and assume that we have labeled data from the 6 other generators (Table 1).Therefore we emulate a real-world scenario where labeled data from older generators may be available, but it might be hard to find labeled samples for newer LLMs.We sample 4k articles from our ChatGPT News dataset and evaluate the same 3 unsupervised models as in Section 4.2 (upper row block in

Conclusion & Future Work
In this work, we address the problem of AIgenerated text detection in the absence of labeled target data.We propose a contrastive domain adaptation framework that leverages the power of both unsupervised domain adaptation and selfsupervised representation learning, in order to tackle the task of AI-generated text detection.Our experiments focus on news text, and show the effectiveness of the framework, as well as superior performance when compared to unsupervised baselines.We also perform a case study to evaluate our framework on our dataset of ChatGPT-generated news articles and achieve satisfactory performance.ChatGPT Case Study: As we elaborated in Section 4.1, we create our own ChatGPT-generated news article dataset, following a procedure similar to (Uchendu et al., 2021).However, the data we generated is conditioned on the sample of humanwritten news articles we randomly selected.We suppose the performance of our model on this Chat-GPT data hence is dependent on this sample.The high performance scores for ChatGPT-generated articles could also stem from the inherent structure of news articles; our data is specifically constrained to the style of journalistic news articles.Therefore, good performance on our news article dataset for ChatGPT does not necessarily imply similar performance across text from other areas.For this, more thorough evaluation is needed, which would be an interesting direction for future work.

Ethical Considerations
We go over some of the ethical considerations surrounding this work and similar directions.

Potential to Penalize Benign Use of LLMs
Recent blogs and articles have demonstrated how the newer language models including ChatGPT, GPT-4 (OpenAI, 2023), Bing Chat 7 , etc. can be used to improve productivity, spur creative thinking, help with writing essays or cover letters or even explain concepts and help in homework.As these language models become more pervasive, standard use of these as writing or brain-storming assistants may become commonplace.In such a case, we may encounter an increasing amount of text generated by these LLMs online.Such text, if used for benign purposes such as the ones mentioned above, should not be flagged or penalized by a detector such as ours.This brings another dimension to this already challenging problem: the issue of intent.Flagging AI-generated content without characterizing the intent behind that could wrongfully penalize users of LLMs.Therefore, the nuances surrounding this detection need to be considered while using this kind of a detector.

Danger of Misuse in High Stakes Areas
We discuss the issue of model misuse, by taking education as an example.Given the accessibility of ChatGPT and other recent AI-text generators, educators have expressed concerns (Tlili et al., 2023) over students cheating or plagiarising via these new technologies.There are already commercial detectors for AI-generated content such as GPTZero 8 and one from Copyleaks 9 that educators may use.However, similar to our model, there is always a margin of error on such detectors.Performing plagiarism checks and subsequently implementing punitive action based solely on such detectors may be detrimental in case of false positives.Legitimate work by a student may be misclassified by these detectors, and potentially impact their career.Eventually, this also diminishes trust in these detectors.Hence, before the widespread use of such AI-generated text detectors, thorough studies on error analysis and reliability need to be performed, along with policy changes to accommodate for the rapidly evolving landscape of AI technologies.'human' and 19 different generators, which are: { Human, GPT-1, GPT-2_small, GPT-2_medium, GPT-2_large, GPT-2_xl, GPT-2_PyTorch, GPT-3, GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XL-NET_large, FAIR_wmt19, FAIR_wmt20, TRANS-FORMER_XL, PPLM_distil, PPLM_gpt2}.

B.2 Human-written Articles
Human-written news articles in TuringBench are from The Washington Post, CNN, and a Kaggle dataset with CNN news articles from 2014-2020 and The Washington Post news articles from 2019-2020.More details on the TuringBench data are in (Uchendu et al., 2021).For the human-written articles in our ChatGPT News dataset, we use a random sample from the dataset of CNN and Washington Post articles as used in TuringBench.

C ChatGPT Visualizations
Here, we visually explore embeddings from the ConDA model for instances in Table 6, in order to understand the issues surrounding the detection of ChatGPT-generated news articles.Figure 3 shows the embeddings from all 6 ConDA models.In all the plots, we see that the human-written and ChatGPT-generated news articles in our ChatGPT News dataset are very closely clustered together, and are not separable.Therefore, even though our model achieves substantially high AUROC scores, there are possibly many false positives and/or false negatives, thus providing an intuition that better feature selection methods might be necessary here.

Figure 1 :
Figure 1: Text embeddings from (left) source-only model and (right) ConDA model on target domain CTRL with GROVER_mega as source.Each domain has both 'human' and 'AI' text.ConDA effectively removes domain-specific features while retaining taskspecific features, increasing the separability between 'human' and 'AI' text, and decreasing the separability between source and target domains.

Figure 2 :
Figure 2: Our ConDA framework.LLM refers to the RoBERTa model; LLM and MLP weights are shared across all four instances.

Table 1 :
(Ng et al., 2019rs we used for our evaluation.wesamplearepresentativeset of 6 different generators, in order to evaluate our model (Table1).For most of the architectures, if there were multiple parameter sizes available, we choose the largest one, to make the detection task more challenging for our model.We briefly go over the architectural details of each of the generators used: CTRL (Keskar et al., 2019) is a transformerbased language mode, that is developed for controllable generation of text based on control codes for style, content, and task-specific generation.The model is pre-trained on a variety of text types, including web-text, news, question-answering datasets, etc. FAIR_wmt19(Ng et al., 2019) is FAIR's model that was developed for the WMT19 news translation task.Texts in TuringBench are from the English version of the FAIR_wmt19 language model.GPT2-XL

Table 3 :
Performance of our ConDA model on each of the target domains, with each of the other domains as source.Numbers in bold are the best performing ConDA models for each target domain, i.e. closest to fully supervised performance.

Table 4 :
Performance of ConDA in comparison to unsupervised baselines, as AUROC.For ConDA, we report the average AUROC over all sources (for each target) and also the maximum AUROC (across all sources), along with the corresponding source in parentheses.Bold shows superior performance across each target.

Table 5 :
Comparison of different model variants; bold shows best performance.We randomly chose 3 target domains to show in this table due to space constraints.

Table 6 :
Results on our ChatGPT News dataset using unsupervised baselines (upper row) and ConDA (lower row).Scores are AUROC.Bold shows best and underline shows second best performance.

Table 7 :
Hyper-parameter values we used for all our experiments.