Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation

,


Introduction
Vision-language understanding (Yin et al., 2022(Yin et al., , 2021;;Jin and Zhao, 2021b;Jin et al., 2022;Cheng et al., 2023) is a fundamental problem in deep learning.Recently, in this field, query-based visual segmentation (Wang et al., 2020;Botach et al., 2022)  has received considerable critical attention, which aims at localizing the visual region at the pixellevel that semantically corresponds to the language query.Though existing works have made tremendous progress, the manual collection of pixel-wise annotations is expensive and tedious in practice, raising a significant requirement for the application in unseen circumstances.However, varying construction conditions on different domains inevitably degrades the adaptation performance, which is formally known as the domain shift problem.
In this paper, to breakthrough the constraint, we propose a novel task Cross-domain Query-based Visual Segmentation (CQVS).As shown in Figure 1, given a source domain with pixel-wise annotations and an unlabeled target domain, CQVS aims to adapt the segmentation model and recognize the query-aligned pixels on the target domain.
To achieve effective adaptation for this crossmodal grounding task, we have to deal with three domain discrepancies as shown in Figure 1 (c): (1) Multi-modal content shift.The free-form query and aligned visual region describe open and diverse contents, leading to arbitrary semantic shift between domains, e.g. one domain mainly describes humans while the other focus more on animals.(2) Uni-modal feature gap.Even describing the same content, each modality may have huge feature gap between domains caused by varying conditions, e.g.visual light and linguistic syntax.(3) Cross-modal relation bias.The relation between modalities are easily learned to be biased by domain-specific factors, especially when only source annotations is available.As illustrated, it cannot be guaranteed to be aligned across domains by separately aligning each modality and requires dedicated solutions.
To mitigate domain discrepancies, domain adaptation (DA) methods align the distributions across domains (Baktashmotlagh et al., 2014) or learn domain-invariant representations (Ganin et al., 2016).While promising, they are confined for CQVS by two critical limitations.(1) Traditional DA methods study the uni-modal tasks, e.g.image segmentation (Vu et al., 2019) and text classification (Glorot et al., 2011), which are insufficient without consideration of multi-modal content and cross-modal relation.(2) Though recent works (Jing et al., 2020;Liu et al., 2021b) investigate the multi-modal tasks, they (i) are limited to image-level retrieval that is imprecise compared with pixel-level grounding, and (ii) only consider partial discrepancies (feature and relation) regardless of internal correlation between domain contents.Motivated by the fact that humans can leverage abstract concepts to guide thinking about concrete details (Thomas and Thorne, 2009), we aim to model the high-level semantic structure in multimodal contents to harmonize the low-level pixel adaptation for feature and relation discrepancies.
Grounded on the above discussions, we propose Semantic-conditioned Dual Adaptation (SDA), a novel framework to achieve precise feature-and relation-level adaptation via a universal semantic structure.Our proposed SDA consists of two key components, Content-aware Semantic Modeling (CSM) and Dual Adaptive Branches (DAB).( 1) CSM builds a sharable semantic space across domains based on the multi-modal content.First, to discover the consistent contents from visual and textual modalities, we extract informative words with visual-guided attention.Then we establish the semantic structure upon the contents, where we apply unsupervised clustering on the source domain and measure the cross-domain semantic similarity to identify common parts on the target domain.Through this module, we implicitly encode the language gap and introduce semantic category infor-mation as common guidance to regularize adaptation.(2) Next, DAB seamlessly integrates dual branches to separately address feature and relation discrepancies.On the one hand, a contrastive feature branch leverages the semantic information to learn category-wise alignment for foreground pixels.To preserve pixel-level discrimination during aligning, we adopt contrastive learning (He et al., 2020;Jin and Zhao, 2021a) to highlight relevant foreground pixels between domains and suppress diverse background pixels.On the other hand, a reciprocal relation branch mitigates the cross-modal relation bias via two reciprocal masks.Concretely, the domain-based masks induced by source knowledge can provide precise but biased results and the semantic-based masks induced by semantic information can provide comprehensive but inaccurate results.With the complementary signals, we indirectly enhance relation via segmentation training on both domains and also directly maximize the mutual information between vision and language.
In summary, our main contributions are listed as follows: (1) We propose a new task CQVS to explore domain adaptation for query-based visual segmentation.(2) We introduce a novel framework SDA, which develops a content-aware semantic modeling module to model the multimodal contents and designs a dual adaptive branches module to mitigate the feature and relation discrepancies.
(3) Extensive experiments on both query-based image and video segmentation datasets evidence the effectiveness and superiority of our approach.

Query-based Visual Segmentation
Query-based visual segmentation aims at recognizing the relevant pixels in the images or videos based on the language query.For image segmentation, several works explore cross-modal fusion methods (Hu et al., 2016;Liu et al., 2017;Margffoy-Tuay et al., 2018), extract multi-modal context with attention (Ye et al., 2019;Huang et al., 2020;Jain and Gandhi, 2021), develop cycle-consistency learning (Chen et al., 2019b) and study the adversarial training (Qiu et al., 2019).Recently, more works investigate video segmentation.Several works focus on dynamic convolution-based methods (Gavrilyuk et al., 2018;Wang et al., 2020;Hui et al., 2021), explore cross-modal attention (Wang et al., 2019;Liu et al., 2021a) and study visual-textual capsule routing algorithms (McIntosh et al., 2020).Recent work (Botach et al., 2022) builds the transformer model for this task.However, existing research heavily relies on expensive annotations and could barely generalize to unseen circumstances, hindering the feasibility in practice.Hence, we further study the domain adaptation for this task.
Recently, few works study domain adaptation for multi-modal tasks, e.g., image captioning (Chen et al., 2017), text-based person search (Jing et al., 2020) and video-text retrieval (Chen et al., 2021;Liu et al., 2021b).Despite the effectiveness, they fail to comprehensively and precisely address various domain discrepancies in this cross-modal segmentation task.Thus, we propose a novel framework to achieve effective adaptation for CQVS.

Preliminary
Problem Formulation.In this task, we are given a labeled source domain and an unlabeled target domain containing N s and N t samples respectively, where V, Q, A is the visual sample (i.e.image/video), textual query and pixel-wise annotations.The goal is to construct a model with existing data to segment the query-relevant pixels on the target domain D t .Base Network.To illustrate our SDA paradigm clearly, we first formulate the base segmentation network as three common modules: (1) Encoder: Given the raw visual input V and query embedding W, it encodes visual features as V ∈ R T ×H×W ×C and query features as Q ∈ R N ×C , where T ,H,W ,C are the frame number, height, width and hidden size of the visual features respectively, N is the word number.Note T = 1 for the image and T will be omitted for ease of presentation.Based on annotations A s , the visual features V s can be correspondingly divided into the foreground V s,+ and the background V s,− ; (2) Interaction: It develops the cross-modal interaction with attention weight α ∈ R H×W ×N and outputs enhanced pixel-level features F ∈ R H×W ×C ; (3) Decoder: It is applied on F to generate final response map S = {s i } H×W i=1 .To train the model, we utilize a binary cross-entropy loss as L seg = L bce (S, A).Overall Framework.The overall SDA is shown in Figure 2. SDA includes a CSM module (Section 3.2) that encodes the content shift and a DAB module (Section 3.3) that closes the feature-and relation-level gap, achieving effective adaptation.

Content-aware Semantic Modeling
To explore the semantic structure of open and diverse vision-language content, we propose a novel method to extract informative contents via crossmodal attention, and construct a universal semantic space by leveraging a clustering algorithm and measuring cross-domain semantic similarity.Attentive Extraction.The multi-modal contents exist in the informative and common parts of visual and textual modalities.As visual features easily vary under different conditions and textual features are comparatively stable but scattered over the sequence, we employ vision-language attention to highlight the correlated words in the query that are attended by the visual foreground.In the source domain, with attention α s ∈ R H×W ×N in the interaction module and the annotations A s ∈ {0, 1} H×W , we calculate visual-guided attention {ᾱ s n } N n=1 over words by averaging the attention scores of all foreground pixels as ᾱs n = 1 H×W H×W i=1 α s i,n A s i .Then we normalize {ᾱ s n } N n=1 with temperature τ for a sharper distribution and combine it with query embeddings W = {w i } N i=1 to obtain content features H s ∈ R C , given by: Here we adopt the embedding W rather than Q for its generalizable ability (Bravo et al., 2022).
In the target domain without annotations, we leverage knowledge distillation (Hinton et al., 2015) to train a predictor with attention weights ᾱs as guidance, enabling it to directly learn the importance from the query.Thus, we can adaptively predict attention weights ᾱt and calculate content features H t .Details are listed in Appendix B.1.Semantic Construction.On the basis of multimodal contents, we aim to abstract and summarize the high-level semantics that represents the key concepts.Concretely, we apply Agglomerative Clustering (Zhang et al., 2012) upon source contents H s and obtain k s different semantic categories with their prototypes (i.e.center of cluster) {C s i } k s i=1 .To explore the target semantic structure, we then measure the cross-domain semantic similarity by calculating the sample-to-prototype distance.That is, we compute the similarity distribution d i = {d i,j } k s j=1 from the semantic H t i of the i-th target sample to all source prototypes {C s i } k s i=1 , given by: With the cross-domain semantic similarity d i , we can define the boundary ρ between "common" and "unknown" points based on its entropy H(d i ), so as to align the common parts of the target domain to the source domain and reject unknown parts.
Each reliable target sample with entropy below the boundary ρ will be assigned to the nearest source cluster while other unreliable ones will be clustered into extra classes, resulting in k t (≤ k s ) common and k u novel classes.With semantic category labels L assigned to data pairs, we establish a semantic structure across domains, where each source category label L s i ∈ {1, 2, .., k s } and each target category label L t j ∈ {1, 2, .., k t , .., k t +k u }.Semantic Initialization.To further initialize discriminative pixel features for stable adaptation, we retrain the segmentation model and incorporate the category clue to ensure the foreground features compact within categories and separate between categories.With the category L s , we can similarly obtain visual prototypes {U s i } k s i=1 for foreground pixels V s,+ .Following the work (Zhang et al., 2021), we calculate the similarity between {U s i } k s i=1 and V s,+ as P s = [P s 1 , P s 2 , .., P s k s ], then compute the prototypical contrastive loss L si by:

Dual Adaptive Branches
In DAB module, we develop two branches to separately address feature and relation discrepancies.

Contrastive Feature Branch
In this branch, we adopt contrastive learning (He et al., 2020) to achieve category-wise foreground alignment for the visual feature gap.
For visual samples from two domains with the same foreground category, we aim to contrastively strengthen their foreground agreement.While source foreground can be obtained via annotations A s , we apply two thresholds γ min and γ max on the predicted response map S t and filter the highly reliable pixels to obtain pseudo masks A t on the target domain, where the position with score over γ max and below γ min are set to 1 and 0 respectively while the rest are ignored.Then we can similarly obtain foreground and background features V t,+ and V t,− .We calculate the pixel-level similarity from the source foreground V s,+ to the target reliable pixels V t, * = V t,+ ∪ V t,− , and make the similarity of associated foreground pairs higher than any other irrelevant pairs, given by: where G s is the set of source foreground pixels.
To enhance the diversity of contrastive samples, we maintain a memory bank {{m l i } B i=1 } k t l=1 with B feature maps for k t common categories on each domain, providing abundant samples based on the category and domain of current training data.
The contrastive training is developed bidirectionally to enable the pixel features from each domain can be enhanced by the other.The full loss combines the symmetric terms as

Reciprocal Relation Branch
In this branch, we learn semantic-based masks as reciprocal signals and enhance relation with two designed modules to alleviate the relation bias.
Reciprocal Masks.Typical methods optimize relation via annotations A s on the source domain and pseudo masks A t on the target domain that are refined from the predicted response map S t (Section 3.3.1).However, A t essentially relies on the decoder pre-trained on the source domain.Hence, A s and A t are both domain-based masks and suffer bias due to coupling with source knowledge.Instead, we leverage the domain-agnostic semantic category L t to develop a multi-label visual classification on the target domain and obtain semanticbased masks Āt ∈ {0, 1} H×W from the class activation map (Appendix B.2), which can highlight the instances of the specific category.The two types of masks are complementary: (1) A s provides accurate source annotations for segmentation ability while Āt provides independent target masks as external knowledge.(2) A t focuses on precise but biased pixels on the target domain while Āt provides imprecise but comprehensive category instances as reciprocal signals (Figure 9).Collaborative Training.Given A s on the source domain and Āt on the target domain, direct training with mixed annotations of different granularity is ineffective (Luo and Yang, 2020).Thus, we collaboratively train the model on two domains via a shared encoder and two separate decoders for two annotations respectively, eliminating the effect of inaccurate masks on the main decoder and providing comprehensive information from the shared encoder.With the source output S s from the main decoder and the target output St from the auxiliary decoder, the objective is given by: Hierarchical Optimization.We also enhance relation based on the cross-modal mutual information (MI).On the target domain, with the domainbased and semantic-based masks A t and Āt , we denote the features of their intersection part as the selected instance V t sel , the features of their union part as category instances V t ct and the features of the remaining part as background V t bg .To distinguish the hierarchical confrontment among them (i.e.background-category-instance), we follow the work (Hjelm et al., 2018) to maximize MI by: where ϕ(•, •) is the MI discriminator followed by the softplus function.Similarly, we enhance MI with loss L s ho in the source domain by directly distinguishing between the foreground and the background.The final objective L ho = L s ho + L t ho .

Training and Inference
Training.We develop a multi-stage training.(1) We pre-train the segmentation model with loss L seg on the source domain.(2) Then we leverage the pre-trained model in CSM module and re-train it by adding the loss L si .(3) We incorporate the DAB module to continue training with the full loss by: Inference.During inference, we use the main decoder to segment pixels with score higher than half of the max value in the response map as foreground.

Performance Comparison
Baselines.We compare SDA with the following methods.(1) For query-based visual segmentation methods that only utilize the labeled source data for training, we select CMDyConv (Wang et al., 2020), CMPC-V (Hui et al., 2021) for videos, and consider CMSA (Ye et al., 2019), CMPC-I (Hui et al., 2021) for images.(2) For DA methods that utilize both the labeled source and unlabeled target data for training, we consider the uni-modal DA methods: MMD (Long et al., 2015), DANN (Ganin and Lempitsky, 2015) for image classification, PLCA (Kang et al., 2020) for semantic segmentation and the cross-modal DA methods: MAN (Jing et al., 2020) for text-based person search, ACP (Liu et al., 2021b) for vision-language retrieval.Query-based Video Segmentation: The results of four transfer directions on three video datasets are shown in Section 3.4.(1) We observe that our SDA framework consistently outperforms all other methods on all criterias, improving mAP[0.5:0.95] by 6.0, 3.3, 8.7, 4.1 on four transfer tasks respectively.
(2) The uni-modal DA method MMD brings little gains and DANN even slightly degrades the performance.We infer the reason is that directly aligning each modality results in negative transfer, as discussed in Section 1. (3) Though cross-modal DA methods achieve a performance boost, they are still inferior to our approach due to the lack of a comprehensive solution to the domain discrepancies.The above observations solidly demonstrate the strong adaptation ability of our SDA framework.Query-based Image Segmentation: As shown in Section 3.4, our SDA also achieves the best results on two transfer tasks for query-based image segmentation.The fact validates the generalizable ability of our approach on different visual modalities (image/video) and further evidences its effectiveness.

Ablation Study
To investigate the validity of the derived modules, we conduct ablation studies on two adaptation tasks RVOS → A2D and UNC → UNC+.Main Ablation Study.As shown in Section 4.3, we verify the contribution of each module in our SDA.The CSM refers to the content-aware semantic modeling, DAB refers to the dual adaptive branches including CFB and RRB.We observe that    only adding the CSM module brings little gains, which is reasonable since it mainly models the content shift for subsequent adaptation.On the basis of CSM, the CFB and RRB both improve the performance dramatically, verifying their effectiveness to address the feature-and relation-level discrepancies.To evaluate the importance of CSM, we remove it and obtain inferior results, confirming the necessity of semantic modeling to harmonize the adaptation.Our full model integrates these modules and therefore achieves better performance.
Ablation Study for the CSM Module.We perform ablation study for the CSM module and report the results in Section 4.3.(1) For attentive extraction (AE), we adopt mean-pooling on the visual and textual features as content features to generate ablation models w/ mean-pooling(v) and w/ mean-pooling(t) respectively.By comparison, our attentive extraction leads to superior perfor-     We further replace the reciprocal masks with only domain-based masks as w/o reciprocal.The inferior results confirm its effectiveness for relation enhancement.

Hyper-Parameter Analysis
Impact of Temperature τ in AE method.Temperature τ controls the attention distribution for the extraction of content features.We evaluate 11 different τ values from 0.01 to 1.0 on RVOS →A2D and UNC→UNC+.The result in Figure 5 shows that the performance achieves the best when τ is set to 0.2 and becomes poor when τ is too small or too large.This result suggests that a proper τ value is crucial to capturing key contents.Impact of Boundary ρ in SC method.To study the impact of boundary ρ, we set ρ = βlog(k s ) where log(k s ) is the theoretical maximum value of the similarity entropy H(d i ).The result in Figure 6 shows that the performance increases and then decreases with increasing β, indicating that the boundary controls the openness degree between domains and hence affects adaptation.

Qualitative Analysis
To shed a qualitative light on evaluating the proposed approach, we conduct several experiments as follows.More results are listed in Appendix D.3 Visualization of Visual Features.In Figure 7, we visualize the visual features on RVOS→ A2D, the left car) but suffers severe bias.Instead, the semantic-based mask can coarsely localize various car instances, thus providing comprehensive clue about the missing one (i.e. the right car).

Conclusion
In this work, we first study the task of cross-domain query-based visual segmentation.To address this problem, we propose Semantic-conditioned Dual Adaptation, a novel framework that achieves the feature-and relation-level adaptation via a universal semantic structure.Experiments shows that our framework performs consistently well on both query-based video and image benchmarks.

Limitation
In this section, we make a clear discussion of the limitation of our work.Our work mainly study the setting where each dataset serves as an independent domain.However, the adopted datasets (e.g.UNC, UNC+) for query-based image segmentation are mostly collected on MS-COCO (Lin et al., 2014) and have limited domain gap between visual modality.The findings could inspire the researchers to explore other settings, e.g. each class serves as an independent domain.

Ethics Statement
We adopt the widely-used datasets that were produced by previous researchers.We followed all relevant legal and ethical guidelines for their acquisition and use.Besides, we recognize the potential influence of our technique, such as its application in human-computer interaction and vision-language grounding system.We are committed to conducting our research ethically and ensuring that our research is beneficial.We hope our work can inspire more investigations for the domain adaptation on multi-modal tasks and wish our framework can serve as a solid baseline for further researches.
This appendix contains four sections.(1) Appendix A introduces the detailed design of our base segmentation network.( 2

A Base Segmentation Network
We adopt a unified segmentation network for both query-based video and image segmentation.Specifically, we adopt the same architecture including the query encoder, the interaction mechanism between the frames and words, the decoder for segmentation and the training loss.The main difference is that we adopt different visual encoders for videos and images, respectively.Encoder.For each video, we employ pretrained I3D layers (Carreira and Zisserman, 2017) with stacked 3D convolution to learn the spatiotemporal features for video clips, denoted as V ∈ R T ×H×W ×C , where T , H, W , C are the frame number, height, width and channel number of output respectively.For each image, we employ a pre-trained ResNet-101 network (He et al., 2016) to learn the spatio features, denoted as V ∈ R H×W ×C .For ease of presentation, we abuse the symbol V to denote both video and image features and drop the T .Besides, we adopt the multi-resolution visual features maps i=1 that are outputs of different encoder layers, where N m is the number of multi-resolution feature maps, H i , W i and C i are separately the width, height and channel number of the i-th feature map.For each query, we employ the Glove (Pennington et al., 2014) word embeddings as the input W and apply a Bi-GRU network to learn the query features Q ∈ R N ×C , where N is the word number.Interaction.With the query representation Q ∈ R N ×C and the visual features V ∈ R H×W ×C , we first incorporate the natural language to generate query-focused visual context V q ∈ R H×W ×C through a dot-product attention and gating modula-tion, given by: where g 1 , g 2 , g 3 are distinct linear transformations and V ∈ R H×W ×C is the attentive representation.Note that the cross-modal attention α in Section 3.1 can be obtained by α = softmax(g 1 (V)g 2 (Q ⊤ )).
We then divide the feature map into different semantic regions based on the unsupervised low-level SLIC superpixel algorithm (Achanta et al., 2012).Specifically, we turn the visual context V q into the superpixel representation V r ∈ R Nr×C through region max-pooling, where N r is the preset number of superpixels, and compute the region-contextual representations Vr with region self-attention, given by: where g q , g k , g v are distinct linear transformation.We further augment the pixel representations by adding corresponding region-contextual representations to the original visual context and get the enhanced pixel-level features F ∈ R H×W ×C .We build the hierarchical cross-modal interaction to obtain enhanced feature maps {F i ∈ R H i ×W i ×C i } Nm i=1 by stacking the interaction module over multi-resolution visual features {V i } Nm i=1 .Decoder.
After the encoding and interaction, we generate the multi-scale response maps {{s i,j } H i ×W i j=1 } Nm i=1 by employing FCN (fully convolutional network) on the enhanced representation i=1 where a i,j ∈ {0, 1}, we directly compute the binary cross-entropy loss for j-th pixel of i-th feature map, given by B Technique Components

B.1 Predictor
In the CSM module (Section 3.2), we independently train a predictor to predict the weight for the words on the target domain, aiming to directly learn the importance from the query without available annotations.The predictor consists of a two-layer MLP.Specifically, we conduct pre-training on the source domain where we fix all other network components except for the predictor.First, we input the query embeddings W into the predictor and obtain the output {ᾱ s,pre n } N n=1 .Then we follow the knowledge distillation scheme (Hinton et al., 2015) to adopt the visual-guided attention weights {ᾱ s n } N n=1 as the objective and develop the L1 loss to train the predictor.The loss is given by: where D KL (A||B) is the Kullback-Leibler divergence from A to B. After training the predictor for 5 epochs, we freeze it and apply it to the target domain to predict the weight ᾱt n .The visualized results can be found in Figure 12.

B.2 Multi-label Visual Classification
In the RRB branch (Section 3.3.2),with the semantic category L t for each referred instance, we first construct image-level training samples by combining all categories that appeared in the image.In this way, an image (or a video frame) is related to multiple category labels and we assume that the ground truth label of an image is y ∈ R k t +k u , where y i = {0, 1} denotes whether label i appears in the image or not.Next, we leverage an independent classification network to perform training, where we adopt a pre-trained ResNet-101 as the backbone to encode visual features.Then we apply the global average pooling on the convolutional feature maps and input them into a fully-connected layer with a sigmoid function to produce the desired output ȳ for classification.With total k t + k u categories on the target domain, the whole network is trained using the traditional multi-label classification loss as follows: To leverage the weakly-supervised localization ability, we follow the CAM method (Zhou et al., 2016) to build the class activation map, which can coarsely highlight the pixels belonging to a specified category.We also follow the IRN method (Ahn et al., 2019) to further refine the class activation map for more precise masks.Afterwards, each target training sample (V t i , Q t i ) is associated with a semantic-based mask Āt i , which provides the coarse-level instance location of the corresponding category L t i .The visualized results are shown in Figure 9.

C Experiment Details
C.1 Dataset Details C.1.1 Query-based Video Segmentation Refer-Youtube-VOS.Refer-Youtube-VOS (Seo et al., 2020) is a large-scale referring video segmentation dataset extended from Youtube-VOS dataset (Xu et al., 2018) which contains 3975 videos, 7451 objects and 27899 expressions with both first-frame expression and full-video expression annotated.A2D Sentences.A2D Sentences (Gavrilyuk et al., 2018) is extended from the Actor-Action Dataset (Xu et al., 2015) by providing textual descriptions for each video.It contains 3,782 videos annotated with 8 action classes performed by 7 actor classes.J-HMDB Sentences.J-HMDB sentences (Gavrilyuk et al., 2018) is extended from the J-HMDB dataset (Jhuang et al., 2013) which contains 928 videos and corresponding 928 sentences.All the actors in JHMDB dataset are humans and one natural language query is annotated to describe the action performed by each actor.
C.1.2Query-based Image Segmentation UNC.UNC (Yu et al., 2016) is collected on MS-COCO (Lin et al., 2014).It contains 19, 994 images with 142,209 referring expressions for 50,000 objects.Expressions in UNC contain words indicating the location of the objects.UNC+.UNC+ (Yu et al., 2016) is also collected on MS-COCO (Lin et al., 2014).It contains 19, 992 images with 141,564 referring expressions for 49,856 objects.Expressions in UNC+ describe the objects based on their appearance and context within the scene without using spatial words.G-Ref.G-Ref (Mao et al., 2016) is also collected on MS-COCO (Lin et al., 2014).It contains 26,711 images with 104,560 referring expressions.Expressions in G-Ref contain longer sentences with an average length of 8.4 words compared with other datasets (e.g.UNC, UNC+) which have an average sentence length of less than 4 words.

C.2 Implementation Details
Model Selection.For visual features, we use the ResNet-101 (He et al., 2016) pre-trained on the Im-ageNet as the backbone feature extractor for images and use the I3D network (Carreira and Zisserman, 2017) pre-trained on the Kinetics dataset (Carreira et al., 2018) for video clips.For query features, we employ the pre-trained Glove (Pennington et al., 2014) word embeddings as input.
Parameter Setting.For the base segmentation setting, we follow the video segmentation work (Wang et al., 2019) to set the target frame as the center of 8 continuous clips.All the frames are rescaled and padded to the same size of 320 × 320.The FCN network for the decoder consists of three fully convolutional layers with residual connection, where the kernel size is 3 × 3 for the first two layers and 1 × 1 for the remaining layer.We set the hidden size to 1024 and use bilinear interpolation for feature map upsampling.For images, we adopt the last three layers of the encoder for the multi-resolution feature maps (N m = 3).For videos, we adopt the last five layers of the encoder for the multiresolution feature maps (N m = 5).
In our SDA framework, we select the visual features from the last layer of the encoder for adaptation.For the content-aware semantic modeling module, we set the temperature τ to 0.2, set the distance threshold in Agglomerative Clustering (Zhang et al., 2012) to 0.5 and set the boundary where log(k s ) is the theoretical maximum value of entropy H(d).For the contrastive feature branch, we set the thresholds γ max and γ min to 0.9 and 0.1 respectively and set the memory size B to 100 for each category.Besides, we adopt the teacher-student architecture (Tarvainen and Valpola, 2017) to provide stable features with the momentum parameter set to 0.99.For the reciprocal feature branch, we follow the IRN method (Ahn et al., 2019) to refine the class activation map.The loss coefficients λ 1 and λ 2 are empirically fixed at 1.0 and 0.1.To train our model, we use the Adam optimizer with an initial learning rate 1e-7.The learning rate increases to 4e-4 linearly for 300 updating steps and then decreases proportionally.The batch size is set to 8 for both the source data and target data.We run all the experiments for 5 times and report the mean results.

Training
Step.As mentioned in Section 3.4, we develop a multi-stage training.In stage 1, we pretrain the segmentation model with BCE loss L seg on the source domain for 20 epochs.In stage 2, we re-train the model by adding the loss L si in CSM module for 20 epochs.In stage 3, we train the model with the full loss L sda for 10 epochs.More specifically, in stages 1 and 2, we both set the learning rate to 4e-4 to start training.In stage 3, we continue training with the updated learning rate.
Experiment Configuration.The SDA is implemented using PyTorch 1.9.0 with CUDA 10.0 and cudnn 7.6.5.All the experiments are conducted on a workstation with four NVIDIA GeForce RTX 2080Ti GPUs.

C.3 Baseline Setting
Query-based Visual Segmentation Baselines.For video segmentation baselines, CMDy-Conv (Wang et al., 2020) utilizes a context modulated dynamic network with group-wise kernel prediction to incorporate context information and an effective temporal evolution encoder to capture motion information; CMPC-V (Hui et al., 2021) builds a cross-modal adaptive modulation module to dynamically recombine multi-modal features.For image segmentation baselines, CSMA (Ye et al., 2019) employs cross-modal attention and self-attention to extract multi-modal context between image regions and referring words; CMPC-I (Hui et al., 2021) applies the same architecture as CMPC-V without temporal interaction on the image side.We directly re-implement the above approaches and adopt the same visual encoder and query embedding for a fair comparison.Domain Adaptation Baselines.We combine domain adaptation approaches with our Base segmentation network to conduct experiments.For DA baselines designed for uni-modal tasks, MMD (Long et al., 2015) minimizes the feature distances; DANN (Ganin and Lempitsky, 2015) employs a gradient reversal layer to learn domaininvariant features; PLCA (Ganin and Lempitsky, 2015) develops pixel-level contrastive learning based on the pixel similarity.To apply MMD and DANN on the CQVS task, we leverage them for both pixel-level visual features and query features.Since PLCA mainly works for visual pixels, we further apply MMD on query features to improve the performance.For DA baselines designed for multimodal tasks, MAN (Jing et al., 2020) performs alignment for each modality feature and leverages pseudo labels to train the target samples; ACP (Liu et al., 2021a)    cept with the instance-level concept (i.e.average pooling of pixels in the foreground region).

D.1 Discussion
Selection of Clustering Algorithm.To effectively establish the semantic structure of multimodal contents, we investigate different clustering algorithms including K-Means (Lloyd, 1982), Spectral Clustering (Ng et al., 2001) and Agglomerative Clustering (Zhang et al., 2012).From the results shown in Appendix D, we observe our framework is sensitive to the choice of a specific clustering algorithm.Specifically, K-Means and Spectral Clustering both require a cluster number value that is manually set.The optimal cluster number is difficult to define, hindering these algorithms to obtain satisfactory clustering results and leading to inferior adaptation performance.Instead, Agglomerative Clustering only requires a proper distance threshold parameter to automatically perform hierarchical clustering by grouping similar points, which yields better results on our content features extracted from the embedding model.Content Extraction: Encoded Features vs Word Embedding.As discussed in Section 3.2, the work (Bravo et al., 2022) shows that a simple language model fits better than a large contextualized language model for detecting novel objects.To fur-  ther verify its effectiveness on semantic modeling, we compare two different query features for content extraction, i.e. the encoded features vs word embedding.We present the results in Appendix D and find that the word embedding performs better than encoded features, which is consistent with the conclusion (Bravo et al., 2022).

D.2 Hyper-Parameter Analysis
Impact of Memory Size B in CFB module.We set the memory size B to [10,50,100,200] to explore the impact of it.The result in Table 8 reveals that a larger memory size can bring more improvement, which is also verified in (He et al., 2020).Notably, when the size B exceeds 100, the gain is limited.Considering the computation cost, we set B = 100 in our experiments.Impact of Thresholds γ max and γ min in the CFB module.Thresholds γ max and γ min separately control the number of selected pseudo pixels for foreground and background in the target domain.
We analyze the impact of two thresholds and report the results in Appendix D.2.It indicates that both threshold values are crucial to the adaptation performance.Our SDA achieves the best results when the values γ max =0.9 and γ min =0.1.It also shows  that the γ min value is a bit more important than the γ max , since the background pixels can provide the discriminative power serving as negative samples.
Impact of Distance Threshold in the SC method.
Distance threshold in Agglomerative Clustering defines the minimum distance between two clusters, which indirectly controls the cluster number.
To explore the impact of it, we set the distance threshold to [0.1, 0.3, 0.5, 0.7, 0.9] and display the results in Figure 10.We note that the performance gradually improves with the increase of distance threshold and slowly reaches the bottleneck.This phenomenon is reasonable since a large threshold leads to few clusters where each one contains many indistinguishable samples and a small threshold results in too many clusters where each one has few samples.

D.3 Qualitative Analysis
Visualization of Semantic Construction.In Figure 11, we present the semantic clusters on each domain.We observe that the cluster on the source domain can group semantically similar visual instances and sentences, e.g."bird", "parrot" and "owl".Meanwhile, the separation between clusters is also clear, e.g."bird" vs "duck".Thus, the similar parts on the target domain can be well-aligned to the source clusters.The visualization results demonstrate the effectiveness of our content-aware semantic modeling module to explore the multimodal contents and learn the semantic structure across domains.
Visualization of Attentive Extraction.In Figure 12, we depict the distribution of the visualguided attention on the source domain and the predicted attention on the target domain.We can find the attention weights on both domains can highlight the crucial words in the query, e.g. the actor "viper" and "bird", while suppressing the inessential parts, e.g. the descriptive words "under bush" and "in the sky".

Figure 1 :
Figure 1: (a) Example of the query-based visual segmentation.(b) Illustration of the cross-domain query-based visual segmentation.(c) Three domain discrepancies.

Figure 2 :
Figure 2: The illustration of the proposed Semantic-conditioned Dual Adaptation framework.

Figure 3 :
Figure 3: The illustration of the CSM module.

Figure 5 :
Figure 5: Impact of Temperature τ on two transfer tasks.

Figure 6 :
Figure 6: Impact of Boundary ρ on two transfer tasks.

Figure 7 :
Figure 7: The t-SNE visualization of the visual features on RVOS→A2D.

Figure 8 :
Figure 8: The segmentation results on RVOS→A2D and UNC→UNC+.Base shown in the second row.SDA shown in the third row.

Figure 9 :
Figure 9: The visualization of reciprocal masks.
) Appendix B introduces the technique components in our SDA framework, including the predictor (Appendix B.1) and the multi-label visual classification (Appendix B.2).(3) Appendix C introduces the experiment details, including the dataset details (Appendix C.1), the implementation details (Appendix C.2), and the baseline settings (Appendix C.3). (4) Appendix D presents extensive experiment results, including some discussions (Appendix D.1), more hyperparameter analysis (Appendix D.2) and more qualitative results (Appendix D.3).

Figure 10 :
Figure 10: Impact of Distance Threshold on two transfer tasks.

Figure 12 :
Figure 12: The visualization of visual-guided attention on RVOS→A2D.Darker color means the higher attention score.

Table 1 :
Performance comparisons of four transfer tasks on three video datasets.O=Overall.M=Mean.

Table 2 :
Performance comparisons of two transfer tasks on three image datasets.We compute the mean average precision over different thresholds as mAP[0.5:0.95].More details of dataset statistics and the implementation details are summarized in Appendix C.
(Lin et al., 2014)-Ref(Mao et al., 2016).Since the image datasets are mostly collected on MS-COCO(Lin et al., 2014), we conduct more experiments on the challenging video datasets.Evaluation Metrics.Following prior works, we employ the criterias including IoU (Intersectionover-Union) and mean average precision as metrics.For IoU, we compute the Overall IoU and the Mean IoU.

Table 4 :
Ablation results about the CSM module.

Table 5 :
Ablation results about the DAB module.
employs the pre-trained classification model to preserve the semantic structure of compositional concepts from uni-modal data.Since both MAN and ACP are designed for global image-level features, we apply MAN by replacing the imagelevel pseudo labels with pixel-level pseudo labels, and apply ACP by replacing the image-level con-

Table 6 :
The Comparison of different clustering algorithms.

Table 7 :
The Comparison of Content Extraction: Encoded Features vs Word Embedding.

Table 8 :
Impact of Size B on the RVOS → A2D task.

Table 9 :
Impact of Thresholds γ max and γ min .