Difference-Masking: Choosing What to Mask in Continued Pretraining

The self-supervised objective of masking-and-predicting has led to promising performance gains on a variety of downstream tasks. However, while most approaches randomly mask tokens, there is strong intuition that deciding what to mask can substantially improve learning outcomes. We investigate this in continued pretraining setting in which pretrained models continue to pretrain on domain-specific data before performing some downstream task. We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining by considering what makes a task domain different from the pretraining domain. Empirically, we find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.


Introduction
Inspired by the distributional hypothesis in the language domain (Harris, 1954), masking is a selfsupervised learning (SSL) objective in which a model attempts to reconstruct hidden portions of data from the surrounding context.Masking has enabled breakthrough performance on tasks from a variety of domains, such as language, vision, and speech (Devlin et al., 2019;Li et al., 2021;Hsu et al., 2021;Ericsson et al., 2022), motivating interest in researching how masking strategies influence representation learning in SSL.
Masked prediction has recently been applied to adapt pretrained models to specific downstream tasks by continuing to pretrain models on indomain unlabelled data (Dery et al., 2023).Masking in this continued pretraining setting been shown to be particularly effective when the target domain differs substantially from the pretraining domain (Gururangan et al., 2020).
While prior work has studied how the amount masked influences model learning (He et al., 2022),

Random Masking
= what is masked during continued pretraining Density describes the relationship between the mass and volume of a substance most masking approaches randomly choose which parts of the data to mask.Although it is understudied in SSL, deciding what to mask is a critical component in human education (Pajares and Miller, 1997;Bjork and Linn, 2006).Educators designing "fill-in-the-blank" assessments for students must decide what content to mask in order to effectively assess student understanding of a domain (Bae and Lee, 2018).For example, in a real-world "fill-inthe-blank" chemistry test, a teacher might choose to mask domain-specific words ("density", "silicon") to assess student learning, instead of masking domain-irrelevant words ("example", "process").

Difference-Masking
In this paper, we propose DIFFERENCE-MASKING, a novel approach for automatically selecting what to mask during continued pretraining.Our strategy first identifies anchors that describe what makes a target domain different from the pretraining domain and then determines what to mask during continued pretraining based on similarity to those anchors.
In experiments spanning four diverse languageonly and multimodal video datasets (ACL-ARC, ChemProt, TVQA, and Social-IQ), we find that DIFFERENCE-MASKING outperforms strong baselines, supporting our hypothesis that masking based on what is different about a task provides strong representation for continued pretraining.We provide intuitions to explain the strong performance of DIFFERENCE-MASKING, along with extensive 1 13222 analyses and ablations to better understand the performance of our method.Our code is publicly available.

Related Work
Masking relies on the distributional hypothesis, which posits that the meaning of a word can be inferred from its context (Harris, 1954).Masking in NLP has functioned as an effective SSL strategy when training models such as BERT (Devlin et al., 2019) and XL-Net (Yang et al., 2019).Although random masking has been more closely studied in NLP than non-random masking, there are three closely related works to ours from NLP.
EntityBERT (Lin et al., 2021) masks tokens based on whether they are part of "entities" recognized by a domain-specific pretrained namedentity-recognizer.Salient Span Masking (SSM) (Guu et al., 2020) is a similar method that uses a named-entity-recognition model to mask out a single entity for the downstream task of open-domain QA.However, these approaches require a domainspecific pretrained entity-tagger, and the masking strategy they determine is the same for any domain to which that tagger is applied.In contrast, DIFFERENCE-MASKING determines what to mask without pretrained entity-taggers, and its masking strategy can change depending on the unlabelled data in the task domain.
Selective Masking (Gu et al., 2020) uses data from the downstream task to decide which tokens to mask during continued pretraining by estimat-ing how much each token contributes to improved downstream task performance.It is important to note that Selective Masking uses supervised downstream task labels, whereas DIFFERENCE-MASKING is entirely self-supervised.
Prior work from the vision community has also contributed to an understanding of masking strategies, primarily by using the attention of the model during SSL training to determine what to mask.MST (Li et al., 2021) uses attention maps to determine "non-essential regions" to mask, while At-tnMask (Kakogeorgiou et al., 2022) does the opposite by masking the most attended-to regions.Unlike DIFFERENCE-MASKING, these approaches do not take into account domain-specific information when determining their masking strategy.This can be an impediment to performance when the model's attentions do not already contain information about what is important in a given input sequence.

DIFFERENCE-MASKING
This section describes the motivation and implementation of DIFFERENCE-MASKING: our selfsupervised method to determine a masking strategy for continued pretraining.The overall process is depicted visually in Figure 2.

Problem Setting
We are given a model which has been pretrained on multi-domain data drawn from domain distribution X P T (e.g., a model such as RoBERTa pretrained 2 on a large multi-domain corpus).We are interested in how to adapt this pretrained model to a specific target domain X T without observing task labels Y .
Continuing pretraining on X T has emerged as a popular solution approach to this problem (Gururangan et al., 2020;Dery et al., 2023).

Motivation and Notation
If the masking objective is used to train models to learn word representations (Harris, 1954;Devlin et al., 2019), a natural question emerges: which words is it most important that our models learn to represent?We believe that this question may be important to effectively continue pretraining on specialized domains.We expect that continued pretraining can benefit from a masking strategy that considers what makes a task-domain different.
This leads to the intuition behind DIFFERENCE-MASKING: to train on what makes a target domain different from the pretraining domain.For example, in a corpus about chemistry we would expect that the task of masking and predicting words strongly related to chemistry such as "molecule" will lead to better learning outcomes than words such as "analysis", which could be related to chemistry in addition to many other domains.
Formally, we term X T ∩P T as the concepts likely to appear in both X T and X P T (e.g., "analysis"), and we term X T /P T as the concepts that make the domain X T different from X P T (e.g., "molecule").With this notation, we can now express our intuition in terms of mutual information with the downstream task Y : we intuit that concepts common in X T but uncommon in X P T (i.e., in X T /P T ) share higher mutual information with the task label than concepts found in both domains (X T ∩P T ) do: The goal of DIFFERENCE-MASKING then is to learn representations during masking that capture the information unique to the domain (X T /P T ) which is more relevant for the downstream task.

Our Approach: DIFFERENCE-MASKING
To learn masked representations that capture the information unique to the domain (X T /P T ), our proposed DIFFERENCE-MASKING approach proceeds in two steps: 1. Finding difference anchors: We first determine which words are most commonly found in domain X T and not commonly found in general domains X P T .We term these words difference anchors that summarize the concepts unique to X T .
2. Masking based on differences: Using these difference anchors, we determine the likelihood that each word should be masked based on its similarity to these difference anchors.We sample from this probability distribution to decide what to mask during MLM continued pretraining.
These steps are explained in detail in the following subsections.

Finding Difference Anchors: TF-ICF
Our goal is to determine a set of corpus-level difference anchors that are representative of the differences between the pretraining domain X P T and the task domain X T .Since our goal is to design a simple yet effective method for finding these differences, we use of a modified version of the widely used TF-IDF scoring function from the field of statistical NLP (Jones, 1972).TF-IDF determines the ratio of how frequently a word appears in a document compared to how frequently the word appears in other documents in a corpus.Because we are attempting to find words that make a target corpus X T different from general pretraining corpora X P T , the score of a word is highest when it appears frequently in our corpus (X T ) and infrequently in the multi-domain pretraining corpus (X P T ).We denote our approach as TF-ICF for term-frequency, inverse-corpus-frequency, expressed by the following scoring function: To effectively capture word frequencies in the general distribution of the English Language used for pretraining (X P T ), we use unigram counts derived from the Google Web Trillion Word Corpus (Brants and Franz, 2006;Norvig, 2009).
We score all words in X T with this metric and choose the top K as anchors A to represent the domain, where K is a hyperparameter of our method.We analyze the impact of this hyperparameter in Section 5.3.

Masking Based on Differences
DIFFERENCE-MASKING then masks words based on similarity to anchors A. Formally, we define 3 13224 similarity between a word w and an anchor word A k as the cosine similarity of the words' BERT (Devlin et al., 2019) In order to choose words to mask, we generate probability distribution α over the words in the sentence x to represent the probability that each word should be masked.We determine the weight α i of each word w i by calculating its similarity score with the most similar anchor word in A (we explore other strategies in our experiments).This value is normalized over the length of the sequence to ensure the probability distribution sums to 1.
DIFFERENCE-MASKING then masks terms by sampling from distribution α without replacement, and the model attempts to reconstruct the masked terms from the surrounding context.

Multimodal Implemention of DIFFERENCE-MASKING
To apply our method to the visual domain, we draw on work from the vision community in which visual representations are grouped at the object level (Baradel et al., 2018;Sajjadi et al., 2022) and use object labels (e.g.person, car...etc) from a state-of-the-art object detector (Wang et al., 2021;Zhang et al., 2016) to calculate similarity with the anchor words.A detailed description of our implementation of DIFFERENCE-MASKING in the multimodal setting can be found in Appendix B.

Experimental Settings
Our experiments evaluate whether DIFFERENCE-MASKING's masking strategy leads to performance improvements on challenging language-only and multimodal video understanding tasks.We follow the experimental setting from (Gururangan et al., 2020), in which unlabelled data from the downstream task domain is used for continued pretraining before eventually performing downstream task finetuning.This is a popular SSL setting because it represents a computationally-feasible way to test the effectiveness of self-supervised representation learning methods (e.g.without recreating a pretrained model), and it is realistic to modern approaches which rely heavily on pretrained models (Dery et al., 2023).
Experiments are performed to allow each model to learn as long as needed during continued pretraining, only stopping when validation error increases (early-stopping).Each result is averaged across five random seeds.Hyperparameter settings and data preprocessing details can be found in Appendix A.

Datasets
Language-only Datasets As in Gururangan et al. (2020); Dery et al. (2023), we conduct experiments with the ChemProt dataset (Kringelum et al., 2016), a relation classification task that uses chemistry documents.ChemProt is a low-resource classification task with a large amount of in-domain unlabeled data, making it a realistic setting in which SSL is helpful in continued pretraining.
We also conduct experiments with the ACL-ARC task (Jurgens et al., 2018), a citation intent task based on the ACL Anthology Reference Corpus (Bird et al., 2008) used in continued pretraining experiments in (Gururangan et al., 2020).We use train, validation, and test splits for both datasets from (Dery et al., 2023;Gururangan et al., 2020).
Multimodal Datasets We also experiment on continued pretraining for two challenging multimodal video understanding tasks.TVQA (Lei et al., 2018) is a dataset containing 21,792 videos from 6 American television shows and questions and answers related to the videos.Each question is paired with 5 answer choices (one correct answer and 4 incorrect answers), and corresponding video, audio, and subtitles.
Social-IQ (Zadeh et al., 2019) contains 1,250 videos of social situations and questions and answers pertaining to the videos.Each question has corresponding video, audio, and subtitles.We use the train, validation, and test splits from the publicly available datasets.
of length N will be masked in random-masking is AttnMask (Kakogeorgiou et al., 2022) is a domain-agnostic token-based masking approach in which the likelihood of masking a given token is proportional to how attended-to that token is by the [CLS] token, averaged across the different heads of the transformer.Formally, this approach can be seen as defining a function g att which takes in model f θ , sequence of tokens x, and index i and outputs how attended-to token x i is.
MST (Li et al., 2021) is an approach very similar to AttnMask, except that it masks "non-essential regions", effectively corresponding to an inverse weighting based on the attention of the model to the token x i .
Selective Masking (Gu et al., 2020) chooses tokens to mask based on whether adding each token will improve downstream task accuracy as measured by the difference between the downstream task performance when using the full sequence x versus using only the sequence up to and including the token x i .Notably, this approach uses downstream task labels to guide the choice of mask in continued pretraining, whereas DIFFERENCE-MASKING is self-supervised.
DGA (Ke et al., 2023) is another relevant work that proposes a masking strategy for NLP model adaptation.However, unlike the methods described above, DGA chooses which attention heads to mask instead of choosing which tokens to mask, assigning importance to attention heads based on the gradient of the loss between the model's representations of two differently-masked versions of the same input.Additionally, DGA encourages the model to learn integrated representations of the target domain and general knowledge using a contrastive loss.
EntityBERT (Lin et al., 2021) masks tokens based on whether they are part of "entities", as defined by a domain-specific named-entityrecognition (NER) model.The original paper uses the PubMedBERT model, trained originally on the clinical domain.We also implement Salient Span Masking (Guu et al., 2020), which in this case is the same as the EntityBERT approach applied only to mask a single word in the sentence.To apply the approach to the ChemProt and ACL-ARC domains requires NER models effective in those domains.

Experimental Methodology
Language-only We reproduce the experimental setting from AANG (Dery et al., 2023), which employs a pretrained 110M RoBERTa base model with 5 two heads: one for continued pretraining and one for the downstream task.Our hyperparameters and other detailed configuration notes are described in Appendix A.
Multimodal We conduct our multimodal experiments using a strong pretrained model: MERLOT-Reserve (Zellers et al., 2022), a large multimodal transformer pretrained with a contrastive multimodal prediction objective on a dataset of 20 million Youtube videos.
To experiment with masking strategies in the multimodal setting, we continually pretrain a 200M MERLOT-Reserve base model by maskingand-predicting visual patches.We evaluate the learned representation quality by freezing the model and finetuning only the linear classifier layer on the downstream task following (Wilf et al., 2023)'s methodology.
A detailed description of our implementation of DIFFERENCE-MASKING in the multimodal setting can be found in Appendix B, and our hyperparameters can be found in Appendix A.

Comparison with Baseline Approaches
Our experiments compare our proposed DIFFERENCE-MASKING with established baselines including Random Masking (at the word and token level), AttnMask (Kakogeorgiou et al., 2022), MST (Li et al., 2021), Selective Masking (Gu et al., 2020), DGA (Ke et al., 2023), EntityBERT (Lin et al., 2021), and Salient Span Masking (Cole et al., 2023).The results are summarized in Table 1.We find that DIFFERENCE-MASKING shows strong results compared to baselines across language-only and multimodal video understanding tasks.
Notably, our approach demonstrated superior performance on the ACL-ARC dataset with an accuracy of 74.04%, a marked improvement over the random token baseline (63.74%) and a substantial improvement over the best baseline (Salient Span Masking, 71.94%).Our approach also surpassed Selective Masking (69.06%).This is surprising because Selective Masking uses downstream task labels to inform its masking strategy whereas DIFFERENCE-MASKING is self-supervised.
Results on the ChemProt dataset are also encouraging, showing that DIFFERENCE-MASKING achieves an accuracy of 83.94%, marginally better than all the baselines, including Random Masking (82.82%),AttnMask (83.53%), and EntityBERT (82.04%).Similarly to Selective Masking, the En-tityBERT and DGA masking strategies were originally tested on much larger datasets, which may suggest a limitation of these methods in the lowresource continued pretraining setting.DIFFERENCE-MASKING also demonstrates robust performance in multimodal settings.On the Social-IQ dataset, DIFFERENCE-MASKING achieved an accuracy of 71.37%, outperforming the Random Masking (69.05%),AttnMask (70.18%), and MST (68.37%) methods.We were unable to compare our approach with Selective Masking and EntityBERT on these datasets due to the languageonly design of their entity taggers.In contrast, our method is not limited to the language domain, and, in fact, performs well in the multimodal setting.And on the TVQA dataset, DIFFERENCE-MASKING achieved an accuracy of 81.73%, outperforming the Random Masking approach substantially (73.75%) and the AttnMask approach marginally (81.57%).
These results highlight the effectiveness and versatility of the DIFFERENCE-MASKING approach across various language and multimodal datasets.

What is masked?
In this section, we investigate what is masked by DIFFERENCE-MASKING and its link to downstream task performance.
On the ACL-ARC task, we find that the most frequently masked words in the ACL-ARC task had an interesting grounding in human intuition.The ACL-ARC task is a citation intent task on a corpus comprising ACL papers.As the subject of ACL papers can vary widely, comprising multiple sub-domains and research fields, we were curious how DIFFERENCE-MASKING's masking strategy would handle this domain.
We found that the most frequently masked words closely-aligned with the ACL paper submission tracks describing the high-level topic categories for papers.For example, some of the most frequently masked words were "learning", "information", "translation", "semantic", and "lexical".These words closely correspond to submission tracks "Machine Learning for NLP", "Information Extraction", "Machine Translation", and "Semantics: Lexical".Since submission tracks for ACL can be seen as a set of topics that span the space 6 13227 Figure 3: The most frequently masked words chosen by the DIFFERENCE-MASKING algorithm across the ChemProt and ACL-ARC tasks.We find that for the ChemProt dataset, the masks we find automatically through unlabelled data partially recover the end task labels.
of ACL papers, this supports our hypothesis that masked words chosen by DIFFERENCE-MASKING align with what makes this domain different.
On the ChemProt task we also found an interesting pattern in what was masked.The objective of the ChemProt task is to determine a type of relation corresponding to a type of biochemical interaction between entities in the text, where labels include words such as "activation", "inhibitor", and "antagonist".Interestingly, we find that some of the words DIFFERENCE-MASKING chooses to mask most often are the same words as the labels for the downstream task.This result is also visualized in Figure 3.Some of the most-masked words by DIFFERENCE-MASKING are "activity", followed by "inhibited", "inhibitor", and "antagonist".This is a fascinating result because it suggests that in masking what makes the ChemProt domain unique, DIFFERENCE-MASKING is determining a self-supervised objective that is highly similar to the downstream task without accessing the downstream task labels.
In the multimodal setting we also find an interesting grounding of how DIFFERENCE-MASKING chooses masks in human intuition.Reasoning about social interactions is believed by many psychologists to rely heavily on understanding visual body language cues (De Stefani and De Marco, 2019;Keck et al., 2022).Social-IQ is designed to test these kind of social intelligence capabilities with subtle questions such as "How do the men in the room feel about each other?" and "Do the people in this video feel comfortable about the clown being there?".In contrast, TVQA tests more general video understanding with question and answer

Method
TVQA Social-IQ types including those that target visual reasoning about non-human entities and non-visual reasoning from specifically text or audio modalities.
As such, we would expect that our continued pretraining strategy would choose to prioritize masking tokens representing human body language more often in Social-IQ than in TVQA.We found that this was in fact the case.Interestingly, we found that AttnMask baseline also picked up on a similar trend in its attempt to mask based on where attention already focuses, although the trend is much more pronounced in our approach.
The findings in Table 2 demonstrate that DIFFERENCE-MASKING chooses to mask substantially fewer visual tokens corresponding to people than to objects in TVQA, (40%) in comparison to Social-IQ (90%).On the Social-IQ dataset, where the performance difference is more pronounced over the closest baseline (↑ 1.76% over AttnMask), the difference between the proportion of tokens masked from people by these approaches is also most pronounced (90% in DIFFERENCE-MASKING vs 19% in AttnMask).7 13228

Sensitivity Analysis
Similarity Function As described in Section 3, DIFFERENCE-MASKING determines masking probabilities by comparing the anchor representations to the token representation.Because the token representation is a single vector and the anchors are a group of vectors, similarity can be defined multiple ways.Table 1 shows results from the "nearest-neighbor" approach to determining similarity described in Section 3.5, motivated by the intuition that a domain can have many sub-domains and if a token is close to any one of these concepts it should be prioritized for masking.For example, the ACL-ARC corpus has many sub-domains, including the over twenty different submission tracks described in Section 5.2.If a paper is about linguistics, it may be important to mask words similar to "language", whereas if a paper is heavy on ML theory, another anchor might be more appropriate to mask in order to best understand the work.
An alternative approach that could be to determine scores by relation to the centroid of the anchor embeddings: in essence, determining whether the token in question is similar to the anchors on aggregate.We would expect that this approach would perform similarly to ours on a narrowly-defined dataset such as ChemProt, but substantially differently on a multi-domain dataset such as ACL-ARC.We evaluate this alternative in Table 3.We find that the nearest-neighbor strategy does, in fact, outperform the centroid strategy, especially on the ACL-ARC task.This supports our intuition that the nearest-neighbor strategy is particularly helpful when there is a complex or peaky domain.

Number of Anchors
In considering the relationship between the anchors and the downstream task, we also investigate how the choice of the number of anchors (K) impacts the downstream performance.We expect that too few anchors will not be expres-sive enough to determine a strong masking strategy, and too many anchors may begin to overfit to niche concepts that are not representative of the domain.We find that there is indeed a "sweet spot", and interestingly that it is the same for both datasets: 20.These results are visualized in Figure 4.
Figure 4: Performance on both tasks is best at the hyperparameter K = 20 anchors.We hypothesize that each task may have an optimal setting of this hyperparameter.

Conclusion
In this paper we introduce DIFFERENCE-MASKING, a method for identifying what makes a target domain unique and using this information to guide a strategy that chooses what to mask during SSL continued pretraining.We find that our method outperforms strong baselines across diverse language and multimodal video understanding tasks.We provide a detailed discussion of what is masked in DIFFERENCE-MASKING and why our method performs well on various tasks.The crosstask applicability of DIFFERENCE-MASKING supports the effectiveness of our framework for SSL pretraining in language, vision, and other domains.

Limitations
As described in Section 3, DIFFERENCE-MASKING is based on the intuition that it is more beneficial to mask based on what is unique (X T /P T ) about a downstream task's domain.However, it is challenging to find what makes a domain unique; therefore, our method is an approximation of X T /P T .We believe future work may find it fruitful to investigate additional methods for approximating this, including modifications on the TF-ICF method we proposed.In Section 5, we provided intuition, empirical results, and analysis to 8 understand why our method outperformed attention masking baselines by a larger margin on Social-IQ than on TVQA.A broader investigation of why DIFFERENCE-MASKING during pretraining is beneficial by a larger margin to some downstream tasks than to others would be helpful to the community.

Ethics Statement
We believe that self-supervised learning is a promising direction for the machine learning community.This does not discount the salient arguments made about the social and enviromental risks of large models (Bender et al., 2021;Strubell et al., 2019).We believe that works such as ours, which study SSL in a resource-constrained context, both increase access to those with limited compute resources and conform to a more environmentallysustainable way of doing research.

A Detailed Experimental Settings
In this section, we provide an overview of the experimental conditions utilized in our study.To ensure fair comparisons with our baselines, we maintain a consistent set of hyperparameters for both continuous pretraining and fine-tuning.For language tasks, we largely adhere to the hyperparameters employed in (Gururangan et al., 2020).Throughout our experiments, we maintain a masking ratio of 25% in both language and multimodal settings.We adopt a static masking strategy, replacing masked tokens with random values.We reproduce MERLOT-Reserve's original training on TVQA: we decompose samples in Social-IQ and TVQA from the form (Question, All Answers, Video Information) into a list of 3-tuples: (Question, Candidate Answer, Video Information).MERLOT scores each candidate answer independently, given the question and video, and is trained with loss that encourages the model to minimize estimated likelihood of incorrect answers and maximize likelihood of correct answers.
From video frames, we mask image patches into 16x16 patches as determined by MERLOT-Reserve's backbone image transformer ViT (Dosovitskiy et al., 2021).The language experiments took nine hours of runtime each on a single 12GB GPU, and the multimodal vision experiments required six hours on a single TPU v2-8.

B Masking Video Tokens
Following the intuition from language, we hypothesize that masking and predicting small patches of an image may be testing local capabilities (e.g.determining what an eye looks like from the rest of the face) rather than global capabilities (e.g.determining what a person's face looks like from the rest of the scene, including other people's faces).
Accordingly, instead of masking low-level image patches, we mask groups of patches corresponding to a higher level semantic entity: bounding boxes over objects in the image.We see this approach as a visual analogue for masking at the word-level instead of the token-level in our language experiments.We found that K = 1 performed much better than other values, where the selected anchor word was "person".We considered two possible bounding boxes associated with people: bounding boxes over faces and bodies.We evaluated both options and found that considering entire bounding boxes over people's bodies (including their faces) performed the best.These results are shown in Table 5.We extracted body detection coordinates using UniTrack (Wang et al., 2021) and face detection coordinates using MTCNN (Zhang et al., 2016).
Apart from the bounding box strategy, we also experimented with masking patches chosen by differences between CLIP embeddings (Radford et al., 2021) of the anchor and the vision patch directly (without bounding box labels).Our experiments validate that the CLIP-based masking strategy performs poorly compared to our bounding box strategy.One possible reason can be that CLIP is not robust enough for video datasets which led to masking patches that are not relevant to the anchor word "person".

C Masking Language Tokens
In Section 4.3 we describe the motivation for using a word-level strategy in our implementation of DIFFERENCE-MASKING.An alternative implementation could be to assign each token in a word the same masking likelihood, and mask tokens only by this probability.This could result in some tokens 12 13233 from the same word being masked where others are not.Our intuition is that for specialized domains such as chemistry, subword tokens may be trivial to predict from their neighbors, but whole words may not be trivial to predict given the context.For example, a word such as "phosphates" would be tokenized into "phos" and "-phates".We expect that it may be trivial to predict "phos" given "-phates" or vice versa, but it may be hard (and may promote a better understanding of the task) to predict the word "phosphates" given the context.
Empirically, we find that this decision improved performance substantially, as shown in the results in Table 7

Figure 1 :
Figure 1: DIFFERENCE-MASKING automatically selects what to mask based on what makes the task domain different from the pretraining domain, enhancing model learning on the end task.
Figure 2: DIFFERENCE-MASKING: an approach to choosing what to mask during continued pretraining that prioritizes masking concepts that make the target domain different from the pretraining domain.DIFFERENCE-MASKING does this by first selecting anchor topics relating to the downstream task, and then by masking words or bounding boxes based on their similarity to those anchor topics.

Table 2 :
For each method, we analyze what percent of tokens are chosen to be masked from within bounding boxes over people as opposed to objects.

Table 5 :
Results of DIFFERENCE-MASKING on multimodal video understanding benchmarks TVQA and Social IQ.DIFFERENCE-MASKING leads to an improvement of 8% and 2% accuracy over random accuracy.

Table 6 :
We validate our hypothesis that masking patches using DIFFERENCE-MASKING is more effective than masking using CLIP similarity.

Table 7 :
below.We validate our hypothesis that masking tokens using DIFFERENCE-MASKING at the word-level is more effective than masking at the token-level.