M2D2: A Massively Multi-Domain Language Modeling Dataset

We present M2D2, a fine-grained, massively multi-domain corpus for studying domain adaptation in language models (LMs). M2D2 consists of 8.5B tokens and spans 145 domains extracted from Wikipedia and Semantic Scholar. Using ontologies derived from Wikipedia and ArXiv categories, we organize the domains in each data source into 22 groups. This two-level hierarchy enables the study of relationships between domains and their effects on in- and out-of-domain performance after adaptation. We also present a number of insights into the nature of effective domain adaptation in LMs, as examples of the new types of studies M2D2 enables. To improve in-domain performance, we show the benefits of adapting the LM along a domain hierarchy; adapting to smaller amounts of fine-grained domain-specific data can lead to larger in-domain performance gains than larger amounts of weakly relevant data. We further demonstrate a trade-off between in-domain specialization and out-of-domain generalization within and across ontologies, as well as a strong correlation between out-of-domain performance and lexical overlap between domains.


Introduction
Even though they can contain a wide variety of different types of domains, the texts that make up the corpora used to train and evaluate language models (LMs) are often treated as if they are all the same.This makes it challenging to characterize LM performance under diverse data distributions and understand how to effectively adapt LMs to new ones.To address these challenges, we develop M2D2, a Massively Multi-Domain Dataset, with 145 subdomains and a human-curated hierarchy for studying fine-grained domain adaptation.Prior work on domain transfer focuses on a small number of broad domains (typically 4-20;Gururangan et al., 2020;Gao et al., 2021;Gururangan et al., 2021).In contrast, domains in M2D2 are fine-grained and organized into a hierarchy derived from human-curated ontologies in Wikipedia (Figure 1) and Semantic Scholar (Figure 2).Unlike prior work, the fine granularity of M2D2 enables the study of transfer to naturally occurring datascarce domains recognized by human curators (e.g.Philosophy, Public Health, Transport).This hierarchy enables the study of domain transfer at varying levels of topic granularity.For instance, how should we combine widely available internet text (entire corpus), text on computer science (coarse domain), and scarce corpus on machine learning (fine domain) to improve performance in the machine learning domain?To the best of our knowledge, M2D2 is the first dataset that combines fine domain granularity with human-curated domain hierarchy in a massively multi-domain setting.qbi o q-bio.PE q-bio.NC q-bio.QM q-bio.BM q-bio.MN q-bio.GN q-bio.CB

Philo soph y
Art Using M2D2, we investigate the following questions, as examples of the broad classes of new questions that can be asked: (1) how well do coarse and fine domains transfer to each other across the hierarchy?(2) which features and aspects of a domain are important for transfer?(3) how important is domain specificity versus breadth?We perform preliminary experiments analyzing transfer between similar domains, disparate domains, and hierarchically related domains.Moreover, we explore how to select source domains to improve transfer performance.
We present baseline experiments using a GPT2 (Radford et al., 2019) language model.We find that (1) more specific data is often more important for performance than larger, less-specific data, shown by our comparison of coarse-grained, fine-grained and coarse-to-fine adaptation comparison (in which coarse-to-fine performed best) , (2) vocabulary overlap is a surprising good indicator for transfer, and (3) data source provenance information is often a better predictor than ontology when predicting transferability, perhaps indicating a more multi-faceted definition of domain could be developed in future work.
Given the importance of fine granularity domains in language modeling, we hope that M2D2 will encourage the community to further study domain transfer: how do we identify hierarchical finegrained domains in naturally occurring text, and how do we leverage this fine-grained domain hier-archy to improve domain transfer.

M2D2
M2D2 consists of a large quantity of fine-grain domains.Unlike prior work that defines the domain of a corpus using its source (e.g. the web text domain; Chronopoulou et al., 2021), we derive domains from a human-curated Wikipedia and arXiv ontologies.In this section, we describe how M2D2 is collected and organized.

Domain Organization
One of the unique properties of M2D2 is its hierarchical nature, enabling the study of transfer at different levels of domain granularity.We assume a particular corpus to have L 0 , . . ., L K levels of hierarchy, where L 0 refers to the lowest or most coarsegrained/broad level (i.e. the whole dataset), and L K refers to the highest or most fine-grained/specific level.A given level of hierarchy D i j is composed of multiple subdomains {D i+1 0 , . . ., D i+1 N i+1 }, which are represented in the next level of the hierarchy L i+1 .Similarly, we assume that a given subdomain is contained within a larger domain.
For the rest of the paper, we use L1 and L2 to represent the two levels of a K level hierarchy that we consider in this paper.

Dataset Collection
We collect M2D2 from two resources, Wikipedia and Semantic Scholar.This allows us to explore domain adaptation in a massively multi-domain setting among domains of varying granularity, while also allowing us to test whether our findings hold across different data sources.
Semantic Scholar We use the S2ORC corpus (Lo et al., 2020), a large corpus of English academic papers annotated with extensive metadata.Using this corpus, which is already categorized into L1-domains representing broader fields of academic research (e.g.Computer Science, Physics), we extract L2-domains by finding a given paper's respective arXiv2 category (e.g."Computation and Language" ∈ Computer Science).Wikipedia We crawl the Wikipedia ontology,3 which lists major categories contained within Wikipedia.Within these major categories or L1domains, we then proceed to look up the category pages within a given L1-domain, and gather respective L2-domains.This procedure yields a hierarchy of domains contained within Wikipedia.We then download the Wikipedia data dump, which we clean using wikiextractor4 and assign a given page to its respective domain.

Unique Properties
M2D2 has the following major unique properties when compared to previous domain adaptation datasets.First, it is massively multi-domain: we have 145 L2 domains grouped into 22 L1 domains, which allows us to test domain adaptation for language modeling on a variety of axes (such as hierarchy, subject matter, and ontology) that would be more difficult with more coarse-grained datasets.Second, M2D2 is hierarchical: this al-lows us to also test the performance of domain specificity versus domain breadth in more flexible adaptation settings.
We describe dataset statistics in Table 1, including dataset size (measured in MB/GB), token count (measured by whitespace tokenization), and the number of L2 domains within each L1 domain.M2D2 contains 8.5B tokens, with an average of 373 million tokens per L1 domain.Demonstrating the hierarchical nature of M2D2, we also list examples of L2 domains contained within the L1 domains (e.g.Computing ∈ Technology and Applied Sciences, Topology ∈ Mathematics) which are also graphically shown in Figures 1 and 2).

Dataset Splits
We split each domain into the respective train, validation, and test sets.To prevent data leakage between the domains when pages belong to two or more domains, we construct validation and test sets from pages that are not contained within any other domains on the same level of hierarchy.For example, the page for "Biotechnology" overlaps in domain with both Biology ∈ Natural and Physical Sciences and Engineering ∈ Technology and Here, we use "Technology and Applied Sciences" to illustrate our L1 domain and "Computing" to illustrate our L2 domain.Bold arrows refer to adaptation steps, and dotted lines refer to an evaluation phase.
Applied Sciences so this would not be included in any evaluation set due to the potential for direct leakage.However, the page for "Computer" is only in Computing ∈ Technology and Applied Sciences and therefore could be included in an evaluation set.We include at least 1 million tokens in the validation and test sets, respectively.This enables us to have a precise evaluation set of texts that only belong to a single fine-grained domain.

Experiments
As examples of the types of new studies M2D2 enables, we explore a number of key questions about the nature of effective domain adaptation in language models.For example, how does one best specialize a language model to a domain, given an ontology?How well can adapted models be applied out-of-domain, within and across ontologies?What features of target domains are predictive of out-ofdomain transfer?
In this section, we present a set of experiments that begin to answer these questions.First, we study the impact of adapting to the L1 and L2 domains of our dataset on in-domain ( §3.2) and outof-domain ( §3.3) language modeling performance.Then, we perform an analysis of lexical features in domains that are predictive of out-of-domain performance ( §3.4).

Experimental setup
In all experiments, we use the 112M GPT2 model (Radford et al., 2019) as the baseline model.Our implementation is based on HuggingFace Transformers (Wolf et al., 2020) and PyTorch (Paszke et al., 2019).All adaptation techniques are performed using Adam (Kingma and Ba, 2015), dropout value of 0.2 (Srivastava et al., 2014), using a learning rate of 5e-5 and a batch size of 64000 tokens.We train all models for a maximum of 1 million iterations and perform early stopping over the validation set.All experiments are run on 8 NVIDIA V100 GPUs.
When adapting our GPT2 model to domains in M2D2, we use one of three settings: L1 Adaptation We continue training on a given L1 domain (e.g.Computer Science).

L2 Adaptation
We continue training on a given L2 domain (e.g.Machine Learning).
L1-to-L2 Adaptation Given a L2 domain (e.g.Machine Learning), we first perform L1 adaptation on its corresponding L1 domain (e.g.Computer Science), and then we further perform L2 adaptation.This setting similar to multi-stage adaptive pretraining approaches used for supervised tasks (Gururangan et al., 2020).
For all techniques, we evaluate test perplexity on L2 domains validation sets.Due to the large quantity of L2 domains, we aggregate L2 results by their corresponding L1.For each ontology, we report the average and standard deviation (average s.d. ) of perplexities across L2 domains in each L1.

In-Domain Results
The first set of experiments in this study considers the impact of adapting the language model to different levels of the M2D2 ontologies.We only consider in-domain perplexity, or the perplexity of model on the domain it is adapted to.
Adaptation improves in-domain performance despite pretraining.Table 2 shows test-set perplexities on L2 domains, averaged across each L1 domain, after performing each adaptation technique (see Appendix on full results).First, we observe that all proposed adaptation techniques improve performance over the base GPT-2 model.This highlights the effectiveness of adaptation in improving in-domain performance, even when considering domains that the language model has likely Table 3: Out-of-domain test perplexities, aggregated to each L1 domain.We look at the impact of L1 vs L2 vs L1to-L2 finetuning settings when compared to simply finetuning on L1.We can see that L2 Adaptation and L1-to-L2 Adaptation are generally less performant in out-of-domain settings that L1 Adapted models, given their in-domain specification.The comparison between L1 versus L2 is statistically significant p < 0.01.
been exposed to during pretraining (as is the case with Wikipedia; L1 adaptation results in a 5.8 decrease in perplexity).For domains which the language model is less likely to have been exposed to during pretraining, this is more pronounced (as is the case with S2ORC; L1 adaptation results in a 12.7 decease in perplexity).
Specificity and hierarchy is more important than broad coverage in adaptation.Next, we observe that in most cases, adapting to L2 domains is more beneficial to in-domain performance than adapting to L1 domains.Adaptation to finer-grained domains better specializes a language model, even though these domains are much smaller than their L1 counterparts.Finally, we observe that using L1-to-L2 adaptation further benefits in-domain performance over L2 adaptation in all cases.Our results suggest that adapting to smaller amounts of domain-specific data leads to more effective in-domain specialization than adapting to large quantities of data that may be more weakly domain-relevant.Moreover, the best results may be achieved by organizing the target domain into subsets of broader and fine-grained data, and adapting along this hierarchy.However, this approach has increased memory and computational requirements relative to solely relying on L1 Adaptation.

Out-of-Domain Results
We also study the effects of our adaptation techniques on out-of-domain performance, by performing zero-shot inference with adapted models on domains (e.g.Art) other than the ones they are adapted to (e.g.Machine Learning).We first transfer models between domains in the same ontology (e.g.Wikipedia → Wikipedia), and then across ontologies (e.g.Wikipedia → S2ORC).
L2 Adaptation decreases out-of-domain performance.We show out-of-domain performance for each adaptation technique in Table 3.We show that conversely to L2 and L1-to-L2 adaptation which significantly improved in-domain performance, this comes with the tradeoff at less performance in both Table 5: Out-of-domain transfer performance between all L1 domains in the S2ORC portion of M2D2."GPT2" refers to the zero-shot performance of the LM on our dataset.
Specific adaptation transfers better to related categories across ontology.Although the two data sources in M2D2 differ considerably in style and content, their ontological categories partially overlap.For example, Mathematics and Art appear in both Wikipedia and Semantic Scholar.Is it possible to transfer between corresponding categories across ontologies?
To answer this question, we first manually align L1 domains from Wikipedia and Semantic Scholar with similar ontological categories (e.g., grouping Mathematics from Wikipedia and Mathematics from S2ORC).We then apply a model adapted to an L1 domain in a source ontology onto its corresponding L1 domain in a target ontology.We compare this cross-ontology performance with two baselines: 1) the average out-of-domain performance of other L1 adapted models in the target ontology and 2) the in-domain performance of a model adapted to the target L1 domain.
Our results are displayed in Table 6.We observe that while L1 adapted models are effective at transferring to other domains within an ontology, they are less effective at transferring to corresponding domains outside an ontology.Surprisingly, in all cases, transferring outside an ontology performs even worse than using the base GPT-2 model with no additional adaptation.Moreover, the average out-of-domain performance of L1 adapted models generally outperforms cross-ontology performance, indicating properties shared within an ontology (e.g.style) could be transferred.reveals a tradeoff between specialization and generalization.The more fine-grained the specialization of the language model, the less one can expect it to be applicable outside of the domain it was trained on.This effect size increases as we move outside the ontology: models trained on one ontology are not useful in other ontologies, despite being trained on similar categories of data.These findings lead us to believe that domain adaptation should be studied from a multi-faceted perspective to exploit specific aspects of domain (e.g.style, content).Future work may look at reducing the tradeoff between highly domain specialized models and out of domain performance, perhaps through ensembling or other approaches.

Lexical indicators of out-of-domain performance
Looking closer at the out-of-domain performance of L1 models, we see intuitive relationships be-tween subject similarity and zero-shot out-ofdomain transfer performance (Table 4).For example, Society and Human Activities domains tend to transfer well to each other, whereas Religion and Mathematics do not transfer as well.These findings suggest that out-of-domain transfer is correlated with content overlap.In this section, we present some basic lexical indicators of out-of-domain performance which support this hypothesis.
Vocabulary overlap strongly correlates with transfer regardless of part-of-speech.Figure 4 shows the correlation of vocabulary overlap a given part-of-speech tag (VERB, NOUN, ADJ) or entities and average out-of-domain performance on M2D2.We compute this by taking the top-k (k = 1000) most common words for a given domain which correspond to a given POS tag.For every given domain, we then calculate the intersection of shared most common words corresponding to the part-of-speech tag with the entirety of M2D2 and plot them against the L2-domain-averaged perplexity over the entire dataset.We use spacy (Honnibal and Montani, 2017) for both entity recognition and POS tagging.We find that vocabulary overlap is a strong predictor of transfer performance regardless of part-of-speech, perhaps indicating its relevance in transfer between fine-grained domains.
Related domains mostly transfer domainspecific tokens.We analyse domain adaptation at a token-level to characterize what different adaptation settings transfer.Specifically, we measure which tokens are most impacted in terms of perword perplexity when we finetune on a domainspecific corpus.We do this by taking the difference between the softmax-normalized probability of pre-  dicting a given word in a given domain when comparing two models adapted to different corpora.
We compare S2ORC adapted models in four settings: two best-transferred domains (a proxy for similar domains; easy transfers), two worst transferred L1 domains (a proxy for distant domains; difficult transfers), L1-to-L2 Adaptation (hierarchical domain transfer), and no adaptation (zero-shot performance of the base LM).We show the distribution between domain-specific (terms that appear less than 0.00001% of the time in any other domain) and non-domain-specific terms in Table 7 that appear in the top 1000 most adapted words.Finally, we show representative samples of tokens with the greatest change after adaptation.We find that the most changed tokens between easy transfers (e.g.Statistics and Computer Science) are non-domainspecific words (such as the) but harder transfers include words that are more domain specific (such as Blockchain).
Summary Our preliminary analyses suggest that simple lexical characteristics of domains are strong indicators of how well an adapted model may generalize.Developing computationally inexpensive indicators of transfer (as lexical overlap is), is important for domain transfer to find the best out of a large set of candidate corpora to perform adaptation to a target domain.This would allow one to approximately find the best corpus, without the computational overhead of adapting to all candidate corpora.
4 Related Work Domain Adaptation Techniques (Gururangan et al., 2020) show that pretrained language models can be adapted to new domains by continued pre-training on domain-specific corpora.Chronopoulou et al. (2021); Gururangan et al. (2021) build upon this work by using hierarchically constructed domain specific adapters/experts (Houlsby et al., 2019).Another line of work in domain generalization is to simply scale the model pre-training on a corpus containing different domains (e.g.GitHub, PubMed) such as done with GPT-J (Wang and Komatsuzaki, 2021) and the Pile (Gao et al., 2021).Dery et al. (2021) also look to bridge these approaches by learning a task/domain specific mixture of tasks.Overall, however, much of this work (Daumé III, 2007;Ruder et al., 2017;Ruder and Plank, 2018;Gururangan et al., 2020;Ramponi and Plank, 2020;Gururangan et al., 2021;Chronopoulou et al., 2021) fits in a paradigm in which a base model is trained further on domainspecific corpora and then testing on tasks within that domain (e.g.abstract sentence role classification (Bird et al., 2008) for the scientific domain).M2D2 is complementary to these works in providing a testbed for fine-grained and hierarhical adaptation across a large quantity of domains.

Domain Adaptation Datasets
One approach toward improved pre-trained language models includes building large-scale pre-training datasets that contain a diverse set of domains, such as the Pile (Gao et al., 2021).Overall, this emphasis has lead to improved performance in various domains, especially with large-scale pre-trained language models, such as GPT-J (Wang and Komatsuzaki, 2021).Another line of work has been in documenting large-scale web-crawled datasets, so practitioners and researchers can be more informed and mindful of the data used (Dodge et al., 2021).Our work extends this thread with a massively multidomain corpus with a manually curated ontology that can be used to study fine-grained and hierarchical domain transfer.

Conclusion
We developed M2D2, a new massively multidomain language modeling dataset for studying domain adaptation in language models.M2D2 consists of 145 fine-grained domains (curated from Wikipedia and Semantic Scholar) that are hierarchically organized using domain-specific ontologies.Using M2D2, we find that domain precision is more important than data quantity to improve in-domain performance, a tradeoff between specialization and out-of-domain generalization.We release M2D2 publicly to spur further research on building effective language models on highly heterogeneous data.

Limitations
In this work, we only consider adaptation techniques that assume domains are monolithic and non-overlapping.Future work may instead explore modeling the data as a mixture of domains, which may improve out-of-domain performance.In addition, M2D2 only covers two data sources (Wikipedia and Semantic Scholar).Future work could expand this corpus with ontologies from other data sources, such as Reddit, which have a fine-grained and hierarchical domains.Moreover, data sourced from the web may contain hate speech and other harmful content, which may be reproduced by language models adapted to such data.The data sources we use adhere to research-friendly data licenses, but training models on web-curated data while maintaining the rights of authors as data subjects and creators remains an open problem.

Wikipedia
Culture and Humanities, Games and Toys, Mass media, Performing arts, Sports and Recreation, The arts and Entertainment, Visual arts, Further research tools and topics, Reference works, Exercise, Health science, Human medicine, Nutrition, Public health, Self care, By continent, By period, By region, Human activities, Impact of human activity, Fields of mathematics, Logic, Mathematics, Biology, Earth sciences, Nature, Physical sciences, Philosophy, Thinking, Allah, Belief systems, Major beliefs of the world, Social sciences, Society, Agriculture, Computing, Engineering, Transport

Figure 1 :
Figure 1: Visualization of the two-level fine-grained domain hierarchy in the Wikipedia portion of M2D2.

Figure 2 :
Figure 2: Visualization of the hierarchies contained within the S2ORC portion of M2D2.

Figure 3 :
Figure3: The types of domain adaptation that we consider in this work: L1, L2, and L1-to-L2 adaptation.Here, we use "Technology and Applied Sciences" to illustrate our L1 domain and "Computing" to illustrate our L2 domain.Bold arrows refer to adaptation steps, and dotted lines refer to an evaluation phase.

Table 1 :
Dataset statistics for M2D2.We list L1 domains, with their corresponding sizes, number of L2 domains, number of tokens, and examples of L2 domains.† These domains did not have any subdomains in the arXiv ontology.

Table 4 :
Out-of-domaintransfer performance between all L1 domains (using abbreviations from Table1) in the Wikipedia portion of M2D2.For each domain, we use the first four letters to refer to itself.The x-axis shows evaluation domains, and the y-axis shows training domains.

Table 6 :
Transfer performance between corresponding domains(Math↔Mathematics and Logic(Math), Computer Science↔Technology and Applied Sciences, Art↔Culture and the Arts, etc..) in both ontologies.It can be seen that provenance is a stronger indicator of transfer performance on M2D2 than ontological correspondence.

Table 7 :
Average percentage of tokens transferred indomain and out of domain.Examples are taken from

Table 10 :
All domains contained within M2D2