General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation

Training keyphrase generation (KPG) models require a large amount of annotated data, which can be prohibitively expensive and often limited to specific domains. In this study, we first demonstrate that large distribution shifts among different domains severely hinder the transferability of KPG models. We then propose a three-stage pipeline, which gradually guides KPG models' learning focus from general syntactical features to domain-related semantics, in a data-efficient manner. With Domain-general Phrase pre-training, we pre-train Sequence-to-Sequence models with generic phrase annotations that are widely available on the web, which enables the models to generate phrases in a wide range of domains. The resulting model is then applied in the Transfer Labeling stage to produce domain-specific pseudo keyphrases, which help adapt models to a new domain. Finally, we fine-tune the model with limited data with true labels to fully adapt it to the target domain. Our experiment results show that the proposed process can produce good-quality keyphrases in new domains and achieve consistent improvements after adaptation with limited in-domain annotated data. All code and datasets are available at https://github.com/memray/OpenNMT-kpg-release.


Introduction
The last decade has seen major advances in deep neural networks and their applications in natural language processing.Particularly, the subarea of neural keyphrase generation (KPG) has made great progress with the aid of large language models (Lewis et al., 2020) and large-scale datasets (Meng et al., 2017a;Yuan et al., 2020a).Due to the high cost of data annotation, most, if not all, of the large-scale KPG datasets are constructed by scraping domain-specific data from the internet.For example, Meng et al. collected more than 500k scientific papers of which keyphrases are provided by paper authors.Gallina et al. crawled about 280k news articles from New York Times with editor-assigned keyphrases.Following Gururangan et al. (2020), we use "domain" to denote a distribution over language characterizing a given topic or genre.Specifically in KPG tasks, domains can be "computer science papers", "online forum articles", "news" etc.
Despite recent neural models can to some extent learn KPG skills from existing datasets (Meng et al., 2021a;Gallina et al., 2019;Yuan et al., 2020a), because most of these datasets are limited to a single domain, it remains unclear how the trained models can be transferred to new domains, especially in a real-world setting.Some existing studies claim their models demonstrate a certain degree of transferability across domains.For instance, Meng et al. show that models trained with scientific paper datasets can generate decent quality keyphrases from news articles, in a zero-shot manner.Xiong et al. present that training with open-domain web documents can improve the model's generalizability.However, there is a lack of systematic studies on domain transferring KPG, and thus the observations reported in prior works do not support a comprehensive understanding of this topic.
To investigate this question, we conduct an empirical study on how well KPG models can transfer across domains.We utilize commonly used KPG datasets covering four different domains (Science, News, Web, Q&A).We first show experiment results ( §2.2) that suggest models trained with data in a specific domain do not generalize well to other domains, even in cases where they are initialized with pre-trained language models such as BART (Lewis et al., 2020).We also visualize the domain gaps among datasets by inspecting their phrase overlaps.Keyphrases often represent the specific knowledge of a domain and this may result in the failure of transferring models across domains.
The empirical study motivates us to explore novel methods that can help models possess the ability of generating high quality keyphrases and more importantly, can quickly adapt to a new domain with limited amount of annotation.We propose a three-stage training pipeline, in which we gradually guide a KPG model's learning focus from general syntactical features to domain-specific information.First, we pre-train the model using community labeled phrases in Wikipedia ( §3.1).Then, we use a novel self-training-based domain adaptation method, namely Transfer Labeling, to adapt the model to the new domain.Note this domain adaptation method does not require ground-truth labels, we leverage the model pre-trained in the previous stage to generate pseudo-labels for training itself.Finally, we use a limited amount of in-domain data with true annotations to fully adapt the model to the new domain.We report extensive experiment results and thorough analyses to demonstrate the effectiveness of the proposed methods.

Background and Motivation
2.1 Background Keyphrase Generation (KPG) Typically, the task is to generate a set of keyphrases P = {p 1 , . . ., p n } given a source text t.Semantically, these phrases summarize and highlight important information contained in t, while syntactically, each keyphrase may consist of multiple words and serve a component of a sentence.Depending on a particular domain the source text belongs to (e.g., scientific paper, news) and downstream applications (e.g., article classification, information retrieval), the extent to which a phrase is important can vary, i.e. the criteria of keyphrase can be different in various datasets.Following Meng et al., we denote a keyphrase as present if it is a sub-string of the source text, or as absent otherwise.We adopt the One2Seq training paradigm (Yuan et al., 2020a).Given a source text t and a set of ground-truth keyphrases P , we concatenate all ground-truth keyphrases into a single string: <bos>p 1 <sep> • • • <sep>p n <eos>, where <bos>, <sep>, and <eos> are special tokens.This string is paired with t to train a sequence-tosequence model.We refer readers to (Meng et al., 2021a) for more details in common KPG practice.

Domain Gap in KPG Tasks
Previous studies have touched on how much KPG models can transfer their skills when applied across domains (Meng et al., 2017a;Xiong et al., 2019a), but not in a systematic way.In this subsection, we revisit this topic and try to ground our discussion with thorough empirical results.Specifically, we consider four broadly used datasets in the KPG community: KP20k (Meng et al., 2017a) contains scientific papers in computer science; OpenKP (Xiong et al., 2019a) is a collection of web documents; KPTimes (Gallina et al., 2019) contains a set of news articles; StackEx (Yuan et al., 2020a) are community-based Q&A posts collected from StackExchange.All the four datasets are large enough to train KPG models from scratch.At the same time, the documents in these datasets cover a wide spectrum of domains.We report statistics of these four datasets in appendix Table 7.On the model dimension, we consider two model architectures: TF-Rand, a 6-layer encoder-decoder Transformer with random initialization (Vaswani et al., 2017); and TF-Bart, a 12-layer Transformer initialized with BART-large (Lewis et al., 2020).We train the two models on the four datasets individually and subsequently evaluate all the resulting eight checkpoints on the test split of each dataset.As shown in Figure 1, in-domain scores (i.e., trained and tested on the same datasets) are placed along the diagonal, the other elements represent cross-domain testing scores.We observe that both models exhibit a large gap between in-domain and out-of-domain performance.Even though the initialization with BART can alleviate the gap to a certain degree, the difference remains significant.
Keyphrases are typically concepts or entities that represent important information of a document.The collection of keyphrases in a domain can also be deemed as a representation of domain knowl-  edge.Therefore, to better investigate the domain gaps, we further look into the keyphrase overlap between datasets.As shown in Table 1, only a small proportion of phrases are in common between the four domains.We provide a T-SNE visualization of a set of phrases sampled from these dataset in appendix Figure 8, the phrase clusters present clear domain gaps in their semantic space.We hypothesize that the domain specific traits in annotated data make models difficult to learn keyphrase patterns in a domain-general sense.Furthermore, humans may label keyphrases under an application-oriented consideration and thus a onesize-fits-all standard for keyphrase annotation may not exist.For example, on StackExchange, users tend to assign common tags to better expose their questions to community experts, resulting in a small keyphrase vocabulary size.On the contrary, the topics are more specialized in scientific papers and authors would emphasize novel concepts in their studies.This may explain the large number of unique keyphrases found in KP20k.

Disentanglement of "Key" and "Phrase"
In §2.2, we empirically show that KPG models do not adequately transfer to out-of-domain data, even initialized with pre-trained language models.However, data annotation for every single domain or application does not seem practical either, due to the high cost and the potential need of domain-specific annotators.Inspired by some prior works, we attempt to disentangle the important properties of a keyphrase as keyness (Bondi and Scott, 2010;Gabrielatos, 2018) and phraseness (Tomokiyo and Hurst, 2003).We believe a proficient KPG model should generate outputs that satisfy both properties.
Keyness refers to the attribute that how well a phrase represents important information of a piece of text.The degree of keyness can be document dependent and domain dependent.For example, "cloud" is a common keyphrase in Computer Science papers, it is, in most cases, less likely to be important in Meteorology studies.Due to its high dependence on domain-specific information, we believe that the knowledge/notion of keyness is more likely to be acquired from in-domain data.
Phraseness, on the other hand, focuses more on the syntactical aspect.It denotes that given a short piece of text, without even taking into account its context, to what extent it can be grammatically functional as a meaningful unit.Although the majority of keyphrases in existing datasets are noun phrases (Chuang et al., 2012), they can present in variant grammatical forms in the real world (Sun et al., 2021).We believe that phraseness can be independent from domains and thus can be obtained from domain-general data.

Methodology
In the spirit of the motivation discussed above, we propose a three-stage training procedure in which a model gradually moves its focus from learning domain-general phraseness towards domainspecific keyness, and eventually adapts to a new domain with only limited amount of annotated data.An overview of the proposed pipeline is illustrated in Figure 2. First, with a Pre-Training stage (PT), the model is trained with domain-general data to learn phraseness ( §3.1).Subsequently, in the Do-main Adaption stage (DA), the model is exposed with unlabeled in-domain data.Within a few iterations, the model labels the data itself and use them to gradually adapt to the new domain ( §3.2).Lastly, in the Fine-Tuning stage (FT), the model fully adapts itself to the new domain by leveraging a limited amount of in-domain data with true annotations ( §3.3).In this section, we describe each of the three stages in detail.

Domain-General Phrase Pre-training
The first training stage aims to capture the phraseness in general, we leverage the Wikipedia data and community labeled phrases from the text.
Wikipedia is an open-domain knowledge base that contains rich entity-centric annotations, its articles cover a wide spectrum of topics and domains and thus it has been extensively used as a resource of distant supervision for NLP tasks related to entities and knowledge (Ghaddar and Langlais, 2017;Yamada et al., 2020;Xiong et al., 2019b).In this work, we consider four types of markup patterns in Wikipedia text to form distant keyphrase labels: • in-text phrases with special formatting (italic, boldface, and quotation marks); • in-text phrases with wikilinks (denoting an entity in Wikipedia); • "see also" phrases (denoting related entities); • "categories" phrases (denoting superordinate entities).
Although the constructed targets using the above heuristics can be noisy if considering the keyness aspect, we show that they work sufficiently for training general phrase generation models.Given a piece of Wikipedia text t and a set of community labeled phrases, we convert this data point to the format of One2Seq as described in §2.1.In practice, the number of phrases within t can be large and thus we sample a subset from them to form the target.We group all the phrases appear in t as present candidates, the rest (e.g., "see-also" and categories) are grouped as absent candidates.Additionally, we take several random spans from t as infilling candidates (similar as (Raffel et al., 2020)) for robustness.Finally, we sample a few candidates from each group and concatenate them as the final target sequence.
On the source side, we prepend a string suggesting the cardinality of phrases in each target group to the beginning of t.We also corrupt the source  sequence by replacing a small proportion of present and infilling phrases with a special token [MASK], expecting to improve models' robustness (Raffel et al., 2020).We show an example of a processed Wikipedia data instance in Figure 3.
Trained with this data, we expect a model to become a general phrase generator -given a source text, the model can generate a sequence of phrases, regardless the specific domain a text belongs to.

Domain Adaption with Transfer Labeling
In the second stage, we aim to expose the model with data from a domain of interest, so it can learn the notion of domain-specific keyness.We propose a method, namely General-to-Specific Transfer Labeling , which does not require any in-domain annotated data.Transfer labeling can be considered as a special self-training method (Yarowsky, 1995;Culp and Michailidis, 2008;Mukherjee and Awadallah, 2020), where the key notion is to train a model with its own predictions iteratively.
Distinct from common practice of self-training where initial models are bootstrapped with annotated data, transfer labeling regards the domaingeneral model from the pre-training stage 3.1 as a qualified phrase predictor.We directly transfer the model to documents in a new domain to predict pseudo labels.The resulting phrases, paired with these documents, are used to tune the model so as to adapt it to the target domain distribution.Note that this process can be run iteratively, to gradually adapt models to target domains.

Low-resource Fine-Tuning
In the third stage, we expose the model to a small amount of in-domain data with annotated keyphrases.This aims to help the model fully adapt to the new domain and reduce the bias caused by noisy labels from previous stages.

Experiments
We reuse the model architecture described in §2.2 throughout this paper.And most models apply a single iteration of transfer labeling.We discuss the effect of multi-iteration transfer labeling in §4.2.5.See Appendix A.1 for implementation details.

Datasets and Evaluation Metric
We consider the same four large-scale KPG datasets as described in §2.2, but instead of training models with all annotated document-keyphrases pairs, we take a large set of unannotated documents from each dataset for domain adaptation, and a small set of annotated examples for few-shot fine-tuning.Specifically, in the pre-training stage (PT), we use the 2021-05-21 release of English Wikipedia dump and process it with wikiextractor package, which results in 3,247,850 passages.In the domain adaptation stage (DA), for each domain, we take the first 100k examples from the training split (without keyphrases), and apply different strategies to produce pseudo labels and subsequently train the models.In the fine-tuning stage (FT), we take the first 100/1k/10k annotated examples (document-keyphrases pairs) from the training split to train the models.We report the statistics of used datasets in appendix Table 7.
We follow previous studies to split training/validation/test sets, and report model performance on test splits of each dataset.A common practice in KPG studies is to evaluate the model performance on present/absent keyphrases separately.However, the ratios of present/absent keyphrases differ drastically among the four datasets (e.g.OpenKP is strongly extraction-oriented). Since we aim to improve the model's out-of-domain performance in general regardless of the keyphrases being present or absent, we follow Bahuleyan and El Asri (2020) and simply evaluate present and absent keyphrases altogether.We report the F@O scores (Yuan et al., 2020a) between the generated keyphrases and the ground-truth.This metric requires systems to model the cardinality of predicted keyphrases themselves.

Zero-shot Performance
We first investigate how well models can perform after the pre-training stage, without utilizing any in-domain annotated data.Since Wikipedia articles contain a rather wide range of phrase types, we expect models trained on this data are capable of predicting relevant and well-formed phrases from documents in general.We show our models' testing scores in the first row of Table 2 and 3, where only PT is checked.We observe that pre-training with Wikipedia data can provide decent zero-shot performance in both settings, i.e., model is initialized randomly (Table 2) and with pre-trained language models (3).Both settings achieve the same average F@O score of 12.2, which evinces the feasibility of using PT model to generate pseudo labels for further domain adaptation.The scores also suggest that at the pre-training stage, the BART model (with pre-trained initialization and more parameters) does not present an advantage in comparison to a smaller model trained from scratch.

Domain Adaptation Strategies
We compare transfer labeling (TL, proposed in §3.2) with two unsupervised strategies: (1) Noun Phrase (NP) and (2) Random Span (RS).For NP, we employ SpaCy (Honnibal et al., 2020) to POStag source texts and extract noun phrases based on regular expressions.For RS, we follow Raffel et al. (2020), extracting random spans as targets and masking them in the source text.For TL, all pseudo phrases are generated by a PT model in a zero-shot manner (with greedy decoding).
As shown in Figure 4, in the single strategy setting, RS performs the best among the three strategies and TL follows.We speculate that RS models are trained to predict randomly masked spans based on their context, and this results in the best gener-alization among the three.As for the NP strategy, since targets are only noun phrases appear in the source text, the models may have the risk of overfitting to recognize a subset of possible phrases.TL lies in between the two discussed strategies, the generated pseudo labels contain both present and absent phrases, and thanks to the PT model trained with Wikipedia data, the generated targets can contain many phrase types beyond noun phrases.
We further investigate the performance gap between RS and TL.On KP20k, the PT model can generate 5.1 present and 2.6 absent keyphrases on average.The generated pseudo labels, albeit of good quality, are always fixed during the training.This is due to the deterministic nature of the PT model, which may cause overfitting and limit the model's generalizability.In contrast, random spans in RS are dynamically generated, therefore a model can learn to generate different target phrases even the same documents appear multiple times during training.This motivates us to investigate if these strategies can be synergistic by combining them.As shown in Figure 4, we observe that combining TL and RS can lead to a significant improvement over all other strategies, indicating that these two strategies are somewhat complementary and thus can be used together in domain adaption.In the rest of the paper, we by default combine TL and RS in the domain adaptation stage, by taking equal amount of data from both sides, we discuss other mixing strategies in Appendix A.3.
It is worth noting that, if we apply domain adaptation with the TL+RS mixing strategy and evaluate models without any fine-tuning (2nd row in Table 2/3), we can observe a clear drop in the performance of randomly initialized model (Table 2).We believe it is because using random spans for targets worsens the phraseness of the predictions.BART initialized models, on the other hand, show robust performance against these noisy targets.

Performance in Low-Data Setting
As described in §4.1, we use 100/1k/10k in-domain examples with gold standard keyphrases to finetune the model.To investigate the necessity of the PT and DA stages given the FT stage, we conduct a set of ablation experiments, skipping some of the training stages in the full pipeline.
We start with discussing the results of randomly initialized models (Table 2).FT-only: in the case where models are only fine-tuned with a small subset of annotated examples, models perform rather poorly on all datasets, especially on KP20k and OpenKP, where more unique target phrases are involved.DA+FT: different from the previous setting, here all models are first trained with 100k pseudo labeled in-domain data points.We expect these pseudo labeled data to improve models on both phraseness and keyness dimensions.Indeed, we observe DA+FT leads to a large performance boost in almost all settings.This suggests the feasibility of leveraging unlabeled in-domain data using the proposed adaptation method (TL+RS).PT+FT: the pre-training stage provides a rather significant improvement in all settings, averaging over datasets and k-shot settings, PT+FT (23.8) nearly doubles the performance of DA+FT (12.6).This observation indicates that the large-scale pre-training with domain-general phrase data can be beneficial in various down-stream domains, which is consistent with prior studies for text generation pre-training.PT+DA+FT: we observe a further performance boost when both PT and DA stages are applied before FT.This to some extent verifies our design that PT and DA can guide the models to focus on different perspectives of KPG and thus work in an complementary manner.
We also investigate when the model is initialized with a pre-trained large language model, i.e., BART (Lewis et al., 2020).Due to space limit, we only report models' average scores (over the four datasets, and over the k-shot settings) in Table 3, we refer readers to appendix Table 9 for the full results.We observe that in the pipeline, the finetuning stage provides TF-Bart the most significant performance boost -the average score is tripled, compared to the 0-shot settings, even performing solely the fine-tuning stage.This may be because the BART model was trained on a much wider range of domains of data (compared to Wikipedia, which is already domain-general), so it may have already contained knowledge in our four testing domains.However, the auto-regressive pre-training of BART does not train particularly on the KPG task.This explains why it requires the BART model to fine-tune on KPG data to achieve higher performance.The above assumption can also be support by further observations in Table 3. Results suggest that the DA stage is not notably helpful to TF-Bart's scores, and the PT stage, on the other hand, seems to contribute to a better score.We believe this is because the quality difference between labels used in these two stages: PT uses community-labeled phrases (high phrase quality but domain-general) and DA uses labels generated by the model itself (no guarantee on phrase quality but closer to target domains).Since TF-Bart only needs specific knowledge about the KPG task, the PT stage can therefore be more helpful.
We run Wilcoxon signed-rank tests on the results of Table 2, and we find all differences between neighboring experiments (e.g.PT+FT vs. PT+DA+FT, both trained with KP20k and 10kshot) are significant (p < 0.05).For Table 3, the improvement of PT+FT over the other three settings is also significant.

Scaling the Domain Adaptation
One advantage of self-labeling is the potential to leverage large scale unlabeled data in target do-mains.We also investigate this idea and build a large domain adaptation dataset by pairing an unlabeled dataset with pseudo labels produced by a PT model.To this end, we resort to the MAG (Microsoft Academic Graph) dataset (Sinha et al., 2015) and collect paper titles and abstracts from 12 million scientific papers in the domain of Computer Science, filtered by 'field of study'.The resulting subset MAG-CS is supposed to be in a domain close to KP20k, yet it may contain noisy data points due to errors in the MAG's data construction process.We follow the same experiment setting as reported in the above subsections, except that in the DA stage we either use 1 million or 12 million pseudolabeled MAG data points for domain adaptation.We train the models with the PT+DA+FT pipeline and report models' scores on KP20k test split.
As shown in Table 4, compared to our default setting which uses 100k unlabeled KP20k data points for domain adaptation, larger scale domain adaptation data can indeed benefit model performancemodels adapted with MAG-CS 12m documents show consistent improvements.However, the MAG-CS 1m data (still 10 times the size of KP20k) does not show clear evidence being helpful.We suspect the distribution gap between the domain adaptation data (i.e., MAG-CS) and the testing data (i.e., KP20k) may have caused the extra need of generalization.Therefore, the MAG-CS 12m data may represent a data distribution that has more overlap with KP20k and thus being more helpful.We also observe that models initialized with BART are more robust against such a distribution gap, on account of BART's pre-training with large scale of text in general domain.

Multi-iteration Domain Adaptation
Prior self-training studies have demonstrated the benefit of multi-iterations of label propagation (Triguero et al., 2015;Li et al., 2019).We conduct experiments to investigate its effects on KPG.Specifically, we first pre-train a TF-Rand model using Wikipedia data as in previous subsections.Then, we repeatedly perform the domain adaptation stage multiple times.In each iteration, the model produces pseudo labels from the in-domain documents and then train itself with this data.Finally, we fine-tune the model with 10k annotated data points, and report its test scores on KP20k.We consider two datasets, KP20k and MAG-CS 1m, as the in-domain data for domain adaptation.As illustrated in Figure 5, the TF-Rand model can gradually gain better test performance by iteratively performing domain adaptation using both datasets.Due to limited computing resources, we set the maximum number of iterations to 10.But the trend suggests that models may benefit from more DA iterations.2020) use pre-trained language models for better representations of documents.In a similar vein, Ye and Wang utilize self-learning to generate synthetic phrases for data augmentation, whereas we use self-labeling for domain adapation.Gao et al. use a dense retriever to augment keyphrase generation in the cross-lingual scenario.
Pre-training for Phrase/Entity Understanding.Meng et al. (2021a) show that pre-train models with noisy annotation can deliver great improvements on KPG.Kulkarni et al. (2021) pre-train an understanding and a generation model with a largescale annotated dataset OAGKX (Çano and Bojar, 2020) and the resulting models achieve decent performance on various NLP tasks.Both studies use a large amount of annotated data for pre-training, which is only available for certain domains.2021) find open-domain QA datasets can be used to learn strong dense phrase representations.Wikipedia is also frequently used in training models for entity-centric and knowledgerich tasks.(Yamada et al., 2020;Liu et al., 2021;Xiong et al., 2019b;Meng et al., 2021b;Huang et al., 2021) use Wikipedia and its related resources as distant supervision to enhance BERT's abilities on modeling entities.
Self-labeling.Self-labeling or self-training is a typical means for utilizing unannotated data and it has been applied in various machine learning tasks (He et al., 2019;Mukherjee and Awadallah, 2020).Yu et al. (2021)

Conclusion
In this study, we investigate domain gaps in the KPG task that hinder models from generalization.We attempt to alleviate this issue by proposing a three-stage pipeline to strategically enhance models' abilities on keyness and phraseness.Essentially, we consider phraseness as a domain-general property and can be acquired from Wikipedia data as distant supervision.Then we use self-labeling to distill the phraseness into data in a new domain, and the resulting pseudo labels are used for domain adaptation, as the labels can reflect the keyness and phraseness of the new domain.Finally, we finetune the model with limited amount of target domain data with true labels.By taking the advantage of open-domain knowledge on the web, we believe this general-to-specific paradigm is generic and can be applied to a wide variety of machine learning tasks.As a next step, we plan to employ the proposed method for text classification and information retrieval, to see whether the domain-general phrase model can produce reliable class labels and queries for domain adaptation.

Limitations
In this study, we provide empirical evidence of the impact of domain gap in keyphrase tasks, and we propose effective methods to alleviate it.However, we acknowledge that this study is limited in the following aspects: (1) As the first study discussing domain adapation and few-shot results, there is few studies to refer to as fair baselines.Nevertheless, we attempt to show the improvements of the proposed methods over base models by extensive experiments.(2) The pretrained keyphrase generation model can be used off-the-shelf, but the multi-stage adaptation pipeline might increase the engineering complexity in practice.(3) We have only explored three strategies for domain adaptation, and they all require generating hard pseudo labels in different ways.Soft-labeling (Liang et al., 2020) and knowledge distillation (Zhou et al., 2021) methods are worth investigating.( 4) We train a model with Wikipedia annotation to predict pseudo keyphrases, and it would be interesting to see if we can use large language models (e.g.GPT-3 (Brown et al., 2020)) to zero-shot predict phrases.

Ethics Statement
Dataset Biases The domain-general pseudo phrases were produced based on public web-scale data (Wikipedia), and it mainly represents the culture of the Englishspeaking populace.Political or gender biases may also exist in the dataset, and models trained on these datasets may propagate these biases.Additionally, the pretrained BART models can carry biases from the data it was pretrained on.Environmental Cost The experiments described in the paper primarily make use of V100 GPUs.We typically used four GPUs per experiment, and the first-stage pretraining may take up to four days.The backbone model BART-LARGE 400 million parameters.While our work required extensive experiments, future work and applications can draw upon ourinsights and need not repeat these comparisons.

A.1 Implementation Details
Most experiments make use of four V100 GPUs.We elaborate the training hyper-parameters for reproducing our results in Table 5 and 6.For inference, we follow previous studies (Yuan et al., 2020b;Meng et al., 2021a) that uses beam search to produce multiple keyphrase predictions (beam width of 50, max length of 40 tokens).We report testing scores with best checkpoints, which achieve best performance on valid set (2,000 data instances for all domains).
Phrase masking ratio denotes for p% of target phrases, replacing their appearances in the source text with a special token [PRESENT].
Random span ratio denotes replacing p% of words in the source text with a special token [MASK].

A.2 Data Statistics
See Table 7.

A.3 Additional Results and Analyses
Figure 6 and 7 show additional results of domain adaptation.In Figure 6, we find that larger beam widths do not lead to significantly better scores after fine-tuning and thus we use simple greedy decoding for most of this study.In Figure 7, we compare various domain adapation strategies of mixing different pseudo labels.Overall, we find that mixing labels of transfer labeling (TL) and random spans (RS) by 50%:50% leads to best performance.
In Figure 8, we use T-SNE to visualize 1,000 most frequent keyphrases from each of four datasets (100k data examples from the training split) in the semantic space.We use BERTbase (Devlin et al., 2019) to generate phrase embeddings (we feed forward each phrase independently as a sequence and take the [CLS] embedding as output).We use the T-SNE of Scikit-Learn (Pedregosa et al., 2011) with default hyperparameters.The result shows that phrases from each domain tend to gather into clusters.Particularly, we can see that a big overlap between KP20k and StackEx since both domains are related to Computer Science.The distribution of OpenKP is more spread out, as its documents are collected from the web and cover a broader range of topics.Table 9: Zero-shot and Few-shot results.Models are initialized with BART-large (Lewis et al., 2020).The best average score is boldfaced.

Figure 1 :
Figure 1: Cross-domain transfer performance of TF-Rand and TF-Bart (F@O, the higher the better).Y-axis: training dataset; X-axis: test dataset.

Figure 2 :
Figure 2: The proposed three-stage pipeline.A model is first pre-trained with general domain data and learns to generate syntactically correct phrases.In the domain adaptation stage, the model adapts to the target domain by training on domain-specific data, where the pseudo labels are generated by the model itself.Finally, we fine-tune the model with limited amount of target domain data with true label, to fully accomplish domain adaptation.

Wikipedia
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or, more simply yet, neural nets, are computing systems inspired by the biological neural networks that constitute animal brains… ), usually simply called neural networks (<MASK>) or, more simply yet, neural nets, are <INFILL> by the <MASK> that constitute animal brains…TargetArtificial neural networks <SEP> NNs <SEP> biological neural networks <SEP> Computational neuroscience <SEP> ADALINE <SEP> computing systems inspired

Figure 3 :
Figure 3: Illustration of processing Wikipedia to sourcetarget pairs in domain general phrase pre-training.

Figure 4 :
Figure 4: Comparison of different strategies for domain adaptation with TF-Rand.TL: Transfer Labeling.NP: Noun Phrases.RS: Random Span.

Figure 5 :
Figure 5: Trend of 10k-shot performance on KP20k with iterative self-labeling for 10 iterations.
Wang et al. (2021); Li et al. (2022) use contrastive learning to train phrase encoders.Wang et al. (2021); Li et al. (2022) use contrastive learning to train phrase encoders.Lee et al. ( define rules as weak supervision for text classification and use self training to propagate labels to new documents.In our case, the pseudo labels are induced by models pre-trained with weak phrase annotation in Wikipedia.Liang et al. (2020) use self-training to supplement distantly supervised NER and Huang et al. (2021) use self-training to leverage unlabeled in-domain data.

Figure 6 :
Figure 6: Comparison of domain adaptation with Transfer Labeling using different beam width to produce pseudo labels (TF-Rand).

Figure 7 :
Figure 7: More results on mixing techniques for domain adaptation (TF-Rand, with different mixing ratios).TL: Transfer Labeling.NP: Noun Phrases.RS: Random Span.

Figure 8 :
Figure 8: T-SNE visualization of keyphrase representations from four datasets.

Table 2 :
Zero-shot and low-data results obtained by TF-Rand.The best average score in each column is boldfaced.

Table 3 :
Zero-shot and low-data results of TF-Bart model.Full results are reported in appendix Table9.

Table 4 :
Average scores (over 4 datasets) with different amount of transfer labeled data for domain adaptation.All models are trained through three stages.The best score in each block is boldfaced.

Table 5 :
Training hyperparameters for TF-Rand.*FT denotes the fine-tuning stage in cases of PT+FT or PT+DA+FT.Empty cell means it is the same as the leftmost value.

Table 6 :
Training hyperparameters for TF-Bart.Empty cell means it is the same as the leftmost value.

Table 8 :
Zero-shot and low-data results.Models are randomly initialized.The best average score is boldfaced.PT DA FT KP20k OpenKP KPTimes StackEx Average over 4 JPTimes DUC-2001