Learning Label Modular Prompts for Text Classification in the Wild

Machine learning models usually assume i.i.d data during training and testing, but data and tasks in real world often change over time. To emulate the transient nature of real world, we propose a challenging but practical task: text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. However, current modular approaches in NLP do not take advantage of recent advances in parameter efficient tuning of pretrained language models. To close this gap, we propose ModularPrompt, a label-modular prompt tuning framework for text classification tasks. In ModularPrompt, the input prompt consists of a sequence of soft label prompts, each encoding modular knowledge related to the corresponding class label. In two of most formidable settings, ModularPrompt outperforms relevant baselines by a large margin demonstrating strong generalisation ability. We also conduct comprehensive analysis to validate whether the learned prompts satisfy properties of a modular representation.


Introduction
While NLP research has received a significant boost in performance by employing large-scale pretrained language models (PLMs), finetuning an entire dedicated model for each task is not always practical or even feasible, especially as model size continues to grow.To alleviate this, recently there has been increased interest in parameter-efficient methods such as Adapter (Houlsby et al., 2019) and Prompt-based tuning methods (Lester et al., 2021;Qin and Eisner, 2021;Liu et al., 2021;Li and Liang, 2021).When training on a downstream task, these methods keep the PLM frozen and only 1 Our Code is available at https://github.com/salesforce/ModularPrompt update a small set of parameters that are added to the network.Among these methods, PROMPTTUN-ING (Lester et al., 2021) proposes tunable prompts -a sequence of learnable soft tokens, and show impressive results, even competitive to full-model finetuning with a large PLM.While these prompt-based methods have proven quite effective on popular benchmarks, these standard tasks typically assume only independently and identically distributed (i.i.d) data during training and testing.However, practical cognitive tasks in the real world are usually more complex involving changing contexts or non-stationary environments.
Talking about what is next for NLP, Kathleen McKeown in a recent interview (source) said: "Most models are static.But the world changes every minute, every second.Dealing with a dynamic world is a new area that's up and coming." Our work in this paper particularly concerns text classification settings where a model is trained on sequence of tasks and evaluated on an arbitrary subset of seen labels of interest.We formalize this as a novel text classification in-the-wild task (defined in §3), which emulates the transient learning environment of real world, e.g., for a service requiring classification, the label set might gradually change over time to include new labels or remove obsolete ones.Such scenarios typically result in a sequence of non-stationary low-resource training and evaluations over different label sets (e.g., train on {chemistry, physics} and {basketball, football} in succession and then test on {physics, football}).
This requires handling non-stationary data distribution which humans are quite adept at, partly because we can decompose a complex task in a modular fashion (Berwick et al., 2013).For example, when learning to classify objects, we acquire modular knowledge exclusive to each class.This allows us to robustly classify irrespective of any label space manipulations such as label omission or learning over new label spaces.This notion of modularity at the level of each class label is what we call label modularity and is a desirable quality for NLP models to generalize to practical non-stationary classification settings.
Contemporary modular model designs for complex NLP tasks typically use a routing network or programmer (Cases et al., 2019;Rosenbaum et al., 2019;Khot et al., 2021;Corona et al., 2021;Jiang and Bansal, 2019;Liu et al., 2019;Hu et al., 2018;Gupta et al., 2020;Andreas et al., 2016) which learns a meaningful decomposition of the task into sub-tasks and executes it by applying a chain of specialised modules designed for each sub-task.For classification tasks, this can entail learning a specialized module for each label in the target labelspace.While there has been limited research on utilizing modular architectures for PLMs (Andreas et al., 2016;Chen et al., 2020), the main research gap that we explore in this work is that of a modular design of parameter efficient tuning of large PLMs, in particular, PROMPTTUNING.
Although PROMPTTUNING can be considered modular at task level in that it learns soft-prompts for each task to support multitasking, it is not able to learn modular decomposition within a particular task.For non-modular designs like PROMPTTUNING or model-finetuning, text classification in-the-wild is challenging to handle, as it requires combining partial information from different label spaces.In contrast, a label-modular approach should learn ex-clusive knowledge for each label and generalise to any subset of the label set.We thus postulate two main objectives of a label-modular model: Objective 1. Separable Label Representation: Each class label should have its own representation which compactly encodes the information from the data belonging to that label.Objective 2. Prediction over Controllable Label Space: Models should perform robustly over any subset of the learnt label space during inference.
To meet these objectives, we propose a modular design of Prompt Tuning -Label-modular Prompt Tuning (MODULARPROMPT).It decomposes the prompt sequence into label-modular components called label prompts, each encoding specific knowledge corresponding to a class label.Thus in each forward pass, we can select desired label prompts to construct the input prompt, based on the target label-set.To ensure that the learned knowledge is encoded in a modular fashion during training, we introduce a novel subset-invariant loss over dynamic label-sets.
To evaluate generalizability of MODULARPROMPT we construct some practical scenarios of text classification in-the-wild.We train in multiple stages over non-overlapping label spaces and evaluate the model on label-sets that (i) correspond to each training stage (stage-specific), (ii) is accumulated over all learned labels (stage-agnostic), and (iii) is comprised of labels across multiple training stages (stage-fused).We show an example of MODU-LARPROMPT on stage-fused NER setting in Figure 1.
The stage-agnostic and stage-fused settings are the most challenging scenarios for typical finetuned or prompt-tuned models, and specifically on those settings we find that MODULARPROMPT outperforms all relevant baselines by a significant margin.This alludes towards its ability to learn robust prompt representations that is generalizable to different non-stationary learning environments.
We further empirically justify that MODU-LARPROMPT indeed showcases modular properties by analyzing its behavior when either the ground truth or other random labels are removed from the input or the order of label prompts is permuted.
Transfer learning over prompts (Vu et al., 2022) and Adapters (Pfeiffer et al., 2021) and multi-tasking over the latter (Mahabadi et al., 2021b) have also been explored, but they are not comparable to ours due to difference in task/problem settings.
Second, continual learning (CL) methods are relevant as they also have sequential training stages.Architecture based CL methods adjust model architecture for each task (Chen et al., 2016;Rusu et al., 2016;Mallya et al., 2018), but require task identities for inference.Regularization based CL methods restrain updating parameters critical to previous tasks (Kirkpatrick et al., 2017;Zenke et al., 2017;Aljundi et al., 2018).Memory based CL methods retain key examples from prior tasks (Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019;de Masson d'Autume et al., 2019;Zhu et al., 2022) while Memory generator models learn to generate and use pseudo-data from prior tasks (Sun et al., 2020;Qin and Joty, 2021).These methods are not comparable to ours as we do not use any memory or pseudo memory in our sequential training.

Methodology
In this section, we first formally define the problem and subsequently present our MODULARPROMPT model, introduce subset invariant loss and explain the framework under text classification in-the-wild.with Ω ts j denoting the set of possible class labels for D ts j .For classification inthe-wild, we examine three challenging yet very practical settings.First, when m = 1 and Ω ts 1 = ∪ n k=1 {Ω tr k }, we have one test dataset covering all seen labels.We refer it as stage-agnostic testing as the test label can come from any of the training stages (or tasks).Second, when m = n and Ω ts j = Ω tr j , ∀j = {1, ..., n}, we have one test dataset corresponding to each training stage with the same label set.We denote this setting as stage-specific testing as each test set evaluates the model's performance on a particular task in which it was trained.Finally, a more challenging setting where m > 1 and Ω ts j / ∈ {Ω tr 1 , ..., Ω tr n }, ∀j = {1, ..., m}, rather , where P(S) denotes the power-set of a given set S. That is, the label set of a test stage does not correspond to any one training stage, but is composed of partial label sets from multiple training stages (or tasks).We refer it as stage-fused testing.Note that the stageagnostic and stage-specific scenarios are closely related to continual learning (Thrun, 1996), though the latter considers access to task-id instead of intratask information (i.e., task label set).

Soft Prompt Tuning
Let X = {x 1 , ..., x L } be an input text sequence, where x t is the t-th token, and M be a pretrained language model.The input text is mapped to a sequence of embeddings The model prediction is defined as During training, M is kept frozen and only T is updated.

Label Modular Prompt Model
We visualise the architecture of PROMPTTUNING (Lester et al., 2021) and our proposed MODU-LARPROMPT in Figure 2. In contrast to PROMPTTUN-ING, MODULARPROMPT's prompt consists of a sequence of label prompts, where each label prompt contains the corresponding label name and a sequence of tunable soft tokens similar to soft prompt.Formally, we denote l k = e k ⊕ {p k 1 , ..., p k m } as label prompt for label k, where e k is the embedding of label k's text or sequence of token-embeddings for multi-token labels, ⊕ denotes concatenation and m is the number of tunable tokens per label prompt.The final input prompt is T = ⊕ k∈S l k with S being the set of labels of interest.

Prompt formulation
The key mechanism of MODULARPROMPT is prompt formulation {R, S} → T , where R denotes the learned representation space of all labels prompts.In PROMPTTUNING, variables S and R do not exist and the model training tunes T directly.In MODULARPROMPT, given S as a set of class labels of interest, we select the corresponding label prompts representation from R and concatenate these to form the final input prompt T .The training loss is back-propagated through Y → T → R to learn the soft label prompts.

Subset invariant loss
The prompt formulation {R, S} → T aims to achieve Objective 2: prediction over controllable label space ( §1).In single domain setting, Ω tr is the set of all possible class labels during training as defined by Section 3.1.However fixing S to a constant Ω tr throughout training will make the model susceptible to data discrepancy between train and inference as Ω ts ̸ = Ω tr .Thus to ensure Objective 2, we propose to vary S during training.We first uniformly sample the size of S, |S| from {1, . . ., (|Ω tr | − 1)} and then randomly choose |S| labels from Ω tr to construct S. Such sub-sampling of Ω tr encourages a fair exploration of different lengths of prompt sequences as input during training, thus enabling representations to be robust to a dynamic Ω ts at inference.For each training instance, with probability p we fix S = Ω tr and vary S as above with (1 − p) chance.We refer such sampling process as S ∼ Ŝ.The subset invariant loss is then defined as: where 1 is the Indicator function; Label prompt transfer In step 3, for learning the label prompt representation R Ω tr i at any training stage i, we first aim to transfer the label-modular knowledge, R Ω tr <i learned over the previous training stages through prompt initialization.This is a unique learning characteristic that is facilitated by our label-modular architecture and allows the model to exploit semantic relatedness between labels across training stages when initializing the label prompt representation.Intuitively, if 'bistro' ∈ Ω tr <i and 'restaurant' ∈ Ω tr i , then initializing the label prompt representation of 'restaurant' with the knowledge encoded in the learned label prompt representation of 'bistro' should be helpful to the model.To compute the similarity between labels l j and l k with j ∈ Ω tr i and k ∈ Ω tr <i , we use pertoken average cosine similarity sim(e j , e k ) based on the embeddings of the label texts.For each label j ∈ Ω tr i , we select the top-K most similar labels Ω tr top-K(j) ⊂ Ω tr <i .We then initialize l j by averaging the top-K similar label prompt representations, weighted by their normalized similarity score: sim(e j , e m ).This method is similar in spirit to (Vu et al., 2022), which shows good transfer for task level prompts with training overheads, while we transfer at a finer-grained level over label prompts with no overheads.

Experiments
In this section, we first introduce datasets used and data construction process ( §4.1) followed by relevant baselines ( §4.2), evaluation methods ( §4.3) and implementation details ( §4.4).Through our experiments, we target three research questions: 1. Can MODULARPROMPT consolidate knowledge over multi-stage training?→ answered in §4.5 with stage-agnostic setting 2. Can MODULARPROMPT adapt to dynamic label space at inference?→ answered in §4.6 with stage-fused setting 3. How competitive is MODULARPROMPT in stagespecific setting?→ answered in §4.7 Additionally, we perform ablations ( §4.8) and quantitative and qualitative analysis ( §4.9- §4.10) to verify modular properties of MODULARPROMPT.

Tasks and Datasets
We conduct experiments on three types of NLP tasks: News Domain Classification on Huffpost-News (Misra, 2018), Name Entity Recognition (NER) on fewNERD (Ding et al., 2021) and Relation Extraction (RE) on FewRel (Han et al., 2018).We formulate all tasks as a text-to-text problem, as defined in §3.1.For News Domain Classification and NER, we construct target text following Qin and Joty (2021).For RE, we concatenate the original text and entities with a seperator '|' as the input sequence, and use the relation type as the target.(Example data can be found at Appendix A) For HuffpostNews, we subsample 100 shots per class for training and validation and split it into 5 stages of disjoint labels.For FewNERD and FewRel, we subsample 50 shots for training and validation and split into 4 and 5 stages, respectively.For testing, we subsample 200, 50, and 50 shots per class for HuffpostNews, FewNERD and FewRel, respectively.The total number of labels for {HuffpostNews,FewNERD,FewRel} is {41,64,80} respectively, and resulting label size per stage is {8-9,16,16} respectively.
For stage-specific testing, we follow the stages defined for training and construct a corresponding test data for each stage.For stage-agnostic testing, we combine stage-specific test data for current stage and all previously seen stages to construct the test data.For stage-fused testing, we construct label-sets for each fused stage such that it is not a subset of any single prior training stage, but rather contains labels from 'all' prior training stages.We construct {5,4,5} fused stages for {Huff-postNews,FewNERD,FewRel}.We conduct 5 randomised trials with different data sampling and experiment seed for all of the above settings.

Baselines
We use T5-large (Raffel et al., 2020) as the backbone PLM for all methods, and consider the following baselines to compare with our MODULARPROMPT: • MODELTUNING (Finetune), which tunes all parameters of the backbone PLM.
• (i) PROMPTTUNING (PT) from §3.2, (ii) PT CL -An extension of PT to continual learning (CL) setting, which trains separate PT models for each stage and concatenates the learned soft-prompts during inference, based on the test label-set.
• Adapter, a parameter efficient tuning alternative introduced in (Houlsby et al., 2019), which inserts light adapter layers into the backbone PLM and only tune them.
As text classification in-the-wild overlaps with continual learning, we also compare with versions of the above baselines that use architecture-agnostic methods and settings relevant to the latter.
• Online regularization based methods: (i) A scalable online version of EWC (Kirkpatrick et al., 2017) proposed in (Schwarz et al., 2018), and (ii) Online MAS (Aljundi et al., 2018).These methods measure each parameter's importance to previous tasks by fisher information, and restrict updating previously important parameters when learning a new task, to mitigate catastrophic forgetting.
• Multitask model, which involves training on all stages simultaneously, not sequentially.This is infact an oracle method for stage-agnostic testing and can be considered as an upper bound of memory-based methods in continual learning.

Evaluation Methods
For all the three NLP tasks, we consider an exact match as a correct prediction and report accuracy for News Classification and RE, and compute F1score over the BIO format for the NER task.By default, we do not apply any other post-processing or verbalizer, though these are orthogonal methods that can be separately used to enhance any of the discussed models.In the stage-fused setting, we apply constrained decoding similar to (Cao et al., 2021) to selected baselines, marked by special indicator * (e.g., Finetune * MAS ).For MODULARPROMPT, we use all seen label prompts for stage-agnostic testing and specific set of label prompts for stagespecific and stage-fused testing.Since other baselines do not have label-level modularity, for stageagnostic and stage-fused testing, we use the checkpoint after the final stage and for stage-specific testing we take their checkpoints after each training stage.We show average performance in the main paper and relegate detailed results to Appendix C.

Implementation Details
We set the learning rate to 0.5 for PROMPTTUNING and MODULARPROMPT and 5e-5 for MODELTUNING and Adapter, using Adafactor (Shazeer and Stern, 2018) optimizer.We adopt implementation of Adapter from OpenDelta (Hu, 2022) and use the default bottleneck dimension of 24.For online EWC and MAS, we report best results obtained over different regularization constant.For all methods, we set maximum training epochs to 256 for Fuffpost-News and FewNERD, and to 512 for FewRel.For MODULARPROMPT, the number of soft tokens per label prompt is set to 10, the selection probability p is set to 50% and number of label transfer candidates K in §3.4 is set to 3.

Results on Stage-agnostic Setting
In Table 1, we show the stage-agnostic testing results.We observe that across all three tasks, MODULARPROMPT significantly outperforms all other baselines by a large margin.This empirically justifies that MODULARPROMPT is indeed able to dynamically combine the label-specific knowledge learned across different training stages in order to infer over the unseen combined label-space.Amongst the baselines, MODELTUNING performs relatively better, while the limited trainable parameters make the parameter efficient models more susceptible to catastrophic forgetting.For CL methods, MAS improves MODELTUNING and PROMPTTUNING by 4% and 8% on average respectively, but fails on Adapter.EWC is less effective in addressing forgetting across all baselines.
Also note that the PT CL extension is able to improve by 10-20% over vanila PT.This shows that soft prompts, behaving like language tokens, have a compositional nature and can be concatenated to support multi-tasking.MODULARPROMPT, in addition to exploiting this implicit language prior, also

Results on Stage-fused Setting
We present results on our novel stage-fused setting in Table 2.We observe that none of the baselines are capable of handling this setting, as is evident from their abysmal performance across all testing stages.In absence of any label-modular representation, they are unable to utilize any information about the desired label-space.On the other hand, MODULARPROMPT not only outperforms all baselines by an average margin of 37.5%, it also achieves 4%-14% better performance than the oracle multi-task MODELTUNING on News Classification and NER.We select the top performing baselines in this setting and apply constrained decoding to them (marked with *), which improves their performance by 20%-30% on News and RE, 2%-4% on NER.However, MODULARPROMPT still outperforms these baselines by 14%-27%.This significant improvement is evident of the fact that MODULARPROMPT, by learning label-modular representations, can effectively combine partial knowledge from different training stages and condition the PLM on any target set of label prompts.This allows it to seamlessly adapt to dynamic unseen label spaces, without applying any post-processing or verbalizer.
Note that while PT CL is able to combine knowledge from multiple training stages to support stageagnostic testing, it fails to extract and consolidate specific knowledge corresponding to only the target label-set, across different stages.

Results on Stage-specific Setting
While MODULARPROMPT has proved to be particularly successful in handling the challenging nonstationary settings of stage-agnostic and stagefused evaluations, we now want to see how competitive it is under stage-specific settings.From the results in Table 3, we see that the average stagespecific performance of MODULARPROMPT is comparable to vanila PROMPTTUNING on the three tasks.Note that while MAS regularization boosts stageagnostic performance somewhat for MODELTUNING and PROMPTTUNING, it infact degrades their stagespecific performance by 10%-40%.Similarly applying EWC regularization fails to improve over the vanila models in this setting while also proving less effective on stage-agnostic evaluation.This shows the lack of robustness of these techniques across the different non-stationary settings.But MODULARPROMPT is able to achieve state-of-the-art in stage-agnostic and stage-fused settings while remaining comparable to PROMPTTUNING in stagespecific evaluation.Besides, (Lester et al., 2021) showed that the performance gap between PROMPT-TUNING and MODELTUNING will gradually close as the size of backbone PLMs scales up.We posit that MODULARPROMPT, being an extension of PROMPT-TUNING can similarly benefit from scaling-up of the PLM, but we leave this as future work owing to resource limitations.

Ablation Study
We now analyze the contribution of different components of MODULARPROMPT towards its SoTA performance.From the results in Table 4, we see that in stage-agnostic setting, both label prompt transfer and subset invariant loss provide a boost, though the role of the former is seemingly more significant.On the contrary, removing subset invariant loss has a more debilitating effect on stage-fused performance.This evinces that subset invariant loss is indeed critical in learning label modular representations.This is essential to the stage-fused evaluation which needs to extract and dynamically re-compose label-specific knowledge.

Quantitative Analysis
Apart from achieving SoTA, does MODULARPROMPT possess the desirable characteristics of a modular model?According to Algorithm 1, MODU-LARPROMPT set S = Ω ts i during inference.We ex- periment with different strategies of input prompt construction including dropping label prompt(s) either corresponding to ground truth label(s) or one other random label, and permuting the default order of label prompts; see Table 5 for the results.Indeed we observe that dropping the ground truth label prompt during inference degrades the mean performance by 57%-82% while dropping any other random label prompt boosts performance slightly.This strongly demonstrates the label grounding property of MODULARPROMPT, i.e. the knowledge of a class label is exclusively embedded in its corresponding label prompt.MODULARPROMPT also shows low sensitivity to the order of label prompts during inference -a yet another favourable property of label modular models

Qualitative Analysis
Revisiting Figure 1 presented in §1, we observe that MODULARPROMPT is able to predict correctly on a testing regime that is unseen during training, by extracting and consolidating label specific knowledge from multiple training stages.More example predictions are shown in Figure 3 (and Appendix B), which indicate that MODULARPROMPT is able to exploit in-context learning over label-prompts to generalize to unseen label-combinations during inference.For example, MODULARPROMPT tags "Gilbert" as politician as he was "a delegate to" a government.In the same spirit, MODULARPROMPT wrongly tags "Bert Bell" and "Rozelle" as athletes (true label being person_other) because they are associated with the sports league "NFL".Such qualitative findings demonstrate MODULARPROMPT's capabilities to learn label modular representations and integrate them dynamically during inference.
for text classification in-the-wild.Extensive experiments show that MODULARPROMPT is able to consolidate knowledge learned during sequential training stages for stage-agnostic testing and extract and recompose knowledge for stage-fused testing, while maintaining competitive performance in stage-specific settings.We have also conduct analysis to show that MODULARPROMPT has desirable modular properties of label grounding, low order sensitivity and in-context learning.Being the first work on modular parameter efficient tuning, we hope for it to spur more research in this area in future towards solving a wider range of tasks under more general non-stationary settings.

Limitations
In this section, we discuss limitations and potential future work towards extending MODULARPROMPT to a more generalised method for wider applicability.
On Scalability In MODULARPROMPT, the input prompt T grows in proportion to |S|, the size of label set of interest.This limits MODULARPROMPT from supporting huge label set (e.g., thousands of labels) as transformers can only condition on a bounded-length context.With long range transformers like Longformer (Beltagy et al., 2020), Performer (Choromanski et al., 2021) and LongT5 (Guo et al., 2021) coming into vogue, this issue is somewhat mitigated.Regardless of that, one potential solution is to formulate a hierarchical version of MODULARPROMPT, which is similar in spirit to hierarchical softmax (Morin and Bengio, 2005).Hierarchical MODULARPROMPT takes multiple steps for prediction, with each step to predict labels in a specific hierarchy level.
Another potential solution is to treat all label prompts as memory units from which the model learns to select relevant ones for a given data instance, in the spirit of (Wu et al., 2022).

On generation tasks
As MODULARPROMPT shows SoTA performance and good modular characteristics for text classification in-the-wild, it is appealing to extend it to other tasks like Question Answering (QA), Machine Reading Comprehension (MRC) and Summarization.However, it is non-trivial for MODULARPROMPT to incorporate these tasks as their target texts are unstructured without clear class labels.One potential solution is to instead consider attributes or properties of target texts, which are also conditioning factors (e.g., formality, concise-ness, topics, aspects, sentiment for summarization).With such definitions, it will be interesting to check if MODULARPROMPT framework can achieve good generalisation and conditional generation on text generation in-the-wild.

A More Details on Dataset Construction
In this section, we provide more complementary details to §4.1.

B More Qualitative Examples
Similar to §4.10, we show another example of MOD-ULARPROMPT on stage-fused NER in Figure 4 and more example predictions in Figure 5.These additional examples strengthen the conclusions in §4.10.In the second example in Figure 5, MOD-ULARPROMPT tags "Big Twin Sauce" as a product food while its ground truth tag is product other.We can see MODULARPROMPT considers the context as the entity is associated with a restaurant.Similarly, in the third example, "Kobo Touch" is actually a hardware reader and its ground truth tag is product other.However, such world knowledge is not available and MODULARPROMPT tags it as a software based on the context of "eBooks" and libraries.

Figure 1 :
Figure 1: Example of MODULARPROMPT for stage-fused NER.Top 2 blocks: training stages covering disjoint sets of entity types (e.g., event, person).Bottom block: fused test stage covering entity types of person and organization from two training stages.Coloured boxes denote label prompts.Underlining and italic blue denotes named entity and its type.

3. 1
Problem Definition Single stage text classification Assume a single text classification domain (or dataset) D. Let (X, Y ) ∼ D be a sample, where X = {x t } L t=1 represents a text input sequence of length L and Y = {y t } M t=1 represents the corresponding classification label name of length M (in tokens).Let Ω denote the set of all possible class labels of interest, for which we have ∀(X, Y ) ∼ D, cls(Y ) ⊆ Ω.Note that cls(Y ) is a mapping which returns the class label(s) in Y .In case of single class classification, cls(Y ) returns {Y }.In case of sequence labelling which is token-level classification, cls(Y ) returns the set of all unique target tags in Y .Text classification in-the-wild Assume a sequence of n text classification stages with the corresponding training datasets D tr = {D tr 1 , ..., D tr n }.Each stage represents a different task in temporal dimension, with (X k , Y k ) ∼ D tr k denoting a sample at the k-th training stage and Ω k denoting the set of all possible class labels for D tr k .Similarly, the testing could consist of m such datasets D ts = (D ts 1 , ..., D ts m )

Figure 2 :
Figure 2: Left: PROMPTTUNING, where consecutive soft tokens are concatenated with the input to a frozen PLM.Right: Our proposed MODULARPROMPT, where multiple label prompts are concatenated together with the input to a frozen PLM.Each label prompt consists of a label name and consecutive sequence of soft tokens.

Figure 3 :
Figure 3: Successful (blue) and failure (red) cases of MODULARPROMPT predictions for stage-fused NER

Figure 4 :
Figure 4: Another example of MODULARPROMPT for stagefused NER otherwise 0. According to Objective 1, we expect our model to make predictions grounded by the relevant label prompts.When S does not contain ground truth class label(s) in Y , the model should not be able to predict Y as output.Thus we set the loss to be zero when cls(Y ) ⊈ S to avoid encouraging ungrounded predictions.

Table 1 :
Stage-agnostic performance on News Classification, NER and Relation Extraction (RE).Size denotes average number of tunable parameters per training stage explicitly imposes subset-invariant loss to adapt to dynamic label spaces, further boosting stageagnostic performance by 14%-18% over PT CL .

Table 4 :
Ablation study of MODULARPROMPT: average performance on stage-agnostic and stage-fused settings

Table 5 :
Mean Stage-fused performance for different inference schemes Table 6 shows examples of text input and target for three datasets.For Huffpost News and FewRel, we subsample exactly 200,50 examples per label class.For FewNERD, as it can have multiple label types per example, we subsample at 50 examples per label class.
The 4 Best Last-Minute Christmas Gift Ideas.Scrambling to shop for that last relative, significant other or hard-to-buy-for friend?We're here to help.→ style and beautyFewNERDOne of the Yak-3s was destroyed right away .→ Yak-3s !product airplane ; All Together Now is a British reality television music competition which first aired on BBC One on 27 January 2018 .→ All Together Now !event other ; BBC One !organization media ; FewRel Its main base was at Tampere -Pirkkala Airport ( TMP ) , Tampere .Pirkkala Airport | Tampere → place served by trainsport hub He represented Sweden at the 1934 FIFA World Cup .1934 FIFA World Cup | Sweden → participating teams

Table 6 :
Examples of text input → text target for Huffpost News, fewNERD and FewRel

Table 8 :
Detailed Stage-fused Performance on News Classification and NER