Conditional set generation using Seq2seq models

Conditional set generation learns a mapping from an input sequence of tokens to a set. Several NLP tasks, such as entity typing and dialogue emotion tagging, are instances of set generation. Seq2Seq models are a popular choice to model set generation but they treat a set as a sequence and do not fully leverage its key properties, namely order-invariance and cardinality. We propose a novel algorithm for effectively sampling informative orders over the combinatorial space of label orders. Further, we jointly model the set cardinality and output by listing the set size as the first element and taking advantage of the autoregressive factorization used by Seq2Seq models. Our method is a model-independent data augmentation approach that endows any Seq2Seq model with the signals of order-invariance and cardinality. Training a Seq2Seq model on this new augmented data (without any additional annotations), gets an average relative improvement of 20% for four benchmarks datasets across models spanning from BART-base, T5-11B, and GPT-3. We will release all code and data upon acceptance.

In this paper, we posit that modeling set generation as a vanilla SEQ2SEQ generation task is suboptimal, because the SEQ2SEQ formulations do not explicitly account for two key properties of a set output: order-invariance and cardinality.Forgoing order-invariance, vanilla SEQ2SEQ generation treats a set as a sequence, assuming an arbitrary order between the elements it outputs.Similarly, the cardinality of sets is ignored, as the number of elements to be generated is typically not modeled.
Prior work has highlighted the importance of these two properties for set output through loss functions that encourage order invariance (Ye et al., 2021), exhaustive search over the label space for finding an optimal order (Qin et al., 2019;Rezatofighi et al., 2018;Vinyals et al., 2016), and post-processing the output (Nag Chowdhury et al., 2016).Despite the progress, several important gaps remain.First, exhaustive search does not scale with large output spaces typically found in NLP problems, thus stressing the need for an optimal sampling strategy for the labels.Second, cardinality is still not explicitly modeled in the SEQ2SEQ setting despite being an essential aspect for a set.Finally, architectural modifications required for specialized set-generation techniques might not be viable for modern large-language models.
We address these challenges with a novel data augmentation strategy.Specifically, we take advantage of the auto-regressive factorization used by SEQ2SEQ models and (i) impose an informative order over the label space, and (ii) explicitly model cardinality.First, the label sets are converted to sequences using informative orders by grouping labels and leveraging their dependency structure.Our method induces a partial order graph over label space where the nodes are the labels, and the edges denote the conditional dependence relations.This graph provides a natural way to obtain informative orders while reinforcing order-invariance.Specifically, sequences obtained via topological traversals of this graph allow independent labels to appear at different locations in the sequence, while restricting order for dependent labels.Next, we Figure 1: An illustrative task where given an input x, the output is a set of emotions.Our method first discovers a partial order graph (middle) in which specific labels (joy) come before more general labels (pride).Listing the specific labels first gives the model more clues about the rest of the set.Topological samples from this partial order graph are label sequences that can be efficiently generated using SEQ2SEQ models.The size of each set is also added as the first element for joint modeling of output with size.
jointly model a set with its cardinality by simply prepending the set size to the output sequence.This strategy aligns with the current trend of very large language models which do not lend themselves to architectural modifications but increasingly rely on the informativeness of the inputs (Yang et al., 2020;Liu et al., 2021).
Figure 1 illustrates the key intuitions behind our method using sample task where given an input x (say a conversation), the output is a set of emotions (Y).To see why certain orders might be more meaningful, consider a case where one of the emotions is joy, which leads to a more general emotion of pride.After first generating joy, the model can generate pride with certainty (joy leads to pride in all samples).In contrast, the reverse order (generating pride first) still leaves room for multiple possible emotions (joy and love).The order [joy, pride] is thus more informative than [pride, joy].The cardinality of a set can also be helpful.In our example, joy contains two sub-emotions, and love contains one.A model that first predicts the number of sub-emotions can be more precise and avoid overgeneration, a significant challenge with language generation models (Welleck et al., 2020;Fu et al., 2021).We efficiently sample such informative orders from the combinatorial space of all possible orders and jointly model cardinality by leveraging the auto-regressive nature of SEQ2SEQ models.(ii) We theoretically ground our approach: treat-ing the order as a latent variable, we show that our method serves as a better proposal distribution in a variational inference framework.( §3.1) (iii) With our approach, SEQ2SEQ models of different sizes achieve a ∼20% relative improvement on four real-world tasks, with no additional annotations or architecture changes.( §4).

Task
We are given a corpus where x t is a sequence of tokens and Y t = {y 1 , y 2 , . . ., y k } is a set.For example, in multilabel fine-grained sentiment classification, x t is a paragraph, and Y t is a set of sentiments expressed by the paragraph.We use y i to denote an output symbol, [y i , y j , y k ] to denote an ordered sequence of symbols and {y i , y j , y k } to denote a set.

Set generation using SEQ2SEQ model
Task Given a corpus {(x t , Y t )} m t=1 , the task of conditional set generation is to efficiently estimate p(Y t | x t ).SEQ2SEQ models factorize p(Y t | x t ) autoregressively (AR) using the chain rule: where the order Y t = [y 1 , y 2 , . . ., y k ] factorizes the joint distribution using chain rule.In theory, any of the k! orders can be used to factorize the same joint distribution.In practice, the choice of order is important.For instance, Vinyals et al. (2016) show that output order affects language modeling performance when using LSTM based SEQ2SEQ models for set generation.
Consider an example input-output pair (x t , Y t = {y 1 , y 2 }).By chain rule, we have the following equivalent factorizations of this sequence: However, order-invariance is only guaranteed with true conditional probabilities, whereas the conditional probabilities used to factorize a sequence are estimated by a model from a corpus.Thus, dependening on the order, the sequence factorizes as either p(y 1 | x)p(y 2 | x, y 1 ) or p(y 2 | x)p(y 1 | x, y 2 ), which are not necessarily equivalent.Further, one of these two factorizations may be better represented in the training data, and thus lead to better samples.For instance, if the training data always contains y 1 , y 2 in the order [y 1 , y 2 ], p(y 2 | x)p(y 1 | x, y 2 ) will be ∼ 0.
Order will also be immaterial if the labels are conditionally independent given the input (Section B.3).However, this assumption is often not valid in practice, especially for NLP, where labels typically share a semantic relationship.

Method
This section expands on two critical components of our system, SETAUG.Section 3.1 presents TSAM-PLE, a method to create informative orders over sets tractably.Section 3.2 presents our method for jointly modeling cardinality and set output.

TSAMPLE: Adding informative orders for set output
As discussed in Section 2, SEQ2SEQ formulation requires the output to be in a sequence.Prior work (Vinyals et al., 2016;Rezatofighi et al., 2018;Chen et al., 2021) has noted that listing the output in orders that have the highest conditional likelihood given the input is an optimal choice.Unlike these methods, we sidestep exhaustive searching during training using our proposed approach TSAMPLE.Our core insight is that knowing the optimal order between pairs of symbols in the output drastically reduces the possible number of permutations.We thus impose pairwise order constraints for labels.Specifically, given an output set Y t = y 1 , y 2 , . . ., y k , if y i , y j are independent, they can be added in an arbitrary order.Otherwise, an order constraint is added to the order between y i , y j .

Learning pairwise constraints
We estimate the dependence between elements y i , y j using pointwise mutual information: pmi(y i , y j ) = log p(y i , y j )/p(y i )p(y j ).Here, pmi(y i , y j ) > 0 indicates that the labels y i , y j co-occur more than would be expected under the conditions of independence (Wettler and Rapp, 1993).We use pmi(y i , y j ) > α to filter our such pairs of dependent pairs, and perform another check to determine if the order between them should be fixed.For each dependent pair y i , y j , the order is constrained to be [y i , y j ] (y j should come after y i ) if log p(y j | y i ) − log p(y i | y j ) > β, and [y j , y i ] otherwise.Intuitively, log p(y j | y i )−log p(y i | y j ) > β implies that knowledge that a set contains y i , increases the probability of y j being present.Thus, in an autoregressive setting, fixing the order to [y i , y j ] will be more efficient for generating a set with {y i , y j }.
Generating samples To systematically create sequences that satisfy these constraints, we construct a topological graph G t where each node is a label y i ∈ Y t , and the edges are determined using the pmi and the conditional probabilities as outlined above (Algorithm 1).The required permutations can then be generated as topological traversals G t (Figure 2).We begin the traversal from a different starting node to generate diverse samples.We call this method TSAMPLE.Our method of generating graphs avoids cycles by design (proof in B.4), and thus topological sort remains welldefined.Later, we show that TSAMPLE can be interpreted as a proposal distribution in variational inference framework, which distributes the mass uniformly over informative orders constrained by the graph.Do pairwise constraints hold for longer sequences?While TSAMPLE uses pairwise (and not higher-order) constraints for ordering variables, we note that the pairwise checks remain relevant with extra variables.First, dependence between pair of variables is retained in joint distributions involving more variables (y i ̸ ⊥ ⊥ y j =⇒ y i ̸ ⊥ ⊥ y j , y k ) for some y k ∈ Y (Appendix B.1).Further, if y i , y j ⊥ ⊥ y k , then it can be shown that 2).The first property shows that the pairwise dependencies hold in the presence of other set elements.The second property shows that an informative order continues to be informative when additional independent symbols are added.Thus, using pairwise dependencies between the set elements is still effective.Using higher-order dependencies might be suboptimal for Algorithm 1 Generating permutations for Y t Input: Set Yt, number of permutations n Parameter: α, β Output: n topological sorts over Gt(V, E) if pmi(y i , y j ) > α; lg p(y i | y j ) − lg p(y j | y i ) > β then 4: E = E ∪ y j → y i 5: end if 6: end for 7: return topo_sort(Gt(V, E), n) practical reasons: higher-order dependencies (or including x t ) might not be accurately discovered due to sparsity, and thus cause spurious orders.
Finally, we note that if all the labels are independent, then the order is guaranteed not to matter (Lemma B.3, also shown empirically in Appendix G).Thus, our method will only be useful when labels have some degree of dependence.
Complexity analysis Let Y be the label space, (x t , Y t ) be a particular training example, N be the size of the training set, and c be the maximum number of elements for any set Y t in the input.Our method requires three steps: i) iterating over training data to learn conditional probabilities and pmi, and ii) given a Y t , building the graph G t (Algorithm 1), and iii) doing topological traversals over G t to create samples for (x t , Y t ).
The time complexity of the first operation is O(N c 2 ): for each element of the training set, the pairwise count for each pair y i , y j and unigram count for each y i is calculated.The pairwise counts can be used for calculating joint probabilities.In principle, we need O(|Y| 2 ) space for storing the joint probabilities.In practice, only a small fraction of the combinations will appear |Y| 2 in the corpus.
Given a set Y t and the conditional probabilities, the graph G t is created in O(c 2 ) time.Then, generating k samples from G t requires a topological sort, for O(kc) (or O(c) per traversal).For training data of size N , the total time complexity is O(N ck).The entire process of building the joint counts and creating graphs and samples takes less than five minutes for all the datasets on an 80-core Intel Xeon Gold 6230 CPU.
Interpreting TSAMPLE as a proposal distribution over orders We show that our method of augmenting permutations to the training data can be interpreted as an instance of variational infer-Figure 2: Our sampling method TSAMPLE first builds a graph G t over the set Y t , and then samples orders from G t using topological sort (topo_sort).The topological sorting rejects samples that do not follow the conditional probability constraints.
ence with the order as a latent variable, and TSAM-PLE as an instance of a richer proposal distribution.
Let π j be the j th order over Y t (out of |Y t |! possible orders Π), and π j (Y t ) be the sequence of elements in Y t arranged with order π j .Treating π as a latent random variable, the output distribution can then be recovered by marginalizing over all possible orders Π: where p θ is the SEQ2SEQ conditional generation model.While summing over Π is intractable, standard techniques from the variational inference framework allow us to write a lower bound (ELBO) on the actual likelihood: In practice, the optimization procedure draws k samples from the proposal distribution q to optimize a weighted ELBO (Burda et al., 2016;Domke and Sheldon, 2018).Crucially, q can be fixed (e.g., to uniform distribution over the orders), and in such cases only θ are learned (Appendix H).The data augmentation approach used for XL-NET (Yang et al., 2019b) can be interpreted with this framework.In their case, the proposal distribution q is fixed to a uniform distribution for generating orders over tokens.The variational interpretation also indicates that it might be possible to improve language modeling by using a different, more informative q.Investigating such proposal distribution for language modeling is an interesting future work.
TSAMPLE can thus be seen as a particular proposal distribution that assigns all the support to the topological ordering over the label dependence graphs.We experiment with sampling from a uniform distribution over the samples (referred to as RANDOM experiments in our baseline setup).The idea of using an informative proposal distribution over space of structures to do variational inference has also been used in the context of grammar induction (Dyer et al., 2016) and graph generation (Jin et al., 2018;Chen et al., 2021).Our formulation is closest in spirit to Chen et al. (2021).However, the set of nodes to be ordered is already given in their graph generation setting.In contrast, we infer the order and the set elements jointly from the input.

Modeling cardinality
Let m = |Y t | be the cardinality (number of elements) in Y t .Our goal is to jointly estimate m and Y t (i.e., p(m, Y t | x t )).Additionally, we want the model to use the cardinality information for generating Y t .We add the cardinality information at the beginning of the sequence, thus converting a sample and then train our SEQ2SEQ model as usual from Here π is some ordering that converts the set Y t to a sequence.As SEQ2SEQ models use autoregressive factorization, listing the order information first ensures that the distribution p Why should cardinality help?Unlike models like deep sets (Zhang et al., 2019a), SEQ2SEQ models are not restricted by the number of elements generated.However, adding cardinality information has two potential benefits: i) it can help avoid over-generation (Welleck et al., 2020;Fu et al., 2021), and ii) unlike free-form text output, the distribution of the set output size (p(|Y t | | x t )) might benefit the model to adhere to the set size constraint.Thus, information on the predicted size can be beneficial for the model to predict the elements to be generated.

Experiments
SETAUG comprises: i) TSAMPLE, a way to generate informative orders to convert sets to sequences, and ii) CARD: jointly modeling cardinality and the set output.This section answers two questions: RQ1: How well does SETAUG improve existing models?Specifically, how well SETAUG can take an existing SEQ2SEQ model and improve it just using our data augmentation and joint cardinality prediction, without making any changes to the model architecture.We also measure if these performance improvements carry across diverse datasets, model classes, and inference settings.RQ2: Why does our approach improve performance?We study the contributions of TSAMPLE and joint cardinality prediction (CARD), and analyze where SETAUG works or fails.

Setup
Tasks We consider multi-label classification and keyphrase generation.These tasks represent set generation problems where the label space spans a set of fixed categories (multi-label classification) or free-form phrases (keyphrase generation).1. Multi-label classification task: We have three datasets of varying sizes and label space: • Go-Emotions classification (GO-EMO, Demszky et al. ( 2020)): generate a set of emotions for a paragraph.
• Open Entity Typing (OPENENT, Choi et al. (2018)): assigning open types (free-form phrases) to the tagged entities in the input text.

Keyphrase generation (KEYGEN):
We experiment with a popular keyphrase generation dataset, KP20K (Meng et al., 2017) which involves generating keyphrases for a scientific paper abstract.Table 1 lists the dataset statistics and examples from each dataset are shown in Appendix E. We treat all the problems as open-ended generation, and do not use any specialized pre-processing.For all the datasets, we filter out samples with a single label.For each training sample, we create n permutations using SETAUG.
Baselines We compare with two baselines: i) MULTI-LABEL: As a non-SEQ2SEQ baseline, we train a multi-label classifier that makes independent predictions of the output labels.Encoderonly and encoder-decoder approaches can be adapted for MULTI-LABEL, and we experiment with BART (encoder-decoder) and BERT (encoder-  1).The unique elements generated by beam search are returned as the set output, a popular approach for one-to-many generation tasks (Hwang et al., 2021).

(175B).
Training We augment n = 2 permutations to the original data using TSAMPLE.For all the results, we use three epochs and the same number of training samples (i.e., input data for the baselines is oversampled).This controls for models trained with augmented data improving only because of factors such as longer training time.All the experiments were repeated for three different random seeds, and we report the averages.We found from our experiments2 that hyperparameter tuning over α, β did not affect the results in any significant way.For all the experiments reported, we use α = 1 and β = log 2 (3).We use a single GeForce RTX 2080 Ti for all our experiments on bart, and a single TPU for all experiments done with T5-11B.For GPT3-175B, we use the OpenAI completion engine (davinci) API (OpenAI, 2021).

SETAUG improves existing models
Our method helps across a wide range of models (BART, T5-11B, and GPT3-175B) and tasks.

Multi-label classification
Table 2 shows improvements across all datasets and models for the multi-label classification task (∼20% relative gains).For brevity, we list macro F score, and include detailed results including macro/micro precision, recall, F scores in Table 9 (Appendix F).We attribute the comparatively lower performance of SET SEARCH baseline to two specific reasons -repeated generation of the same set of terms (e.g., person, business for OPENENT) and generating elements not present in the test set (see Section 4.3.4 for a detailed error analysis).We see similar trends with GPT3-175B (Section 4.2.4).

Keyphrase generation
To further motivate the utility of SEQ2SEQ models for set generation tasks, we experiment on KP-20k, which is an extreme multi-label classification dataset (Meng et al., 2017) with label space span-   3 shows the results.Similar to datasets with smaller label space, our method improves on vanilla SEQ2SEQ.
We want to emphasize that while specialized models for individual tasks might be possible, we aim to propose a general approach that shows that sampling informative orders can help efficient and general set-generation modeling.

Simulations
Following prior work on studying deep network properties effectively via simulation (Vinyals et al., 2016;Khandelwal et al., 2018), we design a simulation to study the effects of output order and cardinality on conditional set generation.The simulation reveals several key properties of our methods.We defer the details to Appendix G, and mention some key findings here.We find similar trends in simulated settings.Specifically, our method is (i) ineffective when labels are independent, (ii) helpful even when position embeddings are disabled, and (iii) helps across a wide range of sampling types.

Few-shot prompting with GPT3-175B
We fine-tune the generation models using augmented data for both BART and T5-11B.However, fine-tuning models at the scale of GPT3-175B is prohibitively expensive.Thus such models are typically used in a few-shot prompting setup.In a few-shot prompting setup, M (∼10-100) inputoutput examples are selected as a prompt p.A new input x is appended to the prompt p, and p∥x is the input to GPT3-175B.Improving these prompts, sometimes referred to with an umbrella term prompt tuning (Liu et al., 2021), is a popular and emergent area of NLP.Our approach is the only feasible candidate for such settings, as it does not involve changing the model or additional post-processing.We apply our approach for tuning prompts for generating sets in few-shot settings. 3e focus on GO-EMO and OPENENT tasks, as the relatively short input examples allow cost-effective experiments.We randomly create a prompt with M = 24 examples from the training set and run inference over the test set for each.For each example in the prompt, we order the set of emotions using our ordering approach TSAMPLE and compare the results with random orderings.Using TSAMPLE to arrange the labels outperforms random ordering for both OPENENT (macro F 34 vs. 39.5 with ours, 15% statistically significant relative improvement), and GO-EMO (macro F 16.5 vs. 14.5, 14% relative improvement).This suggests that ordering helps performance in resource-constrained settings e.g., few-shot prompting.As mentioned in Section 3, our method of generating sets with SEQ2SEQ models consists of two components: i) a strategy for sampling informative orders over label space (TSAMPLE), and ii) jointly generating cardinality of the output (CARD).This section studies the individual contributions of these components in order to answer RQ2.

Ablation study
We ablate the two critical components of our system: cardinality (SETAUG -CARD) and or-der (SETAUG -TSAMPLE) and investigate the performance for each of these settings using BART for multi-label classification.Table 4 presents the results.Both the components individually help, but a larger drop is seen by removing cardinality.We also train using RANDOM orders, instead of TSAM-PLE.RANDOM does not improve over SEQ2SEQ consistently (both with and without CARD), showing that merely augmenting with random permutations does not help.Further, Appendix F shows that cardinality is useful even with RANDOM order.GO

Role of order
Nature of permutations created by SETAUG SETAUG encourages highly co-occurring pairs (y i , y j ) to be in the order y i , y j if p(y j | y i ) > p(y i | y j ).In our analysis, this dependency in the datasets shows that the orders exhibit a pattern where specific labels appear before the generic ones.E.g., in entity typing, the more generic entity event is generated after the more specific entities home game and match (see Figure 3).
Increasing # permutations (n) helps: Fig. 4 shows that SETAUG and RANDOM improve as n is increased from n = 2 to 10; SETAUG outperforms RANDOM across n.
Reversing the order hurts performance In order to check our hypothesis of whether only informative orders helping with set generation, we invert the label dependencies returned by SETAUG for all the datasets and train with the same model settings.Across all datasets, we observe that reversing the order leads to an average of 12% drop in F 1-score.The reversed order not only closes the gap between SETAUG and RANDOM, but in many instances, the performance is slightly worse than RANDOM.
Ordering by frequency Yang et al. (2018) use frequency ordering, where the most frequent label is placed first in the sequence.We compare with this baseline in Table 4 (FREQ).The results indicate that the performance of frequency-based ordering is dataset dependent.Relying on a fixed criteria like frequency might lead to skewed outputs, especially for datasets with a long-tail of labels.For instance, for OPENENT, one of the most significant failure modes of the freq method was generating the most common label in the corpus (person) for every input.TSAMPLE can be seen as a way to balance the most frequent and least frequent labels in the corpus using PMI and conditional likelihood (Algorithm 1, L3).

Role of cardinality
Cardinality is successfully predicted and used Table 4 shows that cardinality is crucial to modeling set output.To study whether the models learn to condition on predicted cardinality, we compute an agreement score -defined as the % of times the predicted cardinality matches the number of elements generated by the model.The model effectively predicts the cardinality almost exactly in GO-EMO and REUTERS datasets (avg.95%).While the exact match agreement is low in OPENENT (35%), the model is within an error of ±1 in 93% of the cases.These results show that cardinality predicts the end of sequence (EOS) token.The accuracy for predicting the exact cardinality is 61% across datasets, and it increases to 76% within an error of 1 (i.e., when the predicted cardinality is off by 1).
Information about cardinality improves multilabel classification MULTI-LABEL baseline uses different values of k for predicting labels.To test if knowledge of cardinality improves multi-class classification, we experiment with a setting where the true cardinality is available at inference (i.e., k is set to the true value of cardinality).Table 5 shows that cardinality improves performance.

Error analysis
We manually compare the outputs generated by the vanilla BART model with BART + SETAUG.For the open-entity typing dataset, we randomly sample 100 examples and find that vanilla SEQ2SEQ approach generates sets with an ill-formed element 22% of the time, whereas SETAUG completely avoids this.Examples of such ill-formed elements include personformer, businessirm, polit, foundationirm, politplomat, eventlete.This analysis indicates that training the model with an informative order infuses more information about the underlying type-hierarchy, avoiding the ill-formed elements.

Related work
Set generation in NLP Prior work has noted the impact of the order on the performance of text generation models (Vinyals et al., 2016), especially in the context of keyphrase generation (Meng et al., 2019).Approaches to explicitly model set properties for NLP tasks include either performing an exhaustive search to find the best order (Vinyals et al., 2016), changing the model training to modify the loss function (Qin et al., 2019), or applying post-processing (Nag Chowdhury et al., 2016).Notably, Ye et al. (2021) introduced One2Set, a method for training order-invariant models for generating set of keyphrases.Our main goal in this work is to provide a framework to identify useful orders for set generation, and show that such orders can help vanilla SEQ2SEQ models.SETAUG can work with any SEQ2SEQ model, and is complementary to these specialized methods.
Non-SEQ2SEQ set generation These include reinforcement learning for multi-label classification (Yang et al., 2019a) and combinatorial problems (Nandwani et al., 2020), and using pointer networks for keyphrase extraction (Ye et al., 2021).We focus on optimally adapting existing SEQ2SEQ models for set generation, without external knowledge (Wang et al., 2021;Zhang et al., 2019b).Chen et al. (2021) explored the generation of an optimal order for graph generation given the nodes.They observed that ordering nodes before inducing edges improves graph generation.However, in our case, since the labels are unknown, conditioning on the labels to create the optimal order is not possible.Murphy et al. (2019) generalize deep sets by encoding a set of N elements by pooling permutations of P (N, k) tuples.With k = N , their method is the same as pooling all N !sequences, and reduces to deep sets with k = 1.We share similarity with tractable searching over N !with Janossy pooling, but instead of iterating over all possible 2-tuples, we impose pairwise constraints on the element order.

Connection with Janossy pooling
Modeling set input A number of techniques have been proposed for encoding set-shaped inputs (Santoro et al., 2017;Zaheer et al., 2017;Lee et al., 2019;Murphy et al., 2019;Huang et al., 2020;Kim et al., 2021).Specifically, Zaheer et al. (2017) propose deep sets, wherein they show that pooling the representations of individual set elements and feeding the resulting features to a non-linear network is a principled way of representing sets.Lee et al. (2019) present permutation-invariant attention to encode shapes and images using a modified version of attention (Vaswani et al., 2017).Unlike these works, we focus on settings where the input is a sequence, and the output is a set.

Conclusion and Discussion
We present SETAUG, a novel data augmentation method for conditional set generation that incorporates informative orders and adds cardinality information.Our key idea is using the most likely order (vs.a randomly selected order) to represent a set as a sequence and conditioning the generation of a set on predicted cardinality.As a computationally efficient and general-purpose plug-in data augmentation algorithm, SETAUG improves SEQ2SEQ models for set generation across a wide spectrum of tasks.For future work, it would be interesting to investigate if the general ideas in this work have applications in settings beyond set generation.Examples include generating additional data to improve language modeling in low-resource scenarios and determining the ideal order of in-context examples in a prompt.

Limitations
Ineffectiveness on independent sets SETAUG is only useful when the labels share some degree of dependence.For tasks where the labels are completely independent, SETAUG will not be effective.It can be shown that order will not affect learning joint distribution over labels if the labels are independent (Lemma B.3).Thus, in such settings, any method (including SETAUG) that seeks to leverage the relationship between labels will not be helpful.In addition to Lemma B.3, we conduct thorough simulation studies to verify this limitation (Figure 8).
Use of large language models We perform experiments with extremely large models, including T5-XXL and GPT-3 models.Particularly, GPT-3 is only available through OpenAI API; thus, all the details about its working are not publicly available.However, our experiments also show results using BART models that run on a single RTX 2080 GPU (please also see details on reproducibility in Appendix A).Further, such language models are typically trained on a large English corpora, which is also the focus of our work.

Focus on SEQ2SEQ
A key limitation of our work is that it focuses on set-generation using SEQ2SEQ models.Thus the proposed insights may not apply to other settings (e.g., computer vision) where using language models is not directly feasible.Nevertheless, with the growing popularity of libraries like Huggingface (Wolf et al., 2019), we anticipate that SEQ2SEQ models will be applied to a growing number of use cases, even those that would traditionally be tackled using a non-SEQ2SEQ method.Further, we compare our method with representative non-SEQ2SEQ baselines (like multi-label classifier).
To our knowledge, our work does not directly use any datasets that contain explicit societal biases.Therefore, we anticipate that our work will not lead to any significant negative implications concerning real-world applications.

A Reproducibility
We take the following steps for reproducibility of our results: 1.All the experiments are performed for three different random seeds.In addition, we conduct a proportion of samples hypothesis test to establish the statistical significance of our results.We did not perform extensive hyperparameter tuning and used the same set of defaults for baselines and our proposed method.
2. For all data augmentation experiments, we match the number of training samples and epochs; all the models are trained for the same duration.This alleviates the concern that the models perform well with augmented data merely because of the longer training time.
3. We conduct a proportion of samples test for all the experiments conducted on real-world datasets and use a small p = 0.0005 to measure highly significant results, which are indicated with an underscore.
Our work aims to promote the usage of existing resources for as many use cases as possible.In particular, all our experiments are performed on the BASE-version of the model (BART) that can relatively lower parameter count to conserve resources and help lower our impact on climate change.

B Proofs
Let Y be the output space, y i , y j , y k ∈ Y, and y k ∈ Y − y i − y j be a subset of the symbols excluding y i , y j .We assume that all the distributions are nonnegative (i.e., p(y) > 0, ∀y ∈ Y) Lemma B.1.
Proof Let y i ⊥ ⊥ (y j y k ) by contradiction.Then: However, Proof We have: Lemma B.3.If y i ⊥ ⊥ y j ∀y i , y j ∈ Y, the order is guaranteed to not affect learning.
Proof Let π j be the j th order over Y (out of |Y|! possible orders Π), and π j (Y) be the sequence of elements in Y arranged with π j .As y i ⊥ ⊥ y j ∀y i , y j , we have p(y i | y j ) = p(y i ).This gives: In other words, when all elements are mutually independent, all possible joint factorizations will simply be a product of the marginals, and thus identical.
Lemma B.4.The graphs constructed to sample orders for SETAUG cannot have cycles.
Proof Let y i , y j , y k form a cycle: y i → y j → y k → y i .By construction, the following conditions must hold for such a cycle to be present: Putting the three implications together, we get log p(y i ) < log p(y j ) < log p(y k ) < log p(y i ), which is a contradiction.Hence, the graphs constructed for SETAUG cannot have a cycle.

C Sample graphs
In this section, we present additional examples from REUTERS and GO-EMO datasets to illustrate the permutations generated by our method.As discussed in Section 3.1, SETAUG encourages highly co-occuring pairs (y i , y j ) to be in the order y i , y j if p(y j | y i ) > p(y i | y j ).In our analysis, this dependency in the datasets shows that the orders exhibit a pattern where specific labels appear before the generic ones.For example, in case of entity typing, the more GO-EMO, sadness is generated after the more specific emotion remorse and fear (Figure 5).Similarly, the entity crude is generated after the entities gas and nat-gas.(Figure 6 (right)).

D Hyperparameters
We list all the hyperparameters in Table 6.

E Dataset
Table 7 shows examples for each of the datasets.

F Additional results
This section presents detailed results that were omitted from the main paper for brevity.This includes macro and micro precision, recall, and F scores on all datasets, and additional ablation experiments.
1. Table 9 shows the detailed results from the four tasks.
3. Table 13 includes results from a multilabel classification baseline where bert-baseuncased is used as the encoder.

G Exploring the influence of order on SEQ2SEQ models with a simulation
We design a simulation to investigate the effects of output order and cardinality on conditional set generation, following prior work that has found simulation to be an effective for studying properties of deep neural networks (Vinyals et al., 2016;Khandelwal et al., 2018).We note that a number of techniques have been proposed for encoding set-shaped inputs (Santoro et al., 2017;Zaheer et al., 2017;Lee et al., 2019;Murphy et al., 2019;Huang et al., 2020;Kim et al., 2021).Unlike these works, we focus on settings where the input is a sequence, and the output is a set, and design the data generation process accordingly.

Data generation
We use a graphical model (Figure 7) to generate conditionally dependent pairs (x, Y), with different levels of interdependencies among the labels in Y. Let Y = {y 1 , y 2 , . . ., y N } be the set of output labels.We sample a dataset of the form {(x, y)} m i=1 .
x is an N dimensional multinomial sampled from a dirichlet parameterized by α, and y is a sequence of symbols with each y i ∈ Y.The output sequence y is created in B blocks, each block of size k.A block is created by first sampling k − 1 prefix symbols independently from Multinomial(x), denoted by y p The k th suffix symbol (y s ) is sampled from either a uniform distribution with a probability = ϵ or is deterministically determined from the preceding k − 1 prefix terms.For block size of 1 (k = 1), the output is simply a set of size B sampled from x (i.e., all the elements are independent).Similarly, k = 2 simulates a situation with a high degree of dependence: each block is of size 2, with the prefix sampled independently from the input, and the suffix determined deterministically from the prefix.Gradually increasing the block size increases the number of independent elements.

G.1 Major Findings
We now outline our findings from the simulation.We use the architecture of BART-base (Lewis et al., 2020) (six-layers of encoder and decoder) without pre-training for all simulations.All the simulations were repeated using three different random seeds, and we report the averages.
Finding 1: SEQ2SEQ models are sensitive to order, but only if the labels are conditionally dependent on each other.We train with the prefix y p listed in the lexicographic order.At test time, the order of is randomized from 0% (same order as training) to 100 (appendixly shuffled).As can be seen from Figure 8 the perplexity gradually increases with the degree of randomness.Further, note that perplexity is an artifact of the model and is independent of the sampling strategy used, showing that order affects learning.
Finding 2: Training with random orders makes the model less sensitive to order As Figure 9 shows, augmenting with random order makes the model less sensitive to order.Further, augmenting with random order keeps helping as the perplexity gradually falls, and the drop shows no signs of flattening.
Finding 3: Effects of position embeddings can be overcome by augmenting with a sufficient number of random samples Figure 9 shows that while disabling position embedding helps the baseline, similar effects are soon achieved by increasing the random order.This shows that disabling position embeddings can indeed alleviate some concerns about the order.This is crucial for pre-trained models, for which position embeddings cannot be ignored.
Finding 4: SETAUG leads to higher set overlap We next consider blocks of order 2 where the prefix symbol y p is selected randomly as before, but the suffix is set to a special character y ′ p with 50% probability.As the special symbol y ′ p only occurs with y p , there is a high pmi between each (y p , y ′ p ) pair as p(y p | y ′ p ) = 1.Different from finding 1, the output symbols are now shuffled to mimic a realistic setup.We gradually augment the training data with random and topological orders and evaluate the learning and the final set overlap using training perplexity and Jaccard score, respectively.The results are shown in Figure 10.Similar trends hold for larger block sizes, and the results are included Figure 8: Perplexity vs. Randomness for varying block sizes.The degree of dependence between the labels is highest for block size = 2, where each label depends the preceding label.In such cases, the model is most affected by shuffling the order at test time.In contrast, with block size of 1, the perplexity is nearly unaffected by the order.This result complements Lemma B.3 in showing that order will not affect SEQ2SEQ models if all the labels are independent of each other.in the Appendix in the interest of space.
Finding 5: SETAUG helps across all sampling types We see from Table 14 that our approach is not sensitive to the sampling type used.Across five different sampling types, augmenting with topological orders yields significant gains.
Finding 6: SEQ2SEQ models can learn cardinality and use it for better decoding We created sample data from Figure 7 where the length of the output is determined by sum of the inputs X.We experimented with and without including cardinality as the first element.We found that training with cardinality increases step overlap by over 13%, from 40.54 to 46.13.Further, the version with cardinality accurately generated sets which had the same length as the target 70.64% of the times, as opposed to 27.45% for the version without cardinality.(Fan et al., 2018), and Nucleus sampling by (Holtzman et al., 2020).
(i) We show an efficient way to model sequenceto-set prediction as a SEQ2SEQ task by jointly modeling the cardinality and augmenting the training data with informative sequences using our novel SETAUG data augmentation approach.( §3.1, 3.2).
ning over 257k unique keyphrases.Due to the large label space, training multi-class classification baselines is not computationally viable.In this dataset, the input text is an abstract from a scientific paper.We use the splits used byYe et al. (2021).For a fair comparison withYe et al. (2021), we use BARTbase for this experiment.Table
log p(y j | y i ) − log p(y i | y j ) > β =⇒ log p(y i ) < log p(y j ) log p(y k | y j ) − log p(y j | y k ) > β =⇒ log p(y j ) < log p(y k ) log p(y i | y k ) − log p(y k | y i ) > β =⇒ log p(y k ) < log p(y i )

Figure 5 :
Figure 5: Label dependencies discovered by TSAMPLE for GO-EMO

Figure
Figure 6: Label dependencies discovered by TSAMPLE for REUTERS

Figure 7 :
Figure 7: The generative process for simulation

Figure 9 :
Figure 9: Augmenting dataset with multiple orders help across block sizes.Augmentations also overcome any benefit that is obtained by using position embeddings.

Figure 10 :
Figure 10: Effect of SETAUG on perplexity and set overlap.Left: Augmentations done SETAUG helps the model converge faster and to a lower perplexity.Right: Using SETAUG, the overlap between training and test set increases consistently, while consistently outperforming RANDOM.
{y 1 , y 2 , ...,y k }) is converted into k training examples {(x, y i )} k i=1 .We fine-tune BART-base to generate one training sample for input x.During inference, we run beam-search with the maximum set size in the training data (Table

Table 2 :
SETAUG improves SEQ2SEQ models by ∼20% relative F 1-points, on three multilabel classification datasets.BART and T5-11B are trained on the original datasets with a random order and no cardinality."+SE-TAUG"indicatesaugmented train data using TSAMPLE and cardinality is prepended to the output sequence.Additional hyperparameter details in Appendix D. We use greedy sampling for all experiments.Our method remains effective across five different sampling techniques, incl.beamsearch,nucleus, top-k, and random sampling (Table14, Appendix G).

Table 3 :
SETAUG improves off-the-shelf BART-base for keyphrase generation task

Table 4 :
Ablations: modeling cardinality (CARD) and sampling informative orders (TSAMPLE) both help, with larger gains from CARD.RANDOM ordering hurts.

Table 6 :
6: Label dependencies discovered by TSAMPLE for REUTERS List of hyperparameters used for all the experiments.

Table 8 :
Multi-label classification when the true cardinality is provided to the classifier.While providing the true cardinality helps the performance of multi-label classifiers, it still lags SETAUG.

Table 9 :
Our main results in detail: using permutations generated by SETAUG and adding cardinality gives the best overall performance in terms of macro precision, recall, and F 1-score.MULTI-LABEL is the standard multi-label classification approach.Statistically significant results (p = 5e −4 ) are underscored.CARD stands for cardinality.pmicro p macro r micro r macro F micro F macro jaccard

Table 10 :
Results for GO-EMO.p micro p macro r micro r macro F micro F macro jaccard

Table 11 :
Results for REUTERS.pmicro p macro r micro r macro F micro F macro jaccard

Table 13 :
Our main results: using permutations generated by SETAUG and adding cardinality gives the best overall performance in terms of macro precision, recall, and F 1--score score.Statistically significant results are underscored.CARD stands for cardinality.BERT @k / BART @k denotes the pointwise classification baseline using BERT/ BART where the top k labels are used as the model output.The average is denoted by BERT/ BART.

Table 14 :
Set overlap for different sampling types with 200% augmentations.The gains are consistent across sampling types.Similar trends were observed for 100% augmentation and without positional embeddings.Top-k sampling was introduced by