Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models

Some variants of self-supervised denoising objectives for pre-training encoder-decoder language models have been reported to have a negligible impact on downstream performance. Yet the design of these pre-training objectives leads to behavioural differences that can be uncovered with specific manipulations. We re-produce a recently proposed zero-shot control method and find that it is only successful on a subset of models. To understand what causes the difference in its effectiveness, we perform a set of controlled experiments, varying only the pre-training objective, and find unexpected interactions between the pre-training method and downstream controllability of models after fine-tuning. Our results show that different pre-training objectives have consequences that may not be visible in standard downstream evaluation, but which should be taken into account when developing models with controllability in mind.


Introduction
Self-supervised denoising objectives have proven extremely powerful for deriving transformer-based pre-trained language models (PLMs) given massive amounts of unlabelled data.These objectives are typically agnostic towards specific downstream tasks and thus do not resemble real-world use cases.Instead, they enable the model to learn optimal parameter initialisations for subsequent fine-tuning on various downstream tasks (Dai and Le, 2015;Erhan et al., 2010).During fine-tuning, the PLM quickly learns new tasks based on the supervised signal provided, rendering pre-training task largely redundant.
Previous work has found performance differences on downstream tasks to be negligible given various denoising pre-training objectives (Lewis et al., 2020;Alajrami and Aletras, 2022). 1 As Figure 1: The effect of CtxAug for inquisitive dialogue modelling with off-the-shelf models.In contrast to BART, T5 models exhibit a minimal response to the context code.T5-small-LM refers to the LM-adapted model from Lester et al. (2021a).a result, the choice of which method to apply in pre-training has largely been based on factors such as efficiency (e.g.Raffel et al., 2020;Song et al., 2019).However, given equally well performing pre-training objectives, we find that encoderdecoder PLMs respond drastically differently to post-hoc manipulations after fine-tuning.
Specifically, we investigate the use of context augmentation (CtxAug), proposed by Hazarika et al. (2022), as a zero-shot control method designed to steer a fine-tuned encoder-decoder model towards generating outputs with particular attributes.While they introduce this as a general control mechanism for encoder-decoder transformers, our experiments with BART (Lewis et al., 2020) and two variants of T5 (Raffel et al., 2020;Lester et al., 2021a) show that controllability via context augmentation is predominantly exhibited by BART (Figure 1).
Given this observation, we hypothesise that the success of this zero-shot control method may be highly dependent on a model's pre-training objective.To investigate this hypothesis, we set out to identify exactly what aspects of BART's pretraining allow for CtxAug to work.Our findings suggest that fine-tuned models are capable of exhibiting vestigial behaviours2 which are endowed by their pre-training objectives and allow for interesting and useful post-hoc manipulation methods in downstream applications.

Seq2Seq Pre-training Objectives
To jointly pre-train an encoder-decoder transformer (Vaswani et al., 2017), seq2seq pre-training objectives typically corrupt an input sequence (noise) before feeding it to the model and then train the model to recover the original sequence (denoise).Usually, this involves span-based masked language modelling (MLM) (Joshi et al., 2020;Devlin et al., 2019a) combined with a standard language modelling objective involving left-to-right prediction (Bengio et al., 2003;Radford et al., 2018).However, popular denoising objectives differ in terms of the extent of corruption applied and the amount that needs to be recovered.For instance, MASS (Song et al., 2019) applies MLM to a single, randomly selected span of contiguous source tokens and predicts only the noised tokens given their positional information.T5 (Raffel et al., 2020) randomly selects multiple token spans and replaces each span with a single unique 'sentinel' mask token.The target sequence then corresponds to a stilted sequence consisting of the masked input spans separated by their respective sentinel tokens.BART (Lewis et al., 2020) applies span-based MLM in conjunction with sentence permutation.In stark contrast to the previous approaches, BART is tasked with reconstructing the input sequence in full and not just the masked spans, which we refer to as partial reconstruction.

Context Augmentation for Zero-shot Control
Despite strong generalisation abilities of fine-tuned PLMs, controlling for desirable attributes in generated text remains an active area of research (e.g.Dathathri et al., 2019;Liu et al., 2021;Yang and Klein, 2021;Krause et al., 2021;Pascual et al., 2021)  unify these based on the approach taken by Lewis et al. (2020) and use a Poisson distribution (λ " 3). 4 For reference, we also compare to a non-pretrained (No PT) baseline, which is trained from scratch on the downstream task.
Model We use the BART model architecture, which resembles a standard encoder-decoder transformer with GeLU activation functions.Following Dufter and Schütze ( 2020) we scale the model down by dividing the size of the hidden layer, intermediate feed forward layers, and the number of attention heads by 12.This results in a hidden size of 64 and intermediate size of 256 and a single attention head.
Data As pre-training data we select the BookCorpus 5 (Zhu et al., 2015;Bandy and Vincent, 2021) due to its stylistic similarities to our downstream task (e.g.dialogues between characters).We perform simple preprocessing, removing preambles and meta data by filtering lines without sentencefinal punctuation or lines containing more than 70% punctuation or numbers.We set aside 100 randomly selected books for validation.The resulting corpus contains approximately 72M and 400k sentences for training and validation, respectively.Given our budgeted training setup, the model only sees approximately 65% of the data before reaching the maximum number of update steps.Finally, we train our own BART tokenizer on the training split with a maximum vocabulary size of 4,096.
4 Here, 0-length spans, which correspond to insertions in the original BART denoising objective are ignored.And contiguous independently masked spans are merged to ensure the we do not have consecutive [M] tokens in the input sequence. 5We use a version created in September, 2020 (https: //github.com/soskek/bookcorpus).

Fine-tuning & Inference
To measure the impact of CtxAug for zero-shot controlled generation, we follow the experimental setup from Hazarika et al. (2022) and focus on promoting inquisitive and positive responses in knowledge-grounded dialogue generation with the Topical-Chat dataset (Gopalakrishnan et al., 2019).The task is to generate the target dialogue turn given a relevant knowledge snippet k and the dialogue history h T , where T is the number of turns.
At inference time, we use top-p sampling (p=0.9) with beam size of 4 and a temperature of 0.7.Sequences are generated with a maximum length of 40 tokens.For all experiments, we pre-train and fine-tune with 3 different seeds before performing inference with 5 different seeds.This results in a total of 15 inference runs for each model.To promote inquisitiveness with CtxAug we randomly sample 10 questions from the training data to construct the control code.To promote positive sentiment, we use a limited set of only 5 short phrases.Finetuning and inference experiments are performed with Hugging Face's Transformers library (Wolf et al., 2020).We include the full details on training and inference hyperparameters in Appendix A.6 4 Results

Pre-training Objectives for CtxAug
Table 1 shows the effectiveness of CtxAug given the different pre-training objectives considered.For promoting inquisitive responses (top row), BART's original denoising objective (MLM+PS) exhibits the strongest positive response to CtxAug over the default generation setting.Meanwhile, isolating the two independent noising operations used in this objective reveals that sentence permutation (PS) This suggests that multiple factors may contribute to the overall effectiveness of CtxAug in practice.Firstly, the fact that models trained from scratch can still leverage CtxAug for positive sentiment suggests that there may be effects arising from correlation of source and target attribute features in the fine-tuning data.In such a case, CtxAug may not generalise to other datasets and tasks.Secondly, and most notably, full reconstruction pre-training objectives support CtxAug more than partial reconstruction objectives.
Reconstructing the corrupted input sequence in full naturally encourages a strong correlation between input and target attributes.This more closely resembles the central mechanism in CtxAug where a vector representing the desired target attribute is 'reconstructed' in the target sequence.Meanwhile, partial reconstruction objectives yield primarily disjointed source and target sequences.This does not necessarily preclude the possibility of inferring relationships between co-occuring attributes over long distances (e.g., sentence-initial subject-verb inversion together with a sentence-final question mark).However, the likelihood of successfully learning these becomes plausible only in scenarios where some co-occurring features remain unmasked and others are reconstructed.This limits the efficacy of CtxAug for promoting inquisitiveness, and possibly other attributes that occur over longer distances, to certain pre-training methods.

Duration of Fine-tuning on CtxAug
To investigate how CtxAug is impacted by the duration of fine-tuning, we conduct an ablation study in which we perform inference at regular intervals throughout fine-tuning.Figure 2 depicts how 7 Appendix B shows that this also holds with publicly available models.CtxAug behaves relative to the default generation setting as the model learns the downstream task.When starting from randomly initialised parameters, given question control phrases (top left), the model fails to leverage the control code effectively, resulting in degradation in inquisitiveness relative to the default generations settings.For positive sentiment (bottom left), however, we can observe that the fine-tuning data provides a sufficient signal to support CtxAug.In this setting the model starts to effectively make use of the control code after three epochs.
Meanwhile, the SI FR pre-trained model is able to leverage CtxAug at all stages of fine-tuning, highlighting the vestigial behaviour from pre-training.This is most visible when encouraging positive sentiment responses (bottom right), where, in the earliest stages of fine-tuning, we can observe a significant increase in the number of positive sentiment responses generated.As the model adapts to the task, this advantage tapers off, indicating that vestigial behaviours from pre-training weaken over time.
For inquisitive responses (top right), the effect of CtxAug is most noticeable after the first few fine-tuning epochs, suggesting that this type of pretraining objective endows the model with a useful bias that can be effectively exploited by CtxAug.We also note that while the effect is only slight under this condition, it reflects the model's overall tendency to generate responses pertaining to the target attributes in question.As the model learns the task, inquisitiveness naturally increases, while positiveness decreases.Manual inspection confirmed that at the earliest stages of training, models tended to output generic and positive responses (e.g."I know!"), which gradually become slightly more varied to include negative responses (e.g."I don't know that.") and simple questions.

Mixing Pre-training Objectives
Any encumberment to leveraging interesting and useful post-hoc control techniques such as Ctx-Aug with fine-tuned PLMs may be considered a significant downside of upstream decisions relating to the pre-training objective.Yet in order to scale models and training data, partial reconstruction objectives have been chosen due to their lower computational cost (Raffel et al., 2020).One possible option for striking a desirable balance between pre-training efficiency and downstream flexibility could be to combine different pre-training objectives either within a single pre-training scheme or as a secondary pre-training before fine-tuning (e.g.Lester et al., 2021b).To this end, we experiment with combining SI FR and SI PR-T5 within a single pre-training scheme, SI FR/PR , and investigate various mixing ratios: 1:3, 1:1 and 3:1.Table 1 (right) shows that gradually increasing the degree to which the model is tasked with full reconstruction of the noised input improves the effectiveness of CtxAug but even at 75% adoption (3:1), it fails to reach equivalence with using only SI FR .

Related Work
The study of PLMs, their abilities, properties and behaviours, occupies a significant space in today's NLP research (e.g.Rogers et al., 2020;Lialin et al., 2022;Clark et al., 2019).Numerous works have evaluated and compared downstream performance of seq2seq PLMs, covering a wide array of tasks including abstractive summarisation (Blekanov et al., 2022;Zhu et al., 2021;Tang et al., 2022;Fabbri et al., 2021), question answering (Luo et al., 2022), graph-to-text generation (Ribeiro et al., 2021), dialogue modelling (Shin et al., 2022) and text simplification (Štajner et al., 2022), among others.While such comparisons are useful for guiding researchers in selecting the right model for a task and can sometimes reveal interesting differences on certain task-specific data sets, they tend to neglect important differences between PLMs, such as the underlying model size or the type and amount of data used for pre-training.Thus, it remains difficult to explain exactly why a particular model performs better or worse on a given task.
Meanwhile, there is a growing body of literature aimed at explaining some of the interesting and often unexpected behaviours observed among large PLMs.In this area, multilinguality has been linked to the duration of fine-tuning (Dufter and Schütze, 2020), and the ability to perform in-context fewshot learning and zero-shot generalisation has been linked to multiple factors.These include model scale (Brown et al., 2020), the types and formatting of demonstrations (Min et al., 2022), memorisation of pre-training data (Xie et al., 2022) and its distributional properties (Chan et al., 2022).The selection of architecture and pre-training objectives have also been found to be influential (Wang et al., 2022).Our work falls into this category and aims to explain which aspects of seq2seq pre-training objectives contribute to the ability to exploit additional conditioning context provided at inference time.

Conclusions
As PLMs become increasingly commonplace, so too does the importance of understanding the potential downstream consequences of decisions relating to their design.Our experiments indicate that context augmentation, as a method for zero-shot controlled natural language generation, is susceptible to inductive biases learned in pre-training given different types of control codes.Based on this, we conclude that pre-training objectives that aim to reconstruct a noised input in full, similar to BART, are best suited to leverage this technique.Looking forward, we expect that even for seemingly equally effective pre-training objectives, we can identify differences in behaviour, e.g.applicability of control methods, that remain after fine-tuning.In searching for optimal pre-training strategies for PLMs, this opens another dimension that needs to be considered and better understood.

Limitations
Comparing downstream performance of pretraining objectives with large-scale models is prohibitively expensive.Because of this, we employ scaled-down models that closely resemble the architectures and training procedures of popular PLMs.In doing so, we assume that our findings are transferable to some larger publicly available models.As noted by Hazarika et al. (2022), CtxAug offers an interesting alternative to prompting generative LMs that are significantly smaller than those that typically exhibit few-and zero-shot capabilities (Brown et al., 2020).While we provide support for both Hazarika et al. (2022)'s claim and our assumption in preliminary and supplementary experiments with select PLMs (see Section 1 and Appendix B), these experiments are still performed on models of up to 140M parameters.Therefore, we stop short of concluding that our findings generalise to LLMs, which dwarf these models in comparison.
Additionally, the number and types of target attributes that a user may want to control for in various downstream text generation tasks are potentially endless.However, our study focuses on only two possible target attributes, namely, inquisitiveness and positive sentiment, for the task of conversational dialogue modelling.In this way, our work partially serves as a re-implementation and reproduction study, confirming the main findings from Hazarika et al. (2022), but also highlighting limitations.

A Training Details
A.1 Pre-training Hyperparameters Our scaled down models have approximately 1M parameters and are pre-trained using the opensource Fairseq library (Ott et al., 2019).Following recommendations for budgeted pre-training by Izsak et al. (2021), we use a small batch size of 4,096 tokens and a triangular learning rate schedule which warms up for 2,500 steps and decays to zero with over 250k update steps.We also restrict the maximum sequence length to 256 which is sufficient for our downstream task of dialogue modelling.All other hyperparameters are kept the same as those used by Lewis et al. (2020).Our mini model pre-training takes approximately 6 hours on a single Nvidia K80 GPU (16GB memory).

A.2 Fine-tuning on Topical-Chat
Topical-Chat comprises conversational dialogues between pairs of crowd workers.The crowd workers were provided with reading sets containing different fun facts on eight different topics including sports, pop culture and politics as interesting discussion points.For each target dialogue turn in the dataset, it is assumed that the relevant knowledge snippet is provided as additional context based on previous work from Hedayatnia et al. (2020).Table 2 provides an overview of the dataset's splits.
To fine-tune on Topical-Chat, we followed the setup adopted by Hazarika et al. (2022).Specifically, the input sequence comprises a fixed number of 'bucketed' tokens.32 tokens are reserved for the knowledge snippet and 25 tokens for each turn in the dialogue history.A <pad> token is used to fill empty positions within each bucket and individual text sequences are truncated if their length  exceeds the allocated bucket size.Dialogue history turns are delimited with speaker identifier tokens and the entire input sequence is prepended with a <bos> token.The model is trained for a maximum of 10 epochs with an effective batch size of 20 and a learning rate of 6.25e ´5.The maximum target sequence length is set to 64.Fine-tuning on a single Nvidia K80 GPU (16GB memory) takes around 1.5 to 2.5 hours depending on the model size.

A.3 Inference on Topical-Chat
At inference time, we use the same hyperparameters for all models.Specifically, we use top-p sampling (p=0.9) with beam size of 4 and a temperature of 0.7.The maximum sequence length is set to 40 tokens.When applying CtxAug we manually re-weight the cross attention distribution using method described in Hazarika et al. (2022).Again, we used the recommend hyperparameter value of 5, which the authors found to provide a good balance between exhibiting the target attribute and maintaining fluency.To account for randomness, we run inference with multiple random seeds, which takes approximately 25 minutes for each experiment setting using a batch size of 120.
To construct the control code, we adopt the same methods as Hazarika et al. (2022).For inquisitiveness, we randomly sample 10 questions from the Topical-Chat training split.These 10 questions are then embedded once to construct the control code that is concatenated with every instance in the test set.Note that the sampling process is dependent on the random seed for each inference run.This means that each seeded inference setting uses a different set of questions to construct the control code.For positive sentiment, we always use the same five phrases defined by Hazarika et al. (2022): "That's awesome", "That's cool", "Oh that is great", "It's great to", "It's wonderful to".Since Hazarika et al. (2022) reported negligible differences between the different sampling strategies for finding control phrases, we refrained from doing an extensive search over alternative methods and opted to use their recommended settings.
Our main experiments are reported on the Topical-Chat 'frequent' test set, however, we observed similar trends across the board when evaluating on the Topical-Chat 'rare' test set also.

B CtxAug for Positive Sentiment
Encouraging positive sentiment with CtxAug applied to our scaled down models proved successful for all models regardless of the pre-training strategy used.Figure 3 shows that this result also holds with much larger publicly available models, with all differences being statistically significant according to a two-tailed unpaired t-test (p < 0.01).Note that the weaker effect of CtxAug for positive sentiment compared to controlling for response inquisitiveness with BART-base agrees with the findings from Hazarika et al. (2022).

C Performance Metrics
Inspecting the results of automatic metrics, we find only negligible differences on downstream performance across different denoising pre-training objectives, supporting previous findings (Lewis et al., 2020;Alajrami and Aletras, 2022;Raffel et al., 2020).Table 5 provides results for commonly used metrics for evaluating dialogue models.Specifically, we report the total number of unique responses generated (Uniq.Resp.), average response length (Resp.len.), perplexity (PPL) as computed by a distilled GPT-2 model 8 , the portion of unique unigrams per response (Dist-1), Self-BLEU (Zhu et al., 2018), BLEU (Papineni et al.,   8 https://huggingface.co/distilgpt2 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005).The latter three metrics are computed using ground-truth responses as references and are implemented in Hugging Face's Evaluate library9 .Without pre-training, the difference in performance for all metrics is noticeable.
alone is insufficient for CtxAug to succeed.Comparing span-infilling pre-training objectives (SI*), we can observe that the format of the target sequence used during pre-training is crucial.With noising operations being equal, CtxAug for inquisitive responses works effectively only when the model is pre-trained to reconstruct the target sequence in full, while partial reconstruction yields similar results to that of no pre-training (No PT).In contrast, encouraging more positive responses with CtxAug (bottom row) succeeds regardless of the pre-training strategy 7 , and even without any pre-training.

Figure 2 :
Figure 2: Effect of CtxAug throughout fine-tuning given different pre-training strategies.X-axis values indicate the number of training epochs and are shown on the log scale to better visualise the earliest stages of fine-tuning.

Figure 3 :
Figure 3: Performance of CtxAug with publicly available models when controlling for positive sentiment in Topical Chat.
Benchmarking Platform for Text Generation Models.In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097-1100, Ann Arbor MI USA.ACM.

Table 2 :
Number of items in Topical-Chat for knowledge-grounded dialogue generation.

Table 4 :
Example of the knowledge-grounded dialogue task in Topical-Chat.

Table 5 :
Performance metrics for dialogue modelling with Topical-Chat evaluated on the 'frequent' test set.Results are averaged from 3 different pre-trained/fine-tuned models initialised with different seeds, each with 5 different seeded runs for inference.