ConvFiT: Conversational Fine-Tuning of Pretrained Language Models

Transformer-based language models (LMs) pretrained on large text collections are proven to store a wealth of semantic knowledge. However, 1) they are not effective as sentence encoders when used off-the-shelf, and 2) thus typically lag behind conversationally pretrained (e.g., via response selection) encoders on conversational tasks such as intent detection (ID). In this work, we propose ConvFiT, a simple and efficient two-stage procedure which turns any pretrained LM into a universal conversational encoder (after Stage 1 ConvFiT-ing) and task-specialised sentence encoder (after Stage 2). We demonstrate that 1) full-blown conversational pretraining is not required, and that LMs can be quickly transformed into effective conversational encoders with much smaller amounts of unannotated data; 2) pretrained LMs can be fine-tuned into task-specialised sentence encoders, optimised for the fine-grained semantics of a particular task. Consequently, such specialised sentence encoders allow for treating ID as a simple semantic similarity task based on interpretable nearest neighbours retrieval. We validate the robustness and versatility of the ConvFiT framework with such similarity-based inference on the standard ID evaluation sets: ConvFiT-ed LMs achieve state-of-the-art ID performance across the board, with particular gains in the most challenging, few-shot setups.


Introduction and Motivation
Pretrained Transformer-based (masked) language models (LMs) such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019b), coupled with task-specific fine-tuning, offer unmatched state-ofthe-art performance in a wide array of standard language understanding and conversational tasks (Wang et al., 2019a;Mehri et al., 2020). However, pretrained LMs do not produce coherent and effective sentence encodings off-the-shelf; their further adaptation is required, akin to standard task fine-tuning. For instance, Reimers and Gurevych (2019) transform monolingual English BERT with supervised natural language inference and paraphrasing data (Williams et al., 2018;Wieting and Gimpel, 2018) into a sentence encoder which excels at sentence similarity and retrieval tasks (Marelli et al., 2014;Cer et al., 2017). This transformation process supports the creation of other similar universal sentence encoders in monolingual and multilingual settings (Chidambaram et al., 2019;Feng et al., 2020), and is typically based on dual-encoder architectures.
Another parallel research thread aims at learning conversational encoders: it validates the benefits of masked language modeling (MLM) pretraining on naturally conversational data Mehri et al., 2021), as well as the benefits of transfer learning for conversational tasks which goes beyond MLM as the pretraining objective (Mehri et al., 2019;Coope et al., 2020;Henderson and Vulić, 2021, inter alia). In particular, response selection as a suitable pretraining task (Al-Rfou et al., 2016;Henderson et al., 2019b;Humeau et al., 2020) learns representations that organically capture conversational cues from conversational text data such as Reddit (Henderson et al., 2019a), again via dual-encoder architectures.
Inspired by these two research threads, we pose the following two crucial questions: (Q1) Is it necessary to conduct full-scale expensive conversational pretraining? In other words, is it possible to simply and quickly 'rewire' existing MLM-pretrained encoders as conversational encoders via, e.g., response ranking fine-tuning on (much) smaller-scale datasets?
(Q2) If we frame conversational tasks such as intent detection as semantic similarity tasks instead of their standard classification-based formulation, is it also possible to frame supervised taskspecific learning as fine-tuning of conversational sentence encoders? In other words, can we learn Stage 1 loss (c, r)

Input LM Input LM
Pooling Pooling c r Figure 1: Illustration of the full CONVFIT framework which fine-tunes pretrained LMs such as BERT or RoBERTa in two separate stages via dual-encoder networks ("zoomed-in" parts; grey blocks denote tunable parameters), and performs intent detection with the CONVFIT-ed models via similarity-based inference. Stage 1 (S1): adaptive conversational fine-tuning, §2.1; Stage 2 (S2): task-tailored conversational fine-tuning (for intent detection), §2.2. Dashed lines denote baseline/ablation variants which skip one of the two stages: (i) we can directly task-tune the sentence encoder with the task data (Stage 2) without running Stage 1, or (ii) we can skip Stage 2, and similar to , learn an MLP classifier on top of the conversational representations from Stage 1. (a) RoBERTa (no fine-tuning) (b) RoBERTa (after S1) (c) RoBERTa (after S1 and S2) Figure 2: t-SNE plots (van der Maaten and Hinton, 2012) of encoded utterances from the ID test set of BANKING77 (i.e., all examples are effectively unseen by the encoder models at training) associated with a selection of 12 intents, demonstrating the effects of gradual "representation specialisation funnel". The encoded utterances are created via mean-pooling based on (a) the original RoBERTa LM; (b) RoBERTa after Stage 1 (i.e., fine-tuned on 1% of the full Reddit corpus, see Figure 1); (c) RoBERTa after Stage 1 and Stage 2, fine-tuned with the OCL objective (n = 3 negatives) using the entire BANKING77 training set (see Figure 1). Additional t-SNE plots are in the Appendix.
task-specialised sentence encoders that enable sentence similarity-based interpretable classification? In order to address these two questions, we propose CONVFIT, a two-stage CONVersational FIne-Tuning procedure that turns general-purpose MLM-pretrained encoders into sentence encoders specialised for a particular conversational domain and task. Casting the end-task (e.g., intent detection) as a pure sentence similarity problem then allows us to recast task-tailored fine-tuning of a pretrained LM as gradual sentence encoder specialisation, as illustrated in Figures 1 and 2.
Our hypothesis is that the pretrained LMs, which already store a wealth of semantic knowledge, can be gradually turned into conversational taskadapted sentence encoders without expensive full pretraining. (S1) Stage 1 transforms pretrained LMs into universal conversational encoders via adaptive fine-tuning (Ruder, 2021) on (a fraction of) Reddit data (see Figure 2b), relying on a standard dual-encoder architecture with a conversational response ranking loss ; cf. Q1. (S2) Stage 2 further specializes the sentence encoder via contrastive learning with intask data, that is, it learns meaningful task-related semantic clusters/subspaces. We then show that the S2 task-tailored specialisation effectively enables a simple and interpretable similarity-based classification based on nearest neighbours (NNs) in the specialised encoder space (see Q2 and Figure 2c).
The two-stage CONVFIT transformation offers new insights and contributions to representation learning for conversational tasks. Unlike prior work which conducted large-scale conversational pretraining from scratch using large datasets, we demonstrate that full pretraining is not needed to obtain universal conversational encoders. By leveraging the general semantic knowledge already stored in pretrained LMs, we can expose (i.e., 'rewire') that knowledge Gao et al., 2021b;Liu et al., 2021b) via much cheaper and quicker adaptive fine-tuning on a tiny fraction of the full Reddit data (e.g., even using < 0.01% of the Reddit corpus). Further, the task-oriented S2 CONVFIT-ing transforms pretrained LMs into task-specialised sentence encoders. Our results with similarity-based classification, targeting the crucial conversational NLU task of intent detection (ID), reach state-of-the-art (SotA) across all standard ID datasets, with particular gains in the most challenging, few-shot setups. Importantly, we show that the gradual application of S1 and then S2 yields a synergistic effect, that is, it attains the highest ID results across the board.
Finally, CONVFIT is highly versatile: it can be used with a range of pretrained LMs and on a spectrum of text classification problems; it also allows for the simple usage of diverse fine-tuning objectives in both Stage 1 and Stage 2, beyond the ones proposed and evaluated in this work.

Methodology
Preliminaries. For any input text t, we obtain its encoding t = enc(t), where enc is a sentence encoder at any CONVFIT stage (i.e., before any finetuning, after S1, or after S2), or any other sentence encoder. The text t is tokenized into subwords (Schuster and Nakajima, 2012) relying on each encoder's dedicated tokeniser. The final encoding t is created via a pooling operation such as (a) using the [CLS] token, (b) or mean-pooling the output subword vectors. Following prior work (Reimers and Gurevych, 2019), we always use mean-pooling.

Stage 1: Adaptive Fine-Tuning
As in prior work on conversational pretraining (Henderson et al., 2019bHumeau et al., 2020), Stage 1 relies on the response ranking task with Reddit data and dual-encoder architectures, which model the interaction between Reddit (context, response) (c, r) pairs. 1 However, unlike prior work, instead of pretraining from scratch we finetune an LM-pretrained encoder, which yields a much quicker conversational encoder specialisation, and does not require massive amounts of data.
Response ranking is formulated as the standard multiple negatives ranking loss (MNEG): for each positive (c i , r i ) pair (i.e., the pair observed in the Reddit fine-tuning data), the aim is to rank the correct response r for the input c over a set of randomly sampled responses r j , j = i from other Reddit pairs. The similarity between c-s and r-s is quantified via the similarity function S operating on their encodings S(c, r). Following prior work, we use the scaled cosine similarity: S(c, r) = D · cos (c, r), where D is the scaling constant. Stage 1 fine-tuning with MNEG then proceeds in batches of B positive Reddit pairs (c i , r i ), . . . , (c B , r B ); the MNEG loss for a single batch is computed as: Effectively, for each batch Eq. (1) maximises the similarity score of positive context-response pairs (c i , r i ), while it minimises the score of B − 1 random pairs. The negative examples are all pairings of c i with r j -s in the current batch, where such (c i , r j ) pairs do not occur in the Reddit data. 2 The output of Stage 1 is the sentence encoder enc S1 which can be used 'as is' similarly to standard sentence encoders Feng et al., 2020): a standard ID approach stacks a Multi-Layer Perceptron (MLP) classifier on top of the fixed sentence vectors t, and fine-tunes only the MLP parameters Gerz et al., 2021). However, the output of S1 can also be further fed as the input encoding for CONVFIT's Stage 2 ( Figure 1).

Stage 2: Task-Based Sentence Encoders
Stage 2 fine-tuning is inspired by metric-based meta-learning (Vinyals et al., 2016;Musgrave et al., 2020) and exemplar-based (also termed prototype-based) learning (Snell et al., 2017;Zhang et al., 2020), which is especially suited for few-shot scenarios. We assume the existence of N a annotated in-task examples see (Henderson et al., 2019a). The intuition is that sentences which elicit similar responses should obtain similar sentence encodings . 2 We also experimented with another SotA loss function, the triplet-based multi-similarity loss (Wang et al., 2019b;Liu et al., 2021a), without any substantial performance differences.
{(x 1 , y 1 ), . . . , (x Na , y Na )}: e.g., x-s are text sentences with y-s being their intent labels/classes; let us assume that there are N c classes {C 1 , . . . , C Nc } in total. The aim is to fine-tune the input sentence encoder in such a way to encode all sentences associated with each particular class into coherent clusters, clearly separated from all other class-related (also coherent) clusters (see Figure 2c). 3 Positive and Negative Pairs. We leverage the class labels only implicitly (see Figure 1), which allows us to treat intent detection as a sentence similarity task. CONVFIT S2 operates with two sets of pairs: 1) P P is the set of positive pairs (x i , x j ), where x i and x j are text instances associated with the same class C i ; 2) N P contains negative pairs (x i , x j ) where x i and x j are associated with two different classes C i and C j . We construct the set N P in a balanced way: for each positive pair (x i , x j ) ∈ P P , we add 2 × n negative pairs into N P , where n is a tunable hyper-parameter; n pairs (x i , x i,n ), n = 1, . . . , n, are constructed by randomly sampling utterances x i,n which do not share the class with x i , and we also sample n negatives (x j,n , x j ) in a similar vein. We now present three different loss functions that fine-tune the input encoders towards task-specialised sentence similarity relying on the sets P P and N P . For all three S2 loss functions, we add a downprojection d o -dim layer with non-linearity (Tanh used) after pooling, see Figure 1. 4 SOFTMAX (SMAX) Loss. Following prior work (Reimers and Gurevych, 2019), for each input sentence pair (x i , x j ), we concatenate their d odimensional encodings x i and x j (obtained after passing them through the input encoder, pooling, and down-projection) with their element-wise difference |x i − x j |. The objective is as follows: where ⊕ denotes concatenation, and W ∈ R 3do×2 is a trainable weight matrix of the softmax classifier, where 2 is the number of classification classes: the model must simply discern between positive pairs (from P P ) and negative pairs from N P . The classifiers are optimised via standard cross-entropy.
Cosine (COS) Loss. The idea is to minimise the following distance, formulated as standard meansquared error: ||δ l − cos(x i , x j )|| 2 , where cos de-notes cosine similarity, and δ l is a hyper-parameter which specifies the 'ideal' (dis)similarity margin in the specialised encoder space. Here, we rely on the default parameters from Reimers and Gurevych (2019) without any tuning: Online Contrastive Learning (OCL) Loss follows the formulation from Hadsell et al. (2006): where 1 is the indicator function which returns 1 iff (x i , x j ) ∈ P P , and 0 iff (x i , x j ) ∈ N P ; dcos = 1−cos is the cosine distance, and δ m is the distance margin, set to the default value of 0.5 (Reimers and Gurevych, 2019) in all our experiments. The loss 'attracts' similar items closer together in the specialised space, while 'repelling' dissimilar items (Mrkšić et al., 2017). 5 Similarity-Based Inference. Intent detection in the specialised encoder space enc S2 is then performed via similarity-based classification (Zhang et al., 2020) after Stage 2. 6 Assuming the simplest case of k = 1 nearest neighbours (NN) classification, we select the intent class for an unseen example u as: I c arg max t∈P ool cos(t, u) . Here, t = enc S2 (t) refers to the sentence encoding of each example t ∈ P ool (which is typically the pool of examples from the ID training set), and the I c function returns the intent class of any t ∈ P ool.
Why Intent Detection as a Sentence Similarity Task? We can take the analogy of 'intent' being a latent semantic class where sentences associated with the intent are diverse surface instances of the class (i.e., language realisations of the underlying concept/intent). This means that finding the most similar labelled instances for the given unlabelled input instance/sentence can directly inform us about the underlying semantic class/intent. 5 We use the online version of the loss that updates the loss focusing on hard negative pairs (i.e., negatives that are close by cosine in the current semantic space) and hard positives which are far apart in the current space. This typically results in quicker convergence and slightly better performance. 6 The benefits of similarity-based classification were recently validated also in other NLP tasks such as cross-lingual abusive content detection (Sarwar et al., 2021), language modeling (Khandelwal et al., 2020;Guu et al., 2020), and question answering (Kassner and Schütze, 2020), among others.

Experimental Setup
Input LMs. We experiment with several popular Transformer-based (Vaswani et al., 2017) LMs as input (see Figure 1), aiming to validate the robustness of CONVFIT, as well as to analyse the impact of LM pretraining on the final task performance: (i) BERT ( ROB) or 6 (DROB) attention layers. In addition, to isolate the effects of LM-pretraining and CON-VFIT-ing from the mere "parameter capacity", we also experiment with a BERT/ROB architecture with RANDomly initialised parameters using the Xavier initialisation (Glorot and Bengio, 2010). Unless noted otherwise, CONVFIT Stage 1 always proceeds with a sample comprising 2% of the full Reddit corpus from Henderson et al. (2019a). 7 Intent Detection Datasets. As discussed in §2, the main evaluation task is intent detection (ID), with a particular focus on low-data (i.e., few-shot) scenarios. Our Stage 2 fine-tuning and the final task evaluation are based on three standard ID datasets in English, also available as part of the recently published DialoGLUE benchmark (Mehri et al., 2020): BANKING77 , HWU64 (Liu et al., 2019a), and CLINC150 (Larson et al., 2019). 8 The key statistics of all three datasets are provided in Table 1; for further details, we refer the reader to the original work and also to (Mehri et al., 2020).
Few-Shot and Full Data Setups. Prior work has recognised the importance of building intent detec-tors in low-data regimes Mehri et al., 2021). Therefore, following this initiative, we evaluate the models in two N-shot scenarios, where we assume that only N = 10 or N = 30 annotated examples per intent are available for training the MLP classifier or for S2 finetuning; Figure 1. 9 The models are also evaluated in the Full setup, where all annotated training examples per intent are used. Note that we always report the scores on the same test set for each setup. For the few-shot scenarios, we report the scores as averages over 3 independent experimental runs.
Hyperparameters and Optimisation. CONVFIT is implemented via the sentence-transformers (sbert) repository (Reimers and Gurevych, 2019), which is in turn built on top of the HuggingFace repository (Wolf et al., 2020). Similar to , we do not rely on any development data, and follow the general suggestions from prior work (Reimers and Gurevych, 2019;  for the hyperparameter setup, which is adopted across all intent ID datasets. 10 For S1 with MNEG, we always train for 2 epochs in batches of 256 with default hparams from sbert. 11 In Stage 2, with all three evaluated objective functions the batch size is 32, the maximum sequence length is 48, the output layer's dimensionality is set to d o = 512. Unless stated otherwise, we always fine-tune for 10, 5, and 2 epochs for the 10-shot, 30-shot, and Full setups, respectively. For the COS and OCL variants, unless noted otherwise, we report the results with n = 3 negative examples per each positive in 10-shot and 30-shot setups, and with n = 1 (for computational tractability) in the Full setup. An analysis of the impact of n on the final ID performance is presented later in §4.
Following the suggested settings of Reimers and Gurevych (2019); , in both CONVFIT stages we use the AdamW optimiser (Loshchilov and Hutter, 2018); the learning rate is 2e − 5 with the warmup rate of 0.1 and linear decay afterwards, and the weight decay rate is set to 0.01.
Similarity-Based Classification. The intent class is chosen according to the k = 1 NNs, based on the cosine distance in the fine-tuned space. 12 Importantly, in few-shot setups we use only the few-shot data as the NN pool for classification.

Model Variants and Baselines
We experiment with a range of model variants enabled by the CONVFIT framework (see Figure 1), and compare their performance in the ID task against an array of cutting-edge universal and conversational sentence encoders. All the models in evaluation are summarised here for clarity.
LM+S1+S2-LOSS. Sentence encoders after running the full CONVFIT pipeline, where intent detection is based on similarity-based NN classification. LM in the label of this variant denotes the input LM, and LOSS is the loss function used in Stage 2 (i.e., SMAX, COS, or OCL). LM+S1. The input LM is converted into a (generalpurpose) conversational encoder via Stage 1 CON-VFIT-ing; intent detection is performed via standard feature-based MLP classification on top of the sentence encodings as in prior work.
SotA Sentence Encoders. We evaluate three widely used state-of-the-art sentence encoders in the standard feature-based MLP classification approach to intent detection: 13 (i) ConveRT (Henderson et al., 2020) is a dual sentence encoder pretrained with the conversational response selection task (Henderson et al., 2019b) on the full Reddit data (Al-Rfou et al., 2016;Henderson et al., 2019a); (ii) multilingual Universal Sentence Encoder (mUSE) (Yang et al., 2020) is a multilingual and better-performing version of the USE model for English , which again relies on a standard dual-encoder framework (Henderson et al., 2019b;Humeau et al., 2020) and is pretrained on massive amounts of data; (iii) Language-agnostic BERT Sentence Embedding (LaBSE) (Feng et al., 2020) adapts pretrained multilingual BERT (mBERT) (Devlin et al., 2019) into a sentence encoder using a dual-encoder framework  with larger embedding capacity (i.e., it provides a shared multilingual vocabulary spanning 500k subwords). 14

Results and Discussion
The main results are summarised in Table 2, and further results and analyses are available in §4.1, with additional results in the Appendix. 15 These results offer multiple axes of comparison, succinctly discussed in what follows.
MLP versus Similarity-Based ID. First, we note that CONVFIT-ed LMs achieve peak ID scores across all three ID datasets, and in all data setups, with ROB+S1+S2-OCL being the highestperforming model variant overall. Running Stage 1 does transform input LMs into effective (universal) conversational encoders already: for MLPbased ID, we observe competitive or even improved performance (cf., the results on BANKING77 and HWU64 as two more challenging evaluation sets) with the ROB+S1 and BERT+S1 variants against current state-of-the-art (conversational) sentence encoders such as ConveRT, USE, and LaBSE.
Importantly, the results after Stage 2 'unanimously' suggest the effectiveness of treating ID as a semantic similarity task, and additional taskspecific specialisation of the sentence encoders with in-task data. Put simply, it seems more effective to use the in-task training data to 'taskspecialise' the sentence encoder space than to learn a standard (MLP) classifier, which directly maps from the feature space to intent labels (Sarwar et al., 2021). The gains are especially pronounced in fewshot setups (e.g., see 10-shot BANKING77).
We speculate that dual-encoder contrastive learning surpasses MLP-based approaches especially in few-data scenarios because it learns from finergrained and more abundant information in such low-data scenarios: i.e., we learn to contrast between pairs of instances rather than simply learning an MLP-based mapping from an instance to its underlying class intent/class. This formulation can also capture some subtle cross-instance (dis)similarities which cannot be captured by MLP.
14 LaBSE is the current SotA encoder across a wide array of languages (Feng et al., 2020;Litschko et al., 2021;Gerz et al., 2021). Besides dual-encoder training, LaBSE leverages standard self-supervised objectives used in pretraining of mBERT and XLM: masked and translation language modeling (Conneau and Lample, 2019); see the original work. 15 For brevity, in the main paper we report the results with the two better-performing S2 losses: COS and OCL.  Extending beyond pure absolute performance, decisions based on k-NN similarity-based ID in the specialised space are also easy to interpret (Simard et al., 1992;Wallace et al., 2018).
Stage 1 + Stage 2? The scores in Table 2 indicate that Stage 2 alone already transforms pretrained LMs into very strong task-specialised sentence encoders. However, a more careful comparison of LM+S1+S2-LOSS versus LM+S2-LOSS variants reveals that Stage 1 fine-tuning is universally useful (regardless of the chosen loss function in S2), and yields ID performance gains. In other words, the coarser-grained adaptive fine-tuning already exposes some conversational knowledge from the pretrained LMs, and such knowledge does have substantial impact on task-specialised S2 tuning. In sum, this finding is line with prior work in other domains and NLP tasks (Gururangan et al., 2020;Ruder, 2021): both domainadaptive (our S1) and task-adaptive additional tuning (our S2) of general-purpose LMs have a synergistic positive impact on the final task performance.
The impact of the gradual two-stage sentence encoder transformation is also clearly visible from the t-SNE visualisation in Figure 2. Besides this, a standard quantitative measure of cluster coherence, the Silhouette coefficient σ (Rousseeuw, 1987) also points in the same direction: Impact of Input LMs. While the results suggest that the CONVFIT framework is applicable and effective with any pretrained LM, the choice of the input LM naturally impacts the absolute ID performance. As expected, the CONVFIT variants with RoBERTa achieve the highest scores across the board. A comparison between DROB and BERT reveals that the pretraining data size and regime seem to play a more critical role than the parameter capacity: the more compact DROB LM is competitive with or even outscores BERT-based variants. 18 16 Higher σ scores are desirable as they imply more coherent and compact clusters, and a stronger inter-cluster separation. 17 Stage 2 tuning with more in-task data also naturally yields a better separation of examples into coherent clusters , which then naturally improves NN-based classification. For instance, running the ROB+S1+S2-OCL (n = 3) variant in 10-shot, 30-shot, and Full data setups yields the respective σ scores for the same set of test examples from Figure 2: σ10 = 0.378, σ30 = 0.548, σ F ull = 0.698, validating the intuition. 18 Given the versatility of CONVFIT, in future work we plan to extend the experiments to other pretrained LMs such as     Figure 3. The trend is quite straightforward: semantic knowledge acquired by LM-pretraining is particularly important in the fewest-shot (i.e., 10shot) setups, and the gap gets reduced with more in-task data available for S2 tuning. However, the gap remains substantial even in the Full setups. Figure 3 also reveals that the strength of CON-VFIT Stage 1 is in adapting the knowledge acquired at LM pretraining: S1 fine-tuning of RAND with smaller amounts of Reddit data cannot match ROB as the input LM, although the gap does become smaller with more in-task data for S2.

Importance of LM Pretraining is illustrated by
Stage 2: Fine-Tuning Losses. Table 2 reveals that strong ID performance after S2 tuning is achieved with different loss functions from §2.2, with different input LMs, even without any careful tuning of hyper-parameters for single settings. This verifies the versatility and robustness of CONVFIT. Both COS and OCL yield consistently strong results, and we expect that even higher absolute scores might ELECTRA (Clark et al., 2020) and T5 (Raffel et al., 2020). be achieved by applying more sophisticated (contrastive learning) loss functions from prior work (Hermans et al., 2017;Liu et al., 2021a) in Stage 2.

Further Discussion
Stage 1: Amount of Reddit Examples. We now analyse what amount of Reddit data is required to turn input LMs into conversational encoders, by reducing S1 fine-tuning data through subsampling. The scores over different sizes are provided in Figure 4, and we note that they extend to other CONVFIT variants (see §3.1). As expected, having more Reddit data does yield better results on average, but even a small sample of Reddit data (e.g., ≈50K (c, r) pairs) 1) transforms the input LM into an effective sentence encoder (e.g., its MLP-based ID results are on par with those achieved with USE, LaBSE, and ConveRT), and 2) improves over the CONVFIT variant that skips S2 completely. This implies that perhaps more careful domain-driven data sampling in the future might yield even more domain-adapted conversational encoders after S1.
Amount of Negative Examples in Stage 2 has only a moderate to negligible impact on the final performance, as shown in Figure 5. Small gains when moving from n = 1 to n = 3 are observed only for the 10-shot setup: there, having more negatives may implicitly play the role of data augmentation for fine-tuning. However, with more in-task examples, the dependence on n becomes inconsequential, and the performance saturates quickly (e.g., see the curves in the 30-shot setups).
Stage 2: Few-Shot versus Full. Framing the ID task a sentence similarity seems especially beneficial for few-shot scenarios, as the model can leverage prototype-based (or instance-based) similarities (Snell et al., 2017) in the specialised encoder space. However, the strong performance with fully CONVFIT-ed models persists also in Full setups.  Figure 6: Impact of the number of data instances at inference. The ROB+S1+S2-OCL variant is tuned in 10shot setups in S2, and additional data  is used only at inference without any S2 retuning.
This finding is further corroborated with the results on another standard ID dataset, English ATIS (Hemphill et al., 1990;, see Table 3. There, we observe even more prominent differences in favour of similarity-based ID enabled by CONVFIT, again especially in the two low-data setups. The proposed prototype-based learning and inference holds promise to boost few-shot performance even more in future work, through additional metric learning (Zhang et al., 2020) or data augmentation techniques (Lee et al., 2021). One limitation of CONVFIT, especially prominent in Full scenarios, is its quadratic time complexity. Future work will look into effective sampling strategies and adaptations towards more sampleefficient and quicker fine-tuning (Tran et al., 2019;Tian et al., 2020;O'Neill and Bollegala, 2021).
Data Augmentation for Inference. Adding more data instances for similarity-based inference, serving as exemplars/prototypes, is likely to boost the final intent detection performance without the need to retrain the model. The intuition is that additional instances can provide finer-grained prototypes for inference, semantically more similar to the input query sentences than the original training data. To test this hypothesis, we conduct a simple probing experiment, where we train the ROB+S1+S2-OCL (n = 3) variant in the 10-shot setup, but then run inference (i) with the same 10 shots; (ii) in the 30-shot setup (i.e., effectively performing the inference-time data augmentation, relying on 20 more data instances per intent class at inference); (iii) in the Full setup.
The scores are summarised in Figure 6. They clearly indicate that performance does rise with more data instances at inference, even without any model retraining/re-tuning, confirming that increased semantic variability helps at inference. This finding is salient for all three evaluation sets. 19 19 The same trends persist with other CONVFIT variants.
As expected, the absolute performance of 30-shot or Full inference when the model is trained in 10shot setups is lower than in the setup where the more abundant data is additionally used for CON-VFIT Stage 2 task-tuning.
Based on these findings, we restate that a promising path for future research concerns investigating and 'task-adapting' automatic paraphrase generation models (Krishna et al., 2020;Dopierre et al., 2021;Schick and Schütze, 2021) such as the one that rely on prompting large models (e.g., GPT-3, T5) (Gao et al., 2021a). Such paraphrases might provide a richer and semantically more varied set of data instances for CONVFIT task-tailored finetuning and similarity-based inference.

Conclusion and Future Work
We proposed CONVFIT, a two-stage conversational fine-tuning procedure that transforms pretrained LMs (e.g., BERT, RoBERTa) into universal (after Stage 1) and task-specialised conversational sentence encoders (after Stage 2) through dualencoder architectures. The semantic knowledge already stored in the pretrained LMs gets 'rewired' for a particular domain and task. We demonstrated that such task-specialised sentence encoders enable casting intent detection (ID) as simple sentence similarity; CONVFIT-ed encoders yield strong ID results across diverse ID datasets and setups.
The CONVFIT framework is very versatile and opens up many future research paths and further extensions and experimentation beyond the scope of this paper. For instance, it is possible to replace the current contrastive loss functions with other recent effective contrastive losses (van den Oord et al., 2018;Gunel et al., 2021, inter alia), or mine hard (instead of using random) negative examples (Lauscher et al., 2020;Kalantidis et al., 2020;Robinson et al., 2021). We will also extend CON-VFIT to other pretrained models, experiment with automatic paraphrasers for data augmentation, and port the framework to other conversational tasks (e.g., slot labelling for dialogue), as well as to other, non-dialogue text classification tasks.

A Additional Experiments and Results
Additional experiments and analyses that further support the main claims of the paper have been relegated to the appendix for clarity and compactness of our presentation in the main paper. They largely follow the trends observed in the results which are provided in the main paper. In sum, we provide the following additional results and information, which offer further empirical support of our main claims in this paper:   Figure 9 (OCL loss in Stage 2) demonstrate the impact of using LM-pretrained Transformers versus randomly initialised Transformers in the CONVFIT framework (both in the full S1+S2 setup, as well as in the setup where only task-tuning (S2) is employed). The patterns in the results, presented over all three evaluation sets, largely follow the patterns observed in Figure 3, which is provided in the main paper. Figure 10 plots how the amount of Reddit data in Stage 1 impacts the final intent detection performance when the COS loss is used for task-tuning in Stage 2. The observed trends in results are very similar to the ones obtained with the OCL loss, presented in the main paper (see Figure 4). Figure 11 presents the impact of the number of negative examples n during Stage 2 fine-tuning with the COS loss; the observed trends are very similar to the ones with the OCL loss, presented in the main paper (see Figure 5). Figure 12 provides t-SNE plots with varying amounts of task data for Stage 2 task-tuning (10shot versus 30-shot versus Full data setups), demonstrating that very tight and coherent clusters emerge even in the 10-shot setups. Figure 13 shows t-SNE plots after 10-shot Stage 2, when varying amounts of Reddit data for Stage 1 fine-tuning are used (e.g., skipping Stage 1 completely versus using ≈50k (context, response) Reddit pairs). Finally, Figure 14 demonstrates that the patterns which emerge after Stage 1 and Stage 2 CONVFIT-ing do not depend on the chosen input LM, and on the chosen loss function in Stage 2: the trends very similar to Figure 2 (provided in the main paper) are also observed with distilRoBERTa as the input LM, and COS as the S2 loss. Figure 7 shows visible impact of adaptive Stage 1 fine-tuning even when only 50k Reddit (context, response) pairs are used.

B Models and Evaluation Data
URLs to the models are provided in Table 6. The intent detection evaluation data is available online: 1. BANKING77, CLINC150, and HWU64 intent detection data have been downloaded from the Di-aloGLUE repository: github.com/alexa/dialoglue We use the 10-shot data provided in the repository, and use their script to generate 30-shot setups for all three datasets.
2. The English ATIS intent detection dataset is extracted from the recently published MultiATIS++ dataset , available here: github.com/amazon-research/ multiatis For reproducibility, we will release the generated 10-shot and 30-shot data splits.
Our code is based on PyTorch, and relies on the two following widely used repositories: • sentence-transformers www.sbert.net • huggingface.co/transformers/   (c) HWU64 Figure 8: A comparison of a randomly initialized BERT or RoBERTa architecture (RAND) with LM-pretrained RoBERTa after Stage 2 CONVFIT-ing; evaluation on all three intent detection datasets; the COS loss used in S2. Figure 9 shows the similar plots with the OCL loss used in S2.  2019a)), while other Reddit data sizes are relative to this corpus size (e.g., ×1/32 means that we use 2%/32 ≈ 0.0625% of the full Reddit corpus). Stage 2 loss is COS (n = 3).      Figure 1); (c) DistilRoBERTa after Stage 1 and Stage 2, fine-tuned with the COS objective (n = 3 negatives) using the entire BANKING77 training set (see Figure 1).