Going beyond research datasets: Novel intent discovery in the industry setting

Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.


Introduction
Allegro is one of largest the e-commerce marketplace in Central Eastern Europe region that connects buyers and merchants.It has millions of active users.Therefore, the good functioning of the Customer Experience (CX) department is crucial as it provides the necessary support, resolves emerging issues, and answers user questions.
Task-oriented chatbots relieve humans by automatically resolving the most repetitive and trivial issues.They usually have a pre-defined set of user intents with matching template answers.Then, when a user asks a question, the intent classifier detects the question intent and returns the matching response.Creating a reliable and comprehensive chatbot requires massive work to discover, define, and maintain a set of intents with training examples.With the continuous development of marketplace platforms, new intents constantly appear as new features are introduced.Therefore, the automated intent discovery system becomes a critical component.
Novel intent discovery is performed offline on historical data.In the context of personalized intelligence assistants existing approaches (Lin et al., 2020;Gao et al., 2021;Vedula et al., 2022) focus on learning transferable features with utterance encoders that guide the discovery on unlabeled data with a handful of labeled examples belonging to known intents.However, at Allegro our main communication form is emails, and we have access to much richer conversational data that can improve discovery performance.A large body of historical conversational data (user questions and consultants' answers) can be leveraged in two ways.Firstly, to better initialize message encoders and secondly by performing intent discovery on conversational data as an additional signal.Additionally, a form of weak supervision is available: keywords (or tags) added by the consultants that help them understand past cases.
The paper's main contribution is the demonstration that incorporating additional signals like conversational structure or weak labels into the existing intent discovery method results in better overall performance.We pre-trained for domain adaptation three encoders using conversational data and weak labels.We devised Conv, a method for fine-tuning on conversational data (i.e., question and answer) for the clustering task using a three-headed encoder.To the best of our knowledge, this result was not reported in the public literature.
2 Related Work

Discovering novel intents
The goal of novel intent discovery is to identify groups of similar utterances in unlabeled data with the assistance of limited labeled data.The Con-strained Deep Adaptive Clustering (Lin et al., 2020, CDAC) uses dense intent representation on top of the pre-trained BERT backbone to learn similarity functions in a semi-supervised contrastive manner.It is then utilized in the clustering algorithm.In a real-world scenario of personal assistants (Gao et al., 2021;Vedula et al., 2022) use a pre-trained BERT model as a backbone encoder with supervised contrastive learning to transfer distance function to unlabeled data for clustering.Unlike this work, the authors use only the question field and English BERT-base uncased model for initialization.They do not use in-domain unlabeled data or weak supervision for backbone pre-training.

Transfer learning
General-purpose pre-trained encoders like BERT are not ideal.Tasks involving domain-specific texts like, e.g., science corpus, clinical notes, or ecommerce product descriptions benefit more from additional pre-training on in-domain data due to better suited vocabulary and word embeddings to domain specific problems (Beltagy et al., 2019;Huang et al., 2019;Tracz et al., 2020;Gururangan et al., 2020).Similarly, for conversational tasks ConveRT (Henderson et al., 2020a) substantially outperforms BERT in neural response selection.Additionally, industrial-scale training on weakly supervised datasets leads to improvements in several NLP tasks (Bach et al., 2018).

Problem statement
Given unlabeled instances D, the goal is to automatically cluster utterances into I classes, which are not known a priori.We also assume that we are given labeled instances D k with I k known set of intents and I ∩ I k ̸ = ∅.Unlabeled instances may belong to both known intents I k and unknown ones I u = I \ I k .

Framework overview
Our novel intent discovery framework consists of representation learning (Bengio et al., 2013) and subsequent clustering with K-means (Lloyd, 1982).We propose the following to improve text representations for real-life novel intents discovery in the communication domain: • Efficient initialization with pre-trained encoders, adapted to the e-commerce domain by optimization for weak training signals and conversational structure of the data.
• Fine-tuning for the clustering task with state-of-the-art training scheme (i.e., CDAC) adapted to use all the conversational data (i.e., question and answer).Conv is our proposed method to train a conversation structure-aware encoder with three-headed architecture.
In the following sections, we describe each component in more detail.

Initialization
An essential step in the deep learning process is initialization.Proper initialization is crucial in training representations for discovering new intents with clustering.The effectiveness of the existing clustering algorithms depends heavily on the quality of the representation encoder.In this work, we identified this dependency and proposed a generic approach for an efficient encoder pre-training in the conversational domain.

Domain specific data structure
We operate in the e-commerce domain with a twosided marketplace.Customers can seek support by exchanging messages via email or chat.The former are typically longer and include a more formal boilerplate.A dialog may be held between merchants and CX support, buyers and CX support, and directly between buyers and merchants.All messages are written in Polish.

Domain adaptation
We prepared two self-supervised models based on BERT-base (Devlin et al., 2019) architecture.We started from a general domain encoder Her-BERT (Mroczkowski et al., 2021).We used a training corpus of 68M conversation threads with 184M messages and 8314M words.We included both emails and chats exchanged between all parties (merchants, CX support, and buyers).
• AlleConveRT is AlleBERT further fine-tuned on the same dataset but with the mixture of MLM and Conversational Contrastive Loss (CCL) (Henderson et al., 2020b).
The details of the training procedure for each of the pre-trained encoders can be found in Appendix E.

Weak supervision
In the case of email communication exchanged with CX support, every message includes at least one of 512 tags.These labels roughly identify the problem solved.They are assigned by CX consultants often in a noisy manner.We utilized this weak signal and prepared TagBERT encoder in a twostage process.Firstly, we finetuned HerBERT with MLM and Message Threads Structural Objective (MTSO) (Wang et al., 2020)   As depicted in Fig. 1, we used an encoder with BERT-base architecture (Devlin et al., 2019) followed by an average pooling 1 and three projection heads with two linear layers and Tanh non-linearity in between (Lin et al., 2020).
The three-headed model works with conversational input containing a pair of texts: the user's question and the consultant's answer 2 .Two heads project each input separately, and the third one handles additional signals from the question-answer concatenation into one string of text.Each of the inputs is fed into encoder separately.A common underneath encoder is updated jointly with a gradient from all heads from the total loss given by the 1 Unlike many implementations, the hidden states for padding tokens are not averaged.
2 While encoding question and answer, are preceded with special tokens for question and answer.weighted average of losses for each head: (1) Here X = (X Q , X A , X QA ) is the array of inputs (all examples), i.e. all questions, all answers, all question-answer concatenations respectively.Y are the input labels3 .θ = (θ Q , θ A , θ QA ) is the array of parameter sets for individual inputs BERTbaseparameters are shared as depicted in Figure 1.
The hyperparameters λ = (λ Q , λ A , λ QA ) govern how conversational structure is utilized for any choice of the training scheme, whereas the precise form of the loss terms L depends on the choice of the training scheme described in Sec.3.5.For example if we choose λ = (1, 0, 0), and compute L according to CDAC training scheme, we follow the original CDAC setup with the question field only.By using λ = (0, 0, 1) and computing L according to CDAC training scheme, we effectively only concatenate question and answer strings and feed it into the model instead of the question string.
In our method Conv for training conversation structure-aware encoder, we trained the representation encoder with uniform heads contribution λ = ( 1 3 , 1 3 , 1 3 ) staring from initializations described in Section 3.3.The final representation used for clustering is an embedding from the head for question-answer concatenation.
To speed up training with large batches, we kept the weights of the encoder frozen except for the last transformer layer.The first linear layer keeps the BERT-base dimension of the representations (i.e., 768).The second linear block output dimension is a representation size hyperparameter.

Training scheme
Up to this point, we are able to use any framework for finetuning the representation encoder for intent discovery with clustering.With that said, we propose to use two potential approaches for real-world CX communication data.
Static.In a setup where we do not have any labeled data available, we extract text representation from the pre-trained encoder by average pooling without additional training.
Constrained DAC (CDAC) (Lin et al., 2020).The method generalizes the Deep Adaptive Clustering (DAC) (Chang et al., 2017) scheme for partially labeled data and trains with a contrastive loss on both distance-based pseudo-pairs and exact pairs given by intent labels.It is semi-supervised since it utilizes both labeled and unlabeled examples from the train set.We adapted CDAC training scheme to Conv, our three-headed, conversation structureaware encoder (see Sec. 3.4).Details of the DAC method are in Appendix B.1, and details of the CDAC method are in Appendix B.2.

Evaluation
We describe our experimental setup for novel intent discovery.We prove the efficiency of the proposed method on real-world communication datasets.To verify gains from different framework components, we present more results in the ablation section (Sec.5).

Real-world internal datasets
We used three internal datasets: Purchase, Delivery and Retail from real traffic to CX support at Allegro in Polish language.CX consultants manually annotated the datasets with intent labels.Categories of email queries to the CX team are more fine-grained than the widely used Banking77 (Casanueva et al., 2020) dataset.Moreover, such real-world datasets are highly imbalanced, with some intents overlapping.Basic dataset statistics are shown in the Table 1.The user emails vary in length and style and may contain irrelevant parts.Each dataset includes messages of different quality and specificity ranging from uninformative chit-chat to well-written ones.In datasets, only the first question and direct answer are included, and all further messages from the correspondence thread are omitted.The Purchase and Delivery cover conversations between buyers and CX consultants.Retail is communication between buyers and merchants, so conversation topics and structure are different.We use a stratified 80/10/10 train/val/test split.
We use two public benchmark English datasets from task-oriented dialog systems: CLINC150 (Larson et al., 2019) and Bank-ing77 (Casanueva et al., 2020) in Dataset splits follow exactly the experimental setup used in (Zhang et al., 2020) in ablation study in Section 5.2 to increase the reproducibilty of our work.In other ablations it is impossible due to missing conversational and weak label signal.
Basic statistics of the datasets are in the Table 1.Further details are in Appendix A.

Experimental setting
We build a controlled open-world intent discovery setup, following the setup proposed in (Lin et al., 2020;Zhang et al., 2020).We prepared novel intents by randomly masking all examples from 50% of intents in the training set.The remaining intents serve as known intents and are additionally partially masked.We masked 50% of all remaining examples.We apply the representation learning framework: we take in-domain encoders described in Section 3.3.2and 3.3.3and do the fine-tuning step (described in Section 3.4 and 3.5).After the training phase, we cluster the whole test dataset with K-means.We performed clustering with the ground truth number of clusters (i.e., the number of intents in the dataset).
We run experiments with hyperparameters (i.e., representation size, batch size, and learning rate) fixed.We have described the method of their selection in Appendix D.
We use five random seeds, which govern intent masking and weight initialization.We train the model for 100 epochs on a single machine with NVIDIA V100 GPU.It takes a few hours to run a single fine-tuning experiment for all seeds for a single setting (dataset, training scheme etc.).

Metric 4 .
We compute metrics based on cluster ids from Kmeans algorithm and ground truth labels.The discovery quality is probed with three standard clustering metrics, i.e., Accuracy (ACC) using the Hungarian algorithm, Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI).We also introduce two additional metrics.First, the binary F1-score i.e., macro F1-score with a majority vote on cluster label calculated on the whole dataset where all known intents are one class, and all novel intents are the second class.Second, the macro F1-score with a majority vote on the cluster label.It turns the clustering quality problem into a multilabel classification.In the main part of the paper, we report AVG i.e., the average of five metrics over all seeds.AVG increases with clustering quality up to 100%.AVG is the primary metric used for In Appendix F we give more details on how we compute metrics or test for statistical significance.

Results
Table 2 shows the AVG metric for our bestperforming model.

Ablation
We attribute the improvement in performance to all three method components: domain adaptation during pre-training with conversational and weak label signal, state-of-the-art training scheme CDAC, and leveraging of conversation structure with our Conv method introduced in Section 3.4.

Initialization
In this section, we show the effect of initialization on the novel intent discovery task.We trained a conversation structure-aware encoder with a CDAC scheme using four different initializations.AVG metric is reported in downstream task.The simultaneous drop in quality on the Retail dataset originating from the domain for which we did not have noisy labels confirms this phenomenon.

Training schemes
We compare two training schemes Static, and CDAC from Sec. 3.5 with two additional baseline methods DAC and Supervised.For Supervised training scheme, we use Large Margin Cosine Loss (LMCL) (Wang et al., 2018) to learn representation from labels.We discard unlabeled data from the train set.We train the models for all four schemes with question input only and BERT-base (Devlin et al., 2019) for English and AlleBERT for Polish datasets.This ablation study is the only case when we can use two public benchmark English datasets from task-oriented dialog systems: CLINC150 (Larson et al., 2019) and Banking77 (Casanueva et al., 2020).Unfortunately, public benchmark datasets lack the answer data, a large amount of unlabeled data, and weak labels.However, including them in this ablation study increases the reproducibility of our work and brings interesting insights.
AVG metric is reported in Table 4 and individual metrics can be found in Table 10.For all datasets, there is a gain from using intent labels (Supervised and CDAC).For public datasets among unsupervised methods, DAC outperforms static representations.However, supervised training is better than semi-supervised CDAC.The results are the opposite for the internal datasets.DAC is better than static representations, and semi-supervised CDAC is better than supervised training.We hypothesize that different real-world and benchmark datasets results might be due to dataset quality and size differences.In general, benchmark datasets are larger and more balanced.Moreover, mail messages from real-world e-commerce are longer and noisier on average.It is an open question how this trend holds for other real-life datasets.
To sum up, there is a gain from intent labels for all datasets.Optimal solutions for public benchmarks and real-world internal datasets differ.CDAC is the best training scheme that uses intent labels for internal datasets.

Conversational structure
We examine if any further gains in performance can be obtained from incorporating the answer field signal.We conduct experiments only on the internal datasets.We use only the best training scheme, i.e., CDAC.We examine four training configurations: only question representation Q trained with λ = (1, 0, 0), only answer representation A trained with λ = (0, 1, 0), question-answer concatenation QA concatenation trained with λ 3 = (0, 0, 1), using question and answer in a simpler two-headed model QA two heads trained with λ = ( 1 2 , 1 2 , 0) and full three-headed conversational model Conv trained with λ = ( 1 3 , 1 3 , 1 3 ) described in detail in section Sec.3.4. 5VG metric is reported in Table 5 and individual metrics can be found in Table 11.The answer alone performs worse than the question alone.We hypothesize that it is due to many non-informative generic answers6 .Perhaps for other real-world datasets consultant's answer may be superior to the user's questions.Passing only the question signal is a strong baseline.Let us check if it is possible to incorporate signals from both question and answer fields in a way that improves performance over Q, question field only baseline.significance test.The same goes for the more sophisticated QA two heads variant.Only our method Conv, a three-headed encoder is better than Q with statistical significance.Incorporating both question and answer signal leads to further improvements.
To sum up, after examining multiple ways to include the conversational signal, we conclude that our method Conv with a three-headed encoder improves the performance by 5 to 13.5pp.
6 Commercial deployment 6.1 Production pipeline overview The method we described and verified experimentally is a part of a larger multi-component system for continuous intent discovery deployed commercially, shown in Fig. 2.Here we briefly list the major components of our production pipeline to give the bigger picture: 1. Representation learning.Representation learning plays a core role in our pipeline.This component is subject to experiments in this paper and consists of two subcomponents: 2. (Over)clustering with K-Means.We cluster representations to discover intent groups in the data.The number of novel intents is required by K-Means.We overestimate this value as it is less time-consuming to manually merge clusters with the same intent.
3. Cluster postprocessing.Various postprocessing steps make analyzing the clusters by the human annotators more efficient: (a) Multi-document summarization.The summarization module, provides humanreadable candidates for the intent name instead of cluster ids.First, we train a logistic regression classifier with bagof-words features to predict cluster ids.Then, we identify the most informative sentence in each message using the classifier coefficients (Angelidis and Lapata, 2018).Finally, we select the five most central sentences across all messages (Zheng and Lapata, 2019).(b) Known intent prediction.We need to distinguish clusters with known intents from clusters with potentially novel intents.
Since the labeled messages are typically a small subset of the training dataset, we infill intents for the unlabeled examples with an intent classifier and present this information to human annotators.

Novel intent selection and data annotation.
Human annotators manually analyze all discovered clusters and choose which novel intents to include in the taxonomy.They annotate all messages from clusters to be included in the labeled dataset to ensure the high coherence of newly discovered intents.
CX intent dataset updated with new intent is the end product of our intent discovery pipeline.Its primary purpose is to train an intent classifier to be served in real-time to CX consultants.It is a complex pipeline of its own.It has similar architecture to the representation learning model in the intent discovery pipeline and it reuses pre-trained encoders.Even though the consultant's answer and the consultant's weak label are not known at the serving time of the intent classification model, we leverage these signals to build a better intent dataset and directly train a better intent classification model.

Commercial benefits case study
Thanks to the deployed pipeline, we doubled the number of defined intents for customer support within one year.Initially, the taxonomy consisted of 100 classes manually defined by the CX consultants.The commercial deployment of the intent discovery pipeline happened at the moment when the domain experts failed to find any new intents manually.Roughly 50 new intents were discovered thanks to our intent discovery pipeline.The selected clusters were reasonably pure: over 90% (mean and median) of examples from the selected clusters were labeled as the given intent.Additional examples for the new intents were further added (active learning etc.) and at the moment, the examples from the clustering process are at least 40% of all examples for 50 automatically discovered intents.Currently, after extending our taxonomy from other sources as well, our taxonomy has roughly 180 intents.
In addition, the pipeline decreased the time required to define novel intents from weeks to days with the additional benefit of analyzing several-fold more messages.The more comprehensive taxonomy significantly impacts the total benefit from the automation process, improves user experience by providing faster responses, and saves the cost of hiring additional CX consultants.

Conclusions
This paper describes an intent discovery pipeline deployed on a large e-commerce platform.The access to real-life datasets allows extending the established intent discovery models to better leverage vast amounts of unlabelled data, its conversational structure, and additional signals like weak labels.In particular, we learn the following lessons: 1.Among multiple ways to handle conversational data, Conv, our generalization of the CDAC model to a three-headed encoder to use all available conversational data (i.e., question and answer) increases the performance of the intent discovery pipeline the most.See Section 5.3.
2. The significant gains also come from pretraining the encoder on an unlabelled indomain dataset with conversational structure and weak labels (TagBERT).See Section 5.1.Therefore, we recommend a system architecture that enables weak labeling by the consultants by design.
3.Even though the consultant's answer and weak labels are not available at the serving time of the intent classification model, they can be used offline for novel intent discovery to build a better dataset and directly improve the intent classification.It happened for our comercially deployed pipeline.See Section 6.
4. Gains from incorporating additional signals (Conv method, TagBERT) are larger than gains from using state-of-the-art methods (CDAC) on datasets without additional signals.See Section 4.4.We advocate for a shift both in construction and research on intent detection datasets.

Limitations
We are aware of two major factors that may affect the generality of our research: shortcomings of the simulated novel intent discovery setup and the assumption that intent detection is a classification problem.
Simulated experiments.In the experimental section, we use small, entirely annotated datasets to analyze different design choices of the representation learning component.We naturally include only already discovered intents (does not mean these are all possible).Our masking procedure that follows research papers (Lin et al., 2020;Zhang et al., 2020) has three drawbacks.Firstly, when we mask most of the dataset, we effectively do few-shot learning, whereas, in reality, the amount of annotated data is much larger.The observed differences between design choices may be mitigated once more data is available.Secondly, real class imbalance may not be reflected in the experimental dataset due to the annotation procedure.Lastly, the ratio between batch size and dataset size is much smaller for real datasets since, in general, we are training with a large amount of unannotated data.It directly affects batch-based pair statistics when using a random sampler in CDAC algorithm.The chance that annotated examples will be present in the batch is low, and effectively we are almost entirely learning from pseudo-pairs during the semi-supervised stage.
Intent detection as classification.We treat the intent discovery as classification i.e. each utterance has only one intent.In reality, users may have more than one goal that transforms the problem into a multi-label scenario.Naturally, we could treat multi-label examples as yet another class, but we do not explore their influence on pipeline performance since they were in a significant minority.
where (R ij = 1) for positive pairs and (R ij = 0) for negative pairs and S ij is cosine similarity of representations.The pseudo-label matrix R is defined in an online fashion for every pair of examples in a batch using current model predictions i.e.
where u(λ) and l(λ) are upper and lower thresholds.Pairs between the thresholds do not take part in the training.This is compensated by adding penalty term u(λ) − l(λ) to the final loss.The thresholds are updated every epoch according to the formula where update rule for every epoch is λ = λ+1.1•0.009 (Chang et al., 2017).We start with λ = 0.
The training ends when u(λ) = l(λ).The training resembles curriculum learning: we start with confident examples with very large or low cosine similarity and then introduce more uncertainty.The penalty term also reflects our confidence since it controls the strength of gradient updates.

B.2 Constrained DAC (CDAC)
This extension of DAC to a semi-supervised scenario was introduced in (Lin et al., 2020).In unsupervised case, we only use contrastive objective with pseudo-labels.Once we have annotated examples, we define true positive and negative pairs with labels.The label matrix R has now pseudo-label part (3) and exact part where y i denotes encoded label for i-th example.Since our batch now includes annotated and unannotated examples, we need to redefine pseudolabels.We consider three cases.Firstly, pseudolabels can be defined only among unannotated examples.Secondly, we can allow pseudo-labels between pairs of annotated and unannotated examples.Lastly, we can define pseudo-labels for all possible pairs, including a scenario where pseudo-labels are defined among annotated pairs.We chose the second scenario.
Additional modification is alternating training.Even epochs use only annotated data and no threshold penalty.Odd epochs use the whole dataset and pseudo-label matrix as well as exact.The loss in the supervised phase is additionally scaled by the δ ≥ 1 hyperparameter to control the weight put on annotated data.

C Metrics 9 .
We choose metrics for our experiments.Three clustering metrics measure the separation of novel intents from each other: • Accuracy (ACC) measures clusters purity.
Cluster and ground-truth labels are matched with the Hungarian algorithm.
• Normalized Mutual Information (NMI) specifies the amount of uncertainty about class labels given cluster labels.
• Adjusted Rand Index (ARI) checks for all sample pairs whether their assigned and ground truth labels are the same.
ACC, NMI, and ARI are calculated only on examples with a novel intent as a ground truth label.The separation of the novel from the known intents is measured by: • Binary F1-score.It is a macro F1-score with a majority vote on the cluster label calculated on the whole dataset where all known intents are one class and all novel intents are the second class.
Last but not least, there is a metric that measures both the separation between novel intents and the separation of the novel from the known: • Macro F1-score with majority vote on cluster label.It turns the clustering quality problem into multi-label classification.
The macro average is calculated only for novel intents.Examples with any ground truth label may be included 10 .All metrics increase with clustering quality up to 100%.We use five random seeds, which govern intent masking and weight initialization.In the main part of the paper, we report AVG i.e., the average of five metrics listed above (which are correlated variables) overall seeds.AVG is the primary metric used for model selection.Whenever in doubt, we confirm that the difference between AVG metrics is statistically significant with correlated T-Test with a p-value=5% threshold.Additionally, to facilitate comparison with other research, for all experiments, the five metrics are listed separately in Appendix.

D Initial fine-tuning
We start our experiments with fine-tuning representation size, batch size, and learning rate hyperparameters for the CDAC training scheme11 .For every dataset, we optimize the hyperparameters in two steps: selecting optimal representation size via grid search over the representation sizes {16, 32, 64, 128, 256} and learning rates {1e-05, 5e-05, 1e-04} and then selecting the optimal learning rate and batch size via grid search over batch sizes {16, 32, 64, 128, 256, 512} and the same learning rates as step 1. Tab. 7 shows the relation of the selected hyperparameters to the number of intents.The selected hyperparameters are later fixed in the experiments.Additionally, to improve training stability, we perform an additional learning rate search again within values {1e-05, 5e-05, 1e-04} for every setup which uses Conv method separately.

E Pre-trained encoders (details)
To leverage large amounts of historical data, we compare four self-supervised encoders, and one supervised trained on conversational data.The training procedure for each encoder is described in detail below for reproducibility.The encoders are used for experiments in Sec.4.4.
HerBERT State-of-the-art BERT-base language model for Polish (Mroczkowski et al., 2021) trained with Masked Language Model (MLM) objective.
AlleBERT The model is a result of further finetuning HerBERT on internal unsupervised conversational data.The single training example contains a conversation thread clipped to 512 tokens.We always clip threads to a random subsequence of whole consecutive utterances to persist in a conversational context.AlleBERT is trained with the MLM objective for 100k steps with the linearly decaying learning rate schedule (peak value 1e-05) and the batch size of 224.The training on four NVIDIA A100 GPUs lasted 2 days.
AlleConveRT The model is a result of further fine-tuning of the AlleBERT on the same data but with the mixture of two objectives, MLM loss with the ratio of 0.2 and Conversational Contrastive Loss (CCL).Following ConveRT (Henderson et al., 2020b) we leverage the structure of the conversations with alternately exchanged utterances in a metric learning setup.Positive examples are consecutive messages from a single conversation, and negatives come from answers within the training batch.To reduce the overfitting to specific utterances, we use label smoothing with the value of 0.2 (same as (Henderson et al., 2020b)).To utilize conversational data structure, we add two projection heads on top of the AlleBERT encoder, one for the question and answer representations12 .Al-leConveRT is trained for the 280k steps with the peak learning rate 1e-05 and the batch size of 448.
The training on four NVIDIA A100 GPUs lasted 4 days.
TagBERT The model is trained in two-stage finetuning of the first version of HerBERT (Rybak et al., 2020).In the first stage, we fine-tune the model on internal unsupervised conversational data.We use MLM objective and Message Threads Structural Objective (MTSO).MTSO is Sentence Structural Objective (Wang et al., 2020) tailored to the conversation domain.During training, we swap messages with respect to threads instead of swapping sentences with respect to documents.Tag-BERT is trained for 100k steps with a batch size of 640 and a peak learning rate 8e-05.
In the second stage, we fine-tune the model on the multi-label classification task.The model predicts several of the 512 classes for each thread.The noisy and highly imbalanced labels come from tags that CX consultants add to the conversation threads, roughly identifying the problem solved.The training dataset contains 2.5M messages.TagBERT is trained for 38k steps with a peak learning rate of 1.6e-04 and a batch size of 512.The training on sixteen NVIDIA P100 GPUs lasted 8 hours.

Figure 1 :
Figure 1: Representation model based on BERT-base encoder used in the discovery pipeline.On the left version with one head.On the right Conv, our conversational model with three separate trainable heads for the question, answer, and question-answer concatenation.The parameters of the encoder are frozen except for the last transformer block.

Figure 2 :
Figure 2: Intent discovery pipeline deployed at Allegro with human-in-the-loop carrying out the novel intent selection and data annotation.Representation learning components are subject to experiments in this paper.The main outcome of the pipeline is an updated intent detection dataset, which can be used to train a better intent classification model.
(a) In-domain pre-training of encoders.Encoders with BERT-base architecture are pre-trained on large chunks of historical data.We include additional signals such as conversational structure (i.e.question and answer) and weak label signal (Section 3.3.2and 3.3.3).The encoders are reused for the intent classification model.(b) Fine-tuning for the clustering task.We further train in-domain encoders.If there exists annotated data, we use semi-supervised CDAC with Conv (Section 3.4).Otherwise, we use static embeddings.

Figure 3 :Figure 4 :
Figure 3: Internal dataset visualization.On the left we visualize t-SNE mapping of sentence representations to 2 dimensions.Different colors indicate different intent labels, each point corresponds to a single example in the dataset.On the right there is a scatter plot of intent sizes and Silhouette score per intent.Each point corresponds to one intent in the dataset.Silhouette score values are in the range from -1 to 1. 1 indicates perfect clustering, and 0 indicates overlapping clusters.The visualizations show the initial difficulty of the clustering task on general domain pre-trained models.

Table 1 :
Downstream tasks datasets characteristic.Class imbalance is measured by the average number of examples per intent and the normalized Shannon's entropy of the intent distribution (which is 1 for for the perfectly balanced case and lower in case of class imbalance).Further details are in Appendix A Table 2: Static baseline and CDAC representations compared with our framework on novel intent discovery task for real-world data.Our framework combines Tag-BERTpre-trained encoder, CDAC training scheme, and Conv method for using the conversation structure.AVG metric averaged over five seeds.model selection.Additionally, to facilitate comparison with other research, the five metrics are listed separately in Appendix F for all experiments.

Table 3 :
Five individual metrics are listed in Table8.We significantly improve intent discovery compared with baselines.Our model uses TagBERT (see Section 3.3.3)as initialization and is trained with the CDAC scheme.While training, we used both question and answer fields and utilized conversational structure-aware encoder Conv introduced in Sec.3.4.The baselines (Static and CDAC) are based on the general domain Her-BERT encoder and use the question field only.We improved over the second-best CDAC, depend-Impact of initialization for novel intent discovery task.Conv conversation structure-aware encoder was trained with the CDAC scheme from different initialization.AVG metric averaged over five seeds with standard deviation.
ing on the dataset, by 8.9pp to 33pp.The performance gap of our framework to the CDAC baseline is greater then the superiority of CDAC over the naive baseline, static embeddings, which is between 7.7pp and 13.2pp.

Table 4 :
Table 3 and individual metrics are shown Table 9. Comparing AlleBERT with HerBERT, we can see that domain-adapted initialization improves 1 to 7pp for discovering new intents.Further adaptation of the starting encoder with the loss of ConveRT improves at least 5pp.Summarizing AlleBERT and AlleConveRT initializations bring gains for all internal datasets.For the CX domain (Purchase or Delivery), the best initialization was provided by TagBERT.Pre-training with weak labels introduced additional training information that turned out to be transferable for the Evaluation of training schemes for novel intent discovery.We report AVG metric averaged over five seed with standard deviation.Models use BERT-base (English datasets) or AlleBERT (Polish datasets) encoder and question input only.The best results are in bold.

Table 5 :
Evaluation of conversational structure for novel intent discovery.We report AVG metric averaged over five seed runs with standard deviation.Models use AlleBERT initialization, CDAC training scheme, and various inputs, i.e., question Q, answer A, or both fields (QA) in three model variants; QA concatenation, QA two heads, and Conv.The best results are in bold.