Improving Classroom Dialogue Act Recognition from Limited Labeled Data with Self-Supervised Contrastive Learning Classifiers

Recognizing classroom dialogue acts has significant promise for yielding insight into teaching, student learning, and classroom dynamics. However, obtaining K-12 classroom dialogue data with labels is a significant challenge, and therefore, developing data-efficient methods for classroom dialogue act recognition is essential. This work addresses the challenge of classroom dialogue act recognition from limited labeled data using a contrastive learning-based self-supervised approach (SSCon). SSCon uses two independent models that iteratively improve each other’s performance by increasing the accuracy of dialogue act recognition and minimizing the embedding distance between the same dialogue acts. We evaluate the approach on three complementary dialogue act recognition datasets: the TalkMoves dataset (annotated K-12 mathematics lesson transcripts), the Dai-lyDialog dataset (multi-turn daily conversation dialogues), and the Dialogue State Tracking Challenge 2 (DSTC2) dataset (restaurant reser-vation dialogues). Results indicate that our self-supervised contrastive learning-based model outperforms competitive baseline models when trained with limited examples per dialogue act. Furthermore, SSCon outperforms other few-shot models that require considerably more labeled data 1 .


Introduction
Dialogue analysis offers significant potential for improving our understanding of classroom learning and teaching by modeling dialogue between students and teachers.Studies of classroom dialogue can provide deep insight into how students learn most effectively and engage with each other and with teachers (Mercer et al., 2019;Mercer, 2010;Resnick et al., 2010;Hmelo-Silver, 2004).A long-standing goal in analyzing classroom dialogue is to understand how student-student and studentteacher dialogues lead to better student learning outcomes (Wendel and Konert, 2016).This work addresses the problem of dialogue act recognition in K-12 classroom dialogues.
Dialogue act recognition has garnered considerable attention and is useful for many tasks such as dialogue generation and understanding (Chen et al., 2022;Lin et al., 2021;Goo and Chen, 2018).Recent efforts in dialogue act recognition are built on large-scale pre-trained language models (Qin et al., 2021(Qin et al., , 2020;;Wang et al., 2020;Raheja and Tetreault, 2019;Chen et al., 2018).These models demonstrate high performance on standard datasets but require substantial labeled training data and, in some cases, combine other corroborative labels such as sentiment.Finding labeled public datasets of K-12 classroom dialogues is challenging for several reasons.First, there are concerns about participants' privacy and security.Second, researchers often develop individualized coding schemes specific to their design framework, and research context (Mercer, 2010;Song et al., 2019;Hao et al., 2020;Song et al., 2021).Therefore, even when labeled datasets are available, the assortment of coding schemes makes it challenging to cross-train across datasets.Third, classroom dialogue utterances are usually specific to a given subject matter, making generic labeled dialogue datasets less useful as auxiliary data.It is, therefore, essential to develop the capability to build dialogue act recognition models from limited labeled training data.
Our research addresses the lack of large-scale labeled classroom dialogue datasets by using a self-supervised contrastive learning-based model (SSCon) trained using limited labeled data.SS-Con uses contrastive learning to transform the dialogue utterance representation into a new embedding space where identical dialogue acts cluster together, and distinct dialogue acts are pushed further apart.The system iteratively improves per-formance, as the contrastive learning step benefits self-supervision, even when presented with limited labeled data.Experiments show that SSCon outperforms competitive baselines with just tens of labeled examples per dialog act in both a K-12 mathematics classroom dataset and an everyday conversation dataset, DailyDialog.Our key contributions are the following: • We propose a novel self-supervised contrastive learning dialogue act recognizer.
• We test our model on multiple datasets in distinctly different domains under label-scarce settings.Our experiments show that our model outperforms strong baselines.
• We illustrate with an ablation study why our model outperforms the baseline.
2 Contrastive Learning Model

Problem Definition
dataset consists of a sequence of N utterances, and a set of dialogue acts A. Dialogue act recognition (DAR) is defined as a classification problem, that involves recognizing the dialogue act given its context, a set of previous m utterances, where da i ⊆ A. The task is formalized as a multiclass classification problem and sometimes a multilabel classification problem, depending on the coding scheme used.

Approach and Assumptions
While our approach is primarily evaluated on the multi-class classification problem, we also test our model on the multi-label dataset DSTC2 (Henderson et al., 2014), with promising results.Since we test a few-shot learning scenario, we start with a small set of labeled utterance examples I and the remaining training set U of unlabeled utterances.We also have a set of labeled utterances T set aside as a validation set.

SSCon Overview
In this section, we describe the architecture of our model, the self-supervised contrastive learning (SS-Con) based multi-class classifier, shown in Figure 1.Our model operates in multiple stages.We begin by finetuning a large pre-trained transformerbased language model (PLM) trained on large publicly available dialogue datasets using our domainspecific dialogue dataset.We use the finetuned PLM to generate dialogue embeddings for our model.In Stage 1, we use the finetuned PLM to encode the utterance and its dialogue history.We also use sentence-BERT (Reimers and Gurevych, 2019) to create a latent representation of each utterance.A classifier built on the latent representation makes an initial soft prediction of the dialogue act.We distill the initial high-confidence predictions as soft labels for Stage 2. In Stage 2, the soft labels from Stage 1 train an encoder using contrastive learning.It translates the latent representations from Stage 1 into a vector in a different encoding space where identical dialogue acts cluster together while distinct dialogue acts are separated.In Stage 3, we pass the utterance representations through the encoder trained in Stage 2 to get the embeddings, which will be the input to classify the dialogue acts.The high probability soft labels from the Stage 3 classifier are sent back to Stage 1 as input to selfsupervise the model in the next iteration.

Pretraining
For our pretraining stage, we finetune DialoGPT (Zhang et al., 2020), a dialogue PLM built on GPT2 (Radford et al.) and trained on 147M Reddit and other online conversations, licensed under MIT License.Given an utterance u i and its context of a set of previous m utterances, the dialogue PLM is finetuned to maximize the conditional probability for the subsequent utterance, For pretraining, we create a dataset consisting of target utterances and its context consisting of preceding m utterances.SSCon uses the finetuned PLM's hidden state as the representation of the utterance u i given its dialogue context

Stage 1: Context Classifier
The Stage 1 classifier concatenates the pretrained dialogue PLM's last layer embedding H i ∈ R 768 with the sentence-BERT embedding S i ∈ R 384 of the utterance u i and predicts the dialog act, y stage1 i for the utterance, Where y stage1 i is the predicted dialogue act class for the last utterance u i and CL stage1 is the Stage 1 multi-class classifier with trainable weights Θ.The concatenated vector X i represents the utterance u i as an independent sentence and in the context of its dialogue The output of Stage 1 is a dialogue act label (and the corresponding prediction probability) for each utterance in the dataset.For the choice of classifier, we explored both MLP classifiers (Haykin, 1994) and XGBoost (Chen and Guestrin, 2016); however, any reasonable classifier can be utilized.We chose the XGBoost classifier as it was faster to train and run without any impact on the performance of our model.However, for one of the datasets (DSTC2), where an utterance could have multiple labels, we used an MLP-based multi-label classifier.
In the first iteration of the self-supervision process, the training data used for the classifier is limited to the number of available labeled samples, usually between ten to a hundred examples per class.In subsequent iterations, the number of samples increases based on the soft labels from earlier iterations.

Stage 2: Contrastive Encoder
The prediction from the Stage 1 classifier, CL stage1 , is used as (soft) ground truth for Stage 2. Specifically, we use the dialogue act labels with the highest confidence in terms of prediction probability as new soft labels along with the initial labeled examples used in Stage 1.Including high confidence, soft-labeled samples increases the effective size of the training set with each iteration.Using this data, we adopt a contrastive training approach.
This training process creates a network that can encode the finetuned PLM latent representation into a space where utterances with the identical dialogue act class are close together while utterances with distinct dialogue act labels are farther apart.The labeled samples are paired to generate positive and negative triplets P = (X i , X j , P ij ) where X i (Equation 2) is the concatenated encoding defined in the previous section for a given utterance u i , and P ij is 1 if both the utterances map to the same dialogue act or -1 if they map to different dialogue acts.In the case of multi-label utterances, P ij is the cosine distance between the one-hot encoding of the dialogue act vectors of the two utterances.The encoder is a five-layer MLP that transforms the X i ∈ R 1152 concatenated latent representation into an E i ∈ R 384 encoding.We use a twin encoder network to train on the positive/negative triplets P using cosine similarity between output embeddings as the similarity score.Given a triplet (X i , X j , P ij ) the network trains, (5) ) The trained encoder transforms each utterance, into a new encoding, , where E i ∈ R 384 is the generated embedding of the MLP encoder (B encoder ).

Stage 3: Embedding Classifier
In this final stage of our model, we use the encoder from the Contrastive Encoder network that was trained in the previous stage to convert each X i into the embedding E i used as input to the Stage 3 classifier.
where y stage3 i is the dialogue act labels for the utterance u i and CL stage3 is a multi-class classifier with trainable weights Φ.Like Stage 1, we use an XGBoost classifier.We use the dialogue act labels with the highest confidence in terms of prediction probability from Stage 1 as the soft label for training the classifier.
The training starts with a small set of labeled utterance examples L, and the remaining training set U of unlabeled utterances.During every iteration of the self-supervision process, the model labels the unlabeled U utterances in our training set.We filter out the utterances labeled with low confidence (low prediction probability of CL stage3 ).The distilled silver-labeled instances are moved from U to L, along with the initial labeled examples.The updated L is the training set for Stage 1, starting the next iteration of the self-supervision process.
3 Experimental Setup

Implementation Details
The pretraining stage involves fine-tuning the Di-aloGPT model with utterance data from a given dialogue dataset.The training set uses nine previous utterances as context.We finetune a HuggingFace pre-trained "dialoGPT-small" base model on a single GPU for four epochs.The version of DialoGPT we used is a 12-layer transformer.We use the hidden state vector H i of the end-of-sentence tag in the 12 th layer of the transformer as the embedding representing the last utterance in the sequence.
Stage 1 of the model uses a dialog act classifier.We implemented an MLP-based classifier and an XGBoost classifier.There was no difference in performance between the two classifiers, so we picked XGBoost as it is a standard baseline.We use the hidden state, H i ,of the last layer as the embedding to represent the dialogue context of the utterance.For the sentence-BERT embedding S i of the utterance we used a pre-trained network ("all-MiniLM-L6-v2") that is trained on roughly 1B sentence pairs (Reimers and Gurevych, 2019).The input to the model is the concatenated vector of H i and S i .After running the model for various distillation thresholds, a threshold of 0.85 was used for high-probability soft-label distillation based on a simple grid search on the threshold parameter as discussed in the appendix.
Stage 2 of the model is a contrastive encoder network.Each encoder network in the twin network is a 5-layer MLP with a 20% dropout between the layers and an output embedding vector of dimension 384.About 1.2 million similarity pairs are used to train the twin network for 4-6 epochs on a single GPU.Stage 3 is an XGBoost classifier head on top of the encoder trained in Stage 2. The iterative selfsupervision process continues until we show no improvement in the validation data.We run 5-10 iterations.In our experiments, we use a validation set to determine when to stop the iterations.We report the results on the test set.

Datasets
We use three datasets in our experiments: the TalkMoves, the DailyDialog, and the Dialogue State Tracking Challenge 2 datasets.The Talk-Moves dataset (Suresh et al., 2022a) consists of 567 human-annotated class video transcripts of K-12 mathematics lessons between teachers and students.A human-transcribed dataset consists of 174,186 teacher utterances and 59,874 student utterances.The dialogue act labels in the dataset have an inter-rater agreement score above 90% for all labels.The dialogue acts for student utterances include 'relating to another student', 'asking for more information', 'making a claim', and 'providing evidence'.The dialogue acts for teacher utterances include: 'keeping everyone together', 'getting students to relate', 'restating', 'revoicing', 'pressing for accuracy', and 'pressing for a reason'.For the TalkMoves dataset, we train our model on student utterances.
To evaluate our approach on multi-label data set, we train on the DSTC2 dataset (Henderson et al., 2014).This dataset contains dialogues between crowdsourced workers and automated dialogue systems in the restaurant reservations domain, with 1000 train and test dialogues and about 21 dialogue acts.

Baselines
We run our experiments on different datasets described above.The TalkMoves dataset does not yet have many published baselines for dialogue act recognition, so we compare our approach against multiple baseline models.One is a baseline XG-Boost classifier using the same utterance representation input as our model.This classifier is the same as our Stage 1 classifier.The second baseline is a self-supervised XGBoost classifier similar to SSCon without the contrastive learning in Stage 2. For the third baseline, we use an embedding prototype distance-based classifier.The average sentence embedding vector for labeled examples from each class represents a prototype for each label.We then measure the cosine distance of every utterance's sentence embedding in our test set against each class-prototype embedding.An utterance is assigned the class label of the closest prototype in the embedding space.To validate SS-Con against current state-of-the-art dialogue act recognition models, we compare our results on the DailyDialog dataset against a few-shot learning model and a model that uses all available labeled training data.The state-of-the-art Co-GAT model (Qin et al., 2021) uses all training data available, including related sentiment labels on the utterance.Trained on very limited data, we do not expect our model to beat Co-GAT but use its performance as an upper bound.ProtoSeq (Guibon et al., 2021) is a sequential prototypical network trained to work in a few-shot fashion.However, they do use all training data to train their network.We use this model for comparison purposes as our approach is a type of few-shot learner using a small number of samples.We also compare SSCon to standard XGBoost as we do for the TalkMoves dataset.We also compare SSCon dialogue act recognition results on DSTC2, a multi-label problem, against the self-supervised student-teacher approach by (Mi et al., 2021a).

Evaluation Metrics
The various baselines are compared with SSCon using four metrics.The Matthews Correlation Coefficient (MCC) for multi-class classification has a range of -1 to 1 and handles imbalanced datasets well.In some cases, the baselines only report the F1 (macro) score, the arithmetic mean of individual class F1 scores giving equal weight to all classes.We also report macro-averaged precision and recall.

Results
The first dataset we consider is the TalkMoves dataset.To evaluate the models in a few-shot learning scenario, we experiment using 10-70 labeled instances per dialogue act type (less than 1% of the overall data).Table 1 shows our results.SSCon shows an improvement of about 7-10% over the Utterance Embedding XGBoost baseline and the Prototype Distance Classifier when trained on 70 labeled examples.When SSCon is compared with the self-supervised XGBoost classifier without contrastive learning, it shows an improvement of 1% for the 70 labeled examples case but an improvement of more than 60% for the 10 labeled example case.This difference in performance suggests that contrastive learning is more beneficial when working with fewer labeled examples.Besides the three baselines, we also report the performance of a pub- lished model that uses 100% (21K labeled samples) and 70% of the labeled training data.SSCon works in a label-scarce scenario, so our results are lower than the model using the complete labeled data set.Figure 2 shows the impact of labeled example counts on the overall performance of the SSCon classifier and an XGBoost classifier.While our iterative approach increases the performance on average by almost 50% over the baseline for a small training size (10 labels per class), it is about 9% for a larger labeled set size (70 labels per class).We notice that the overall performance improvement using SSCon is more with small labeled sets, and the improvement tapers as the labeled examples count increases.
The second dataset we consider is the DailyDialog dataset with four possible dialogue acts.Like the TalkMoves dataset, we consider 10 to 70 labeled examples per class.Table 1 shows that our performance, with just 70 instances per class (0.6% of 21521 labeled examples), is within 16% of the state-of-the-art dialogue act recognition model (Qin et al., 2021) that uses 100% of all the training data labels plus auxiliary sentiment labels.We also show that we outperform the other few-shot learning model, ProtoSeq (Guibon et al., 2021), by a significant margin even though they train using the entire labeled dataset.We investigate the change in performance of SSCon with increasing size of the labeled training set (Figure 2).As before, the performance improvement of SSCon over a standard XGBoost classifier is more significant for smaller label sets and less so as the number of samples increases.With ten samples per class, SSCon performance improvement over the baseline classifier is almost 41%.With seventy labels per class, the improvement is only about 9%.
We also compare our results on the DSTC2 dataset on restaurant inquiries.This dataset differs from the other two as each utterance can have multiple dialogue act labels.SSCon performs better on the DSTC2 dataset than the baseline self-supervising student-teacher model ToD-BERT-ST (Mi et al., 2021a) by 3 to 10%, depending on the size of the labeled training dataset.The ToD-BERT-ST model uses a data augmentation approach, while we use a contrastive learning approach.Both these approaches are complementary, and future work should explore using them together.

Clustering in Embedding Space
The fundamental intuition behind our approach is that contrastive learning brings similar utterances close to each other and dissimilar utterances further apart in the embedding space.Figure 3 shows the spatial clustering of labels in the embedding space of the trained Contrastive Encoder for one iteration with the TalkMoves dataset.Each square in the heatmap corresponds to the median cosine similarity distance between individual dialogue act class examples.To compute this distance, we take utterances from each dialogue act class and calculate the median cosine distance between their embeddings.Each cell in the heatmap shows a scaled distance, with the blue color corresponding to closer embeddings and red corresponding to the embeddings being farther away.The heatmap on the left corresponds to the distance matrix between embeddings provided as input to Stage 2. The heatmap to the right corresponds to the distance matrix for embeddings that the Contrastive Encoder has transformed.Please note that the figure uses scaled distance values, and we show four student dialog act classes.
When looking at the heatmap on the left for the input feature space, instances belonging to the dialogue act "relating to another student" (first row, label 8) are closer to the instances belonging to the dialogue act, "making a claim" (third row, label 10) than among themselves.After being transformed by the Contrastive Encoder (heatmap on the right), the smallest distances for each dialogue act fall on Classifier LastUtterance(ui) "relating to another" predicted as "making a claim" "Reciprocal" "I don't agree, I measured them' "by five" "relating to another" predicted as "relating to another" "He didn't show his work" "You don't know that' "The first one" "making a claim" predicted as "making a claim" "Two fifths" "Y intercept' "power of three" Table 2: : Examples of true positive and false negative for a couple of the TalkMoves dialog acts the diagonal (blue), as one expects with clustered labels.The diagonal corresponds to utterances with the same label.The non-diagonal terms, which are utterances with different labels, are pulled farther apart (red).

Classifier Performance by Dialogue Act
Figure 4 shows a confusion matrix with counts normalized by the ground truth label counts.In the TalkMoves dataset, the classifier struggles the most with the dialogue act corresponding to "relating to another student" (label 8).It is not able to clearly distinguish between "relating to another student" and "making a claim" (label 10).Label-scarce training suggests we only cover a limited variety of examples for each dialogue act class.In Table 2, we show some examples of correct and incorrect predictions.The first row shows utterances of the type "relating to another" mislabeled as "making a claim".The other two rows are true positive examples.We can see that the example utterances are hard to distinguish between "making a claim" or "relating to another" even for a human.The difference is in the context of the dialogue, and with limited training data, such distinctions are hard to make and might lead to overfitting.

Labeled Sample Selection
The initial set of labeled samples, in essence, drives the final performance of SSCon. Figure 5 shows the performance distribution of the baseline, XG-Boost, for multiple runs for different sample sizes.Two trends are evident as the number of samples increases for every dataset.The first trend is that the performance improves as more samples are available for training.The improvement in performance with sample size is an expected trend as more examples indicate more information for the classifier to model.The second trend is that the variance 6 Related Work Kalchbrenner and Blunsom (2013) proposed a hierarchical network architecture that combines CNN and RNN models to capture discourse structure.Subsequent work on dialogue act recognition (DAR) followed a similar approach using hierarchical architectures combining CNN and LSTM (Lee and Dernoncourt, 2016;Ji et al., 2016;Liu et al., 2017).More recently, researchers enhanced the model architecture by adding a CRF layer for classification (Kumar et al., 2018;Chen et al., 2018).Raheja and Tetreault (2019) build on earlier work that solves DAR as a sequential labeling problem using deep hierarchical networks by adding contextaware self-attention with promising results on standard benchmark datasets.More recent work has built off publicly available large pre-trained language models, significantly reducing the required training data.Qin et al. (2020) showed that combining associated labels such as sentiment with DAR can improve performance.However, all the above architectures require significant training data to achieve peak performance.Our model (SS-Con) works with limited training data as it builds on dialogue context captured by a finetuned pretrained language model.We use an iterative self-supervised training approach combined with a contrastive learning step to accommodate the lack of large labeled training datasets.
The idea of using pre-trained models to learn a few examples has been shown to be successful for natural language processing (Miller et al., 2000;Fei-Fei et al., 2006;Brown et al., 2020), and specifically, task-oriented dialogue systems (Liu et al., 2021b;Wu et al., 2020).Using few-shot learning, Mi et al. (2022) show they can improve dialogue state tracking, intent recognition, and natural language generation with limited labeled data using talk-specific instructions through prompts.Guibon et al. (2021) propose a prototypical network for sequence labeling on conversational data.While their network is trained to support few-shot DAR, they still require significant data for episodic training.The approaches mentioned above are not suitable for label-scarce situations.We show that SS-Con outperforms their model's performance on the benchmark dataset with limited labeled data.
Pretrained language models have been used for classifying texts in the K-12 math education context.Shen et al. ( 2021) applied a BERT-based model to classify knowledge components in descriptive math texts.Loginova and Benoit (2022) employed an LSTM model to predict math problem difficulty trained on a question-answer dataset.These models work on descriptive text samples, a distinct use case from classroom dialogues.Okur et al. ( 2022) developed a speech dialogue system using MathBERT for natural language understanding, trained on significant pre-labeled data, differing from our limited labeled data scenario.
Self-supervision is a viable approach for DAR with limited training data.Mi et al. (2021a) use a teacher-student model iterative approach to improve performance.They use a novel text augmentation technique that adds to each iteration's training data.We show that SSCon improves upon their results on standard datasets.Our model uses contrastive learning (Hadsell et al., 2006) to cluster the different classes.A contemporary work by Tunstall et al. (2022) uses a similar contrastive learningbased approach for few-shot text classification.Liu et al.'s (2021a) trans-encoder combines two learning paradigms, cross, and bi-encoder, in a joined iterative framework to build state-of-the-art sentence similarity models.Our self-supervised contrastive learning-based (SSCon) based multiclass classifier also combines two learning paradigms like the trans-encoder by Liu et al. (2021a).An embedding-based classifier, and a twin network, trained to leverage different training goals, help each other improve.One model uses the whole dialogue context captured in a hidden layer embedding to train a classifier.At the same time, the other is a twin neural network taking the contrastive representation learning approach, clustering same dialogue acts and separating distinct dialogue acts within the embedding space.

Conclusion
Classroom dialogue analysis can yield significant insight into student learning.However, collecting and coding classroom dialogue datasets is very labor-intensive.To address this problem, we introduce a novel self-supervised contrastive learning approach that can automate a portion of this process and make it more efficient even when limited labeled is available.We show that our approach improves on other methods that work on limited labeled datasets.The results also show that our approach can match and exceed the performance of some models trained on the fully labeled dataset.

Limitations
Selecting a representative set of examples to label becomes essential when working with limited labeled data.In this work, we use uniform sampling for our results, which might not be the best approach.We discuss these limitations in more detail in the appendix.
While we evaluate our model on a multi-label dataset (DSTC2) and show improvement over standard baselines, the effectiveness of our approach on such problems needs more investigation.

Ethical Considerations
While our algorithm is primarily a tool for improving classifier performance in label-scarce settings and uses publicly available, anonymized datasets, we acknowledge the potential ethical implications it may carry.Despite the neutral nature of our tool, it could unintentionally propagate or amplify biases favoring certain styles of communication.If used in real-time settings or without proper checks in place, it could inadvertently alter the natural dynamics of the classroom as teachers or students might modify their behavior based on how they believe the classifier categorizes their utterances.Like any other, we recognize that our tool could make mistakes or be misused by over-relying on quantitative aspects over qualitative aspects of instruction.Therefore, real-world application requires continuous vigilance and open dialogue with practitioners and stakeholders to ensure its use benefits teaching and learning.Two aspects determine the decision to stop the iteration.Figure 6 shows the model's performance in subsequent iterations for the TalkMoves dataset, starting from a label set of size ten examples per dialog act.A held-out dataset was used to calibrate the model's performance.One can see that the performance improvement is non-monotonic.In most cases, stopping after just one or two iterations is necessary to get maximum performance.The system is trying to improve based on minimal information in the labeled set, so as the size of the label set is small, there is a potential for over-fitting to the information contained in the label set.Another implementation detail is the selection of threshold confidence levels to distill soft labels as we expand the training set of the next model at the end of an iteration.Figure 7 shows the performance of our model for various prediction probability distillation thresholds.As mentioned earlier in the main paper and observed in Figure 7, 85% is the distillation threshold with the best performance distribution.

B Sample Selection by Active Learning
We experimented with using active learning techniques to pick labeled examples (Ren et al., 2021).
We first trained a model with just ten initial labeled examples per class.We used the model to find a new set of ten samples for each class with the least confidence and added them to the labeled set using human labels.Figure 8 compares our active learning with uniform random selection approaches as a baseline.We can see that the random approach has much more variance for every label set size.However, we also note that the best-performing model was a random set for every label size.This difference in performance might be because our ap-   D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure 2: Improvement of performance over a baseline XGBoost classifier by our SSCon classifier.

Figure 3 :
Figure 3: Heatmaps show the median cosine distance between embedding examples (TalkMoves dataset) for each dialog act type (8-relating to another student; 9-Asking for more info; 10-making a claim; 11-providing evidence/reasoning).The left heatmap corresponds to the input embeddings.The right heatmap corresponds to the Contrastive Encoder embeddings

Figure 4 :
Figure 4: Confusion matrix with counts normalized by actual label counts (TalkMoves Dataset) for the SSCon classifier (70-samples per class training set).Dialog acts: 8-relating to another student; 9-Asking for more info; 10-making a claim; 11-providing evidence/reasoning.

Figure 5 :
Figure 5: Drift in performance as we increase the number of labeled samples (TalkMoves dataset).

Figure 6 :
Figure 6: The model's performance changes with each iteration for multiple runs.Each run starts with different labeled data sets of 10 examples per class (TalkMoves dataset).

Figure 7 :
Figure 7: The figure shows our model's performance distribution for different soft label distillation thresholds.The runs were for 70 labeled samples per class as the initial training set from the TalkMoves dataset.

Figure 8 :
Figure 8: Comparison between active learning selection vs uniform random selection of labeled samples.

Table 1 :
Comparison of results for the Daily Dialog, TalkMoves and DSTC2 dataset against baselines.
proach for active sample selection was naive.When we add a new sample based on low prediction confidence, there is no guarantee that we will improve the model.The new sample might be predicted with low confidence either because our model has limited information to decide or because the sample itself is an outlier.If it is the former adding the sample might help, but if it is the latter, adding the sample increases the number of outliers in our training set.As our training data is limited, outliers in the input data disrupt overall performance.We plan to experiment with more sophisticated active learning techniques for sample selection in the future.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 3 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.