Disentangling Aspect and Stance via a Siamese Autoencoder for Aspect Clustering of Vaccination Opinions

,


Introduction
Mining public opinions about vaccines from social media has been hindered by the wide variety of users' attitudes and the continuously new aspects arising in the public debate of vaccination (Hussain et al., 2021).The most recent approaches have adopted holistic frameworks built on morality analysis (Pacheco et al., 2022) or neural-based models predicting users' stances on different aspects of the online debate (Zhu et al., 2022).So far, these frameworks have been frequently framed via well-known tasks, such as aspect classification or text span detection, that use supervision to train text classifiers.However, such a direct usage of the supervision information has constrained the models to predefined aspect classes and restricted their flexibility in generalising to opinions with aspects never seen before (e.g., new moral issues or immunity level).
To mitigate this limitation, some of the most promising approaches have been devised as supervised models generating clustering-friendly representations (Tao et al., 2021).These have recently shown promising results on open-domain tasks when combined with pre-trained language models (PLM) thanks to their flexibility, generalisation, and need for minimal tweaks (Reimers and Gurevych, 2019;Sircar et al., 2022).However, despite the improved capabilities in capturing the overall text semantics, existing models for text clustering (Miranda et al., 2018;Meng et al., 2019;Shen et al., 2021;Zhang et al., 2021a), still struggles to distinguish between the mixed users' stances and aspects on vaccination, and as a result, they often generate clusters that do not reflect the novel aspects of interest.As an illustrating example, consider the tweets "mRNA vaccines are poison" and "The Pfizer vaccine is safe", that the majority of existing methodologies are prone to cluster into different groups due to the opposite stances manifested, despite the fact that both of them are targeting safety issues.
To address the aforementioned problem, we posit that a model should be able to (i) disentangle the stance from the aspect discussed, and simultaneously (ii) use the generated representations in a framework (e.g., clustering) that ease the integration of aspects never seen before.We thus propose a novel representation learning approach, called the Disentangled Opinion Clustering (DOC) model, which performs disentan-gled learning (Mathieu et al., 2016) via text autoencoders (Bowman et al., 2016;Montero et al., 2021), and generates clustering-friendly representations suitable for the integration of novel aspects1 .The proposed model, DOC, learns clusteringfriendly representations through a denoising autoencoder (Montero et al., 2021) driven by outof-the-box Sentence-BERT embeddings (Reimers and Gurevych, 2019), and disentangles stance from opinions by using the supervision signal to drive a disentangled cross-attention mechanism and a Swapping Autoencoder (Park et al., 2020).
We conducted an experimental assessment on two publicly available datasets on vaccination opinion mining, the Covid-Moral-Foundation (CMF) (Pacheco et al., 2022) and the Vaccination Attitude Detection (VAD) corpora (Zhu et al., 2022).We first assessed the quality of the disentangled representation in generating aspect-coherent clusters.Then, we measured the generalisation of the proposed approach via a cross-dataset evaluation by performing clustering on a novel dataset with unknown aspect categories.Finally, we showed the benefit of this approach on the traditional stance classification task, along with a report on the thorough ablation study highlighting the impact of each model component on the clustering quality and the degree of disentanglement of the generated representations.Our contributions can be summarized as follows: • We introduce DOC, a Disentangled Opinion Clustering model to generate clustering-friendly representations, which distinguishes between users' stances and opinions in the vaccination debate and integrates newly arising aspects via a clustering approach.
• Unlike traditional aspect-based classification models, we outline a framework adopting limited supervised signals provided by few stance and aspect labels, functioning as inductive biases to generate clustering-friendly representations.
• We conduct a thorough experimental analysis on the two major publicly available datasets on vaccination opinion mining from social media, and demonstrate the benefit of the disentangling mechanisms on the quality of aspect clusters, the generalization across datasets with different as-

Related Work
Sentence Bottleneck Representation Sentence representation learning typically aims to generate a fixed-sized latent vector that encodes a sentence into a low-dimensional space.In recent years, in the wake of the wide application of pre-trained language models (PLMs), several approaches have been developed leveraging the PLMs to encode sentence semantics.The most prevalent work is the SBERT (Reimers and Gurevych, 2019) that fine-tunes BERT (Devlin et al., 2019)  Disentangled Latent Representation Earlier works explored disentangled representation to facilitate domain adaptation (Bengio et al., 2013;Kingma et al., 2014;Mathieu et al., 2016).In recent years, John et al. (2019) generated disentangled representations geared to transfer holistic style such as tone and theme in text generation.Park et al. (2020) proposed the Swapping autoencoder to separate texture encoding from structure vectors in image editing.The input images are formed in pairs to induce the model to discern the variation (e.g., structure) while retaining the common property (e.g., texture).However, recent studies show that disentanglement in the latent space is theoretically unachievable without access to some inductive bias (Locatello et al., 2019).It is suggested that local isometry between variables of interest is sufficient to establish a connection between the observed variable and the latent variable (Locatello et al., 2020a;Horan et al., 2021), even with few annotations (Locatello et al., 2020b).This is in line with (Reimers and Gurevych, 2019;Lu et al., 2022) where contrastive pairs are leveraged for training, which illuminates our work to utilize labels and reconstruction of perturbed text to induce the disentanglement.

Text Clustering
The recent development in neural architectures has reshaped clustering practices (Xie et al., 2016).For example, Zhang et al. (2021b) leveraged transformer encoders for clustering over the user intents.Several methods utilised PLM embeddings to discover topics which were subsequently used for clustering news articles and product reviews (Huang et al., 2020;Meng et al., 2022).Others exploited the neural components, i.e., the BiLSTM-CNN (Zhang et al., 2019), the CNN-Attention (Goswami et al., 2020) and the Self-Attention (Zhang et al., 2021c) (Tao et al., 2021).Yet, few methods cluster documents along a particular axis or provide disentangled representations to cluster over a subspace.

Vaccination Opinion Mining
The task of vaccination opinion mining is commonly carried out on social media to detect user attitudes and provide insights to be used against the related 'infodemic' (Kunneman et al., 2020;Wang et al., 2021;Chandrasekaran et al., 2022;Zhao et al., 2023).
Recent approaches rely on semantic matching and stance classification with extensions including human-in-the-loop protocols and text span prediction to scale to the growing amount of text (Pacheco et al., 2022;Zhu et al., 2022).
In both corpora, a small number of tweets are labelled, each of which is annotated with a stance label ('pro-vaccine', 'anti-vaccine' and 'neutral') and a text span or an argumentative pattern denoting an aspect.For example, for the tweet, 'The Pfizer vaccine is safe.',its stance label is 'provaccine' and the argumentative pattern is 'vaccine safety'.Since vaccination opinions explode over time, supervised classifiers or aspect extractors would soon become outdated and fail to handle constantly evolving tweets.In an effort to mitigate this issue, we address the problem of vaccination opinion mining by learning disentangled stance and aspect vectors of tweets in order to cluster tweets along the aspect axis.
Our proposed model, called Disentangled Opinion Clustering (DOC), is shown in Figure 1.It is trained in two steps.In unsupervised learning (Figure 1(a)), a tweet is fed into an autoencoder with DeBERTa as both the encoder and the decoder to learn the latent sentence vector z.Here, each tweet is mapped to two embeddings, the context embedding u s which encodes the stance label information and the aspect embedding u a which captures the aspect information.Under unsupervised learning, these two embeddings are not distinguished.Together with the hidden representation of the input text, H, they are mapped to the latent sentence vector z by cross-attention.As the autoencoder can be trained on a large-scale unannotated tweets relating to vaccination, it is expected that z would capture the vaccine-related topics.
Then in the second step of supervised learning (Figure 1(b)), the DeBERTa-based autoencoder is fine-tuned to learn the latent stance vector z s and the latent aspect vector z a using the tweet-level annotated stance label and aspect text span (or the argumentative pattern 'vaccine safety' in Figure 1(b)) as the inductive bias.Here, the latent stance vector z s is derived from u s .It is expected that z s can be used to predict the stance label.On the other hand, the latent aspect vector z a is derived from u a only and it can be used to generate the SBERTencoded aspect text span.Both z s and z a , together with the hidden representation of the input text H, are used to reconstruct the original text through the DeBERTa decoder.The training instances are organized in pairs since we use the idea of swapped autoencoder (shown in Figure 1(c)) to swap the aspect embedding of one tweet with that of another if both discuss the same aspect.The resulting latent vector can still be used to reconstruct the original tweet.In what follows, we describe the two steps, unsupervised and supervised learning, in detail.
Unsupervised Learning of Sentence Representation Due to the versatility of PLMs, sentence representations are usually derived directly from contextualised representations generated by the PLMs.However, as has been previously discussed in Montero et al. (2021), sentence representations derived in this way cannot guarantee reliable reconstruction of the input text.Partly inspired by the use of autoencoder for sentence representation learning as in (Montero et al., 2021), we adopt the autoencoder The DeBERTa-based autoencoder is fine-tuned to learn the latent stance vector z s and the latent aspect vector z a using the tweet-level annotated stance label and aspect text span (or the argumentative pattern 'vaccine safety' for the input tweet) as the inductive bias; (c) Swapping autoencoder.To enable a better disentanglement of z s and z a , for the two tweets discussing the same aspect but with different stance labels, tweet B's aspect embedding u B a is replaced by the tweet A's aspect embedding u A a .As the two tweets discuss the same aspect, their aspect embeddings are expected to be similar.As such, we can still reconstruct tweet B using the latent content vector z B c derived from the swapped aspect embedding.Note that (b) and (c) are learned simultaneously.
architecture to initially guide the sentence representation learning by fine-tuning it on vaccination tweets.Rather than RoBERTa (Liu et al., 2019), we adopt DeBERTa, a variant of BERT in which each word is represented using two vectors encoding its content and position.The attention weight of a word pair is computed as a sum of four attention scores calculated from different directions based on their content/position vectors, i.e., content-tocontent, content-to-position, position-to-content, and position-to-position. Instead of representing each word by a content vector and a position vector, we modify DeBERTa by representing an input sentence using two vectors, a context embedding u s encoding its stance label information and an aspect embedding u a encoding its aspect information.
We will discuss later in this section how to perform disentangled representation learning with u s and u a .During the unsupervised learning stage, we do not distinguish between u s and u a and simply use u = [u s , u a ] to denote them.
More specifically, we train the autoencoder on an unannotated Twitter corpus with the masked token prediction as the training objective.The encoder applies the multi-head attention to clamp the hidden representations of the top layer of the pre-trained transformer.If we use H to denote the hidden representations, the multi-head attention can be expressed as: where H is generated from a fully-connected layer over the hidden vectors.The bottleneck representation z is supposed to encode the semantics of the whole sentence.
The transformer decoder comprises n layers of cross-attention such that the output of the previous layer is processed by a gating mechanism (Hochreiter and Schmidhuber, 1997).The recurrence is repeated n times to reconstruct the input, where n denotes the token length of the input text.
Injecting Inductive Biases by Disentangled Attention Recent work on disentanglement learning suggested unsupervised disentanglement is impossible without inductive bias (Locatello et al., 2020b).In the datasets used in our experiments, there are a small number of labelled tweets.We can use the tweet-level stance labels and the annotated aspect text spans as inductive bias.Here, the disentangled attention of DeBERTa is utilized to mingle different factors.Assuming each sentence is mapped to two vectors, the context vector u s encoding its stance label information and the aspect vector u a encoding its aspect information, we can then map u s to a latent stance vector z s which can be used to predict the stance label, and map u a to a latent aspect vector z a which can be used to reconstruct the aspect text span.We use the cross-attention between u s and u a to reconstruct the original input sentence.Stance Classification Let h CLS denote the hidden representation of the [CLS] token, the stance bias is injected by classification over the stance categories: Essentially, we use u s as query and h CLS as key and value to derive z s , which is subsequently fed to a softmax layer to predict a stance label ŷs .The objective function is a cross-entropy loss between the true and the predicted labels.
Aspect Text Span Reconstruction We assume u a encoding the sentence-level aspect information and use self-attention to derive the latent aspect representation z a .To reconstruct the aspect text span from z a , we use the embedding generated by SBERT (Reimers and Gurevych, 2019) as the targeted aspect span, since SBERT has been empirically shown achieving the state-of-the-art on Semantic Textual Similarity tasks.Those clusteringfriendly representations, if they encode the argumentative patterns or aspect spans alone, are strong inductive biases in the axis of aspects.Specifically, the sentence embedding of the aspect expression is generated by a Gaussian MLP decoder (Kingma and Welling, 2014): where x a denotes the aspect text span in the original input sentence, y a is the ground-truth aspect text span embedding produced by y a = SBERT(x a ), whose value is used for computing the Gaussian negative log-likelihood loss2 .Input Text Reconstruction To reconstruct the original input text, we need to make use of both the latent stance vector z s and the latent aspect vector z a .Here we use the cross attention of these two vectors to derive the content vector z c .
where u = [u s , u a ], a j is the j-th element of a, and K c j represents the j-th row of K c .The resulting z c is the content representation for reconstructing the original sentence.
Disentanglement of Aspect and Stance Although the inductive biases, i.e., the tweet-level stance label and the annotated aspect span, are used to learn the latent stance vectors z s and the aspect vectors z a , there could still be possible dependencies between the two latent variables.To further the disentanglement, we propose to swap the learned aspect embeddings of two tweets discussing the same aspect in Siamese networks.We draw inspiration from the Swapping Autoencoder (Park et al., 2020) where a constituent vector of a Generative Adversarial Network (GAN) is swapped with that produced by another image.The original swapping autoencoder was designed for image editing and required a patch discriminator with texture cropping to the corresponding disentangled factors with the desired properties.In our scenario, such alignment is instead induced by tweets discussing the same aspect.
We create pairs of tweets by permutations within the same aspect group {x A , x B } A,B∈G k ,A̸ =B .Here, by abuse of notation, we use k to denote the k-th aspect group, G k .The groups are identified by tweets with the same aspect label, regardless of their stances.We sketch the structure of pair-wised training in Figure 1(c).The tweets are organized in pairs and a bottleneck representation is obtained for each tweet: We would like z A to disentangle into latent factors, i.e., the variation in a factor of z A is associated with a change in x A (Locatello et al., 2020a).Unlike the majority of works (Zhang et al., 2021d) that directly splits z A in the latent space, we assume that the entangled vector is decomposed by a causal network.We train a vector u = [u s , u a ] to trigger the activation of the networks (i.e., the self-attentions in Eq. 3-Eq.7).The outputs of the networks are independent components that encode the desiderata.If z s and z a are parameterized independent components triggered by u s and u a respectively, the substitution of u B a with u A a can be regarded as soft exchanges between z A a and z B a .We thus substitute u B a with u A a to cause changes in z B c .This substitution will also be reflected by changes in z B a .In practice, we train on all permutations with the same aspect group, regardless of the stance.The reconstruction loss for each latent factor (i.e., stance and aspect) is calculated once to balance the number of training examples unless it is content text generated from the swapped bottleneck representation.
Formally, the swapping autoencoder presented in Figure 1(c) can be expressed as where z B c is input to the decoder for the reconstruction of x B .Note that the above equations are specially used in the swapping autoencoder for the computation of z B .If there is no substitution in the latent space, the above equations will not be calculated.Given L B c = dec(z B c ), the final objective function is written as where λ s , λ a and λ B are hyper-parameters controlling the importance of each desirable property.

Experiments
Datasets We conduct our experimental evaluation on two publicly available Twitter datasets about the Covid-19 vaccination: the Covid-Moral-Foundation (CMF) (Pacheco et al., 2022) and the Vaccination Attitude Detection (VAD) corpus (Zhu et al., 2022).CMF is a tweet dataset focused on the Covid-19 vaccine debates, where each tweet is assigned an argumentative pattern.VAD consists of 8 aspect categories further refined by vaccine Baselines We employ 5 baseline approaches: SBERT5 , AutoBot6 , DS-Clustering, VADet, and SCCL7 , of which SBERT and AutoBot are out-ofthe-box sentence embedding generators.VADet is specialised to learn disentangled representations.However, it is noteworthy that even though it employs DEC (Xie et al., 2016), the resulting representations are unsuitable for distance-based clustering.SCCL performs joint representation learning and document clustering.DS-Clustering is a pipeline approach that predicts a text span and employs SBERT to generate an aspect embedding.For clustering-friendly representation learning methods, we examine their performance using k-means and k-medoids (Leonard and Peter, 1990), and the Agglomerative Hierarchical Clustering (AHC).
The comparison involves three tasks: tweet clustering based on aspect categories (intra-and crossdatasets), and tweet-level stance classification.For stance classification, we employ RoBERTa and De-BERTa, and use their averaged embeddings for clustering.
Evaluation Metrics First, we use Clustering Accuracy (CA) and Normalized Mutual Information (NMI) to evaluate the quality of clusters in line with (Shaham et al., 2018;Tao et al., 2021).NMI is defined as NMI = 2×I(y; ŷ) / H(y)+H(ŷ) , where I(y; ŷ) denotes the mutual information between the ground-truth labels and the predicted labels, H(•) denotes their entropy.Then we employ BERTScore (Zhang et al., 2020) to evaluate the performance of models in clustering in the absence of ground-truth cluster labels.BERTScore is a successor of Cosine Similarity (John et al., 2019) that measures the sentence distance by calculating the cross distance between their corresponding word embeddings.We follow Bilal et al. ( 2021) to compute the averaged BERTScore as where |G k | is the size of the k-th group or cluster.We report the average performance for all the models.As a quantitative evaluation metric for disentanglement, we use the Mean Correlation Coefficient (MCC).We refer the readers to A.3 for qualitative results.
Clustering-Friendly Representation We first show the advantages of disentangled representations in clustering.With the representations obtained from SBERT and AutoBot, we employ kmeans to perform clustering.Since the similarity between sentences in SBERT is measured by cosine similarity which is less favorable for k-means algorithm, we also use k-medoids to ensure a fair comparison.The other baseline approaches are run with their default settings.We assign the aspect labels to the predicted clusters with the optimal permutation such that the permutation of {1, . . ., K} yields the highest accuracy score, where K denotes the total number of clusters.For the CMF dataset, we set K = 7, and on VAD K = 8.In comparisons against representation learning methods, DOC takes the lead as long as it is attached with competent clustering algorithms.This shows the benefit of clustering with disentangled representations since the clustering algorithm will no longer obfuscate the stance polarities and the aspect categories.DOC achieves higher scores on the VAD dataset compared to CMF, with more prominent improvement over the baselines, which may be credited to the increased size of the dataset.When DOC is evaluated with different clustering algorithms, k-medoids excels on CMF, while AHC outperforms the others on VAD, showing that cosine similarity is more appropriate for distance calculation since the k-means algorithm relies on Euclidean distance.

Cross-Dataset Evaluation
In this context, the most interesting property of clustering-friendly representations is their ability to perform clustering in novel datasets whose categories are unknown in advance.To assess this, we use the models trained on CMF to perform clustering on VAD, and repeat the process vice versa.We specify the number of clusters as 7 and 8, respectively.The alignment between the clustered groups and gold labels is solved by the Hungarian algorithm.Note that direct aspect classification across datasets would not be possible since an accurate mapping between the two sets of classes cannot be established.Table 3 reports the performance of cross-dataset clustering.Our metrics of interest are still CA, NMI and averaged BERTScore.All the methods show a performance drop on VAD overall, while the performance on CMF turns out to be a bit higher.DOC-k-medoids achieved competitive results across the datasets, demonstrating that clustering-friendly representations disentangle the opinions and, as a result, can integrate unknown aspects.

Stance Classification
We report in Table 4 the results of DOC, RoBERTa and DeBERTa.For DOC, we only report DOC-AHC since stance labels are by-products of clustering-friendly representations.
We see the DOC performance on CMF close to that of DeBERTa, and that the improvement on VAD is marginal.This may be attributed to the absence of the swapping operation on z s , and therefore the stance latent vector may contain other semantics or noise.Nevertheless, DOC is still preferred over DeBERTa considering its significant performance gain over DeBERTa on aspect clustering.

Evaluation of Disentangled Representations
As with the nonlinear ICA community (Khemakhem et al., 2020), we use Mean Correlation Coefficient (MCC) to quantify the extent to which DOC managed to learn disentangled representations.Here, the Point-Biserial Correlation Coefficient between dist(z a , zk a ) (i.e., the distance be-tween the aspect vector and the centroid of cluster k) and Y (i.e., the dichotomous variable indicating whether it belongs to or not belongs to group k in groundtruth) is chosen to measure the isometry between z a and k.Notice that we specify dist as Euclidean Distance here.However, isometry does not hinge on the Euclidean Distance, and it could be easily substituted with Cosine Similarity, in which case the mean is no longer the best estimation for the cluster center and would be replaced by the medoid of cluster k.The clustering method would be k-medoids accordingly.
For each cluster k ∈ {1, 2, . . ., K}, we calculate the correlation coefficient between dist(z a , zk a ) and Y .We then obtain MCC by averaging the correlation coefficients.A high MCC indicates that the group identity of a data point is closely associated with the geometric position of its z a in the latent space, which means that z a captures the group information.The results are shown in Figure 2. We observe consistent improvement over the sentence representation models.DS-Clustering is able to encode tweets into aspect embeddings.Nevertheless, its distance between aspect latent vectors is a weaker indicator for group partition compared with that of DOC, suggesting that z a discovered by DOC better captures the difference between aspects.

Conclusion
In this work, we introduced DOC, a Disentangled Opinion Clustering model for vaccination opinion mining from social media.DOC is able to disentangle users' stances from opinions via a disentangling attention mechanism and a swap-autoencoder.It was designed to process unseen aspect categories thanks to the clustering approach, leveraging clustering-friendly representations induced by outof-the-box Sentence-BERT encodings and the disentangling mechanisms.A thorough experimental assessment demonstrated the benefit of the disentangling mechanism on the quality of aspect-based clusters and the generalization capability across datasets with different aspect categories outperforming existing approaches in terms of generalisation and coherence of the generated clusters.

Limitations
There are a few limitations we would like to address.First of all, the number of clusters needs manual configuration.This is a limitation of the clustering algorithms (Xie et al., 2016) since we need to set a threshold for convergence, which consequentially pinpoints k.An expedient alternative is to analyse the dataset for the realistic settings or probe into k for the optimal setup, which is, however, beyond the scope of this paper.Another limitation is the pre-requisite for millions of unannotated data.The autoencoder needs enormous data to learn bottleneck representations.Its performance would be hindered without access to abundant corpora.Lastly, the performance of the acquired clustering-friendly representations depends on the similarity metric chosen.Efforts need to be made to find the best option, whether it is Euclidean distance or cosine similarity etc.

A.1 Dataset Details
In this section, we provide a detailed analysis of the dataset instances.
In the Covid-Moral-Foundation (CMF) dataset, each tweet is associated with a pre-defined and manually annotated argumentative pattern.The annotated tweets are categorized by moral foundations that can be regarded as coarse aspects distilled from argumentative patterns.Each moral foundation is associated with two polarities (e.g., care/harm), and is treated as the group label of a cluster of tweets.The polarity is given by the vaccination stance label.In the example in Table A1, 'The vaccine is safe' is the argumentative pattern, while 'Care/Harm' is the categorical label denoting the aspect group.An exhaustive list to the argumentative patterns can be found in the original paper of Pacheco et al. (2022).
In the Vaccination Attitude Detection (VAD), a training instance comprises a stance label, a categorical aspect label and an aspect text span.For example, Table A1 shows the tweet 'Study reports Oxford/AstraZeneca vaccine is protective against Brazilian P1 strain of COVID19.' is annotated with the text span 'Oxford/AstraZeneca vaccine is protective against Brazilian P1 strain of COVID19', and its aspect belongs to the aspect category 'Immunity Level'.

A.2 Training Details
We experiment with a pre-trained DeBERTa8 base model.The hidden size is d H = 768.We set both d V and d K = 768, and d z = 1024.The learning rate is initialised with η = 3e − 5 and the number of epochs is 10.We use Linear Warmup to enforce the triangular learning rate.
We train the model with two Titan RTX graphics cards on a station of an Intel(R) Xeon(R) W-2245 CPU.The training process takes less than 9 hours, with the inference time under 30 minutes.

A.3 Additional Results
Clustering with Different Latent Vectors We experiment clustering using the disentangled aspect vectors z a or the content vectors z (i.e., without the disentanglement of aspects and stances) on both CMF and VAD datasets, and have the detailed results reported in Table A2.It can be observed that using the disentangled aspect vectors for clustering gives better results compared to using the content vectors, regardless of the clustering approaches used.On CMF, the best results are obtained using k-medoids, while on VAD, similar results are obtained using either k-medoids or AHC.

Qualitative Results
We illustrate in Figure A1 and Figure A2 the clustering results and the la-tent space of the entangled/disentangled representation projected by the t-SNE method.Figure A1(ab) display the cluster assignments after permutation, whereas Figure A2(a-b) show the groundtruth labels.The class labels are rendered by colours whose detailed mapping is provided in Figure A2.From Figure A1, we see clear improvements in terms of clustering quality on both datasets when the model is compared against the DeBERTa-averaged-embedding. Figure 2 shows more separated groups thanks to the disentangled representation, providing strong distance-based discrimination for the clustering algorithms.As a result, simple clustering methods like k-means can achieve competitive results against deep clustering methods (i.e., SCCL and VAD), which have access to weak labels or data augmentations.

Color Mappings in Visualisation
We illustrate in Figure A2 the color mapping from t-SNE plots to the true aspect category labels.It is shown that the vectors are more separated and their grouping aligns closer to the ground-truth labels when they are clustered on the space of z a , indicating that such latent vectors provide strong distance-based discrimination among groups in the Euclidean space, as has been used as a distance metric in the t-SNE algorithm.We also experiment with cosine-similarity metric for k-medoids and the results have been reported in the Experiments section.

Figure 1 :
Figure 1: Disentangled Opinion Clustering (DOC) Model.(a) Unsupervised learning.A tweet is fed into an autoencoder with DeBERTa as both the encoder and decoder to learn the latent sentence vector z; (b) Supervised learning.The DeBERTa-based autoencoder is fine-tuned to learn the latent stance vector z s and the latent aspect vector z a using the tweet-level annotated stance label and aspect text span (or the argumentative pattern 'vaccine safety' for the input tweet) as the inductive bias; (c) Swapping autoencoder.To enable a better disentanglement of z s and z a , for the two tweets discussing the same aspect but with different stance labels, tweet B's aspect embedding u B a is replaced by the tweet A's aspect embedding u A a .As the two tweets discuss the same aspect, their aspect embeddings are expected to be similar.As such, we can still reconstruct tweet B using the latent content vector z B c derived from the swapped aspect embedding.Note that (b) and (c) are learned simultaneously.

Figure 2 :
Figure 2: Boxplots of MCC for all representation learning models, over the 5 runs.The representations are used for k-means clustering in the Euclidean space.A high MCC score indicates the strong correlation between dist(z a , zk a ) and z a ∈ G k .
Figure 2-D plots of the data points projected by t-SNE.
Figure A2: t-SNE plots on CMF and VAD.Each dot is a tweet encoded using either the disentangled aspect vector z a (left subfigure) or the latent content vector z (right subfigure).Different colors indicate the true aspect category labels.

Table 1 :
Dataset statistics of CMF and VAD.We list the number of pro-vaccine, anti-vaccine and neutral tweets in each group.
bands.Similar to the argumentative pattern in the CMF dataset, each tweet is characterised by a text span indicating its aspect.The dataset statistics are reported in Table1, with examples shown in A.1.The train/test split follows 4 : 1.For the unsupervised pre-training of sentence bottleneck representations, we combine the unlabelled Covid-19 datasets from both CMF 3 and VAD 4 repositories.The final dataset consists of 4.37 million tweets.

Table 2 :
Table 2 lists the performance of baseline methods on all the tasks and datasets.We see consistent Clustering results.Representation learning models are listed with the affiliated clustering methods.
improvements across all the evaluation metrics using our proposed DOC.When compared with endto-end methods (i.e., VADet and SCCL) whose intermediate representations cannot be used to calculate a distance, the disparity depends on DOC's clustering approaches employed.On CMF, VADet outperforms SCCL.But DOC gives superior performance overall regardless of the clustering approaches used, showing the flexibility of the DOC representations.

Table 3 :
Cross-dataset evaluation results.Each representation learning model is listed with the most performant clustering method.

Table 5 :
We study the effects by taking away components of different functionality in disentanglement, and experiment with different Ablation study on removal of components and choices of context vectors.choices of context vectors, i.e., u s and u a .The results are shown in Table5.We see a significant performance drop without loading the pre-trained weights for the language model.The removal of inductive biases and the swapped autoencoder both hamper the clustering of the model across the metrics.The performance gap is more obvious without the inductive bias, which we attribute to the weaker supervision induced by swapping the latent codes.Ablating choices of context vectors show the superiority of the MLP strategy.In contrast, the performance of the context vector generated by mean pooling is rather poor.It shows that the context vector produced by mean-pooling can hardly trigger the disentanglement of the hidden semantics.

Table A1 :
Training examples of CMF and VAD.In CMF, Argumentative Patterns are pre-defined phrases indicating an aspect.In VAD, aspect spans are text subsequence of the annotated tweets.

Table A2 :
Clustering accuracy and average BERTScore with different latent vectors.