A Few Topical Tweets are Enough for Effective User Stance Detection

User stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can be highly accurate (+98%). However, such methods perform poorly or fail completely for less vocal users, who may have authored only a few tweets about a target. In this paper, we tackle stance detection for such users using two approaches. In the first approach, we improve user-level stance detection by representing tweets using contextualized embeddings, which capture latent meanings of words in context. We show that this approach outperforms two strong baselines and achieves 89.6% accuracy and 91.3% macro F-measure on eight controversial topics. In the second approach, we expand the tweets of a given user using their Twitter timeline tweets, which may not be topically relevant, and then we perform unsupervised classification of the user, which entails clustering a user with other users in the training set. This approach achieves 95.6% accuracy and 93.1% macro F-measure.


Introduction
Stance detection entails identifying the position of a user towards a topic, an entity, or a claim (Mohammad et al., 2016b). Effective stance detection, particularly in the realm of social media, can be instrumental in gauging public opinion, identifying intersecting and diverging groups, and understanding issues of interest to different user communities (Magdy et al., 2016a). Much recent works have explored varying stance detection methods including supervised, semi-supervised, and unsupervised user classification Magdy et al., 2016a;Pennacchiotti and Popescu, 2011;Wong et al., 2013), and much of the work has fo- cused on stance detection for Twitter users. The different approaches have advantages and disadvantages. For example, supervised methods are simple to implement, but they require manually annotated training data and their accuracy varies widely based on classification features, classification techniques, and the number of training and test examples (Magdy et al., 2016a). Though semisupervised and unsupervised methods typically use user interactions and often may yield perfect classification, they are effective in classifying highly vocal users with many topical tweets . Most of these methods produce sub-optimal results for users who rarely express their opinion, and for whom we may only have one or two topically related tweets. Though a single tweet might be explicitly clear, often it may lack sufficient context to determine the stance of the user. Figure  1 show two tweets that pertain to the 2018 US midterm elections, where the first expresses a lucid pro-Republican stance and the second could have been authored by a supporter of either the Republican or Democratic party. In this paper, we aim to effectively identify the stance of Twitter users towards specific targets (entities or topics), where the users have mentioned the targets in only a few tweets (less than two tweets on average).
To do so, we employ two approaches. In the first approach, we classify users based on their tweets that are represented using contextualized embeddings, which capture latent meanings of words in context. Specifically, we use BERT embeddings to represent tweets, and we fine tune the embeddings for every topic. We compare this approach to two strong baselines, namely using Support Vector Machine (SVM) classification and fastText, which is a deep learning based classifier. In the second approach, we expand the tweets of a given user using their Twitter timeline tweets, and then we use the additional tweets, which would typically be not relevant to the topic at hand, to perform unsupervised classification of the user by clustering him/her with the users in the training set. Using such expansion allows us to make use of user homophily, which is manifested in the echo chambers that form on Twitter, where users with similar views tend to retweet similar accounts beyond the topic at hand. To test our approaches, we use a dataset containing tweets on 8 polarized US-centric topics. We also examine the effect of expansion when using SVM, fastText, and contextualized embeddings. For testing, we randomly selected 100 users for each topic that have less than 5 topical tweets, and we manually labeled them for stance. To construct the training set, we used unsupervised stance detection to automatically label the 5,000 most active users per topic, and for every topic we used a balanced set of 500 users per stance as our training set . Since the approaches rely on different features and utilize different classification techniques, we indicate which approach works best under different conditions. The contributions of this paper are as follows: • We fine-tune contextualized embeddings to generate latent representations of tweets to effectively classify the stance of users based on only one or two tweets. We achieve an accuracy of 89.6% and macro F-measure of 91.3%, which are significantly higher than the scores achieved using two strong baselines.
• We show that using additional timeline tweets for the users that we wish to classify, and then using unsupervised classification, where we cluster the test user with users in our training, leads to an accuracy of 95.6% and macro F1-measure of 92.0%. In doing so, we extend prior work on unsupervised stance detection to effectively classify both users who are vocal on a topic as well as those with perhaps one or two topical tweets.
• We show that expanding user tweets using their timeline tweets can significantly improve some supervised classification setups.
• We conduct error analysis on our best setups to determine the sources of the errors and to guide the choice of classification methods.
• We plan to release the tweet IDs of the test set along with the associate gold labels. Further, we plan to release the code that performs classification based on contextualized embeddings.

Related Work
Over the last few years, much research has focused on user stance detection. The goal of stance detection is to ascertain the positions of users towards some target such as a topic, person, or claim (Thomas et al., 2006;Mohammad et al., 2016a;Barberá, 2015;Barberá and Rivero, 2014;Borge-Holthoefer et al., 2015;Cohen and Ruths, 2013;Colleoni et al., 2014;Conover et al., 2011;Fowler et al., 2011;Himelboim et al., 2013;Magdy et al., 2016a,b;Makazhanov et al., 2014;Weber et al., 2013). While stance may easily be detected by humans, machine learning models often fall short, particularly for users who talk about a target sparingly. Several studies have focused on modeling stance by introducing different features ranging from linguistic and structural features (Mohammad et al., 2016a) to network interactions and profile information (Borge-Holthoefer et al., 2015;Magdy et al., 2016a,b;Weber et al., 2013). Much work on stance detection involved using supervised and semi-supervised classification methods. One of the major downsides of both classification methods is the need for a seed list of manually labeled users, which is time consuming and requires topic expertise. Supervised learning is sensitive to the classification features, the size of the training sets, the number of available tweets for users in the test set, and the classification algorithm (Borge-Holthoefer et al., 2015). Some common classification features include: lexical, syntactic, and semantics feature; network features such as retweeted accounts and user mentions; content features such as words and hashtags; and user profile information such as name and location (Aldayel and Magdy, 2019;Magdy et al., 2016a,b;Pennacchiotti and Popescu, 2011). Some commonly used classification algorithms include SVMs and deep learning classification (Zarrella and Marsh, 2016). Popat et al. (2019) presented a neural network model for stance classification by augmenting BERT representations with a novel consistency constraint to determine stance with respect to both a claim and perspective. We extend their work in two ways, namely: we drop the need to have a claim and perspective, and we couple BERT supervised classification with unsupervised classification to effectively tag vocal and non-vocal users. Semi-supervised methods such as label propagation (Barberá, 2015;Borge-Holthoefer et al., 2015;Weber et al., 2013) often rely on two users retweeting identical accounts or tweets to propagate a label of one user to another. Though such typically achieves high precision (often above 95%) (Darwish et al., 2018), it is generally successful in tagging vocal users with strong opinions. Recently,  have introduced a highly effective unsupervised method for predicting the stance of prolific Twitter users towards controversial topics. By projecting users onto a low-dimensional space and then clustering them allows for clear separation between vocal users with respect to their stance . This method confers two main advantages over previous methods, namely: it does not require any initial manual labeling, and classification accuracy is nearly perfect. However, it is successful in labeling vocal users only and fails on users with very few topical tweets. We extend prior work on unsupervised stance detection to effectively classify both prolific and non-prolific users in a holistic way by aggregating both supervised and unsupervised methods. Further, we extend prior deep-learning based supervised classification to use contextual embeddings that capture syntactic and semantic features of words in context. We are framing the problem as user-based classification, which is common in the computational social science community, as opposed to tweetbased classification, which is common in the NLP community (Mohammad et al., 2016b). This is motivated by two aspects, namely: 1) tweets often don't provide sufficient context for proper annotation; and 2) users have durable stances over time.
For example, if someone says "Most important election in history! Vote!", it is nearly impossible to know if the author's position without context.

Data Sets
Topics Our dataset includes tweets on eight polarizing topics that are US-centric, which were gra-ciously provided to us by . Table 1 lists all the topics including when the tweets were collected and the number of tweets per topic. The topics include both long-standing issues such as gun control and transient issues such as the nomination of Judge Kavanaugh to the US Supreme Court. There is also a non-political issue, namely vaccination. The tweets were also filtered based on user-stated locations to limit the data to US users. The filtering was done using a gazetteer that includes either US (or its variants) and state names (and their abbreviations).  Training Set Given the tweets for every topic, we performed per topic unsupervised stance detection . This approach identifies the most active n users per topic and computes similarity between them based on a common feature, such as which hashtags they use or which accounts that they retweet. Next, the users are projected onto a lower dimensional space in a manner where similar users are brought closer together and dissimilar users are pushed further apart. Then the projected users are clustered. Using the best reported setup of , we used the 5,000 most active users with at least 10 tweets, computed similarity between them based on which accounts they retweeted, projected users using UMAP (McInnes and Healy, 2018), and clustered them using the mean shift clustering algorithm (Fukunaga and Hostetler, 1975). Stefanov et al. (2019) estimated the accuracy of the unsupervised approach on the 8 topics to be 98%. Next, we took 500 random users from the two largest clusters to construct a balanced training set, and we manually inspected a few users from each cluster to give an overall label to each cluster (ex. pro-or anti-gun control). Further, we crawled the timeline tweets of the users in our training set.   Test Set For each topic, we randomly selected 200 users who have less than 5 tweets. The average number of tweets per user ranged across topics between 1.25 and 1.77 tweets. An annotator who is well versed with US politics manually examined the per topic tweets of users to determine their stances. If the tweets of a user were not sufficient to ascertain their stance, the annotator manually searched and examined their tweets on Twitter in an effort to find further clues. If no conclusive evidence of stance were found, the annotator skipped the user. The annotator labeled up to 100 users per topic. We asked another annotator to annotate 20 tweets per topic to ascertain inter-annotator agreement. Table 2 lists the percentage of skipped tweets, inter-annotator agreement, and percentage of proand anti-tweets. Next, we scraped the timeline of all the labeled users. Due to the time difference between collecting topical tweets and when we initiated the scraping of users' timelines, some user accounts were deleted, suspended, or made protected. Table 3 lists the number of labeled users and the subset of them for whom we were able to scrape their timelines. Thus, we put users for whom we were not able to collect timeline tweets into Set A, and we put the remaining users in Set B. We report results for both sets separately.

Classification Models
Supervised Classification As baselines, we used two different classification methods, namely   (Joachims, 2002) 1 . We employed two feature types, namely: the accounts that users retweeted; and the words in tweets, including retweeted accounts, hashtags, and user mentions and replies. Prior work has shown that using retweeted accounts as features yields better results compared to using the content of tweet (Darwish et al., 2018). When using words in tweets, we tokenized tweets using NLTK (Bird et al., 2009), removed all URLs and emoticons, retained all hashtags and user mentions, and specifically delineated retweeted accounts by adding 'RT ' before them. We chose to distinguish between retweeted accounts and user mentions because retweeting commonly signifies agreement and user mentions (including replies) may indicate opposition. We concatenated the aforementioned features from all the tweets of a user, and we constructed a feature vector, where the value of each unique feature was set to its frequency across all tweets of a user. For the deep learning based classifier, we used fastText, which is an efficient text classifier that has been shown to be effective for different text classification tasks (Joulin et al., 2016). Since fastText was designed for sentence-level classification, we opted to perform tweet-level classification. During training, we assigned the label of a user to all his/her tweets. During testing, we averaged per class confidence scores across all tweets for a user, and we assigned the label with the highest average confidence to the user. As for features, we used all the words in tweets, and we preprocessed tweets in the manner described earlier for SVM. We opted not use retweeted accounts only as the number of retweeted accounts was arbitrary for each user and fastText is not well suited for long input text.
Contextualized Embeddings Over the last several years, pre-trained embedding (Mikolov et al., 2013;Pennington et al., 2014) have helped achieve significant improvements in a wide range of classification tasks in natural language processing. Representing words as vectors in a low-dimensional continuous space and then using them for downstream tasks lowered the need for extensive manual feature engineering. However, these pre-trained vectors are static and fail to handle polysemous words, where different instances of a word have to share the same representation regardless of context. More recently, different deep neural language models have been introduced to create contextualized word representations that can cope with the issue of polysemy and the context-dependent nature of words. Models such as OpenAi GPT (Radford et al., 2018), ELMo (Peters et al., 2018), BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019), and UMLFIT (Howard and Ruder, 2018), to name a few, have achieved groundbreaking results in many NLP classification and language understanding tasks. For this paper, we use BERT base-multilingual 2 (referred to hereafter simply as BERT), which we fine-tune for stance detection, as this eliminates the need for heavily engineered task-specific architectures. BERT is pre-trained on Wikipedia text from 104 languages and comes with hundreds of millions of parameters. It contains an encoder with 12 Transformer blocks, hidden size of 768, and 12 self-attention heads. As shown in Fig. 2, We fine-tuned BERT by adding a fully-connected dense layer followed by a softmax output layer, minimizing the binary cross-entropy loss function for the training data. For all experiments, we used HuggingFace 3 transformer implementation with PyTorch 4 as it provides pre-trained weights and vocabulary. As for features, we used all the words in tweets that were preprocessed in the manner described earlier for SVM and fastText. Similar to fastText, we performed tweet-level clas-sification, and we used the average softmax output scores per class across all tweets for a user to assign a label to a test user.
Unsupervised Classification For unsupervised classification, we used the same unsupervised classification method described earlier, which we used to prepare the training set. Specifically, we constructed a feature vector for each test user based on the accounts he/she retweeted, computed its similarity to all users in the training set, projected all the users in the training along with the test user into a lower dimensional space using UMAP, and lastly clustered the users using mean shift. We then labeled the test user using the majority label of the cluster in which the user appeared.

Experiments
We split users in our test set on the basis of whether we were able to crawl their timelines or not. Set A includes users for which we were not able to obtain their timeline tweets. Set B includes users for which we were able to collect their timeline tweets. We separated between them, because Set B would allow us to compare between setups that use timeline tweets with those that do not on identical users.
For Set A, we always trained on the training users with their on-topic tweets and the tested on the test users, who typically had less than 2 tweets on average. We used four different classification setups, namely using fastText, SVM with retweeted accounts as features (SVM RT ), SVM with all words as features (SVM T EXT ), and finetuned BERT embeddings with a dense neural layer and softmax output (BERT). We experimented with using the unsupervised method on Set A, but the unsupervised algorithm was not able to assign any test user to a cluster, mostly because the number of tweets and subsequently retweeted users per test user were too few. For Set B, we experimented with the same classifiers using four different conditions, namely: not expanding either training or test sets with users' timeline tweets; expanding the test set only; expanding the training set only; and expanding both the training and test sets.

Results and Discussion
For all experiments, we report on per topic accuracy (A) and macro precision (P), recall (R), and F-measure (F) across stances on a topic.  reports the results on Sets A where we were not able to expand the test set using timeline tweets. As the results show, BERT yielded the best results in terms of A, P, R, and F for most topics, with the highest overall averages across all scores. fastText trailed BERT, and SVM T EXT performed much worse. SVM RT performed better than SVM T EXT . This suggests that BERT, which uses contextual embeddings, is effective in performing accurate stance detection, even when classifying users with a very small number of topical tweets. As for the Unsupervised method, using the unsupervised method was not able to assign any test user to a cluster, mostly because the number of tweets per test user were too few. Hence, we omitted the unsupervised method from Table 4. Table 5 shows the results on Set B, where we expanded the test, training, or either or both training and test user tweets using timeline tweets. The results suggest the following: • For BERT and fastText, which rely on the content of the tweets, we achieved the best results with no expansion or when we only expanded the training set. The inclusion of non-topical tweets in the test set led to worse results overall. We suspect that is happened because of the mismatch between the training and test sets.
• For SVM RT and Unsupervised classification, which rely exclusively on whom users retweeted, the expansion of the test dramatically improved overall A, P, R, and F. The positive improvement for both after timeline expansion suggests that the accounts that a user retweets are a strong signal of stance across multiple topics, and stances on multiple topics are likely correlated. For example, a user who supported the Kavanaugh nomination was likely to vote republican in the midterm elections. For future work, we plan to examine cross topic classification.
• Similar to the results observed for Set A (4), when no expansion is used, BERT led to the best overall results. However, using unsupervised classification led to the best overall results across all setups, with expanding the test set only yielding slightly better results than expanding both the training and test sets. Expanding the test set only is significantly more efficient than expanding both training and test sets.
• Using unsupervised classification failed to all users in the test set for any topic when the test set was not expanded, mostly because the number of tweets per test user and subsequent number of retweeted accounts were too few.
• SVM T EXT yielded the worse results overall, despite the inclusion of all the features in the tweets, such as retweeted accounts, hashtags, words, etc. It seems that the inclusion of more features (compared to SVM RT ) confused the classifier leading to lower results.
• SVM T EXT and SVM RT led to the lowest results when we only expanded the training set. For both setups, the classifier classified all users as belonging to one of the stances or the other. Hence, R for one class was 100.0 and 0.0 for the other (with macro R = 50.0). We suspect that expanding the feature space in the training set confused the SVM classifier. Both setups are unusable.
We computed the standard deviation (SD) of all our measures across topics for every setup. Lower SD coupled with high A and F is desirable as they indicate the setup produces consistently high results across topics. Unsupervised classification yielded the lowest SD values and highest overall score. BERT and fastText with no expansion and SVM RT with expanded test set had slightly higher SD. Thus, if we are able to scrape a user's timeline tweets, it is advantageous to use a method that relies on which accounts a user retweets, with unsupervised classification producing the best results. As we will show in the error analysis, the success of  Immigration: this is a real crisis at the border Kavanaugh: why did jeff flake demand an investigation and then accept a bogus one Sarcasm 4 Immigration: RT @infantry0300 someone should let "Ms Hitler was a really great guy until he crossed the border into Poland" unsupervised classification is contingent on users retweeting a sufficient number of times, particularly politically related accounts in our case. When timeline tweets are not available, it is best to use contextualized embeddings.

Error Analysis
We analyzed all the errors in Set B that were produced by BERT with no expansion, as it represents the best results when expansion is not possible, and those produced by unsupervised classification with the expansion of the test set only, as this produced the best overall results. Since we used BERT to perform tweet-level classification, we manually inspected all 129 misclassified tweets across all topics. Generally we found four types of errors, namely: unexplainable errors where the tweets clearly expressed stance, but the classifier mislabeled them; vague tweets that have no clear clues; tweets in which the user uses the language of the opposing side; and sarcastic tweets. Table  6 lists the error types with their frequencies and provides example tweets. When we used LIME (Ribeiro et al., 2016) to analyze the output of BERT, we noticed two important phenomena. First, BERT was able to identify stance based on retweeted accounts, and not just the text of the tweet. Second, BERT was able to learn correlations between topic. For example, the tweet "RT BernieSanders: I believe health care must be a right", BERT based its decision on BernieSanders, the democratic presidential hopeful in 2016 and 2020, and on "health care", where positions on health care and climate change are often aligned. For unsupervised classification, we manually examined all 15 users that were missclassified, of whom 7 were from the Kavanaugh topic. Prominent reasons for incorrect classification were: Lack of sufficient retweets for a user, where the percentage of retweets ranged between 1-1.4% of all tweets for three of the misclassified users (2 for climate change and 1 for Kavanaugh); Geo-graphic mislabeling, where 2 accounts were not US accounts (1 for Ilhan and 1 for Kavanaugh); Users retweeting mostly apolitical accounts such as music, art, or cars related accounts (1 for climate change, 1 for vaccine, and 3 for Kavanaugh)retweeting of politically biased accounts and media sources seem to provide strong signals for classification; or User goes against the general opinion of his group as in the clearly republican user who was criticizing the National Rifle Association (NRA) (gun control).
Thus, the most common reason for misclassification was the dearth of retweets from politically oriented or topically related accounts.

Conclusion
In this paper, we presented two methods for classifying users according to their stance towards a target. The first utilizes contextualized embeddings to represent tweets and then uses deep neural network for classification. This approach led to results that outperform two strong baselines. The second utilizes additional tweets from users' timelines to cluster test users with other users with known labels in an unsupervised manner. The first method yielded the best results when timeline tweets were not available, while the second yielded even better results overall. Given the overall setup described in the paper, where the training data was obtained using unsupervised user classification, we can automatically label the most active users with nearly perfect accuracy, and we can label users with only few topical tweets with high accuracy, often above 95% when we can obtain their timeline tweets. For future work, we plan to explore the effectiveness of cross topic classification, where training and testing are done on different topics. Perhaps, we can build unified models that could be used across multiple topics for a given a population of Twitter users (ex. users who are interested in US politics).