Stanceosaurus: Classifying Stance Towards Multicultural Misinformation

We present Stanceosaurus, a new corpus of 28,033 tweets in English, Hindi and Arabic annotated with stance towards 250 misinformation claims. As far as we are aware, it is the largest corpus annotated with stance towards misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, we introduce a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance. Pre-trained transformer-based stance classifiers that are fine-tuned on our corpus show good generalization on unseen claims and regional claims from countries outside the training data. Cross-lingual experiments demonstrate Stanceosaurus’ capability of training multilingual models, achieving 53.1 F1 on Hindi and 50.4 F1 on Arabic without any target-language fine-tuning. Finally, we show how a domain adaptation method can be used to improve performance on Stanceosaurus using additional RumourEval-2019 data. We will make Stanceosaurus publicly available to the research community upon publication and hope it will encourage further work on misinformation identification across languages and cultures.


Introduction
The prevalence of misinformation on online social media has become an increasingly severe societal problem.A key language technology, which has the potential to help content moderators identify rapidly-spreading misinformation, is the automatic identification of both affective and epistemic stance (Jaffe, 2009;Zuczkowski et al., 2017) towards false claims.Progress on the problem of stance identification has largely been driven by the availability of annotated corpora, such as RumourEval (Derczynski et al., 2017;Gorrell et al., 2019).However, 1 Our code and data are available at https://tinyurl.com/stanceosaurus Translation: Out of 16 priests of Tirupati temple, 128 kg gold, 150 crores cash, 77 crores diamonds were found from Income Tax raid at the house of one priest.
Translation: These are the priests of our country who have immense wealth.existing corpora mostly focus on misinformation spreading within western countries.
In this paper, we present Stanceosaurus, a diverse and high-quality corpus that builds on the best design choices made in previous misinformation corpora, including RumourEval-2019 and CovidLies.Stanceosaurus covers more diverse topics, geographic regions, and cultures than prior work.It includes 28,033 tweets in English, Hindi, and Arabic that are manually annotated for stance (see Figure 1) towards 251 misinformation claims, collected from 15 independent fact-checking websites that cover India, Singapore, Australia, New Zealand, Canada, the United States, Europe, and the Arab World (the Levantine, Gulf, Northwest African regions, and Egypt).To the best of our knowledge, Stanceosaurus is the largest and most diverse annotated stance dataset to date.
Through extensive experiments, we demonstrate that Stanceosaurus can support the fine-grained

Dataset Target Number/Range of Topics
SemEval-2016 (Mohammad et al., 2016) Subject 6 political topics (e.g., atheism, feminist movement) SRQ (Villa-Cox et al., 2020) Subject 4 political topics & events (e.g., general terms, student marches) Catalonia (Zotova et al., 2020) Subject 1 topic (i.e., Catalonia independence) COVID (Glandt et al., 2021) Subject 4 topic related to Covid-19 (e.g., stay at home orders) Multi-target (Sobhani et al., 2017) Entity 3 pairs of candidates in 2016 US election WTWT (Conforti et al., 2020) Event 5 merger and acquisition events RumourEval (Gorrell et al., 2019) Tweet 8 news events + rumors about natural disasters Rumor-has-it (Qazvinian et al., 2011) Claim 5 rumors (e.g., Sarah Palin getting divorced?)CovidLies (Hossain et al., 2020) Claim 86 pieces of COVID-19 misinformation Stanceosaurus (this work) Claim 251 claims over a diverse set of global and regional topics classification of explicit and implicit stances, as well as zero-shot cross-lingual stance identification.In addition, we introduce and experiment with class-balanced focal loss (Cui et al., 2019) to alleviate the class imbalance issue, which is a well-known challenge in automatic stance detection (Zubiaga et al., 2016;Baly et al., 2018).Similar to other corpora that are labeled with stance towards messages or claims, Stanceosaurus reflects the natural distribution of stance observed in the wild, with comparatively few examples labeled as Supporting or Refuting (see label distributions in Table 4).We show that fine-tuning BERTweet large with class-balanced focal loss (Cui et al., 2019) can achieve 66.8 F1 for 3-way stance classification and 61.0 F1 for the finer-grained 5-way stances for English.With zero-shot transfer learning, we achieve 50.4 and 53.1 F1 for Hindi and Arabic, respectively, in a 5-way classification.Lastly, we show it is possible to train a single model to achieve better performance on Stanceosaurus' test set via additional fine-tuning on RumourEval (Gorrell et al., 2019), using a variation of EasyAdapt (Daumé III, 2007;Bai et al., 2021) designed for pre-trained Transformers, even though these two corpora have significant differences.

Related Work
Stance Classification Datasets.Given the importance of studying misinformation spreading on Twitter and the open access to its data, there are many stance classification datasets consisting of annotated tweets.However, existing datasets are largely restricted to a limited range and a number of topics -see Table 1 for a summary. 2 Note that many of these datasets are considering stance to-ward an entity or topic (e.g., Bitcoin), whereas we focus on more specific full-sentence claims (e.g., Bitcoin is legal in Malaysia), which provides flexibility to cover more diverse topics in our work.
The closet prior efforts to ours are RumourEval-2019 (Gorrell et al., 2019) and CovidLies (Hossain et al., 2020).RumourEval-2019 (Gorrell et al., 2019) contains annotations on whether a reply tweet in a conversation thread is supporting, denying, querying, or commenting on the rumour mentioned in the source tweet.However, RumourEval covers only eight major news events (e.g., Charlie Hebdo shooting) plus additional rumors about natural disasters.The CovidLies dataset (Hossain et al., 2020) annotates a 3-way stance (Agree, Disagree, Neutral) towards 86 pieces of COVID-19-specific misinformation, using BERTScore (Zhang et al., 2020) to find potentially relevant tweets.As the authors of CovidLies (Hossain et al., 2020) have noted, relying on BERTScore (i.e., a semantic similarity measurement) biases the data collection towards more supporting and less refuting tweets.
Besides Twitter, stance classification has also been studied for other types of data.For example, the Perspectrum dataset (Chen et al., 2019) was constructed using debate forum data.Emergent (Ferreira and Vlachos, 2016) and AraStance (Alhindi et al., 2021) consist of English and Arabic news articles annotated with stance, respectively.

The Stanceosaurus Corpus
Our corpus consists of social media posts manually annotated for stance toward claims from multiple fact-checking websites across the world.We carefully designed the data collection and annotation scheme to ensure better quality and coverage, improving upon prior work.

Collecting Fact-checked Claims
To ensure multicultural representation, we obtain fact-checked claims from both Western and non-Western sources (Table 2).We choose nine wellknown fact-checking websites in English, three in Hindi, and three in Arabic. 3We randomly select claims from each source posted between 5/17/2012 and 02/28/2022 that have sparked discussion on Twitter.In total, we have 251 claims in our corpus, of which 144 are considered regional based on manual inspection (see column Country & Regions in Table 2).For example, the claims "Finland is promoting a 4 day work week" and "Burning Ghee will produce Oxygen"4 are both considered regional, one explicitly and one implicitly; whereas the claim "Bees use acoustic levitation to fly" is considered international.The claims in Stanceosaurus range from news, health, and science to politics (e.g., "Sonu Sood promises to support Hamas/Palestine"), conspiracy theories, history, and urban myths (e.g., "The pyramids of Giza were built by slaves").We present all 251 claims in Appendix D.

Retrieving Conversations around Claims
For better coverage of diverse topics, we invested substantial effort in creating customized queries with varied keywords and time ranges for each claim to retrieve tweets.We also trace the entire reply chain in both directions, so Stanceosaurus includes relevant tweets that may not contain the keywords.
Curated Search Queries.We retrieve tweets by keyword search, which we believe is the most effective approach given the constraints of Twitter's APIs.To ensure the coverage and quality of our dataset, we manually curated and iteratively refined search queries for each claim, utilizing advanced search operators to restrict the relevant time period and language.We expand search queries with synonyms (e.g., "jab" for "vaccine") and lexical variations whenever possible; the latter is particularly helpful for including different Arabic dialects.See Appendix A for example queries.We collect tweets from different time periods for different claims (e.g., a two-week range for timely events and a max range from 7/3/2008 to 5/9/2022 for historic myths).
Context from URLs and Reply Chains.Individual tweets retrieved by search do not capture the contextual aspects of stance, which can be very important as misinformation often spreads in multiturn conversations on social media.Therefore, we also collect the parent tweets (i.e., the tweet that a search retrieved tweet is replying to) and the entire reply chains if available.Additional details are presented in Appendix C.

Annotating Stance Towards Claims
We employ a fine-grained annotation scheme that supports 5-way and 3-way stance classification.
5-way Stance Categories.We define stance detection as a five-way classification task, including irrelevant tweets in addition to the four stance classes used in prior works (Schiller et al., 2021;Gorrell et al., 2018), as follows: • Irrelevant -unrelated to the claim; • Supporting -explicitly affirms the claim is true or provides verifying evidence; • Refuting -explicitly asserts the claim is false or presents evidence to disprove the claim; • Discussing -provide neutral information on the context or veracity of the claim; • Querying -questions the veracity of the claim.
See Figure 1 and Appendix B.1 for examples of different stances, shown with the reply chain details.

Subcategories and 3-way Stance Classification.
Although some tweets may be neutral towards a claim, they can still show an indirect bias.For example, the tweet "Fauci: No Concern About Number of People Testing Positive After COVID-19 Vaccine." in response to the claim "The COVID-19 Vaccine has magnets or will make your body magnetic" discusses the vaccine rollout, while it can be viewed as implicitly supporting the claim regarding the lack of vaccine safety.We thus further annotate the Discussing tweets for their leanings as three subcategories: Discussing support (44.6%), Discussing ref ute (25.7%), and Discussing other (29.7%).This not only enables fine-grained classification but also makes our Stanceosaurus corpus flexible enough to support the 3-way (Supporting, Refuting, Other)5 setup used in other prior work.
Data Annotation.We hired four native speakers for English, two for Hindi, and two for Arabic to annotate the tweets with stance.English annotators are all from the U.S., and non-English annotators grew up in the respective countries or regions of the claims being collected.All of the annotators have a college-level education.We designed detailed guidelines (see Appendix B.2) and held training sessions to assist our annotators.For each claim, the annotators are reasonably familiar with the topic because they are asked to read and learn about the subject matter before annotating.Cohen's Kappa (κ) between the annotators is summarized in Disagreements often occur over challenging cases.For example, "Evergreen ship stuck in the Suez Canal -interesting call sign" is supporting the conspiracy theory "Hillary Clinton is trafficking children aboard the Evergreen Ship", with the connection being that the call sign of the ship is "H3RC", which coincidentally overlaps with Hillary's initials.The disagreements were resolved by a third adjudicator for Hindi, and through discussions between the annotators for English and Arabic.Interestingly, the Hindi subset of Stanceosaurus exhibits some forms of code-switching in 28.2% of instances, including some replies written in English, while 6.3% of the Arabic data exhibited code-switching.A subset of 200 tweets randomly sampled from the Arabic data was further labeled for language variations, which contains 62.5% Modern Standard Arabic (MSA), 35.5% dialects, 0.5% Arabizi, and 1.5% in the form of emojis or mentions.

Comparison to RumourEval
Although our annotation design is comparable to RumourEval (Gorrell et al., 2019), in that both annotate the stance of Twitter threads towards rumorous claims, there are a few important differences: (1) RumourEval limits their rumorous claims primarily to 8 major news events plus additional natural disaster events, whereas we use a much larger and more diverse sample of claims originating from multicultural news outlets.(2) RumourEval, unlike our dataset, does not explicitly provide the claims.Rather, the first tweet of the thread is used to represent both the claim and the stance in Ru-mourEval.(3) We label discussing subcategories that capture indirect bias towards a claim (see §3.3).(4) RumourEval excludes irrelevant tweets, limiting its generalizability.For a direct comparison, we present the corpus statistics of Stanceosaurus and RumourEval-20196 in  classification models on both datasets in §5.3.

Automatic Stance Detection
We design multiple automatic stance identification experiments to test the generalization capabilities of models trained on Stanceosaurus.First, we establish the baseline performance of predicting stance towards unseen claim using fine-tuned Transformer models in §5.1 and experiment with the class-balanced focal loss for addressing the imbalanced class distribution.We present zeroshort cross-lingual experiments in §5.2, where multilingual models are trained on English tweets and evaluated on the Hindi and Arabic tweets.Furthermore, we demonstrate that a simple domain adaptation method can help improve performance on Stanceosaurus using additional RumourEval data in §5.3.Finally, we show that models trained only on International claims subset can extrapolate well to regional claims from individual countries in §5.4.

Baseline Models
We experiment with fine-tuning methods using BERT (Devlin et al., 2019) and BERTweet (Nguyen et al., 2020).Stance identification is modeled as sentence-pair classification, using special tokens to format the input as "[CLS] claim [SEP] text", where "text" is a tweet concatenated with its context (parent tweet and any extracted HTML titles -see §3.2).We found that incorporating context generally helps stance classification for reply tweets (see ablation study in Appendix B.3).We use standard crossentropy loss in all baselines.

Class-balanced Focal Loss (CB foc )
The imbalanced class problem has been identified as a major challenge in automatic stance classification (Li and Scarton, 2020), since fewer messages exhibit Supporting or Refuting stances in the wild (see Table 4).To alleviate this issue, prior work has used weighted cross-entropy loss (Fajcik et al., 2019).We experiment with weighted crossentropy loss and Class-Balanced Focal loss (Cui et al., 2019;Baheti et al., 2021), which has shown promising results in computer vision research recently, as an alternative.
We use ŝ " pz 0 , z 1 , z 2 , z 3 , z 4 q to represent the unnormalized scores assigned by the model for five stance classes C " {Irrelevant, Discussing, Supporting, Refuting, Querying}.The class-balanced focal loss is then defined as: . y is the gold stance label, n y is the number of instances with the label y, and p m " sigmoidpz 1 m q, where: Focal loss employs the expression p1 ´pm q γ to reduce the relative loss for well classified examples (Lin et al., 2017).The reweighting term lowers the impact of class imbalance on the loss.In our experiments, hyperparameters β and γ are tuned between [0.1, 1) and [0.1, 1.1], respectively, based on the performance on the dev set.

Implementation Details
We replace usernames and URLs with special tokens, truncate or pad the input to a sequence length of 256 as in BERT and BERTweet.8All models were trained for 10 epochs and optimized with the Adam optimizer.Learning rates were selected among t1e ´5, 3e ´5, 5e ´5, 7e ´5, 9e ´5u.The train batch size was set to 8.For all test set evaluations, we select the best checkpoint that achieves the highest Macro F1 on the development set.

Experiments and Results
We report average results over five random seeds primarily by Macro F1, which has been used as the standard metric in stance classification since the arguably more important stances (i.e., Refuting and Supporting) only consist of a small portion of data.

Stance Detection for Unseen Claims
For this experiment, we split the English data based on claims into train, dev, and test set (see the left side of Table 4).We evaluate all models on the 5way stance classification of tweets towards claims that are unseen during training.As shown in Table 5, the best model is BERTweet large , which achieves 60.2 F1 when trained with standard and weighted cross-entropy loss and 61.0 F1 with class-balanced focal loss.We see some alleviation of the data imbalance issue in the per-label analysis in Table 6, which shows improved F1 using class-balanced focal loss for the two least frequent labels, Refuting and Querying.
As mentioned in §3.3, Stanceosaurus can also support 3-way stance classification by merging Discussing support and Discussing ref ute tweets with Supporting and Refuting, respectively.We present the results from BERTweet large for this experiment in Table 6.Interestingly, the label F1 for Refuting decreases in the 3-way classification, compared to the 5-way setup.It suggests that identifying the indirect leaning for Discussing ref ute tweets makes the task harder.Meanwhile, the higher F1 scores for Supporting and Other labels indicate that our classifier is good at detecting tweets that propagate misinformation, even when some of them do not assert a stance explicitly.

Zero-Shot Cross-Lingual Transfer
Truly multicultural stance identification requires models that are capable of operating across languages.To demonstrate the feasibility of identifying the stance towards misinformation claims in a zero-shot cross-lingual setting, when no training data in the target language is available, we fine-tune models on Stanceosaurus' English training set and use all the annotated Hindi/Arabic data as the test set.We experiment with both multilingual BERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020).Because we assume no training data is available for the target language, all hyperparameters are tuned on the English dev set.Full results of our 5-class cross-lingual experiments are presented in Table 7.When trained with class-balanced focal loss, XLM-RoBERTa large achieves 53.1 Macro F1 for Hindi and 50.4 for Arabic, notably outperforming models trained with cross-entropy loss.

Combining Stanceosaurus + RumourEval
Because Stanceosaurus follows a similar labeling scheme as existing stance corpora, such as Ru-mourEval (Gorrell et al., 2018), this raises a natural question: is it possible to achieve better performance by combining the two datasets?
We first confirm that fine-tuning BERTweet large with class-balanced focal loss is also the best performing model on RumourEval-2019's original 4-class evaluation setup, outperforming the weighted cross-entropy loss used in BUT-FIT (Faj- the Other category.When merging the datasets, we upsample the RumourEval dataset to twice its size to counteract the imbalance between the two datasets.Table 9 shows models trained on indomain data achieve higher performance than the naive merging of the two datasets for training. To close this performance gap, we adopt the EasyAdapt (Daumé III, 2007;Bai et al., 2021) method to fine-tune BERTweet large on the combination of RumourEval and Stanceosaurus.EasyAdapt creates three identical copies of the contextualized representations of the input, which are concatenated and fed into a linear layer before softmax classification.The parameters in the linear layer that correspond to the first and third copies are updated when training on Stanceosaurus, while others are zeroed out; the parameters that correspond to the second and third copies are updated when training on RumourEval.This enables the model to encode representations that are specific to each dataset and domain-independent parameters that can transfer between the two datasets.BERTweet large with EasyAdapt achieves 67.4 Macro F1 for Stanceosaurus and 65.8 Macro F1 for RumourEval, outperforming the in-domain model performance for Stanceosaurus and matching the in-domain model performance of RumourEval.

Stance Detection for Unseen Countries
The English dataset comprises 97 international and 93 regional claims.We test BERTweet's ability to generalize toward regional claims by training on international claims.Specifically, we create a new train-test-dev split, with 10740/5701/4896 datapoints spread around 97/43/42 claims.Table 10 shows the results stratified by source.Performance on the regional data varies widely between sources.Poynter and AFP Fact Check New Zealand, two sources with the most international data, have the best F1s at 63.0 and 63.5 respectively.

Limitations
We currently use manually curated search queries for collecting tweets related to misinformation claims in Stanceosaurus.While we tried our best to include relevant keywords and their synonyms in the search queries, it still requires careful manual effort and may not be exhaustive in finding all relevant tweets related to the claim.Furthermore, it is non-trivial to extend such queries to new claims and languages.Future work could look at automatically generating these queries using a few-shot shot in-context demonstrations with large language models (Brown et al., 2020).
We collect the Stanceosaurus dataset with all the human resources available to us for three languages.We leave annotations for more languages for future work.We will also release our detailed data annotation guideline and invite other researchers to extend our work to set a standard benchmark for stance classification.
There are also potential biases in the claims that reflect the biases of content moderators from the fact-checking sources.We made our best effort to identify a list of fact-checking sources based on Wikipedia and pre-existing datasets used in the NLP community to collect claims from different countries and languages.We randomly sample claims from these sources and, since we are constructing a Twitter-based dataset, we are only able to include claims that have been discussed on Twitter.If the claim is unpopular on Twitter, we cannot sufficient data for annotation.Following Twitter's Developer Agreement and Policy, we release our dataset freely for academic research and include the full set of claims in the Appendix of the paper for readers to examine the potential biases in our dataset more conveniently.
Although the class-balanced focal loss improves stance classification in data imbalanced settings, our models are still far from perfect.We do not use user-specific, temporal, and network features as additional context which has been shown to improve prediction performance (Aldayel and Magdy, 2019;Lukasik et al., 2016).

Broader Impact and Ethical Considerations
We will release our dataset under Twitter Developer Agreement,10 which grants permissions for academic researchers to share Tweet IDs and User IDs for non-commercial research purposes, as of October 1st.2022.
Our datasets and models are developed for research purposes and may contain unknown biases towards certain demographic groups or individuals (Sap et al., 2019).Further investigation into systematic biases should be conducted before deployment in a production environment.
Social media companies currently struggle with content moderation in non-Western countries. 11 We hope Stanceosaurus will help stimulate more public research that can help shed light on how to inhibit the spread of dangerous misinformation across languages and cultures.A Customized Queries for Retrieving Tweets We present example claims and their search queries from each of the three languages in Table 13.

B Stance Classification with Context B.1 Annotation Example of Tweets in Reply Chain
Table 14 shows representative examples of different stances towards the claim "The COVID-19 Vaccine will make your body magnetic".Note that some tweets are context-dependent (e.g., "No that is not true"); their stance can only be determined with appropriate context.

B.2 Guidelines for Tricky Annotation
We identified some common scenarios in our annotation which lead to annotation disagreements in our preliminary analysis of the data.We designed specific guidelines to improve annotation consistency, including: • If the claim has a lot of information, we should focus on the core contentious part of the claim when judging the stance of the tweets.
• If the tweet is giving an analysis of the contentious event or talking about an adjacent event (regional) then it should be considered Discussing.
• If the tweet is just emojis, praise, or pleasant message (e.g., "thank you", "good job sir") towards a context tweet, consider it Discussing with the leaning inherited from the Stance of the context tweet.
• For querying, the tweet should be questioning the veracity of the claim and not any other question about the incident.
• If the main purpose of the tweet is gauging the people's opinions related to the claim then it is Discussing.
• If the tweet is posing a question with #fakenews or #factcheck but the URL asserts that the claim is fake then it should be judged Refuting.However, if the URL is also a question without a judgment then it should be considered Discussing.
• If a reply tweet is adding information/opinion on top of the context (assuming that the context tweet is true) then annotate Discussing with Leaning inherited from the context.

B.3 Importance of Considering Context
Stance that is realized in social media messages often depends on the context of a conversation, or links to external webpages, as discussed in §3.2.In this section, we evaluate the impact of context in the form of parent tweets and URL titles.To ablate context, we first organize tweets in the training data into reply chains.Next, we separate threads into root tweets that have no parent in the conversation thread and reply tweets that are written in response to another message.We fine-tune BERTweet large on (1) only root tweets, (2) only reply tweets, and (3) both root and reply tweets.We also measure the impact of training with and without context.We use standard cross-entropy loss for this comparison study, excluding the impact of hyperparameter choices in the focal loss, as the stance distribution differs between root and reply tweets.
The results in Table 11 demonstrate that root tweets, reply tweets, and context are complementary for achieving the best overall performance.The F1 score on root tweets is significantly higher than on reply tweets, indicating the difficulty to determine stance in extended conversations.Unsurprisingly, training only on root tweets achieves a higher 61.7 F1 on root tweets but a lower 35.4F1 on reply tweets.For models trained only on reply tweets, including context improves performance on reply tweets but hurts performance on root tweets.

B.4 Unseen Fact-checking Sources
Since the claims in Stanceosaurus are collected from multicultural sources, we also test stance clas- For each source, we remove associated tweets from train/dev in Stanceosaurus' standard data split.Macro F1 scores are computed on a subset of the test set with tweets only from the unseen source.We also report the performance of the same model trained on full train/dev splits in Stanceosaurus with tweets from all sources.Performance is degraded when predicting stance on unseen sources, but not by a large margin.
sifier's performance towards claims found in factchecking sources that are unseen in the training data.Specifically, we convert each fact-checking website in Stanceosaurus into an unseen source by creating a new data-split and removing its tweets from the train and dev sets.Then, a model trained on this restricted data is evaluated on the test tweets from the selected unseen source.For comparison, we also report the performance of the best model from the unseen claims experiment ( §5.1 where claims from each source are split into train/dev/test) on these test tweets from the unseen source.For every unseen source, we train a BERTweet large stance classifier with class-balanced focal loss and report its results in Table 12.The models perform worse when the source is removed from training data, with Politifact showing the biggest drop in performance from 60.0 F1 to 50.6 F1.This highlights the importance of source-specific data in classifying misinformation claims.

C Additional Details on Conversation Threads
For each claim from English and Hindi sources, we randomly sample up to 150 tweets for annotation: max 50 tweets (average 50 for English and 48.1 for Hindi) retrieved from our queries, max 50 parent tweets (average 30.7 for English and 8.6 for Hindi), and max 50 children tweets (average 28.3 for English and 33.0 for Hindi) from reply chains.For Arabic, we annotated all the tweets (average 175.8 per claim) retrieved from the search and reply chain.Finally, we organize every tweet such that its immediate parent serves as the context.
For tweets containing URLs, we also additionally include the HTML 'Title' tag extracted from the URL.About 40.5% of all tweets in our dataset have a parent tweet in context, while 19.5% of tweets have associated HTML titles.

D Stanceosaurus Claims
We provide the full set of English, Hindi, and Arabic claims, with English translations for Hindi and Arabic.Figure 2   Table 13: Example English and Hindi claims with corresponding search queries.Queries are manually constructed to cast a broad net, retrieving both relevant and irrelevant messages containing the keywords.
Claim: The COVID-19 Vaccine has magnets or will make your body magnetic ‹ Irrelevant: @dbongino is right.you can't tell people to wear a mask if the vaccines work.its like trying to put a north end of a magnet and trying to connect it to a north end another magnet., it will never work.#foxandfriends ‹ Supporting: a friends family member got the covid vaccine and now she can put a magnet up to the injection site and the magnet stays on her arm.
Refuting (only in context): @Newsweek Why the hell would they even bother with a high quantity of metal in the injection?And the amount that would be required to hold a magnet in place would be ridiculous.
ë Refuting (only in context): @pentatonicScowl @Newsweek I imagine the people making the claims don't fully understand how magnets work ë Supporting (only in context): @AuracleDMG @pentatonicScowl @Newsweek Laugh now, cry later.. ë ‹ Refuting: @cis_kale Your point being?Even if these RNA vaccines contained ferric nanoparticles, they would not be in high enough concentrations to be able to hold a magnet in place.I suspect that blood itself has a higher concentration of ferric particles than the vaccine described in this paper ‹ Querying: There is a #covid19 vaccine magnet test circulating on Tiktok, Is it really a thing?!! ë Supporting (only in context): @Thepurplelilac well, 4 friends out of 9 can stick magnets to their arms so yeah, it's a thing ‹ Discussing: @heggzigu @htmdnl too early to make any presumptions on either side.the truth has a way of exposing itself given enough time.
bring a magnet to your vaccination appointment, see how the vaccine reacts with the magnet, maybe even bring a metal detector as well.would that convince you?Translation: The end of covid and its variants will be through the use of scorpion venom

Figure 1 :
Figure 1: Example Hindi and English tweets in Stanceosaurus with stance towards the claim "Raid at Tirupati temple priest's house, 128 kg gold found".

Translation:
Death of the artist Kadim Al Sahir Translation: The American wrestling champion, The Undertaker, dies of the Corona virus Translation: Drogba elected as president of the Ivory Coast Football Association Translation: A picture shows the lighting of the Burj Khalifa in the colors of the Lebanese flag, after the Lebanese team won the Arab Basketball Championship title Translation: The World Bank declares Egypt bankrupt due to the inability to pay its debts Translation: The Chinese yuan is being adopted as an alternative to the US dollar in Russia's markets Translation: Russia threatens to cut Internet cables and send the world back to the Stone Age Translation: Turkish President Recep Tayyip Erdogan: "We will send a million Syrians back to their country" Translation: The dismissal of Morocco's national team coach Vahid Halilhodzic Translation: Joe Biden: The three Abrahamic religions are similar and can be combined Translation: Iran has been excluded from the World Cup in Qatar Translation: MBC Group suspended the contract of the artist, Ramez Jalal, after the failure of his program "Ramez Movie Star" Translation: Gambian referee Bakary Gassama was killed after refereeing the last Algeria-Cameroon match Translation: Pfizer's vaccine contains deadly graphene oxide and the vaccination plot is to control human populations and several other allegations Translation: The International Court of Justice in The Hague ruled to cancel all forms of vaccination, manufacture and sale and cancel the health protocol of the World Health Organization and put several personalities under international legal prosecution, including the Director General of Pfizer on charges of genocide.Translation: Cheb Khaled's statement to beIN Sports: I hope an Arab team wins the Arab Cup Translation: A picture of Cristiano Ronaldo shows him holding Erdogan's new book Translation: A picture of a rhinoceros next to a man whose publishers claim it shows the death of the last northern white rhino in the world Translation: FIFA officially agrees to hold the World Cup every two years Translation: Tens of thousands of women around the world have reported changes in their menstrual cycle after receiving the coronavirus vaccine, which has raised question marks among doctors and scientists

Table 1 :
Summary of Twitter stance classification datasets.Stanceosaurus covers more claims from a broader range of topics and geographical regions than prior Twitter stance datasets.

Table 2 :
Fact-checking sources included in our Stanceosaurus corpus.The most common regions are listed.Stance -breakdown of tweets into 5 main categories in relation to each claim: Irrelevant, Supporting, Refuting, Discussing, and Querying.Country & Regions -home country of the source and the distribution of claims regarding home country, regional, and international matters.Other refers to claims in countries other than the primary countries covered by the source (e.g., Snopes claims about India).
(Qazvinian et al., 2011;omatic Stance Classification.Many prior efforts have developed methods for automatic stance classification, which have progressed from feature-based approaches(Qazvinian et al., 2011;

Table 3
Table 4, and further test

Table 4 :
(Left) Number of tweets in the English subset of Stanceosaurus.The train/dev/test sets consist of 112/44/34 separate claims, respectively.(Right)Statistics of RumourEval-2019(Gorrell et al., 2019)after we reconstruct the data from message IDs.

Table 6 :
Per-label comparison of BERTweet large , when fine-tuned with cross-entropy, weighted cross-entropy loss, and class-balanced focal loss, both for 3-class and 5-class stance detection on our corpus.Weighted cross-entropy and class-balanced focal loss improves F1 score overall, and in particular for the least frequent stance of Refuting.

Table 7 :
Cross-lingual experiments where the models are trained on the English part of Stanceosaurus and evaluated on the Hindi/Arabic data.Models trained with class-balanced focal loss (CB foc ) outperforms those trained with standard and weighted cross-entropy loss (CE) with higher Macro F1 and lower variance.

Table 8 :
Results on RumourEval-2019 that compare different models trained with class-balanced focal loss (CB foc ), standard and weighted cross-entropy losses.

Table 9 :
(Daumé III, 2007;riments on Stanceosaurus and RumourEval.We fine-tune BERTweet large using class-balanced focal loss.Performance drops significantly when training on one dataset and testing on the other.However, with EasyAdapt(Daumé III, 2007;  Bai et al., 2021), we attain a single model that achieves best performance on Stanceosaurus while being on-par with in-domain RumourEval model performance.

Table 10 :
Results on Unseen Countries experiment.BERTweet large finetuned on class-balanced focal loss is trained on international claims and evaluated on regional claims, stratified by fact-checking source.The model achieves an aggregate F1 that is somewhat lower than its counterpart in the Unseen Claims experiment.
Checking.In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, 2021, pages 57-65, Online.Association for Computational Linguistics.

Table 11 :
Ablation experiments to study the impact of context in 5-way stance classification.In particular, we split the Twitter threads within Stanceosaurus' training set into root tweets (those with no parent in the conversation thread) and reply tweets (tweets that are written in response to another message).In all experiments, we train BERTweet large using cross-entropy loss.Results suggest that predicting the stance of reply tweets is significantly harder than root tweets.Context improves the overall stance classification performance mainly by improving prediction on reply tweets.

Table 12 :
Results of BERTweet large with the classbalanced focal loss on unseen fact-checking sources.

‹
Discussing: Fauci: No Concern About Number of People Testing Positive After COVID-19 Vaccine.Spike Protein Vax is magnet for coronavirus.Originally used as turbo booster mounted on virus but too flimsy.Now injected in target in advance of infection, death rate 4X.

Table 14 :
An example claim and its corresponding tweets from the 5 stance categories (best view in color): Irrelevant, Refuting, Supporting, Discussing, and Querying.‹ symbol indicates the tweets we directly retrieved from our query keyword search method.Indented lines with ë are replies to parent tweets.