Empowering the Fact-checkers! Automatic Identification of Claim Spans on Twitter

The widespread diffusion of medical and political claims in the wake of COVID-19 has led to a voluminous rise in misinformation and fake news. The current vogue is to employ manual fact-checkers to efficiently classify and verify such data to combat this avalanche of claim-ridden misinformation. However, the rate of information dissemination is such that it vastly outpaces the fact-checkers’ strength. Therefore, to aid manual fact-checkers in eliminating the superfluous content, it becomes imperative to automatically identify and extract the snippets of claim-worthy (mis)information present in a post. In this work, we introduce the novel task of Claim Span Identification (CSI). We propose CURT, a large-scale Twitter corpus with token-level claim spans on more than 7.5k tweets. Furthermore, along with the standard token classification baselines, we benchmark our dataset with DABERTa, an adapter-based variation of RoBERTa. The experimental results attest that DABERTa outperforms the baseline systems across several evaluation metrics, improving by about 1.5 points. We also report detailed error analysis to validate the model’s performance along with the ablation studies. Lastly, we release our comprehensive span annotation guidelines for public use.


Introduction
The swift acceleration of Online Social Media (OSM) platforms has led to tremendous democratized content creation and information exchange.Consequently, these platforms serve as ideal breeding grounds for malicious rumormongers and talebearers, abetting a colossal upsurge of misinformation.Such misinformation manifests in many ways, including bogus claims, fabricated information, and rumors.The massive COVID-19 'Infodemic' (Naeem and Bhatti, 2020) is one such malignant byproduct that led to the rampant spread * Equal contribution RT @PirateAtLaw: No no no.Corona beer is the cure not the disease.
We don't have evidence but we are positive our wine keeps you from getting #COVID19 if you drink enough of it.Better alternative to #DisinfectantInjection don't you think?#winecures.
RT @angeliicamdc: Mexicans are immune to the coronavirus because we have sana sana colita de rana @adamseconomics Vaccine is probably made from Chinese ingredients sourced in Wuhan. of political and social calumny (Ferrara, 2020;Margolin, 2020;Ziems et al., 2020), accompanied by counterfeit pharmaceutical claims (O' Connor and Murphy, 2020).Therefore, finding such claimridden posts on OSM platforms, investigating their plausibility, and differentiating the credible claims from the apocryphal ones has risen to be a pertinent research problem in Argument Mining (AM).
'Claim', as coined by Toulmin (2003), is 'an assertion that deserves our attention'.It is the key component of any argument (Daxenberger et al., 2017).Consider the second tweet, 'We don't have evidence...', as given in Figure 1.For the task of claim identification at the coarse level, the entire tweet will be marked as a claim.However, on closer inspection, we find that the text fragments of 'our wine keeps you from getting #COVID19' and 'Better alternative to #DisinfectantInjection' represent the finer argumentative units of claim and form the set of evidence, based on which this tweet is considered a claim.Segregating such argumentative units of misinformed claims from their benign counterparts fosters many benefits.To begin with, it partitions the otherwise independent claims in a single post, enabling us to retrieve a larger number of claims.Secondly, it acts as a precursor to the downstream tasks of claim check-worthiness and claim verification.Thirdly, it will also bring in the angle of explainability in coarsegrained claim identification.Finally, it will serve the manual fact-checkers and hoax-debunkers1,2 to conveniently strain out the unnecessary shreds of text from further processing.We further elaborate on the necessity of claim span identification and exemplify it in Section 2.
Though the recent literature reflects extensive work done in claim detection (Daxenberger et al., 2017;Chakrabarty et al., 2019;Gupta et al., 2021), limited forays have been made in claim span identification i.e., recognizing the argumentative components of a claim (Wührl and Klinger, 2021).In the recent past, commendable work has been done on span-level argument unit recognition pertaining to other computational counterparts under the umbrella of AM, such as hate speech (Mathew et al., 2021), toxic language (Pavlopoulos et al., 2021) etc.Such study, however, has eluded the realm of claims, owning to the lack of quality annotated datasets.This heralds a specialized corpus creation on claim span identification.
To this end, we propose CURT (Claim Unit Recognition in Tweets), a large-scale, claim span annotated Twitter corpus.We also present several baseline models for solving claim span identification as a token classification task and evaluate them on CURT.Furthermore, we introduce claim descriptions, which are generic prompts aimed to assist the model in focusing on the most significant regions of the input text using explicit instructions on what to designate as a 'claim'.They are elucidated later in detail.Finally, we benchmark our dataset with DABERTa (Description Aware RoBERTa), a plugand-play adapter-based variant of RoBERTa (Liu et al., 2019), endeavored to infuse the Pre-trained Language Model (PLM) with the description information.Empirical results attest that DABERTa outperforms the conventional baselines and generic PLMs for our task consistently across various metrics.
Contributions.Through this work, we make the following tangible contributions: 1. Formulation of a novel problem statement: We propose the novel task of Claim Span Identification that aims to identify argument units of claims in the given text.
2. Claim span identification dataset and extensive annotation guidelines: We posit a largescale Twitter dataset, the first of its kind, with 7.5k claim span annotated tweets, to placate the absence of the annotated dataset for claim span identification.Additionally, we develop comprehensive annotation guidelines for the same.
3. Claim span identification system: We propose a robust claim span identification framework based on Compositional De-Attention (CoDA) and Interactive Gating Mechanism (IGM).

Extensive evaluation and analysis:
We evaluate our model against different baselines to confirm sizable improvements over them.We also report thorough qualitative and quantitative analysis along with the ablation studies.

Why Claim Span Identification?
As stated in Section 1, we hypothesize that claim span identification would aid fact-checkers to quickly segregate claim-ridden content from the rest of the post.Moreover, we suppose that it will be a propitious precursor for claim verification and fact-checking, facilitating better retrieval of relevant evidences.We back our hypothesis with a small experiment of evidence-based document retrieval.We collect 50 random samples from CURT, along with their corresponding ground-truth claim spans.Further, for both the tweets and the claim spans, we extract top-k relevant articles from a knowledge-base leveraging the traditional retrieval system, BM25 (Robertson et al., 1995).We use the recently released publicly available CORD19 corpus (Wang et al., 2020) to retrieve factual documents.Finally, we present retrieved documents to three evaluators and ask them to mark whether or not the retrieved shreds of evidence are relevant to the given input tweet/span from our dataset.All three annotators label each text-evidence pair independently.Eventually, to obtain the final relevancy score, majority voting is employed.We obtain a high inter-annotator score (Fleiss Kappa) of 0.63 and 0.67 for tweets and spans, respectively.We compare the performance of tweet-based and span-based retrievals in terms of precision (P) and Input P@5 P@3 nDCG@5 nDCG@3 Tweets 0.3922 0.2745 0.2733 0.2280 Spans 0.4407 0.3390 0.3038 0.2521 Table 1: nDCG@k and P@k scores for tweet and spans using BM25 retrieval system and CORD19 dataset.
normalized Discounted Cumulative Gain (nDCG) scores and report them in Table 1.For comparison, we consider two different top-k settings (k=3 and 5).We begin by examining the retrieval performance using P@k, which measures the fraction of relevant documents extracted in the top-k set.Spanbased document retrieval consistently improves precision scores when compared to tweets.For nDCG@5, we discover that span-based retrieval outperforms tweet-based retrieval by more than 3%.
When we limit the retrieval depth to 3, we see a similar pattern.This, in turn, demonstrates that entire posts contain much extraneous information, frequently impeding the performance of evidence retrieval systems that are a prerequisite for both automated and manual fact-checking.In summary, we reinforce that our hypothesis positively stands true, as span-based document retrieval results in a better score for precision as well as nDCG.This attests to the task's feasibility and importance in the realm of claims.
In summary, existing literature on claims concentrates entirely on sentence-level claim identification and does not investigate on eliciting fine-grained claim spans.In this work, we endeavor to move from coarse-grained claim detection to fine-grained claim span identification.We consolidate a large manually annotated Twitter dataset for claim span identification task and benchmark it with various baselines and a dedicated description-based model.

Dataset
Over the past few years, several claim detection datasets have been released (Rosenthal and McKeown, 2012;Chakrabarty et al., 2019).However, none of these corpora come with claim-based rationales that quantify a post as a claim.To bridge this gap, we propose CURT (Claim Unit Recognition in Tweets), a large scale Twitter corpus with tokenlevel claim span annotations.
Data Selection.We annotate claim detection Twitter dataset released by Gupta et al. (2021) for our task.However, the guidelines they presented have certain reservations, wherein they do not explicitly account for benedictions, proverbs, warnings, advice, predictions, and indirect questions.
As a result, tweets such as 'Dear God, Please put an end to the Coronavirus.Amen' and '@FLO-TUS Melania, do you approve of ingesting bleach and shining a bright light in the rectal area as a quick cure for #COVID19?#BeBest' have been mislabeled claims.This prompted us to extend the existing guidelines and introduce a more exclusive and nuanced set of definitions based on claim span identification.We present details of the extended annotation guidelines and guideline development procedure in Appendix (A.1).In total, we annotated 7555 tweets from the Twitter corpus by Gupta et al. (2021) which met our guidelines.
Dataset Statistics and Analysis.We segment CURT into three partitions -training set, validation set, and test set, in the split of 80:10:10.Dataset related statistics are given in Table 2.One important point to note here is that while a claim tweet is typically 27 tokens long, a claim span is only around 10 tokens long.This implies that the claimridden tweets have a lot of extraneous information.Arguments can also perhaps comprise several claims that may or may not be related to each other.Around 19% of the claim tweets in our dataset contain multiple claim spans.As a result, in total, we obtain 9458 claim spans from 7555 tweets.We observe that the majority of the tweets contain single claims.Out of 7555 tweets, 6039 include a single claim, demonstrating that the majority of tweets contemplate single assertions at a time.

Proposed Methodology
In this section, we outline DABERTa and its intricacies.The main aim is to seamlessly coalesce critical domain-specific information into Pre-trained Language Models (PLM).To this end, we introduce Description Infuser Network (DescNet), a plug-and-play adapter module that conditions the LM representations with respect to the handcrafted descriptions.The inclusion of claim description encourages the model to focus on the most essential phrases in the input tweet, which may be thought of as guided attention that leads to increased performance.We judiciously curated our claim descriptions in accordance with the annotation guidelines for claims and non-claims offered by Gupta et al. (2021).In Table 3 we list some of the claim descriptions along with the claims that they most align with.It is noteworthy that a claim can align with more than one claim descriptions as well.
Overview of PLMs for Token Classification.To begin with the details of the proposed framework, DABERTa, we present the working of PLMs for the token classification task.PLMs such as BERT (Devlin et al., 2019), DistilBERT (Sanh et al., 2019), and RoBERTa (Liu et al., 2019) are widely used for various downstream NLP tasks owning to their strong contextual language representation capabilities and fine-tuning ease.As the input to these PLMs, each i th input text is first tokenized into a sequence of sub-word embeddings X i ∈ R N ×d , where N is the maximum sequence length and d is the feature dimension.Then a positional embedding vector P E pos ∈ R N ×d is added to the token embeddings in a pointwise fashion to retain the positional information (Vaswani et al., 2017).The vector Z i ∈ R N ×d , hence obtained, is fed to a stack of transformer encoder blocks.Each  encoder block is a modular unit consisting of two sub-layers: (i) Multi-Headed Self-Attention, and (ii) Feed-Forward Network.Furthermore, each sub-layer contains a residual connection, followed by dropout and layer normalization.For the task of token classification, the output of the last encoder layer is passed to a CRF layer (Lafferty et al., 2001).This modularity of PLMs enables easy integration of adapter modules in their architecture for making these PLMs task-specific and domain-dependent.We choose RoBERTa (Liu et al., 2019) as our backbone network as it is the best-performing baseline (see Table 4).

Interactive Gating Mechanism (IGM)
Description Infuser Network (DescNet).Desc-Net is designed to facilitate deep semantic interaction among the input text and claim descriptions, and help underline the key fragments of claims.It consists of precisely engineered components of CoDA and IGM, each devised to augment the pro-cess of claim span identification.
To put formally, consider D = {d 1 , d 2 , ..., d m } as the set of m claim descriptions and T = {t 1 , t 2 , ..., t n } as the corpus of n input texts.The description representations are extracted from pretrained RoBERTa (Liu et al., 2019) and passed through a transformer encoder layer.To begin with, each i th PLM generated vector Z i ∈ R N ×d of input text t i interacts with each j th description vector D j ∈ R M ×d via the CoDA block.Here the vector Z i forms the query, which is processed against the vector D j acting as the key and value (Equation 1).
All such compositionally manipulated vectors Z C ij , after interacting with each j th description vectors are concatenated and passed through a dropout layer before going through a non-linear transformation for dimensionality reduction (Equation 2).The resultant vector Z ′ i along with the vector Z i is passed to the IGM module to extract the semantically appropriate features pertinent for fine-grained claim span identification (Equation 3).
The vector Ẑi is then passed to a CRF layer.
The traditional narrative on attention mechanism (Bahdanau et al., 2015;Parikh et al., 2016;Seo et al., 2016;Vaswani et al., 2017) is heavily biased on the use of Softmax operator where the attention weights are always bounded between [0, 1].Such a convex weighted addition scheme allows the vectors to only contribute in an additive manner.To counter this bottleneck, Tay et al. ( 2019) devised a quasi-attention technique that enables learning of additive as well as subtractive attention weights, allowing the input vectors to add to (+1), not contribute to (0), and even subtract from (−1) the output vector.They decomposed the original Softmax-based self-attention as pointwise multiplication between two matrices as shown in Equation 4, where G(.) is the negative pointwise L 1 distance between query Q and key K.
We adopt this quasi-attention strategy to promote more meaningful interaction between the input text and claim descriptions and generate more precise claim-relevant representations.
Interactive Gating Mechanism (IGM).To further distinguish salient tokens inclusive in claim spans, we posit Interactive Gating Mechanism.To begin with, the vectors Z i and Z ′ i are max pooled to obtain Z ip , Z ′ ip ∈ R d .These vectors are passed through a series of gates, the first of them being the conflict gate C, aimed at capturing the semantically conflicting features in Z i and Z ′ i (Equation 6).
The refine gate R, on the other hand, endeavors to capture the semantically similar features between Z ip and Z ′ ip (Equation 8).
To congregate the conflicting and similar semantic representations spawned by the gates C and R, we employ an adaptive gating scheme to retain maximum differential information from each gate.It is given by Equation 10.
Finally, this vector Ẑi is passed to a CRF layer for token classification.

Experiments and Results
Baseline Models.We employ the following baseline systems.▷ CNN+CRF: A Convolutional Neural Network (CNN) trained with GloVe (Pennington et al., 2014) and a CRF head on top.▷ BiL-STM+CRF (Huang et al., 2015): A sequence labeling model comprising Bidirectional Long Short-Term Memory (BiLSTM) and CRF layer.▷ BERT (Devlin et al., 2019): A bidirectional transformerinspired auto-encoder language model fine-tune for our span identification task.▷ DistilBERT (Sanh et al., 2019): A smaller, faster, and lighter version of BERT fine-tune on our dataset for the task at hand.▷ SpanBERT (Joshi et al., 2020): An enhanced version of the BERT trained on span prediction objective.▷ RoBERTa (Liu et al., 2019): A robustly optimized BERT approach, RoBERTa, is a variant of BERT with improved training methodology.We fine-tune it on our dataset.▷ NLRG (Chhablani et al., 2021): A system proposed at SemEval-2021 Task 5 on toxic span detection (Pavlopoulos et al., 2021).It is a combination of SpanBERT and RoBERTa where the former model is used for predicting the span start and end, while the latter is used for token classification.▷ HITSZ-HLT (Zhu et al., 2021b): The system topped the SemEval-2021 task on toxic span detection.They approached the task as a combination of sequence labeling and span extraction and proposed an ensemble of three BERT-based models.
Evaluation Metrics.In concordance with Pavlopoulos et al. (2021), we evaluate the performance of all the systems, based on token-level precision (P), recall (R), and F1 scores.To further put a lens over how the models fare for different token types, we calculate the micro-level precision, recall, and F1 score for each of the 'B', 'I', and 'O' tokens. 3Lastly, to quantify the number of tokens included in the spans, we also report the Dice Similarity Coefficient (DSC) (Dice, 1945).Performance Comparison.We summarize our collated results in Table 4. Evidently, DABERTa outperforms all the baseline systems against majority of the evaluation metrics. 4We analyze all the systems based on the following questions.How accurately do the models predict?To gauge how well each model performs for the token classification task, we monitor precision, recall, and F1 scores.As it can be inferred from Table 4, the traditional word embedding-based deep learning models of CNN and BiLSTM give the poorest token classification performance.An appreciable improvement of about 10-14% across all three metrics is observed when we move from the classical deep learning architectures to the transformer-based models of DistilBERT, BERT, SpanBERT, and RoBERTa.This underlines the importance of using contextual word embeddings and transformer-based architectures for the task at hand.The addition of the CRF layer further amplifies the performance of these models.SpanBERT also fares better than BERT as it is trained using span prediction objective.We also notice that employing the CRF layer results in a somewhat better balance of precision and recall when compared to using a basic linear layer.The ensemble-based models of NLRG and HITSZ-HLT also give admissible results for our task.Our proposed model, DABERTa, surpasses all the models in terms of the precision, recall, and F1 scores.An improvement of about 1.5% is observed between RoBERTa and DABERTa in terms of these metrics.This justifies the inclusion of claim descriptions that amalgamate domain-specific semantic information into RoBERTa architecture via the deftly crafted adapter module.In summary, we see that 4 Data preprocessing details are presented in Appendix (A.2).all the models show a good trade-off between precision and recall.Are the models aggressive or defensive?Observing the precision, recall, and F1 scores for each of the 'B', 'I', and 'O' tags, as shown in Table 4, we get an idea of how aggressive or defensive the models are at predicting claim spans.CNN and BiLSTM show considerable resistance in predicting the claim spans, as evidenced by high precision, recall, and F1 scores for the token 'O' and less for the tokens 'B' and 'I'.The BERT-based models show a sizable improvement of about 22% and 15% for predicting tokens 'B' and 'I', respectively, over the traditional deep learning models.The addition of CRF layers further bolsters the predictive power for the token 'B'.DABERTa offers an improvement of about 4-5% over its traditional counterpart for predicting the token 'B'.Upon close inspection, we observe that the ranges of precision, recall, and F1 scores for predicting the tokens 'I' and 'O' vary by not more than 3%.However, the predictive power for the token 'B' varies vastly by about 25%.Hence, we hypothesize that the inclusion of descriptions makes our model cognizant of the syntactic and semantic constructs of claims.How the models behave for multiple spans? Figure 3 illustrates how well the models identify multiple spans.It is observed that the models of CNN and BiLSTM find it challenging to identify multiple spans.The transformer-based models with a linear head tend to predict more claim spans in the tweet than required.This issue is mitigated when the linear head is replaced with a CRF layer.Still, these models can identify roughly only 80% of the time the occurrence of multiple spans.On the other hand, our model, DABERTa, correctly predicts multiple spans almost more than 85% of the time.Moreover, it does not predict more claim spans than required.Thus, the addition of domain-specific claim descriptions appropriately guides DABERTa in identifying the correct occurrence of spans.
Ablation Study.Table 4 also reports the ablation studies.Replacing CoDA with a naïve Dot-Product Attention (DPA), we observe a drop in the performance across almost all the metrics.Amongst all, the performance drop in predicting the token 'B' is the most prominent (∼ 1.5% across precision, recall, and F1).Thus, we conjecture that the quasiattention mechanism is better able to spot the starting of a claim fragment than DPA.When IGM is removed, the performance for predicting token 'B' slightly improves.However, it leads to a decrease in the predictive power for 'O' token (∼ 2.5% in F1).Therefore, the combination of CoDA and IGM obtains the most balanced performance.
Hyper-parameter Tuning.We utilize the base version of RoBERTa (Liu et al., 2019) to propose DABERTa.The model is trained end-to-end using the Adam optimizer (Kingma and Ba, 2014), learning rate of 4e−5, and batch size of 32 for 20 epochs with early stopping if the dice score does not improve after 5 epochs.We used the Nvidia Tesla v100 32 GB GPU.The hyper-parameter tuning is done with respect to the validation dataset.It is observed that the performance consistently increases as the integration is done at a higher level of RoBERTa layers.This is admissible as studies on probing the PLM layers suggest that different layers encode distinct linguistic properties (Tenney et al., 2019).Furthermore, evidence by Peters et al. (2018) suggests that the lower layers of a language model encode the syntactic information, whereas the higher layers capture the complex semantics.As we strive to employ deep semantic interaction between the PLM representations and the claim descriptions, our results are consistent with their findings.
Error Analysis.In this section, we manually analyze the errors the models are prone to make.Table 5 highlights randomly sampled tweets from our dataset, CURT, along with their gold spans and predictions from DABERTa.In addition, we also consider the predictions from the best-performing baseline, RoBERTa, for a fair comparison.We analyze the errors committed by both the systems and divide them into three different categories: (i) tweets with a single-claim span, (ii) tweets with claim-like premises, and (iii) tweets with claims that can be inferred from the underlying undertone of the tweet but no explicit span can be marked to highlight the claim-specific connotation, e.g., figurative sentences, satire, indirect questions etc. (Note: For simplicity, we refer to such claims as implicit claims.)In the most straightforward situation where the tweets only contain a single claim DABERTa makes more precise predictions than the baseline system as shown in the first example of Table 5.We observe that both the models identify the claim-span correctly; however, RoBERTa identifies some unnecessary spans, which trespasses our objective of equipping the fact-checkers with only relevant information.The second type of error related to spans is the presence of claim-like premises.Claims and premises5 are closely related components of argument mining, and differentiating them is strenuous, even for humans.Example 2 in Table 5, exhibits a post containing claim-premise pair.There are two conclusive claims in the tweet -'#coronavirus was used by the #CCP as a bio weapon' and 'CCP is kicking out black people from hotels even if they don't have covid'.Even though 'not only to kill people but to encourage racism

Gold
Gold Truly sobering analysis: US more vulnerable than many countries to #coronavirus owing to combination of high numbers of uninsured, many w/o paid sick leave, and a leadership that has downplayed the challenge while not preparing the country for it.1 RoBERTa Gold Truly sobering analysis: US more vulnerable than many countries to #coronavirus owing to combination of high numbers of uninsured, many w/o paid sick leave, and a leadership that has downplayed the challenge while not preparing the country for it.DABERTa Gold Truly sobering analysis: US more vulnerable than many countries to #coronavirus owing to combination of high numbers of uninsured, many w/o paid sick leave, and a leadership that has downplayed the challenge while not preparing the country for it.

Gold
Whether made on purpose or not #coronavirus was used by the #CCP as a bio weapon, not only to kill people but to encourage racism among their citizens against foreigners.Especially black people, CCP is kicking out black people from hotels even if they dont have covid.2 RoBERTa Whether made on purpose or not #coronavirus was used by the #CCP as a bio weapon, not only to kill people but to encourage racism among their citizens against foreigners.Especially black people, CCP is kicking out black people from hotels even if they dont have covid.DABERTa Whether made on purpose or not #coronavirus was used by the #CCP as a bio weapon, not only to kill people but to encourage racism among their citizens against foreigners.Especially black people, CCP is kicking out black people from hotels even if they dont have covid.

Gold
RT @HealtheNews: Can honey, ginger, garlic or turmeric or any other home remedies cure #Covid19?No, here's why. 3 RoBERTa RT @HealtheNews: Can honey, ginger, garlic or turmeric or any other home remedies cure #Covid19?No, here's why.
DABERTa RT @HealtheNews: Can honey, ginger, garlic or turmeric or any other home remedies cure #Covid19?No, here's why.
Table 5: Error analysis of the outputs.Bold text (green) highlights the correct claim span whereas text in italics (red) represents the mistakes committed by our model, DABERTa, and vanilla RoBERTa as baseline.
among their citizens against foreigners' appears to be a claim at first glance, it serves as the premise to support the conclusive part of the arguments brought forward in the tweet.In most cases, we discern that both the systems identify the claimspans correctly, but they are easily fooled by the premises, hence leaving room for significant improvement in this regard.
Another prominent class of errors is implicit claims.Extracting the claim-spans in implicit claims is arduous.We observe that both the systems strive to understand the linguistic structure of the implicit claims.For instance, in sample 3, the user intends to assert that honey, ginger, garlic, or turmeric do not cure COVID19; however, DABERTa fails to understand the user's intention and yields the wrong span.We perceive similar behavior from the best-performing baseline, RoBERTa, as well.A plausible reason is the skewed nature of the dataset , which is lopsided with a significant bias toward explicit claims.According to our observations, DABERTa outperforms the best-performing baseline system significantly (∼ 4%; p < 0.0004).6Hence, furnishing us with empirical shreds of evidence that DABERTa can be efficiently used for claim span identification.

Conclusion
Through this systematic research, we introduced the novel task of Claim Span Identification, which is valuable on various fronts.We conducted an evidence-based document retrieval experiment, demonstrating that employing claim spans retrieves more relevant evidence than using the entire tweet.Furthermore, as there exists no specialized corpus for claim span identification, we compiled CURT, a large-scale Twitter corpus consisting of around 7.5k tweets annotated with token-level claim spans.We showed convincing results using various token classification baselines on our dataset.Moreover, we benchmarked CURT with DABERTa, an adapter-based variant of RoBERTa, that encapsulates critical domain-specific information into the pre-trained model via claim descriptions.Through extensive qualitative, quantitative, and empirical results, we illustrated how DABERTa outperforms the other models on different fronts.Lastly, we also developed an extensive set of annotation guidelines and released them for further research.

Limitations
Though DABERTa yields the state-of-the-art performance in claim span identification; there are a few cases where it falls short.Even for humans, recognizing claim spans in figurative or metaphorical sentences is arduous; consequently, our suggested model also struggles with them.As a result, our future study will focus on boosting the claim span identification performance, especially for such sentences.Our analysis also bestowed that the high resemblance between claims and premises confuses the model, making it difficult to distinguish between the two.DABERTa shares the said limitation with other baseline systems as well.As a result, this could be another alluring open challenge to work on.

A.1.1 Guideline Development
While different frameworks and models of argumentation range in intricacy and claim conceptualization, the claim element is colloquially perceived as a principal component of an argument.Following Stab and Gurevych (2017), we define the claim as 'the argumentative component in which the speaker or writer conveys the central, contentious conclusion of their argument'.Aharoni et al. (2014) proposed a framework in which an argument is often divided into two parts: claim and premise.The premise, which is another crucial component of an argument, encompasses all shreds of evidence obliged to either corroborate or refute the claim.We confine our corpus to claim components only.However, claims and premises are usually indistinguishable and frequently blend together.As a result, distinguishing them can be challenging, especially when authors use claim-like statements as a premise.
Due to the highly subjective nature of claims, it is imperative to devise structured annotation guidelines to annotate a new dataset for the claim span identification task.Therefore, after rigorous analysis and discussion, we established an initial set of annotation guidelines.To acclimate better with the dataset, we progressed through iterations of improvements.In every iteration, 100 random tweets were annotated by three annotators 7 following the initial set of annotation guidelines.The annotators resolved the ambiguous cases mutually.In successive iterations, we further addressed the unsettled tweets that necessitated clarifications in the annotation guidelines.We reconsidered all prior annotations for every change in the guideline to ensure that the annotations emulated the most advanced version of the annotation guidelines.The final sprint of pilot annotation included annotating another set of randomly chosen 100 tweets with the final guidelines.Following Trautmann et al. (2020), we calculated the inter-annotator agreement using the α u agreement measure (Krippendorff et al., 2016).We computed the mean pairwise value per post, where each token can be being classified into two classes, claim, and non-claim.We obtained a more than satisfactory agreement score of 0.87. 7They are linguistic experts and their age ranges between 20-35 years.
Finally, the entire Twitter dataset was annotated by the same annotators that carried out the prefatory pilot annotations.

A.1.2 General Instructions
• A claim is a statement that says you strongly believe that something is true.The action of showing, using or stating something strongly.
• We use tweets that are annotated with a binary label using LESA guidelines (Gupta et al., 2021), which indicates whether a tweet is a claim or not.
• The claim span is that part of a sentence that contains the semantic representation of the claim.
• Since our primary goal is to tackle misinformation in OSM, we majorly focus on claims that have some social impact.

A.1.3 Guidelines and Examples
• In the case of facts, we annotate the fact/span that may not be known by everyone, for example, scientific facts or legal (law) facts, and doesn't involve any commonsense.However, we do not include universal facts in the claim span.Example 1: "Water is colorless" is a universally known fact and hence should not be marked as claim span.
Example 2: "Virus always mutate" is a scientific fact that may not be known by everyone.Hence we will annotate this fact as a claim.
• An assertion about future eventualities/predictions will not be included in the claim span.Prediction is an extrapolation based on an assertion and is associated with a confidence level that can never be greater than or equal to 100%.Thus, we will not consider predictions as a part of the claim span.Example: @realDonaldTrump Uh no actually The virus will never go away Scientists will develop a vaccine for it that should be ready by next June which will allow nearly everyone to be immune to #CoronaVirus This really isn't hard to understand even for a very stable genius.
• A proverb is a simple, concrete, traditional saying that expresses a perceived truth based on common sense or experience which contains wisdom, truth, morals, and traditional views in a metaphorical, fixed, and memorizable form.The proverbs are not facts.The elements of proverbs should be annotated as claim span.
Example: "Prevention is better than cure" is not a claim.
• If a claim contains statistics or dates, they should be included in the span.But not all numbers are important.
Example 1: "@FernandoSVZLA @AP So far 50 people outside China have it with no deaths.
If China was hiding information and it was more lethal, we would see that fairly quickly".Here the claim span is [ 50 people outside China have it with no deaths] Example 2: "57 round trip to LA thanks coronavirus".The number is not important here.
• In case there are multiple conclusive independent claims in one tweet, we annotate each one of them separately.
Example: "5 million left Wuhan before the lockdown.If they were really interested in knowing, they'd be testing at least 1 in 100 cases of all viral pneumonia.They're limiting who's being tested so they aren't accused of lying.Oh, and they might be asked to actually do something."Claim span would consist of: [1] "5 million left Wuhan before the lockdown" and [2] "They're limiting who's being tested so they aren't accused of lying" • Tweets that negate a possibly false claim are also considered to be claims.Example: "disinfectants are not a cure for coronavirus".
• Tweets 'reporting' something to be true or an instance to have happened or will happen are claims.
• In cases of claims made in the form of a conditional sentence, the premise/context would be included in the span.
Examples: if you've been in the McDonald's play place you're immune to the coronavirus.
• For claims containing humor/sarcasm, only the humorous phrase will be considered as a claim span if it has some social impact.For satire, the complete sentence will be considered.

Figure 1 :
Figure 1: Examples of claim tweets and their ground truth claim spans highlighted in boldface text (blue).

Figure 3 :
Figure 3: A comparative study among DABERTa and baselines.The horizontal bar signifies the ration of number of predicted spans and number of gold spans.

Figure 4 :
Figure 4: Performance of DABERTa when the adapter module is inserted at different layers of RoBERTa.

Figure 4
Figure4reflects the effect of integrating the adapter DescNet at different layer of RoBERTa.It is observed that the performance consistently increases as the integration is done at a higher level of RoBERTa layers.This is admissible as studies on probing the PLM layers suggest that different layers encode distinct linguistic properties(Tenney et al., 2019).Furthermore, evidence byPeters et al. (2018) suggests that the lower layers of a language model encode the syntactic information, whereas the higher layers capture the complex semantics.As we strive to employ deep semantic interaction between the PLM representations and the claim descriptions, our results are consistent with their findings.

Table 2 :
Dataset statistics.All the lengths are in tokens.
The underlying principle behind this theoretical formalization is to link a claim span to a claim description to guide the model on what to focus on explicitly.As shown in Figure 2, Desc-Net houses two sub-components, namely, Compositional De-Attention block (CoDA) and Interactive Gating Mechanism (IGM).The particulars of each component are delineated in the following sections.
Claim Descriptions.Before delving into CoDA and IGM, we first examine Claim Descriptions, which are the cornerstone of the proposed model.Claim Descriptions are handcrafted templates that guide the model where to concentrate its focus.

Description Infuser Network (DescNet) Compositional De-Attention (CoDA)
Another case for more testing for #coronavirus!Blood tests show 14% of people are now immune to covid-19 in one town in Germany https://t.co/MVOq3nc4hnTexts in the tweet that negate a possibly false claim Texts in the tweet made in sarcasm or humour @username I think the cure to coronavirus is a 6 pack of corona only.yeah Texts in the tweet containing opinions that have societal implications @username @username I think it's a bio weapon made by China so I'm not surprised it has a lot of carriers.Texts in the tweet in the form of conditional statement if you smoke weed you are immune to coronavirus Texts in the tweet containing a quote from someone The president said injecting disinfectant into the body can cure the virus.
ConcatIGM Block Figure 2: A schematic diagram of DABERTa for the claim span identification.⊙ represents point-wise multiplication, and ⊗ represents matrix multiplication.No! #Bleach won't cure #COVID19.Disinfectants can't kill the #coronavirus in your body.In fact, they will hurt you.If you or someone you know has been exposed to bleach, call Poison Control for help (1-800-222-1***).https://t.co/DtIfi77vLzhttps://t.co/9MxSFoVM0LWhat in the holy hell?And @Lysol issued a statement that people should not ingest Lysol.WTF? #Covid_19 #lysol #DontDrinkLysol

Table 3 :
Examples of handcrafted claim descriptions, along with some aligning examples.Claim spans are highlighted in italics.
And cancer, heart disease, OCD, schizophrenia and AIDS.And life.#Covid19 #COVID Claim: Drinking bleach and/or injecting Disinfectant will cure COVID19.And cancer, heart disease, OCD, schizophrenia and AIDS.And life.• Personal experience will only be part of the claim phrase if they are opinions with societal impacts/implications. Example: Story about how #HydroxyChloroquine likely help people recover from #Coronavirus.IMO, it was never touted as the cure but as option for treatment doctors should consider and it appears to work in some cases....39 in one place.https://t.co/2hhi6aSVrYClaim: [it was never touted as the cure but as option for treatment doctors should consider and it appears to work in some cases....39 in one place.]• A claim can be a sub-part of a question, only if it is not a direct question.Example: @FLOTUS Melania, do you approve of ingesting bleach and shining a bright light in the rectal area as a quick cure for #COVID19 ?#BeBest" Claim: [ingesting bleach and shining a bright light in the rectal area as a quick cure for #COVID19] • Ground/Reasoning to justify a claim will not be a part of the claim phrase.Example: Covid-19 vaccine development and deployment in China, when available, will be made a global public good, which will be China's contribution to ensuring vaccine accessibility and affordability in developing countries Claim phrase: [Covid-19 vaccine development and deployment in China, when available, will be made a global public good] • Mocking/attacking a group or individual is not a part of the claim phrase.Example: Because #coronavirus has tremendous chances of getting cured but your anti-national agenda is worse than death Claim: [coronavirus has tremendous chances of getting cured] • Claim phrases do not include the predicate part that does not contribute to it being a claim.Example: I firmly believe that [if they found a way to bottle the @andersoncooper giggle, it would cure the corona virus ]