Article Reranking by Memory-Enhanced Key Sentence Matching for Detecting Previously Fact-Checked Claims

False claims that have been previously fact-checked can still spread on social media. To mitigate their continual spread, detecting previously fact-checked claims is indispensable. Given a claim, existing works focus on providing evidence for detection by reranking candidate fact-checking articles (FC-articles) retrieved by BM25. However, these performances may be limited because they ignore the following characteristics of FC-articles: (1) claims are often quoted to describe the checked events, providing lexical information besides semantics; (2) sentence templates to introduce or debunk claims are common across articles, providing pattern information. Models that ignore the two aspects only leverage semantic relevance and may be misled by sentences that describe similar but irrelevant events. In this paper, we propose a novel reranker, MTM (Memory-enhanced Transformers for Matching) to rank FC-articles using key sentences selected with event (lexical and semantic) and pattern information. For event information, we propose a ROUGE-guided Transformer which is finetuned with regression of ROUGE. For pattern information, we generate pattern vectors for matching with sentences. By fusing event and pattern information, we select key sentences to represent an article and then predict if the article fact-checks the given claim using the claim, key sentences, and patterns. Experiments on two real-world datasets show that MTM outperforms existing methods. Human evaluation proves that MTM can capture key sentences for explanations.

False claims that have been previously factchecked can still spread on social media. To mitigate their continual spread, detecting previously fact-checked claims is indispensable. Given a claim, existing works retrieve fact-checking articles (FC-articles) for detection and focus on reranking candidate articles in the typical two-stage retrieval framework. However, their performance may be limited as they ignore the following characteristics of FC-articles: (1) claims are often quoted to describe the checked events, providing lexical information besides semantics; and (2) sentence templates to introduce or debunk claims are common across articles, providing pattern information. In this paper, we propose a novel reranker, MTM (Memoryenhanced Transformers for Matching), to rank FC-articles using key sentences selected using event (lexical and semantic) and pattern information. For event information, we propose to finetune the Transformer with regression of ROUGE. For pattern information, we generate pattern vectors as a memory bank to match with the parts containing patterns. By fusing event and pattern information, we select key sentences to represent an article and then predict if the article fact-checks the given claim using the claim, key sentences, and patterns. Experiments on two real-world datasets show that MTM outperforms existing methods. Human evaluation proves that MTM can capture key sentences for explanations. The code and the dataset are at https://github.com/ ICTMCG/MTM.

Introduction
Social media posts with false claims have led to real-world threats on many aspects such as politics (Fisher et al., 2016), social order (Wang and Li, 2011), and personal health (Chen, 2020).

Sentence
Relevant? Contains Quotation?

Contains
Fact-checking Patterns? S1. Lemon is not so-called acid food, and drinking lemonade does not lead to cancer. A claim and sentences in the candidate fact-checking articles (translated from Chinese). S1 is on a similar topic but actually irrelevant, while S2 and S3 which contain quotation or fact-checking patterns are relevant.
To tackle this issue, over 300 fact-checking projects have been launched, such as Snopes 1 and Jiaozhen 2 (Duke Reporters' Lab, 2020). Meanwhile, automatic systems have been developed for detecting suspicious claims on social media (Zhou et al., 2015;Popat et al., 2018a). This is however not the end. A considerable amount of false claims continually spread, even though they are already proved false. According to a recent report (Xinhua Net, 2019), around 12% of false claims published on Chinese social media, are actually "old", as they have been debunked previously. Hence, detecting previously fact-checked claims is an important 5469 task.
According to the seminal work by Shaar et al. (2020), the task is tackled by a two-stage information retrieval approach. Its typical workflow is illustrated in Figure 1(a). Given a claim as a query, in the first stage a basic searcher (e.g., BM25 Robertson and Zaragoza, 2009) searches for candidate articles from a collection of fact-checking articles (FC-articles). In the second stage, a more powerful model (e.g., BERT, Devlin et al., 2019) reranks the candidates to provide evidence for manual or automatic detection. Existing works focus on the reranking stage: Vo and Lee (2020) model the interactions between a claim and the whole candidate articles, while Shaar et al. (2020) extract several semantically similar sentences from FC-articles as a proxy. Nevertheless, these methods treat FCarticles as general documents and ignore characteristics of FC-articles. Figure 1(b) shows three sentences from candidate articles for the given claim. Among them, S1 is more friendly to semantic matching than S2 and S3 because the whole S1 focuses on describing its topic and does not contain tokens irrelevant to the given claim, e.g., "has spread over years" in S2. Thus, a semantic-based model does not require to have strong filtering capability. If we use only general methods on this task, the relevant S2 and S3 may be neglected while irrelevant S1 is focused. To let the model focus on key sentences (i.e., sentences as a good proxy of article-level relevance) like S2 and S3, we need to consider two characteristics of FC-articles besides semantics: C1. Claims are often quoted to describe the checked events (e.g., the underlined text in S2); C2. Event-irrelevant patterns to introduce or debunk claims are common in FC-articles (e.g., bold texts in S2 and S3).
Based on the observations, we propose a novel reranker, MTM (Memory-enhanced Transformers for Matching). The reranker identifies key sentences per article using claim-and pattern-sentence relevance, and then integrates information from the claim, key sentences, and patterns for article-level relevance prediction. In particular, regarding C1, we propose ROUGE-guided Transformer (ROT) to score claim-sentence relevance literally and semantically. As for C2, we obtain the pattern vectors by clustering the difference of sentence and claim vectors for scoring pattern-sentence relevance and store them in the Pattern Memory Bank (PMB). The joint use of ROT and PMB allows us to iden-tify key sentences that reflect the two characteristics of FC-articles. Subsequently, fine-grained interactions among claims and key sentences are modeled by the multi-layer Transformer and aggregated with patterns to obtain an article-level feature representation. The article feature is fed into a Multi-layer Perceptron (MLP) to predict the claim-article relevance.
To validate the effectiveness of our method, we built the first Chinese dataset for this task with 11,934 claims collected from Chinese Weibo 3 and 27,505 fact-checking articles from multiple sources. 39,178 claim-article pairs are annotated as relevant. Experiments on the English dataset and the newly built Chinese dataset show that MTM outperforms existing methods. Further human evaluation and case studies prove that MTM finds key sentences as explanations. Our main contributions are as follows: • We propose a novel reranker MTM for factchecked claim detection, which can better identify key sentences in fact-checking articles by exploiting their characteristics. • We design ROUGE-guided Transformer to combine lexical and semantic information and propose a memory mechanism to capture and exploit common patterns in fact-checking articles. • Experiments on two real-world datasets show that MTM outperforms existing methods. Further human evaluation and case studies prove that our model finds key sentences as good explanations. • We built the first Chinese dataset for factchecked claim detection with fact-checking articles from diverse sources.

Related Work
To defend against false information, researchers are mainly devoted to two threads: (1) Automatic fact-checking methods mainly retrieve relevant factual information from designated sources and judge the claim's veracity. Thorne et al. (2018) use Wikipedia as a fact tank and build a shared task for automatic fact-checking, while Popat et al. (2018b) and Wang et al. (2018) retrieve webpages as evidence and use their stances on claims for veracity prediction.
(2) Fake news detection methods often use non-factual signals, such as styles (Przybyla, 2020; Qi et al., 2019), emotions (Ajao Given a claim q and a candidate article d with l sentences, s 1 , ..., s l , MTM x feeds (q, s) pairs into ROUGE-guided Transformer (ROT) to obtain claim-sentence scores in both lexical and semantic aspects; y matches residual embeddings r s,q with vectors in Pattern Memory Bank (PMB) (here, only four are shown) to obtain pattern-sentence scores; z identifies k 2 key sentences by combining the two scores (here, k 2 = 2, and s i and s l are selected); { models interaction among q , s , and the nearest memory vector m for each key sentence; and | perform score-weighted aggregation and predict the claim-article relevance. et al., Zhang et al., 2021), source credibility (Nguyen et al., 2020), user response (Shu et al., 2019) and diffusion network (Liu and Wu, 2018;Rosenfeld et al., 2020). However, these methods mainly aim at newly emerged claims and do not address those claims that have been fact-checked but continually spread. Our work is in a new thread, detecting previously fact-checked claims. Vo and Lee (2020) models interaction between claims and FC-articles by combining GloVe (Pennington et al., 2014) and ELMo embeddings (Peters et al., 2018). Shaar et al. (2020) train a RankSVM with scores from BM25 and Sentence-BERT for relevance prediction. These methods ignore the characteristics of FC-articles, which limits the ranking performance and explainability.

Proposed Method
Given a claim q and a candidate set of k 1 FCarticles D obtained by a standard full-text retrieval model (BM25), we aim to rerank FC-articles truly relevant w.r.t. q at the top by modeling fine-grained relevance between q and each article d ∈ D. This is accomplished by Memory-enhanced Transformers for Matching (MTM), which conceptually has two steps, (1) Key Sentence Identification and (2) Article Relevance Prediction, see Figure 2. For an article of l sentences, let S = {s 1 , ..., s l } be its sentence set. In Step (1), for each sentence, we derive claim-sentence relevance score from ROUGEguided Transformer (ROT) and pattern-sentence relevance score from Pattern Memory Bank (PMB). The scores indicate how similar the sentence is to the claim and pattern vectors, i.e., how possible to be a key sentence. Top k 2 sentences are selected for more complicated interactions and aggregation with the claim and pattern vectors in Step (2). The aggregated vector is used for the final prediction. We detail the components and then summarize the training procedure below.

ROUGE-guided Transformer (ROT)
ROT (left top of Figure. 2) is used to evaluate the relevance between q and each sentence s in , both lexically and semantically. Inspired by (Gao et al., 2020), we choose to "inject" the ability to consider lexical relevance into the semantic model. As the BERT is proved to capture and evaluate semantic relevance (Zhang et al., 2020), we use a one-layer Transformer initialized with the first block of pretrained BERT to obtain the initial semantic representation of q and s: where [CLS] and [SEP] are preserved tokens and z q,s is the output representation.
To force ROT to consider the lexical relevance, we finetune the pretrained Transformer with the guidance of ROUGE (Lin, 2004), a widely-used metric to evaluate the lexical similarity of two segments in summarization and translation tasks. The intuition is that lexical relevance can be characterized by token overlapping, which ROUGE exactly measures. We minimize the mean square error between the prediction and the precision and recall of ROUGE-2 between q and s (R 2 ∈ R 2 ) to optimize the ROT:R (2) where the first term is the regression loss and the second is to constraint the change of parameters as the ability to capture semantic relevance should be maintained. λ R is a control factor and ∆θ represents the change of parameters.

Pattern Memory Bank (PMB)
The Pattern Memory Bank (PMB) is to generate, store, and update the vectors which represent the common patterns in FC-articles. The vectors in PMB will be used to evaluate pattern-sentence relevance (see Section 3.1.3). Here we detail how to formulate, initialize, and update these patterns below.
Formulation. Intuitively, one can summarize the templates, like "...has been debunked by...", and explicitly do exact matching, but the templates are costly to obtain and hard to integrate into neural models. Instead, we implicitly represent the common patterns using vectors derived from embeddings of our model, ROT. Inspired by (Wu et al., 2018), we use a memory bank M to store K common patterns (as vectors), i.e., M = {m i } K i=1 . Initialization. We first represent each q in the training set and s in the corresponding articles by averaging its token embeddings (from the embedding layer of ROT). Considering that a pattern vector should be event-irrelevant, we heuristically remove the event-related part in s as possible by calculating the residual embeddings r s,q , i.e., subtracting q from s. We rule out the residual embeddings that do not satisfy t low < r s,q 2 < t high , because they are unlikely to contain good pattern information: r s,q 2 ≤ t low indicates q and s are highly similar and thus leave little pattern information, while r s,q 2 ≥ t high indicates s may not align with q in terms of the event, so the corresponding r s,q is of little sense. Finally, we aggregate the valid residual embeddings into K clusters using K-means and obtain the initial memory bank M: where {r valid s,q } is the set of valid residual embeddings. Update. As the initial K vectors may not accurately represent common patterns, we update the memory bank according to the feedbacks of results during training: If the model predicts rightly, the key sentence, say s, should be used to update its nearest pattern vector m. To maintain stability, we use an epoch-wise update instead of an iterationwise update.
Take updating m as an example. After an epoch, we extract all n key sentences whose nearest pattern vector is m and their n corresponding claims, which is denoted as a tuple set (S, Q) m . Then (S, Q) m is separated into two subsets, R m and W m , which contain n r and n w sentence-claim tuples from the rightly and wrongly predicted samples, respectively. The core of our update mechanism ( Figure 3) is to draw m closer to the residual embeddings in R m and push it away from those in W m . We denote the i th residual embedding from the two subsets as r R m i and r W m i , respectively. To determine the update direction, we calculate a weighted sum of residual embeddings according to the predicted matching scores. For (s, q), suppose MTM outputŷ s,q ∈ [0, 1] as the predicted matching score of q and d (whose key sentence is s), the weight of r s,q is |ŷ s,q − 0.5| (denoted as w s,q ). Weighted residual embeddings are respectively summed and normalized as the components of the direction vector (Eq. 5): where u mr and u mw are the aggregated residual embeddings. The direction is determined by Eq. 6: push away (6) where w r and w w are the normalized sum of corresponding weights used in Eq. 5 (w r + w w = 1). The pattern vector m is updated with: where m old and m new are the memory vector m before and after updating; the constant λ m and m old 2 jointly control the step size.

Key Sentence Selection
Whether a sentence is selected as a key sentence is determined by combining claim-and patternsentence relevance scores. The former is calculated with the distance of q and s trained with ROT (Eq. 8) and the latter uses the distance between the nearest pattern vector in PMB and the residual embedding (Eq. 9). The scores are scaled to [0, 1].
For each sentence s in d, the relevance score with q is calculated by Eq. 10: scr(q, s) = λ Q scr Q (q, s) + λ P scr P (q, s) (10) where Scale(x) = 1 − x−min max−min and max and min are the maximum and minimum distance of s in d, respectively. u = arg min i m i − r s,q 2 , and λ Q and λ P are hyperparameters whose sum is 1.
Finally, sentences with top-k 2 scores, denoted as K = {s key i (q, d)} k 2 i=1 , are selected as the key sentences in d for the claim q.
We model more complicated interactions between the claim and the key sentences by feeding each z q,s key (derived from ROT) into a multi-layer Transformer (MultiTransformer): Following (Reimers and Gurevych, 2019), we respectively compute the mean of all output token vectors of q and s in z q,s key to obtain the fixed sized sentence vectors q ∈ R dim and s key ∈ R dim , where dim is the dimension of a token in Transformers.
Weighted memory-aware aggregation. For final prediction, we use a score-weighted memory-aware aggregation. To make the predictor aware of the pattern information, we append the corresponding nearest pattern vectors to the claim and key sentence vectors: where Intuitively, a sentence with higher score should be attended more. Thus, the concatenated vectors (Eq. 12) are weighted by the relevance scores from Eq. 10 (normalized across the top-k 2 sentences). The weighted aggregating vector is fed into a MLP which outputs the probability that d fact-checks q: scr (q, s key i ) = Normalize scr(q, s key i ) (13) whereŷ q,d ∈ [0, 1]. Ifŷ q,d > 0.5, the model predicts that d fact-checks q, otherwise does not. The loss function is cross entropy: where y q,d ∈ {0, 1} is the ground truth label. y q,d = 1 if d fact-checks q and 0 otherwise. The predicted values are used to rank all k 1 candidate articles retrieved in the first stage.

Training MTM
We summarize the training procedure of MTM in Algorithm 1, including the pretraining of ROT, the initialization of PMB, the training of ARP, and the epoch-wise update of PMB. Calculate scr Q (q, s) via ROT and scr P (q, s) via PMB.

7:
Calculate scr(q, s) using Eq.10. 8: Select key sentences K. Calculate v for each s in K andŷ q,d .

11:
Update the ARP to minimize L M .

Experiments
In this section, we mainly answer the following experimental questions: EQ1: Can MTM improve the ranking performance of FC-articles given a claim? EQ2: How effective are the components of MTM, including ROUGE-guided Transformer, Pattern Memory Bank, and weighted memory-aware aggregation in Article Relevance Prediction? EQ3: To what extent can MTM identify key sentences in the articles, especially in the longer ones?

Data
We conducted the experiments on two real-world datasets. Table 1 shows the statistics of the two datasets. The details are as follows:

Twitter Dataset
The Twitter 4 dataset is originated from (Vo and Lee, 2019) and processed by Vo and Lee (2020). The dataset pairs the claims (tweets) with the corresponding FC-articles from Snopes. For tweets with images, it appends the OCR results to the tweets. We remove the manually normalized claims in Snopes' FC-articles to adapt to more general scenarios. The data split is the same as that in (Vo and Lee, 2020).

Weibo Dataset
We built the first Chinese dataset for the task of detecting previously fact-checked claims in this ar-  BERT(Transfer): As no sentence-level labels are provided in most document retrieval datasets, Yang et al. (2019) finetune BERT with short text matching data and then apply to score the relevance between query and each sentence in documents. The three highest scores are combined with BM25 score for document-level prediction. Rankers from related works of our task Sentence-BERT: Shaar et al. (2020) use pretrained Sentence-BERT models to calculate cosine similarity between each sentence and the given claim. Then the top similarity scores are fed into a neural network to predict document relevance.
RankSVM: A pairwise RankSVM model for reranking using the scores from BM25 and sentence-BERT (mentioned above), which achieves the best results in (Shaar et al., 2020).  MRR  MAP@  HIT@  MRR  MAP@  HIT@  1  3  5  3  5  1  3  5  3  5    CTM (Vo and Lee, 2020): This method leverages GloVe and ELMo to jointly represent the claims and the FC-articles for predicting the relevance scores. Its multi-modal version is not included as MTM focuses on key textual information.

Experimental Setup
Evaluation Metrics. As this is a binary retrieval task, we follow Shaar et al. (2020) and report Mean Reciprocal Rank (MRR), Mean Average Precision@k (MAP@k, k = 1, 3, 5) and HIT@k (k = 3, 5). See equations in Appendix B. Implementation Details. In MTM, the ROT and ARP components have one and eleven Transformer layers, respectively. The initial parameters are obtained from pretrained BERT models 6 . Other parameters are randomly initialized. The dimension of claim and sentence representation in ARP and pattern vectors are 768. Number of Clusters in PMB K is 20. Following (Shaar et al., 2020) and (Vo and Lee, 2020), we use k 1 = 50 candidates retrieved by BM25. k 2 = 3 (Weibo, hereafter, W) / 5 (Twitter, hereafter, T) key sentences are selected. We use Adam (P. Kingma and Ba, 2015) for optimization with = 10 −6 , β 1 = 0.9, β 2 = 0.999. The learning rates are 5 × 10 −6 (W) and 1 × 10 −4 (T). The batch size is 512 for pretraining ROT, 64 for the main task. According to the quantiles on 6 We use bert-base-chinese for Weibo and bert-base-uncased for Twitter.

Performance Comparison
To answer EQ1, we compared the performance of baselines and our method on the two datasets, as shown in Table 2. We see that: (1) MTM ourperforms all compared methods on the two datasets (the exception is only the MAP@1 on Twitter), which indicates that it can effectively find related FC-articles and provide evidence for determining if a claim is previously fact-checked.
(2) For all methods, the performance on Weibo is worse than that on Twitter because the Weibo dataset contains more claim-sentence pairs (from multiple sources) than Twitter and is more challenging. Despite this, MTM's improvement is significant. (3) BERT(Transfer), Sentence-BERT and RankSVM use transferred sentence-level knowledge from other pretext tasks but did not outperform the document-level BERT. This is because FCarticles have their own characteristics, which may not be covered by transferred knowledge. In con-trast, our observed characteristics help MTM achieve good performance. Moreover, MTM is also efficiency compared to BERT(Transfer), which also uses 12-layer BERT and selects sentences, because our model uses only one layer for all sentences (other 11 layers are for key sentences), while all sentences are fed into the 12 layers in BERT(Transfer).

Ablation Study
To answer EQ2, we evaluated three ablation groups of MTM's variants (AG1∼AG3) to investigate the effectiveness of the model design. 7 Table 3 shows the performance of variants and MTM.
AG1: With vs. Without ROUGE. The variant removes the guidance of ROUGE (MTM w/o ROUGE guidance) to check the effectiveness of ROUGEguided finetuning. The variant performs worse on Weibo, but MAP@1 slightly increases on Twitter. This is probably because there are more lexical overlapping between claims and FC-articles in the Weibo dataset, while most of the FC-articles in the Twitter dataset choose to summarize the claims to fact-check.
AG2: Cluster-based Initialization vs. Random Initialization vs. Without update vs. Without PMB. The first variant (MTM w/ rand mem init) uses random initialization and the second (MTM w/o mem update) uses pattern vectors without updating. The last one (MTM w/o PMB) removes the PMB. We see that the variants all perform worse than MTM on MRR, of which w/ rand mem init performs the worst. This indicates that cluster-based initialization provides a good start and facilitates the following updates while the random one may harm further learning.
AG3: Score-weighted Pooling vs. Average pooling, and With vs. Without pattern vector. The first variant, MTM w/ avg. pool, replace the score-weighted pooling with average pooling. The comparison in terms of MRR and MAP shows the effectiveness of using relevance scores as weights. The second, MTM w/o pattern aggr., does not append the pattern vector to claim and sentence vectors before aggregation. It yields worse results, indicating the patterns should be taken into consideration for final prediction.
★ A video of a teenager drowning is spreading online, with the content that ... ★ Recently, a piece of news about the new driving test regulations spread in WeChat Moments. ★ It is reported that the video attached to this rumor records the scene of the 6.11 homicide in Xihua, and is totally unrelated to the rumor that four monks killed people for the kidneys.
★ After investigation, the post stated that Russia confirmed that MH370 was hijacked to a US military base. But there was no report on the mainstream Russian media. The reported publisher was judged to publish false information.
★ According to the publisher, it just wanted to attract netizens and increase its popularity. It never participated in marking for Gaokao. ★ In the past two days, there was a picture online called "the latest international gesture for police calling", which attracted many netizens to forward it. ★ After retrieving, the editor found it was a variant of an old rumor published in 2014 that claims Ukrainian Embassy is hiring mercenaries in China. ★ The police verified that the news was a rumor. We here remind that you should not forward the message as soon as you see it, and not dial the phone number in the news.

Visualization of Memorized Patterns
To probe what the PMB summarizes and memorizes, we selected and analyzed the key sentences corresponding to the residual embeddings around pattern vectors. Figure 4 shows example sentences where highly frequent words are in boldface. These examples indicate that the pattern vectors do cluster key sentences with common patterns like "...spread in WeChat Moments".

Human Evaluation and Case Study
The quality of selected sentences cannot be automatically evaluated due to the lack of sentencelevel labels. To answer EQ3, we conducted a human evaluation. We randomly sampled 370 claimarticle pairs whose articles were with over 20 sentences from the Weibo dataset. Then we showed each claim and top three sentences selected from the corresponding FC-article by MTM. Three anno- tators were asked to check if an auto-selected sentence helped match the given query and the source article (i.e., key sentences). Figure 5 shows (a) MTM hit at least one key sentence in 83.0% of the articles; (b) 73.0% of the sentences at Rank 1 are key sentences, followed by 65.1% at Rank 2 and 56.8% at Rank 3. This proves that MTM can find the key sentences in long FC-articles and provide helpful explanations. We also show the positional distribution in Figure 5(c), where key sentences are scattered throughout the articles. Using MTM to find key sentences can save fact-checkers' time to scan these long articles for determining whether the given claim was fact-checked.
Additionally, we exhibit two cases in the evaluation set in Figure 6. These cases prove that MTM found the key sentences that correspond to the characteristics described in Section 1. Please refer to Appendix D for further case analysis.

Conclusions
We propose MTM to select from fact-checked articles key sentences that introduce or debunk claims. These auto-selected sentences are exploited in an end-to-end network for estimating the relevance of the fact-checked articles w.r.t. a given claim. Experiments on the public Twitter dataset and the private Weibo dataset show that MTM outperforms the state of the art. Moreover, human evaluation and case studies demonstrate that the selected sentences provide helpful explanations of the results.

Broader Impact Statement
Our work involves two scenarios that need the ability to detect previously fact-checked claims: (1) For social media platforms, our method can check whether a newly published post contains false claims that have been debunked. The platform may help the users to be aware of the text's veracity by providing the key sentences selected from fact-checking articles and their links.
(2) For manual or automatic fact-checking systems, it can be a filter to avoid redundant fact-checking work. When functioning well, it can assist platforms, users, and fact-checkers to maintain more credible cyberspace. But in the failure cases, some well-disguised claims may escape. This method functions with reliance on the used fact-checking article databases. Thus, authority and credibility need to be carefully considered in practice. We did our best to make the new Weibo dataset for academic purpose reliable. Appendix A introduces more details.

A Constructing the New Weibo Dataset
To construct datasets for fact-checked claim detection on social media, we need to (1) collect the fact-checked claims (social media posts); (2) collect fact-checking articles (FC-articles); and (3) generate claim-article pairs.
Collection. In Step (1), we used posts whose labels are fake from the datasets for fake news detection (Zhang et al., 2021;Zhou et al., 2015), because their labels were determined by fact-checking. In Step (2), we crawled fact-checking articles from multiple sources to enrich the article base. The sources are partially listed in Table 4 due to the space limit. For the claims and articles which contained much text in the attached images, we recognized the text using OCR service on Baidu AI platform 8 . Note that we only crawled the claims and articles that were publicly available at the crawling time. To protect privacy, the publishers' names were removed. However, we preserved names and offensive words in the main text because they were crucial for summarizing the events and performing the matching process. 8 https://ai.baidu.com/tech/ocr Annotation. In Step (3), we performed a modelassisted human annotation. We first duplicated the data collected in Step (1) and (2) and then used BM25 to retrieve the relevant FC-articles as candidates with the claims as queries. Twenty-six annotators (postgraduates) were instructed (by a Chinese guideline with examples written by the first author) to check whether the candidates did fact-check the given claims. We dropped the claims that are annotated as irrelevant to all candidates. For claims that were with highly overlapping candidates but different annotation results, the authors manually checked and corrected the wrongly annotated samples.

B Calculation of Evaluation Metrics
Assume that query set Q has |Q| queries and the i th query has n i relevant documents. We calculate the evaluation metrics using the following equations: where rank i refers to the rank position of the first relevant answer for the i th query in the corresponding retrieving result. (Wikipedia, 2021) where P i (j) is the proportion of returned documents in the top-j set for the i th query that are relevant. rel i (j) is an indicator function equaling 1 if the document at rank j in the returned list for the i th query is relevant and 0 otherwise. (Li et al., 2016) where has i (k) is an indicator function equaling 1 if rank i ≤ k and 0 otherwise. (Yang et al., 2012) Note that we guarantee that a query has at least one relevant document in its candidate list, so the corner case of empty ground truth set is ignored.
C Implementation of BM25 and Baselines BM25: The articles were indexed with gensim (Řehůřek and Sojka, 2010).

Claim
As reported by Korean People's Daily, Jae-Seo Jung, professor of Ewha Womans University Korean, refutes the claim that Cao Cao's tomb is at Anyang. According to his research on Chinese and Korean history, Professor Jung finds that Cao Cao is a Korean. Auto-selected Sentences by MTM S1. (The only key sentence) Some Chinese media quoted the Korean People's Daily as saying that Professor Jae-Seo Jung claimed that Cao Cao is a Korean. S2. (Not key sentence) Some Chinese media quoted Korean Daily's news, which said that Professor Huanjing Park of Sungkyunkwan University published a report saying that Sun Yat-Sen, the founding father of modern China, was a Korean. S3. (Not key sentence) According to the report, Cheng-Soo Park, a history professor at Seoul University in South Korea, said that after ten years of research, he believed that it was the Korean people who first invented Chinese characters. Later, the Korean people brought Chinese characters to China, forming the present Chinese culture. Claim I noticed it in my WeChat Moments. It was very sad, but I still hope this is fake! [The plane crashed in Vietnam. All the people on board were probably dead!] CNN reported that the MH370 was confirmed to have fallen within 100 kilometers north of Ho Chi Minh City, Vietnam. Because of the rainstorm, the local people thought it was a falling meteorite. At present, it is still raining in the local area. As it is a mountainous area, so it is difficult to carry out the search and rescue work. Auto-selected Sentences by MTM S1. (Not key sentence) On the evening of the 8th, a short message purportedly from "Vietnam News Agency" said: "Vietnam News Agency Express at 19:32 on March 8th: 17 hours after Malaysia Airlines flight MH370 lost contact, it was found by Philippine maritime vessels carrying out search and rescue mission in the sea area of 06 55 15" N and 103 34 43 "E. S2. (Not key sentence) Since then, Boeing China President deleted the Weibo post, saying that "the plane has been found" is the wrong message, and the search continues. S3. (Not key sentence) On the afternoon of the 8th, the South China Sea Rescue Bureau said that it was a misunderstanding that the two search and rescue vessels previously reported by the media set out from Xisha and Haikou at 10:49 and 11:30 respectively. Ground Truth Key Sentences GT1. CNN did not release the news that the losing-contact airplane crashed. GT2. On the 8th of this month, it was spread online that "CNN said that the flight MH370 crashed in Vietnam".
GT3. CNN's official account on Twitter is still using the term "lost contact", and the TV lives also use "missing" to modify MH370.

D Further Case Analysis
We reviewed the fact-checking articles in the set for human evaluation wherein MTM hit less than two key sentences. We here exhibit two situations that make MTM did not perform well: (1) In Figure 7, the claim is about where Cao Cao was born. MTM found three sentences with significant patterns (shown in boldface). However, only 14 https://github.com/HIT-SCIR/ ELMoForManyLangs 15 https://github.com/nguyenvo09/ EMNLP2020 S1 is related to the claim. S2 and S3 introduce similar but irrelevant claims. This is because that the fact-checking article is actually a collection of rumors about South Korea on the Chinese social media. The claims in this article are all similar to each other, and thus, to differentiate them needs more delicate semantic understanding.
(2) Figure 8 shows a case where MTM found no key sentence from the article. We append the key sentences selected manually below. We speculate that the failure is due to the length of the given claim. The claim is longer than general posts on Weibo and contains many details, making the model lose focus on the key elements of the event description. Thus, S1 describing another news about MH370's activity in Vietnam was selected, instead of the ground truth sentences. To achieve better performance, future work may consider improving the semantic modeling and summarizing key information from both fact-checking articles and claims.