Abstract, Rationale, Stance: A Joint Model for Scientific Claim Verification

Scientific claim verification can help the researchers to easily find the target scientific papers with the sentence evidence from a large corpus for the given claim. Some existing works propose pipeline models on the three tasks of abstract retrieval, rationale selection and stance prediction. Such works have the problems of error propagation among the modules in the pipeline and lack of sharing valuable information among modules. We thus propose an approach, named as ARSJoint, that jointly learns the modules for the three tasks with a machine reading comprehension framework by including claim information. In addition, we enhance the information exchanges and constraints among tasks by proposing a regularization term between the sentence attention scores of abstract retrieval and the estimated outputs of rational selection. The experimental results on the benchmark dataset SciFact show that our approach outperforms the existing works.


Introduction
A system of scientific claim verification can help the researchers to easily find the target scientific papers with the sentence evidence from a large corpus for the given claim. To address this issue, Wadden et al. (2020) introduced scientific claim verification which consists of three tasks. As illustrated in Figure 1, for a given claim, the system finds the abstracts which are related to the claim from a scholarly document corpus (abstract retrieval task); it selects the sentences which are the evidences in the abstract related to the claim (rationale selection task); it also classifies whether the abstract/sentences support or refute the claims (stance prediction task). Wadden et al. (2020) also provided a dataset called SCIFACT.
Most of the existing works of general claim verification are based on pipeline models (Soleimani et al., 2020;Alonso-Reina et al., 2019;  2020; Zhou et al., 2019;Nie et al., 2019;Lee et al., 2020b); some works utilize joint optimization strategies (Lu and Li, 2020;Yin and Roth, 2018;Hidey et al., 2020). These models attempted to jointly optimize the rationale selection and stance prediction, but did not directly link the two modules . In the case of the scientific claim verification, Wadden et al. (2020) proposed a baseline model VERISCI based on a pipeline of three components for the three tasks. Pradeep et al. (2021) proposed a pipeline model called VERT5ERINI which utilized the pre-trained language model T5 (Raffel et al., 2020) and adapted a pre-trained sequence-to-sequence model.  jointly trained two tasks of rationale selection and stance prediction, and had a pipeline on abstract retrieval task and the joint module.
Above existing works on scientific claim verification are fully or partially pipeline solutions. One problem of these works is the error propagation among the modules in the pipeline. Another problem is that the module in the pipeline trained independently cannot share and leverage the valuable information among each other. Therefore, we propose an approach, named as ARSJOINT, which jointly learns the three modules for the three tasks. It has a Machine Reading Comprehension (MRC) framework which uses the claim content as the query to learn additional information. In addition, we assume that the abstract retrieval module should have good interpretability and tend to assign high sentence-level attention scores to the evidence sentences that influence the retrieval results; it is consistent with the goal of the rationale selection module. We thus enhance the information exchanges and constraints among tasks by proposing a regularization term based on a symmetric divergence to bridge these two modules.
The experimental results on the benchmark dataset SCIFACT show that the proposed approach has better performance than the existing works. The main contributions of this paper can be summarized as follows. (1). We propose a scientific claim verification approach which jointly trains on the three tasks in a MRC framework. (2). We propose a regularization based on the divergence between the sentence attention of abstract retrieval and the outputs of rational selection.

Notation and Definitions
We denote the query claim as q and an abstract of a scientific paper as a ∈ A. We denote the set of sentences in abstract a as S = {s i } l i=1 and the word sequence of s i is [s i1 , ..., s in i ]. The title of the paper t ∈ T is used as auxiliary information, the word sequence of t is [t 1 , ..., t nt ]. Here, S, s i and t are for a in default and we omit the subscripts 'a' in the notations. The purpose of the abstract retrieval task is to detect the set of related abstracts to q; it assigns relevance labels y b ∈ {0, 1} to a candidate abstract a. The rationale selection task is to detect the decisive rationale sentences S r ⊆ S of a relevant to the claim q; it assigns evidence labels y r i ∈ {0, 1} to each sentence s i ∈ S. The stance prediction task classifies a into stance labels y e which are in {SUPPORTS=0, REFUTES=1, NOINFO=2}. The sentences in a have the same stance label value.

Pre-processing
As there are a huge amount of papers in the corpus, applying all of them to the proposed model is time-consuming. Therefore, similar to the existing works on this topic (Wadden et al., 2020;Pradeep et al., 2021;, we also utilize a lightweight method to first roughly select a set of candidate papers. We used the BioSentVec Pagliardini et al., 2018) to obtain the embeddings of the claim or a scientific paper based on its title and abstract, and compute the cosine sim-ilarity between the claim and the paper. The papers with top-k similarities are used as the candidates.

Joint Abstract, Rationale, Stance Model
The input sequence of our model is defined as , which is obtained by concatenating the claim q, title t and abstract a. We compute the list of word representations H seq of the input sequence by a pre-trained language model (e.g., BioBERT (Lee et al., 2020a)). We obtain the word representations of the claim from H seq and use them in our AR-SJOINT model. Figure 2 shows the framework of our model with three modules for the three tasks.
In all three modules, we use attention layer (denoted as g(·)) on word (sentence) representations to compute a sentence (document) representation. A document can be a claim, title, abstract, or their combinations. The computation is as follows (refer to ), where the * in H * represents any type of sentence (claim q, title t or a sentence s in an abstract), the in H represents any type of document, W and b are trainable parameters.
, u * j = tanh(Ww 1 h * j + bw 1 ) for word-level attention, , U j = tanh(Wc 1 H j + bc 1 ) for sentence-level attention. (1) Abstract Retrieval: In this task, a title can be regarded as an auxiliary sentence that may contain the information related to the claim for the abstract, we thus use the title with the sentences in the abstract together. We build a document ta = [t, a] and concatenate the word representations of t and a into H ta = [H t , H a ] as the input to this module. We use a hierarchical attention network (HAN) (Yang et al., 2016) to compute document representations h ta ∈ R d , h ta = HAN(H ta ). HAN is proper for document classification by considering the hierarchical document structure (a document has sentences, a sentence has words). We also compute the sentence representation of claim h q ∈ R d with a word-level attention layer (denoted as g(·)), h q = g(H q ). To compute the relevance between h ta and h q , we use a Hadamard product on them and a Multi-Layer Perception (MLP, denoted as f (·)) with Softmax (denoted as σ(·)); the outputs are the probabilities that whether the abstract is relevant to the claim, . A cross entropy loss L ret is used for training. Rationale Selection: This task focuses on judging whether a sentence in the abstract is a rationale one or not. For the multiple sentences in the abstract, they have same title information but have different rationale labels. Therefore, when judging each sentence in the abstract, using the title may not positively influence the performance. We thus use the word representation H a of the abstract as input. We compute the sentence representation h s i by a word-level attention layer, and use a MLP with Softmax to estimate the probability p r i1 and p r i0 that whether s i is the evidence of the abstract or not. The cross entropy loss is L rat . Stance Prediction: The module first computes the sentence representation h s i in a same way with that of rationale selection. After that, it only selects the sentences S r with the true evidence labelŷ r i = 1 or the estimated evidence probability p r i1 > p r i0 ; whether using the true label or the estimated label is decided by a scheduled sampling which will be introduced later. We then compute the estimated stance labels based on a sentence-level attention layer and a MLP with Softmax, h S r = g(H S r ) and [p e 0 , p e 1 , p e 2 ] = σ(f (h q • h S r )), where S r = {s i ∈ S|ŷ r i = 1 or p r i1 > p r i0 }. The cross entropy loss is L sta . Scheduled Sampling: Since rationale sentences S r are used in stance prediction, the error of the rationale selection module will be propagated to the stance prediction module. To alleviate this problem, following , we also use a scheduled sampling method (Bengio et al., 2015), which is to SUPPORT   feed the sentences with true evidence labelŷ r i = 1 to the stance prediction module at the beginning, and then gradually increase the proportion of the sentences with the estimated evidence probability p r i1 > p r i0 , until eventually all sentences in S r are based on the estimated evidences. We set the sampling probability of using the estimated evidences as p sample = sin( π 2 × current_epoch−1 total_epoch−1 ). Rationale Regularization (RR): The attention scores have been used for interpretability in NLP tasks (Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Sun and Lu, 2020). We assume that the abstract retrieval module should have good interpretability and tend to assign high sentence-level attention scores to the evidence sentences that influence the retrieval results; it is consistent with the goal of the rationale selection module. We thus enhance the information exchanges and constraints among tasks by proposing a regularization term based on a symmetric divergence on the sentence attention scores α of abstract retrieval and the estimated outputs y r of the rational selection to bridge these two modules. The detailed formula is as follows, where p and q are α or y r .

Experimental Settings
Dataset: We utilize the benchmark dataset SCI-FACT 1 . It consists of 5,183 scientific papers with titles and abstracts and 1,109 claims in the training and development sets. Table 1 presents the statistics of the dataset. Experimental Settings: For our ARSJOINT model, we use Optuna (Akiba et al., 2019) to tune the hyperparameters λ 1 , λ 2 , λ 3 and γ of the loss L on 20% of the training set and based on the performance on another 20% training set. We choose the optimal hyperparameters by the average F1-score on abstract-level and sentence-level evaluations.
The search ranges of these four hyperparameters are set to [0.1, 12], and the number of search trials is set to 100. Table 2 lists the selected weight hyperparameters of our model. The other hyperparameters such as learning rate in the model refer to the ones used in exiting work  to make a fair comparison. These hyperparameters are listed in Table 3. We implement our ARSJOINT model 2 in Py-Torch. Since the length of the input sequence seq is often greater than the maximum input length of a BERT-based model, we perform a tail-truncation operation on each sentence of seq that exceeds the maximum input length. For the pre-trained language model, we verify our approach by respectively using RoBERTa-large  and BioBERT-large (Lee et al., 2020a) trained on a biomedical corpus. We fine-tune RoBERTa-large and BioBERT-large on the SCIFACT dataset. In addition, the MLP in our model has two layers.
We compare our ARSJOINT approach with Paragraph-Joint , VERISCI 1 (Wadden et al., 2020) and VERT5ERINI (Pradeep et al., 2021). We use the publicly available code 2 of them. The "Paragraph-Joint Pre-training" model is pretrained on the FEVER dataset (Thorne et al., 2018) and then fine-tune on the SCIFACT dataset. The "Paragraph-Joint SCIFACT-only" is not pre-trained   Table 3: Hyperparameter settings following the existing work. k tra and k ret are the number of candidate abstracts for each claim in the training and testing stages. lr 1 and lr 2 are the learning rates of the BERT-based model and other modules of the proposed model. on other datasets. Evaluation: We evaluate the methods by using the abstract-level and sentence-level evaluation criteria given in SCIFACT 1 . Abstract-level evaluation: It evaluates the performance of a model on detecting the abstracts which support or refute the claims. For the "Label-Only" evaluation, given a claim q, the classification result of an abstract a is correct if the estimated relevance labelŷ b is correct and the estimated stance labelŷ e is correct. For the "La-bel+Rationale" evaluation, the abstract is correctly rationalized, in addition, if the estimated rationale sentences contain a gold rationale. Sentence-level evaluation: It evaluates the performance of a model on detecting rationale sentences. For the "Selection-Only" evaluation, an estimated rationale sentence s i of an abstract a is correctly selected if the estimated rationale labelŷ r i is correct and the estimated stance labelŷ e is not "NOINFO". Especially, if consecutive multiple sentences are gold rationales, then all these sentences should be estimated as rationales. For the "Selection+Label", the estimated rationale sentences are correctly labeled, in addition, if the estimated stance labelŷ e of this abstract is correct. The evaluation metrics F1-score (F1), Precision (P), and Recall (R) are used. We train the model using all training data, and since Wadden et al. (2020) does not publish the labels on the test set, we evaluate the approaches on the development set following .  In addition, acute peritoneal inflammation recruited preferentially Ly-6C(med-high) monocytes. 0.1613 1 1 Taken together, these data identify distinct subpopulations of mouse blood monocytes that differ in maturation stage and capacity to become recruited to inflammatory sites. 0.0745 0 0 Table 5: Result example of Rationale Regularization. Given a claim, it lists the sentences from an abstract. α i is sentence attention score in the abstract retrieval task;ŷ r i is estimated rationale label; y r i is true rationale label.

Experimental Results
performs the existing works with fully or partially pipelines. VERISCI and VERT5ERINI are pipeline models and Paragraph-Joint is a partially pipeline model with a joint model on two tasks. It shows that the proposed model which jointly learns the three tasks is effective to improve the performance.
Second, when using the same pre-trained model RoBERTa-large, comparing our method and the paragraph-joint model, ARSJOINT (RoBERTa) and ARSJOINT w/o RR (RoBERTa) have better performance than "Paragraph-Joint SciFact Only", especially on Recall. It shows that jointly learning with the abstract retrieval task can improve performance. For the Paragraph-Joint method, "Paragraph-Joint Pre-training" with pre-training on another FEVER dataset has much better performance than "Paragraph-Joint SCIFACT-only" without pre-training on other datasets. Similarly, we replace RoBERTa-large with BioBERTlarge which contains biological knowledge; AR-SJOINT (BioBERT) achieves better performance over "Paragraph-Joint Pre-training".
Third, as an ablation study of the proposed RR, in the case of using BioBERT-large, there is a significant difference between the model with and without RR. Although only a small difference in the case of using RoBERTa-large, there is still an improvement on Recall. This indicates that ratio-nale regularization can effectively improve the performance of the model. Table 5 shows an example of the results with RR. In this example, it lists a claim and the sentences from an abstract. The attention scores of the sentences in the abstract retrieval task are consistent with the true rationale labels (as well as the estimated rationale labels). The abstract retrieval module thus has good interpretability.

Conclusion
In this paper, we propose a joint model named as ARSJOINT on three tasks of abstract retrieval, rationale selection and stance prediction for scientific claim verification in a MRC framework by including claim. We also propose a regularization based on the divergence between the sentence attention of the abstract retrieval task and the outputs of the rational selection task. The experimental results illustrate that our method achieves better results on the benchmark dataset SCIFACT. In future work, we will try to pre-train the model on other general claim verification datasets such as FEVER (Thorne et al., 2018) to improve the performance.