Tribrid: Stance Classification with Neural Inconsistency Detection

We study the problem of performing automatic stance classification on social media with neural architectures such as BERT. Although these architectures deliver impressive results, their level is not yet comparable to the one of humans and they might produce errors that have a significant impact on the downstream task (e.g., fact-checking). To improve the performance, we present a new neural architecture where the input also includes automatically generated negated perspectives over a given claim. The model is jointly learned to make simultaneously multiple predictions, which can be used either to improve the classification of the original perspective or to filter out doubtful predictions. In the first case, we propose a weakly supervised method for combining the predictions into a final one. In the second case, we show that using the confidence scores to remove doubtful predictions allows our method to achieve human-like performance over the retained information, which is still a sizable part of the original input.


Introduction
The spreading of unverified claims on social media is an important problem that affects our society at multiple levels (Vlachos and Riedel, 2014;Ciampaglia et al., 2015;Hassan et al., 2017). A valuable asset that we can exploit to fight this problem is the set of perspectives that people publish about such claims. These perspectives reveal the users' stance and this information can be used to help an automated framework to determine more accurately the veracity of rumors (Castillo et al., 2011;Bourgonje et al., 2017).
The stance can be generally categorized either as supportive or opposing. For instance, consider the claim "The elections workers in Wisconsin illegally altered absentee ballot envelopes". A tweet with a supportive stance is "The number of people who took part in the election in Wisconsin exceeded the total number of registered voters" while one with an opposed stance is "An extra zero was added as votes accidentally but it was quickly fixed after state officials noticed it".
Being able to automatically classify the stance is necessary for dealing with the large volume of data that flows through social networks. To this end, earlier solutions relied on linguistic features, such as ngrams, opinion lexicons, and sentiment (Somasundaran and Wiebe, 2009;Anand et al., 2011;Hasan and Ng, 2013;Sridhar et al., 2015) while more recent methods additionally include features that we can extract from networks like Twitter (Chen and Ku, 2016;Lukasik et al., 2016;Sobhani et al., 2017;Kochkina et al., 2017). The current state-ofthe-art relies on advanced neural architectures such as BERT (Devlin et al., 2019) and returns remarkable performance. For instance, STANCY (Popat et al., 2019) can achieve an F 1 of 77.76 with the PERSPECTRUM (PER) dataset against the F 1 of 90.90 achieved by humans (Chen et al., 2019).
Although these results are encouraging, the performance has not yet reached a level that it can be safely applied in contexts where errors must be avoided at all costs. Consider, for instance, the cases when errors lead to a misclassification of news about a catastrophic event, or when they trigger wrong financial operations. In such contexts, we argue that it is better that the AI abstains from returning a prediction unless it is very confident about it. This requirement clashes with the design of current solutions, which are meant to "blindly" make a prediction for any input. Therefore, we see a gap between the capabilities of the state-of-the-art and the needs of some realistic use cases.
In this paper, we address this problem with a new BERT-based neural network, which we call Tribrid (TRIplet Bert-based Inconsistency Detection), that is designed not only to produce a reliable and accurate classification, but also to test its confidence. This test is implemented by including a "negated" version of the original perspective as part of the input, following the intuition that a prediction is more trustworthy if the model produces the opposite outcome with the negated perspective. If that is not the case, then the model is inconsistent and the prediction should be discarded.
Testing the consistency of the model with negated perspectives is a task that can be done simply by computing two independent predictions, one with the original perspective and one with the negated one. However, this is suboptimal because existing state-of-the-art methods are trained only with the principle that supportive perspectives should be similar to their respecting claims in the latent space (Popat et al., 2019) and the similarity might be sufficiently high even if there are keywords (e.g., "not") that negate the stance. To overcome this problem, we propose a new neural architecture that processes simultaneously both original and negated perspectives, using a siamese BERT model and a loss function that maximises the distance between the two perspectives. In this way, the model learns to distinguish more clearly supportive and opposite perspectives.
To cope with the large volume of information that flows through social networks, it is important that the negated perspectives are generated automatically or at least with a minimal human intervention. To this end, two types of techniques have been presented in the literature. One consists of attaching a fixed phrase which negates the meaning (Bilu et al., 2015) while the other adds or removes the first occurrences of tokens like "not" (Niu and Bansal, 2018;Camburu et al., 2020). Tribrid implements this task in a different way, namely using simple templates that negate both with keywords (e.g., "not") and antonyms.
The prediction scores obtained with Tribrid can be used either to determine more accurately the stance of the original perspective or to discard lowquality predictions. We consider both cases: In the first one, we propose an approach where multiple classifiers are constructed from the scores and a final weakly supervision model combines them. In the second one, we propose several approaches to establish the confidence and describe how to use them to discard low-quality predictions. Our experiments show that our method is competitive in both cases. For instance, in the second case, our approach was able to achieve a F 1 of 87.43 on PER by excluding only 39.34% of the perspectives. Moreover, the score increased to 91.26 when 30% more is excluded. Such performance is very close to the one of humans and this opens the door to an application in contexts where errors are very costly.
The source code and other experimental data can be found at https://github.com/ karmaresearch/tribrid.

Background and Related Work
Stance Classification aims to determine the stance of a input perspective that supports or opposes another given claim. In earlier studies, the research mainly focused on online debate posts using traditional classification approaches (Thomas et al., 2006;Murakami and Raymond, 2010;Walker et al., 2012). Afterwards, other approaches focused on spontaneous speech (Levow et al., 2014), and on student essays (Faulkner, 2014). Thanks to the rapid development of social media, the number of studies on tweets has increased substantially (Rajadesingan and Liu, 2014;Chen and Ku, 2016;Lukasik et al., 2016;Sobhani et al., 2017;Kochkina et al., 2017), especially boosted by dedicated Se-mEval challenges (Mohammad et al., 2016;Kochkina et al., 2017) and benchmarks (Bar-Haim et al., 2017;Chen et al., 2019).
In more recent work, the NLP community investigated on how to use deep neural network to improve the performance. Some representatives are LSTM-based approaches (Du et al., 2017;Sun et al., 2018;Wei et al., 2018), RNN-based approaches (Sobhani et al., 2019;Borges et al., 2019), CNN approaches    Figure 1: BERT used for stance classification by Chen et al. (2019) and Popat et al. (2019) 2017) and more recently BERT-based approaches (Chen et al., 2019;Popat et al., 2019;Schiller et al., 2020). Techniques like attention mechanisms (Du et al., 2017), memory networks (Mohtarami et al., 2018), lexical features (Riedel et al., 2017;Hanselowski et al., 2018), transfer learning and multi-task learning (Schiller et al., 2020) can also improve the performance. All these approaches focus on identifying the most effective method to achieve the highest possible performance using syntactic and semantic features from the input. In contrast, our goal is to improve the performance by injecting background knowledge in the form of negated text, encouraging the model to produce more consistent predictions. As far as we know, we are the first that study this form of optimization to improve the performance of stance classification.
The generation of negated perspectives can be viewed as an instance of the broader problem of constructing adversarial examples, which is drawing more attention in NLP research in recent years Wang et al., 2019). The main focus of adversarial generation is to change the text (with character/words changes or removals) to train more robust models. Most of the works focus on changes that do not alter the semantics of the original input (Belinkov and Bisk, 2018;Xiao et al., 2017;Iyyer et al., 2018;Cheng et al., 2020) while works that negate the semantics are many fewer. In this context, some initial works have manually constructed some small-scale tests Mahler et al., 2017). More recently, Gardner et al. (2020) suggested that datasets should be perturbed by experts with small changes to the test instances. In this way, empirical evaluations can test more accurately the true linguistic capabilities of the models. In their evaluation, they selected PER, one of the datasets that we also consider, and showed that models such as BERT perform significantly worse on the perturbed dataset. Also Ribeiro et al. (2020) consider the problem that accuracy on a held-out dataset may overestimate the performance on a real scenario. To counter this, they propose a new methodology that involves tests with certain perturbations (like negation), but they do not consider stance classification. These works further motivate our effort to develop models that are more consistent when presented with negated inputs. Moreover, another goal of our work is to use consistency as a proxy to measure uncertainty and to discard low-quality predictions. A similar objective was pursued by Kochkina and Liakata (2020) for the problem of rumor verification.
Since manually creating additional test instances is time consuming, some works propose automatic procedures to generate them. Bilu et al. (2015) proposes to add a fixed phrase at the end ("but this is not true") while Niu and Bansal (2018) adds the token "not" before the first verb in the sentence or replaces it with its antonym. Finally, Camburu et al. (2020) suggests a simpler alternative to remove the first occurrence of "not". In contrast to them, we use templates in the form of if-then rules.

Our Approach
First, we provide a short description on how BERT has been used to achieve the state-of-the-art for this problem (Section 3.1). Then, we describe our proposed neural network (Section 3.2) and how we can interpret its output to classify the stance (Section 3.3). Finally, we discuss the generation of negated perspectives using templates (Section 3.4).

BERT Base and STANCY
Our input is a sentence pair C, P where C is the input claim and P is the perspective. In 2019, Chen et al. proposed to concatenate C and P using two tokens [CLS] and [SEP] to delimit the claim and the perspective, respectively, and to feed the resulting string to BERT (see Figure 1a). We call this approach BERT base . A few months later, Popat et al. (2019) proposed an improvement based on the assumption that the latent representation of a perspective should be similar if the perspective support the claim and vice versa. The resulting network is called STANCY (see Figure 1b). The main idea is to compute a latent representation of the claim and perspective (Σ cp ) and one for the claim alone (Σ c ). The two are compared with cosine similarity, denoted with cos(·), and passed to a final dense layer that performs the classification.

Tribrid: Neural Architecture
Current solutions deploy one BERT model passing, as input, the claim followed by the perspective to obtain the latent representation (see, e.g., Figure 1a). This approach is not ideal for us because we would like to pass more information as input (the negated perspective), and this might lead to a string that is too long. We address this problem by using multiple BERT models that share the same parameters, thus creating a network that is often labeled as a siamese network (Bromley et al., 1993).
A schematic view of Tribrid is shown in Figure 2. As input, Tribrid receives the triplet C, P, N P where N P is the negated perspective. We also provide a simpler architecture where the input is the pair C, P , which we call Tribrid pos .
In Tribrid, each component of the input triplet is fed to a BERT model that shares the parameters with the other two models. The three BERT models compute latent representations for C, P , and N P , henceforth written as Σ c , Σ p , and Σ ¬p . In the second stage, the network concatenates them into a single representation. We experimented with the concatenation techniques proposed in Sentence-BERT (Reimers and Gurevych, 2019), InferSent (Conneau and Kiela, 2018) and Universal Sentence Encoder (Cer et al., 2018), . . | is the element-wise distance, and * is the element-wise multiplication. We selected (Σ c , X, |Σ c − X|, Σ c * X) as it slightly outperformed the others in our experiments and further concatenated cos(Σ c , X) to it because their similarity is valuable for predicting the stance (Popat et al., 2019).
Finally, the result of the concatenation is passed to a final dense layer that returns two logits λ s and λ o , which estimate the likelihood of supporting and opposing stances, respectively.
To train the model, we introduce the following loss function L = L c + L e + L d , described below (with Tribrid pos , L = L c + L e ). The first component L c is a standard cross entropy loss: whereŷ ∈ {λ s , λ o } is the logit of the true stance. The second part is the cosine embedding loss: where y = 1 if the perspective supports the claim and −1 if it is opposed to it.
The third component L d is added because L c and L e do not take into account the fact that the perspective that supports the claim should be "closer" to the claim than the perspective that opposes it. To this end, we add a triplet loss (Schroff et al., 2015). Let Σ + = Σ p and Σ − = Σ ¬p if the input perspective supports the claim or Σ + = Σ ¬p and Σ − = Σ p otherwise. Then,

Stance Classification
We can make use of four signals to predict the stance. The first two are the logits λ s and λ o . The third one is δ p = ||Σ c − Σ p ||, i.e., the distance between the claim and the input perspective. The fourth one is δ ¬p = ||Σ c − Σ ¬p ||, i.e., the distance to the negated perspective. The values of these signals can be combined in different ways in order to compute a final binary decision. We define two possible alternative procedures. The first procedure consists of picking the logit with the highest value as the final label, e.g., if λ s > λ o , then the stance should be support. In contrast, the second procedure looks at the distance values. In this case, it chooses the final label depending on which perspective has the closest distance, i.e., the system should return "support" if δ p < δ ¬p or "oppose" otherwise.
In both cases, the confidence of the model can be quantified by the difference between the two signals. If difference is at least τ , where τ is a given threshold, then we can accept the outcome trusting that the system is sufficiently confident. Otherwise, we abstain from making a prediction. Following this principle, we introduce the decision procedures K τ and Λ τ , defined as follows: where S stands for support, O for oppose, and A for abstain.
The advantage of these procedures is that we can select the minimum amount of acceptable confidence by choosing an appropriate τ . In practice, we can use a small validation dataset or pick τ so that at most X% of the data is excluded.
In case the user is not willing to discard any prediction, then we propose a third decision procedure where the system never abstains. In essence, our proposal consists of creating multiple K τ and Λ τ classifiers with different τ which are fed to an ensemble method that makes the final prediction.
A simple example of an ensemble method is majority voting, but this technique does not consider latent correlations between the classifiers. To take those into account, we can rely on weak supervision. In particular, we can use the state-of-the-art method proposed by Fu et al. (2020), which is called FlyingSquid. As far as we know, methods like the one of Fu et al. have not yet been considered for stance classification. We show here that they lead to an improvement of accuracy.
The main goal of FlyingSquid is to learn a model that is able to compute a probabilistic label (which is the stance in our case) with a set of noisy labeling functions (in our case the K τ and Λ τ classifiers) given as input. This method is particularly interesting to us for two reasons: 1) It does not need to access ground truth annotations (which are scarce in our context), and 2) it can find the optimal model's parameters quickly, without iterative procedures like gradient descent.
We proceed as follows. First, we create n classifiers K i and Λ j where i ∈ {τ K 1 , . . . , τ K n } and j ∈ {τ Λ 1 , . . . , τ Λ n }. For a given pair C, P, N P , these classifiers produce 2n labels that can be either {S, O, A}. These labels form the input for FlyingSquid, which learns a model from the labels' correlations and return a final label l ∈ {S, O} for every C, P, N P . To recap, we proposed three approaches for stance classification with our neural model. The first is the K τ classifier, the second is Λ τ , while the third is a weak supervision model (FlyingSquid) built from multiple K τ and Λ τ classifiers. The first two classifiers might abstain if the model is not confident while the third one always returns a binary output. Henceforth, we refer to them as Tribrid l , Tribrid d , and Tribrid w , respectively.

Templates for Automatic Negation
To create negated perspectives, we use templates that are encoded as if-then rules of the form A ⇒ B. The rules contain instructions on how to change the text. In case more rules apply to the same perspective, then only the first application is kept.
For our purpose, the templates should be relatively simple so that they can applied to large volumes of text and do not capture biases in some datasets. Moreover, the templates should make only few changes to the text because we would like to teach the model to pay attention to specific tokens, or combinations of words, that can potentially change the stance.
With this desiderata in mind, we randomly picked some perspectives from multiple datasets and negated them by encoding meaningful changes into rules. This process returned a list of about 60 templates. From this list, we extracted 14 templates which are enough to cover about 90% of the cases (see Section 4.1 for more details about the coverage and the appendix for the list of all templates  As we can see from the list, these patterns are fairly simple and mostly reduce to a strategical position of "not" or to replace words with antonyms.

Evaluation
We tested our approach on the datasets PER (Chen et al., 2019) and IBMCS (Bar-Haim et al., 2017), which are the main datasets previously used by our competitors. PER is a set of claims and perspectives constructed from online debate websites while IBMCS is a similar dataset released by IBM. Statistics on both datasets are in Table 1a and 1b. We did not use the datasets proposed in SemEval-2016, Task 6 (Mohammad et al., 2016) and SemEval2017, Task 8 (Derczynski et al., 2017) for the predictions because our work focuses on a binary classification while these datasets also include additional classes such as "neutral", "query", or "comment". Extending our method to predict more than two classes should be seen as future work.
We implemented our approach using PyTorch 1.4.0, and a BERT BASE model with 12 layers, 768 hidden size, and 12 attention heads. We finetuned BERT with grid search optimizing the F 1 on a validation dataset with a learning rates {1, 2, 3, 4, 5} × 10 −5 , batch size {24, 28, 32}, and the Adam optimizer. For our experiments, we used a machine with a TitanX GPU with 12GB RAM.
In the following, we first discuss the results using the confidence based K τ and Λ τ classifiers (Tribrid l and Tribrid d ). Then, we study the performance with our weakly supervised approach. Finally, we provide an analysis on the coverage of the templates for negating the perspectives.

Confidence-based Stance Classification
To establish how general our templates are, we have applied them to the text in PER and IBMCS and in the dataset of SemEval2016, Task 6. Moreover, we also considered datasets which are used for other NLP tasks, namely ARC (Habernal et al., 2018;Hanselowski et al., 2018). As we can see in Table 2, the templates generalize well as they can negate more than 90% of the perspectives both in PER and IBMCS and between 81.0% and 100% in the other cases. For now, we restrict our analysis to the subset of perspectives for which there is a negation.
In the next section, we will also consider the (few) remaining cases. First, it is interesting to look at the percentage of cases when the four signals, i.e., λ s , λ o , δ p , and δ ¬p , agree. If this occurs for a certain input, then we can interpret it as an hint that the model is confident about the prediction. For instance, if λ s > λ o and δ p < δ ¬p , then we are confident that the output should be support. Table 3 reports the percentage of the input where the signals agree (column "Coverage") and the F 1 that we would obtain if we follow this strategy for deciding the stance. We see that the number of cases when there is an agreement is large (86% on PER) and the performance is fairly high (F 1 of about 83.5) and superior to  the one that we would obtain with, e.g. STANCY over the entire dataset (77.76, Table 5). From this, we conclude that this is a rather simple strategy to improve the performance without discarding many cases. However, notice that with this approach we cannot choose which is the minimum level of acceptable confidence. For this, we can use our proposed Tribrid l or Tribrid d .
In general, if we want to accept only the cases with high confidence, then we can use either Tribrid l or Tribrid d with a high τ . In this way, however, it is likely that we will discard many cases. To study what the tradeoff would be in our benchmarks, we present the results of an experiment where we pick τ such that only X% of the data will be missed. Two natural baselines for comparing our performance consist of applying the K τ decision procedure using the logits returned by BERT base and STANCY. In this way, we can also study the behaviour using other methods. Notice that in this experiment τ is chosen independently for each method. This means that the value of τ with STANCY can be different than the value of τ with BERT base when, for instance, 10% of the predictions are filtered out. In this way, the comparison is fair since it is done on subsets of predictions which have similar sizes. Table 4 reports the results of this experiments while Figure 3 plots the same numbers in a graph. As expected, we observe the that F 1 increases as we increase τ . However, notice that with BERT base and STANCY the F 1 drastically decreases after we filter approximately more than 70-80% of the cases. That point marks the maximum F 1 that we can achieve with those methods. Instead, with our method the F 1 keeps increasing to higher values, which indicates that our system is capable of producing much higher-quality predictions.
We make three main observations. First, with a similar discard rate, our approach outperforms both BERT base and STANCY (the difference is statistically significant with a p-value of 0.0379 as per paired t-test between Tribrid l and STANCY and 0.0441 between Tribrid d and STANCY). This shows that our proposed neural architecture, which is trained and used including negated perspectives, is able to return higher-quality predictions than existing methods. Second, Tribrid d (Λ τ ) can achieve a very high F 1 ; above 95 in the most selective case. However, if we are not willing to sacrifice a large part of the input, then Tribrid l returns better performance. For instance, with a discard rate of 30%, then Tribrid l returns an F 1 of 86.88 vs. 84.69 obtained with Tribrid d . Finally, it is remarkable that both Tribrid d and Tribrid l can achieve a very high accuracy while retaining a sizable part of the input. We argue that in scenarios such as social media even if we remove 30-40% of the available perspectives then we are still left with enough data for the downstream task (e.g., fact-checking). Clearly, this does not hold in contexts where all data is needed. In this case, Tribrid w is more appropriate.

Stance Classification with Tribrid w
To exploit the weakly supervised model used in Tribrid w (FlyingSquid), we created five classifiers with different threshold values. Then, we performed grid search and feature ablation using the validation dataset and F 1 as metric to optimize. For Λ τ , we considered threshold values in the range [0.01, 2] while for K τ the range was [1,20]. We then picked the best performed settings, namely five τ with the values 0.01, 0.2, 1.3, 1.5, 1.9 for Λ τ , and five K τ classifiers setting τ with the values 5, 5.5, 8.5, 11, 13 . We applied each classifier to the input and construct a feature vector with 2n labels for every C, P, N P . To train the probabilistic model of FlyingSquid, we used the labels obtained from the claims and perspectives in the training set. Table 5 reports the results obtained with Tribrid w and with the simpler variant Tribrid pos , which makes no use of negation. Moreover, it also reports the results with several alternatives. First, we considered BERT base , STANCY, and majority voting as alternative to weak supervision. Then, we trained additional BERT base and STANCY models considering only the negated perspectives. We used the logits produced by these models and the ones produced by the models trained with the original perspectives to create 10 K τ classifiers. In this way, we could evaluate our weak supervision approach using BERT base and STANCY instead of our neural model. We call these last two baselines BERT ba /N and STANC/N, respectively.
Notice that Table 5 does not include the results obtained with the method by Schiller et al. (2020) because it uses BERT large and external datasets with transfer learning. Thus, it cannot be directly compared to our approach. For fairness, we mention that their best result is a F 1 of about 84 on PER. This makes transfer learning a promising extension of our work, but this deserves a dedicated study. Finally, notice that if it is not possible to negate the input perspective, then Tribrid w applies the fallback strategy of executing Tribrid pos . Therefore, the results presented in Table 5 were obtained considering the entire testsets.    After looking at Table 5, we make two main observations. First, Tribrid pos slightly outperforms STANCY on PER (80.40 vs 77.76). This means that our strategy of processing the claim and perspective separately, instead concatenating them is beneficial. We argue that this is because with our approach BERT receives shorter strings, thus it is able to produce latent representations of higher quality. Besides, the way we used for concatenating the representations could also lead to an additional increase of the performance.
Second, Tribrid w further outperforms Tribrid pos , which was the second best, with a difference that is statistically significant (p-value of 3.610e-16 with the McNemar test). Moreover, if we compare the performance of BERT base and STANCY with and without the negated perspectives (BERT base vs. BERT ba /N and STANCY vs. STANC/N), then we observe that the F 1 increases also in these cases. This suggests that the strategy of including negative perspectives and to post-process the output in a weakly supervised fashion is a viable solution to improve the performance over the entire input.

Templates for Negating Perspectives
Our approach heavily depends on the quality of the templates. For us, a good template is not necessarily a template that alters the meaning in a way that is considered optimal by a human. Instead, it is a template that alters the text in a way that improves stance prediction. Moreover, a good template should not be too specific so that it can be applied to as much text as possible. Table 6 reports a list of the most popular templates on PER. As we can see, a very simple template like the top one matches a large number of cases. One may wonder what the performance would be if we use instead one of the other known techniques for negating the text. Table 7 shows what would happen if we use the methodologies proposed by Bilu et al. (2015) and by Camburu et al. (2020), which are the two most prominent approaches in the current literature. The first method appends the suffix "but this is not true" while the second removes the token "not". Therefore, we call them "AppSuff" and "DelNot", respectively. Since the goal of negating the perspectives is to recognize dubious predictions, we focus on the number of "flipped" cases, that is the number of cases where the outcome changes if we pass the negated text. For instance, suppose that the outcome with perspective A is S. Then, we expect that if we provide ¬A, then the output is O. If this happens, then we count this case as "flipped". Table 7 reports the number of flipped cases and the F 1 that is obtained on the subset with such cases. Notice that here we consider only models that have not been trained with negated information to avoid that they learn some biases. As we can see, the number of flipped cases and F 1 are superior with our templates than with the other two methods. This shows that negating using templates produces sentences that can be recognized more easily by BERT. Because of this, BERT can return an opposite prediction in a larger number of cases and with a better accuracy.
We conclude mentioning a couple of "too hard" cases for Tribrid, even with a high τ . The first relates to the claim "We should drop the sanctions against Cuba" and perspective "Sanctions are not working". We suspect that the problem here is that "not" produces a double negation that confuses the model. The second is the claim "Animal testing should be banned" and perspective "Animals do not have rights, therefore it is acceptable to experiment on them". In this case, computing the semantics of the perspective requires some entailment that is likely to be too complex for the model.

Conclusion
In this paper, we introduced a new method to classify the stance of messages about a given claim. The main idea is to "inject" negated perspectives that are automatically generated into a BERT-based model so that we can filter out dubious predictions or to improve the overall accuracy.
If we filter out dubious predictions, then we can improve the performance to a point where the F 1 reaches a human-like level without sacrificing a large part of the input. We believe that discarding a (small) percentage of the input is not a major issue in data-intensive environments (e.g., social networks) where many users express their perspectives. However, if the use case is such that we must always make a prediction, then we have shown how we can leverage weak supervision to make a judicious prediction based on the confidence of the model. Also this approach is competitive against the state-of-the-art on standard benchmark datasets.
Our work opens the door to several followup studies. A natural continuation is to explore whether we can achieve similar results if we negate the claim instead of the perspective. Moreover, it is interesting to see whether we can construct paraphrases instead of negated text and modify the architecture accordingly. If we increase the number of sequences that we pass as inputs, our "siamese" approach may no longer work. If this occurs, then future work is needed to find some alternatives. Finally, more sophisticated ways to negate the text may lead to further improvements.
In general, we believe that critically assessing the output of a BERT model using negated text is a promising technique to evaluate the model's confidence. Therefore, it can also bring some improvements in other tasks like sentiment analysis, entity linking, or word sense disambiguation.