Students Who Study Together Learn Better: On the Importance of Collective Knowledge Distillation for Domain Transfer in Fact Verification

While neural networks produce state-of-the- art performance in several NLP tasks, they generally depend heavily on lexicalized information, which transfer poorly between domains. Previous works have proposed delexicalization as a form of knowledge distillation to reduce the dependency on such lexical artifacts. However, a critical unsolved issue that remains is how much delexicalization to apply: a little helps reduce overfitting, but too much discards useful information. We propose Group Learning, a knowledge and model distillation approach for fact verification in which multiple student models have access to different delexicalized views of the data, but are encouraged to learn from each other through pair-wise consistency losses. In several cross-domain experiments between the FEVER and FNC fact verification datasets, we show that our approach learns the best delexicalization strategy for the given training dataset, and outperforms state-of-the-art classifiers that rely on the original data.


Introduction
Neural networks have achieved state-of-the-art (SOTA) performance across many natural language processing (NLP) tasks, usually in supervised settings.
However, it has been shown that there are limitations to these methods caused in part by their over-fitting on statistical and lexical nuances (or artifacts) specific to a dataset (Gururangan et al., 2018), which prevent them from transferring well across domains. A key solution to this problem is to not let these models rely on such dataset artifacts and instead encode the true underlying semantics of the dataset. Also, in recent years fact verification has emerged as a critical task with important societal implications (Vosoughi et al., 2018). Formally, the task is defined as: given a pair of claim and evidence texts, determine if the evidence supports or rejects the claim, or does not have enough information to reach a conclusion. Several fact verification datasets have been proposed recently, based on realworld news articles (Pomerleau and Rao, 2017), Wikipedia based knowledge bases (Thorne et al., 2018), fact verification websites (Wang, 2017), etc. Several transformer networks (Vaswani et al., 2017) based approaches (Liu et al., 2020) have achieved SOTA performance on these tasks.
However, as shown in (Panenghat et al., 2020;Karimi Mahabadi et al., 2020;Gururangan et al., 2018), these models are also similarly affected by syntactic and lexical artifacts seen in other NLP tasks. Specifically, (Suntwal et al., 2019) shows the presence of such artifacts in the Fact Extraction and Verification (FEVER) dataset (Thorne et al., 2018), along with demonstrating that this limits the ability of the trained models to transfer knowledge to other similar datasets such as the Fake News Challenge (FNC) dataset (Pomerleau and Rao, 2017). To mitigate this dependency on such artifacts, they proposed a data distillation (or delexicalization) approach, which replaces some lexical artifacts such as named entities with their type and a unique id to indicate the occurrence of the same artifact in claim and evidence.
A key unresolved issue in this direction is how much delexicalization to apply. As indicated in previous work (Suntwal et al., 2019;Mithun et al., 2021), delexicalization reduces overfitting. But too much delexicalization may discard critical information, e.g., replacing India with its NE label, say LOCATION, may remove contextual information about the country that is necessary for the correct classification of the claim-evidence pair. Our work proposes a solution for this problem, with the following contributions: (1) Inspired by teacher-student architectures (Tarvainen and Valpola, 2017;Rasmus et al., 2015), we propose a novel architecture that combines data distillation with model distillation to improve cross-domain performance of neural networks. In particular, our approach relies on multiple students that have access to different delexicalized views of the data but are encouraged to learn from each other through pair-wise consistency losses. We call our approach Group Learning (GL). Once training completes, the student with the best performance is kept for evaluation purposes. Because we rely on a single model at evaluation time, our approach has the same evaluation run time cost as a single classifier. Note that our method can be seen as an inverse of an ensemble strategy, which trains individual models separately but applies them jointly. GL scales better at inference time due to its reliance on a single model at that stage.
(2) We implemented a GL architecture for fact verification using BERT (Devlin et al., 2019), as the classifier, and multiple delexicalized views of the data using FIGER (Ling and Weld, 2012) and CoreNLP (Manning et al., 2014) NER systems. We evaluated the domain transfer of the proposed method using two fact verification datasets: FEVER and FNC. Our results show that our method achieves a cross-domain accuracy of 73.06% when trained on FEVER and tested on FNC, and 74.46% in the other direction, outperforming other stand-alone trained methods that rely on the lexicalized data. Importantly, our approach chooses different students in each direction, highlighting different properties of the respective training datasets.

Data Distillation
Based on the findings of (Suntwal et al., 2019) that named entities (NEs) are most likely to overfit in a fact verification task, we delexicalize our data by replacing NEs with their semantic classes (and a unique id). To detect and replace named entities with their most specific label returned by the Named Entity Recognizer (NER), we use their solution of Overlap Aware (OA-NER), which relies on CoreNLP (Manning et al., 2014) NE labels. In addition, we propose two new delexicalization approaches based on the FIGER-NER (Ling and Weld, 2012):  FIGER Specific : Uses the most specific classes returned by the FIGER NER, (e.g., CITY for Los Angeles.)

Model Distillation
We propose a combined distillation strategy to mitigate the risk of distilling the data at the incorrect granularity (overly aggressive or too conservative). Specifically, we introduce a Group Learning architecture (shown in Figure 1), inspired from the teacher-student paradigm (Hinton et al., 2015;Tarvainen and Valpola, 2017;Laine and Aila, 2016;Sajjadi et al., 2016). In this architecture, each student method is trained on two techniques: (a) Different versions of the same dataset, each delexicalized differently by using different data distillation techniques mentioned above.
(b) The distributions of predictions of other models. This combined methodology of knowledge distillation encourages students to learn as much as possible from their views of the data while jointly learning with other students. Training together on the soft labels (distribution of predictions) of other student methods acts as a form of regularization between all methods. More formally, each student component includes a regular classification loss (implemented using cross-entropy) on their respective data. Additionally, each has a consistency loss  between all other methods that minimize the difference in predicted label distributions. The intuition behind our approach is that by providing multiple data distillation options to choose from, we encourage the student methods to 'pull' towards each other and the original underlying semantics. The part of semantic knowledge that is obscured from a student method due to the particular delexicalization technique used in the dataset version it sees is instead learned in its effort to perform on par with other methods. Thus, similar to a classroom environment where the students learn from both known labels (e.g., a textbook) and by helping another student learn, each student can thus choose the right amount of granularity needed to enhance its understanding.

Classifiers
We use BERT (Devlin et al., 2019) as the pretrained model used in all our experiments since it has achieved SOTA results in many NLP tasks, including fact verification. Specifically, we experiment with two variants, BERT-Base, and Mini BERT (Turc et al., 2019), a light-weight version of BERT, both from the Hugging Face repository (Wolf et al., 2019).

Experiments
Data: We use two distinct fact verification datasets for our experiments,The Fact Extraction and Verification (FEVER) dataset (Thorne et al., 2018) and the Fake News Challenge (FNC) dataset (Pomerleau and Rao, 2017).
The FEVER dataset consists of 145,449 data points, each having a claim and evidence pair. These claim-evidence pairs typically contain one or more sentences compiled from Wikipedia using an information retrieval (IR) module and are classified into three classes: supports, refutes and not enough information. The evidence for data points that had the gold label of not enough information were retrieved (using a task-provided IR component) either by finding the nearest neighbor to the claim or randomly. Even though the training partition of the FEVER dataset was publicly released, the gold test labels used in the final shared task were not. We therefore built our own test partition by dividing the randomized training partition into 80% (119,197 data points) and 20% (26,252 data points).
The FNC dataset comprises claim-evidence pairs that were divided into four classes, agree, disagree, discuss and unrelated. These claim-evidence pairs were created using the headlines and content section of real news articles, respectively. While the training partition of the publicly available dataset comprised 49,972 data points, the testing partition had 25,413 data points. We further divided the training partition into 40,904 data points for training and 9,068 data points for development.
In order to evaluate the proposed methods in a cross-domain setting, we modified the label space of the source domain to match that of the target domain as done in (Suntwal et al., 2019).
Setting: In all the experiments, the performance of the underlying method on the respective original, lexicalized data is considered as the baseline. In the baseline model, we use the default hyper parameters from Hugging Face. We focus our analysis on cross-domain evaluation, i.e., we train all methods on one dataset (e.g., FEVER) and evaluate their accuracy on the other dataset (e.g., FNC). At the end of the training, the best student model from the list of all the trained models is saved to be used for evaluation.  Table 2: In-domain and cross-domain accuracies for various methods. All scores reported are averaged across three random seeds. "Lex" is the stand alone model trained on the original lexicalized data; "GL" denotes the student in the proposed multi-student 'Group Learning' architecture. Decomposable Attention Delex refers to the best performing model in (Suntwal et al., 2019), a stand alone decomposable attention model (Parikh et al., 2016) which was trained on data that was delexicalized using the OA-NER and Super Sense tagging techniques. * indicates that the corresponding result is significantly better than its baseline ("lex" in the same column), under a bootstrap resampling test with 1,000 samples, and p-value < 0.05.

Results
Table 2 summarizes our results. We focus on two sets of experiments using each training method and setting: in-domain (columns 2 and 4) and crossdomain (columns 3 and 5). For comparison purposes, we also show the results from (Suntwal et al., 2019) where a decomposable attention model was used for the same experiments.As shown in Table 2, although all the baseline models perform well (83.43%-99.45%) in-domain, they transfer poorly when evaluated cross-domain where up to 35% drop in performance is observed. This verifies our findings that the signal the model learns from the un-masked text does not generalize well between domains. On the other hand, we observe a marginal in-domain drop in performance for the student models trained on the GL architecture (e.g., 6% in the FEVER/FEVER setting for BERT-Base Cased) compared to their lexical counterparts. This indicates that GL models retain most signal from lexical data. Importantly, the GL models perform considerably better than their corresponding lexical versions across domain (e.g., up to 20.35% improvement in the FNC/FEVER setting for Mini-BERT). This demonstrates that data distillation and model distillation can be successfully combined as a strategy to improve domain transfer of fact verification methods.

Discussion and Conclusion
Previous work (Suntwal et al., 2019) has shown that delexicalization is useful in learning domain transferable knowledge. However, the level of delexicalization suitable for each task is unclear. In this work, we provide multiple delexicalization choices to neural network models and encourage them to choose the most appropriate choice. We suspect this approach acts as regularization (through the consistency losses), as well as a form of data noise (because of the imperfect NERs), which has been shown to aid in knowledge distillation paradigms (Hinton et al., 2015;Tarvainen and Valpola, 2017). Also, note that the BERT models perform better than the decomposable attention (DA) (Parikh et al., 2016) model in most of the cases, especially in the FNC dataset. More importantly, the crossdomain performance gain when using BERT with the proposed group learning architecture is higher than that achieved by the decomposable attention model. This is likely caused by three reasons. First, BERT has a considerably larger number of parameters than DA. Second, in applications involving text pairs, the decomposable attention model individually encodes these text pairs before using bidirectional cross attention. Instead, BERT combines these two stages using the self-attention mechanism that operates jointly over the two concatenated sentences (and separated with the [SEP] token). Lastly, BERT is pretrained on massive amounts of texts related to the datasets used here, whereas DA learns its parameters from scratch.
Further, analyzing the selection made by the GL framework for various random seeds, we observed that when trained on FEVER and tested on FNC, GL selects the lexicalized student, while in the other cross-domain direction, the common choice is the student delexicalized with OA-NER. We hypothesize this happens for two reasons. First, the training data in the FNC dataset is smaller (40,904 data points) compared to the FEVER dataset (119,197 training data points). Therefore it is more prone to overfitting in the original, lexicalized form. Second, as mentioned above, since the FNC dataset is derived from real-world news articles, the number of evidence sentences in the FNC is higher than FEVER sentences. This means that delexicalized sentences in FNC preserve enough lexical signal for training, even in their delexicalized forms. The opposite observations hold in the other direction (FEVER to FNC), which caused the lexicalized students to perform better. Also, please note that even though we use only four student methods in our experiments to train with each other, this can be extended to any number of methods. However, the correct number of student models (and their corresponding delexicalized datasets) for a given task needs to be empirically determined.
Our approach demonstrates that: (a) delexicalization helps model generalization, (b) the amount of delexicalization to apply varies from dataset to dataset, and (c) it is possible to learn how much delexicalization to apply through our proposed GL architecture.