View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data

We present an algorithm based on multi-layer transformers for identifying Adverse Drug Reactions (ADR) in social media data. Our model relies on the properties of the problem and the characteristics of contextual word embeddings to extract two views from documents. Then a classifier is trained on each view to label a set of unlabeled documents to be used as an initializer for a new classifier in the other view. Finally, the initialized classifier in each view is further trained using the initial training examples. We evaluated our model in the largest publicly available ADR dataset. The experiments testify that our model significantly outperforms the transformer-based models pretrained on domain-specific data.


Introduction
Social media has made substantial amount of data available for various applications in the financial, educational, and health domains.Among these, the applications in healthcare have a particular importance.Although previous studies have demonstrated that the self-reported online social data is subject to various biases (Olteanu et al., 2018), this data has enabled many applications in the health domain, including tracking the spread of influenza (Aramaki et al., 2011), detecting the reports of the novel coronavirus (Karisani and Karisani, 2020), and identifying various illness reports (Karisani and Agichtein, 2018).
One of the well-studied areas in online public health monitoring is the extraction of adverse drug reactions (ADR) from social media data.ADRs are the unintended effects of drugs for prevention, diagnosis, or treatment.The researchers in Duh et al. (2016) reported that consumers, on average, report the negative effect of drugs on social media 11 months earlier than other platforms.This highlights the importance of this task.Another team of researchers in Golder et al. (2015) reviewed more than 50 studies and reported that the prevalence of ADRs across multiple platforms ranges between 0.2% and 8.0%, which justifies the difficulty of this task.In fact, despite the long history of this task in the research community (Yates and Goharian, 2013), for various reasons, the performance of the state-of-the-art models is still unsatisfactory.Social media documents are typically short and their language is informal (Karisani et al., 2015).Additionally, the imbalanced class distributions in ADR task has exacerbated the problem.
In this study we propose a novel model for extracting ADRs from Twitter data.Our model which we call View Distillation (VID ) relies on the existence of two views in the tweets that mention drug names.We use unlabeled data to transfer the knowledge from the classifier in each view to the classifier in the other view.Additionally, we use a finetuning technique to mitigate the impact of noisy pseudo-labels after the initialization (Karisani and Karisani, 2021).As straightforward as it is to implement, our model achieves the state-of-the-art performance in the largest publicly available ADR dataset, i.e., SMM4H dataset.Our contributions are as follows: 1) We propose a novel algorithm to transfer knowledge across models in multi-view settings, 2) We propose a new technique to efficiently exploit unlabeled data in the supervised ADR task, 3) We evaluate our model in the largest publicly available ADR dataset, and show that it yields an additive improvement to the common practice of language model pretraining in this task.To our knowledge, our work is the first study that reports such an achievement.Next, we provide a brief overview of the related studies.

Related Work
Researchers have extensively explored the applications of ML and NLP models in extracting ADRs from user-generated data.Perhaps one of the early reports in this regard is published in Yates and Goharian (2013), where the authors utilize the related lexicons and extraction patterns to identify ADRs in user reviews.With the surge of neural networks in text processing, subsequently, the traditional models were aggregated with these techniques to achieve better generalization (Tutubalina and Nikolenko, 2017).The recent methods for extracting ADRs entirely rely on neural network models, particularly on multi-layer transformers (Vaswani et al., 2017).
In the shared task of SMM4H 2019 (Weissenbacher and Gonzalez-Hernandez, 2019), the top performing run was BERT model (Devlin et al., 2019) pretrained on drug related tweets.Remarkably, one year later in the shared task of SMM4H 2020 (Gonzalez-Hernandez et al., 2020), again a variant of pretrained BERT achieved the best performance (Liu et al., 2019).Here, we propose an algorithm to improve on pretrained BERT in this task.Our model relies on multi-view learning and exploits unlabeled data.To our knowledge, our model is the first approach that improves on the domain-specific pretrained BERT.

Proposed Method
Our model for extracting the reports of adverse drug effects rely on the properties of contextual neural word embeddings.Previous research on Word Sense Disambiguation (WSD) (Scarlini et al., 2020) has demonstrated that contextual word embeddings can effectively encode the context in which words are used.Although the representations of the words in a sentence are assumed to be distinct, they still possess shared characteristics.This is justified by the observation that the techniques such as self-attention (Vaswani et al., 2017), which a category of contextual word embeddings employ (Devlin et al., 2019), rely on the interconnected relations between word representations.
This property is particularly appealing when documents are short, therefore, word representations, if are adjusted accordingly, can be exploited to extract multiple representations for a single document.In fact, previous studies have demonstrated that word contexts can be used to process short documents, e.g., see the models proposed in Liao and Grishman (2011) and Karisani et al. (2020) for event extraction using hand-crafted features and contextual word embeddings respectively.Therefore, we use the word representations of drug mentions in user postings as the secondary view along the document representations of user postings in our model.As a concrete example, from the hypothetical tweet "this seroquel hitting me", we extract one representation from the entire document and another representation from the drug name1 Seroquel.In continue, we call these two views the document and drug views.Figure 1 illustrates these two views using BERT (Devlin et al., 2019) as an encoder.
Given the two views we can either concatenate the two sets of features and train a classifier on the resulting feature vector or use a co-training framework as described in Karisani et al. (2020).However, the former is not exploiting the abundant amount of unlabeled data, and the latter is resource intensive, because it is iterative, and also it has shown to be effective only in semi-supervised settings where there are only a few hundred training examples available.Therefore, below we propose an approach to effectively use the two views along the available unlabeled data in a supervised setting.
In the first step, we assume the classifier in each view is a student model and train this classifier using the pseudo-labels generated by the counterpart classifier.Since the labeled documents are already annotated, we carry out this step using the unlabeled documents.More concretely, let L and U be the sets of labeled and unlabeled user postings respectively.Moreover, let L d and L g be the sets of representations extracted from the document and drug views of the training examples in the set L; and let U d and U g be the document and drug representations of the training examples in the set U .To carry out this step, we train a classifier C d on the representations in L d and probabilistically, with temperature T in the softmax layer, label the representations in U d .Then we use the association between the representations in U d and U g to construct a pseudo-labeled dataset of U g .This dataset along its set of probabilistic pseudo-labels is used in a distillation technique (Hinton et al., 2015) to train a classifier called C g .Correspondingly, we use the set L g to train a classifier C g , then label the set U g and use the association between the data points in U g and U d to construct a pseudo-labeled dataset in the document view to train the classifier C d .
The procedure above results in two classifiers C d and C g .The classifier in each view is initialized by the knowledge transferred from the other view.However, the pseudo-labels that are used to train each classifier can be noisy.Thus, in order to reduce the negative impact of this noise, in the next step, we use the training examples in the sets L d and L g to further finetune these two classifiers respectively.To finetune C d we use the objective function below: where J is the cross-entropy loss, y v is the groundtruth label of the training example v, and λ is a hyper-parameter to govern the impact of the two terms in the summation.The first term in the summation, is the regular cross-entropy between the output of C d and the ground-truth labels.The second term is the cross-entropy between the outputs of C d and C d .We use the output of C d as a regularizer to train C d in order to increase the entropy of this classifier for the prediction phase.Previous studies have shown that penalizing low entropy predictions increases generalization (Pereyra et al., 2017).We argue that this is particularly important in the ADR task, where the data is highly imbalanced.Note that, even though C d is trained on the training examples in L d , the output of this classifier for the training examples is not sparse-particularly for the examples with uncommon characteristics.Thus, we use these soft-labels2 along the groundtruth labels to train C d .Respectively, we use the objective function below to finetune C g : where the notation is similar to that of Equation 1.Here, we again use the output of C g as a regularizer to train C g .In the evaluation phase, to label the unseen examples, we take the average of the outputs of the two classifiers C d and C g .
Algorithm 1 illustrates our model (VID) in Structured English.On Lines 8 and 9 we derive the document and drug representations from the sets L and U .On Lines

Experimental Setup
We evaluated our model in the largest publicly available ADR dataset, i.e., the SMM4H dataset.This dataset consists of 30,174 tweets.The training set in this dataset consists of 25,616 tweets of which 9.2% are positive.The labels of the test set are not publicly available.The evaluation in the dataset must be done via the CodaLab website.We compare our model with two sets of baselines: 1) a set of baselines that we implemented, 2) the set of baselines that are available on the CodaLab website3 .Our own baseline models are: BERT, the base variant of the pretrained BERT model (Devlin et al., 2019), as published by Google.BERT-D, a domain-specific pretrained BERT model.This model is similar to the previous baseline, however, it is further pretrained on 800K unlabeled drug-related tweets that we collected from Twitter.We pretrained this model for 6 epochs using the next sentence prediction and the masked language model tasks.BERT-D-BL, a bi-directional LSTM model.In this model we used BERT-D followed by a bi-directional LSTM network (Hochreiter and Schmidhuber, 1997).We also compare our model with all the baselines available on the CodaLab webpage.These baselines include published and unpublished models.They also cover models that purely rely on machine learning models and those that heavily employ medical resources; see Weissenbacher and Gonzalez-Hernandez (2019) for the summary of a subset of these models.
We used the Pytorch implementation of BERT (Wolf et al., 2019).we used two instances of BERT-D as the classifiers in our model-see Figure 1.Please note that using domain-specific pretrained BERT in our framework makes any improvement very difficult, because the improvement in the performance should be additive.We used the training set of the dataset to tune for our two hyperparameters T and λ.The optimal values of these two hyper-parameters are 2 and 0.5 respectively.We trained all the models for 5 epochs4 .During the tuning, we observed that the finetuning stage in our model requires much fewer training steps, therefore, we finetuned for only 1 epoch.In our model, we used the same set of unlabeled tweets that we used to pretrain BERT-D.This verifies that, indeed, our model extracts new information that cannot be extracted using the regular language model pretraining.As required by SMM4H we tuned for F1 measure.In the next section, we report the F1, Precision, and Recall metrics.

Results and Analysis
Table 1 reports the performance of our model in comparison with the baseline models-only the top three CodaLab baselines are listed here.We see that our model significantly outperforms all the baseline models.We also observe that the performances of our implemented baseline models are lower than that of the CodaLab models.This difference is mainly due to the gap between the size of the unlabeled sets for the language model pretraining in the experiments-ours is 800K, but the  top CodaLab model used a corpus of 1.5M examples.This suggests that our model can potentially achieve a better performance if there is a larger unlabeled corpus available.Table 2 reports the performance of VID in comparison to the classifiers trained on the document and drug representations.We also concatenated the two representations and trained a classifier on the resulting feature vector, denoted by Combined-View.We see that our model substantially outperforms all three models.Table 3 compares our model with the classifiers with different pretraining and finetuning resources.Again, we see that VID is comparable to the best of these models.We also observe 2 percent absolute improvement by comparing P-Drug-F-Drug and P-Doc-F-Drug, which signifies the efficacy of View Distillation.
In summary, we evaluated our model in the largest publicly available ADR dataset and compared with the state-of-the-art baseline models that use domain specific language model pretraining.We showed that our model outperforms these models, even though it uses a smaller unlabeled corpus.We also carried out a set of experiments and demonstrated the efficacy of our proposed techniques.

Conclusions
In this study we proposed a novel model for extracting adverse drug effects from user generated content.Our model relies on unlabeled data and a novel technique called view distillation.We evaluated our model in the largest publicly available ADR dataset, and showed that it outperforms the existing BERT-based models.
10 and 11 we use the labeled training examples in the two views to train C d and C g .On Lines 12-14 we train and finetune C g , and on Lines 15-17 we train and finetune C d .Finally, we return C d and C g .In the next section, we describe our experimental setup.

Table 2 :
F1, Precision, and Recall of VID in comparison to the performance of the classifiers trained on the document, drug, and combined views.

Table 3 :
Performance of VID in comparison to the performance of the classifiers pretrained on the document or drug pseudo-labels (indicated by P-{•}) and finetuned on the document or drug training examples (indicated by F-{•}).