CDA: A Contrastive Data Augmentation Method for Alzheimer’s Disease Detection

Alzheimer’s Disease (AD) is a neurodegenerative disorder that signiﬁcantly impacts a patient’s ability to communicate and organize language. Traditional methods for detecting AD, such as physical screening or neurological testing, can be challenging and time-consuming. Recent research has explored the use of deep learning techniques to distinguish AD patients from non-AD patients by analysing the sponta-neous speech. These models, however, are limited by the availability of data. To address this, we propose a novel contrastive data augmentation method, which simulates the cognitive impairment of a patient by randomly deleting a proportion of text from the transcript to create negative samples. The corrupted samples are expected to be in worse conditions than the original by a margin. Experimental results on the benchmark ADReSS Challenge dataset demonstrate that our model achieves the best performance among language-based models 1 .


Introduction
Alzheimer's Disease (AD) is a debilitating neurodegenerative disorder characterized by a progressive cognitive decline that is currently incurable.It accounts for up to 70% of all cases of dementia (Association, 2020).With an aging population, the prevalence of AD is on the rise.As symptoms of Alzheimer's disease can be mistaken for a variety of other cognitive disorders, traditional diagnostic methods, such as physical screening or neurological testing, can be challenging and time-consuming.Furthermore, they require a certain degree of clinician expertise (Prabhakaran et al., 2018).
Consequently, the development of automatic detection methods for Alzheimer's disease is essential to the advancement of current medical treatment.The use of machine learning methods to detect * *Corresponding author 1 Our code is publicly available at https://github.com/CSU-NLP-Group/CDA-AD.
AD or other diseases automatically has gained increasing attention in recent years (Luz et al., 2018;Martinc and Pollak, 2020;Liu et al., 2021;Yu et al., 2023).Nevertheless, these approaches have limitations due to a lack of data and the generalizability of the models.Some studies have attempted to address this problem by model ensembling (Syed et al., 2021;Rohanian et al., 2021), multi-task learning (Li et al., 2022;Duan et al., 2022) or data augmentation (Woszczyk et al., 2022), but the improvement in performance is not always substantial.
Inspired by previous research that AD patients often have language disorders, such as difficulties in word finding and comprehension (Rohanian et al., 2021), we propose a novel Contrastive Data Augmentation (CDA) approach for automatic AD detection.In our study, we simulated cognitive decline associated with Alzheimer's disease by randomly deleting words from the speech transcript to create negative samples.It is expected that the corrupted samples are in worse condition than the original due to the degradation of coherence and semantic integrity.Compared to traditional data augmentation methods, the CDA method expands the dataset scale and utilizes augmented data more effectively.We have demonstrated in our experiments on the ADReSS Challenge dataset that our approach uses linguistic features alone, is more generalizable to unseen data, and achieves superior results compared to strong baselines.

Data and Preprocessing
We use the data from the ADReSS Challenge (Alzheimer's Dementia Recognition through Spontaneous Speech) (Luz et al., 2020), a subset of the DementiaBank's English Pitt Corpus (Becker et al., 1994).It consists of recordings and transcripts of spoken picture descriptions from the Boston Diagnostic Aphasia Examination.During the examination, the subject is shown a picture and is asked to describe its content in their own language.

Encoder Classifier
The outputs of two forward pas with the same data A total of 156 speech audio recordings and transcripts were obtained from English-speaking participants in the ADReSS dataset, with an equal number of participants (N=78) diagnosed with and not suffering from Alzheimer's disease, as shown in Table 1.Annotated transcripts in the dataset are in CHAT format (MacWhinney, 2014).Participants' ages and genders are also balanced to minimize the risk of bias in prediction.As some of the tokens in CHAT format are highly specific and are unlikely to be included in BERT tokenizers, we converted them into actual repetitions of words.We remain with only words, punctuation, and pauses for input into the BERT model.Our method uses only the transcripts from the dataset.

Methods
Figure 1 illustrates the framework of the proposed model.Firstly, for each transcript, we generate a number of augmented instances, which are then input to Text Encoder along with the original transcripts to obtain their corresponding representations.Then the classifier uses feature vectors acquired in Text Encoder and output a probability of being AD for each transcript and its corresponding augmented samples.We will discuss more details in the following subsections.

Text Encoder and Classifier
For fair comparisons with previous work (Woszczyk et al., 2022), the input text is encoded using the pre-trained BERT (bertbase-uncased) and represented by [CLS] after bert_pooler.Given a text sequence x i , we can get the encoded representations h i through the encoder.
After obtaining the embedding of the transcript, we pass it through a simple linear classifier (Eq.2) to get final prediction scores, we use the commonly used binary cross-entropy (BCE) as our classification loss function, and the classification loss is denoted as L BCE (Eq.3).
, where y i is the golden label for x i , W and b are trainable parameters in classifier.

Contrastive Data Augmentation
The performance of previous work is limited due to a lack of data availability.To alleviate this, we propose the contrastive data augmentation approach (CDA) to replicate the cognitive decline associated with AD to expand the data size and improve the model robustness.
Negative Sample Generation Assuming that the dataset {x i , y i } N i=1 contains N training samples.We randomly delete a proportion of p ∈ [0, 1] words from each sample for n neg times to create n neg negative samples.After that we can get an augmented set {x i , y i , X i neg } N i=1 , where X i neg = {x j i } nneg j=1 are from x i .We can further augment the training set by repeating the whole process for n aug times to get {x i , y i , X i neg } N ×naug i=1 and expand the data size by n a ug.
Positive Sample Generation Inspired by Gao et al. (2021), we resort to the randomness of dropout to construct positive samples.Dropout is a popular regularization technique due to its simplicity, but the randomness it introduces may hinder further improvements in the model's generalization performance.R-Drop (Wu et al., 2021) is proposed to fix the aforementioned problem by ensuring consistency between the outputs of two forward-pass with the same data.We deploy the R-Drop algorithm as a regularization method for generating positive instances.More specifically, the original sample x i is fed to the model twice at each step, and two corresponding predictions, denoted as ŷ1 i and ŷ2 i , are obtained.Then we try to minimize the bidirectional Kullback-Leibler (KL) divergence between them, which is denoted as L KL (Eq.4): Contrastive Loss It is reasonable to assume that the negative samples are more likely to have AD than the original ones in view of the degradation in semantic coherence and integrity.To achieve this, we regularize their differences to be larger than a margin m.
Particularly, the encoder receives x i and X i neg as input and outputs their corresponding embedding representations h i and H i neg .Then, their representations are fed to the classifier to get a final score ŷi and ỹj i for x i and xj i , respectively.Their differences becomes Eq.5: , where m is the margin between positive and negative samples.The final loss is a combination of the above three loss terms L BCE , L margin and L KL .
, where α and μ are hyperparameters that control the impact of positive and negative samples, and we set α = 0.5 and μ = 0.5 in our model.

Experiments
We employ 10-fold cross-validation to estimate the generalization error and adjust the model's parameter settings.The best setting is used to retrain models on the whole train set with five different random seeds and is then applied to the test set.
The results reported in this paper are the average of these models.The accuracy is used as the primary metric of task performance since the dataset is balanced.Recall, precision, and F1 are also reported for the AD class to provide a more comprehensive assessment.The hyperparameters in our model are: learning rate=1e-04, batch size=8, epoch=5, n aug =3, n neg =3, p=0.3, margin=0.1.

Baselines
We compare our method with: 1) LDA, which is the challenge baseline linear discriminant analysis (LDA) (Luz et al., 2020); 2) BERT, Balagopalan et al. ( 2021) compared BERT models with featurebased Models and obtained relatively better results using the former; 3) Fusion, Campbell et al. ( 2021) fused the features of language and audio for classification; 4) SVM(BT RU) (Woszczyk et al., 2022), is the SVM model using Back-translation from Russian that achieves the best results over the BERT model using Back-translation from German (BT DE); 5) Ensemble methods, Sarawgi et al. (2020) take a majority vote between three individual models.ERNIE0p and ERNIE3p are based on ERNIElarge (Sun et al., 2020) that use original transcripts and transcripts with pauses manually inserted for AD classification, respectively.

Results
The main experimental results are shown in Table 2.We can observe that the performance significantly improves when BERT is applied.Backtranslation data augmentation results in consistent improvements in both BERT (BT DE) and SVM (BT RU), suggesting that data argumentation is a promising strategy.Our method achieves accuracy (87.5%), precision (88.1%), and F1 score (86.9%), outperforming the baseline method by a substantial margin, suggesting the effectiveness of cognitive impairment simulation in our method.By ensembling our models on five models with a majority vote mechanism, the performance improves significantly (4.2% absolute improvements in accuracy and 4% absolute improvements in F1 score, respectively) and achieves the best results among all

Ablation Study
To determine the effectiveness of the main modules, namely random deletion (RD) and regularized dropout (R-Drop), we removed them from the model one by one and tested their impact on performance in 10-fold cross-validation.As shown in Table 3, by combining the contrastive data augmentation strategy with the base BERT, our model outperforms it by a large margin.However, when either module is removed, the model experiences a significant loss of performance, suggesting their positive contributions to the performance.

Parameter Analysis
We also perform parameter analysis under the same experimental settings.As illustrated in Figure 2, we can see that a lower deletion rate leads to relatively higher accuracy, as the more words deleted, the less informative the transcript is.But a large margin negatively impacts both recall and accuracy.
As for n aug , the model performs better regarding recall and accuracy when it is set to 3, and lower or higher values will affect the performance.The same conclusion applies to n neg , where a breakdown of the model is observed when n neg =7.The model performance also improves as the number of negative samples increases.However, this will take more computing resources.

Conclusion
Our experiments show the potential of contrastive data argumentation in improving the accuracy of models for Alzheimer's disease diagnosis.As a comparison to large, complex multimodal models, and other techniques of data augmentation, we obtain the best results by simulating cognitive impairment caused by AD.Despite the small size of the dataset, the results of this study provide a basis for further research into more complex issues.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section Ethics Statement.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Section 2.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 2. B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 2.

C Did you run computational experiments?
Section 3. C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Because the data size is small and the overall computational expenditure is minimal.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4. D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The overview of our proposed method.

Figure 2 :
Figure 2: Accuracy and recall scores at different deletion rate, margin, n aug and n neg .
you describe the limitations of your work?Section Limitaion.A2.Did you discuss any potential risks of your work?Section Ethics Statement.A3.Do the abstract and introduction summarize the paper's main claims?Section 1. A4. Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 2. B1.Did you cite the creators of artifacts you used?Section 2. B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Section Ethics Statement.

C4.
If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4. D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
Statistics of ADReSS Dataset

Table 2 :
Results of our method and the baselines on the test set.