Comparison of different feature sets for identification of variants in progressive aphasia

,


Introduction
In some types of dementia, such as primary progressive aphasia, language deficit is a core symptom, and the analysis of narrative or conversational speech is important for assessing the extent of an individual's language impairment. Analysis of connected speech has been limited in the past because it is time-consuming and requires expert annotation. However, studies have shown that it is possible for machine learning classifiers to achieve high accuracy on some diagnostic tasks when trained on features which were automatically extracted from speech transcripts.
In this paper, we summarize previous research on the automatic analysis of speech samples from individuals with dementia, focusing in particular on primary progressive aphasia. We discuss in detail different types of features and compare their effectiveness in the classification task. We suggest some benefits and drawbacks of these different features. We also examine the interactions between different feature sets, and discuss the relative importance of individual features across feature sets. Because we examine a large number of features on a relatively small data set, we emphasize that this work is exploratory in nature; nonetheless, our results are consistent with, and extend, previous work in the field.

Background
In recent years, there has been growing interest in using computer techniques to automatically detect dementia from speech and language features derived from a sample of narrative speech. Some researchers have explored ways to use methods such as part-of-speech tagging, statistical parsing, and speech signal analysis to detect disorders such as dementia of the Alzheimer's type (DAT) (Bucks et al., 2000;Singh et al., 2001;Thomas et al., 2005;Jarrold et al., 2010) and mild cognitive impairment (MCI) (Roark et al., 2011).
Here, we focus on a type of dementia called primary progressive aphasia (PPA). PPA is a subtype of frontotemporal dementia (FTD) which is characterized by progressive language impairment without other notable cognitive impairment. There are three subtypes of PPA: semantic dementia (SD), progressive nonfluent aphasia (PNFA), and logopenic progressive aphasia (LPA). SD, sometimes called "fluent" progressive aphasia, is typically marked by fluent but empty speech, anomia, deficits in comprehension, and spared grammar and syntax (Gorno-Tempini et al., 2011). In contrast, PNFA is characterized by halting and sometimes agrammatic speech, reduced syntactic complexity, word-finding difficulties, and relatively spared single-word comprehension (Gorno-Tempini et al., 2011). The third subtype, LPA, is characterized by slow speech and frequent word finding difficulties; this subtype is not included in the current analysis.
Although clear diagnostic criteria for PPA have been established (Gorno-Tempini et al., 2011), there is no one test which can provide a diagnosis. Classification of PPA into subtypes requires evaluation of spoken output, as well as neuropsychological assessment and brain imaging. Qualitative evaluation of speech often can be done accurately by clinicians or researchers, but the ability to do this evaluation can require years of training and experience. Some researchers have performed detailed quantitative characterization of speech in PPA, but the precise characteristics of speech are not yet fully understood and this process is too time-consuming for most clinicians. Peintner et al. (2008) conducted one of the earliest automatic analyses of speech from individuals with FTD, including SD and PNFA as well as a behavioural variant. They considered psycholinguistic features as well as phoneme duration features extracted from the audio samples. Although they were fairly successful in classifying participants according to their subtype, they did not report many details regarding the specific features which were useful or how those features might reflect the underlying impairment of the speakers. Pakhomov et al. (2010a) examined FTD speech from an information-theoretic approach. They constructed a language model of healthy control speech, and then calculated the perplexity and outof-vocabulary rate for each of the patient groups relative to that model. In another study, Pakhomov et al. (2010b) extracted speech and language features from samples of FTD speech. In a principal components analysis, they discovered four components which accounted for most of the variance in their data: speech length, hesitancy, empty content, and grammaticality. However, they did not perform any classification experiments. Fraser et al. (2013a) attempted to classify participants as either SD patients, PNFA patients, or healthy controls using a large number of language  features extracted from manually-transcribed transcripts. They distinguished between SD and control participants with very high accuracy, and were also successful at distinguishing between PNFA and control participants. However, their method did not perform as well on the task of classifying SD vs. PNFA speakers. In subsequent work (Fraser et al., 2013b), they expanded their feature set to include acoustic features extracted directly from the audio file.

Data
Twenty-four patients with PPA were recruited through three Toronto memory clinics, and 16 ageand education-matched healthy controls were recruited through a volunteer pool. All participants were native speakers of English, or had completed some of their education in English. Exclusion criteria included a known history of drug or alcohol abuse and a history of neurological or major psychiatric illness. Each patient was diagnosed by a behavioural neurologist and all met current criteria for PPA (Gorno-Tempini et al., 2011). Table 1 shows demographic information for each group.
To elicit a sample of narrative speech, participants were asked to tell the well-known story of Cinderella. They were given a wordless picture book to remind them of the story; then the book was removed and they were asked to tell the story in their own words. This procedure, described in full by Saffran et al. (1989), is commonly used in studies of connected speech in aphasia.
The narrative samples were transcribed by trained research assistants. The transcriptions include filled pauses, repetitions, and false starts, and were annotated with the total speech time. Sentence boundaries were marked according to semantic, syntactic, and prosodic cues.

Classification framework
Given the audio files and transcripts, we can then calculate our features (described in detail below) and use them to train a support vector machine (SVM) classifier. We use a leave-one-out crossvalidation framework and report the average accuracy (i.e. proportion of correctly classified instances) across folds. We optimize the complexity parameter and the kernel type in a nested crossvalidation loop over the training set. For comparison, we also tested a naïve Bayes classifier; however we found that the results were consistently poorer and we do not report them here.

Features
In the following sections we will describe each of the feature sets that we use and explain how the features are computed, and we will discuss some of the potential advantages and disadvantages associated with each set. In particular, we discuss what types of data are necessary for the extraction of these features. The data types are: unsegmented transcripts, segmented transcripts, and audio.

Part-of-speech features
Different categories of words may be selectively impaired in different types of dementia. In PPA, individuals with SD tend to be more impaired with respect to nouns than verbs, and may replace nouns with pronouns or circumlocutory phrases. In contrast, individuals with PNFA may have more difficulty with verbs and may even demonstrate agrammatism, which can result in the omission of grammatical morphemes and function words. Thus, it is often useful to compare the relative frequencies with which words representing the different parts-of-speech (POS) are produced in a sample, as in Table 2. Similar features have been reported in computational studies of MCI (Roark et al., 2011), FTD (Pakhomov et al., 2010b), and DAT (Bucks et al., 2000). Numerous POS taggers exist, although we use the Stanford tagger here (Toutanova et al., 2003).

Complexity features
Changes in linguistic complexity may accompany the onset of dementia, although some studies have found a decrease in complexity (e.g. Kemper et al. (2001)) while others have found an increase (e.g. Le et al. (2011)).

CFG features
Although many of the complexity features above are derived from parse trees, in this section we present a set of features that take into account the context-free grammar (CFG) labels on each of the nodes. CFG features have been previously used to assess the grammaticality of sentences in an artificial error corpus (Wong and Dras, 2010) and to distinguish human from machine translations (Chae and Nenkova, 2009). However, this is the first time such features have been applied to speech from participants with dementia.
In Table 4 we list a few examples of our 134 CFG features, as well as the three phrase-level features (calculated for noun phrases, verb phrases, and prepositional phrases).    Park et al. (2011) found that listeners' judgements of fluency were affected by a number of different variables, and the three most discriminative features were "speech rate, speech productivity, and audible struggle." For our list of fluency features (Table 5), we include only those features which could be extracted from the transcripts alone (assuming the total speech time is given). We count pauses filled by um and uh separately, as research has suggested that they may indicate different cognitive processes (Clark and Fox Tree, 2002).

Fluency features
The number of words in a sample could be easily generated using the word count feature in most text-editing software (although we first exclude filled pauses and NID tokens), and the verbal rate can subsequently be calculated directly. The other three features are easily calculated using string matching and an electronic dictionary.

Psycholinguistic features
Some types of dementia are characterized by impairments in semantic access. Such impairments may be sensitive to psycholinguistic features such as lexical frequency, familiarity, imageability, and age of acquisition (Table 6). We use the SUBTL frequency norms (Brysbaert and New, 2009) and the combined Bristol and Gilhooly-Logie norms (Stadthagen-Gonzalez and Davis, 2006;Gilhooly and Logie, 1980) for familiarity, imageability, and Frequency Frequency with which a word occurs in some corpus of natural language Familiarity Subjective rating of how familiar a word seems Imageability Subjective rating of how easily a word generates an image in the mind Age of acquisition Subjective rating of how old a person is when they first learn that word Light verbs Number of occurrences of be, have, come, go, give, take, make, do, get, move, and put, normalized by total number of verbs age of acquisition (see Table 6). We compute the average of each of these measures for all content words, as well as for nouns and verbs separately. Another measure that fits into this category is the frequency of occurrence of light verbs, which an impaired speaker may use to replace a more specific verb. We use the same list of light verbs as Breedin et al. (1998), given in Table 6.
One challenge associated with psycholinguistic features is finding norms which provide adequate coverage for the given data. Fraser et al. (2013a) reported that the SUBTL frequency norms had a coverage of above 90% on their data, but the Bristol-Gilhooly-Logie norms had a coverage of only around 30%.

Vocabulary richness features
Individuals experiencing semantic difficulty may use a limited range of vocabulary. We can measure the vocabulary richness or lexical diversity of a narrative sample using a number of different metrics (see Table 7). Type-token ratio has been a popular choice, perhaps due to its simplicity; however it is sensitive to the length of the sample. Bucks et al. (2000) were the first to apply Honoré's statistic and Brunét's index to the study of dementia, and found significant differences between individuals with DAT and healthy controls. Covington and McFall (2010) proposed a new measure called the moving-average type-token ratio (MATTR), which is independent of text length. This feature was later applied to aphasic speech in a study by Fergadiotis and Wright (2011), and was found to be one of the most unbiased indicators of lexical diversity in impaired speakers.
The measures given in Table 7 are easily computed from their respective formulae. In this work, we lemmatize each word using NLTK (Bird et al., 2009) before calculating the features. For MATTR, we consider w = 10, 20, 30, 40, 50.
Honoré's statistic N V −0.165 / where V is the number of word types and N is the number of word tokens. Brunét's index 100 log N/(1 −V 1 /V ) where V 1 is the number of words used only once, V is the total number of word types, and N is the number of word tokens. Type-token ratio (TTR) V /N where V is the number of word types and N is the number of word tokens. Moving-average type-token ratio (MATTR w ) TTR calculated over a moving window of size w, and averaged over all windows.

Acoustic features
What we call acoustic features are extracted directly from the audio file. We consider the features given in Table 8. Acoustic features have been shown to be useful when automatically detecting conditions such as Parkinson's disease, in which changes in speech are common (Little et al., 2009;Tsanas et al., 2012). Acoustic features have also been examined in studies of DAT (Meilán et al., 2014), FTD (Pakhomov et al., 2010b), and PPA (Fraser et al., 2013b, whose software we use here). An obvious benefit to acoustic features is that they do not require a transcription, and can be calculated immediately given an audio sample. The corresponding drawback is that they tell us nothing about the linguistic content of the narrative.

Experiments
We report the results of three experiments exploring the discriminative power of the different features. We first compare the classification accuracies using each individual feature set. We then perform an ablation study to determine which combination of feature sets leads to the highest classification accuracy. We also look at individual features across sets and discuss which ones are the most discriminative, particularly in situations where data might be limited.

Individual comparison of accuracy by set
The accuracies which result from using each feature set individually are given in Table 9. The highest accuracy across the three tasks is achieved in distinguishing SD participants from controls. An accuracy of .963 can be achieved using all the features together, or using the psycholinguistic or POS features alone. This is consistent with the semantic impairments that are observed in SD.
Total duration of speech Total length of all non-silent segments Phonation rate Total duration of speech / total duration of the sample (including pauses) Mean pause duration Mean length of pauses > 0.15 ms Short pause count # Pauses > 0.15 ms and < 0.4 ms Long pause count # Pauses > 0.4 ms Pause:word ratio Ratio of silent segments longer than 150 ms to non-silent segments F 0:3 mean Mean of the fundamental frequency and the first three formant frequencies F 0:3 variance Variance of the fundamental frequency and the first three formant frequencies Mean instantaneous power Measure related to the loudness of the signal Mean 1st ACF Mean first autocorrelation function Max 1st ACF Maximum first autocorrelation function Skewness Measure of lack of symmetry, associated with tense or creaky voice Kurtosis Measure of the peakedness of the signal ZCR Zero-crossing rate, can be used to distinguish between voiced and unvoiced regions MRPDE Mean recurrence period density entropy, a measure of periodicity Jitter Measure of the short-term variation in the pitch (frequency) of a voice Shimmer Measure of the short-term variation in the loudness (amplitude) of a voice The measures of vocabulary richness do not distinguish between the SD and control groups, suggesting it is the words themselves, and not the number of different words being used, that is important.
In the case of PNFA participants vs. controls, we find that the highest accuracy is achieved using all the features, and the second highest by using only acoustic features. This is not surprising, considering that the acoustic features include measures of pausing and phonation rate, which can detect the characteristic halting speech of PNFA. The third best accuracy is achieved using the fluency features, which also fits with this explanation. However, we might have expected that the complexity and CFG features would be more sensitive to the grammatical impairments of PNFA.
Finally, the best accuracy for SD vs. PNFA is lower than in the previous two cases, and is achieved using only CFG features. This suggests that there are some grammatical constructions which occur with different frequencies in the two groups. These differences do not appear to be captured by the complexity features, which could explain why Fraser et al. (2013a) did not find syntactic differences between the SD and PNFA groups. Interestingly, the results using CFG fea-  Table 9: Classification accuracies for each feature set individually using a SVM classifier. Bold indicates the highest accuracy for each task.
tures are actually higher than the results using all features. This demonstrates that classifier performance can be adversely affected by the presence of irrelevant features, especially in small data sets.

Combining feature sets
In the previous section we examined the feature sets individually; however, one type of feature may complement the information contained in another feature set, or it may contain redundant information. To examine the interactions between the feature sets, we perform an ablation study. Starting with all the features, we remove each feature set one at a time and measure the accuracy of the classifier. The feature set whose removal causes the smallest decrease in accuracy is then removed permanently from the experiment, the reasoning being that the most important feature sets will cause the greatest decrease in accuracy when removed. In some cases, we observe that the classification accuracy actually increases when a set is removed, which suggests that those features are not relevant to the classification (at least in combination with the other sets). In the case of a tie, we remove the feature set whose individual classification accuracy on that task is lowest. The procedure is then repeated on the remaining feature sets, continuing until only one set remains. The results for SD vs. controls are given in Table 10a. The best result, 1.00, is achieved by combining the psycholinguistic and POS features. This is unsurprising, since each of those feature sets perform well individually. Curiously, the same result can also be achieved by also including the complexity, vocabulary richness, and CFG features, but not in the intermediate stages between those two optimal sets. We attribute this to the interactions between features and the small data set.
For PNFA vs. controls, shown in Table 10b, the  Table 10: A=acoustic, P=psycholinguistic, POS=part-of-speech, C=complexity, F=fluency, VR=vocabulary richness, CFG=CFG production rule features. Bold indicates the highest accuracy with the fewest feature sets. best result of .966 is achieved using a combination of acoustic and psycholinguistic features. In this case the removal of the fluency features, which gave the second highest individual accuracy, does not make a difference to the accuracy. This suggests that the fluency features contain similar information to one of the remaining sets, presumably the acoustic set.
In the case of SD vs. PNFA, we again see that the best accuracy can be achieved by combining two feature sets, as shown in Table 10c. Using psycholinguistic and CFG features, we can achieve an accuracy of .917, a substantial improvement over the best accuracy for this task in Table 9. In fact, in all three cases we see that using a carefully selected combination of feature sets can result in better accuracy than using all the feature sets together or using any one set individually.

Best features for incomplete data
Up to this point, we have examined complete feature sets. We now briefly explore which individual features are the most discriminative across feature sets. We approach this as a practical consideration: given the data that a researcher has, and limited resources, what are the best features to measure? We consider the following four scenarios: 1. Given audio files only. This scenario often arises because it is relatively easy to record speech, but difficult to have it transcribed. Only acoustic features can be extracted. 2. Given basic transcriptions only (no audio).
We assume there is no sentence segmentation and the time is not marked (e.g. as in the output of automatic speech recognition). Thus, we can measure psycholinguistic, POS, and vocabulary measures. We can also measure the fluency features except for verbal rate, as well as mean word length and subordinate/coordinate conjunctions from the complexity set. Without sentence boundaries, we cannot parse the transcripts. 3. Given fully segmented transcripts (no audio).
We can measure all features except for acoustic features. 4. Given audio and fully segmented transcripts.
We can measure all features.
For each scenario, we rank the available features by their χ 2 value and choose the top 10 only as input to the SVM classifier (see Manning et al. (2008) for a complete explanation of χ 2 feature selection). We only include features if χ 2 > 0, so in cases where there are very few relevant features, fewer than 10 features may be selected. Because we perform cross-validation, the selected features may vary across different folds. In the tables that follow, we present the features ranked by the number of folds in which they appear (i.e. a feature with the value 1.00 was selected in every fold). Due to space constraints, only the top 10 ranked features are shown.
The results for Scenario 1 are given in Table 11a. For the SD vs. controls and PNFA vs. controls, the most highly ranked features tend to be related to fluency and rate of speech, as well as voice quality (skewness and MRPDE). However, when distinguishing between the two patient groups, the acoustic features are essentially useless. In most cases, we see that none of the acoustic features had a non-zero χ 2 value, and thus the classifier could not be properly trained.
For Scenario 2 (Table 11b), the results for SD vs. controls show that within the psycholinguistic and POS feature sets, features relating to familiarity and frequency are very important, as well as nouns and demonstratives. In the PNFA vs. controls case, we see that a number of the vocabulary richness features are selected, which is in contrast to the previous two experiments. However, it appears that only the MATTR feature is important (with varying window lengths), so when we considered only full feature sets, that information was obscured by the other, irrelevant features in that set. The SD vs. PNFA case shows a mix of features from the previous two cases.
For Scenario 3 (Table 11c), we add the complexity and CFG features. These features do not have a large effect in the SD vs. controls case, but a few CFG features are selected in the PNFA vs. controls and SD vs. PNFA cases.
In Scenario 4 (Table 11d), we consider all features. In the SD vs. controls case this increases the accuracy. However, for PNFA vs. controls and SD vs. PNFA, the classification accuracy actually decreases, relative to Scenario 3. When the number of features increases, the potential to overfit to the training data fold also increases, and it seems likely that that is occurring here. Nonetheless, we expect that the features which are selected in every fold are still highly relevant. These features are unchanged between Scenarios 3 and 4 in the SD vs. controls and SD vs. PNFA case, however in the PNFA vs. controls case, the acoustic features are now ranked more highly than some of the vocabulary richness and CFG features from Scenario 3.

Discussion
While it may be tempting to calculate as many features as possible and use them all in a classifier, we have shown here that better results can be achieved by choosing a small, relevant subset of features. In particular, psycholinguistic features such as frequency and familiarity were useful in all three classification tasks. Acoustic features were useful in discriminating patients from controls, but not for discriminating between the two PPA subtypes. We also found that MATTR was relevant in some cases, although the other vocabulary richness features were not, and that the CFG features were more useful than traditional measures of syntactic complexity. POS features were useful only in distinguishing between SD and controls.
One of the biggest challenges in this type of work is the small amount of data available.  Psychological studies are typically on the order of only tens to possibly hundreds of participants, while machine learning researchers often tackle problems with thousands to millions of data points. We have chosen techniques appropriate for small data sets, but acknowledging the potential weaknesses of machine learning methods when training data are limited, these findings must be considered preliminary. However, we also believe that this is a promising approach for future ap-plications, including automated screening for language impairment, support for clinical diagnosis, tracking severity of symptoms over time, and evaluating therapeutic interventions.