Entheos: A Multimodal Dataset for Studying Enthusiasm

Enthusiasm plays an important role in engaging communication. It enables speakers to be distinguished and remembered, creating an emotional bond that inspires and motivates their addressees to act, listen, and co-ordinate (Bettencourt et al., 1983). Although people can easily identify enthusiasm, this is a rather difﬁcult task for machines due to the lack of resources and models that can help them understand or generate enthusiastic behavior. We introduce Entheos, the ﬁrst multimodal dataset for studying enthusiasm composed of video, audio, and text. We present several baseline models and an ablation study using different features, showing the importance of pitch, loudness, and discourse relation parsing in distinguishing enthusiastic communication.


Overview
Although different emotional constructs such as anger and happiness have been studied extensively in the field of natural language processing (NLP), more fine-grained emotional expressions such as enthusiasm or charisma are relatively unexplored. Such models and datasets can benefit different areas of NLP and AI. Multimodal human-machine interaction can be more effective if systems can find a deeper understanding of more complex emotional responses or generate appropriate emotionally-aware communicative presentations. Given the importance of enthusiasm in teaching (Bettencourt et al., 1983;Zhang, 2014), for instance, researchers are studying the effect of virtual agents and robots that can behave in an enthusiastic manner (Liew et al., 2017(Liew et al., , 2020Saad et al., 2019). The current research is far from generating natural enthusiastic behavior.
Although previous research results in psychology, education, and business have studied the im- portance of enthusiasm in communication (Bettencourt et al., 1983;Sandberg, 2007;Keating, 2011;Antonakis et al., 2019), it is relatively unexplored in the NLP and dialogue literature. We take a step to bridge this gap by introducing the first multimodal dataset labeled with levels of enthusiasm following the definition that Keller et al. (2016) provided.
Our contributions are as follows: First, we present Entheos ( ênjeos: being possessed by a god, root for enthusiasm), the first multimodal dataset of TED talk speeches with annotated enthusiasm level 1 (Section 3). It contains sentence segments, labeled as either monotonous, normal, or enthusiastic. Figure 1 shows an example of an enthusiastic sample. Second, in search of finding multimodal signals for understating enthusiasm, we present an analysis of our data to identify attributes present in enthusiastic speech in different modalities (Section 3.5 and 5). Finally, we also provide several baseline models using different kinds of features extracted from text, speech, and video. In addition, we show the importance of identifying discourse relations in predicting enthusiasm (Section 5).

Related Work
In this paper, we focus on investigating resources and models that can help us gain insights into ways by which computers can understand and predict enthusiasm. This topic is relatively unexplored in the computer science field although it has been extensively studied in psychology (Bettencourt et al., 1983;Sandberg, 2007;Keating, 2011;Antonakis et al., 2019).
Enthusiasm Limited work exists on the automatic detection of enthusiasm and has been mainly done in the text domain. Inaba et al. (2011) worked on the detection of enthusiasm in human textbased dialogues, using lexical features and word co-occurrences with conditional random fields in order to distinguish enthusiastic utterances from non-enthusiastic ones. They defined enthusiasm as "the strength of each participant's desire to continue the dialogue each time he/she makes an utterance". In our work, we instead combine different modalities and features to detect enthusiasm and we define an enthusiastic speaker as "stimulating, energetic, and motivating" (Keller et al., 2016). Tokuhisa and Terashima (2006) also worked with humanto-human conversational dialogues and annotated dialogue acts (DAs) and rhetorical relations (RRs) on a sentence-level. An enthusiasm score in the range of 10-90 was given without providing examples to the annotators. The relationship between DAs, RRs, and enthusiasm was analyzed based on the frequencies. They found that affective and cooperative utterances are significant in an enthusiastic dialogue. We detected RRs automatically and trained a feed forward network to classify enthusiasm in three levels: monotonous, normal, and enthusiastic. During data annotation, examples for each category were available as references. Twitter data have also been used to detect enthusiasm. Mishra and Diesner (2019) created a dataset with enthusiastic and passive labels. Enthusiastic tweets had to include personal expression of emotion or call to action, whereas passive tweets lacked clear emotive content or call to action. They trained logistic regression models using salient terms. We evaluate emotional expressions in several modalities. We use acoustic features that relate to emotion such as pitch and voice quality, and also Facial Ac-tion Units extracted from videos which measure the intensity of different facial expressions.
Charisma Enthusiasm is also a trait that can be displayed by charismatic speakers (Spencer, 1973), which in addition are perceived as competent, passionate, and self-confident (Niebuhr, 2020). Charisma is a desired trait for leaders in economy and politics (Antonakis et al., 2019;De Jong and Den Hartog, 2007) because it can influence followers to undertake personally costly yet socially beneficial actions. Niebuhr et al. (2016) have investigated the prosodic attributes of charismatic speakers. They analyzed pitch level, pitch variation, loudness, duration of silence intervals, etc and concluded that charisma can be trained as far as melodic features are concerned. In addition to analyzing the relationship of different attributes with enthusiasm, we also trained a model that can distinguish between different levels of enthusiasm.
Although sentiment analysis and emotion detection have been studied extensively in unimodal and multimodal frameworks as shown in several surveys (Marechal et al., 2019;Garcia-Garcia et al., 2017;Seyeditabari et al., 2018;Sudhakar and Anil, 2015) there is a gap in the analysis, detection and generation of enthusiastic behavior. Our dataset will allow to extend the work in understanding human behavior and also generate more natural virtual agents (Zhang, 2014;Keller et al., 2014;Liew et al., 2020;Viegas et al., 2020).

Entheos Dataset
In this section we present the Entheos dataset. We describe our domain choice and label selection, the annotation process, extracted features, as well as statistics of the dataset.

Data Acquisition
Enthusiastic speakers are passionate about their message, wanting to gain their audience for their purpose and persuading them to change their perspective or take action. Given that TED is wellknown for spreading powerful messages that can change attitudes and behavior, we use TED talk speeches as our domain for creating a multimodal enthusiasm dataset. We randomly selected 52 male and female speakers from the TEDLIUM corpus release 3 (Hernandez et al., 2018), which contains audio of 2351 talks. Transcripts were obtained  through the Google cloud transcription service 2 . The talks were segmented into sentences, based on punctuation. We extend the samples from the TEDLIUM corpus with aligned video segments downloaded from the official TED website.

Label Selection and Temporal Granularity
In order to define the temporal granularity for annotation and what labels to use, we performed preliminary annotation experiments with three annotators. Three audio recordings of talks were chosen from speakers with different proficiency level. One recording was a TED talk by Al Gore 3 , and the remaining were recordings of participants in a pilot study with our institution in which they introduce themselves and describe their skills.
We evaluated two different temporal granularities: sentence-level and entire talk. In addition, we explored the use of three different sets of labels, which will be described in the following.

PSCR (Public Speaking Competence Rubric)
PSCR (Schreiber et al., 2012) was developed to effectively assess students' skills in public speaking. It is composed of eleven skills that are assessed during speaking with a 0-4 scale. We focused on the seventh, which evaluates the effective use of vocal expression and paralanguage to engage the audience. During annotation, annotators had Table 1 available for a detailed description on how the speaker articulates for the corresponding rating.
Vocal Attributes Based on the PSCR descriptions we crystallized four main components of the effective use of the voice: vocal variation, intensity, pacing, and expression. Each one was evaluated with a score of 0-4 and described as depicted in Table 2.
Enthusiasm and Emphasis As a final set of labels, we decided to use intuitive categories, namely enthusiasm and emphasis. For enthusiasm, we chose the definition provided by Keller et al. (2016) as they study enthusiasm in context of spoken   monologues (similar to our data) while Inaba et al.
(2011) studied written dialogues. We also asked annotators to label enthusiasm in three levels: monotonous, normal, and enthusiastic. As Table 3 shows, annotators were asked to label emphasis as existent or not, depending on whether words were emphasized by speaking louder or pronouncing words slowly.

Experiment Description
The experiment was composed of two parts. First the entire audio recordings were played and the annotators were asked to use only the PSCR annotation scheme, rating each talk with a single score. Afterwards, seven sentences of each talk were played with pauses in between to allow annotation using vocal attributes, enthusiasm and emphasis labels. Each sentence was annotated with six scores. For both parts, the annotators had access to the description of the labels during annotation as shown in Tables 1,2,3. Once all annotators finished labeling a sample, the next one was played.

Results and Conclusion
In Table 4 the interrater agreement for the different annotation schemes is shown in terms of Fleiss' kappa Landis and Koch (1977). We can see that PSCR, which rated the entire talk, has the lowest agreement. Vocal variation and pacing have moderate agreement, while vocal intensity, enthusiasm, and emphasis show almost perfect agreement. Given these results, we annotated audio recordings on a sentence-level using enthusiasm and emphasis labels.

Data Annotation Protocol
Our study was approved by our institution's human subject board and annotators were paid $20/h. Seventeen subjects participated in data annotation and signed the consent form before the study. For data annotation, an internal tool was created that enabled annotators to listen to audio samples and annotate them through their web browser at their time of convenience. As labeling availability fluctuated, instead of randomly choosing samples from the entire dataset, we decided to release small batches of data to obtain as many annotations per sample as possible. In a bi-weekly rhythm, small batches of 200 samples were available to annotate in a randomly chosen order for each annotator. As our definition for enthusiasm ( Table 3) allows subjective interpretations, we included three reference audio files for each enthusiasm level in the web interface of our annotation tool as depicted in Figure 2. Annotators were indicated to listen to the reference files after every 10 labeled samples and when insecure on how to label a sample. In addition, annotators were given the definition of enthusiasm and emphasis shown at Table 3. Besides enthusiasm and emphasis, also the corresponding perceived gender was annotated. We limited the options for perceived gender to female and male, based on prior work which used these two genders to improve the performance in emotion detection (Li et al., 2019). Samples with laughter or clapping were asked to be labeled as noisy files.
Annotator Quality Assessment: Annotation was performed by 17 different annotators. As noisy annotations are common when crowdsourcing and not using expert annotators due to spammers and malicious workers (Burmania et al., 2015), we com- Figure 2: Layout of the annotation interface. On the top left is the sample to be annotated and below are the different labels: perceived gender, enthusiasm, and emphasis. On the top center is the option to mark the sample as noisy if laughter or clapping is present. On the right side are reference samples for the three different levels of enthusiasm.
pared the percentage agreement of each individual's annotations with a preliminary majority vote. The analysis showed that 12 annotators had lower agreement than 30%. The same annotators had also labeled less than 17% of the data. To ensure high quality of annotation we used the remaining five annotators who labeled more than 50% of the data. The remaining annotators identify themselves as latino, asian, and white. We removed all samples that had only one or two different annotations and computed the final majority vote for the remaining 1,126 samples. To confirm high inter-rater agreement, we computed Cohen's kappa (McHugh, 2012) in a pairwise manner for the five annotators and obtained an average agreement of 0.66.

Final Data Selection
Out of 1,819 labeled samples, we kept 1,126 which had more than one annotation. The selected samples are from 113 different TED talk speeches, being 60 from male and 53 from female speakers. We created a test split with 108 samples from five speakers of each perceived gender. The training set, composed by 55 male and 48 female speakers, has a total of 1,018 samples. There is no overlap of speakers between training and test set. In Center: Female speakers have proportionally fewer monotonous samples and more normal samples than male, but the same proportion of enthusiastic samples. Bottom: Samples labeled as enthusiastic have been mainly rated as fascinating, persuasive, and inspiring. They have rarely been rated negatively.

Data Statistics
In the following we will describe the relationship between the different enthusiasm levels and other attributes of the talks such as viewer ratings, number of views and comments, and perceived gender of the speakers. This metadata was obtained from a Kaggle competition 4 that collected data about TED talks until September 21st, 2017.
In Figure 3 (center), we can see that the enthusi-  Figure 3 (bottom), the label distribution among the different ratings that were given by viewers is shown. There are nine positive ratings (funny, beautiful, ingenious, courageous, informative, fascinating, inspiring, persuasive, jaw-dropping) and five negative ratings (longwinded, confusing, unconvincing, ok, obnoxious) which viewers could select. The ratings have been sorted by increasing number of enthusiastic samples. We can see that the negative ratings have the least number of enthusiastic samples. The ratings with the three highest numbers of enthusiastic samples are fascinating, persuasive and inspiring. We also performed two one-way ANOVAs to evaluate if the number of views and comments depend on the enthusiasm level. The resulting p-values were correspondingly p = 0.3844 and p = 0.6892 which means that views and comments are not influenced by the enthusiasm level of the speaker.

Computational Experiments
In the experiments of this paper, we aim to establish a performance baseline for the Entheos dataset using only the enthusiasm annotations. We train our model with different feature combinations to understand the role of different modalities in enthusiasm detection (see Figure 4). In the following we describe different features that were extracted and the model architecture that we used.

Features
Given the small number of labeled samples, instead of training an end-to-end model, we extract different features that will serve as input for our model. In the following we will describe the features used per modality.
Video: As enthusiasm is related to emotions, we extracted Facial Action Units (FAUs) which describe the intensity of muscular movements in the face based on the Facial Action Coding System (FACS) (Friesen and Ekman, 1978

Model Architecture
Our model is composed by four fully connected layers with ReLU activation functions in between. We use concatenation to combine different features in the multimodal setting. Given our imbalanced dataset, we compute class weights, which represent the relation of samples per label and the total sample number. The class weights are then passed to our loss function (cross entropy loss) to give more weight to samples of the underrepresented classes. We use the Adam optimizer (Kingma and Ba, 2015) and during training, we perform early stopping to avoid overfitting. We train the model for a three class problem using all enthusiasm levels and also in a binary manner, combining "monotonous" and "normal" labels to the category called "non-enthusiastic".

Results and Evaluation
In this section, we present the performance results of our model using different combinations of features. We also evaluate the performance of the discourse parsers used and show statistical analysis of visual and acoustic features. All results of our statistical analysis are shown in the Appendix A.

Predicting Enthusiasm level
For each feature combination, we performed hyperparameter search with 10-fold cross-validation.
The best hyperparameter combination was used to train the model with the entire training set. We evaluated the performance of the models on our test set. In Table 5, the weighted average results for precision, recall, and F1-score are shown. We see that in the unimodal case, BERT embeddings perform the best in the binary classification as well as in the three-class problem. Although PDTB has a higher F1-score in the binary case, RST performs better in the multi-class problem. Out of the different audio features, eGEMAPS performs slightly better than the other acoustic features. In the multiclass case, IS09 features are the best performing acoustic features. When all features except AUs are combined, we reach the highest F1-score for the binary problem, improving the best unimodal performance by 0.08. We also see that combining both discourse relation features with eGEMAPS and BERT improves F1-score by 0.08 compared to using only one of them. In the multi-class problem, the best performing feature combination shows only a slight improvement of 0.04 compared to the unimodal case. Although manually annotating the entire resource was beyond the scope of this paper, we believe that it is necessary to understand the weaknesses and strengths of automatic parsers when used in spoken monologues. With current efforts being made in the field of creating discourse parsers for speech, the role of discourse parsers for enthusiasm detection will be better understood.

Evaluating the Effect of Discourse Features
We see in Table 5 that discourse relations help the model achieve the highest F1-score. However, we obtained the discourse relations by using discourse parsers that are trained on Wall Street Journal data 6 , which is different from monologues.
To evaluate the performance of the parsers, 40 samples of our data were manually annotated with RST and PDTB relations by two annotators. The annotation protocol was approved by our institution's human subject research center. The interrater agreement was κ = 0.88. The accuracy of the RST parser on our data sample was 46.7 and for the PDTB parser 60.0. Although the accuracy of the parsers is low using our data, we have seen that concatenating both discourse relation features  to BERT and eGEMAPS improved our model's performance from an F1-score of 0.64 to 0.83 in the binary classification.
In Figure 5(a,b) we evaluated the relative occurrence of each enthusiasm level for RST and PDTB relations in ascending order of enthusiastic samples. In Figure 5a we can see that most samples do not have any discourse relation. However, there is a clear difference in the number of monotonous and enthusiastic samples that show contingency, as well as temporal relations. In Figure 5b we see that enthusiastic samples compared to monotonous samples use more elaboration, attribution, and joint relations. We performed the Pearson Chi Square test to verify our null hypotheses that discourse relations and enthusiasm level are independent from each other. We obtained a p-value of 0.0001 for PDTB and a p-value of 0.008 for RST, which permits us to reject our null hypothesis, meaning that the discourse relations influence the level of enthusiasm.

Investigating Visual Features
Given that AUs have not helped our model improve, we evaluated their dependence with our labels. We performed two separate one-way ANOVAs to evaluate the dependence of the mean of the 18 AUs with our labels, as well as the standard deviation of the AUs with our labels. The AUs with p-value < 0.05 are AU 12 (lip corner puller), AU 15 (lip corner depressor), AU 17 (chin raiser), and AU 26 (jaw drop). In Figure 5(c,d) the label distribution for the mean of AU 26 and standard deviation of AU 12 is shown. In both cases, we can observe that monotonous samples have more frequently a mean and standard deviation of zero compare to enthusiastic samples. We can also see in Figure 5d that enthusiastic samples have more frequently a standard deviation of AU 12 > 0.02.

Investigating Acoustic Features
We have seen that acoustic features are important in improving our model's performance. In this section we want to evaluate if pitch (F0) and loudness are independent from enthusiasm level. We perform a one-way ANOVA for the mean F0 per sample and its enthusiasm level, as well as for the mean loudness. Both p-values are < 0.05, meaning that the enthusiasm labels depend on the acoustic features. In Figure 5e, we can see that monotonous samples have a lower mean F0 than that of enthusiastic samples. We can also see in Figure 5f that monotonous samples have lower mean loudness than that of enthusiasm. These observations agree with the intuition that enthusiastic speakers speak louder and increase their pitch.

Discussion and Conclusion
We present the first multimodal dataset for enthusiasm detection called Entheos 7 and discuss several baseline models. In addition, we present qualitative and quantitative analyses for studying and predicting enthusiasm using the three modalities of text, acoustic, and visual. Our work has several limitations. TED talks are a very specific form of monologues as they are well-rehearsed and prepared. However, it is more likely that we can find enthusiastic speakers or wellstructured sentences in TED talks. To understand enthusiastic behaviors in daily conversations, more data from other domains need to be annotated and studied. We hope that our annotation protocol will help other researchers in the future.
Further theoretical and empirical research is needed for better studying enthusiastic behaviors in general. The signals and definitions that we have worked with are not fine-grained or well-connected 7 https://github.com/clviegas/ Entheos-Dataset when exploring different modalities. Facial expressions and gestures can potentially provide meaningful contributions. Our experiments with facial action units were not successful. Our baseline approach used statistical information of each AU instead of the raw signal, which may dilute useful information. More experiments are needed to evaluate if and how AUs can help predict enthusiasm.
We hope our resources provide opportunities for multidisciplinary research in this area. Given the difficulties of annotating multimodal datasets in this domain, future work needs to investigate weakly supervised approaches for labeling multimodal data.

A Statistical Tests
In this section we present the results of the statistical tests performed to the facial action units and prosody features extracted from the entire dataset.

A.1 AU Statistical Tests
In order to understand which AU influence the enthusiasm level, we performed two different statistical tests: ANOVA for the three levels of enthusiasm (monotonous, normal, enthusiastic), and T-test for two levels of enthusiasm (enthusiastic, non-enthusiastic). In Table 6 on the left we can see the results of the ANOVA, analyzing the mean value of the different AUs per sample with the three levels of enthusiasm. All mean AUs that show pvalue < 0.05 are highlighted. As AU 26 has the lowest p-value, the label distribution is shown in Figure 5c. In Table 6 on the right we can see the results of the ANOVA, analyzing the standard deviation of the different AUs per sample with the three levels of enthusiasm. As AU 12 has the lowest p-value, the label distribution is shown in Figure 5d.
We also performed T-tests for the binary case using the labels enthusiastic and non-enthusiastic.  is the only AU with a p-value < 0.05. The distribution of the average values of AU 17 are shown in Figure 6(a). For comparison, the distribution of the average AU 02 (outer brow raiser) with highest p-value is shown in Figure 6(b). For both analysis, ANOVA and T-test, the differences of standard deviations among the enthusiasm levels are statistically significant for almost all AUs. This is not the case when analyzing the average values of AUs.

A.2 Prosody Statistical Tests
We performed statistical significance tests using the mean and standard deviation for F0 (pitch) and loudness. In Table 8(left), the ANOVA analysis results are shown and in  Table 7: T-test for two levels of enthusiasm and AU mean values on the left and standard deviation of AU on the right. AUs with lowest p-value are highlighted.