MONAH: Multi-Modal Narratives for Humans to analyze conversations

In conversational analyses, humans manually weave multimodal information into the transcripts, which is significantly time-consuming. We introduce a system that automatically expands the verbatim transcripts of video-recorded conversations using multimodal data streams. This system uses a set of preprocessing rules to weave multimodal annotations into the verbatim transcripts and promote interpretability. Our feature engineering contributions are two-fold: firstly, we identify the range of multimodal features relevant to detect rapport-building; secondly, we expand the range of multimodal annotations and show that the expansion leads to statistically significant improvements in detecting rapport-building.


Introduction
Dyadic human-human dialogs are rich in multimodal information. Both the visual and the audio characteristics of how the words are said reveal the emotions and attitudes of the speaker. Given the richness of multimodal information, analyzing conversations requires both domain knowledge and time. The discipline of conversational analysis is a mature field. In this discipline, conversations could be manually transcribed using a technical system developed by Jefferson (2004), containing information about intonation, lengths of pauses, and gaps. Hence, it captures both what was said and how it was said 1 . However, such manual annotations take a great deal of time. Individuals must watch the conversations attentively, often replaying the conversations to ensure completeness.
However, the potential issue with Jeffersonian annotations is that there are often within-word annotations and symbols which makes it hard to benefit from pre-trained word embeddings. Inspired by the Jeffersonian annotations, we expand the verbatim transcripts with multimodal annotations such that downstream classification models can easily benefit from pre-trained word embeddings.
Our paper focuses on the classification task of predicting rapport building in conversations. Rapport has been defined as a state experienced in interaction with another with interest, positivity, and balance (Cappella, 1990). If we can model rapport building in the medical school setting, the volunteer actors can let the system give feedback for the unofficial practice sessions, and therefore students get more practice with feedback. Also, the lecturer could study the conversations of the top performers and choose interesting segments to discuss. As student doctors get better in rapport building, when they graduate and practice as doctors, treatments are more effective and long-term (Egbert et al., 1964;DiMatteo, 1979;Travaline et al., 2005).
Outside of the healthcare domain, understanding and extracting the features required to detect rapport-building could help researchers build better conversational systems. Our first contribution is the identification of multimodal features that have been found to be associated with rapport building and using them to predict rapport building automatically. Our second contribution is to include them into a text-based multimodal narrative system (Kim et al., 2019b). Why go through text? It is because this is how human experts have been manually analyzing conversations in the linguistics community. Our text-based approach has the merit of emulating the way human analysts analyze conversations, and hence supporting better interpretability. We demonstrate that the additions bring statistically significant improvements. This feature-engineering system 2 could potentially be used to accomplish a highly attention-demanding task for an analyst. With an automated text-based approach, we aim to contribute towards the research gap of automatic visualizations that support multimodal analysis (Kim et al., 2019a). The created multimodal transcript itself is a conversational analysis product, which can be printed out on paper.
In this paper, we first introduced the problem domain (section 3). Secondly, we motivated the new features (detailed in Fig. 1) to be extracted (section 4). Then, we extracted the features from videos and encoded them as text together with verbatim transcripts (section 4). To evaluate whether the text narratives were useful, we ran experiments that predict rapport-building using texts containing different amounts of multimodal annotations (section 5). Finally, we discuss the results and visualize the outputs of the system (section 6).

Related Works
The automated analysis of conversations has been the subject of considerable interest in recent years. Within the domain of doctor-patient communication, Sen et al. (2017) calculated session-level input features, including affective features (Gilbert, 2014). Analyses using session-level features have a drawback of not being able to identify specific defining multimodal interactions in the conversation (Zhao et al., 2016;Heylen et al., 2007). Therefore, we build upon the works of Sen et al. (2017) in addition to the use of session-level features, we propose using a finer level of talk-turn multimodal text representation as inputs into a hierarchical attention network (HAN) (Yang et al., 2016).
We also build upon our previous work (Kim et al., 2019b) by broadening the range of multimodal features considered. As for the different methods of multimodal information fusion, Poria et al. (2017) completed an extensive review of the different state-of-the-art multimodal fusion techniques. Recent multimodal fusion research (such as ICON (Hazarika et al., 2018a), CMN (Hazarika et al., 2018b), MFN (Zadeh et al., 2018), Dia-logueRNN (Majumder et al., 2019), M3ER (Mittal et al., 2020)) has focussed on end-to-end approaches. Unlike the typical end-to-end approach of representing and fusing multimodal features using numeric vectors, our contribution is an entirely text-based multimodal narrative, thereby improv-2 Open-sourced at https://github.com/SpectData/MONAH ing downstream analysis's interpretability. The approach of this system not only annotates the presence of nonverbal events (Eyben et al., 2011), but also the degree of the nonverbal event intensity at both the session-level and talkturn-level.

Data
This study uses data from the EQClinic platform (Liu et al., 2016). Students in an Australian medical school were required to complete at least one medical consultation on the online video conferencing platform EQClinic with a simulated patient who is a human actor trained to act as a patient. Each simulated patient was provided with a patient scenario, which mentioned the main symptoms experienced. The study was approved by the Human Research Ethics Committee of the University of New South Wales (project number HC16048).
The primary outcome measurement was the response to the rapport-building question on the Student-Patient Observed Communication Assessment (SOCA) form, an adapted version of the Calgary-Cambridge Guide (Kurtz and Silverman, 1996). Simulated patients used the SOCA form to rate the students' performances after each video consultation. Our dataset comprises of 873 sessions, all from distinct students. Since we have two recordings per session (one of the student, the second of the simulated patient), the number of recordings analyzed is 1,746. The average length per recording is 928 seconds (sd=253 seconds), amounting to a total of about 450 hours of recordings analyzed. The dataset's size is small relative to the number of multimodal features extracted; therefore, there is a risk of overfitting.
We used the YouTube platform to obtain the transcript per speaker from the recordings. We chose YouTube because we (Kim et al., 2019c) found that it was the most accurate transcription service (word error rate: 0.28) compared to Google Cloud (0.34), Microsoft Azure (0.40), Trint (0.44), IBM Watson (0.50), when given dyadic video-conferences of an Australian medical school. Jeong-Hwa and Cha (2020) found that among the four categories of YouTube errors (omission, addition, substitution, and word order), substitution recorded the highest amount of errors. Specifically, they found that phrase repetitions could be mis-transcribed into non-repetitions. From our experience, (a) repairinitiation techniques such as sound stretches (e.g. "ummmm") (Hosoda, 2006), were either omitted or substituted with "um"; (b) overlapping speech was not a problem because our speakers were physically separated and recorded into separate files.
We brought together the two speakers' transcripts into a session-level transcript through wordlevel timings and grouped together words spoken by one speaker until the sequence is interrupted by the other speaker. When the interruption occurs, we deem that the talk-turn of the current speaker has ended, and a new talk-turn by the interrupting speaker has begun. The average number of talkturns per session is 296 (sd=126), and the average word count per talk-turn is 7.62 (sd=12.2).
At this point, we note that acted dialogues differ from naturally occurring dialogues in a few ways. Firstly, naturally occurring dialogues tend to be more vague (phrases like "sort of", "kinda", "or something") due to the shared understanding between the speakers (Quaglio, 2008). Secondly, taboo words or expletives that convey emotions (like "shit", "pisssed off", "crap") is likely to be less common in an acted medical setting than naturally occurring conversations. Some conversations transform into genuine dialogues where the speakers "shared parts of themselves they did not reveal to everyone and, most importantly, this disclosure was met with acceptance" (Montague, 2012). This definition of genuine conversation is similarly aligned to our definition of rapport-building in section 4.1. Figure 1 shows a summary of the features extracted. We annotated verbatim transcripts with two different levels of multimodal inputs -annotations at the session-level are labeled coarse, whilst annotations at the talk-turn-level are labeled fine. To facilitate comparisons, all input families belonging to the coarse (fine) level would be annotated with uppercase (lowercase) letters, respectively. In this paper, we refer to the previously existing set of features (with white background) as the "prime" ( ) configuration. Families are also abbreviated by their first letter. For example, the coarse P family would consist of only speech rate and delay, whilst the coarse P family would consist of P plus tone. As another example, the coarse D family is the same as the D family because there are no newly added features (in blue). We introduce the framework of our multimodal feature extraction pipeline in Figure 2.

Multimodal features extractions
As an overview, we extracted the timestamped verbatim transcripts and used a range of pre-trained models to extract temporal, modality-specific features. We relied on pre-trained models for feature extraction and did not attempt to improve on themdemonstrating the value of using multidisciplinary pre-trained models from natural language processing, computer vision, and speech processing for conversational analysis.
Effectively, we extracted structured data from unstructured video data (section 4.2). With the structured data and verbatim transcript, we weaved a multimodal narrative using a set of predefined templates (sections 4.3 and 4.4). With the multimodal narrative, we employed deep learning techniques and pre-trained word embeddings to predict the dependent variable (section 5).

Dependent variable -rapport building
The dependent variable is defined as the success in rapport building. Rapport building is one of the four items scored in the SOCA. The original 4-point Likert scale is Fail, Pass-, Pass, Pass+, we converted this scale into a binary variable where it is true if the rapport-building score is "Pass+" as we are concerned here with identifying good rapport building. "Pass+" means that the actor felt rapport such that all information could be comfortably shared. 38 percent of the population has achieved "Pass+". All actors followed the same pre-interview brief. Because only the actor scored the student performance and there is no overlap, the limitation is that we do not have measures of agreement. Table 1 gives an overview of all features for each speaker. We define six families of coarse-level inputs --demographics, actions, prosody, semantics, mimicry, and history. We computed the features per speaker. From all families, there are a total of 77 features per session.

Description of features
We first discuss the family of demographics. Talkativeness is chosen because the patient's talkativeness would initiate the doctor's active listening while aiding identification of patient's concerns -processes that could establish rapport. In Hall et al. (2009), it appears that patients appreciate a certain degree of doctor's dominance in the conversation, which itself is also correlated with higher rapport. Big 5 Personality consists of Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness to Experience (Mc-Crae and Costa, 1987). This personality structure is widely used in research and practice to quantify aspects of a person's natural tendency in thought, feeling, and action, with good validity and reliability indicators (McCrae, 2017). It is chosen because traits of agreeableness and openness on the part of both doctor and patient predict higher rapport. Among doctors, higher openness and agreeableness predict higher empathy towards patients (Costa et al., 2014). Among patients, higher agreeableness predicted higher trust towards doctors (Cousin and Mast, 2013), and higher openness predicted higher doctor affectionate communication (Hesse and Rauscher, 2019). Big 5 Personality is extracted through feeding transcripts to the IBM Watson Personality Insights API (version 2017-10-13), costing a maximum of 0.02 USD per call. Gender is chosen because personality differences between genders were observed cross-culturally. Among twenty-three thousand participants across cultures Num. Sessions* Number of past sessions the assessor has scored before this Proportion given extreme marks* Proportion of past sessions that the assessor has given an extreme score for both college-age and adult samples, females reported higher agreeableness, warmth, and openness to feelings than males (Costa Jr et al., 2001), traits that could be linked to rapport building.
Secondly, for the family of actions, laughter is chosen because humor (which was defined in part by the presence of laughter) on the part of both doctor and patient was found to be twice as frequent in high-satisfaction than low-satisfaction visits (Sala et al., 2002). Laughter events were detected using the Ryokai et al. (2018) algorithm. Facial expressions that resemble smiling is another behavioral indicator of humor appreciation, and approval of one another (Tickle-Degnen and Rosenthal, 1990). Head nodding is a type of backchannel response (i.e., response tokens) that has been shown to reflect rapport between doctor and patient, especially when the primary activity is face to face communications (Manusov, 2014). Forward trunk leaning is chosen because it has long been found to reflect an expression of interest and caring, which are foundational to rapport building (Scheflen, 1964). Additionally, facial positivity (posiface) is included as it is useful in rapport building detection in small groups (Müller et al., 2018). Lastly, action units (AU) that describe specific facial expressions, in particular AU 05 (upper lid raiser), 17 (chin raiser), 20 (lip stretcher), 25 (lips part), are also included as they were useful in automated dyadic conversational analyses to detect depression in our previous work (Kim et al., 2019b). All features introduced in this paragraph were calculated using the AU and landmark positioning features extracted using OpenFace (Baltrušaitis et al., 2016).
Thirdly, for the family of prosody, delay is chosen because it has been shown to be an indicator of doctor-to-patient influence -patients of low rapport with their doctors were found to speak less in response to doctor's comments (Sexton et al., 1996). Speech rate is chosen because doctor's fluent speech rate and patient's confident communication have been positively correlated with the patient's perception of rapport (Hall et al., 2009). Delay and speech rate are calculated using the time-stamped transcripts. Tone is chosen because a warm and respectful tone on the part of both doctor and patient is positively correlated with the patient's perception of rapport (Hall et al., 2009). Tone is calculated using the Vokaturi algorithm (version 3.3) (Vokaturi, 2019).
Fourthly, for the family of semantics, sentiment is chosen because the provision of positive regard from a practitioner to a patient is an important factor to foster therapeutic alliance; additionally, this process may be further enhanced if the patient also demonstrates positive behaviors towards the practitioners (Farber and Doolin, 2011). Sentiment is extracted using the VADER algorithm (Gilbert, 2014), in line with Sen et al. (2017). Questions is chosen because higher engagement by the doctor (e.g., asking questions) with the patient and the patient asking fewer questions have been shown to positively correlate with the patient's perception of rapport (Hall et al., 2009). Questions are detected using Stanford CoreNLP Parser  and the Penn Treebank (Bies et al., 1995) tag sets.
Next, mimicry is chosen because doctor-patient synchrony is an established proxy for rapport. In a review paper, rapport is theorized to be grounded in the coupling of practitioner's and patient's brains (Koole and Tschacher, 2016). Such a coupling process would eventuate in various forms of mimicry in the dyad, for instance, vocally (e.g., matching speech rate and tone), physiologically (e.g., turntaking, breathing), physically (e.g., matching body language) (Wu et al., 2020). In this study, we aim to use vocal mimicry to capture this underlying phenomenon. Session level mimicry scores are approximated through Dynamic Time Wrapping distances (Giorgino and others, 2009), in line with Müller et al. (2018).
Lastly, history is chosen because the scores given by the assessors could be subjective evaluations where the evaluations are unduly influenced by the assessor's leniency bias (Moers, 2005). We attempted to mitigate the leniency bias by introducing history features that indicate the assessor's leniency and its consistency.

Generation of coarse multimodal narrative
In this section, we discuss the coarse multimodal narrative. We summarized the automatic generation of the text representation in Table 2. We calculated the z-score for all the above templates (except Template 3 which is categorical) using the following z-score formula. The average (µ), and standard deviation (σ) are computed using observations from the training observations. Using the z-score, we bucketed them into "very low" (z<-2), "low" (z<-1), "high" (z>1) and "very high" 18 patient question four proportion given maximum marks high (z>2). The reason for z-transformation is to create a human-readable text through bucketing continuous variables into easy-to-understand buckets ("high" vs. "low").

Generation of fine multimodal narrative
In addition to the verbatim transcript, we introduced two new families of information -prosody, and actions. Table 3 gives an overview of the templates, and the bold-face indicates a variable. The motivations of the features have been discussed; we discuss the rules of insertion in the next few paragraphs. Template 19 is the verbatim transcript returned from the ASR system. Before each talk-turn, we identified the speaker (doctor/patient) and added multimodal information using templates 20-29. Speech rate and tone were standardized across all training observations. We appended template 20, 21 where possible values are dependent on the zscore -"quickly" (1 < z-score < 2) and "very quickly" (z-score ≥ 2). For delay, we used time intervals of 100 milliseconds, and between 200 and 1200 milliseconds -in line with Roberts and Francis (2013). We appended template 22 at the front of the talk-turn if a delay of at least 200 milliseconds is present between talk-turns. In addition, we appended template 23 where possible values are dependent on the standardized duration of delay -"short" (< 1 z-score), "long" (< 2 z-score) and "significantly long" (≥ 2 z-score). Template 23 captures longer than usual delay, considering the unique turn-taking dynamics of each conversation. The standardized duration of delay is calculated using talk-turn delays from the respective session. Lastly, as for the actions family, templates 24 -28 were added if any of the actions are detected during the talk-turn. For template 29, it was only added if the AU is detected throughout the entire duration of the talk-turn.

Experimental settings
There are two main types of inputs -(1) numeric inputs at the session-level, and (2) coarse and/or fine  BERT (Devlin et al., 2018), the HAN is faster to train and easier to interpret.

Research questions
The proposed features have been motivated by scientific studies in Section 4. A natural next question is, "what are the impacts of these proposed features on model performance?" We break this broad question into three questions. Firstly, (Q1) do the newly added features improve performance over the existing set of features for the classification tree and/or HAN?
Secondly, modelling using unstructured text input data (as opposed to using numeric inputs) has the risk of introducing too much variability in the inputs. Therefore, we investigate (Q2) -given the coarse-only inputs, do the performance between the HAN and classification tree differ significantly?
Lastly, adding more granular talkturn-level inputs to the coarse session-level inputs has the benefit of deeper analyses, because it allows the analyst to analyze important talkturns of the conversation. On top of this benefit, (Q3) do we also have significant performance improvement between coarseonly vs. both coarse and fine inputs?
For all models, the area under the receiveroperator curve (AUC) was used as the evaluation metric. The AUC measures the goodness of ranking (Hanley and McNeil, 1982) and therefore does not require an arbitrary threshold to turn the probabilities into classes. The partitioning of the dataset to the five-folds is constant for decision tree and HAN to facilitate comparison. The five folds are created through stratified sampling of the dependent variable.

Classification tree set-up
To answer (Q1) and (Q2), we tested for all 72 configurations of prime (2 3 = 8) plus full (2 6 = 64) family inputs for the decision tree. We performed the same z-transformation pre-processing (as in section 4.3) on the decision tree input variables and limited random search to twenty trials.
The algorithm used is from the rpart package with R. As part of hyperparameter tuning, we tuned the cp (log-uniform between 10 −7 to 10 −9 ), maximum depth (uniform between 1 to 20), and minimum split (uniform between 20 to 80) through five-fold cross-validation and random search.

HAN set-up
To answer (Q1) and (Q2), we chose the input configurations that performed that best for the classification tree, and used the same input configurations in HAN to compare the difference. Therefore, this test is biased in favour of the classification tree. To answer (Q3), we added the fine narratives to each coarse-only configuration, and compared the difference.
The model architecture is the HAN architecture by Yang et al. (2016), with about 5 million parameters. We used the pre-trained Glove word embeddings (Pennington et al., 2014) of 300-dimensions to represent each word. Words not found in the Glove vocabulary are replaced with the "unk" to- Table 4: Summary of the model performances. We report the average five-fold cross-validation AUC and its standard deviation in brackets. Row-wise: We begin with the D A P , which is the full existing feature set from Kim et al. (2019b), and progressively compare it against the new sets of features to answer Q1. Column-wise: We compare the difference in AUC between the classification tree and coarse-only HAN to answer Q2. We compare the difference in AUC between the coarse-only HAN and coarse + fine HAN to answer Q3. Asterisks (*) indicate significance relative to the D A P row. Carets (ˆ) indicate significance relative to column-wise comparisons, we also provide the confidence intervals in square brackets [] for the difference in performance. The number of symbols indicate the level of statistical significance, e.g., ***: 0.01, **: 0.05, *: 0.10.

Experimental results
The results are summarized in Table 4. The key findings are: (Q1) with the extended inputs, we observed statistically significant improvements in both the HAN and tree over the existing full set of features (one-tailed t-test); (Q2) given the coarseonly inputs, the performances between the HAN and classification tree did not differ significantly (two-tailed t-test), therefore it is plausible that feature engineering into text features do not risk performance; (Q3) although adding the fine narratives allow deeper analyses by the analyst, it does not lead to significant differences over the coarse-only inputs (two-tailed t-test).
(Q1) When compared to the full set of existing features, the classification tree achieved statistically significant improvements (at α = 0.05) in all six out of six coarse input families. For HAN, it achieved statistically significant improvements in one (at α = 0.05) or two (at α = 0.10) out of six coarse input families. This demonstrates the value of the newly introduced coarse features 4 . (Q2) Across the seven coarse input configurations, there are no significant differences in the performance from the classification tree when compared to the HAN in six out of seven input configurations. The only exception is in the baseline D A P configuration where the HAN is significantly better. However, the lack of statistically significant differences does not mean that the performances are the same. In line with Quertemont (2011) recommendation, we provided the confidence interval around the difference in performance for discussion. Of all confidence intervals that included zero in the fourth column of Table 4, the confidence intervals do not suggest that that the effect sizes are negligible (for example, less than 0.01). In summary, we cannot conclude that the performance of HAN differs significantly from tree nor are they the same.
(Q3) The addition of fine narratives to the coarse narrative did not result in significantly stronger (nor weaker) performance in any of the seven input configurations. We posit that this negative finding is due to the difficulty in prioritizing the backpropagation updates to the parts of the network interacting with the coarse features, where there is likely a high signal-to-noise ratio. Despite the negative finding, we think it is important to explore fine features' addition onto coarse features because it produces a complete transcript for the human to understand how the conversation proceeded.

Qualitative Analysis
We visualized the talkturn-level and word-level attention weights from the model. Attention weights are normalized using z-transformation and bucketed into four buckets (< 0, < 1, < 2, ≥ 2) (Kim et al., 2019b). The analyst could analyze an important segment in detail (as in Fig. 3) or see an overview of the important segments in the conversation (see appendix E). In the example (Fig. 3), we observed that the multimodal annotations of leaning forward and positive expression were picked up as important words by the model.

Conclusion
In this paper, we build upon a fully text-based feature-engineering system. We motivated the added features with existing literature, and demonstrated the value of the added features through experiments on the EQClinic dataset. This approach emulates how humans have been analyzing conversations with the Jefferson (2004) transcription system, and hence is human-interpretable. It is highly modular, thereby allowing practitioners to inject modalities. In this paper, we have used a wide range of modalities, including demographics, actions, prosody, mimicry, actions, and history. The ablation tests showed that the added coarse features significantly improve the performance for both decision tree and HAN models.
Future research could (1) investigate whether this feature engineering system is generalizable to wider applications of conversational analysis; (2) conduct user studies to validate the usability and

Appendices A Tuning procedure
We tuned the SGD optimizer with a learning rate between 0.003 to 0.010, batch size to be between 4 to 20, L2 regularization between 10 −6 and 10 −3 , and trained for up to 350 epochs without early stopping. We tuned the number of gated recurrent units (GRU) (Cho et al., 2014) between 40 to 49 in both the word-level and talk-turn-level layers, with both the GRU dropout and recurrent dropout (Gal and Ghahramani, 2016) to be between 0.05 to 0.50. The method of choosing hyperparameters is through uniform sampling between the above-mentioned bounds, except for the learning rate where log-uniform sampling is used. Training is performed on a RTX2070 GPU or V100 GPU.
B Hyperparameter configurations for best-performing models Table 5 (HAN) and Table 6 (Tree) report the hyperparameter configurations for each of the bestperforming model reported in Table 4.

C Performance of additional tuning
We conducted additional experiments on the tree configurations to (1) compare the improvements in performance when tuning the HAN and tree, and (2) evaluated the increase in performance if the tree is allowed twenty more hyperparameters random search trials (Fig. 4). From the larger increases in HAN performances, it is plausible that HAN is more sensitive to the hyperparameter tuning than the tree. Table 7 reports the additional tests on the impact of the added fine features. We observe that whilst all three input configurations (va, vp, vpa) have small increases in performance, none of them are statistically significant.

E Conversation thumbnail visualization
By illustrating the talkturn-level attention weights as a heatmap thumbnail (Fig. 5), the analyst could quickly get a sense of the important segments of the conversation without reading the content and zoom-in if required.

F Jefferson example
As an optional reference, we engaged a professional transcriptionist to transcribe the conversation segment presented (Fig. 3) using the Jefferson system. The Jefferson example is presented in Fig.  6. The verbal content is slightly different due to (1) different methods to determine talkturns transitions and (2) automatic speech recognition accuracy.   7: Summary of the model performances for the fine narratives. We report the average five-fold crossvalidation AUC and its standard deviation in brackets. Row-wise, we begin with the v configuration to show the impact of fine multi-modal annotations over the verbatim transcript. Then, we show the impact of the additions (Q1) over the existing fine annotations from Kim et al. (2019b) using column-wise comparisons. Asterisks (*) indicate significance relative to the v row. Carets (ˆ) indicate significance relative to column-wise comparisons, we also provide the confidence intervals in square brackets [] for the difference in performance. The number of symbols indicate the level of statistical significance, e.g., ***: 0.01, **: 0.05, *: 0.10.   .hhh -in breath, .h -short in breath; ↑ -Rise in intonation; underline -emphasis; <> -slowed speech rate, ><quickened speech rate; [] -overlapping speech.