“Talk to me with left, right, and angles”: Lexical entrainment in spoken Hebrew dialogue

It has been well-documented for several languages that human interlocutors tend to adapt their linguistic productions to become more similar to each other. This behavior, known as entrainment, affects lexical choice as well, both with regard to specific words, such as referring expressions, and overall style. We offer what we believe to be the first investigation of such lexical entrainment in Hebrew. Using two existing measures, we analyze Hebrew speakers interacting in a Map Task, a popular experimental setup, and find rich evidence of lexical entrainment. Analyzing speaker pairs by the combination of their genders as well as speakers by their individual gender, we find no clear pattern of differences. We do, however, find that speakers in a position of less power entrain more than those with greater power, which matches theoretical accounts. Overall, our results mostly accord with those for American English, with a lack of entrainment on hedge words being the main difference.


Introduction
Entrainment, also known as accommodation or alignment, is a widespread phenomenon in human interaction which leads interlocutors to adapt to each other to become more similar. It has been found for a variety of linguistic dimensions, including prosody Hirschberg, 2011), phonetics (Pardo, 2006), syntax (Reitter et al., 2006), and lexical choice (Brennan and Clark, 1996).
Lexical entrainment has been studied for several types of lexical choices from specific sets of words -such as referring expressions (Brennan and Clark, 1996), high-frequency words and taskrelated words (Rahimi et al., 2017), as well as hedge and cue phrases  -to the wider linguistic style (Niederhoffer and Pennebaker, 2002). This motivates us to consider both specific word sets and overall language use here.
Importantly, there are correlations between lexical entrainment and interesting aspects of the conversation. These include task success for both speaker pairs (Reitter and Moore, 2007;Nenkova et al., 2008) and groups (Gonzales et al., 2010;Friedberg et al., 2012), conversation flow and perceived naturalness (Nenkova et al., 2008), as well as power differences between the speakers (Danescu-Niculescu-Mizil et al., 2011). This suggests practical applications and has led to the development of entraining natural language generators in Dutch (De Jong et al., 2008), German (Buschmeier et al., 2009), and American English and European Portuguese (Lopes et al., 2015), among others.
To the best of our knowledge, there has not been any systematic research on lexical entrainment in Hebrew or any other Semitic Language. Previous studies analyzing lexical choice in Semitic Languages focus on borrowing and code-switching, for instance between Arabic and English (Abu-Melhim et al., 2016) and Arabic and Hebrew (Hawker, 2018). Given the important social role of entrainment and its potential applications, our study provides an important contribution by presenting the first analysis of lexical entrainment in Hebrew. This helps identify variations in how the behavior manifests in different linguistic and cultural contexts. We note that in a recently published study (Weise et al., 2020), we analyzed acoustic-prosodic entrainment in Hebrew for the same data. Together, these two papers provide a broad investigation of entrainment for this novel language context.

Corpus
In this study, we analyze the Open University of Israel Map Task Corpus (MaTaCOp) (Azogui et al., 2016) of dyadic, Hebrew conversations, modeled after the HCRC Map Task Corpus (Anderson et al., 1991). Each participant was given a map with la-beled landmarks, some of them shared with the partner's map, some unique. The map of one participant in a pair, the leader, contained a path among the landmarks. It was their task to describe the path so their partner, the follower, could reproduce it. All speaker pairs discussed the same two pairs of corresponding maps, with either speaker acting as a leader for one map and as a follower for the other. We refer to whole conversations as sessions and to each of the two parts as tasks.
MaTaCOp contains about six hours of conversations between 32 speakers, all of them fluent in Hebrew. There are six female, six mixed, and four male pairs. Most of the paired speakers were acquainted prior to the experiment. We analyze the influence of this aspect of our data in Section 5.7. Further details on the level of familiarity is provided in Appendix B. For more details on the corpus in general, see Weise et al. (2020).

Transcription, Tokenization, and Lemmatization
MaTaCOp is fully transcribed. Tokenization, on the other hand, generally follows standard Hebrew orthography. For instance, proclitics (such as mi-"from") were transcribed attached to the subsequent word (e.g., mi-nekuda "from point"). However, in case a silent pause or other disfluency occurred between a clitic and the subsequent word, the clitic was transcribed separately, as in mi-nekuda "from point". In total, this yields 50075 tokens for the corpus.
Due to Hebrew's rich morphology, many of the words in our corpus appear in a variety of grammatical forms, such as agol "round.M.SG", agula "round.F.SG", agul-im "round.M.PL", and ha-agol "the-round". We use a manually created list of grammatical forms for each lemma to lemmatize and count occurrences per lemma. 1 Overall, there are 1,038 lemmas and 2,179 other grammatical forms.
1 All word lists we use here can be downloaded at openu.ac.il/en/academicstudies/matacop/pages/default.aspx.

Lexical Entrainment Measures
We measure the lexical similarity of speakers' utterances per session (or task, where noted) using two previously established measures, one for a specific set of words W , one for the overall productions.
Per word w ∈ W and per speaker S, the first measure determines cnt S (w), the number of times w was uttered by S, and ttl S , the total number of words uttered by S. Similarity between a pair of speakers S 1 , S 2 is then defined based on the absolute difference of the fractions per word, as Nenkova et al. (2008) proposed this measure for high-frequency words. Note that it is symmetric. The second measure was originally proposed by Gravano et al. (2014) to compare tones and break indices (ToBI). For it, we construct a trigram language model for each speaker from their utterances, using SRILM (Stolcke, 2002). The measure sim 2 (S 1 , S 2 ) is then defined as the negated perplexity of using the language model for S 1 to predict all utterances of S 2 , computed with SRILM. Low perplexity indicates that the model for S 1 is a good representation of the utterances of S 2 . In this case, the phrases used by S 2 are essentially a subset of those used by S 1 . We interpret this as entrainment of S 1 towards S 2 as it signals that S 1 incorporated S 2 's phrases into their own. Conversely, high perplexity indicates a lack of entrainment. This is why we use negated perplexity for sim 2 . Note that this measure is asymmetric. For a symmetric version, we simply add the asymmetric values for both directions, following Weise and Levitan (2018).
To determine whether significant entrainment is present, we follow Levitan et al. (2012). Each similarity value sim i (S 1 , S 2 ) for a speaker S 1 with their partner S 2 is compared with the weighted average similarity of S 1 with non-partners, using paired Student's t-tests. Non-partners must have the same gender as S 2 and their partners must have the same gender as S 1 . For similarity per task, nonpartners must also be talking about the same map and have the same role as S 2 . Non-partners are weighted by how closely their language model's entropy, computed using SRILM, matches that of the actual partner (absolute differences). This is meant to account for the effect that the richness of a speaker's lexical inventory has on our measures and follows Weise and Levitan (2018).

Entrainment on most frequent lemmas
Following Nenkova et al. (2008), we first use sim 1 to check whether speakers in our corpus entrain on its 25 most frequent lemmas (excluding 56 lemmas representing landmark labels and the directional terms in Section 5.2). We find that speakers do significantly entrain on these lemmas (t(15) = 4.15, p = 8.54e − 04). That is, the distributions of the 25 most frequent lemmas show greater similarity between partners than with nonpartners. This effect also approaches significance for just the female pairs (t(5) = 2.83, p = 0.037) and male pairs (t(3) = 3.14, p = 0.052), but not for mixed pairs (t(5) = 1.51, p = 0.19). 2 Table 1 summarizes these results and those for the following subsections. We also use independent Student's t-tests to conduct direct comparisons between the similarity values for the gender pairs, i.e., female vs. male, female vs. mixed, and male vs. mixed. 3 This yields no significant differences and no difference even approaches significance (lowest p = 0.53).

Entrainment on directional terms
Leaders and followers in our corpus use various directional terms to communicate the path among the landmarks. To assess whether they adopt each other's terminology, we follow Silber-Varod et al.
(2020) and consider ten different terms of two basic types. This includes the directions of a compass -i.e., ṡ afon "north", darom "south", maarav "west", and mizraḣ "east" -and relative directionsi.e., le-mal-a "upwards", me-al "above", le-mat-a "downwards", mi-taḣ at "below", smol "left", and yamin "right". We treat the lemmas of these ten terms as a set W for measure sim 1 and count all occurrences for all grammatical forms per lemma.
Using this approach, we find significant evidence of entrainment on these ten directional terms overall (t(15) = 6.64, p = 7.86e-06) as well as for female pairs (t(5) = 5.75, p = 0.0022), male pairs (t(3) = 4.42, p = 0.022), and mixed pairs (t(5) = 4.85, p = 0.0047) separately. Again, 2 To account for multiple testing, we regard these four tests as a "family" and treat results up to the k-th smallest p-value p k as significant at level α = 0.05, where k is the largest integer such that p k ≤ k m α, with m being the size of the family (Benjamini and Hochberg, 1995). We do the same for each analogous group of four tests for other entrainment targets in the following subsections. 3 We again account for multiple testing by treating these three tests as a family here and in the following subsections. no difference between the gender pairs even approaches significance (lowest p = 0.086).

Entrainment on geometric terms
In addition to directional terms, speakers employ a variety of geometric terms to describe the shape of the path, the locations of the landmarks, and their relation to each other. This includes, for example, malben "rectangle" and b-a-hitstalvut "at the intersection". To determine whether speakers entrain on these, we consider a list of 34 lemmas (with a total of 199 grammatical forms) of such terms as another set W for measure sim 1 . This yields significant results overall (t(15) = 4.82, p = 2.2e − 04) as well as for female (t(5) = 5.08, p = 0.0038) and mixed pairs (t(5) = 6.62, p = 0.0012), but not for male pairs (t(3) = 1.00, p = 0.39). Once again, none of the differences between gender pairs even approach significance (lowest p = 0.72).

Entrainment on hedge words
The difficulty of describing irregular path shapes in the Map Task, along with incomplete information about the landmarks, creates uncertainty for the speakers which encourages the use of hedge words. Furthermore, in their corpus of deceptive interviews,  found the strongest evidence of lexical entrainment for hedge words, stronger than for the 25 most frequent words. These observations motivate us to analyze hedge words as well, using a translated version of the same list Levitan et al. used (with 37 lemmas and 78 grammatical forms total). However, we find no significant entrainment, neither overall (t(15) = 1.61, p = 0.13) nor for any of the gender pairs (lowest p = 0.14).

Entrainment on imperative verb forms
The different roles in the Map Task facilitate the use of imperative verb forms. Leaders might command followers to draw a path a certain way, while followers might demand information or a different way of describing, as in the utterance we quoted in the title. Of course, they can achieve the same communicative goals with phrases that avoid imperatives, using, for example, nonverbal predicates or standard infinitival clauses such as az at tsrix-a em laredet mi-tsad smol la-xanut "so you have to um to get down from the left side of the store". This flexibility allows for entrainment. However, note that the different roles actually make it unlikely for speakers to use the same verbs. A leader might instruct a follower to "draw the path
around the lake", while the follower might demand "tell me how close". Therefore, we check whether speakers adopt an imperative mode of speaking from each other, regardless of individual verbs. We identified a list of 122 imperative verb forms 4 in our corpus and determine what fraction of each speaker's words this list represents. That is, W for sim 1 consists of only one placeholder "word". Using this method, we find no significant entrainment, neither overall (t(15) = 1.03, p = 0.30), nor for any gender pair (lowest p = 0.15).

Entrainment on overall productions
Lastly, we use sim 2 to check whether speakers entrain on their partners' overall language use, i.e., whether they model their partners' productions better than those of other speakers. We find that this is the case overall (t(15) = 3.09, p = 0.0074) but neither for female pairs (t(5) = 1.44, p = 0.21), nor for male pairs (t(3) = 2.72, p = 0.073), nor for mixed pairs (t(5) = 2.20, p = 0.08). Once again, we find no significant differences between the gender pairs in direct comparisons (lowest p = 0.43).
Since sim 2 is asymmetric, we can use it to compare the entrainment behavior of individual speakers based on their gender and role, respectively, with independent Student's t-tests. This yields no significant difference between female and male speakers (t(30) = 1.06, p = 0.30).
In order to compare speakers based on their roles, we measure at the task level with separate language models and predictions of all utterances of a task instead of a whole session. Doing so yields a highly significant difference, with followers entraining more than leaders (t(62) = 5.52, p = 6.95e-07). Of course, leaders speak significantly more than followers (t(62) = 5.04, p = 4.25e-05), which might explain why their productions are supersets of those of the followers. However, the difference remains significant even when normalizing the measure by the number of words spoken (t(62) = 3.22, p = 0.0020).

Influence of familiarity
Prior acquaintance between subjects, as in our data, is unusual in entrainment research and introduces a confound to our comparison with other studies. We conduct some additional analysis of this here and discuss it further in Section 6.
For this analysis, we consider speaker pairs in two groups, of "high" (11 pairs) and "low" (5 pairs) familiarity. 5 For each entrainment target, we compare the similarity values for the two groups with independent Student's t-tests. This does yield a significant difference for entrainment on overall productions (t(14) = 3.31, p = 0.0051), but not for any other entrainment target (0.05 < p < 0.83). That is, speakers who were already well-acquainted before participating in the experiment, show greater entrainment in their overall language use (and only that) than those with little or no acquaintance.

Discussion
In this first analysis of lexical entrainment in Hebrew, using two existing measures, we find substantial evidence of entrainment both on specific groups of words and overall language use.
Speakers entrained on the 25 most frequent lemmas in the corpus, a result that matches findings on English corpora of telephone conversations (Weise and Levitan, 2018), deceptive interviews , and task-oriented, multi-party interactions (Rahimi et al., 2017).
The broadest and most significant evidence of entrainment we find is for directional terms and the geometric terms to describe the path. In fact, in some cases speakers actively requested entrainment, as in: a azov et ha-sinus-im daber iti besmol-a yemin-a ve-be-zaviy-ot "uh leave the sines, talk to me with to-the-left, to-the-right, and with angles". Our results match previous ones for referring expressions (Brennan and Clark, 1996) and "project words" (Rahimi et al., 2017) in English.
Contrary to , who found the strongest evidence of lexical entrainment for hedge words, we find no entrainment for these. This may be because Hebrew speech patterns tend to be very "direct" (Katriel, 2004, ch.2), more so than English ones (Van Dijk, 1997, p.235), so hedges might be culturally less appropriate.
We do not find that speakers entrain on an imperative mode of speaking. This may be due to data sparsity, though, as imperatives constitute only 1.3% of all tokens (see Appendix A) despite the experimental setting facilitating their use. Studies of syntactic alignment, e.g., by Reitter et al. (2006), have found that English speakers adopt syntactic choices from their interlocutors. A broader investigation of this is needed for Hebrew.
Our results for entrainment on overall productions -how well speakers' language models fit their interlocutors' productions -match prior results for English. Weise and Levitan (2018) found this measure to be significant for both task-oriented dialogs and telephone conversations. But unlike us, they found the results for this measure to be more significant than those for the 25 most frequent terms.
Our results reveal no clear pattern of differences between the gender pairs. The number of entrainment targets and significance levels for female, male, and mixed pairs are comparable (marginally weaker results for male pairs might be partially attributable to a smaller sample size). Direct comparisons between the gender pairs also do not reveal any significant differences for any of our measures. Neither does the comparison between individual speakers based on their gender, using the asymmetric version of our measure for overall productions. Similar analyses for acoustic entrainment in English have sometimes found differences based on speaker gender (Levitan et al., 2012) and sometimes not (Pardo et al., 2018;Weise et al., 2019). In our own analysis of acoustic entrainment in the same Hebrew corpus (Weise et al., 2020), we also found no difference based on speaker gender. The only study of the effect of gender on lexical entrainment we are aware of was for human-robot interactions and found that female speakers exhibited a greater degree of entrainment to the robot interlocutor than males did (Kimoto et al., 2017).
Speakers in subordinate roles are predicted to entrain more than those in power (Giles et al., 1991). This has been confirmed for lexical entrainment in English (Danescu-Niculescu-Mizil et al., 2011) and we find the same here. Followers, having less power due to their dependency on information from the leaders, entrain more than leaders with regard to their overall productions. Conversely, for directional terms alone,  found that followers had greater influence on the terminology, that is, leaders adopted followers' terms more often than vice versa.
It is worth repeating that speakers in our corpus were acquainted prior to their participation in the experiment. There is little prior research on the impact of this factor. For acoustic-prosodic entrainment, Truong and Heylen (2012) find that unacquainted speakers exhibit more entrainment while Cabarrão et al. (2016) report an example with the opposite trend. The analysis of our own data indicates that familiarity has at least some influence, specifically for entrainment on overall productions. However, for hedge words the difference between high and low familiarity pairs is so insignificant (t(14) = 0.50, p = 0.62) that we do not believe familiarity explains the difference between our results and those for unacquainted English speakers.
Overall, we find that lexical entrainment in our Hebrew corpus is very much comparable to prior results for English. The only notable difference is the lack of entrainment on hedge words which, as we noted above, may be due to cultural differences. Future research should investigate additional conversational settings in Hebrew, including with unacquainted speakers.

A Percentages of words per list
This paper considers a variety of different word lists for entrainment measure sim 1 (see Section 4). These lists represent different percentages of all the words uttered by various speaker groups in the corpus, as detailed in Table 2. We include them here so they may be used to interpret our results, by themselves or in comparison with other corpora. We note, for instance, that imperative verb forms are comparatively rare, which might partially explain the lack of significant entrainment we found. It also suggests differential use of the word lists by the speaker groups. For instance, as might be expected, the percentage of words that are imperative verb forms is more than twice as high for leaders (1.6%) as for followers (0.7%). Table 3 provides an overview of the sessions, i.e., speaker pairs, in our corpus and their similarity as measured by sim 1 and sim 2 (see Section 4).

B Session details and raw similarities
Details for the sessions include the gender pair (female, male, mixed) and the level of familiarity between the interlocutors. Familiarity was categorized into two groups. Most speaker pairs were highly acquainted, through marriage (two pairs), prior service in the same military unit (three pairs), or work in the same department (six pairs). The remaining pairs had a low level of acquaintance through work in the same institution with little interaction (four pairs) or no acquaintance at all (one pair).
For each session and entrainment target, the table lists the respective similarity value as well as the baseline similarity derived from the average similarity with non-partners. All values are negative, with values closer to zero indicating greater similarity (see Section 4).