Prosodic segmentation for parsing spoken dialogue

Parsing spoken dialogue poses unique difficulties, including disfluencies and unmarked boundaries between sentence-like units. Previous work has shown that prosody can help with parsing disfluent speech (Tran et al. 2018), but has assumed that the input to the parser is already segmented into sentence-like units (SUs), which isn’t true in existing speech applications. We investigate how prosody affects a parser that receives an entire dialogue turn as input (a turn-based model), instead of gold standard pre-segmented SUs (an SU-based model). In experiments on the English Switchboard corpus, we find that when using transcripts alone, the turn-based model has trouble segmenting SUs, leading to worse parse performance than the SU-based model. However, prosody can effectively replace gold standard SU boundaries: with prosody, the turn-based model performs as well as the SU-based model (91.38 vs. 91.06 F1 score, respectively), despite performing two tasks (SU segmentation and parsing) rather than one (parsing alone). Analysis shows that pitch and intensity features are the most important for this corpus, since they allow the model to correctly distinguish an SU boundary from a speech disfluency – a distinction that the model otherwise struggles to make.


Introduction
Parsing spoken dialogue poses unique difficulties: spontaneous speech is full of disfluencies, including false starts, repetitions, and filled pauses.In addition, speech transcripts lack punctuation, which would otherwise help signal the boundaries of sentence-like units (SUs). 1 Because of these difficulties, current parsers struggle to accurately parse English speech transcripts, even when they handle other English text well.However, research has shown that prosody can help with at least one of these problems, improving parsing performance for speech that contains disfluencies (Tran et al., 2018(Tran et al., , 2019)).In this work, we hypothesize that incorporating prosodic features from the speech signal can actually help with both of these problems: not only parsing disfluent speech, but also parsing speech that isn't segmented into SUs.
Other researchers have augmented parsers with prosodic features, but always with the assumption that the parser has access to gold SU boundaries, which cannot be assumed in a deployed speech application.For example, Gregory et al. (2004); Kahn et al. (2005) and Hale et al. (2006) incorporated prosody into statistical parsers or parse rerankers, with mixed results.More recently, Tran et al. (2018) and Tran et al. (2019) found that prosody improved an end-to-end neural parser, with the most significant gains in disfluent sentences.Parsing without access to gold SU boundaries is much more difficult: Kahn and Ostendorf (2012) showed that parsing quality depends on the quality of the sentence segmentation.Furthermore, finding SU boundaries is not as simple as finding long pauses in speech, as we demonstrate below.
We hypothesize that access to prosodic features will help an English parser that has to both parse and correctly identify SU boundaries (which we call SU segmentation).We test this hypothesis by inputting entire dialog turns to a neural parser without gold SU boundaries.We call this the turnbased model, and compare it to an SU-based model, which assumes gold SU boundaries and parses one SU at a time.We use turns as our input unit because they resemble the input a dialog agent would receive from a user.Following Tran et al. (2019) and others, we use a human-generated gold transcript instead of an automatic speech recognition (ASR) transcript; we plan to use ASR output in future work.
We build on the work of Tran et al. (2018) and Tran et al. (2019), considering two different experimental conditions for each model: inputting text features only and inputting both text and prosodic features.Using the Switchboard corpus of English conversational dialogue, we find that when only transcripts are used, the turn-based parser performs considerably worse than the SU-based parser, which is not surprising given that it needs to perform two tasks instead of one.However, when prosodic features are included, there is no difference in performance between the turn-based and SU-based models, and both models outperform the text-only counterparts.
Our primary contributions are: • We show that a parser that has access to prosody can perform both SU segmentation and parsing as well as a model that only has to parse.
• We show that one difficultly for the prosodyfree turn-based model is that it confuses speech disfluencies with SU boundaries, as illustrated in Figure 1.Further analysis indicates that adding pitch and intensity features can help the model to disambiguate the two, while pause and duration features do not.
2 Background: prosody and syntax Prosodic signals divide speech into units (Pierrehumbert, 1980).The location and type of these prosodic units are determined by information structure (Steedman, 2000), disfluencies (Shriberg, 2001), and to some extent, syntax (Cutler et al., 1997).Some psycholinguistic research shows that in experimental conditions, speakers can use prosody to predict syntax -for example, that English speakers can use prosody to determine where to attach a modifier or prepositional phrase, or how to correctly group coordinands (e.g., Kjelgaard and Speer (1999); Speer et al. (1996); Warren et al. (1995)).However, Cutler et al. (1997) argues that English speakers often "fail to exploit" this prosodic information even when it is present, so it isn't actually a signal for syntax in practice.Many computational linguists have experimented with this possible link between syntax and prosody by incorporating prosody into syntactic parsers (e.g., Noeth et al. (2000); Gregory et al. (2004); Kahn et al. (2005); Tran et al. (2018)).These models have had mixed success: For example, Gregory et al. (2004) found that prosody was at best a neutral addition to their model, while Kahn et al. (2005) found that prosody helped rerank PCFG output.One possible reason that prosody is only somewhat effective in previous research is that prosodic units below the level of the SU do not always coincide with traditional syntactic constituents (Selkirk, 1995(Selkirk, , 1984)). 2 In fact, the only prosodic boundaries that consistently coincide with syntactic boundaries are the prosodic boundaries at the ends of SUs (Wagner and Watson, 2010).The prosodic boundaries at the end of SUs are more distinctive (i.e., tending to correspond to longer pauses and more distinctive pitch and intensity variations) and less likely appear in any other location.These features make prosody a reliable signal for SU boundaries, even though it is an unreliable signal for syntactic structure below the SU level.
Some researcheres have used this correlation between prosody and SU boundaries to help in SU boundary detection.Examples of SU segmentation models that found prosodic cues were important include Gotoh and Renals (2000); Kolář et al. (2006); Kahn et al. (2004); Kahn and Ostendorf (2012), who all used traditional statistical models (e.g., HMMs, finite state machines, and decision trees), and Xu et al. (2014), who used a neural model.Kahn et al. (2004) and Kahn and Ostendorf (2012) also looked at downstream parsing accuracy on the same corpus we use.Like us, Kahn and Ostendorf (2012) don't use gold SU boundaries, but direct comparison is impossible because they use ASR output instead of human transcriptions and a different metric for parse performance (SParseval; Roark et al. (2006)).However, they show that having access to gold SU boundaries increases the SParseval score from 78.5 to 82.3, which shows that parsing without gold SU boundaries is difficult.
However, in some research areas, prosody is less frequently used for SU detection.Some ASR corpora and applications segment at relatively arbitrary boundaries such as long silences or even regular intervals (e.g., Jain et al. (2020)).Other applications, such as speech translation, do require syntactically coherent input, but even there, systems targeting SUs have often used only textual features (Sridhar et al., 2013;Wan et al., 2020).Systems for restoring punctuation from ASR output must identify SU boundaries to correctly insert sentence-final punctuation, but these systems are typically evaluated on rehearsed monologues (such as TED talks) or read speech, which largely lack disfluencies (e.g., Federico et al. (2012)).Here, we show that prosody is primarily helpful for distinguishing SU boundaries from disfluencies, so although some of these systems have used prosody (e.g., Tilk and Alumäe (2016)), text-only systems are very competitive (e.g., Che et al. (2016); Alam et al. (2020)).
Even when SU boundaries are already known, other research in parsing conversational speech has shown that prosody helps identify and correctly handle disfluencies.Tran et al. (2018) found that prosody only modestly affects parsing of fluent SUs, but has a marked effect on disfluent SUs.This accords with other previous work that has found that prosody is helpful in disfluency detection (Zayats and Ostendorf, 2019) We discuss the relationship between prosody and disfluencies in greater detail in Section 6, including how prosody helps the model not to confuse disfluencies and SU boundaries, as shown in Figure 1 above.

Task and data
We use the American English corpus Switchboard NXT (henceforth SWBD-NXT) (Calhoun et al., 2010).We choose this corpus mainly so we can compare performance with Tran et al. (2018) and Tran et al. (2019), as well as other earlier proba-bilistic models such as Kahn et al. (2005).SWBD-NXT comprises 642 dialogues between strangers conducted by telephone.These dialogues are transcribed and hand-annotated with Penn Treebankstyle constituency parses.We preprocess the transcripts to remove punctuation and lower-case all letters, making the input more like an ASR transcript that would be used in a deployed application.
The transcript divides the corpus into SUs and turns.Since these SUs may be sentences or other syntactically independent units such as sentence fragments, we use the generic term 'sentence-like unit' (SU).A turn is a contiguous span of speech by a single speaker.Turns are hand-annotated in SWBD-NXT, but for a deployed dialog agent, a turn is simply whatever contiguous input the user gives.Not all turns in the SWBD-NXT contain more than one SU: of a total 60.1k turns, 35.8k consist of a single SU.The remaining 24.3k contain more than one SU; the majority (52.4 percent) of these contain just two SUs.The average number of SUs per turn is 1.82.
We follow the general approach of Tran et al. (2018), but where they parse a single SU at a time, we give our parser a single dialog turn at a time for our turn-based model.The model returns constituency parses for the turn in the form of Penn Treebank (PTB)-style trees.In order to keep the output in the form of valid PTB trees, we add a top-level constituent, labelled TURN, to all turns, however many SUs they consist of.This example shows how the two sentences in (1) would be fused into a single turn in (2): Of course, using turns instead of SUs leads to longer inputs.We experiment with a pipeline approach (first segmenting turns into SUs, then parsing) as well as an end-to-end approach.In the end-to-end approach, we can't handle extremely long inputs since these longer sequences lead to high memory usage for transformers.We still want to capture the model's behavior on generally longer inputs, so we filter out two problematically long turns from the training set (out of 49,294 turns).
We do not have to remove any turns from the development or test sets.This leaves the maximum turn length at 270 tokens.We also remove any turns for which some or all speech features are missing from the corpus.

Feature extraction
From the speech signal, we extract features for pauses between words, word duration, pitch, and intensity.We largely follow the feature extraction procedure outlined in Tran et al. (2018) and Tran et al. (2019), which we summarize here, noting any deviations from or additions to their procedure.
Pause features are extracted from the timealigned transcript.Each word's pause feature corresponds to the pause follows it.Each pause is categorized into one of six bins by length in seconds: p > 1, 0.2 < p ≤ 1, 0.05 < p ≤ 0.2, 0 < p ≤ 0.05, p ≤ 0 (see below), and pauses where we are missing time-aligned data.Following Tran et al. (2018), the model learns 32-dimensional embeddings for each pause category.
Since we use turns instead of SUs, we have to determine how to handle pauses at the beginnings and endings of turns.We decide to calculate pauses based on all words in the transcript, not just the words for a single speaker at a time.This means that at a turn boundary, we calculate the pause as the time between the end of one speaker's turn and the beginning of the other speaker's turn.If one speaker interrupts another, the pause duration has a negative value.We place these negative-valued pauses in the same bin as pauses with length 0.
Duration features are also extracted from the time-aligned transcript.We are interested in the relative lengthening or shortening of word tokens, so we normalize the raw duration of each token.Following the code base for Tran et al. (2019), we perform two different types of normalization.In the first case, we normalize the token's raw duration by the mean duration of every instance of that word type.In the second, we normalize the token's raw duration by the maximum duration of any word in the input unit (SU or turn).These two normalization methods result in two duration features for each word token, which are concatenated and input to the model.
Pitch features (or more accurately, F0 features) are extracted from the speech signal using Kaldi (Povey et al., 2011).These are extracted from 25ms frames every 10ms.Three pitch features are extracted: warped Normalized Cross Correlation Function (NCCF); log-pitch with mean subtraction over a 1.5-second window, weighted by Probability of Voicing (POV); and the estimated derivative of the raw log pitch.For further details on these features, see Ghahremani et al. (2014).
Intensity features are also extracted from the speech signal using the same software and frame size as we use for pitch features.Starting with 40-dimensional mel-frequency filterbank features, we calculate three features: (1) the log of the total energy, normalized by the maximum total energy for the speaker over the course of the dialog; (2) the log of the total energy in the lower half of the 40 mel-frequency bands, normalized by the total energy; and (3) the log of the total energy in the upper half of the 40 mel-frequency bands, normalized by the total energy.
For training, development, and testing, we use the split described in Charniak and Johnson (2001), which is a standard split for experiments on SWBD-NXT (e.g., Kahn et al. (2005); Tran et al. ( 2018)).
The training set makes up 90 percent of the data, and the development and testing sets make up 5 percent each.

Model
We use the parser described in Tran et al. (2019), directly extending the code base described in their paper. 3The model is a neural end-to-end constituency parser based on Kitaev and Klein (2018)'s textonly parser, with a transformer-based encoder and a chart-style decoder based on Stern et al. (2017) and Gaddy et al. (2018).This encoder-decoder is augmented with a CNN on the input side that handles prosodic features (Tran et al., 2019).For further description of the model and hyperparameters, see Appendices A.1 and A.2.
The text is encoded using 300-dimensional GloVe embeddings (Pennington et al., 2014). 4Of the four types of prosodic features described in Section 3, pause and duration features are already token-level.However, pitch and intensity features are extracted from the speech signal at the frame level.In order to map from these frame-level features to a token-level representation, the pitch and intensity features pass through a CNN, and are then concatenated with the token-level pause and duration features.
We follow Tran et al. (2019) in training each model 10 times with different random seeds.For the development set, we report the mean of these 10 models' performance.We then select the median model by development set performance, and use it to calculate test set results.For any further experiments, such as those discussed in Section 6, we use the random seed for this median model.Each model is trained for 50 epochs and use the epoch with highest development set performance.
In addition to this end-to-end approach, we also report results for a pipeline approach.For the pipeline, we first segment the speech into SUs using a modified version of the parser architecture: We keep the encoder the same, but we change the decoder so that it only does sequence labelling, and we frame the SU segmentation task as a sequence labelling task.We then use the SU-based parser to parse the resulting SUs.We report the model's performance with and without prosodic features during the segmentation and parsing steps.

Results
We compare the turn-based F1 performance of our parser to a replication of the SU-based performance described in Tran et al. (2018) and Tran et al. (2019).Table 1 shows the development and test set results. 5 We find that the turn-based model benefits significantly from prosody.model performs equivalently well to the SU-based model, despite doing two tasks instead of one.The SU-based model also improves by 0.36 in F1 score on the test set with the addition of prosody.Note that while prosody has a considerably larger effect on the turn-based model than on the SU based model, the exact size of this change will depend on the corpus.For example, in a corpus with very few multi-SU turns, the performance change in the turnbased model might not be as large.However, our results suggest that prosody helps when a model needs to both detect SU boundaries and parse SUs.
The biggest difference between the SU-and turnbased models' performance on this corpus is in the text-only scenario, where the turn-based parser is substantially worse.This is expected for a few reasons.First, the text-only turn-based parser encounters longer inputs.Longer inputs tend to lead to more parse errors simply because there are more ways to parse a longer string.Table 2 shows this correspondence between length and performance.The median length of turns in the development set is 9 tokens, while the median length of SUs is 6 tokens.Longer strings are also more likely to contain the things that make parsing difficult, namely disfluencies and SU boundaries.
The turn-based parser's task is also more com- plex: it has to perform both SU segmentation and parsing, rather than parsing alone.This gives the turn-based parser novel ways to make errors by splitting a turn into the wrong number of SUs.However, prosody brings the turn-based parser up to the level of the SU-based parser, even though the turn-based model's task is more complex.Table 5 shows how the text-only parser significantly overestimates the number of SU boundaries.Without prosody, the model achieves an F1 score of 63.74 on SU prediction on the development set, compared to 99.41 with prosody (see Table 3).The most comparable work on SWBD is Kahn and Ostendorf (2012), who achieved 78 F1 using a hidden-event model, where we use a much more powerful transformer model; however, their model used ASR transcripts as input, so these scores aren't directly comparable.
We also test the pipeline model described in Section 4, which first segments turns into SUs and then parses them, both with and without prosody.We train just one segmentation model with the same random seed as the median development set model.We report the development set performance on segmentation (measured by segmentation F1 (Makhoul et al., 2000)) and parse F1 in Table 3.
The text+prosody pipeline model achieves an F1 score of 99.71, which is statistically indistinguishable from the end-to-end text+prosody model.In both cases, we see that the addition of prosody boosts SU segmentation accuracy to near-perfect levels, which explains why the parser performance is similar (and much better than without prosody).
Comparing the two text-only models reveals a more interesting pattern: while the pipeline model achieves much better segmentation F1, its parsing performance is worse.This is unexpected, as parsing and segmentation performance are usually correlated.This effect seems to arise because the two models err in different directions on segmentation: The pipeline model under-segments turns (corre-sponding to higher segmentation precision), while the end-to-end over-segments (higher recall, substantially lower precision).When it over-segments, the end-to-end text-only model often splits a word or short constituent off of an otherwise well-formed SU subtree; by contrast, the pipeline model tends to leave two or more SUs combined and and then to generate many SU-internal parsing errors.These SU-internal parsing errors include more coordination errors as well as VP, NP, and clause attachment errors than the end-to-end model. 6However, the pipeline model does as well as the end-to-end model at PP attachment and modifier attachment.
Overall, these results show that a pipeline model can be as effective at parsing as an end-to-end one, but that including prosody is even more important for a pipeline model.Since we care about parsing performance and the end-to-end text-only model does much better at parsing, we use the end-to-end model for all remaining analyses.

Error types
We use the Berkeley Parser Analyser (Kummerfeld et al., 2012) to determine what types of errors each of the SU-based and end-to-end turn-based models makes.Figure 2 summarizes the output of the Analyser.Overall, the SU-based parser shows only small effects from prosody, but the turn-based model does significantly worse on certain error types without prosody.Even for the turn-based model, prosody only affects error types that have to do with the shape of the tree.The different label category shows errors where two identically shaped trees have different constituent labels, and prosody has no effect on these.
For the turn-based model, poor SU segmentation by the text-only model explains some of the differences between the text+prosody and text-only models.Since 68.8 percent of SUs are clauses (i.e., they have a top node of type S, SBAR, SQ or SINV), an incorrect SU segmentation is usually classed as a clause attachment error.An example of this kind of attachment error can be seen in Appendix A.4.However, prosody also affects the turn-based model's rate of NP, PP, and modifier attachment errors.Since these attachment errors are not as common in the text-only SU-based model, it seems likely that they are caused by a cascade effect from errors in top-level SU segmentation.Prosody also affects the turn-based model's rate of unary errors (which are errors "involving unary productions that are not linked to a nearby error such as a matching extra or missing node") and single word phrase errors (which are range of node errors that span a single word" but which are not related to other errors) (Kummerfeld et al., 2012).Finally, very modest differences are seen for two rare error types: NP-internal and VP attachment errors.Our turn-based model performs worse overall on disfluent turns than on fluent turns, which was also true of Tran et al. ( 2018)'s SU-based model.Prosody also leads to a greater gain in F1 for disfluent turns than for fluent turns.These differences in performance are shown in Table 4.The lower performance on disfluent sentences may be at least partially attributable to length differences: the median length of turns with disfluencies is 28 tokens, compared to 3 tokens for fluent turns, where we define a disfluent turn as any turn containing the constituent tag EDITED.As discussed in Section 5, longer input generally leads to more parser errors, meaning that disfluent sentences are more likely to cause parser errors.However, there are other reasons disfluencies are difficult for the turn-based model, as discussed in the following section.

Distinguishing disfluencies and SU boundaries
One effect of disfluencies is that the text-only model tends to confuse certain kinds of disfluencies for SU boundaries, as illustrated in Figure 1.
Table 5 shows that the text+prosody model largely avoids this confusion, and indeed can do so almost as well using only pitch or intensity features.However, models using only pause or duration features are not good at distinguishing disfluencies from SU boundaries and predict boundaries too often.These results largely concur with previous work describing the similarities and differences between prosodic features of disfluencies and SU boundaries (Shriberg, 2001;Wagner and Watson, 2010).
In this section, we examine each of the features  In these examples, the text in square brackets is called the reparandum, which is immediately followed by the interruption point.Disfluencies in SWBD-NXT are marked in the constituency parse annotation, where the reparandum is marked as a constituent with the label EDITED.The interruption point is the right edge of this constituent.
Our analysis draws on the work of Shriberg (2001), who described the prosodic features of the interruption point and the reparandum based on an analysis of three English conversational and taskbased dialogue corpora -the Switchboard Corpus (which we use a subset of), ATIS (Hirschman, 1992), and AMEX (Kowtko and Price, 1989).
Pauses.Although pauses may be the most intuitive potential cue to SU boundaries, previous work suggests that long pauses also characterize interruption points (Wagner and Watson, 2010;Shriberg, 2001).Indeed, our analysis shows that longer pauses (> 0.05s) are over-represented in both locations.If pause types were distributed uniformly, 16 percent of both SU boundaries and interruption points would have a longer pause.Instead, we find that 33 percent of SUs boundaries and 37 percent of interruption points have such pauses.This explains why the pause-only model tends to confuse SU boundaries and interruption points.
Duration.Shriberg (2001) found that both interruptions and SU boundaries are associated with lengthening of the immediately preceding syllable.Lengthening before the interruption point may occur even if there are no other prosodic cues to the disfluency, and can be "far greater" than at SU boundaries (Shriberg, 2001, 161).This type of lengthening is captured by our first duration feature, which measures the token duration normalized by the mean duration for its word type.Like Shriberg (2001), we find that words preceding SU boundaries are lengthened on average (normalized duration: 1.18), and those preceding interruption points even more so (normalized duration: 1.41).In principle, this extra lengthening could help the durationonly model distinguish SU boundaries from interruptions, but in practice the model is nearly as bad at distinguishing them as the text-only model.
The second duration feature is the token length normalized by the maximum length of any token in the input, to normalize for speaking rate.Initially, this feature looks helpful: SU-final words have mean value of 0.86, while words directly before the interruption point have a mean of 0.50.However, the feature mainly captures the number of phones in a word, since words with fewer phonesincluding English function words -tend to have shorter normalized duration.It turns out that function words occur more often before interruption points than before SU boundaries: using NLTK's stopwords as a heuristic for function words, only 21.9 percent of development set SUs end in a function word, while the word before an interrutption point is a function word 51.6 percent of the time (Bird and Klein, 2009).Since the second duration feature captures a lexical distinction that is already signalled in the text, it cannot help the durationonly model outperform the text-only model.
Pitch.Based on previous work, our finding that pitch features are useful is not a surprise: the pitch contour before an interruption point is generally "flat or slowly falling" (Shriberg, 2001, 161), while SU boundaries are characterized by a boundary tone, generally corresponding to a fall or rise.Our model may be able to learn such temporal patterns, but even just looking at static pitch features re-veals differences between boundaries and interruptions for two of the three features.In particular, the mean warped NCCF value for pre-interruption point words is significantly higher than the value for SU-final words (p < 0.001), though somewhat lower than the overall average across the development set.Meanwhile, the log-pitch with POV-weighted mean subtraction is significantly lower at interruption points than at SU boundaries (p < 0.01).These differences allow the pitch-only model to distinguish SU boundaries and interruption points much better than the pause-or durationonly models can (see Table 5).Of these two pitch features, log-pitch is a more direct indicator of fundamental frequency (F0), which suggests that average perceived pitch is likely lower before disfluencies than before SU boundaries.There could be several reasons for this difference.For example, it could be that the "flat or slowly falling" tone of disfluencies that Shriberg (2001) describes has a lower average value than SU boundaries which can have either a fall or a rise (e.g., for certain kinds of questions).However, examining pitch features across the whole corpus obscures more subtle distinctions such as different types of pitch contours.
Intensity.We find that intensity features alone are enough to distinguish SU boundaries from interruption points, which is interesting because intensity has not been previously identified as an important cue: Shriberg (2001) doesn't note any particularly distinctive intensity features of the reparandum or interruption point, and work by Kim et al. (2006) on the Switchboard Corpus suggests that SU boundaries are correlated to lower intensity in some speakers, but that this isn't consistent across speakers.The three intensity features correspond to overall energy, energy in the lower half of frequencies, and energy in the higher frequencies.SUfinal words have a significantly higher mean value for lower-frequency intensity than all other words (p < 0.001), while words before the interruption point do not.This systematic difference in one intensity feature seems to be part of how intensity features allow the model to consistently tell SU boundaries apart from disfluencies.
Overall performance.Given our claim that the main issue facing the text-only turn-based parser is distinguishing disfluencies from SU boundaries, it is not surprising that the two features that do best at this, pitch and intensity, also yield the highest overall performance.Results are shown in Table 6.

Features F1
All

Conclusion
Our experiments show that parsing English speech transcriptions without gold SU boundaries is difficult for our parser: Its F1 score drops by about 4 percentage points compared to a model with gold SU boundaries.Incorrect SU segmentation causes a large part of this damage, though other errors in tree construction also play a role.We show that we can undo this damage by giving our parser prosodic information.Importantly, prosody helps by allowing the parser to distinguish disfluencies from SU boundaries.These results argue for giving prosodic information to parsers in deployed applications, where no SU boundary annotations are available, including dialog agents.
Furthermore, our experiments show that even limited prosodic features help a great deal: for our English data, pitch information alone is not significantly worse than pitch, intensity, pause, and word duration information combined.This means that incorporating the right kind of prosodic information can potentially lead to significant gains.

A.1 Model description
The parser is an encoder-decoder model that takes both speech and text inputs.In this appendix, we describe the three main model components: the CNN that processes the continuous speech inputs before they reach the encoder, the transformerbased encoder, and the chart-style decoder.

A.1.1 The speech-processing CNN
Of the four prosodic features, pause and duration are already discrete at the token level.Pitch and intensity, however, are extracted from frames every 10 ms in the original speech signal.If a given token is shorter than a fixed number of frames, some frames of left and right context are included; frames from longer tokens are subsampled to reduce their frame length.These two frame-based features features have a different dimensionality than the token-level input and they are untenably long for a sequence model or transformer.The CNN solves both these problems by producing a fixed-length representation for each feature at the token level.This representation can be concatenated with the other token-level features and input to the encoder.
For a speech input with f frames, the raw features input to the CNN have dimensions 6 × f , where 6 is the number of total features for each frame (3 pitch features and 3 intensity features).Several filters of different sizes then perform onedimensional convolution of the input.These different filters allow the CNN to integrate information on various time scales.We apply N of each of these m filters, for a total of mN filters.We use the hyperparameters described by Tran et al. (2018): N = 32 filters of widths w = [5,10,25,50], for a total of mN = 128 filters.The output of each filter is then max-pooled, which converts the features for a given token to a uniform dimension.
These CNN-processed features are then concatenated with the token-level prosodic features (pause and duration) and the text embedding for the token, and then input to the encoder.The CNN is trained along with the encoder-decoder model.

A.1.2 The encoder
The encoder is a standard transformer with eight attention heads, based on the work of Kitaev and Klein (2018).For each word of input x i , the transformer encoder produces a representation of the forward context, − → y i , and the backward context ← − y i .We represent a given span between indices i and j by subtracting the forward representations and backward representations and concatenating the results: The next section explains how we use this span representation v (i,j) to generate scores for constituents in a tree.

A.1.3 The decoder
The decoder is a chart-style span-based decoder.Its goal is to output the correct tree T for an input x 1 , ..., x n .Each tree's score S(T ) is simply the sum of the scores of its constituents, where each constituent is defined by a start index i, an end index j, and a label l.
S tree (T ) = i,j,label∈T S label (i, j, l) + S span (i, j) As this formula for tree score shows, each constituent's score is made up of a label score and span score.Conceptually, the span score corresponds to the probability that a constituent exists that exactly covers span (i, j) in the input; the label score reflects the probability that the span (i, j) has a given constituent label (e.g., S, NP).The decoder must have a way of determining the label score and span score for each constituent.label scores are generated by passing the span representation v (i,j) through a two-layer feedforward network like the feed-forward networs Vaswani et al. (2017) use: Following Kitaev and Klein (2018), we also include a layer normalization step (LN orm).This feedforward network produces a vector for each span S label (i, j) whose size is the number of possible labels: The lth element of this vector is the score for the label l: We also need to calculate the span score, but calculating the score for all spans (i, j) would be prohibitively inefficient.Instead, Kitaev and Klein (2018), following the approach of Stern et al. (2017) and Gaddy et al. (2018), use a dynamic programming strategy based on the CKY algorithm.The score for a span (i, j) is calculated in terms of the scores of its subspans, which allows span scores to be built up recursively from the stored scores of smaller spans.A given span (i, j) can be split at any internal point into two subspans, (i, k) and (k, j).Each of these possible splits (i, k, j) is assigned a score, calculated by summing the span scores of the subspans: S split (i, k, j) = S span (i, k) + S span (k, j) Then, to find the best score for this span (i, j), we find the label and split that maximize the following sum: [S label (i, j, l) + S split (i, k, j)] All spans are recursively split into subspans, eventually arriving at single-word spans.Since there are no splits possible for a single-word span, the score for a single word span is simply that word's best label score: This method requires that the grammar be in Chomsky-Normal form, which the model achieves by collapsing strings of unary rules and using dummy nodes to make n-ary rules into binary rules.
With this method of generating tree scores from span representations, we can then define the hinge loss for our predicted tree T compared to the gold tree T * , where ∆ represents the Hamming loss on labeled spans: We then use this loss function to train our encoder-decoder, including the CNN input module for speech.

A.2 Model training details
We used the hyperparameters specified in (Tran et al., 2019)

A.3 Incorporating BERT
We include here the results for both the SU-and turn-based parsers when given BERT embeddings (Devlin et al., 2019) in place of GloVE embeddings (Pennington et al., 2014).We train one model for each experimental condition, using the random seed we used to generate the results shown in Table 1.We see in Table 8 that  performance in all experimental conditions.The SU-based text+prosody parser does outperform the turn-based parser by a statistically significant margin, though this result was obtained on just one model instead of 10 randomly seeded models.However, the turn-based parser's performance remains quite close to the SU-based parser's despite having a more difficult task to perform, and otherwise the basic pattern from the GloVE results holds here.

A.4 Clause attachment illustration
Figure 1: A portion of a turn that contains both disfluencies (shown in curly braces) and an SU boundary.A simplified version of the text+prosody model output is shown in (a), which matches the gold SU boundaries.The text-only model incorrectly places an SU boundary after a disfluency (shown in (b)).

Figure 2 :
Figure 2: Prevalence of various error types in the development set output, given four different experimental conditions: SU-based, with and without prosody; and turn-based, with and without prosody.Error types are classified by the Berkeley Parser Analyzer (Kummerfeld et al., 2012).

Figure 3 :
Figure 3: An example of a clause attachment error.The tree shown in (a) is correctly parsed as a single SU by the text+prosody model, whereas the text-only model incorrectly segments this into two SUs, as shown in (b).This example is taken from the development set and slightly simplified for space (shown by ellipses).

Figure 3
Figure 3 illustrates an example of an error classified as a clause attachment error by the Berkeley Parser Analyser(Kummerfeld et al., 2012).

Table 1 :
Test and development set F1 of the turn-based model compared to the SU-based model.Dev.set scores are the mean over 10 random seeds.For the test set, we use the model that has the median dev.set performance out of 10 randomly seeded models.

Table 2 :
F1 performance of the text-only and text+prosody turn-based models on inputs of various lengths in the development set.The inputs are divided into bins of approximately equal size by token length.

Table 3 :
Development set performance of the pipeline model on segmentation and parsing as compared to the end-to-end model.(Resultsarefrom single models rather than an average as in Table1.)

Table 5 :
The total number of SU boundaries predicted on the set as compared to the number of SU boundaries predicted to fall at what are actually interruption points within disfluencies.The first line shows the target for both values.We give results for a model with all four prosodic features, models with only one prosodic feature at a time, and a model with no prosodic features.

Table 6 :
Results of ablation testing, measured by F1 score on the dev.set.Asterisks indicate a statistically significant difference (p < 0.001) from the model with all features.The first row shows with all features; the next four rows show the result with one feature at a time; the final row shows the result with no prosody.

Table 7 :
's code base, documented in Table7.Each model was trained for 50 epochs on a single Nvidia GTX 1080 GPU, which took approximately 7 hours per model.The text-only models have approximately 23M trainable parameters each, while the text+prosody models have approximately 20M trainable parameters.Model hyperparameters.Note that the maximum sequence length for the SU-based model is 200 tokens.

Table 8 :
Turn-based Development set F1 when using BERT embeddings, comparing the turn-based model to the SU-based model.