Toward Interactive Dictation

Voice dictation is an increasingly important text input modality. Existing systems that allow both dictation and editing-by-voice restrict their command language to flat templates invoked by trigger words. In this work, we study the feasibility of allowing users to interrupt their dictation with spoken editing commands in open-ended natural language. We introduce a new task and dataset, TERTiUS, to experiment with such systems. To support this flexibility in real-time, a system must incrementally segment and classify spans of speech as either dictation or command, and interpret the spans that are commands. We experiment with using large pre-trained language models to predict the edited text, or alternatively, to predict a small text-editing program. Experiments show a natural trade-off between model accuracy and latency: a smaller model achieves 30% end-state accuracy with 1.3 seconds of latency, while a larger model achieves 55% end-state accuracy with 7 seconds of latency.


Introduction
Speech can be preferable for text entry, especially on mobile devices or while the user's hands are occupied, and for some users for whom typing is always slow or impossible. While fast and accurate automatic speech recognition (ASR) is now ubiquitous (Kumar et al., 2012;Xiong et al., 2016;Chiu et al., 2018;Radford et al., 2022), ASR itself only transcribes speech. In practice, users may also wish to edit transcribed text. The ASR output might be incorrect; the user might have misspoken; or they might change their mind about what to say or how to phrase it, perhaps after seeing or hearing their previous version. Azenkot and Lee (2013) found that users with visual impairment spent 80% of time editing text vs. 20% dictating it. In this work, we study the task of interactive dictation, in which users can both perform verbatim dictation and utter open-ended commands in order to edit the existing text, in a single uninterrupted speech stream. See Figure 1 for an example. Unlike commercial systems like Dragon (DNS; Nuance, 1997Nuance, , 2022 and dictation for Word (Microsoft, 2022) that require reserved trigger words for commanding, the commands in our data are invoked using unrestricted natural language (NL). For example, in Figure 1, both (b) and (d) invoke replace commands, but (d) uses nested syntax to specify both an edit action and location, while (b) is implicit (as natural speech repairs often are).
In interactive dictation, users do not need to memorize a list of specific trigger words or templates in order to invoke their desired functionality.
A dictation system should be as intuitive as dic-tating to a human assistant-a situation in which people quite naturally and successfully intersperse speech repairs and commands with their dictation. Beyond eliminating the learning curve, letting users speak naturally should also allow them to focus on what they want to say, without being repeatedly distracted by the frustrating separate task of getting those words into the computer.
Because we accept unrestricted NL for commands, both segmentation and interpretation become nontrivial for a system to perform. 1 Segmentation requires capturing (sometimes subtle) changes in intent, and is especially difficult in cases where command boundaries do not align with ASR boundaries. 2 We collect a dataset of 1320 documents dictated in an interactive environment with live, incremental ASR transcription and Wizard-of-Oz-style interpretation of user commands. Annotators were not told a set of editing features they were allowed to use, but simply instructed to make their commands understandable and executable by a hypothetical human helper. Collection required designing a novel data collection interface. Both the interface and dataset will be publicly released to help unlock further work in this area. 3 Finally, we experiment with two strategies for implementing the proposed system: one that uses a pre-trained language model to directly predict the edited text given unedited text and a command, and another that interprets the command as a program specifying how to edit. Predicting intermediate programs reduces latency because the programs are short, at the expense of accuracy. This strategy also requires additional work to design and implement a set of editing functions and annotate commands with programs that use these functions.
For each strategy, we also experimented with two choices of pre-trained language model: a small finetuned T5 model and a large prompted GPT3 model. Using the smaller model significantly improves latency, though again at the cost of accuracy.
In summary, our contributions are: (1) a novel 1 In template-based systems, by contrast, commands can be detected and parsed using regular expressions. An utterance is considered a command if and only if it matches one of these regular expressions. 2 In Figure 1, for example, we must segment the first sentence into two parts, a dictation ("Just wanted to ask about the event on the 23rd") and a command ("on Friday the 23rd"). ASR can also overpredict boundaries when speakers pause in the middle of a sentence. For example, in our data "Change elude mansion to elude mentioned." was misrecognized by MSS as "Change. Elude mansion to elude mentioned." 3 https://aka.ms/tertius task (interactive dictation), (2) a novel data collection interface for the task, with which we collect a new dataset, and (3) a system that implements said task, with experiments and analysis.

Background & Related Work
Many modern speech input tools only support direct speech-to-text (e.g., Radford et al., 2022). Occasionally, these models also perform disfluency correction, which includes removing filler words (e.g., um), repeated words, false starts, etc. (e.g., Microsoft Azure, 2022). One form of disfluency that has received particular attention is speech repair, where the speaker corrects themself midutterance. For example, let's chat tomorrow uh I mean Friday contains a speech repair, where the user corrects "tomorrow" with "Friday." The repaired version of this should be let's chat Friday. Prior work has collected datasets and built systems specifically for speech repair (Heeman andAllen, 1994, 1999;Johnson and Charniak, 2004). Additionally, ASR systems themselves make errors that humans may like to correct post-hoc; there has been work on correcting ASR errors through respeaking misdetected transcriptions (McNair and Waibel, 1994;Ghosh et al., 2020;Vertanen and Kristensson, 2009;Sperber et al., 2013). Beyond disfluencies that were not automatically repaired but were transcribed literally, humans must fix many other mistakes while dictating. They often change their mind about what to say-the human writing process is rarely linear-and ASR itself commonly introduces transcription errors. Most systems require the user to manually fix these errors through keyboard-and-mouse or touchscreen editing (e.g., Kumar et al., 2012), which can be inconvenient for someone who already relies on voice for dictation. Furthermore, most commercial systems that support editing through speech (DNS, Word) require templated commands. Thus, while speech input is often used to write short-form, imprecise text (e.g., search queries or text messages), it is not as popular as it might be, and it is used less when writing longer and more precise documents.
In our work, we study making edits through spoken natural language commands. Interpreting flexible natural language commands is a wellstudied problem within NLP, with work in semantic parsing (Zelle and Mooney, 1993;Zettlemoyer and Collins, 2009;Artzi andZettlemoyer, 2013), instruction-following (Chen andMooney, 2011;Branavan et al., 2009;Tellex et al., 2011;Anderson et al., 2018;Misra et al., 2017), and task-oriented dialogue (Budzianowski et al., 2018). Virtual assistants like Siri (Apple, 2011), Alexa (Amazon, 2014), and Google Assistant (Google, 2016) have been built to support a wide range of functionalities, including interacting with smart devices, querying search engines, scheduling events, etc. Due to advances in language technologies, modern-day assistants can support flexible linguistic expressions for invoking commands, accept feedback and perform reinterpretation (Semantic Machines et al., 2020), and work in an online and incremental manner . Our work falls in this realm but: (1) in a novel interactive dictation setting, (2) with unrestricted commanding, and (3) where predicting boundaries between dictations and commands is part of the task.
Recently, a line of work has emerged examining how large language models (LLMs) can serve as collaborative writing/coding assistants. Because of their remarkable ability to generate coherent texts over a wide range of domains and topics, LLMs have proven surprisingly effective for editing, elaboration, infilling, etc., across a wide range of domains (Malmi et al., 2022;Bavarian et al., 2022;Donahue et al., 2020). Though our system also makes use of LLMs, it supports a different mode of editing than these prior works. Some works use edit models for other types of sequence-tosequence tasks (e.g. summarization, text simplification, style transfer) (Malmi et al., 2019;Dong et al., 2019;Reid and Zhong, 2021), while others use much coarser-grained editing commands than we do, expecting the LLM to (sometimes) generate new text (Bavarian et al., 2022;Zhang et al., 2023). In addition to these differences, our editing commands may be misrecognized because they are spoken, and may be misdetected/missegmented because they are provided through the same channel as text entry.

Task Framework
We now formalize our interactive dictation setting. A user who is editing a document speaks to a system that both transcribes user dictation and responds to user commands. This process results in a interactive dictation trajectory-a sequence of timestamped events: the user keeps speaking, several trained modules keep making predictions, and the document keeps being updated.
Supervision could be provided to the predictive modules in various ways, ranging from direct supervision to delayed indirect reward signals. In this paper, we collect supervision that can be used to bootstrap an initial system. We collect gold trajectories in which every prediction is correct-except for ASR predictions, where we preserve the errors since part of our motivation is to allow the user to fix dictation errors. 4 All predictions along the trajectory are provided in the dataset.
Our dataset is not completely generic, since it assumes that certain predictive modules will exist and interact in particular ways, although it is agnostic to how they make their predictions. It is specifically intended to train a system that is a pipeline of the following modules ( Figure 2): (a) ASR As the user speaks, the ASR module proposes transcripts for spans of the audio stream. Due to ASR system latency, each ASR result normally arrives some time after the end of the span it describes. The ASR results are transcripts of successive disjoint spans of the audio, and we refer to their concatenation as the current transcript (U in Figure 2(a)). (b) Segmentation When the current transcript changes, the system can update its segmentation. It does so by partitioning the current transcript U into a sequence of segments u i , labeling each as being either a dictation or a command. (c) Normalization (optional) Each segment u i can be passed through a normalization module, which transforms it from a literal transcript into clean text that should be inserted or interpreted. This involves speech repair as well as text normalization to handle orthographic conventions such as acronyms, punctuation, and numerals.
While the module (a) may already attempt some version of these transformations, an off-the-shelf ASR module does not have access to the document state or history. It may do an incomplete job and there may be no way to tune it on gold normalized results. This normalization module can be trained to finish the job. Including it also ensures that our gold trajectories include the intended normalized text of the commands. (d) Interpretation Given a document state d i−1 and a segment u i , the interpretation module predicts the new document state d i that u i is meant Capitalize the S in eSpeak.
Attached are the eSpeak events. Please review.

Execution Engine
Attached are the espeak events.
Attached are the eSpeak events.

Interpretation
Step Attached are the espeak events. Capitalize the S&E speak. Please review. (c) Normalization Figure 2: Diagram of an interactive dictation system. First, the ASR system (a) transcribes speech, which the segmentation system (b) parses into separate dictation and command segments. Next, an optional normalization module (c) fixes the any ASR or speech errors in the segment. Finally, the interpretation system (d) returns the result of each operation. On the right is the concrete instantiation of our system. to achieve. 5 The document is then immediately updated to state d i ; the change could be temporarily highlighted for the user to inspect. Here d i−1 is the result of having already applied the updates predicted for segments u 1 , . . . , u i−1 , where d 0 is the initial document state. Concretely, we take a document state to consist of the document content together with the current cursor position. 6 When u i is a dictation segment, no prediction is needed: the state update simply inserts the current segment at the cursor. However, when u i is a command segment, predicting the state update that the user wanted requires a text understanding model. Note that commands can come in many forms. Commonly they are imperative commands, as in Figure 1d. But one can even treat speech repairs such as Figure 1b as commands, in a system that does not handle repairs at stage (a) or (c).
Rather than predict d i directly, an alternative design is to predict a program p i and apply it to d i−1 to obtain d i . In this case, the gold trajectory in our dataset includes a correct program p i , which represents the intensional semantics of the command u i (and could be applied to different document states). 5 This prediction can also condition on earlier segments, which provide some context for interpreting ui. It might also depend on document states other than di−1-such as the state or states that were visible to the user while the user was actually uttering ui, for example. 6 The cursor may have different start and end positions if a span of text is selected, but otherwise has width 0. For example, the document state d1 in Figure 2 is ("Attached are the espeak events.", (31, 31)).

Change Propagation
The ASR engine we use for module (a) sometimes revises its results. It may replace the most recent of the ASR results, adding new words that the user has spoken and/or improving the transcription of earlier words. The engine marks an ASR result as partial or final according to whether it will be replaced. 7 To make use of streaming partial and final ASR results, our pipeline supports change propagation. This requires the predictive modules to compute additional predictions. If a module is notified that its input has changed, it recomputes its output accordingly. For example, if module (a) changes the current transcript, then module (b) may change the segmentation. Then module (c) may recompute normalized versions of segments that have changed. Finally, module (d) may recompute the document state d i for all i such that d i−1 or u i has changed.
The visible document is always synced with the last document state. This sync can revert and replace the effects on the document of previous incorrectly handled dictations and commands, potentially even from much earlier segments. To avoid confusing the user with such changes, and to reduce computation, a module can freeze its older or more confident inputs so that they reject change notifications (Appendix B). Modules (b)-(d) could also adopt the strategy of module (a)-quickly return provisional results from a "first-pass" system with the freedom to revise them later. This could further improve the responsiveness of the experience.

Dataset Creation
To our knowledge, no public dataset exists for the task of interactive dictation. As our task is distinct from prior work in a number of fundamental ways ( §2), we create a new dataset, TERTiUS. 8 Our data collection involves two stages. First, a human demonstrator speaks to the system and provides the gold segmentations, as well as demonstrating the normalizations and document state updates for the command segments. Later, for each command segment, an annotator fills in a gold program that would yield its gold state update.
For a command segments, we update the document during demonstration using the demonstrated state updates-that is, they do double duty as gold and actual state updates. Thus, we follow a gold trajectory, as if the demonstrator is using an oracle system that perfectly segments their speech into dictations (though these may have ASR errors) versus commands, and then perfectly interprets the commands. A future data collection effort could instead update the document using the imperfect system that we later built ( §5), in which case the demonstrator would have to react to cascading errors.

Collecting Interactive Dictation
We build a novel data collection framework that allows us to collect speech streams and record gold and actual events.
We used an existing ASR system, Microsoft Speech Services (MSS; Microsoft Azure, 2022). We asked the demonstrator to play both the role of the user (issuing the speech stream), and also the roles of the segmentation, normalization, and interpretation parts of the system (Figures 2b-d). Thus, we collect actual ASR results, while asking the demonstrator to demonstrate gold predictions for segmentation, normalization, and interpretation.
The demonstration interface is shown in Figure 3. demonstrators were trained to use the interface, and told during training how their data would be used. 9 A demonstrator is given a task of dictating an email into our envisioned system (shown in the yellow textbox). We collected data in three scenarios:

Dictation Segments
Command Segments Actual Literal Utterance Literal Utterance u i Figure 3: Data collection UI. Demonstrator speech is transcribed by a built-in ASR system. Demonstrators specify gold segmentations by pressing a key to initiate a command segment (editText) and releasing the key to initiate a dictation segment (insertText). The resulting transcribed segments appear in the ASR fields of the boxes in the right column. For a command segment, the demonstrator specifies the normalized version in the Gold ASR field, and demonstrates the command interpretation by editing the document post-state. Document states are shown in the left column: selecting a segment makes its post-state (and pre-state) appear there. mouse and keyboard until it reflects the desired post-state after applying command u i . For reference, the UI also displays the pre-state d i−1 and a continuously updated visual diff ∆(d i−1 , d i )).
Demonstrators can move freely among these steps, editing normalizations or state updates at any time, or appending new segments by speaking. 14 We believe our framework is well-equipped to collect natural, flexible, and intuitive dictation and commanding data, for several reasons: (1) We do not restrict the capabilities of commands or the forms of their utterances, but instead ask demonstrators to command in ways they find most natural.
(2) We simulate natural, uninterrupted switching between segments by making it easy for demonstrators to specify segment boundaries in real-time. (3) We collect a realistic distribution over speech errors and corrections by using an existing ASR system and asking demonstrators to replicate real emails. In the future, the distribution could be made more realistic if we sometimes updated the document by using predicted normalizations and state updates rather than gold ones, as in the DAgger imitation learning method (Ross et al., 2011). 14 They are also allowed to back up and remove the final segments, typically in order to redo them.

Annotating Programs for Commands
After obtaining sequences of demonstrated dialogues using the above procedure, we extract each command segment and manually annotate it with a program p i that represents the intensional semantics of the command. This program should in theory output the correct d i when given d i−1 as input. Program annotation is done post-hoc with a different set of annotators from §4.1.
We design a domain-specific Lisp-like language for text-manipulating programs, and an execution engine for it. We implement a library consisting of composable actions, constraints, and combinators. A program consists of actions applied to one or more text targets, which are specified by contraints. Combinators allow us to create complex constraints by composing them. For example, in Figure 2, Capitalize the S in eSpeak, has the program (capitalize (theText (and (like "S") (in (theText (like "eSpeak")))))) where capitalize is the action, (like "S") and (

Handling of partial ASR results
The current transcript sometimes ends in a partial ASR result and then is revised to end in another partial ASR result or a final ASR result. All versions of this transcript-"partial" and "final"-will be passed to the segmenter, thanks to change propagation. During demonstration, we record the gold labeled segmentations for all versions, based on the timing of the demonstrator's keypresses. However, only the segments of the "final" version are shown to the demonstrator for further annotation. A segment of a "partial" version can simply copy its gold normalized text from the segment of the "final" version that starts at the same time. These gold data will allow us to train the normalization model to predict a normalized command based on partial ASR results, when the user has not yet finished speaking the command or the ASR engine has not yet finished recognizing it.
In the same way, a command segment u i of the "partial" version could also copy its gold document post-state d i and its gold program p i from the corresponding "final" segment. However, that would simply duplicate existing gold data for training the interpretation module, so we do not include gold versions of these predictions in our dataset. 15

Dataset details & statistics
In the first stage ( §4.1), eleven human demonstrators demonstrated 1372 interactive dictation trajectories (see Table 1 for details). In the second stage ( §4.2), two human annotators annotated programs for 868 commands. 16 The dataset was then split into training, validation, and test sets with 991 15 The gold pre-state di−1 may occasionally be different, owing to differences between the two versions in earlier dictation segments. In this case, the interpretation example would no longer be duplicative (because it has a different input). Unfortunately, in this case it is no longer necessarily correct to copy the post-state di, since some differences between the two versions in the pre-state might need to be preserved in the post-state. 16 The rest of the programs were auto-generated by GPT3. See details in Appendix C.2.
All demonstrators and annotators were native English speakers. The dataset is currently only English, and the editor supports unformatted plain text. However, the annotation framework could handle other languages that have spoken and written forms, and could be extended to allow formatted text.
A key goal of our system is flexibility. We quantify how well TERTiUS captures flexibility by measuring the diversity of natural language used to invoke each state change. 17 We count the number of distinct first tokens (mainly verbs) used to invoke each action. These results are reported in Table 4 in the Appendix, alongside a comparison with DNS. 18 We see that TERTiUS contains at least 22 ways to invoke a correction, while DNS supports only 1. In short, these results show that doing well on TERTiUS requires a much more flexible system that supports a wider array of functions and ways of invoking those functions than what existing systems provide.

Modeling & Training
The overall system we build for interactive dictation follows our pipeline from Figure 2 and §3: 1. A segmentation model M SEG takes the current transcript U, and predicts a segmentation u 1 , . . . , u n , simultaneously predicting whether each u i corresponds to a dictation or command segment.
2. Each dictation segment is directly spliced into the document at the current cursor position.
3. For each command segment: (a) A normalization model M NOR predicts the normalized utterance u ′ i , repairing any ASR misdetections. (b) An interpretation model, M INT(state) or M INT(program) , either: 1. directly predicts the end state of the command d i , or 2. predicts the command program p i , which is then executed to d i by the 17 The system we build can theoretically support more flexibility than what is captured in TERTiUS. However, for TER-TiUS to be a useful testbed (and training set) for flexibility, we would like it to be itself diverse. 18 We also measure the diversity of state changes captured by TERTiUS in Appendix A.5. execution engine. We experiment with both types of interpretation model.
Below we describe the specific models we use.

Segmentation
The segmentation model partitions U into segments u i , each of which is labeled by m i as being either dictation or command: Concretely, the segmentation model does this using BIOES tagging (Jurafsky and Martin, 2009, Chapter 5). Here each command is tagged with a sequence of the form BI * E ("beginning, inside, . . . , inside, end") or with the length-1 sequence S ("singleton"). Maximal sequences of tokens tagged with O ("outside") then correspond to the dictation segments. Note that two dictation segments cannot be adjacent. We implement the segmentation model as a T5-base encoder (Raffel et al., 2022) followed by a two-layer MLP prediction module. More details on why each tag is necessary and how we trained this model can be found in Appendix C.1.

Normalization and Interpretation
For each u i that is predicted as a command segment, we first predict the normalized utterance u ′ i , 19 We then interpret u ′ i in context to predict either the document state d i or an update program p i .
We then update the document state accordingly. We experiment with two ways of implementing the two steps: we either fine-tune two separate T5-base models (Raffel et al., 2022) that run in a pipeline for each command, or we prompt GPT3 (Brown et al., 2020)

Results
We evaluate the segmentation model in isolation, and the normalization and interpretation steps together. (Appendices D.2 and D.3 evaluate the normalization and interpretation steps in isolation.) For simplicity, we evaluate the models only on current transcripts U that end in final ASR results (though at training time and in actual usage, they also process transcripts that end in partial ones). 22

Segmentation
Metrics Exact match (EM) returns 0 or 1 according to whether the entire labeled segmentation of the final transcript U is correct. We also evaluate macro-averaged labeled F1, which considers how many of the gold labeled segments appear in the model's output segmentation and vice versa. Two labeled segments are considered to be the same if they have the same start and end points in U and the same label (dictation or command).
Results Segmentation results on an evaluation dataset of transcripts U (see Appendix D.1) are shown in the top section of Table 2. All results are from single runs of the model. The model performs decently on TERTiUS, and in some cases is even able to fix erroneous sentence boundaries detected by the base ASR system (Appendix D.1.2). However, these cases are also difficult for the model: a qualitative analysis of errors find that, generally, errors arise either when the model is misled by erroneous over-and under-segmentation by the base ASR system, or when commands are phrased in ways similar to dictation. Examples are in in Appendix D.1.1.

Normalization & Interpretation
Metrics We evaluate normalization and interpretation in conjunction. Given a gold normalized command utterance u i and the document's gold pre-state d i−1 , we measure how well we can reconstruct its post-state d i . We measure state exact match (EM) 23 between the predicted and gold post-states. If the interpretation model predicts given in Appendix D.2. 20 Specifically, the text-davinci-003 model. 21 Although the normalized utterance is not used for the final state prediction, early experiments indicated that this auxiliary task helped the model with state prediction, possibly due to a chain-of-thought effect . 22 See Appendix D for details. 23 We disregard the cursor position in this evaluation.  intermediate programs, then we also measure program exact match (EM) between the predicted program and the gold program.

Results
The bottom of Table 2 shows these results. All results are from single runs of the model. GPT3 generally outperforms T5, likely due to its larger-scale pretraining. When we evaluated ASR repair and interpretation separately in Appendices D.2 and D.3, we found that GPT3 was better than T5 at both ASR repair and interpretation. Furthermore, we find that both GPT3 and T5 are better at directly generating states (55.1 vs. 38.6 state EM and 29.5 vs. 28.3 state EM). However, the gap is larger for GPT3. We suspect that GPT3 has a better prior over well-formed English text and can more easily generate edited documents d directly, without needing the abstraction of an intermediate program. T5-base, on the other hand, finds it easier to learn the distinctive (and more direct) relationship between u and the short program p.
Other than downstream data distribution shift, we hypothesize that program accuracy is lower than state accuracy because the interpretation model is trained mostly on auto-generated program annotations, and because the execution engine is imperfect. We anticipate that program accuracy would improve with more gold program annotations and a better execution engine. Table 2 reports runtimes for each component. This allows us to identify bottlenecks in the system and consider trade-offs between model performance and efficiency. We see that segmentation is generally quick and the ASR repair and interpretation steps are the main bottlenecks. The T5 model also runs much faster than the GPT3 model, 24 despite performing significantly worse, indicating a tradeoff between speed and accuracy. Figure 4 shows that by generating programs instead of states, we achieve faster runtimes (as the programs are shorter), at the expense of accuracy.

Conclusion
Most current speech input systems do not support voice editing. Those that do usually only support a narrow set of commands specified through a fixed vocabulary. We introduce a new task for flexible invocation of commands through natural language, which may be interleaved with dictation. Solving this task requires both segmenting and interpreting commands. We introduce a novel data collection framework that allows us to collect a pilot dataset, TERTiUS, for this task. We explore tradeoffs between model accuracy and efficiency. Future work can examine techniques to push out the Pareto frontier, such as model distillation to improve speed and training on larger datasets to improve accuracy. Future work can also look at domains outside of (work) emails, integrate other types of text transformation commands (e.g., formatting), and may allow the system to respond to the user in ways beyond updating the document.

Limitations
TERTiUS is a pilot dataset. In particular, its test set can support segment-level metrics, but is not large enough to support reliable dialogue-level evaluation metrics. Due to resource constraints, we also do not report inter-annotator agreement measurements. While we made effort to make our interface low-friction, the demonstration setting still differs from the test-time scenario it is meant to emulate, and such a mismatch may also result in undesired data biases. Because our dialogues were collected before having a trained interpretation model, trajectories always follow gold interpretations. Because of this, the main sources of errors are ASR misdetections or user speech errors. In particular, TER-TiUS contains data on: 1. misdetections and speech errors in transcription, and how to fix them through commands, 2. misdetections and speech errors in edits, and what intent they correspond to. We leave to future work the task of addressing semantic errors and ambiguities which result from incorrect interpretation of user intent. Some of these limitations can be addressed by incorporating trained models into the demonstration interface, which will allow faster demonstration, and capture trajectories that include actual system (non-gold) interpretations. Though the trained system runs, we have not done user studies with it because it is not production-ready. The T5-base models are efficient enough, but the prompted GPT3 model is too slow for a responsive interactive experience. Neither model is accurate enough at interpretation. We welcome more research on this task! When a human dictates to another human, interleaved corrections and commands are often marked prosodically (by pitch melody, intensity, and timing). Our current system examines only the textual ASR output; we have given no account of how to incorporate prosody, a problem that we leave to future work. We also haven't considered how to make use of speech lattices or n-best lists, but they could be very useful if the user is correcting our mistranscription-both to figure out what text the user is referring to, and to fix it.

Impact Statement
This work makes progress toward increasing accessibility for those who cannot use typing inputs. The nature of the data makes it highly unlikely that artifacts produced by this work could be used (in-tentionally or unintentionally) to quickly generate factually incorrect, hateful, or otherwise malignant text.
The fact that all speakers in our dataset were native speakers of American English could contribute to exacerbating the already present disparity in usability for English vs. non-English speakers. Future work should look to expand the diversity of languages, dialects, and accents covered.

A.1 ASR results
Types of segments Below we describe the types of ASR results we collect in TERTiUS. As dialogues are uttered, we obtain a stream of timestamped partial and full ASR results from MSS. Examples of partial and full ASR results can be found below: 0:00.00: attached 0:00.30: attached is 0:00.60: attached is the 0:01.05: attached is the draft 0:02.15: Attached is the draft.
The first four lines are partial ASR results u partial that are computed quickly and returned by MSS in real time as the user is speaking. The last line is the final ASR result, which takes slightly longer to compute, but represents a more reliable and polished ASR result. After a final result u final has been computed, it obsolesces prior partial ASR results.
While not used in present experiments, collecting partial ASR results enables building an incremental system that can be faster and more responsive in real time; rather than waiting for ends of sentences to execute commands, a system can rely on partial ASRs to anticipate commands ahead of time (akin to ). Collecting timing information is also helpful for evaluating the speed of our system: the system runtime continges on the rate at which it obtains new ASR results and how long it takes to process them.
Furthermore, MSS additionally returns n-best lists for each final ASR results. These are a list of candidate final ASRs that may feasibly correspond with user audio, e.g., Attached is the draft. Attached his draft. Attacked is the draft. · · · Aggregation segments For long user audio streams, partial and final results are returned sequentially, each describing roughly a single sentence. The most recent ASR result is concatenated together with the previous history of final ASR results, to return the full partial or final ASR result for the entire stream. For example, after the user utters the first sentence in the example above, the user may continue by saying: please please re please review please review win please review when pause please review when possible Please review when possible.
We concatenate each of these new ASR results with the previous final ASR results to obtain the current transcript U (see §3),which evolves over time as follows: Attached is the draft. please Attached is the draft. please re Attached is the draft. please review Attached is the draft. please review win Attached is the draft. please review when pause Attached is the draft. please review when possible Attached is the draft. Please review when possible.
Segmenting ASR results into Segments During Annotation During annotation ( §4.1), all these partial and final ASR results get mapped to segments, forming u final i and u partial i . This is done by identifying the timestamp of each token within each partial and final result. For example, in the example ASR results sequence at the beginning of this section A.1, suppose the user specifies an segment boundary at time 0:00.45, (separating "Attached is" from "the draft."). We get the following ASR results for the first segment: attached attached is Attached is (we refer to the first two as partial ASRs for the segment, as they are derived from partial ASR, and the third as the final ASR for the segment), and the following ASR results for the second segment: the the draft the draft.

A.2 Annotation Instructions ( §4.1)
The full text of written instructions given to annotators during the first round of annotation ( §4.1) is provided below:

Transcribing
Your goal is to replicate the prompt in the target box verbatim / expand the prompt in the yellow textbox into a coherent email, starting from the given (potentially non-empty) starting document in the 'Transcription output' box. You are expected to do so using a series of speech-to-text transcriptions and commands. Try to use the starting document as much as possible (i.e. do not delete the entire document and start over).
You can easily see what changes are to be made by toggling the 'See Diff View' button. Once that mode is on, the text you need to add will be highlighted in green, while the text you need to delete will by highlighted in red. Once there is no colored text, your text box matches the target text box and you are done.
Begin this process by hitting the 'Begin transcription' button. This will cause a new 'in-sertText' command to appear in the command log on the right.
You are now in transcription mode. Whatever you say will appear in the 'Transcription output' box.

Editing
You can fix mistakes in transcription, add formatting, etc. through adding 'editText' commands.
Hold down 'ctrl' on your keyboard to issue a new 'editText' command.
While holding down 'ctrl' you will be in edit mode. In this mode, you can manually use mouse-and-keyboard to change the output. However, you must describe the edit you are making before you make it.
Begin by describing your edit using your voice. Whatever you say now will appear in the editText ASR box, but not in the 'Transcription output'. Because the ASR system is imperfect, the textual description may be faulty. Fix any mistakes in the detected speech in the 'Gold ASR' box.
Finally, manually edit the 'Transcription output' box to correspond the effect of your edit command.
Note: It is important that you vocalize your change before making any edits to either 'Gold ASR' or 'Transcription output', as the ASR system stops recording as soon as you click into either one of these boxes.

Undoing, Reseting, Submitting, & Saving
You can click on previous commands in the command log to revisit them. Note that if you edit the output associated with a 'edit-Text' prior in the history, you will erase the changes associated with subsequent 'editText' operations.
If you would like to undo some portion of command log, you can use the 'Delete Selected Command & Afterwards' button. Simply click on the first command you would like to remove, then click the button to remove that command and all commands after it.
You can clear the entire command log by hitting "Reset".
If you would like to work on transcribing another target, use the green arrow keys below the target. This will present you with a new target while saving progress on your current target. To delete a target prompt, press the red 'X'.
Once you are done editing, click "Submit" button.
Please double-check each command before submission! In particular, commands will appear red if they are potentially problematic (e.g. they are not associated with any change to the underlying text). Please check to make sure there are no red commands that you do not intend to be there!

A.3 Target Text Preprocessing
For replicating Enron emails, we process emails from the Enron Email Dataset to create our target final states. We break the email threads into individual emails, filtering out email headers and non-well-formed emails (emails that are either less than 50 characters or more than 5000 characters long, or contain too many difficult-to-specify non-English symbols). Annotators also had the option to skip annotating certain emails, if they found the email too difficult to annotate.

A.5 Dataset Analysis
To assess the diversity of state changes, we quantify the number of distinct actions, constraints, and constraint combinators (see §4.2) that appear in the annotated programs. In Table 3, we list out all actions, constraints, and constraint combinators present in TERTiUS. TERTiUS contains at least 15 types of actions (and allows for action composition with sequential chaining operation do), with 34 types of constraint and constraint combinators.
In Table 4, we approximate the invocation diversity represented in TERTiUS, by measuring the number of distinct first tokens used to invoke each type of actions. For actions that overlap in function  with ones supported by DNS, we also report a similar diversity metric against the full set of trigger words supported by DNS. 25

B Running Online
When running the system online in real time, we must consider efficiency and usability. We introduce a "commit point" that signifies that the system cannot re-segment, re-normalize, or reinterpret anything before that point. We only want to consider recent ASR results because the system quickly becomes inefficient as the dialogue length grows (the interpretation step, which is the bottleneck of the system, must run for every single command.) Furthermore, users often refer to and correct only recent dictations and commands; reverting early changes can have potentially large and undesirable downstream effects, leaving users potentially highly confused and frustrated. Concretely, the commit point is implemented as the system treating the document state at that point as the new "initial state," so that it is unable to access segments and the history of document states from before that point. We implement this point so that it must coincide with the end of a final ASR result. We feed into the system this state as the initial state, and the entire sequence of ASR results starting from that point. All dictations and command segments returned by the model are executed in sequence from the commit point.
We decide to set a commit point based on system confidence and time since last commit. System confidence is derived from the confidences of each component model at each step of the prediction. We measure the system confidence of the end state predicted by the system, by summing the logprobabilities of: 1. the segmentation model result, (summing the log-probabilities of each BIOES tag predicted for each token), 2. the ASR repair model result for each command (log-probability of the resulting sentence), 3. the interpretation model result for each command (the log-probability of the end state or program). Once the system confidence exceeds a threshold τ commit , we decide to commit immediately at that point. Otherwise, if we have obtained more than 4 final ASR results since the last commit, we must commit at our most confident point from within the last 4 turns.

C Model Training Details
In this section, we describe how we trained each component of the system. See §5 for a description of the inputs, outputs, and architecture of each model. Our final system is incremental, able to process both partial and final ASR results.

C.1 Segmentation Model
We use BIOES for the segmentation model. Note that we cannot just predict a binary command/dictation tag for each token, because it would be unable to discern two consecutive commands from one continuous command. Thus, we need to use B to specify the beginning of a new command segment. E is also necessary for the model to predict whether the final segment, in particular, is an incomplete and ongoing (requiring the ASR repair model to predict the future completion) or complete (requiring the ASR repair model to only correct errors).
We expect in the final online version of the endto-end system, the segmentation model will: 1. run often, being able to accept and segment both partial and final ASR results, 2. run on only the most recent ASR, to avoid completely resegmenting an entire document that's been transcribed. Thus, we construct the training data for this model in a way to simulate these conditions. We extract all sequences of turns of length between 1 -4 from TER-TiUS (capping to at most 4 for condition 2), take their segments u, and concatenate them to simulate U, asking the model to segment them back into their individual u. For the final turn of each chosen sequence, we include in the training data both the final ASR result and all partial ASR results. We fine-tune on this data with a learning rate of 1e-4 and batch size of 4 until convergence.

C.2 ASR Repair & Interpretation Models
Below we describe the concrete implementations and training details of each model: T5 In the T5 implementation, both M NOR and M INT are T5-base encoder-decoder models.
As described in §4.4, we do not have annotations of programs for the full training split. Thus, we automatically generate the missing programs using GPT3.
We have an initial training reservoir that consists solely of data points with program annotations D annot . For each example in the remaining training set, we retrieve a subset of samples from D annot to form the prompt. We also use GPT3 for this retrieval step 26 .
We then annotate programs in the remaining training set in an iterative manner: as new programs are annotated, we use the execution engine to check whether it executes to the correct end state, and if so, we add it to D annot , such that future examples can include these programs in their prompt.

GPT3
In the GPT3 implementation, both the ASR repair and interpretation steps occur in a single inference step, with GPT3 being prompted to predict both outputs in sequence. Specifically, it is prompted with: The model is shown demonstrations in this format from the training data, then asked to infer, for each test sample, the highlighted portions from the non-highlighted portions.

D.1 Segmentation
We run all the error analyses in this section on a model trained and tested exclusively on the Replicate doc task (where annotators were asked to replicate emails from the Enron Email Dataset).
We do not evaluate the segmentation model on all of the transcripts that arise during a trajectory, many of which are prefixes of one another. Doing so would pay too little attention to the later segments of the trajectory. (F1 measure on the final transcript will weight all of the segments equally, but F1 measure on the earlier transcripts does not consider the later segments at all.) Instead, we create an evaluation set of shorter transcripts. For each trajectory, we form its final full transcript by concatenating all of its final ASR result results. Each sequence of up to 4 consecutive gold segments of this full transcript is concatenated to form a short transcript that the segmentation model should split back into its gold segments. For example, if the full transcript consists of 8 gold segments, then it will have 8 + 7 + 6 + 5 evaluation examples of 1 to 4 segments each.