Speakerly: A Voice-based Writing Assistant for Text Composition

We present Speakerly, a new real-time voice-based writing assistance system that helps users with text composition across various use cases such as emails, instant messages, and notes. The user can interact with the system through instructions or dictation, and the system generates a well-formatted and coherent document. We describe the system architecture and detail how we address the various challenges while building and deploying such a system at scale. More specifically, our system uses a combination of small, task-specific models as well as pre-trained language models for fast and effective text composition while supporting a variety of input modes for better usability.


Introduction
Writing is a multi-step process that involves planning (ideation), translation (composition), and reviewing (revision) (Flower and Hayes, 1981).In the ideation phase, the writer gathers information and organizes their thoughts.The composition step involves articulating the ideas effectively through the use of the right words and arranging them cogently in a draft.During revision, the focus is on grammatical correctness, logical flow of ideas, coherent document structure, and style.
Most current writing assistants have been limited in their ability to provide seamless writing assistance across all the stages, take into account the user context, and be robust to work on diverse real-world use cases at scale (Gero et al., 2022a).
In this work, we introduce Speakerly ™ , a voicebased end-to-end writing assistance system that works across the different stages of writing, helping users become more efficient with their communication.The user uses the voice interface to articulate their thoughts in natural speech.Our system then creates a polished and ready-to-send first draft while addressing all the intermediate issues, Figure 1: An illustrative example of Speakerly ™ for email composition on mobile.A user presses the microphone button at the bottom of their email application and starts speaking naturally (no templatization or structural tailoring of the speech input is needed).Once they stop speaking, Speakerly ™ converts the speech into structured, well-formatted, and polished compositions.such as structure, formatting, appropriate word usage, and document coherence.
We use voice as it is a natural and efficient input modality, allowing users to compose their thoughts quickly and even use the system in eyes-free scenarios while performing tasks such as walking and driving (Kamm, 1995;Cohen and Oviatt, 1995;Ruan et al., 2018).Moreover, with the increased ubiquity of voice-based assistants, such as Alexa and Siri, voice-based interactions have become more common and intuitive for users (Porcheron et al., 2018).
However, using voice has some challenges.First, during the ideation stage, the user typically only has a rough idea of what they want to write.Thus, if the system is unable to handle a lack of structure and slight incoherence in the input, users will end up spending a significant amount of time on fixing the output.Second, different writers can have varied needs requiring the system to handle the demands and constraints of different use cases.For example, short vs. long inputs, instructional vs. dictation inputs, open vs closed-ended inputs, and specific structures and formatting for emails, instant messages, and notes (Table 1).Finally, the system should be reasonably fast so that it can provide a delightful user experience.
Speakerly ™ is composed of multiple stages (Section 3) that progressively refine the relatively noisy and unstructured speech from the user and address the aforementioned challenges and requirements.In the remainder of this paper, we describe the technical system architecture and our approaches to address challenges related to modeling, evaluation, inference, and sensitivity.

Related Work
Most research in the past has been limited to either a single use case for composition or one particular stage of the writing process.For example, previous works have focused on email writing (Hui et al., 2018), science writing (Gero et al., 2022b), story writing (Clark et al., 2018;Coenen et al., 2021), slogan and metaphor writing (Gero and Chilton, 2019), poetry writing (Chakrabarty et al., 2022), and support comments (Peng et al., 2020), to name a few.Our system, in contrast, can handle various use cases ranging from short instant messages to long notes to open-ended instructions to closedended and information-dense dictations.
On the other hand, some writing assistancefocused works disproportionately emphasize specific stages of writing, such as editing and revision (Mallinson et al., 2022;Du et al., 2022;Kim et al., 2022;Schick et al., 2023;Raheja et al., 2023) rather than end-to-end writing assistance.Again, in contrast, our system is much more extensive as it takes in noisy and unstructured speech input and iteratively refines it to produce a final well-formatted output rather than focusing on a single-shot, structured text-to-text transformation.
Voice-based input has been known to optimize people's interaction and has been studied in the past (Williams, 1998) and is well-integrated in virtual assistants (such as Siri and Alexa).It has been used for various tasks such as voice notes (Stifelman et al., 1993), data capturing (Luo et al., 2021), information querying (Schalkwyk et al., 2010) and data exploration (Srinivasan et al., 2020).Such systems can have speech recognition errors that are difficult to recover from and restrict the user's natural speaking behavior (Luo et al., 2020).To tackle these problems, recent works have looked at voice-based text editing (Ghosh, 2020;Fan et al., 2021).

System Description
Our system takes natural speech from the user as input and generates a coherent and well-formatted text output.As shown in Fig. 2, the input progressively gets refined and enhanced as it traverses the pipeline, consisting of multiple task-specific models.Each stage can have its own errors.Hence, models across the pipeline are designed with complementary, sometimes overlapping capabilities, which allows them to recover from errors collectively and improve robustness to variation and noise in the input.
The pipeline has three main components: Automatic Speech Recognition (ASR), Normalization, and Comprehension.The ASR module takes raw speech and converts it to text.Then, the normalization module cleans up speech disfluencies, adds punctuation, and applies grammatical error corrections (GEC).Finally, the comprehension module cleans the text of remaining issues, such as incoherent document structure, word choice, formatting, formality, and style, and composes the final output text, handling instruction or dictation, or any other mode of input for a variety of use cases.We now explain these three components in more detail:

Automatic Speech Recognition (ASR)
The entry point to the system is an ASR component.This stage is responsible for the transcription of the user's spoken input and also handles basic speech recognition errors, such as filler words and background noise1 .We leverage out-of-thebox ASR solutions and experiment with Speech-to-Text services from Microsoft Azure, Google Cloud, and OpenAI Whisper.In general, Google and Microsoft Azure were at par in terms of supported features, such as support for streaming (real-time recognition), recognition of different dialects, spoken punctuation recognition, vocabulary customization, and price.We also considered OpenAI Whisper since it is open-source and about 70% cheaper.We eventually chose to use the Microsoft Azure Speech-to-Text due to quality considerations (Section 4.1).

Audio Input
ASR Transcription write an email to the team and say that were canceling uhm today's meeting because most people can't make it but uh next week we'll have Sarah talk about UXR uh vision uh OK ours and that that we no um that people should make sure to attend

Normalization
Normalized Input

Comprehension & Enhancements
Write an email to the team and say that we're canceling today's meeting because most people can't make it.But next week we'll have Sarah talk about UXR, vision, OKRs and that people should make sure to attend.

Hi team,
We are canceling today's meeting because most people can't make it.
However, next week, we will have Sarah talk about UXR vision and OKRs.People should make sure to attend.

See you then! Thanks!
Figure 2: Overview of the system architecture.The ASR system first transcribes the input.Then, the Normalization stage fixes the issues in the transcribed input (shown in red and blue).Finally, the comprehension stage generates a well-formatted and coherent output text with further enhancements.

Normalization
The transcribed audio input may still contain noise stemming from ASR errors, speech disfluencies, uniqueness of individual elocution, ambiguous word boundaries, background noise, and lack of context, among others.Therefore, we introduce another stage in the pipeline to enrich the speech transcription further and get a cleaner input for the downstream comprehension model(s)2 .This stage comprises three sub-stages dedicated to addressing specific issues in the transcription: Speech Disfluency Filtering, Punctuation Restoration, and Grammatical Error Correction.We now describe these in more detail.

Speech Disfluency Filtering
One of the numerous issues encountered in speechbased systems pertains to the inherent fluidity of spoken language, characterized by the occurrence of errors and spontaneous self-correction.Speakers, upon recognizing their speech errors, instinctively engage in the process of rectification by means of editing, reformulating, or starting afresh.This instinctual and subconscious phenomenon is a common and integral part of spontaneous human utterance, referred to as disfluencies (Shriberg, 1994), and poses significant challenges to the real-world deployment of speech-based systems.
Specifically, this part of the system focuses on detecting and removing disfluent tokens in the transcribed text and not replacing them with correct hypotheses.We formulate this as a tokenlevel sequence tagging problem and experiment with three models.To categorize the disfluencies, we use the framework defined in Shriberg (1994), which has three categories: repetitions (one or many words are repeated), replacements (a disfluent word or phrase is replaced with a fluent one), and restarts (initial utterance is completely abandoned and restarted).
Following are the details of the Disfluency Filtering models: 1. Baseline: An off-the-shelf model for joint disfluency detection and constituency parsing (Jamshid Lou and Johnson, 2020).

Punctuation Restoration
Once the disfluencies are removed, the input is still a stream of text without any punctuation or sentence segmentation.Therefore, the next step in the system is to restore punctuation (including capitalization).We experiment with three models that are trained to perform multi-class token classification.Specifically, there are five categories describing the respective token-level edit actions they apply: • COMMA: Append one of [, ; : -] • PERIOD: Append .
• QUESTIONMARK: Append one of [? !] • CAPITALIZATION: Capitalize the word We use this as our baseline.

Grammatical Error Correction (GEC)
We use the GECToR system (Omelianchuk et al., 2020) for grammatical error correction.Similar to our models for Disfluency filtering and Punctuation restoration, it is a sequence tagging model using a Transformer-based encoder.

Comprehension
The output from the normalization step is then fed into the comprehension stage, which transforms the normalized input into a well-structured and coherent output, handling a wide variety of inputs.
Table 1 shows the different types of inputs that the comprehension stage can handle.For example, the input can be an instruction or a dictation; an email, an instant message, or a note; open-ended or closed-ended5 .Moreover, the spoken text can often be incomplete and noisy.Thus, the comprehension model enhances the quality of such text while minimizing meaning change and hallucination.
We experiment with two approaches for the comprehension stage.The first is fine-tuning a lightweight pre-trained model (called COMP-FT), and the second is using a pre-trained LLM out-ofthe-box (called COMP-LLM).

COMP-FT
We use Pegasus (Zhang et al., 2020) (770M parameters), a transformer-based encoder-decoder.We limit ourselves to a small model since larger models have a higher latency, and we find that a model of this size can handle a significant portion of inputs.Since smaller models do not work well on open-ended generation, we limit it to closed-ended inputs.Model training details are present in Appendix A.
To fine-tune COMP-FT, we create a dataset containing 28k/1k/1k input-output pairs for training/validation/test sets, respectively.First, we ask human annotators to create 10k instruction-output pairs covering the various instruction-based use cases described earlier.Then, we create dictationbased data by removing the formatting and paraphrasing6 the outputs from this dataset, and use the resulting text as inputs instead.
Finally, we augment the dataset by applying 25 different augmentations to deal with the issues that were either not handled or were introduced by the earlier stages of the pipeline.We build upon NL-Augmenter (Dhole et al., 2021), an open-source library that contains 117 transformations and 23 filters for a variety of natural language tasks.A selection of the augmentations can be found in Appendix C.

COMP-LLM
We use the gpt-3.5-turbomodel from the Azure OpenAI platform.Since this model is a chatbased model, the main challenge is to find the right prompt for all our use cases.Further, the text generated by it is prone to verbosity and often contains hallucinations leading to meaning change.The benefit, however, is its ability to handle open-ended inputs such as "Write a list of items to bring camping".Finally, it has higher latency and is more expensive to deploy.

Hybrid Approach
Since both COMP-FT and COMP-LLM are effective at different use cases, we combine both models into a hybrid approach.Outputs requiring more open-ended generation and having low scope for sensitivity issues are passed to COMP-LLM, whereas shorter inputs and those which require more factual consistency are processed by COMP-FT.The last column in Table 1 shows which model processes the different inputs.We train a binary classifier, a fine-tuned DistilBERT (Sanh et al., 2019)  was trained using a manually created dataset containing 1000 examples.The classifier is applied to the output text of the normalization stage.

ASR
In order to evaluate the quality of the various ASR systems, we collected a dataset of 1000 voice inputs by releasing the system to a small set of internal users, who were asked to use the system for their composition needs.Expert annotators then transcribed these voice recordings, and the ASR systems were evaluated using the standard ASR metrics of Word Error Rate (WER) and Word Recognition Rate (WRR).

Speech Disfluency Filtering
Since the Disfluency Filtering models are sequence tagging models, we use Precision/Recall/F1 as the evaluation metrics on two evaluation datasets.First is the CCPE-M dataset (Radlinski et al., 2019), a corpus consisting of dialogues between two paid crowd-workers using a Wizard-of-Oz based, Coached Conversational Preference Elicitation (CCPE) methodology.We also collect and annotate (via crowdsourcing) an internal dataset sourced from the transcripts of company-wide, internal Zoom meetings, which were then annotated for the disfluency filtering task by expert annotators.Table 3 summarizes the results of the three models on the two evaluation sets.We observed that DISF-SB-QA-LD was the best-performing model, owing largely to the task-specific data augmentation.

Punctuation Restoration
Since the Punctuation Restoration models are also sequence tagging models, we evaluate them using Precision/Recall/F1 metrics on the same test set as the COMP-FT model (Section 3.3.1).Table 4 details the results of the three models on the test set for all the punctuation label groups.We also report metrics for sentence boundary detection, which is a combination of the PERIOD and QUESTIONMARK labels.We observe that PUNCT-COMP-GEC was the best-performing model in most categories.

Comprehension
For COMP-FT, we evaluate various models between 240M and 1.3B parameters on our test set (Section 3.3.1)using BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005) and BLEURT (Sellam et al., 2020) as evaluation metrics.The Pegasus (Zhang et al., 2020) model outperforms the other models (Table 6).However, we find that automated metrics were neither suitable nor reliable for evaluation, as they largely focus on n-gram-based overlap with references.Thus, we use human evaluations to measure the quality of our comprehension models.
We conduct extensive human annotation studies to gather insight into the quality of the output generated by the comprehension models.First, we compare COMP-FT and COMP-LLM on various closed-ended composition scenarios using 1200 examples.We restrict this dataset to closed-ended use cases since COMP-FT does not work well for open-ended use cases.For each example, we ask seven annotators to provide a binary judgment on fluency, coherence, naturalness, and coverage (descriptions to annotators provided in Appendix B) and decide the final judgment by majority voting.We also measure inter-annotator agreement using a simple percent agreement, as well as Cohen's κ (McHugh, 2012).
Table 5 shows the human evaluation results for the two models on the four metrics and the corresponding inter-annotator agreement scores.We find that outputs generated by COMP-LLM are more fluent than those from COMP-FT.This result is expected since LLMs are known to generate highly fluent text.Further, outputs generated by COMP-FT are marginally better than those from COMP-LLM on coherence and naturalness.Finally, we find that outputs generated by COMP-LLM have much more meaning-change than those from COMP-FT, highlighting a known problem of hallucination in LLMs.Overall, we find that in closed-ended inputs, the text generated by COMP-FT is overall of higher quality than that generated by COMP-LLM.
Table 5 also shows that Cohen's κ scores were higher for both models on Fluency and Coverage, indicating that annotators were more aligned on these criteria than they were on Coherence and Naturalness.This confirms our understanding that grammar and the presence or absence of information are more objective, whereas Coherence and Naturalness are more subjective and may vary based on context (for example, a short message may be unnatural but perfectly acceptable as a quick reply).Even though these categories had lower κ scores, they are still in a range that is considered fair agreement.

Sensitivity Evaluations
Current text generation systems have been shown to contain bias and behave differently to sensitive text (Bender et al., 2021;Welbl et al., 2021;Hovy and Prabhumoye, 2021).Therefore, we conduct an iterative sensitivity review of our end-to-end pipeline.We prepare a dataset of 800 sensitive examples to test the generation quality on offensive and non-inclusive language, bias, meaning change, and sensitive domains (such as medical advice and self-harm).We reviewed the generated outputs for the sensitive inputs and after manual reviewing, made the following changes to mitigate the identified risks: 1. Apply dictionary-based filtering for offensive words and a sensitivity classifier7 after both the normalization and comprehension stages.
2. Retrain COMP-FT on an improved dataset containing examples to handle sensitive text better, improved co-reference resolution, and diversity-based augmentations.For COMP-LLM, we evaluate prompts on their ability to handle sensitive text.
3. Adjust the classifier of the hybrid model to send more sensitive data to COMP-FT instead of COMP-LLM.
Overall, we find that COMP-FT is much better at handling sensitive text compared to COMP-LLM.

Inference
We deploy our service on Amazon ECS using the g5.2xlarge instances.To increase the throughput while reducing overall latency, we enable our service to scale horizontally as well as run multiple inference workers per instance.We conduct  load testing to evaluate the infrastructure costs required for deploying the system.We find that we can successfully serve a constant traffic of 1 request per second using the COMP-FT model on a single g5.2xlarge instance while maintaining a p90 latency of 3 seconds.To achieve the same latency and throughput requirements for COMP-LLM, we need to scale the number of instances to 30.With a hybrid system that routes the request to either COMP-FT or COMP-LLM, we can reduce the number of instances to 10.

Conclusion
In this paper, we presented Speakerly ™ , a real-time voice-based writing assistant for text composition.It provides a low barrier to entry into the writing process, where a user can interact naturally, either using dictation, instructions, or unstructured thoughts.In turn, it generates a high-quality first draft with low latency, thus, providing them with a simple and efficient way to articulate their thoughts into ready-to-send emails, messages, or notes.We present comprehensive technical details of the different stages of the pipeline and experiments which guided our decisions while deploying the system to our users.

Limitations
While we design Speakerly ™ to handle the various challenges that can occur in real-world spoken input, there are instances where the system can generate output that does not reflect what the user wanted to say or generate sensitive text.In such cases, the user can either ask the system to regenerate the output, speak again, or manually edit the generated output.Since manually editing the system can be tedious, we plan to integrate a text editing step in the pipeline.Furthermore, our system currently cannot generate very long outputs (greater than 512 tokens).Currently, for most open-ended inputs, we rely on an external LLM, which can be costly and have high latency.Moving forward, we intend to look at other smaller models that can generate highquality outputs for such texts.Lastly, since we use external ASR systems, which can be limited in their ability to deal with different accents, our system's ability can be limited by it (even though we do have augmentations to mimic such inputs).Finally, we only tested this system for English.
Fluency The generated output should be correct with respect to grammar and word choice, including spelling.It should have no datelines, headers, system-internal formatting, capitalization errors, or ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read.
Coherence The generated output should be well structured and well organized.It should not just be a heap of related information or a collection of sentences but should build from sentence to sentence to a well-organized, naturally flowing, coherent body of information.
Naturalness The generated output should use natural phrasing and maintain the appropriate tone and level of formality given its content (e.g., the implied relationship between sender and recipient, the topic, etc.).
Coverage The generated output should adequately verbalize the information present in the input.Coverage of all details of the most significant details is desired in the generated output.

C Augmentations for training data
Our system consists of a pipeline of ML models that progressively refines the input at each stage.However, some stages may introduce new errors or fail to fix the errors that they were supposed to fix.The comprehension model is the last stage of the pipeline, and it must address the remaining issues or the new issues introduced by the earlier stages of the pipeline.Therefore, to introduce these capabilities in the comprehension model, we add augmentations to the training dataset of the comprehension model.
While preparing the training dataset for finetuning the COMP-FT model, we generate new training examples by adding augmentations to the input and output of the initial dataset prepared by human annotators.Table 7 shows some of the augmentations we apply.It consists of three columns, showing the augmentation type, the issue it addresses, and its definition.We have four categories in the types of issues we address: ASR issues: These are issues that were caused by the ASR system, such as incorrectly transcribing a word with its homophone, i.e., similar sounding word.
Normalization issues: These are issues that were caused due to issues in the normalization stages, such as missing inserting the correct punctuations or not removing the filler words.
User input issues: These are issues that were present in the user speech and were not handled by the earlier models in the pipeline, such as repetition of information or incomplete information in the input.
Sensitivity issues: These are issues that we found during our sensitivity reviews, such as the model behaving differently if a non-western name is present in the input.

•
NONE: No change Following are the details of the Punctuation restoration models: 1. rpunct 3 is an open-source Python package for punctuation restoration, which uses a BERTbase model trained on Yelp reviews dataset 4 .

Table 1 :
model, to decide whether the system should use COMP-FT or COMP-LLM.This model Different types of inputs (i.e.normalization outputs) handled by our system, along with their characteristics.

Table 2 :
Performance comparison of different ASR solutions.WER indicates Word Error Rate, and WRR indicates Word Recognition Rate.

Table 5 :
Human Evaluation of different comprehension models on Fluency, Coherence, Naturalness, and Coverage.The number in brackets shows Cohen's κ and inter-annotator agreement scores, respectively.