Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

Autocomplete is a task where the user inputs a piece of text, termed prompt, which is conditioned by the model to generate semantically coherent continuation. Existing works for this task have primarily focused on datasets (e.g., email, chat) with high frequency user prompt patterns (or focused prompts) where word-based language models have been quite effective. In this work, we study the more challenging open-domain setting consisting of low frequency user prompt patterns (or broad prompts, e.g., prompt about 93rd academy awards) and demonstrate the effectiveness of character-based language models. We study this problem under memory-constrained settings (e.g., edge devices and smartphones), where character-based representation is effective in reducing the overall model size (in terms of parameters). We use WikiText-103 benchmark to simulate broad prompts and demonstrate that character models rival word models in exact match accuracy for the autocomplete task, when controlled for the model size. For instance, we show that a 20M parameter character model performs similar to an 80M parameter word model in the vanilla setting. We further propose novel methods to improve character models by incorporating inductive bias in the form of compositional information and representation transfer from large word models. Datasets and code used in this work are available at https://github.com/UBC-NLP/char_autocomplete.


Introduction
Autocomplete models are conditioned on userwritten prompts or text to generate semantically coherent continuations. For example, given the user input "Filmmaker George Lucas used Tikal as a ", a semantically coherent continuation can be "filming location" (Example 1). Autocomplete models can dramatically reduce keystrokes and improve user's productivity in a wide range of applications including email, chat and document authoring. Some typical challenges in building a real-time autocomplete model include: (i) processing arbitrary length user input (e.g., paragraphs), (ii) handling low frequency user prompt patterns (or broad prompts 2 , see Example 1), and (iii) satisfying memory constraints of the target device (such as cap on peak memory utilization).
Despite the importance of the task, there has been limited research on autocomplete. Existing works such as Smart Compose (Chen et al., 2019) and Trajanovski et al. (2021) train autoregressive language models on emails and chats, where user prompt patterns tend to be high-frequency. That is, the prompts are focused prompts, e.g., a prompt about office standups. All these models are trained at word level, which leads to two issues: (i) input/output embedding parameters (less compressible component of the Transformer model (Shen et al., 2020) 3 ) occupy a significant share (e.g., more than 77%) of the parameter budget due to the large vocabulary size and (ii) tendency to memorize highfrequency prompt patterns resulting in poor generalization on the low-frequency ones. In this paper, we focus on the autocomplete task of broad prompts from domains such as Wikipedia, where user prompt patterns often have low frequency (e.g., prompt about 93 rd academy awards). For instance, from Table 1, we observe that WikiText-103 (broad prompts) contains at least 10% more unique out of vocabulary (OOV) ngrams compared to the Reddit dataset (focused prompts). This makes our task more challenging than conventional settings considered in prior work that either adopt word-based models that are good at memorizing high-frequency patterns for focused prompts or rely on conventional language modeling that are not geared for generating precise and short horizon continuations (see Section 4).
Furthermore, we study this problem for practical applications under memory-constrained settings. Lower-end edge platforms (e.g., Raspberry Pi with 256MB of memory (Cai et al., 2020)) have memory constraints that are more limiting than latency constraints, for supporting various on-device models. Also, given that autoregressive language models are memory-bounded (Wang et al., 2021), we focus on improving the accuracy-memory trade-off for autocomplete task of broad prompts. Our work is complementary to existing works in model compression including those on pruning (Gordon et al., 2020), quantization (Han et al., 2016) and distillation (Sanh, 2019) that primarily focus on natural language understanding tasks (e.g., text classification). In contrast to these works, we study the effectiveness of character-based language models for a natural language generation task (e.g., autocomplete).
Compared to word models, we show that character models (i) contribute 96% fewer parameters in the embedding layer due to a much smaller vocabulary, (ii) work well on low-frequency (or broad) prompt patterns (e.g., 21% accuracy improvement by using 20M character model over 20M word model, see Figure 3 (a)) and (iii) result in high savings on peak memory utilization (e.g., 4.7% memory savings by using 20M character model over 20M word model, see Figure 3 (b)). When controlled for model size (number of parameters), we find that smaller character models (e.g., 20M parameters) perform similar to large word models (e.g., 80M parameters). We further develop several novel methods to improve the accuracy of character models, which unlike previous work, have minimal impact on memory usage. These methods introduce inductive bias in the form of compositional information and representation transfer from large word models (best method). We show that the best method achieves 1.12% and 27.3% accuracy improvements over vanilla character and vanilla word models respectively with no impact on memory usage. We discuss the limitations of our work in Section 8 and defer the analysis of accuracy-latency trade-off to future work while focusing only on memory-constrained settings in this work.
Our major contributions are as follows: 1. To the best of our knowledge, this is the first study of the autocomplete task for broad prompts in a memory-constrained setting.
2. We perform an extensive comparison of character and word models across diverse architectures and demonstrate the advantage of character models over large word models for the autocomplete task on dimensions like peak memory utilization and model parameters.
3. We introduce novel methods leveraging inductive bias to further improve the accuracy of character models with minimal impact on memory usage.

Related Work
Our work leverages advances in neural language models, autocompletion, and efficient deep learning.
Neural Language Models. The autocomplete models we study in this work utilize Transformerbased (Vaswani et al., 2017) autoregressive neural language models as backbone. Compared to word models, character models lag behind in language modeling performance when controlled for model size  and have a high computational complexity due to long sequence length (Tay et al., 2021). In this work, we focus on deploying models on lower-end edge platforms (e.g., Raspberry Pi) where memory, as opposed to latency, is the major bottleneck. Autocomplete Task. Despite the pervasiveness of autocomplete models, there is limited research in the academic community on the autocomplete task. Gmail Smart Compose (Chen et al., 2019) is a popular word-based autocomplete model for email suggestions. They find the encoder-decoder architecture to have a higher latency than the decoderonly architecture. They also find the Transformer architecture to be marginally better than the LSTM architecture (Hochreiter and Schmidhuber, 1997).
Motivated by these findings, we employ a decoderonly, Transformer based architecture for building our autocomplete model. Trajanovski et al. (2021) leverage word-based autocomplete models for providing email and chat suggestions.
In this work, we focus on building autocomplete models for broad prompts from domains such as Wikipedia, where user prompt patterns can be quite low frequency (e.g., prompt about 93 rd academy awards). Unlike our prompt completion task, query autocompletion task is a well researched problem (Bar-Yossef and Kraus, 2011;Cai and de Rijke, 2016;Wang et al., 2020;Gog et al., 2020), where the goal is to complete the user's query, e.g., search query. Since user queries are generally short, query autocomplete models need not track long-range dependencies to understand the user's intent. In contrast, it is a requirement in our prompt completion setting, as the user prompt can be arbitrarily large, e.g., sentences or paragraphs. Efficient Deep Learning. Exponential growth in the size of Transformer-based autoregressive language models (e.g., 175B (Brown et al., 2020)) has given rise to a strong need to make these models efficient so they can be used on commodity devices like laptop, tablet, and mobile, which have various resource constraints such as peak memory utilization and latency, while yielding the best performance under the constraints. To this end, there has been extensive research on building efficient Transformer models that are smaller, faster, and better, as summarized thoroughly by Tay et al. (2020) and Menghani (2021). Our work is focused on improving the efficiency of a natural language generation task (e.g., autocomplete), which which has received less attention from an efficiency perspective. Wang et al. (2021) observe that 73% of the overall latency of autoregressive language models goes to memory intensive data movement operations (e.g., splitting heads, transpose, reshape) and conclude that these models are memory intensive. Since lower-end edge platforms have tighter memory constraints than latency constraints (Cai et al., 2020), we focus on improving the accuracymemory trade-off of autocomplete models.
Given a text sequence x = (x 1 , . . . , x |x| ) (user input) with tokens from a fixed vocabulary x i ∈ V, the goal of the autocomplete task is to generate a completionx k+1:N such that the resulting sequence (x 1 , . . . , x k ,x k+1 , . . . ,x N ) resembles a sample from p * , where p * (x) denotes the reference distribution. x can be arbitrarily large (e.g., paragraphs), whilex k+1:N is generally short (e.g., three words). Each token x k can be a word, character, or subword. The vocabulary V contains unique tokens from the dataset D consisting of a finite set of text sequences from p * .
Data. Most datasets in the autocomplete literature come from domains with focused prompts (e.g., emails (Chen et al., 2019;Trajanovski et al., 2021), chat messages (Trajanovski et al., 2021)). In this work, we target the autocomplete task on datasets with broad prompts (e.g., Wikipedia) with a lot of low-frequency prompt patterns (e.g., the prompts EMNLP 2022 conference). Autocomplete models trained to answer broad prompts can be used to assist users in completing documents such as essay, report, letter, etc.
Metrics. The commonly used metric for evaluating the quality of an autocomplete model is Ex-actMatch@N (Rajpurkar et al., 2016) which measures the percentage of the first N words in the predicted suggestion that exactly match the first N words in the ground truth suggestion. Exact-Match@Overall (Chen et al., 2019) is a weighted average of the ExactMatch for all subsequence lengths up to K. For our setting, larger n-grams are increasingly difficult to predict for both word and character models as shown in Figure 4. Hence we set K to 3. Since the exact match metric strictly looks for full match of the subsequence, it is a hard metric to improve on, especially for broad prompts. One can utilize a less stringent metric such as Par-tialMatch (Trajanovski et al., 2021). PartialMatch measures the percentage of characters in the first N words in the predicted suggestion that exactly match those of the ground truth suggestion. However, PartialMatch might not adequately penalize for the grammatical incorrectness of the predicted suggestion. Trajanovski et al. (2021) also utilize metrics that require interactions from real users, which are difficult to acquire in practice. Given that the user-based metrics and PartialMatch metric have a strong correlation with ExactMatch in all the experiments carried out by Trajanovski et al. (2021), we use the exact match metric to quantify the performance of the autocomplete model in this work. We further perform human evaluation to compare the naturalness of the suggestions gener- ated by different models. 4 Model. We adopt the Transformer architecture, specifically Transformer-XL , for our autocomplete model. We choose Transformer-XL for the following two reasons: (i) as  show, the model achieves strong results on word and character-based language modeling benchmarks and (ii) the model can handle long text sequences (e.g., 1600 word tokens or 3800 character tokens) which is crucial for treating arbitrarily long user inputs (x).
Training. We train a decoder-only, Transformer-XL model that conditions on user input to generate the suggestion autoregressively. The parameters θ of the autocomplete model p θ (x) can be optimized using the standard language modeling objective.
Inference. During inference, the model p θ (x) takes the user input x 1:k ∼ p * and generates the suggestionx k+1:N ∼ p θ (.|x 1:k ) such that (x 1 , . . . , x k ,x k+1 , . . . ,x N ) resembles a sample from p * . In this work, we choose greedy search and select the token that receives the highest probability as the generated token; that is,x t = arg max p θ (x t |x 1 , . . . , x t−1 ). As explained in Appendix A.4, we find that beam search performs poorly on our task and also show that the trends we see in the next section do not depend on the choice of the decoding algorithm. For simplicity, we assume the autocomplete model generates exactly one suggestionx k+1:N .

Character vs. Word Model
Existing autocomplete models are primarily wordbased, i.e., the representation choice for x k is word. Word-based autocomplete models have the following properties: (i) they invest most of the param-eters (e.g., more than 77%) from the overall parameter budget on the embedding layer, which is less likely compressible using standard techniques such as quantization (Shen et al., 2020) and (ii) they can memorize high-frequency prompt patterns and perform well on datasets with focused prompts (e.g., Reddit posts). In this work, we aim to keep the parameter allocation to the embedding layer as small as possible thereby improving the overall memory footprint. To this end, we choose character as the representation choice and study the memory-accuracy tradeoff of character based models on the autocomplete task for broad prompts.
Character-based autocomplete models have several desirable properties compared to their word based counterpart, as they (i) invest far fewer parameters (e.g., less than 4%) of the parameter budget on the embedding layer and invest most parameters on other highly compressible Transformer components such as self-attention network, feedforward network, and softmax layer; (ii) perform well on datasets with broad prompts (as we will show); and (iii) provide a better tradeoff between accuracy and memory (model size and peak memory utilization).
To demonstrate these various aspects, we perform extensive experiments on the WikiText-103 benchmark (Merity et al., 2017) (unless stated otherwise). This benchmark contains about 100M tokens from Wikipedia to simulate broad prompts. Since we focus on improving the memory footprint of autocomplete models, we do not experiment with subword models, which introduce a large number of token embeddings in the embedding layer (e.g., 50K), compared to their character based counterpart.
Differences in our Autocomplete Task with Conventional Language Modeling Task. The training procedure for our autocomplete task and that for conventional language modeling (CLM) task are generally similar. However, there are two main differences between these two tasks. First, the goal of our autocomplete task is to generate suggestions with high precision (as captured by ExactMatch) while the main goal of CLM is to maximize the overall data likelihood (as captured by perplexity). Chen et al. (2019) show that perplexity and ExactMatch metrics are only weakly correlated as improvements in perplexity could be "mostly in places where the model is relatively low in score". As shown in Figure 1, autocomplete models with poorer perplexity scores (e.g., character AdaEmb Softmax Attn   model of size 20M) can enjoy better ExactMatch scores compared to models with better perplexity scores (e.g., word model of size 20M). We also perform a theoretical analysis to show how perplexity scores can change drastically for the same Exact-Match score (details in Appendix A.5). Secondly, our autocomplete task typically focuses on generating short horizon continuation while CLM typically focuses on generating long horizon continuation.

Component-Wise Parameter
Breakdown. Transformer-XL model can be broken down into four components: (i) adaptive embedding layers (AdaEmb) (Baevski and Auli, 2019), which contain shared input and output token embeddings; (ii) self-attention layers (Attn); (iii) feedforward network layers (FFN); and (iv) output softmax layers (Softmax). Figure 2 shows the percentage of parameters allocated to each component for both word-and character-based models, averaged over 100 random architectures for each representation. 5 Word-based models allocate more than 77% of the parameters to the embedding layers, which are less amenable for compression for purposes of generating efficient and smaller models. These models allocate less than 14% and 8% of the parameter budget to highly compressible layers such as self-attention and feedforward network layers. In contrast, character-based models allocate more than 90% of the parameters to these highly compressible layers and less than 4% to the embedding layers. Hence, character-based models have the potential to admit greater compression using standard techniques such as distillation and quantization with a negligible performance drop. Accuracy vs. Memory Tradeoff. Although character-based models seem to have better compression potential, their autocomplete performance gap over word-based models as a function of memory is not immediately obvious. We study the effect of memory in two ways: (i) model size, which corresponds to the total number of model parameters, and (ii) peak memory utilization, which measures the peak amount of memory utilized by a process during inference. In all our experiments, the decoding of character models stops once the desired number of words (identified by space character) are predicted. Figure 3 shows the accuracy-memory pareto curve 6 . Surprisingly, we observe that small character models (e.g., 20M) can rival large word models (e.g., 80M) in terms of accuracy-memory tradeoff. For instance, if we use a character model of size 20M instead of a word model of size 80M, we can save 75% of the model parameters and more than 60% of the peak memory utilization for a performance drop of < 0.5 points. Broad vs. Focused Domain. Prior works  have found character models to be lagging behind word models in language modeling performance. Surprisingly, small character models perform similarly to or better than big word models on the autocomplete task. We hypothesize that the reason behind the superior performance of character models in our setting is due to their ability to answer broad prompts better than word-based models. To validate this claim, we compare character and word models on their ability to answer broad and focused prompts, controlled for the model size consisting of 80M parameters each.
From Table 1, we observe that the percentage of unique out-of-vocabulary (OOV) n-grams in WikiText-103 is 10% higher than that in the Reddit dataset. While WikiText and Reddit by nature have a different vocabulary distribution, the significant gap in the relative proportions of OOV n-grams indicates that Wikipedia articles cover more diverse and broad domains. Therefore we simulate broad prompts using articles from WikiText-103 and focused prompts with user posts from Reddit.com website. As shown in Figure 4, the performance of the word-based model is superior to that of the character-based model in answering focused prompts, but not for answering broad prompts. A potential reason is the tendency of word-based models to memorize high-frequency patterns that are rife in datasets with focused prompts. On the other hand, character-based models excel on answering broad prompts (which are the focus of our work) which can be attributed to their superior ability in handling low-frequency patterns. We observe this trend with character-based models when we report the accuracy on the the top k ('cutoff') low (high) frequent prompt patterns for WikiText (Reddit) selected by ranking the prompts based on the percentage of OOV n-grams (up to 3) in the ascending (descending) order (see Figure 5). We also observe the trend for unseen datasets with broad prompts (e.g., PTB, see Appendix A.6).

Methods to Improve Character Models
In the previous section, we demonstrated characterbased models to be more efficient than word-based models for the autocomplete task on broad prompts. Unlike word-based models, which directly consume words, character-based models are forced to learn and compose semantically meaningful textual units (e.g., suffixes, words) from more granular lexical units in the form of characters. Therefore, methods that can explicitly integrate information from semantic units higher than characters (such as from words or word segments) can propel the performance of character based models (Park and Chiba, 2017). However, existing methods primarily focus on improving the accuracy of character models, often at the expense of memory. For example, Park and Chiba (2017) augment a character model with explicit model parameters for word embeddings, which add several millions of additional parameters (e.g., 13M parameters with modest embedding size of 50 and standard WikiText-103 word vocabulary size of 267K). We introduce some novel methods that explicitly integrate word information into the character model with negligi- (c) Transfer from word models method Figure 6: Methods to improve character models. Note 'Position' in (a), (b) refers to character position embeddings.
ble impact on memory, as discussed next.
BERT-Style Word Segment Embedding. In this method, we introduce a word segment embedding layer which acts as an inductive bias by providing the word segment information explicitly in addition to character and position embedding layers (Figure 6 (a)). This word segment embedding layer is inspired by the sentence segment layer of BERT (Devlin et al., 2019) which helps the model distinguish sentences in the textual input. In our case, the word segment embedding layer can help the model distinguish words in the textual input. Number of additional model parameters introduced by this layer is given by the maximum number of words in a training input sequence times the embedding dimension, which is generally negligible. Character Pooling. In this method, we compute word embeddings by pooling from embeddings of characters seen so far for the current word (see Figure 6 (b)). The pooling function takes a set of character embeddings as input, and outputs the word embedding which is concatenated with other embeddings (as additional input) similar to the previous method. We experiment with nonparameterized, simple pooling functions such as sum, mean, and maximum. Unlike the previous method, the character pooling method does not introduce additional model parameters, due to the choice of our pooling function. The computation of word embedding does not involve look-ahead embeddings from characters belonging to the current word (that are not seen at the current timestep), thus preventing data leakage that can render the language modeling task trivial. Transfer from Word Models. In this method, we initialize a subset of decoder layers of the character model with decoder layers from a trained word model. Unlike previous methods, the decoder layer transfer method can appropriately exploit the rich syntactic and semantic information learned by the word model, which serves as a good starting point for training a character model rather than training from scratch. Figure 6 (  to the character model. Similar to the character pooling method, this method does not introduce additional model parameters. Rather, this method introduces a novel hyperparameter that controls the percentage of word-level bottom layers to transfer into our character-level model, which is tuned on the validation set. To the best of our knowledge, no prior work has explored transferring layers from a source trained model, where the source and the target model have very different vocabularies.

Results
We now discuss improvements on training character models by employing our novel methods over training a vanilla character model from scratch. BERT-Style Word Segment vs. Character Pooling. Figure 7 shows improvements of character models of size 80M with BERT-style word segment embedding and character pooling methods. Context percent corresponds to the percentage of initial tokens taken from a Wikipedia paragraph to construct the prompt, while the rest of the tokens form the ground truth. BERT-style word segment outperforms the baseline and character pooling methods on all context percent values. We attribute the inferior performance of the character pooling methods to their inability to track the order of the characters while computing the word representation. Among different pooling functions, the max function performs well on most context percent values. When the context percent is very low (e.g., 0.2), it is interesting to see that all methods perform similar or outperform the baseline. This result shows that integrating word information explicitly is especially crucial when the prompts are ambiguous or contain few tokens (i.e., context percent is low). We omit the character pooling method from our further analysis due to its inferior performance.  BERT-Style Word Segment vs. Transfer from Word Models. Table 2 shows the performance improvements of "BERT-style word segment" method and "transfer from word model" method over vanilla character (baseline) model of size 10M.

Models
In order to transfer decoder layers from the word model, we first train a 20-layer word model that has the same Transformer shape (i.e., number of heads, head dimension, model dimension, and inner dimension in feedforward layer) as the baseline model and transfer the bottom 10% of the decoder layers from the word model to initialize our character model. 7 We quantify the improvement in model performance using both ExactMatch@Overall and PartialMatch@Overall metrics. Consistent with the findings of Trajanovski et al. (2021), we observe the improvements in ExactMatch and PartialMatch metrics to be highly correlated. "BERT-style word segment" and "transfer from word model" methods improve upon the baseline (vanilla character model) by 0.65% (0.33%) and 1.12% (0.84%), respectively, in terms of ExactMatch (resp. Partial-7 The hyperparameter space for the transfer from word models method can be seen in Appendix A.3 Match). These methods also improve upon the baseline word model by 26.7% (12.1%) and 27.3% (12.6%), respectively. Importantly, compared to the "BERT-style word segment" method that introduces 384K additional parameters, our "transfer from word model" method does not introduce any additional parameters. This demonstrates the advantage of "transfer from word models" in improving character models (as compared to our other methods), while leaving no impact on memory.
Prompt and Suggestions Prompt: In December 759 , he briefly stayed in Tonggu ( modern Gansu ) . He departed on December 24 for Chengdu ( Sichuan province ) , where he was hosted by local Prefect and Ground truth: fellow poet Pei Baseline: servant and served BERT-style: chief executive officer Transfer from word models: commissioned as a Prompt: Although initially he was little @-@ known to other writers , his works came to be hugely influential in both Ground truth: Chinese and Japanese Baseline: the writers and BERT-style: writers and writers Transfer from word models: the ancient and Prompt: Hung summarises his life by concluding that , Ground truth: " He appeared Baseline: according to ksummarises BERT-style: in the same Transfer from word models: as a result Qualitative Analysis and Human Evaluation. Tables 3 and 9 (Appendix) show sample suggestions generated by the vanilla and proposed character autocomplete models. Suggestions generated by the strongest method appear more plausible (even when not having exact match with the ground truth) and have less repetition and less grammatical errors. 8 We also perform human evaluation of suggestions generated by various autocomplete models based on their naturalness (see A.9 for details). Human suggestions taken from WikiText-103 have a naturalness score of 0.88 as rated by annotators. From the naturalness scores of the models in Table 2, we observe that the "transfer from word models" method generates most natural suggestions (0.69) which is better than the character baseline (0.62) but worse than the human baseline (0.88).

Conclusion
In this work, we investigated the challenging task of building autocomplete models for answering broad prompts under memory-constrained settings. To this end, we introduced some novel methods that integrate word information into a character model with negligible impact on memory. Employing our methods, we demonstrated that character models can achieve a better accuracy-memory trade-off as compared to word models.

Limitations
The limitations of this work are as follows: • English. Our work builds autocomplete models for English language only.
• Single suggestion. Our autocomplete models focus on generating exactly one suggestion, which can work well for cases (e.g., Smart-Compose) where the suggestion appears directly on the application in a blurred manner and allows the user to add the suggestion by pressing a simple key like TAB key. We do not cover cases where the number of required suggestions can be arbitrary.
• Accuracy-memory tradeoff only. Our work primarily focuses on deploying models on lower-end edge platforms where memory, as opposed to latency, is the major bottleneck. Hence, our methods may not improve the accuracy-latency tradeoff which is a focus for future work.
• Naturalness as a metric. We use naturalness as the only quality metric for humans to rate the suggestion. We do not use metric based on user acceptability (whether the user will accept the suggestion or not), mainly due to its high subjectivity.

A Appendices
A.1 Hyperparameter space for computing component-wise parameter breakdown Table 7 displays the Transformer-XL hyperparameter space used to create 100 random architectures for computing component-wise parameter breakdown plot (Figure 2) for both word and character models. Rest of the hyperparameters come from the default configuration of Transformer-XL model.
A.2 Hyperparameter values for word and character models of different sizes Table 5 displays the hyperparameter values for word models of different sizes used in the paper. Table 6 displays the hyperparameter values for character models of different sizes used in the paper.
A.3 Hyperparameter space for transfer from word models method Table 7 displays the hyperparameter space for the proposed transfer from word models method.
A.4 Greedy vs. beam search decoding Figure 8 shows the pareto-curve for greedy and beam search. It is clear that smaller character models rival bigger word models regardless of the choice of decoding algorithm. Strikingly, we find greedy search to outperform beam search by a large margin. Two possible reasons are: (i) the noise injected by the adaptive softmax approximation of predicted probability distribution over vocabulary, and/or (ii) sensitivity of beam search to explore spurious hypothesis when the user prompt patterns are low frequency.

A.5 Theoretical analysis on differences in perplexity and Exact Match metrics
We will conduct a theoretical study to show the differences in the information captured by perplexity and Exact Match metric. Specifically, we show that the exact match score can be perfect whereas perplexity score can either be perfect or worsen by a large margin (Claim 1). Conversely, we also show that the exact match score can be the worst (i.e., zero) whereas the perplexity score can be poor or better by a large margin (Claim 2). Without loss of generality, we assume the vocabulary size V to be 2. Let A, B be the two tokens corresponding to the first and second index in the vocabulary respectively. Consider a single token prediction (x j ) and let the ground truth token be B, that is,x j = [0, 1]. Table 8 shows the differences in perplexity score and Exact Match score as a function ofx j , as it varies slightly. The first six rows in the table validate Claim 1, where exact match score is 1 but the perplexity ranges −9.9e−10 to 0.67. The rest of the rows validate Claim 2, where the exact match score is 0 but the perplexity score ranges from 0.69 to 20.72.

A.6 Accuracy-Memory Pareto-Curve on Unseen Datasets
We study the accuracy-memory pareto curve of autocomplete models trained on WikiText-103 and evaluate on the test set of two unseen datasets: LAMBADA (mostly focused prompts) and PTB (mostly broad prompts). From Figure 9, we observe that the trend where smaller character models rival larger word models that holds true for answering broad prompts (PTB) but not clearly for answering focused prompts (LAMBADA). It is striking that the trend holds true for broad prompts even when the examples are unseen during the training of the autocomplete model. Table 9 displays sample suggestions generated by vanilla and proposed character autocomplete models, grouped by the type of artifact in the generation.

A.8 Qualitative analysis of vanilla and proposed character models
We manually inspect the suggestions generated by vanilla and proposed character models 9 . Table 10 displays the percentage of different artifacts: plausible (plausible suggestion that does not have exact match with the ground truth), semantic error (e.g., new n-gram, incorrect n-gram usage), repetition (e.g., n-gram with repetitions), and grammatical error. Compared to baseline and BERT-style word segment model, character model with decoder layer transfer from word model results in less undesirable artifacts overall.   prompt. Some aspects of natural suggestion are borrowed from Dou et al. (2022). The annotation guideline can be seen in Table 11. We ask 8 anno-tators to rate 10 suggestions each.