PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context

Creating effective visualization is an important part of data analytics. While there are many libraries for creating visualization, writing such code remains difficult given the myriad of parameters that users need to provide. In this paper, we propose the new task of synthesizing visualization programs from a combination of natural language utterances and code context. To tackle the learning problem, we introduce PlotCoder, a new hierarchical encoder-decoder architecture that models both the code context and the input utterance. We use PlotCoder to first determine the template of the visualization code, followed by predicting the data to be plotted. We use Jupyter notebooks containing visualization programs crawled from GitHub to train PlotCoder. On a comprehensive set of test samples from those notebooks, we show that PlotCoder correctly predicts the plot type of about 70% samples, and synthesizes the correct programs for 35% samples, performing 3-4.5% better than the baselines.


Introduction
Visualizations play a crucial role in obtaining insights from data. While a number of libraries (Hunter, 2007;Seaborn, 2020;Bostock et al., 2011) have been developed for creating visualizations that range from simple scatter plots to complex 3D bar charts, writing visualization code remains a difficult task. For instance, drawing a scatter plot using the Python matplotlib library can be done using both the scatter and plot methods, and the scatter method (Matplotlib, 2020) takes in 2 required parameters (the values to plot) along with 11 other optional parameters (marker type, color, etc), with some parameters having numeric types (e.g., the size of each marker) and some being arrays (e.g., the list of colors for each collection of the plotted data, where each color is specified as a string or another array of RGB values). Looking up each parameter's meaning and its valid values remains tedious and error-prone, and the multitude of libraries available further compounds the difficulty for developers to create effective visualizations.
In this paper, we propose to automatically synthesize visualization programs using a combination of natural language utterances and the programmatic context that the visualization program will reside (e.g., code written in the same file as the visualization program to load the plotted data), focusing on programs that create static visualizations (e.g., line charts, scatter plots, etc). While there has been prior work on synthesizing code from natural language (Zettlemoyer and Collins, 2012;Oda et al., 2015;Wang et al., 2015;Yin et al., 2018), and with addition information such as database schemas (Zhong et al., 2017;Yu et al., 2018 or input-output examples (Polosukhin and Skidanov, 2018;Zavershynskyi et al., 2018), synthesizing general-purpose code from natural language remains highly difficult due to the ambiguity in the natural language input and complexity of the target. Our key insight in synthesizing visualization programs is to leverage their properties: they tend to be short, do not use complex programmatic control structures (typically a few lines of method calls without any control flow or loop constructs), with each method call restricted to a single plotting command (e.g., scatter, pie) along with its parameters (e.g., the plotted data). This influences our model architecture design as we will explain.
To study the visualization code synthesis problem, we use the Python Jupyter notebooks from the JuiCe dataset (Agashe et al., 2019), where each notebook contains the visualization program and its programmatic context. These notebooks are crawled from GitHub and written by various programmers, thus a main challenge is understanding the complexity and the noisiness of real-world programmatic contexts and the huge variance in the quality of natural language comments. Unfortunately, using standard LSTM-based models and Transformer architectures (Vaswani et al., 2017) fails to solve the task, as noted in prior work (Agashe et al., 2019).
We observe that while data to be plotted is usually stored in pandas dataframes (Pandas, 2020), they are not explicitly annotated in JuiCe. Hence, unlike prior work, we augment the programmatic context with dataframe names and their schema when available in predicting the plotted data.
We next utilize our insight above and design a hierarchical deep neural network code generation model called PLOTCODER that decomposes synthesis into two subtasks: generating the plot command, then the parameters to pass in given the command. PLOTCODER uses a pointer network architecture (Vinyals et al., 2015), which allows the model to directly select code tokens in the previous code cells in the same notebook as the plotted data. Meanwhile, inspired by the schema linking techniques proposed for semantic parsing with structured inputs, such as text to SQL tasks (Iyer et al., 2017;Wang et al., 2019a;Guo et al., 2019), PLOTCODER's encoder connects the embedding of the natural language descriptions with their corresponding code fragments in previous code cells within each notebook. Although the constructed links can be noisy because the code context is less structured than the database tables in text-to-SQL problems, we observe that our approach results in substantial performance gain.
We evaluate PLOTCODER's ability to synthesize visualization programs using Jupyter notebooks of homework assignments or exam solutions. On the gold test set where the notebooks are official solutions, our best model correctly predicts the plot types for over 80% of samples, and precisely predicts both the plot types and the plotted data for over 50% of the samples. On the more noisy test splits with notebooks written by students, which may include work-in-progress code, our model still achieves over 70% plot type prediction accuracy, and around 35% accuracy for generating the entire code, showing how PLOTCODER's design decisions improve our prediction accuracy.

Natural Language
Explore the relationship between rarity and a skill of your choice. Choose one skill ('Attack','Defense' or 'Speed') and do the following. Use the scipy package to assess whether Catch Rate predicts the skill. Create a scatterplot to visualize how the skill depends upon the rarity of the pokemon. Overlay a best fit line onto the scatterplot.  Figure 1: An example of plot code synthesis problem studied in this work. Given the natural language, code context within a few code cells from the target code, and other code snippets related to dataframes, PLOTCODER synthesizes the data visualization code.

Related Work
There has been work on translating natural language to code in different languages (Zettlemoyer and Collins, 2012;Wang et al., 2015;Oda et al., 2015;Yin et al., 2018;Zhong et al., 2017;Yu et al., 2018;Lin et al., 2018). While the input specification only includes the natural language for most tasks, prior work also uses additional information for program prediction, including database schemas and contents for SQL query synthesis (Zhong et al., 2017;Yu et al., 2018, input-output examples (Polosukhin and Skidanov, 2018;Zavershynskyi et al., 2018), and code context (Iyer et al., 2018;Agashe et al., 2019). There has also been work on synthesizing data manipulation programs only from input-output examples (Drosos et al., 2020;Wang et al., 2017). In this work, we focus on synthesizing visualization code from both natural language description and code context, and we construct our benchmark based on the Python Jupyter notebooks from the JuiCe (Agashe et al., 2019). Compared to JuiCe's input format, we also annotate dataframe schema if available, which is especially important for visualization code synthesis. Prior work has studied generating plots from other specifications. Falx (Wang et al., 2019b(Wang et al., , 2021 synthesizes plots from input-output examples, but do not use any learning technique, and focuses on developing a domain-specific language for plot generation instead. In (Dibia and Demiralp, 2019), the authors apply a standard LSTM-based sequence-to-sequence model with attention for plot generation, but the model takes in only raw data to be visualized with no natural language input. The visualization code synthesis problem studied in our work is much more complex, where both the natural language and the code context can be long, and program specifications are implicit and ambiguous.
Our design of hierarchical program decoding is inspired by prior work on sketch learning for program synthesis, where various sketch representations have been proposed for different applications (Solar-Lezama, 2008;Murali et al., 2018;Dong and Lapata, 2018;Nye et al., 2019). Compared to other code synthesis tasks, a key difference is that our sketch representation distinguishes between dataframes and other variables, which is important for synthesizing visualization code.
Our code synthesis problem is also related to code completion, i.e., autocompleting the program given the code context (Raychev et al., 2014;Li et al., 2018;Svyatkovskiy et al., 2020). However, standard code completion only requires the model to generate a few tokens following the code context, rather than entire statements. In contrast, our task requires the model to synthesize complete and executable visualization code. Furthermore, unlike standard code completion, our model synthesizes code from both the natural language description and code context. Nevertheless, when the prefix of the visualization code is given, our model could also be used for code completion, by including the given partial code as part of the code context.

Visualization Code Synthesis Problem
We now discuss our problem setup of synthesizing visualization code in programmatic context, where the model input includes different types of specifications. We first describe the model inputs, then introduce our code canonicalization process to make it easier to train our models and evaluate the accuracy, and finally our evaluation metrics.

Program Specification
We illustrate our program specification in Figure 1, which represents a Jupyter notebook fragment. Our task is to synthesize the visualization code given the natural language description and code from the preceding cells. To do so, our model takes in the following inputs: • The natural language description for the visualization, which we extract from the natural language markdown above the target code cell containing the gold program in the notebook. • The local code context, defined as a few code cells that immediately precede the target code cell. The number of cells to include is a tunable hyper-parameter to be described in Section 5. • The code snippets related to dataframe manipulation that appear before the target code cell in the notebook, but are not included in the local code context. We refer to such code as the distant dataframe context. When such context contains code that uses dataframes, they are part of the model input by default.
As mentioned in Section 1, unlike JuiCe, we also extract the code snippets related to dataframes, and annotate the dataframe schemas according to their syntax trees. As shown in Figure 1, knowing the column names in each dataframe is important for our task, as dataframes are often used for plotting.

Code Canonicalization
One way to train our models is to directly utilize the plotting code in Jupyter notebooks as the ground truth. However, due to the variety of plotting APIs and coding styles, such a model rarely predicts exactly the same code as written in Jupyter notebooks. For example, there are at least four ways in Matplotlib (and similar (kind='scatter',x='x',y='y'). Moreover, given that the natural language description is ambiguous, many plot attributes are hard to precisely predict. For example, from the context shown in Figure 1, there are many valid ways to specify the plot title, the marker style, axis ranges, etc. In our experiments, we find that when trained on raw target programs, fewer than 5% predictions are exactly the same as the ground truth, and a similar phenomenon is also observed earlier (Agashe et al., 2019).
Therefore, we design a canonical representation for plotting programs, which covers the core of plot generation. Specifically, we convert the plotting code into one of the following templates: • LIB.PLOT TYPE(X,{Y} * ), where LIB is a plotting library, and PLOT TYPE is the plot type to be created. The number of arguments may vary for different PLOT TYPE, e.g., 1 for histograms and pie charts, and 2 for scatter plots. • L 0 \n L 1 \n ... L m , where each L i is a plotting command in the above template, and \n are separators. For example, when using plt as the library (a commonly used abbreviation of matplotlib.pyplot), where LIB = plt and PLOT TYPE = scatter. Plotting code in other libraries could be converted similarly.
The tokens that represent the plotted data, i.e., X and Y, are annotated in the code context as follows: • VAR, when the token is a variable name, e.g., x and y in Figure 1. • DF, when the token is a Pandas dataframe or a Python dictionary, e.g., df in Figure 1. • STR, when the token is a column name of a dataframe, or a key name of a Python dictionary, such as 'Catch Rate' and 'Speed' in Figure 1. The above annotations are used to cover different types of data references. For example, a column in a dataframe is usually referred to as DF [STR], and sometimes as DF [VAR] where VAR is a string. In Section 4.2, we will show how to utilize these annotations for hierarchical program decoding, where our decoder first generates a program sketch that predicts these token types without the plotted data, then predicts the actual plotted data subsequently.

Evaluation Metrics
Plot type accuracy. To compute this metric, we categorize all plots into several types, and a prediction is correct when it belongs to the same type as the ground truth. In particular, we consider the following categories: (1) scatter plots (e.g., generated by plt.scatter); (2) histograms (e.g., generated by plt.hist); (3) pie charts (e.g., generated by plt.pie); (4) a scatterplot overlaid by a line (e.g., such as that shown in Figure 1, or generated by sns.lmplot); (5) a plot including a kernel density estimate (e.g., plots generated by sns.distplot or sns.kdeplot); and (6) others, which are mostly plots generated by plt.plot.
Plotted data accuracy. This metric measures whether the predicted program selects the same data to plot as the ground truth. Unless otherwise specified, the ordering of variables must match the ground truth as well, i.e., swapping the data used to plot x and y axes result in different plots. Program accuracy. We consider a predicted program to be correct if both the plot type and plotted data are correct. As discussed in Section 3.2, we do not evaluate the correctness of other plot attributes because they are mostly unspecified.

PLOTCODER Model Architecture
In this section, we present PLOTCODER, a hierarchical model architecture for synthesizing visualization code from natural language and code context. PLOTCODER includes an LSTM-based encoder (Hochreiter and Schmidhuber, 1997) to jointly embed the natural language and code context, as well as a hierarchical decoder that generates API calls and selects data for plotting. We provide an overview of our model architecture in Figure 2.

NL-Code Context Encoder
PLOTCODER's encoder computes a vector representation for each token in the natural language description and the code context, where the code context is the concatenation of the code snippets describing dataframe schemas and the local code cells, as described in Section 3.1. NL encoder. We build a vocabulary for the natural language tokens, and train an embedding matrix for it. Afterwards, we use a bi-directional LSTM to encode the input natural language sequence (denoted as LSTM nl ), and use the LSTM's output at each timestep as the contextual embedding vector for each token. Code context encoder. We build a vocabulary V c for the code context, and train an embedding matrix for it. V c also includes the special tokens {VAR, DF, STR} used for sketch decoding in Section 4.2. We train another bi-directional LSTM (LSTM c ), which computes a contextual embedding vector for each token in a similar way to the natural language encoder. We denote the hidden state of LSTM c at the last timestep as H c . NL-code linking. Capturing the correspondence between the code context and natural language is crucial in achieving a good prediction performance. For example, in Figure 2, PLOTCODER infers that the dataframe column "age" should be plotted, as this column name is mentioned in the natural language description. Inspired by this observation, we design the NL-code linking mechanism to explicitly connect the embedding vectors of code tokens and their corresponding natural language words. Specifically, for each token in the code context that also occurs in the natural language, let h c and h nl be its embedding vectors computed by LSTM c and LSTM nl , respectively, we compute a new code token embedding vector as: where W l is a linear layer, and [h c ; h nl ] is the concatenation of h c and h nl . When no natural language word matches the code token, h nl is the embedding vector of the [EOS] token at the end of the natural language description. When we include this NL-code linking component in the model, h c replaces the original embedding h c for each token in the code context, and the new embedding is used for decoding. We observe that many informative natural language descriptions explicitly state the variable names and dataframe columns for plotting, which makes our NL-code linking effective. Moreover, this component is especially useful when the variable names for plotting are unseen in the training set, thus NL-code linking provides the only cue to indicate that these variables are relevant.

Hierarchical Program Decoder
We train another LSTM to decode the visualization code sequence, denoted as LSTM p . Our decoder generates the program in a hierarchical way. At each timestep, the model first predicts a token from the code token vocabulary that represents the program sketch. As shown in Figure 2, the program sketch does not include the plotted data. After that, the decoder predicts the plotted data, where it employs a copy mechanism (Gu et al., 2016;Vinyals et al., 2015) to select tokens from the code context.
First, we initiate the hidden state of LSTM p with H c , the final hidden state of LSTM c , and the start token is [GO] for both sketch and full program decoding. At each step t, let s t−1 and o t−1 be the sketch token and output program token generated at the previous step. Note that s t−1 and o t−1 are different only when s t−1 ∈ {VAR, DF, STR}, where o t−1 is the actual data name with the corresponding type. Let es t−1 and eo t−1 be the embedding vectors of s t−1 and o t−1 respectively, which are computed using the same embedding matrix for the code context encoder. The input of LSTM p is the concatenation of the two embedding vectors, i.e., Attention. To compute attention vectors over the natural language description and the code context, we employ the two-step attention in (Iyer et al., 2018). Specifically, we first use hp t to compute the attention vector over the natural language input using the standard attention mechanism (Bahdanau et al., 2015), and we denote the attention vector as attn t . Then, we use attn t to compute the attention vector over the code context, denoted as attp t . Sketch decoding. For sketch decoding, the model computes the probability distribution among all sketch tokens in the code token vocabulary V c : Here W s is a linear layer. For hierarchical decoding, we do not allow the model to directly decode the names of the plotted data during sketch decoding, so s t is selected only from the valid sketch tokens, such as library names, plotting function names, and special tokens for plotted data representation in templates discussed in Section 3.2. Data selection. For s t ∈ {VAR, DF, STR}, we use the copy mechanism to select the plotted data from the code context. Specifically, our decoder includes 3 pointer networks (Vinyals et al., 2015) for selecting data with the type VAR, DF, and STR respectively, and they employ similar architectures but different model parameters.
We take variable name selection as an instance to illustrate our data selection approach using the copy
mechanism. We first where W v is a linear layer. For the i-th token c i in the code context, let hc i be its embedding vector, we compute its prediction probability as: After that, the model selects the token with the highest prediction probability as the next program token o t , and uses the corresponding embedding vectors for s t and o t as the input for the next decoding step of LSTM p .
The decoding process terminates when the model generates the [EOF] token.

Experiments
In this section, we first describe our dataset for visualization code synthesis, then introduce our experimental setup and discuss the results.

Dataset Construction
We build our benchmark upon the JuiCe dataset, and select those that call plotting APIs, including those from matplotlib.pyplot (plt), pandas.DataFrame.plot, seaborn (sns), ggplot, bokeh, plotly, geoplotlib, pygal. Over 99% of the samples use plt, pandas.DataFrame.plot, or sns. We first extract plot samples from the original dev and test splits of JuiCe to construct Dev (gold) and Test (gold). However, the gold splits are too small to obtain quantitative results. Therefore, we extract around 1,700 Jupyter notebooks of homeworks and exams from JuiCe's training set, and split them roughly evenly into Dev (hard) and Test (hard). All remaining plot samples from the JuiCe training split are included in our training set. The length of the visualization programs to be generated varies between 6 and 80 tokens, but the code context is typically much longer. We summarize the dataset statistics in Table 1.

Evaluation Setup
Implementation details. Unless otherwise specified, for the input specification we include K = 3 previous code cells as the local context, which usually provides the best accuracy. We set 512 as the length limit for both the natural language and the code context. For all model architectures, we train them for 50 epochs, and select the best checkpoint based on the program accuracy on the Dev (hard) split. More details are deferred to Appendix A.
Baselines. We compare the full PLOTCODER against the following baselines: (1) -Hierarchy: the encoder is the same as in the full PLOTCODER, but the decoder directly generates the full program without predicting the sketch.
(2) -Link: the encoder does not use NL-code linking, and the decoder is not hierarchical. (3) LSTM: the model does not use NL-code linking, copy mechanism, and hierarchical decoding. The encoder still uses two separate LSTMs to embed the natural language and code context, which performs better than the LSTM baseline in prior work (Agashe et al., 2019). (4) + BERT: we use the same hierarchical decoder as the full model, but replace the encoder with a Transformer architecture (Vaswani et al., 2017) initialized from a pre-trained model, and we fine-tune the encoder with other part of the model. We evaluated two pre-trained models. One is RoBERTa-base , an improved version of BERT-Base (Devlin et al., 2018) pre-trained on a large text corpus. Another is code-BERT (Feng et al., 2020), which has the same architecture as RoBERTa-base, but is pre-trained on GitHub code in several programming languages including Python, and has demonstrated good performance on code retrieval tasks. To demonstrate the effectiveness of target code canonicalization discussed in Section 3.2, we also compare with models that are directly trained on the raw ground truth code from the same set of Jupyter notebooks.

Results
We present the program prediction accuracies in Table 2. First, training on the canonicalized code significantly boosts the performance for all models, suggesting that canonicalization improves data quality and hence prediction accuracies. When trained with target code canonicalization, the full PLOTCODER significantly outperforms other model variants on different data splits. On the hard data splits, the hierarchical PLOTCODER predicts 35% of the samples correctly, improving over the non-hierarchical model by 3 − 4.5%. Meanwhile, NL-code linking enables the model to better capture the correspondence between the code context and the natural language, and consistently improves the performance when trained on canonicalized target code. Without the copy mechanism, the baseline LSTM cannot predict any token outside of the code vocabulary. Therefore, this model performs worse than other LSTM-based models, especially on plotted data accuracies, as shown in Table 3.
Interestingly, while our hierarchical decoding, NL-code linking, and copy mechanism are mainly designed to improve the prediction accuracy of the plotted data, as shown in Table 4, we observe that the plot type accuracies of our full model are also mostly better, especially on the hard splits. To better understand this, we break down the results by plot type, and observe that the most significant improvement comes from the predictions of scatter plots ("S") and plots in "Others" category. We posit that these two categories constitute the majority of the dataset, and the hierarchical model learns to better categorize plot types from a large number of training samples. In addition, we observe that the full model does not always perform better than other baselines on data splits of small sizes, and the difference mainly comes from the ambiguity in the natural language description. We defer more discussion to Section 5.4. Also, using BERT-like encoders does not improve the results. This might be due to the difference in data distribution for pre-training and vocabularies. Specifically, RoBERTa is pre-trained on English passages, which does not include many visualization-related descriptions and code comments. Therefore, the subword vocabulary utilized by RoBERTa breaks down important keywords for visualization, e.g., "scatterplots" and "histograms" into multiple words, which limits model performance, especially for plot type prediction. Using codeBERT improves the performance of RoBERTa, but it still does not improve over the LSTM-based models, which may again due to vocabulary mismatch. As a result, in Table 4, the plot type accuracies of both models using BERT-like encoders are considerably lower than the LSTM-based models.
To better understand the plotted data prediction performance, in addition to the default plotted data accuracy that requires the data order to be the same as the ground truth, we also evaluate a relaxed   version without ordering constraints. Note that the ordering includes two factors: (1) the ordering of the plotted data for the different axes; and (2) the ordering of plots when multiple plots are included. We observe that the ordering issue happens for around 1.5% of samples, and is more problematic for scatter plots ("S") and "Others." Figure 3 shows sample predictions where the model selects the correct set of data to plot, but the ordering is wrong. Although sometimes the natural language explicitly specifies which axes to plot (e.g., Figure 3 (a)), such descriptions are mostly implicit (e.g., Figure 3 (b)), making it hard for the model to learn. Full results on different plot types are in Section 5.4.

The Effect of Different Model Inputs
To evaluate the effect of including different input specifications, we present the results in Table 5. Specifically, -NL means the model input does not include the natural language, and -Distant DFs means the code context only includes the local code cells. Interestingly, even without the natural language description, PLOTCODER correctly predicts a considerable number of samples. Figure 4 shows sample correct predictions without relying on the natural language description. To predict the plotted

(b) Natural Language
This graph provides more evidence that the higher a state's participation rates, the lower that state's averages scores are likely to be. The higher the participation rate, the lower the expected average verbal scores.  Figure 3: Examples of predictions where the model selects the correct set of data to plot, but the order is wrong. data, a simple yet effective heuristic is to select variable names appearing in the most recent code context. This is also one possible reason that causes the wrong data ordering prediction in Figure 3(a); in fact, the prediction is correct if we change the order of assignment statements for variables age and duration in the code context.

Input
Test (  Meanwhile, we evaluated PLOTCODER by varying the number of local code cells K. The results show that the program accuracies converge or start Natural Language Explore the relationship between rarity and a skill of your choice. Choose one skill ('Attack','Defense' or 'Speed') and do the following. Use the scipy package to assess whether Catch Rate predicts the skill. Create a scatterplot to visualize how the skill depends upon the rarity of the pokemon. Overlay a best fit line onto the scatterplot.  Explore the relationship between rarity and a skill of your choice. Choose one skill ('Attack','Defense' or 'Speed') and do the following. Use the scipy package to assess whether Catch Rate predicts the skill. Create a scatterplot to visualize how the skill depends upon the rarity of the pokemon. Overlay a best fit line onto the scatterplot.  Create a scatter plot of the observations in the 'credit' dataset for the attributes 'Duration' and 'Age' (age should be shown on the xaxis).

(b) Natural Language
This graph provides more evidence that the higher a state's participation rates, the lower that state's averages scores are likely to be. The higher the participation rate, the lower the expected average verbal scores.

Natural Language
Problem 5. Age groups (1 point) Create a histogram of all people's ages. Use the default settings. Add the label "Age" on the x-axis and "Count" on the y-axis.
to decrease when K > 3 for different models, as observed in (Agashe et al., 2019). However, the accuracy drop of our hierarchical model is much less noticeable than the baselines, suggesting that our model is more resilient to the addition of irrelevant code context. See Appendix B for more discussion.

Prediction Results Per Plot Type
We present the breakdown results per plot type in Tables 6 and 7. To better understand the plotted data prediction performance, in addition to the default plotted data accuracy that requires the data order to be the same as the ground truth, we also evaluate a relaxed version without ordering constraints, described as permutation invariant in Table 7. We compute the results on Test (hard), which has more samples per plot type than the gold splits. Compared to the non-hierarchical models, the most significant improvement comes from the predictions of scatter plots ("S") and plots in "Others" category. We posit that these two categories constitute the majority of the dataset, and the hierarchical model learns to better categorize plot types from a large number of training samples. The accuracy of the hierarchical model on some categories is lower than the baseline's, but the difference is not statistically significant since those categories only contain a few examples. A more detailed discussion is included in Appendix C.

Error Analysis
To better understand the challenges of our task, we conduct a qualitative error analysis and categorize the main reasons of error predictions. We investigate all error cases on Test (gold) split for the full hierarchical model, and present the results in Table 8. We summarize the key observations below, and defer more discussion to Appendix E.
• Around half of error cases are due to the ambiguity of the natural language description. (1-3) • About 10% samples require longer code context for prediction, because the program selects the plotted data from distant code context that exceeds the input length limit. (4) • Sometimes the model generates semantically same but syntactically different programs from the ground truth, which can happen when two variables or data frames contain the same data.(5) • Besides understanding complex natural language description, as shown in Figure 3, another challenge is to understand the code context and reason about the data stored in different variables. For example, in Figure 5, although both dataframes income data and married af peoples include the age column, the model must infer that married af peoples is a subset of income data, thus it should select income data to plot the statistics of people from all groups. (6-7) Error Category % (1) NL only suggests the plot type 28.57 (2) NL only suggests the plotted data 9.52 (3) NL has no plotting information 9.52 (4) Need more code context 9.52 (5) Semantically correct 14.29 (6) Challenging NL understanding 19.05 (7) Challenging code context understanding 9.52

Conclusion
In this paper, we conduct the first study of visualization code synthesis from natural language and programmatic context. We describe PLOTCODER, a model architecture that includes an encoder that links the natural language description and code context, and a hierarchical program decoder that synthesizes plotted data from the code context and dataframe items. Results on real-world Jupyter notebooks show that PLOTCODER can synthesize visualization code for different plot types, and outperforms various baseline models.

A Implementation Details
For the model input, we select the suffix of the code sequence when it exceeds the length limit, and we select the prefix for the natural language. To construct the vocabularies, we include natural language words that occur at least 15 times in the training set, and code tokens that occur at least 1,000 times, so that each vocabulary includes around 10, 000 tokens. We include an [UNK] token in both vocabularies, which is used to encode all input tokens not appeared in our vocabularies.
The model parameters are randomly initialized within [−0.1, 0.1]. Each LSTM has 2 layers, and a hidden size of 512. The embedding size of all embedding matrices is 512, and the hidden size of the linear layers is 512. For training, the batch size is 32, the initial learning rate is 1e-3, with a decay rate of 0.9 after every 6, 000 batch updates. The dropout rate is 0.2, and the norm for gradient clipping is 5.0.
For models using the Transformer architecture as the encoder, we use the pre-trained RoBERTa-base and codeBERT from their official repositories. 2 The hyper-parameters are largely the same as the LSTM-based models, except that we added a linear learning rate warmup for the first 6, 000 training steps, which is the common practice for fine-tuning BERT-like models.

B Training with Varying Number of Contextual Code Cells
As discussed in Section 5.3.1, we provide the results of including different number of local code cells as the model input in Figure 6. We also evaluated the upper bounds of program accuracies for different values of K, where we consider an example to be predictable if all plotted data in the target program are covered in the input code context. We observe that including dataframe manipulation code in distant code cells improves the coverage, especially when K is small.

C Detailed Analysis on Results Per Plot Type
In Section 5.4, we present the breakdown results per plot type in Tables 6 and 7, where we observed that "Scatter" and "Others" constitute the majority  of the dataset, and the hierarchical model learns to better categorize plot types from a large number of training samples.
Note that for categories that the hierarchical model does not perform better than baselines, even if the accuracy differences are noticeable, the numbers of correct predictions do not differ much. For example, among the 13 samples in the "Pie" category, the hierarchical model correctly classifies 8 samples, while the non-hierarchical version makes 10 correct predictions. When looking at the predictions, we observe that the 2 different predictions are mainly due to the ambiguity of the natural language descriptions. Specifically, the text descriptions are "The average score of group A is better than average score of group B in 51% of the state" and "I am analyzing the data of all male passengers". In fact, for both examples, the hierarchical model still generates a program including the plotted data in the ground truth. However, the hierarchical model wrongly selects plt.bar as the plotting API for the former sample, and selects plt.scatter for the latter sample, where it additionally selects another variable for the x-axis. For these 2 samples, we observe that the code context includes plotting programs that use other data to generate pie charts, and the non-hierarchical model picks a heuristic to select the same plot type in the code context when there is no cue provided in the natural language description, while the hierarchical model selects plot types that happen more frequently in the training distribution. A similar phenomenon holds for other categories or data splits with a small number of examples.

D Other Plot Types
In the "Others" category discussed in Section 3.3, besides the plots generated by plt.plot, there are also other plot types, with much smaller data sizes than plt.plot. In Table 9, we present the breakdown accuracies of some plot types, which constitute the largest percentages in the "Others" category excluding plt.plot samples. Specifically, around 4% samples use boxplot, and each of the other 3 plot types include around 1% samples. Due to the lack of data for such plot types, the results are much lower than the overall accuracies of all plot categories, but still non-trivial.  Table 9: Breakdown accuracies of plots in "Others" category on Test (hard), using the full hierarchical model.

E More Discussion of Error Analysis
As discussed in Section 5.4.1, the lack of information in natural language descriptions is the main reason for a large proportion of wrong predictions (categories 1-3 in Table 8).
• Many natural language descriptions only mention the plot type, e.g., "Make a scatter plot," which is one reason that the plot type accuracy is generally much higher than the plotted data accuracy. (1) • Sometimes the text only mentions the plotted data without specifying the plot type, e.g., "Plot the data x 1 and x 2 ," where both plt.plot(x 1 ,x 2 ) and plt.scatter(x 1 ,x 2 ) are possible predictions, and the model needs to infer the plot type from the code context. (2) • The text description includes no plotting information at all, e.g., "Localize your search around the value you found above," where the model needs to infer which variables are search results and could be plotted. (3) We consider several directions to address different error categories as future work. To mitigate the ambiguity of natural language descriptions, we could incorporate additional program specifications such as input-output examples. Input-output examples are also helpful for evaluating the execution accuracy, which considers all semantically correct programs as correct predictions even if they differ from the ground truth. Most Jupyter notebooks from GitHub do not contain sufficient execution information, e.g., many of them load external data for plotting, and the data sources are not public. Therefore, developing techniques to automatically synthesize input-output examples is a promising future direction. Designing new models for code representation learning is another future direction, which could help address the challenge of embedding long code context.