Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation

Answering a programming question with only its title is difficult as salient contextual information is left out. To address this, we present a corpus of over 40,000 StackOverflow question texts to be used in conjunction with the corresponding intents from the CoNaLa dataset (Yin et al., 2018). Using both the intent and the question body, we use BART to establish a baseline BLEU score of 34.35 for this new task. We then find further improvements of 2.8% by combining the mined CoNaLa data with the labeled data to achieve a 35.32 BLEU score. We then evaluate the prior state-of-the-art CoNaLa models with this additional data. We find that our proposed method of using the body and mined data beats that of the previous state-of-the-art by a 71.96% BLEU score. Finally, we perform ablations that prove that BART is an unsupervised multimodal learner and examine its extractive behavior.


Introduction
The goal of semantic parsing is to translate a Natural Language(NL) utterance to its logical components. There is a large body of research on applying semantic parsing for source code generation in a multitude of domain specific languages such as lambda calculus and SQL (Dahl et al., 1994;Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Ling et al., 2016;Xiao et al., 2016;Rabinovich et al., 2017;Dong and Lapata, 2018;Guo et al., 2019;Hwang et al., 2019;Tabassum et al., 2020). However, the task of translating an NL utterance to a general-purpose programming language has proven to be more challenging. A significant issue contributing to this is the difficulty in acquiring quality data due to the necessary domain knowledge needed in the annotation process.
Despite this, the past few years have seen a large number of datasets released for different text-to- Figure 1: Overview of our approach. From the combined annotated + mined set, we concatenate the intent and question body for inputs to BART (Lewis et al., 2020) and use beam search for generation. code related tasks (Ling et al., 2016;Yu et al., 2018;Lu et al., 2021). Some datasets such as CodeSearchNet (Husain et al., 2019) contain snippets from a multitude of different languages. Others focus on distinct tasks within a specific language, such as JuICe (Agashe et al., 2019), which contains executable Python programming assignments. Utilizing these corpora, prior works (Suhr et al., 2018;Neubig, 2017, 2018;Sun et al., 2019;Hayati et al., 2018;Yin and Neubig, 2019; have found success with a large variety of model architectures. These methods, however, struggle with domain agnostic open-ended code generation in general-purpose languages. One idea to combat this is to utilize large pretrained language models. Transformers (Vaswani et al., 2017) have demonstrated that they can both be few-shot (Brown et al., 2020) and unsupervised multitask (Radford et al., 2019) learners. They have been successfully applied to programming language tasks. CodeBERT achieved strong performance on the CodeSearch-Net task through pretraining on bimodal NL comment and code pairs (Feng et al., 2020), while Sun et al. (2019) used abstract syntax trees(AST) and transformers to achieve state of the art performance on the HearthStone benchmark (Ling et al., 2016). Roziere et al. (2021) proposed the deobfuscation pretraining task to incorporate structural features of code into transformer models without the use of ASTs. More recently, Shin et al. (2021) explored the capabilities of large pretrained language models to be few-shot semantic parsers.
Yet open-domain programming question answering on sites such as StackOverflow(SO) 2 has remained an elusive goal.  created an annotated dataset with the site in which the intent and answer snippet pairs were automatically mined from the question. They then had crowd workers rewrite the intents to reflect the corresponding code better. Currently, state-of-the-art was achieved by pretraining an LSTM model on resampled API and mined data . Subsequent work conducted an empirical study on the effectiveness of using a code generation model in an IDE plugin and find that developers largely had favorable opinions of their experience (Xu et al., 2021). An inherent issue with the approach of Xu et al. (2020), more fundamentally the dataset and parameters of the task, is that the intent can only contain a limited amount of information. Arriving at this answer from the intent "add a new axis to array a" requires not only the disambiguation of data types for variable a, but also the use of multiple distinct library-specific concepts. Further, this must be accomplished while maintaining syntactically correct code and proper order of arguments. However, neither the original title nor the rewritten intent contains the necessary information to accomplish this task. Although the previous state-of-the-art-model by  uses abstract syntax trees (AST) to guarantee syntactically valid python code, it incorrectly generates a[(-1),:]=a. One potential remedy would be to increase the amount of training data, but as discussed previously, getting high-quality annotated code generation data is especially difficult.
Motivated by the limitations to the amount of information a given intent can contain and the substantial difficulty involved with gathering more labeled data, we utilize the multimodal text from the question bodies provided by the StackExchange API 3 . We take advantage of the strong performances of transformer models to beat the previous state-of-the-art by 3.06 BLEU. We ensure a fair comparison by training the models from prior works with the extra data to adequately evaluate our proposed method. When all models are trained with the extra data, using BART beats the previous state of the art by 15.12 BLEU.
Our main contributions are the following: • Expanding upon the original CoNaLa dataset  to include the multimodal textual question bodies and thus the pertinent contextual information they contain such as inputs, outputs, and required libraries.
• Demonstrating that BART does not rely on a single modality, but rather achieves its best performance on our dataset when all modalities are included. This indicates at least a shallow understanding of both natural and programming language as well as how they are related in the context of SO questions.
• Conducting experiments revealing that BART's struggle to generate syntacically correct code is likely a result of its tendency to be extractive rather than generative in the task of text-to-code generation.

Methodology
As detailed in Figure 1, our overarching approach is to: (1) gather textual bodies from SO for both the annotated and mined examples in the CoNaLa corpus, (2) use the concatenated intents and question bodies as inputs for a large pretrained language model, and (3) use beam search to generate the answer code snippet.

StackOverflow Data
Every example e i ∈ E from the CoNaLa dataset  is comprised of an intent x i ∈ X that concisely summarizes what the poster wants and a snippet of Python code y i ∈ Y that represents an implementation of x i . Crowd sourcing was used to rewrite a selection of the mined intents to reflect the snippet better and to ensure that the snippet was indeed a correct answer. As discussed, these intents are limited in the amount of information they can contain. The intent "add a new axis to array a" from Figure 2 could refer to a wide variety of different Python objects. It could range from the default list to the Tensor object from PyTorch 4 . The full question, or either its tags or title, is typically enough for a human to disambiguate the correct library to use. But the annotated intent lacks this crucial information as it is rather difficult to design an annotation task for SO data. 5 We address this problem directly by using the additional data found in the SO question. In Figure 2 there are four direct mentions of the NumPy library: two in the question body and one each in both the tags and the title. Further, there is a direct mention of the ndarray data type from NumPy. It is, therefore, rather intuitive to include this additional data as input with the hope that it improves the answer generation performance. Although we did mention that both the tags and title provide salient information, the focus of this paper is only on using the noisy textual question bodies. Therefore, for every example e i the inputs now become the concatenation of x i and the body q x i ∈ Q from the original SO question. It is important to note that |Q| = |E| as a single question can have many examples while every question is, by definition, unique.

Unsupervised Modality Learning
Multiple modalities are present in the textual body of a given question. These can range from embedded images to messages from administrators (or upset users) stating that the question is a duplicate of some tangentially related post that does not have an answer. While these are useful to readers, we limit our focus to three modalities: code blocks, inline code, and NL. These modalities are marked in Figure 2 with blue, green, and red, respectively. Ideally, we would prefer to leave in the HTML tags to serve as sentinel tokens, but, looking at Figure 2, one immediately finds that the poster forgot to mark _to_col as inline code. Therefore, we remove all HTML tags from the inputs, creating an unsupervised learning environ-ment. Therefore, we propose that a transformer model will learn each of the three modalities and learn to use the relationships between each. We use BART (Lewis et al., 2020) because its pretraining focuses on denoising textual data and, to the best of our knowledge, has minimal exposure to code examples. We used HuggingFace's (Wolf et al., 2020) BartForConditionalGeneration model which has a default BART encoder-decoder model with a linear layer and bias for outputs.

Unlabeled Data
We followed Xu et al. (2020) by using large amounts of the mined but not annotated data. Unlike Xu et al. (2020), however, we do not use this data for pretraining. Instead, we combine this data with the annotated data in our main training and validation sets. By adding more questions to the training set, we directly increase the probability that the model encounters a larger and more representative distribution of libraries. Intuitively, this will reduce the variances between experiments as we have reduced the dependency on the specific examples used in the training and validation sets. This variance reduction is especially useful when working with a small dataset such as CoNaLa.

Datasets
CoNaLa  6 is an open domain text to code generation task constructed from SO questions. It has 2,879 7 annotated NL-code pairs with more than 590K mined pairs from over 40,000 unique SO questions in the dataset. StackOverflow Data For every unique question in both the annotated and mined sets, we gather additional data from the StackExchange API. As discussed in subsection 2.1, we only use the question body as input. Therefore the task is to generate a valid answer snippet from both the intent and the textual body. Detailed statistics for this dataset are given in Table 1 and Table 2.

Methods
We removed 238 ( 10%) examples from the training set to form the validation set. We then followed Xu et al. (2020)   is valid. However, we only used 10,000 samples rather than the 100,000 Xu et al. (2020) used. From this, we remove 1000 for validation. 8 For all tests of our model with the mined data, we combine the two training and validation sets into one. Every experiment and test conducted in this work was conducting using Google's Colab Pro service. It afforded us the ability to use 512 input tokens with a batch size of 16. More importantly, we were able to use P100 and V100 graphics cards. Following that, we perform an ablation study using BART and the different components of our approach. Every ablation is run five separate times with different seeds and validation splits. For each test, the model with the lowest validation loss is used in the evaluation. Each test is run for ten epochs as we consistently observed overfitting after five to eight epochs.
Because we introduce new data at inference, we needed to ensure we fairly compare our methods with previous work. To this end, we run the prior works with the question bodies as inputs. However, for testing  with the question bodies, we limited the amount of mined data in pretraining to 10,000 instead of 100,000. This was done due to Google Colab's execution time limits, as it took upwards of four hours for each run of Xu et al.
(2020) with only 10,000 samples. 8 Some questions were deleted from StackOverflow in both the annotated and mined sets, so we could not use those.

Metrics
We measure the corpus level BLEU score of the generated code snippets with the same postprocessing methods and smoothing as . We evaluate our ablations by comparing the corpus BLEU score and unigram, bigram, and trigram precision. Finally, we calculate the percentage of test examples for which our model generated a syntactically valid Python snippet.
For the previous state-of-the-art, we also report the Oracle BLEU proposed by Yin and Neubig (2019). This is calculated by choosing the candidate snippet s i with the highest sentence level BLEU score out of n generated snippets. Formally, given the candidate list C = [c 1 , . . . , c n ] and ground truth y i , Furthermore, we want to quantify how much our model relies on the body of the question or "cheats." To do this, we calculate the cheating for the generated snippet s i ∈ [s 1 , . . . , s N ] = S and ground truth y i ∈ [y 1 , . . . , y N ] = Y with respect to the input text b i ∈ [b 1 , . . . , b N ] = B. Given the function m(a, b) that calculates a textual similarity metric m, we define the cheating w.r.t. m as If the model is heavily "cheating" from the input, then m(s i , b i ) m(y i , b i ), which leads to a large C m . The quantity C m is, by design, similar to a standard mean squared error. The largest difference is that the difference is not squared, to facilitate distinguishing between less and more similar.
For the metric function m, we use BLEU and ROUGE (Lin, 2004). For the former, we take the bigram (C BB ) and trigram (C BT ) precision from BLEU. For ROUGE, we use bigram ROUGE (ROUGE-2/C R2 ) and the longest common subsequence (ROUGE-L/C RL ). The intuition behind using these metrics is that there is a high probability that unigram precision is large. The answers to a question must address the contents of the said question, leading to shared tokens between inputs and outputs. However, the probability should massively drop when considering multiple grams. Therefore, the similarity between n-grams when n > 1 should indicate the reliance on the inputs.

Implementation
We implemented our model with Python and Hug-gingFace's transformer library (Wolf et al., 2020) We used a BART model with a linear layer and a separate bias for text generation. We utilized the smallest available BART model from FAIR, which was the Facebook/BART-base 10 . For training, we again rely on HuggingFace's trainer and their implementation of the learning rate scheduler. We used Adam (Loshchilov and Hutter, 2017) as our optimizer with a learning rate of 5e−5 and a linear learning rate scheduler. We also used a warmup ratio of 0.05. Finally, for generation, we used beam search with four beams, early stopping, and a length penalty of 0.9.

Results
We list the previous state-of-the-art BLEU scores for the CoNaLa dataset as well as the performance of our models in Table 3. Using the intent and question bodies achieved a BLEU score of 34.35±1.01. This was further increased to 35.32±0.42 by including the mined data in the training and validation set.
To better understand our model, we perform ablation tests and report their results in Table 4. When comparing our top performance with the previous top performance, regardless of the data used, our model beats the previous state of the art by 3.40 BLEU, a 10.54% increase. Notably, our model outperforms the previous SoTA by 14.78 BLEU, 9 https://github.com/huggingface/transformers 10 https://huggingface.co/facebook/bart-base a 71.96% increase when only comparing the experiments with the question body. Furthermore, BART with the mined data and question bodies beats their Oracle BLEU by 1.61 BLEU, translating to a 4.78% increase. However, it is important to note that Xu et al. (2020) outperforms our model by 1.71(5.30%) when we do not use the textual body. But they still both beat the baseline TranX by 25.72% and 7.98%, respectively. The use of the mined data further beat the reranker by 1.46%.
The 71.96% increase is likely because TranX models were never intended to perform well with very noisy data, as evidenced by the 36% dropoff in corpus BLEU when adding the body to . In choosing BART, we intentionally picked a transformer model designed for denoising (Lewis et al., 2020). Further testing is likely needed to determine if our approach is heavily dependent on the underlying transformer, but that is beyond the scope of this paper.

Impact of adding the Question Body
Adding the body of the question objectively improved the performance of the model. The BLEU score increased 30.92% to 34.35 and, per Table 4, there was an increase across unigram, bigram, and trigram precision. While they all do increase, the amount is far from constant. The unigram precision only saw a 3.61% increase, whereas bigram and trigram precision increased by 12.77% and 22.90%, respectively. This indicates that while the model selected slightly more correct tokens, it greatly improved its ordering of said tokens.
Similar improvements, albeit smaller in value, also occurred when including the mined data without the question bodies. However, there was a sharp drop in the standard deviations for the three precision metrics. In contrast, adding the question body resulted in a steep increase in variance. This is most probably a result of the "shrinking" of the dataset that occurred when we added the bodies. In Table 1 we report that every split of the dataset has fewer unique questions than it does examples. Also reported is that the number of tokens in the body is, on average, significantly greater than that of the intents. The effective dataset size is now much smaller, while the number of unique answer snippets stayed the same. The result is that the model now performs better on the difficult test set, at the cost of being more reliant on the training and validation split. Using both the bodies and mined  data does mitigate this "shrinking" effect, as shown by the lower standard deviations than those when only using the body.

Is BART Reliant on a Single Modality
As discussed in subsection 2.2, we focus on three modalities in the textual bodies: code blocks, inline code, and natural language. We put forth the idea that a large pretrained language model such as BART learns each modality in an unsupervised manner. We designed four distinct ablations to test if this was the case. Each was run both with and without the mined data totaling eight ablations. We report the full BLEU scores from these in Table 4. Further, we calculate the performance with respect to baselines in Table 5. Notably, there was no modality whose removal resulted in a BLEU score worse than when the question body was not used in the input. There was also not a modality whose removal improved performance. From our ablations, it is clear that the most important modality in the question bodies is the code regardless of if it is inline or in a block. But, using only code is still 2.25% worse than when all three modalities are included with mined. This indicates that the NL surrounding acts not only as additional context, but likely further both direct and indirect indicators of salient code for the model.

Removing Code Improves Syntax
In Table 4 we report the percent of generated snippets that are syntactically valid-adding only the mined data results in a 9% increase. When using the question bodies, the addition of the mined data also increases the percent of valid snippets generated by 7.88%. While it is an improvement, it is still a 3.76% drop from when the body was excluded. Further, removing the code from the bodies resulted in the highest percentages of 92.00% and 84.92% with and without the mined data. We then performed a finer analysis using a single seed and the same training and validation data across all ablations and reported the results in Appendix A. Across all ablations, the majority of errors are caused by mismatches of parentheses. In reality, a large percentage of general syntax errors are likely caused by this. However, syntax errors prevent the extraction of the AST for further investigation of these errors. We also report in Table 9 the percentage of valid snippets generated when the print function is present. One of the more commonly occurring incompatibilities between Python 2 and 3 is that print now requires parentheses. Considering that the questions in the CoNaLa dataset are from March 2017 or earlier  and that support for Python 2.x only ended in January 2020 11 , we hypothesize that these deprecated calls are a large cause of the errors. When both the body and snippet have print, the inclusion of the question body led to the percent of valid snippets dropping by 21.06 with and 21.05 without the mined data with respect to their baselines. While there are only 19 such questions in the test set, this is a significant drop. The likely cause is that the autoregressive decoder of BART struggles to remember to close the parentheses when wrapping the snippet with a print statement. One solution would be to run the 2to3 12 translator on all of the code. However, the possibilities for code blocks to contain code and other modalities such as error messages and console executions present significant hurdles   as 2to3 does not support these. Therefore we leave that to future work.

Cheating
In subsection 3.3 we define the "cheating" equation to measure if the generated snippet is more similar to the question body than the ground truth is. The ideal model would maximize the BLEU score while minimizing the |C m |. We run multiple ablations on a single seed and calculate the "cheating" as defined by Equation 2 and present these results in Table 6. Suffice to say that serious violations of academic integrity have occurred. As expected, the baseline is less similar to the question bodies than the ground truth is. When the body was used as input, C BT increased by 20.28 points, while C RL rose by 3.16 points, representing a 293.49% and 159.60% increase over their respective baselines. Including the mined data resulted in increases of 18.59 (308.13%) and 0.77(265.52%) when compared to using only the intents. Both indicate that the model's generated output has significantly more  Intent: concatenate items of list 'l' with a space ' ' print(' '.join(map(str, l))) x list(map(tuple,[])) y [item for item in L if " in item] z print(' '.join(str(x) for x in L)) Intent: concatenate elements of list 'b' by a colon ":" """:""".join(str( We select three examples that demonstrate the benefits of our approach while also highlighting the issues in both the use of the question body and SO corpora in general and report them in Table 7. In the first example, we can see that both x and y have learned how to use einsums, but neither is correct. z in this case produces an answer that returns the correct value. It is highly probable that BART understood from the poster's explicit mention that P.dot(T).transpose(1, 0, 2) gives the desired result and thus extracts it. However, this example has two critical issues: the poster's intent is to find a "cleaner" way to multiply a matrix with a tensor, and scipy.tensordot is deprecated. The latter is to be expected, considering the answer is from 2010. But it does indicate that a better evaluation based on inputs and outputs is likely needed.
The next two examples are quite similar but are from two separate questions. x likely mistakes the core intent to be type conversion due to the inclusion of the words "items" and "with." y also suffers from the inclusion of these tokens but believes the problem involves filtering. In the final example, x recognizes that it must convert the items in b to str, but does not return a joined string. y recognizes that, again, the answer involves type conversion but predicts the incorrect type.
Similar to the first example, z produces answers for both the second and third examples that functionally return the correct results. However, running z's solution for the third example would result in a syntax error due to the missing ")." On further inspection of the question bodies, it becomes apparent that the probable reason why one snippet is syntactically valid while the other is not is the presence of a Python 2 print. The model recognizes that a suitable answer can be found in the question but must be converted to python 3. As discussed in subsection 4.3, these print statements are prone to cause syntactical issues.

Conclusion
We expand the CoNaLa dataset by adding the textual question bodies from the StackExchange API and achieve state-of-the-art performance with a simple BART model. Further, we demonstrate that, for this task, BART performs best when code blocks, inline code, and NL are all present. We then examine the impact of the question body on syntax errors and BART's cheating through multimodal understanding. Finally, we examine examples that highlight the issues with both StackOverflow data and code evaluation in general. Future work should focus on extracting desired inputs and outputs for a given intent. Further, additional efforts put into creating corpora of executable code are likely to improve not only generation but evaluation. Both will also protect datasets from deprecated functions and abandoned libraries.