Code Generation From Flowcharts with Texts: A Benchmark Dataset and An Approach

,


Introduction
Recently, automated source code generation from requirements documents has been a fashionable task that maps specific descriptions to various executable codes in software engineering and artificial intelligence.Developments in deep learning d e f twoSum ( nums , t a r g e t ) : n = l e n ( nums ) f o r i i n range ( n ) : f o r j i n range ( i + 1 , n ) : i f nums [ i ] + nums [ j ] == \ t a r g e t : r e t u r n [ i , j ] r e t u r n [ ] Table 1: An example of the flowchart and its corresponding code.The input to the model is the flowchart, and the output is the code.
have facilitated the effectiveness of transformations between natural language and source code.However, generating code that solves a specified task requires searching in the huge structured space of possible programs, with a very sparse reward signal, and the solutions can look dramatically different even for the same problem (Li et al., 2022).Therefore, most prior work has been limited to generating short code snippets from sentences (Oda et al., 2015;Yin et al., 2018).However, these works' sentences are more straightforward than the requirement documents in real scenarios.For example, to generate the code in Table 1, the model needs to understand the task requirements and accurately select traversal algorithm from many candidates (e.g.dynamic programming, greedy, divide and conquer).
In practice, to solve complex code generation tasks, rather than translating requirements documents directly into code, software engineers write code through Unified Modelling Language(UML), which intends to provide a standard way of visualizing the design of a system (Hutchinson et al., 2014).For all UML, flowchart plays an important role in the analysis of system requirements, preliminary design, and detailed design (Sendall and Kozaczynski, 2003).Before writing code, drawing a flowchart to illustrate the algorithm to be used and the steps to follow can significantly reduce task's difficulty.There is evidence that, in industrial practice, flowcharts have been widely used for problem understanding (i.e.analysis) and documentation.Thus, we propose a new task to generating executable code from a flowchart with text.
There have also been several studies on converting flowcharts to code.Wu et al. and Wang et al. proposed rule-based methods that can identify the loop and selection semantics in the flowchart and automatically convert it into pseudo-code which use structural conventions of a normal programming language such as while, if, but the details in the pseudo-code are still in text, which makes these codes unable to execute directly on the computer.In contrast, we tackle the task of executable code(python) generation from flowcharts with text.Most closely related to our task is Oda et al. (2015), they propose a task to generate code from pseudocode.However, each line of pseudo-code in their dataset's is independent of the other, models can generate code snippets for each line regardless of other lines.However, in our dataset, the connections between nodes in the flowchart are abundant and diverse which need to be carefully handled.Thus, although flowcharts and pseudo-codes can be converted into each other, our dataset is different from theirs.
We constructed a benchmark dataset, FC2Code, which contains 320 flowcharts with natural language and code pairs.We obtained the code from the programming competition platform LeetCode and manually drew the flowcharts with natural language.Previous models cannot be used directly for the new source code generation task for the following reasons.Firstly, a flowchart is a graph containing various structures, including loops and selections, which is different from the text.Secondly, in a natural scene, each node of the flowchart is not independent.As seen from Table 3, 62% of nodes in FC2Code are related with other nodes, which means that the model cannot generate code snippets for each node without considering its neighbors.
To solve these problems above, we propose a two-stage code generation model.In the first stage, the Structure Recognition Model transforms the flowchart into a pseudo-code that containing some structure information such as while, if.In the second stage, a Code Generation Mode is employed to merge the information in the node of the flowchart and convert the pseudo-code into executable code.The experiments show that it is necessary to enhance each node's representation with its neighbors according to the structure of the flowchart.

Generating Source Code from Requirements Document
Automatic generation of source code from requirements documents has recently been a hot topic in artificial intelligence communities, such as mapping natural language directly to executable programs in the form of logical forms (Zelle and Mooney, 1996), database queries (Yu et al., 2018(Yu et al., , 2019)), general purpose code snippets (Oda et al., 2015;Yin et al., 2018), complete executable code that can solve a specific task (Liu et al., 2020;Li et al., 2022).

Dataset Construction
We created a new dataset called FC2Code (FlowChart to Code), which consists of code and flowchart pairs.As can be seen in Figure 1.The basic process of dataset construction can be divided into 3 steps: 1) Code Extraction.2) Flowchart Sketch Generation.3) Nature Language Annotation.The following sections describe the above three steps in detail.
Code Extraction.Firstly, we extract 320 codes from LeetCode 2 , an online programming competition platform.The code crawled in public available code repositories such as Github often has projectrelated operations or global variables and the logic of some source code is often not clear enough, and even has bugs.Thus, we extract the Python codes from an online programming competition platform, LeetCode.LeetCode provides open-domain tasks with high-quality official solutions, each solution contains well-tested code with a detailed explanation.Because each function in the code will be converted into a flowchart, we filter out the code containing multiple functions to make sure each flowchart contains complete information that can generate executable code.We select 253 problems with 320 codes from LeetCode.
Flowchart Sketch Generation.Secondly, we use pyflowchart 3 to automatically convert the code 2 https://leetcode.com/problemset/all/ 3 https://github.com/cdfmlr/pyflowchart/into flowchart sketches.Specifically speaking, each line of code is transformed into a node in the flowchart, then the tool automatically connects edges to each node according to their execution order in the code.Note that the content of the nodes in the flowchart sketches is still code snippets.
Nature Language Annotation.Thirdly, we translate each code in the node into natural languages in Chinese.We set the following principle for annotators: • When the programmer sees the flowchart, he/she can write the code with the same functionality as the source code.
• We also encourage annotators to describe the same code in different ways.For example, "i++" will be annotated as "Increment i" or "Shift the subscript of an array to the right by one unit." • In real scenarios, the nodes in the flowchart are usually related to each other, for example, some nodes may incorporate the variable declared or described above.Therefore, we increased the description's abstraction and added some relationships between nodes.For example, when a new variable is defined, the function and usage of the variable will also be annotated, and in the following annotation process, we will no longer directly mention the name of the variable but refer to its function.Table 3 shows situations when one node is connected to another in our dataset.
Annotating nodes in flowcharts is a timeconsuming process.Labeling one flowchart takes us about 30 minutes of labor on average.And annotating 320 flowcharts costs us around 160 hours of human labor.Table 2 presents statistics for our dataset.
To verify the quality of the annotation, we let annotators exchange their samples and infer the code based on the annotation results.Finally, we sampled 50 flowcharts with 771 nodes, and find that 82.00% flowcharts and 98.44% nodes from our dataset are solvable.

Relation Between Nodes
The relations between nodes in the flowchart are abundant and diverse.Our analysis of 50 flowchart examples reveals the relation types and their proportion in FC2Code.Overall, 62% of nodes are related to another node, and 98% flowcharts contain the nodes that are related to another node.As shown in Table 3, Most nodes (46%) require using the variable declared or described in the previous node.7% require understanding the functions of the variable.For example, in row 2 of Table 3, given that dp[i] represents the i-th odd number, "The first odd number" in the current node should be converted to "dp[1]".4% nodes require understanding the properties of the variable and its data structure.For example, in row 3 of Table 3, the data type of "builder" is an array and the NL in current node is "Return a valid string", thus, we need to convert "builder" to a string before returning it.9% nodes need to infer what the demonstrative pronoun refers to.In row 4 of Table 3, the NL of current node using a word "their", which requires the model to find what "their" refers to.Lastly, 17% nodes are related to at least 2 other nodes.

Distance of Two Interrelated Nodes
In the randomly sampled 50 flowcharts, the distance distribution of two nodes with a relationship is shown in Figure 2. The number of nodes decreases as the distance of relations increases.The reason account for it is that local variables are often used repeatedly in code snippets.There are many nodes with distances between 1 and 2. Maybe it's because the code in FC2Code contains a lot of for keywords.As shown in Table 1: "I is less than n", "Set i as the index of the array, the initial value is 0" and "Increment i" are related and their distances are between 1 and 2.

Task Definition
Flowchart is composed of nodes and edges.Note that m and n are not always equal because the code snippets such as else, continue and break do not related to any node in the flowchart.They need to be generated according to the structure of the flowchart.We rewrite [y 0 , ..., y m ] as [y 0 , ..., y n , y ′ 0 , ..., y ′ l ], where y i is generated from x i and y ′ i is the code snippets that is not related to any node in the flowchart.
The relationships between the nodes [x 0 , ..., x n ] and the code snippets [y 0 , ..., y n ] are also provided, which can be used in the training phase.

Two Stage Code Generation Method
The tokens in the code can be divided into two categories: the first category of tokens describe the execution order of the code snippets, such as while, for, if, break, continue, return.The second category of tokens are used to describe a specific process, such as assignment and comparison.Similarly, the flowchart is a combination of nodes and edges, the edges describe the execution order of each node.
There are two execution orders in the flowchart: selection and loop, the loop in the flowchart should be mapped to the statement while, as well as the selection should be mapped to statement if.Identifying the flowchart structure in advance will reduce the difficulty of the task, and some rule-based methods can identify flowchart structures without errors.
Therefore, we can first use a rule-based method to identify the structure of the flowchart and generate a pseudo-code, and then use a code generation model to convert the pseudo-code into executable code.Specifically, given a flowchart with n nodes[x 0 , ..., x n ].In the first stage, a Structure Recognition Model is used to transform [x 0 , ..., x n ] into pseudo-code [z 0 , ..., z n , z ′ 0 , ..., z ′ l ], z i is obtained by adding spaces and prefixes like while, if token before x i , z ′ i do not related to any node in the flowchart(e.g.else, continue, break) and will be inserted into [z 0 , ..., z n ] as a single line.In the second stage, a Code Generation Model is used to transform pseudo-code into executable code [y 0 , ..., y n , y ′ 0 , ..., y ′ l ].In the second stage, the Code Generation Model is employed to convert the pseudo-code into executable code.

Structure Recognition Model
In this section we use the Structure Recognition Model to transform flowchart [x 0 , ..., • Step 2: Identify the nodes (e.g.while) and edges (e.g.continue, break, return) associated with the first category of tokens in the loop.
We identified these structures based on their characteristics.For example: 1) The True branch of the continue node will point to the while node.
2) The True branch of the break and return nodes will jump out of the loop.
• Step 3: Determining the scoping of Selection.
In the first step, we have found the Condition nodes related to selection.In this step, we need to find where the 2 branches of the selection meet.
• Step 4: Generate pseudo-code.To generate the pseudo-code, the model will determine the order of the nodes [x 0 , ..., x n ] according to the structure of the flowchart and convert it into the pseudo-code [z 0 , ..., z n , z ′ 0 , ..., z ′ l ].We basically follow their method and generate pseudo-code.The full algorithm can be found in their paper (Wang et al., 2012).

Code Generation Model
In the second stage, we will transform the pseudocode [z 0 , ..., z n , z ′ 0 , ..., z ′ l ] into executable code [y 0 , ..., y n , y ′ 0 , ..., y ′ l ].Because [z ′ 0 , ..., z ′ l ] are already executable code snippets, they can be directly converted to [y ′ 0 , ..., y ′ l ] without making any changes.The difficulty lies in converting [z 0 , ..., z n ] into [y 0 , ..., y n ] For each z i .We first use a bidirectional Long Short-Term Memory (LSTM) to encodes the tokens of z i into h i , and c i is the final cell state of LSTM.Then, Graph Attention Networks (Velickovic et al., 2017) is employed to enhance the representation of c i .Specifically speaking, to preserve information about the direction information of the edges in the flowchart, we treat the flowchart as a directed graph and construct the Reversed Flowchart by reversing the edges in the flowchart.Then, we use the GAT to fuse the information of c i with its neighbors c j according to the original flowchart G org and the reversed flowchart G rev respectively.We set the window size on G org is d org , and the window size on G rev is d rev , then we obtain c i 's new representations c i_org and c i_rev respectively.Then, we use MLP to merge c i_org and c i_rev :  ways of fusing information.In most settings, using a flowchart as a metric for calculating distances of each node is usually better than using pseudocode, a good example to explain this can be found in Figure 1.For the code "for i in range(n):", the input to the model is "I is less than n?" , the related node is "Set i as the index of the array ..." and "Increment i".In the flowchart, the distance for both needed nodes is 1.However, in pseudo-code, the distance between "Increment i" and "I is less than n?" is 4, which means that, to achieve the same performance, the window size d rev will be larger if we use pseudo-code as metric instead of flowchart.(In the pseudo-code, "Increment i" will appear at the end of the loop structure, that is, below the pseudo-code snippet "Return their array indices" in Figure 1. )

Conclusion
In this paper, we introduced the task of generating source code from flowcharts with texts.structure of the flowchart.And the ablation experiments further show that considering the direction of edges in the flowchart will improve the model's performance, and when fusing the information of neighbor nodes, compared with pseudo-code, calculating the distance of two nodes using a flowchart is better, which means that the flowchart is important for the second stage.

Limitations
Because our model is specially designed to fuse the information of each node in the flowchart, this model may not be suitable for fusing information in pseudo-code.Therefore, whether merging the information of each node according to the flowchart or according to the pseudo-code is better is still waiting for further study.

Figure 2 :
Figure 2: The distance distribution of two nodes with a relationship in FC2Code.The horizontal axis represents the percentage of nodes, the vertical axis represents the distance.

Figure 3 :
Figure 3: In the first stage, the Structure Recognition Mode is used to transform the flowchart into the pseudo-code.In the second stage, the Code Generation Model is employed to convert the pseudo-code into executable code.
Wang et al. proposed to generate pseudo-code from the flowchart in the following steps: • Step 1: Find out the loop and selection in the flowchart.Flowchart is a combination of two basic structures: selection and loop.Wang et al. found that the flowchart can be seen as a directed graph, in which each loop forms a strongly connected sub-graph.They used this method to find all the loop structures in the flowchart.Then, the structures led by the remaining Condition nodes are selection structures.

Table 3 :
The relation types and their proportion in FC2Code dataset.Overall, 62% of nodes are related to another node, and 98% flowcharts contain the nodes that are related to another node.
Where [:] denotes vector concatenation.Then, h i and c ′ i is sent to the TranX's decoder(Yin and  Neubig, 2018) and generate code snippets y i .

Table 5 :
Models benefit from a larger window size.d org means the window size on G org , d rev means the window size on G rev .

Table 6 :
To train models for this task, we constructed a new opendomain dataset (FC2Code) from the programming competition platform LeetCode.We propose a twostage code generation model.In the first stage, the Structure Recognition Algorithm is employed to transform the flowchart into pseudo-code containing the structural conventions of a typical programming language such as while, if.In the second stage, A code generation model is employed to convert the pseudo-code into codes.The experiments show that it is necessary to enhance each node's representation with its neighbors according to the We can get the neighbors of each node according to the flowchart or to the pseudo-code.This table shows the impact of different ways of fusing information on model performance.