Constructing Flow Graphs from Procedural Cybersecurity Texts

Following procedural texts written in natural languages is challenging. We must read the whole text to identify the relevant information or identify the instruction flows to complete a task, which is prone to failures. If such texts are structured, we can readily visualize instruction-flows, reason or infer a particular step, or even build automated systems to help novice agents achieve a goal. However, this structure recovery task is a challenge because of such texts' diverse nature. This paper proposes to identify relevant information from such texts and generate information flows between sentences. We built a large annotated procedural text dataset (CTFW) in the cybersecurity domain (3154 documents). This dataset contains valuable instructions regarding software vulnerability analysis experiences. We performed extensive experiments on CTFW with our LM-GNN model variants in multiple settings. To show the generalizability of both this task and our method, we also experimented with procedural texts from two other domains (Maintenance Manual and Cooking), which are substantially different from cybersecurity. Our experiments show that Graph Convolution Network with BERT sentence embeddings outperforms BERT in all three domains


Introduction
Many texts in the real-world contain valuable instructions. These instructions define individual steps of a process and help users achieve a goal (and corresponding sub-goals). Documents including such instructions are called procedural texts, ranging from simple cooking recipes to complex instruction manuals. Additionally, discussion in a shared forum or social media platform, teaching books, medical notes, sets of advice about social behavior, directions for use, do-it-yourself notices, * These authors contributed equally to this work.
itinerary guides can all be considered as procedural texts (Delpech and Saint-Dizier, 2008). Most of these texts are in the form of natural languages and thus, lacking structures. We define structure as sentence-level dependencies that lead to a goal. These dependencies can vary based on the textdomain. Some examples of such dependencies are action traces, effects of an action, information leading to the action, and instruction order. Constructing structured flow graphs out of procedural texts is the foundation for natural language understanding and summarization, question-answering (QA) beyond factoid QA, automated workflow visualization, and the recovery of causal relationships between two statements. By flow-graph we mean both information and action flows in a text. However, the lack of structures in such texts makes them challenging to follow, visualize, extract inferences, or track states of an object or a sub-task, which ultimately makes constructing their flow graphs an insurmountable task.
Procedural texts are common in cybersecurity, where security analysts document how to discover, exploit, and mitigate security vulnerabilities in articles, blog posts, and technical reports, which are usually referred to as security write-ups. Practitioners in cybersecurity often use write-ups as educational and researching materials. Constructing structured flow graphs from security write-ups may help with automated vulnerability discovery and mitigation, exploit generation, and security education in general. However, automatically analyzing and extracting information from security write-ups are extremely difficult since they lack structure. Figure 1 illustrates the core of a security write-up (broken into sentences) that carries instructions for exploiting a vulnerability in an online shopping service. S 1 , S 3 , and S 4 are the author's observations about the service's nature. Based on this information, S 5 and S 6 are two possible paths of actions. The author chose S 6 and ran the Python code in S 8 to exploit the service. S 0 and S 2 are irrelevant for the author's goal of exploiting this service.
Here we propose a novel approach to extract action paths out of structure-less, natural language texts by identifying actions and information flows embedded in and between sentences and constructing action flow graphs. Specifically, our focus is on procedural texts in the cybersecurity domain. We also show that constructing flow graphs helps extract paths of actions in domains besides cybersecurity, such as cooking and maintenance manuals.
Most previous works (Mori et al., 2014;Kiddon et al., 2015;Malmaud et al., 2014;Maeta et al., 2015;Xu et al., 2020;Mysore et al., 2019;Song et al., 2011) focus on fine-grained knowledge extraction from procedural texts in diverse domains. There are also a handful of works (Delpech and Saint-Dizier, 2008;Fontan and Saint-Dizier, 2008;Jermsurawong and Habash, 2015) that study the structure of natural language texts. Different from previous works, we extract structures and construct flow graphs from natural texts at the sentence level. This is because fine-grained domain-entity extraction tasks require a large amount of annotated data from people with specific in-depth domain knowledge, whereas text structures can be generalized. Dataset. We built a dataset from security writeups that are generated from past Capture The Flag competitions (CTFs). CTFs are computer security competitions that are usually open to everyone in the world. Players are expected to find and exploit security vulnerabilities in a given set of software services, and through exploiting vulnerabilities, obtain a flag-a unique string indicating a successful attempt-for each exploited service. Once the game is over, many players publish security writeups that detail how they exploited services during the game. While these write-ups are a valuable educational resource for students and security professionals, they are usually unstructured and lacking in clarity. We collected 3617 CTF write-ups from the Internet, created a procedural text dataset, and invited domain experts to label each sentence for the purpose of constructing flow graphs and identifying action paths. To the best of our knowledge, this is the first attempt to use the knowledge embedded in security write-ups for automated analysis. The data and the code is publicly available 1 for future research. This paper makes the following contributions: • We built a new procedural text dataset, CTFW, in the cybersecurity domain. To the best of our knowledge, CTFW is the first dataset that contains valuable information regarding vulnerability analysis from CTF write-ups. • We proposed a new NLU task of generating flow graphs from natural language procedural texts at the sentence level without identifying fine-grained named entities. • We proposed four variations of a graph neural network-based model (LM-GNN) to learn neighbor-aware representation of each sentence in a procedural text and predict the presence of edges between any pair of sentences. • We evaluated our models on CTFW. To the best of our knowledge, this is the first attempt in automated extraction of information from security write-ups. We also evaluated our models across three datasets in different domains and showed the generalizability of our approach.

Our Approach
We map each sentence of a procedural text as a node in a graph, and the action or information flows as edges. The task is then simplified into an edge prediction task: Given a pair of nodes, find if there is an edge between them. We learn feature representations of nodes using language models like BERT/RoBERTa (Devlin et al., 2018; Table 1: Dataset Statistics. |e + | is the total number of actual edges, and |e + | + |e − | is the total number of edges possible. The in-degree of the starting node and out-degree of the end node are both 0. 2019). Then, to make the nodes aware of their neighboring sentences, we use Graph Neural Network (GNN) to update the node representations.
We check for the edge between every pair of nodes in a graph and reduce the task to a binary classification during inference. This formulation enables us to predict any kind of structure from a document.

Dataset Creation
In this section, we present how we created three datasets on which we evaluated our approach. Table 1 shows the statistics for each datasets used.

CTF Write-ups Dataset (CTFW)
Each CTF competition has multiple challenges or tasks. Each task may have multiple write-ups by different authors. We crawled 3617 such write-ups from GitHub and CTFTime (CTFTime). Write-ups are unique and diverse but have common inherent principles. For each write-up, we provide two kinds of annotations: sentence type and flow structure. The writing style is informal with embedded code snippets and often contains irrelevant information. Part of the annotations were provided as an optional, extra-credit assignment for the Information Assurance course. These CTF write-ups were directly related to the course-content, where students were required to read existing CTF write-ups and write write-ups for other security challenges they worked on during the course. Then students were given the option of voluntarily annotating CTF write-ups they read for extra credits in the course. For this task, we followed all the existing annotation guidelines and practices. We also ensured that (1) The volunteers were aware of the fact that their annotations would be used for a research project (2) They were aware that no PII was involved or would be used in the research project (3) They were aware that extra credits were entirely optional, and they could refrain from submitting at any point of time without any consequences (4) Each volunteer was assigned only 10-15 write-ups based on a pilot study we did ahead of time, annotating an average-length CTF write-up took about two minutes (maximum ten mins).
Remaining annotations were performed by the Teaching Assistants (TA) of the course. These annotations were done as part of the course preparation process, which was part of their work contract. All the TAs were paid bi-weekly compensation by the university or by research funding. It was also ensured that the TAs knew these annotations would be used for a research project, their PII was not involved and annotations were to be anonymized before using. We verified the annotations by randomly selecting write-ups from the set. Figure 1 shows a sample annotation. Sentence Type Annotations. We split the documents into sentences using natural language rules. We then ask the volunteers to annotate the type of each statement as either Action (A), Information (I), Both (A/I), Codes (C), or irrelevant (None). Action sentences are those where the author specifies actions taken by them, whereas, Information statements mention observations of the author, the reasons and effects of their action. Sentences containing codes are assigned as C, and those which can be considered as both information and actions are marked as Both (A/I). Flow structure Annotations. The second level of annotations is regarding the write-up structure. Each volunteer is given a csv file for each document with a set of sentence IDs and text for each write-up. They are asked to annotate the flow of information in the document by annotating the sentence id of some next possible sentences, which indicate the flow. We filter those write-ups which are irrelevant and those which did not have much detail (single line of valuable information). We call a write-up as irrelevant if it has no actioninformation annotations or if it has direct codes without any natural language description of steps to detect vulnerabilities. We only keep write-ups written in the English language for this work. Finally, we have 3154 write-ups with sentence type and structure annotations.
CTFTime website states that the write-ups are copyrighted by the authors who posted them and it is practically impossible to contact each author.
Such data is also allowed to use for academic research purposes(usc; euc). Thus, we follow the previous work using data from CTFTime (Švábenskỳ et al., 2021), and share only the urls of those writeups which we use. We do not provide the scraper script since it would create a local copy of the writeup files unauthorized by the users. Interested readers can replicate the simple scraper script from the instructions in Appendix A and use it after reviewing the conditions under which it is permissible to use. We, however, share our annotations for those write-up files.

Cooking Recipe Flow Corpus (COR)
This corpus (Yamakata et al., 2020) provides 300 recipes with annotated recipe named entities and fine-grained interactions between each entity and their sequencing steps. Since we attempt to generate action flow graphs without explicitly identifying each named entity, we aggregate the fine-grained interactions between recipe named entities to generate sentence-level flows for each recipe. We reject three single-sentence recipes.

Maintenance Manuals Dataset (MAM)
This dataset (Qian et al., 2020) provides multigrained process model extraction corpora for the task of extracting process models from texts. It has over 160 Maintenance Manuals. Each manual has fine-grained interactions between each entity and its sequencing steps. We use the annotations from sentence-level classification data and semantic recognition data for generating annotations of sentence-level flows for each process. Here also, we reject single sentence processes.

Model Description
Our goal is to find paths or traces of actions or information between texts. This needs an understanding of each sentence's interconnection. Hence, we modeled the problem into an edge prediction task in a graph using GNNs. We represent each sentence as a node and directed edges as information flows. Since this is procedural text (unidirectional nature) of instructions, we consider only the directed edges from one sentence S n to any of its next sentences S n+i . The node representations are learned using language models (LM) and GNNs.

Document to Sentence Pre-processing
Given a natural language document, first we split the document into sentences based on simple rules and heuristics. COR and MAM datasets already have document split into separate sentences. In the flow graph creation task, we filter out irrelevant sentences for the CTFW dataset based on the sentence type annotations. After this pre-processing task, each document (D i ) is converted into a series of sentences (S j ) where n is the number of valid sentences in a document.

Document to Graph Representation
A graph (G = (V, E)) is formally represented as a set of nodes (V = {v 0 , v 1 , ..}) connected by edges (E = {e 0 , e 1 , ..} where e i = {v m , v n }). We consider the sentences (S j ) of any document (D i ) as nodes of a directed graph (G i ). We experiment with two graph structure types for learning better node representation using GNN. First, we form local windows (W N , where N = 3, 4, 5, all sentences) for each sentence and allow the model to learn from all of the previous sentences in that window. We form the document graph by connecting each sentence with every other sentence in that window, with directed edges only from S i to S j where i < j. We do this since procedural languages are directional. We call this configuration Semi-Complete. Second, we consider connecting the nodes linearly where every S i is connected to S i+1 except the last node. We call this Linear setting. Figure 2 shows the settings. We use LMs like BERT and RoBERTa to generate initial sentence representations. For each sentence (S i ), we extract the pooled sentence representation (CLS S i ) of contextual BERT/RoBERTa embeddings (h S i ).
We use CLS S i as node features for the graph (G i ).

Neighbor Aware Node Feature Learning
Since the LM sentence vectors are generated individually for each sentence in the document, they are not aware of other local sentences. So, through the semi-complete graph connection, the model can learn a global understanding of the document. However, the linear connection helps it learn better node representation conditioned selectively on its predecessor. We call the connected nodes as the neighbor nodes. We use Graph Convolutional Network (GCN) (Kipf and Welling, 2016) and Graph Attention Network (GAT) (Veličković et al., 2017) to aggregate the neighbor information for each node following the generic graph learning function (1) H where A is the adjacency matrix of the graph, H l and H (l+1) are the node representations at lth and (l + 1)th layer of the network and f is the message aggregation function. In GCN, each node i, aggregates the representations of all of its neighbors N (i) based on A and itself at layer l and computes the enriched representation h l+1 i based on the weight matrix Θ of the layer normalized by degrees of source d(i) and its connected node d(j) as per (2). In GAT, messages are aggregated based on multi-headed attention weights (α) learned from the neighbor node representations h l j following (3).

Projection
We concatenate the neighbor aware node representations of each pair of nodes (h i ;h j ) from a graph and pass it through two projection layers with a GELU (Hendrycks and Gimpel, 2016) nonlinearity in between. We use the same non-linearity functions used by the BERT layers for consistency. We steadily decrease the parameters of each projection layer by half. During testing, given a document, we are unaware of which two sentences are connected. So, we compare each pair of nodes. This leads to an unbalanced number of existing (1) and non-existing (0) edge labels. Hence, we use weighted cross-entropy loss function as in equation (4) and (5), where L is the weighted cross-entropy loss, w c is the weight for class c, i is the data in each mini-batch.

Training and Inference
Our training data comprises a set of sentences and the connections as an adjacency matrix for each document. Batching is done based on the number of graphs. GCN/GAT updates the sentence representations. A pair of node representations are assigned a label of 1 if there is an edge between them; otherwise, we assign them 0. Thus, we model it as a binary classification task as in equation (6) where f is the projection function, g is the softmax function, and y is the binary class output. Depending on the weighted cross-entropy loss, the node representations get updated after each epoch. During inference, the model generates node representations of each sentence in a test document, and we predict whether an edge exists between any two nodes in a given document graph.

Experiments
Datasets and Tasks: Each dataset is split into train, validation, and test sets in 70:10:20 ratio. The first task is identifying relevant information from raw CTF write-ups by classifying the type of each sentence. The second task is identifying information flows between sentences by predicting edges between sentence pairs, if any. Metrics: We use accuracy as the evaluation metric for the Sentence Type classification task on CTFW.
For the second task, because of the label imbalance we compare based on the area under Precision-Recall curve (PRAUC) and also report the corresponding F1-score. Hence do not report area under the ROC curve or accuracy. We consider four settings for this task. The no window setting (W all ) checks whether there is an edge between any two statements in the given document. The comparisons required in this setting are directly proportional to the document's size. In CTFW, the size of each write-up is quite large. So,  6 Results and Discussion

Sentence Type Classification (STC)
We use large and base versions of BERT and RoBERTa for this task to predict the type of sentences in a given text to establish a baseline for this task. This task helps to identify relevant and irrelevant sentences in a document. Each sentence is classified into any of Action, Information, Both, Code, and None. These fine-grained annotations can be used in later works for creating automated agents for vulnerability analysis. The processed data consists of 120751 samples for training, 17331 for validation, and 34263 for testing. Table 3 shows that RoBERTa-large performs better than the rest.

Flow Structure Prediction
Here we present the performance results for the flow structure prediction. Random Baseline: In the Random baseline, for every pair of nodes in each document we randomly select 0 (no-edge) or 1 (edge). For Weighted Random baseline, we choose randomly, based on the percentage of edge present in the train set. We only report F1 since there is no probability calculation. Next Sentence-based Prediction (NS) Baseline: We use LMs like BERT and RoBERTa in a next sentence prediction setting to get the baselines. Each pair of sentences is concatenated using [SEP] token and passed through these language models. Using the pooled LM representation, we classify whether an edge exists between them or not. We show maximum PRAUC and its corresponding F1 for each dataset from the results of each of our window settings (W 3 , W 4 , W 5 , W all ).  Table 2. The scores are overall best scores across single and double layers GNN (GCN/GAT) and LM (BERT/RoBERTa) after experiments with both base and large version, trained with pretrained and randomly initialized weights.
We see that the best LM-GCN models outperform the best baseline model by 0.12, 0.07, 0.06 in PRAUC for CTFW, COR, and MAM datasets, respectively. However, the best LM-GAT scores falls short of the baselines indicating that the graph attentions on LM sentence representations cannot learn robust representation to perform this edge prediction task. Another thing to notice here is that, the best BERT-GCN models perform better than RoBERTa-GCN for COR and MAM datasets while performs poorly in the CTFW dataset. We hypothesize that this is because, the CTFW dataset has ten times more data than COR and six times more than MAM, which helps the RoBERTa model correctly predict the edges.

Analysis
Effect of Graph Connection Type: Table 4 shows how the models behave with semi-complete (SC) and linear (L) graph connection. For each  dataset, we compare the PRAUC results for each window to draw more granular insight on the effect of neighbor aware representation learning. When we restrict graph learning by creating small windows (W 3 , W 4 , W 5 ), the linear model works better because of its selective learning conditioned on its predecessor. On the other hand, the semi-complete connection helps to learn a global awareness and works best in the W all setting. It is important to note that each model performs better than the average PRAUC performance, which is the percentage of edges in the data indicating that the model is able to learn using the graph connections. Effect of Graph Layers: We study how the depth of the GNNs affects the performance. We compare PRAUC across all four variations of the model in No-Window (W all ), W 5 , W 4 , W 3 settings in Figure 3. We experimented with no (L 0 ), single (L 1 ) and double (L 2 ) GNN layers. In all three datasets, we find the performance improves when we use a single layer and degrades beyond that for each of the windows with GCN based models. We do not go beyond two layers because of this observation and the graph connection types we use. We believe the reason for this drop (0.03-0.08 PRAUC) is that information from 2-hop neighbors might hinder the learning of the current node and confuse the model to predict wrongly. The GAT-based models mostly remain unaffected with the graph layers for both COR and MAM while showing some improvement in CTFW for one layer setting. Effect of Pre-trained LM Weights: We study the impact of pre-trained weights of BERT and RoBERTa on the performance in Figure 4. We notice, for the three datasets, the performance slightly decreases when the pre-trained model weights are used. This observation may be because the texts' nature is quite different from the type of texts these LMs have been pre-trained on. The CTFW data often contains code fragments embedded in sentences, emoticons, or common conversational languages used in public forums. Effect of LM Size: We also experimented with the size of sentence embeddings to see if that makes any difference to the performance. We use base and large version of BERT and RoBERTa for the experiments across three datasets. We present the impact on F1 and PRAUC in Figure 5. The performance of the larger versions of the models drop in all three datasets. This drop, we believe, is because the sentences in these texts are relatively short and help the smaller versions of the models with lesser parameters to learn better.
Other experiments: We also experimented with modifications of other parts of the models like changing the number of projection layers, projection layer sizes, the number of attention heads in the GAT model, or dropout percent in selected layers and modes of message aggregation (add, max, mean). We do not report them since they do not significantly change PRAUC values. maud et al., 2014), their sentence-level dependencies (Mori et al., 2014;Maeta et al., 2015;Xu et al., 2020), and action-verb argument flow across sentences (Jermsurawong and Habash, 2015;Kiddon et al., 2015;. In other domains, extraction of clinical steps from medline abstracts (Song et al., 2011), extraction of material synthesis operations and its arguments in material science (Mysore et al., 2019), providing structures to how-to procedures (Park and Motahari Nezhad, 2018), and action-argument retrieval from web design tutorials (Yang et al., 2019) mostly focus on fine-grained entity extractions rather than action or information traces. The goal of our paper is constructing flow graphs from free-form, naturallanguage procedural texts without diverse domain knowledge. Hence, we refrain from training specialized named-entity recognizers for each domain to find specific entities. Our work is related to event or process discovery in process modeling tasks (Epure et al., 2015;Honkisz et al., 2018;Qian et al., 2020;Hanga et al., 2020), but our goal is not finding specific events or actions from procedural texts. In addition, the recent research proposed a method to create the forum structures from an unstructured forum based on the contents of each post using BERT's Next Sentence Prediction (Kashihara et al., 2020). However, we focus on building flow graphs for procedural texts using GNNs.
Graph Neural Networks: GNNs are important in reasoning with graph-structured data in three major tasks, node classification (Kipf and Welling, 2016;Hamilton et al., 2017), link prediction (Schlichtkrull et al., 2018), and graph classification Pan et al., 2015Pan et al., , 2016. GNNs help learn better node representations in each task using neural message passing (Gilmer et al., 2017) among connected neighbors. We consider two widely used GNNs, GCN (Graph Convolutional Network) (Kipf and Welling, 2016) and GAT (Graph Attention Networks) (Veličković et al., 2017) to learn sentence representation to provide a better edge prediction.
Edge Prediction Task: Edge or link prediction tasks (Li et al., 2018;Pandey et al., 2019;Haonan et al., 2019;Bacciu et al., 2019) work mainly on pre-existing networks or social graphs as inputs and predict the existence of future edges between nodes by extracting graphspecific features. Different from existing work, we modeled the task of generating a graph-structure from a given natural-language text as an edge prediction task in a graph and learning representations of sentences considered as nodes.
Combinations of BERT and GCN: Recent works have used concatenation of BERT and GCN representations of texts or entities to improve performance of tasks like commonsense knowledge-base completion (Malaviya et al., 2019), text classification Lu et al., 2020), multi-hop reasoning , citation recommendation (Jeong et al., 2019), medication recommendation (Shang et al., 2019), relation extraction (Zhao et al., 2019). Graph-BERT (Zhang et al., 2020) solely depends on attention layers of BERT without using any message aggregation techniques. However, we differ from each of the previous methods in terms of model architecture, where we use BERT to learn initial sentence representations and GCN or GAT to improve them by learning representations from its neighboring connected sentences. BERT-GAT for MRC (Zheng et al., 2020) created the graph structure from the well-structured wikipedia data whereas we explore two predefined natures of graph structures because of the free-formed text nature without such well-defined text-sections, presence of code-fragments, emoticons, and unrelatedtoken.

Conclusion and Future Work
We introduce a new procedural sentence flow extraction task from natural-language texts. This task is important for procedural texts in every domain. We create a sufficiently large procedural text dataset in the cybersecurity domain (CTFW) and construct structures from the natural form. We empirically show that this task can be generalized across multiple domains with different natures and styles of texts. In this paper, we only focus on English security write-ups. As part of future work, we plan to build automated agents in the cybersecurity domain to help and guide novices in performing software vulnerability analysis. We also plan to include non-English write-ups. We hope the CTFW dataset will facilitate other works in this research area.

Impact Statement
The dataset introduced here consists of write-ups written in public forums by students or security professionals from their personal experiences in the CTF challenges. The aggregated knowledge of such experiences is immense. This in-depth knowledge of the analysis tools and the approach to a problem is ideal for students working in software vulnerability analysis to learn from. Automated tutors built using such knowledge can reduce the efforts and time wasted in manually reading through a series of lengthy write-up documents. CTFTime website states that the write-ups are copyrighted by the authors who posted them and it was practically impossible to contact each authors. It is also allowed to use the data for research purposes (usc; euc) Thus, we follow the previous work (Švábenskỳ et al., 2021) using data from CTFTime and share only the urls of those write-ups from the CTFTime website which we use. We do not provide the scraper script since it would create a local copy of the write-up files unauthorized by the users. Interested readers can replicate the simple scraper script from the instructions in Appendix A and use it after reviewing the conditions under which it is permissible to use. We, however, share our annotations for those write-ups files.
Part of the annotations were provided as an optional, extra-credit assignment for the Information Assurance course. These CTF write-ups were directly related to the course-content, where students were required to read existing CTF write-ups and write write-ups for other security challenges they worked on during the course. Then students were given the option of voluntarily annotating CTF write-ups they read for extra credits in the course. For this task, we followed all the existing annotation guidelines and practices. We also ensured that • The volunteers were aware of the fact that their annotations would be used for a research project.
• They were aware that no PII was involved or would be used in the research project.
• They were aware that extra credits were entirely optional, and they could refrain from submitting at any point of time without any consequences.
• Each volunteer was assigned only 10-15 writeups based on a pilot study we did ahead of time, annotating an average-length CTF writeup took about two minutes (maximum ten mins).
Remaining annotations were performed by the Teaching Assistants (TA) of the course. These annotations were done as part of the course preparation process, which was part of their work contract. All the TAs were paid bi-weekly compensation by the university or by research funding. It was also ensured that the TAs knew these annotations would be used for a research project, their PII was not involved and annotations were to be anonymized before using.

A Extraction and Processing of
Write-ups: The extraction of CTF Write-up involved the following three phases. Writeup URL extraction : We loop through all the write-up pages on ctftime website from page numbers 1 to 25500). We use a simple python scraper to scrape the content of each page using python requests (Reitz) library. We look for keyword "Original write-ups" and extracted the href component if present. These URLs are stored for each writeup indexed with the page numbers.
Write-up Content extraction : We use these URLs and extract the contents of the write-ups using python libraries requests and BeautifulSoup (Richardson, 2007). We extract all the text lines ignoring contents in html tags like style, scripts, head, title. The contents are stored in a text file named with the same page ids of the URLs.
Processing of Write-up : We processed and filter out sentences which do not have any verb forms using spacy (spaCy, 2017) POS-Tagger. We cleaned and removed unnecessary spaces and split them into sentences. The processing script is available in the github.

B CTFW Data Statistics
In CTFW, there are write-ups for 2236 unique tasks. Only four out of those having more than 5 writeups each. 72% of the tasks have single write-up. The write-ups are from 311 unique competitions, ranging from years 2012-2019. A task having multiple write-ups vary in contents. In CTFW, only 3% of the tasks have more than three write-ups, and 9% have more than two.

C Training Details:
The correct set of hyperparameters are found by running three trials. We run for {50, 100} epochs and store the model with the best PRAUC score. Each training with evaluation takes around 1-3 hours for base version of models and around 6 hours for larger versions depending upon the dataset used. The model parameters are directly proportional to the model parameters of language models, since the GNN only allow few more parameters as compared to the LMs.

E Number of comparisons Reduction using Windows
We can control the total number of comparisons required to predict the edges in a graph by using the windows (W N where N = 3, 4, 5, all). The number of comparisons for each window is given by the equation 7. We can reduce the number of comparisons considerably for large documents using shorter windows of 3, 4, 5 sentences. The number of comparison C is defined by C = max{(n − s), 0}s + s(s−1) 2 n = 3, 4, 5 n 2 n = all (7) F CTFW STC Label statistics