Full-Stack Information Extraction System for Cybersecurity Intelligence

Due to rapidly growing cyber-attacks and security vulnerabilities, many reports on cyber-threat intelligence (CTI) are being published daily. While these reports can help security analysts to understand on-going cyber threats, the overwhelming amount of information makes it difficult to digest the information in a timely manner. This paper presents, SecIE , an industrial-strength full-stack information extraction (IE) system for the security domain. SecIE can extract a large number of security entities, relations and the temporal information of the relations, which is critical for cyberthreat investigations. Our evaluation with 133 labeled threat reports containing 108,021 tokens shows that SecIE achieves over 92% F1-score for entity extraction and about 70% F1-score for relation extraction. We also showcase how SecIE can be used for downstream security applications.


Introduction
A rapid increase in cyberattacks, both in number and attack techniques, poses enormous challenges to security analysts.Much of the information on new threats often appear first in unstructured reports such as blogs and news articles.To quickly respond to the on-going attacks, it is critical to digest the information about new threats in a short period of time.However, it is very difficult to find relevant information from CTI reports, particularly because cyber-attacks involve many different entities, including the attacker, victim (e.g., companies/industries), tools (e.g., malware) indicators of compromise (IOCs, e.g., file names and IP addresses), and various relations, some of which may be unknown to the security experts.
We present a large-scale full-stack IE system designed for the cybersecurity domain.SecIE can extract 26 entity types, 20 fixed rela- While there have been efforts to apply NLP and IE to the cybersecurity domain (Joshi et al., 2013;Lal, 2013;Jones et al., 2015;Bridges et al., 2017;Liao et al., 2016;Husari et al., 2017;Pingle et al., 2019;Yi et al., 2020), they target on a specific sub-area of cybersecurity, mostly on extracting IOCs or vulnerabilities, or a component (either entity extraction or relation classification) in the IE process.To our knowledge, our system is the largest endto-end IE system for the cybersecurity domain supporting a large number of security entity and relation types.
Most existing IE systems apply supervised (deep) learning methods relying on a large amount of high-quality labeled data.Unlike the general domain types, labeling fine-grained security entities and relations requires deep domain knowledge, and, thus it is much more difficult to produce a high-quality training data for the security domain.As an anecdote, 3 annotators (1 security expert and 2 professional annotators with many years' experience) working full-time for 5 months could produce only 133 annotated documents, which are far from enough to train supervised models for our need.Thus, SecIE applies unsupervised NLP technologies.We develop tehcniques to handle idiosyncrasies in security terms and and take into account the structural characteristics found in many CTI reports.This domain customization allows SecIE highly accurate, achieving over 92% F1 for entity extraction and 70% F1 for relation extraction.

Methodology
We employ a pipeline architecture as shown in Figure 2  documents are processed sequentially, where the document content and all the results from the previous components are passed as input to the next component.However, the system can process multiple documents in parallel yielding a high throughput.
2.1 Document Pre-processing Document Parsing: This component performs text content extraction and document structure detection.We use Apache Tika2 to extract the file content and structural information such as titles, hyperlinks, tables, and list structures from the input files.The extracted structures are stored as annotations over the document content and passed to the subsequent components along with the content.
Linguistic Analysis: This component performs sentence boundary detection, part-ofspeech (POS) tagging and dependency parsing.We use SyntaxNet (Andor et al., 2016) for POS tagging and dependency parsing.It was trained with general domain documents and often fails to parse security sentences correctly, because some security entities include many tokens and punctuation marks internally (e.g., some URLs have over 100 tokens).To improve the parsing accuracy, we first detect entity mentions and pass the entire entity mention as a noun token to the parser.Figure 3 shows a sample sentence and the parsing results when all tokens are passed to the parser individually and when entity mentions are passed as a token.

Entity Extraction
We identified the 26 fine-grained entity types related to malware, IoCs, and security vulnerability.The types are determined based on the STIX standard3 which defines 9 key security concepts and their relationships.The full list of our entity types are shown in Figure 6 in Appendix.We provide a type inheritance as shown in Figure 6, allowing applications to consume the entity types at different levels.
Dictionary-based Method is used when a reputable list of terms belonging to a certain entity type exists.In the cybersecurity domain, previously known Campaign, Malware and ThreatActor cases are well documented.In these cases, we match the dictionary terms with the noun phrases.However, this dictionary matching method can extract only previously known samples.We address this problem using the lexico-syntactic pattern matching method to extract new mentions.

Lexico-Syntactic Pattern-based Method
Inspired by the findings in (Hearst, 1992), we apply the following syntactic patterns to extract security entities: (1) NP (, NP)* BE NP; (2) NP, CALLED NP; (3) NP such as NP (, NP)* (4) NP including NP (, NP)*; (5) NP a.k.a | ((which|that)?(BE)?(also)?CALLED as) NP.Here, NP stands for a noun phrase, BE represents the be-verbs (e.g., 'is'), and CALLED includes 'dubbed ', 'called', 'named', 'known', 'referred', and 'termed'.To discover new mentions, we first check if the entity type of an NP in these syntactic patterns is determined.Then, we label the remaining NPs to the same entity type.If the types of multiple NPs are determined, they should be the same type or have a super-subtype relation.We also use a predefined set of cue words to detect new mentions for Campaign and Malware, and ThreatActor (Table 6 in Appendix).If a cue word matches with NP or NP's headword, we classify the other NPs to the same entity type as the cue word.Table 1 shows sample sentences where 'WannaCry', 'Wcrypt', 'WCRY', 'WannaCrypt' are extracted as Malware even though the mentions were unknown.
1) WannaCry is a ransomware worm that spread rapidly ... 2) A new ransomware dubbed "WannaCry" is ... 3) The WannaCry ransomware has been very active since ... 5) WannaCry, also known as Wcrypt, WCRY, WannaCrypt Classification-based Extraction AvSignature mentions do not conform to particular patterns making regex-based method ineffective (e.g., 'ADWARE/Agent.imv','Trojan-Ransom.Win32.Xpan').Further, the number of AvSignature instances is very large (millions), making the dictionary method inefficient.However, they have distinct word shapes which are very different from regular words, and it is easy to collect many examples from public sources.We collected 660,000 AvSignature names from VirusTotal as the positive sample and added 470,000 words randomly chosen from CTI reports as the negative sample.We then trained a Logistic Regression model using character n-gram and word shape features (e.g., uppercase/lowercase letters, digits and symbols).

Coreference Resolution
We categorize coreferences into two types based on the search range for the referent.
Within-sentence Coreference appears in certain syntactic structures that connect two noun phrases, such as appositives, relative pronouns (e.g., 'which'), or certain phrases such as "<nominative noun> [,] CALLED [as] <proper noun>".When the proper noun belongs to a security entity, we resolve the nominative noun or pronoun to the proper noun.Figure 1 shows two examples of witin-sentence coreferences: "a new strain of ransomware, called Trojan-Ransom.Win32.Xpan" and "a gang called TeamXRat".We resolve "a new strain of ransomware" to "Trojan-Ransom.Win32.Xpan" and "a gang" to "TeamXRat".
Cross-sentence Coreference Syntactic analysis alone cannot connect two mentions together when they appear in different sentences.We use a document structure-based sentence embedding model proposed in (Lee and Park, 2019), which generates semantic representations for sentences using BERT (Devlin et al., 2019;Joshi et al., 2019).If a sentence contains a nominative or pronoun mention (e.g., 'the malware'), we identify semantically related sentences for the sentence based on the sentence embeddings and find its referent from the proper nouns in the related sentences.We replace the nominative or pronoun mention with each of the candidates, calculate the likelihood of the candidate in the sentence, and choose the candidate with the highest likelihood as the referent.

Topic Entity Detection
Most CTI reports provide a deep analysis on a particular malware or campaign.We call the focus of a CTI report the topic entity.Many CTI reports are very succinct, often simply providing the list of related entities, such as IOCs, without contextual connection to the topic entity.These related entities provide critical intelligence about the topic entity, and connecting them with the corresponding malware or campaign is critical.We identify the topic entity of CTI reports as follows.We first look for mentions of Malware, Campaign, and ThreatActor in the first 15 sentences.When there are multiple mentions of these types, we choose the topic entity based on the following factors: (1) the position of the sentence (likely to appear early in the article); (2) if the mention is a singular or plural (tend to be singular); (3) the syntactic role of the mention in the sentence (likely the subject or the object); (4) the occurrence count of the mention in the article (likely to appear many times).

Relation Extraction
Similarly to entity extraction, we apply several different techniques for relation extraction.
OpenIE Relation Extraction discovers relations from certain syntactic structures (Angeli et al., 2015;Banko et al., 2007;Soderland et al., 2010;Fader et al., 2011;Mausam et al., 2012;Roy et al., 2019).Many security relations involve actions (e.g., download, connect, etc.).Thus, we focus on the three syntactic structures containing a verb phrase and two noun phrases: <NP(subj)-VP-NP(obj)>, <NP-VPpp-NP>, and <VP-NP-pp-NP>, where pp is a preposition.We find these syntactic structures in sentences, and, if both NP arguments are security entity mentions, we extract a relation by associating the NPs with the VP as the relation type.Table 2 shows examples of semantic relations extracted using this method.

Cooccurrence-based Relation Extraction
While the OpenIE relations provide useful semantic relations, extracting relations only from the three structures can miss other relevant relations.We generate relations if two security entities co-occur in a sentence but are not connected by an OpenIE relation.The assumption is that if the two entities frequently appear together in the same sentence, they should be of interest to security analysts.We produce cooccurrence-based relations between the five main security entities: Campaign, Indicator, Malware, ThreatActor and Vulnerability and assign a generic relation type ('related').Table 3 shows sample co-occurrence-based relations.
Relations with Topic Entity As discussed above, many threat reports describe information about a particular security event or entity, and other entities in the document provide insights on the topic entity.In this work, if the entities in a list are not included in any other relations, we connect them to the topic entity via a relation type denoted as related+EntityType (e.g., relatedHash).

Temporal Information Extraction
Threat intelligence is time sensitive, and knowing when a security event has occurred is critical.Time information can be expressed in multiple ways, including point-in-time (e.g., "2016-05-25"), relative time (e.g., "last year"), time range (e.g., "2016-2017"), and embedded time (e.g., "CVE-2017-3018").SecIE extracts these time expressions and normalize them to the timestamp.For relative time expressions, we infer their point-in-time based on an anchor time, which can be an absolute time expression in nearby sentences.If there are no point-intime expressions in the document, we use the file's last modified time or the publication date as the anchor time.Then, we use the following priority orders to determine which temporal information gets assigned to a relation: ( 1

Performance Evaluation
To evaluate our system, we manually labeled 133 CTI reports, which contain 6,438 sentences and 108,021 tokens.The documents were labeled by 3 full time annotators over 5 months.
To ensure the quality of the labeled data, we kept only the labels agreed by all 3 annotators, resulting in 3,295 entity and 1,216 relation mentions.More detailed statistics of the annotations are shown in Table 7

Entity Extraction Results
Table 4 shows the performance of our entity extraction (see Table 9 and Table 10 in Appendix for performance for all entity types).The evaluation is performed by measuring the mention-level precision (P), recall (R) and F1 scores over all entity types.SecIE all reports the performance for all 133 reports, showing that SecIE achieves a very high F1 score with a good balance between precision and recall.
Further, we compare SecIE with a deep learning model to illustrate the challenges for applying supervsied learning methods for cybersecurity data.We split the 133 labeled documents into train (80%), validation (10%) and test (10%) datsets, consisting of 106, 14 and 13 documents respecitively, and trained a BERT model as described in (Devlin et al., 2019).The results ( small ) validate that SecIE significantly outperforms the BERT model.

Relation Extraction Results
We measure the performance of relation extraction using four different settings.
• ExactMatch: An extracted relation and the ground truth must have the same entity spans, entity types and the relation type.• -eType: The condition for the entity type match is removed from ExactMatch.This is mainly because Malware and Campaigin are often interchangeably used.
• -rType: The condition for the relation type match is removed from ExactMatch.• -eType-rType: Both the entity type and the relation type can be different.Further, we evaluate the performance of relation extraction with and without co-reference resolution to show the effectiveness of the coreference resolution step.

Security Applications
We demonstrate how SecIE can provide additional insights on security incidents.

Malware Analysis
SecIE can be used to build a knowledge graph (KG) on malware from text. Figure 4 shows an input document about WannaCry4 and the output KG.As we can see, SecIE extracted all of the security entities and connected them to the topic entity (new variant of WannaCry).

Inconsistency in CVEs
The NVD (national vulnerability database) provides information about known security vulnerabilities including the descriptions and asso-  We match the mentions of Application extracted from the description and the CPE entries in the metadata using simple matching rules.Since an application can be referred by several synonyms (e.g., Microsoft Office vs. Office), we apply a loose matching for application names.The versions can be represented as an exact version (e.g., 4.05), a range (e.g., 'before 10.3'), or wildcard symbols (e.g., 4.x or 4.*), so we match the versions accordongly.
We randomly selected 168 CVE records and manually checked the inconsistency check results.This technique detected 26 potential inconsistencies, and 6 of them were confirmed to be inconsistent.This demonstrates that our tool can be used to find potentially errorneous CVE records and help to improve the quality of the CVE database.

Related Work
There have been a few efforts to apply IE to the cybersecurity domain.Most existing works focus on entity extraction for a small number of security entities (mainly, IOCs and Vulnerabilities) from certain security text (mainly, CVEs and Tweets).Joshi et al. (Joshi et al., 2013) present a system that produces linked data from CVE records.This system can extract 6 entity types commonly found in CVEs and link the extracted instances to DBPedia entries.(Sabottke et al., 2015) proposes a Twitter-based exploit detector, which collects tweets mentioning vulnerabilities.This tool uses a simple keyword matching and monitors occurrences of the "CVE" keywords and IDs in tweets.Liao et al. (Liao et al., 2016) presents a system (iACE) for fully automated IOC extraction.iACE detects file name, IP address, and URL using regular expressions.TTPDrill (Husari et al., 2017) extracts threat actions (i.e., TTP) from security reports and map them to a threat action ontology from ATT&CK and CAPEC.This tool detects threat actions from the SVO dependency structure, where the subject is a malware instance.(Yi et al., 2020) presents an NER tool for the cybersecurity domain, which is similar to our entity extraction component.They apply regular expressions, dictionary matching and a CRF classifier for about 20 different entity types and achieves about 82% F1 score.

Conclusion
We presented a large-scale full-stack IE system developed for the cybersecurity domain.Through careful design choices to handle the idiosyncrasies in the cybersecurity data, our system achieves high F1 scores for both entity extraction and relation extraction.We also demonstrated how our system can be used for downstream applications.Our system can help security analysts by transforming the unstructured threat reports into structured formats which can be easily consumable by subsequent security applications.

Figure 1 :
Figure 1: A CTI report and the security entities and relation extracted by SecIE , consisting of document parsing; linguistic analysis; entity extraction; coreference resolution; topic entity detection; relation extraction; and relation time assignment.Input

Figure 2 :
Figure 2: High-level System Architecture (a) Parsing with individual tokens (b) Parsing with entities as tokens

Figure 3 :
Figure 3: Improved sentence parsing through domain customization ) time in the same dependency construct; (2) time in the same sentence (3) time in the previous sentences; (4) the document's last modified time; and (5) The document's published time Figure 7 in Appendix shows a sample threat report and the output of SecIE including the entity, relation, and time information.

Figure 4 :
Figure 4: A KG built by SecIE from a report about the WannaCry ransomware

Figure 5 :
Figure 5: Example of inconsistent CVE

Table 1 :
Examples of new mention extraction.The numbers indicate the rule used to determine the entity type.

Table 2 :
Examples of OpenIE relation extraction

Table 3 :
Examples of occurrence-based relation extraction

Table 4 :
Performance of entity extraction models.all denotes the 133 labeled documents, and small denotes the 13 test dataset.

Table 5 :
Table5shows the evaluation results demonstrating SecIE's effectiveness.It produces over 70% precision across all settings, and co-reference resolution improves the performance, especially the recall.Relation extraction performance using different matching strategies and coreference settings.