A Pretrained Language Model for Cyber Threat Intelligence

We present a new BERT model for the cyberse-curity domain, CTI-BERT , which can improve the accuracy of cyber threat intelligence (CTI) extraction, enabling organizations to better defend against potential cyber threats. We provide detailed information about the domain corpus collection, the training methodology and its effectiveness for a variety of NLP tasks for the cybersecurity domain. The experiments show that CTI-BERT significantly outperforms several general-domain and security-domain models for these cybersecurity applications, indicating that the training data and methodology have a significant impact on the model performance.


Introduction
In response to rapidly growing cyber-attacks, cybersecurity experts publish many CTI reports, detailing on new security vulnerabilities and malware.While these reports help security analysts to better understand the cyber-threats, it is very difficult to digest all the information in a timely manner.Thus, automatic extraction of CTI from text has gained a lot of attention from the cybersecurity community.
Two different approaches have been used to produce domain-specific language models: continual pretraining and pretraining from scratch.The continual pretraining method takes an existing generaldomain model and continues training the model using a domain-specific corpus.While this approach is useful, especially when the size of the domain-specific corpus is small, the vocabulary of the new model remains largely same as that of the original model.Most domain-specific terms are thus out of vocabulary.The pretraining from scratch approach trains a new tokenizer to construct a domain-specific vocabulary and trains the language model using only its own corpus.Beltagy et al. (2019), Gu et al. (2022), andHu et al. (2022) have trained BERT models from scratch for the biomedicine, computer science, and political science areas.These studies show that pretraining from scratch outperforms the continual pretraining.
Recently, a few transformers-based LMs have been built for the cybersecurity domain.
CyBERT (Priyanka Ranade and Finin, 2021) trains a BERT model, and SecureBERT (Aghaei et al., 2023) trains a RoBERTa model using the continual pretraining method.jackaduma (2022) introduces SecBERT and SecRoBERTa models trained from scratch.However, these models either do not provide training details or are not evaluated on many cybersecurity tasks.
We present CTI-BERT, a BERT model pretrained from scratch with a high quality cybersecurity corpus containing CTI reports and publications.In CTI-BERT, both the vocabulary and the model weights are learned from our corpus.Further, we introduce a variety of sentence-level and token-level classification tasks and benchmark datasets for the security domain.The experimental results demonstrate that CTI-BERT outperforms other generaldomain and security domain models, confirming that training a domain model from scratch with a high quality domain-specific corpus is critical.
To the best of our knowledge, this work provides the most comprehensive evaluations for classification task within the security domain.Accomplishing these tasks is a crucial part of the broader process of automatically extracting CTI, suggesting appropriate mitigation strategies, and implementing counter-measurements to thwart attacks.Thus, we see our work as an essential milestone towards more intelligent tools for cybersecurity systems.
The main contributions of our work are the following: • We curate a large amount of high quality cybersecurity datasets specifically designed for cyber-threat intelligence analysis.
• We develop a pre-trained BERT model tailored for the cybersecurity domain.
• We perform extensive experiments on a wide range of tasks and benchmark datasets for the security domain and demonstrate the effectiveness of our model.

Training Datasets
We curated a cybersecurity corpus from various reputable data sources.The documents are professionally written and cover key security topics including cyber-campaigns, malware, and security vulnerabilities.Most of the documents are in HTML and PDF formats.We processed the files using the Apache Tika parsers1 to extract the file content.Then, we detected sentence boundaries and discarded sentences if the percentage of word tokens is less than 10% in the sentences.The CISSP textbooks cover all information security topics including access control, cryptography, hardware and network security, risk management and recovery planning.
Academic Paper This dataset contains all the papers in the proceedings of USENIX Security Symposium, a premier security conference, from year 1990 through 2021.
Security Wiki This dataset contains 7,919 Wikipedia pages belonging to the "Computer Security" category.We download the data starting from the 'Computer Security' category and recursively extracting pages from its subcategories.We discarded the subcategories not related to the cybersecurity domain.
Threat Reports This dataset contains news articles and white papers about cyber-campaigns, malware, and security vulnerabilities.These articles provide in-depth analysis on a specific cyber-attack, including the attack techniques, any known characteristics of the perpetrator, and potential mitigation methods.We collected the dataset from security companies and the APTnotes collection4 , which is a repository of technical reports on Advanced Persistent Threat (APT) groups.
Vulnerability This dataset contains records from CVE (Common Vulnerabilities and Exposures) 5and CWE (Common Weakness Enumeration)6 , which offer the catalogs of all known vulnerabilities and provide information about the affected products, the vulnerability type, and the impact.

Training Methodology
We first train the WordPiece tokenizer after lowercasing the security text and produce a vocabulary with 50,000 tokens.Training a tokenizer from scratch is beneficial, as it can recognize domainspecific terms better.Table 13 in Appendix shows examples of our tokenizer and BERT for recognizing security-related words.Following the observations by RoBERTa (Liu et al., 2019), we trained CTI-BERT using only the Masked Language Modeling (MLM) objective using the HuggingFace's MLM training script.The model was trained for 200,000 steps with 15% mlm probability, the sequence length of 256, the total batch size of 2,048, the learning rate of 5e-4 with learning rate warm-up to 10,000 steps and weight decay of 0.01.We use the Adam optimizer with β 1 = 0.9, β 2 = 0.98, ϵ = 1e − 6.

Cybersecurity Applications
We evaluate CTI-BERT using several security NLP applications and compare its results with both general-domain models and other cybersecurity domain models.The baseline models are bert-base-uncased, SecBERT (BERT models) and roberta-base, SecRoBERTa and SecureBERT (RoBERTa models).All the baseline models are downloaded from HuggingFace.
The downstream applications can be categorized as sentence-level classification tasks and tokenlevel classification tasks.The goal of the experiments is to compare different pretrained models rather than optimizing the classification models for individual tasks.Thus, we use the same model architecture and hyper-parameters to fine-tune models for all sub-tasks in each application category.

Masked Word Prediction
First, we conduct the masked token prediction task to measure how well the models understand the domain knowledge.To ensure that the test sentences are not in the training data, we use five headlines from security news published in January and February, 2023 7 .Table 2 shows the test sentences and the models' predictions.For each sentence, we conduct the masked token prediction twice with different masked words.The upper line shows the predictions for <mask> 1 , and the lower line shows the predictions for <mask> 2 respectively.
The results clearly show that CTI-BERT performs very well in this test; its predictions are either the same words (boldfaced) or synonyms (italicized).Note that CTI-BERT produces RAT for "PlugX <mask>", which is a more specific term than the masked word ('malware').RAT (Remote Access Trojan) is the malware family which PlugX belongs to.However, both SecBERT and SecRoBERTa do not perform well for this test, even though they were trained with security text.Interestingly, roberta-base performs better than these models and bert-base-uncased.

Sentence Classification Tasks
For sentence or document-level classification, we add onto the pretrained language models a classification head, with one hidden layer and one output projection layer connected with tanh activation, which takes the average of the last hidden states of all tokens in sentences as the input.We fine-tune the pretrained models together with the randomly initialized classification layers, using 1,000 warmup steps, with learning rate varied according to the formula in Vaswani et al. (2017).We use the Adam optimizer with β 1 = 0.9, β 2 = 0.999, and weight decay of 0.01.All the models are trained for 50 epochs with the batch size of 16 and the learning rate of 2e-5.
For the evaluation, we train five models with five different seeds (42, 142, 242, 342, and 442) for each task and report both the micro and macro mean F1 score (Mean) and the standard deviation (Std.) over the five models.

ATT&CK Technique Classification
The key knowledge SoC analysts look for in CTI reports is information about malware behavior and the adversary's tactics and techniques.The MITRE ATT&CK framework8 offers a knowledge base of these adversary tactics and techniques, which has been used as a foundation for the threat models and methodologies in many security products.
To facilitate research on identifying ATT&CK techniques in prose-based CTI reports, MITRE created TRAM9 , a dataset containing sentences from CTI reports labeled with the ATT&CK techniques.We observe that TRAM contains duplicate sentences across the splits.We remove the duplicates and keep only the classes with at least one sentence in train, development and test splits.The cleaned dataset contains 1,491 sentences, 166,284 tokens, and 73 distinct classes.More detailed statistics of the dataset is shown in Table 15 in Appendix.Note that this dataset is very sparse and imbalanced.Table 3 shows the results of the six models for this task.As we can see, CTI-BERToutperforms all other models by a large margin.

IoT App Description Classification
IoTSpotter is a tool for automatically identifying Mobile-IoT (Internet of Things) apps, IoT-specific library, and potential vulnerabilities in the IoT  from the SemEval-2018 Task 8, which consisted of four subtasks to measure NLP capabilities for cybersecurity reports (Phandi et al., 2018).The task provided 12,918 annotated sentences extracted from 85 APT reports, based on the MalwareTextDB work (Lim et al., 2017).
The first sub-task is to build models to extract sentences about malware.The dataset is biased with the ratios of malware and non-malware sentences being 21% and 79% respectively as shown in Table 17 in Appendix.The results are listed in Table 5 which shows that CTI-BERT and SecRoBERTaperform well on this task.

Malware Attribute Classification
This task classifies sentences into the malware attribute categories as defined in MAEC (Malware Attribute Enumeration and Characterization) vocabulary11 .MAEC defines the malware attributes in a 2-level hierarchy with four high-level attribute types-ActionName, Capability, StrategicObjectives and TacticalObjectives-and 444 low-level types.This sub-task was conducted by building models for each of the four high-level attributes.Table 23 in Appendix shows more details of this dataset for the four high-level attributes.As we can see, the datasets are very sparse with a large number of classes.
Tables 6-9 show the classification results for the four malware attribute types.We can see that CTI-BERT performs well, being the best or second best model, for all four attributes types.

Token Classification Tasks
Here, we compare the models' effectiveness for token-level classification using two securitydomain NER tasks and a token type detection task.We use the standard sequence tagging setup and add one dense layer as the classification layer on top of the pretrained language models.The classification layer assigns each token to a label using the BIO tagging scheme.Our system is implemented in PyTorch using HuggingFace's transformers (Wolf et al., 2019).The training data is randomly shuffled, and a batch size of 16 is used with post-padding.We set the maximum sequence length to 256 and use cross entropy loss for model optimization with the learning rate of 2e-5.All other training parameters were set to the default values in transformers.Similarly to the sentence classification tasks, we train five models for each task with the same five seeds for 50 epochs and compare the average mention-level precision, recall and F1-score.

NER1: Coarse-grained Security Entities
Cybersecurity entities have very distinct characteristics, and many of them are out of vocabulary terms.Here, we investigate if domain specific language models can alleviate the vocabulary gap.We collected 967 CTI reports on malware and vulnerabilities.The documents are labeled with the 8 entity types defined in STIX (Structured Threat Information Expression) 12 , which is a standard framework for cyber intelligence exchange.The 8 types are Campaign (names of cyber campaigns), Course-OfAction (tools or actions to take to deter cyber attacks), ExploitTarget (vulnerabilities targeted for exploitation), Identity (individuals, groups or organizations involved in attacks), Indicator (objects used to detect suspicious or malicious cyber activity), Malware (malicious codes used in cyber crimes), Resource (tools used for cyber attacks); and ThreatActor (individuals or groups that commit cyber crimes).The size of the dataset and detailed 12 https://stixproject.github.io/releases/1.2statistics of the entity types in the corpus are shown in Table 18 and Table 19 in Appendix.Table 10 shows the NER results using the mention-level micro average scores.

NER2: Fine-grained Security Entities
We note that some STIX entity types (esp.Indicator) are very broad containing many different sub-types and, thus, are difficult to be directly used by automatic threat investigation applications.We redesigned the type system into 16 types by dividing broad categories into their subcategories and annotated the test dataset from the NER1 task.We then split the dataset into a 80:10:10 ratio for the train, dev and test sets.Table 20 and Table 21 in Appendix show the statistics of this dataset.The NER results in Table 11 show that most models perform better for the finer-grained types, and especially CTI-BERT outperforms all other models by a large margin.

Token Type Classification
The token type detection task is the sub-task2 from SemEval2018 Task8 which aims to classify tokens to Entity, Action and Modifier, and Other categories.Action refers to an event.Entity refers to the initiator of the Action (i.e., Subject) or the recipient of the Action (i.e., Object).Modifier refers to tokens that provide elaboration on the Action.All other tokens are assigned to Other.More details on the dataset are shown in Table 22 in Appendix.
Even though the categories are not semantic types as in NER, this task can also be solved as a token sequence tagging problem, and, thus, we apply the same system used for the NER tasks.The classification results are shown in Table 12.Overall, the models don't perform very well likely because the mentions are long and semantically heterogeneous.The results show that the BERT based models perform better than the RoBERTa-based models.
There have been several attempts to construct language models for the cybersecurity domain.Roy et al. (2017Roy et al. ( , 2019) ) propose techniques to efficiently learn domain-specific language models with a small-size in-domain corpus by incorporating external domain knowledge.They train Word2Vec models using malware descriptions.Similarly, Mumtaz et al. (2020) train a Word2Vec model using security vulnerability-related bulletins and Wikipedia pages.
Recently, transformers-based models have been built for the cybersecurity domain: Cy-BERT (Priyanka Ranade and Finin, 2021), SecBERT (jackaduma, 2022) and Secure-BERT (Aghaei et al., 2023).CyBERT is trained with a relatively small corpus consisting of 500 security blogs, 16,000 CVE records, and the APTnotes collection.Further, CyBERT applies the continual pretraining and uses the BERT model's vocabulary after adding 1,000 most frequent words in their corpus which do not exist in the base vocabulary.SecBERT provides both BERT and RoBERTa models trained on a security corpus consisting of APTnotes, the SemEval2018 Task8 dataset and Stucco-Data 13 which contains security blogs and reports.However, the details about the data and any experimental results are not available.SecureBERT trains a RoBERTa model using security reports, white papers, academic books, etc., which are similar to our dataset both in terms of the size and document type.However, the model is built using the continual pretraining method while CTI-BERT is trained from scratch.We believe that the main difference comes from CTI-BERT being trained from scratch and having the vocabulary specialized to the domain, compared to the extended vocabulary used in CyBERT and SecureBERT.Table 14 compares different training strategies used for these models.

Conclusion
We presented a new pretrained BERT model tailored for the cybersecurity domain.Specifically, we designed the model to improve the accuracy of cyber-threat intelligence extraction and understanding, such as security entity (IoCs) extraction and attack technique (TTPs) classification.As demonstrated by the experiments in Section 4, our model outperforms existing general domain and other cybersecurity domain models with the same base architecture.For future work, we plan to collect more documents to improve the model and also to train other language models to support different security applications.

Limitations
The model is pretrained using only English data.While the majority of cybersecurity-related information is distributed in English, we consider adding support for multiple languages in the future work.Further, while we demonstrate that CTI-BERT outperforms other security-specific LMs for a variety of tasks, the benchmark datasets are relatively small.Thus, the findings may not be conclusive, and further evaluations with more data are needed.

Ethical Considerations
To our knowledge, this research has a very low risk for ethical perspectives.All datasets were collected from reputable sources, which are publicly available.The only person information in our corpus is the authors' names and their affiliations in the USENIX Security proceedings.However, we do not expose their identities nor use the information in this work.

A Details on Model Training
Table 1 summarizes our document categories and their statistics.

Table 1 :
Summary of our datasets Security TextbookThe dataset contains two online text books for the CISSP (Certified Information Systems Security Professional) certification test.

Table 3 :
(Jin et al., 2022)lassification Resultsapps(Jin et al., 2022).The authors created a dataset containing the descriptions of 7,237 mobile apps which are labeled with mobile IoT apps vs. non-IoT apps with the distribution of approximately 45% and 55% respectively.They removed stopwords and put together all remaining tokens in the description ignoring the sentence boundaries.We use the datasets 10 without any further processing.The data statistics are shown in Table16in Appendix.The models' classification results are shown in Table4.

Table 5 :
Malware Sentence Classification Results

Table 6 :
Performance for ActionName attributes

Table 7 :
Performance for Capability attributes

Table 8 :
Performance for StrategicObjective attributes

Table 9 :
Performance for TacticalObjective attributes

Table 13 :
Comparison of Vocabulary.For a fair comparison, we generated our tokenizer with 30,000 tokens.

Table 14 :
Comparison of Model Training."-indicates the information is not available.

Table 15 :
Summary of TRAM Data

Table 16 :
Summary of IoTSpotter Data

Table 17 :
Summary of the Malware Sentence Data

Table 18 :
Summary of the NER1 Dataset

Table 19 :
Entity Types and Distributions in the NER1 Dataset

Table 21 :
Entity Types and Distributions in the NER2 Dataset

Table 22 :
Dataset for Token Type Classification

Table 23 :
Data Statistics for Malware Attribute Classification