WAVE-27K: Bringing together CTI sources to enhance threat intelligence models

Felipe Castaño; Amaia Gil-Lerchundi; Raul Orduna-Urrutia; Eduardo Fidalgo Fernandez; Rocío Alaiz-Rodríguez

WAVE-27K: Bringing together CTI sources to enhance threat intelligence models

Felipe Castaño, Amaia Gil-Lerchundi, Raul Orduna-Urrutia, Eduardo Fidalgo Fernandez, Rocío Alaiz-Rodríguez

Abstract

Considering the growing flow of information on the internet, and the increased incident-related data from diverse sources, unstructured text processing gains importance. We have presented an automated approach to link several CTI sources through the mapping of external references. Our method facilitates the automatic construction of datasets, allowing for updates and the inclusion of new samples and labels. Following this method we built a new dataset of unstructured CTI descriptions called Weakness, Attack, Vulnerabilities, and Events 27k (WAVE-27k). Our dataset includes information about 27 different MITRE techniques, containing 22539 samples related one technique and 5262 related to two or more techniques simultaneously. We evaluated five BERT-based models into the WAVE-27K dataset concluding that SecRoBERTa reaches the highest performance with a 77.52% F1 score. Additionally, we compare the performance of the SecRoBERTa on the WAVE-27K dataset and other public datasets. The results show that the model using the WAVE-27K dataset outperforms the others. These results demonstrate that the data within WAVE-27K contains relevant information and that the proposed method effectively built a dataset with a level of quality sufficient to train a machine-learning model.

Anthology ID:: 2024.nlpaics-1.14
Volume:: Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Month:: July
Year:: 2024
Address:: Lancaster, UK
Editors:: Ruslan Mitkov, Saad Ezzini, Tharindu Ranasinghe, Ignatius Ezeani, Nouran Khallaf, Cengiz Acarturk, Matthew Bradbury, Mo El-Haj, Paul Rayson
Venue:: NLPAICS
SIG:
Publisher:: International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Note:
Pages:: 119–126
Language:
URL:: https://aclanthology.org/2024.nlpaics-1.14/
DOI:
Bibkey:
Cite (ACL):: Felipe Castaño, Amaia Gil-Lerchundi, Raul Orduna-Urrutia, Eduardo Fidalgo Fernandez, and Rocío Alaiz-Rodríguez. 2024. WAVE-27K: Bringing together CTI sources to enhance threat intelligence models. In Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security, pages 119–126, Lancaster, UK. International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security.
Cite (Informal):: WAVE-27K: Bringing together CTI sources to enhance threat intelligence models (Castaño et al., NLPAICS 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.nlpaics-1.14.pdf

PDF Cite Search Fix data