Cristian Sandescu


pdf bib
Interpretable Identification of Cybersecurity Vulnerabilities from News Articles
Pierre Frode de la Foret | Stefan Ruseti | Cristian Sandescu | Mihai Dascalu | Sebastien Travadel
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

With the increasing adoption of technology, more and more systems become target to information security breaches. In terms of readily identifying zero-day vulnerabilities, a substantial number of news outlets and social media accounts reveal emerging vulnerabilities and threats. However, analysts often spend a lot of time looking through these decentralized sources of information in order to ensure up-to-date countermeasures and patches applicable to their organisation’s information systems. Various automated processing pipelines grounded in Natural Language Processing techniques for text classification were introduced for the early identification of vulnerabilities starting from Open-Source Intelligence (OSINT) data, including news websites, blogs, and social media. In this study, we consider a corpus of more than 1600 labeled news articles, and introduce an interpretable approach to the subject of cyberthreat early detection. In particular, an interpretable classification is performed using the Longformer architecture alongside prototypes from the ProSeNet structure, after performing a preliminary analysis on the Transformer’s encoding capabilities. The best interpretable architecture achieves an 88% F2-Score, arguing for the system’s applicability in real-life monitoring conditions of OSINT data.