Detecting Trending Terms in Cybersecurity Forum Discussions

Jack Hughes, Seth Aycock, Andrew Caines, Paula Buttery, Alice Hutchings


Abstract
We present a lightweight method for identifying currently trending terms in relation to a known prior of terms, using a weighted log-odds ratio with an informative prior. We apply this method to a dataset of posts from an English-language underground hacking forum, spanning over ten years of activity, with posts containing misspellings, orthographic variation, acronyms, and slang. Our statistical approach supports analysis of linguistic change and discussion topics over time, without a requirement to train a topic model for each time interval for analysis. We evaluate the approach by comparing the results to TF-IDF using the discounted cumulative gain metric with human annotations, finding our method outperforms TF-IDF on information retrieval.
Anthology ID:
2020.wnut-1.15
Volume:
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:
November
Year:
2020
Address:
Online
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
107–115
Language:
URL:
https://aclanthology.org/2020.wnut-1.15
DOI:
10.18653/v1/2020.wnut-1.15
Bibkey:
Cite (ACL):
Jack Hughes, Seth Aycock, Andrew Caines, Paula Buttery, and Alice Hutchings. 2020. Detecting Trending Terms in Cybersecurity Forum Discussions. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 107–115, Online. Association for Computational Linguistics.
Cite (Informal):
Detecting Trending Terms in Cybersecurity Forum Discussions (Hughes et al., WNUT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wnut-1.15.pdf