DarkBERT: A Language Model for the Dark Side of the Internet

Youngjin Jin; Eugene Jang; Jian Cui; Jin-Woo Chung; Yongjae Lee; Seungwon Shin

doi:10.18653/v1/2023.acl-long.415

DarkBERT: A Language Model for the Dark Side of the Internet

Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, Seungwon Shin

Abstract

Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.

Anthology ID:: 2023.acl-long.415
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7515–7533
Language:
URL:: https://aclanthology.org/2023.acl-long.415
DOI:: 10.18653/v1/2023.acl-long.415
Bibkey:
Cite (ACL):: Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7515–7533, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: DarkBERT: A Language Model for the Dark Side of the Internet (Jin et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.415.pdf
Video:: https://aclanthology.org/2023.acl-long.415.mp4

PDF Cite Search Video