SYSML: StYlometry with Structure and Multitask Learning: Implications for Darknet Forum Migrant Analysis

Darknet market forums are frequently used to exchange illegal goods and services between parties who use encryption to conceal their identities. The Tor network is used to host these markets, which guarantees additional anonymization from IP and location tracking, making it challenging to link across malicious users using multiple accounts (sybils). Additionally, users migrate to new forums when one is closed further increasing the difficulty of linking users across multiple forums. We develop a novel stylometry-based multitask learning approach for natural language and model interactions using graph embeddings to construct low-dimensional representations of short episodes of user activity for authorship attribution. We provide a comprehensive evaluation of our methods across four different darknet forums demonstrating its efficacy over the state-of-the-art, with a lift of up to 2.5X on Mean Retrieval Rank and 2X on Recall@10.


Introduction
Crypto markets are "online forums where goods and services are exchanged between parties who use digital encryption to conceal their identities" (Martin, 2014). They are typically hosted on the Tor network, which guarantees anonymization in terms of IP and location tracking. The identity of individuals on a crypto-market is associated only with a username; therefore, building trust on these networks does not follow conventional models prevalent in eCommerce. Interactions on these forums are facilitated by means of text posted by their users. This makes the analysis of textual style on these forums a compelling problem.
Stylometry is the branch of linguistics concerned with the analysis of authors' style. Text stylometry was initially popularized in the area of forensic linguistics, specifically to the problems of author profiling and author attribution (Juola, 2006;Rangel et al., 2013). Traditional techniques for authorship analysis on such data rely upon the existence of long text corpora from which features such as the frequency of words, capitalization, punctuation style, word and character n-grams, function word usage can be extracted and subsequently fed into any statistical or machine learning classification framework, acting as an author's 'signature'. However, such techniques find limited use in short text corpora in a heavily anonymized environment.
Advancements in using neural networks for character and word-level modeling for authorship attribution aim to deal with the scarcity of easily identifiable 'signature' features and have shown promising results on shorter text (Shrestha et al., 2017). Andrews and Witteveen (2019) drew upon these advances in stylometry to propose a model for building representations of social media users on Reddit and Twitter. Motivated by the success of such approaches, we develop a novel methodology for building authorship representations for posters on various darknet markets. Specifically, our key contributions include: First, a representation learning approach that couples temporal content stylometry with access identity (by levering forum interactions via meta-path graph context information) to model and enhance user (author) representation; Second, a novel framework for training the proposed models in a multitask setting across multiple darknet markets, using a small dataset of labeled migrations, to refine the representations of users within each individual market, while also providing a method to correlate users across markets; Third, a detailed drill-down ablation study discussing the impact of various optimizations and highlighting the benefits of both graph context and multitask learning on forums associated with four darknet markets -Black Market Reloaded, Agora Marketplace, Silk Road, and Silk Road 2.0 -when compared to the state-of-the-art alternatives.

Related Work
Darknet Market Analysis: Content on the dark web includes resources devoted to illicit drug trade, adult content, counterfeit goods and information, leaked data, fraud, and other illicit services (Lapata et al., 2017;Biryukov et al., 2014) . Also included are forums discussing politics, anonymization, and cryptocurrency. Biryukov et al. (2014) found that while a vast majority of these services were in English (about 84%), a total of about 17 different languages were detected. Analysis of the volume of transactions and number of users on darknet markets indicates that they are resilient to closures; rapid migrations to newer markets occur when one market shuts down (ElBahrawy et al., 2019).
Recent work (Fan et al., 2018;Hou et al., 2017;Fu et al., 2017;Dong et al., 2017) has levered the notion of a heterogeneous information network (HIN) embedding to improve graph modeling, where different types of nodes, relationships (edges) and paths can be represented through typed entities. Zhang et al. (2019) used a HIN to model marketplace vendor sybil 1 accounts on the darknet, where each node representing an object is associated with various features (e.g. content, photography style, user profile and drug information). Similarly, Kumar et al. (2020) proposed a multiview unsupervised approach which incorporated features of text content, drug substances, and locations to generate vendor embeddings. We note that while such efforts (Zhang et al., 2019;Kumar et al., 2020) are related to our work, there are key distinctions. First, such efforts focus only on vendor sybil accounts. Second, in both cases, they rely on a host of multi-modal information sources (photographs, substance descriptions, listings, and location information) that are not readily available in our setting -limited to forum posts. Third, neither effort exploits multitask learning. Authorship Attribution of Short Text: Kim (2014) introduced convolutional neural networks (CNNs) for text classification. Follow-up work on authorship attribution (Ruder et al., 2016;Shrestha et al., 2017) leveraged these ideas to demonstrate that CNNs outperformed other models, particularly for shorter texts. The models proposed in these works aimed at balancing the trade-off between vocabulary size and sequence length budgets based on tokenization at either the character or 1 a single author can have multiple users accounts which are considered as sybils word level. Further work on subword tokenization (Sennrich et al., 2016), especially byte-level tokenization, have made it feasible to share vocabularies across data in multiple languages. Models built using subword tokenizers have achieved good performance on authorship attribution tasks for specific languages (e.g., Polish (Grzybowski et al., 2019)) and also across multilingual social media data (Andrews and Bishop, 2019). Non-English as well as multilingual darknet markets have been increasing in number since (Ebrahimi et al., 2018. Our work builds upon all these ideas by using CNN models and experimenting with both character and subword level tokens. Multitask learning (MTL): MTL (Caruana, 1997), aims to improve machine learning models' performance on the original task by jointly training related tasks. MTL enables deep neural network-based models to better generalize by sharing some of the hidden layers among the related tasks. Different approaches to MTL can be contrasted based on the sharing of parameters across tasks -strictly equal across tasks (hard sharing) or constrained to be close (soft-sharing) (Ruder, 2017). Such approaches have been applied to language modeling (Howard and Ruder, 2018), machine translation (Dong et al., 2015), and dialog understanding (Rastogi et al., 2018).

SYSML Framework
Motivated by the success of social media user modeling using combinations of multiple posts by each user (Andrews and Bishop, 2019; Noorshams et al., 2020), we model posts on darknet forums using episodes. Each episode consists of the textual content, time, and contextual information from multiple posts. A neural network architecture f θ maps each episode to combined representation e ∈ | E . The model used to generate this representation is trained on various metric learning tasks characterized by a second set of parameters g φ : | E − → |. We design the metric learning task to ensure that episodes having the same author have similar embeddings. Figure 1 describes the architecture of this workflow and the following sections describe the individual components and corresponding tasks. Note that our base modeling framework is inspired by the social media user representations built by Andrews and Bishop (2019) for a single task. We add meta-path embeddings and multitask objectives to on darkweb markets (Tai et al., 2019). To identify different user accounts across markets that correspond to the same author, we follow a two-step process. First, we select the posts containing a PGP key, and then pair together users who have posts containing the same PGP key. Following this, we still have a large number of potentially incorrect matches (including scenarios such as information sharing posts by users sharing the PGP key of known vendors from a previous market). We manually check each pair to identify matches that clearly indicate whether the same author or different authors posted them, leading to approximately 100 reliable labels, with 33 pairs matched as migrants across markets.

Evaluation
While ground truth labels for a single author having multiple accounts are unavailable, individual models can still be compared by measuring their performance on authorship attribution as a proxy. We evaluated our method using retrieval-based metrics over the embeddings generated by each approach. Denote the set of all episode embeddings as E = {e 1 , . . . e n } and let Q = {q 1 , q 2 , . . . q κ } ⊂ E be the sampled subset. We computed the cosine similarity of the query episode embeddings with all episodes. Let R i = r i1 , r i2 , . . . r in denote the list of episodes in E ordered by their cosine similarity with episode q i (excluding itself) and let A(.) map an episode to its author. The following measures are computed. Mean Reciprocal Rank: (MRR) The RR for an episode is the reciprocal rank of the first element (by similarity) with the same author. MRR is the mean of reciprocal ranks for a sample of episodes.
Recall@k: (R@k) Following Andrews and Bishop (2019), we define the R@k for an episode e i to be an indicator denoting whether an episode by the same author occurs within the subset r i1 , . . . , r ik . R@k denotes the mean of these recall values over all the query samples. Baselines We compare our best model against two baselines. First, we consider a popular short text authorship attribution model (Shrestha et al., 2017) based on embedding each post using character CNNs. While the method had no support for additional attributes (time, context) and only considers a single post at a time, we compare variants that incorporate these features as well. The second method for comparison is invariant representation of users (Andrews and Bishop, 2019). This method considers only one dataset at a time and does not account for graph-based context information. Results for episodes of length 5 are shown in Table 2 6 Analysis

Model and Task Variations
To compare the variants using statistical tests, we compute the MRR of the data grouped by market, episode length, tokenizer, and a graph embedding indicator. This leaves a small number of samples for paired comparison between groups, which precludes making normality assumptions for a ttest. Instead, we applied the paired two-samples Wilcoxon-Mann-Whitney (WMW) test (Mann and Whitney, 1947). The first key contribution of our model is the use of meta-graph embeddings for context. The WMW test demonstrates that using pretrained graph embeddings was significantly better than using random embeddings (p < 0.01). Table 2 shows a summary of these results using ablations. For completeness of the analysis, we also compare the character and BPE tokenizers. WMW failed to find any significant differences between the BPE and character models for embedding (table omitted for brevity). Many darkweb markets tend to have more than one language (e.g., BMR had a large German community), and BPE allows a shared vocabulary to be used across multiple datasets with very few out-of-vocab tokens. Thus, we use BPE tokens for the forthcoming multitask models. Multitask Our second key contribution is the multitask setup.   (Musgrave et al., 2020;Zhai and Wu, 2019). We experimented with various state-of-theart metric learning methods ( §3.3) in the multi task setup and found that softmax-based classification (SM) was the best performing method in 3 of 4 cases for episodes of length 5 (Figure 7). Across all lengths, SM is significantly better (WMW: p < 1e − 8) and therefore we use SM in SYSML.

Novel Users
The dataset statistics (Table 1)  understand the distribution of performance across these two configurations, we compute the test metrics over two samples. For one sample, we constrain the sampled episodes to those by users who have at least one episode in the training period (Seen Users). For the second sample, we sample episodes from the complement of the episodes that satisfy the previous constraint (Novel Users). Figure 8 shows the comparison of MRR on these two samples against the best single task model for episodes of length 5. Unsurprisingly, the first sample (Seen Users) have better query metrics than the second (Novel Users). However, importantly both of these groups outperformed the best single task model results on the first group (Seen Users), which demonstrates that the lift offered by the multitask setup is spread across all users. Episode Length Figure 9 shows a comparison of the mean performance of each model across various episode lengths. We see that compared to the baselines, SYSML can combine contextual and stylistic information across multiple posts more effectively.
Additional results (see appendix), indicate that this trend continues for larger episode sizes. In this section, we consider the average (euclidean) distance between each pair of episodes by the same author as a heuristic for stylometric identifiability (SI), where lower average distance corresponds to higher SI and vice versa. Somewhat surprisingly, authors with a small number of total episodes (< 10) were found at both extremes of identifiability, while the authors with the highest number of episodes were in the intermediate regions, suggesting that SI is not strongly correlated with episode length. Next, we further investigate these groups. High SI authors: Among the 20 users with the lowest average distance between episodes, a single pattern is prominent. This first group of high SI users are "newbie" users. On a majority of analyzed forums, a minimum number of posts by a user is required before posting restrictions are removed from the user's account. Thus, users create threads on 'Newbie Discussion' subforums. Typical posts on these threads include repeated posting of the same message or numbered posts counting up to the minimum required. As users tend to make all these posts within a fixed time frame, the combination of repeated, similar stylistic text and time makes the posts easy to identify. Exemplar episodes from this "newbie" group are shown in Table 3.
After filtering these users out, we identified a few more notable high SI users. These include an author on BMR with frequent '£' symbol and ellipses ('...') and an author on Agora who only posted referral links (with an eponymous username 'Refer-ralLink'). Finally, restricting posts to those made by 200 most frequently posting users (henceforth,  T200), we found a user (labeled HSI-Sec 2 ) who frequently provided information on security, where character n-grams corresponding to 'PGP', 'Key', 'security' are frequent (  (2012) has demonstrated that obfuscation and imitation based strategies are effective against text stylometry. We analyze the T200 authors who had high inter-episode distances to ascertain whether this holds true for SYSML. For the least (and third least) identifiable author among T200, we find that frequent word n-grams are significantly less frequent than those for the most identifiable author from this subset (most frequent token occurs ∼ 600 times vs. ∼ 4800 times for identifiable) despite having more episodes overall. Further, one of the most frequent tokens is the [QUOTE] token, implying that this author frequently incorporates other authors' quotes into their posts. This strategy is analogous to the imitation based attack strategy proposed by Brennan et al. (2012). For the second least identifiable T200 author, we find that the frequent tokens have even fewer occurrences, and the special token [IMAGE] and its alternatives are among the frequent tokens -suggesting that an obfuscation strategy based on diversifying the vocabulary is effective. Some samples are presented in Table 4 under LSI-1 and LSI-2. Gradient-based attribution: To cement our preceding hypotheses, we investigate whether the generated embedding can be attributed to phrases in the input which were mentioned in the previous section. We use Integrated Gradients (Sundararajan et al., 2017), an axiomatic approach to input attribution. Integrated Gradients assign an importance score to each feature which corresponds to an approximation of the integral of the gradient of a model's output with respect to the input features 2 pseudonym

A Ethics Statement
The research conducted in this study was deemed to be exempt research by the Ohio State University's Office of Responsible Research Practices, since the forum data is classified as 'publicly available'. Darknet forum data is readily available publicly across multiple markets (Branwen et al., 2015;Munksgaard and Demant, 2016) and we follow standard practices for the darkweb (Kumar et al., 2020) limiting our analysis to publicly available information only. The data was originally collected to study the prevalence of illicit drug trade and the politics surrounding such trades. Limiting Harm To the best of our knowledge, the collected data does not contain leaked private information (Munksgaard and Demant, 2016). Beyond relying on the exempt nature of the study, we also strive to take further steps for minimizing harms from our research. In accordance with the ACM Code of Ethics and to limit potential harm, we carry out substantial pre-processing ( §4) to remove links, images, and keys that may contain sensitive information. Towards respecting the privacy of subjects, we do not connect the identity of users to any private information; our method serves only to link users across markets. Further, in this study, we restrict our analysis to darknet markets that have been inactive for several years. The darknet market community has itself taken steps over the past few years to link identities of trustworthy members across market closure via development of information hubs such as Grams, Kilos, and Recon (Broadhurst et al., 2021). Our efforts aim to understand the formative years that lead towards this centralization. Inclusiveness Our methods do not attempt to characterize any traits of the users making the posts. Based on our analysis, the datasets contain posts in English, German, and Italian. Thus, our methods may be limited in applicability and biased in performance for languages belonging to these and related Indo-European languages. Potential for Dual Use Our goal is to understand how textual style evolves on darknet markets and how users on such markets may misuse them for scams and illicit activities. This digital forensic analysis can be put to good use for understanding trust signalling on these markets. We understand the potential harm from dual use; stylometric methods could be used for the identification of users who may not want their identity to be made public, especially when they are subject of hostile governments. We believe that making the information about the existence of such stylometric advances public and providing prescriptions for avoidance techniques ( §7.1) would aid users who may not know of strategies that they can use to preserve their anonymity. Existing work (Noorshams et al., 2020;Andrews and Bishop, 2019) has already expanded the use of stylometry to the open web. Thus, we have made the analysis of patterns that lower stylometric identifiability one focus of our case study.

B Reproducibility
We describe the various hyperparameter settings used for the models trained by us. All deep learning models are implemented in python using Pytorch 3 , and the original C++ implementation of metapath2vec is used for generating metapath embeddings 4 . We used an implementation from the Captum python library (Kokhlikyan et al., 2020) that uses the Gauss-Legendre quadrature rule for approximating the gradient.

C Training Hyperparameters
We use batches of size 256. The Adam optimizer is used for training each network. The initial learning rate is set to 1e − 3, with a multiplicative decay factor of 0.5 if the validation metrics do not improve after 5 epochs. Each model is trained for 30 epochs, and each configuration is run 5 times. We used a V100 GPU to train each run, with the average running time of 27:17 per run (mm:ss). For each run, 10% of the dataset is used for validation. The best model is selected on the basis of minimum validation loss.

C.1.1 Text Embedding Model
Character vocabularies of size 1k and BPE vocabularies of size 30k are trained using only training portion of the datasets. The HuggingFace Tokenizers 5 library is used to build the byte-level BPE vocab. We use a Text CNN for embedding text across all settings. Each token has 32 dimensional embeddings, and the final embedding dimension for a text sequence is set to 128. Filters of sizes {2, 3, 4, 5} are used, and the dropout probability is set to 0.1 for the final layer.

C.1.2 Time Embedding
The time embedding dimension is set to 64.

C.1.3 Context Embedding
The context embedding dimension is set to 128. For metapath2vec, we generate 1000 walks for each user (author) node, the number of negative samples for each user is 5, and the window size in the skipgram model is set to 7. These hyperparameters are also used in the metapath2vec work. For the length of each sampled walk, we set it to 80, which is widely used in many representative skip-gram based embedding methods such as node2vec.

C.1.4 Pooling Transformer
The pooling transformer model has a feed forward layer dimension and final dimension of 128. There are 4 layers, each with 4 heads. The droupout probability for the final feed forward leyer layer is set to 0.1, and the output dimension is set to 32.

D Parameter Search
Most hyperparameter comparisons are reported in the paper. For the multitask dataset sampling, we ran the multitask model with P cr ∈ {0.01, 0.02, 0.04, 0.1}, with P M = 1 − P cr , P M I ∝ |M i | with similar performances up to P cr = 0.04 and a drop at P cr = 0.1. All results reported in the paper have P cr = 0.01

E Metrics
All metrics are computed using a sample of the episode embeddings. The sample size used for computing the metrics is κ = 1000

F Additional Results
From Figure 11, we see that the number of users reduces rapidly as the posts per user decrease. Thus, we limited our analysis to up to 5 posts per episode. For completeness, we also provide additional results for 7 and 9 posts per episode in Table 5 and 6 respectively. Note that the histogram has some non-smooth bumps at around 10, 50, 100 posts as they act as the minimum number of posts for   Table 6: Additional results for 9 posts per episode different levels of forum users. As explained in a previous section, users post on 'newbie' forums until they reach a specific number of posts, leading to these unusual bumps in the histogram. We note that the performance of our methods continues to improve as the posts per episode are increased (at a cost to coverage -number of users studied), though the improvement is higher in the bigger markets as these tend to have a sufficiently large number of individuals with a higher number of total posts.