A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks

The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger generates a fixed phrase that, when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this attack that can cause significant harm, in this paper, we borrow the “honeypot” concept from the cybersecurity community and propose DARCY, a honeypot-based defense framework against UniTrigger. DARCY greedily searches and injects multiple trapdoors into an NN model to “bait and catch” potential attacks. Through comprehensive experiments across four public datasets, we show that DARCY detects UniTrigger’s adversarial attacks with up to 99% TPR and less than 2% FPR in most cases, while maintaining the prediction accuracy (in F1) for clean inputs within a 1% margin. We also demonstrate that DARCY with multiple trapdoors is also robust to a diverse set of attack scenarios with attackers’ varying levels of knowledge and skills. We release the source code of DARCY at: https://github.com/lethaiq/ACL2021-DARCY-HoneypotDefenseNLP.


Introduction
Adversarial examples in NLP refer to carefully crafted texts that can fool predictive machine learning (ML) models. Thus, malicious actors, i.e., attackers, can exploit such adversarial examples to force ML models to output desired predictions. There are several adversarial example generation algorithms, most of which perturb an original text at either character (e.g., Gao et al., 2018)), word (e.g., (Ebrahimi et al., 2018;Wallace et al., 2019;Gao et al., 2018;Garg and Ramakrishnan, 2020), or sentence level (e.g., (Le et al., 2020;Gan and Ng;Cheng et al.)).

Original:
this movie is awesome Attack: zoning zoombie this movie is awesome Prediction: Positive −→ Negative Original: this movie is such a waste! Attack: charming this movie is such a waste! Prediction: Negative −→ Positive Because most of the existing attack methods are instance-based search methods, i.e., searching an adversarial example for each specific input, they usually do not involve any learning mechanisms. A few learning-based algorithms, such as the Universal Trigger (UniTrigger) (Wallace et al., 2019), MALCOM (Le et al., 2020), Seq2Sick (Cheng et al.) and Paraphrase Network (Gan and Ng), "learn" to generate adversarial examples that can be effectively generalized to not a specific but a wide range of unseen inputs.
In general, learning-based attacks are more attractive to attackers for several reasons. First, they achieve high attack success rates. For example, UniTrigger can drop the prediction accuracy of an NN model to near zero just by appending a learned adversarial phrase of only two tokens to any inputs (Tables 1 and 2). This is achieved through an optimization process over an entire dataset, exploiting potential weak points of a model as a whole, not aiming at any specific inputs. Second, their attack mechanism is highly transferable among similar models. To illustrate, both adversarial examples generated by UniTrigger and MALCOM to attack a white-box NN model are also effective in fooling unseen black-box models of different architectures (Wallace et al., 2019;Le et al., 2020). Third, thanks to their generalization to unseen inputs, learningbased adversarial generation algorithms can facilitate mass attacks with significantly reduced computational cost compared to instance-based methods.
Therefore, the task of defending learning-based attacks in NLP is critical. Thus, in this paper, we propose a novel approach, named as DARCY, to defend adversarial examples created by UniTrigger, a strong representative learning-based attack (see Sec. 2.2). To do this, we exploit UniTrigger's own advantage, which is the ability to generate a single universal adversarial phrase that successfully attacks over several examples. Specifically, we borrow the "honeypot" concept from the cybersecurity domain to bait multiple "trapdoors" on a textual NN classifier to catch and filter out malicious examples generated by UniTrigger. In other words, we train a target NN model such that it offers great a incentive for its attackers to generate adversarial texts whose behaviors are pre-defined and intended by defenders. Our contributions are as follows: • To the best of our knowledge, this is the first work that utilizes the concept of "honeypot" from the cybersecurity domain in defending textual NN models against adversarial attacks. • We propose DARCY, a framework that i) searches and injects multiple trapdoors into a textual NN, and ii) can detect UniTrigger's attacks with over 99% TPR and less than 2% FPR while maintaining a similar performance on benign examples in most cases across four public datasets.

The Universal Trigger Attack
Let F(x, θ), parameterized by θ, be a target NN that is trained on a dataset D train ← {x, y} N i with y i , drawn from a set C of class labels, is the groundtruth label of the text x i . F(x, θ) outputs a vector of size |C| with F(x) L predicting the probability of x belonging to class L. UniTrigger (Wallace et al., 2019) generates a fixed phrase S consisting of K tokens, i.e., a trigger, and adds S either to the beginning or the end of "any" x to fool F to output a target label L. To search for S, UniTrigger optimizes the following objective function on an attack dataset D attack : where ⊕ is a token-wise concatenation. To optimize Eq. (1), the attacker first initializes the trigger to be a neutral phrase (e.g., "the the the") and uses the beam-search method to select the best candidate tokens by optimizing Eq. (1) on a mini-batch randomly sampled from D attack . The top tokens are then initialized to find the next best ones until   Table 2 shows the prediction accuracy of CNN (Kim, 2014) under different attacks on the MR (Pang and Lee, 2005) and SST (Wang et al., 2019a) datasets. Both datasets are class-balanced. We limit # of perturbed tokens per sentence to two. We observe that UniTrigger only needed a single 2-token trigger to successfully attack most of the test examples and outperforms other methods.

Attack Performance and Detection
All those methods, including not only UniTrigger but also other attacks such as HotFlip (Ebrahimi et al., 2018), TextFooler (Jin et al.) and TextBugger , can ensure that the semantic similarity of an input text before and after perturbations is within a threshold. Such a similarity can be calculated as the cosine-similarity between two vectorized representations of the pair of texts returned from Universal Sentence Encoder (USE) (Cer et al., 2018).
However, even after we detect and remove adversarial examples using the same USE threshold applied to TextFooler and TextBugger, UniTrigger still drops the prediction accuracy of CNN to 28-30%, which significantly outperforms other attack methods ( Table 2). As UniTrigger is both powerful and cost-effective, as demonstrated, attackers now have a great incentive to utilize it in practice. Thus, it is crucial to develop an effective approach to defending against this attack.

Honeypot with Trapdoors
To attack F, UniTrigger relies on Eq. (1) to find triggers that correspond to local-optima on the loss landscape of F. To safeguard F, we bait multiple optima on the loss landscape of F, i.e., honeypots, such that Eq. (1) can conveniently converge to one of them. Specifically, we inject different trapdoors (i.e., a set of pre-defined to-Figure 1: An example of DARCY. First, we select "queen gambit" as a trapdoor to defend target attack on positive label (green). Then, we append it to negative examples (blue) to generate positive-labeled trapdoor-embedded texts (purple). Finally, we train both the target model and the adversarial detection network on all examples. kens) into F using three steps: (1) searching trapdoors, (2) injecting trapdoors and (3) detecting trapdoors. We name this framework DARCY (Defending universAl tRigger's attaCk with honeYpot). Fig. 1 illustrates an example of DARCY.
3.1 The DARCY Framework STEP 1: Searching Trapdoors. To defend attacks on a target label L, we select K trapdoors S * L = {w 1 , w 2 , ..., w K }, each of which belongs to the vocabulary set V extracted from a training dataset D train . Let H(·) be a trapdoor selection function: S * L ←− H(K, D train , L). Fig. 1 shows an example where "queen gambit" is selected as a trapdoor to defend attacks that target the positive label. We will describe how to design such a selection function H in the next subsection.
STEP 2: Injecting Trapdoors. To inject S * L on F and allure attackers, we first populate a set of trapdoor-embedded examples as follows: where D y =L ←− {D train : y = L}. Then, we can bait S * L into F by training F together with all the injected examples of all target labels L ∈ C by minimizing the objective function: where D trap ←− {D L trap |L ∈ C}, L D F is the Negative Log-Likelihood (NLL) loss of F on the dataset D. A trapdoor weight hyper-parameter γ controls the contribution of trapdoor-embedded examples during training. By optimizing Eq. (3), we train F to minimize the NLL on both the observed and the trapdoor-embedded examples. This generates "traps" or convenient convergence points (e.g., local optima) when attackers search for a set of triggers using Eq. (1). Moreover, we can also control the strength of the trapdoor. By synthesizing D L trap with all examples from D y =L (Eq. (2)), we want to inject "strong" trapdoors into the model. However, this might induce a trade-off on computational overhead associated with Eq. (3). Thus, we sample D L trap based a trapdoor ratio hyper-parameter ← |D L trap |/|D y =L | to help control this trade-off.
STEP 3: Detecting Trapdoors. Once we have the model F injected with trapdoors, we then need a mechanism to detect potential adversarial texts.
To do this, we train a binary classifier G(·), parameterized by θ G , to predict the probability that x includes a universal trigger using the output from F's last layer (denoted as F * (x)) following G is more preferable than a trivial string comparison because Eq. (1) can converge to not exactly but only a neighbor of S * L . We train G(·) using the binary NLL loss: (4)

Multiple Greedy Trapdoor Search
Searching trapdoors is the most important step in our DARCY framework. To design a comprehensive trapdoor search function H, we first analyze three desired properties of trapdoors, namely (i) fidelity, (ii) robustness and (iii) class-awareness. Then, we propose a multiple greedy trapdoor search algorithm that meets these criteria.
Fidelity. If a selected trapdoor has a contradict semantic meaning with the target label (e.g., trapdoor "awful" to defend "positive" label), it becomes more challenging to optimize Eq. (3). Hence, H should select each token w ∈ S * L to defend a target label L such that it locates as far as possible to other contrasting classes from L according to F's decision boundary when appended to examples of D y =L in Eq. (2). Specifically, we want to optimize the fidelity loss as follows.
Robustness to Varying Attacks. Even though a single strong trapdoor, i.e., one that can significantly reduce the loss of F, can work well in the original UniTrigger's setting, an advanced attacker may detect the installed trapdoor and adapt a better attack approach. Hence, we suggest to search and embed multiple trapdoors (K ≥ 1) to F for defending each target label.
Class-Awareness. Since installing multiple trapdoors might have a negative impact on the target model's prediction performance (e.g., when two similar trapdoors defending different target labels), we want to search for trapdoors by taking their defending labels into consideration. Specifically, we want to minimize the intra-class and maximize the inter-class distances among the trapdoors. Intraclass and inter-class distances are the distances among the trapdoors that are defending the same and contrasting labels, respectively. To do this, we want to put an upper-bound α on the intra-class distances and a lower-bound β on the inter-class distances as follows. Let e w denote the embedding Objective Function and Optimization. Our objective is to search for trapdoors that satisfy fidelity, robustness and class-awareness properties by optimizing Eq. (5) subject to Eq. (6) and K ≥ 1. We refer to Eq. (7) in the Appendix for the full objective function. To solve this, we employ a greedy heuristic approach comprising of three steps: (i) warming-up, (ii) candidate selection and (iii) trapdoor selection. Alg. 1 and Fig. 2 describe the algorithm in detail. The first step (Ln.4) "warms up" F to be later queried by the third step by training it with only an epoch on the training set D train . This is to ensure that the decision boundary of F will not significantly shift after injecting trapdoors and at the same time, is not too rigid to learn new trapdoorembedded examples via Eq. (3). While the second step (Ln.10-12, Fig. 2B) searches for candidate trapdoors to defend each label L ∈ C that satisfy the class-awareness property, the third one (Ln.14-20, Fig. 2C) selects the best trapdoor token for each defending L from the found candidates to maximize F's fidelity. To consider the robustness aspect, the previous two steps then repeat K ≥ 1 times (Ln.8-23). To reduce the computational cost, we randomly sample a small portion (T |V| tokens) of candidate trapdoors, found in the first step (Ln.12), as inputs to the second step.
Computational Complexity. The complexity of Alg. (1)   off between the complexity and robustness of our defense method.

Set-Up
Datasets. Table A.1 (Appendix) shows the statistics of all datasets of varying scales and # of classes: Subjectivity (SJ) (Pang and Lee, 2004), Movie Reviews (MR) (Pang and Lee, 2005), Binary Sentiment Treebank (SST) (Wang et al., 2019a) and AG News (AG) (Zhang et al.). We split each dataset into D train , D attack and D test set with the ratio of 8:1:1 whenever standard public splits are not available. All datasets are relatively balanced across classes.
Attack Scenarios and Settings. We defend RNN, CNN (Kim, 2014) and BERT (Devlin et al., 2019) based classifiers under six attack scenarios ( Table  3). Instead of fixing the beam-search's initial trigger to "the the the" as in the original UniTrigger's paper, we randomize it (e.g., "gem queen shoe") for each run. We report the average results on D test over at least 3 iterations. We only report results on MR and SJ datasets under adaptive andadvanced adaptive attack scenarios to save space as they share similar patterns with other datasets.
Detection Baselines. We compare DARCY with five adversarial detection algorithms below.
• OOD Detection (OOD) (Smith and Gal, 2018)   SST dataset with a detection AUC of around 75% on average (Fig. 3). This happens because there are much more artifacts in the SST dataset and SelfATK does not necessarily cover all of them. We also experiment with selecting trapdoors randomly. Fig. 4 shows that greedy search produces stable results regardless of training F with a high ( ←1.0, "strong" trapdoors) or a low ( ←0.1, "weak" trapdoors) trapdoor ratio . Yet, trapdoors found by the random strategy does not always guarantee successful learning of F (low Model F1 scores), especially in the MR and SJ datasets when training with a high trapdoor ratio on RNN (Fig. 4 1 ). Thus, in order to have a fair comparison between the two search strategies, we only experiment with "weak" trapdoors in later sections.
Evaluation on Advanced Attack. Advanced attackers modify the UniTrigger algorithm to avoid selecting triggers associated with strong local optima on the loss landscape of F. So, instead of 1 AG dataset is omitted due to computational limit  Table 4 (Table A.3, Appendix for full results) shows the benefits of multiple trapdoors. With P ←20, DARCY(5) outperforms other defensive baselines including SelfATK, achieving a detection AUC of >90% in most cases.
Evaluation on Adaptive Attack. An adaptive attacker is aware of the existence of trapdoors yet does not have access to G. Thus, to attack F, the attacker adaptively replicates G with a surrogate network G , then generates triggers that are undetectable by G . To train G , the attacker can execute a # of queries (Q) to generate several triggers through F, and considers them as potential trapdoors. Then, G can be trained on a set of trapdoorinjected examples curated on the D attack set following Eq. (2) and (4). Fig. 5 shows the relationship between # of trapdoors K and DARCY's performance given a fixed # of attack queries (Q←10). An adaptive attacker can drop the average TPR to nearly zero when F is injected with only one trapdoor for each label (K←1). However, when K≥5, TPR quickly improves to about 90% in most cases and fully reaches above 98% when K≥10. This confirms the robustness of DARCY as described in Sec. 3.2. Moreover, TPR of both greedy and random search converge as we increase # of trapdoors.
However, Fig. 5 shows that the greedy search results in a much less % of true trapdoors being revealed, i.e., revealed ratio, by the attack on CNN. Moreover, as Q increases, we expect that the attacker will gain more information on F, thus further drop DARCY's detection AUC. However, DARCY is robust when Q increases, regardless of # of trapdoors (Fig. 6). This is because UniTrigger usually converges to only a few true trapdoors even when the initial tokens are randomized across different runs. We refer to Fig. A.2, A.3, Appendix for more results.
Evaluation on Advanced Adaptive Attack. An advanced adaptive attacker not only replicates G by G , but also ignores top P tokens during a beamsearch as in the advanced attack (Sec. 4.2) to both maximize the loss of F and minimize the detection chance of G . Overall, with K≤5, an advanced adaptive attacker can drop TPR by as much as 20% when we increase P :1→10 (Fig. 7). However, with K←15, DARCY becomes fully robust against the attack. Overall, Fig. 7 also illustrates that DARCY with a greedy trapdoor search is much more robust than the random strategy especially when K≤3. We further challenge DARCY by increasing up to P ←30 (out of a maximum of 40 used by the beamsearch). Fig. 8 shows that the more trapdoors Figure 9: Detection TPR under oracle attack embedded into F, the more robust the DARCY will become. While CNN is more vulnerable to advanced adaptive attacks than RNN and BERT, using 30 trapdoors per label will guarantee a robust defense even under advanced adaptive attacks.
Evaluation on Oracle Attack. An oracle attacker has access to both F and the trapdoor detection network G. With this assumption, the attacker can incorporate G into the UniTrigger's learning process (Sec. 2.1) to generate triggers that are undetectable by G. Fig. 9 shows the detection results under the oracle attack. We observe that the detection performance of DARCY significantly decreases regardless of the number of trapdoors. Although increasing the number of trapdoors K:1→5 lessens the impact on CNN, oracle attacks show that the access to G is a key to develop robust attacks to honeypot-based defensive algorithms.
Evaluation under Black-Box Attack. Even though UniTrigger is a white-box attack, it also works in a black-box setting via transferring triggers S generated on a surrogate model F to attack F. As several methods (e.g., (Papernot et al., 2017)) have been proposed to steal, i.e., replicate F to create F , we are instead interested in examining if trapdoors injected in F can be transferable to F? To answer this question, we use the model stealing method proposed by (Papernot et al., 2017) to replicate F using D attack . Table A .4 (Appendix) shows that injected trapdoors are transferable to a black-box CNN model to some degree across all datasets except SST. Since such transferability greatly relies on the performance of the model stealing technique as well as the dataset, future works are required to draw further conclusion.

Discussion
Advantages and Limitations of DARCY. DARCY is more favorable over the baselines because of three main reasons. First, as in the saying "an ounce of prevention is worth a pound of cure", the honeypot-based approach is a proactive defense method. Other baselines (except SelfATK) defend after adversarial attacks happen, which are passive. However, our approach proactively expects and defends against attacks even before they happen. Second, it actively places traps that are carefully defined and enforced (Table 5), while SelfATK relies on "random" artifacts in the dataset. Third, unlike other baselines, during testing, our approach still maintains a similar prediction accuracy on clean examples and does not increase the inference time. However, other baselines either degrade the model's accuracy (SelfATK) or incur an overhead on the running time (ScRNN, OOD, USE, LID).
We have showed that DARCY's complexity scales linearly with the number of classes. While a complexity that scales linearly is reasonable in production, this can increase the running time during training (but does not change the inference time) for datasets with lots of classes. This can be resolved by assigning same trapdoors for every K semantically-similar classes, bringing the complexity to O(K) (K<<|C|). Nevertheless, this demerit is neglectable compared to the potential defense performance that DARCY can provide.
Case Study: Fake News Detection. UniTrigger can help fool fake news detectors. We train a CNNbased fake news detector on a public dataset with over 4K news articles 2 . The model achieves 75% accuracy on the test set. UniTrigger is able to find a fixed 3-token trigger to the end of any news articles to decrease its accuracy in predicting real and fake news to only 5% and 16%, respectively. In a user study on Amazon Mechanical Turk (Fig. A.1   1 minute reading a news article and give a score from 1 to 10 on its readability. Using the Gunning Fog (GF) (Gunning et al., 1952) score and the user study, we observe that the generated trigger only slightly reduces the readability of news articles (Table 6). This shows that UniTrigger is a very strong and practical attack. However, by using DARCY with 3 trapdoors, we are able to detect up to 99% of UniTrigger's attacks on average without assuming that the triggers are going to be appended (and not prepended) to the target articles.
Trapdoor Detection and Removal. The attackers may employ various backdoor detection techniques (Wang et al., 2019b;Qiao et al., 2019) to detect if F contains trapdoors. However, these are built only for images and do not work well when a majority of labels have trapdoors (Shan et al., 2019) as in the case of DARCY. Recently, a few works proposed to detect backdoors in texts. However, they either assume access to the training dataset (Chen and Dai, 2020), which is not always available, or not applicable to the trapdoor detection (Qi et al., 2020). Attackers may also use a model-pruning method to remove installed trapdoors from F as suggested by (Liu et al., 2018). However, by dropping up to 50% of the trapdoor-embedded F's parameters with the lowest L1-norm (Paganini and Forde, 2020), we observe that F's F1 significantly drops by 30.5% on average. Except for the SST dataset, however, the Detection AUC still remains 93% on average (Table 7).
Parameters Analysis. Regarding the trapdoorratio , a large value (e.g., ←1.0) can undesirably result in a detector network G that "memorizes" the embedded trapdoors instead of learning its seman-tic meanings. A smaller value of ≤0.15 generally works well across all experiments. Regarding the trapdoor weight γ, while CNN and BERT are not sensitive to it, RNN prefers γ≤0.75. Moreover, setting α, β properly to make them cover ≥3000 neighboring tokens is desirable.

Related Work
Adversarial Text Detection. Adversarial detection on NLP is rather limited. Most of the current detection-based adversarial text defensive methods focus on detecting typos, misspellings (Gao et al., 2018;Pruthi et al., 2019) or synonym substitutions (Wang et al., 2019c). Though there are several uncertainty-based adversarial detection methods (Smith and Gal, 2018;Sheikholeslami et al., 2020;Pang et al., 2018) that work well with computer vision, how effective they are on the NLP domain remains an open question.
Honeypot-based Adversarial Detection. (Shan et al., 2019) adopts the "honeypot" concept to images. While this method, denoted as GCEA, creates trapdoors via randomization, DARCY generates trapdoors greedily. Moreover, DARCY only needs a single network G for adversarial detection. In contrast, GCEA records a separate neural signature (e.g., a neural activation pattern in the last layer) for each trapdoor. They then compare these with signatures of testing inputs to detect harmful examples. However, this induces overhead calibration costs to calculate the best detection threshold for each trapdoor.
Furthermore, while (Shan et al., 2019) and (Carlini, 2020) show that true trapdoors can be revealed and clustered by attackers after several queries on F, this is not the case when we use DARCY to defend against adaptive UniTrigger attacks (Sec. 4.2). Regardless of initial tokens (e.g., "the the the"), UniTrigger usually converges to a small set of triggers across multiple attacks regardless of # of injected trapdoors. Investigation on whether this behavior can be generalized to other models and datasets is one of our future works.

Conclusion
This paper proposes DARCY, an algorithm that greedily injects multiple trapdoors, i.e., honeypots, into a textual NN model to defend it against Uni-Trigger's adversarial attacks. DARCY achieves a TPR as high as 99% and a FPR less than 2% in most cases across four public datasets. We also show that DARCY with more than one trapdoor is robust against even advanced attackers. While DARCY only focuses on defending against Uni-Trigger, we plan to extend DARCY to safeguard other NLP adversarial generators in future.

Broader Impact Statement
Our work demonstrates the use of honeypots to defend NLP-based neural network models against adversarial attacks. Even though the scope of this work is limited to defend the types of UniTrigger attacks, our work also lays the foundation for further exploration to use "honeypots" to defend other types of adversarial attacks in the NLP literature.
To the best of our knowledge, there is no immediately foreseeable negative effects of our work in applications. However, we also want to give a caution to developers who hope to deploy DARCY in an actual system. Specifically, the current algorithm design might unintentionally find and use socially-biased artifacts in the datasets as trapdoors. Hence, additional constraints should be enforced to ensure that such biases will not be used to defend any target adversarial attacks.
OBJECTIVE FUNCTION 1: Given a NN F, and hyper-parameter K , α, β, our goal is to search for a set of K trapdoors to defend each label L ∈ C by optimizing: A.2 Further Details of Experiments •

A.3.3 Average Runtime
According to Sec. 3.1, the computational complexity of greedy trapdoor search scales linearly with the number of labels |C| and vocabulary size |V|. Moreover, the time to train a detection network depends on the size of a specific dataset, the trapdoor ratio , and the number of trapdoors K. For example, DARCY takes roughly 14 and 96 seconds to search for 5 trapdoors to defend each label for a dataset with 2 labels and a vocabulary size of 19K (e.g., Movie Reviews) and a dataset with 4 labels and a vocabulary size of 91K (e.g., AG News), respectively. With K←5 and ←0.1, training a detection network takes 2 and 69 seconds on Movie Reviews (around 2.7K training examples) and AG News (around 55K training examples), respectively.

A.3.4 Model's Architecture and # of Parameters
The CNN text classification model with 6M parameters (Kim, 2014) has three 2D convolutional layers (i.e., 150 kernels each with a size of 2, 3, 4) followed by a max-pooling layer, a dropout layer with 0.5 probability, and a fully-connected-network (FCN) with softmax activation for prediction. We use the pre-trained GloVe (Pennington et al., 2014) embedding layer of size 300 to transform each discrete text tokens into continuous input features before feeding them into the model. The RNN text model with 6.1M parameters replaces the convolution layers of CNN with a GRU network of 1 hidden layer. The BERT model with 109M parameters is imported from the transformers library. We use the bert-base-uncased version of BERT.

A.3.5 Hyper-Parameters
Sec. 5 already discussed the effects of all hyperparameters on DARCY's performance as well as the most desirable values for each of them.
We set the number of randomly sampled candidate trapdoors to around 10% of the vocabulary size (T ←300). We train all models using a learning rate of 0.005 and batch size of 32. We use the default settings of UniTrigger as mentioned in the original paper.