Provably Secure Generative Linguistic Steganography

Generative linguistic steganography mainly utilized language models and applied steganographic sampling (stegosampling) to generate high-security steganographic text (stegotext). However, previous methods generally lead to statistical differences between the conditional probability distributions of stegotext and natural text, which brings about security risks. In this paper, to further ensure security, we present a novel provably secure generative linguistic steganographic method ADG, which recursively embeds secret information by Adaptive Dynamic Grouping of tokens according to their probability given by an off-the-shelf language model. We not only prove the security of ADG mathematically, but also conduct extensive experiments on three public corpora to further verify its imperceptibility. The experimental results reveal that the proposed method is able to generate stegotext with nearly perfect security.


Introduction
Steganography is the technology of hiding secret information within an innocent natural carrier (such as image (Hussain et al., 2018), audio (Mishra et al., 2018), video (Liu et al., 2019), text (Krishnan et al., 2017), etc) in order to avoid eavesdropping. Steganography differs from cryptography in that cryptography only conceals the content of secret information, whereas steganography even conceals its very existence, which makes it more secure and reliable in some scenarios (Anderson and Petitcolas, 1998).
Natural language is suitable as a carrier of steganography by virtue of its high robustness in transmission (Ziegler et al., 2019). Unlike digital images or digital audio which is sensitive to distortions like compression, cropping, blurring or pixel-wise dropout, text can usually be transmitted losslessly through different kinds of public channels. Nevertheless, text generally has low entropy and lacks sufficient redundancy for information hiding (Sharma et al., 2016), which often results in low embedding capacity of linguistic steganography. For example, in traditional modificationbased methods (such as synonym substitution (Xiang et al., 2014(Xiang et al., , 2018 and spelling transformation (Shirali-Shahreza, 2008)), where secret information is encoded by slightly modifying an existing covertext, the options for modification can be very limited to keep the text fluent enough so as not to arouse suspicions.
In recent years, powered by the advanced technology of deep learning and natural language processing, language models based on neural networks have made significant progress in generating fluent text (Radford et al., 2019;Brown et al., 2020), which bring new vitality to linguistic steganography and facilitate the investigation of generationbased methods (Fang et al., 2017;Yang et al., 2018a;Dai and Cai, 2019;Ziegler et al., 2019;Yang et al., 2020a;Zhou et al., 2021). The generative linguistic steganography directly transform secret information into innocuous-looking steganographic text (stegotext) without any covertext. Using an off-the-shelf language model, secret information can be encoded in the selection of token at each time step autoregressively during the generation procedure, which greatly alleviates the drawback of low embedding capacity. However, previous methods inevitably introduce distortions during generation. The imperceptibility of generative linguistic steganography still needs further optimization.
In this paper, we aim to further improve the imperceptibility of generative linguistic steganography. The contributions of this work are the following: 1. We present ADG (Adaptive Dynamic Grouping), a novel generative linguistic steganographic method based on off-theshelf language models, which groups the tokens adaptively in accordance with their probability at each time step to embed secret information dynamically in the generated stegotext. 2. We discuss the security of ADG and give mathematical proof, which reveals that the proposed method is provably secure. 3. Through quantitative analysis, we derive satisfactory experimental results in terms of both imperceptibility and embedding capacity, which further verifies the effectiveness of ADG. Our code is available at https://github.com/M hzzzzz/ADG-steganography.

Notation
We use lowercase letters in bold type (e.g. a) to denote vectors, normal lowercase letters (e.g. a) to denote scalars and uppercase letters (e.g. A) to denote sets. We use the symbol |A| to denote the size of a set. Calligraphic letters denote neural models (e.g. A). Both English letters and Greek letters are adopted. We use p(·) and q(·) to denote distributions and f (·) to denote functions, which are usually shortened to p, q and f . Subscripts and superscripts are used to tell the different variables/distributions/functions apart.

Generative Linguistic Steganography
Language modeling is a task to estimate the joint distribution of serialized natural language p LM (w), where w is a sequence of n tokens [w 1 , w 2 , ..., w n ] and each token belongs to the vocabulary Σ. For an autoregressive language model L, the output is usually factorized as a product of conditional distribution of the current token p LM (w t |w 1 , ..., w t−1 ).
(1) According to Simmons (1984), it is usually supposed that Alice (sender) wants to send a secret message m ∼ Uniform({0, 1} l ) to Bob (receiver) through a public channel monitored by Eve (adversary). In generative linguistic steganography, they share an embedding algorithm f emb which takes a language model L and the secret message m as input and then outputs stegotext y to transmit. They also share a corresponding extraction algorithm f ext , which is the inverse mapping of f emb that is able to recover the secret message m according to the language model L and the received stegotext y.

Imperceptibility
In order to avoid raising Eve's suspicions, stegotext y is required to be fluent enough and statistically indistinguishable from natural innocuous text x, which we call covertext. Cachin (1998) proposed the information-theoretic security of steganography to measure the statistical imperceptibility quantitatively, which is defined as the Kullback-Leibler divergence (KL divergence) between the distributions of covertext x and stegotext y. The distortion of generative linguistic steganography is two-fold: one is introduced by the bias of the language models, which is the gap between the true distribution of natural text p true (x) and the modeled distribution p LM (x); the other is introduced by f emb . Instead of directly sampling from the modeled distribution, the embedding algorithm f emb actually provides a special way to sample from p LM (y), which we call steganographic sampling (stegosampling). It is equivalent to sampling from a modified distribution q(y) produced by an implicit language model L . In a word, the latter distortion is the gap between p LM (y) and q(y), which can also be regarded as the gap between the conditional distributions p LM (y t |y <t ) and q(y t |y <t ). We simply use p LM and q to refer to the conditional distributions in the rest of this paper.

Related Work
In the early stage, some researchers investigated rule-based approaches or using Markov Chains to achieve generative linguistic steganography (Wayner, 1992;Chapman and Davida, 1997;Chapman et al., 2001;Chapman and Davida, 2002;Dai et al., 2010;Moraldo, 2014;Luo et al., 2016;Yang et al., 2018b). However, these methods followed a simplistic pattern and are hard to guarantee the grammatical correctness and the semantic fluency of the generated stegotext.
With the development of deep learning, language models based on neural networks show great performance on automatic text generation. The pattern of generating stegotext with neural language models has been widely accepted. Fang et al. (2017) proposed a linguistic steganographic method that randomly partitioned the vocabulary Σ into 2 b bins [B 1 , B 2 , ..., B 2 b ] and each one contained |Σ|/2 b tokens. At each time step, they selected the token with the highest probability within the bin according to the b−bit secret information to be embedded. Yang et al. (2018a) improved the embedding algorithm by building the mapping from secret information to tokens dynamically at each time step rather than statically in advance. Concretely, the top 2 k tokens with the highest probability were encoded by Huffman coding algorithm. Then they took the token which has the same code as the secret information. Dai and Cai (2019) proposed patient-Huffman, which was an improved version of Yang et al. (2018a) that sacrificed embedding capacity for imperceptibility. They first calculated the distortion (total variation distance or KL divergence) between q and p LM and then only used Huffman coding embedding algorithm to embed secret information when the distortion was less than a preset threshold δ. Otherwise they directly sampled a token to avoid high distortion occasions. Ziegler et al. (2019) employed arithmetic coding to embed secret information. They truncated the top h likely tokens and left out the low-probability long-tails. Then the tokens are encoded by arithmetic coding algorithm and selected according to the secret information. Compared with other coding algorithm, arithmetic coding has higher compression rate, which results in less damage to conditional probability distribution p LM and helps to improve imperceptibility.

ADG Methodology
According to the analysis in Section 2.3, the distortion of generative linguistic steganography includes the bias of the language model L and the damage to the conditional distribution caused by the embedding algorithm f emb . The former is not our research priority. With the development of automatic text generation, the former distortion can be gradually minimized. In this paper, we mainly pay attention to the latter distortion. We aim to seek an optimal solution theoretically and experimentally.
Given an off-the-shelf language model, how can we embed secret information to the generated tokens? Unlike previous works that encoded the conditional distribution by lossless coding algorithm, we achieve this goal in a novel way by grouping. Through mathematical analysis and proof, we propose a provably secure method ADG, which does little damage to the conditional distribution and is nearly equivalent to directly sampling from the full distribution. In this section, we investigate the security of steganography by grouping and give detailed descriptions of the proposed method.

Steganography by Grouping
Steganography by grouping is to group all tokens in the vocabulary into several groups, so that each group represents a unique secret message. E.g. we can Tokens belonging to the target group are able to make up the stegotext. In such a way, Bob reads each token in the sequence in turn and performs the same grouping operation to extrapolate which groups the current token belongs to, thereby extracting the corresponding secret information. The key question is: how to group the tokens at each time step to ensure an optimal imperceptibility? We have the following assumption.
Assumption 1. For secret information in the form of uniformly distributed bitstream, adaptively grouping the vocabulary into u groups (u = 2 r , r ∈ N, r ≤ log 2 |Σ|) with equal probability will ensure the optimal imperceptibility.
Proof. Assuming that the discrete conditional probability distribution p LM is arbitrarily partitioned into u groups to embed r-bit secret information. p ij denotes the probability of the j-th token in the i-th group. η i and n i denote the total probability and the size of the i-th group respectively. Then we have Our goal is to figure out the grouping algorithm to achieve the best imperceptibility, i.e. to minimize the gap between p LM and q. First of all, starting from the modeled distribution p LM = [..., p ij , ...], we calculate the equivalent distribution q. The probability of each token is firstly normalized within its group (1/η i ) and then multiplied by the selected probability of the group, which is 1/u since secret information is uniformly distributed. Therefore, q has the following form We measured the gap between the two distributions with KL divergence, which is Therefore, the KL divergence between the two distributions is a function of the vector η = [η 1 , η 2 , ..., η u ].
Next, we will prove Assumption 1 in two steps. [1].
Considering the auxiliary function f aux (η) = η log(uη), (0 ≤ η ≤ 1), we firstly analyse its concavity and convexity on the domain of definition. For every η 1 , η 2 ∈ (0, 1) and 0 ≤ λ ≤ 1, [2]. Then, when generalizing to u variables (Jensen et al., 1906), there is The equality sign holds if and only if It means that D KL (p LM ||q) = u i=1 f aux (η i ) takes the minimum value 0 when each component of η is equal, in which case p LM and q are equivalent and that achieves the optimal information-theoretic security defined by Cachin (1998).
Therefore, we basically construct the idea of our embedding algorithm, that is, to adaptively group the vocabulary into multiple groups at each time step, so that each group is assigned approximately the same probability. In practice, since the probability distribution is discrete, the probability of groups may not be absolutely equal. Firstly, we determine the number of groups u to be its maximum value 2 − log 2 pmax , where p max is the highest probability in p LM . Secondly, since the time complexity of solving the global optimal solution of equal grouping is unacceptable, we implement a suboptimal solution in ADG, as demonstrated in Algorithm 1. In line 10, we employ binary search algorithm to select the token that has the nearest probability of a given value. Our implementation enables us to obtain a unique grouping result for any p LM , which ensures that the secret information can be extracted accurately and completely at the receiving end.

Recursion and Pruning
After obtaining the grouping results, we can select the group according to the next log u bits of secret information to be embedded and simply sample a token in the group to generate stegotext. As a matter of fact, we can also continue grouping the obtained groups to further enlarge the embedding capacity and recursively grouping the new groups until it is impossible to be equally participated (the normalized p max of the current group is greater than 0.5). In order to improve the efficiency of the recursive grouping, we employ pruning strategy to remove the redundant grouping operations. We only need to recursively group the selected groups every time in accordance with the secret information to be embedded. In this manner, the amount of secret information embedded in each token is adjusted dynamically according to its probability distribution.
To sum up, at each time step, the proposed ADG embedding algorithm first conducts the equal Algorithm 1: Suboptimal solution of equal grouping.
Data: vocabulary Σ, distribution p LM Result: set of groups G 1 list of tokens = sorted (p LM ); 2 p max = probability of the first token; 3 u = 2 − log 2 pmax ; 4 mean = 1/u; 5 for (i = 1; i ≤ u − 1; i ++) do grouping algorithm adaptively according to the conditional distribution, and then recursively repeats the operation on the selected group dynamically according to the secret information, until it is indivisible. At last, we normalize the probability of the last selected group and sample a token to generate the stegotext. We have proved the security of equal grouping algorithm. Obviously, it can also be extended to the recursive manner of ADG, which means the proposed method is provably secure.

Information Extraction
The extraction algorithm is basically the inverse process of the embedding algorithm. For an exactly successful extraction, Alice and Bob have to share the same language model, vocabulary and grouping algorithm. At each time step, Bob is supposed to recursively operate the same grouping algorithm as Alice do, and then select the group contains the current token in the stegotext. The index of the selected groups reveal the embedded secret information.

Experimental Results and Analysis
In this section, we evaluate the performance of ADG in terms of both embedding capacity and imperceptibility. Details of our experiments and the analysis of the results are present in the following subsections.

Datasets
We evaluated the performance of ADG on three public corpora, namely "Large Movie Review Dataset" (Movie) (Maas et al., 2011), "All the News" (News) 1 and "Sentiment140" (Tweet) (Go et al., 2009). Large movie review dataset is originally built for binary sentiment classification, containing 100,000 movie reviews in total crawled from IMDb 2 . "All the news" is a collection of publications of mainstream news media. Sentiment140 is also used in sentiment analysis tasks, which contains 1,600,000 tweets extracted from Twitter 3 . We converted the raw text to lowercase and removed HTML tags and most punctuations, then segmented it into sentences with NLTK tools (Loper and Bird, 2002). We filtered out sentences with length below 5 or above 200. For the convenience of training and evaluation, any token occurring less than 10 times was mapped to a special token " UNK". We also added " BOS" and " EOS" at the beginning and end of each sentence to help training. Sentences in a batch were padded to the same length with a special padding token " PAD". Finally, we divided the preprocessed corpora into training set and test set according to the ratio of 9:1. Statistics are demonstrated in Table 3.

Implementation Details
In experiments, we utilized LSTMs (Hochreiter and Schmidhuber, 1997) for word-level generation. We stacked 2 LSTM layers and the model was implemented with Pytorch (Paszke et al., 2017). The dimension of word embedding was set to be 350. Hidden states in LSTM were set to be 512dimensional vectors. In the training procedure, we applied SGD algorithm together with Adam  (Kingma and Ba, 2014) to train the language model. Learning rate was set to be 0.001. The SGD update direction was computed using a batch of 32 training samples. They were both trained for 30 epochs on one GeForce GTX 1080 GPU. In the generation procedure, we adopted the model performing best on test sets. All generated sentences must be longer than 5 and shorter than 200.

Baselines
We rebuilt  For fair comparison, we rebuilt all the baselines with the same language models. For Bins, we set b to be 1, 2, 3, 4, 5 and the corresponding number of bins was 2, 4, 8, 16, 32. For Huffman, we built Huffman tree with the top 2, 4, 8, 16, 32 likely tokens. For Patient-Huffman, we measured the distortion by KL divergence and restricted the threshold δ to 1, 1.5, 2 with top 8 tokens. For Arithmetic, we truncated the conditional distribution at h = 100, 200, 300. In each case, we generated 1,000 stegotext. We randomly chose same amount of covertext from the test sets for further evaluation.

Metrics
The metrics we utilized to evaluate the performance on embedding capacity and imperceptibility are listed as follows.
Embedding Rate (ER): It is the average amount of information that one single token can carry, and is in unit of bits per word (bpp). Embedding rate is a metric to indicate the embedding capacity. Higher is better.
KL Divergence between the implicit distribution q and the modeled distribution p LM (KLD 1 ): It reflects the gap introduced by the embedding algorithm. Lower is better and the unit is bit.
KL Divergence between the statistical distributions of the sentence embedding of covertext and stegotext (KLD 2 ): It indirectly reflects the overall information-theoretic security. We mapped all stegotext and covertext to fixed length dense vectors v x and v y by third-party sentence vectorization tool (Le and Mikolov, 2014), and assumed that the resulting vectors of covertext and stegotext both obey isotropic Gaussian distribution. Then KLD 2 is computed by where µ and σ are the mean and standard deviation of sentence vectors. We set the dimension of sentence vectors to be 100. Lower is better and the unit is bit.

Detection Accuracy:
It reflects the antisteganalysis ability of steganographic methods. Ste-  ganalysis is the technology used by Eve to detect hidden information in stegocarriers, which is the opposite direction of steganography. In our experiment, we employed linguistic steganalysis approaches based on Fasttext (Yang et al., 2019) (Acc 1 ) and TextCNN (Yang et al., 2020b) (Acc 2 ). We took stegotext as positive samples and covertext as negative samples. We conducted 10-fold cross validation and reported the average accuracy. Closer to 50% is better.
Effective Embedding Rate: It is a new metric we proposed to evaluate the comprehensive performance of steganographic algorithms. It is defined to be calculated by meaning that if the stegotext has a certain probability of being detected, the average amount of secret information actually transmitted should be discounted accordingly. For mathematical rigorousness and completeness, if Acc < 0.5, we assign 1 − Acc to Acc. In extreme cases where the stegocarriers are completely natural, the detection accuracy should be 50% and EER is equal to ER. On the contrary, stegocarriers with 100% detection accuracy cannot carry a single bit. We calculated this metric with the accuracy results obtained by the two aforementioned steganalysis method (EER 1 , EER 2 ). Higher is better and the unit is bpp.

Results and Analysis
The results of KLD 1 and KLD 2 are listed in Table 1. KLD 1 measures the distortion between q and p LM , which is introduced by the embedding algorithm ADG. KLD 2 estimates the overall informationtheoretic security that also considers the deviation of language models. In terms of KLD 1 , we found that the proposed method ADG outperforms all baselines and it is very close to the optimal value 0 (stochastic sampling), which means generating stegotext by ADG is almost equivalent to normal generation with the language models. The results of KLD 2 are also advantageous, indicating that the generated stegotext is statistically consistent with the covertext. We noticed that some baselines can also perform well on KLD 2 (e.g. Patient-Huffman (δ = 1.0)). However, they have a crucial flaw in embedding capacity. Table 2 demonstrates the results of antisteganalysis, where we found the tendency coheres with that of KLD 1 and KLD 2 . The proposed method ADG outperforms all baselines on the three corpora and it is very close to the optimal value 0.5, which further confirms its imperceptibility. Besides, we also illustrated some examples of stegotext generated by ADG in Table 5 for qualitative  Table 5: Examples of stegotext generated by ADG on the three corpora.

Movie
The supporting cast was also excellent. But I guess you 've seen the many silent movies along with his other films. And this movie was a precursor of val kilmer in the extreme. It 's a unique wonderful movie that deserves all the recognition it deserved. This is the worst movie I have ever seen.

News
The FBI estimated its total wealth on Thursday. Remember this is in part because of the actual policies of Donald Trump. He said he did not care about any counterintelligence investigation. Today however the process could not change even if he doesnt agree with Trumps rhetoric. More than 100 000 people have been detained and another 30 000 civilians have been wounded early on Sunday.

Tweet
Worst headache everrrr I dunno why but it was so scary. I had a blast today in the MTV Movie Awards. Ahhh some brothers do n't play sports! Sadly you will be missing so much. I do n't think the peach ice cream last night was good. study. We found that the stegotext is fluent enough, with correct grammar and coherent semantics. Finally, taking both embedding capacity and imperceptibility into account, we investigated effective embedding rate listed in Table 4. It can be concluded that our method has excellent comprehensive performance, which outperforms all baselines. In general, the experimental results indicate that the proposed method ADG is able to resist both perceptual and statistical steganalysis of Eve, meanwhile ensure a remarkable embedding rate, which reveals its effectiveness.

Conclusion
Previous works of generative linguistic steganography inevitably introduce distortions to the distribution estimated by off-the-shelf language models. In this paper, we attempted to achieve provably se-cure generative linguistic steganography during the procedure of stegotext generation. We proposed ADG, which embeds secret information by adaptive dynamic grouping. According to the mathematical proof and extensive experiments conducted on three public corpora, we found that the proposed method is provably secure and capable of generating fluent stegotext with high embedding capacity and high imperceptibility. We hope our investigation of provably secure generative linguistic steganography can be leveraged as a building block for future research.