Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception

Wang, Jingang; Du, Hui; Liu, Peng

doi:10.3390/bdcc9080192

Open AccessArticle

Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception

by

Jingang Wang

^1,2

,

Hui Du

¹ and

Peng Liu

^1,2,*

¹

Hainan Branch, Institute of Acoustics, Chinese Academy of Sciences, Haikou 570105, China

²

Lingshui, Marine Information, Hainan Observation and Research Station, Lingshui 572423, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(8), 192; https://doi.org/10.3390/bdcc9080192

Submission received: 10 June 2025 / Revised: 5 July 2025 / Accepted: 17 July 2025 / Published: 22 July 2025

Download

Browse Figures

Versions Notes

Abstract

Linguistic steganography can be utilized to establish covert communication channels on social media platforms, thus facilitating the dissemination of illegal messages, seriously compromising cyberspace security. Synonym substitution-based linguistic steganography methods have garnered considerable attention due to their simplicity and strong imperceptibility. Existing linguistic steganalysis methods have not achieved excellent detection performance for the aforementioned type of linguistic steganography. In this paper, based on the idea of focusing on accumulated differences, we propose a two-stage synonym substitution-based linguistic steganalysis method that does not require a synonym database and can effectively detect texts with very low embedding rates. Experimental results demonstrate that this method achieves an average detection accuracy 2.4% higher than the comparative method.

Keywords:

synonym substitution-based linguistic steganalysis; synonym substitution-based linguistic steganography; deep learning; hard negative mining

1. Introduction

Linguistic steganography is a form of information hiding [1] technique aimed at embedding secret messages into natural text while maintaining relative imperceptibility. Linguistic steganography can be utilized by criminals on social media platforms to disseminate illegal information, making detection and tracking difficult, thus posing serious threats to social security. In contrast, linguistic steganalysis refers to methods aimed at detecting such covert techniques to identify the presence of potential secret information within text.

Synonym substitution-based linguistic steganography is an important branch of linguistic steganography [2,3,4,5,6,7,8,9,10]. Its mainstream approach considers certain elements of natural text as replaceable entities and constructs substitution sets for them. By altering elements within the substitution sets to other content, secret information embedding is achieved. Due to its simple principles and relatively easy implementation, it is prone to exploitation by criminals for illegal activities. Meanwhile, synonym substitution-based steganography minimally alters the semantic structure of the text during secret information embedding, rendering it highly imperceptible. Moreover, its low-embedding-capacity drawback will diminish under the nearly infinite transmission bandwidth in social media.

The specific steps of synonym substitution-based linguistic steganography methods for generating steganography text are as follows: Firstly, establish a synonym set. Then, traverse all the words in natural text and identify those that are located in the synonym set. Obtain the current encoding content of the natural text through the synonym encoding rules. Next, obtain the encoding content based on the secret information to be embedded, and finally select the corresponding synonyms for replacement. Due to the difficulty in establishing a large-scale synonym set (with a small number of synonym groups and a small number of words per synonym group), this leads to the requirement of lengthy natural texts for embedding secret information suitable for message transmission, resulting in relatively minimal modifications and increased detection difficulty. Additionally, synonym substitution-based linguistic steganography methods typically allow for embedding of secret information by adjusting the embedding rate, selectively replacing only a portion of synonyms in the natural text, thereby achieving linguistic steganography at very low embedding rates.

When conducting linguistic information hiding at very low embedding rates, the changes in the text before and after steganography are typically minimal to the extent that they are insufficient to cause significant alterations in the statistical features of the text. This implies that conventional statistical analysis-based methods [11,12,13,14,15,16] for detecting text information hiding may struggle to extract steganographically sensitive features, and such methods often require a known synonym set, making them less practical. Additionally, steganography methods typically involve synonym substitution operations throughout the entire natural text segment, and these substitutions are often uniformly distributed, thereby not causing significant changes at specific locations. This makes it difficult for deep learning-based linguistic steganalysis methods [17,18] to capture meaningful steganographically sensitive features.

The general approach of existing synonym substitution-based linguistic steganalysis networks involves constructing a sentence-level linguistic steganalysis network to obtain sentence-level decision results. Subsequently, through the introduction of fusion strategies or two-stage segment-level linguistic steganalysis networks, determining whether the text contains secret information is further refined. However, between these two stages, only the decision features are transmitted. When steganographic algorithms adopt low embedding rates to hide secret information, the differences between natural and steganography text are minimal, resulting in less effective information in the sentence-level decision features. Therefore, the detection performance for low-embedding-rate steganography text remains to be improved.

For the problems existing in synonym substitution-based linguistic steganalysis methods mentioned above, there are the following solution ideas: when the steganographic algorithm embeds secret information with low embedding rates, the difference between sentence-level natural text and steganographic text is small. It is difficult to achieve excellent segment-level detection performance using only sentence-level decision features. Therefore, retaining sentence-level high-dimensional features to focus on the entire text in order to capture the cumulative differences between natural text and steganographic text is a solution to improve the detection performance of the network under low embedding rates. The contributions of this paper can be summarized as follows:

(1): This paper proposes a two-stage synonym substitution linguistic steganalysis framework. Between the two stages, both the sentence-level decision features and the high-dimensional representations at the sentence level are transmitted, thereby achieving effective segment-level linguistic steganalysis at low embedding rates.
(2): This paper proposes a heterogeneous feature extraction module to mine steganographic replacement words and extract inter-word association features between these words and other words in the sentence, ultimately obtaining sentence-level decision features and sentence-level high-dimensional representations.
(3): This paper proposes a difficult sentence mining re-perception module. For sentence-level decision features, we achieve the simultaneous capture of local mutations and overall trends to comprehensively understand whether the segment contains hidden information. For sentence-level high-dimensional representations, a long-distance attention network is utilized to achieve full-segment attention to perceive whether the segment contains hidden information. Finally, the above dual decision features are fully integrated to achieve information hiding detection.

2. Background and Related Works

Information hiding, also known as steganography, is the technique of embedding secret information into digital media such as images, audio, text, etc., without significantly altering their appearance. Utilizing information hiding methods enables the transmission of modified carriers through public channels to recipients without arousing suspicion, allowing for covert communication. Text, due to its ubiquity in social media, presents an ideal medium for information hiding, facilitating the construction of covert communication systems with high stealthiness. The existing literature reveals four main categories of text information hiding methods applicable to social media based on different approaches to embedding secret information. The first category involves modification-based social media information hiding methods [2,3,4,5,6,7,8,9,10], primarily employing semantic analysis and synonym substitution to conceal secret information. The second category encompasses generation-based social media information hiding methods [19,20,21,22,23,24,25], relying on pre-trained language models to directly generate steganography text. The third category focuses on modifying text image features for social media information hiding, achieving secret information embedding by altering imperceptible text spacing, font color, or specific pixel values [26,27,28,29,30,31]. The fourth category involves invisible character embedding for social media information hiding [32,33,34,35,36], exploiting certain characters in the character encoding table that, when inserted into text, remain imperceptible to the human eye. Thus, embedding secret information can be accomplished by inserting invisible characters from the character encoding table into text.

In particular, modification-based social media information hiding methods are less susceptible to editing restrictions imposed by social media platforms, are relatively straightforward in principle, and pose higher detection difficulty, thereby carrying greater security risks for social media platforms. Therefore, addressing the aforementioned issues, this paper focuses on the research of detection methods for modification-based social media information hiding. A popular research focus within modification-based social media information hiding methods is the use of synonym substitution to conceal information. This involves considering certain elements within natural text as interchangeable and forming substitution sets, wherein elements located within these sets in natural text are replaced with other content to embed secret information. Building upon this theory, Ref. [2] proposes the first modification-based linguistic steganography method. It utilizes a multi-base encoding approach to encode synonyms extracted from WordNet, selecting appropriate synonyms from the synonym set to embed secret information. While this method is practical, there is still room for improvement in aspects such as embedding capacity and detectability.

Scholars in the field have conducted a series of studies to address the aforementioned issues. Ref. [3] proposed an improved method for language steganography based on lexical substitution. By utilizing the Google n-gram corpus and a vertex coding algorithm, it enhanced data embedding capacity and the applicability and precision of synonym replacement in information hiding. Ref. [4] further improved upon the method presented in Ref. [3] by introducing additional grammar transformations, effectively increasing embedding capacity. To prevent synonym substitution from disrupting the frequency balance between synonyms and to enhance statistical undetectability, Ref. [5] introduced a linguistic steganography method based on synonym run-length encoding. Initially, synonyms in the text are represented in the form of runs based on relative word frequencies. Subsequently, adaptive synonym substitution is applied to boundary elements of adjacent runs to embed secret information into the parity of run lengths, maintaining the quantity of high- and low-frequency synonyms to reduce embedding distortion. Ref. [6] proposes a method combining matrix encoding with synonym substitution to enhance embedding efficiency. Initially, synonyms in the carrier are quantified according to certain rules, followed by matrix encoding of synonym groups in the carrier to embed secret information. Ref. [7] proposed an adaptive modification-based linguistic steganography method to enhance security. It introduced a dual-layer checksum coding method to minimize the impact of synonym substitution, with the cost function calculation including statistical distortion and semantic distortion. Ref. [8] introduced a new linguistic steganography technique with high imperceptibility and undetectability through secret message compression and candidate text selection. Ref. [9] proposed a synonym substitution steganography algorithm that preserves word frequency. It dynamically groups synonyms in the text and encodes secret information by altering the positions of low-frequency synonyms. During the substitution process, maintaining the quantity of low-frequency synonyms unchanged reduces statistical characteristic changes caused by steganography, enhancing detection resistance. Ref. [10] proposed a linguistic steganography algorithm based on the distance of binary dependency collocation vectors. When embedding secret information, it evaluates the suitability of synonym substitution by calculating the vector distance of binary dependency collocations, thereby obtaining the optimal set of synonym substitutions.

Compared with text information hiding methods, text information hiding detection methods, also known as linguistic steganalysis, refer to the determination of whether a text fragment contains hidden information. Early linguistic steganalysis methods based on manually designed features typically involved constructing a series of text features manually, analyzing the changes in these features before and after steganography, and finally designing corresponding binary classifiers to distinguish between stego and natural text. Ref. [11] proposed the first modification-based linguistic steganalysis method, which utilized language models trained on non-stego and stego text to capture their differences, and based on language model statistics, it employed support vector machines (SVMs) for linguistic steganalysis. Ref. [12] evaluated the suitability of words in context based on inverse document frequency and further classified them using support vector machines based on the sequence of text suitability. Ref. [13] estimated the word adaptability in text by introducing context clustering and proposed a steganalysis scheme based on context clustering scores. Ref. [14] first defined text attribute pairs based on the frequency of synonyms and the size of synonym sets, and they theoretically analyzed the statistical characteristic changes of text attribute pairs caused by steganography, proving their usefulness in distinguishing between stego and non-stego text. Ref. [15] investigated the frequency features of elements in substitution sets in both non-stego and stego text, discovered partial characteristics of relative frequency before and after steganography, and they further proposed a synonym substitution linguistic steganalysis method based on Relative Frequency Analysis (RFA). Ref. [16] proposed a modification-based linguistic steganalysis method based on word embeddings using Skipgram language models to represent synonyms and their contexts as embedding vectors. Building on this, they established an adaptability model measuring the suitability of synonyms in specific contexts using word embeddings and TF-IDF, and they further utilized support vector machines to distinguish between non-stego and stego text by analyzing adaptability values to extract detection features.

With the widespread application of deep learning-based language models in the field of natural language processing, researchers have begun to focus on how to leverage them for synonym substitution-based linguistic steganalysis. Ref. [17] proposed a sentence-level linguistic steganalysis network based on convolutional neural networks (CNNs). Firstly, it utilizes a word embedding layer to obtain high-dimensional representations of text. Then, it employs convolutional layers with different kernel sizes to extract text features at various scales. All extracted features are concatenated and fed into a classifier for classification. Additionally, a long-text detection decision strategy was proposed, which calculates the proportion of stego sentences and determines whether they contain secret information based on a threshold value. Ref. [18] presented a synonym substitution-based linguistic steganalysis method based on cascaded convolutional neural networks (CNNs). The first-level network is a sentence-level linguistic steganalysis network, composed of convolutional layers, pooling layers, and fully connected layers with multiple kernels of different sizes, which extract two-dimensional steganographic features from each sentence. The second-level network is a segment-level linguistic steganalysis network, which utilizes the predicted representations from the sentence-level linguistic steganalysis network to further determine whether the segment under inspection contains secret information.

The above-mentioned synonym substitution-based linguistic steganalysis method consists of two stages in the detection process. In the first stage, a sentence-level linguistic steganalysis network is constructed to obtain sentence-level decision results. In the second stage, a decision fusion strategy is introduced or a segment-level detection network is designed to further determine whether the segment under investigation contains secret information. Since only sentence-level decision features are retained between the two stages, when the synonym substitution linguistic steganography algorithm adopts a low embedding rate to embed secret information, the difference between the cover and stego texts is small, resulting in fewer effective sentence-level decision features. Consequently, the detection performance of the above method for low-embedding-rate stego texts needs improvement.

3. Statistical Analysis of Word Pairs Before and After Synonym Substitution-Based Linguistic Steganography

When embedding secret information utilizing synonym substitution-based linguistic steganography algorithms, some words in the text will be replaced with other words, inevitably altering the statistical characteristics of the original word pairs in the text. Specifically, for an input text segment

s = [w_{1}, w_{2}, \dots, w_{i}, \dots, w_{n}]

of length n, where

w_{i}

represents the i-th word, if the synonym substitution-based linguistic steganography algorithm modifies

w_{i}

to

w_{i}^{'}

when embedding secret information, then the changes in word pairs before and after embedding will occur as follows:

[w_{1}, w_{i}], \dots, [w_{i}, w_{n}] \to [w_{1}, w_{i}^{'}], \dots, [w_{i}^{'}, w_{n}]

(1)

To validate the above viewpoints, this paper constructs a natural text dataset based on mainstream social media corpora (Twitter [37], Movie [38], News [39]) and employs the TLex [2] linguistic steganography algorithm to hide secret information by replacing synonyms at different embedding rates. This paper analyzes the statistical information of word pairs before and after steganography, specifically including the coefficient of word pair reduction, coefficient of word pair addition, and coefficient of word pair variation. Among them, the coefficient of word pair reduction indicates the proportion of the total number of word pairs appearing only in natural text to the total number of word pairs appearing in natural text, the coefficient of word pair addition indicates the proportion of the total number of word pairs appearing only in steganographic text to the total number of word pairs appearing in steganographic text, and the coefficient of word pair change indicates the proportion of the total number of word pairs with different frequencies between natural text and steganographic text to the total number of word pairs.

This paper believes that the coefficient of word pair reduction can measure the static specificity of natural text, the coefficient of word pair addition can measure the static specificity of steganographic text, and the coefficient of word pair change can measure the dynamic specificity between natural text and steganographic text. The calculation formulas are as follows:

I_{d} = \frac{{C T}_{c} ∖ {C T}_{s}}{〈{C T}_{c}〉}

(2)

I_{e} = \frac{{C T}_{s} ∖ {C T}_{c}}{〈{C T}_{s}〉}

(3)

I_{v} = \frac{\sum F_{c} [{C T}_{c} \cap {C T}_{s}] \neq F_{s} [{C T}_{c} \cap {C T}_{s}]}{〈{C T}_{c} \cap {C T}_{s}〉}

(4)

where coefficients

I_{d}, I_{e}, I_{v}

, respectively, represent the coefficient of word pair reduction, the coefficient of word pair addition, and the coefficient of word pair variation.

C T_{c}

represents the set of word pair types in natural text, while

C T_{s}

represents the set of word pair types in steganographic text.

〈\cdot〉

denote summation over the quantities in a set.

F_{c} [\cdot]

denotes the frequency of a certain word pair in natural text, while

F_{s} [\cdot]

denotes the frequency of a certain word pair in steganographic text. The statistical results of the above information under different embedding rates are shown in Figure 1.

From the data observed in Figure 1, it can be seen that with the increase in the embedding rate, the coefficient of reduction in word pairs increases from 6.3% to 26.3%, the coefficient of increase in word pairs rises from 12.8% to 33.1%, and the coefficient of variation in word pairs goes from 13.0% to 37.9%. This indicates that the statistical differences between words within the synonym set and other words in the sentence gradually increase before and after steganography. The proportion of word pairs reflects to some extent the amount of secret information contained. Furthermore, it is evident that regardless of whether under a 25% embedding rate or a 100% embedding rate, the total proportion of word pairs in the steganographic text is always higher than that before steganography. Therefore, the word pair feature is an effective criterion for distinguishing between natural texts and texts containing hidden information. It is necessary to design a feature extraction module to effectively capture the changes in associations between words before and after steganography.

4. Method

In light of the existing issues with current natural text steganalysis methods and the statistical analysis of word pair changes before and after text steganography, this paper presents a novel framework for natural language steganalysis. Firstly, a heterogeneous feature extraction module is designed to capture steganographic substitution words and extract inter-word association features between these words and other words in the text. Simultaneously, the obtained features are further optimized to acquire sentence-level decision features and sentence-level high-dimensional representations. Subsequently, a module for difficult sentence mining and perception is introduced to reconstruct sentence-level high-dimensional representations using sentence-level decision features. Following this, the sentence-level decision features are further integrated with optimized sentence-level high-dimensional representations to differentiate between text containing hidden information and text without hidden information. Detailed elaboration of the aforementioned methods will be provided subsequently.

4.1. Overall Structure

The overall structure and training process of the synonym substitution-based linguistic steganalysis method proposed in this paper are illustrated in Figure 2. The method mainly consists of two modules: heterogeneous feature extraction module and difficult sentence mining and re-perception module. The heterogeneous feature extraction module is utilized to achieve sentence-level linguistic steganalysis. This module takes both unmodified and steganographically modified sentences as input, extracting sentence-level decision features and sentence-level high-dimensional representations. The difficult sentence mining and re-perception module further implements segment-level linguistic steganalysis based on the aforementioned features.

The network training is divided into two stages. The first stage aims to train the heterogeneous feature extraction module. Specifically, the dataset of steganographic segments is first divided into sentences to obtain sentence fragments. Each fragment can be labeled based on whether it has been modified by contrasting with non-steganographic segments. Consider a training batch S of size N defined as

S = {\{x_{i}, y_{i}\}}_{i = 1}^{N}

, where

x_{i}

represents the i-th input sentence fragment and

y_{i}

represents the label of the i-th input sentence fragment. For each fragment, the heterogeneous feature extraction module is utilized to extract sentence-level decision features and sentence-level high-dimensional vectors. Subsequently, cross-entropy classification loss is computed based on the two-dimensional decision features and sentence label information. The specific computation process is as follows:

{\hat{y}}_{i}, h_{i} {= F}_{H F E} (x_{i})

(5)

L_{c e} = \frac{1}{N} \sum_{i = 1}^{N} - {\hat{y}}_{i} l o g (y_{i}) - (1 - {\hat{y}}_{i}) l o g (1 - y_{i})

(6)

where

{\hat{y}}_{i} \in R^{2}

represents the two-dimensional decision features of the i-th sample and

h_{i} \in R^{d}

represents the sentence-level high-dimensional representation of the i-th sample, where d denotes the length of the sentence-level high-dimensional representation vector, and

y_{i}

represents the label of the i-th sample. The parameters of the heterogeneous feature extraction module are optimized using the classification loss

L_{c e}

.

The second stage is utilized to train the difficult sentence mining re-perception module. Specifically, for an input segment D with n sentence fragments,

D = [s_{1}, s_{2}, \dots, s_{j}, \dots, s_{n}]

, where

s_{j}

represents the j-th sentence fragment. Firstly, the heterogeneous feature extraction module, which was pre-trained, is utilized to extract sentence-level decision features and sentence-level high-dimensional representations for each sentence fragment. The calculation process is shown in the following equation.

{\hat{y}}_{i}, h_{j} = {F_{H F E}}^{*} (s_{j}), j = 1, 2, \dots, n

(7)

where

{\hat{y}}_{i} \in R^{2}

represents the two-dimensional decision features of the j-th sentence fragment and

h_{j} \in R^{d}

represents the sentence-level high-dimensional representation of the j-th sentence fragment. Subsequently, the decision features matrix

M_{d} \in R^{n \times 2}

and the high-dimensional representation matrix

M_{s} \in R^{n \times d}

of all sentence fragments in the segment are concatenated. Then, both are fed into the difficult sentence mining re-perception module to predict whether the segment D contains secret information.

{\hat{y}}_{i} = F_{D S M P} (M_{d}, M_{s})

(8)

where

{\hat{y}}_{i} \in R^{1}

represents the predicted probability value of the segment containing secret information, setting a fixed detection threshold t, and the detection result can be represented as

D \in \{\begin{matrix} C o v e r ({\hat{y}}_{i} \leq t) \\ S t e g a ({\hat{y}}_{i} > t) \end{matrix}

(9)

During the training process, the loss is computed using cross-entropy, and the parameters of the difficult sentence mining re-perception module are updated via the backpropagation algorithm.

4.2. Heterogeneous Feature Extraction Module

There is a strong semantic correlation between the words in the text, which means that each word in the sentence forms a semantic style with other words in the sentence. Synonym substitution-based steganography algorithms replace words in the text with their synonyms when embedding secret messages, inevitably destroying the inter-word associations. Therefore, identifying steganographic replacement words and paying attention to the changes in inter-word dependency caused by replacement will help distinguish natural text from steganographic text. The heterogeneous feature extraction module designed in this paper is shown in Figure 3.

Specifically, given an input text segment

s = [w_{1}, w_{2}, \dots, w_{i}, \dots, w_{n}]

of length n where

w_{i}

represents the i-th word, we first utilize a dynamic word embedding layer T to map the text into a high-dimensional vector matrix

E \in R^{n \times d}

, where d denotes the length of word vectors. The dynamic word embedding layer T is randomly initialized with a uniform distribution in the range of

[- 1, + 1]

and updated during training via backpropagation to learn subtle differences between different words. Since changes in inter-word dependency caused by synonym substitution are most notably reflected between the replacement word and its adjacent words, a one-dimensional convolutional network with a local receptive field can be employed to identify steganographic replacement words. The simplified computational formula for the one-dimensional convolutional network is as follows:

y = \sum_{k = 0}^{M - 1} x_{n - k} * h_{k}

(10)

where

h_{k}

represents the k-th element of the convolution kernel,

x_{n - k}

represents the

(n - k)

-th element of the input, M denotes the length of the convolution kernel, and y represents the output. To fully extract local semantic mutation features, this paper will employ a one-dimensional convolutional network with different kernel sizes. The specific calculation formula is as follows:

{a t t}_{w} = m p (c a t ({c o n v}_{k_{1}} (E), {c o n v}_{k_{2}} (E), {c o n v}_{k_{3}} (E)))

(11)

where

{c o n v}_{k_{1}}, {c o n v}_{k_{2}}, {c o n v}_{k_{3}}

, respectively, represent one-dimensional convolutional layers with kernel sizes

k_{1}

,

k_{2}

, and

k_{3}

.

c a t (\cdot)

denotes concatenation operation.

m p (\cdot)

denotes max-pooling operation, and

{a t t}_{w}

represents the attention map of steganographic replacement words. Subsequently, the attention map of steganographic replacement words is utilized to reconstruct the high-dimensional vector matrix E, and further, a sequence encoder based on a self-attention mechanism is utilized to extract inter-word association features between steganographic sensitive words and other words in the sentence.

T = S E (E ⊙ {a t t}_{w})

(12)

where

S E

represents the sequence encoder and ⊙ denotes element-wise multiplication. T denotes the inter-word association features of the text. The core idea of the self-attention mechanism in sequence encoding is to allow the model to consider other elements in the sequence while processing each element of the sequence. The basic formula of self-attention can be expressed as

{a t t}_{s} (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{K}}}) V

(13)

where

Q, K, V

, respectively, represent the Query, Key, and Value matrices, which are usually obtained by applying different linear transformations to the input sequence.

d_{K}

is the dimensionality of the key vectors, and we divide by

\sqrt{d_{K}}

to scale the dot product to ensure training stability.

S o f t m a x (\cdot)

is used to transform the dot product results into a probability distribution form, so each output element is the weighted sum of the input sequence, where the weights reflect the correlation between sequence elements. Subsequently, the attention map of steganographic replacement words

a t t_{w}

is utilized to reconstruct the inter-word association features T, and then through max-pooling operation, explicit extraction of the inter-word dependency features between steganographic replacement words and other words in the text is performed.

F_{s} = m p ({T ⊙ a t t}_{w})

(14)

where

F_{S}

represents the inter-word dependency features at the sentence level, also known as sentence-level high-dimensional representations. Subsequently, further sentence-level decision features are obtained using a linear layer.

F_{d} = S i g m o i d (W_{s} F_{s} + b_{s})

(15)

where

F_{d}

represents the sentence-level decision features,

S i g m o i d (\cdot)

represents the activation function,

W_{s}

represents the weight matrix of the linear layer, and

b_{s}

represents the bias vector.

4.3. Difficult Sentence Mining Re-Perception Module

In synonym substitution-based linguistic steganography methods, secret information is embedded by selectively replacing only a portion of the synonyms in the natural text, thus achieving very low embedding rates. Due to the limited number of synonym replacements, the differences between sentence-level natural text and steganographic text are minor. This subtle difference makes it difficult to achieve excellent segment-level detection performance relying solely on sentence-level features. By preserving the high-dimensional representations at the sentence level and further analyzing the entire segment, the cumulative differences between natural text and steganographic text can be revealed, thereby improving the effectiveness of segment-level linguistic steganography detection. Additionally, during the detection process, difficult-to-detect sentences can be identified by analyzing sentence-level decision features. Then, by focusing on these sentences again, detection performance can be improved. The difficult sentence mining re-perception module designed in this study is detailed in Figure 4.

Specifically, consider the sentence-level decision features

F_{d}^{d} = [F_{d}^{1}, F_{d}^{2}, \dots, F_{d}^{l}] \in R^{l \times 2}

and the sentence-level high-dimensional representations

F_{s}^{d} = [F_{s}^{1}, F_{s}^{2}, \dots, F_{s}^{l}] \in R^{l \times d}

where l denotes the number of sentences in the segment. The sentence-level decision features are the confidence scores predicted by the heterogeneous feature extraction module for whether each sentence in the segment contains secret information. If the two values of a sentence’s decision feature are close, it is considered difficult to classify and thus should be given special attention. Therefore, this study preprocesses the sentence-level decision features and utilizes a linear layer to obtain sentence-level attention, thereby reconstructing the sentence-level high-dimensional representations. The specific computational formula is as follows:

{a t t}_{h} = W_{d 1} \cdot c a t ({F_{d}^{d}}_{1} - {F_{d}^{d}}_{2}, F_{d}^{d}) + b_{d 1}

(16)

{F_{h} = F}_{s}^{d} \cdot {a t t}_{h}

(17)

where

{a t t}_{h}

represents the attention map for difficult sentences,

{F_{d}^{d}}_{1}

represents the first dimension of the classification decision feature,

{F_{d}^{d}}_{2}

represents the second dimension of the classification decision feature,

W_{d} 1

represents the weight matrix of the linear layer, and

b_{d} 1

represents the bias vector.

F_{h}

represents the optimized high-dimensional representation. This study considers the synchronous utilization of sentence-level decision features and optimized sentence-level high-dimensional representations. For the processing of sentence-level decision features, this paper states that if the decision feature of a sentence indicates that it does not contain secret information, it is considered a negative feature; conversely, if it indicates the presence of secret information, it is considered a positive feature. At low embedding rates, most of the sentence-level decision features in steganographic segments are negative features, with positive features being a minority mutation; at high embedding rates, there are more positive features in the sentence-level decision features in steganographic segments. Therefore, to capture local mutations at low embedding rates and perceive overall trends at high embedding rates, this paper proposes to use convolutional neural networks and long short-term memory networks to extract segment decision features, with the following calculation formula:

D_{d} = W_{d 2} \cdot c a t (C N N (F_{d}^{d}), L S T M (F_{d}^{d})) + b_{d 2}

(18)

where

D_{d}

represents the segment decision feature,

W_{d} 2

represents the weight matrix of the linear layer, and

b_{d} 2

represents the bias vector. CNN represents convolutional neural network, and LSTM represents long short-term memory network. When optimizing high-dimensional representations for handling difficult sentences, considering the significant amount of information contained therein, this paper intends to utilize a Transformer Encoder with strong modeling capabilities to explore the changes in cumulative word dependencies before and after steganography. Different pooling operations are employed to further optimize the obtained features, and ultimately, a segment’s re-perceived features

D_{t}

are obtained through a linear layer. The specific calculation formula is as follows:

D_{t} = W_{d 3} \cdot c a t (m p (T E (F_{h})), a p (T E (F_{h}))) + b_{d 3}

(19)

where

T E

represents Transformer Encoder,

a p (\cdot)

denotes average pooling operation,

W_{d 3}

represents the weight matrix of the linear layer, and

b_{d 3}

represents the bias vector. Subsequently, this study integrates segment decision features with segment re-perceived features using learnable parameters to obtain the predictive value of whether the segment contains secrets. The specific calculation process is as follows:

{\hat{y}}_{i} = {L P}_{a} \cdot D_{d} + {L P}_{b} \cdot D_{t}

(20)

where

L P_{a}

and

L P_{b}

represent learnable parameters, and

{\hat{y}}_{i}

represents the predicted value of the segment.

5. Experiments and Analysis

In this section, we first introduce the experimental setup employed in this study, including the steganographic algorithms and training corpora, datasets construction, comparative methods, and experimental parameters. Finally, we analyze the overall detection performance of the proposed approach through various steganalysis tasks.

5.1. Experimental Settings

5.1.1. Steganography Algorithms and Training Corpora

All experiments are conducted on three popular training corpora: Twitter [37], Movie [38], and News [39]. The detailed statistics of each corpus are shown in Table 1. Additionally, we selected three synonym substitution-based steganography algorithms, T-lex [2], MC [6], and WFP [9], to thoroughly evaluate the detection performance of our proposed linguistic steganalysis method. In this experiment, both natural text and steganography text with different embedding payloads are tested. The generated samples were split into 70% for training, 10% for validation, and 20% for testing.

5.1.2. Embedding Rate Settings

The embedding rate is an important factor affecting the quality of text generated by linguistic steganography methods. The TLex steganography algorithm includes four embedding rate settings: 25%, 50%, 75%, and 100%. The MC steganography algorithm comprises two embedding rate settings: 3-bit (embedding 3 bits of secret information into 1 word every 7 words) and 4-bit (embedding 4 bits of secret information into 1 word every 15 words).

5.1.3. Datasets

For the purpose of constructing low-embedding-rate steganography texts, this experiment initially preprocesses and segments the Movie, News, and Twitter corpora. Every 50 segmented sentences are synthesized into one message. The Movie corpus comprises 9072 messages, the News corpus contains 14520 messages, and the Twitter corpus includes 10,372 messages. Subsequently, the TLex, MC, and WFP synonym substitution-based linguistic steganography algorithms are employed to embed secret information into each message.

The experiment designs three mixed-domain steganalysis tasks, including hybrid steganographic algorithm steganalysis tasks, hybrid corpora steganalysis tasks, and hybrid steganographic algorithms and corpora steganalysis tasks. For the hybrid steganographic algorithm steganalysis tasks, each task includes natural texts from three corpora, as well as steganographic texts obtained by embedding secret information into these natural texts using a specific steganographic algorithm. For the hybrid corpora steganalysis tasks, each task includes natural texts from a specific corpus and corresponding steganographic texts. These steganographic texts are generated by embedding secret information into natural texts using three different steganographic algorithms under different embedding rate conditions. For the hybrid steganographic algorithms and corpora steganalysis tasks, each task includes natural texts from three corpora and corresponding steganographic texts generated by three different steganographic algorithms under different embedding rate conditions.

When conducting practical linguistic steganalysis, it is inevitable to discriminate between stego text generated by unknown steganographic algorithms or stego text generated by embedding secret information into text from unknown corpora. Therefore, this experiment designs three different cross-domain steganalysis tasks, namely, cross-steganography algorithm steganalysis tasks, cross-corpus steganalysis tasks, and cross-steganography algorithm and corpus steganalysis tasks. For cross-steganography algorithm steganalysis tasks, the training and testing sets consist of stego text generated by embedding secret information into text from the same corpus using different steganographic algorithms. For cross-corpus steganalysis tasks, the training and testing sets consist of stego text generated by embedding secret information into text from different corpora using the same steganographic algorithm. For cross-steganography algorithm and corpus steganalysis tasks, the training and testing sets consist of stego text generated by embedding secret information into text from different corpora using different steganographic algorithms.

Furthermore, to validate the detection performance of the model on the native data of the above-mentioned corpora, this experiment designs a native data steganalysis task. It will preprocess 3000 original messages from Movie, News, and Twitter corpora, respectively, and directly employ the Tlex, MC, and WFP steganography algorithms to embed secret information to generate steganographic texts. Each of the three steganography algorithms will embed secret information in 1000 original messages. The Tlex algorithm will use 500 original messages for embedding secret information at both a 25% embedding rate and a 100% embedding rate. The MC algorithm will use 500 original messages for embedding secret information at both a 3-bit embedding rate and a 4-bit embedding rate.

5.1.4. Comparison Methods

To validate the effectiveness of the proposed method, this paper compares two deep learning-based synonym substitution-based linguistic steganalysis methods as baseline models, namely, LS-CNN [17], TCNNS [18], EILG [40], and SDC [41]. LS-CNN is a convolutional neural network-based sentence-level linguistic steganalysis network, which utilizes convolutional kernels of different sizes to extract text features at different scales to determine whether the sentence contains secret information. Additionally, this method proposes a segment-level steganalysis decision strategy, calculating the proportion of stego sentences and determining the presence of secret information by setting a threshold value. TCNNS is a linguistic steganalysis method based on cascaded convolutional neural networks. The first-stage network is utilized to extract two-dimensional steganographic features for each sentence in the segment, while the second-stage network determines whether the segment under contains secret information based on the obtained sentence-level steganographic features. EILG extracts high-quality text representations by integrating local semantic and global dependency features. SDC utilizes bidirectional recurrent neural networks and dense convolutional neural networks to extract multi-granularity text features.

5.1.5. Training Settings

All experiments are conducted using the PyTorch 2.4.1 deep learning framework and the Python 3.8 programming language. The selection of hyperparameters is based on empirical experience from multiple experiments and grid search results, with specific settings shown in Table 2. All experiments were executed on a GeForce GTX 2080Ti graphics processing unit with 11 GB of graphical memory. The experiments utilized the Adam optimizer with a learning rate of 1 × 10⁻³ and were trained for 50 epochs. Model detection performance is evaluated using accuracy, precision, recall, and F1-Score.

5.2. Constructing Dataset for Performance Analysis of Detection

This section will comprehensively evaluate the detection performance on the Constructing dataset between the comparative methods and the proposed method in this paper. Specifically, it includes the following: hybrid steganographic algorithm steganalysis tasks, hybrid corpora steganalysis tasks, hybrid steganographic algorithms and corpora steganalysis tasks, cross-steganography algorithm steganalysis tasks, cross-corpus steganalysis tasks, and cross-steganography algorithm and corpus steganalysis tasks.

5.2.1. Hybrid Steganographic Algorithm Steganalysis Tasks

This section evaluates the detection accuracy of comparative algorithms and the proposed method in hybrid steganographic algorithm steganalysis tasks. The experimental results are shown in Table 3. Overall, the method proposed in this paper shows varying degrees of performance improvement in detection accuracy compared to the comparative methods across different corpora. The proposed method achieves an average detection performance improvement of 8.57%, 5.97%, 3.63%, and 2.03% compared to the two comparative methods across different corpora. Specifically, the proposed method achieves the highest average detection accuracy of 97.44% on the Twitter corpus. The average detection accuracy is slightly lower on the Movie corpus, but compared to the four comparative algorithms, the average detection accuracy also improves by 5.51%.

Additionally, we evaluated the performance of the models, which includes the number of parameters and inference time. These metrics not only affect the operational efficiency of the models but also directly relate to their feasibility and performance in practical applications. Therefore, we conducted a comparative analysis among different models, as shown in Table 4. From the table, it can be seen that although EILG and SDC have relatively small parameter counts, their steganalysis performance is also comparatively poor. In contrast, LS-CNN, TCNNS, and our proposed algorithm, while having a relatively larger number of parameters, demonstrate better steganalysis performance. Furthermore, our proposed method achieves optimal performance; although it has a longer inference time, this is also an area we continue to optimize.

5.2.2. Hybrid Corpora Steganalysis Tasks

This section evaluates the detection accuracy of the proposed method in comparison with alternative algorithms in hybrid corpora steganalysis tasks. The experimental results are presented in Table 5. Overall, the proposed method outperforms the comparison methods in detection accuracy across different steganographic algorithms to varying degrees. The proposed method exhibits an average improvement in detection accuracy of 7.94%, 7.02%, 1.63%, and 0.90% over the four comparison methods across different steganographic algorithms. Specifically, the proposed method performs best under the Tlex steganographic algorithm with a 100% embedding rate, achieving a detection accuracy of 99.85%. Although our algorithm did not achieve the optimal performance under the MC steganographic algorithm with a 4-bit embedding rate, it still attained a detection accuracy of 94.25%, which is within one percentage point of the optimal detection result. In relatively more challenging tasks, the proposed method shows greater improvements in detection accuracy compared to the comparison methods. For instance, under the Tlex steganographic algorithm with a 25% embedding rate, the average detection accuracy of the four comparison methods is only 91.04%, while the proposed method achieves an average improvement in detection accuracy of 5.47% in this scenario.

5.2.3. Hybrid Steganography Algorithms and Corpora Steganalysis Tasks

This section evaluates the detection accuracy and F1-Score of the proposed method in comparison with alternative algorithms in hybrid steganography algorithms and corpora steganalysis tasks. The experimental results are presented in Table 6. From the experimental results, it can be observed that the proposed method outperforms the comparison algorithms in terms of detection accuracy by 8.35%, 7.72%, 3.51%, and 1.29%, respectively. Additionally, the proposed method exhibits a performance improvement of 29.51%, 27.85%, 3.73%, and 1.37% in F1-Score compared to the comparison algorithms.

5.2.4. Cross-Steganography Algorithm Steganalysis Tasks

This section evaluates the detection accuracy of the proposed method in comparison with alternative algorithms in cross-steganography algorithm steganalysis tasks. The experimental results are presented in Table 7, where T, M, and W represent the Tlex, MC, and WFP steganographic algorithms, respectively. Tasks such as “T→M” denote training on data generated by the Tlex algorithm containing secret information and testing on data generated by the MC algorithm, with similar tasks for other combinations. Overall, it can be observed that the proposed steganalysis method outperforms the comparison methods in terms of detection accuracy across various cross-steganography algorithm steganalysis tasks. The proposed method exhibits an average improvement in detection accuracy of 3.17% and 1.93% over the two comparison methods across different steganographic algorithms. Specifically, the proposed method performs best on the T→W task, achieving a detection accuracy of 99.74%, which is close to perfect detection. Even in this scenario, the proposed method still improves detection accuracy by 2% and 1.57% over the two comparison methods. However, the performance of the proposed method is relatively poorer on the T→M task, although it still improves detection accuracy by 3.43% and 2.11% over the two comparison methods in this case.

5.2.5. Cross-Corpus Steganalysis Tasks

This section evaluates the detection accuracy of the proposed method in comparison with alternative algorithms in cross-corpus steganalysis tasks. The experimental results are presented in Table 8, where M, N, and T represent the Movie, News, and Twitter corpora, respectively. Tasks such as “M→N” denote training on data from the Movie corpus (including both natural and corresponding steganographic text) and testing on data from the News corpus (including both natural and corresponding steganographic text), with similar tasks for other corpus combinations. Overall, compared to the hybrid corpora steganalysis task, the detection accuracy for cross-corpus steganalysis tasks shows a slight decrease. This is because different corpora have differences in vocabulary, leading to increased difficulty in detection. Across various cross-corpus steganalysis tasks, the proposed steganalysis method outperforms the comparison methods in terms of detection accuracy. The proposed method exhibits an average improvement in detection accuracy of 8.93% and 6.52% over the two comparison methods across different steganographic algorithms. Specifically, the proposed method performs best on the M→T task, with detection accuracy improvements of 11.37% and 7.47% over the two comparison methods. Although the performance of the proposed method is slightly lower on the T→N task, it still achieves detection accuracy improvements of 12.85% and 4.73% over the two comparison methods in this scenario.

5.2.6. Cross-Steganography Algorithm and Corpus Steganalysis Tasks

This section evaluates the detection accuracy of the proposed method in comparison with alternative algorithms in cross-steganography algorithm and corpus steganalysis tasks. The experimental results are presented in Table 9. Tasks such as “TM→MN” denote training on data from the Movie corpus (including both natural and corresponding steganographic text generated using the Tlex algorithm) and testing on data from the News corpus (including both natural and corresponding steganographic text generated using the MC algorithm), with similar tasks for other corpus and algorithm combinations. Overall, across various cross-steganography algorithm and corpus steganalysis tasks, the proposed steganalysis method outperforms the comparison methods in terms of detection accuracy. The proposed method exhibits an average improvement in detection accuracy of 6.78% and 5.6% over the two comparison methods across different steganographic algorithms. Specifically, the proposed method performs best on the MM→TT task, with detection accuracy improvements of 11.26% and 7.2% over the two comparison methods. Although the performance of the proposed method is slightly lower on the WT→MM task, it still achieves detection accuracy improvements of 3.95% and 3.9% over the two comparison methods in this scenario.

5.3. Analysis of Detection Performance on Native Dataset

This section evaluates the detection accuracy of the proposed method in comparison with alternative algorithms for the task of steganalysis on native datasets. The experimental results are presented in Table 10. Overall, it can be observed that the proposed steganalysis method outperforms the comparison methods in terms of accuracy across different datasets. The proposed method exhibits an average improvement in detection performance of 2.77% and 1.54% over the two comparison methods across various datasets. Specifically, the proposed method achieves the highest detection accuracy on the Movie dataset, reaching 97.71%. In this scenario, the proposed method improves detection accuracy by 3.39% and 1.45% over the two comparison methods. On the other hand, the lowest detection accuracy is observed on the Twitter dataset, but even in this case, the proposed method improves detection accuracy by 2.86% and 1.99% over the two comparison algorithms. Additionally, on the News dataset, the proposed method improves detection accuracy by 2.04% and 1.18% over the two comparison algorithms.

6. Conclusions

In this paper, we propose a synonym substitution linguistic steganalysis method based on heterogeneous feature extraction and difficult sentence mining re-perception. First, a heterogeneous feature extraction module is designed to identify steganographic replacement words and extract inter-word correlation features between these words and others in the text. Based on these features, further optimization is performed to extract sentence-level decision features and high-dimensional sentence-level representations. Afterwards, a difficult sentence mining re-perception module is designed to reconstruct the high-dimensional sentence-level representations using sentence-level decision features and further integrate sentence-level decision features with optimized high-dimensional sentence-level representations to distinguish between texts containing hidden information and those that do not. The experimental results on the test set show that the proposed method has significantly improved detection accuracy.

While the proposed synonym substitution linguistic steganalysis method demonstrates improved detection accuracy, it is important to acknowledge certain limitations. The two-stage feature extraction process may lead to increased computational demands, potentially limiting its applicability in resource-constrained environments. Additionally, although the method effectively identifies steganographic replacement words, it may struggle with more complex or adaptive steganographic techniques that employ advanced obfuscation strategies. In future research, we will focus on exploring optimization strategies to enhance computational efficiency without sacrificing detection effectiveness in order to address these limitations.

Author Contributions

Conceptualization, J.W. and H.D.; methodology, J.W. and H.D.; validation, J.W.; formal analysis, H.D.; data curation, J.W.; writing—original draft preparation, J.W. and H.D.; writing—review and editing, J.W. and P.L.; supervision, P.L.; funding acquisition, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hainan Province Science and Technology Special Fund under Grant ZDYF2025SHFZ058, and in part by the Youth Innovation Promotion Association, Chinese Academy of Sciences under Grant 2022022, and in part by the South China Sea Nova project of Hainan Province under Grant NHXXRCXM202340, and in part by the Haikou Key Science and Technology Project under Grant 2024020.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Ni, T.; Xu, W.; Gu, T. SwipePass: Acoustic-based Second-factor User Authentication for Smartphones. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 1–25. [Google Scholar] [CrossRef]
Winstein, K. Lexical Steganography Through Adaptive Modulation of the Word Choice Hash. 1998. Available online: http://web.mit.edu/keithw/tlex/ (accessed on 7 June 2022).
Chang, C.-Y.; Clark, S. Practical linguistic steganography using contextual synonym substitution and a novel vertex coding method. Comput. Linguist. 2014, 40, 403–448. [Google Scholar] [CrossRef]
Barmawi, A. Linguistic based steganography using lexical substitution and syntactical transformation. In Proceedings of the 2016 6th International Conference on IT Convergence and Security (ICITCS), Prague, Czech Republic, 26 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Xiang, L.; Wang, X.; Yang, C.; Liu, P. A novel linguistic steganography based on synonym run-length encoding. IEICE Trans. Inf. Syst. 2017, 100, 313–322. [Google Scholar] [CrossRef]
Yang, X.; Li, F.; Xiang, L. Synonym substitution-based steganographic algorithm with matrix coding. Chin. Comput. Syst. 2015, 36, 1296–1300. [Google Scholar]
Huanhuan, H.; Xin, Z.; Weiming, Z.; Nenghai, Y. Adaptive text steganography by exploring statistical and linguistical distortion. In Proceedings of the 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), Shenzhen, China, 26–29 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 145–150. [Google Scholar]
Xiang, L.; Wu, W.; Li, X.; Yang, C. A linguistic steganography based on word indexing compression and candidate selection. Multimed. Tools Appl. 2018, 77, 28969–28989. [Google Scholar] [CrossRef]
Xiang, L.; Yang, X.; Zhang, J.; Wang, W. A word-frequency-preserving steganographic method based on synonym substitution. Int. J. Comput. Sci. Eng. 2019, 19, 132–139. [Google Scholar] [CrossRef]
Huo, L.; Xiao, Y.-c. Synonym substitution-based steganographic algorithm with vector distance of two-gram dependency collocations. In Proceedings of the 2016 2nd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 14–17 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2776–2780. [Google Scholar]
Taskiran, C.M.; Topkara, U.; Topkara, M.; Delp, E.J. Attacks on lexical natural language steganography systems. In Security, Steganography, and Watermarking of Multimedia Contents VIII; SPIE: San Jose, CA, USA, 2006; Volume 6072, pp. 97–105. [Google Scholar]
Yu, Z.; Huang, L.; Chen, Z.; Li, L.; Zhao, X.; Zhu, Y. Detection of synonym-substitution modified articles using context information. In Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking, Hainan, China, 13–15 December 2008; IEEE: Piscataway, NJ, USA, 2008; Volume 1, pp. 134–139. [Google Scholar]
Chen, Z.; Huang, L.; Miao, H.; Yang, W.; Meng, P. Steganalysis against substitution-based linguistic steganography based on context clusters. Comput. Electr. Eng. 2011, 37, 1071–1081. [Google Scholar] [CrossRef]
Xiang, L.; Sun, X.; Luo, G.; Xia, B. Linguistic steganalysis using the features derived from synonym frequency. Multimed. Tools Appl. 2014, 71, 1893–1911. [Google Scholar] [CrossRef]
Chen, Z.; Huang, L.; Yang, W. Detection of substitution-based linguistic steganography by relative frequency analysis. Digit. Investig. 2011, 8, 68–77. [Google Scholar] [CrossRef]
Xiang, L.; Yu, J.; Yang, C.; Zeng, D.; Shen, X. A word-embedding-based steganalysis method for linguistic steganography via synonym substitution. IEEE Access 2018, 6, 64131–64141. [Google Scholar] [CrossRef]
Wen, J.; Zhou, X.; Zhong, P.; Xue, Y. Convolutional neural network based text steganalysis. IEEE Signal Process. Lett. 2019, 26, 460–464. [Google Scholar] [CrossRef]
Xiang, L.; Guo, G.; Yu, J.; Sheng, V.S.; Yang, P. A convolutional neural network-based linguistic steganalysis for synonym substitution steganography. Math. Biosci. Eng. 2020, 17, 1041–1058. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Zhang, P.; Jiang, M.; Huang, Y.; Zhang, Y.-J. Rits: Real-time interactive text steganography based on automatic dialogue model. In International Conference on Cloud Computing and Security; Springer: Berlin/Heidelberg, Germany, 2018; pp. 253–264. [Google Scholar]
Zhang, S.; Yang, Z.; Yang, J.; Huang, Y. Linguistic steganography: From symbolic space to semantic space. IEEE Signal Process. Lett. 2020, 28, 11–15. [Google Scholar] [CrossRef]
Yang, Z.; Wei, N.; Liu, Q.; Huang, Y.; Zhang, Y. Gan-tstega: Text steganography based on generative adversarial networks. In Digital Forensics and Watermarking, Proceedings of the 18th International Workshop, IWDW 2019, Chengdu, China, November 2–4, 2019; Springer: Berlin/Heidelberg, Germany, 2020; Revised Selected Papers 18; pp. 18–31. [Google Scholar]
Xiang, L.; Yang, S.; Liu, Y.; Li, Q.; Zhu, C. Novel linguistic steganography based on character-level text generation. Mathematics 2020, 8, 1558. [Google Scholar] [CrossRef]
Yang, Z.-L.; Guo, X.-Q.; Chen, Z.-M.; Huang, Y.-F.; Zhang, Y.-J. Rnn-stega: Linguistic steganography based on recurrent neural networks. IEEE Trans. Inf. Forensics Secur. 2018, 14, 1280–1295. [Google Scholar] [CrossRef]
Yang, Z.-L.; Zhang, S.-Y.; Hu, Y.-T.; Hu, Z.-W.; Huang, Y.-F. Vae-stega: Linguistic steganography based on variational auto-encoder. IEEE Trans. Inf. Forensics Secur. 2020, 16, 880–895. [Google Scholar] [CrossRef]
Zhou, X.; Peng, W.; Yang, B.; Wen, J.; Xue, Y.; Zhong, P. Linguistic Steganography Based on Adaptive Probability Distribution. IEEE Trans. Dependable Secur. Comput. 2022, 19, 2982–2997. [Google Scholar] [CrossRef]
Huang, D.; Yan, H. Interword distance changes represented by sine waves for watermarking text images. IEEE Trans. Circuits Syst. Video Technol. 2001, 11, 1237–1245. [Google Scholar] [CrossRef]
Chen, C.; Wang, S.; Zhang, X. Information hiding in text using typesetting tools with stego-encoding. In Proceedings of the First International Conference on Innovative Computing, Information and Control-Volume I (ICICIC’06), Beijing, China, 30 August–1 September 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 1, pp. 459–462. [Google Scholar]
Azzawi, A.F.A. A multi-layer arabic text steganographic method based on letter shaping. Int. J. Netw. Secur. Its Appl. (IJNSA) 2019, 11, 27–40. [Google Scholar]
Liang, O.W.; Iranmanesh, V. Information hiding using whitespace technique in microsoft word. In Proceedings of the 2016 22nd International Conference on Virtual System & Multimedia (VSMM), Kuala Lumpur, Malaysia, 17–21 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
Shah, S.A.; Khan, A.; Hussain, A. Text steganography using character spacing after normalization. Int. J. Sci. Eng. Res 2020, 11, 949–957. [Google Scholar]
Taha, A.; Hammad, A.S.; Selim, M.M. A high capacity algorithm for information hiding in arabic text. J. King Saud-Univ.-Comput. Inf. Sci. 2020, 32, 658–665. [Google Scholar] [CrossRef]
Mustafa, N.A.A. Text hiding in text using invisible character. Int. J. Electr. Comput. Eng. 2020, 10, 3550. [Google Scholar] [CrossRef]
Rizzo, S.G.; Bertini, F.; Montesi, D.; Stomeo, C. Text watermarking in social media. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, Sydney, Australia, 31 July–3 August 2017; pp. 208–211. [Google Scholar]
Alanazi, N.; Khan, E.; Gutub, A. Inclusion of unicode standard seamless characters to expand arabic text steganography for secure individual uses. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 1343–1356. [Google Scholar] [CrossRef]
Al-Nofaie, S.; Gutub, A.; Al-Ghamdi, M. Enhancing arabic text steganography for personal usage utilizing pseudo-spaces. J. King Saud-Univ.-Comput. Inf. Sci. 2021, 33, 963–974. [Google Scholar] [CrossRef]
Ditta, A.; Yongquan, C.; Azeem, M.; Rana, K.G.; Yu, H.; Memon, M.Q. Information hiding: Arabic text steganography by using unicode characters to hide secret data. Int. J. Electron. Secur. Digit. Forensics 2018, 10, 61–78. [Google Scholar] [CrossRef]
Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanf. 2009, 1, 2009. [Google Scholar]
Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 21 June 2011; pp. 142–150. [Google Scholar]
Thompson, A. Available online: https://www.kaggle.com/datasets/snapcrack/all-the-news (accessed on 1 May 2024).
Xu, Q.; Zhang, R.; Liu, J. Linguistic Steganalysis by Enhancing and Integrating Local and Global Features. IEEE Signal Process. Lett. 2023, 30, 16–20. [Google Scholar] [CrossRef]
Wen, J.; Deng, Y.; Peng, W.; Xue, Y. Linguistic Steganalysis via Fusing Multi-Granularity Attentional Text Features. Chin. J. Electron. 2023, 32, 76–84. [Google Scholar] [CrossRef]

Figure 1. Coefficients of word pair reduction, addition, and variation under different embedding rates.

Figure 2. The overall structure of the network.

Figure 3. The structure of heterogeneous feature extraction module.

Figure 4. The structure of the difficult sentence mining re-perception module.

Table 1. The statistical information of machine-generated text and natural text classification training corpora.

	Movie	Twitter	News
Sentence number	1,283,813	2,639,290	1,962,040
Words number	25,601,794	25,551,044	43,626,829
Average length	19.94	9.68	22.24
Unique number	48,342	46,341	42,745

Table 2. Hyperparameter settings.

Hyperparameter	Value
Embedding dimension	300
Convolution kernel number in HFE	200
Convolution kernel size in HFE	(3, 5, 7)
Convolution kernel number in DSMP	4
Convolution kernel size in DSMP	5
Hidden dimension (Bi-LSTM) in DSMP	2

Table 3. Comparison of accuracy for hybrid steganography algorithm steganalysis tasks.

	EILG	SDC	LS-CNN	TCNNS	Ours
Movie	85.88%	88.74%	91.21%	92.49%	95.09%
News	86.47%	88.92%	93.59%	95.17%	97.27%
Twitter	91.75%	94.24%	94.12%	96.04%	97.44%
Avg	88.03%	90.63%	92.97%	94.57%	96.60%

Table 4. Params and inference times of different models.

	EILG	SDC	LS-CNN	TCNNS	Ours
Params	603.14 K	3.39 M	43.77 M	42.04 M	45.69 M
Inference times	1.73 ms	4.21 ms	1.06 ms	60.42 ms	188.80 ms

Table 5. Comparison of accuracy in steganalysis tasks on hybrid corpora.

	Tlex-25%	Tlex-50%	Tlex-75%	Tlex-100%	MC (3 bit)	MC (4 bit)	WFP	Avg
EILG	87.24	88.48	91.14	93.03	90.39	94.61	89.05	90.56
SDC	88.61	90.46	90.49	93.86	91.49	95.15	90.29	91.48
LS-CNN	93.46	96.94	97.48	99.83	97.54	90.18	99.22	96.87
TCNNS	94.85	97.67	98.93	99.78	98.12	91.85	99.24	97.60
Ours	96.51	98.64	99.34	99.94	99.23	94.25	99.60	98.50

Table 6. Comparison of accuracy in steganalysis tasks between hybrid steganography algorithms and corpora.

	EILG	SDC	LS-CNN	TCNNS	Ours
Acc	88.65	89.28	93.49	95.71	97.00
F1-Score	67.59	69.25	93.37	95.73	97.10

Table 7. Comparison of accuracy in steganalysis tasks across different steganography algorithms.

	T→M	T→W	M→T	M→W	W→T	W→M	Avg
LS-CNN	78.29	97.74	90.85	97.73	85.75	91.26	90.27
TCNNS	79.61	98.17	93.24	96.66	89.46	91.90	91.51
Ours	81.72	99.74	94.63	98.77	90.37	95.40	93.44

Table 8. Comparison of accuracy in steganalysis tasks across different corpora.

	M→N	M→T	N→T	N→M	T→M	T→N	Avg
LS-CNN	85.96	79.71	75.87	78.65	74.19	66.35	76.79
TCNNS	84.56	83.61	77.27	80.77	74.49	74.47	79.20
Ours	89.53	91.08	87.42	87.22	79.85	79.20	85.72

Table 9. Comparison of accuracy in steganalysis tasks across different corpora and algorithms.

	LS-CNN	TCNNS	Ours
TM→MN	73.31	75.37	77.52
TN→WT	73.63	74.61	83.47
MM→TT	76.13	80.19	87.39
MN→WM	77.82	78.93	86.45
WM→TN	81.88	80.74	84.66
WT→MM	72.61	72.66	76.56
Avg	75.90	77.08	82.68

Table 10. Comparison of accuracy in steganalysis tasks for native data.

	LS-CNN	TCNNS	Ours
Movie	94.32	96.26	97.71
News	93.12	93.98	95.16
Twitter	81.61	82.48	84.47
Avg	89.68	90.91	92.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Du, H.; Liu, P. Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception. Big Data Cogn. Comput. 2025, 9, 192. https://doi.org/10.3390/bdcc9080192

AMA Style

Wang J, Du H, Liu P. Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception. Big Data and Cognitive Computing. 2025; 9(8):192. https://doi.org/10.3390/bdcc9080192

Chicago/Turabian Style

Wang, Jingang, Hui Du, and Peng Liu. 2025. "Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception" Big Data and Cognitive Computing 9, no. 8: 192. https://doi.org/10.3390/bdcc9080192

APA Style

Wang, J., Du, H., & Liu, P. (2025). Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception. Big Data and Cognitive Computing, 9(8), 192. https://doi.org/10.3390/bdcc9080192

Article Menu

Synonym Substitution Steganalysis Based on Heterogeneous Feature Extraction and Hard Sample Mining Re-Perception

Abstract

1. Introduction

2. Background and Related Works

3. Statistical Analysis of Word Pairs Before and After Synonym Substitution-Based Linguistic Steganography

4. Method

4.1. Overall Structure

4.2. Heterogeneous Feature Extraction Module

4.3. Difficult Sentence Mining Re-Perception Module

5. Experiments and Analysis

5.1. Experimental Settings

5.1.1. Steganography Algorithms and Training Corpora

5.1.2. Embedding Rate Settings

5.1.3. Datasets

5.1.4. Comparison Methods

5.1.5. Training Settings

5.2. Constructing Dataset for Performance Analysis of Detection

5.2.1. Hybrid Steganographic Algorithm Steganalysis Tasks

5.2.2. Hybrid Corpora Steganalysis Tasks

5.2.3. Hybrid Steganography Algorithms and Corpora Steganalysis Tasks

5.2.4. Cross-Steganography Algorithm Steganalysis Tasks

5.2.5. Cross-Corpus Steganalysis Tasks

5.2.6. Cross-Steganography Algorithm and Corpus Steganalysis Tasks

5.3. Analysis of Detection Performance on Native Dataset

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI