1. Introduction
With the rapid development and wide application of the Internet, the threat of phishing software has become increasingly serious. Phishing software is essentially a kind of Trojan horse or backdoor, which refers to software written for malicious purposes. The applications [
1,
2,
3] belonging to phishing software can carry out a variety of malicious activities, such as stealing personal privacy and destabilizing the system, along with the startup of normal and legitimate programs without the user’s knowledge. Even if entirely relying on widely used open-source software [
4,
5] or running programs in loop-closed mobile devices [
6,
7,
8], it can still be difficult to avoid hidden malware in the compiled program. In the past, phishing software was often spread through emails. In recent years, with the continuous advancement and evolution of network attack technologies, phishing software has become increasingly sophisticated and stealthy. This evolution presents significant challenges to maintaining network security.
Over the past few decades, researchers have developed numerous methods and technologies for detecting phishing software. The detection of malicious content in images is a highly scrutinized area of research. This involves the embedding of harmful data, such as malicious scripts or codes, within image files, which are then disseminated to conduct malevolent actions through image distribution channels [
9]. Distinct from traditional malware, malicious payloads concealed within superfluous image data exhibit advanced levels of stealth and deceit. This complexity significantly complicates their detection through established malware identification methodologies. For example, for the malicious code injected into the images, the content usually used is the malicious shell code, which is well acknowledged. They are usually in the form of a binary string. To exploit this kind of malicious information in reality, we usually need an application with a known or unknown vulnerability (such as a 0-day vulnerability) to open such images. By constructing ROP chains, the malicious information hidden in the image will be extracted and executed as binary code in the memory of the application, though the binary code is just binary data of the cover images. Considering that malicious codes constitute the majority of such concealed malicious content, this paper focuses on methodologies for identifying and analyzing malicious codes embedded within redundant images.
Currently, research on the detection of malicious code in images focuses on two aspects. On one hand, researchers are committed to developing new detection algorithms and techniques to improve the accuracy and efficiency of detecting malicious code in images. For example, several researchers have developed methods for detecting malicious code in images using deep learning. By employing deep learning networks for feature extraction and classification of images, these methods have achieved notable detection performance [
10,
11,
12]. On the other hand, researchers have also paid attention to the propagation routes and behavioral characteristics of malicious code in images to better understand and respond to the threat of malicious code in images. For example, some researchers have proposed some effective defense strategies and methods by analyzing the propagation pathways and behavioral characteristics of malicious codes in images [
13,
14].
However, the current detection methods for malicious code in images still have some problems and challenges. Firstly, due to the covert and deceptive nature of images with malicious code, traditional malicious code detection methods often have difficulty detecting and recognizing malicious code in images [
15]. Secondly, the propagation routes and behavioral characteristics of malicious code in images are changing and evolving, which brings more challenges to the detection work [
16]. Therefore, how to effectively detect malicious code in images has become an important issue in current research.
To address the above problems and challenges, this paper proposes a phishing software detection method based on R-tree and the analysis of the Edge of Stability (EOS) phenomenon. The method achieves fast and accurate detection of malicious code in images by constructing an R-tree index and extracting image steganography features. Specifically, we first segment the image by using the R-tree index to improve the detection efficiency. Subsequently, we extract the steganography features of the image and employ machine learning algorithms for classification and determination. Experimental results show that our method has obvious advantages in both detection accuracy and efficiency.
In order to better illustrate the contributions of this paper, the two main contributions are listed as follows.
This paper investigates various methodologies for concealing malicious information. Following a detailed exposition of multiple techniques for embedding malicious content, it evaluates which methods are most likely to be utilized for such concealment.
This paper introduces a phishing software detection approach based on R-tree and the analysis of the EOS phenomenon. It delves into the exploration of neural network layers, network width, activation functions, and loss functions. Furthermore, it examines the relationship between the EOS phenomenon and learning rate to minimize loss.
The rest of this paper is structured as follows.
Section 2 reviews related work and covers foundational knowledge, presenting the latest research related to image steganography, detection of malicious information in images, and the EOS phenomenon. Subsequently, the concepts related to R-trees, neural networks, and the EOS phenomenon are analyzed.
Section 3 discusses methods of embedding malicious information, detailing techniques for malicious steganography.
Section 4 proposes a phishing software detection method based on R-trees and the analysis of the EOS phenomenon. We first introduce the structure of the R-tree and then propose a phishing software detection method based on the R-tree.
Section 5 presents the experimental setup, including parameter tuning and a discussion on issues related to the EOS phenomenon.
Section 6 concludes the paper with an overall review and conclusions. Possible future works are also proposed in this section.
2. Background
In this section, we delve into the realm of image steganography, initially presenting an overview of the pertinent techniques and concepts related to RGB (Red, Green, Blue) color models and LSB (Least Significant Bit) methods. Subsequently, we explore cutting-edge technologies in the detection of malicious information embedded within digital media and discuss the frontier research issues in deep learning, particularly focusing on the stability characteristics of high learning rates at the boundary of algorithmic performance. Furthermore, we will provide some preliminary knowledge. We will first introduce the R-tree [
17,
18], which is widely used in explainable algorithms and high-dimensional problems. Subsequently, we will describe the general neural networks and the Edge of Stability (EOS) phenomenon.
2.1. Image Steganography
Steganography [
19,
20], a term derived from the Greek words “Stegos”, meaning “cover”, and “Grafia”, denoting “writing”, is defined as “covered writing”. It embodies the art and science of concealing secret information within digital communication objects, aiming to obscure the existence of the message itself [
21]. Cryptography [
22] is dedicated to the protection of message content via encryption, transforming it into a format that is inaccessible to unauthorized individuals. In contrast, steganography primarily aims to conceal the very existence of the message, thereby maintaining its confidentiality in a subtle manner that evades detection [
1].
The process of embedding data within digital images by adjusting pixel values in the spatial domain stands as a rudimentary yet highly effective method [
23]. Among these techniques, Least Significant Bit (LSB) modification is notably simple, positioning it as a prevalent method in spatial domain image steganography [
23]. The least significant bits (LSBs) of an image encode minimal information, rendering any minor alterations to these bits undetectable by the human visual system. Spatial domain techniques based on LSB modification embed secret bits directly into the cover image by altering its least significant bits, thereby preserving the visual quality of the cover image. In 2016, Dadgostar and Afsari [
24] pioneered a novel approach to image steganography, melding interval-valued intuitionistic fuzzy edge detection with refined LSB techniques, marking a notable advancement in the field. This method employs edge detection to identify the image’s edge information and embeds steganographic data into these regions using a modified LSB approach, optimizing the balance between invisibility and payload capacity.
2.2. Malicious Information Detection Technology
The concept of malicious information detection has garnered widespread attention in recent research, owing to its critical impact on the efficiency and reliability of cybersecurity. This process involves the identification and filtration of harmful content within the digital ecosystem, including phishing emails, malicious domains, URLs, and image files embedded with malicious code. Achieving proficiency in this detection is vital for enhancing the effectiveness and efficiency of cybersecurity defenses. During the detection process, the challenge lies in striking an optimal balance between swiftly identifying malicious content and minimizing false positives, which is essential for improving overall security measures. Maintaining this equilibrium represents a significant challenge in the field of cybersecurity, which necessitates precise adjustments in detection algorithms, selection of training data, and other related areas.
In 2010, Zhang et al. [
25] delved into anomaly behavior detection and countermeasures within wireless sensor networks. Fast forward to 2022, Aljabri et al. [
26] conducted a comprehensive analysis of machine learning applications in the identification of malicious links within networks. Their study extensively explored diverse machine learning algorithms and feature extraction strategies, evaluating their utility and limitations in real-world applications and outlining potential directions for subsequent research endeavors. Furthermore, Chang et al. [
27] demonstrated the application of natural language processing (NLP) technologies in developing a financial fraud detection model, specifically focusing on the design of an anti-fraud chatbot. This work highlights the evolving use of NLP in enhancing security measures against sophisticated financial fraud schemes.
In 2023, Atawneh and Aljehani [
28] demonstrated the efficacy of deep learning techniques in the identification of phishing emails, underscoring their capability to parse complex email content effectively. They developed and assessed a deep learning-based model tailored for phishing email detection, capitalizing on the advanced feature extraction capabilities of deep learning to analyze and identify malicious behaviors and fraud indicators within emails. By training the model to recognize unique linguistic patterns, formats, and other key characteristics of phishing emails, their research illustrated that deep learning models can proficiently distinguish between phishing and legitimate emails. Different from the convolutional neural network (CNN), some scholars [
29] used the graph neural network to train their deep learning models. Concurrently, Thein et al. [
30] employed a decision tree model to detect malicious domains, highlighting its rapid and accurate classification prowess. Their research highlighted the effectiveness and utility of decision tree models for tackling cybersecurity issues, especially in detecting and neutralizing malicious domains. Additionally, it provided a structured methodology for the assessment and identification of potential cyber threats.
Lin et al. [
31] proposed an interpretable architecture for emerging network systems, which enhances the interpretability of malicious traffic detection from both input and output perspectives, aiming to understand network traffic data and improve the reliability of results. Zhao et al. [
32] explored attack strategies to poison models to evade detection by maintaining label consistency and proposed defense mechanisms to mitigate such threats, which helps to understand backdoor risks in natural language processing (NLP) systems and provides insights for robust defense methods to enhance model security. Li et al. [
33] proposed a malicious code detection strategy combining CNNs and Transformers, which improved accuracy and surpassed the state-of-the-art malicious code detection techniques.
Liu et al. [
34] proposed a pre-trained language model-guided multi-level feature attention network, named PMANet, to advance the detection of malicious URLs. PMANet outperforms previous state-of-the-art pre-trained models and traditional deep learning models in challenging real-world scenarios such as small-scale data, class imbalance, cross-datasets, adversarial attacks, and active malicious URL case studies. Zhao et al. [
35] explored the threat of clean-label backdoor attacks in language models and systematically analyzed the attack methods, their effectiveness, and potential defense strategies to enhance the robustness of the model. Fang et al. [
36] proposed a deep learning model combining BERT and BiLSTM, which showed excellent performance in processing unbalanced data and capturing the deep semantic features of malicious comments, providing an effective technical means for social media content review and network environment purification.
2.3. EOS
The concept of EOS in neural networks has recently garnered significant attention due to its direct impact on the performance and reliability of deep learning models. The EOS phenomenon within neural networks signifies the equilibrium achieved in updating model parameters throughout the training phase, circumventing instability induced by disproportionately large step sizes, as well as averting diminished learning efficacy resulting from excessively small step sizes. Achieving this equilibrium is crucial for enhancing the training efficiency and generalization capability of models. In a state of EOS, models are able to find an optimal balance between rapid learning and avoiding excessive adjustments, thereby contributing to improved training outcomes. Maintaining this stability presents a key challenge in deep learning, involving meticulous adjustments of training parameters such as learning rates and weight initialization.
In 2022, Chen and Bruna [
37] delved into the convergence properties of gradient descent beyond the traditional boundaries of stability. This study challenged prevailing perspectives on the behavior of gradient descent, introducing a novel viewpoint that neural networks can achieve effective convergence and learning even beyond conventional stability limits. In the same year, Zhu and colleagues elucidated the dynamics of EOS during the training process through a simplified example. This research clarified the complexities of training dynamics, offering a clearer understanding of neural network behavior at the brink of stability and the factors that could either advance or retreat from this threshold [
38]. Additionally, Wang et al. examined the variations in sharpness along the gradient descent trajectory, especially noting the phenomenon of “progressive sharpening” observed near the EOS [
39]. Their work established a mathematical framework to comprehend how sharpness evolves during training and its implications for model performance and stability. In 2025, Qiu et al. [
40] also briefly discussed the EOS phenomenon in their convex optimization work. This investigation delved into the foundational mechanisms underlying neural network stability, underscoring the nuanced equilibrium between model complexity and generalization capacity. This research offered insights into leveraging EOS to enhance the robustness of deep learning models.
2.4. R-Tree
The R-tree is a spatial indexing structure designed for the efficient storage and querying of spatial data within databases. It mirrors binary trees with a balanced structure, comprising nodes that each depict a rectangle containing one or more objects. Unlike binary trees, R-trees are adept at addressing search problems in higher dimensions [
41,
42,
43]. They optimize the search operations for spatial data by dividing space into smaller regions and organizing objects based on spatial proximity. This method enables efficient retrieval of objects overlapping or proximate to a designated query area.
To construct an R-tree from coordinates and offsets, it is essential to understand the concepts of Minimum Boundary Rectangles (MBRs) and Minimum Boundary Boxes (MBBs) [
44]. An MBB encompasses a set of identical objects or points within a rectangle, defined by the maximum and minimum coordinates of its internal objects or points across each dimension, which can extend beyond two in higher-dimensional spaces. An MBB can serve as the internal object of another MBB as long as the larger MBB fully encompasses the smaller one, making it a common construct for delineating enclosed spaces in 3D environments. Within 2D spaces, the MBR serves a geometric function akin to the MBB, finding application across computer science and spatial data analysis domains. It represents a rectangular shape that encloses a set of points or objects within 2D space, defined by the extremal coordinates of its internal elements along the x and y axes. In specific contexts, the transformation between MBBs and MBRs is essential [
45,
46], underscoring their critical contribution to spatial indexing and analysis.
In this discussion, the term “distance” typically refers to the Euclidean distance in space. Typically, the organization of fundamental Minimum Boundary Rectangles (MBRs) ought to follow established principles, encompassing all coordinates [
47]. The construction of an R-tree begins with these fundamental MBRs, which cover individual objects or coordinates. These are then grouped into nodes based on a specified maximum capacity. Throughout this grouping phase, new parent nodes are established to house the emerging child nodes. Each parent node is assigned a unique ID and an MBR that collectively encompasses the MBRs of its offspring. This grouping and parent node creation continues until all MBRs are allocated to a node, culminating in the root of the R-tree. The leaf nodes, which represent the zeroth level of the R-tree, contain only the basic MBRs. To ensure a more balanced structure, adjustments are made to the last node of each level if its entry count falls below the minimum capacity. Building or adjusting R-tree nodes from bottom to top necessitates indexing for efficient spatial queries. MBR calculations shown in Algorithm 1 can be finalized after the R-tree construction to minimize the need for recalculations during node adjustments or balancing, thereby reducing the overall computational load.
Figure 1 illustrates the MBR calculation process, requiring inputs like the root of the R-tree and all coordinates. The R-tree can be dumped to memory or indexed by ID for later retrieval and reconstruction using the same data structure. Moreover,
Figure 1 illustrates the R-tree architecture, detailing the interconnections between leaf nodes, non-leaf nodes, foundational MBRs, and points (coordinates) to clarify the constructed R-tree.
| Algorithm 1 The MBR computing process [17,18] |
- 1:
function computeFundamentalMBR(coordinates) - 2:
- 3:
- 4:
- 5:
- 6:
return - 7:
end function - 8:
function computeNodeMBR(nodes) - 9:
- 10:
- 11:
- 12:
- 13:
return - 14:
end function - 15:
function computeRTreeMBR(node) - 16:
if node is a fundamental MBR then - 17:
node.MBR ← computeFundamentalMBR(node.coordinates) - 18:
else - 19:
for entry in node.entries do - 20:
if entry.MBR is not available then - 21:
computeRTreeMBR(entry) - 22:
end if - 23:
end for - 24:
node.MBR ← computeNodeMBR(node.entries) - 25:
end if - 26:
end function - 27:
computeRTreeMBR(root)
|
| Note: , , and represent the min value of x-axis, the max value of x-axis, the min value of y-axis and the max value of y-axis in the designated MBR respectively.
|
2.5. Neural Network
Neural network layers, neural network width, activation functions, and loss functions constitute critical elements within neural networks [
48]. The selection and optimization of these components are crucial for the network’s performance and efficacy. During the design and training phases of a neural network, it is essential to make judicious choices based on the specific requirements of the task and data characteristics, adjusting various parameters to achieve optimal learning and predictive outcomes.
2.5.1. Neural Network Layer
Neural network architecture comprises several layers, including input, hidden, and output layers. The input layer serves as the initial receptor of raw data, hidden layers undertake the tasks of feature extraction and learning representations, and the output layer finalizes and presents the prediction. The number of hidden layers and the number of neurons within each can be tailored to the complexity of the task and dataset, facilitating adaptation to diverse learning requirements.
2.5.2. Neural Network Width
Neural network width refers to the number of neurons within each layer. Expanding the network’s width enhances the model’s capacity to fit complex patterns inherent in the data. However, an overly broad network might cause overfitting, marked by high training set performance but inadequate generalization to new data. While a broader network increases representational and learning capabilities, it also escalates computational and storage demands. Therefore, it is crucial to balance the model’s performance against computational resource constraints to strike an optimal balance when deciding on network width. In the case of simpler datasets, adopting a narrower width not only conserves resources but also lowers the likelihood of overfitting. Facing complex data, appropriately broadening the network’s width, and integrating methods like regularization can circumvent overfitting.
2.5.3. Activation Function
Normalization layers, such as batch normalization or layer normalization, play a pivotal role in stabilizing and accelerating the training process. Theoretical analyses of these layers have shown their capability to lower the network’s Lipschitz constant, enhancing convergence performance [
49]. By mitigating the effects of gradient vanishing or explosion, normalization layers enhance generalization and enable the use of higher learning rates. Activation functions enable the intricacies of data patterns to be learned, which is essential to artificial neural networks. Drawing parallels to the neuron-based models of the human brain, activation functions determine the signal transmitted to the subsequent neuron. In artificial neural networks, the activation function at a node defines its output for given inputs [
50]. Standard computer chip circuits can be viewed as digital activation functions that produce on (1) or off (0) outputs based on the input. Thus, activation functions embody mathematical equations that dictate the output of neural networks, introducing non-linear transformations that empower the networks to learn and model more intricate patterns and relationships. The choice of activation function typically does not directly affect the number of layers or the width of the neural network. Prominent activation functions include the Sigmoid, Rectified Linear Unit (ReLU), and Tanh functions [
50,
51,
52], with their suitability varying according to task specifics and network requirements. For instance, the Sigmoid function is suited for binary classification, whereas the ReLU function is favored for its ability to accelerate learning and foster sparse representations.
Table 1 presents common activation functions and their formulas.
In neural network architectures, activation functions such as ReLU and tanh play a pivotal role in modulating the activation state of individual neurons, where ReLU nullifies negative inputs and tanh maps inputs to the [−1, 1] interval. These functions are instrumental in enhancing the network’s capacity to capture nonlinear relationships, thereby augmenting its performance and expressiveness. However, the efficacy of normalization layers is contingent upon the specific architecture and dataset in use. The choice of activation function can indirectly impact the network’s training dynamics and convergence rate due to the distinct derivative properties inherent to each function. For instance, given that ReLU possesses a derivative of 1 for positive inputs and 0 for negative inputs, it can precipitate the vanishing gradient issue. This situation underscores the critical need for careful selection of activation functions, which should be tailored to complement the network’s architectural dimensions and depth. Such a targeted approach is essential for optimizing both performance and the efficiency of convergence.
2.5.4. Loss Function
Loss functions quantify the discrepancy between predicted outcomes and true labels, serving as the objective function for neural network optimization. By minimizing the loss function, neural networks enhance the accuracy of their predictions. Predominantly utilized during the training phase, the loss function is employed after each batch of training data is processed by the model. This procedure initiates with the generation of predictions through forward propagation, which is then followed by an assessment of the deviation between predicted outputs and actual values, culminating in the calculation of loss. Upon determining the loss, the model engages in backpropagation to adjust its parameters, aiming to reduce the discrepancy between predicted and actual values. By reducing the discrepancy between predicted and actual values, this adjustment enhances the model’s learning efficacy. Common loss functions include Mean Squared Error (MSE) and Cross-Entropy Loss (CE) among others [
53,
54].
Table 2 presents common loss functions and their respective expressions. The selection of an appropriate loss function is contingent upon the nature of the task and the characteristics of the data. For instance, MSE is typically applied to regression problems while Cross-Entropy Loss is favored for classification challenges.
2.6. EOS
In the realm of classical optimization theory, the learning rate
must be less than the reciprocal of the smoothness parameter or the reciprocal of the sharpness L to ensure convergence, typically adhering to the condition
[
55]. Nonetheless, Larger learning rates are frequently adopted, with the smoothness parameter typically not taken into account during the initial phases of training. Recent observations reveal that employing a fixed learning rate
for gradient descent in neural network training results in the maximal eigenvalue of the training loss Hessian oscillating above
. Concurrently, training loss exhibits non-monotonic behavior on short timescales, yet demonstrates a consistent decrease over longer periods. This phenomenon is termed the Edge of Stability (EOS) dynamic [
56]. The sharpness metric increases with gradient descent and ultimately stabilizes just above
.
4. Methodology
In this section, we introduce phishing software detection methods based on R-tree structures and edge-defined characteristics. Initially, we elaborate on the principles and procedures of image segmentation utilizing R-trees. Subsequently, we detail the process of extracting stable edge features using deep learning algorithms for classification purposes.
4.1. Image Segmentation Based on the R-Tree
The R-tree is a multidimensional indexing structure that facilitates the segmentation of a target into multiple layers and segments. In this work, it is employed to segment images and construct a hierarchical index, thereby enabling deep learning models to achieve rapid retrieval and access. The following sections will delve into the principles and methodologies involved in using R-tree indexing for image segmentation and index construction. Firstly, image preprocessing, including scaling and normalization, is essential to ensure the input image data maintains uniform dimensions and format. This step helps mitigate the impact of varying image sizes on the effectiveness of the R-tree indexing process. Subsequently, to form the foundational Minimum Bounding Rectangles (MBRs), preprocessed images are segmented into multiple smaller blocks using Z-order. These foundational MBRs combine to form upper-layer MBRs, which are designated as the leaf nodes of the R-tree. Each foundational MBR comprises multiple pixels.
The construction of the R-tree index necessitates a bottom-up approach, bundling lower-level Minimum Bounding Rectangles (MBRs) into upper-level ones and ultimately forming a root node with fewer than eight children. In our study, the maximum branching factor is set to 20, with a minimum of 8, noting that root nodes may hold up to 8 children but not exceed this number. This entails allocating leaf nodes to appropriate non-leaf nodes across successive layers until the root node is established. The selection of a segmentation strategy is tailored to the image’s characteristics and needs, employing methods like the greedy algorithm or minimum area enlargement. To expedite the construction of the R-tree for faster pinpointing of pixels embedded with malicious information within deep learning models, our approach involves directly bundling 20 lower-level MBRs into an upper-level one. This is aimed at ensuring each non-leaf node encapsulates similar blocks to boost search efficiency. Trimming the base MBR from the preceding bottom-level MBR is able to adjust the final bottom-level MBR to contain exactly 8 foundational MBRs when the last of these bottom-level MBRs incorporates fewer than 8 foundational units. This ensures the structured efficiency of the indexing mechanism.
During the construction of the R-tree, we introduced several optimization strategies, employing an adaptive splitting technique that selects the optimal division point based on the specific characteristics and distribution of each image. For certain special images, we accelerated the R-tree construction process using approximate query techniques. Employing Locality Sensitive Hashing (LSH) when a local pixel’s RGB values show minimal variation aids in this process. Given that the R-tree index highlights pixels adjacent to those embedded with malicious content, it facilitates convolutional kernels in sidestepping pixel blocks devoid of such data, thereby streamlining the search and analysis endeavors. Following construction, each non-leaf node within the R-tree holds information on several blocks, enabling image retrieval and access by traversing the R-tree. For a specified query image, the R-tree’s indexing structure enables rapid pinpointing of relevant leaf nodes for further processing and analysis.
4.2. Accelerated Learning Based on EOS
EOS, identified as an emerging issue in the deep learning domain over the last year, has offered new insights into model training and recognition processes. It reveals that these processes can be significantly accelerated without sacrificing accuracy, provided the learning rate is sufficiently high. This phenomenon occurs beyond the traditionally theorized bounds of deep learning. This acceleration comes at the cost of training stability, with model loss fluctuating and decreasing across iterations rather than monotonically, and accuracy similarly fluctuating and increasing, not monotonically. However, excessive learning rates can lead to training failure. Consequently, we can harness EOS to accelerate the training and recognition of deep learning models. It is imperative to regulate the learning rate to ensure the training proceeds successfully. The following details the principles and steps for utilizing EOS to speed up training and recognition in deep learning models.
We employed the two open-source datasets for generating our image datasets, injecting them with various categories of malicious information for training the deep learning model in classification, by referring to [
64] for checking the convergence. To facilitate this, we opted for a Convolutional Neural Network (CNN) for training purposes. Before commencing training, images were preprocessed through an R-tree-based segmentation technique designed for accelerating the convolutional kernel’s navigation across image blocks, with comprehensive details provided in
Section 4.1.
Given the uncertainty regarding which CNN architectures and parameters would best adapt to the two datasets injected with malicious information, an independent exploration of model architectures and parameters was undertaken. This exploration encompassed the design and connectivity of network layers, including variations in the number of neural network layers, network widths, normalization layers (activation functions), and loss functions. All the above were directed towards resolving the scientific issue presented in this research and followed by an assessment of their efficacy.
After completing the training, the trained model was utilized to extract steganography features and perform classification. For each image sample, the image was input into the model via the forward propagation algorithm to yield the model’s output. This output takes the form of a probability vector, showing the distribution across various categories. Based on this distribution, the category corresponding to the highest probability was selected as the classification result for the image. Additionally, the confidence level was returned when calling third-party application programming interfaces (APIs).
To expedite model training and recognition of malicious information classification, we experimented with higher learning rates for quicker training and identification while ensuring model robustness to prevent training from exploding. Feature extraction was conducted at one or several layers, with the outputs of convolutional layers serving as feature vectors. During the comparison, integrating fully connected layers subsequent to the framework’s concluding layer enhances the utilization of comprehensive feature information. The inserted pooling layers, which were placed between convolutional layers, summarized and abstracted feature information.
Finally, we determine the presence of malicious content in images by analyzing classification outcomes and identifying the steganography type with the highest probability from the results to extract the concealed information. If an image is identified as containing malicious information, appropriate interventions and protective actions can be implemented.
5. Experiments and Results
This section begins by detailing our experimental setup and datasets before delving into the exploration and discussion of the efficacy of steganographic methods. Finally, experiments and evaluations were conducted on Convolutional Neural Network (CNN) models that integrate R-tree and steganography features.
5.1. Experimental Environment
The experiment was conducted on a system featuring Windows 10 Pro 22H2 x64 (manufactured by ASUS in China), an 11th Gen Intel(R) Core(TM) i7-11800H CPU (manufactured by Intel Corporation in Santa Clara, CA, USA) at 2.30 GHz with 8 cores, an NVIDIA GeForce RTX 3060 laptop GPU (manufactured by NVIDIA Corporation in Santa Clara, CA, USA), 24 GB RAM, a 512 GB SSD, and a 1024 GB HDD. The operating system was installed on the SSD while the code and datasets resided on the HDD. It is important to note that runtimes for the program vary with each execution due to the state of the machine and the resources available, which differ across machines. However, the trends and ratios of runtime should align with those presented in this study.
5.2. Datasets
For this paper, the dataset CIFAR-10 (“cifar10-5k” for details) was selected as the primary experimental dataset for training due to its classic and stable performance in model training among similar datasets. The Edge of Stability (EOS) phenomenon will be reflected in the poisoned version of this dataset. This version of the CIFAR-10 dataset includes 1000 color images at a resolution of . It spans an original category and 9 poisoned categories, where each poisoned category contains 100 images corresponding to one of the nine steganographic methods proposed above. Malicious data, represented as binary strings, were embedded into the images using several information hiding techniques. That is, we employed additional embedding strategies, including row–column and block-based embedding methods that use random selection and minimal entropy techniques. Specifically, considering that the carrier image is square, the square block structure was used as the block structure, where each block of pixels had a portion of the binary string embedded based on the entropy minimization method, ensuring minimal perceptible changes in the image.
While CIFAR-10 is a classic dataset, the Stego260 dataset [
65] is also selected for experiments. Different from CIFAR-10, the Stego260 dataset, consisting of color images that have been specifically designed for steganalysis research, is used to compute the testing accuracy after poisoning. These images have been manipulated to include hidden information using various steganographic techniques, making them ideal for testing the accuracy of our detection methods. The poisoned version of the Stego260 dataset contains 10 categories, where each category contains 8000 color images. We divide the dataset into training and testing sets in a ratio of 8:2. By checking the performance of the machine learning model on the Stego260 dataset, we can evaluate its ability to accurately detect malicious data hidden within images across different datasets and scenarios. This comparative analysis will provide valuable insights into the robustness and generalizability of our detection system.
5.3. Steganography Results and Evaluation
The outcomes of steganography as shown in
Figure 3 (10 bytes),
Figure 4 (50 bytes), and
Figure 5 (90 bytes) demonstrate the effectiveness of hiding shellcodes of the same length with an average of 20 bytes.
Figure 3 demonstrates that the nine mainstream steganography methods evaluated in this study successfully embed malicious information without discernible differences observable to the naked eye on mobile devices. Although the malicious payloads are minimal, 20-byte shellcodes are sufficient for attackers to execute remote commands. A minority of users might notice slight abnormalities in images from
Figure 4 and
Figure 5, specifically in (a), (e), (f), (g), and (h), upon close inspection. However, differences are not discernible in images produced using the remaining four methods.
Here, we explore nine mainstream methods of concealing malicious information on mobile platforms. The evaluation of malicious information concealment hinges on two primary criteria: its obfuscation and the ease with which it can be extracted. Storing malicious information consecutively becomes easily detectable by users with methods (a) through (f), due to their random storage approach. This is because large amounts of data stored in consecutive pixels can result in the appearance of noticeable horizontal or vertical irregular lines within the image. Extractability is reflected in the difficulty of connecting and retrieving the hidden malicious information. Methods (a), (d), (e), (f), and (g), which store malicious information continuously at specific locations, facilitate the easiest extraction. For methods (h) and (i), attackers must acquire the specific column numbers of the image to successfully retrieve the hidden content, making extraction relatively more challenging. Methods (b) and (c) require the construction of ROP chains for extraction and execution. Consequently, method (i) emerges as the most suited for embedding malicious information within images. Methods (d), (e), (f), and (g) are suitable for steganography when extractability is the priority. Conversely, if stealth is of utmost importance, method (b) emerges as a viable option. Our experiments validate these conclusions.
5.4. Results and Evaluation of the EOS
This subsection will present the experimental results from different aspects while keeping other parameters unchanged. Subsequently, the EOS phenomenon of the training will be analyzed.
5.4.1. Neural Network Layers
Within each architectural framework, models are synthesized through the integration of diverse layers to elucidate the influence of neural network depth on the training process. This investigation particularly emphasizes the deployment of ReLU activation to illuminate these dynamics. The experimental findings, delineated in
Figure 6, delineate the training outcomes utilizing fully connected layers (abbreviated as FC), training devoid of supplementary layers (referred to as CNN), and training incorporating pooling layers, with a focus on average pooling (notated as AvgPool), thereby offering a comparative analysis of layer-specific effects on model performance.
As depicted in
Figure 6, all training loss and accuracy curves exhibit slight fluctuations, albeit not significantly. Generally, the impact of neural network depth on model training appears negligible. No marked difference was observed in the training outcomes across various layer counts at the same learning rate. Furthermore, for identical layer depths, the variance in both training loss and accuracy curves remained consistently minimal.
5.4.2. Neural Network Width
In this study, the neural network’s architecture was varied to represent tests across different network widths. The tanh activation function in fully connected layers was selected to explore the relationship between model training and neural network width. When adjusting the width of the neural network, all other parameters were held constant. Network widths were set to 200, 400, 800, and 1600, respectively. The outcomes of these experiments are illustrated in
Figure 7, demonstrating how varying network widths influence model performance and learning dynamics.
Similar to variations in neural network layers, different widths of neural networks do not significantly impact model training. As illustrated in
Figure 7, no notable differences were observed across varying widths of neural networks at the same learning rate. For a given network width, as the learning rate increases, both the training loss and accuracy curves exhibit earlier and more pronounced fluctuations.
It was observed that when the neural network width was set to 200 with a learning rate of 0.04, model training experienced an explosive failure shortly after initiation. This phenomenon can be attributed to the potential for gradient explosion when overly high learning rates are applied to neural networks with smaller widths. This issue particularly affects networks characterized by a smaller number of neurons and more constricted pathways for information transmission. During the transmission process, gradients are prone to accumulating, leading to excessively large gradient values. Such accumulation can severely impede the updating process of neural network parameters. Consequently, when both the width and depth of a neural network are limited, it is advisable to opt for a smaller initial learning rate.
5.4.3. Standardized Layer (Activation Function)
In this analysis, Rectified Linear Unit (ReLU), Exponential Linear Unit (ELU), Hyperbolic Tangent (tanh), Hard Hyperbolic Tangent (hardtanh), and Softplus are evaluated as activation functions for training within a fully connected neural network framework, which has demonstrated optimal performance on our test dataset. Drawing on the findings of Cohen et al. [
56], Mean Squared Error (MSE) is adopted over Cross Entropy (CE) as the loss function for model training, based on observations that training with MSE yields superior results. This experimental setup facilitates an investigation into the impact of normalization layers on training effectiveness. The outcomes of these experiments are documented in
Figure 8.
As depicted in
Figure 8, curves exhibit significant fluctuations for activation functions with a learning rate set at 0.01, excluding ReLU. This phenomenon can be attributed to several reasons. The input range for the ReLU activation function within normalization layers is broader than for other functions, spanning all real numbers. Due to its gradient being 1 in the positive interval, ReLU avoids the issue of gradient vanishing and introduces sparse activation. In normalization layers, ReLU sets a portion of input values to zero, fostering a more sparse representation within the network. This sparsity contributes to enhanced generalization capability and robustness of neural networks. Moreover, the ReLU activation function boasts high computational efficiency since its calculations involve only simple comparisons and maximum value operations. Since it is just a simple threshold function, for each input value, the ReLU function only needs to perform one comparison and one multiplication operation. Therefore, the computational complexity of the ReLU function can be
, which is independent of the size of the input. Consequently, ReLU’s distinct behavior suggests that lower complexity in normalization layers correlates with more stable training. Thus, when initializing neural network models with high learning rates, it is advisable to employ activation functions that offer strong robustness.
As the learning rate escalates, training accuracy curves demonstrate significantly increased fluctuations across all activation functions except Softplus. This trend holds true even for models utilizing the ReLU activation function. The number of iterations before the onset of this jitter decreases. On one hand, excessively high learning rates can cause parameters to overshoot the optimum solution point in the gradient direction, thereby increasing the value of the loss function. On the other hand, an excessive increase in the magnitude of parameter adjustments may precipitate oscillatory movements along the gradient trajectory, engendering fluctuations of the loss function in proximity to the optimal solution. For Softplus, given its continuous differentiability and absence of discontinuities, it can also demonstrate oscillatory tendencies at lower learning rates. This issue arises because the slope of the Softplus function becomes very small near zero. When the learning rate exceeds the optimal value, the network’s convergence speed might increase, potentially leading to excessively large weight updates. In both scenarios, network parameters may oscillate.
It has been conclusively demonstrated that normalization layers significantly influence model training. The phenomenon of EOS manifests regardless of the activation function employed when initializing with a larger learning rate. It is advisable to opt for activation functions with lower computational complexity and enhanced robustness. Furthermore, adhering to the conventional rule for the initial learning rate, , is recommended to ensure training stability and interpretability.
5.4.4. Loss Function
In this experiment, ReLU was selected as the activation function due to its effectiveness. Fully connected layers were employed to capitalize on their potential for superior performance. Comparative experiments were conducted using Cross-Entropy (CE) and Mean Squared Error (MSE) as the loss functions to evaluate their impact on model outcomes.
As illustrated in
Figure 9, the use of different loss functions does not affect the trend of the training loss and accuracy curves. The model training concludes at different iterations due to the application of distinct loss functions. Specifically, training tends to be completed earlier when employing the cross-entropy (CE) loss function. Therefore, it is advised to select the appropriate loss function when training neural network models, as an unsuitable choice may lead to slow training or even divergence.
5.5. Results and Evaluation of the Peak Signal-to-Noise Ratio (PSNR)
PSNR is an important indicator for measuring image compression quality, especially in the field of image compression and image reconstruction. The calculation of PSNR relies on the mean square error (MSE) between the original image and the processed (e.g., compressed or denoised) image. The calculation equation of PSNR is shown in Equation (
1), where
is the maximum possible pixel value of the image (usually 255 for 8-bit images).
is the Mean Square Error (MSE) between the original image and the compressed image shown in Equation (
2), where
and
are the pixel values of the original image and the compressed image at position (x, y), respectively. The variables
m and
n are the dimensions (width and height) of the image.
The average PSNR computation results for each image hiding method are shown in
Table 3. According to the PSNR data of the nine steganographic methods given, it can be seen that there are certain differences in the noise levels introduced by these steganographic methods when hiding information. It can be observed from the data that the fifth steganographic method has the highest peak-to-noise ratio, reaching
, while the eighth and seventh steganographic methods have lower peak-to-noise ratios,
and
, respectively. This difference may reflect that different steganographic methods have different degrees of influence on the image during the information hiding process. Some steganographic methods may introduce more noise, thereby having a greater impact on the image quality. In addition, it can be inferred that image information hiding based on minimum information entropy can better deceive the human eye since pixels with the least information are modified. Therefore, when choosing a suitable steganographic method, it is necessary to comprehensively consider the peak-to-noise ratio and other performance indicators to ensure that the image quality is maintained to the greatest extent while hiding the information.
5.6. Results and Evaluation of the Testing Results
While
Section 5.4 and
Section 5.5 focus on the performance of the training procedures, here, we would like to discuss the performance of the testing procedures. After gathering the experimental results, the confusion matrix is summarized in
Figure 10. We can see that the values are mainly concentrated on the diagonal. Meanwhile, we can see that some of the wrong classifications fall into the categories of other steganographic methods in similar positions. This meets our expectations. After the calculation, the accuracy of the trained machine learning model has reached
.
6. Conclusions
This study introduces a phishing software detection method that leverages R-trees and the analysis of the Edge of Stability (EOS) phenomenon. Initially, it examines the effectiveness of various information hiding techniques for embedding data within images in mobile phishing software. Subsequently, the paper presents a methodology for detecting malicious content hidden in images of mobile applications, fine-tunes the relevant parameters, and discusses issues related to EOS. Empirical evidence demonstrates that the proposed detection method achieves an accuracy of within a reasonable time. This achievement not only enhances the speed of training and recognition but also ensures stable performance while reducing the computational overhead on mobile devices.
Our research also encountered certain limitations. The diversity in information hiding techniques introduces a significant challenge due to their varying effectiveness. This challenge necessitates the development of an algorithm that can automatically select the optimal hiding method, taking into account the characteristics of both the malicious information and the image carrier. Moreover, although the phishing software detection method based on R-trees and EOS offers a general solution, crafting a method with enhanced generalization capabilities to detect malicious information hidden within images remains a goal. Moving forward, our efforts will be dedicated to addressing these challenges, aiming for solutions that further refine and extend the capabilities of our detection methodologies.