This section presents a detailed analysis of the working of the proposed methodology. The overall focus is two-fold: tracking the performance of the Siamese network in similarity learning and assessing the efficacy of the genetic algorithm in optimizing the S-box design, guided by the Siamese network. Each phase of the experiment is comprehensively analyzed, ensuring a thorough evaluation of the proposed approach. To ensure that our experiments are reproducible, we set all random seeds to deterministic values. All runs of each configuration start with the same initial population, guaranteeing that comparisons between the different model variants are consistent and fair in terms of the impact of the proposed similarity-guided crossover operation.
4.1. Performance Analysis of the Siamese Network
To study the work and performance of a transformer-based model, extensive experiments were conducted with hyperparameter fine-tuning, architecture scaling, and different loss functions. To begin with, the process started with a baseline model with a standard transformer configuration including six encoder layers, a hidden size of 512, eight attention heads, and a dropout rate of 0.1. The baseline model is trained for 50 epochs with a learning rate of 0.001 and a batch size of 32, using Contrastive Loss as the objective function. After finishing the training, the final metrics were as follows: final training loss of 0.25, a validation loss of 0.27, and an accuracy of 85%. From these metrics, it could be inferred that the model effectively captured the core patterns in the data.
(i) Hyperparameter tuning and Architectural Scaling: To evaluate the efficacy of the Siamese network in learning similarity among S-boxes, an experiment with fine-tuning the hyperparameters is carried out; Reducing the learning rate to 0.0005 and the batch size to 64, accelerating the model learning. The metrics attained are a training loss of 0.23, a validation loss of 0.26, and an accuracy of up to 87%, respectively, indicating better generalization and smoother updates. Subsequently, the architecture is enhanced by increasing the number of encoder layers from 6 to 12 and increasing the attention heads from 8 to 16, reverting the learning rate and batch size to the original. This shift gave the model a real boost, dropping the training loss to 0.21, the validation loss to 0.24, and improving the accuracy to 89%, as shown in
Table 1. Although this setup really improved the performance, it also meant that the proposed algorithm became resource-intensive, thereby raising the need for careful regularization to avoid overfitting.
(ii) Loss Function Variations: It has also been observed how different loss functions affect model training, which is of equal importance.
- -
Contrastive Loss Trends: The contrastive loss function, which minimizes intra-class variance and maximizes inter-class variance, is examined over training iterations. When using Contrastive Loss alone, the model had a training loss of 0.28, a validation loss of 0.30, an accuracy of 82%, and an AUC score of 0.85, establishing a solid baseline for binary classification.
Figure 6 shows the Confusion Matrix and AUC curve, illustrating effective class separation but indicating room for improvement.
- -
KL Divergence Loss Trends: The KL divergence plays a critical role in preserving the distribution of features by aligning the embedded representations with a reference distribution. As the network attempted to adapt in the latent feature space, fluctuations were initially seen. Performance was slightly improved when the KL Divergence loss function was integrated with Contrastive loss. The training loss got reduced to 0.26, the validation loss dropped to 0.28, accuracy shot up to 84%, and the AUC score increased to 0.87. The updated Confusion Matrix (
Figure 7a) and AUC Curve (see
Figure 7b) show that incorporation of KL Divergence helped improve true positive rates and made the different classes easier to separate.
- -
Novel Contrastive + KL Loss Function: Combining Contrastive and KL loss functions, moderated by the parameter α, resulted in effective convergence through training epochs. The unified loss delivered superior results, with a training loss of 0.24, a validation loss of 0.27, an accuracy of 89%, and an AUC of 0.88. These findings indicate the robustness of the Contrastive-KL loss function in capturing detailed features and improving class differentiation, though parameter tuning may optimize performance across different scenarios.
Figure 8 displays the confusion matrix and ROC curve corresponding to this combined loss.
The comparison of the loss curve is illustrated in
Figure 9, showing each loss function’s performance.
Table 2 demonstrates the performance comparison of loss functions. These results show that the new loss function not only helps the model learn finer details better but also makes it more likely to perform well on new, unseen data.
4.2. Boosting Genetic Algorithm with Siamese Network
In this section, the impact of incorporating a Siamese network into a genetic algorithm (GA) on the quality of S-boxes generated is studied. The comparison of the proposed algorithm with that of the traditional crossover method (single-point crossover method) by using a similarity-guided, adaptive crossover strategy is carried out. This method has changed over 500 generations through experimentation with various population sizes of 100, 500, and 1000 individuals. The vanilla GA (without modifications), the Contrastive-guided (using Contrastive loss), the KL-guided (using KL divergence loss), and the suggested combined Contrastive-KL-guided approach have all been assessed.
Table 3 provides a brief definition of each variant. In all these variants, the key metrics, namely nonlinearity (NL), differential uniformity (DU), and diversity, are tracked to evaluate the performance and efficacy of the proposed model.
(i) Results across population sizes: For every population size of 100, 500, and 1000, the algorithm evolved over 500 generations. For each generation, the parameters being recorded are nonlinearity (NL), differential uniformity (DU), and population diversity scores. The main aim is to evaluate the improvements in terms of cryptographic soundness, convergence, and robustness of the algorithm.
- -
Population Size = 100: The standard GA converged by the 200th generation, stabilizing the improvement curve after achieving a decent score for nonlinearity (NL), but demonstrated low diversity as shown in
Figure 10a,b. In another model, in which Contrastive loss has been used to guide the GA, the observations were promising. The model could choose the crossover points strategically early on (generations 1–150), which resulted in an increase in NL score by approximately 7 to 9%. On the other hand, the KL-guided model used a softer, probabilistic approach to similarity. It did not show improvements initially, but kept improving steadily, maintaining high diversity and surpassing the Contrastive Model around the 400th generation. The proposed combined Contrastive-KL loss model got quick gains like the Contrastive one, but also kept the diversity benefits from KL divergence, keeping momentum across all 500 generations, delivering the best cryptographic results, improving average nonlinearity by about 12% over the baseline, as shown in
Figure 10a.
- -
Population Size = 500: Initially, the baseline GA maintained a decent level of diversity but stagnated around generation 200, as shown in
Figure 11a,b. The Contrastive-guided approach managed to enhance NL and reduce DU early in the process, although diversity experienced a decline between generations 250 and 350 due to a tendency to favor similar parent solutions. The KL-divergence-guided model preserved an initial diversity level, allowing for more sustained exploration, maintaining approximately 15% higher diversity score than the Contrastive model during the critical middle phase. Even though this meant a slightly slower start in convergence, it avoided getting stuck at premature convergence, progressing smoothly in fitness up to generation 500, preserving diversity, and allowing cryptographic configuration. The Contrastive-KL model outperformed both Contrastive and KL individual models for this population size, as it consistently achieved the highest final NL scores and the lowest DU, showing a better balance between aggressive crossover and staying adaptable. Also, it kept a diversity score above 0.4 for all generations, twice as high as baseline from getting stuck or drifting aimlessly, making it reliable and consistent, as shown in
Figure 11c.
- -
Population Size = 1000: The baseline GA benefited from a higher initial diversity, which slowed the convergence process but lacked the strategic crossover mechanisms needed for further improvement. The Contrastive-guided model outperformed the baseline by approximately 300th generations, as illustrated in
Figure 12a. Due to the population’s size, its tendency to be aggressive did not reduce diversity, as there was enough variety to prevent rapid homogenization. The KL divergence approach provides a stabilizing effect in this larger population. As demonstrated in
Figure 12c, the KL divergence model maintained a higher diversity (approximately 5%) compared to baseline and Contrastive models. The proposed combined loss model surpassed other models in terms of the soundness of the cryptographic properties of generated S-boxes and diversity stability, as shown in
Figure 12. Despite the modest numerical improvements, the combined loss model showed 4–6% better metrics than the baseline and other models. This indicates that in large populations where diversity is already high, the effectiveness of similarity-based crossover reduces, showing diminishing results, as depicted in
Figure 12c. But it showed lower variability over the generations, and the consistent improvement in S-box properties demonstrated its value well at larger scales.
The similarity-guided adaptive crossover framework appears to perform well based on the test results. It further improves performance when combined with model-driven loss optimization. In the different experiments carried out with different models, the Contrastive-guided model made strategic choices and produced good results early on, but it didn’t perform well on the diversity front. On the other hand, the KL divergence model helped maintain a more varied population, but the time taken to converge was substantial. The combined Contrastive + KL approach effectively balanced this trade-off between premature convergence and diversity, producing great results for smaller population sizes and steadily maintaining the improvement trajectory in large population sizes as well. This method improved the overall performance and lowered fitness variability by up to 20%.
(ii) Statistical Comparison of Models across Population Sizes: A thorough analysis of the proposed new similarity-driven adaptive crossover method was carried out by testing it against the standard genetic algorithm (GA) with different population sizes (100, 500, 1000) across the four variants namely (a) vanilla Genetic algorithm, (b) a model guided with Contrastive loss, (c) KL divergence, and (d) combined approach using Contrastive and KL, respectively. The metrics, which show cryptographic soundness, include nonlinearity, differential uniformity, and diversity of the population, were tracked over 500 generations. This section provides a comprehensive analysis of the best metrics, summarizing their average and variability, which helps to understand the working, stability, and efficacy of each approach. For each population size (100, 500, and 1000), the results are shown in
Table 4,
Table 5 and
Table 6, respectively.
For a population size of 100, as shown in
Table 4, the Contrastive-KL model attains the highest average nonlinearity score of 109.3460 and a diversity measure of 0.0099, matching a Diversity Uncertainty (DU) value of 10.0, comparable to that of standard and Contrastive alone, indicating a balanced approach between swift advancement and exploratory potential. When considering a size of 500, as detailed in
Table 5, the basic Genetic Algorithm exhibits commendable initial diversity and nonlinearity, with a mean of 109.7395, yet it does not improve its DU, which remains steady at 10.00, unlike the Contrastive and KL-enhanced models. The Contrastive + KL model approaches similar levels of nonlinearity (109.5395), maintains a diversity of 0.0119, and displays the lowest standard deviation of 0.4778, emphasizing its robustness. However, its DU of 10.00 may underrepresent the model’s full capacity in longer-term or extended scenarios. At the largest population size of 1000 in
Table 6, the baseline GA benefits from a bigger genetic pool and achieves the highest nonlinearity of 109.9595. Still, the proposed model stays competitive with 109.5815 in nonlinearity and the lowest diversity of 0.0165. It also maintains perfect differential uniformity at 10.00. The proposed model also has the lowest standard deviation in nonlinearity at 0.4066, demonstrating its robustness across multiple runs.
Overall, these findings lend support to the qualitative analysis. The proposed Contrastive + KL loss effectively combines Contrastive loss’s swift convergence with KL’s capacity for maintaining diversity, resulting in enhanced cryptographic properties. The Siamese Transformer assesses S-box similarity to assist more intelligent crossover decisions, striking a balance between exploration and exploitation to ensure stable evolution. These benefits are most evident in smaller populations (100 and 500), where they lead to important improvements, yet remain valuable in larger populations (1000), which tend to offer greater diversity but less guided optimization. This hybrid approach consistently produces reliable S-boxes characterized by high nonlinearity and optimal differential uniformity, representing a meaningful progression in AI-driven cryptographic design. During experimentation, a comprehensive evaluation of all four variants has been carried out across three population sizes: 100, 500, and 1000. However, to assess the potential and the ceiling of cryptographic strength of the proposed framework, an extended analysis was carried out on a population size of 120 over 1000 iterations, achieving a nonlinearity score of 110.25, a DU of 8, and a diversity measure of 0.993, as shown in
Table 7. The resulting optimized S-box, as shown in
Table 8, is a bijective S-box, making it highly suitable for cryptographic applications. The obtained results surpass the standard genetic algorithm, which records a DU of 10. The differential uniformity of value eight significantly strengthens the resistance against differential cryptanalysis. It is important to note that standard GA does not optimize beyond the value of 10, as it tends to stagnate. Furthermore, the nonlinearity increased in parallel with a decrease in DU, which demonstrated the balance between the two trajectories.
To assess the security strength of the proposed optimized S-box, a comparative study is carried out with the traditional GA based methods and other state-of-the-art S-box generation methods, as shown in
Table 9, using the most desirable S-box security metrics, namely the nonlinearity and differential uniformity. As discussed, the proposed framework generates an optimized S-box that attains a mean nonlinearity score of 110.25, aligning with the lowest DU value of 8, and notably excels with a diversity of 0.993, exhibiting reliable convergence without premature uniformity. The proposed S-box demonstrates improved cryptographic strength compared to existing methods using genetic algorithm investigated in [
29,
42], reinforcement method using hash functions presented in [
32], high nonlinear S-box designing using I-Chings operators in [
43], a transfer-function assisted metaheuristic and booster algorithm studied in [
42], an S-box generation method using mutation optimization in [
44], a DC Generative adversarial network based S-box design method presented in [
45], a multi-objective optimization based optimal S-box generation method given in [
46], and a Roulette wheel based Social Network Search algorithm for strong S-box design proposed in [
47]. Hence, it is evident that the proposed method offers great credibility for augmenting the performance of the genetic algorithm for producing optimized S-boxes. The proposed methodology stands out due to the innovative use of the Siamese Transformer network’s advanced S-box comparison mechanisms, which assist more knowledgeable crossover decisions. Unlike conventional genetic algorithms or hybrid approaches that incorporate chaos theory, this method achieves a careful balance between exploration and exploitation. As a result, it attains high levels of nonlinearity, maintains consistent differential uniformity (DU), and exhibits high diversity. Furthermore, by integrating deep learning techniques, the approach greatly improves the effectiveness of designing strong cryptographic S-boxes, establishing itself as a preferred choice for secure cryptographic components.
4.3. Computational Complexity and Runtime Analysis
It is a possibility to assume that the Siamese Transformer incurs significant computational overhead; our runtime analysis indicates that the evaluation takes place only once every crossover occurrence. In S-box optimization, the primary computational constraints are the cryptographic fitness evaluations, particularly the calculation of the Walsh-Hadamard Transform for nonlinearity and the creation of XOR-sum distribution tables for differential uniformity.
Empirically, given a population size of 100 using NVIDIA 4060Ti, INTEL i9, the baseline Standard GA necessitated an average of 1.42 s for each generation. The proposed framework necessitated an average of 1.48 s for each generation. This indicates that the incorporation of the Siamese Transformer adds a minimal overhead of only 4.2% to the overall cryptographic optimization endeavor. The substantial advantages in avoiding local optima and identifying superior cryptographic settings significantly surpass the 4.2% increase in runtime. These findings unequivocally demonstrate the scalability and usefulness of the proposed framework. To illustrate, running the complete genetic algorithm for 500 generations on a population of 100 takes around 710 to 740 s (≈12 min) for the existing GA and around 740 to 780 s (≈12 to 13 min) for the proposed scheme on a system equipped with an Intel i9 processor and NVIDIA RTX 4060Ti GPU. For a larger population size (say, 500 and 1000), the total execution time scales up significantly, taking up to approximately 1 to 2 h. This increase in runtime is directly proportional to the population size, primarily due to the higher number of fitness evaluations and similarity computations. The computational cost is mainly associated with the evaluation of cryptographic fitness functions, such as the Walsh–Hadamard transform and differential uniformity, both of which are sensitive to changes in population size and the S-box size. Running numerous separate experiments for each setting is challenging because evaluating fitness repeatedly can be expensive, especially for larger populations. Therefore, results are provided from a representative run using the same seed in the current study to ensure reproducibility of results. The general trends observed from different population sizes and model variants, however, are stable and consistent with each other.