Next Article in Journal
Verifiable Multi-Authority Attribute-Based Encryption with Keyword Search Based on MLWE
Previous Article in Journal
A Review on Blockchain-Based Trust and Reputation Schemes in Metaverse Environments
 
 
Article
Peer-Review Record

STAR: Self-Training Assisted Refinement for Side-Channel Analysis on Cryptosystems

Cryptography 2025, 9(4), 75; https://doi.org/10.3390/cryptography9040075
by Yuheng Qian 1,2, Jing Gao 3, Yuhan Qian 1, Yaoling Ding 1,* and An Wang 1
Reviewer 1:
Reviewer 3: Anonymous
Cryptography 2025, 9(4), 75; https://doi.org/10.3390/cryptography9040075
Submission received: 20 October 2025 / Revised: 20 November 2025 / Accepted: 21 November 2025 / Published: 27 November 2025
(This article belongs to the Section Hardware Security)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Summary:

The manuscript presents a two-stage unsupervised learning framework, named STAR (Self-Training Assisted Refinement), for classifying key-dependent operations in cryptographic side-channel analysis (SCA). The approach first applies a Gaussian Mixture Model (GMM) to cluster unlabeled trace segments and generate pseudo-labels for high-confidence samples. These pseudo-labels are then used to train a convolutional neural network (CNN) that iteratively reclassifies low-confidence samples through a self-training loop, progressively refining classification accuracy. The authors evaluate STAR on three datasets corresponding to public-key cryptographic algorithms and report good classification accuracy with claimed improvements of 12–48% over prior methods.

While the conceptual foundation of the paper is solid, the experimental design, presentation, and validation are seriously underdeveloped and require substantial revision to reach publication quality. Most notably, the manuscript suffers from insufficient transparency and missing methodological details, making the reported results neither verifiable nor reproducible.

 

Strengths:

The paper addresses an important and timely problem in side-channel analysis: achieving accurate operation classification in the presence of unlabeled, noisy, and high-dimensional trace data. The motivation is clear, and the overall direction. The authors have identified the limitations of existing clustering approaches, such as DBSCAN, particularly the inability to fix the number of clusters and the resulting need for brute-force cluster enumeration. The proposed use of soft probabilistic labels from GMM as initial supervision for iterative CNN refinement is conceptually sound. Overall, the motivation and methodological direction are strong, and the work could potentially contribute to advancing unsupervised or semi-supervised methods in SCA if experimentally validated with sufficient rigor.

Weaknesses:

There is a major issue of overstated generalization. In Section 3 and Figure 1, the authors claim that STAR can process timing, power, acoustic, illuminance, and electromagnetic side-channel data. However, all experiments use only power traces. No discussion is provided on how the proposed framework would adapt to other modalities with distinct signal characteristics. This constitutes an exaggerated claim unsupported by evidence and should be toned down or substantiated experimentally. Figure 1 needs to be corrected, and the associated description.

The related work discussion is incomplete. The introduction briefly mentions prior horizontal-attack research but omits many relevant studies that have addressed similar motivations using unsupervised, semi-supervised, or deep-learning methods. The paper would benefit from a dedicated “Related Work” section summarizing these contributions and highlighting how STAR differs from them. In the experimental section, STAR should be compared not only with GMM and DBSCAN-CNN but also with other contemporary clustering-based SCA frameworks and CNN architectures for horizontal analysis.

Equally problematic is the absence of CNN architecture details and hyperparameter settings. The manuscript never describes the network depth, filter sizes, activation functions, optimizer, learning rate, batch size, or regularization techniques. For a framework whose central innovation relies on a neural network’s capacity to refine clustering, these omissions are serious. Moreover, the authors do not provide any information about training time, computational cost, or hardware utilization beyond the mention of a GPU. The complete lack of model specification undermines the reproducibility of the framework.

The experimental section lacks critical information. The authors do not specify how many power traces were used in each dataset. The data quality, sampling rate, and degree of noise (signal-to-noise ratio, random delay magnitude) are never quantified. It remains unclear how the training and testing sets were divided or whether cross-validation was performed. Without this information, the results cannot be independently reproduced or trusted.

The experimental validation itself is overly limited and lacks depth. Figures 3, 5, and 7 are insufficient to substantiate claims of perfect classification. Even in these figures, some boundary samples appear visually misclassified after self-training. The authors should include quantitative confusion matrices, precision-recall metrics, or sample-wise accuracy distributions to confirm their claims. More investigations need to be performed on the CNN architectures. The authors need to determine the optimal CNN architecture and demonstrate whether it generalizes across datasets.

Transparency is further undermined by the lack of source code availability or pseudocode outlining the STAR pipeline. For reproducibility and community adoption, the authors should release implementation details or open-source code for their GMM initialization, confidence-thresholding, and iterative self-training procedures.

The manuscript has many editorial mistakes. It needs a thorough check.

Comments on the Quality of English Language

The manuscript requires careful proofreading for grammar, punctuation, and spacing consistency. Several grammatical errors are present (for example, “CNN provide” should be “CNNs provide” and “we proposes” should be “we propose”). In addition, there are frequent missing spaces.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Overall a very good work adequate for publication. I have only two remarks for the authors:

  1. The abstract and introduction need to get amended to demonstrate to the potential readers the most important topics regarding the paper’s contributions, not to mention a thorough presentation of the sections included in the paper under review.
  2. On top of that, the authors should amend the dedicated for this purpose Section the most state-of-the-art related work of the last five years (at least) on the subject and support the contributions that this paper offers to the scientific community compared to the already published works. A good idea could be to summarize the reviewed papers in a specified table.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript proposes STAR, a two-stage unsupervised framework for side-channel analysis. It uses a Gaussian Mixture Model (GMM) to get initial high-confidence pseudo-labels. Then, a CNN is self-trained to iteratively refine the labels for low-confidence samples, claiming 100% accuracy on ECC, RSA, and SM2 datasets.

The idea of using GMM to fix the cluster-number problem of DBSCAN is good, and the iterative refinement is logical. The results look impressive, especially the 100% accuracy and zero traversal count. However, I have some major concerns, particularly about a key parameter which seems to require manual tuning.

1). My biggest concern is the confidence threshold tau. Section 4.3 clearly shows this parameter is extremely sensitive. The optimal value (0.9, 0.98, 0.95) is different for each dataset. How can a user select this without ground truth labels? This "magic number" problem seriously hurts the practical usability.
2). The novelty seems a bit incremental. Looking at Table 1, GMM alone already achieves 98% accuracy. The complex CNN self-training stage only refines the last 2%. The main contribution seems to be replacing DBSCAN with GMM (with K=2) to solve the traversal-count problem.
3). The paper admits in "Future work" that the threshold needs automation. I disagree. This isn't "future work," it's a fundamental flaw. An "unsupervised" method that needs a magic number tuned manually for each dataset isn't really unsupervised or practical.
4). You fixed the GMM cluster number K=2 for all experiments. This is a huge assumption. This only works for simple "add-or-double" operations. What if the crypto has three or more operations? The method's generality is questionable.
5). The comparison to DBSCAN-CNN is not fair. You criticize its high traversal count, but your "solution" is to just assume K=2 beforehand. DBSCAN is meant to find an unknown number of clusters. You are comparing apples and oranges.
6). The feature extraction part is vague. You just say "PCA was used" or "ISOMAP provided the embedding". This is a critical step. You must detail how these features were engineered and selected. This lack of detail makes the work hard to reproduce.
7). The CNN architecture is just a generic diagram. For reproducibility, you must provide the exact details: how many layers, filter sizes, kernel counts, activation functions, and training parameters. This is a basic requirement for any deep learning paper.
8). The datasets are not well-described. Where did they come from? Are they public? What device was used? The SM2 trace in Figure 6 looks extremely clean, almost like a simulation. We need to know if these are realistic, noisy, real-world traces.
9). Claiming 100% accuracy on three different datasets is extraordinary, especially against random delays. This sounds too good to be true. Was this based on one "lucky" trace? You need to show this result is robust, perhaps over many different traces.
10). The "state-of-the-art" comparison is weak. Table 1 only compares against four very basic or slightly old methods. There are many more advanced unsupervised and semi-supervised SCA papers published recently. This comparison is not convincing.
11). You say GMM provides "soft labels", but then you immediately use a hard threshold tau to pick high-confidence samples. This defeats the purpose of soft labels. Why not use the actual probability scores to weight the loss during CNN training?
12). The number of iterations for self-training (15, 29, 25) seems random. The stopping criterion is just "until no samples are left". This could be very slow. Have you analyzed the convergence speed or efficiency of this iterative process?

Please kindly provide revisions or clarifications. If above concerns are addressed or clarified, I will recommend acceptance of your manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I thank the authors for their detailed revisions and their efforts to address many of the concerns raised in the initial review. The manuscript has improved in structure and clarity, and the authors have enhanced its technical rigor.

The manuscript continues to suffer from language and editorial quality issues. Many of the grammatical errors and formatting inconsistencies noted previously still appear in the revised text. For example, subject-verb agreement issues remain (“CNN provide”), and there are numerous missing spaces and punctuation inconsistencies. A thorough proofreading by a fluent English speaker is necessary to ensure the clarity and professional polish expected of a publication in this area. Although these issues were explicitly mentioned in the previous review, the authors have not addressed them.

Another concern is the continued inclusion of “illuminance” as a viable side-channel modality in Figure 1. To the best of my knowledge, there is no peer-reviewed literature validating the extraction of cryptographic keys through illuminance-based measurements. Unless the authors can cite credible works demonstrating such feasibility, this claim should be removed or revised to avoid misleading readers about the current state of research in side-channel analysis.

Finally, I strongly encourage the authors to improve transparency regarding dataset availability. Since the evaluation relies solely on privately collected traces, it would benefit the community if the acquisition setup, including sampling parameters, probe characteristics, and noise conditions, were documented more explicitly. Ideally, incorporating at least one publicly available dataset and also providing STAR pseudocode in the work would significantly strengthen the reproducibility and accessibility of the work. While the authors indicate an intention to release code after acceptance, more concrete implementation details should be included at this stage.

Comments on the Quality of English Language

The manuscript requires careful proofreading for grammar, punctuation, and spacing consistency. Several grammatical errors are present (for example, “CNN provide” should be “CNNs provide” and “we proposes” should be “we propose”). In addition, there are frequent missing spaces. The authors have not solved these issues, though these were pointed out before.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Good revisions. My concerns are addressed. I think this manuscript can be accepted now.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop