Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Multi-Scale Unsupervised Feature Extraction Network with Structured Layer-Wise Decomposition

Appl. Sci. 2025, 15(13), 7194; https://doi.org/10.3390/app15137194

by Yusuf Şevki Günaydın¹

and Baha Şen^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Evangelia Karali

Appl. Sci. 2025, 15(13), 7194; https://doi.org/10.3390/app15137194

Submission received: 25 May 2025 / Revised: 19 June 2025 / Accepted: 24 June 2025 / Published: 26 June 2025

(This article belongs to the Special Issue Applications of Advanced Deep Learning Technology in Control and Intelligent Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposed a multi-scale unsupervised feature extraction network with structured layer-wise decomposition. Experimental results show that The approach is scalable, indicating strong potential for broader applicability in machine learning tasks beyond classification. However, I think some parts need to be improved. The details are listed as follows:
(1) In the introduction section, the author should list the contributions of the article by item, so that the readers can clearly understand the innovation and contributions of the paper.
(2) In the abstract, the author mentions that this approach is adaptable to various datasets and architectures. The proposed method employs three frameworks; however, it only utilizes one dataset, which does not sufficiently demonstrate the approach's applicability across various datasets. ‌The authors are advised to conduct thorough research and incorporate an additional 2–3 representative datasets to fully demonstrate the method's versatility.
(3)In the Introduction section,the authors review the technical applications of deep learning and encoder-decoder architectures. However, ‌the cited references are predominantly dated‌. We strongly encourage the author to incorporate cutting-edge literature from the last two years to ensure the state-of-the-art relevance of this research. For example, (1) SR-RDFAN-LOG: Arbitrary-scale logging image super-resolution reconstruction based on residual dense feature aggregation. https://doi.org/10.1016/j.geoen.2024.213042. (2) EMANet: An Ancient Text Detection Method Based on Enhanced-EfficientNet and Multidimensional Scale Fusion. https://doi.org/10.1109/JIOT.2024.3423667.
(3) In the Experimental Configurations section, λ1, λ2, and λ3 values for proposed loss functions were assigned fixed values to maintain consistency and ensure stable convergence during training. It is recommended that the author provide specific values in detail to allow readers to reproduce the results.
(4) The authos proposed variational loss functions to improve the network performance. It is recommended that the author ‌incorporate ablation studies‌ comparing the performance with and without variational loss functions in the loss formulation ‌to comprehensively demonstrate the effectiveness of this module‌.
(5) There are some grammatical and typographical issues throughout the paper that should be polished to improve readability. It is recommended that the author ‌carefully read through the entire manuscript‌ to ‌rigorously revise and refine both logical flow and grammatical accuracy‌.

Author Response

Response to Reviewer 1 Comments

Thank you very much for taking the time to review this manuscript. The assessments you provided have been quite guiding for us. Thanks to this, we have obtained the opportunity to refine the article a bit further. Please find the detailed responses below related to the corresponding comments.

Comments 1: In the introduction section, the author should list the contributions of the article by item, so that the readers can clearly understand the innovation and contributions of the paper.

Response 1: Thank you for pointing this out. The contributions of the study have been listed at the end of the Introduction section on page 4 between line 136-147.

Comments 2: In the abstract, the author mentions that this approach is adaptable to various datasets and architectures. The proposed method employs three frameworks; however, it only utilizes one dataset, which does not sufficiently demonstrate the approach's applicability across various datasets. ‌The authors are advised to conduct thorough research and incorporate an additional 2–3 representative datasets to fully demonstrate the method's versatility.

Response 2: Thank you for your valuable feedback. In response, we have added a medical image dataset to the study to demonstrate the model’s performance on high-resolution images (512×512) and in the context of image segmentation. Details of the new dataset and the corresponding segmentation results have been incorporated into the revised manuscript. The changes can be found in page 9, line 308-315.

Comments 3: In the Introduction section, the authors review the technical applications of deep learning and encoder-decoder architectures. However, ‌the cited references are predominantly dated‌. We strongly encourage the author to incorporate cutting-edge literature from the last two years to ensure the state-of-the-art relevance of this research. For example, (1) SR-RDFAN-LOG: Arbitrary-scale logging image super-resolution reconstruction based on residual dense feature aggregation. https://doi.org/10.1016/j.geoen.2024.213042. (2) EMANet: An Ancient Text Detection Method Based on Enhanced-EfficientNet and Multidimensional Scale Fusion. https://doi.org/10.1109/JIOT.2024.3423667.

Response 3: Thank you for your insightful suggestion. In response, we have updated the Introduction section by incorporating recent and relevant references from the past two years to strengthen the discussion of deep learning and encoder-decoder architectures. Specifically, we have added the suggested references with more recent studies, which enhance the state-of-the-art context of our study. The changes can be found in page 2-3, line 80-110.

Comments 4: In the Experimental Configurations section, λ1, λ2, and λ3 values for proposed loss functions were assigned fixed values to maintain consistency and ensure stable convergence during training. It is recommended that the author provide specific values in detail to allow readers to reproduce the results.

Response 4: Thank you for your helpful comment. In response, we have updated Table 3 in the Experimental Configurations section to include the specific values of the loss function weights. These values were fixed to maintain consistency and ensure stable convergence during training and are now clearly reported to support reproducibility. The changes can be found in page 10, table 3.

Comments 5: The authors proposed variational loss functions to improve the network performance. It is recommended that the author ‌incorporate ablation studies‌ comparing the performance with and without variational loss functions in the loss formulation ‌to comprehensively demonstrate the effectiveness of this module‌.

Response 5: Thank you for your constructive suggestion. In response, we have conducted an ablation study to evaluate the effectiveness of the proposed variational loss functions at each layer of the proposed method. Specifically, we applied the image segmentation task on a newly added CT medical image dataset and analyzed the impact of each loss component. The results, presented in the revised manuscript, demonstrate the contribution of each variational loss term by comparing model performance with and without them. This ablation study confirms the effectiveness of the proposed loss formulation in improving segmentation accuracy. The changes can be found in page 14, table 5, line 406-417.

Comments 6: There are some grammatical and typographical issues throughout the paper that should be polished to improve readability. It is recommended that the author ‌carefully read through the entire manuscript‌ to ‌rigorously revise and refine both logical flow and grammatical accuracy‌.

Response 6: Thank you for your observation. We have thoroughly revised the manuscript to address grammatical and typographical issues, and have carefully refined the language to improve readability, logical flow, and overall clarity. We appreciate your suggestion, which helped enhance the quality of the paper.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors propose a multi-scale unsupervised feature extraction method aimed at enhancing the representational capacity of grayscale or low-quality images, thereby improving the performance of downstream classification models. While the research topic is meaningful, several issues need to be addressed:

The paper does not provide a separate analysis of the specific contributions of each branch (smooth/detail/residual) or each loss term to the final results. For instance, it lacks evidence showing how the performance changes when a particular module is removed. Appropriate ablation studies are missing.
Although models such as ResNet18 are still widely used, they are not among the most recent architectures. The authors are encouraged to include comparisons with more up-to-date backbone networks to strengthen the experimental validation.
The number of references cited is relatively limited. Incorporating more recent and relevant works would improve the paper’s comprehensiveness.
Unsupervised feature extraction methods have broad applications, such as in remote sensing image retrieval [1]. It is recommended that the authors discuss such applications to broaden the scope of the paper.

[1]Unsupervised Remote Sensing Image Retrieval Using Probabilistic Latent Semantic Hashing

The method is evaluated solely on the CIFAR-10 dataset, which consists of small images (32×32) and represents a relatively simple classification task. This narrow evaluation raises concerns about the method’s ability to capture more complex structures and its generalization capability.

Author Response

Response to Reviewer 2 Comments

Thank you very much for taking the time to review this manuscript. The assessments you provided have been quite guiding for us. Thanks to this, we have obtained the opportunity to refine the article a bit further. Please find the detailed responses below related to the corresponding comments.
Comments 1: The paper does not provide a separate analysis of the specific contributions of each branch (smooth/detail/residual) or each loss term to the final results. For instance, it lacks evidence showing how the performance changes when a particular module is removed. Appropriate ablation studies are missing.
Response 1: Thank you for your constructive suggestion. In response, we have conducted an ablation study to evaluate the effectiveness of the proposed variational loss functions. Specifically, we applied the image segmentation task on a newly added CT image dataset and analyzed the impact of each loss component. The results, presented in the revised manuscript, demonstrate the contribution of each variational loss term by comparing model performance with and without them. This ablation study confirms the effectiveness of the proposed loss formulation in improving segmentation accuracy. The changes can be found in page 9, line 308-315 and in page 14, table 5, line 406-417.
Comments 2: Although models such as ResNet18 are still widely used, they are not among the most recent architectures. The authors are encouraged to include comparisons with more up-to-date backbone networks to strengthen the experimental validation.
Response 2: Thank you for your valuable feedback. While ResNet18 was used in the initial experiments, we have extended our study by incorporating a more recent and widely adopted architecture, U-Net, for the image segmentation task. U-Net is a strong baseline in medical image analysis and has demonstrated excellent performance in pixel-level prediction tasks. The experimental results using U-Net have been added to the revised manuscript, providing additional validation of our method’s effectiveness with a more modern backbone. The changes can be found in page 10, line 357-368 and in page 14, table 5, line 406-417. Comments 3: The number of references cited is relatively limited. Incorporating more recent and relevant works would improve the paper’s comprehensiveness. Unsupervised feature extraction methods have broad applications, such as in remote sensing image retrieval [1]. It is recommended that the authors discuss such applications to broaden the scope of the paper. [1]Unsupervised Remote Sensing Image Retrieval Using Probabilistic Latent Semantic Hashing Response 3: Thank you for your helpful suggestion. In response, we have expanded the reference list by incorporating additional recent and relevant works, including the recommended paper on unsupervised remote sensing image retrieval using probabilistic latent semantic hashing. Furthermore, we have added a discussion on the broader applications of unsupervised feature extraction methods—such as in remote sensing image retrieval—to enhance the comprehensiveness and scope of the paper. The changes can be found in page 2-3, line 80-110. Comments 4: The method is evaluated solely on the CIFAR-10 dataset, which consists of small images (32×32) and represents a relatively simple classification task. This narrow evaluation raises concerns about the method’s ability to capture more complex structures and its generalization capability. Response 4: Thank you for your insightful comment. To address the concern regarding generalization and applicability to more complex tasks, we have extended our evaluation by incorporating a new 512×512 CT medical image dataset for image segmentation. This dataset presents higher-resolution inputs and more complex structural patterns compared to CIFAR-10. The additional experiments demonstrate the effectiveness and robustness of our method in capturing fine-grained details and generalizing to more challenging, real-world scenarios. The results and analysis have been included in the revised manuscript. The changes can be found in page 9, line 308-315.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript introduces a novel unsupervised multi-scale feature extraction framework based on a multi-branch autoencoder architecture. The proposed method decomposes input images into smooth, detailed and residual components. It introduces custom loss functions for each branch to enforce semantic separation and sparsity.

Abstract is short while comprehensive. The problem is introduced: effective feature extraction, particularly in scenarios with limited or unlabeled data, the methodology is described briefly and the results are clear and scientific justified. In the Introduction sections the authors describe thorough recent advancements on feature extraction using deep learning techniques and the latest scientific literature relative to novel feature extraction framework based on machine learning. Introduction section ends up with a very good description of the authors scientific framework and methodology that they used to solve feature extraction problem with deep learning methods. In section Materials and Methods, a well-structured explanation of the multi-branch autoencoder design is given. The division into smooth, detail, and residual layers is logically motivated and illustrated clearly with accompanying figures and tables. The mathematical formulation of the loss components is thorough and technically sound. The tables describing the encoder and decoder configurations provide enough detail (kernel sizes, strides, padding, etc.) for exact implementation. Using different regularizes (smoothness via squared gradient, edge-preserving via L1, sparsity via log) aligns with common image processing goals. Experiments describe in scientific manner hardware and software configuration used. The choice of CIFAR-10 is justified. The preprocessing steps (resizing, grayscaling, tensor conversion) are clearly stated. The use of accuracy, precision, recall, F1-score, and confusion matrices provide a comprehensive evaluation. Confusion matrices highlight per-class improvements. Discussion section effectively summarizes the study's core contributions. In conclusions the authors sum up with their findings, discuss possible areas of application and their feature plans as far the aforementioned scientific work concerns.

However, I believe that authors should give imaging example of the multi-branch autoencoder result, I mean one or two examples of final smooth, detail and residual image. Moreover, the choice of autoencoder is not well justified over CNNs and Transformers. Further what is the impact of each branch on final output result?

I also think that the section Loss Functions lacks of references. The chosen loss functions should be justified by extra references, so that their choice is scientific objective. Hyperparameters how robust to changes are? I mean how performance varies with different hyperparameter combinations. Perhaps a sensitivity analysis is needed.

The authors do not discuss anything about higher-resolution dataset (e.g., Fashion-MNIST, TinyImageNet, or grayscale medical imagery) and how their proposed method will respond.

The manuscript is written in good English however it needs improvements. As an example, in Abstract the f letter is almost always missing.

Line 179: Substituting these approximations into equation ??, I think a number must replace the 2 questionmarks.

Also in Discussion and Conclusions some sentenses are two generic, e.g "suggests broad applicability across a variety of machine learning scenarios", I think the authors should be more specific.

Author Response

Response to Reviewer 3 Comments

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.

Comments 1: However, I believe that authors should give imaging example of the multi-branch autoencoder result, I mean one or two examples of final smooth, detail and residual image. Moreover, the choice of autoencoder is not well justified over CNNs and Transformers. Further what is the impact of each branch on final output result?

Response 1: We thank the reviewer for this valuable suggestion. To address this concern, we have added an ablation study on an image segmentation task using a newly introduced 512x512 CT medical image dataset. This experiment demonstrates the contribution of each branch—smooth, detail, and residual—on the final segmentation performance, highlighting their individual impact in both visual quality and quantitative accuracy. In addition, we now include visual examples of the outputs generated by each branch to provide clearer insight into how the multi-branch architecture contributes to feature decomposition and final reconstruction. The changes can be found in page 14, figure 6.

Regarding the choice of architecture, we selected a multi-branch autoencoder due to its interpretability, modularity, and computational efficiency in learning distinct and complementary representations such as smooth background structures, fine details, and residual information. While CNNs and Transformers are indeed powerful alternatives, autoencoders offer a more controllable latent space suitable for targeted decomposition, which aligns well with our design goal of explicitly separating image components. However, we appreciate the reviewer's point and recognize that a future direction may involve exploring hybrid architectures or Transformer-based autoencoders to further improve adaptability and performance. We have revised the manuscript to include these additions and clarifications. The changes can be found in page 4, line 157-169.

Comments 2: I also think that the section Loss Functions lacks of references. The chosen loss functions should be justified by extra references, so that their choice is scientific objective. Hyperparameters how robust to changes are? I mean how performance varies with different hyperparameter combinations. Perhaps a sensitivity analysis is needed.

Response 2: Thank you for your valuable feedback. We have added relevant references to the Loss Functions section to better justify the selection of the proposed loss terms, grounding their choice in established literature. Regarding hyperparameter sensitivity analysis, due to time and hardware constraints, we were limited in conducting a comprehensive study exploring the full range of hyperparameter combinations. However, we acknowledge the importance of such analysis for robustness evaluation and plan to include detailed sensitivity experiments in future work to thoroughly assess the impact of varying hyperparameters on model performance. The changes can be found in page 5, line 180-195.

Comments 3: The authors do not discuss anything about higher-resolution dataset (e.g., Fashion-MNIST, TinyImageNet, or grayscale medical imagery) and how their proposed method will respond.

Response 3: Thank you for this valuable suggestion. To address this concern and evaluate the generalizability of our proposed method on higher-resolution and more complex data, we have extended our experiments by incorporating a new grayscale medical image dataset consisting of 512×512 CT images. This dataset was used to perform an image segmentation task, allowing us to demonstrate the effectiveness and scalability of our approach in a high-resolution, real-world medical context. The results are presented in the revised manuscript and support the applicability of our method beyond small-scale datasets like CIFAR-10. We believe this addition strengthens the experimental validation and provides a more comprehensive evaluation of the model’s performance. The changes can be found in page 9, line 308-315.

Comments 4: The manuscript is written in good English however it needs improvements. As an example, in Abstract the f letter is almost always missing.

Response 3: Thank you for your observation. We carefully reviewed the entire manuscript, with particular attention to the Abstract section, and did not identify any consistent issue with the “f” character being missing. This may have resulted from a rendering or font-related issue during the review process. Nonetheless, we have thoroughly proofread the manuscript and made several grammatical and formatting improvements to enhance overall clarity and readability.

The updated abstract:

“[Recent developments in deep learning have underscored prizing effective feature extraction, in scenarios with limited or unlabeled data. This study introduces a novel unsupervised multi-scale feature extraction framework based on a multi-branch auto-encoder architecture. The proposed method decomposes input images into smooth, detailed and residual components, using variational loss functions to ensure that each branch captures distinct and non-overlapping representations. This decomposition enhances the information richness of input data while preserving its structural integrity, making it especially beneficial for grayscale or low-resolution images. Experimental results on classification and image segmentation tasks show that the proposed method enhances model performance by enriching input representations. Its architecture is scalable and adaptable, making it applicable to a wide range of machine learning tasks beyond image classification and segmentation. These findings highlight the proposed method’s utility as a robust, general-purpose solution for unsupervised feature extraction and multi-scale representation learning.]”

Comments 5: Line 179: Substituting these approximations into equation ??, I think a number must replace the 2 questionmarks.

Response 5: Thank you for your observation. We have corrected the issue, and the placeholder "??" has been replaced with the appropriate equation number. Specifically, it now correctly refers to Equation 2 in the revised manuscript. The changes can be found in page 6, line 224.

Comments 6: Also in Discussion and Conclusions some sentences are two generic, e.g "suggests broad applicability across a variety of machine learning scenarios", I think the authors should be more specific.

Response 6: Thank you for your valuable feedback. We agree that certain statements in the Discussion and Conclusions sections were overly general. In response, we have revised those sentences to be more specific and aligned with the empirical evidence presented in our study. For example, instead of stating that the method “suggests broad applicability across a variety of machine learning scenarios,” we now highlight concrete domains where our model demonstrated practical benefits, such as medical image segmentation and remote sensing. These revisions aim to clarify the scope and relevance of our contributions without overstating their generalizability.

The changes can be found in page 15, line 438-443, line 451-456.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revisions have enhance the clarity, accuracy, and overall quality of the manuscript. I agree to accept this version.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed all my comments well.

Article Menu

A Multi-Scale Unsupervised Feature Extraction Network with Structured Layer-Wise Decomposition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI