Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Machine Learning Techniques Improving the Box–Cox Transformation in Breast Cancer Prediction

Electronics 2025, 14(16), 3173; https://doi.org/10.3390/electronics14163173

by Sultan S. Alshamrani

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Mansoor Hayat

Electronics 2025, 14(16), 3173; https://doi.org/10.3390/electronics14163173

Submission received: 1 June 2025 / Revised: 29 July 2025 / Accepted: 6 August 2025 / Published: 9 August 2025

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper combines Box Cox transform with ML model to explore its application in breast cancer prediction, but we can consider introducing more types of feature transformation methods (such as logarithmic transformation, square root transformation, etc.) to compare and analyze the effects of different methods in processing skewed data and unbalanced medical data.
The article used synthetic datasets (based on gamma distribution) and SEER datasets for experimental validation, but further consideration could be given to using more real-world medical datasets, such as datasets containing more features and complex data structures, to more comprehensively evaluate the model's generalization ability.
The article can elaborate on how the proposed method can be integrated with existing clinical decision support systems based on practical clinical application scenarios. For example, it can explain how the model can monitor the condition changes of breast cancer patients in real time, and adjust the treatment plan in time according to the prediction results.
Multimodal algorithms have great inspiration and reference value for this task, such as multi-task learning for hand heat trace time estimation and identity recognition, deep soft threshold feature separation network for infrared handprint identity recognition and time estimation.

5.The article can further explore the applicability of the model under different data distributions and sample sizes. For example, analyze whether the model can provide stable predictive performance in small sample sizes or extremely imbalanced data distributions, and propose corresponding optimization strategies.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

First, we would like to thank Reviewer #1 for his/her insightful comments and suggestions to improve the overall manuscript's innovation and structure.

Here, we have addressed all the comments the reviewer has suggested and tried to answer them one by one. Accordingly, the manuscript is revised based on the reviewer's suggestions and recommendations.
Thanks!

Comment 1:

This paper combines the Box-Cox transformation with an ML model to explore its application in breast cancer prediction, but we can consider introducing more types of feature transformation methods (such as logarithmic transformation, square root transformation, etc.) to compare and analyze the effects of different methods in processing skewed data and unbalanced medical data.

Author's Response

Ø We greatly appreciate the reviewer’s suggestion to include additional feature transformation methods, such as logarithmic and square root transformations, for comparison with the Box-Cox transformation. We agree that incorporating a broader range of transformations would enhance the analysis of how different methods handle skewed and imbalanced data, particularly in medical datasets like breast cancer prediction.

Ø In our study, the primary focus was on the Box-Cox transformation, as it is specifically designed to stabilize variance and normalize skewed data, which aligns well with the characteristics of the datasets used (both synthetic and SEER). We chose the Box-Cox transformation due to its flexibility in transforming data to approximate a normal distribution through the adjustment of the lambda parameter (λ). Our results demonstrate a significant improvement in model performance with the Box-Cox transformation, particularly when the λ parameter is optimized.

Ø However, we acknowledge the potential value in exploring additional transformations.

Ø We have extended our analysis by including a logarithmic method to compare the preprocessing techniques on model performance in

Ø Section 3.2.2.

Ø 4.3. Scenario C: Apply AI machine learning models on the SEER using the logarithmic transformation for continuous values

Ø In section 4.5, I have added a comparison between the Box-Cox transformation to the logarithmic transformation.

Ø In future work, we plan to extend our analysis by including square root transformations with other methods as part of a comprehensive comparison. This would allow us to evaluate the impact of various preprocessing techniques on model performance, particularly on datasets with skewed distributions and class imbalances, and would provide further insights into the most effective strategies for improving predictive accuracy in breast cancer prediction.

Comment 2:

The article used synthetic datasets (based on gamma distribution) and SEER datasets for experimental validation, but further consideration could be given to using more real-world medical datasets, such as datasets containing more features and complex data structures, to more comprehensively evaluate the model's generalization ability.

Ø Thank you for your valuable observation regarding the datasets used in our experimental validation. In response to your suggestion to include more real-world medical datasets with greater complexity, we have incorporated an ablation study (Section 4.6) using a clinical and biochemical dataset from 166 participants. This dataset includes diverse real-world features such as age, BMI, glucose, insulin, HOMA, leptin, adiponectin, and resistin, which reflect typical variability found in clinical settings.

Ø The ablation study was designed to assess the generalizability and robustness of the proposed model across varying data characteristics. The inclusion of this dataset provides a realistic scenario of breast cancer risk prediction and offers deeper insights into how different machine learning models perform with real-world, non-synthetic inputs. The performance metrics from this study support the effectiveness of our model and demonstrate its adaptability to more complex and heterogeneous medical data structures.

Comment 3:

The article can elaborate on how the proposed method can be integrated with existing clinical decision support systems based on practical clinical application scenarios. For example, it can explain how the model can monitor the condition changes of breast cancer patients in real time and adjust the treatment plan in time according to the prediction results.

Ø We thank the reviewer for their valuable suggestion regarding the practical integration of the proposed model into clinical decision support systems (CDSS). In light of this, we have expanded our discussion to clarify the real-world applicability of the framework, particularly in the context of patient monitoring and treatment adjustment.

Ø The updated manuscript incorporates an ablation study based on a real clinical-biochemical dataset consisting of features such as BMI, glucose, insulin, HOMA, leptin, and adiponectin, which are routinely collected in clinical settings. These features can be seamlessly integrated into electronic health record (EHR) systems, allowing the model to operate dynamically and responsively.

Ø In a practical clinical application, the ensemble stacking model, demonstrated to be the most accurate in our study (91.88% accuracy), can serve as a real-time predictive module. As patient data is updated during follow-up visits or laboratory testing, the model can reevaluate breast cancer risk or monitor potential recurrence. This enables physicians to make timely decisions, such as intensifying surveillance, recommending lifestyle modifications, or adjusting treatment protocols by the risk predictions.

Ø Furthermore, due to the use of interpretable features and transparent model structures, the system can provide actionable insights while maintaining clinical interpretability. This ensures that the model not only performs well in terms of metrics but also aligns with the practical requirements of decision-making in oncology.

Ø Future work will explore the model’s deployment within a simulated CDSS environment, with emphasis on real-time data integration, patient-specific risk trajectories, and clinician feedback loops.

Comment 4:

Multimodal algorithms have great inspiration and reference value for this task, such as multi-task learning for hand heat trace time estimation and identity recognition, deep soft threshold feature separation network for infrared handprint identity recognition and time estimation.

Ø We thank the reviewer for the insightful observation regarding the relevance of multimodal algorithms and multi-task learning approaches, such as those used in hand heat trace time estimation and infrared handprint identity recognition. These advanced methods indeed offer valuable inspiration for improving the robustness and versatility of medical predictive models.

Ø Our current study primarily focused on the impact of data transformation, specifically, the Box-Cox transformation, on the predictive performance of ML models using structured tabular clinical and biochemical data. While our approach is unimodal and centered on preprocessing improvements for structured data, we agree that incorporating multimodal and multi-task frameworks could further enhance model adaptability and precision, particularly in real-world healthcare environments where data often spans multiple modalities (e.g., imaging, genomics, EHRs).

Ø In light of this, we have added a note in the future work section highlighting the potential of extending our framework to support multimodal integration and multi-task learning strategies. This direction could enable the simultaneous handling of heterogeneous data sources and prediction tasks (e.g., risk classification and prognosis estimation), thereby advancing the clinical utility of our model.

Comment 5:

The article can further explore the applicability of the model under different data distributions and sample sizes. For example, analyze whether the model can provide stable predictive performance in small sample sizes or extremely imbalanced data distributions, and propose corresponding optimization strategies.

Ø We sincerely appreciate the reviewer’s valuable suggestion regarding the evaluation of model performance under varying data distributions and sample sizes. Indeed, assessing the robustness of machine learning models in scenarios involving small samples or extreme class imbalance is critical for their practical deployment, particularly in clinical settings where such conditions are common.

Ø To partially address this, our study utilized both a synthetically generated gamma-distributed dataset and the real-world SEER dataset, which presents significant class imbalance, to evaluate the impact of the Box-Cox transformation across diverse data conditions. The consistent improvement in model performance across these datasets demonstrates the potential stability of our approach. Nonetheless, we acknowledge that further exploration is warranted.

Ø In the revised manuscript, we have expanded the results and discussion section with the addition of Section 4.4. Scenario (D): Apply AI machine learning models on the SEER using SMOTE augmentation to include the importance of evaluating model generalizability under limited data and high imbalance scenarios. However, the Box-Cox transformation demonstrated superior effectiveness compared to SMOTE, likely due to the pronounced imbalance in the dataset, with a significantly higher number of "Alive" cases (3408) relative to "Dead" cases (616).

Comment 6:

The English could be improved to more clearly express the research.

Author's Response:

Ø We sincerely thank the reviewer for the valuable observation regarding the clarity of the English language used in the manuscript. In response, we have thoroughly reviewed and revised the entire manuscript to enhance grammatical accuracy, sentence structure, and overall readability.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article addresses the challenge of mitigating the negative impact of inherent medical data complexity on the predictive performance of machine learning models in breast cancer prediction. To tackle this issue, the study proposes a methodology combining Box-Cox transformation with machine learning models to enhance predictive accuracy. The research demonstrates strong potential for medical application, with a clear and well-structured rationale and comprehensive content. To further improve the readability and quality of the manuscript, the following suggestions are provided:

It is recommended to upload the specific code to a publicly accessible database and include a DOI link in the paper to ensure transparency and reproducibility.
The conclusion should elaborate on the practical application scenarios of the proposed methodology and the associated technical requirements, including any cost-related considerations.
In the section on research limitations, it would be valuable to discuss potential issues or limitations associated with the application of Box-Cox transformation in combination with machine learning models, providing a more balanced and comprehensive perspective.

These modifications will help strengthen the manuscript and ensure its alignment with the high standards expected in scientific publishing.

Author Response

First, we would like to thank Reviewer #2 for his/her insightful comments and suggestions to improve the overall manuscript's innovation and structure.

Comment 1:

It is recommended to upload the specific code to a publicly accessible database and include a DOI link in the paper to ensure transparency and reproducibility.

Author's Response

Ø I appreciate the reviewer’s suggestion to upload the code to a publicly accessible repository to ensure transparency and reproducibility of our work.

Ø In line with this recommendation, I have added the code available on a GitHub platform at this link

https://github.com/susamash/Machine-Learning-Technics-Improving-Box-Cox-Transformation-in-Breast-Cancer-Prediction

Comment 2:

The conclusion should elaborate on the practical application scenarios of the proposed methodology and the associated technical requirements, including any cost-related considerations.

Ø Thank you for your valuable feedback. I appreciate the suggestion to elaborate on the practical application scenarios, technical requirements, and cost-related considerations of the proposed methodology.

Ø In the revised conclusion, I have included a detailed discussion of how the Box-Cox transformation and the associated ML models can be applied in real-world healthcare settings, particularly for breast cancer prediction.

Ø Additionally, I have outlined the technical requirements of implementing this methodology, such as computational resources, to help clarify the feasibility of this approach.

Comment 3:

In the section on research limitations, it would be valuable to discuss potential issues or limitations associated with the application of Box-Cox transformation in combination with machine learning models, providing a more balanced and comprehensive perspective.

Ø Thank you for your valuable feedback. I appreciate your suggestion to discuss the limitations associated with the application of the Box-Cox transformation in combination with machine learning models.

Ø In response, I've revised the limitations section to give a more comprehensive perspective. Specifically, I have added details on the challenges associated with selecting an optimal lambda value, as its selection can significantly impact the performance of the model. I also highlighted the potential computational complexity involved when applying the Box-Cox transformation to large datasets.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The approach simply stacks two 2-D backbones (DeepLabV3+ and YOLOv8); clarify the technical advance of the “Multi-Head” branch and explain why a unified 3-D encoder-decoder was not explored.
IntrA and Kaggle aneurysm data are volumetric, yet the pipeline is slice-wise—add a head-to-head comparison with 3-D baselines or justify how inter-slice context is preserved.
Patient-level train/validation/test splits are not reported, risking data leakage; provide subject counts and a CONSORT-style diagram with exclusive partitions.
Dice = 0.08 alongside IoU ≈ 0.68 is mathematically impossible (Dice ≈ 2·IoU∕(1+IoU) ≈ 0.81); recalculate all metrics and release the code used.
Gains over baselines are ≤2 %; report 95 % bootstrap confidence intervals or McNemar p-values to back “significant improvement” claims.
Replace or augment DeepLabV3+ with a lightweight GhostUNet variant (e.g., Attention GhostUNet++, arXiv:2504.11491 and https://doi.org/10.1016/j.procs.2023.01.110) to improve fine vessel delineation while keeping parameters low.
Equations (10)–(17) introduce symbols without dimensions—collect all variables in a notation table for clarity.
“Code available on request” is insufficient; release preprocessing scripts, trained weights, and metric calculators under an open licence to enable reproducibility.

Author Response

First, we would like to thank Reviewer #3 for his/her insightful comments and suggestions to improve the overall manuscript's innovation and structure.

However, we believe that the comments provided do not directly pertain to the content of our study.

Our paper focuses on the impact of Box-Cox pre-processing techniques in conjunction with AI model predictions on synthetic and SEER datasets, and the comments you provided seem to be more relevant to a different manuscript involving 3D encoder-decoder networks, stacking 2D backbones (DeepLabV3+ and YOLOv8), volumetric data analysis, CONSORT-style diagrams, Dice = 0.08 alongside IoU ≈ 0.68, arguments with DeepLabV3+, and there are no 17 equations in our manuscript.

You can revise the comments below.

Thanks!

Comment 1:

Comments and Suggestions for Authors

The approach simply stacks two 2-D backbones (DeepLabV3+ and YOLOv8); clarify the technical advance of the “Multi-Head” branch and explain why a unified 3-D encoder-decoder was not explored.

IntrA and Kaggle aneurysm data are volumetric, yet the pipeline is slice-wise—add a head-to-head comparison with 3-D baselines or justify how inter-slice context is preserved.

Patient-level train/validation/test splits are not reported, risking data leakage; provide subject counts and a CONSORT-style diagram with exclusive partitions.

Dice = 0.08 alongside IoU ≈ 0.68 is mathematically impossible (Dice ≈ 2·IoU∕(1+IoU) ≈ 0.81); recalculate all metrics and release the code used.

Gains over baselines are ≤2 %; report 95 % bootstrap confidence intervals or McNemar p-values to back “significant improvement” claims.

Replace or augment DeepLabV3+ with a lightweight GhostUNet variant (e.g., Attention GhostUNet++, arXiv:2504.11491 and https://doi.org/10.1016/j.procs.2023.01.110) to improve fine vessel delineation while keeping parameters low.

Equations (10)–(17) introduce symbols without dimensions—collect all variables in a notation table for clarity.

“Code available on request” is insufficient; release preprocessing scripts, trained weights, and metric calculators under an open licence to enable reproducibility..

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Incorporating multimodal and multi-task frameworks could further enhance model adaptability and precision, multimodal algorithms can be mentioned as a feasible direction, such as GD-YOLO: A lightweight model for household waste image detection, multi-task learning for hand heat trace time estimation and identity recognition, deep soft threshold feature separation network for infrared handprint identity recognition and time estimation.
The generalization and convergence of machine learning are very important. Please provide mathematical proof or theoretical explanation for the convergence and generalization of the algorithm proposed in this article.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Please see the word file

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

No More comments

Author Response

Thanks so much

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

Multimodal algorithms have great inspiration and reference value for this task, multimodal algorithms can be mentioned as a feasible direction, such as GD-YOLO: A lightweight model for household waste image detection, multi-task learning for hand heat trace time estimation and identity recognition, deep soft threshold feature separation network for infrared handprint identity recognition and time estimation.
The generalization and convergence of machine learning are very important. Please provide mathematical proof or theoretical explanation for the convergence and generalization of the algorithm proposed in this article.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Thanks please see the word filw

Author Response File: Author Response.docx

Article Menu

Machine Learning Techniques Improving the Box–Cox Transformation in Breast Cancer Prediction

Further Information

Guidelines

MDPI Initiatives

Follow MDPI