Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features

Appl. Sci. 2025, 15(8), 4507; https://doi.org/10.3390/app15084507

by Bireswar Dutta

Reviewer 1:

Arrvind Raghunath

Reviewer 2: Anonymous

Appl. Sci. 2025, 15(8), 4507; https://doi.org/10.3390/app15084507

Submission received: 24 March 2025 / Revised: 17 April 2025 / Accepted: 18 April 2025 / Published: 19 April 2025

(This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare and Precision Medicine)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Main question addressed by the article is evaluating the accuracy of various machine learning models in the diagnosis of lung cancer.

AI related research is much needed in cancer care and this report adds to the literature on potential widespread applications of machine learning models in the most common malignancy worldwide.

Conclusions are mostly consistent with data presented, no ethical concerns. References are appropriate.

Comments:

Line 34, I would not classify small cell lung cancer as slow-growing, it actually is an aggressive malignancy.

Lines 55 and 96, therapeutic results deteriorate markedly if the condition goes untreated for more than 3 years. Most lung cancer patients do not survive if not treated for more than 6 months. Untreated lung cancer for >3 years is quite rare and oncologists hardly see this scenario in their entire careers. This will need to be revised.

Table 1, last line 'patience' needs changed to patient.

Symptoms such as swallowing difficulty and the history of alcohol consumption may not be common symptoms seen in lung cancer. May need to omit these variables. It is stated that shortness of breath is less significant in line 384, in-fact this is the most common symptom seen in this disease. I think this should be included as a variable.

Author Response

1. Line 34, I would not classify small cell lung cancer as slow-growing, it actually is an aggressive malignancy.

Ans. Thank you for your suggestion. I modify the sentence which can be found in Line 34-36.

Lung cancer encompasses a spectrum…and therapeutic challenges [3]. Line 34-36

2. Lines 55 and 96, therapeutic results deteriorate markedly if the condition goes untreated for more than 3 years. Most lung cancer patients do not survive if not treated for more than 6 months. Untreated lung cancer for >3 years is quite rare and oncologists hardly see this scenario in their entire careers. This will need to be revised.

Ans. Thank you for your suggestion. I modify the sentence which can be found in Line 54-57.

The sentence in Line 96 is discarded, as it is already mentioned in Lines 54-57.

The evidence demonstrates… significantly more challenging [7,8,10]. Line 54-57

Table 1, last line 'patience' needs changed to patient.

Ans. Thank you for your suggestion. I modify the mistake, which can be found in Table-2.

The patient has lung cancer

Symptoms such as swallowing difficulty and the history of alcohol consumption may not be common symptoms seen in lung cancer. May need to omit these variables. It is stated that shortness of breath is less significant in line 384, in-fact this is the most common symptom seen in this disease. I think this should be included as a variable.

Ans. Thank you for your suggestion. I modify the sentence which can be found in Line 474-483

Finally, literature indicated shortness… a threshold of 0.15 for inclusion. Line 474-476

Shortness of breath was eliminated … was below the required threshold. Line 476-478

On the other hand, swallowing difficulty… and retained for the analysis. Line 478-481

Future studies should consider other… (PCA), to confirm the outcome. Line 481-483

Reviewer 2 Report

Comments and Suggestions for Authors

The title is too generic; revise it to reflect the methodology or specific contribution.
The abstract lacks clear problem articulation and specific technical details—please revise.
Keywords are too broad; replace them with more specific and domain-relevant terms.
Acronyms (e.g., ML, DL) should be introduced once and consistently used thereafter. Further follow this pattern throughout the manuscript (machine learning(ML))
The problem statement is not clearly defined in the abstract and introduction—please clarify.
The literature review is shallow. Add at least 5–7 recent studies for both ML and DL in lung disease diagnosis and include a comparative table summarizing related works with datasets, models, and results.
The methodology section is generic. Provide a stronger approach and compare with prior studies to show novelty.
The use of Weka is not well-justified—clearly state what parts were automated and what was custom-simulated.
Add statistical analysis of data features to improve understanding of the dataset.
Justify the use of Pearson correlation for feature selection—why not explore other state-of-the-art methods?
Clarify how the three different learning rates were selected—was any tuning method applied?
Figure 2 needs to be resized, smoothed, and improved in quality.
Redraw Figure 3 with better flowchart formatting; remove # and correct arrow placements.
Figures 4, 5, and 6 are blurred—redraw them or improve resolution.
Use a table to show neural network configurations (e.g., layers, neurons) instead of multiple blurry figures.
In Line 290–291, the formula for precision and recall has double brackets—clarify if intentional.
Combine Figures 7 and 8 into a single plot or label subplots as a, b, c, d for clarity.
Figure 9 seems unnecessary—move it to the methodology section or remove it.
Subsection headings should be more specific rather than generic.
Avoid repeating acronyms like “Machine Learning (ML)” multiple times after the first definition.
Provide a comprehensive comparison table with previous studies and clearly highlight how your approach outperforms them.
The conclusion is weak—restructure it to reflect contributions, key findings, and future work.

Author Response

The title is too generic; revise it to reflect the methodology or specific contribution.

Ans. Thank you for your suggestion. I modify the Title of the study.

The abstract lacks clear problem articulation and specific technical details—please revise.

Ans. Thank you for your suggestion. I rewrite the Abstract, which can be found in Lines 10-25.

Lung cancer is a critical global…effective early detection techniques. Line 10-11

This research seeks to improve… and lifestyle factor data from Kaggle. Line 11-13

The data preprocessing steps… and Support Vector Machines. Line 13-18

Neural Network (NN) was evaluated… 80% train/test splitting method. Line 18-21

NN model was implemented with…improve accuracy and reduce noise. Line 21-23

ML models used in the study… learning rate, attained 92.86% accuracy. Line 23-25

Keywords are too broad; replace them with more specific and domain-relevant terms.

Ans. Thank you for your suggestion. I modify the keyword, which can be found on Line 26-27.

Lung cancer prediction; Machine learning… Correlation matrix. Line 26-27

Acronyms (e.g., ML, DL) should be introduced once and consistently used thereafter. Further follow this pattern throughout the manuscript (machine learning (ML))

Ans. Thank you for your suggestion. I use the acronym ML, DL which can be found throughout the paper.

The problem statement is not clearly defined in the abstract and introduction—please clarify.

Ans. Thank you for your suggestion. I rewrite the abstract and Introduction to clarify the problem statement, which can be found in Line 10-25, 58-81.

Lung cancer is a critical global…effective early detection techniques. Line 10-11

This research seeks to improve… and lifestyle factor data from Kaggle. Line 11-13

The data preprocessing steps… and Support Vector Machines. Line 13-18

Neural Network (NN) was evaluated… 80% train/test splitting method. Line 18-21

NN model was implemented with…improve accuracy and reduce noise. Line 21-23

ML models used in the study… learning rate, attained 92.86% accuracy. Line 23-25

In the last few years, ML and data… decision-making accuracy [8,9]. Lines 58-60

Researchers have utilized ML and… at the onset of treatment [6,7]. Lines 60-63

However, selecting an appropriate… its relation to patient habits [3,5]. Lines 63-64

ML automates disease prediction… patterns from large datasets [8,11]. Lines 65-66

Therefore, it is critical to develop… and facilitate timely intervention. Lines 67-68

This study systematically… factors aiming to address the concerns. Lines 68-70

This research aims to identify the ML… to fulfill the aims of the study: Lines 71-75

What are the predictive accuracy…performs best across these metrics? Lines 76-78

How does feature selection affect… ML lung cancer detection accuracy? Lines 79
Do DL methods, such as NN, outperform… in lung cancer prediction? Lines 80-81

The literature review is shallow. Add at least 5–7 recent studies for both ML and DL in lung disease diagnosis and include a comparative table summarizing related works with datasets, models, and results.

Ans. Thank you for your suggestion. I rewrite the Literature review section.

Table 1 mentions the summaries of recently published studies. Table 10 compares the current study to the recently published study. The rest of the description is found in Lines 91-138.

Machine learning has become an…and classification of disease subtypes. Lines 91-94

Maurya et al. (2024) [1] conducted… for early lung cancer prediction. Lines 95-98

Khanam and Foo (2021) [11] compared…for lung disease diagnosis. Lines 98-101

Protić et al. (2023) [12] explored… efficient diagnostic models. Lines 102-104

Dudáš (2024) [13] investigated… that contribute to lung diseases. Lines 104-107

Patra (2020) [14] utilized ML… with an accuracy of 81.25%. Lines 107-110

Radhika et al. (2019) [15] performed… SVM performing second best. Lines 110-113

Conversely, NB exhibited the … individual and ensemble classifiers. Lines 113-117

DL, a subfield of artificial intelligence… promise in this field. Lines 119-122

Esteva et al. (2021) [17] provided a… and the cultivation of trust. Lines 122-126

Alzubaidi et al. (2021) [8] reviewed…used in lung disease diagnosis. Lines 126-128

Vieira et al. (2021) [18] used a data… risk factor for lung cancer. Lines 129-132

Chakraborty et al. (2024) [6] discussed… utilized in healthcare. Lines 132-135

Liu et al. (2023) [9] explored medical… of lung disease diagnosis. Lines 135-138

The methodology section is generic. Provide a stronger approach and compare with prior studies to show novelty.

Ans. Thank you for your suggestion. I rewrite the methodology section, which can be found in Line 156-327.

A robust and well-defined methodology…employed in this study. Line 156-159

Furthermore, it goes beyond a…the current study's methodology. Line 159-162

This study utilized a lung cancer…connected to Google LLC. Line 164-166

While some studies focus primarily…diagno-sis [17,21]. Line 166-168

Unlike Patra (2020) [14], which…larger sample size for analysis. Line 168-170

The 'Lung_Cancer' attribute is… 2 to represent patients without it. Line 170-173

A detailed statistical analysis of the… data's inherent characteristics. Line 174-176

These insights are crucial for… the model's performance. Line 176-179

By understanding the data's properties… feature predictiveness. Line 179-182

Weka (Waikato Environment for… to automate several ML tasks. Line 183-184

Weka provides pre-built…of machine learning classifiers [11]. Line 184-188

The utilization of Weka facilitated… common ML workflow steps. Line 189-190

However, NN models were custom…the deep learning architectures. Line 190-192

This hybrid approach allowed us… and Python's customizability. Line 192-193

The NN uses Python programming… Notebook environment. Line 194

The data has been primarily analyzed… between ages 55 and 75. Line 200-202

Data preprocessing transforms raw data… while 39 are negative. Line 204-208

The current study uses… 276 data were included in the next phase. Line 210-212

Unlike Radhika et al. [15], which…data was found in the dataset. Line 214-216

The “InterquartileRange” filter of Weka… and extreme values. Line 218-219

Pearson's correlation technique… and the target variable [19]. Line 221-223

It also provides a straightforward… relevant linear relationships. Line 223-227

The coefficient, ranging from… suggests a significant association. Line 227-229

Weka's “correlation” filter determines…of the correlation coefficient. Line 229-231

The value 0.15 was used as a threshold… dysphagia, and allergies. Line 231-234

In line with Khanam and Foo [11]… of 0 to 1 using min-max scaling . Line 238-239

The "Normalize" filter in Weka was…mean value after normalization. Line 239-241Following preprocessing, the dataset… and 238 with lung cancer. Line 244-245

Figure 2 shows the correlation… indicating the highest correlation. Line 246-248

The dataset was divided into 80% for… set for evaluation. Line 255-257

We further employed 10-fold… imbalanced datasets like this study. Line 257-261

In the K-fold cross-validation…the outcomes from all K iterations. Line 216-264

We implemented the ML algorithms… tuning using grid search. Line 266-268

Unlike Alzubaidi et al. (2021) [8], which… architectural comparison. Line 268-271

We evaluated several ML algorithms… for classification tasks. Line 271-275

We developed three distinct NN… the results were compared. Line 279-281

Within a NN, the activation function… sigmoid and ReLU. Line 281-283

The NN models were developed by…during backpropagation. Line 283-286

Additionally, Stochastic Gradient Descent… on the loss gradient. Line 286-288

We conducted experiments with… partitioning train and test data. Line 289-291

For K-fold cross-validation, we utilized…nature of the target variable. Line 291-293

The NN model was constructed… layer represent the eight attributes. Line 295-296

The hidden layer consists of a…model with a single hidden layer. Line 296-299

A four-layer NN model was… as the single-hidden-layer model. Line 303-305

The second layer comprises 41 hidden… with two hidden layers. Line 305-308

The input and output layers…layer Neural Network (NN). Line 312-313

The second, third, and fourth… NN model is depicted in Figure 6. Line 313-316

Table 5 provides a detailed… NN models used in this study. Line 319-320

The learning rates of 0.1, 0.01, and… convergence and performance. Line 323-326

While a more complete hyperparameter… sensitivity to learning rate. Line 326-328

The use of Weka is not well-justified—clearly state what parts were automated and what was custom-simulated.

Ans. Thank you for your concern. I try to include the explanation why Weka is used in the current study, which can be found in Lines 183-193

Weka (Waikato Environment for… to automate several ML tasks. Line 183-184

Weka provides pre-built…of machine learning classifiers [11]. Line 184-188

The utilization of Weka facilitated… common ML workflow steps. Line 189-190

However, NN models were custom…the deep learning architectures. Line 190-192

This hybrid approach allowed us… and Python's customizability. Line 192-193

Add statistical analysis of data features to improve understanding of the dataset.

Ans. Thank you for your concern. I explain why statistical analysis for data feature was used in the current study, which can be found in Lines 174-182

A detailed statistical analysis…interpret model outcomes effectively. Lines 174-175

Descriptive statistics, revealing… the data's inherent characteristics. Lines 175-176

These insights are crucial for… the model's performance. Lines 176-179

By understanding the data's properties… feature predictiveness. Lines 179-182

Justify the use of Pearson correlation for feature selection—why not explore other state-of-the-art methods?

Ans. Thank you for your concern. I try to include the explanation why Pearson correlation for feature selection was used in the current study, which can be found in Lines 221-234

Pearson's correlation technique… and the target variable [19]. Line 221-223

It also provides a straightforward… relevant linear relationships. Line 223-227

The coefficient, ranging from… suggests a significant association. Line 227-229

Weka's “correlation” filter determines…of the correlation coefficient. Line 229-231

The value 0.15 was used as a threshold… dysphagia, and allergies. Line 231-234

Clarify how the three different learning rates were selected—was any tuning method applied?

Ans. Thank you for your concern. I explain why three different learning rates are selected, which can be found in Lines 323-328

The learning rates of 0.1, 0.01, and …preliminary experimentation. Lines 323-324

These values represent a range from… convergence and performance. Lines 324-326

While a more complete hyperparameter… sensitivity to learning rate. Lines 326-328

Figure 2 needs to be resized, smoothed, and improved in quality.

Ans. Thank you for your suggestion. I redraw Figure 2.

Redraw Figure 3 with better flowchart formatting; remove # and correct arrow placements.

Ans. Thank you for your suggestion. I redraw Figure 2.

Figures 4, 5, and 6 are blurred—redraw them or improve resolution.

Ans. Thank you for your suggestion. I redraw Figures 4, 5, 6.

Use a table to show neural network configurations (e.g., layers, neurons) instead of multiple blurry figures.

Ans. Thank you for your suggestion. I redraw Figures 4, 5, 6 and include a Table (Table 5) to describe configuration of different NN models.

In Line 290–291, the formula for precision and recall has double brackets—clarify if intentional.

Ans. Thank you for your suggestion. I modify the mistakes, which can be found in Lines 338-341

Combine Figures 7 and 8 into a single plot or label subplots as a, b, c, d for clarity.

Ans. Thank you for your suggestion. I label the subfigure as a, b, c, and d.

Figure 9 seems unnecessary—move it to the methodology section or remove it.

Ans. Thank you for your suggestion. I delete the Figure 9.

Subsection headings should be more specific rather than generic.

Ans. Thank you for your suggestion. I modify several subsection headings, which can be found throughout the paper.

2.1 Machine Learning Applications in Lung Disease Diagnosis Line 90

2.2 DL Applications in Lung Disease Diagnosis Line 118

2.3 Research Deficiencies and Prospective Avenues Line 140

3.1. Dataset Description, Features, and Tools Line 163

3.1.1. Descriptive Statistics Line 199

3.1.4. Identification of missing values Line 213

3.1.5. Identification and elimination of outliers Line 217

3.5.2. Data Normalization Techniques Line 237

3.3. Experimental Setup for Testing and Training data Line 254

3.5.4. Selection of different learning rates Line 322

4.1. Performance of Machine Learning Algorithms Line 330

5.1 Evaluation of Machine Learning Models Line 398

5.2 Efficacy of NN in Lung Cancer Diagnosis Line 408

5.3 Influence of Data Preprocessing and Feature Selection Line 421

5.4 Comparative Analysis with Prior Studies Line 430

Avoid repeating acronyms like “Machine Learning (ML)” multiple times after the first definition.

Ans. Thank you for your suggestion. I use the acronym ML, which can be found throughout the paper.

Provide a comprehensive comparison table with previous studies and clearly highlight how your approach outperforms them.

Ans. Thank you for your suggestion. I include a Table (Table 10) to compare the findings of the current study with the previous studies, which can be found in Line 440.

The conclusion is weak—restructure it to reflect contributions, key findings, and future work.

Ans. Thank you for your suggestion. I rewrite the conclusion which can be found on Line 443-449.

This study investigated the efficacy… to improve patient outcomes. Lines 443-446

The research provides a comparative… and lifestyle factors. Lines 443-447

This shows how a rigorous data… performance and reduced noise. Lines 447-449

The study identifies coughing, wheezing… significant risk indicators. Lines 449-451

It evaluates the performance of… for lung cancer prediction. Lines 451-454

Several ML models, including… for effective lung cancer prediction. Lines 455-456

DL models, notably a three-hidden-layer… patterns in the data. Lines 456-459

The English could be improved to more clearly express the research.

Ans. Thank you for your suggestion. I modify the English writing, which can be found throughout the paper.