Next Article in Journal
A Review on Computer Vision Technology for Physical Exercise Monitoring
Next Article in Special Issue
Electromyography Gesture Model Classifier for Fault-Tolerant-Embedded Devices by Means of Partial Least Square Class Modelling Error Correcting Output Codes (PLS-ECOC)
Previous Article in Journal
A Simplified Algorithm for Dealing with Inconsistencies Using the Analytic Hierarchy Process
Previous Article in Special Issue
Stimulation Montage Achieves Balanced Focality and Intensity
 
 
Article
Peer-Review Record

Classification of Skin Lesions Using Weighted Majority Voting Ensemble Deep Learning

Algorithms 2022, 15(12), 443; https://doi.org/10.3390/a15120443
by Damilola A. Okuboyejo and Oludayo O. Olugbara *
Reviewer 1:
Reviewer 2:
Algorithms 2022, 15(12), 443; https://doi.org/10.3390/a15120443
Submission received: 3 October 2022 / Revised: 13 November 2022 / Accepted: 20 November 2022 / Published: 24 November 2022
(This article belongs to the Special Issue Machine Learning in Medical Signal and Image Processing)

Round 1

Reviewer 1 Report

In this study, a deep learning model was used to diagnose the classification of skin diseases. A novel ensemble deep learning algorithm based on the residual network and the dual path network is proposed. The general idea of the study is interesting, and the following issues should be considered:

1.     The manuscript contains a large number of abbreviations. Please express the full names clearly when they first appear, otherwise they will not be readable.

2.     The annotations for Figure 3 should be consistently prepared and written below the image.

3.     The authors performed a detailed pre-processing of the image selection to train the model. However, in terms of application, various factors need to be considered in order for the results of the algorithm to be of value for application, especially in the medical field. It is not feasible to pursue better performance of a model while ignoring the necessary influences that exist in reality. It is strongly recommended that noisy is added to the dataset and comparing the results. In addition, please describe the feasibility for application purposes.

4.     For a fresh sample, how can the results of this study be used to help doctors make a diagnosis?

5.     The Conclusion is too lengthiness, so that the readers cannot understand the main ideas of the manuscript quickly.

 

 

Author Response

REFEREE 1:

In this study, a deep learning model was used to diagnose the classification of skin diseases. A novel ensemble deep learning algorithm based on the residual network and the dual path network is proposed. The general idea of the study is interesting, and the following issues should be considered.


Comment 1

The manuscript contains a large number of abbreviations. Please express the full names clearly when they first appear, otherwise they will not be readable.

Response 1

We are grateful to the reviewer for the positive comment about our work. The full names of all abbreviations have now been specified when they first appear.

 

Comment 2

The annotations for Figure 3 should be consistently prepared and written below the image.

Response 2

The annotation for Figure 3 has now been prepared consistently.

 

Comment 3

The authors performed a detailed pre-processing of the image selection to train the model. However, in terms of application, various factors need to be considered in order for the results of the algorithm to be of value for application, especially in the medical field. It is not feasible to pursue better performance of a model while ignoring the necessary influences that exist in reality. It is strongly recommended that noisy is added to the datasets and comparing the results. In addition, please describe the feasibility for application purposes.

Response 3

We are sincerely grateful to the reviewer for the positive comment about our work. The preprocessing used in the study includes segmentation and image augmentation. Image segmentation was used at stage 1 of our evaluation to ensure the effectiveness of our algorithms on different images. As reported in Table 5, image segmentation did not significantly improve the classification performance for most of the base learners and was not used at stage-2 evaluation. Image augmentation transformations were applied to each data item as stated in section 3.2.2 and data batch fed into the learning models at stage 1 and stage 2 training because of the imbalance in the multiclass experimental datasets. A total of 58367 lesion images were used in this study containing several noise attributes such as hair shafts, ruler markings, and vignettes, which have now been stated clearly in section 3.1. In addition, at no point were such lesion noises removed from the images during classification, and this has also now been clearly stated in section 3.3.

The sample images recorded in Figure 2 have also been updated to showcase more images with several noise attributes from the datasets used in this study. In addition, we have updated the concluding section concerning the feasibility of our work for real-world application purposes. We believe that our proposed ensemble deep learning algorithms can improve early diagnosis of skin lesions (e.g. Figures 2b, 2c, 2g, 2i, and 2j) before they become invasive. In addition, for skin lesions that are already at the invasive stage (e.g. Figures 2a, 2d, 2e, 2f, 2h, 2k, and 2l), our proposed solution can act as a good second opinion for dermatologists towards effective discrimination of malignant tumors.

 

Comment 4

For a fresh sample, how can the results of this study be used to help doctors make a diagnosis?

Response 4

A fresh lesion image sample can be fed directly into the ensemble classifier without prior pre-processing and the appropriate classification of the image would be determined as described in Algorithm 1 (section 3.3.1) and Algorithm 2 (section 3.3.2).

 

Comment 5

The Conclusion is too lengthiness, so that the readers cannot understand the main ideas of the manuscript quickly.

Response 5

We have removed non-essential information from the concluding section of the study. Essential details of the study and future work that could be of benefit to readers have also been stated more clearly.

Reviewer 2 Report

Skin lesions in the manuscript are classified by ensemble algorithm based on a variety of transfer learning algorithms. The authors made an enormous amount of work analyzing transfer learning models and combining them together.

 At the Line 46 it is mentioned that “… that manifold researchers are interested in obtaining accurate automated systems that can assist in the early detection and diagnosis of the disease [9]” and, also at the Line 76: “However, the classification of skin diseases is a difficult task because of the strong similarities in common symptoms for which AI methods were recommended to improve the accuracy of dermatology diagnosis [13].”

It is not clear how the manuscript relates to the early detection. For the pictures provided, I could assume that they are taken at the rather advanced stage of the disease and are not the early manifestations. So, the claim of early detection either should be removed, or reiterated and supported in the manuscript clearly.

The technique used in the manuscript is an ensemble algorithm on a set of transfer learning architectures. Please, directly state the novelty in your approach, or what is the main purpose of the manuscript if this is just an application of a well-known approach. It also would be nice to provide code for the method, as well as to describe the technical issues you met during the implementation of the approach.

I think that after some reviewing the manuscript could be accepted for publication as an example of implementation of ensembling.

 

There are several recommendations to the authors of the manuscript:

It would be nice to add to the reference [4] in the text (line 41) if 9000 cases are worldwide, in the USA, or somewhere else.

At lines 210-218, there are mentioned a lot of architectures of NN. Please, add the references upon the first mentioned.

Also, provide information on “widely used metrics for performance evaluation” (Line 219) – what are they.

Line 222, please provide where these resources are available from, and references to them. Now it is done only in the Table 1 later.

Are the classes and data of these classes in the Table 1 (the last column) overlapping? Please, mention this at Table 1 description and/or add that this information is provided later in the Table 2.

Line 293: Please, clarify what are “ones” in the phrase “it explicitly differentiates the information preserves from the ones added to the network”.

Line 319: “training 26 deep learning algorithms with an image segmentation process based on dermatologically segmented ground truth and without any prior image segmentation”. Please, clarify what does it mean, the processing “without prior image segmentation” and at the same time the usage of ground truth (GT) image segmentation. Doesn’t GT constitute a prior image segmentation? Do you mean under “prior image segmentation” the extraction of the ROI first as a preliminary step and the analysis of this ROI later? Please, be specific. Usually extraction of ROI makes some improvement for classification.

Line 353: In the phrase “Regularization and optimization are among the types of fundamental hyperparameters” I think “regularization and optimization” should not be referred as hyperparameters, but as processes. The type of the optimization algorithm chosen IS a parameter, but optimization itself is not a parameter.

In Algorithm 2 it is not clear how to “Compute the ? weight ѿ? from each ? handler Ϧ?.” As well as how to “Aggregate the results of the handlers with ѿ? >= 0.25.“ Also, the “learner ?” was not introduced anywhere before. The method of weighting (soft, hard, or something else) was not mentioned. As this is likely the main part of your contribution, please make it clear, what did you do and which methods and what way were applied.

You have never mentioned “ray distributed framework” which is referred at line 403 “The application of the ray distributed framework assisted in reducing the inference process”

Usage of distributed process to get the prediction rather not in 7.3 second, but in 5.3 second looks as non-important gain with likely a disproportional increase of complexity. Also, you didn’t mention how the distributed calculations were organized.

Line 451: “Both ResNeXt-101 and DPN-107 base learners could not be fitted to the training datasets because of the underlying computing capacity issues” – what is the meaning of “the underlying computing capacity issues” mentioned?

Please, provide formulas for multi-class calculation for Jaccard index, multiclass accuracy, MCC and dice coefficient for closeness of the explanation.

Author Response

REFEREE 2:

Skin lesions in the manuscript are classified by ensemble algorithm based on a variety of transfer learning algorithms. The authors made an enormous amount of work analyzing transfer learning models and combining them together. At the Line 46 it is mentioned that “… that manifold researchers are interested in obtaining accurate automated systems that can assist in the early detection and diagnosis of the disease [9]” and, also at the Line 76: “However, the classification of skin diseases is a difficult task because of the strong similarities in common symptoms for which AI methods were recommended to improve the accuracy of dermatology diagnosis [13].”

Comment 1

It is not clear how the manuscript relates to the early detection. For the pictures provided, I could assume that they are taken at the rather advanced stage of the disease and are not the early manifestations. So, the claim of early detection either should be removed, or reiterated and supported in the manuscript clearly

Response 1

We are sincerely grateful to the reviewer for the positive comment about our work. Section 3.1 of the datasets has been properly described to reflect the usage of both in-situ and invasive lesion images. The rationale for early in-situ images is to ensure that a learning algorithm can be used for early disease diagnosis, while invasive lesions enhance the ability of the proposed ensemble learning algorithm to act as a second opinion for Dermatologists.

The dataset of 58367 lesion images used in this study contains several noise attributes such as hair shafts, ruler markings, vignettes, etc and this has now been stated clearly in section 3.1. In addition, at no point were such lesion noises removed from the images during classification, and this has now been clearly stated in section 3.3. The sample images recorded in Figure 2 have also been updated to showcase more early in-situ skin lesions (e.g. Figures 2b, 2c, 2g, 2i, and 2j) as well as skin lesions that are already at an invasive stage (e.g. Figures 2a, 2d, 2e, 2f, 2h, 2k, and 2l). we believe that our proposed ensemble deep learning methods can improve early diagnosis of in-situ skin lesions as well as for skin lesions that are already at the invasive stage.

Comment 2

The technique used in the manuscript is an ensemble algorithm on a set of transfer learning architectures. Please, directly state the novelty in your approach, or what is the main purpose of the manuscript if this is just an application of a well-known approach. It also would be nice to provide code for the method, as well as to describe the technical issues you met during the implementation of the approach.

Response 2

To the best of our knowledge, we are the first to apply ensembles of IG-ResNeXt-101, SWSL-ResNeXt-101, ECA-ResNet-101, and DPN-131 with confidence preservation. Our proposed ensemble methods perform inference on lesion images without prior pre-processing to solve multiclass lesion classification problems (up to 10 classes) with a compelling balanced accuracy result. In addition, the contribution of our study has been stated in section 2 and our reported results are equally compared with other state-of-the-art methods in section 4.4 to demonstrate novelty. As stated in section 4.3, some of the technical issues encountered during implementation include challenges in fitting ResNeXt-101 and DPN-107 base learners to the training datasets because of the limitation of the processing unit the computer machine used for experimentation.

Comment 3

It would be nice to add to the reference [4] in the text (line 41) if 9000 cases are worldwide, in the USA, or somewhere else.

Response 3

Additional references [4-7] have been added to support the mortality rate assertion.

Comment 4

At lines 210-218, there are mentioned a lot of architectures of NN. Please, add the references upon the first mentioned.

Response 4

We have added appropriate references in Line 219-226 upon first mention of each architecture.

Comment 5

Also, provide information on “widely used metrics for performance evaluation” (Line 219) – what are they.

Response 5

The metrics used for performance evaluation have now been properly detailed in section 3.2.1 and highlighted in Table 3.

Comment 6

Line 222, please provide where these resources are available from, and references to them. Now it is done only in the Table 1 later.

Response 6

References to the datasets have been updated on the first mention.

Comment 7

Are the classes and data of these classes in the Table 1 (the last column) overlapping? Please, mention this at Table 1 description and/or add that this information is provided later in the Table 2.

Response 7

We have now stated that some of the lesion images overlap across different datasets.

Comment 8

Line 293: Please, clarify what are “ones” in the phrase “it explicitly differentiates the information preserves from the ones added to the network”.

Response

We have now updated the sentence to indicate that the DenseNet architecture differentiates the residual information preserved from the new information added to the network.

Comment 9

Line 319: “training 26 deep learning algorithms with an image segmentation process based on dermatologically segmented ground truth and without any prior image segmentation”. Please, clarify what does it mean, the processing “without prior image segmentation” and at the same time the usage of ground truth (GT) image segmentation. Doesn’t GT constitute a prior image segmentation? Do you mean under “prior image segmentation” the extraction of the ROI first as a preliminary step and the analysis of this ROI later? Please, be specific. Usually extraction of ROI makes some improvement for classification.

Response 9

We have rewritten the statement to properly describe the process. We have now properly stated the purpose of stage 1 evaluation and how the result influenced the experimentation at stage 2.

Comment 10

Line 353: In the phrase “Regularization and optimization are among the types of fundamental hyperparameters” I think “regularization and optimization” should not be referred as hyperparameters, but as processes. The type of the optimization algorithm chosen IS a parameter, but optimization itself is not a parameter.

Response 10

We have updated section 3.2.2 to be more specific in referring to the associated regularization and optimization hyperparameters.

Comment 11

In Algorithm 2 it is not clear how to “Compute the ? weight ѿ? from each ? handler Ϧ?.” As well as how to “Aggregate the results of the handlers with ѿ? >= 0.25.“ Also, the “learner ᴟ?” was not introduced anywhere before. The method of weighting (soft, hard, or something else) was not mentioned. As this is likely the main part of your contribution, please make it clear, what did you do and which methods and what way were applied.

Response 11

We have updated sections 3.3.1 and 3.3.2 with better clarity and include appropriate expressions. In this study, due to the number of base classifiers used, we combined the usage of hard and soft voting schemes to solve for the possibility of an even number of predicted outputs as highlighted in step 6 of Algorithm 2.

Comment 12

You have never mentioned “ray distributed framework” which is referred at line 403 “The application of the ray distributed framework assisted in reducing the inference process”.

Response 12

Section 3.3 has been updated with an appropriate explanation of the Ray distributed framework.

Comment 13

Usage of distributed process to get the prediction rather not in 7.3 second, but in 5.3 second looks as non-important gain with likely a disproportional increase of complexity. Also, you didn’t mention how the distributed calculations were organized.

Response 13

We have now clearly stated the challenge of slow convergence speed typically faced by most ensemble methods and how the application of the Ray distributed framework was used to resolve this challenge. In addition, the reduction of classification time from an average of 7.3 to 5.3 amounts to a 25% increase in speed, which we strongly believe is a significant improvement.

Comment 14

Line 451: “Both ResNeXt-101 and DPN-107 base learners could not be fitted to the training datasets because of the underlying computing capacity issues” – what is the meaning of “the underlying computing capacity issues” mentioned?.

Response

We have updated the statement to state that relative to other high-end computer vision hardware, the effect of the limited graphical processing units (GPU) capacity used in the study affected the fitting of both ResNeXt-101 and DPN-107 base learners.

Comment 15

Please, provide formulas for multi-class calculation for Jaccard index, multiclass accuracy, MCC and dice coefficient for closeness of the explanation.

Response 15

Table 3 of section 3.2.1 has now been added to present the associated formulas for each performance evaluation metric.

Round 2

Reviewer 2 Report

I think the revised version could be accepted. Well done with the manuscript!

Back to TopTop