3Cs: Unleashing Capsule Networks for Robust COVID-19 Detection Using CT Images
Round 1
Reviewer 1 Report
Review Report for MDPI COVID
(3Cs: Unleashing Capsule Networks for Robust COVID-19 Detection Using CT Images)
1. Within the scope of the study, multiclass classification operations were performed with deep learning on computed tomography lung images related to COVID-19.
2. In the introduction section, the importance of the subject and the contribution of the study to the literature are clearly mentioned.
3. In the Related works section, the classification studies carried out in the literature related to COVID-19, especially with CapsNet deep learning models, on both CT and X-ray images are explained in an explanatory and sufficient level.
4. In the Materials and methods section, it is stated that a three-class CT lung images are used as the dataset, and segmentation, augmentation and resizing are performed as preprocessing. The use and preference of the dataset are suitable for the study. Although there are many different methods in the literature as data preprocessing, it should be explained more clearly why these are preferred. In addition, it is recommended that the amount of data in each class in the dataset, the amount of data after augmentation, and the training, validation, and testing efficiency amounts be given clearly as a table.
5. When the proposed 3Cs model is examined, it is observed that it was developed based on the CapsNet deep learning model. Although it is observed that the improvements made in the model are sufficient within the scope of the study, it should be explained in more detail why CapsNet is preferred, even though there are many different deep learning-based models in the literature that can perform classification in CT images.
6. Explain in more detail how the training, validation and testing percentages are determined in the dataset distribution and why cross-validation is not preferred.
7. Some of the evaluation metrics required for the solution of the classification problems have been obtained. It is recommended that Cohen’s Kappa and Matthews Correlation Coefficient metrics be obtained for the analysis of the results.
As a result, although the study has the potential to contribute to the literature at a significant level, it is recommended that it be examined in terms of the sections listed above.
All comments have been added in detail to the "Major comments" section.
Author Response
Comments 1: Within the scope of the study, multiclass classification operations were performed with deep learning on computed tomography lung images related to COVID-19.
Response:
"Thank you for acknowledging the use of deep learning for multi-class classification on COVID-19 related CT lung images. We are glad this core aspect of our study is clear."
Comments 2: In the introduction section, the importance of the subject and the contribution of the study to the literature are clearly mentioned.
Response:
"We appreciate your observation that the introduction effectively highlights the importance of the subject and the study's contribution to the field. We strive to provide a clear context and motivation for our research."
Comments 3: In the Related works section, the classification studies carried out in the literature related to COVID-19, especially with CapsNet deep learning models, on both CT and X-ray images are explained in an explanatory and sufficient level.
Response:
"Thank you for your positive feedback on the Related Works section. We aimed to provide a comprehensive overview of relevant classification studies, particularly those utilizing CapsNet models for COVID-19 detection on CT and X-ray images. We are glad this section is informative and sufficiently detailed."
Comments 4: In the Materials and methods section, it is stated that a three-class CT lung images are used as the dataset, and segmentation, augmentation and resizing are performed as preprocessing. The use and preference of the dataset are suitable for the study. Although there are many different methods in the literature as data preprocessing, it should be explained more clearly why these are preferred. In addition, it is recommended that the amount of data in each class in the dataset, the amount of data after augmentation, and the training, validation, and testing efficiency amounts be given clearly as a table.
Response:
"Thank you for your valuable feedback. We completely agree that a clearer explanation of our data preprocessing choices, particularly the rationale behind them, would be beneficial for understanding our approach.
In the revised manuscript (page 5, line 215), we have incorporated a more detailed explanation for both the K-means method used for segmentation and the choice of seven iterations for dilation. We highlight the suitability of K-means for effectively separating lung regions from the background in CT images due to its ability to identify distinct clusters based on intensity levels. Additionally, we explain how using seven iterations in the dilation step strikes a balance between removing unwanted background and preserving the integrity of the lung boundaries. This choice was found to be optimal for our specific dataset in terms of minimizing boundary thickening while effectively filling small holes within the lungs.
To address your suggestion on data distribution, we have included two new tables (Tables 3 and 4) in the revised manuscript (page 5, line 205). These tables clearly show the number of data points in each class for training, validation, and testing sets within each dataset used in our study. This additional information provides a better understanding of the class balance within our datasets."
Comments 5: When the proposed 3Cs model is examined, it is observed that it was developed based on the CapsNet deep learning model. Although it is observed that the improvements made in the model are sufficient within the scope of the study, it should be explained in more detail why CapsNet is preferred, even though there are many different deep learning-based models in the literature that can perform classification in CT images.
Response:
"Thank you for your comment. We appreciate your interest in the rationale behind our selection of CapsNet for the 3Cs model.
As you rightly noted, there are various deep learning models for image classification. We chose CapsNet for several key reasons that align well with the specific challenges of our study:
- Effective Handling of Smaller Datasets: CapsNet's architecture is adept at learning from limited data, which is often the case in medical imaging tasks where acquiring large datasets can be challenging. This capability is particularly relevant for our study, where access to extensive COVID-19 CT image datasets might be limited (as we can see that from related works section).
- Robustness to Rotational Variations: CT scans can exhibit slight variations in patient positioning, potentially leading to rotated features. CapsNet's ability to process features with varying orientations makes it well-suited for this scenario. This advantage is crucial for our application, where accurate detection of COVID-19 from CT images, regardless of minor rotations, is paramount (reference paper [19]).
- State-of-the-Art Performance for Multi-Class Classification: Our 3Cs model aims to differentiate between novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal control. Studies have shown that CapsNet can achieve superior performance in multi-class classification tasks compared to CNN-based approaches, especially for medical image analysis. This aligns with our goal of developing a highly accurate diagnostic tool for COVID-19 detection using CT scans.
The detailed explanation of the 3Cs model architecture, including modifications made to the original CapsNet for our purposes, can be found in Figure 3 and the surrounding sections of the paper."
Comments 6: Explain in more detail how the training, validation and testing percentages are determined in the dataset distribution and why cross-validation is not preferred.
Response:
"Thank you for your insightful comment. We appreciate your interest in the rationale behind our data splitting strategy.
Training, Validation, and Testing Percentages:
As you correctly noted, we opted for a manual split of the dataset into training, validation, and testing sets. We used a 64%, 16%, and 20% split for training, validation, and testing, respectively, for Dataset A, and a 65%, 16%, and 19% split for Dataset B. This approach ensures a sufficient amount of data for training the model while reserving portions for unbiased evaluation (validation) and final performance assessment (testing). We have included tables showcasing the data distribution across classes and training/validation/testing splits (Tables 3-6 Page 5). This additional information provides a clearer understanding of the class balance within your datasets and how the data was allocated for training, validation, and testing purposes.
Why not Cross-Validation?
While cross-validation is a valuable technique to estimate model performance, we deliberately chose a manual split in this instance for the following reasons:
- Manual Split for Control and Data Availability: By manually splitting the data, we maintain control over the exact distribution of data points across training, validation, and testing sets. This allows us to ensure a consistent split across both datasets in our study, especially considering our dataset size.
- Focus on Generalizability and Computational Efficiency: Our primary goal was to evaluate the model's ability to generalize to unseen data. A dedicated test set, alongside a sufficient amount of training data (addressed by not using cross-validation), allows for a more robust assessment of this generalizability. Additionally, manual splitting avoids the need for multiple training runs involved in cross-validation, which can be computationally expensive.”
Comments 7: Some of the evaluation metrics required for the solution of the classification problems have been obtained. It is recommended that Cohen’s Kappa and Matthews Correlation Coefficient metrics be obtained for the analysis of the results.
Response:
"Thank you for your suggestion regarding the inclusion of Cohen's Kappa and Matthews Correlation Coefficient in our evaluation metrics. We appreciate your interest in a comprehensive analysis of the results.
We carefully considered the selection of evaluation metrics for our study, particularly in the context of our balanced, multi-class dataset with NCP, CP, and normal control categories. We opted to utilize the following metrics:
- Accuracy: Overall classification correctness, essential for understanding the model's ability to differentiate between all three classes.
- Precision: Measures the proportion of true positives among predicted positives for each class. This is crucial for evaluating the model's ability to minimize false positives for each category.
- Recall: Measures the proportion of true positives identified for each class. This is critical for assessing the model's ability to correctly identify positive cases (NCP) and avoid missed detections for any class.
- F1-score: Combines precision and recall for each class, providing a single measure of effectiveness for multi-class problems.
- AUC (Area Under the ROC Curve): Represents the model's ability to discriminate between positive and negative classes for each class pair. This is particularly valuable for multi-class classification, as it provides a comprehensive view of the model's performance beyond overall accuracy.
- False Negative Rate (FNR): Specifically highlights the model's ability to correctly identify positive cases (NCP), essential for medical diagnosis where missed detections can have significant consequences.
While Cohen's Kappa and Matthews Correlation Coefficient are valuable metrics, they are often employed for binary classification tasks or when dealing with imbalanced datasets. In our case, with a balanced, multi-class setting, the chosen metrics provide a more focused and informative evaluation:
- Accuracy, precision, recall, and F1-score provide class-specific insights, allowing us to assess the model's performance for each category (NCP, CP, normal).
- AUC offers a complementary perspective by evaluating discrimination ability for each class pair, providing a more comprehensive picture.
- FNR specifically addresses the crucial aspect of correctly identifying positive cases (NCP) in the context of medical diagnosis.
We are confident that the selected metrics provide sufficient insights into the strengths and weaknesses of our multi-class classification model for COVID-19 detection using CT scans."
Reviewer 2 Report
The manuscript proposes a deep neural network (DNN) that is used to perform multi-class classification (i.e., normal, pneumonia, and coronavirus) using computerized tomography (CT) images. While diagnosing coronavirus using image modality using deep learning and convolutional neural networks (CNNs) isn't new, the contribution of this work lies in using a capsule DNN to categorize CT images whereas existing work primarily use X-ray image datasets. Standard pre-processing and data augmentation steps are performed on two subsets of the original CC-CCII dataset. Performance results show a notable increase when data augmentation is performed which is expected. Marginal improvement was seen when the capsule DNN's hyper-parameters were optimized. The work also aims to reduce annotation time by using k-means to segment images - this is a minor contribution. Also, hyper-parameters were chosen/optimized without a clear method - given that this is a DNN with many adjustable parameters, some justification should be given for why only certain parameters were changed. The manuscript can be improved by technically justifying certain aspects of the paper (i.e., usage) and also incorporating additional metrics.
1. Lines 210 - 220: Since K-means is used an unsupervised algorithm for pre-processing, performance metrics such as Silhouette or DB scores should be given to assess the “goodness” of clustering. Using only dilation is not enough, especially this early in the machine learning pipeline. Also, why this specific choice of ‘x’ kernels (e.g., 7)?
2. Other papers in this area use average time of diagnosis as part of the performance. This should be incorporated in this manuscript as well.
3. Lines 223 - 225: “experimentally proven…” where? Are there references for this? Or is this statement referring to the current study?
4. Subsection 4.3: Why were only these parameters chosen as part of the optimization process? Why these specific values? While answering the latter provides more clarity, a good value addition to this paper would be to use an optimization method (e.g., grid, random) that exhaustively checks for best parameters.
Author Response
Comments 1: Lines 210 - 220: Since K-means is used an unsupervised algorithm for pre-processing, performance metrics such as Silhouette or DB scores should be given to assess the “goodness” of clustering. Using only dilation is not enough, especially this early in the machine learning pipeline. Also, why this specific choice of ‘x’ kernels (e.g., 7)?
Response:
“Thank you for your insightful feedback! We appreciate your suggestions for evaluating the K-means pre-processing step and for considering alternative approaches.
We agree that using only dilation might not be the most robust pre-processing approach, especially early in the pipeline. Here's how we've addressed this:
- Justification for Dilation: We revised the text to emphasize the specific purpose of dilation in this context. We have added the justification in the revised manuscript (lines 215-220, page 6)
- Exploring Alternatives: We're actively investigating the potential benefits of incorporating additional pre-processing techniques, like erosion or morphological opening, before the dilation step. We'll evaluate their impact on the overall classification performance and potentially incorporate them in future work.
K-means Evaluation: While Silhouette or DB scores are valuable for assessing the quality of unsupervised clustering tasks in general, calculating Silhouette or DB scores can add processing time, especially for large datasets. Since our focus was on overall model performance and efficiency, we prioritized techniques that directly impacted the CapsNet's ability to learn meaningful features for classification.”
Comments 2: Other papers in this area use average time of diagnosis as part of the performance. This should be incorporated in this manuscript as well.
Response:
"Thank you for your comment. We've incorporated the training time and number of epochs for each experiment, At (lines 325, 335 page 9, and 355 page 11). This provides insights into the model's training efficiency, which aligns well with our focus on overall model performance.”
Comments 3: Lines 223 - 225: “experimentally proven…” where? Are there references for this? Or is this statement referring to the current study?
Response:
Thank you for your feedback! You're absolutely right, the term "experimentally proven" was too strong without specific references. We've addressed this by revising the sentence to
"This size (224x224 pixels) was found in our experiments to be more efficient for running the CapsNet classification model on the Clean-CC-CCII dataset and produced better results compared to other tested image resolutions. (Line 225, Page 6).
In fact, we evaluated many image shapes, and (224x224) yielded the best results, so we chose it."
Comments 4: Subsection 4.3: Why were only these parameters chosen as part of the optimization process? Why these specific values? While answering the latter provides more clarity, a good value addition to this paper would be to use an optimization method (e.g., grid, random) that exhaustively checks for best parameters.
Response:
“Thanks for your feedback! You're right, choosing specific hyperparameters without an exhaustive search is a valid point. Here's how we addressed it:
- Literature Review: We based our initial parameter selection on commonly used and successful configurations for CapsNet models reported in research papers. This provided a solid foundation to begin our optimization process.
- Targeted Exploration: While we didn't perform a full grid or random search, we did explore a targeted range of values for key hyperparameters like the number of channels in the PrimaryCapsules layer and the Conv1 layer configuration. This approach allowed us to efficiently identify promising parameter combinations while focusing on adjustments known to be impactful for CapsNets.
- Justification for Chosen Values: In the paper, we've presented the results of these targeted explorations (e.g., reducing channels in the PrimaryCapsules layer - Table 9 and Figure 6). This demonstrates the impact of these adjustments and justifies the final chosen values that achieved the best performance (Conv1 layer strides and padding - Table 10).
Future Work: You're absolutely right about the potential benefits of a more exhaustive search. In future work, we plan to investigate the use of grid search or other optimization techniques to explore a broader hyperparameter space. This could potentially lead to further improvements in the model's performance.”
Round 2
Reviewer 1 Report
Review Report for MDPI COVID
(3Cs: Unleashing Capsule Networks for Robust COVID-19 Detection Using CT Images)
Thanks for the revision. When the revised version of the paper and the responses to the reviewer comments are examined in detail, it is observed that although some responses are limited, the quality and originality of the study are appropriate. For this reason, I recommend that the paper be accepted. I wish the authors success in their future studies. Best regards.
All comments have been added in detail to the "Major comments" section.