4.1. Experiment Setup and Evaluation Metrics
The experimental setup for our study was executed on a computer system operating under Windows 11 Pro, equipped with 32 GB of RAM to ensure sufficient processing capacity for the demands of algorithms. We utilized an NVIDIA RTX 3060 GPU for graphical processing, which boasts 12 GB of onboard memory. Our implementation utilized the Python 3.6.0 version programming language for software. We employed Keras as our primary neural network library and TensorFlow as the backend framework. In terms of model training, the CycleGAN model underwent 300 epochs of training to ensure adequate learning and adaptation to the input data, while the CNNs were trained for 30 epochs. The Adam optimizer was selected to optimize both models for their adaptive learning rate capabilities, enhancing network weight update efficiency [
38,
39]. For training dynamics, a mini-batch size of 16 was chosen [
40]. This size balances computational demand and the need for stable convergence properties, allowing the model to benefit from the stochastic gradient descent approach while mitigating the risk of unstable training dynamics. The learning rate was set to 0.0001 for gradual convergence, which is crucial for achieving reliable training outcomes. Cross-entropy served as the loss function, an effective choice for classification problems due to its ability to quantify the difference between the predicted probabilities and the actual categorical distribution [
41].
To assess the efficacy of the proposed methodology, we employed performance metrics that are instrumental in evaluating the accuracy and robustness of categorization models [
42]. These metrics included accuracy, which measures the overall correctness of the model across all categories, providing a straightforward indicator of model performance in terms of its ability to predict the label for a given input correctly. Precision, another critical metric, evaluates the model’s performance in terms of the proportion of true positive predictions in relation to the total positive predictions made, highlighting the model’s ability to minimize false positives. Recall, or sensitivity, gauges the network’s capacity to correctly categorize all actual instances of a particular category, which is crucial in applications where missing a positive instance can have severe consequences. The F1-score, which is the harmonic mean of precision and recall, offers a balanced measure of a model’s precision and recall, providing a single score that reflects both the accuracy and completeness of the model’s predictions. Additionally, the confusion matrix shows the successes and specific areas where the model may confuse one category for another. This matrix is precious for visualizing the performance of a model across different categories and for identifying trends or biases in misclassifications, which can inform further refinements to the model architecture or training process.
where TP = true positive, TN = true negative, FP = false positive, and FN = false negative.
4.2. Performance of Categorization Models
To assess our methodology, we employed a test collection consisting entirely of images that had not been previously used in training. This approach ensures that the evaluation of both the standard and modified VGG16 models is conducted unbiased and objectively, accurately reflecting their ability to generalize to new, unseen data. Using completely unseen images, we effectively mimic real-world scenarios where the models must operate on data they have not previously encountered. Such an evaluation not only confirms the robustness of the model’s predictive accuracy but also underscores their potential applicability in clinical environments where the ability to interpret unfamiliar and variable medical images accurately is paramount.
As shown in
Table 5, the performance metrics from the test collection for both the standard and modified VGG16 networks provide significant insights into their diagnostic abilities. The standard VGG16 network achieved an accuracy of 97.61% on the test collection, which indicates a high level of correctness in its predictions across various conditions. The precision of the model, standing at 97.37%, reflects its effectiveness in identifying true positive cases, suggesting that when it predicts a condition, it is usually correct. Furthermore, the recall rate of 97.93% highlights the network’s capacity to identify most of the actual positive cases, which is crucial in medical diagnostic settings to avoid overlooking conditions that require intervention. The F1-score, which combines precision and recall into a single measure, was 97.64%, indicating a balanced accuracy in terms of both identifying conditions correctly and not missing significant cases. In comparison, the modified VGG16 model exhibits superior performance across all metrics on the test collection, underscoring the benefits of the modifications implemented. It achieved an accuracy of 98.58%, precision of 98.74%, recall of 98.77%, and an F1-score of 98.76%. These enhancements suggest that the modifications to the VGG16 network have substantially improved its diagnostic precision and reliability. The increased accuracy and precision indicate a more refined ability to categorize images correctly with fewer errors, while the elevated recall and F1-score imply improved comprehensive detection capabilities and a balanced sensitivity–specificity trade-off.
Table 6 provides a comprehensive classification report for both the standard and modified versions of the VGG16 model. Including precision, recall, and F1-score across various diagnostic categories—COVID-19, normal, opacity, and viral—allows for evaluating each model’s performance in a medical diagnostic context. For the category of COVID-19, the modified model shows a higher precision and recall compared to the standard model. This suggests that the modifications to the VGG16 architecture have remarkably enhanced its ability to accurately identify and confirm cases of COVID-19, minimizing false negatives, which are particularly critical in managing the pandemic effectively. In diagnosing normal conditions, both models perform well, but the modified model demonstrates slight improvements in all metrics. This indicates enhanced capability in differentiating normal anatomical structures from pathological changes, which is crucial in reducing unnecessary medical interventions for healthy patients. The opacity category, often challenging due to the subtle radiographic signs that must be distinguished from similar conditions, shows a noticeable improvement in the modified model. The increase in precision and recall suggests that the modifications may include better feature extraction capabilities that help distinguish between opacity and other conditions more effectively. Viral conditions, including less common and more diverse pathologies, also see improvement in the modified model, particularly in precision and recall.
In
Table 7, for the standard VGG16 model, the confusion matrix indicates a robust performance, particularly in identifying normal lung conditions with a significant number of true positives (2034 out of 2081 normal cases). However, it exhibits certain weaknesses, such as misclassifying normal cases as opacity and viral and a less pronounced ability to distinguish COVID-19 from normal, as evident from misclassifications. On the other hand, the modified VGG16 model, as shown in
Table 7, demonstrates a significant improvement in the overall accuracy of classifications. It notably enhances the detection of normal cases, increasing true positives to 2057 and reducing false positives compared to the original VGG16 model. These improvements are directly attributed to the modifications in the network architecture, which have enhanced the model’s feature extraction layers. Importantly, the modified model also exhibits a decrease in cross-condition misclassifications among opacity and viral conditions, indicating a refined sensitivity to the unique characteristics of these conditions.
In
Figure 5a, for the standard VGG16 model, the ROC curves exhibit high AUC curve across all categories, indicating strong discriminative ability. Specifically, the model shows exceptional performance in detecting COVID-19, with an AUC close to 0.9911, underscoring its capability to identify COVID-19-positive cases with high accuracy while maintaining a low rate of false positives. The performance in other categories like opacity, normal, and viral also demonstrates high AUC values, suggesting that the network effectively distinguishes between these conditions and healthy or different pathological states. Comparatively, the modified VGG16 model, as shown in
Figure 5b, enhances these metrics further, as evidenced by its ROC curves. The AUC for COVID-19 reaches an impressive 0.9979, reflecting the model’s enhanced sensitivity and specificity—traits that are critical in a clinical setting, especially for conditions with significant health implications like COVID-19. Similarly, the AUCs for normal, opacity, and viral conditions are notably higher than those of the standard model, which can be attributed to refined model architecture, improved training algorithms, or more sophisticated data handling and processing techniques. These improvements suggest that the modifications to the VGG16 network have effectively addressed limitations in the original model’s ability to differentiate between subtle radiographic features of various lung conditions.
Table 8 depicts the statistical analysis on
p-value and
t-test. The
p-value (0.0542) for standard VGG16 is slightly above the threshold of 0.05, indicating that the performance improvement observed is not statistically significant. For the modified VGG16, the
p-value (0.0426) is below the 0.05 threshold, suggesting that the improvements observed with this model are statistically significant. This indicates that the modifications applied to the VGG16 model have enhanced its performance. The standard model’s
t-test (1.4758) value suggests a moderate difference from the modified model. The
t-test (1.0214) value for the modified model is lower than that of the standard model, which is interesting considering its significant
p-value. This discrepancy suggests that although effective, the modification provides a statistical improvement but is not excessive in magnitude.
As illustrated in
Figure 6, we present three X-ray scans where the model’s predictions were incorrect. The first two scans were incorrectly categorized as having opacity, with confidence scores of 57.38% and 63.72%, respectively, despite being clinically normal. The third image, which shows characteristics typical of opacity, was inaccurately categorized as normal with a confidence of 58.84%. These examples were selected to showcase the variability in the model’s performance, particularly in borderline cases that may feature characteristics of opacities or normal but do not meet the clinical criteria for such a diagnosis.