This section presents the results concerning the methodology employed for lung and lesion segmentation, and COVID-19 and CAP classification in CT exams. We compared state-of-the-art models using accuracy, precision, recall, F1-score, specificity, Hausdorff Distance, and processing time.
4.1. Lung Segmentation
The first task was to segment the lungs from the background on raw CT slices to remove unnecessary artifacts for COVID-19 and CAP detection. We summarize the results of this step in
Table 6.
In general, all architectures presented excellent results regarding accuracy, F1-score (DSC), and Hausdorff Distance. Resnext101 Unet++ outperformed the other architectures in all metrics, achieving 99.71 ± 0.05%, 98.64 ± 0.19%, and 3.9 ± 0.16 in accuracy, F1-score (DSC), and Hausdorff Distance, respectively. However, all architectures presented a similar performance in the three metrics. In the following sections, we analyze the significance of our results with statistical tests, aiming to confirm their relevance.
Figure 2 illustrates the segmentation metric boxplots applied for lung segmentation: accuracy, F1-score (DSC), and Hausdorff Distance.
Concerning the accuracy metric, we can see, by considering the
y-axis, that all algorithms performed similarly, as accuracy varied from 0.9945 to 0.9975. First, however, we remark on some important aspects when comparing our segmentation architectures. For instance, Resnet50 Unet, Densenet201 Unet, Resnext101 Unet, Densenet201 Unet++, and Resnext101 Unet++ presented the best accuracy medians (
Figure 2a), lying higher than other algorithm boxes. Moreover, the interquartile ranges of these algorithms were smaller than those of the others, indicating that the accuracy values were less dispersed with a left-skewed distribution. On the other hand, MobilenetV2 FPN presented the lowest accuracy with more dispersed data and a soft left-skewed distribution. The remaining algorithms presented competitive accuracy results but with dispersed and skewed values. In addition, only MobilenetV2 FPN, MobilenetV2 Unet++, Resnet50 Unet++, and MobilenetV2 MAnet had no outliers.
In general, the F1-score behavior was similar. For example, Resnet50 Unet, Densenet201 Unet, Resnext101 Unet, Densenet201 Unet++, and Resnext101 Unet++ again presented the best median values (
Figure 2b), with a left-skewed distribution. However, Resnet101 Unet++ had a more dispersed data distribution.
The architectures had more dispersed data for the Hausdorff metric (
Figure 2c). For example, Resnext101 Unet++ had the lowest median, with a right-skewed distribution, and MobilenetV2 FPN presented the highest Hausdorff median.
Because one accuracy population was not normal (Densenet201 Unet), we applied Friedman’s test with the Nemenyi post hoc test to analyze whether the distributions of the accuracy results differed. We present the test results in
Figure 3a. Differences are significant if the distance between the mean ranks is greater than the Critical Distance (CD).
We failed to reject the null hypothesis that the population was normal for all F1-score populations. Therefore, we assumed that all F1-score populations were normal. We applied Bartlett’s test for homogeneity and failed to reject the null hypothesis that the data were homoscedastic. Thus, we assumed that our data were homoscedastic. Because we had more than two populations and all populations were normal and homoscedastic, we used repeated measures ANOVA as an omnibus test to determine any significant differences among the mean values of the populations. As the results of the ANOVA test were significant, we used Tukey’s HSD post hoc test to infer which differences were significant. Populations were significantly different if their confidence intervals were not overlapping; see
Figure 3.
None of the architectures significantly differed in accuracy, as they had a mean rank distance smaller than the Critical Distance for at least one other evaluated architecture (
Figure 3a). Nonetheless, the architecture that had the most different accuracy from the others was MobilenetV2 FPN.
Most confidence maps overlapped (
Figure 3b), except for MobilenetV2 FPN, the fastest architecture in training and testing (
Figure 4). When selecting an architecture, we can choose MobilenetV2 FPN for a fast architecture with a slight loss in F1-score. On the other hand, let us suppose that we decide on an architecture with higher F1-score. In that case, we can choose any other architecture, because F1-score differences are insignificant. Thus, the best choice would be Resnet50 Unet++, the second fastest architecture, which, as shown by the test, did not significantly differ in F1-score from other slower architectures.
The Hausdorff Distance results were generally similar (
Figure 3c). Again, MobilenetV2 FPN had the most significant difference, while other architectures had no significant difference in Hausdorff Distance.
The fastest model for training and testing was MobilenetV2 FPN, and the slowest one was Resnext101 Unet++. However, even if the shortest training time (513.5 s) was more than ten times faster than the longest training time (5304.3 s), the fastest testing time was 1.9 s, and the slowest testing time was 8.5 for evaluating 3677 images, or averages of
and
s per image, respectively. As complexity increased, other models followed linear training and testing time growth. We present this behavior in
Figure 4.
4.2. Lesion Segmentation
The second task was to segment lesions inside the lungs from the previously segmented CT slices for COVID-19 and CAP detection. We summarize the results of this step in
Table 7.
All architectures presented excellent results regarding accuracy, F1-score (DSC), and Hausdorff Distance. Densenet201 Unet, Resnet50 Unet++, and Resnext101 Unet++ outperformed the other architectures in accuracy, achieving %. Densenet201 Unet++ obtained the highest F1-score (DSC) among all architectures, achieving %. However, all architectures presented a similar performance in the three metrics. In the following sections, we analyze the significance of our results with statistical tests, aiming to confirm their relevance. However, MobilenetV2 FPN, the fastest architecture, obtained the smallest HD of .
The accuracy results were high because most of the ground-truth image was composed of black pixels, with only a small percentage of the image being white lesion pixels. When we calculated the accuracy of our models, these black pixels increased all accuracy results, reducing the metric credibility.
Figure 5 illustrates the segmentation metric boxplots applied for lesion segmentation: accuracy, F1-score (DSC), and Hausdorff Distance.
Concerning the accuracy metric, we can see, by considering the
y-axis, that all algorithms performed very similarly, as accuracy varied from 0.9980 to 0.9990. First, however, we remark on some important aspects when comparing our segmentation architectures in terms of this metric. For instance, Resnet50 Unet, Densenet201 Unet, Densenet201 Unet++, and Resnext101 Unet++ presented higher accuracy medians (
Figure 5a). Moreover, the interquartile ranges of these algorithms were smaller than those of the others, indicating that the accuracy values were less dispersed with a left-skewed distribution.
On the other hand, MobilenetV2 MAnet presented the lowest accuracy, with more dispersed data and a soft left-skewed distribution. The remaining algorithms presented competitive accuracy results but with more dispersed and skewed values. In addition, only Densenet201 FPN, Resnext101 Unet, Resnet50Unet++, Densenet 201 Unet++, Resnext101 Unet++, and Densenet201 MAnet had no discrepant values.
Concerning F1-score, the Unet decoders ( Resnet 50 Unet, Densenet201 Unet, MobilenetV2 Unet++, Resnet50 Unet++, and Densenet201 Unet++) presented higher median values with lower dispersion (
Figure 5b). On the other hand, Resnext101 Unet had a more dispersed data distribution. Moreover, only the Resnet50 Unet, Resnext101 Unet++, and Resnet50 MAnet architectures presented discrepant values.
In general, the architectures had less dispersed data for the Hausdorff metric (
Figure 5c). For example, Resnext101 Unet++ had the lowest median, with a right-skewed distribution, and MobilenetV2 FPN, Densenet201 FPN, Resnext101 Unet++, and Resnext101 MAnet presented the lowest Hausdorff median.
We failed to reject the null hypothesis that the population was normal for all accuracy populations. Therefore, we assumed that all accuracy populations were normal. We applied Bartlett’s test for homogeneity and failed to reject the null hypothesis that the data were homoscedastic. Thus, we assumed that our data were homoscedastic. Because we had more than two populations and all populations were normal and homoscedastic, we used repeated measures ANOVA as an omnibus test to determine any significant differences among the mean values of the populations. As the results from the ANOVA test were significant, we used Tukey’s HSD post hoc test to infer which differences were significant. Populations were significantly different if their confidence intervals were not overlapping; see
Figure 6a.
Because one F1-score and one HD population were not normal (Resnext101 Unet++), we applied Friedman’s test with the Nemenyi post hoc test to analyze if there was a difference among the distributions of the accuracy results. We present the test results in
Figure 6b,c. Differences were significant if the distance between the mean ranks was greater than the Critical Distance (CD).
Most confidence maps overlapped (
Figure 6a), except for MobilenetV2 MAnet, which mainly overlapped with Resnext101 FPN and Resnet50 MAnet. Resnext101 FPN and Resnet50 MAnet had similar results in all metrics and similar training and testing times. However, MobilenetV2 MAnet was faster for training and testing, with a small decrease in accuracy (
Figure 4). Thus, when selecting an architecture, we can choose MobilenetV2 MAnet for a fast architecture with a slight loss in accuracy. On the other hand, let us suppose that we decide on an architecture with higher accuracy. In that case, we can choose any other architecture, because F1-score differences are insignificant. Thus, the best choice would be Resnet50 Unet++ again, the second fastest architecture, which, as shown by the test, did not significantly differ in F1-score from other slower architectures.
None of the architectures significantly differed from the others in F1-score, as they had a mean rank distance smaller than the Critical Distance for at least one other evaluated architecture (
Figure 6b). Nonetheless, the architecture that had the most different F1-score from the others was MobilenetV2 MAnet.
In general, the Hausdorff Distance results were similar (
Figure 6c). MobilenetV2 FPN, Resnext101 Unet++, and Resnext101 MAnet had the most significant difference, while other architectures had no significant difference in Hausdorff Distance.
The fastest model for training was MobilenetV2 MAnet, and for testing, it was MobilenetV2 FPN. However, MobilenetV2 MAnet converged faster, needing only 25 epochs. The slowest ones for training were Densenet201 Unet++ and Resnext101 Unet++, and the slowest for testing was Resnext101 Unet++. However, even if the fastest training time (573.0 s) was more than thirty times faster than the slowest training time (19,164.5 s), the fastest testing time was 2.8 s, and the slowest testing time was 12.6 s for evaluating 6493 images, or averages of
and
s per image, respectively. As complexity increased, other models followed linear training and testing time growth. We present this behavior in
Figure 7.
Finally, we answer RQ1, as we have shown that all architectures mainly had similar results in lung and lesion segmentation, without statistical differences in metrics, and achieved competitive results. The main differences were in training and testing time. MobilenetV2 FPN was the fastest for lung segmentation training and testing; MobilenetV2 FPN, for lesion segmentation training; and MobilenetV2 MAnet, for lesion segmentation testing.
4.3. Lesion Detection
We first applied our architectures to the other 1010 full CT scans of MosMedData to validate our pipeline in a 3D scenario to detect and segment all lesions in an exam and then classify the exam as “with lesion”, if any lesion was found, or “without lesion”, otherwise. The results are summarized in
Table 8.
All architectures had similar and competitive results on MosMedData. Mobilenet Unet had the highest accuracy, F1-score, and recall, with 94.36%, 96.5%, and 97.39%, respectively. However, it only achieved the specificity of 82.35%. Densenet201 MAnet obtained the highest precision and specificity, with 97.23% and 90.2%, respectively. However, it achieved lower accuracy, 87.82%, and recall, 87.22%.
These metrics indicate that Mobilenet Unet had the smallest number of false negatives (21 exams or 2.60%) but a higher number of false positives (36 exams or 17.65%). Therefore, as missing a positive exam over a negative is more critical, Mobilenet Unet might be an efficient option to detect COVID-19 on MosMedData.
Then, to evaluate our architectures’ robustness, we performed external validation on the SPGC dataset, which was not on the training/validation/test sets, thus having a different distribution from our original images. Furthermore, the SPGC dataset has CAP exams, which were added to the “with lesion” class.
Table 9 presents the results of all architectures evaluated in this work.
All architectures had similar and competitive results in the external validation on the SPGC dataset. Mobilenet Unet had the highest accuracy and F1-score, with 98.05% and 98.7%, respectively. Thus, we answer RQ2, as Mobilenet Unet detected exams with COVID-19 and CAP lesions and exams without lesions.
Mobilenet Unet is an intermediate architecture with a small encoder of only 3.4 million parameters and a decoder of 32 million parameters. Its size might have aided it in learning the task without overfitting samples with the same distribution from training/validation/test sets.
It is worth mentioning that external validation plays a vital role when comparing CNNs, because it simulates real-world situations, allowing us to choose the architecture that best generalizes for new samples.
4.4. COVID-19 and CAP Classification
We trained eight deep learning models on COVIDxCT to differentiate between COVID-19 and CAP CT slices. This classification distinguishes previously segmented lesions due to these two diseases, as our segmentation models cannot distinguish between COVID-19 and CAP lesions. We present our results in
Table 10.
Our results of classifying CT slices as COVID-19 or CAP on COVIDxCT using eight different deep learning models are competitive. All the models achieved high accuracy, F1-score, precision, recall, and specificity. Among the models, Resnext101 achieved the highest overall performance, with accuracy of 96.79%, F1-score of 96.84%, precision of 94.71%, recall of 99.07%, and specificity of 94.52%. The performance of the other models is also noteworthy, with accuracy ranging from 94.84% to 96.79%. Finally, it is worth pointing out that the models’ specificity varied considerably, ranging from 91.22% to 95.32%.
Then, we externally validated these eight deep learning models on the slices with the most extensive lesions detected on the SPGC dataset, which can be lesions caused by COVID-19 or CAP. Finally, we summarize the results in
Table 11.
These results indicate that the eight deep learning models we evaluated have promising potential for distinguishing COVID-19 from CAP using CT images. Overall, Densenet201 achieved the best performance with the highest accuracy, F1-score, and specificity. However, it is worth noting that the relatively low specificity for CAP means that the models may be more prone to false negatives in this class. This is an important consideration, as the accurate detection of Common-Acquired Pneumonia is also critical to the appropriate treatment and management of patients. It is important to note that these results were obtained by externally validating the models on a single slice from each CT scan from the SPGC dataset. Because the SPGC dataset has a smaller sample size than the COVIDxCT dataset used for model training, further evaluation on larger and more diverse datasets is needed to fully assess the generalizability and robustness of the models. Furthermore, to use these two-dimensional deep learning models and gain processing time, the three-dimensionality of the SPGC dataset CT scans was discarded, which also caused a loss of information.
By merging the segmentation, detection, and classification tasks, we obtained the confusion matrix in
Figure 8. For lung segmentation, we applied Resnext101 Unet++; for lesion segmentation, we applied MobilenetV2 Unet; and for COVID-19 or CAP classification, we used Densenet201. These architectures were selected according to their overall results, mainly focusing on a low false-negative rate.
The confusion matrix shows that the classifier performed well in the COVID-19 class, with a high number of true positives (168) and a low number of false positives (3). However, there were some misclassifications, as 35% of CAP exams were classified as COVID-19. These results suggest that our classification models could not differentiate between the two classes or that there was insufficient information on the CT slice to differentiate between them.
Then, we used GradCAMPlusPlus [
56] to analyze the interpretability of our models with Densenet201. In
Figure 9, we can see that the red spots, which highlight the most critical areas of the image for classification, mainly coincide with our segmentation results obtained with MobilenetV2 Unet.
The segmentation of COVID-19 or CAP lesions provides more information than the results of GradCAMPlusPlus, as it is possible to calculate the lesion area for each slice or full CT scan lesion volume. Furthermore, the GradCAMPlusPlus heatmap in
Figure 9a might suggest that lesions only occurred in the left lung, while the segmentation showed lesions in both lungs. However, neither of the two methods could determine which specifications of the lesion were used by the CNNs to classify an image as COVID-19 or CAP. Thus, further interpretability is necessary, and
RQ3 is answered.
4.5. COVID-19 Severity
In order to provide numerical data about the segmented COVID-19 lesions, we calculated the severity of the disease based on the compromised area of the lungs. Then, we applied this methodology to MosMedData and compared the results. A summary is presented in
Table 12.
Again, Mobilenet Unet obtained the highest results, with accuracy of 75.05%, F1-score of 73.26%, precision of 72.67%, and recall of 75.05%. Even if the metrics were not as high as for binary classification (“without lesion” or “with lesion”),
Figure 10 shows, for four architectures (Resnet50 FPN, Mobilenet FPN, Mobilenet Unet, and Densenet MAnet), that our pipeline correctly segmented most lesions presented on the CT scans. The lower metrics obtained might have been due to the qualitative analysis made when labeling MosMedData, which we could not replicate with quantitative values.
We present the images where each model found the most extensive lesion area for that specific exam. The experiment found that all architectures could locate lesions in the same lung areas, indicating consistent performance. However, some architectures were unable to accurately identify certain lesion areas. Specifically, the MobilenetV2 FPN architecture failed to locate a small lesion in the right lung in the presented image (
Figure 10a), while the other three architectures correctly identified it. These findings suggest that while all architectures performed similarly overall, there were still differences in their ability to accurately identify certain lesion areas, highlighting the importance of selecting the most suitable architecture for a specific task. These difficulties in detecting certain lesion areas could have worsened the results presented in the confusion matrices in
Table 13.
Despite the success of our models in differentiating COVID-19 from non-COVID-19 cases (as shown in
Table 8), we still observed a high degree of error when it came to distinguishing between different severity classes of COVID-19 on MosMedData. This error may have been due to several factors, such as the incorrect segmentation of lesions on CT scans by our models or the lack of quantified evaluation on MosMedData, as specialists qualitatively evaluated severity. This may have affected our results even when lesions were correctly segmented, thus answering
RQ4.
According to our results, Resnet50 FPN was the fastest architecture on MosMedData, while Densenet201 MAnet was the slowest one. Specifically, the average time taken by Resnet50 FPN to segment lesions from all slices of MosMedData was 12.42 s. On the other hand, Resnext101 Unet++ took 17.43 s to segment lesions. Regarding the SPGC dataset, the fastest architecture was Mobilenet FPN, and the average time taken to segment lesions was 12.42 s. On the other hand, Densenet MAnet was the slowest, and the average time taken to segment lesions was 25.33 s. However, despite the speed difference, both models are viable for real-life usage. This means that the highlighted models can be effectively used in clinical settings, where speed and accuracy are essential and computational resources might be limited. The choice of model will depend on the user’s specific needs, such as the available computational resources.
4.6. Limitations
The first limitation of our work is that the high cost of CT scans and high exposure to ionizing radiation limits their widespread adoption in hospitals. This issue contributes to the reduced public data for training, and real-life testing and usage. Machine learning models identify patterns based on the data that they are trained on. Therefore, a machine learning model is biased when the data are biased. To partially address this issue, we conducted external validation on a new dataset obtained from the literature. However, since there are only a limited number of publicly available datasets, ensuring the model’s generalizability remains challenging.
Another limitation of this work is that all the architectures evaluated in this study were based on 2D images, whereas CT scans provide 3D information. Although using 2D images simplifies the computational complexity and reduces the training time required by the models, it may not fully capture the complexity of and variations in 3D structures. Consequently, the accuracy of the models in predicting and diagnosing various medical conditions using a 2D approach may be lower than that with a 3D approach. Another limitation of this study is that only analyzing the CT scan of a patient may not be sufficient for a diagnosis. CT scans provide useful information about the body’s internal structures, but they do not provide information about the patient’s symptoms or medical history. Therefore, integrating CT scan analysis with clinical data processed using natural language models could improve the accuracy of the diagnosis. By combining image and language models, physicians can make more informed decisions and provide better patient treatment options.
While our approach showed encouraging results, other CT scan factors that may contribute to the differentiation between CAP and COVID-19 might not be captured when separately analyzing only 2D slices. Moreover, it is known that there is a significant overlap in the imaging features of COVID-19 and other respiratory diseases, which makes differentiation challenging even with the use of advanced imaging techniques. Therefore, future studies using a more comprehensive approach that includes 3D imaging and clinical data may be necessary to improve the accuracy of COVID-19 diagnosis and further differentiate it from CAP.
Despite these limitations, the findings of this study provide valuable insights into the potential applications of deep learning and computer vision techniques in medical image analysis. Future studies can build upon these findings and further explore using 3D imaging and language models to improve medical diagnosis and treatment accuracy and efficiency.