COVLIAS 1.0Lesion vs. MedSeg: An Artificial Intelligence Framework for Automated Lesion Segmentation in COVID-19 Lung Computed Tomography Scans

Background: COVID-19 is a disease with multiple variants, and is quickly spreading throughout the world. It is crucial to identify patients who are suspected of having COVID-19 early, because the vaccine is not readily available in certain parts of the world. Methodology: Lung computed tomography (CT) imaging can be used to diagnose COVID-19 as an alternative to the RT-PCR test in some cases. The occurrence of ground-glass opacities in the lung region is a characteristic of COVID-19 in chest CT scans, and these are daunting to locate and segment manually. The proposed study consists of a combination of solo deep learning (DL) and hybrid DL (HDL) models to tackle the lesion location and segmentation more quickly. One DL and four HDL models—namely, PSPNet, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet—were trained by an expert radiologist. The training scheme adopted a fivefold cross-validation strategy on a cohort of 3000 images selected from a set of 40 COVID-19-positive individuals. Results: The proposed variability study uses tracings from two trained radiologists as part of the validation. Five artificial intelligence (AI) models were benchmarked against MedSeg. The best AI model, ResNet-UNet, was superior to MedSeg by 9% and 15% for Dice and Jaccard, respectively, when compared against MD 1, and by 4% and 8%, respectively, when compared against MD 2. Statistical tests—namely, the Mann–Whitney test, paired t-test, and Wilcoxon test—demonstrated its stability and reliability, with p < 0.0001. The online system for each slice was <1 s. Conclusions: The AI models reliably located and segmented COVID-19 lesions in CT scans. The COVLIAS 1.0Lesion lesion locator passed the intervariability test.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an infectious disease that poses a concern to humans worldwide. The World Health Organization (WHO) proclaimed COVID-19 (the novel coronavirus disease) as a global pandemic on 11 March 2020. COVID-19 is a rapidly spreading illness worldwide, yet hospital resources are limited. As of 1 December 2021, COVID-19 had led to the infection of 260 million people and 5.2 million deaths worldwide [1]. COVID-19 has clearly shown to have several molecular pathways [2], leading to myocardial injury [3], diabetes [4], pulmonary embolism [5], and thrombosis [6]. Due to the lack of an effective vaccine or medication, early detection of COVID-19 is critical to saving many lives and safeguarding frontline workers. Most medical staff have become infected due to their frequent contact with patients, significantly aggravating the already dire healthcare situation.
The early detection of COVID-19 is critical to saving many lives and protecting frontline workers, due to the lack of an appropriate vaccination or therapy. RT-PCR, or "reverse transcription-polymerase chain reaction", is one of the gold standards for the detection The layout of this lesion segmentation study is as follows: In Section 2, we present the patient demographics and types of AI architectures. The results of the experimental protocol using the AI architectures, along with the performance evaluation, are shown in Section 3. The in-depth discussion is elaborated in Section 4, where we present our findings, benchmarking tables, strengths, weaknesses, and extensions of our study. The study concludes in Section 5. The main contributions of this study are as follows: (1) The proposed study consists of a combination of solo DL and HDL to tackle the lesion location for faster segmentation. One DL and four HDL models-namely, PSPNet, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet-were trained by an expert radiologist. (2) The training scheme adopted a fivefold cross-validation strategy on a cohort of 3000 images selected from a set of  40 COVID-19-positive individuals. Performance evaluation was carried out using systems such as (a) Dice similarity, (b) Jaccard index, (c) Bland-Altman plots, and (d) regression plots. (3) COVLIAS 1.0 Lesion was benchmarked against the online MedSeg system, demonstrating COVLIAS 1.0 Lesion to be superior to MedSeg when compared against Manual Delineation 1 and Manual Delineation 2. (4) The proposed interobserver variability study used tracings from two trained radiologists as part of the validation. (5) Statistical tests-namely, the Mann-Whitney test, paired t-test, and Wilcoxon test-demonstrated its stability and reliability, along with the p-values. (6) The online system for each slice was <1 s.

Demographics and Baseline Characteristics
The layout of this lesion segmentation study is as follows: In Section 2, we present the patient demographics and types of AI architectures. The results of the experimental protocol using the AI architectures, along with the performance evaluation, are shown in Section 3. The in-depth discussion is elaborated in Section 4, where we present our findings, benchmarking tables, strengths, weaknesses, and extensions of our study. The study concludes in Section 5.

Demographics and Baseline Characteristics
Approximately 3000 CT images (collected from 40 patients from Croatia) were used to create the training cohort (Figure 2). The patients had a mean age of 66 (SD 7.988), with 35 males (71.4 %) and the remainder females. In the cohort, the average GGO and consolidation scores were 2 and 1.2, respectively. Out of the 40 patients who participated in this study, all had a cough, 85% had dyspnoea, 28% had hypertension, 14% were smokers, and none had a sore throat, diabetes, COPD, or cancer. None of them were admitted to the intensive care unit (ICU) or died due to COVID-19 infection. The layout of this lesion segmentation study is as follows: In Section 2, we present the patient demographics and types of AI architectures. The results of the experimental protocol using the AI architectures, along with the performance evaluation, are shown in Section 3. The in-depth discussion is elaborated in Section 4, where we present our findings, benchmarking tables, strengths, weaknesses, and extensions of our study. The study concludes in Section 5.

Demographics and Baseline Characteristics
Approximately 3000 CT images (collected from 40 patients from Croatia) were used to create the training cohort (Figure 2). The patients had a mean age of 66 (SD 7.988), with 35 males (71.4 %) and the remainder females. In the cohort, the average GGO and consolidation scores were 2 and 1.2, respectively. Out of the 40 patients who participated in this study, all had a cough, 85% had dyspnoea, 28% had hypertension, 14% were smokers, and none had a sore throat, diabetes, COPD, or cancer. None of them were admitted to the intensive care unit (ICU) or died due to COVID-19 infection.

Image Acquisition and Data Preparation
This proposed study used a Croatian cohort of 40 COVID-19-positive patients. The retrospective cohort study was conducted from 1 March to 31 December 2020, at the

Image Acquisition and Data Preparation
This proposed study used a Croatian cohort of 40 COVID-19-positive patients. The retrospective cohort study was conducted from 1 March to 31 December 2020, at the University Hospital for Infectious Diseases in Zagreb, Croatia. All patients over the age of 18 who agreed to participate in the study had a positive RT-PCR test for the SARS-CoV-2 virus, underwent thoracic MDCT during their hospital stay, and met at least one of the following criteria: hypoxia (oxygen saturation below 92%), tachypnea (respiratory rate above 22 per minute), tachycardia (pulse rate > 100), or hypotension (systolic blood pressure 100 mmHg) prior to starting the study. The UHID Ethics Committee gave their consent. The acquisition was carried out using a 64-detector scanner from FCT Speedia HD (from Fujifilm Corporation, Tokyo, Japan, 2017), while the acquisition protocol consisted of a single full inspiratory breath-hold for collection of CT scans of the thorax in the craniocaudal direction.
Researchers used Hitachi Ltd.'s (Tokyo, Japan) Whole-Body X-ray CT System with Supria Software, and a typical imaging method to view the images (System Software Version: V2.25, Copyright Hitachi, Ltd., 2017). When scanning, the following values were used: wide focus, 120 kV tube voltage, 350 mA tube current, and 0.75 s rotation speed in the IntelligentEC (automatic tube-current modulation) mode. We followed the standardized protocol for reconstruction as adopted in our previous studies where, for multi-recon options, the field of view was 350 mm, the slice thickness was 5 mm (0.625 × 64), and the table pitch was 1.3281. We selected a slice thickness of 1.25 mm and a recon index of 1 mm for picture filter 22 (lung standard) with the Intelli IP Lv.2 iterative algorithm (WW1600/WL600). Furthermore, for picture filter 31 (mediastinal), with the Lv.3 Intelli IP iterative algorithm (WW450/WL45), the slice thickness was 1.25 mm and the recon index was 1 mm.
Scanned areas were chosen based on the presence of no metallic objects and reasonable image quality without artefacts or blurriness caused by the movement of the patients during the conduction of the scan. Each patient's CT volume in this cohort consisted of 300 slices. The senior radiologist (K.V.) carefully selected~70 CT slices (512 × 512 px 2 ) that preserved most of the lung region (only accounting for about 20% of the total CT slices). Figures 3 and 4 show the annotated lesions from tracers 1 and 2, respectively, in red, with the raw CT image as the background. University Hospital for Infectious Diseases in Zagreb, Croatia. All patients over the age of 18 who agreed to participate in the study had a positive RT-PCR test for the SARS-CoV-2 virus, underwent thoracic MDCT during their hospital stay, and met at least one of the following criteria: hypoxia (oxygen saturation below 92%), tachypnea (respiratory rate above 22 per minute), tachycardia (pulse rate > 100), or hypotension (systolic blood pressure 100 mmHg) prior to starting the study. The UHID Ethics Committee gave their consent. The acquisition was carried out using a 64-detector scanner from FCT Speedia HD (from Fujifilm Corporation, Tokyo, Japan, 2017), while the acquisition protocol consisted of a single full inspiratory breath-hold for collection of CT scans of the thorax in the craniocaudal direction. Researchers used Hitachi Ltd.'s Whole-Body X-ray CT System with Supria Software, and a typical imaging method to view the images (System Software Version: V2.25, Copyright Hitachi, Ltd., 2017). When scanning, the following values were used: wide focus, 120 kV tube voltage, 350 mA tube current, and 0.75 s rotation speed in the IntelligentEC (automatic tube-current modulation) mode. We followed the standardized protocol for reconstruction as adopted in our previous studies where, for multi-recon options, the field of view was 350 mm, the slice thickness was 5 mm (0.625 × 64), and the table pitch was 1.3281. We selected a slice thickness of 1.25 mm and a recon index of 1 mm for picture filter 22 (lung standard) with the Intelli IP Lv.2 iterative algorithm (WW1600/WL600). Furthermore, for picture filter 31 (mediastinal), with the Lv.3 Intelli IP iterative algorithm (WW450/WL45), the slice thickness was 1.25 mm and the recon index was 1 mm.
Scanned areas were chosen based on the presence of no metallic objects and reasonable image quality without artefacts or blurriness caused by the movement of the patients during the conduction of the scan. Each patient's CT volume in this cohort consisted of ~300 slices. The senior radiologist (K.V.) carefully selected ~70 CT slices (512 × 512 px 2 ) that preserved most of the lung region (only accounting for about 20% of the total CT slices). Figures 3 and 4 show the annotated lesions from tracers 1 and 2, respectively, in red, with the raw CT image as the background.

The Deep Learning Models
The proposed study consists of a combination of solo deep learning (DL) and hybrid DL (HDL) models to tackle the lesion location and lesion segmentation more quickly. It was recently shown that the combination of two DL models has more feature-extraction power compared to the solo DL models; this motivation brought the innovation of combining two solo DL models. This study therefore implemented four HDL modelsnamely, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet-that were trained by an expert radiologist. This was then also benchmarked against a solo DL model, namely, PSPNet.
By replacing the kernel filter in the initial layer with 11-and 5-sized filters, the VGG-Net architecture was meant to reduce training time [62]. VGGNet was extremely efficient and speedy, but it had a problem in optimization due to vanishing gradients. During backpropagation, it resulted in training with substantially less or no weights, because it was multiplied by the gradient at each epoch, and the update to the initial layers was very modest. Residual Network, or ResNet [63], was created to address this issue. A new link called the "skip connection" was created in this architecture, allowing gradients to bypass a limited number of layers and thereby resolve the issue of the vanishing gradient problem. Furthermore, during the backpropagation step, another modification to the network-namely, an identity function-kept the local gradient value at a non-zero quantity.
The HDL models were designed by combining one DL (i.e., VGG or ResNet, in our study) with another DL (i.e., UNet or SegNet, in our study), thereby producing a superior network with the advantages of both parent networks. The VGG-SegNet, VGG-UNet, Res-Net-SegNet, and ResNet-UNet architectures employed in this research are made up of three parts: an encoder, a decoder, and a pixel-wise softmax classifier. The details of the SDL and HDL models are discussed in the following sections.

PSPNet-Solo DL Model
The pyramid scene parsing network (PSPNet) [64] is a semantic segmentation network that takes into account the image's overall context. PSPNet includes four sections to its design (

The Deep Learning Models
The proposed study consists of a combination of solo deep learning (DL) and hybrid DL (HDL) models to tackle the lesion location and lesion segmentation more quickly. It was recently shown that the combination of two DL models has more feature-extraction power compared to the solo DL models; this motivation brought the innovation of combining two solo DL models. This study therefore implemented four HDL models-namely, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet-that were trained by an expert radiologist. This was then also benchmarked against a solo DL model, namely, PSPNet.
By replacing the kernel filter in the initial layer with 11-and 5-sized filters, the VGGNet architecture was meant to reduce training time [62]. VGGNet was extremely efficient and speedy, but it had a problem in optimization due to vanishing gradients. During backpropagation, it resulted in training with substantially less or no weights, because it was multiplied by the gradient at each epoch, and the update to the initial layers was very modest. Residual Network, or ResNet [63], was created to address this issue. A new link called the "skip connection" was created in this architecture, allowing gradients to bypass a limited number of layers and thereby resolve the issue of the vanishing gradient problem. Furthermore, during the backpropagation step, another modification to the network-namely, an identity function-kept the local gradient value at a non-zero quantity.
The HDL models were designed by combining one DL (i.e., VGG or ResNet, in our study) with another DL (i.e., UNet or SegNet, in our study), thereby producing a superior network with the advantages of both parent networks. The VGG-SegNet, VGG-UNet, ResNet-SegNet, and ResNet-UNet architectures employed in this research are made up of three parts: an encoder, a decoder, and a pixel-wise softmax classifier. The details of the SDL and HDL models are discussed in the following sections.

PSPNet-Solo DL Model
The pyramid scene parsing network (PSPNet) [64] is a semantic segmentation network that takes into account the image's overall context. PSPNet includes four sections to its design ( Figure 5): (1) input, (2) feature map, (3) pyramid pooling module, and (4) output [65,66]. The segmented image is sent into the network, which then uses a set of dilated convolution and pooling blocks to extract the feature map. The network's heart is the pyramid pooling module, which helps capture the global context of the image/feature map constructed in the previous stage. This section is divided into four sections, each with its own scaling capabilities. This module's scaling options are 1, 2, 3, and 6, with 1 × 1 scaling assisting in the acquisition of spatial data, and thereby increasing the resolution of the acquired features. The higher-resolution features are captured by the 6 × 6 scaling. All of the outputs from these four components are pooled at the end of this module using global average pooling. The global average pooling output is sent to a collection of convolutional layers in the final section. Finally, the output binary mask generates the collection of prediction classes. The main advantage of PSPNet is the global feature extraction using the pyramid pooling strategy. output [65,66]. The segmented image is sent into the network, which then uses a set of dilated convolution and pooling blocks to extract the feature map. The network's heart is the pyramid pooling module, which helps capture the global context of the image/feature map constructed in the previous stage. This section is divided into four sections, each with its own scaling capabilities. This module's scaling options are 1, 2, 3, and 6, with 1 × 1 scaling assisting in the acquisition of spatial data, and thereby increasing the resolution of the acquired features. The higher-resolution features are captured by the 6 × 6 scaling. All of the outputs from these four components are pooled at the end of this module using global average pooling. The global average pooling output is sent to a collection of convolutional layers in the final section. Finally, the output binary mask generates the collection of prediction classes. The main advantage of PSPNet is the global feature extraction using the pyramid pooling strategy.

Two SegNet-Based HDL Model Designs-VGG-SegNet and ResNet-SegNet
The VGG-SegNet architecture used in this study (Figure 6) consists of three components: an encoder, a decoder, and a pixel-wise softmax classifier at the end. It consists of 16 convolution (conv) layers (green in color) compared to the 13 in the SegNet [67] design (VGG backbone). The difference between ResNet-SegNet ( Figure 7) and VGG-SegNet ( Figure 6) is in the encoder and decoder parts. The VGG is replaced by ResNet [63] architecture in the encoder part of the architecture. Skip connection in VGG-SegNet is shown by the horizontal lines running from encoder to decoder in Figure 7, which help in retaining the features. To overcome the vanishing gradient problem, a new link known as the "skip connection" (Figure 7) was invented in this architecture, allowing the gradients to bypass a set number of levels [68,69]. This consists of conv blocks and identity blocks (Figure 7). The conv block consists of three serial 1 × 1, 3 × 3, and 1 × 1 convolution blocks in parallel to a 1 × 1 convolution block, which is then added in the end. The identity block is similar to the conv block, except that it uses skip connection. Since VGG is faster and Se-gNet is a basic segmentation network, this segmentation process is relatively faster; thus, VGG-SegNet is more advantageous compared to SegNet alone. On the other hand, Res-Net-SegNet is more accurate, since it has a greater number of layers, and prevents the vanishing gradient problem.

Two SegNet-Based HDL Model Designs-VGG-SegNet and ResNet-SegNet
The VGG-SegNet architecture used in this study (Figure 6) consists of three components: an encoder, a decoder, and a pixel-wise softmax classifier at the end. It consists of 16 convolution (conv) layers (green in color) compared to the 13 in the SegNet [67] design (VGG backbone). The difference between ResNet-SegNet ( Figure 7) and VGG-SegNet ( Figure 6) is in the encoder and decoder parts. The VGG is replaced by ResNet [63] architecture in the encoder part of the architecture. Skip connection in VGG-SegNet is shown by the horizontal lines running from encoder to decoder in Figure 7, which help in retaining the features. To overcome the vanishing gradient problem, a new link known as the "skip connection" (Figure 7) was invented in this architecture, allowing the gradients to bypass a set number of levels [68,69]. This consists of conv blocks and identity blocks ( Figure 7). The conv block consists of three serial 1 × 1, 3 × 3, and 1 × 1 convolution blocks in parallel to a 1 × 1 convolution block, which is then added in the end. The identity block is similar to the conv block, except that it uses skip connection. Since VGG is faster and SegNet is a basic segmentation network, this segmentation process is relatively faster; thus, VGG-SegNet is more advantageous compared to SegNet alone. On the other hand, ResNet-SegNet is more accurate, since it has a greater number of layers, and prevents the vanishing gradient problem.

Two UNet-Based HDL Model Designs: VGG-UNet and ResNet-UNet
VGG-UNet ( Figure 8) and ResNet-UNet ( Figure 9) are based on the classic UNet structure, which consists of encoder (downsampling) and decoder (upsampling) components. The VGG-19 [62,[70][71][72] and ResNet-51 [58,63,73,74] models replace the downsampling encoder in VGG-UNet and ResNet-UNet, respectively. These architectures are better than the traditional UNet [75], since each level's traditional convolution blocks are changed by the VGG and ResNet blocks in VGG-UNet and ResNet-UNet, respectively. Note that skip connection in VGG-UNet is shown by the horizontal lines running from encoder to decoder in Figure 8, which help in retaining the features, similar to Figure 7 in VGG-SegNet. To overcome the vanishing gradient problem, a new link known as the "skip connection" (Figure 9) was invented in this architecture, allowing gradients to bypass a set number of levels [68,69]. This consists of conv blocks and identity blocks (Figure 9). This is very similar to ResNet-SegNet, as shown in Figure 7. The conv block consists of three serial 1 × 1, 3 × 3, and 1 × 1 convolution blocks in parallel to a 1 × 1 convolution block, which is then added in the end. The identity block is similar to the conv block, except that it uses skip connection. The key advantage of VGG-UNet over UNet is its higher speed of operation, while ResNet-UNet offers better accuracy and avoids the vanishing gradient problem due to new skip connections.   Figure 9) are based on the classic UNet structure, which consists of encoder (downsampling) and decoder (upsampling) components. The VGG-19 [62,[70][71][72] and ResNet-51 [58,63,73,74] models replace the downsampling encoder in VGG-UNet and ResNet-UNet, respectively. These architectures are better than the traditional UNet [75], since each level's traditional convolution blocks are changed by the VGG and ResNet blocks in VGG-UNet and ResNet-UNet, respectively. Note that skip connection in VGG-UNet is shown by the horizontal lines running from encoder to decoder in Figure 8, which help in retaining the features, similar to Figure 7 in VGG-SegNet. To overcome the vanishing gradient problem, a new link known as the "skip connection" (Figure 9) was invented in this architecture, allowing gradients to bypass a set number of levels [68,69]. This consists of conv blocks and identity blocks ( Figure  9). This is very similar to ResNet-SegNet, as shown in Figure 7. The conv block consists of three serial 1 × 1, 3 × 3, and 1 × 1 convolution blocks in parallel to a 1 × 1 convolution block, which is then added in the end. The identity block is similar to the conv block, except that it uses skip connection. The key advantage of VGG-UNet over UNet is its higher speed of operation, while ResNet-UNet offers better accuracy and avoids the vanishing gradient problem due to new skip connections.   Figure 9) are based on the classic UNet structure, which consists of encoder (downsampling) and decoder (upsampling) components. The VGG-19 [62,[70][71][72] and ResNet-51 [58,63,73,74] models replace the downsampling encoder in VGG-UNet and ResNet-UNet, respectively. These architectures are better than the traditional UNet [75], since each level's traditional convolution blocks are changed by the VGG and ResNet blocks in VGG-UNet and ResNet-UNet, respectively. Note that skip connection in VGG-UNet is shown by the horizontal lines running from encoder to decoder in Figure 8, which help in retaining the features, similar to Figure 7 in VGG-SegNet. To overcome the vanishing gradient problem, a new link known as the "skip connection" (Figure 9) was invented in this architecture, allowing gradients to bypass a set number of levels [68,69]. This consists of conv blocks and identity blocks (Figure 9). This is very similar to ResNet-SegNet, as shown in Figure 7. The conv block consists of three serial 1 × 1, 3 × 3, and 1 × 1 convolution blocks in parallel to a 1 × 1 convolution block, which is then added in the end. The identity block is similar to the conv block, except that it uses skip connection. The key advantage of VGG-UNet over UNet is its higher speed of operation, while ResNet-UNet offers better accuracy and avoids the vanishing gradient problem due to new skip connections.

Loss Function for SDL and HDL Models
The new models adopted the cross-entropy (CE) loss functions during the model generation [76][77][78]. If α CE represents the CE loss function, pr i represents the classifier's probability used in the AI model, x i represents the input gold standard label 1, and (1 − x i ) represents the gold standard label 0, then the loss function can be expressed mathematically as shown in Equation (1): where × represents the product of the two terms.

Loss Function for SDL and HDL Models
The new models adopted the cross-entropy (CE) loss functions during the model generation [76][77][78]. If represents the CE loss function, pr i represents the classifier's probability used in the AI model, i represents the input gold standard label 1, and (1 − i) represents the gold standard label 0, then the loss function can be expressed mathematically as shown in Equation (1): where represents the product of the two terms.

Experimental Protocol
The AI models' accuracy was determined using a standardized cross-validation (CV) technique. Using the AI framework, our group produced a number of CV-based protocols of various types. We adopted a fivefold cross-validation protocol consisting of 80% training (2400 scans), while the remaining 20% were training data (600 CT scans). The choice of the fivefold cross-validation was due to the mild COVID-19 conditions. Five folds were created in such a way that each fold had the opportunity to have a distinct test set. The K5 protocol included an internal validation mechanism in which 10% of the data were considered for validation.
The AI systems' accuracy was determined by comparing anticipated output to ground-truth pixel values. Because the output lung mask was either black or white, these readings were interpreted as binary (0 or 1) integers. Finally, the sum of these binary integers was divided by the total number of pixels in the image. Using the standardized symbols for truth tables for the determination of accuracy, we used TP, TN, FN, and FP to denote true positive, true negative, false negative, and false positive, respectively. The AI systems' accuracy can be mathematically expressed as shown in Equation (2):

Experimental Protocol
The AI models' accuracy was determined using a standardized cross-validation (CV) technique. Using the AI framework, our group produced a number of CV-based protocols of various types. We adopted a fivefold cross-validation protocol consisting of 80% training (2400 scans), while the remaining 20% were training data (600 CT scans). The choice of the fivefold cross-validation was due to the mild COVID-19 conditions. Five folds were created in such a way that each fold had the opportunity to have a distinct test set. The K5 protocol included an internal validation mechanism in which 10% of the data were considered for validation.
The AI systems' accuracy was determined by comparing anticipated output to groundtruth pixel values. Because the output lung mask was either black or white, these readings were interpreted as binary (0 or 1) integers. Finally, the sum of these binary integers was divided by the total number of pixels in the image. Using the standardized symbols for truth tables for the determination of accuracy, we used TP, TN, FN, and FP to denote true positive, true negative, false negative, and false positive, respectively. The AI systems' accuracy can be mathematically expressed as shown in Equation (2):

Results
This proposed study is an improvement on the previously published COVLIAS 1.0 Lung system with lesion segmentation. This study uses a cohort of 3000 images for a set of 40 COVID-19-positive patients, with five AI models utilizing a fivefold CV technique. The training was carried out on one set of manual delineation from a senior radiologist. Figure 10 shows the accuracy and the loss plot using the best AI model (ResNet-UNet) out of the five models used in this proposed study. Figure 11 shows the overlay of the AI-predicted lesions (green) in rows 3-7 against manual delineation (red, row 2), with raw CT images (row 1) as the background. Figures A1-A5 show the outputs from PSPNet, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet, respectively. Figure A6 shows the visual lesion overlays of MedSeg (green) vs. MD (red). 1.0Lung system with lesion segmentation. This study uses a cohort of 3000 images for a set of 40 COVID-19-positive patients, with five AI models utilizing a fivefold CV technique. The training was carried out on one set of manual delineation from a senior radiologist. Figure 10 shows the accuracy and the loss plot using the best AI model (ResNet-UNet) out of the five models used in this proposed study. Figure 11 shows the overlay of the AIpredicted lesions (green) in rows 3-7 against manual delineation (red, row 2), with raw CT images (row 1) as the background. Figures A1-A5 show the outputs from PSPNet, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet, respectively. Figure A6 shows the visual lesion overlays of MedSeg (green) vs. MD (red).

Performance Evaluation
This proposed study uses (1) the Dice similarity coefficient (DSC) [79,80], (2) Jaccard index (JI) [81], (3) Bland-Altman (BA) plots [82,83], and (4) receiver operating characteristics (ROC) [84][85][86] for the five AI models against MD 1 and MD 2 for performance evaluation. The same five metrics are used for MedSeg to validate the five AI models against it. Figures 12-16 show the cumulative frequency distribution (CFD) plots for DSC and JI from PSPNet, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet, respectively, and depict the score at an 80% threshold. The CFD plots for DSC and JI are shown in Figure 17, which shows the output from the MedSeg model used for validating the COVLIAS 1.0 Lesion system. This study also uses manual delineation from two trained radiologists (K.V. and G.L.) to validate the results of the five AI models and MedSeg. Figure 18 shows lesions detected by the best AI model (ResNet-UNet) and MedSeg, along with MD by two trained radiologists (K.V. and G.L.).
Diagnostics 2022, 12, 1283 12 of 36 Figure 17, which shows the output from the MedSeg model used for validating the COVLIAS 1.0Lesion system. This study also uses manual delineation from two trained radiologists (K.V. and G.L.) to validate the results of the five AI models and MedSeg. Figure  18 shows lesions detected by the best AI model (ResNet-UNet) and MedSeg, along with MD by two trained radiologists (K.V. and G.L.).     Figure 17, which shows the output from the MedSeg model used for validating the COVLIAS 1.0Lesion system. This study also uses manual delineation from two trained radiologists (K.V. and G.L.) to validate the results of the five AI models and MedSeg. Figure  18 shows lesions detected by the best AI model (ResNet-UNet) and MedSeg, along with MD by two trained radiologists (K.V. and G.L.).     Figure 17, which shows the output from the MedSeg model used for validating the COVLIAS 1.0Lesion system. This study also uses manual delineation from two trained radiologists (K.V. and G.L.) to validate the results of the five AI models and MedSeg. Figure  18 shows lesions detected by the best AI model (ResNet-UNet) and MedSeg, along with MD by two trained radiologists (K.V. and G.L.).               Table 1 presents the DSC and JI scores for five AI models using MD 1 and MD 2. The left-hand side of the table shows statistical computation using MD 1, while the right-hand side of the table shows the statistical computation using MD 2. The first five rows are the five AI models. The percentage difference is the difference between the AI model and the MedSeg model. As can be seen, the five AI models (ResNet-SegNet, PSPNet, VGG-SegNet, VGG-UNet, and ResNet-UNet) are all better than MedSeg, by 1%, 4%, 4%, 5%, and 9%, respectively. The mean Dice similarity for all five models is 0.8, which is better than that of MedSeg by 5%. The same is true for the Jaccard index where, as can be seen, the five AI models (ResNet-SegNet, PSPNet, VGG-SegNet, VGG-UNet, and ResNet-UNet) are all better than MedSeg, by 2%, 5%, 6%, 8%, and 15%, respectively. The mean JI is 0.66 which is better than that of MedSeg by 7%. Thus, in summary, both the Dice similarity and Jaccard index in all five AI models are better than those of the MedSeg model.  Table 1 presents the DSC and JI scores for five AI models using MD 1 and MD 2. The left-hand side of the table shows statistical computation using MD 1, while the right-hand side of the table shows the statistical computation using MD 2. The first five rows are the five AI models. The percentage difference is the difference between the AI model and the MedSeg model. As can be seen, the five AI models (ResNet-SegNet, PSPNet, VGG-SegNet, VGG-UNet, and ResNet-UNet) are all better than MedSeg, by 1%, 4%, 4%, 5%, and 9%, respectively. The mean Dice similarity for all five models is 0.8, which is better than that of MedSeg by 5%. The same is true for the Jaccard index where, as can be seen, the five AI models (ResNet-SegNet, PSPNet, VGG-SegNet, VGG-UNet, and ResNet-UNet) are all better than MedSeg, by 2%, 5%, 6%, 8%, and 15%, respectively. The mean JI is 0.66 which is better than that of MedSeg by 7%. Thus, in summary, both the Dice similarity and Jaccard index in all five AI models are better than those of the MedSeg model.  We also used another manual delineation system (G.L.), labelled as MD 2. The behavior was consistent with that of MD 2. The Dice similarity in the five AI models was superior to that of MedSeg by 4%, 0%, 4%, 1%, and 4%, respectively. Similarly, the JI was superior to that of MedSeg by 5%, 2%, 8%, 3%, and 8%, respectively. The mean Dice similarity using MD 2 was superior by 3%, while the mean Jaccard index was superior by 5%, thus proving our hypothesis. Figures 19-23 show the correlation coefficient (CC) plots for the five AI models against MD 1 and MD 2. The plots also show the CC values of all of the plots with p < 0.0001. Finally, we also present the benchmarking against MedSeg in Figure 24, against MD 1 and MD 2. Table 2 presents the CC scores for the five AI models, along with the means of these AI models and MedSeg against MD 1 and MD 2, and the percentage difference between the results of the AI models and MedSeg.

Statistical Validation
To assess the system's dependability and stability, standard tests-namely, paired ttests [87,88], Mann-Whitney tests [89][90][91], and Wilcoxon tests [92]-were utilized. Med-Calc software was used for the statistical analysis (Osteen, Belgium) [93,94]. To validate the system described in the study, we supplied 13 potential combinations for the five AI models and MedSeg against MD 1 and MD 2. Table 3 displays the Mann-Whitney test, paired t-test, and Wilcoxon test findings. Using the varying threshold strategy, one can compute COVLIAS's diagnostic performance using receiver operating characteristics (ROC). The ROC curve and area under the curve (AUC) values for the five (two new and three old) AI models are depicted in Figure 25, with AUC values more than ~0.85 and ~0.75 for MD 1 and MD 2, respectively. The BA computation strategy [95,96] was used to demonstrate the consistency of two methods. We show the mean and standard deviation of the lesion area for the AI models (Figures 26-30) and MedSeg (Figure 31), plotted against MD 1 and MD 2.

Statistical Validation
To assess the system's dependability and stability, standard tests-namely, paired t-tests [87,88], Mann-Whitney tests [89][90][91], and Wilcoxon tests [92]-were utilized. Med-Calc software was used for the statistical analysis (Osteen, Belgium) [93,94]. To validate the system described in the study, we supplied 13 potential combinations for the five AI models and MedSeg against MD 1 and MD 2. Table 3 displays the Mann-Whitney test, paired t-test, and Wilcoxon test findings. Using the varying threshold strategy, one can compute COVLIAS's diagnostic performance using receiver operating characteristics (ROC). The ROC curve and area under the curve (AUC) values for the five (two new and three old) AI models are depicted in Figure 25, with AUC values more than~0.85 and~0.75 for MD 1 and MD 2, respectively. The BA computation strategy [95,96] was used to demonstrate the consistency of two methods. We show the mean and standard deviation of the lesion area for the AI models (Figures 26-30) and MedSeg (Figure 31), plotted against MD 1 and MD 2.

Discussion
This proposed study presents automated lesion detection in an AI framework using SDL and HDL models-namely, (1) PSPNet, (2) VGG-SegNet, (3) ResNet-SegNet, (4) VGG-UNet, and (5) ResNet-UNet-trained using a fivefold cross-validation strategy using a set of 3000 manually delineated images. As part of the benchmarking strategy, we compared the five AI models against MedSeg. As part of the variability study, we utilized the lesion annotations from another tracer to validate the results of the five AI models and MedSeg. We used four kinds of metric for evaluation of the five AI models, namely, (1) DSC, (2) JI, (3) BA plots, and (4) ROC. The best AI model, ResNet-UNet, was superior to MedSeg by 9% and 15% for Dice similarity and Jaccard index, respectively, when compared against MD 1, and by 4% and 8%, respectively, when compared against MD 2. Statistical tests-namely, the Mann-Whitney test, paired t-test, and Wilcoxon test-demonstrated its stability and reliability. The training, testing, and evaluation of the AI model were carried out using NVIDIA's DGX V100. Multi-GPU training was used to speed up the process. The online system for each slice was < 1 s. Table 2 shows the CC values of all

Discussion
This proposed study presents automated lesion detection in an AI framework using SDL and HDL models-namely, (1) PSPNet, (2) VGG-SegNet, (3) ResNet-SegNet, (4) VGG-UNet, and (5) ResNet-UNet-trained using a fivefold cross-validation strategy using a set of 3000 manually delineated images. As part of the benchmarking strategy, we compared the five AI models against MedSeg. As part of the variability study, we utilized the lesion annotations from another tracer to validate the results of the five AI models and MedSeg. We used four kinds of metric for evaluation of the five AI models, namely, (1) DSC, (2) JI, (3) BA plots, and (4) ROC. The best AI model, ResNet-UNet, was superior to MedSeg by 9% and 15% for Dice similarity and Jaccard index, respectively, when compared against MD 1, and by 4% and 8%, respectively, when compared against MD 2. Statistical tests-namely, the Mann-Whitney test, paired t-test, and Wilcoxon test-demonstrated its stability and reliability. The training, testing, and evaluation of the AI model were carried out using NVIDIA's DGX V100. Multi-GPU training was used to speed up the process. The online system for each slice was < 1 s. Table 2 shows the CC values of all

Discussion
This proposed study presents automated lesion detection in an AI framework using SDL and HDL models-namely, (1) PSPNet, (2) VGG-SegNet, (3) ResNet-SegNet, (4) VGG-UNet, and (5) ResNet-UNet-trained using a fivefold cross-validation strategy using a set of 3000 manually delineated images. As part of the benchmarking strategy, we compared the five AI models against MedSeg. As part of the variability study, we utilized the lesion annotations from another tracer to validate the results of the five AI models and MedSeg. We used four kinds of metric for evaluation of the five AI models, namely, (1) DSC, (2) JI, (3) BA plots, and (4) ROC. The best AI model, ResNet-UNet, was superior to MedSeg by 9% and 15% for Dice similarity and Jaccard index, respectively, when compared against MD 1, and by 4% and 8%, respectively, when compared against MD 2. Statistical tests-namely, the Mann-Whitney test, paired t-test, and Wilcoxon test-demonstrated its stability and reliability. The training, testing, and evaluation of the AI model were carried out using NVIDIA's DGX V100. Multi-GPU training was used to speed up the process. The online system for each slice was < 1 s. Table 2 shows the CC values of all

Discussion
This proposed study presents automated lesion detection in an AI framework using SDL and HDL models-namely, (1) PSPNet, (2) VGG-SegNet, (3) ResNet-SegNet, (4) VGG-UNet, and (5) ResNet-UNet-trained using a fivefold cross-validation strategy using a set of 3000 manually delineated images. As part of the benchmarking strategy, we compared the five AI models against MedSeg. As part of the variability study, we utilized the lesion annotations from another tracer to validate the results of the five AI models and MedSeg. We used four kinds of metric for evaluation of the five AI models, namely, (1) DSC, (2) JI, (3) BA plots, and (4) ROC. The best AI model, ResNet-UNet, was superior to MedSeg by 9% and 15% for Dice similarity and Jaccard index, respectively, when compared against MD 1, and by 4% and 8%, respectively, when compared against MD 2. Statistical tests-namely, the Mann-Whitney test, paired t-test, and Wilcoxon test-demonstrated its stability and reliability. The training, testing, and evaluation of the AI model were carried out using NVIDIA's DGX V100. Multi-GPU training was used to speed up the process. The online system for each slice was <1 s. Table 2 shows the CC values of all of the AI models against MD 1 and MD 2; furthermore, it also presents a benchmark against MedSeg. The results show consistency, where ResNet-UNet is the best model amongst all of the AI models. It is 14% and~2% better than MedSeg for MD 1 and MD 2, respectively.
The primary attributes used for comparison of the five models are shown in Table 4, including (1)

Short Note on Lesion Annotation
Ground-truth annotation is always a challenge in AI [97,98]. In our scenario, in certain CT slices, the lesions overlapped, making it difficult to ensure precise lesion annotations. Some opacities are borderline, and the radiologist's decision may be highly subjective, resulting in false positives or false negatives. When it is difficult to notice and differentiate opacities in patients with COVID-19, or with cardiac disorders, emphysema, fibrosis, or autoimmune diseases with pulmonary manifestation, the differences in experience are particularly significant for the annotation of complex investigations [99][100][101][102][103][104][105].

Explanation and Effectiveness of the AI-Based COVLIAS System
The proposed study uses five AI-based models-PSPNet, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet-for COVID-19-based lesion detection, and presents a comparison against an existing system in the same domain, known as MedSeg. This proposed study uses (1) DSC (Equation (3)), (2) JI (Equation (4)), (3) BA plots, and (4) ROC curves for the five AI models against MD 1 (or GS 1) and MD 2 (or GS 2) for performance evaluation, to prove the effectiveness of the AI-based COVLIAS system. The same five metrics were used for MedSeg against MD1 and MD2 to validate the five AI-based COVLIAS models against it.
where X is the set of pixels of the image 1, ground-truth, or manually delineated image, and Y is the set of pixels of the image 2 or AI-predicted image from COVLIAS 1.0 Lesion .
Ding et al. [109] presented MT-nCov-Net which is a multitasking DL network that includes segmentation of both lungs and lesions in CT scans, based on Res2Net50 [121] as its backbone. This study used five different CT image databases, totaling more than 36,000 images. Augmentation techniques such as random flipping, rotation, cropping, and Gaussian blurring were also applied. The Dice similarity was 0.86. Hou et al. [110] demonstrated the use of an improvised Canny edge detector [122,123] for CT images to detect COVID-19 lesions using a dataset of about 800 CT images. Lizzi et al. [112] adopted UNet by cascading it for COVID-19-based lesion segmentation on CT images. Various augmentation techniques-such as zooming, rotation, Gaussian noise, elastic deformation, and motion blur-were used in this study. The Dice similarity coefficient (DSC) was 0.62, which is lower compared to the 0.86 of Ding et al. [109]. ResNet-50 and Xception-Net [124] were used as the backbone of the DR-ML network demonstrated by Qi et al. [113]. This study used~2400 CT images, with rotation, reflection, and translation as image augmentation techniques. DSC was not reported in this study, but it had an AUC of 0.94. Paluru et al. [114] presented a combination of UNet and ENet, named Anam-Net. It was designed for COVID-19-based lesion segmentation from lung CT images. The model was trained using a cohort of~4300 images, and the input image to this model had to be a segmented lung. Anam-Net was benchmarked against ENet, UNet++, SegNet, LEDNet, etc. There was no augmentation reported, and the DSC was 0.77. The authors demonstrated an Android application and a deployment on an edge device for Anan-Net to perform COVID-19-based lesion segmentation. Zhang et al. [115] demonstrated CoSinGAN-the only generative adversarial network (GAN) of its kind for COVID-19-based lesion segmentation. Only~700 CT lung images were used by this GAN in the training process, with no augmentation techniques. The DSC was 0.75 for CoSinGAN, and was benchmarked against other models. Singh et al. [111] modified the basic UNet architecture for lesion detection and heatmap generation. LungINFseg, a modified UNet architecture, was developed using a cohort of 1800 CT lung images with some augmentation techniques, and it reported a DSC of 0.8. The results of the modified UNet were benchmarked against some previously published segmentation networks, such as FCN [125], UNet, SegNet, Inf-Net [126], MIScnn [127,128], etc. The use of UNet with a multiresolution approach was demonstrated by Amyar et al. [117] for lesion detection and classification using 449 COVID-19-positive images. The authors reported an accuracy of 94% and DSC of 0.88, with no augmentation techniques. In only the classification framework, the model performance was benchmarked against some previously published studies. Budak et al. [116] used SegNet with attention gates to solve the problem of lesion segmentation for COVID-19 patients. Hounsfield unit windowing was also used as part of image pre-processing, with different loss functions to deal with small lesions. A cohort consisting of 69 patients was used in this study, where the author only reported a DSC of 0.89. A 10-fold CV protocol on 250 images with the UNet model was demonstrated by Cai et al. [118], with a DSC of 0.77. The authors presented lung and lesion segmentation using the same model. They also proposed a method to predict the duration of intensive care unit (ICU) stay based on the findings of the lesion segmentation. Ma et al. [119] also used the standard UNet architecture on a set of 70 patients for 3D CT volume segmentation. Model optimization was also carried out during the training process, and a DSC of 0.67 was reported in the study. The authors benchmarked the performance of the model with other studies in the same domain. Lastly, Kuchana et al. [120] used a cohort of 50 patients for lung and lesion segmentation with UNet and Attention UNet. During the training process, the authors optimized the hyperparameters, and a 0.84 DSC was reported by the model. Arunachalam et al. [129] recently presented a lesion segmentation system based on a two-stage process. Stage I consisted of region-of-interest estimation using region-based convolutional neural networks (RCNNs), while Stage II was used for bounding-box generation. The performance parameters for the training, validation, and test sets were 0.99, 0.931, and 0.8, respectively. The RCNN was primarily for COVID-19 lesion detection, coupled with automated bounding-box estimation for mask generation.

Strengths, Weaknesses, and Extension
This is the first pilot study for the localization and segmentation of COVID-19 lesions in CT scans of COVID-19 patients, under the class of COVLIAS 1.0. The main strengths were the design of five AI models that were benchmarked against MedSeg-the current industry standard. Furthermore, we demonstrated that COVLIAS 1.0 Lesion is superior to MedSeg using manual lesion tracings MD 1 and MD 2, where MD 1 was used for training and MD 2 was used for evaluation of the AI models. The system was evaluated using several performance metrics.
Despite the encouraging results, the study could not include more than one observer (MD 1) for manual delineation, due to factors such as cost, time, and availability of a radiologist during the pandemic. During lesion segmentation, the image analysis component that changes the HU values could affect the training process; therefore, in-depth analysis was needed [130][131][132]. This is currently beyond the scope of our current objectives.
Several extensions can be attempted in the future. (1) Multiresolution techniques [133,134] embedded with advanced stochastic image-processing methods could be adapted to improve the speed of the system [135,136]. (2) A big data framework could be adopted, whereby multiple sources of information can be used in a deep learning framework [137].
(3) Our study tested interobserver variability by considering two different observers (MD1 and MD2). Our assumption for intraobserver analysis consisted of very subtle changes, as per our previous conducted studies [46,58,[138][139][140]; we therefore did not consider it crucial to conduct intraobserver studies, due to lack of funding and the radiologists' time constraints. Thus, intraobserver analysis could be conducted as part of future research [58,96,138]. (4) Furthermore, there could be an additional step involved where, first, the lung is segmented, and then this segmented lung is used for analyzing the lesions [141,142]. This should help to increase the DSC and JI of the AI system. (5) The addition of lung segmentation does, however, increase the system's time and computational cost. One could use the joint lesion segmentation and classification in a multiclass framework such as classification of GGO, consolidations, and crazy paving, using tissue-characterization approaches [56,143]. (6) One could also conduct multiethnic and multi-institutional studies for lung lesion segmentation, as attempted in other modalities [144]. (7) One could understand the lesion distribution in different COVID-19 symptom categories-i.e., high-COVID-19-symptom lesions vs. low-COVID-19symptom lesions-as tried in other diseases [36]. (8) Since SDL and HDL strategies have been adapted for lesion segmentation, it is very likely that it can have a bias in AI [54] and, therefore, can be studied for lesion segmentation. (9) Several new ideas have emerged that need shape, position, and scale, and such techniques require spatial attention, channel attention, and scale-based solutions. Recently, advanced solutions have been tried for different applications, such as human activity recognition (HAR) [145]. Methods such as RNN or LSTM can also be incorporated in the skip connection of the UNet or hybrid UNet, which can be used for superior feature map selection [146]. Systems could also be designed where the high-risk lesions (high-valued GGO) and low-risk lesions (low-valued GGO) can be combined using ideas such as deep transfer networks [147]. Furthermore, increased loss function could be explored as part of training the AI models [148][149][150][151][152]. (10) As part of the extension to the system design, one could compare other kinds of cross-validation protocols, such as 2-fold, 3-fold, 4-fold, 10-fold, and jack-knife (JK) protocols such as training equals testing. Examples of such protocols can be seen in our previous studies [45,59,60,[153][154][155]. Even though our design had a fivefold protocol, our experiences have shown slight variations in performance with the changes in cross-validation results.

Conclusions
The proposed study presents a comparison between COVLIAS 1.0 Lesion and MedSeg for lesion segmentation in 3000 CT scans taken from 40 COVID-19 patients. COVLIAS 1.0 Lesion (Global Biomedical Technologies, Inc., Roseville, CA, USA) consists of a combination of solo deep learning (DL) and hybrid DL (HDL) models to tackle the lesion location and segmentation more quickly. One DL and four HDL models-namely, PSP-Net, VGG-SegNet, ResNet-SegNet, VGG-UNet, and ResNet-UNet-were trained by an expert radiologist. The training scheme adopted a fivefold cross-validation strategy for performance evaluation. As part of the validation, it used tracings from two trained radiologists. The best AI model, ResNet-UNet, was superior to MedSeg by 9% and 15% for Dice similarity and Jaccard index, respectively, when compared against MD 1, and by 4% and 8%, respectively, when compared against MD 2. Other error metrics, such as correlation coefficient plots for lesion area errors and Bland-Altman plots, showed a close correlation with the manual delineations. Statistical tests such as the paired t-test, Mann-Whitney test, and Wilcoxon test were used to demonstrate the stability and reliability of the AI system. The online system for each slice was <1 s. To conclude, our pilot study demonstrated the AI model's reliability in locating and segmenting COVID-19 lesions in CT scans; however, multicenter data need to be collected and experimented with.