Fine-Grained Assessment of COVID-19 Severity Based on Clinico-Radiological Data Using Machine Learning

Background: The severe and critical cases of COVID-19 had high mortality rates. Clinical features, laboratory data, and radiological features provided important references for the assessment of COVID-19 severity. The machine learning analysis of clinico-radiological features, especially the quantitative computed tomography (CT) image analysis results, may achieve early, accurate, and fine-grained assessment of COVID-19 severity, which is an urgent clinical need. Objective: To evaluate if machine learning algorithms using CT-based clinico-radiological features could achieve the accurate fine-grained assessment of COVID-19 severity. Methods: The clinico-radiological features were collected from 78 COVID-19 patients with different severities. A neural network was developed to automatically measure the lesion volume from CT images. The severity was clinically diagnosed using two-type (severe and non-severe) and fine-grained four-type (mild, regular, severe, critical) classifications, respectively. To investigate the key features of COVID-19 severity, statistical analyses were performed between patients’ clinico-radiological features and severity. Four machine learning algorithms (decision tree, random forest, SVM, and XGBoost) were trained and applied in the assessment of COVID-19 severity using clinico-radiological features. Results: The CT imaging features (CTscore and lesion volume) were significantly related with COVID-19 severity (p < 0.05 in statistical analysis for both in two-type and fine-grained four-type classifications). The CT imaging features significantly improved the accuracy of machine learning algorithms in assessing COVID-19 severity in the fine-grained four-type classification. With CT analysis results added, the four-type classification achieved comparable performance to the two-type one. Conclusions: CT-based clinico-radiological features can provide an important reference for the accurate fine-grained assessment of illness severity using machine learning to achieve the early triage of COVID-19 patients.


Introduction
During the COVID-19 pandemic, it was observed that the mortality was significantly higher in severe and critical cases [1]. After the occurrence of the symptoms of severe acute respiratory infection, some patients rapidly developed acute respiratory distress syndrome (ARDS) and other serious complications, which are followed by multiple organ failure [2]. Therefore, early diagnosis of severe and critical cases could optimize the allocation of medical resources, ensure early intervention for severe and critical patients, and finally reduce the mortality of COVID-19 [3].

Clinical Data Collection
Clinical data was collected by chart review. The patients were classified into four types according to patients' most severe conditions, using the Diagnosis and Treatment Plan of COVID-19 issued by National Health Commission (7th ed.): (1) mild type: minimal clinical symptoms without pneumonia in imaging; (2) regular type: fever, respiratory and other symptoms with pneumonia in imaging; (3) severe type: respiratory distress, respiratory rate ≥30 times/min; in resting state, oxygen saturation ≤93%; PaO2/FiO2 ≤ 300 mmHg; (4) critical type: respiratory failure requiring mechanical ventilation, shock and other organ failure requiring ICU monitoring and treatment [20].

CT Imaging Protocol
All scans were performed with patients in the supine position during end-inspiration without intravenous contrast on three CT scanners: uCT 760, uMI 780 scanners (United Imaging; Shanghai, China) and Precision 32 (CAMPO Imaging; Shenyang, China). Images were obtained from the apex to lung bases, using a standard dose protocol, reconstructed at 1.0 mm/1.1 mm slice thickness, with 0.7 mm increment, 512 × 512 mm and a sharp reconstruction kernel. The lung window width and level settings were 1500 Hounsfield units (HU) and −600 HU.

CT Image Analysis 2.4.1. CTscore from Visual Quantitative Evaluation
The CTscore was retrospectively obtained by visual quantitative evaluation of acute lung inflammatory lesion on CT images. Two radiologists blinded to the clinical information reviewed all images independently. In each lobe, the score was calculated from the percentage of total lesion areas as 0 (0%), 1 (1-25%), 2 (26-50%), 3 (51-75%), or 4 (76-100%). For each subject, by adding the scores of five lung lobes, the total severity score (CTscore) ranged from 0 to 20. The final score of each case was decided by a third experienced thoracic radiologist. The details can be found in our published work [21].

MT-HRNet-3d Neural Network for Lesion Volume Measurement
The lesion volume was calculated based on an AI algorithm which was developed by HY Medical Technology Co., Ltd (Beijing, China). The main neural network framework is detailed in Figure 1.
edly fusing the representations produced by the high-to-low subnetworks. With segmentation and classification subnets, the network is targeted for the 3d medical image classification and segmentation. In this study, we used the 3d lesion volume reconstructed from segmented masks of lesions (i.e., the output of segmentation subnet) as an input for the machine learning algorithm of severity classification. The segmentation subset had been trained and validated based on the consensus manual segmentation results of three radiological experts as the gold standard. Figure 1. The network architecture of MT-HRNet-3d, composed of a simple stem, a main body, as well as the segmentation and classification subnets. The main body consists of four stages of parallel high-to-low resolution subnetworks with repeated information exchange across multi-resolution subnetworks (multi-scale fusion). ×n denotes the down-sampling ratio of the resolution representation to the input image, and the numbers of blocks of each stage module are [2,2,3,3] in [×2, ×4, ×8, ×16] resolution levels, respectively. In the segmentation subnet, only the highest resolution representations are concatenated for the medical lesion segmentation, while in the classification subnet, all the high-to-low resolution representations are aggregated for multi-class classification. Seg_out and Cls_out are the outputs of two subnets. In this study, the lesion volume reconstructed from Seg_out result was used as an input of machine learning algorithm of severity classification.

Modification of Neural Network
Following [22], the widths (number of channels) of the convolutions of the four resolutions were C, 2C, 4C, and 8C, where C was set to 32. In order to improve the computational efficiency, a simple modification in the main body was made by including each branch in the multi-resolution group convolution with different residual blocks to save the memory. Specifically, the four high-to-low resolution branches contain 2, 2, 3, and 3 blocks, respectively, where the 2nd, 3rd, and 4th stages contain 1, 2, and 2 multi-resolution modules.
Notably, we considered the HRNet as a naturally multi-task learning framework for the segmentation and classification and believe that the good feature representations in different resolution can be learned by related multi-task learning. Therefore, the segmentation and classification subnets were added simultaneously. Instead of all the high-tolow resolution representations, only the highest resolution representations were concatenated for the medical lesion segmentation, resulting in fine segmentation contours. In the Figure 1. The network architecture of MT-HRNet-3d, composed of a simple stem, a main body, as well as the segmentation and classification subnets. The main body consists of four stages of parallel high-to-low resolution subnetworks with repeated information exchange across multi-resolution subnetworks (multi-scale fusion). ×n denotes the down-sampling ratio of the resolution representation to the input image, and the numbers of blocks of each stage module are [2,2,3,3] in [×2, ×4, ×8, ×16] resolution levels, respectively. In the segmentation subnet, only the highest resolution representations are concatenated for the medical lesion segmentation, while in the classification subnet, all the highto-low resolution representations are aggregated for multi-class classification. Seg_out and Cls_out are the outputs of two subnets. In this study, the lesion volume reconstructed from Seg_out result was used as an input of machine learning algorithm of severity classification.
The original high-resolution network, named HRNetV1 [22], maintains high-resolution representations by exchanging information across multi-resolution subnetworks. HR-NetV2 [23] explores the representations from all the high-to-low resolution parallel convolutions other than only the high-resolution representations in HRNetV1, which adds a small overhead and leads to stronger high-resolution representations. To achieve the threedimensional (3d) medical image classification and segmentation, we modified the existing networks and built a 3d high-resolution network, named MT-HRNet-3d (3d Multi-Task High-Resolution Network).
As shown in Figure 1, the MT-HRNet-3d network is composed of a simple stem, a main body, as well as the segmentation and classification subnets. The simple stem consists of one 2-strided convolutions decreasing the resolution, remaining the scale of the highest resolution at 2. The main body consists of four stages of parallel high-to-low resolution subnetworks, outputting the high-to-low resolution feature maps through repeatedly fusing the representations produced by the high-to-low subnetworks. With segmentation and classification subnets, the network is targeted for the 3d medical image classification and segmentation. In this study, we used the 3d lesion volume reconstructed from segmented masks of lesions (i.e., the output of segmentation subnet) as an input for the machine learning algorithm of severity classification. The segmentation subset had been trained and validated based on the consensus manual segmentation results of three radiological experts as the gold standard.

Modification of Neural Network
Following [22], the widths (number of channels) of the convolutions of the four resolutions were C, 2C, 4C, and 8C, where C was set to 32. In order to improve the computational efficiency, a simple modification in the main body was made by including each branch in the multi-resolution group convolution with different residual blocks to save the memory. Specifically, the four high-to-low resolution branches contain 2, 2, 3, and 3 blocks, respectively, where the 2nd, 3rd, and 4th stages contain 1, 2, and 2 multiresolution modules.
Notably, we considered the HRNet as a naturally multi-task learning framework for the segmentation and classification and believe that the good feature representations in different resolution can be learned by related multi-task learning. Therefore, the segmentation and classification subnets were added simultaneously. Instead of all the high-to-low resolution representations, only the highest resolution representations were concatenated for the medical lesion segmentation, resulting in fine segmentation contours. In the classification subnet, all the high-to-low resolution representations were aggregated for multi-class classification.

Multi-Task Loss Function
A MT-HRNet-3d network has two sibling output subnets. The classification subnet outputs a discrete probability distribution (per Image), p = (p0, ..., pK), over K + 1 categories. As usual, p is computed by a softmax over the K + 1 outputs of a linear classifier. The segmentation subnet outputs a probability segmentation map with the same resolution as the input feature map of the main body, then the segmentation map is upsampled (2 times) to the input size by trilinear upsampling. To jointly train the model, the related multi-task loss is defined as, in which L CE is the cross-entropy loss for the image-level classification. The second task loss, L Dice , is the dice loss, defined over the probability segmentation map and the pixel-level labeled mask. The indicator function 1 mask evaluates to 1 when the lesion mask is labeled and 0 otherwise. The hyper-parameter λ in Equation (1) controls the balance between the two task losses, usually set to 1 based on our experience. By sharing the information during training, the multi-task learning approach can improve data efficiency, reduce overfitting, and therefore enhance the overall efficiency of the algorithm in achieving an accurate estimation of lesion area.

Statistical Analysis
To evaluate if the clinico-radiological features, especially the CT imaging features (CTscore and VRmax), were significantly related with the prognostic severity of COVID-19, statistical analysis was performed between the clinico-radiological features and the severity in the two-type and four-type classifications, respectively.
For quantitative features, firstly, the Shapiro-Wilk test was performed to examine if the data follow normal distribution in each severity subgroup. If normal distribution was satisfied (i.e., p > 0.05 in Shapiro-Wilk test), the t-test was performed in the twotype classification to examine if there was a significant difference in the feature between the severe and non-severe patients. For the four-type classification, Levene's test was performed to examine the homogeneity of variance among subgroups. For any feature where the homogeneity of variance was satisfied or violated (i.e., p > 0.05 or p ≤ 0.05 in Levene's test), the analysis of variance (ANOVA) and Welch's ANOVA was performed, respectively to examine if there was a significant different in the feature among patients with different severities.
For ordinal features and the quantitative features that did not follow normal distribution (i.e., p ≤ 0.05 in Shapiro-Wilk test), the non-parametric tests were performed. The Mann-Whitney U test and Kruskal-Wallis H test were performed as the alternatives to t-test and ANOVA in the two-type and four-type classifications, respectively.
For categorical features, the differences between rates were tested by Chi-squared (χ 2 ) or Fisher's exact tests, if appropriate. A p-value less than 0.05 was considered as statistically significant in all the comparisons.

Machine Learning Algorithms
The data were processed using four different classification algorithms: decision tree, random forest, SVM, and XGBoost. As aforementioned, 59 clinico-radiological features (bold in Table A1) were selected. We applied feature engineering which is commonly used in machine learning to derive the feature set for the training of machine learning algorithms [24]. First, constant and quasi-constant features (i.e., with the same value or limited variations among all patients) were removed. Second, Pearson's correlation coefficient was calculated between different features. For correlated features, only the most indicative one was kept, with others removed to reduce information redundancy. Third, statistical methods that calculate mutual information is employed to further remove features with redundant information. The algorithm was automatic. Finally, 37 features were input for algorithmic training.
The estimation of the patient's severity was the output, based on the two-type (nonsevere and severe) and four-type classifications (mild, regular, severe, critical), respectively. Firstly, the testing dataset was separated randomly as 10% of the original dataset. The remaining data was split into the training set (80%) and testing set (20%) for the 5-fold cross validation. To investigate the significance of CT analysis results in assessing the severity, the assessment was performed with VRmax and CTscore included and excluded, respectively. The precision, recall, and F1 values were used for the quantitative comparison of results. The results of two-type and four-type classifications were also compared to initially examine the performance of fine-grained classification of COVID-19 severity.
To illustrate the contribution of different features in the decision made by machine learning methods, we used the local interpretable model-agnostic explanations (LIME) which is a common method for explaining black-box models, i.e., models whose inner logic is hidden and not clearly understandable [25]. On the standardized p-dimensional dataset where p is the number of retained features, LIME performs ridge regression, which is trained in a weighted fashion, i.e., each point contributes to the model according to its weight. In the resultant model, the coefficient of a feature reflects its contribution in the classification: the higher the coefficient, the bigger the variation in the output when the feature is changed. The sign of the coefficient shows the direction of the variation in the output [25].

Significant Clinico-Radiological Features in Two-Type Classification
The data distribution is balanced in two-type (57 non-severe, 21 severe) classification but there is a lack of mild cases in four-type classifications (57 regular, 16 severe, 5 critical). In the quantitative variables, the normal distribution was satisfied in the following features: BMI, cTnlTimes, LYM1, ALB1, and ALB2 (p > 0.05 for all in Shapiro-Wilk test). The clinicoradiological features that are significantly different in severe and non-severe cases are shown in Table 1.

Significant Clinico-Radiological Features in Four-Type Classification
In the quantitative variables, the normal distribution was satisfied in the following features: weight, BMI, cTnlTimes, LYM1, ALB1, and ALB2 (p > 0.05 for all in Shapiro-Wilk test), where the homogeneity of variance was satisfied in all (p > 0.05 in Levene's test) except cTnlTimes (p < 0.001). The clinico-radiological features that are significantly different among the patients with different severities are shown in Table 2. In both the two-type and four-type classifications, the CT imaging features were significantly related with the severity of COVID-19. After feature engineering, we found out that NtproBNP, LYM, LDH, and CRP were the four most indicative ones of the 37 remaining features. As shown in Table 3, overall, the addition of AI-assisted CT image analysis results (radiological features) did not improve the accuracy of the algorithms in two-type classification. Only Decision algorithms showed mild improvement in F1 value with worsening in other metrics, while XGBoost and Random Forest showed some worsening. The SVM results were unaffected.

Role of CT Images Analysis in Four-Type Classification
In Table 3, it can be observed that with the addition of AI-assisted CT images analysis, the accuracy of the estimation has been improved in Decision Tree, Random Forest, and XGBoost but not in SVM. Figure 2a shows a correctly classified severe case in two-type classification. The existence of arrhythmia and fatigue, which may be related with cardiac diseases and the development of COVID-19, indicate higher severity. The phlegm showed the opposite relationship with severity. A possible explanation is that the lack of phlegm will make it difficult for the patient to expel the sputum from the lung, which will deteriorate the affection in the lung and lead to higher severity. Figure 2b shows a correctly classified severe case in four-type classification. It can be seen that the key clinico-radiological features are different from the two-type classification in Figure 2a, where the CTscore becomes a key factor. Figure 3 shows two cases with similarity in many clinical features. With the machine learning method, they have been accurately classified as non-severe and severe, and regular and severe, in the two-type and four-type classifications, respectively. It can be observed that the VRmax and CTscore showed significant difference between the two cases, which improved the accuracy of classification. In particular, in these two cases, some clinical features did not show strong efficiency (e.g., NTproBNP is even lower in the severe case) but the VRmax correctly indicates the right trend (i.e., higher in the severe case) with the highest relative ratio between the severe and non-severe cases, showing high robustness in assessing COVID-19 severity against patient-specific variability in pathophysiological features.

Comparison between Statistical Analysis and Machine Learning Results
In statistical analysis, the CT image analysis results (CTscore and VRmax) were significant features of the severity in both two-type (p < 0.001 for both) and four-type (p <

Comparison between Statistical Analysis and Machine Learning Results
In statistical analysis, the CT image analysis results (CTscore and VRmax) were significant features of the severity in both two-type (p < 0.001 for both) and four-type (p < 0.001 for both) classifications. The inclusion of CT image analysis results significantly improved the accuracy of machine learning algorithms in evaluating the severity of COVID-19 using four-type classification. These results commonly suggest that CT image analysis could provide important reference for the fine-grained assessment of illness severity to achieve the early triage of COVID-19 patients. Compared with the statistical analysis, the machine learning algorithms provided the patient-specific quantitative evaluation of different clinicoradiological features regarding the illness severity, providing a more detailed reference for the accurate diagnosis and treatment of COVID-19.

Quantitative CT Image Analysis: A Key to COVID-19 Severity Assessment
It has been suggested that CT image analysis could improve the machine learning algorithm to achieve a higher accuracy in diagnosing COVID-19 than the clinical COVID-19 reporting and data system [26]. In particular, the lesion volume segmented and quantified using deep learning showed a strong potential in the prediction of COVID-19 severity [27]. Furthermore, the ratio of compromised lung volume (sum of poorly and non-aerated volumes) has been observed to be an accurate outcome predictor of the risk of oxygenation support and intubation, and in-hospital mortality (p < 0.001 for all in logistic regression) [28]. Therefore, the quantitative CT image analysis (especially the lesion volume) plays a key role in the assessment of the severity of COVID-19 and provides important reference to guide the clinical triage and intervention.
According to the model interpretation (two-type XGBoost classification), the model unveils several important clinic-radiological features such as NTproBNP. As a result, among all those important features, the CTscore improves the performance only moderately. For the four-type classification, the CTscore improves in all classification algorithms but not in SVM due to its structure. The SVM model decides its hyperplane based on all the features. The CTscore does have influence on the decision of the hyperplane, but its influence is not strong enough to demonstrate its efficacy on the final decision.

Fine-Grained Severity Assessment of COVID-19 Using Machine Learning: Clinical Significance
Considering the current epidemic of COVID-19 and the high risk of its recurrence, the early detection of severe COVID-19 illness is an urgent clinical need. Currently, both the two-type [29] and four-type [30] classification standards are widely used in the triage of COVID-19 patients. In the latest guidelines of Chinese Medical Association, the severe and critical cases of COVID-19 were unified as "severe" due to the rapid exacerbation of severe cases, the highly diverse "time window", and the difficulty in detecting critical cases which may lead to the delayed treatment [31]. The time lengths from illness onset to severe symptoms such as dyspnea (95% confidence interval (CI): 4.0-9.0 days) or critical symptoms such as sepsis (95% CI: 7.0-13.0 days) are patient-specific and covers a wide range [1]. Additionally, the RT-PCT test has low sensitivity in the first 3-5 days of affection [32]. Therefore, the early and accurate detection of clinical risks based on the four-type classification could play an important role in guiding the clinical intervention for the severe and critical cases towards more appropriate management. Additionally, due to the rapidly growing imbalance between supply and demand for medical resources, the fair allocation of medical resources has become an urgent clinical need in many countries [33]. In Table 3, with CT analysis results added, the four-type classification achieved comparable or even better performance in some metrics than the two-type classification, especially in Random Forest and XGBoost algorithms, which indicates that the fine-grained classification of COVID-19 severity could achieve comparable performance to the widely used binary classification on the same dataset while providing a more detailed reference for clinical diagnosis, treatment, and management. With accurate estimation of severity, this machine learning method could be used to guide the allocation of limited medical resources.

Limitations and Future Directions
Firstly, only 78 patients were included in this single-center, retrospective pilot study. Secondly, the distributions of age and severity were imbalanced in these cases. There were only five cases aged 19-30 years and only one case aged less than 18 years. The majority (57.5%) of the included cases were moderate. Only five critical cases were included. There was a lack of mild cases. During the early outbreak period, the patients with mild symptoms often had delayed hospital admission due to the lack of awareness. A balanced training dataset is important to improve the reliability of the algorithm and its applicability in different cohorts. Additionally, it has been found the severity of COVID-19 was related with other physiological factors including pregnancy [14], and the comorbidity of chronical respiratory diseases, cardiovascular diseases [10], diabetes [34], and cancer [35]. Therefore, the results derived by the machine learning models need to be validated in large-scale datasets with more balanced data distribution. In future studies, more clinical datasets could be included to cover a wider range of age and severity, as well as those in different physiological and pathological conditions.

Conclusions
The results of quantitative CT image analysis were significantly related to the severity of COVID-19. The clinico-radiological features including the CT image analysis results can provide an important reference for the fine-grained assessment of illness severity using machine learning to achieve the early triage of COVID-19 patients. Institutional Review Board Statement: This study was approved by the ethics committee of the Fifth Affiliated hospital of Sun Yat-Sen University, and the requirement for informed consent was waived due to the retrospective nature of the study and the analysis using anonymous clinical data.
Informed Consent Statement: Patient consent was waived due to the retrospective nature of the study and the analysis using anonymous clinical data.

Data Availability Statement: Not applicable.
Acknowledgments: We acknowledge the Fifth Affiliated hospital of Sun Yat-Sen University for sharing the data for analysis.

Conflicts of Interest:
The authors declare no conflict of interest.  The time period from onset of symptoms to admission Time Onset2CT1

Appendix A
The time period from onset of symptoms to the first CT imaging Time Onset2CTPositive1 The time period from onset of symptoms to the first CT Positive results Time Onset2severity The time period from onset of symptoms to the first diagnosis of two-type severity Time Onset2CTPeak The time period from onset of symptoms to CT Peak results Time Selected clinic-radiological features are in bold. In the paired features (e.g., PCT1 and PCT2), the numbers 1 and 2 indicate the admission value and maximal value during hospitalization, respectively.