Classification Performance for COVID Patient Prognosis from Automatic AI Segmentation—A Single-Center Study

Background: COVID assessment can be performed using the recently developed individual risk score (prediction of severe respiratory failure in hospitalized patients with SARS-COV2 infection, PREDI-CO score) based on High Resolution Computed Tomography. In this study, we evaluated the possibility of automatizing this estimation using semi-supervised AI-based Radiomics, leveraging the possibility of performing non-supervised segmentation of ground-glass areas. Methods: We collected 92 from patients treated in the IRCCS Sant’Orsola-Malpighi Policlinic and public databases; each lung was segmented using a pre-trained AI method; ground-glass opacity was identified using a novel, non-supervised approach; radiomic measurements were collected and used to predict clinically relevant scores, with particular focus on mortality and the PREDI-CO score. We compared the prediction obtained through different machine learning approaches. Results: All the methods obtained a well-balanced accuracy (70%) on the PREDI-CO score but did not obtain satisfying results on other clinical characteristics due to unbalance between the classes. Conclusions: Semi-supervised segmentation, implemented using a combination of non-supervised segmentation and feature extraction, seems to be a viable approach for patient stratification and could be leveraged to train more complex models. This would be useful in a high-demand situation similar to the current pandemic to support gold-standard segmentation for AI training.


Introduction
Since the beginning of last year, the world has been facing a health emergency, the pandemic caused by the novel Coronavirus, Sars-CoV2. Even up to the present date, many aspects of the physiopathology of the COVID- 19 infection are yet to be fully understood. The diagnostic gold standards for Sars-CoV2 are the reverse transcriptionpolymerase chain reaction (rt-PCR) and the gene sequencing of sputum, throat swab and lower respiratory tract secretion [1,2]. These tests have several limitations: a limited testing capacity related to insufficient kits or laboratory supplies; a long reporting time, varying from 6 to 48 h; a great variability in sensitivity, ranging from 37% to 71%. To circumvent these limitations, imaging has emerged as an important tool to guide diagnosis, especially in cases of clinical-laboratory discordances.
Imaging protocols directive from public health authorities are heterogeneous: chest radiography is widely used, although it is not accurate in mild or early COVID-19 infection; there is an improving interest in bedside lung ultrasound, but limited experiences are reported in the literature [3]. Among imaging modalities, Computed Tomography (CT) is the most sensitive (60-98%) acquisition technique, but it has low specificity in the early stage of the disease [4]. For this reason, World Health Organization (WHO) and most radiologic societies do not recommend performing screening CT (WHO characterizes COVID-19 as a pandemic-11 March 2020). High-Resolution Computed Tomography (HRCT) proved nonetheless to be a valuable aid to the clinical and epidemiological management of affected patients, especially during shortages of the necessary reagents, long reporting time, and the high operator-dependency [5][6][7].
There have been controversial publications about the role of chest CT imaging analysis in the diagnosis and management of COVID-19 [8]. The identification of healthy lung chest CTs from pneumonia cases has been deeply investigated in literature [9,10], but the COVID-19 pandemic has posed the non-trivial problem of classifying different pulmonary diseases. Patchy shadows or ground-glass opacities (GGOs) and consolidations (CSs) are not exclusive of COVID-19 but might be also caused by pulmonary edema, bacterial infection, other viral infection, or alveolar hemorrhage [11].
An initial prospective made by Huang et al. [12] on chest CT scans of patients affected by COVID-19 has shown that the examined subjects have a bilateral GGO and CS. The same medical results were also confirmed by other authors [13,14], who posed the basis for the next quantification studies allowing a better characterization of the COVID-19 features. The classification between early-stage patients and progressive phases has been thoroughly investigated [15][16][17][18], and all of them lead to the same conclusion: the main COVID-19 features can be difficult to detect in early stages of the disease, and their correct identification is strongly dependent on the radiologist's expertise.
The role of HRCT in COVID-19 infection management is also controversial due to the radiation exposure problem. Initially, the American College of Radiology (ACR) and the Italian Society of Medical and Interventional radiology (SIRM) guidelines did not recommend HRCT in the diagnostic workup, the former referring to 2018 guidelines for acute respiratory illnesses [19][20][21]. With the increase of available evidence resulting from clinical trials, a prognostic role of HRCT is now emerging, increasing its value beyond the diagnosis [3,22]. Moreover, a study from Ria et al. evaluated the risk-benefit ratio of radiation exposure in COVID-19 patients, stating that HRCT is justified in patients older than 30 years [23].
The application of automated methods to support the clinicians in the analysis of a large amount of data aims to remove the subjectivity of the measurement and improve the time required for the diagnosis formulation [21,24,25]. Machine learning and deep learning models applied in the medical research field are becoming more popular as the basis for clinical decision support systems. Medical image segmentation plays a pivotal role in the automation of these applications since a correct segmentation of anatomical structures is a crucial step to improve the accuracy of the algorithms and to minimize possible confounders.
Deep learning applications have achieved the most significant results also in the COVID-19 literature [24,27,[34][35][36], with extremely high classification performances and low execution time. The drawback of these methods is the demand for labeled data: the training of a deep learning model requires a vast amount of manual (or semi-automatic) labelled data. This is problematic because data annotation is a very time-consuming operation, dependent on the operator experience.
To overcome this issue, many authors, such as Wang et al. [34], proposed a different approach, combining deep learning features with standard machine learning one, proving the efficiency of this synergy. This approach could also increase the ability to explain the model, potentially leading to an improvement in the understanding of the main features of the COVID-19 disease. Wang et al. highlighted the irregularity and heterogeneous intensities of the lung lesion textures as COVID-19 significant features. Concurrently, in non-COVID-19 patients, Wang et al. found stronger uniformity of Hounsfield values in the chest CT within the lesions. Both these features can be automatically quantified by a machine learning algorithm, allowing stratification of the patients' severity and removing possible false positives. Other identified useful characteristics concern the geometry and shape of the lesions [8,37].
For a more detailed stratification of patients, the Radiological Society of North America (RSNA) released a consensus statement, endorsed by the Society of Thoracic Radiology and the American College of Radiology (ACR), that classifies the CT appearance of COVID-19 pneumonia into four categories: "typical," "atypical," "undetermined," and "negative" [38]. The "typical" pattern is characterized by the presence of round-shaped Ground-Glass Opacities (GGO), usually bilateral with a sub-pleural location on the dorsal basal segments. The GGO can be associated with "Crazy Paving" areas or other signs of organizing pneumonia. The "undetermined" pattern is characterized by the absence of the "typical" pattern findings, with diffuse GGO areas with a perihilar or unilateral distribution, with or without consolidated areas. The "atypical" pattern is characterized by either the absence of the "typical" or "undetermined" signs and the presence of lobar consolidations, "tree in bud," smooth thickening of the septa and pleural effusion; in this presentation, no GGO are detectable. The "negative" pattern is characterized by the absence of pathological findings. The "typical" and "negative" patterns have proven to be very accurate in identifying the disease in patients with suspected COVID-19 infection [3].
Many authors have already proved an occasional discordance between HRCT and rt-PCR. There have been observations of patients with a high clinical suspect of COVID-19 supported by epidemiological criteria and imaging, but with negative rt-PCR [39,40]. On the other hand, there was evidence of patients with a positive rt-PCR and suggestive clinical findings, that did not present pathological findings on HRCT [41]. The clinicoradiological dissociation in asymptomatic individuals requires reconsidering the role of radiological findings in the clinical management of these patients [42].
To improve the reliability of radiological examination, several authors presented Texture Analysis of the CT scans as a valuable tool to aid the diagnosis [43][44][45] and to identify clinically severe patients [46]. Texture Analysis can identify putative features that are not part of the RSNA criteria, such as enlargement of pulmonary vessels [47][48][49][50][51], and that could be overlooked during the human visual inspection, such as fine characteristics of the GGO areas [52]. Currently, only a few methods exist to automatically process HRCT scans and quantify the extent of the pulmonary involvement [32,53]. From the clinical point of view, the disease can be assessed with the newly developed PREDI-CO (prediction of severe respiratory failure in hospitalized patients with SARS-COV2 infection), which considers clinical parameters predictive of severe respiratory failure in hospitalized patients, and is defined as the sum of the following conditions: This score outperforms in the stratification of patients the well-established qSOFA, SOFA, CURB-65, and MEWS scores; this is due to the fact that the PREDI-CO score was designed and validated ad-hoc for this purpose [54].
In the current work, we aim to automatize the evaluation of the PREDI-CO score and several radiomic features using a novel, non-supervised image processing pipeline.

Patients Selection
This study involves 92 CT scans of patients affected by COVID-19. 10 of these scans come from the public dataset "COVID-19 CT Lung and Infection Segmentation Dataset" published on Zenodo [55]. Left lung, right lung, and infections are labeled by two radiologists and verified by an expert radiologist (with more than 10 years of experience). These scans were used to validate the segmentation performances of the implemented pipeline.
Department of Diagnostic and Preventive Medicine of the IRCCS Policlinic Sant'Orsola-Malpighi provided the remaining 82 scans. The selected patients are composed for the 66.3% by male, the age distribution (min/median/max) is 35/60/89. For each patient, experts ascertained the presence of ground-glass opacities (100%), consolidation (10%), crazy paving (53%), multifocal GGO (multiple locations affected by GGO, 32%), peripheral GGO (presence of GGO areas exclusively far away from the trachea, 23%), and roundish GGO (GGO characterized by round regular shapes, 12%). Moreover, each patient was assigned the PREDI-CO score value estimated by two radiologists.
In IRCCS Policlinic Sant'Orsola-Malpighi, HRCT exams were performed with the following parameters: two different Multi-Slices CT (64 slices, GE VCT or PHILIPS Ingenuity), with keV range 100-120, tube current modulation with a low Quality Index to optimize patient dose, slice thickness range 1-2 mm; images were reconstructed with highresolution kernel. The CT parameters used in the Zenodo dataset are not available.

Pipeline Overview
The workflow developed in this work as show in Figure 1 can be split into 3 steps: (1) the segmentation of the lungs; (2) segmentation of the GGO areas; (3) estimation of the radiomic features.

Lung Segmentation
Lung segmentation is a pivotal pre-processing step in many image analyses, such as identification and classification of pathologies. Rule-based approach, like thresholding, region growing, etc., usually fails for CT scans of patients with severe Interstitial Lung Disease (ILD) [33]. For this reason, we used a pre-trained publicly available https://github.com/JoHof/lungmask v0.24 (accessed on 24 March 2021) U-Net model [33] for lung segmentation.

GGO Segmentation
In the second step, a novel automated pipeline for the segmentation of GGO areas was developed, combining 2 different techniques: vessel artifacts exclusion and k-means clustering. Both these methods are unsupervised learning techniques and have been chosen due to the limited number of available samples. The decision to avoid supervised methods such as Convolutional Neural Networks (CNNs) is based on the likelihood of including strong biases, even including possible data augmentation strategies.
The intensity of pulmonary vessels is very similar to solid components of GGO; therefore, it affects the segmentation, introducing potential false positives [56]. To remove these vessels, we used the vesselness measure, i.e., the presence of multiscale tubular structures. Multiscale Vessel Enhancement Filtering (MVEF) [57] is defined as the likelihood of an image region to contain vessels, and it is estimated using the Frangi filter [58]. The areas with high values of MVEF were identified as vessels and therefore removed from the lung region obtained in the previous step.
After the removal of the vessels, we identified the GGO as areas with a common color texture. To identify these regions, we used k-means clustering, grouping voxels by color and texture similarity, and identifying the tissue corresponding to each cluster [59].
Since GGO involves extended areas, it is informative to include neighborhood voxel information. The color contrast between GGO and healthy areas may change between patients; it is, therefore, useful to consider different gamma corrections of the image. We applied a series of image processing filters to obtain a high-dimensional feature space, including all these features for each individual voxel. For each voxel, we estimated a vector of features obtained by the application of the following filters: • Gamma corrected image (γ = 1.5); • Adaptive Histogram Equalized image, in a radius of 3 voxels; • Median blurred image with a kernel of radius 3 voxels; • Standard deviation filtered image with a kernel of radius 1 voxels.
We used the Adaptive Histogram Equalization algorithm for the image standardization: for each slice, the histogram was equalized considering a volume of 3 voxels. The gamma correction is a non-linear operation used to decode the luminance and enhance the low contrast regions. The median blurring allows considering the information about the neighborhood voxels, reducing the effect of outlier voxels. The last feature is the application of a local standard deviation filter; this filter replaces each voxel value with the standard deviation of its neighborhood. This feature is useful as GGO is characterized by highly heterogeneous gray level regions, allowing to filter out bronchial structures and motion artifacts not removed during the lung segmentation.
The k-means clustering is "isotropic" in all directions of space and therefore tends to produce round (rather than elongated) clusters. In this situation, leaving variances unequal is equivalent to put more weight on variables with smaller variance. To avoid this, the features were standardized according to the mean and standard deviation estimated on the training dataset.
We selected 10 scans and applied the k-means clustering for the estimation of the centroids using the Euclidean metric. The k-means clustering is sensitive to the class balance in the training phase (it might give more prominence to more common present structures). Therefore, for the training phase, we considered only a subset of scans with a reasonable amount of GGO areas, excluding in all the cases the (overrepresented) image background. The k-means cardinality was estimated based on lung anatomy considerations. The resulting clusters were: • Healthy lung; We implemented the whole pipeline using Python, and the source code is publicly available on GitHub (https://github.com/RiccardoBiondi/segmentation, accessed on 24 March 2021). We used SimpleITK [60,61] for the implementation and management of image filters and the OpenCV [62] library for the implementation of the k-means clustering.
We estimated the performances of our segmentation algorithm according to the following scores: where TP, TN, FP, and FN are the True Positives, True Negative, False Positive, and False Negative scores, respectively.

Feature Extraction
We applied the proposed pipeline on the 82 patients provided by the Department of Diagnostic and Preventive Medicine of the IRCCS Policlinic Sant'Orsola-Malpighi of Bologna. The radiomic features extraction was performed on the identified GGO areas. The extracted features include morphological and texture-based scores: Bilaterality (distribution of GGO between left and right lung); For the scores classification, we included the patient's age as informative features. We measured the texture properties by computing the Haralick features (Energy, Inertia, Entropy, Inverse Difference Moment, Cluster Shade and Cluster Prominence) from the Gray Level Co-occurrence Matrix (GLCM), computed on the whole identified area [63]. For each identified GGO area, we computed its elongation and roundness [64], obtaining the corresponding distribution for each patient. We computed the distribution of the distances between the lesion and lung centroids, normalized to the semiaxis of the equivalent ellipsoid as a measure of the GGO peripherality. Each distribution was characterized by the minimum, maximum, median, interquartile range (25-75), skewness, and kurtosis. We estimated the bilaterality distribution using the Matthews coefficient (MCC) defined as follows: being RGV = Volume of GGO in the right lung, LGV = Volume of GGO in the left lung, LLV = Left Lung Volume, and RGV = Right Lung Volume. The volume of GGO was normalized according to the total lung volume to overcome possible issues related to anatomical differences between patients.

Classification
We used the whole set of extracted radiomic features to predict the following GGO characteristics estimated by the expert radiologists: Additionally, the two clinical outcomes: Not all the above scores were available for all the patients. The GGO characteristics were reported for only 78/82 patients, while the clinical outcomes for only 72 of them. We restricted the score classification on these two subsets of data.
The considered scores are all binary values (True/False), representing the presence/absence of the corresponding characteristic. The only exception is given by the PREDI-CO score, which, by definition, allows an incremental set of values: PREDI-CO score values range from 0 (minimal risk) to 9 (maximal risk). The 47% of patients report a PREDI-CO score of 0 or 1, leading to an overrepresentation of these 2 labels. We grouped multiple labels into the same class to overcome this issue: we applied the cutoff of 1 to dichotomize the PREDI-CO score values into two classes.
We applied a feature selection procedure to filter out redundant and non-informative values for each classification. This step is required to improve the classification performances (remotion of possible confounders) and to make the obtained results more interpretable from a clinical point of view. The selection was performed using a Fisher Exact and a χ 2 tests selecting the 3 features with the highest significance (lower p-values). Both the tests require categorical data, so we dichotomized the features according to their medians.
We used the set of filtered features as input to 4 different classification algorithms to predict the various GGO characteristics and clinical outcomes: The performances of these 4 methods give us an insight into the structure of the data: KNN methods are strongly local, Random Forest relies on the separability of the samples but does not include co-linearities between variables, and linear models (Logistic classifier and Ridge classifier) strongly rely on linear dependencies between the observed values and the predictions.
The classification performances were estimated according to the following metrics: using a leave-one-out cross-validation strategy. The numerosity of the labels of each characteristic is strongly unbalanced (e.g., only 10% of patients show the presence of consolidations) for every characteristic, except for Crazy Paving and PREDI-CO score (after the dichotomization). To compensate for this, we weighted the classification performances of each algorithm according to the inverse of the class frequency.
We used the scikit-learn [65] Python package for the implementation of all the analyses.

GGO Segmentation
We applied the proposed segmentation pipeline on each patient under analysis. For samples, we collected also a manual segmentation performed by an expert radiologist, which was used as the gold standard.
The collected results were evaluated using the previously introduced metric scores and their distribution analyzed as show in Table 1, details in table S1. We show in Figure 2 the distribution of the individual scores (Sensitivity, Specificity, Precision, and F1 score). The scores were obtained by the average of the whole 15 scans (overall) on the 10 corona cases (CORONACASES OVERALL) and on the five gold standard (GOLD STD OVERALL). In Figure 3 and Figure 4, we show a visual comparison between the achieved segmentation and the ground truth labels (both for corona cases and the gold standard).   In Table 2 we report the results of other published pipelines in comparison with our method (details of the employed methods in Table S2). Notice that Jun2020 is a benchmark database for COVID-19 annotation methods and CT scans segmentation (https://gitee.com/junma11/COVID-19-CT-Seg-Benchmark, accessed on 24 March 2020). The evaluation set for both Jun2020 and Muller2020 is COVID-19-CT-Seg, which is the database containing both corona case studies (the one used for annotation) and radiopedia ones (removed because rescaled on 8-bit GL images, which is not compatible with the implemented pipeline). For each technique on each database, only the best results presented in the literature are reported.

Feature Extraction
We applied the feature extraction step on the GGO areas identified by our segmentation algorithm. In Figure 5 we show the Pearson's correlation matrix between each pair of observed features. The cluster plot highlights the existence of multiple groups of strongly correlated features. The first set of clusters are given by the features related to the roundness, elongation, and distance features. The most prominent cluster is composed of the Haralick features: energy, entropy, inertia, and cluster prominence. In Figure 6 we show the relation between feature distributions and labels. Using the median of the feature distributions as a threshold, we estimated the percentage of the patients associated with each label who are above this threshold. A result of 0.5 (whitelike cells) indicates non-specificity of the variable for that individual label, i.e., a uniform distribution of the feature. A result greater than 0.5 (red-like cells) or smaller than 0.5 (blue-like cells) indicates a high/low percentage of patients for whom the feature values are greater/smaller than the median of the distribution, respectively. This representation allows a visual analysis for the identification and selection of the most informative features related to each label.

Individual Features Analysis
For each classification algorithm, we considered the full set of radiomic features extracted and the three best features identified by the feature selection methods (Fisher and χ 2 tests). For each feature selection criteria, we report the list of the three best features ordered according to their informative power.

Multifocal GGO
A Multifocal GGO label equal to 1 (32% of the patients) identifies the patients with the presence of a multifocal lesion, while a label equal to 0 (68% of the patients) identifies its absence.
The three best features selected by the Fisher criterion are (with the type of feature indicated in parenthesis): The three best features selected by the χ 2 criterion are (with type of feature indicated in parenthesis):
We show in Table 3 the results obtained in terms of global adjusted accuracy and in Table 4 the precision, sensitivity and F1 score of the same prediction.

Presence of Crazy Paving
A Crazy Paving label equal to 1 (52% of the patients) identifies the patients with the presence of a crazy-paving pattern, while a label equal to 0 (48% of the patients) identifies its absence.
The three best features selected by the Fisher criterion are (with type of feature indicated in parenthesis):
The three best features selected by the χ 2 criterion are (with type of feature indicated in parenthesis):
We show in Table 5 the results obtained in terms of global adjusted accuracy and in Table 6 the precision, sensitivity and F1 score of the same prediction. Table 5. Global adjusted accuracy of the models to predict the presence of crazy paving using different feature selection strategies. From the left: dichotomized score value, Precision obtained using the 3 best features identified by the Fisher test, Precision obtained using the 3 best features identified by the χ 2 test; Sensitivity obtained using the 3 best features identified by the Fisher test, Sensitivity obtained using the 3 best features identified by the χ 2 test; F1 score obtained using the 3 best features identified by the Fisher test; F1 score obtained using the 3 best features identified by the χ 2 test. For each column the maximum value was indicated with bold font

Presence of Consolidation
A consolidation label equal to 1 (10% of the patients) identifies the patients with the presence of consolidation, while a label equal to 0 (90% of the patients) identifies its absence.
The three best features selected by the Fisher criterion are (with type of feature indicated in parenthesis): Median of the gray level distribution (Radiomics); • Median of the elongation distribution (Radiomics).
The three best features selected by the χ 2 criterion are (with type of feature indicated in parenthesis):
We show in Table 7 the results obtained in term of global adjusted accuracy and in Table 8 the precision, sensitivity, and F1 score of the same prediction. Table 7. Global adjusted accuracy of the models to predict the presence of consolidations using different feature selection strategies. From the left: dichotomized score value, Precision obtained using the 3 best features identified by the Fisher test, Precision obtained using the 3 best features identified by the χ 2 test; Sensitivity obtained using the 3 best features identified by the Fisher test, Sensitivity obtained using the 3 best features identified by the χ 2 test; F1 score obtained using the 3 best features identified by the Fisher test; F1 score obtained using the 3 best features identified by the χ 2 test. For each column the max-imum value was indicated with bold font

Roundish GGO
A roundish GGO label equal to 1 (12% of the patients) identifies the patients with the presence of a roundish GGO lesion, while a label equal to 0 (88% of the patients) identifies its absence.
The three best features selected by the Fisher criterion are (with type of feature indicated in parenthesis): • GGO volume percentage (Radiomics); • Skewness of the roundness distribution (Radiomics); • Median of the roundness distribution (Radiomics).
The three best features selected by the χ 2 criterion are (with type of feature indicated in parenthesis): • GGO volume percentage (Radiomics); • Skewness of the roundness distribution (Radiomics); • Median of the roundness distribution (Radiomics).
We show in Table 9 the results obtained in term of global adjusted accuracy and in Table 10 the precision, sensitivity and F1 score of the same prediction. Table 9. Global adjusted accuracy of the models to predict the presence of roundish GGO using different feature selection strategies. From the left: dichotomized score value, Precision obtained using the 3 best features identified by the Fisher test, Precision obtained using the 3 best features identified by the χ 2 test; Sensitivity obtained using the 3 best features identified by the Fisher test, Sensitivity obtained using the 3 best features identified by the χ 2 test; F1 score obtained using the 3 best features identified by the Fisher test; F1 score obtained using the 3 best features identified by the χ 2 test. For each column the max-imum value was indicated with bold font

Peripheral GGO
A peripheral GGO label equal to 1 (23% of the patients) identifies the patients with the presence of a peripheral GGO lesion, while a label equal to 0 (77% of the patients) identifies its absence.
The three best features selected by the Fisher criterion are (with the type of feature indicated in parenthesis): Minimum of the distance distribution (Radiomics); • Skewness of the gray level distribution (Radiomics).
The three best features selected by the χ 2 criterion are (with type of feature indicated in parenthesis): • Minimum of the distance distribution (Radiomics); • Patient age (Clinical); • Skewness of the elongation distribution (Radiomics).
We show in Table 11 the results obtained in terms of global adjusted accuracy and in Table 12 the precision, sensitivity, and F1 score of the same prediction. Table 11. Global adjusted accuracy of the models to predict the presence of peripheral GGO using different feature selection strategies. From the left: dichotomized score value, Precision obtained using the 3 best features identified by the Fisher test, Precision obtained using the 3 best features identified by the χ 2 test; Sensitivity obtained using the 3 best features identified by the Fisher test, Sensitivity obtained using the 3 best features identified by the χ 2 test; F1 score obtained using the 3 best features identified by the Fisher test; F1 score obtained using the 3 best features identified by the χ 2 test. For each column the max-imum value was indicated with bold font

PREDI-CO Score
The prediction of the PREDI-CO score was performed considering the dichotomized values obtained by thresholding the labels according to the cutoff of 1: values ≤1 (47% of the patients) were labelled as 0 and values > 1 (53% of the patients) were labelled as 1. The considered dataset includes only two patients with a PREDI-CO score ≥6 proving the strong unbalancing of the classes and highlighting the prominence of non-severe patients.
The three best features selected by the Fisher criterion are (with type of feature indicated in parenthesis): We show in Table 13 the results obtained in terms of global adjusted accuracy and in Table 14 the precision, sensitivity and F1 score of the same prediction. Table 13. Global adjusted accuracy of the models to predict the dichotomized PREDI-CO score using different feature selection strategies. From the left: dichotomized score value, Precision obtained using the 3 best features identified by the Fisher test, Precision obtained using the 3 best features identified by the χ 2 test; Sensitivity obtained using the 3 best features identified by the Fisher test, Sensitivity obtained using the 3 best features identified by the χ 2 test; F1 score obtained using the 3 best features identified by the Fisher test; F1 score obtained using the 3 best features identified by the χ 2 test. For each column the max-imum value was indicated with bold font

Patient Survival
A survival label equal to 1 (10% of the patients) identifies the patients survived to the COVID-19, while a label equal to 0 (90% of the patients) identifies death patients.
The three best features selected by the Fisher criterion are (with type of feature indicated in parenthesis): We show in Table 15 the results obtained in terms of global adjusted accuracy and in Table 16 the precision, sensitivity and F1 score of the same prediction. Table 15. Global adjusted accuracy of the models to predict mortality using different feature selection strategies. From the left: dichotomized score value, Precision obtained using the 3 best features identified by the Fisher test, Precision obtained using the 3 best features identified by the χ 2 test; Sensitivity obtained using the 3 best features identified by the Fisher test, Sensitivity obtained using the 3 best features identified by the χ 2 test; F1 score obtained using the 3 best features identified by the Fisher test; F1 score obtained using the 3 best features identified by the χ 2 test. For each column the maximum value was indicated with bold font

GGO Segmentation
The examples reported in Figure 3 and Figure 4 show how the non-supervised segmentation method proposed in this paper is able to approximate the gold standard results with satisfactory results. This result has two strong implications for the Radiomics of the COVID-19 patients. First, given that the amount of information required for the k-means method training is considerably lower than CNN methods, while still retaining good results, this segmentation can be implemented with in-patient training. Secondly, this method can be used with success as a first segmentation method to be used as training for other, more specific methods. We remark that all the proposed techniques are voxel-based algorithms: this kind of method requires the whole patient's scan as input, drastically reducing the dimensionality of the dataset. As a reference, a 3D U-Net-based method [66] required two order of magnitude training samples to achieve comparable results.
It is worth noting that the various segmentation scores are dependent on the class balance, and therefore tend to penalize this kind of segmentation where one class (the GGO class in our case) is substantially under-represented. This can be confirmed by confronting the results of the proposed segmentation with published methods such as those reported in Table 1.

Individual Features Analysis
We noticed an evident improvement in the prediction converting the clinical outcomes to dichotomized classes. A second improvement in the quality of the prediction was obtained by the application of the feature reduction techniques, which allow a stabilization of the results. Most of the predictive power for each feature can be synthesized in 3 to 4 features per variable. Of all the variables, particular prominence was observed for the Radiomic features. These features were the most important ones in most of the predictions. Of all the predicted characteristics and outcomes, only the peripherality of GGO and PREDI-CO score requires the inclusion of the age value. For the PREDI-CO score, this is not unexpected as it is one of the components of this score. As for the peripherality of GGO lesions, this is an interesting result as they are an important predictor of clinical outcomes and not intuitively associated with patients' age.
If one considers the results for the different predictors (linear penalized, KNN, Random Forest), one can observe that in general, KNN and Random Forests achieve similar performances, while the penalized linear methods consistently perform better. This can be interpreted as the effect of a progressive non-dichotomic behavior in the system. These linear models were also the ones that gained the least from the pre-selection of the features. This can be explained as the L1 and L2 penalization already reducing the effect of the numerosity of the provided features. Methods such as KNN, based on features metrics, are particularly affected by the features numerosity and thus are the ones that have the greatest improvement by feature selection. Linear penalized methods, on the other end, include an implicit feature selection internally and could be even penalized by a reduction in the number of considered features.
Most of these features have a strong class unbalance (down to around 10% of samples in one group, such as in the Consolidation and Roundish GGO), and therefore, the prediction score tends to be strongly unbalanced, with a strong penalization for the prediction of the least represented class. When this unbalance is not present, such as in the case of the prediction of the PREDI-CO score, one can observe good, balanced prediction scores.
The prediction scores for the PREDI-CO are also higher than for similarly balanced classes (such as Crazy Paving). This indicates that the extracted features, albeit not optimal for predicting individuals' components of the score, are indeed able to predict the score as a global value. It is interesting to notice that the variables considered as the most important in the prediction (both for the Fisher and χ 2 method) alongside the age (a wellknown risk factor) are (1) the distance between the GGO area and the trachea and (2) the irregularity of the GGO lesion shape. This follows the clinical hypothesis that the spreading of the damaged area toward the peripheral area of the lungs leads to the worst prognosis for the patient.
The results obtained in this work cannot overcome the performances of the already published artificial intelligence techniques. The main limitation of this work is related to the number of available samples: semi-supervised learning algorithms are designed to work with small datasets but require better labeling of them compared to supervised methods.
A second criticality is given by the preliminary choice of the number of clusters for the k-means algorithm: in our work, we identified only five putative clusters for the tissue segmentation. This degree of freedom determines the quality of the areas used for the radiomic feature extraction, and therefore, it could affect the efficiency of the prediction models. In contrast, the manipulation of this single degree of freedom could help to achieve better results on in-patient segmentations.
The clinical characteristics and outcomes considered in this work are scores estimated by the expert radiologists for the description of the state of the patient, but they do not consider the real severity of him. This intrinsic limit does not allow a prediction of the real outcome of the patient, allowing only an undirected evaluation.

Conclusions
In the present work, we highlighted the possibility of obtaining a reliable automated segmentation using non-supervised approaches and using this segmentation in a prediction pipeline for patient prognosis.
Artificial Intelligence for diagnostic uses, such as a clinical decision support system, is recognized as an approach rich of potential outcomes but is limited by the requirement of human-driven data curation. With this work, we aimed to prove that semi-supervised approaches to segmentation are promising, as they would combine the best effort of highly trained physicians to develop true gold standard segmentation and the expertise of data analysts to augment that segmentation in full-blown models.
The current COVID-19 pandemic highlights the criticality of relying on high specialized clinicians for time-demanding tasks, as the same experts that can generate goldstandard segmentation for AI training are also the ones responsible for patient diagnosis and care. Improving methods for semi-supervised learning in Radiomics would allow for more effective use of the time and energy of these experts while capitalizing on AI training to support them in patient's diagnosis and treatment.
While the results presented in this work are not yet at the accuracy level necessary for assisted diagnostic use, we surmise that this approach would be helpful in developing a solid triage system, which would help to prioritize the resources available and direct them were most effective.
Supplementary Materials: The following are available online at www.mdpi.com/2076-3417/11/12/5438/s1, Table S1: Comparison between the score results for the gold standard segmentation in the included databases with details for each sample. For each score, the average value (and corresponding standard deviation at 1 σ) is reported, Table S 2 Details of the various state-of-art methods referenced in the paper. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of IRCCS Azienda Ospedaliero-Universitaria di Bologna (protocol code no. EM949-2020_507/2020/Oss/AOUBo, approved on date: 16 September 2020).

Informed Consent Statement:
This study is an observational, retrospective single-center study and was approved by our local institution review board. Informed consent was waived by the institutional review board owing to the retrospective nature of the study.

Data Availability Statement:
The work is based on a mix of public data and data collected in loco. Public data can be found at [52]. Data collected in loco is available on request due to restrictions on privacy (European GDPR).

Conflicts of Interest:
The authors declare no conflict of interest.