Predicting Mechanical Ventilation and Mortality in COVID-19 Using Radiomics and Deep Learning on Chest Radiographs: A Multi-Institutional Study

In this study, we aimed to predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and random forest (RF) machine learning classifiers to predict mechanical ventilation requirement and mortality were trained and evaluated using radiomic features extracted from patients’ CXRs. Deep learning (DL) approaches were also explored for the clinical outcome prediction task and a novel radiomic embedding framework was introduced. All results are compared against radiologist grading of CXRs (zone-wise expert severity scores). Radiomic classification models had mean area under the receiver operating characteristic curve (mAUCs) of 0.78 ± 0.05 (sensitivity = 0.72 ± 0.07, specificity = 0.72 ± 0.06) and 0.78 ± 0.06 (sensitivity = 0.70 ± 0.09, specificity = 0.73 ± 0.09), compared with expert scores mAUCs of 0.75 ± 0.02 (sensitivity = 0.67 ± 0.08, specificity = 0.69 ± 0.07) and 0.79 ± 0.05 (sensitivity = 0.69 ± 0.08, specificity = 0.76 ± 0.08) for mechanical ventilation requirement and mortality prediction, respectively. Classifiers using both expert severity scores and radiomic features for mechanical ventilation (mAUC = 0.79 ± 0.04, sensitivity = 0.71 ± 0.06, specificity = 0.71 ± 0.08) and mortality (mAUC = 0.83 ± 0.04, sensitivity = 0.79 ± 0.07, specificity = 0.74 ± 0.09) demonstrated improvement over either artificial intelligence or radiologist interpretation alone. Our results also suggest instances in which the inclusion of radiomic features in DL improves model predictions over DL alone. The models proposed in this study and the prognostic information they provide might aid physician decision making and efficient resource allocation during the COVID-19 pandemic.


Introduction
Coronavirus disease 2019 (COVID- 19), an illness caused by novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has spread rapidly across the world, with Figure 1. Study pipeline. Visualized here is the schema for the experiments performed in this study. Experiment 1 demonstrated the use of radiologist expert scoring of CXRs for clinical outcome prediction. In Experiment 2, we extracted predefined radiomic features from segmented CXRs and input them into machine learning models such as linear discriminant analysis, quadratic discriminant analysis, and random forest classifiers. Experiment 3 used a CNN deep learning model to predict COVID-19 patient outcomes using segmented CXRs as inputs. In Experiment 4, we investigated two separate methods (P4 and P5) of integrating radiomic features with segmented CXRs for DL analysis.

Cohort Description
In this two-center, IRB-approved study, anonymized frontal CXRs were obtained from patients suspected of COVID-19 on presentation at Stony Brook University Hospita (SBUH) and Newark Beth Israel Medical Center (NBIMC) between March and June 2020 ( Figure 2). A total of 559 baseline CXRs for 538 patients at SBUH were analyzed. For this study, 17 CXRs of pediatric patients or with poor image quality taken from 16 patients were discarded. A total of 174 baseline CXRs from 174 patients were included from NBIMC. Of these, 5 CXRs were discarded due to indistinguishable lung fields. We considered all CXRs taken on the first day for which CXR data exist for a patient as baseline CXRs. Hence, a patient may have multiple baseline CXRs, though these would all be taken on the same day.
In total, 711 CXRs taken from 691 patients (363 males and 328 females) were analyzed in this study. The mean age of patients studied was 56 years old (median = 57 years, standard deviation = 17.774 years, Table 1). COVID-19 positivity was tested for each patient via reverse transcriptase-polymerase chain reaction (RT-PCR). In total, 530 CXRs from 515 patients who tested positive for COVID-19 (Table 2) and 181 CXRs from 176 patients found not to be infected with COVID-19 at SBUH were analyzed. CXRs taken from Figure 1. Study pipeline. Visualized here is the schema for the experiments performed in this study. Experiment 1 demonstrated the use of radiologist expert scoring of CXRs for clinical outcome prediction. In Experiment 2, we extracted predefined radiomic features from segmented CXRs and input them into machine learning models such as linear discriminant analysis, quadratic discriminant analysis, and random forest classifiers. Experiment 3 used a CNN deep learning model to predict COVID-19 patient outcomes using segmented CXRs as inputs. In Experiment 4, we investigated two separate methods (P4 and P5) of integrating radiomic features with segmented CXRs for DL analysis.

Related Work
Machine learning methods have been applied extensively to the study of COVID-19, analyzing both clinical variables and medical images for disease diagnosis and prognosis [3,8,[12][13][14][15][18][19][20][21]23,24]. In the domain of computational radiology, many studies have focused primarily on CT image analysis, though further work is now being performed on CXRs [19,20,24]. However, few studies attempt to predict COVID-19 patient clinical outcomes using CXRs, and the public datasets often studied have been critiqued for potentially biasing results [8,19,23]. Below, we perform a brief survey of related and relevant works.
First, several scoring systems based upon radiologist interpretation have been proposed for the grading of COVID-19 severity using CXRs. Balbi et al. have described their own proposed Brixia score and measurements of diseased lung involvement and their correlation with mortality in COVID-19 patients [6]. Similarly, Toussie et al. and Shen et al. have proposed CXR scoring systems that they have shown to correlate significantly with various outcomes including survival, hospitalization, and intubation [2,22].
In general, DL has been widely used in the field of natural and medical image analysis. In this work, we employed both ResNet and U-Net DL architectures, modifying them for our particular use cases [25,26]. ResNet has been previously applied to a variety of classification tasks using medical images, and U-Net is the commonly widely utilized DL architecture for medical image segmentation [27,28]. The use of these architectures is commonplace for medical image classification and segmentation tasks and has historically performed well for numerous tasks.
Computational approaches have also been employed to predict clinical courses for COVID-19 patients. Vaid et al. utilized clinical variables including measurements of inflammation, biomarkers, and other lab values to predict COVID-19 mortality with an AUC of up to 0.84 [3]. Chassagnon et al. utilized a U-Net segmentation pipeline, followed by radiomic feature extraction, using CT data in order to predict long-term survival, with an AUC of up to 0.86 [20]. Studying CXRs, Ferreira Jr. et al. validated the relationship between several radiomic features and COVID-19 diagnosis and prognosis in a small cohort of 49 COVID-19 positive patients [29]. Kwon et al. utilized DL in combination with clinical variables to achieve AUCs of up to 0.88 and 0.82 for intubation and mortality prediction, respectively [24].
Our method combined aspects of each of these approaches to provide a robust, interpretable method for clinical outcome prediction in the context of COVID-19. We analyzed CXRs, a more frequently used modality when compared with CT. Furthermore, our study contained a large dataset of images taken from multiple institutions; the inherent variability in intensity distribution between these datasets demonstrates the robustness of our model on CXRs obtained under different conditions. We also compared radiomic and DL approaches for outcome prediction, investigating their relative benefits for different prediction tasks.

Cohort Description
In this two-center, IRB-approved study, anonymized frontal CXRs were obtained from patients suspected of COVID-19 on presentation at Stony Brook University Hospital (SBUH) and Newark Beth Israel Medical Center (NBIMC) between March and June 2020 ( Figure 2). A total of 559 baseline CXRs for 538 patients at SBUH were analyzed. For this study, 17 CXRs of pediatric patients or with poor image quality taken from 16 patients were discarded. A total of 174 baseline CXRs from 174 patients were included from NBIMC. Of these, 5 CXRs were discarded due to indistinguishable lung fields. We considered all CXRs taken on the first day for which CXR data exist for a patient as baseline CXRs. Hence, a patient may have multiple baseline CXRs, though these would all be taken on the same day.
In total, 711 CXRs taken from 691 patients (363 males and 328 females) were analyzed in this study. The mean age of patients studied was 56 years old (median = 57 years, standard deviation = 17.774 years, Table 1). COVID-19 positivity was tested for each patient via reverse transcriptase-polymerase chain reaction (RT-PCR). In total, 530 CXRs from 515 patients who tested positive for COVID-19 (Table 2) and 181 CXRs from 176 patients found not to be infected with COVID-19 at SBUH were analyzed. CXRs taken from COVID-19 positive patients were used in outcome prediction experiments, whereas those from both COVID-19 positive and negative patients were used to build lung and artifact segmentation models. Of the 530 CXRs from positive patients, 217 baseline CXRs were taken for 205 patients that later required mechanical ventilation. A total of 164 CXRs were from 158 patients who later died from the disease. Representative CXR images are displayed in Figure 3. taken for 205 patients that later required mechanical ventilation. A total of 164 CXRs were from 158 patients who later died from the disease. Representative CXR images are displayed in Figure 3.

Lung and Artifact Segmentation
A segmentation pipeline was developed to avoid learning of features unrelated to lung fields. In order to segment lungs and artifacts from CXR images, two residual U-Net

Lung and Artifact Segmentation
A segmentation pipeline was developed to avoid learning of features unrelated to lung fields. In order to segment lungs and artifacts from CXR images, two residual U-Net DL models were employed [26,30]. Both network architectures were augmented using multiscale image inputs for better intermediate feature representations with deep supervision (Figure 4) [31]. Lung fields and artifacts such as EKG leads, pacemakers, and other non-anatomical objects were first manually segmented for a dataset of 100 CXRs, excluding heart shadows. These segmentations were used to train the two networks, one for lung segmentation and the other for artifact segmentation. A focal Tversky loss function to penalize false positive predictions was employed (alpha = 0.3, gamma = 1.0) [32]. This was to avoid misidentification of high-intensity objects as lungs and to mitigate misclassification of lungs as unwanted artifacts. The trained models were then used to generate lung and artifact masks for the remaining 611 CXRs. Each of these masks was manually reviewed and errors in segmentation, if any, were corrected.

Average Histogram Matching (HM)
It should be noted that CXRs from the two institutions, SBUH and NBIMC, fall within two distinct data domains differing in pixel intensity distribution. To mitigate image differences, an average histogram matching (HM) was employed ( Figure 5). A total of 80 CXR images were chosen randomly from the SBUH dataset to create an average cumulative distribution. Every CXR from both SBUH and NBIMC was then mapped to this average cumulative function using an HM approach, bringing all CXRs into the same intensity range [33].
For both ventilation and mortality classification, models were trained and evaluated in a cross-validation setting. To this end, 217 ventilation-positive and 300 ventilation-negative CXRs were used for ventilation classification, whereas 164 CXRs from deceased patients and 357 CXRs from recovered patients were used for mortality classification. For each iteration of cross-validation evaluation, folds were chosen such that training and testing folds each contained an equal number of positive and negative samples.

Experiment 1: Outcome Classification Using Radiologist Severity Scores
In order to develop a clinical baseline model, we adopted a previously described CXR

Average Histogram Matching (HM)
It should be noted that CXRs from the two institutions, SBUH and NBIMC, fall within two distinct data domains differing in pixel intensity distribution. To mitigate image differences, an average histogram matching (HM) was employed ( Figure 5). A total of 80 CXR images were chosen randomly from the SBUH dataset to create an average cumulative distribution. Every CXR from both SBUH and NBIMC was then mapped to this average cumulative function using an HM approach, bringing all CXRs into the same intensity range [33].
For both ventilation and mortality classification, models were trained and evaluated in a cross-validation setting. To this end, 217 ventilation-positive and 300 ventilation-negative CXRs were used for ventilation classification, whereas 164 CXRs from deceased patients and 357 CXRs from recovered patients were used for mortality classification. For each iteration of cross-validation evaluation, folds were chosen such that training and testing folds each contained an equal number of positive and negative samples.

Experiment 1: Outcome Classification Using Radiologist Severity Scores
In order to develop a clinical baseline model, we adopted a previously described CXR scoring system for COVID-19 patients [7]. Scoring of CXRs was performed by radiology residents (G.S., R.G., S.A., N.S., C.M., and J.P.). Any ambiguous scores were further confirmed by one of two attending radiologists (J.G. and A.G.). For each lung, a severity score of 0, 1, or 2 was assigned to each of three lung zones: lower, middle, and upper ( Figure 6), with a maximum possible score of 12 for both lungs combined. A score of 0 was assigned to lung zones with no radiographic findings, a score of 1 was assigned to zones with the presence of ground-glass opacities, and a score of 2 was assigned to zones with consolidative opacities with or without air bronchograms. The formulation of this system and the assignment of different scores to 6 lung zones is in line with other described COVID-19 CXR scoring systems [2,6]. Once these scores were assigned for each CXR, a multiple logistic regression model was developed to predict mechanical ventilation requirement and mortality based upon these zone-wise expert scores. This approach was evaluated in a cross-validation setting and served as a human-based comparison for the machine learning models discussed below.

Experiment 2: Outcome Classification Using Radiomic Features
Radiomic features were extracted from CXRs for clinical outcome prediction in order to provide interpretable insights into which textural features might be most predictive of mortality and mechanical ventilation. In total, 143 radiomic features from the Haralick, Gabor, Laws energy, histogram of gradients, and grey intensity feature families were computed for each baseline CXR [34][35][36][37]. Features were extracted solely from segmented lung fields, excluding artifacts. Descriptions of various radiomic features can be found in Table 3. Each of these features has been previously studied in medical applications including in the study of COVID-19 [29,[34][35][36][37]. In this study, we performed an exploratory analysis of these various well-studied radiomic features in order to determine their relative value in predicting clinical outcomes for COVID-19 patients. For each radiomic feature, statistics including measures of median, skewness, standard deviation, and kurtosis were calculated. These statistics and clinical factors including expert scores and patient age/sex were used for classifier construction. For prediction of future mechanical ventilation requirement and mortality, random forest (RF), linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA) classifiers were trained and cross-validated on radiomic features from baseline CXRs [38,39]. For each of 50 iterations in a 5-fold cross-validation setting, feature reduction among radiomic and clinical features was performed on the training set using a Wilcoxon rank-sum test, Student's t-test, or a maximum relevance minimum redundancy approach [40]. Highly correlated features (Pearson correlation threshold = 0.9) were removed to reduce redundancy. Ablation studies were performed to assess the relative performance of radiomic classification with and without HM and with and without clinical features.

Experiment 3: Outcome Classification Using Convolutional Neural Networks
Convolutional neural networks (CNNs) were employed to predict future mechanical ventilation requirement and patient mortality from baseline CXRs. Additional preprocessing steps for DL included automatic cropping of CXRs to a tight boundary around the lungs, resizing input images to 224 × 224 pixels, and the application of min-max normalization to rescale image intensity values between 0 and 1.
For each classification experiment a ResNet-50 pretrained on ImageNet was utilized [25]. Data augmentation techniques such as flipping, rotation, and translation were used to reduce overfitting. The fully connected (FC) layer of each architecture was replaced by a custom layer with an input size of 512 by 1 (no clinical variables included) or 520 by 1 (expert scores and patient age/sex included) and output size of 2 by 1 to match our desired binary classification scheme. The FC layer was trained without the use of pretrained weights. Dropout layers with a probability of 0.1 were included after FC layers to improve the generalizability of classification. For each model, a binary cross-entropy loss function and an Adam optimizer with a learning rate of 0.00001 were used for network training [41]. The learning rate was decreased by a factor of 0.01 after each 10th epoch. Models were trained and evaluated in a cross-validation setting in which new training, validation, and testing splits were chosen for each of five iterations.
Class activation maps (CAMs) were also generated using network outputs prior to the global average pooling layer in the ResNet-50 architecture. These CAMs enable a degree of visualization of a network's "attention" in making predictions, thereby providing a soft validation of the prognostically relevant regions as determined by the network.
In addition, t-distributed stochastic neighbor embedding (t-SNE) was used to visualize features extracted using ResNet-50 models for mortality and mechanical ventilation predictions using network outputs prior to the final FC layer [42].

Experiment 4: Outcome Classification Using Convolutional Neural Networks and Radiomic-Map Embedding
DL of radiomic and imaging features was explored using two different approaches.

Feed-Forward Concatenation of Radiomic Features
In this approach, the features used for classifier development in Experiment 2 were first normalized to within a range of 0 to 1 before being concatenated to the output of the upsampling layer of the ResNet-50 architecture used in Experiment 3. The following feed-forward layer was then modified to contain 512 + n neurons, where n is the number of chosen radiomic features for the desired classification problem. If clinical data including expert scores and patient age/sex were also included, the number of neurons was instead 520 + n. In this experiment, model weights for the initial image feature extractor layers were used from Experiment 3, whereas the weights for the altered feed-forward layer were randomly initialized. The entire model was then trained. This process was identical for both mechanical ventilation and mortality prediction.

Radiomic-Embedded Feature Maps
Radiomic features from Experiment 2 were used to create radiomic-embedded feature maps for each CXR. t-SNE (random state = 1) was employed to perform feature reduction and to convert radiomic data to a 2D representation [42]. To assess the predictive capability of a model trained using both radiomic-embedded feature maps and CXR images as inputs, the same general procedure employed in Experiment 3 was used. A key difference was a change in the first input convolution filter of the ResNet-50 architecture to receive a 2-channel CXR and radiomic-embedded map input rather than a 3-channel input. All other network configurations are identical to those described in Experiment 2. Dataset splits of each of these classifiers were identical to those detailed in Experiment 2.

Results
Results for Experiments 1, 2, 3, and 4 are summarized in Tables 4-7 and are reported as mean ± 95% confidence interval based on fivefold cross-validation results.

Experiment 1: Outcome Classification Using Radiologist Severity Scores
For Experiment 1, expert scores predicted mechanical ventilation, with a mean crossvalidated AUC (mAUC) of 0.75, a specificity of 69%, and a sensitivity of 67%. Expert scores were able to predict mortality with an mAUC of 0.79, a specificity of 76%, and a sensitivity of 69%. The distribution of zone-wise export scores among patients in each clinical outcome class is shown in Figure 7a,b. Figure 7c,d visualizes distributions of total expert scores among patients in each clinical outcome class. The distribution of expert scores within each lung region along with the distribution of total expert scores was statistically significant between patients requiring mechanical ventilation, compared with those who did not, as well as between deceased and recovered patients. Correlations between zone-wise expert scores for each lung region and clinical outcomes/variables are shown in Figure 7e. Total severity score (the sum of scores from all lung regions) correlated most strongly with both future ventilation requirement (0.44) and mortality (0.40), respectively. These correlations were stronger than correlations for individual lung zones with clinical outcomes and for patient age or sex with clinical outcomes.

Experiment 2: Outcome Classification Using Radiomic Features
For Experiment 2, a machine learning classifier trained to predict the need for mechanical ventilation using radiomic features extracted from non-HM-adjusted images yielded an mAUC of 0.72, a specificity of 67%, and a sensitivity of 64%. Using radiomic features from HM-adjusted images achieved an mAUC of 0.78, a specificity of 72%, and a sensitivity of 72% for mechanical ventilation prediction. A machine learning classifier used to predict mortality in COVID-19 positive patients using radiomic features from non-HM-adjusted images had an mAUC of 0.77, a specificity of 72%, and a sensitivity of 72%. Using radiomic features from HM-adjusted images resulted in an mAUC of 0.78, a specificity of 73%, and a sensitivity of 70% for mortality prediction. The inclusion of zone-wise expert scores and patient age and sex improved both mechanical ventilation and mortality prediction when combined with radiomic features to yield an mAUC of 0.79, specificity of 71%, and sensitivity of 71% for mechanical ventilation prediction and an mAUC of 0.83, specificity of 74%, and sensitivity of 79% for mortality prediction.
The top features for radiomic outcome classification are listed in Table 7. Please see Table 3 for detailed descriptions of these features. Among the most discriminating radiomic features identified for predicting mechanical ventilation requirement and mortality were the Laws E5S5 energy and Haralick correlation features, respectively ( Figure 8). The Laws E5S5 filter is a composite edge and spot detection filter, whereas the Haralick correlation measures the similarity of a pixel to its neighbors using a grey-level co-occurrence matrix.

Experiment 3: Outcome Classification Using Convolutional Neural Networks
In Experiment 3, a ResNet-50 model trained solely using non-HM-adjusted CXRs to predict future mechanical ventilation requirement had an mAUC of 0.70, a specificity of 72%, and a sensitivity of 55% on cross-validation. Using HM-adjusted images as input for

Experiment 3: Outcome Classification Using Convolutional Neural Networks
In Experiment 3, a ResNet-50 model trained solely using non-HM-adjusted CXRs to predict future mechanical ventilation requirement had an mAUC of 0.70, a specificity of 72%, and a sensitivity of 55% on cross-validation. Using HM-adjusted images as input for DL resulted in improved mechanical ventilation requirement prediction with an mAUC of 0.75, a specificity of 73%, and a sensitivity of 64%. A ResNet-50 model trained using non-HM-adjusted CXRs to predict mortality yielded an mAUC of 0.72, a specificity of 72%, and a sensitivity of 56%. Using HM-adjusted images for DL training resulted in improved mortality prediction with an mAUC of 0.75, a specificity of 74%, and a sensitivity of 59%.

Experiment 4: Outcome Classification Using Convolutional Neural Networks and Radiomic-Map Embedding
For Experiment 4, we found that the inclusion of radiomic features improved DL prediction of both mechanical ventilation and mortality. DL models trained using radiomicembedded feature maps improved the prediction of mortality over DL of CXRs alone but did not increase performance when predicting mechanical ventilation requirement. Using feed-forward concatenation of radiomic features to DL features, our model obtained an mAUC of 0.77, a specificity of 75%, and a sensitivity of 66% for mechanical ventilation requirement prediction. Using radiomic-embedded features a DL model produced an mAUC of 0.74. a specificity of 76%, and a sensitivity of 59% for mortality prediction. The inclusion of clinical features including expert scores and patient age/sex improved predictions for mechanical ventilation requirement with an mAUC of 0.78, a specificity of 78%, and a sensitivity of 67%. For mortality prediction, the inclusion of clinical features improved model predictions to obtain an mAUC of 0.77, a specificity of 60%, and a sensitivity of 77%. Ultimately, the inclusion of radiomic features improved DL prediction of clinical outcomes (Table 6).
For DL experiments, representative CAMs are shown in Figure 9. An expert reader (J.G, 15 years of experience) noted that for CXRs from patients that required mechanical ventilation, CAM maximal signal intensity was shown to correlate with areas of dense infiltrates. For selected CXRs for patients who did not require mechanical ventilation, CXRs appeared to demonstrate no focal consolidation or infiltrates. The maximal CAM signal for these CXRs was observed in left middle lung zones, predominantly along the perihilar region. For all CAMs generated, network activations were shown to be most significantly located within lung fields. t-SNE feature reduction for deep features is also visualized in Figure 9. Clustering for features from patients that did and did not require mechanical ventilation was observed.
CXRs appeared to demonstrate no focal consolidation or infiltrates. The maximal CAM signal for these CXRs was observed in left middle lung zones, predominantly along the perihilar region. For all CAMs generated, network activations were shown to be most significantly located within lung fields. t-SNE feature reduction for deep features is also visualized in Figure 9. Clustering for features from patients that did and did not require mechanical ventilation was observed. Figure 9. t-SNE and CAM visualization of DL predictions: (a) displays t-SNE clustering of DL network outputs for ventilation prediction; (b-d) demonstrate no focal consolidation or infiltrates. CAMs show maximal signal intensity in the left middle lung zone predominantly along the perihilar region; (e) shows no focal consolidation or infiltrates. CAM shows maximal signal intensity in the right mid to lower lung zone; (f) demonstrates diffuse patchy infiltrates bilaterally, predominantly in the mid to lower lung zones. CAM shows the highest signal intensities in the right mid to lower lung zones in areas of dense infiltrates. Additionally, noted is slightly increased CAM activity in the left lower lobe around the areas of dense infiltrates; (g) demonstrates diffuse patchy infiltrates bilaterally. CAM shows the highest signal intensities in the right lower and left upper lung zones around areas of slightly dense infiltrates; (h) shows diffuse infiltrates bilaterally with relative sparing of the right upper lobe. CAM shows the highest signal intensities in the right mid and left mid to lower lung zones in areas of dense infiltrates; (i) demonstrates diffuse bilateral reticular opacities with interlobular septal thickening along with superimposed dense infiltrates predominantly in the lower lobes. CAM shows the highest signal intensity in the right lower lung zone around areas of dense infiltrates. CXR interpretation performed by J.G. (15 years of experience). (e) shows no focal consolidation or infiltrates. CAM shows maximal signal intensity in the right mid to lower lung zone; (f) demonstrates diffuse patchy infiltrates bilaterally, predominantly in the mid to lower lung zones. CAM shows the highest signal intensities in the right mid to lower lung zones in areas of dense infiltrates. Additionally, noted is slightly increased CAM activity in the left lower lobe around the areas of dense infiltrates; (g) demonstrates diffuse patchy infiltrates bilaterally. CAM shows the highest signal intensities in the right lower and left upper lung zones around areas of slightly dense infiltrates; (h) shows diffuse infiltrates bilaterally with relative sparing of the right upper lobe. CAM shows the highest signal intensities in the right mid and left mid to lower lung zones in areas of dense infiltrates; (i) demonstrates diffuse bilateral reticular opacities with interlobular septal thickening along with superimposed dense infiltrates predominantly in the lower lobes. CAM shows the highest signal intensity in the right lower lung zone around areas of dense infiltrates. CXR interpretation performed by J.G. (15 years of experience).

Discussion
In this work, we presented models for baseline CXR analysis demonstrating high sensitivities for future mechanical ventilation requirement (71%) and mortality (79%) prediction. These models outperform expert score-based classification that yields sensitivities of 67% and 69% for mechanical ventilation requirement and mortality, respectively. These results highlight the value that quantitative modeling of CXRs can have for the prognostic prediction of COVID-19. Previous non-imaging models have been proposed with high sensitivities for various clinical outcomes using biomarkers such as serum lactate dehydrogenase, lymphocyte counts, and coagulation factors in the setting of COVID-19 [4,[13][14][15]. We demonstrated that these models might be complemented by imaging-based approaches. The ability to discern actionable prognostic information from baseline CXR has significant implications for decision making and triage in the COVID-pandemic, especially in high-volume hospital settings. Determining which patients might progress to severe disease would enable healthcare providers to make informed decisions regarding treatments. Furthermore, the ubiquitous nature of CXR in the management of COVID-19 makes a quantitative predictor of outcomes using the modality a convenient and useful tool for physicians.
Previous studies have applied DL to the analysis of COVID-19 CXRs [8,18,19,23]. However, at least one study has reported potential deficiencies in these approaches, including insufficiencies in a commonly used public dataset, neglecting to segment lung fields, and a failure to account for large differences between disparate public datasets [23]. Most significantly, there has been some suggestion that a few studies on a large multi-institutional public dataset may have produced models that learn to distinguish between data taken from different institutions rather than distinguishing meaningful differences in underlying pathology [23]. Nevertheless, new evaluation methods and improvements in data quality might improve experiments performed on these public datasets [43]. Previous studies have also not explicitly accounted for foreign objects in lung fields, which can obscure pathological findings. Here, we further presented a method for dataset homogenization between two separate institutions using HM, addressing any potential discrimination between datasets by our models. Furthermore, we developed a unique CXR preprocessing pipeline to segment lungs and artifacts.
Radiomic features can provide insight into what characteristics of a patient's CXR are significant in making clinical predictions and can be more informative to a physician than exclusively DL approaches. From our results, it can be observed that radiomic features play an interesting role in outcome prediction for COVID-19. A small subset of radiomic features was shown to be effective in predicting outcome for both mechanical ventilation requirement (3 features) and mortality (1 feature). Radiomic feature classification of future mechanical ventilation requirement improved with HM while also reducing the number of features required for accurate outcome prediction (10 vs. 3 features). Interestingly, the opposite effect was observed for mortality prediction; the number of features needed for outcome prediction increased following HM (one vs. four features). For ventilation prediction, classifier performances improved following HM, whereas HM slightly worsened mortality prediction performance. Laws energy filters appear to be important in making mechanical ventilation requirement predictions, and Figure 8 demonstrates the observed improvement in Laws E5S5 feature discrimination between classes following HM. For mortality prediction, Laws energy filters are also selected as discriminatory features following HM. However, the performance of these features in predicting mortality is not as strong as the use of Haralick features prior to HM. Notably, the Haralick correlation feature does not seem to be "improved" by HM and becomes less valuable in class discrimination for mortality prediction (Figure 8). The variable effect of postprocessing techniques on different radiomic feature families warrants further exploration in future experiments. Here, we showed that two different feature families (Haralick and Laws energy) might have unique roles in predicting different clinical outcomes and might be variably affected by HM.
In this work, we also explored the relative value of two methods of radiomic feature inclusion in deep learning: radiomic feature embedding and feed-forward concatenation of radiomic features. Notably, the inclusion of radiomic features improved DL predictions for both clinical outcome tasks. For mechanical ventilation requirement prediction, feed-forward radiomic feature concatenation was superior to radiomic feature appending. The opposite was observed for mortality prediction. This again indicates that different machine learning approaches and selective model invocation may be required for different clinical prediction tasks. We also found that HM uniformly improves DL prediction of clinical outcomes.
We also demonstrated that radiomic and DL analysis of CXRs can achieve competitive or superior results in predicting clinical outcomes when compared with expert scoring of CXR severity. This is of particular significance in high-volume or low-resource healthcare settings where expert annotations may be harder to obtain. Moreover, the combination of DL and radiomic approaches with zone-wise expert scoring of CXRs performs even more accurately in the outcome prediction task, indicating that the two might be applied synergistically to further improve predictions. Furthermore, our models have demonstrated validity on a multi-institutional dataset and might provide a more consistent method of CXR evaluation than human scoring.
There are certain limitations in our work. First, we used baseline CXRs that are likely to be nonuniform in the interval between COVID-19 infection and image acquisition. While this is representative of the clinical reality that patients receive baseline CXRs at varying time points in their disease course, future studies might build improved time-to-event prediction models using data with a more uniform temporal distribution. It is also important to note that the two clinical outcomes studied in this work are neither independent nor mutually exclusive; generally, a patient requiring mechanical ventilation is more likely to succumb to their disease than one that does not. Furthermore, a limited number of clinical features were studied, and our models might benefit from including co-morbidities such as a history of cancer, chronic obstructive pulmonary disease, hypertension, etc. Other studies have previously validated the utility of measures such as these in predicting COVID-19 progression and clinical outcomes [3,16]. Additionally, in this study, we did not control for code status among patients, which might influence results. For instance, a patient's disease might progress to an emergent situation requiring mechanical ventilation, but the patient might have a standing order to not initiate such a procedure [22]. Future experiments might attempt to control this confounding variable if these data are made readily available. Finally, additional validation is necessary to demonstrate the robustness of classification models in the broader context of COVID-19 treatment in other hospitals and locations.
This work, along with several other recent studies, established the value of computational analysis of CXRs in order to study clinical outcomes in COVID-19 [2,21,22,24,44]. In most cases, these studies analyze CXRs taken at a single time point, although modeling of sequential CXR data might enable an improved analysis of the temporal evolution of COVID-19, as observed on imaging data.

Conclusions
In summary, we presented a complete pipeline for computational evaluation of CXR in COVID-19 patients. Both radiomic and DL classification models enable us to predict mechanical ventilation requirement and mortality from baseline CXRs. Each of these approaches outperforms or performs competitively with predictions made using expert severity assessment of CXRs, indicating the potential for increased efficacy and efficiency in modeling COVID-19 outcomes using machine learning approaches. Furthermore, we demonstrated the improvement that a novel radiomic embedding approach has on DL predictions of COVID-19 outcomes. The ability to make early predictions of disease outcomes may aid in triage, clinical decision making, and efficient hospital resource allocation as the COVID-19 pandemic progresses.
Informed Consent Statement: All patient data were deidentified prior to analysis.

Data Availability Statement:
A portion of the data reported in this study will be made available through the Cancer Imaging Archive COVID-19 imaging collection.