COVID-19 and Artificial Intelligence: An Approach to Forecast the Severity of Diagnosis

(1) Background: The new SARS-COV-2 pandemic overwhelmed intensive care units, clinicians, and radiologists, so the development of methods to forecast the diagnosis’ severity became a necessity and a helpful tool. (2) Methods: In this paper, we proposed an artificial intelligence-based multimodal approach to forecast the future diagnosis’ severity of patients with laboratory-confirmed cases of SARS-CoV-2 infection. At hospital admission, we collected 46 clinical and biological variables with chest X-ray scans from 475 COVID-19 positively tested patients. An ensemble of machine learning algorithms (AI-Score) was developed to predict the future severity score as mild, moderate, and severe for COVID-19-infected patients. Additionally, a deep learning module (CXR-Score) was developed to automatically classify the chest X-ray images and integrate them into AI-Score. (3) Results: The AI-Score predicted the COVID-19 diagnosis’ severity on the testing/control dataset (95 patients) with an average accuracy of 98.59%, average specificity of 98.97%, and average sensitivity of 97.93%. The CXR-Score module graded the severity of chest X-ray images with an average accuracy of 99.08% on the testing/control dataset (95 chest X-ray images). (4) Conclusions: Our study demonstrated that the deep learning methods based on the integration of clinical and biological data with chest X-ray images accurately predicted the COVID-19 severity score of positive-tested patients.


Introduction
The emergence and extremely rapid spread of the new coronavirus, COVID-19, has led to the existence of a global problem, a problem never seen before. International organizations have come to the aid of the states of the world and have proposed numerous legislative instruments for combating and spreading this virus. The pandemic context made its presence felt at a rather difficult time nationally and globally [1]. This global challenge, SARS-CoV-2, overcomes several intensely debated issues, such as respect for human rights and freedoms, environmental protection, respect for democracy, and the reduction of social and economic imbalances that can influence the spread of the disease [2].
Researchers have tried developing more effective prediction models to control the spread of COVID-19 and discovered new, unexpected factors that can influence the severity of the disease [3,4]. With the appearance of COVID-19, more attention has been paid to improve automatic recognition systems based on artificial intelligence methods. It is difficult to provide an easy solution to this problem. However, precise and effective artificial intelligence techniques can be useful in overcoming this pandemic and providing adequate care to patients.
Forecasting the severity of COVID-19 patients is a crucial step in patients' management since the patients' treatment depends on it: The mild and moderate COVID−19 cases necessitate antivirals and oxygen therapy, while severe ones necessitate intensive care units or ventilator support [5].
Many studies showed how important is to anticipate the COVID-19 severity since the evolution to critical disease stage could be quick, due especially to the immune mechanisms [6,7]. Systems based on deep learning algorithms were developed to diagnose COVID-19 disease, using different medical imaging modalities like Computer Tomography (CT) and X-ray [8].
In [9,10], chest X-ray images were investigated to differentiate lung changes produced by COVID-19 disease. These studies demonstrated that the prediction of COVID-19 disease severity could be also established based on lung changes as ground-glass opacity, lungs' involvement, consolidation, bilateral infiltration, and vascular enlargement.
Together with the patterns of lung CT or X-ray images, other parameters taken into consideration for predicting the severity were symptoms and clinical and biological tests of COVID-19-infected patients [11].
In [12][13][14], the authors investigated different machine learning algorithms to investigate the correlations between usual blood routines and COVID-19 diagnosis. In [12], an ensemble of four machine learning algorithms (support vector machines, adaptive boosting, random forest, and k-nearest neighbours) was developed to investigate if normal, usual blood medical tests can be used to detect COVID-19 infection. The conclusion of the study [12] was that only the usual blood tests did not help in accurate detection of COVID-19.
In [15], a screening method based on machine-learning was proposed to predict a positive SARS-CoV-2 infection, taking into consideration eight symptoms and features, such as sex, age, cough, fever, sore throat, shortness of breath, headache, and known contact with a person confirmed to have COVID- 19.
The scope of the study proposed in [16] was to develop an artificial intelligence algorithm capable of correlating possible findings on chest CT of patients, symptoms of respiratory syndromes, and positive epidemiological factors with the evolution of the COVID-19 disease.
In our study, we designed a multimodal approach for predicting the future severity of diagnosis of COVID-19-infected patients at early disease stages. The proposed method integrated information from multiple sources (chest X-ray images, symptoms, clinical and biological variables), using an ensemble of deep learning (DL), pre-trained models and ensemble with stacking of machine learning (ML) algorithms.
For classifying chest X-ray images (CXR), we developed an ensemble (CXR-Score) composed of four pre-trained models: VGG-19, ResNet50, DenseNet121, and Inception v3. These DL models were trained on the ImageNet dataset, being capable of recognizing 1000 object classes. Using transfer learning with fine-tuning, they can be used in diverse domains. The method consists of removing the final set of fully connected layers of the pre-trained network and replacing them with a new set of fully connected layers with random initializations [17,18].
For classifying the patients' symptoms and biological and clinical data, we developed an ensemble called AI-Score based on stacking. This module combined four base models (Ada Boost, Random Forest, and XGBoost) and their configurations to make better predictions than a single model [19].
The ensemble-based method was successfully applied to improve the accuracy of individual DL models and ML algorithms, even if we had a small patient dataset.

Multimodal Approach Description
Our multimodal approach consisted of the following steps, described in Figure 1.

1.
At admission, from positive-tested COVID-19 patients, we collected the symptoms, clinical variables, blood tests, and chest X-ray scans together with a radiologist' report.

2.
During hospitalization, each patient was diagnosed with a COVID-19 severity score (mild, moderate, and severe) assessed by the oxygen flow rate, the necessity of mechanical ventilation, or patient death.

3.
We constructed modules based on artificial intelligence that were trained on data collected from patients with COVID-19 and could predict the future severity of the diagnosis.
v3. These DL models were trained on the ImageNet dataset, being capable of recognizing 1000 object classes. Using transfer learning with fine-tuning, they can be used in diverse domains. The method consists of removing the final set of fully connected layers of the pre-trained network and replacing them with a new set of fully connected layers with random initializations [17,18]. For classifying the patients' symptoms and biological and clinical data, we developed an ensemble called AI-Score based on stacking. This module combined four base models (Ada Boost, Random Forest, and XGBoost) and their configurations to make better predictions than a single model [19].
The ensemble-based method was successfully applied to improve the accuracy of individual DL models and ML algorithms, even if we had a small patient dataset.

Multimodal Approach Description
Our multimodal approach consisted of the following steps, described in Figure 1. 1. At admission, from positive-tested COVID-19 patients, we collected the symptoms, clinical variables, blood tests, and chest X-ray scans together with a radiologist' report. 2. During hospitalization, each patient was diagnosed with a COVID-19 severity score (mild, moderate, and severe) assessed by the oxygen flow rate, the necessity of mechanical ventilation, or patient death. 3. We constructed modules based on artificial intelligence that were trained on data collected from patients with COVID-19 and could predict the future severity of the diagnosis.

Retrospective Study
The retrospective study included 475 patients with laboratory-confirmed cases of SARS-CoV-2 infection, admitted between September 2020 and May 2021 to Victor Babes University Hospital (Craiova, Romania), the only COVID Hospital in the Oltenia Region. Patients were included in the study if they were admitted directly into one of the two

Retrospective Study
The retrospective study included 475 patients with laboratory-confirmed cases of SARS-CoV-2 infection, admitted between September 2020 and May 2021 to Victor Babes University Hospital (Craiova, Romania), the only COVID Hospital in the Oltenia Region. Patients were included in the study if they were admitted directly into one of the two Infectious Disease Clinics or into one of the two Pneumology Departments. Patients confirmed with SARS-CoV-2 infection and transferred from other non-profile hospitals in the region were also included.
Patients who did not undergo laboratory confirmation of SARS-CoV-2 or who tested negative in the laboratory of Victor Babes University Hospital or Human Genomics Lab- Epidemiological and clinical data were analyzed among those with abnormal or normal imaging findings. The chest X-ray scans were performed using three different radiography systems, one of which was mobile and two of which were fixed.
This study was carried out following the Helsinki Declaration of 1975 and was approved by the Review Ethics Board of the Victor Babes University Hospital.
To assess the severity of diagnosis we used chest X-ray images, parameters of respiratory function (oxygen saturation), and hematological parameters based on World Health Organization (WHO) COVID-19 disease severity: mild disease (symptomatic patients confirmed with SARS-CoV-2 infection but without signs of viral pneumonia or hypoxia), moderate disease (patients with clinical signs of pneumonia but on the chest X-ray no signs of severe pneumonia and SpO2 ≥ 90% on room air), severe disease (patients with clinical and radiological signs of severe pneumonia plus SpO2 < 90% on room air or respiratory rate >30 breaths/min) [20].
The correlation between chest X-ray and COVID-19 severity in our patients' cohort can be visualized in Figure 2: Infectious Disease Clinics or into one of the two Pneumology Departments. Patients confirmed with SARS-CoV-2 infection and transferred from other non-profile hospitals in the region were also included.
Patients who did not undergo laboratory confirmation of SARS-CoV-2 or who tested negative in the laboratory of Victor Babes University Hospital or Human Genomics Laboratory as part of the Craiova Center of Diagnostic and Treatment were excluded from this research.
Epidemiological and clinical data were analyzed among those with abnormal or normal imaging findings. The chest X-ray scans were performed using three different radiography systems, one of which was mobile and two of which were fixed.
This study was carried out following the Helsinki Declaration of 1975 and was approved by the Review Ethics Board of the Victor Babes University Hospital.
To assess the severity of diagnosis we used chest X-ray images, parameters of respiratory function (oxygen saturation), and hematological parameters based on World Health Organization (WHO) COVID-19 disease severity: mild disease (symptomatic patients confirmed with SARS-CoV-2 infection but without signs of viral pneumonia or hypoxia), moderate disease (patients with clinical signs of pneumonia but on the chest X-ray no signs of severe pneumonia and SpO2 ≥ 90% on room air), severe disease (patients with clinical and radiological signs of severe pneumonia plus SpO2 < 90% on room air or respiratory rate >30 breaths/min) [20].
The correlation between chest X-ray and COVID-19 severity in our patients' cohort can be visualized in Figure 2:

Chest X-ray Image Acquisition and Radiologist Report
Many current types of research established that opacity and lung involvement were important guidelines for the future evolution of COVID-19 disease [21,22].
The Italian Society of Radiology (SIRM) strongly recommends using chest X-ray as a first-line imaging tool and reserved CT examination for more severe cases [23][24][25]. In our center, the CT scan is usually performed after the chest X-ray and only in specific situations. In our study, the chest X-ray images were collected from patients who were positive tested for COVID-19. The X-ray images were labelled by radiologists with more than 10 years of experience. The severity score of lung illness was based on opacity degree and the lung involvement established from chest X-ray images of the COVID-19 patients.
The CXR severity score was adapted from Irmak [9] and Wong et al. [26], and it was between 0 and 14 by summing up the opacity (0-6) and involvement (0-8). The total severity score summed the individual scores of both lungs.
Therefore, the chest X-ray images used for training our models were manually classified by the radiologists as normal (without modifications, with a total severity score of 0), mild (total severity score of 1-6), moderate (total severity score of 7-12), or severe (total severity score of >12).
In Figure 3, chest X-ray images with different severity scores can be observed.
Many current types of research established that opacity and lung involvement were important guidelines for the future evolution of COVID-19 disease [21,22].
The Italian Society of Radiology (SIRM) strongly recommends using chest X-ray as a first-line imaging tool and reserved CT examination for more severe cases [23][24][25]. In our center, the CT scan is usually performed after the chest X-ray and only in specific situations.
In our study, the chest X-ray images were collected from patients who were positive tested for COVID-19. The X-ray images were labelled by radiologists with more than 10 years of experience. The severity score of lung illness was based on opacity degree and the lung involvement established from chest X-ray images of the COVID-19 patients.
The CXR severity score was adapted from Irmak [9] and Wong et al. [26], and it was between 0 and 14 by summing up the opacity (0-6) and involvement (0-8). The total severity score summed the individual scores of both lungs.
Therefore, the chest X-ray images used for training our models were manually classified by the radiologists as normal (without modifications, with a total severity score of 0), mild (total severity score of 1-6), moderate (total severity score of 7-12), or severe (total severity score of >12).
In Figure 3, chest X-ray images with different severity scores can be observed.

Datasets
We used two datasets (CXR-Dataset and SCB-Dataset) for training and testing the proposed method.
From 475 patients, 380 patients were used for training and validation, while 95 control patients were used for testing.
The image dataset (CXR) contained the chest X-ray images of our patients' cohort (475); the images were labelled by a radiologist with four severity grades (normal, mild, moderate, severe). The initial chest X-ray image dataset was split into training (380) and testing (95) datasets. Only the training dataset was augmented to obtain the desired invariance and robustness of algorithms, using the following methods: brightness changes, contrast adjustment, and parallel shifting. The augmented training dataset (CXR-Dataset) contained 2,092 images, which were used to learn (80% images) and validate (20% images)

Datasets
We used two datasets (CXR-Dataset and SCB-Dataset) for training and testing the proposed method.
From 475 patients, 380 patients were used for training and validation, while 95 control patients were used for testing.
The image dataset (CXR) contained the chest X-ray images of our patients' cohort (475); the images were labelled by a radiologist with four severity grades (normal, mild, moderate, severe). The initial chest X-ray image dataset was split into training (380) and testing (95) datasets. Only the training dataset was augmented to obtain the desired invariance and robustness of algorithms, using the following methods: brightness changes, contrast adjustment, and parallel shifting. The augmented training dataset (CXR-Dataset) contained 2092 images, which were used to learn (80% images) and validate (20% images) the algorithms. The testing dataset contained the images of those 95 control patients and was used only for testing the ability of DL models.
The second dataset consisted of patients' symptoms and clinical and biological variables collected at patients' admission and labelled with the diagnosis (mild, moderate, or severe) established at patients' discharge. This initial dataset was split into two independent patients' datasets: training (380) and testing (95). In the training dataset, together with symptoms and clinical and biological variables, the radiologic lung degree of severity was taken into consideration, while in the testing dataset we used the deep learning CXR-Score. The SCB training dataset was used for learning and validation of the proposed ML methods, while the SCB testing dataset was used only for testing the methods. The transformations applied on the SCB Dataset were binarization and normalization transformations, while the problem of missing values was resolved by replacing them with the mean values.
The datasets were constructed as in Figure 4.
was used only for testing the ability of DL models. The second dataset consisted of patients' symptoms and clinical and biological variables collected at patients' admission and labelled with the diagnosis (mild, moderate, or severe) established at patients' discharge. This initial dataset was split into two independent patients' datasets: training (380) and testing (95). In the training dataset, together with symptoms and clinical and biological variables, the radiologic lung degree of severity was taken into consideration, while in the testing dataset we used the deep learning CXR-Score. The SCB training dataset was used for learning and validation of the proposed ML methods, while the SCB testing dataset was used only for testing the methods. The transformations applied on the SCB Dataset were binarization and normalization transformations, while the problem of missing values was resolved by replacing them with the mean values.
The datasets were constructed as in Figure 4. Additionally, the distribution of patients between diagnoses can be observed in Table  1.

CXR-Score Module
This module was able to automatically classify the CXR images with four severity scores: mild, moderate, severe, and normal. Additionally, the distribution of patients between diagnoses can be observed in Table 1.

CXR-Score Module
This module was able to automatically classify the CXR images with four severity scores: mild, moderate, severe, and normal.
We proposed an ensemble that fused four pre-trained models (VGG, ResNet, DenseNet, and Inception) fine-tuned with our train dataset. The performance of classifying the severity of CXR images was improved using an ensemble method that reduced the variance by training four models instead of a single model and by combining the predictions of the models.
A generic architecture of a DL convolutional neural architecture can be observed in Figure 5. The convolutional layers were interposed with pooling and batch normalization layers for the feature extraction task, whereas the fully connected (FC) layers were used for the classification task [18]. ance by training four models instead of a single model and by combining the predictions of the models.
A generic architecture of a DL convolutional neural architecture can be observed in Figure 5. The convolutional layers were interposed with pooling and batch normalization layers for the feature extraction task, whereas the fully connected (FC) layers were used for the classification task [18].

VGG Model
The VGG architecture was proposed by Simonyan et al. [27], having a depth between 16 and 19 layers and consisting of very small convolution filters. For our proposed VGG model with transfer learning, we used the configuration consisting of 19 convolutional layers, with filters of size 3 × 3.
The proposed VGG-19 model is described in Algorithm 1.

VGG Model
The VGG architecture was proposed by Simonyan et al. [27], having a depth between 16 and 19 layers and consisting of very small convolution filters. For our proposed VGG model with transfer learning, we used the configuration consisting of 19 convolutional layers, with filters of size 3 × 3.
The proposed VGG-19 model is described in Algorithm 1. tion, we used the ResNet50 version, which has 50 blocks, and each convolutional block has three convolutional layers. The proposed ResNet50 model is described in Algorithm 2.  [29] proposed the GoogLeNet consisting of 22 convolutional layers including nine Inception modules. An Inception module has three types of kernel filters 5 × 5, 3 × 3, and 1 × 1 for convolution and a 3 × 3 filter for pooling. GoogLeNet uses stochastic gradient descent (SGD) algorithms during the training stage to extract higher-level features. For our proposed Inception model, we used the version InceptionV3. The proposed InceptionV3 model is described in Algorithm 3.

DenseNet Model
DenseNet was proposed by Huang G. and is an extension of ResNet architecture [30]. A DenseNet is a type of convolutional neural network that utilises dense connections between layers, connecting all layers with matching feature-map sizes directly with each other. Each layer obtains additional inputs from all preceding layers and passes on its feature maps to all subsequent layers. The proposed DenseNet121 model is described in Algorithm 4. To improve the accuracy of chest X-ray images, we developed an automated method considering the ensemble of the previously described deep convolutional neural networks.
The probabilities of the four trained models (VGG-19, ResNet50, InceptionV3, and DenseNet121) were averaged to generate new probabilities P (n) for the final diagnosis decision, as in Equation (1): where n = 1 . . . 4 represents the number of diagnostics. The combined predictions from multiple deep learning architectures could introduce a bias, but also reduced the variance of the ensemble model [31].
The algorithm steps for the proposed ensemble are described in Algorithm 5.

AI-Score Module
To integrate the symptoms, laboratory tests, and chest characteristics of X-ray scans, we developed the machine learning module (AI-score) using a super learner ensemble [16]. The method was capable of detecting and using the interactions among these numerous attributes for our small dataset.
The proposed super learner (AI-score) is an ensemble of machine learning algorithms with two levels. To construct a robust model, we combined in the first level four models that have different prediction methodologies: Ada Boost [32], Random Forests [33], and XGBoost [34]. In the second level, we used the CatBoost algorithm [35].

AdaBoost Model
The AdaBoost algorithm was first introduced by Freund and Schapire in 1996 and belongs in part to the family of boosting algorithms [32]. The algorithm methodology is to sequentially grow decision trees as weak learners. The algorithm is learned from previous mistakes by penalizing incorrectly predicted samples by assigning them a larger weight after each round of prediction.
For parameters' tuning, in our implementation, we set the maximum depth to 15 and the estimators to 200.

Random Forests Model
The random forests algorithm was developed by Breiman in 2001 and is based on the bagging method [33]. The data are bootstrapped by randomly choosing subsamples for each iteration of growing trees. By comparison to AdaBoost, the growth is realized in parallel for Random forests. The reduction of overfitting is realized by combining many weak learners that underfit because they only use a subset of all training samples.
For parameters' tuning, in our implementation, we set the maximum depth to 15 and the estimators to 200.

XGBoost Model
XGBoost (eXtreme Gradient Boosting) was introduced by Chen and Guestrin in 2016 and uses the concept of gradient tree boosting [34]. XGBoost has the advantages of increased speed and performance and reduced overfitting by introducing regularization parameters. The algorithm is based on gradient-boosted trees that use regression trees in a sequential learning process as weak learners.
For parameters' tuning, we set the maximum depth to 15, the estimators to 200, and the learning rate to 0.2.

CatBoost Model
CatBoost is a recently open-sourced machine learning algorithm from Yandex [35]. The algorithm is based on a gradient boosting library and returns very good results with less data, even if the deep learning models necessitate an immense amount of data for learning. The algorithm does not need an extensive hyper-parameter tuning and lowers the chances of overfitting, creating more generalized and robust models.
In our implementation, we tuned the following hyper-parameters: Iterations were 15, depth was 3, and the learning rate was 0.1.

AI-Score Ensemble Model
The AI-score model is an application of stacked generalization with two levels.

•
In the first level, the base machine learning models (AdaBoost, RandomForests, XG-Boost) used the same 5-fold splits of the training data.

•
In the second level, a meta-model (CatBoost) was fit on the out-of-fold predictions from each model of the previous level.
In Algorithm 6, we describe the proposed super learner ensemble AI-score.

Software and Statistical Tools
For selecting the symptoms and clinical and biological variables relevant for diagnosis severity, we used the existing studies, which aimed to identify the importance of various variables in predicting COVID-19 disease severity. All statistical calculations were performed using the software STATA (StataCorp LLC, USA). We used two approaches. One was to apply the Chi2 test and calculate risk ratio (RR) as effect size (for discrete variables) or Student's t test and calculate Cohen's d as effect size (for continuous variables) to measure association with the variable severe disease. A second approach was to use logistic regression with the backward elimination based upon likelihood ratio, which retained in the model several parameters.
The performance metrics used to evaluate the proposed methods were accuracy, sensitivity, specificity, positive predictive value, and negative predictive value.
The area under the receiver operating characteristic curve (AUC-ROC) and Precision/Recall curve were also taken into consideration to evaluate the ability of the method of classification. Additionally, we calculated the optimal thresholds for the ROC curves by seeking the maximum of Geometric mean (G-Mean) scores. For the Precision-Recall curves, we calculated the optimal threshold by seeking the maximum of F-measure (F-Score).
We used the TensorFlow and Keras frameworks [36] to implement the deep learning models and the scikit-learn package [37] to implement the machine learning models and to calculate the performance metrics, while matplotlib [38] was used to plot the ROC and Precision/Recall curves.

Selection of Patient Variables through Statistics
The severe COVID-19 disease was significantly associated with CXR severity, sex, parameters of respiratory function (oxygen saturation, respiratory rate), cardiovascular function (systolic and diastolic pressure, cardiac frequency), associated disease (diabetes, cardiac disease, kidney disease, hypertension, autoimmune thyroid, and obesity), and presence of symptoms (coughing, headache, shortness of breath, vertigo, palpitations, abdominal pain, myalgia, and inappetence). From the hematological parameters, severe disease was associated with the decrease of white blood cells (WBC), lymphocytes (LYM)%, Life 2021, 11, 1281 12 of 19 monocytes (MON)%, eosinophils (EOS)%, and basophiles (BAS)%. Regarding biochemical parameters, the severe disease was significantly associated with ALT and glucose decrease. The inflammatory markers (fibrinogen, CRP, and ESR) were also increased in severe disease. There was significant coagulation in the periphery, as shown by D-dimers' (products of fibrin degradation) increase. Therefore, from 46 initial variables, only 34 variables were found strongly associated with COVID-19 diagnosis' severity, as in Table 2. Additionally, the regression analysis revealed a strong association between oxygen saturation and COVID-19 severity, as in Table 3. Other associations were discovered between the risk of severe COVID-19 disease and various risk factors as female sex, diabetes, obesity (very strong association, Odds Ratio = 166). Severe disease was also associated with an increase in white blood cell numbers.

Interpretability of CXR-Score Module
Each DL model (ResNet50, VGG-19, Inceptionv3, and DenseNet121) was trained on the CXR training dataset and evaluated on the CXR testing dataset. The CXR-Score ensemble model was also evaluated on the CXR testing dataset.
As for performance measures, accuracy, sensitivity, specificity, and positive and negative predictive values were computed, and the quantitative results are summarized in Table 4. By comparison, the best results were obtained using the CXR-Score, with an average accuracy of 99.08%, followed by the DenseNet model with an average accuracy of 99.02%. Additionally, the other three models performed very well for all diagnosis' classes. The ResNet model perfectly classified the normal diagnosis class. Additionally, each individual DL model and CXR-Score ensemble were evaluated according to the overall scores, computed as the average of the area under the receiver operating characteristic curve (AUC) and the average of G-mean scores, as in Figure 6. CXR-Score and Inception v3 recorded the best G-mean scores and AUCs on the CXR testing dataset. Additionally, each individual DL model and CXR-Score ensemble were evaluated according to the overall scores, computed as the average of the area under the receiver operating characteristic curve (AUC) and the average of G-mean scores, as in Figure 6. CXR-Score and Inception v3 recorded the best G-mean scores and AUCs on the CXR testing dataset. In Figure 7, the Precision/Recall curves together with average precision were evaluated for each model. The CXR-Score ensemble recorded an average precision of 0.99, with the greatest F-measure (F-score) of 0.96 on the CXR testing dataset. In Figure 7, the Precision/Recall curves together with average precision were evaluated for each model. The CXR-Score ensemble recorded an average precision of 0.99, with the greatest F-measure (F-score) of 0.96 on the CXR testing dataset.

Interpretability of AI-Score Module for Predicting the Final Diagnosis Severity
Each ML model (Random Forests, Ada Boost, XGBoost) was trained on the SCB training dataset and evaluated on the SCB testing dataset. The AI-Score ensemble model was evaluated on the SCB testing dataset. The CXR-Score module was used to predict the chest X-ray severity score for each patient in the SCB testing dataset.
The AI-Score ensemble model recorded the best results for all diagnosis classes, having an average accuracy of 98.59%, average specificity of 98.97%, and average sensitivity of 97.93%. By comparison, the XGBoost algorithm recorded also excellent results, having an average accuracy of 97.89%, average specificity of 98.46%, and average sensitivity of 95.77%. The random forests algorithm obtained an average accuracy of 96.48%, while AdaBoost obtained inferior results for classifying the severe and moderate diagnosis' classes with an average accuracy of 93.67%. All metrics are summarized in Table 5.
Additionally, each ML model and AI-Score ensemble were evaluated according to the overall scores, computed as the average of the area under the receiver operating characteristic curve (AUC), as in Figure 8. The AI-Score recorded an AUC of 1 and a maximum G-mean of 0.98 for the SCB testing dataset.

Interpretability of AI-Score Module for Predicting the Final Diagnosis Severity
Each ML model (Random Forests, Ada Boost, XGBoost) was trained on the SCB training dataset and evaluated on the SCB testing dataset. The AI-Score ensemble model was evaluated on the SCB testing dataset. The CXR-Score module was used to predict the chest X-ray severity score for each patient in the SCB testing dataset.
The AI-Score ensemble model recorded the best results for all diagnosis classes, having an average accuracy of 98.59%, average specificity of 98.97%, and average sensitivity of 97.93%. By comparison, the XGBoost algorithm recorded also excellent results, having an average accuracy of 97.89%, average specificity of 98.46%, and average sensitivity of 95.77%. The random forests algorithm obtained an average accuracy of 96.48%, while AdaBoost obtained inferior results for classifying the severe and moderate diagnosis' classes with an average accuracy of 93.67%. All metrics are summarized in Table 5.   Additionally, each ML model and AI-Score ensemble were evaluated according to the overall scores, computed as the average of the area under the receiver operating characteristic curve (AUC), as in Figure 8. The AI-Score recorded an AUC of 1 and a maximum G-mean of 0.98 for the SCB testing dataset.

Discussion
Many previous types of research focused on diagnosing patients' infection with SARS-CoV2 based on chest CT, X-ray scans, symptoms, or blood tests, whereas there were few studies to predict the future COVID-19 severity [10].
In the present study, we developed an artificial intelligence-based method for grading the COVID-19 severity based on multimodal features taken at admission.
We found 34 symptoms and clinical and biological variables that are strongly related to future COVID-19 severity, confirmed also by other studies [39][40][41]. Exploring the clinical features of our patients' dataset, we found that male and older patients were at risk to develop severe disease. Other clinical variables strongly related to COVID-19 severity were parameters of respiratory function (oxygen saturation, respiratory rate) and

Discussion
Many previous types of research focused on diagnosing patients' infection with SARS-CoV2 based on chest CT, X-ray scans, symptoms, or blood tests, whereas there were few studies to predict the future COVID-19 severity [10].
In the present study, we developed an artificial intelligence-based method for grading the COVID-19 severity based on multimodal features taken at admission.
We found 34 symptoms and clinical and biological variables that are strongly related to future COVID-19 severity, confirmed also by other studies [39][40][41]. Exploring the clinical features of our patients' dataset, we found that male and older patients were at risk to develop severe disease. Other clinical variables strongly related to COVID-19 severity were parameters of respiratory function (oxygen saturation, respiratory rate) and cardiovascular function (systolic and diastolic pressure, cardiac frequency). In our study cohort, comorbidities, such as diabetes, cardiac disease, kidney disease, hypertension, autoimmune thyroid, and obesity, were related to the severity and they were also confirmed by studies in [42,43]. Other connections, which we found, were between several symptoms (coughing, headache, shortness of breath, vertigo, palpitations, abdominal pain, myalgia, and inappetence) and COVID-19 severity [44,45].
The hematological parameters, WBC, LYM%, EOS%, and BAS%, increased in severe disease, while MON% and NEU% decreased. PLT (Thrombocytes) were not associated with severity in our study cohort. We also found relationships between D-dimers, fibrinogen, ESR, and severity.
Together with these clinical and biological variables, the chest X-ray images offered important features of disease severity. The lung characteristics that we included were opacity and lung involvement [9,26]. The advantage of the deep learning classification of the chest X-ray images into normal, mild, moderate, and severe was the speed and good classification accuracy, which were even higher than those of a radiologist [10]. Our CXR-Score module had accuracy in identifying the severe and mild pattern in CXR images of 99.54%, while in identifying the moderate pattern in CXR images it was 98.4%. The normal CXR images were classified with an accuracy of 98.85%.
The final prognosis of diagnosis' severity was established using AI-Score module, in which we integrated 34 variables: symptoms, clinical and biological variables, and CXR-Score. The AI-Score module classified the severe diagnosis with an accuracy of 98.94% and a sensitivity of 97.14%, while the moderate diagnosis was classified with an accuracy of 98.94% and a sensitivity of 96.66%. Regarding the mild diagnosis, the obtained accuracy was 97.89% with a sensitivity of 100%.
Our results suggest that the proposed AI-Score method can become a useful tool in forecasting the future severity of diagnosis at a patient's admission. Ultimately, this improved diagnosis can be used for assessing the efficacy of vaccination or to evaluate new emerging treatments for COVID-19 [46].
Even if the obtained results were satisfactory, our study had some limitations. Firstly, the machine learning and deep learning methods still need human intervention in the processes of image and data collection.
Second, the dataset known in advance was another limitation because there was a risk of overfitting, even if the training and testing datasets were patient-independent. Future research will evaluate and refine the proposed method on a larger patient database collected from other hospitals.
Third, to improve the resolution quality of CXR images, the wavelet denoising technique and super-resolution deep learning methods will be taken into consideration [47]. Even if they improve the results, they are time consuming in real-time computer-assisted methods, due to computations. So, the trade-off between their complexity and performance in real time will be considered in the future.
Fourth, the severity grading of CXR images could be improved by segmentation to detect more precisely the lung involvement [48].

Conclusions
The novel COVID-19 has become one of the most acute and severe health problems of the past century. Artificial intelligence-based methods already play an important role in combating the pandemic's terrible effects.
In this study, we proposed an artificial intelligence-based AI-Score solution, which provides fast and powerful assistance to physicians. We integrated ensemble ML and DL algorithms for forecasting the severity of the COVID-19 diagnosis' evolution based on 34 input variables. AI-Score consists of an ensemble of machine learning algorithms developed to predict the future severity score as mild, moderate, or severe for COVID-19infected patients. Additionally, a deep learning module (CXR-Score) was developed to automatically classify the chest X-ray images and integrate them into AI-Score.
Our method achieved good accuracy in retrospective chest X-ray images, symptoms, and clinical and biological blood tests. Based on these promising preliminary results and further testing on larger patients' cohort, our AI-based method could become an important tool for the computer-aided diagnosis of COVID-19 severity in early stages.