Regression Models to Study the Total LOS Related to Valvuloplasty

Background: Valvular heart diseases are diseases that affect the valves by altering the normal circulation of blood within the heart. In recent years, the use of valvuloplasty has become recurrent due to the increase in calcific valve disease, which usually occurs in the elderly, and mitral valve regurgitation. For this reason, it is critical to be able to best manage the patient undergoing this surgery. To accomplish this, the length of stay (LOS) is used as a quality indicator. Methods: A multiple linear regression model and four other regression algorithms were used to study the total LOS function of a set of independent variables related to the clinical and demographic characteristics of patients. The study was conducted at the University Hospital “San Giovanni di Dio e Ruggi d’Aragona” of Salerno (Italy) in the years 2010–2020. Results: Overall, the MLR model proved to be the best, with an R2 value of 0.720. Among the independent variables, age, pre-operative LOS, congestive heart failure, and peripheral vascular disease were those that mainly influenced the output value. Conclusions: LOS proves, once again, to be a strategic indicator for hospital resource management, and simple linear regression models have shown excellent results to analyze it.


Introduction
The present research paper is an extension of a previous paper that the same authors presented at a conference [1]. In fact, the dataset considered is much larger, both in terms of number of records and the variables considered. Moreover, in order to further improve the regression model, a comparison with other algorithms was made.
Valvular heart diseases are diseases that affect the valves by altering the normal circulation of blood within the heart, with repercussions on the general health of the subject.
Knowledge of the natural history of the most common valvular heart diseases is important because the onset of symptoms often is the point at which intervention becomes necessary. Most valvular heart diseases are amenable to surgical intervention, which can afford a symptom-free and relatively normal life span [2].
The prevalence of valvular disease increases sharply with age, owing to the predominance of degenerative etiologies. The burden of heart valve disease in the elderly has an important impact on patient management, given the high frequency of comorbidity and the increased risk associated with intervention in this age group [3].
For each subject, it is fundamental to evaluate the severity of valvular disease, given that the risk in surgery is proportional to the degree of valvular disease; specifically in 2 of 13 the elderly, any type of surgery pre-operative evaluation and preparation is especially important for a successful outcome of the surgery [4].
A prospective survey of patients with valvular heart disease in Europe showed that of patients with severe, symptomatic, single VHD, 31.8% did not undergo intervention, most frequently because of comorbidities [5].
The etiology, approach to treatment, and expected outcomes of VHD are different in the elderly compared with younger patients. Both stenotic and regurgitant lesions are associated with unfavorable outcomes if left untreated. Surgical mortality remains high due to multiple co-morbidities, and the long-term survival benefit is dependent on many variables, including valvular pathology. Quality of life is an important consideration in treatment decisions in this age group. Increasingly, octogenarian patients are receiving transcatheter therapies, with transcatheter aortic valve replacement having the greatest momentum [6].
When surgery is not possible, or when the risks outweigh the benefits, percutaneous treatment options may offer effective alternatives. However, procedures may not always go as planned, and frail patients or those whose symptoms are caused by other comorbidities may not benefit from valve intervention at all. Significant effort should be made to assess frailty, comorbidities, and patient goals prior to intervention [7].
In the current guidelines of the European Society for Cardiology, published in 2021 [8], surgical treatment remains the standard of care for most forms of severe valvular heart disease; however, the presence of chronic kidney disease impairs clinical outcomes and is associated with higher mortality rates when compared to patients with preserved renal function [9].
These latter valvular abnormalities are likely to increase further as the average age of the population increases [10]. For this reason, it is critical to be able to best manage the patient undergoing this surgery.
The length of hospital stay (LOS) is considered an excellent indicator of quality in care processes [11,12]. In fact, many studies have focused on how to reduce patients' LOS by optimizing care processes. For example, Scala et al. and Improta et al. demonstrated how the introduction of a diagnostic therapeutic pathway for femur fracture and diabetic patient management, respectively, reduced LOS with consequent benefits for both patients and hospitals [13,14]. Biomedical data analytics is key to improving processes, reducing costs, and giving clinicians new tools to manage all different patients. There are many approaches used in the literature for data analysis, including lean six sigma [15][16][17][18][19], health technology assessment [20][21][22][23], machine learning algorithms [24][25][26], and mathematical modelling [27][28][29]. The latter approach was chosen for this study. In the literature, there are several applications: Tesfahun et al., in order to optimize medical waste management processes, developed a model capable of predicting the production rate of this waste [ [32]; and Kadam et al. employed both artificial neural networks and a multiple linear regression to predict the potability of water in an Indian river [33].
Therefore, the aim of this work is to use mathematical modelling and, in particular, several regression algorithms to obtain a model that can help clinicians in the assessment of the LOS of patients undergoing mitral valve repair surgery. As already mentioned, the present work aims to be an extension and improvement of the previous one presented at a conference [1]. In fact, the final model is more complex since the individual comorbidities were considered and, therefore, allows the clinician to take into account more aspects that characterize the particular patient.

Method
The research was carried out at the Complex Operative Unit (C.O.U.) of the Cardiology unit at the University Hospital "San Giovanni di Dio e Ruggi d'Aragona" of Salerno (Italy).
The dataset was obtained from the hospital's information system, QuaniSDO, and included all patients who underwent open-heart mitral valve repair surgery without replacement from 2010 to 2020. It comprises 379 records and contains the following information: Date of admission, discharge, and procedure.
From this information, variables were obtained that were then used in the multiple linear regression. In particular, the dependent variable (i.e., the output of the model), the LOS, was obtained as the difference between the date of discharge and the date of admission; the pre-operative LOS-independent variable was obtained as the difference between the date of discharge and the date of the procedure. By analyzing the procedures, it was possible to determine the number of cardiac procedures carried out in addition to mitral valve repair surgery. For example, interventions for bypass, pacemaker implantation, cardioversions, or other interventions on valves were considered. In addition, the other independent variables are reported below: Four procedures. Table 1 shows the distribution of the features into the sample.  [34] was used to build an MLR model used to predict the total LOS. Before its implementation, the following six conditions must be verified:

1.
Linear relationship between the independent and dependent variable; 2.
Independence of the residuals; 4.
Constant variance of the residuals; 5.
Absence of outliers.
With MatLab version R2020a, other regression algorithms (linear support vector machine, LSVM; narrow neural network, NNN; rational quadratic Gaussian process regression, GPR; and random forest, RF) were implemented. In particular, SVM can also be used as a regression method, keeping at the base the same main idea of the classifier to minimize the error by identifying a hyperplane in an N-dimensional space, where N depends on the number of variables, and considering a margin of tolerance that is not part of the classification process. Neural network models can be considered valid alternatives to classical regression models. In fact, they have the property of learning from a set of data without the need for a complete specification of the decision model. They automatically provide all necessary data transformations and are able to see through noise and distortion. Gaussian processes (GP) are a supervised learning method used for regression and probabilistic classification problems. They are versatile, different kernels can be specified, and the prediction is probabilistic (Gaussian). Lastly, RF is a supervised learning algorithm in which multiple learning algorithms are combined together to make a more accurate prediction. The model is powerful and accurate, but overfitting can easily occur. Before performing the analyses, the dataset was divided into training sets for 80% and test sets for 20%. The R 2 parameter was used to evaluate the accuracy of the model.

Results
Before implementing the MLR model, the six hypotheses were tested.

The Linear Relationship between the Independent and Dependent Variable
To verify this assumption, partial dispersion graphs were created to verify the trend of the dependent variable LOS as a function of the selected independent variables. Figure 1 shows what has been obtained for the pre-operative LOS.

The Linear Relationship between the Independent and Dependent Variable
To verify this assumption, partial dispersion graphs were created to verify the trend of the dependent variable LOS as a function of the selected independent variables. Figure  1 shows what has been obtained for the pre-operative LOS. Consistent with the definition of total LOS, the linear relationship between the variables was deduced. The problem with this type of representation was that the effect of combining several independent variables was not considered.

Absence of Multicollinearity
The absence of multicollinearity has been demonstrated through Pearson's correlation, tolerance, and the variance inflation factor (VIF). All variables are a function of the correlation between the i-th independent variable and the others. Table 2 shows the results of the Pearson correlation and the statistical significance. Consistent with the definition of total LOS, the linear relationship between the variables was deduced. The problem with this type of representation was that the effect of combining several independent variables was not considered.

Absence of Multicollinearity
The absence of multicollinearity has been demonstrated through Pearson's correlation, tolerance, and the variance inflation factor (VIF). All variables are a function of the correlation between the i-th independent variable and the others. Table 2 shows the results of the Pearson correlation and the statistical significance. Table 3, instead, shows the values of VIF and tolerance that were obtained for each independent variable.
With the exception of the pre-operative LOS, the Pearson correlation value was always less than 0.7. In addition, the VIF values were always less than 10 and the tolerance values were always greater than 0.2, so the absence of multicollinearity was verified.

The Independence of the Residuals
The Durbin-Watson statistical test was used to test this hypothesis. The result is always between 0 and 4, where the intermediate value represents that there is no autocorrelation detected in the sample. In this case, the result was equal to 1.517 and, therefore, was within the acceptability range of (1.5; 2.5).

The Residuals Have Constant Variance
To evaluate the variance of the residuals, the graphic "standardized expected value regression" on the x-axis against "standardized residual regression" was created. Figure 2 shows the obtained result.
The scatter plot (Figure 2) shows that the data is randomly distributed around zero. It is possible to say that the homoscedasticity hypothesis is not violated. The hypothesis is therefore verified.  With the exception of the pre-operative LOS, the Pearson correlation value was always less than 0.7. In addition, the VIF values were always less than 10 and the tolerance values were always greater than 0.2, so the absence of multicollinearity was verified.

The Independence of the Residuals
The Durbin-Watson statistical test was used to test this hypothesis. The result is always between 0 and 4, where the intermediate value represents that there is no autocorrelation detected in the sample. In this case, the result was equal to 1.517 and, therefore, was within the acceptability range of (1.5; 2.5).

The Residuals Have Constant Variance
To evaluate the variance of the residuals, the graphic "standardized expected value regression" on the x-axis against "standardized residual regression" was created. Figure  2 shows the obtained result.

The Residuals Are Normally Distributed
The P-P plot ( Figure 3) shows how well the available data set fits the specific probability distribution. With this tool, the cumulative distribution of the empirical probability of the data is compared with that of the assumed true cumulative distribution functions.

Int. J. Environ. Res. Public Health 2022, 19, x 9 of 15
The scatter plot ( Figure 2) shows that the data is randomly distributed around zero. It is possible to say that the homoscedasticity hypothesis is not violated. The hypothesis is therefore verified.

The Residuals Are Normally Distributed
The P-P plot ( Figure 3) shows how well the available data set fits the specific probability distribution. With this tool, the cumulative distribution of the empirical probability of the data is compared with that of the assumed true cumulative distribution functions. Although the curve did not exactly retrace the ideal line, the slight variation did not affect the good performance of the model. Although the curve did not exactly retrace the ideal line, the slight variation did not affect the good performance of the model.

Presence of Outliers
The last hypothesis to be verified was the absence of outliers that affect the estimate of the parameters βi. To accomplish this, Cook's distance was calculated for each observation. Figure 4 shows the obtained result. Although the curve did not exactly retrace the ideal line, the slight variation did not affect the good performance of the model.

Presence of Outliers
The last hypothesis to be verified was the absence of outliers that affect the estimate of the parameters βi. To accomplish this, Cook's distance was calculated for each observation. Figure 4 shows the obtained result. For each observation, the Cook's distance was less than 1. Therefore, there were no outliers that caused bias. For each observation, the Cook's distance was less than 1. Therefore, there were no outliers that caused bias.
After this verification phase, the MLR model was implemented. Table 4 shows the goodness of the model. The R 2 value was greater than the set threshold value of 0.5. The model was well suited to the problem under consideration and could be a valid preliminary tool. Table 5 shows the model coefficients and the t-test result at a significance level of 0.05. The test showed that of the selected independent variables, age, pre-operative LOS, CHF, and PVD significantly influenced LOS. For all of them, the value of the coefficients was positive and among these the highest was the one associated with the PVD.
After analyzing the results of the MLR model, further regression algorithms were implemented (Table 6). Among these, the best was the rational quadratic GPR, but the value was still lower than that obtained with the MLR model. The diagrams of the predictions made, with the relative errors for each algorithm, are shown below (Figures 5-8).

Discussion
In this study, the data provided by the C.O.U. of the Cardiology unit at the University Hospital "San Giovanni di Dio e Ruggi d'Aragona" of Salerno (Italy) were analyzed. Specifically, the information was related to the flow of patients who underwent an open-heart mitral valve repair surgery without replacement from 2010 to 2020, for a total of 379 records. Starting from the extraction of a limited set of information from hospital discharge forms, a group of independent variables were obtained (gender, age, pre-operative LOS, acute myocardial infarction (AMI), congestive heart failure (CHF), cerebrovascular disease (CeVD), peripheral vascular disease (PVD), chronic obstructive pulmonary disease (COPD), diabetes, renal disease (RD), 2 procedures, 3 procedures and 4 procedures) and were used to predict the total LOS. As conducted in the previous study [1], an MLR model was implemented. The obtained MLR model had an R 2 value equal to 0.720, and among

Discussion
In this study, the data provided by the C.O.U. of the Cardiology unit at the University Hospital "San Giovanni di Dio e Ruggi d'Aragona" of Salerno (Italy) were analyzed. Specifically, the information was related to the flow of patients who underwent an open-heart mitral valve repair surgery without replacement from 2010 to 2020, for a total of 379 records. Starting from the extraction of a limited set of information from hospital discharge forms, a group of independent variables were obtained (gender, age, pre-operative LOS, acute myocardial infarction (AMI), congestive heart failure (CHF), cerebrovascular disease (CeVD), peripheral vascular disease (PVD), chronic obstructive pulmonary disease (COPD), diabetes, renal disease (RD), 2 procedures, 3 procedures and 4 procedures) and were used to predict the total LOS. As conducted in the previous study [1], an MLR model was implemented. The obtained MLR model had an R 2 value equal to 0.720, and among

Discussion
In this study, the data provided by the C.O.U. of the Cardiology unit at the University Hospital "San Giovanni di Dio e Ruggi d'Aragona" of Salerno (Italy) were analyzed. Specifically, the information was related to the flow of patients who underwent an openheart mitral valve repair surgery without replacement from 2010 to 2020, for a total of 379 records. Starting from the extraction of a limited set of information from hospital discharge forms, a group of independent variables were obtained (gender, age, pre-operative LOS, acute myocardial infarction (AMI), congestive heart failure (CHF), cerebrovascular disease (CeVD), peripheral vascular disease (PVD), chronic obstructive pulmonary disease (COPD), diabetes, renal disease (RD), 2 procedures, 3 procedures and 4 procedures) and were used to predict the total LOS. As conducted in the previous study [1], an MLR model was implemented. The obtained MLR model had an R 2 value equal to 0.720, and among the variables, those that most influenced the LOS were age, pre-operative LOS, CHF, and PVD. Compared to the result obtained in the short paper, where the model was obtained using 70 records included in the 379 used here, the goodness of the model is slightly lower (R 2 = 0.864) without showing, with the exception of the pre-operative LOS, which is linked to the LOS by definition, any significant influence. Undergoing multiple heart surgeries was not significantly correlated with LOS. In this case, the greater number of records made it possible to identify the classes of patients for which greater organizational effort is required. In addition to the MLR model, further regression algorithms were tested (linear support vector machine, LSVM; narrow neural network, NNN; rational quadratic Gaussian process regression, GPR; and random forest, RF). Of these, the best was rational quadratic GPR, with a value of R 2 = 0.690. The performance, however, was lower than the MLR model, which ultimately remains the best model.
The limitation of this work is certainly that of not considering the impact that specific cardiac procedures with the same complexity as the one in the exam, such as coronary revascularization and tricuspid annuloplasty, have on LOS.
Future developments will certainly include exceeding the limits, the validation of the models through both an update of the dataset with the inclusion of what has been obtained for the year 2021 and through the analysis of data from different populations. In addition, further regression and classification models may be implemented.

Conclusions
In this work, the dataset consisting of 379 patients who underwent open-heart mitral valve repair surgery without replacement from 2010 to 2020 at the C.O.U. of the Cardiology unit at the University Hospital "San Giovanni di Dio e Ruggi d'Aragona" of Salerno (Italy) was analyzed through 4 different regression models/algorithms. An MLR model, linear support vector machine, narrow neural network, rational quadratic Gaussian process regression, and random forest were implemented. Among these, the best was the MLR model, with an R 2 = 0.720. Finally, the statistical analysis showed that the variables that significantly affected the total LOS were age, pre-operative LOS, congestive heart failure, and peripheral vascular disease.

Data Availability Statement:
The datasets generated and/or analyzed during the current study are not publicly available for privacy reasons but are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare that they have no competing interest.