Personalized Data Analysis Approach for Assessing Necessary Hospital Bed-Days Built on Condition Space and Hierarchical Predictor

: The paper describes the medical data personalization problem by determining the individual characteristics needed to predict the number of days a patient spends in a hospital. The mathematical problem of patient information analysis is formalized, which will help identify critical personal characteristics based on conditioned space analysis. The condition space is given in cube form as a reﬂection of the functional relationship of the general parameters to the studied object. The dataset consists of 51 instances, and ten parameters are processed using different clustering and regression models. Days in hospital is the target variable. A condition space cube is formed based on clustering analysis and features selection. In this manner, a hierarchical predictor based on clustering and an ensemble of weak regressors is built. The quality of the developed hierarchical predictor for Root Mean Squared Error metric is 1.47 times better than the best weak predictor (perceptron with 12 units in a single hidden layer).


Introduction
In the era of active development of digital technologies, personalized medicine is focused on the processing of heterogeneous medical data, and technology ensures this process is manageable. Personalization is becoming increasingly important in the scientific community for many reasons, including the new optimization methods and algorithms that help reduce medical costs and provide quality and effective health care. This study focuses on solving the problem of formalization of the studied object and the formalization of its conditions during treatment or rehabilitation, which will optimize treatment processes, analyze individual patient characteristics, and forecast possible personalized solutions for medical care based on patient health assessment.
The problem of personalizing data requires a clear strategy to determine the main stages of information processing, given in Figure 1.
Determining the individual characteristics needed to solve the personalization problem depends on the key factors of object identification. For the formalized representation of the studied object in medicine, the main parameters of its general condition with certain characteristics are considered. Patient-oriented data are data that the patient or his/her relatives report to the doctor. The quality of patient-oriented data depends on the level of literacy regarding the patient's health, while the quality of patient data depends on the quality and sensitivity of devices and technologies involved in the collection process [1]. Patient-oriented data have several benefits, such as enabling medical data management and supporting double-checking medical research to reduce the incidence of false-positive and false-negative outcomes.
Data from the study object, coming from various sources, are used for medical and health purposes. However, data confirming the provision of special medical care are mainly concentrated and can be collected as part of screening for disease or diagnostic processes. The results of screening or diagnosis should be checked (principles of double-checking) by medical examinations. As clarified in [1], a paradigm shift in screening for disease is essential to, on the one hand, collect sufficient individualized and accurate data for patientrelated medical treatment, with the aim of pre-detection, prevention, and prediction of any human health problems. On the other hand, a paradigm shift would help reduce medical costs and reduce morbidity and mortality. In addition, a paradigm shift would pave the way for personalized medicine.
Personalized medicine (PM) can be considered a continuation of traditional approaches to understanding and treating diseases with higher accuracy. The patient's gene variations profile can guide the choice of drugs or treatment protocols that minimize harmful side effects or provide more successful results. PM can also indicate a person's susceptibility to certain diseases before they manifest, allowing doctors and patients to develop a monitoring and prevention plan.
In the digital age, personalized medicine must be supported by small and big data (and should become a data-driven process) and use artificial intelligence technology to support risk prevention, prediction, and detection, and medical intervention. PM is becoming increasingly important in the scientific community (medical, paramedic, and bioengineering) for many reasons, particularly given the greater efficiency and effectiveness of health care, the reduction in medical costs, and more.
PM requires genetic information, information about predisposition to medical diseases, and accurate medical data for the patient, excluding meaningless data. PM in its real form still faces challenges and problems, such as collecting relevant and precise medical treatment data for better diagnosis and prognosis (disease screening, etc.). According to Professor Nisar Malek, Medical Director of the Clinic of Gastroenterology, Hepatology and Infectious Diseases at Tübingen University Hospital and Head of the Tübingen Center for Personalized Medicine, today's PM goals are multifaceted. He stated the following in an interview [2] that he gave in 2018: "Our first goal-and this is also our main goal-is to further improve diagnosis and treatment for an individual patient and it is desirable to integrate this into as many medical specialties as possible. Our second goal is to collect diagnostic and treatment data and transfer them to databases, making them suitable for medical and bioinformatics processes for conclusions about a particular patient. Here we are touching on the realm of big data".
It is a well-known fact that artificial intelligence gives us many opportunities to understand and solve many complex problems in the practical and scientific space. Artificial intelligence can be used in systems designed to detect, track, and predict disease outbreaks. The better we can track the spread of a virus, the more effectively and quickly we can fight it [3].
Data mining, including in medicine, is used by several scientists. For example, Bidyuk P. in [4], for the analysis of data gaps and their filling, proposed to use decision trees, an EM algorithm, and a regression approach to forecast missing data using forecasting functions. Similar results were obtained by Sokolova O. [5] and Sina Khanmohammadi [6] for the associative classification of medical data and by Anthony Costa Constantinou for a comprehensive survey [7] and analysis of data from intelligent Bayesian models to support medical decisions. Bayesian networks are also used to reformulate the Quick Medical Reference (QMR) model in decision theory. However, with the spread of big data technology, Bayesian networks have been slow. Therefore, in the work of Y. Tang [8], Bayesian networks' parallelization was developed. Bayesian networks are also used to diagnose dementia, Alzheimer's disease, and mild cognitive impairment. The Bayesian belief network also is used in [9] and [10] for ageing diseases analysis. However, even under conditions of parallelism, for multi-parameters, large volume, and dynamic medical data, Bayesian networks should be used in combination with other machine learning methods only. An apparatus of artificial neural networks, including the use of fuzzy logic, is also actively proposed to analyze various medical data. Thus, in the work of Bodyansky E. and Perova I. [11], a system of rapid medical diagnosis based on auto-associative neuro-fuzzy memory is proposed. This system is characterized by the simplicity of both the architectural solution and its software implementation and provides for the diagnosis of patients with multiple parameters.
The next problem of personalized medical data processing is the imbalance of input data and small samples of collected data. These factors impose a number of limitations on applying existing methods and means of computational intelligence to solve such problems. In the paper [12], an approach for collecting medical data using the Internet of Things and a proposed ensemble of neural networks to detect unusual human movements is developed. However, in cases of solving highly specialized problems, which is typical for medicine, the error of learning in the ensemble of neural networks is much higher than the error of one network. In addition, the procedures for training the ensemble of neural networks are quite time-and resource-intensive. Dyvak M. [13] uses interval estimators to assess the object's condition in time constraints. In Silva-Ramírez E. L. [14], to solve classification and fill data gaps, a multilayer perceptron is trained according to different rules and an approach to multiple counting, which is based on a combination of multilayer perceptron and k-nearest neighbors. Refs [15,16] present mathematical methods for converting dynamic failure trees into Markov models, Bayesian models, or simulation models based on the Monte Carlo method, and they search for methods to reduce model complexity and increase computational performance. However, it is necessary to develop new heuristic approaches that reduce the dimensionality of the Markov model and the number of operations required for their automated analysis and calculation.
The application of Markov models can be used to calculate the parameter of the failure rate and the probability estimation of the first and second kinds of errors, and the analysis of the causes of system failure remains a topical issue [17]. Thus, data mining techniques are used to solve many problems in the processing and analyzing of medical information. However, there are no comprehensive studies aimed at identifying the patient's condition without specifying history. Some issues have been solved towards this end. Still, the researchers only partially consider the phenomena of big data and the Internet of Things, in-depth analysis, and visualization of accumulated data to support decision-making on personalized treatment [18].
The mathematical apparatus of Petri net and its modifications have been used by researchers to model processes of different natures. For example, to model real-time planning processes in a resource-limited environment, Italian scientists in [3] proposed to use an apparatus of color Petri nets (TCPN) [19], which allowed for analysis of the interdependence, conflicting priorities, and variability of available resources (for example, in industrial projects). Spanish researchers [20] used the Petri color grid apparatus to improve the process of modeling, analysis, and semantic validation of complex system events (for example, processing technology and large data streams in the process of determining the level of air quality).
This paper describes the problem of data personalization by determining the individual characteristics needed to solve the personalization problem in the first section. The next section analyzes patient information to help identify general individual characteristics and present the mathematical model of the condition space in cube form as a reflection of the functional relationship of the general state to the studied object. The Results section presents a hierarchical predictor for a patient's number of days in hospital based on individual parameters. The Discussion section describes the main point of researching the problem of personalization and solving tasks of estimation correlation between the individual characteristics, and it evaluates the effectiveness of the chosen prediction model.
The main contributions of this paper are the following: • The mathematical formalization of the condition space for personalized medical data is developed. It allows for predicting target variables in a sub-part of the space with higher predictive accuracy. • A dataset with information about personal parameters of 51 patients was collected, which allowed generalization and deeper analysis (https://doi.org/10.6084/m9 .figshare.14865411.v1, accessed on 28 June 2021).

•
The hierarchical predictor consists of splitting objects and prediction for each separated cluster is developed. It produces 1.47 times greater predictive accuracy than the best weak predictor (perceptron with 12 units in single hidden layer).

•
The collected dataset is too small. That is why a specific method based on the hierarchical predictor is proposed for small dataset analysis. Five-fold cross-validation is also used for results validation.

Materials and Methods
To objectively evaluate the object under study, it is necessary to build a formal model. Personalized medical data are considered as a set, the elements of which are the parameters of the object's conditions, namely the elements of sets of time-independent characteristics (A in ) and time-dependent parameters of the object (A t ).
Using the theory of functional analysis, time-dependent data can be formalized as a set A t , the elements of which are subsets of individual patient's parameters A 1 , A 2 , . . . , A n : where: namely: A t = {(a 1 , a 2 , . . . , a n )|a 1 ∈ A 1 , a 2 ∈ A 2 . . . , a n ∈ A n } The system for assessing the condition of the object to optimize the recovery process is presented as: where Fe is a system for estimating the object's conditions; A in is a set of time-independent parameters, characterizing the indicators, obtained from the dynamics changes of the conditions, calculated based on the medical historical data; K is the set of coefficients, indicating changes in the performance indicators A in ; S is the set of strategic decisions to assess the state of the studied object; R represents the production rules for decision-making for finding optimal conditions; ES is a set of estimated conditions of the studied object, which depends on the evaluation coefficients; A t is the set of characteristics of the studied object, which change under the influence of time; and G is medical treatment protocol, which depends on A in . At the stage of analysis of the results of researching the patient's condition, we can formulate the following property: time-dependent data at a certain point in time take constant indicators, which determine treatment decisions under the influence of personalized data.
A patient's data are characterized by heterogeneity, which complicates the analyzing process, as there is a need to formalize the patient's physical condition (FS), taking into account time-dependent (A t ) and time-independent data (A in ). To do this, we can represent the formalization of the patient's physical condition as a reflection of the physical condition of the object: Based on the personalized patient data obtained during the process of caring for the condition and beyond, we can simulate the condition space of the studied object in time as a Euclidean space, where each pair of elements a 1 , a 2 corresponds to a real number (a 1 , a 2 ) that satisfies the conditions (axioms of the scalar product): (a 1 , a 1 ) ≥ 0, where (a 1 , a 1 ) = 0, when a 1 = 0 (a 1 , a 2 ) = (a 2 , a 1 ) (a 11 + a 12 a 2 ) = (a 11 , a 2 ) + (a 12 , a 2 ) (λa 1 , a 2 ) = λ(a 1 , a 2 ) Given that the condition space is represented as Euclidean space, it is possible to develop a mathematical model of condition space as a multidimensional system in the time domain.
The condition space is presented as a three-dimensional cube, where the x-axis reflects the time Po(t), the y-axis is the parametric Po(a), and the z-axis is an indicator for individual cases, i.e., the set of studied objects Ob ( Figure 2).
As determined by simultaneous iteration, the functional relationship of the physical state to the personalized medical data reveals a relationship between the time factor and the parametric index: Under the condition of the analysis of a physical condition of the investigated object at the following time iteration, dependence of time indicators of the investigated object on previous values of its physical condition is observed.
We can represent the physical state of K at each time iteration as a relation in the functional form of the record:  FS oi is only an indicator of the physical condition of SO during the process of treatment or rehabilitation. Due to this, the presented relationship can be analyzed to understand the behavior of the object under study.
An important factor is the analysis and application of protocols according to the identified disease. Treatment protocols offer clear guidelines for practitioners and improve the quality of clinical decisions, including among those professionals who are accustomed to outdated medical practice. Another important advantage of clinical protocols is that they promote coherence in patient care provision at all levels.
Therefore, the condition space of the object should take into account patient's conditions at the appropriate stage of treatment and the medical protocol: It is a well-known fact that many patients are diagnosed with a number of additional comorbidities that need to be addressed. Each concomitant disease requires treatment according to current protocols. Thus, it is necessary to take this into account when modeling the condition space of a particular object (Figure 4).

The Experimental Setup
The experimental setup is organized as follows: • Exploratory data analysis (feature normalization and encoding); • The condition space development; • Weak predictors selection; • The hierarchical predictor development; • Results evaluation.
All calculations were made using RStudio. Data were passed through Data Sampler for balancing. As results, all instances are taken into account. This means that the collected dataset is balanced.

Dataset Description
We processed patients' personalized data from collected dataset (https://doi.org/ 10.6084/m9.figshare.14865411.v1, accessed on 28 June 2021). This dataset was collected in the surgical department at Lviv Public Hospital (Ukraine). Patients were treated clinically for postoperative complications in the abdomen.
The dataset consists of the following characteristics: Each instance represents the object GS o. The task is to predict patients' number of days in the hospital (the duration of treatment) based on drug treatment and patients' personal parameters.
Dataset consists of 51 instances and 10 parameters. Time in hospital is the target variable. After the preprocessing stage and one-hot-encoding usage for categorical variables, the dataset consists of 39 features. The missed values are found in the Flora feature and in the Related diagnosis feature. The missing data imputation procedure is not used, missing data are empty values.

Condition Space Development
For the condition space development, the following steps are performed: • the most significant features selection; • the splitting of instances into clusters with similar time-dependent and time-independent parameters.
The initial feature selection is performed by correlation matrix, Boruta, and regression tree. Hard voting is used for final feature selection.
The correlation between parameters is given in Figure 6. Significant correlation is absent. Next, the Boruta algorithm [21] is used (Figure 7). The Boruta algorithm is a wrapper built around the random forest classification algorithm. Figure 7 shows "important" in green and "tentative" in yellow. Blue boxplots correspond to minimal, average, and maximum Z score of a shadow attribute.
We can see that important variables for both methods are similar. The hard voting result from the three feature selectors is given below: The treatment using the drug Ceftriaxon affects the duration of stay in hospital. Therefore, only time-dependent parameters are important for predicting days in hospital.
Next, clustering is used. Visual Assessment of (Cluster) Tendency (VAT) is used for analysis of the possibility for objects splitting. VAT shows poor cluster tendency (Figure 8). Small dissimilarities are represented by dark shades, and large dissimilarities are represented by light shades [22]. The Hopkins statistic [23] is equal to 0.71. This means that dataset is not significantly clusterable. The fuzzy c-mean shows better results (Table 1). All strong instances in each cluster are marked in bold.
The same result appears with k-means visualization (Figure 9). Clusters are found to overlap, especially clusters one and three.  Therefore, the dataset is split into three clusters. Objects 7, 13 17, 20, 26, 34, 39, 42, 45, 47, 50, and 51 should be analyzed separately. The membership function value for these objects is less than 0.65. That is why a strong difference between objects cannot be found.

Weak Predictors Selection
At the next stage, we try to build several predictors. Two metrics are taken into account: Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). Results are given in Table 2. The analysis with selected variables is given below ( Table 3).

The Hierarchical Predictor Development
At the next stage, weak predictors are used for each separated cluster. Instances with unknown cluster are added to the fourth cluster. In total, four clusters are taken into account.
The hierarchical predictor is built using the following steps: 1. Fuzzy c-means divide objects into four clusters (Table 1); 2.
Linear regression random forest, SVM with radial basis kernel, and SVM with polynomial kernel are used for each cluster separately; 3.
Average voting on the obtained results is provided. Based on it, average value will be selected.
The predictive accuracy of the hierarchical predictor is presented in Table 4. The quality is worse than for the whole dataset. Repeated K-fold cross-validation is a technique used for small dataset validation. The advantage of this technique is the ability for parallelization. The result is presented in Table 4.
We can see that RMSE for a hierarchical predictor based on the whole dataset is not much higher than for selected features. On the other hand, the predictive accuracy for separated clusters is better than for the best weak predictor, artificial neural network (ANN) with 12 units in single hidden layer. The accuracy of the hierarchical predictor with repeated K-fold cross-validation is not much higher than the non-cross-validated method.

Discussion
The obtained results of researching the offered models for the collected data, in general, confirm a hypothesis about differences in predictive accuracy for the whole dataset and condition space built on clustering analysis and feature selection. However, the following two remarks must be made at once:

•
The number of instances in the collected dataset is too small. A given hypothesis should be proved by a large dataset. We are working on extending the dataset. However, a similar approach was used in our previous research with other datasets [24,25] with the same approximate results; • The authors suggest that the accuracy of the prediction will strongly depend on the number of empty values.
The hierarchical predictor is 1.01 times better than the best weak predictor (perceptron) for selected features (Tables 2 and 3). The quality of the developed hierarchical predictor for RMSE metric is 1.47 times better than the best weak predictor. The regression tree shows the same result for the whole dataset and selected variables. Linear regression shows 1.53 times better RMSE metrics for the selected features in comparison with whole dataset. The rest of the weak predictors show better results on selected features.
Feature selection is not only important for increasing accuracy. The hierarchical predictor training time for selected features is 1.2 times faster than for the whole dataset.
The proposed method is used for small datasets, and it is similar to [26]. Here, authors propose to improve RBF-based input-doubling method by introducing additional elements into the formula for calculating the output signal of the method. We increase the accuracy by 4% based on both MAE and RMSE compared to the basic method.

Conclusions
This paper presents feature selection and prediction of bed-days in hospital using condition space and developed hierarchical predictors. The dataset of personalized medical parameters was collected in a public hospital. This dataset is used to predict the number of days in the hospital. Patients were treated clinically for postoperative complications in the abdomen. Our study shows a low pairwise correlation between a huge subset of the parameters listed in the dataset. However, proper feature selection is needed to increase the quality of a prediction model.
Data preprocessing allows us to increase the quality of analysis. Boruta, regression tree, and correlation are used for feature selection. The results of the selection are formed based on hard voting using all feature selectors. Several clustering algorithms are used for splitting the object into separated clusters. This splitting allows for developing the condition space based on time-dependent and time-independent data and medical protocol.
A hierarchical predictor based on the combination of clustering results and four weak predictors for each cluster separately was developed in this paper. Therefore, the proposed algorithm shows a higher predictive accuracy compared to the best predictor perceptron.
Further research will include developing a hierarchical classifier built on other weak predictors and ensemble development.