Data-Driven Analytics for Personalized Medical Decision Making

: The study was conducted by applying machine learning and data mining methods to treatment personalization. This allows individual patient characteristics to be investigated. The personalization method was built on the clustering method and associative rules. It was suggested to determine the average distance between instances in order to ﬁnd the optimal performance metrics. The formalization of the medical data preprocessing stage was proposed in order to ﬁnd personalized solutions based on current standards and pharmaceutical protocols. The patient data model was built using time-dependent and time-independent parameters. Personalized treatment is usually based on the decision tree method. This approach requires signiﬁcant computation time and cannot be parallelized. Therefore, it was proposed to group people by conditions and to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. The novelty of the paper is the new clustering method, which was built from an ensemble of cluster algorithms, and the usage of the new distance measure with Hopkins metrics, which were 0.13 less than for the k-means method. The Dunn index was 0.03 higher than for the BIRCH (balanced iterative reducing and clustering using hierarchies) algorithm. The next stage was the mining of associative rules provided separately for each cluster. This allows a personalized approach to treatment to be created for each patient based on long-term monitoring. The correctness level of the proposed medical decisions is 86%, which was approved by experts.


Introduction
Nowadays, modern medicine is on the verge of change in many countries. It is crucial to support medical decisions regarding analysis and prediction of a patient's condition at different stages of their treatment. Thus, in recent years health policy has changed in regard to the form of intervention, with the aims of prevention and personalization of medical decisions. Particularly important for the implementation of this policy is the completeness and timeliness of received information about patients and the ability to monitor patients, not only in the hospital. This is why information technology, smart devices, and smart systems are being integrated into medical and social areas [1][2][3]. 2 of 15 As an example, the primary healthcare system serves almost 70% of the population, treating about 90% of common and local problems [4]. Consequently, for qualified medical care, it is necessary to develop intelligent systems for personalized treatment decisions. Such intelligent systems will be responsible for collecting, processing, and transmitting a large amount of data using medical records, sets of microcontrollers, communication equipment, and sensors based on Internet of Things (IoT) solutions. Artificial intelligence (AI) will help to solve the preprocessing tasks, such as gap imputation, outlier analysis, complexity reduction, hidden dependency identification [5][6][7]. Artificial intelligence and big data will be used together to improve the prediction accuracy of the spatial and temporal development of a patient's disease [8][9][10].
Personalized medicine (PM) is supposed to be supported by small and big data (which must become a data-driven process) and to use artificial intelligence technology to help with risk prevention, prediction, and detection, as well as medical intervention. Personalized treatment (PT) is based on existing medical protocols and therapeutic schemes (the list of medicaments and medical manipulations for each type of disease), however it takes into account the individual characteristics of each patient (concomitant pathologies, allergic reactions, current blood pressure and glucose level, etc.). This gives the opportunity to choose the most appropriative medicaments from the list.
The aim of this paper is to develop a novel approach for medical decisions based on an ensemble of clustering methods. The optimal condition of a patient is estimated as the distance between the centroid in a cluster and a particular instance. The list of the most appropriative medicaments is built on separate associative rules for each cluster.
The proposed approach can also be used for processing multidimensional information in structured and semistructured modes in other domains.
The structure of the paper is as follows. The first part, covering the related sources, presents an approach for medical data analysis. Next, the patient data model and a description of the existing dataset are given. An estimation of the quality metrics is provided for the existing method. The novel approach based on an ensemble of clustering methods is developed. The results of the proposed approach are then analyzed. The conclusion section underlines the novelty of the proposed approach.

Related Works
Various aspects of medical data processing are highlighted below.
• Diagnosis (classification and prediction) and personalization of treatment.
Bidiuk [11] used decision-tree-based computation procedures and regression approaches to predict missing data and for PM decisions. Dangare [12] obtained similar results for mining of associative rules in medical data. The main disadvantage of these methods is that they only process structured data. This is why medical records without data from sensors and medical devices can be processed.
Bayesian networks, artificial neural networks (ANN), and k-means algorithms are used in [13] for heart disease prediction. However, Bayesian networks are too slow to process large amounts of data or for online diagnostics. This is why Tang [14] developed a method for parallel Bayesian networks. However, even in the case of parallelism, it is advisable to use Bayesian networks in combination with other machine learning methods for multiparameter, large-scale, dynamic medical data flows. The ANN apparatus is commonly used in combination with fuzzy logic for the analysis of various medical data. Thus, in the works of Bodiansky and Perova [15][16][17][18], a system for rapid medical diagnostics based on auto-associative neuro-fuzzy memory was proposed. However, one of the urgent tasks remains, which is to improve the accuracy of the results of the classification problem. In addition, problems related to imbalanced input data and small samples of data collected manually by medical staff impose a number of restrictions on the usage of existing methods and computational intelligence tools to solve such problems [19].
In [20], AI methods in PM are analyzed. The relationship between diseases and their treatments are found. However, these relationships are built on medical protocols, since the proposed approach cannot reduce the list of medicaments.
A previous paper [21] presented the architecture of the support decision medicinal system based on ontology and a personalized data model was given. This model will be extended in this paper using sensor data.
Therefore, for the diagnosis and PM support, a combination of analyzed methods is required.
• Clustering: Cluster analysis is widely used for outlier detection. The outliers in the medical domain are the differences between optimal patient conditions based on the local protocol and individual features. Some of the simple clustering algorithms are partitioning methods. The k-means algorithm builds k clusters located at large distances from each other. The main type of problems solved by the k-means algorithm is the assumption (hypotheses) of the number of clusters and the diversity of the instances in different clusters. The choice of the k number may be based on the results of previous studies and theoretical considerations [22].
The authors in [23] used the balanced iterative reduction and clustering using hierarchies (BIRCH) algorithm for diabetes detection. For the BIRCH algorithm, the clustering rate increases due to the generalized appearance of clusters. This algorithm implements a two-stage clustering process. During the first stage, a preliminary set of clusters is formed. In the second step, other clustering algorithms suitable for Random Access Memory (RAM) are applied to the detected clusters [23].This algorithm can be used for outlier detection [22,23]. If the dimensions of the data features are very large, e.g., greater than 20, BIRCH is not suitable. At this time, the mini batch k-means algorithm performs better. The complexity of the algorithm is represented as O(n).
The density-based spatial clustering of applications with noise (DBSCAN) method iteratively collects objects that are close to the root objects that can merge several tightly accessible clusters. DBSCAN does not require the number of clusters received to be predefined, unlike partitioning methods [24], although there is a need for guidance on the values of the radius parameters of any object and the minimum number of objects that directly affect the clustering result. The optimal values for these parameters are difficult to determine, especially for multidimensional data spaces [22].
Hierarchical clustering creates a hierarchy of clusters. Each cluster node contains child clusters; cluster descendants divide the vertices belonging to their common ancestor. The main disadvantage is the high complexity of this algorithm [25].
It is very important to reduce the time complexity for large dataset analysis [26]. For example, the Big Cities Health Inventory Data Platform (BCHC Data Platform, https://www.bigcitieshealth. org/city-data) contains over 30,000 data points across more than 50 health, socioeconomic, urban (information from smart sensors), and demographic indicators across 11 categories in the United States. Another example is online monitoring of a patient's conditions at home. In this case, both structured data (medical records) and streaming data (information from IoT devices) should be analyzed in online mode. Consequently, the parallel processing of information can reduce the time complexity of data preprocessing and processing.
In [27], the unsupervised artificial neural network model is used to extract downstream gas temperature profiles in combustion health monitoring. The variational autoencoders (VAE) belong to the family of generative models. The VAE maximizing likelihood yields an estimated density that always bleeds probability mass away from the estimated data manifold. This is why the dataset should be balanced.
The genetic algorithm in [28] is used for naturalistic driving data analysis and driver behavior mining. This method can also be implemented for patient condition prediction, but it is most important to find a personalization treatment based on outliers and the average conditions for each group of treatments. Therefore, the search for personalized solutions involves a number of interrelated processes, namely: • Data collection (physical parameters of the patients) from smart sensors, e.g., electronic glucometers, automatic tonometers, and holters; • Data consolidation is focused on defining the data models, as well as the formation of data flows in accordance with the defined tasks; • Data processing consists of validating data and storing data, for example in a database, knowledge base, or data warehouse, while ensuring data protection; • The data analysis process sorts the data by defined parameters, taking into account cluster analysis data mining methods, including k-means and k-median algorithms, data visualization methods, etc.; • The forecasting process involves the construction of a test object that allows the use of artificial intelligence methods, including artificial neural networks, decision trees, Bayesian networks, linear regression networks, correlation-regression analysis, and methods for finding associative rules, including the a priori algorithm.

Patient Data Model
The patient data were characterized by multidimensionality and heterogeneity, complicating the analysis of patient conditions. This is why it was necessary to formalize the patients' physical conditions (FC) by the set of parameters A, consisting of time-dependent A t and time-independent A in parameters: The A t and A in sets can intersect.
The analyzed dataset consisted of information for about 36 patients from Lviv hospital, Ukraine. Data were collected during their hospitalization period. The list of time-independent characteristics includes the following: Active substance (list of values); • Medicament (list of values; each value is given as a document consisting of indicators, diagnosis, active substances, and contraindications).
The list of time-dependent characteristics includes the following: These parameters can be presented as a document-oriented graph ( Figure 1). The overall dataset consists of 15,000 records. The medicaments list is based on medical protocols and diagnoses. More than 30 medicaments exist for each diagnosis. The main purpose of the paper is to select the medicaments from this list based on analysis of other characteristics and patient conditions. This will allow personalization of the treatment.
The data processing and analysis for personalized medical solutions, APD, are represented as a pair: where = ⋃ , is the set of the patient′s physical conditions, A is the set of the patient′s personal data, n is the total number of patient characteristics, and D is the set of personalized decisions.
The physical condition is represented as: where Vd is the set of the attribute values di according to the treatment protocol. Attributes of the set A are called conditions, D is the solution, and di is the value of the i-th solution obtained from the set D.
The patient condition is equation demonstrated below: where ⃗ ( ) is the conditions vector of the dimensions of space n, which includes object variables that uniquely determine each condition: where ⃗ ( ) is the control vector that displays signals affecting the system from the outside through the proposed solutions to determine the therapeutic treatment scheme: The overall dataset consists of 15,000 records. The medicaments list is based on medical protocols and diagnoses. More than 30 medicaments exist for each diagnosis. The main purpose of the paper is to select the medicaments from this list based on analysis of other characteristics and patient conditions. This will allow personalization of the treatment.
The data processing and analysis for personalized medical solutions, APD, are represented as a pair: where GS = ∪ n i=1 FC i , GS is the set of the patient's physical conditions, A is the set of the patient's personal data, n is the total number of patient characteristics, and D is the set of personalized decisions.
The physical condition is represented as: where V d is the set of the attribute values d i according to the treatment protocol. Attributes of the set A are called conditions, D is the solution, and d i is the value of the i-th solution obtained from the set D.
The patient condition is equation demonstrated below: where → FC(t) is the conditions vector of the dimensions of space n, which includes object variables that uniquely determine each condition: where → S (t) is the control vector that displays signals affecting the system from the outside through the proposed solutions to determine the therapeutic treatment scheme: A and B are the parameter matrices, which consist of object parameters [29].
The control vector → S (t) is presented by binary variables for the following parameters: Heart rate.
S i (t) = 0, parameter is in normal range 1, parameters is out of normal range (8) If S i (t) is equal to 1, we should eliminate medicaments with similar contraindicators.
The medical data are represented as a multidimensional dataset. Therefore, the next step is the patient conditional prediction based on time-independent and time-depended characteristics.

Analysis Based on the Existing Clustering Method
The preprocessing stage allows for better data understanding. The results were obtained in RStudio.
First, the existing clustering methods were used. The k-means algorithm evaluates the shapes of the clusters. This method can also be used to find outliers. The gap statistics method was applied to estimate the most appropriative number of clusters. The best number was equal to three ( Figure 2). The total sum of the square value was 43.5%.
A and B are the parameter matrices, which consist of object parameters [29]. The control vector ⃗ ( ) is presented by binary variables for the following parameters: Heart rate.
( ) = 0, parameter is in normal range 1, parameters is out of normal range If Si(t) is equal to 1, we should eliminate medicaments with similar contraindicators.
The medical data are represented as a multidimensional dataset. Therefore, the next step is the patient conditional prediction based on time-independent and time-depended characteristics.

Analysis Based on the Existing Clustering Method
The preprocessing stage allows for better data understanding. The results were obtained in RStudio.
First, the existing clustering methods were used. The k-means algorithm evaluates the shapes of the clusters. This method can also be used to find outliers. The gap statistics method was applied to estimate the most appropriative number of clusters. The best number was equal to three ( Figure 2). The total sum of the square value was 43.5%. Therefore, according to gap statistics, the best number of clusters is three. The shapes of the three clusters are given in Figure 3. Therefore, according to gap statistics, the best number of clusters is three. The shapes of the three clusters are given in Figure 3.  The overlap of the objects is apparent in Figure 3. This is why the fuzzy c-means algorithm was used for future data analysis. Part of the membership function is given below. The minimum difference between values of the membership function for one instance is 0.01. We can see the overlap of the clusters for the following objects: The visual assessment of the cluster tendency (VAT) metric performs an evaluation of the quality of clustering. This highlights opportunities for scaling to bigger datasets [30], making it particularly suitable for IoT scaling applications. VAT did not identify clusters for this dataset, showing low propensity to form groups ( Figure 4).  The overlap of the objects is apparent in Figure 3. This is why the fuzzy c-means algorithm was used for future data analysis. Part of the membership function is given below. The minimum difference between values of the membership function for one instance is 0.01. We can see the overlap of the clusters for the following objects: The visual assessment of the cluster tendency (VAT) metric performs an evaluation of the quality of clustering. This highlights opportunities for scaling to bigger datasets [30], making it particularly suitable for IoT scaling applications. VAT did not identify clusters for this dataset, showing low propensity to form groups (Figure 4).
The cophenetic correlation is used for evaluation of hierarchical clustering methods. This is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. The cophenetic correlation can also be calculated between the original distance matrix and the cophenetic distance matrix, which can serve as a measure of the adequacy of the cluster solution compared to the initial data.
The evaluation of five hierarchical clusters was performed by this indicator, which were compared in the previous section. The visual assessment of the cluster tendency (VAT) metric performs an evaluation of the quality of clustering. This highlights opportunities for scaling to bigger datasets [30], making it particularly suitable for IoT scaling applications. VAT did not identify clusters for this dataset, showing low propensity to form groups (Figure 4).  Thus, in order to find personalized solutions, it is suggested to take into account the average values of a particular cluster, which will determine the optimal conditions of the patient.

##
By applying the naive Bayes method, the value of the random variable was calculated, achieving an a posteriori maximum for classes 1, 4, and 5 (Table 1). The Kohonen map shows the proximity of the patient's weight and age attributes, as well as the diagnosis and concomitant diagnosis, medication, and hospital stay, highlighting the affinity of the data and the impact on the object under study ( Figure 5).
In summary, the quality metrics for different clustering algorithms are not too high for the analyzed dataset.

The Proposed Clustering Method
In this paper, an ensemble based on the k-means clustering algorithm was used. The main purpose of the modification was to find the anomalous clusters.
The state space of the object in this study is given as a Euclidean space. This allows the state space to be presented as a multidimensional system in the time domain. In order to describe the relationships between them, it is advisable to analyze the criterion of divergence between patients.

The Proposed Clustering Method
In this paper, an ensemble based on the k-means clustering algorithm was used. The main purpose of the modification was to find the anomalous clusters.
The state space of the object in this study is given as a Euclidean space. This allows the state space to be presented as a multidimensional system in the time domain. In order to describe the relationships between them, it is advisable to analyze the criterion of divergence between patients.
Suppose that a group of instances belonging to class Gi contains M examples. The diagnostic measure of distance is selected and then the distance from point aij to point aim included in the group of verified examples (quadratic measure) is defined as: where: = , , . . . , = , , … , = , , . . . , Therefore: where: where: A 1 = {a 11 , a 12 , . . . , a 1m } (11) Therefore: where: Here, ∆A i is the difference between parameter values, n is the number of parameters, m is the number of patients that are being clustered, and k i is the normalization factor. The unequal attributes can have different ranges of represented entities in the selection. This is why the distance can be dependent on attributes with large ranges. Therefore, data must be normalized. The optimal distance to the optimal parameter value can be used for each defined cluster. It is possible to determine the average distance from point a ij to points of a training sequence belonging to the G i class: The following steps present the proposed algorithm (Algorithm 1).
Input data: FC Output data: clustering vector, sum of squares SS by cluster, gravity center vector.
1. Preprocessing stage. The "reference" point FC = (FC 1 , . . . , FC n ) is fixed (for example, at the center of gravity of the "cloud" of objects), which is then converted to 0 by subtracting it from all points representing the objects.

2.
Initialization stage. The point farthest from 0 is taken as the initial centroid c. MAX__SS = max_value 3.

3.1.
Cluster update (8) assign an object to the centroid c; 3.1.2. cluster S is formed from all such objects.

3.2.
Update the centroid

End
The novelty of the proposed clustering method was in the possibility for parallelized grouping for each clustering number, as well as the new distance measurement. This method is similar to hierarchical clustering because it allows grouping without cluster number initialization. The difference is in the distance evaluation technique, which can find distances between objects, as well as between separated parameters of an object. This is very important for multidimensional datasets. The next advantage compared to hierarchical clustering is the better time complexity. The best time complexity for hierarchical clustering with a heap of the run time for the general case can be reduced to O (n 2 log n). Simultaneously, the time complexity of the proposed method is O (n 2 ).

The Algorithm used for Finding Personalized Treatment
The next stage is the detection of outliers. The outliers can affect the clustering results by shifting the cluster centers. However, it is most important to find outliers for personalized treatment. The outliers can be evaluated based on the distance (Equation (9)). Then, relationships between patient parameters, indicators, and contraindicators of medicaments from the recommended list should be found. To do this, the a priori algorithm [31] can be used.
The personalized treatment algorithm is shown below (Algorithm 2).

Algorithm 2. Finding a personalized treatment (PTA).
Input data: FC, clustering vector, gravity center vector. Output data: clustering vector, sum of squares SS by cluster, center of gravity vector.

1.
For each cluster, find the objects with distances L ij higher than 75% of the average value. Add these objects to OUTLIERS list.

2.
For each object from OUTLIERS: For each parameter S i (t) if S i (t) = 1 eliminate the medicaments with similar contraindicators.

2.2.
Find the relations between FC parameters using the a priori algorithm.

2.3.
Create the list of the most appropriative medicaments from medicaments with a support level higher than min_support.

Results
To compare the results with well-known methods, we created three clusters based on the proposed clustering method. The minimum difference between values of the membership function for one instance was 0.23. This value demonstrates that the better clustering result was better than the fuzzy c-means method.
In addition, the Hopkins statistics method [32] was used to assess the quality of the clustering algorithm evaluation ( Table 2). The statistics method was used for evaluation of the density-based clustering algorithm. The Hopkins statistics values greater than 0.5 for objects w i and q i corresponded to the null hypothesis that w i and q i are similar, while the grouped objects were distributed randomly and uniformly. A value of H ind < 0.25 at a 90% confidence level indicates a tendency for data grouping.
Therefore, the Hopkins statistics value of the proposed method is not dominated by the k-means, fuzzy c-means, or Kohonen maps methods for different numbers of clusters. DBSCAN, with an epsilon neighborhood equal to 0.2, also shows good results for small number of clusters.
The Dunn index is defined as the ratio between the minimal intercluster distance and the maximal intracluster distance for algorithm evaluations [33,34]. If the dataset contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large. Thus, the Dunn index should be maximized. This metric is used for hierarchical clustering analysis with three clusters (Table 3). Thus, the proposed method is not dominated by other methods based on the Dunn index estimation. The results of the algorithm for outlier mining and personalization of treatment was evaluated by experts with more than 20 years' experience in purulent surgical infection of soft tissues and antibiotic therapy in the treatment of diseases caused by purulent surgical infection.
Personalized treatment was proposed for seven out of 36 patients (step 2 of the algorithm). The min_support was set to 40%. The number of frequent rules (relations between time-independent and time-dependent parameters) was nine (step 3). The medicament reduction rate was 52%. The experts confirmed the correctness of medicament treatment for six out of seven patients.
The most important associative rules are given in Table 4. As a result, the confirmed correctness level of the proposed medical decisions is 86%.

Discussion
According to the results of the research, it is possible to point out the significance of such factors as the heterogeneity of data, subjective assessments of psychological indicators, and temporal changes in the characteristics of the general state of the investigated object. The medical data processing also requires the development of appropriate solutions for data collection, transmission, storage, and protection [19,21,32,35]. On one hand, most of analyzed algorithms solve the above problems effectively, but on other hand do not cover the entire list of problems associated with the personalized treatment. In this article, the method of patient clustering is applied. This determines the conditional measure of similarity of the objects, combining them into clusters. The patient's condition is assumed as the average cluster score. The Euclidean distance is usually used as a measure of proximity. Using the partition clustering method, the number of clusters is unknown and selected by the researcher, while the quality of the clustering depends on the primary separation using the elbow method [23]. Thus, the following disadvantages appear for such methods: • The specificity of each patient is not taken into account-all users are divided into classes; • If there is no characteristic value in the cluster, it will not be possible to obtain a recommendation.
The usage of the DBSCAN method shows the absence of densely filled sections of the state space. The correlation between the attributes is negligible.
One of the simplest classification algorithms is the naive Bayes method. However, quite often this works better than algorithms that are more complicated. This method is based on the assumption that all the variables under consideration are independent of each other [14]. In the example given in Table 2, the a posteriori maximum of only a few classes is achieved.
The advantages of this method are as follows: • Classification, including of many classes, is done quickly; • When the assumption of independence is fulfilled, the classifier outperforms other algorithms, such as logistic regression, while requiring less training data; • The algorithm works better with categorical features than continuous ones.
The disadvantages of this method are: • If there is a value of a categorical type in the test dataset that is not found in the training dataset, the model assigns a zero probability to this value and cannot make a prediction; • Although the algorithm is a good classifier, the values of predicted probabilities are not always sufficiently accurate.

Conclusions
The research was carried out on the development of an intellectual analysis method for medical datasets to solve the problem of object clustering. Consequently, it was compared with existing data mining methods.
It was suggested to determine the average distance between instances of classes in order to find priority signs of influence, which help to detect the best instance state of a cluster.
The formalization of the medical data for personalized solutions supporting the preprocessing stage is proposed in accordance with current standards and pharmaceutical protocols.
The paper proposes the classification of persons by state in order to determine the deviation of the parameters from the average rate of the group. The proposed method made it feasible to create a personalized approach to analyze the conditions and make recommendations for each patient based on long-term follow-up and monitoring under the guidance of a physician.
Appropriately, using the results of the analysis, it becomes possible to predict the optimal general state for a particular person, which will help to improve satisfaction and ensure patient longevity.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: