Forecast of Medical Costs in Health Companies Using Models Based on Advanced Analytics

: Forecasting medical costs is crucial for planning, budgeting, and efﬁcient decision making in the health industry. This paper introduces a proposal to forecast costs through techniques such as a standard model of long short-term memory (LSTM); and patient grouping through k-means clustering in the Keralty group, one of Colombia’s leading healthcare companies. It is important to highlight its implications for the prediction of cost time series in the health sector from a retrospective analysis of the information of services invoiced to health companies. It starts with the selection of sociodemographic variables related to the patient, such as age, gender and marital status, and it is complemented with health variables such as patient comorbidities (cohorts) and induced variables, such as service provision frequency and time elapsed since the last consultation (hereafter referred to as “recency”). Our results suggest that greater accuracy can be achieved by ﬁrst clustering and then using LSTM networks. This implies that a correct segmentation of the population according to the usage of services represented in costs must be performed beforehand. Through the analysis, a cost projection from 1 to 3 months can be conducted, allowing a comparison with historical data. The reliability of the model is validated by different metrics such as RMSE and Adjusted R 2 . Overall, this study is intended to be useful for healthcare managers in developing a strategy for medical cost forecasting. We conclude that the use of analytical tools allows the organization to make informed decisions and to develop strategies for optimizing resources with the identiﬁed population.


Introduction
Healthcare is one of the largest industries and services of the global economy, one that has been significantly increasing until it becoming one of the biggest challenges of our time [1]. According to the World Health Organization (WHO), healthcare represented 7.56% of Europe's gross domestic product (GDP) in 2015 [2]. In 2018, the total healthcare expenditure of the United States was 16.8% of its GDP (the highest in the world) (WHO-GDP) [2]. The national healthcare expenditure of the United States in 2018 was USD 3.8 trillion, but forecasts show that these costs will increase up to USD 6.2 trillion dollars by 2028 [3]. Among others, one reason for this increase is the misuse of medication and the duplication of procedures by doctors [4].
In Colombia, according to the National Government, public health issues have been prioritized to guarantee equality; thus, for 2020, the budget was USD 8 billion, with an 8.12% increase since 2019, when it was USD 7.45 billion [5,6]. In this sense, the public health sector became one of the national sectors with the highest allocation of resources in the national budget.
To be in line with the General Health and Social Security System (SGSSS), the Keralty organization, one of the main actors in the Colombian health system [7], designed the This allowed the accredited health sector institutions to be grouped into two large clusters. The first was defined as institutions in the process of financial consolidation; and the second cluster was defined as large health institutions. The business profiles of the institutions under study were thus defined.
Specifically, we summarize our contribution as follows. First, we predict the medical cost of a healthcare organization using the described techniques and suggest an avenue of improvement in further work: namely that understanding how and why cost-drivers increase may provide information about the risk factors and the possible starting points for defining preventive measures and strategies.
This paper is structured as follows. In Section 2, we show related works. In Section 3, we describe the methodology and information about the data, data-processing operations, and the methods we used to evaluate the problem. In Section 4, we first present the results obtained with the LSTM networks and continue presenting the results obtained from combining cluster segmentation with LSTM networks. We then proceed in Section 5 to discuss the results to finally summarize the conclusions and directions for future research.

Related Work
The cost forecast is one of the main objectives of different time series methods when these methods are applied in diverse fields. A time series is a sequence of measurements over time rarely mapped in equal intervals. Time series forecasting can be applied to diverse sectors, and in this case, specifically to the prediction of medication costs as performed in papers by, e.g., Jaushic and Shruti [12,22], using different techniques such as ARIMA and LSTM. Another work by Kabir [23] using RL, RNN, and LSTM showed a sustainable approach to forecast the future demands of hospital beds, considering the hospital capacity and the population of the region in order to plan the future increase in required hospital beds. Scheuer [24] used electronic medical records for Finnish citizens over sixty-five years of age to develop a sequential deep learning model to predict the use of health services in the following year using RNN and LSTM networks. Another work which uses clustering techniques is that by Mahmoud [25]. This author studied hip fracture care in Ireland and, using k-means clustering, showed that elderly patients are grouped according to three variables: age, length of stay, and time to surgery. According to Mahmoud, the cost of treating a hip fracture was estimated to be approximately EUR 12,600. He identified hip fractures as one of the most serious injuries with long hospital admissions.
In addition, Miroslava [26] used k-means to find the most appropriate clinical variables between 23 and 26 variables capable of efficiently separating patients diagnosed with type 2 diabetes mellitus (T2DM) with underlying diseases such as arterial hypertonia (AH), ischemic heart disease (CHD), diabetic polyneuropathy (DPNP), and diabetic microangiopathy (DMA).
The following Table 1 provides a summary of the related papers and their input variables. are among the ten most prescribed pain medications in the US Kabir (2021) [23] RL, RNN, LSTM Bed cost Number of beds, occupation, and patients Scheuer (2020) [24] Lasso, LightGBM, LSTM Cost of visits by family doctor Number of patients, number of visits, average visits per patient, procedure codes, and diagnoses

Materials and Methods
This study explores two different approaches to forecasting medical costs in the Colombian public health insurance. The steps of the methodology applied to meet the objectives of this paper are shown in Figure 1.

Data Collection
In this research, we used datasets from the Keralty health company [7]. The data for this retrospective analysis were obtained from one of the modules of medical and affiliate accounts of the Core Beyond Health application developed by Sonda [27]. This includes invoices from medical services corresponding to patient assistance through the public health plan. We also used the Vacovid repository (Proprietary Source) to obtain the information of patients that are classified within any health conditions or cohorts. The dataset contains all the information available on the costs of services received by the users between 2017 and 2021. Figure 2 shows the datasets and the variables of each data source.

Data Collection
In this research, we used datasets from the Keralty health company [7]. The data for this retrospective analysis were obtained from one of the modules of medical and affiliate accounts of the Core Beyond Health application developed by Sonda [27]. This includes invoices from medical services corresponding to patient assistance through the public health plan. We also used the Vacovid repository (Proprietary Source) to obtain the information of patients that are classified within any health conditions or cohorts. The dataset contains all the information available on the costs of services received by the users between 2017 and 2021. Figure 2 shows the datasets and the variables of each data source.

Data Processing
In this step, we transformed raw data into an adequate and understandable format. In the real world, datasets contain errors. Therefore, this step solves errors and the datasets become easy to manage [28]. Below, we briefly describe the most important data we followed in each dataset: The following transformations from the data: 1. Dates are converted into DateTime Y%-M%-D% and thus dates are formatted;

Data Processing
In this step, we transformed raw data into an adequate and understandable format. In the real world, datasets contain errors. Therefore, this step solves errors and the datasets become easy to manage [28]. Below, we briefly describe the most important data we followed in each dataset: The following transformations from the data: 1.
Dates are converted into DateTime Y%-M%-D% and thus dates are formatted; 2.
Empty fields are mapped in 0 values; 4.
The "TotalComorbidities" field is created, allowing to identify the number of diagnoses or cohorts of a patient; 5.
Category values are encoded; 6.
Mappings to a dictionary of types of documents; 7.
Exceedingly small provision values of less than 1000 are disregarded; 8.
The "Number" and "InvoicedValue" fields are converted into int. format.
After unifying and cleaning the dataset, we ended up with a total of 160,463,128 entries about the invoices for the provided medical services. Table 2 shows the variables selected to work in the simulators with a 5% sample corresponding to 3,202,610 services with 34 different attributes. The output variable in this study is "InvoicedValue".  In Table 3, we show the Spearman correlation coefficients between the selected variables and the invoiced values for patients who are not marked with any morbidity: "Without comorbidity" means that it is not classified under any health cohort, as well as those marked with at least one morbidity "With one morbidity" designates patients belonging to at least one or more health cohorts. Similarly, in Table 4, we show the Pearson correlation coefficients between the listed variables and the invoiced value for patients within each cohort or pathology. This process allowed us to identify the most statistically significant variables that can be associated with the medical cost.  The only variable that has a relationship with the cohorts with a correlation coefficient close to 0.5 which is "Number of services"; if a coefficient that is assigned is a substantial (negative or positive) number, it has influence on the prediction. Conversely, if the coefficient is zero, it has no impact on the prediction.

Model Implementation
The cost forecast was performed under two proposals: cost analysis by selecting the variables using LSTM neural networks, and finally, segmentation through the Cluster to analyze the cost of each cluster using the same techniques. Our deep learning LSTM regression model was developed, with Keras [29,30] and Sklearn [31], using Python programming language [32]. We also used Streamlit [33], which allowed us to create a web application to display our results, and Google Cloud Platform AI Platform [34], to train the automatic learning models, host the model in the Cloud and finally make the model available for the users on cloud storage. The usage of LSTM networks is motivated by the long and shortterm seasonalities involved in the medical cost time series, such as Christmas, summer, and weekdays. This makes the usage of LSTM models more appropriate.

LSTM Networks
This neural network, over time, can connect three pieces of information: current input data; the short-term memory received from the preceding cell (the so-called hidden state); and the long-term memory of more remote cells (the so-called cell state)-from which the RNN cell produces a new hidden state [12]. Figure 3 shows an LSTM memory cell.
Machine learning algorithms work best when numerical inputs are scaled to a standard range. Normalization and standardization are the two most popular techniques for scaling numerical data before modeling. Normalization scales each input variable separately to the range of 0-1, which is the range for floating-point values where we have the highest accuracy. Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to change the distribution to have a mean of zero and a standard deviation of one. The only variable that has a relationship with the cohorts with a correlation coefficient close to 0.5 which is "Number of services"; if a coefficient that is assigned is a substantial (negative or positive) number, it has influence on the prediction. Conversely, if the coefficient is zero, it has no impact on the prediction.

Model Implementation
The cost forecast was performed under two proposals: cost analysis by selecting the variables using LSTM neural networks, and finally, segmentation through the Cluster to analyze the cost of each cluster using the same techniques. Our deep learning LSTM regression model was developed, with Keras [29,30] and Sklearn [31], using Python programming language [32]. We also used Streamlit [33], which allowed us to create a web application to display our results, and Google Cloud Platform AI Platform [34], to train the automatic learning models, host the model in the Cloud and finally make the model available for the users on cloud storage. The usage of LSTM networks is motivated by the long and short-term seasonalities involved in the medical cost time series, such as Christmas, summer, and weekdays. This makes the usage of LSTM models more appropriate.

LSTM Networks
This neural network, over time, can connect three pieces of information: current input data; the short-term memory received from the preceding cell (the so-called hidden state); and the long-term memory of more remote cells (the so-called cell state)-from which the RNN cell produces a new hidden state [12]. Figure 3 shows an LSTM memory cell.
Machine learning algorithms work best when numerical inputs are scaled to a standard range. Normalization and standardization are the two most popular techniques for scaling numerical data before modeling. Normalization scales each input variable separately to the range of 0-1, which is the range for floating-point values where we have the highest accuracy. Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to change the distribution to have a mean of zero and a standard deviation of one. To normalize the data and feed the LSTM, we used MinMaxScaler from sklear.preprocessing to scale our data between -1 and 1. The feature range parameter was used to specify the range of the scaled data. Then, we converted the training and test data into a time series problem; we must predict a value in time T based on the month data. To train the LSTM network with our data, we needed to convert the data into a 3D format in the form To normalize the data and feed the LSTM, we used MinMaxScaler from sklear.preprocessing to scale our data between -1 and 1. The feature range parameter was used to specify the range of the scaled data. Then, we converted the training and test data into a time series problem; we must predict a value in time T based on the month data. To train the LSTM network with our data, we needed to convert the data into a 3D format in the form accepted by LSTM. This means that the input layer expects a 3D data matrix when fitting the model and making predictions, even if the specific dimensions of the matrix contain only one value, for example, a sample or a feature. When defining the input layer of your LSTM network, the network assumes that you have one or more samples and requires that you specify the number of time steps and the number of features.
There is not a general rule as to how many nodes or how hidden layers must be elected, and very often a trial-and-error approach may yield the best results for each problem [37]. As this is a simple network, we started trying with four neurons, then with eight, and finally, a test was performed with sixteen neurons, which was the first parameter of the LSTM layer. The second parameter was "return sequences", which was established in false, as we did not add more layers to the model. The last parameter was the number of indicators [12]. We also added an exclusion layer to our model to prevent overfitting. Finally, we added a dense layer at the end of the model; the number of neurons on the dense layer was established at 1, as we wanted to predict a single number value in the output. In this paper, we used the Adam optimizer [38] and we used mean squared error as the loss metric [39] to show the implementation of the LSTM network.
Some of the parameters that can be modified and which are very important to achieving the good performance of the model are the activation function and the cost function. Activation functions largely control what information is propagated from one layer to the next. By combining non-linear activation functions with multiple layers, network models are able to learn non-linear relationships. The most commonly used activation functions are relu and sigmoid. The activation function relu will generate an output equal to zero when the input is negative, and an output equal to the input when the input is positive. As such, the activation function retains only the positive values and discards the negative ones, giving them an activation of zero. The sigmoid activation function takes any range of values at the input and maps them to the range of 0-1 at the output.
Another parameter is the cost function, also called the loss function, which quantifies the distance between the actual value and the value predicted by the network. In other words, it measures how incorrect the network is when making predictions. In most cases, the cost function returns positive values. The network's predictions are improved when the cost value is close to zero.
An epoch corresponds to the number of times that the algorithms will be executed. In each cycle (epoch), all the training data pass through the neural network so that it learns about them: A long short-term memory network (LSTM) is one of the most popular neural networks for analyzing time series. The ability of an LSTM to remember previous information makes it ideal for such tasks [40].

Clusters
In this case, we use it to try to identify patients with the same characteristics, as shown in Figure 4.  To implement the k-means clustering algorithm, one must first choose a k value, i.e., the number of clusters to be formed. Then, one must randomly select k data points from the dataset as the centers/initial centers of the clusters. Then, the distance between the data point and the cluster's centroid is calculated; as such, each datum is assigned to the cluster with the closest centroid. For each cluster, the new mean is estimated based on the data points of the conglomerate. This does not end until the mean of the clusters remains stable under a predetermined variation limit or until the maximum number of iterations is reached.
For the clustering process carried out in this paper, we considered the related comorbidities, namely "Age"; "WeeksContributedLastYear", corresponding to the weeks contributed to in the last year; "ContinuousContributedWeeks", corresponding to the weeks contributed since first affiliation-in addition to two new variables which are "frequency", corresponding to the number of services provided a to patient; and "recency", corresponding to the last time they received medical assistance. In addition, the cohort variables are used for CKD, COPD, AHT, diabetes, cancer, HIV, tuber, asthma, obesity, and transplant.
We determined the most suitable number of clusters through the elbow method [41,42]. To this end, we varied the number of clusters from 1 to 20 and calculated the WCSS (within-cluster sum of squares). This designates the sum of squared distances between each point and the centroid in the calculated clusters. The point after which the curve does not decrease quickly is the appropriate value for K, as shown in Figure 5. After choosing the number of clusters, a manual description of the characteristic of each cluster was made to be able to identify each group, as seen in Table 5.

Cluster
Description 0 HighAge, COPD-AHT 1 YoungAdult, HEALTHY 2 Adult, AHT-OBESITY To implement the k-means clustering algorithm, one must first choose a k value, i.e., the number of clusters to be formed. Then, one must randomly select k data points from the dataset as the centers/initial centers of the clusters. Then, the distance between the data point and the cluster's centroid is calculated; as such, each datum is assigned to the cluster with the closest centroid. For each cluster, the new mean is estimated based on the data points of the conglomerate. This does not end until the mean of the clusters remains stable under a predetermined variation limit or until the maximum number of iterations is reached.
For the clustering process carried out in this paper, we considered the related comorbidities, namely "Age"; "WeeksContributedLastYear", corresponding to the weeks contributed to in the last year; "ContinuousContributedWeeks", corresponding to the weeks contributed since first affiliation-in addition to two new variables which are "frequency", corresponding to the number of services provided a to patient; and "recency", corresponding to the last time they received medical assistance. In addition, the cohort variables are used for CKD, COPD, AHT, diabetes, cancer, HIV, tuber, asthma, obesity, and transplant.
We determined the most suitable number of clusters through the elbow method [41,42]. To this end, we varied the number of clusters from 1 to 20 and calculated the WCSS (withincluster sum of squares). This designates the sum of squared distances between each point and the centroid in the calculated clusters. The point after which the curve does not decrease quickly is the appropriate value for K, as shown in Figure 5.  To implement the k-means clustering algorithm, one must first choose a k value, i.e., the number of clusters to be formed. Then, one must randomly select k data points from the dataset as the centers/initial centers of the clusters. Then, the distance between the data point and the cluster's centroid is calculated; as such, each datum is assigned to the cluster with the closest centroid. For each cluster, the new mean is estimated based on the data points of the conglomerate. This does not end until the mean of the clusters remains stable under a predetermined variation limit or until the maximum number of iterations is reached.
For the clustering process carried out in this paper, we considered the related comorbidities, namely "Age"; "WeeksContributedLastYear", corresponding to the weeks contributed to in the last year; "ContinuousContributedWeeks", corresponding to the weeks contributed since first affiliation-in addition to two new variables which are "frequency", corresponding to the number of services provided a to patient; and "recency", corresponding to the last time they received medical assistance. In addition, the cohort variables are used for CKD, COPD, AHT, diabetes, cancer, HIV, tuber, asthma, obesity, and transplant.
We determined the most suitable number of clusters through the elbow method [41,42]. To this end, we varied the number of clusters from 1 to 20 and calculated the WCSS (within-cluster sum of squares). This designates the sum of squared distances between each point and the centroid in the calculated clusters. The point after which the curve does not decrease quickly is the appropriate value for K, as shown in Figure 5. After choosing the number of clusters, a manual description of the characteristic of each cluster was made to be able to identify each group, as seen in Table 5.

Cluster
Description 0 HighAge, COPD-AHT 1 YoungAdult, HEALTHY 2 Adult, AHT-OBESITY After choosing the number of clusters, a manual description of the characteristic of each cluster was made to be able to identify each group, as seen in Table 5.
To confirm the result of the optimal number of clusters indicated by the elbow technique, we ran the silhouette method, which is also a method for finding the optimal number of clusters, interpretation, and the validation of the consistency of data within clusters. See Table 6. The silhouette method calculates the silhouette coefficients of each point, which measure the extent to which a point resembles its own cluster compared to other clusters. SeniorAdult, CANCER-AHT 10 HighAge, CKD-AHT 11 Young, HEALTHY, LittleUse 12 Adult, CANCER 13 HighAge, COPD-AHT-OBESITY 14 Young, HEALTHY, RecentUse In this case, the optimal number of clusters is 5; however, for a better differentiation of patients with different medical conditions in cohorts and according to suggestions from clinical experts inside in our organization, in the interest of observing, over a period of time, which of these groups did or did not have the expected outcome associated with mortality, higher fatality events and higher cost events, it was decided that a total of 15 clusters would be used.

Results
After applying the clustering and training predictive models using the LSTM network, we found a set of features that give the best performance. These features are shown in Table 7 below. For both models, the LSTM network model and clustering were executed and the data were grouped into two variables, namely ProvisionDate and InvoicedValue, to predict the cost of services for more than 1,558,613 patients in the sample between 2017 and 2021. The first 80% were used to train the models, and the remaining 20% were used to assess them.

LSTM Networks
For a summary of the model run with sixteen hidden memory cells, see (2) (3) Table 8 shows the RMSE for standard models with different numbers of memory cells. The lowest RMSE was obtained (=89.03) for a standard LSTM with 16 hidden memory cells. One of the features is the prediction of a particular population, showing the current and projected cost. This feature allows us to filter by conditions such as gender, healthcare regime, marital status, and whether they have a cohort or condition such as diabetes, CKD, hypertension. Additional cohort variables can be projected for one to three months. Figure 6 shows the result with the following filters: woman as gender and diabetes condition.

Clustering
In this section, we visually explore the discovered clusters to look for relations and insights. The clusters are examined with respect to patient characteristics, outcomes, and

Clustering
In this section, we visually explore the discovered clusters to look for relations and insights. The clusters are examined with respect to patient characteristics, outcomes, and standards of care considering variables such as age, frequency, and recency. A discussion is then presented to better interpret these results.

Distribution by Age Cluster (in Years)
First, we explored the clusters discovered in terms of age. In the Figure 7, the behavior of age is represented by the identified clusters, in which we can see that the clusters (0, 3, 7, 8, 10, 12 and 13) show older people with some health condition, in comparison with the other clusters that show that the population is concentrated on younger people.

Clustering
In this section, we visually explore the discovered clusters to look for relations and insights. The clusters are examined with respect to patient characteristics, outcomes, and standards of care considering variables such as age, frequency, and recency. A discussion is then presented to better interpret these results.

Distribution by Age Cluster (in Years)
First, we explored the clusters discovered in terms of age. In the Figure 7, the behavior of age is represented by the identified clusters, in which we can see that the clusters (0, 3,7,8,10,12, and 13) show older people with some health condition, in comparison with the other clusters that show that the population is concentrated on younger people.

Distribution by Frequency of Use Cluster
Second, we explored the variable frequency, as shown in Figure 8, where it can be observed that all the people in the groups are attending medical consultations quite often.

Distribution by Frequency of Use Cluster
Second, we explored the variable frequency, as shown in Figure 8, where it can be observed that all the people in the groups are attending medical consultations quite often.

Distribution by Cluster of Last Attention Time (Recency)
We also explored the users by the variable recency, as can be seen in Figure 9, that measures the time elapsed since the last medical service. All of them have recently seen a doctor, unlike cluster 11 which comprises young patients which have not seen a doctor for a long time. The rest of the clusters have had at least one visit recently.

Distribution by Cluster of Last Attention Time (Recency)
We also explored the users by the variable recency, as can be seen in Figure 9, that measures the time elapsed since the last medical service. All of them have recently seen a doctor, unlike cluster 11 which comprises young patients which have not seen a doctor for a long time. The rest of the clusters have had at least one visit recently.

Distribution by Cluster of Weeks Contributed since Last Year
This corresponds to the number of weeks contributed since the last year Figure 10

Distribution by Cluster of Last Attention Time (Recency)
We also explored the users by the variable recency, as can be seen in Figure 9, that measures the time elapsed since the last medical service. All of them have recently seen a doctor, unlike cluster 11 which comprises young patients which have not seen a doctor for a long time. The rest of the clusters have had at least one visit recently.   measures the time elapsed since the last medical service. All of them have recently seen a doctor, unlike cluster 11 which comprises young patients which have not seen a doctor for a long time. The rest of the clusters have had at least one visit recently.

Distribution by Cluster of Continuous Contributed Weeks
This shows the number of weeks that the users have been affiliated since their first date of affiliation, as shown in Figure 11. It can be noticed that cluster 8 aggregates old healthy users that have been affiliated for a prolonged period.

Distribution by Cluster of Continuous Contributed Weeks
This shows the number of weeks that the users have been affiliated since their first date of affiliation, as shown in Figure 11. It can be noticed that cluster 8 aggregates old healthy users that have been affiliated for a prolonged period. The model was evaluated with 4 and 16 memory cells, showing the reliability when first segmented by cluster, for all clusters except for clusters 1 and 3, where with 16, its results are better. As shown in the Table 9, it is preferable to use 4 memory cells.  The model was evaluated with 4 and 16 memory cells, showing the reliability when first segmented by cluster, for all clusters except for clusters 1 and 3, where with 16, its results are better. As shown in the Table 9, it is preferable to use 4 memory cells.
After defining the clusters, and according to the cluster selection, we predicted the cost again using LSTM networks; this feature allows you to choose which cluster and over what period to project it. In this case, we chose cluster 3, resulting in the following projection as seen in Figure 12. As such, patients were better modeled and performance was slightly increased, instead of working with the optimal values in performance provided by the elbow and silhouette methods (see Tables 9 and A1 for details of the performance of both approaches). It is also important to note that the allowance of 15 clusters, instead of 5, has also helped to identify two clusters of inactive patients (6) and 'Young and Healthy with Little Use' patients (cluster 11) whose predictability is not reliable (R 2 < 0) and could be biasing the models when using only five clusters. Little Use' patients (cluster 11) whose predictability is not reliable (R2<0) and could be biasing the models when using only five clusters. We reviewed previous cost prediction model studies, namely a standard short-term memory model (LSTM) and a stacked LSTM model, to predict the monthly drug cost of more than 50.000 patients between 2011 and 2015. For the single-layer LSTM model, they obtained an RMSE value of 14,617 and an R2 value of 0,8048. For the stacked LSTM model, the RMSE value was 13,693 and an R2 value of 0,8159 [12]. Another works predicted the average weekly expenditure of patients on certain pain medications, with different models such as Arima, MLP, and LSTM selecting two medications among the 10 most prescribed pain medications in the US; the LSTM result yielded an RMSE value for medicine A of 143,69 and an R2 value of 0,77 [22].
Below are the metrics we adopted for each model. These are: root mean square error (RMSE) [43,44]; mean absolute percentage error (MAPE) [45]; R 2 ; and adjusted R 2 [46]. The most common metric used for regression purposes is the root mean square error (RMSE) and it represents the square root of the average distance between the actual value and the predicted value. This indicates the absolute adjustment of the model to the data; how close are the observed data points to the model's predicted values. The RMSE measurement is an absolute mean of adjustment. As the square root of a variance, the RMSE can be interpreted as a standard deviation of the unexplained variable, and it has the useful property of being in the same units as the response variable. Lower RMSE values indicate a better adjustment [47,48]. We reviewed previous cost prediction model studies, namely a standard short-term memory model (LSTM) and a stacked LSTM model, to predict the monthly drug cost of more than 50,000 patients between 2011 and 2015. For the single-layer LSTM model, they obtained an RMSE value of 14.617 and an R 2 value of 0.8048. For the stacked LSTM model, the RMSE value was 13.693 and an R 2 value of 0.8159 [12]. Another works predicted the average weekly expenditure of patients on certain pain medications, with different models such as Arima, MLP, and LSTM selecting two medications among the 10 most prescribed pain medications in the US; the LSTM result yielded an RMSE value for medicine A of 143,69 and an R 2 value of 0.77 [22].
Below are the metrics we adopted for each model. These are: root mean square error (RMSE) [43,44]; mean absolute percentage error (MAPE) [45]; R 2 ; and adjusted R 2 [46]. The most common metric used for regression purposes is the root mean square error (RMSE) and it represents the square root of the average distance between the actual value and the predicted value. This indicates the absolute adjustment of the model to the data; how close are the observed data points to the model's predicted values. The RMSE measurement is an absolute mean of adjustment. As the square root of a variance, the RMSE can be interpreted as a standard deviation of the unexplained variable, and it has the useful property of being in the same units as the response variable. Lower RMSE values indicate a better adjustment [47,48].
Mean absolute percent error (MAPE) measures the average percentage error. It is calculated as the average of the absolute percentage errors. MAPE is sensitive to scale and becomes meaningless for low volumes or data with zero demand periods. When aggregated or used with multiple products, the MAPE result is dominated by low volume or zero products [45].
R-squared and adjusted R-squared are often used for explanatory purposes and explain how well the selected independent variables explain the variability in their dependent variables. The coefficient of determination or R 2 is another measure used to assess the performance of a regression model. The metric helps us compare our current model to a constant baseline and tells us how much better our model is. The constant baseline is chosen by taking the mean of the data and drawing a line at the mean. R 2 is a scale-free score which implies that regardless of whether the values are excessively large or excessively small, R 2 will always be less than or equal to 1 [22].
Adjusted R 2 represents the same meaning as R 2 but is an improvement on it. R 2 suffers from the problem that scores improve in increasing terms even though the model is not improving. The adjusted R 2 is always smaller than R 2 as it adjusts for increasing predictors and only shows an improvement if there is a real improvement [46].
In summary, when the LSTM network model is executed with the selected data, in this case, women in the diabetes cohort, the data are grouped into two variables, "Provision-Date" and "InvoicedValue", which are those used in the network. The results are shown in Table 10. After segmenting patients and executing the LSTM network again for all clusters, we obtained the following results shown in Table 11.

Discussion
The purpose of this paper was to show techniques for predicting the costs of patients. The first model is an approach to simulate costs considering the decrease or increase in a particular population of a certain cohort. With the projected cost for each cohort, in case it decreases or increases, we can have an estimate of the costs that the company could save so that it can implement strategies such as investing in promotion and prevention plans for cohorts.
When we made the prediction with the initial values filtered by woman as gender and with diabetes using the LSTM networks, we observed that the RMSE metric shows that, on average, the mean prediction error corresponds to 89.03. In this case, MAPE indicates that, on average, the forecast is wrong by 36.25%. For R 2 , 89% of the variations of the dependent variable are explained by the independent variables of our model. We see that the R 2 is high, indicating a high linear relationship between ProvisionDate and InvoicedValue. Finally, the adjusted R 2 value is 83% of the variability explained by the model, considering the number of independent variables, as shown in Table 10.
With the other approach, when we do the clustering first using the k-means technique with its fifteen groups and then run the LSTM network for each of the clusters as shown in Table 11, we obtain better results. For RMSE, for clusters 0, 2, 3, 7, 8, 9, 10, 12, and 13, they have a better average mean prediction error for each one. MAPE has a lower forecast error for clusters 0, 2, 3, 4, 7, 9, 10, 12, 13, and 14. The R 2 for all clusters indicates a high relationship between the variables InvoicedValue and Date. Finally, the adjusted R 2 value for all clusters has a higher percentage of variability explained by the model. The values for clusters 1, 5, 11, and 14 have a high percentage of adjusted R 2 , which can be interpreted as good. However, it shows an RSME as an average prediction error that is high enough to project. Clusters that did not perform as well, e.g., young, HEALTHY, and LittleUse are users with little history, and therefore it is more complex to predict their behavior.
The main implication of our results is that combining the use of the clustering algorithms to identify patient groups with deep learning LSTM networks to predict future costs for these groups enables a more accurate prediction of the costs of patients for healthcare providers.

Conclusions
The results demonstrate the feasibility of segmenting the population by cluster (k-means), and finally the LSTM network to project the cost of each group. Having a tool that allows the organization to know the cost for the next month or up to three months allows it to better provision resources. We do not consider it appropriate to project beyond three months because the model may lose reliability. The results obtained show the validity of the initial approach-which remains probabilistic-based on care events, which can be improved with the incorporation of clinical variables.
This first phase estimates the probabilistic projection of costs grouped by population segments to address a second phase of the project, which aims to consolidate a patientfocused cost model based on their medical records in such a way that it allows us not only to predict the potential services and costs related to each patient, but also to identify the potential operational, clinical, and administrative strategies to improve the quality of life of patients, preventing the accelerated development of diseases and/or events that impair their health and consequently provide a better life expectancy and reduce future costs related to these potential events. This approach allows us to help health organizations to be ready for providing healthcare by optimizing costs, giving an accurate diagnosis of diseases, improving service quality by grouping patients, optimizing resources, and improving clinical results [49].
By having more variables for a person, such as demographic variables, the identification of a provisioning event, clinical, diagnostic, and risk variables, and the cost of all the services provided, either with their own infrastructure or third-party infrastructure, the results are more accurate. With all the patient's related variables over time and their cost, it is possible to predict the risks and costs of a person and thus be able to implement survival models.
The goal is to have projected monthly costs, which can be used to assess a chronic patient or a recurring patient and their cost pattern, and model through clusters in cohorts to provide preventive care, allowing the health system to reduce costs and significantly improve the quality of life of patients.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
As indicated by the elbow and silhouette methods, the result of running with 5 clusters is shown, highlighting a slight increase in performance.