Forecasting of the Prevalence of Dementia Using the LSTM Neural Network in Taiwan

: The World Health Organization has urged countries to prioritize dementia in their public health policies. Dementia poses a tremendous socioeconomic burden, and the accurate prediction of the annual increase in prevalence is essential for establishing strategies to cope with its effects. The present study established a model based on the architecture of the long short-term memory (LSTM) neural network for predicting the number of dementia cases in Taiwan, which considers the effects of age and sex on the prevalence of dementia. The LSTM network is a variant of recurrent neural networks (RNNs), which possesses a special gate structure and avoids the problems in RNNs of gradient explosion, gradient vanishing, and long-term memory failure. A number of patients diagnosed as having dementia from 1997 to 2017 was collected in annual units from a data set extracted from the Health Insurance Database of the Ministry of Health and Welfare in Taiwan. To further verify the validity of the proposed model, the LSTM network was compared with three types of models: statistical models (exponential smoothing (ETS), autoregressive integrated moving average model (ARIMA), trigonometric seasonality, Box–Cox transformation, autoregressive moving average errors, and trend seasonal components model (TBATS)), hybrid models (support vector regression (SVR), particle swarm optimization–based support vector regression (PSOSVR)), and deep learning model (artiﬁcial neural networks (ANN)). The mean absolute percentage error (MAPE), root-mean-square error (RMSE), mean absolute error (MAE), and R-squared (R 2 ) were used to evaluate the model performances. The results indicated that the LSTM network has higher prediction accuracy than the three types of models for forecasting the prevalence of dementia in Taiwan


Introduction
The elderly population has been increasing sharply worldwide. In March 2018, Taiwan officially became an aged society, with an elderly population accounting for 14% of the total population [1]. Despite over 25 years elapsing before Taiwan progressed from an aging society to an aged society, Taiwan is estimated to become a super-aged society by 2026 because of its fast aging speed [2]. The rapid growth of the elderly population indicates a shift in health care concerns (chronic diseases and conditions) and medical care for older adults as well as emphasis on the importance of long-term care, prevention, and resilience. series and past observations are assumed to satisfy a linear relationship [17]; however, most time series data have a nonlinear relationship, which limits the application scope of the ARIMA. On the other hand, ETS forecasts are based on historical data. If not combined with other methods, the prediction of a nonlinear time series cannot produce satisfactory results. These methods rely heavily on linear assumptions, which use historical data sets, single-variable, or multivariable time series functions to forecast future trends. The lack of nonlinear fitting capabilities may limit the development of these methods.
Machine learning (ML) methods-such as artificial neural networks (ANNs) [18] and support vector regression (SVR) [19]-have demonstrated excellent nonlinear fitting capabilities in demand forecasting. However, improper parameter settings seriously affect the realization of ANN and SVR methods. Studies have strongly indicated that a solution is required to produce suitable hyperparameters [20,21]. Therefore, many hybrid models were proposed to solve the optimization problem, such as PSOSVR, GASVR, DESVR, GSSVR [20,22]. Recurrent neural networks (RNNs) are neural networks developed for time series problems in numerous types of neural network structures. Research has indicated that RNNs outperform traditional neural networks, such as multilayer perceptron machines [23,24]. Cui and Liu used a combination of an RNN and convolutional neural network (CNN) to classify Alzheimer's disease [25]. Maragatham and Devi (2019) established a mental strength failure prediction model based on long short-term memory (LSTM) neural networks [26]. In addition, Lipton et al. (2015) used an LSTM network for the classification and diagnosis of patients in hospital pediatric intensive care units [27]. Wang et al. (2019) also developed a deep learning approach involving the use of longitudinal electronic health records to predict mortality risk for the identification of patients with dementia who may benefit from palliative care [28]. Another study evaluated the role of deep learning models in identifying surgical behaviors and evaluating surgeons' technical performance [29]. These aforementioned studies have demonstrated the successful application of various RNN models to numerous medical prediction tasks through the effective use of the temporal relationship among collected patient data.
This study proposes the use of an RNN structure based on an LSTM network to predict the trends of patients with dementia. Based on our understanding, relatively limited studies have explicitly evaluated the forecast of dementia worldwide to date. Most of the research focuses on predicting patients with dementia and the classification of dementia [30][31][32], which is substantial progress for physicians. However, for the government or policymakers, the strategic layout and budget of medical care are the issues they are concerned about. With the increasing number of patients with dementia [33], the cost of care is also increasing, which is one of the focuses of this study. Recently, Kingston et al. forecasted the older population's care needs in England over the next 20 years via PACSim model [34]. Ahmadi-Abhari et al. developed a Monte-Carlo Markov model on predicting the number of people living with dementia to 2050 and provided the estimates for the impact of smoking cessation [35]. In addition to no relevant research on the relationship between the number of patients with dementia and nursing cost and policy promotion in Taiwan, the LSTM network with excellent performance in sequence prediction [36], has not been used to predict the number of people with dementia. Therefore, this study aims to establish a prediction model to provide a reference for government budgeting and administration by accurately predicting the number of people with dementia. Comparisons with a series of benchmark models verified the superiority of the proposed model. On the basis of these findings, this paper presents further constructive recommendations to actively support dementia prevention and care, which can considerably improve the health care process for caregivers and society. Section 2 discusses the LSTM architecture and other prediction models. Sections 3 and 4 respectively explain how the LSTM network can be used for regression problems and present the experimental results of a forecasting application.

Data Sources and Preprocessing
The sources of the data were extracted from the Health Insurance Database of the Ministry of Health and Welfare in Taiwan, which included the annual number of dementia patients over 60 years old. According to the availability of information from 1997 to 2017, we applied the proposed LSTM method to the abstracted data to determine the trend. The model methodology will be elaborated below. The annual data from 1997 to 2013 was applied as a training set to train the proposed LSTM method. Subsequently, the testing set was used to test the accuracy of the forecast, which consisted of the annual number of dementia patients for 2014-2017.

LSTM Network
Time series forecasting is emerging as one of the most important branches of data analysis. However, traditional time series forecasting models often result in poor forecasting accuracy, because such methods require large sequence data features [37]. Data collected at fixed time intervals are called time series data; each data point is equally spaced in time. Time series prediction is a method of predicting future trends and patterns of a historical data set using time characteristics. Predicting the number of patients with dementia using input data with a time component and a model that differs from the traditional regression method may be effective. Figure 1 presents a traditional RNN. The input of the RNN has a sequence length x = (x 1 , . . . , x T ), which can be processed recursively. When processing each symbol, the RNN maintains an internal hidden state (i.e., s). The parameters of this method are the recursive weight matrix W, the input weight matrix U, and the output weight matrix V. The operation of the RNN at each time step t can be expressed as where σ is the starting function, and t = 1, 2, . . . , T. The output of the RNN is calculated using the formula plication.

Data Sources and Preprocessing
The sources of the data were extracted from the Health Insurance Database of the Ministry of Health and Welfare in Taiwan, which included the annual number of dementia patients over 60 years old. According to the availability of information from 1997 to 2017, we applied the proposed LSTM method to the abstracted data to determine the trend. The model methodology will be elaborated below. The annual data from 1997 to 2013 was applied as a training set to train the proposed LSTM method. Subsequently, the testing set was used to test the accuracy of the forecast, which consisted of the annual number of dementia patients for 2014-2017.

LSTM Network
Time series forecasting is emerging as one of the most important branches of data analysis. However, traditional time series forecasting models often result in poor forecasting accuracy, because such methods require large sequence data features [37]. Data collected at fixed time intervals are called time series data; each data point is equally spaced in time. Time series prediction is a method of predicting future trends and patterns of a historical data set using time characteristics. Predicting the number of patients with dementia using input data with a time component and a model that differs from the traditional regression method may be effective. Figure 1 presents a traditional RNN. The input of the RNN has a sequence length x = (x1, ..., xT), which can be processed recursively. When processing each symbol, the RNN maintains an internal hidden state (i.e., s). The parameters of this method are the recursive weight matrix W, the input weight matrix U, and the output weight matrix V. The operation of the RNN at each time step t can be expressed as where σ is the starting function, and t = 1, 2, …, T. The output of the RNN is calculated using the formula A one-step-ahead forecast in a time series requires both the previous data and the most recent data. An RNN model has the advantages of a hidden-layer self-feedback mechanism and the ability to avoid long-term dependence problems. However, practical applications still face some difficulties [38].  A one-step-ahead forecast in a time series requires both the previous data and the most recent data. An RNN model has the advantages of a hidden-layer self-feedback mechanism and the ability to avoid long-term dependence problems. However, practical applications still face some difficulties [38].
The first LSTM neural network was proposed by Hochreiter and Schmidhuber (1997) and had a targeted design to solve the problem of long-term dependence [39]. The LSTM network memory unit consists of four gates (or units), namely the input gate, output gate, forget gate, and memory unit. The gate that controls the flow of information is displayed in Figure 2. The LSTM network is a variant of RNNs, which have been used for numerous practical situations in fields such as biomedical sciences [40], speech recognition [41], sentiment analysis [42], and image classification [43]. The input gate controls whether the input signal can modify the state of the memory cell. The output gate controls whether squashes the permissible amplitude range of the output signal to some finite value.
The function is shown as The LSTM network employs three control gates and memory units to save, read, reset, and update long-term information. Due to the common mechanism of the LSTM network's internal parameters, the dimensions of the weight matrix can be set to control the dimension of the output. The LSTM network establishes a long time period between the input and feedback.
An entire univariate or multivariate time series can be used to train LSTM networks. To improve the learning process and effectiveness of the model, smaller subsamples from the original time series were defined in this study. With T as the sequence length, the sequences z (t) = [xt, xt+1, …, xt+T−1] ∈ R T and y (t) = xt+T−1+q ∈ R are the tth input and output of the LSTM network, respectively; q is a positive integer that indicates the number of steps ahead to be predicted, and N is the total number of subsamples which depends on the length of the original time series or the sequence length.

Statistical Models
Statistical predictions are usually divided into two categories: qualitative and quantitative methods. Time series analysis is a quantitative prediction method that is widely used in mathematical statistics, signal processing, financial prediction, electroencephalography, and other fields and is beneficial to economic and scientific improvement. To objectively present the most robust method with a low error rate, the three following methods were selected for comparison.

ETS (Exponential Smoothing)
Proposed by Brown and Meyer (1961) [44], ETS is a data averaging method that considers three factors: the error, trend, and season. A maximum likelihood estimation is used in ETS to optimize the initial values and parameters, and the optimal exponential smoothing model is then selected. Moreover, the weight of the ETS weighted data decays exponentially. The latest data have a higher weight than older data, and the weight of the older data decreases gradually. The ETS algorithm is a solution that overcomes the limitations The LSTM network is an effective algorithm for establishing time series models. The basic component of the LSTM network is the memory block, which solves the gradually gradient vanishing problem by storing network parameters for a long period of time. The four gates of the LSTM network are represented by the following formula: At time t, x t is the input data of the LSTM unit, h t is the output of the LSTM unit, h t−1 is the output of the LSTM unit at the previous moment, and C t is the value of the memory unit. The process of the LSTM network can be divided into the following steps.
(1) Calculate the value of the candidate memory unit C t , where W c is the weight matrix and b c is the bias.
(2) Calculate the value of the input gate I t . The input gate controls the update of the current input data to the state value of the memory unit, where σ is the sigmoid function, W i is the weight matrix, and b i is the bias.
(3) Calculate the value of the forget gate F t . The forget gate controls the update of the historical data to the state value of the memory unit, where W f is the weight matrix and b f is the bias.
(4) Calculate the value of the current moment memory unit C t ; C t−1 is the state value of the last LSTM unit.
(6) Calculate the output of the LSTM unit h t , where tanh is a non-linear activation. It squashes the permissible amplitude range of the output signal to some finite value. The function is shown as The LSTM network employs three control gates and memory units to save, read, reset, and update long-term information. Due to the common mechanism of the LSTM network's internal parameters, the dimensions of the weight matrix can be set to control the dimension of the output. The LSTM network establishes a long time period between the input and feedback.
An entire univariate or multivariate time series can be used to train LSTM networks. To improve the learning process and effectiveness of the model, smaller subsamples from the original time series were defined in this study. With T as the sequence length, the sequences z (t) = [x t , x t+1 , . . . , x t+T−1 ] ∈ R T and y (t) = x t+T−1+q ∈ R are the tth input and output of the LSTM network, respectively; q is a positive integer that indicates the number of steps ahead to be predicted, and N is the total number of subsamples which depends on the length of the original time series or the sequence length.

Statistical Models
Statistical predictions are usually divided into two categories: qualitative and quantitative methods. Time series analysis is a quantitative prediction method that is widely used in mathematical statistics, signal processing, financial prediction, electroencephalography, and other fields and is beneficial to economic and scientific improvement. To objectively present the most robust method with a low error rate, the three following methods were selected for comparison.

ETS (Exponential Smoothing)
Proposed by Brown and Meyer (1961) [44], ETS is a data averaging method that considers three factors: the error, trend, and season. A maximum likelihood estimation is used in ETS to optimize the initial values and parameters, and the optimal exponential smoothing model is then selected. Moreover, the weight of the ETS weighted data decays exponentially. The latest data have a higher weight than older data, and the weight of the older data decreases gradually. The ETS algorithm is a solution that overcomes the limitations of previous exponential smoothing models, but does not provide a convenient forecasting interval calculation method.

ARIMA (Autoregressive Integrated Moving Average)
Proposed by Box and Jenkins (1976), the ARIMA model, also known as the Box-Jenkins model, uses several formed fragments after the time series has passed as the input, and a prediction model is established on the basis of the regression analysis results [13]. This model is frequently used for the prediction of short-term trends in economic areas.

TBATS (Trigonometric Seasonality, Box-Cox Transformation, ARMA Errors, and Trend Seasonal Components Model)
This method, which was proposed by Livera, Hyndman, and Snyder (2011), combines trigonometric seasonality, Box-Cox transformation, ARMA errors, and trend and seasonal components [45]. This approach can be used to analyze and predict whether seasonal data exist based on the exponential smoothing method. The combination of multiple models can achieve more accurate results but also requires more training time, resulting in slower calculation.  [19]. SVR includes an insensitive loss function and penalty factor to enhance the robustness of SVMs [46,47]. SVR involves the projection of data to a high-dimensional hyperplane and subsequent calculation of the total distance from each point to the hyperplane. The hyperplane with the smallest total distance is identified as the solution. SVR has three hyperparameters: the regularization parameter (C), kernel function bandwidth (σ), and ε-insensitive loss function (ε). Changes to these parameters considerably affect the accuracy of SVR prediction. However, the automatic adjustment of the three hyperparameters in SVR remains a challenge in improving the accuracy of SVR prediction.

PSOSVR (Particle Swarm Optimization-Based SVR)
Particle swarm optimization (PSO) was proposed by Kennedy and Eberhart (1995) based on the flight motion of foraging birds [48]. A bird's movement continuously reveals places with food, thereby updating the position of the entire group until the optimal location is finally identified. Defining the stopping criterion and constraining the processing mechanism is crucial in a PSO search space. If the number of iterations of the PSO exceeds a predetermined threshold, the PSO will stop. PSO is an approach to global optimization calculation that can effectively select the optimal combination of internal parameters for an SVR model and improve the model prediction accuracy and generalization ability (Liu et al., 2018). The fitness function is calculated in each optimization process of the PSO to determine the solution for the parameters (C, σ, and ε).

Deep Learning Model (ARTIFICIAL Neural Network ANN)
ANNs are inspired by the structure of the human nervous system. A neural network is a collection of interrelated 'neurons' in a self-adjusting system. An ANN arithmetically adjusts the weights (free parameters) to meet performance requirements using representative samples. Because of their learning process involves the use of historical data, ANNs demonstrate high effectiveness in complex problems [49]. Back-propagation networks are well-known supervised learning neural network models [50] that consist of an optimization algorithm which combines a backward pass, gradient descent [51], and the chain rule in calculus. The gradient descent method identifies the initial position of the parameter in the steepest downhill direction and updates the parameter position. Slope information is obtained through derivation of the function. The gradient descent method uses cost function to optimize the weights in ANN. Table 1 presents a summary of the data collected from the Department of Statistics of Taiwan's Ministry of Health and Welfare. The number of people with dementia over the age of 60 years registered from 1997 to 2017 was calculated. The data used in this study were grouped by gender and age, and the maximum, minimum, mean, median, first quartile, third quartile, interquartile range (IQR), standard deviation (SD), and coefficient of variation (CV) were calculated for each group. The CV values were used to determine the degree of dispersion of a set of data around the average value, with a larger CV value indicating a higher degree of dispersion. The data indicated that older age was associated with a higher degree of dispersion. The overall number of dementia cases was higher among women than among men in each age group. SVR has three hyperparameters that affect the accuracy of the forecasting task; namely, the tube size of the ε-insensitive loss function (ε), regularization parameter (C), and bandwidth of the kernel function (σ). The parameter settings used in this study are displayed in Tables 2 and 3.    Table 4 presents a comparison of the predicted results of the six time series prediction models and the proposed LSTM model for male patients with dementia. The models were divided into statistical, hybrid, and deep learning models to understand the relationship between each prediction model. The statistical models comprise ETS, ARIMA, and TBATS. The hybrid models comprise SVR and PSOSVR. The deep learning model were represented by ANN. These six models were compared with the prediction results of the proposed LSTM model. Among the statistical models, the model TBATS exhibited favorable performance compared with the models ETS and ARIMA, with respective decreases in mean absolute percentage error (MAPE) of 20% and 21% in TBATS. Hybrid models, PSOSVR also demonstrated favorable performance compared with the SVR and ANN models, with significant decreases in MAPE value of 65% and 16% in PSOSVR. Comparisons of the predicted results of the six time series prediction models and the proposed LSTM model for female patients with dementia are displayed in Table 5. Similarly, the prediction models were divided into three categories based on whether they used statistical, hybrid, or deep learning models. In the statistical models, the MAPE of TBATS was relatively high compared with the MAPE values of ETS and ARIMA, which were 46% and 24% lower, respectively. This discrepancy is principally related to the prediction model's poor performance in the 65~69 years old group, for which the MAPE value was as high as 30.48%. The inability to modify or iterate the parameters is the major disadvantage of the statistical models. To improve the results, PSOSVR was compared with SVR; the MAPE was reduced by 46% in PSOSVR, and the error percentage was successfully reduced. However, when PSOSVR was compared with the ANN, the MAPE value was 2% higher, which may have been related to PSOSVR not having the yet-identified optimal solution. However, the prediction results were close to those of neural networks, which demonstrates the value of the hybrid models.

Patients Aged 60~64 Years
As shown by the blue curves in Figure 3, the slope of the training set (1997-2012) was similar to the slope of the test set (2013-2017), indicating that the trend of the overall patient population was stable. Therefore, the predicted results produced by each model were relatively similar. The SVR prediction curve deviated from the actual values in Figures 4a and 5a because SVR involved the use of the grid search method for parameter adjustment, meaning that the searched hyperparameters may not have included the optimal solution; thus, the PSOSVR predicted value was closer to the actual value. However, optimization algorithms such as PSO do not guarantee that the output is the global optimum. Therefore, there remains room for improvement in the prediction ability. The predicted value of the LSTM network was closer to the observed value due to network's ability to memorize the overall curve trend and output value control through the gate. Therefore, the LSTM network demonstrated the most favorable prediction accuracy for this age group.
meaning that the searched hyperparameters may not have included the optimal solution; thus, the PSOSVR predicted value was closer to the actual value. However, optimization algorithms such as PSO do not guarantee that the output is the global optimum. Therefore, there remains room for improvement in the prediction ability. The predicted value of the LSTM network was closer to the observed value due to network's ability to memorize the overall curve trend and output value control through the gate. Therefore, the LSTM network demonstrated the most favorable prediction accuracy for this age group.  thus, the PSOSVR predicted value was closer to the actual value. However, optimization algorithms such as PSO do not guarantee that the output is the global optimum. Therefore, there remains room for improvement in the prediction ability. The predicted value of the LSTM network was closer to the observed value due to network's ability to memorize the overall curve trend and output value control through the gate. Therefore, the LSTM network demonstrated the most favorable prediction accuracy for this age group.

Patients Aged 65~69 Years
As indicated by the orange curve in Figure 3, the number of patients steadily increased. Therefore, adapting to the overall trend is not demanding for the algorithms. Both Figures 4b and 5b indicate that the LSTM network was more sensitive to changes in the trend; thus, its performance was predicted to be the most favorable for this age group. Although the curves in Figure 5b suggest that the results obtained with ARIMA and TBATS were in higher agreement with the observed values, the errors of the output values were too large, making access to an accurate prediction impossible.

Patients Aged 70~74 Years
Among the patients aged 70~74 years, the prediction curve of the LSTM network for female patients was slightly behind the trend, and the estimation was affected by the training data. The gray curve in Figure 3b indicates that the number of female patients decreased from 2013 to 2014, which may have caused inconsistency in the test and training

Patients Aged 65~69 Years
As indicated by the orange curve in Figure 3, the number of patients steadily increased. Therefore, adapting to the overall trend is not demanding for the algorithms. Both Figures 4b and 5b indicate that the LSTM network was more sensitive to changes in the trend; thus, its performance was predicted to be the most favorable for this age group. Although the curves in Figure 5b suggest that the results obtained with ARIMA and TBATS were in higher agreement with the observed values, the errors of the output values were too large, making access to an accurate prediction impossible.

Patients Aged 70~74 Years
Among the patients aged 70~74 years, the prediction curve of the LSTM network for female patients was slightly behind the trend, and the estimation was affected by the training data. The gray curve in Figure 3b indicates that the number of female patients decreased from 2013 to 2014, which may have caused inconsistency in the test and training data set trends. Therefore, because of the LSTM network's sensitivity to changes in training data trends, the LSTM network predictions were slightly inferior to those of ANN models, which used neural networks to form direct predictions. PSOSVR and TBATS performed better than the single models.

Patients Aged 75~79 Years
The prevalence of dementia among patients in the age range of 75 to 79 years steadily increased, as illustrated by the yellow curves in Figure 3. This regular growth should have been relatively easy for each model to predict. However, the PSOSVR prediction curve in Figure 4d was completely parallel (and close) to the SVR prediction curve. Even under PSO, SVR failed to identify more suitable parameters, resulting in the use of similar parameters for both PSOSVR and SVR.

Patients Aged 80~84 Years
Male and female patients in this age range exhibited considerably different trends from 2014 to 2017 (see Figure 3). The curves of male patients flattened or even decreased, whereas those of female patients increased significantly. Because of the abnormal downward trend in male patients' test data (Figure 3), obtaining accurate prediction results was difficult for the algorithms. These challenges are reflected in the predictions of SVR, PSOSVR, ANN, and other statistical models. Due to its memorization ability, the LSTM network produced more stable predictions, consequently achieving the highest prediction accuracy for this age group.
Similar and weak prediction results were obtained by other statistical models for this age group. In the case of a large deviation between the training and test data, both the ANN and LSTM network maintained a certain level of prediction accuracy. The prediction result of the LSTM network was still the most reliable because of its high sensitivity to changes in the training data. Minor trends can be included to improve the training of the model, thereby assisting the LSTM network in achieving even higher prediction accuracy.

Patients Over 85 Years Old
Patients over 85 years of age constituted the largest age group of patients with dementia. As the population aged, the number of patients over the age of 85 years also increased in the test set. SVR, PSOSVR, and other statistical models exhibited difficulty in predicting small trends in the training data set. Even though the prediction results of both the ANN and LSTM network were close to the actual values, the LSTM network reported the exact value in some sections. Overall, compared with other models, the LSTM network produced more accurate predictions of small trends.

Discussion
In this study, we analyzed the patients diagnosed as having dementia from 1997 to 2017 in annual units from a data set extracted from the Health Insurance Database of the Ministry of Health and Welfare in Taiwan. To further verify the validity of the proposed model, the statistical models (ETS, ARIMA, and TBATS), hybrid models (SVR and PSOSVR), and deep learning models (ANN and LSTM) were compared. Overall, the RMSE and MAPE demonstrated that LSTM network has superior performance than other existed models. In this section, we discussed the statistical models, hybrid models, and deep learning models.

Statistical Models: Comparison of the ETS, ARIMA, and TBATS Models
The statistical models used for comparison were the ETS, ARIMA, and TBATS models, as listed in Tables 4 and 5. Both ETS and ARIMA are classic time series forecasting models. However, both have disadvantages for predictions based on data from multiple time peri-ods. If the series has a single root (nonstationary series) or is not adjusted to the appropriate lagging period, then the model cannot achieve a high accuracy. Therefore, complicated preprocessing is required, such as statistical testing of the sequence. Developed from ETS, TBATS is a seasonal model that can predict seasonal time series more effectively. The research results indicate that TBATS has a higher average error for predictions among female patients, which may be because of the instable number of dementia patients; therefore, TBATS did not exhibit sufficient performance improvements.

Hybrid Models: Comparison of the SVR and PSOSVR Models
Support vector regression (SVR) is a popular choice for prediction and curve fitting for linear and non-linear regression types. Formulated as an optimization problem, SVR can determine the optimal regression model by using the epsilon function, which is mapped to the hyperplane of the solution space. This model has the advantage of adapting to multidimensional tasks and producing suitable predictions for nonlinear data. Therefore, this study used SVR for sequence prediction. The results revealed that the prediction error of SVR was higher than that of statistical models because SVR has three hyperparameters-namely C, σ, and ε. If the hyperparameters are not properly adjusted, the model's predictive ability cannot achieve optimal performance. PSO has been widely used to solve the hyperparameter optimization problem [49,52]. As a result, PSO was used in the present study to adjust the hyperparameters of the SVR by adjusting the parameters to an optimal combination and reduced prediction error. Tables 3 and 4 indicate that the error rate of PSOSVR was considerably lower than that of SVR and was better than most statistical models.

Deep Learning Models: Comparison of the ANN and LSTM Network Models
ANNs are developed through imitation of the neuron transmission in the human brain. A shallow neural network based on back-propagation has the advantages of efficient training, high accuracy, and suitability for data sets with noise. ANNs demonstrate excellent predictive ability but also have numerous shortcomings, such as the use of multiple hyperparameters, proneness to overfitting, gradient disappearance, gradient explosion, and long-term dependence problems. Numerous neural network models have been used to solve time series forecasting problems. RNNs have received extensive attention [50] because of their internal state and short-term memory. RNNs store a vector for each step, which is especially important when the input data contain short-term correlations. However, because of the vanishing gradient problem, the model has difficulty learning the long-term correlations of the input sequence if the stochastic gradient descent is used to train the model.
In the LSTM network, the special valve structure (gate) can avoid gradient vanishing in a deep network. Furthermore, the memory unit enhances the long-term memory capability and overall prediction efficiency. The results of this study confirm that the LSTM network has the lowest prediction error, with the average MAPE falling between 2.50% and 3.12%, demonstrating its excellent prediction accuracy. This study used p-values and R 2 (coefficient of determination) to statistically analyze the prediction results and verify the significance and interpretability of the proposed model. In this study, if the p value is significant, it implies that the difference between the two is obvious, which proves that LSTM has a much lower prediction error than other models. Tables 3 and 4 show that all models exhibited significant differences. The prediction ability of the LSTM network was higher than that of the ETS, ARIMA, TBATS, SVR, PSOSVR, and ANN models. The R 2 value reflects the proportion of variance of the dependent variable that can be explained by the independent variable, and it is often used for regression models. Higher R 2 values indicate better explanatory power of the model. Table 6 demonstrates that most of the R 2 values of the LSTM network were substantially higher than those of the ETS, ARIMA, TBATS, SVR, PSOSVR, and ANN models, which indicates that the LSTM model optimally fit the original data and had the highest explanatory power of all models.

Dementia Prevention and Interventions.
With declining mortality in younger populations, dementia is expected to become one of the greatest global health concerns of the 21st century. Although dementia is not curable, its management and the delay of its manifestation are considered to be theoretically possible. Studies have reported that the course of the disease can be modified with adequate care [53], which supports the focus on the manipulation of modifiable risk factors. In 2017, nine potentially modifiable risk factors were reported, including hypertension, obesity, depression, and low social contact [3]. Three new modifiable risk factors, namely traumatic brain injury, excessive alcohol consumption, and air pollution, were introduced in 2020, with convincing evidence [3]. Approximately 30-50% of dementia cases are attributed to these potentially modifiable risk factors. A reduction of 10-25% of these risk factors can reduce the number of patients with dementia by 1.1 to 3 million worldwide [54]. Furthermore, postponement of the onset of dementia by even 2 years can reduce the burden on public health, society, interpersonal relationships, and the economy [55]. Based on this study's proposed model, policy administrators, medical workers, and stakeholders can implement more effective and extensive societal policies on dementia prevention and care among society. The following suggestions on dementia prevention and care are provided for the aim of a more dementia-friendly society.
Promoting resilience in an aging society is a far-reaching approach to dementia prevention. The maximization of care quality and reduction of dementia incidence should begin at the community level, including through the promotion of dementia awareness and knowledge. According to the UK National Institute of Health and Care Excellence and the US National Institute of Health, social isolation is a potentially modifiable risk factor [56,57]. Aging people may experience loneliness and a lack of social contact and social participation, and the promotion of social engagement opportunities is necessary within the community. Moreover, education and intellectual stimulation alternatives have been demonstrated to enhance cognitive resilience later in life [58]. Therefore, within communities, the establishment of supportive social networks that encourage interaction will alleviate loneliness, hence reduce dementia possibilities.
The cost and burden of dementia care are tremendous and continue to rise as the global population ages. The average total cost incurred by patients with dementia exceeds the total costs of patients with other diseases [8]. Patients with dementia are often elderly people approaching their last years of life; thus, their workforce productivity is naturally weaker. As a group that has relatively low capability of coping with such a household financial crisis, the illness contributes to patients' cognitive and physical burden and hinders the ability of their families to afford future health care [8]. Therefore, especially financially, dementia care often calls for more medical health care support than other illnesses [9]. To actively promote high-quality dementia care, additional medical expenditure on dementia care and prevention is necessary and strongly recommended.
Furthermore, the prevalence of dementia affects not only patients but also their family or the health care workers who must live with these patients and deal with the behavioral and emotional effects of dementia. As mentioned earlier, patients with dementia often also experience disorientation, confusion, mood instability, and behavioral or psychological symptoms. As a result, under high pressure for an extended period of time, studies have reported that informal caregivers of patients with dementia often develop poor mental health, and have a relatively high mortality rate [5,[10][11][12], which results in further socioeconomic problems. Additional dementia-care training is necessary for the development of adequate dementia care, which should also include the emphasis on caregivers' mental and physical health. Not only the quality of care of patients with dementia, but also their family caregivers should be emphasized in future policies; an expand in the medical allowance and societal support for this particular population should be considered.
Prevention is more effective than a cure. Proper social welfare and public health policies necessitate a precise model for predicting the prevalence rate of dementia. The purpose of this study was to examine whether an alternative prediction model, namely the proposed LSTM network, could effectively predict the trends among the population of patients with dementia. The results demonstrate that the proposed model was not only applicable, but also significantly more accurate than the other models. This precise model can successfully predict the prevalence of dementia and can thus aid government administrations in the development of relevant strategies. For example, policymakers can manage the budget allocated to dementia care to reduce its occurrence by implementing legislative changes, developing preventive interventions for younger populations, and providing ongoing education and care for elderly adults and their families. Future research is warranted to investigate the performance of the proposed LSTM network for the prediction of trends in other illnesses.

Contribution of This Paper
In this study, we analyzed the patients diagnosed as having dementia from 1997 to 2017 and used seven models to forecast the number of patients. This paper's contribution is listed below: (1) analysis of the dementia patient data and figure out the long-term dependency; (2) construction of the LSTM forecasting model; (3) successful forecast of the prevalence of dementia using the LSTM model; (4) provide aid to the government administrations in developing relevant strategies. For example, policymakers can manage the budget allocated to dementia care to reduce its occurrence by implementing legislative changes, developing preventive interventions for younger populations, and providing ongoing education and care for elderly adults and their families. Future research is warranted to investigate the performance of the proposed LSTM network for the prediction of trends in other illnesses Furthermore, the successful application of LSTM in the sequence prediction task of this study will significantly improve the prediction of the prevalence of dementia patients if more clinical variables can be analyzed in the future, which achieves the original intention of this study. Prevention is more effective than a cure. Lastly, the LSTM model can also be widely applied in many fields, such as vessel trajectory prediction [59], tidal level forecasting [60], financial market forecasting [61], and real-time crash risk prediction.

Conclusions
The accurate prediction of the trends and prevalence of dementia among people of different genders and ages would strongly assist in providing evidence for the development of interventions to prevent or delay dementia onset. The proposed LSTM network demonstrated a higher prediction accuracy compared with ETS, ARIMA, TBATS, SVR, PSOSVR, and ANN models. The prevalence was further analyzed among patients from different gender and age groups to further elucidate the prediction results. Continued effort in the development of advanced prediction models can provide evidence for health care professionals to further improve the care and interventions for people with dementia and their family caregivers. Successful dementia prevention, treatment, and support programs would dramatically reduce the burden on health care systems, individuals, societies, and economies. As the aging population continues to grow, the development of health and social care strategies for patients with dementia using accurate time series models will inevitably be an ongoing process. Being equipped to adequately address dementia will likely be one of the ultimate indicators of societal advancement in the future world. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.