A Combined Model of SARIMA and Prophet Models in Forecasting AIDS Incidence in Henan Province, China

Acquired immune deficiency syndrome (AIDS) is a serious public health problem. This study aims to establish a combined model of seasonal autoregressive integrated moving average (SARIMA) and Prophet models based on an L1-norm to predict the incidence of AIDS in Henan province, China. The monthly incidences of AIDS in Henan province from 2012 to 2020 were obtained from the Health Commission of Henan Province. A SARIMA model, a Prophet model, and two combined models were adopted to fit the monthly incidence of AIDS using the data from January 2012 to December 2019. The data from January 2020 to December 2020 was used to verify. The mean square error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) were used to compare the prediction effect among the models. The results showed that the monthly incidence fluctuated from 0.05 to 0.50 per 100,000 individuals, and the monthly incidence of AIDS had a certain periodicity in Henan province. In addition, the prediction effect of the Prophet model was better than SARIMA model, the combined model was better than the single models, and the combined model based on the L1-norm had the best effect values (MSE = 0.0056, MAE = 0.0553, MAPE = 43.5337). This indicated that, compared with the L2-norm, the L1-norm improved the prediction accuracy of the combined model. The combined model of SARIMA and Prophet based on the L1-norm is a suitable method to predict the incidence of AIDS in Henan. Our findings can provide theoretical evidence for the government to formulate policies regarding AIDS prevention.


Introduction
Acquired immune deficiency syndrome (AIDS) is caused by the human immune deficiency virus (HIV). According to the latest data of the Joint United Nations Programme on HIV/AIDS, approximately 37.6 million people were living with HIV and AIDS worldwide at the end of 2020 [1]. By the end of 2018, an estimated 1,250,000 people were living with HIV in China, and 80,000 new infections were reported in 2018 [2]. Henan is one of the provinces hit hardest by HIV and AIDS in China [3,4]. In the past, AIDS patients mostly appeared in paid blood donors in central China, especially in Henan province. Plasma marketing in Henan province in 1994-1995 lead to the rapid emergence of a large number of AIDS cases in the early 2000s [5]. Henan province ranks first in terms of population. Although China has responded with the prevention and treatment of AIDS, there is still a certain number of undetected HIV infections and continuous spread in society every year, which has led to a serious AIDS epidemic in Henan province [6]. The number of AIDS patients in Henan province ranked first in China in 2015, and 65,896 people were living with HIV and AIDS in Henan province in October 2020 [7]. At present, the medical field has not developed specific drugs to cure AIDS or an effective vaccine to prevent AIDS. Due to the latter, developing scientific and effective AIDS prevention and control strategies has become a top priority to curb the AIDS epidemic. Therefore, it is important to establish a 2 of 12 mathematical model of AIDS epidemic prediction so as to understand the process of the AIDS epidemic, explore its epidemic characteristics and develop law, and seek the optimal strategy for its prevention and control [8].
Over the past few years, mathematical models have been used to successfully predict the incidence of HIV and AIDS [9,10]. At present, the most common method used to predict disease incidences is the autoregressive integrated moving average (ARIMA) model, which is been widely used for the prediction of timeseries data, such as the incidence of human brucellosis [11] and COVID-19 [12]. However, ARIMA models have one major limitation of pre-assumed linearity [13]. In most cases, nonlinear structures are adopted during timeseries analyses, as adequate results cannot be obtained from linear models. On the contrary, a Prophet model does not need to specify a detailed model, and it simultaneously simulates multiple seasons through a generalized additive model. It adopts a Bayesianbased curve-fitting method to smooth and forecast timeseries data, and it has a fast and robust fitting process for large outliers, missing values, and dramatic changes [14]. To date, Prophet model has been used in many disciplines, including infectious diseases such as COVID-19 [15,16], but few studies have applied it to AIDS. In the present study, we have built both Seasonal Autoregressive Integrated Moving Average (SARIMA) and Prophet models because time series data for AIDS had both linear and non-linear characteristics.
It is universally agreed that combining different models can increase the chance of capturing various data features and improve prediction accuracy [17,18]. More recently, combined forecasting models have been extensively applied in various fields with high predictive performance, including air quality [19] and influenza [20]. However, the parameter estimation methods of proposed models have mainly been based on the minimum L2-norm of the prediction error vector, that is, the combination prediction model is based on the minimum sum of the squares of the prediction error. However, the prediction error becomes enlarged or reduced after the prediction error is squared [21]. Considering the defect of the L2-norm, it is necessary to introduce the prediction error vector L1-norm index, which uses the sum of the absolute values of the prediction errors. Its robustness is better than prediction residual error sum of squares, especially when there are abnormal, extreme values in the data [22,23]. Wang et al. proposed a combined model to predict air pollutant concentration. Based on the L1-norm, the model performed a weighted combination of the prediction results of three single models (extreme learning machine, Elman neural network, and support vector machine), and the results showed that the proposed combined model had a stable prediction performance [24]. Therefore, we propose a combined model of the SARIMA and Prophet models to predict AIDS incidence based on the L1-norm. We compare the prediction effects to explore which is the most precise model for AIDS incidence prediction in Henan province. The results provide reference information for AIDS prevention and intervention in Henan province.

Data Sources
The incidence data of AIDS in Henan from January 2012 to December 2020 were obtained from the Health Commission of Henan Province (http://wsjkw.henan.gov.cn/, accessed on 1 April 2021). The monthly incidences of AIDS in Henan province from January 2012 to December 2019 were set as the training set, and the incidences of AIDS from January to December 2020 were set as the test set. A SARIMA model, a Prophet model, and two combined models were adopted to fit the monthly incidence of AIDS by using the data from the training set. The forecasting performances of the four fitted models were verified by using the data from the test set. The technical roadmap is shown in Figure 1

SARIMA Model
ARIMA model consists of three parts: autoregression (p), the degree of and the order of moving average (q) [25]. For seasonal trends, a SARIMA mo nonseasonal and seasonal components and can be specified as SARIMA (p Q)s, where p, d, and q refer to the orders of the nonseasonal autoregressive sonal differencing, and nonseasonal moving average (MA) parts of the mo Q refer to the orders of the seasonal AR, seasonal differencing, and seasona the model, and the subscripted letter "s" is the length of the seasonal perio study, the incidence of AIDS varied in the annual cycle, so s = 12.
A timeseries modeling approach involves the following three steps: m cation, parameter estimation, and diagnostic checking. Firstly, if it is necessa ate differencing of the series is performed to achieve stationarity and norma an augmented Dickey-Fuller (ADF) unit root test to estimate whether the ti

SARIMA Model
ARIMA model consists of three parts: autoregression (p), the degree of difference (d), and the order of moving average (q) [25]. For seasonal trends, a SARIMA model combines nonseasonal and seasonal components and can be specified as SARIMA (p, d, q) × (P, D, Q)s, where p, d, and q refer to the orders of the nonseasonal autoregressive (AR), nonseasonal differencing, and nonseasonal moving average (MA) parts of the model. P, D, and Q refer to the orders of the seasonal AR, seasonal differencing, and seasonal MA parts of the model, and the subscripted letter "s" is the length of the seasonal period [26]. In this study, the incidence of AIDS varied in the annual cycle, so s = 12.
A timeseries modeling approach involves the following three steps: model identification, parameter estimation, and diagnostic checking. Firstly, if it is necessary, appropriate differencing of the series is performed to achieve stationarity and normality. We used an augmented Dickey-Fuller (ADF) unit root test to estimate whether the timeseries was stationary or not. Secondly, the temporal correlation structure of the transformed data is identified by examining its autocorrelation (ACF) and partial autocorrelation (PACF) functions. In addition, the values of p, P and q, Q are finally determined by considering that smaller Akaike information criterion (AIC) and Bayesian information criterion (BIC) values correspond to a higher prediction accuracy. Finally, in order to test the normality of the SARIMA residuals, a white noise test is conducted in the residual series [27].

Prophet Model
A Prophet model is a data prediction tool for the timeseries of Facebook. It was introduced by Taylor and Letham and it is available in packages such as Python and R [14]. It can almost automatically predict the future trend of a timeseries. In addition, the model deals with the case of outliers in a time series and it also deals with some missing values. A Prophet model adopts a curve-fitting method based on Bayes to smooth and forecast timeseries data, so the results that need to be predicted can be obtained faster. In general, timeseries prediction or data analysis can use this algorithm to predict the trend of future timeseries. The formulation of a Prophet model is similar to a generalized additive model, including trend, seasonality, and holidays: is the trend function representing nonperiodic changes in timeseries values, and s(t) represents periodic changes (for example, weekly and annual seasonality). h(t) represents the effects of holidays that occur on potentially irregular schedules over one or more days. ε t is the error term and was assumed to be normally distributed in this study. For a trend model, it involves fitting a piecewise linear curve or a nonlinear saturating growth model. This sort of growth is typically modeled using a logistic growth model, which, in its most basic form, is: where C indicates the carrying capacity, k is the growth rate, and m represents an offset parameter. Both the carrying capacity and rate of growth are not constant. By altering the parameter rate, the flexibility of the model can be controlled [28].
In this study, the change points were automatically selected, the number of change points was set as 25, and the carrying capacity of the logistic growth model was set as 8.5. We set the interval width as 0.85 and the number of uncertainty samples as 1000.

The Combined Model Based on L1-Norm
Supposing the observed values of an index are {x t , t = 1, 2, . . . , N}, there are m feasible single prediction methods to forecast it, where x it is the predicted value of ith method at time t, i = 1, 2, . . . , m, t = 1, 2, . . . , N. When {l i , i = 1, 2, . . . , m} is the weighting coefficient of m single prediction in the combined prediction model, it satisfies normality and non-negativity: We first consideredx t , applying weighted geometric mean: wherex is the weighted geometric average of x t . Supposing e t is the logarithmic error between the combined predicted value and the corresponding actual value at time t yields the following: where e it is the logarithmic error between the actual value at time t and the corresponding predicted value of the ith single model, i = 1, 2, . . . , m, t = 1, 2, . . . , N. F is the logarithmic error between the combined prediction model and the actual value of the index based on the L1-norm, and F i is defined as the logarithmic error between the predicted value and the corresponding actual value of the ith single model, yielding: where F is the function of the weighting coefficient vector L = (l 1 , l 2 , . . . , l m ) T of various prediction methods, which can be denoted as F(L).
In the ideal case, if there is no prediction error, F(L) = 0. However, predicted error is inevitable. The smaller F(L) is, the closer the combined prediction value is to the actual value of the index, and the more accurate and effective the combined model. Therefore, model (1) was expressed as: In this study, m = 2, N = 12, l 1 was the weighting coefficient of the SARIMA model and l 2 was the weighting coefficient of the Prophet model.

Model Evaluation
The Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used to screen parameters of the SARIMA model. The SARIMA model with the minimum AIC and BIC was the most suitable one. The models were estimated using the mean square error (MSE), mean absolute percentage error (MAPE), and mean absolute error (MAE), and the model with the smallest values of these indices was identified as optimal [10].

Data Processing and Analysis
R software (version 3.6.2, R Foundation for Statistical Computing, Vienna, Austria) was adopted to develop the SARIMA model and the Prophet model. In addition, LINGO software (version 15.0, Lindo System Inc., Chicago, IL, USA) was adopted to create the combined model. The significant level was 0.05.

Trends of AIDS in Henan Province
The timeseries data covered 108 months, from January 2012 to December 2020. As shown in Figure 2, the monthly incidence fluctuated from 0.05 to 0.50 per 100,000 individuals, and the monthly incidence of AIDS in Henan province had a certain periodicity. The peaks of the disease occurred mainly in December, while the low peaks occurred mainly in January or February. There was a sudden decline in the monthly incidence of AIDS in January 2020 and a record low in February 2020.

SARIMA Models
For SARIMA, the ADF test indicated that the original series was stationary (Dickey-Fuller = − 3.796, p < 0.05), which did not need trend differencing ACF and PACF graphs were used to estimate the parameter ranges of p, P and q, Q. We found that the plots of the original series displayed slow decay at the seasonal lags (Figure 3a). Therefore, at lag-12 (subtracting the observations after a lag of 12 periods) differencing was used to remove seasonality features. The sequence of one-order seasonal difference was stable (Figure 3b). Then, some candidate SARIMA models were assessed to forecast future values based on the previously observed values (Table 1). Further, of all the tested models shown in Table  1, the SARIMA(1,0,1)(0,1,1) [12] model was found to best fit the data, which had the lowest Akaike information criterion (AIC = −253.67) and Bayesian information criterion (BIC = −244.00). This model also passed the Ljung-Box Q Test (p = 0.420); the testing results of the estimated parameters were all statistically significant (p < 0.05).

SARIMA Models
For SARIMA, the ADF test indicated that the original series was stationary (Dickey-Fuller = − 3.796, p < 0.05), which did not need trend differencing ACF and PACF graphs were used to estimate the parameter ranges of p, P and q, Q. We found that the plots of the original series displayed slow decay at the seasonal lags (Figure 3a). Therefore, at lag-12 (subtracting the observations after a lag of 12 periods) differencing was used to remove seasonality features. The sequence of one-order seasonal difference was stable (Figure 3b). Then, some candidate SARIMA models were assessed to forecast future values based on the previously observed values (Table 1). Further, of all the tested models shown in Table 1, the SARIMA(1,0,1)(0,1,1) [12] model was found to best fit the data, which had the lowest Akaike information criterion (AIC = −253.67) and Bayesian information criterion (BIC = −244.00). This model also passed the Ljung-Box Q Test (p = 0.420); the testing results of the estimated parameters were all statistically significant (p < 0.05).

SARIMA Models
For SARIMA, the ADF test indicated that the original series was stationary (Di Fuller = − 3.796, p < 0.05), which did not need trend differencing ACF and PACF g were used to estimate the parameter ranges of p, P and q, Q. We found that the p the original series displayed slow decay at the seasonal lags ( Figure 3a). Therefore, 12 (subtracting the observations after a lag of 12 periods) differencing was used to re seasonality features. The sequence of one-order seasonal difference was stable (Figu Then, some candidate SARIMA models were assessed to forecast future values bas the previously observed values (Table 1). Further, of all the tested models shown in 1, the SARIMA(1,0,1)(0,1,1) [12] model was found to best fit the data, which had the l Akaike information criterion (AIC = −253.67) and Bayesian information criterion ( −244.00). This model also passed the Ljung-Box Q Test (p = 0.420); the testing resu the estimated parameters were all statistically significant (p < 0.05).  Finally, we predicted the monthly incidence of AIDS from January to December in 2020 with the SARIMA(1,0,1)(0,1,1) [12] model. The results are shown in Table 2.

Prophet Model
The prophet model was automatically fitted with the incidence rates in Henan province from January 2012 to December 2019, and then the AIDS incidence rates from January to December 2020 were predicted. The predicted results are shown in Table 2, and the fitting prediction curve is shown in Figure 4. The results showed that the monthly incidence of AIDS was seasonal in Henan province.

Prophet Model
The prophet model was automatically fitted with the incidence rates in Henan province from January 2012 to December 2019, and then the AIDS incidence rates from January to December 2020 were predicted. The predicted results are shown in Table 2, and the fitting prediction curve is shown in Figure 4. The results showed that the monthly incidence of AIDS was seasonal in Henan province. The decomposed components of the monthly incidence of AIDS included the effect of trend and the yearly seasonality ( Figure 5). An increasing trend in the reported The decomposed components of the monthly incidence of AIDS included the effect of trend and the yearly seasonality ( Figure 5). An increasing trend in the reported incidence of AIDS was observed from 2012 to 2020. For yearly seasonality, an apparent local maximum appeared in September, and an apparent local minimum appeared in November. Health 2022, 19, x FOR PEER REVIEW 9 of 13 incidence of AIDS was observed from 2012 to 2020. For yearly seasonality, an apparent local maximum appeared in September, and an apparent local minimum appeared in November.

The Combined Model
For the combined model based on the L2-norm, we obtained = 0.548 and = 0.452. For the combined model based on the L1-norm, we obtained = 0.4587 and = 0.5417. The prediction values of the combined models are shown in Table 3. According to model (1), the minimum log error of the combined model based on the L1-norm was F( , ) = 3.5479, while the logarithmic error of the SARIMA model was = 4.0063 and that of the Prophet model was = 3.8331. The combinatorial model based on the L1-norm was the optimal combinatorial model, as it resulted in F( , ) < min{ , }.

Model Evaluation
Compared with the other models, the effect values of the combined model based on L1 norm were all lower (Table 3). It showed that the prediction effect of the combined

The Combined Model
For the combined model based on the L2-norm, we obtained l 1 = 0.548 and l 2 = 0.452. For the combined model based on the L1-norm, we obtained l 1 = 0.4587 and l 2 = 0.5417. The prediction values of the combined models are shown in Table 3. According to model (1), the minimum log error of the combined model based on the L1-norm was F(l 1 , l 2 ) = 3.5479, while the logarithmic error of the SARIMA model was F 1 = 4.0063 and that of the Prophet model was F 2 = 3.8331. The combinatorial model based on the L1-norm was the optimal combinatorial model, as it resulted in F(l 1 , l 2 ) < min{F 1 , F 2 }.

Model Evaluation
Compared with the other models, the effect values of the combined model based on L1 norm were all lower (Table 3). It showed that the prediction effect of the combined model based on L1 was the best.

Discussion
In this study, we focused on monthly incidence data for AIDS from 2012 to 2020 in Henan province, China. The results showed that the incidence of AIDS in Henan province was relatively stable. We built four models for analyzing the timeseries: a SARIMA model, a Prophet model, and two combined models of SARIMA and Prophet. The predictive abilities of the four models were compared and it was discovered that the combined model based on the L1-norm proposed in this study had the best model effect values and was superior to the other models in predicting effect. These findings indicated the potential value of the combined model based on the L1-norm in forecasting short-term AIDS incidence in Henan province. The model can provide reference for AIDS prevention and intervention in Henan province.
The Henan provincial government has done an excellent job in the prevention and treatment of AIDS during a transfer of AIDS carriers to a new AIDS population in Henan province. It warned us that the AIDS epidemic cannot be ignored. The Henan government has not slackened on AIDS in recent years. To implement China's 13th Five-Year Action Plan for HIV/AIDS Containment and Prevention, the Henan University AIDS Prevention and Control Alliance was established in 2018 [7]. In the face of the impact of COVID-19, we still need to pay attention to the prevention and control of AIDS [29,30]. It requires us to go further in AIDS prevention, treatment, and research and to raise public awareness of the growing threat posed by infectious diseases. In addition, AIDS is not a seasonal infection. Still, the findings suggested some seasonality in the AIDS incidence data reported through the Center for Disease Control and Prevention (CDC): the rate of new infections is low in January and February and this can be attributed to the influence of the annual Spring Festival, which falls in late January or early February. During the Spring Festival, national and provincial CDCs and most hospitals and clinical laboratories operate with limited capacity, and people's willingness to seek medical treatment falls, resulting in low AIDS infection records. Some other studies have also supported the latter phenomenon [31].
Different models have their own merits and faults. Based on historical AIDS incidence data, our study used timeseries analysis methods to establish SARIMA(1,0,1)(0,1,1) [12], a Prophet model, and two combined models for forecasting the monthly incidence of AIDS in Henan province. We found that the prediction effect of the Prophet model was better than the SARIMA model. We know that the SARIMA model combines the advantages of the two methods of regression analysis and moving average, which can analyze seasonal timeseries [32]. A previous study found that the most appropriate ARIMA models for HIV incidence in 2015 and 2016 in Guangxi, China, were ARIMA(1,1,2)(0,1,2) [12] and ARIMA(2,1,0)(1,1,2) [12], respectively [8]. A limitation of the SARIMA model is that it is easy to overfit with poor generalization ability when processing daily data, and it is more suitable for linear models [13]. However, the monthly incidences of AIDS had both linear and nonlinear characteristics, and there was a missing value, which meant some information may not have been captured by the SARIMA model. Adopting a generalized additive model formulation, the Prophet model is fast in its fitting procedure and robust for large outliers, missing values, and dramatic changes. The Prophet model automatically fitted the data, and the interpolation of missing values was not required [14], so it produced better results. This may explain why the performances of the SARIMA model were not as accurate as those of the Prophet model, which was consistent with a previous study [33].
Previously, Adesoye Idowu Abioye et al. [34] used an ARIMA and Prophet model to fit and predict the COVID-19 cases in Nigeria from September to October 2020. On this basis, we built combined models to predict the incidence of AIDS in Henan province. The results showed that the prediction effects of the combined models were better than those of the two single models. Firstly, the combined models could combine the characteristics of a single model to capture more data information, and they were more suitable for fitting data with both linear and nonlinear characteristics, such as AIDS incidence. Secondly, the reasonable distribution of the weight of a single model may result in an ensemble with more accurate and lower variance [17]. To date, many related studies have shown that a combined model is better than a single model. For example, Grzegorz Dudek [35] believed that a hybrid residual dilated long short-term memory and exponential smoothing model was more competitive. Li [36] found that a combined model of modified linear ARIMA and modified nonlinear ARIMA improved the single models. Furthermore, we found that, compared with the L2-norm, the L1-norm improved the prediction accuracy of the combined model. The combined model based on the L2-norm was based on the minimum sum of the squares of the prediction error. However, the prediction error was enlarged or reduced after the prediction error was squared (that is, if the absolute value of the prediction absolute error was greater than 1, it was larger after it was squared; if the absolute value of the predicted relative error was less than 1, it was smaller after being squared). The combined model based on the L1-norm used the sum of the absolute values of the prediction errors. Based on the L1-norm, the model was sparse and regularized. The L1-norm decreased very quickly for small weights and slowly for large weights [24]. Overall, the combined forecasting model of SARIMA and Prophet models based on the L1-norm was an appropriate way for predicting the incidence of AIDS in Henan province. With the help of the combined model, it is reasonable for the government to allocate health resources to control the epidemic efficiently.
This study had several limitations. First, since the analyzed data was the monthly incidence data of AIDS, the decomposition components of the Prophet model only included the trend effect and yearly effect, which could not reflect the weekly effect and short-term holiday effect. Second, the timeseries prediction model was adopted without considering factors affecting the incidence of AIDS, such as age, gender, social status, epidemic variation, and humanity. Stratified analysis is essential for a better understanding of AIDS epidemiology. In the future, we can explore other suitable models (such as support vector regression, exponential smoothing models, and machine learning) for predicting AIDS in combination with epidemiological data and socio-economic determinants [37,38].

Conclusions
The combined forecasting model of the SARIMA and Prophet models based on the L1-norm was an appropriate way for predicting the incidence of AIDS in Henan province. The results showed that the combined model was suggested for use in AIDS surveillance, which provide evidence for the government to formulate policies by providing estimates on AIDS incidence trends in Henan, China.
Author Contributions: Data analysis, writing-original draft preparation, Z.L.; data collection, Z.L., Z.S., H.Z. and M.L.; supervision, Y.Y., X.J., J.B. and X.S. All authors have read and agreed to the published version of the manuscript.