Seasonality and Trend Forecasting of Tuberculosis Prevalence Data in Eastern Cape, South Africa, Using a Hybrid Model

Background: Tuberculosis (TB) is a deadly infectious disease caused by Mycobacteria tuberculosis. Tuberculosis as a chronic and highly infectious disease is prevalent in almost every part of the globe. More than 95% of TB mortality occurs in low/middle income countries. In 2014, approximately 10 million people were diagnosed with active TB and two million died from the disease. In this study, our aim is to compare the predictive powers of the seasonal autoregressive integrated moving average (SARIMA) and neural network auto-regression (SARIMA-NNAR) models of TB incidence and analyse its seasonality in South Africa. Methods: TB incidence cases data from January 2010 to December 2015 were extracted from the Eastern Cape Health facility report of the electronic Tuberculosis Register (ERT.Net). A SARIMA model and a combined model of SARIMA model and a neural network auto-regression (SARIMA-NNAR) model were used in analysing and predicting the TB data from 2010 to 2015. Simulation performance parameters of mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), mean percent error (MPE), mean absolute scaled error (MASE) and mean absolute percentage error (MAPE) were applied to assess the better performance of prediction between the models. Results: Though practically, both models could predict TB incidence, the combined model displayed better performance. For the combined model, the Akaike information criterion (AIC), second-order AIC (AICc) and Bayesian information criterion (BIC) are 288.56, 308.31 and 299.09 respectively, which were lower than the SARIMA model with corresponding values of 329.02, 327.20 and 341.99, respectively. The seasonality trend of TB incidence was forecast to have a slightly increased seasonal TB incidence trend from the SARIMA-NNAR model compared to the single model. Conclusions: The combined model indicated a better TB incidence forecasting with a lower AICc. The model also indicates the need for resolute intervention to reduce infectious disease transmission with co-infection with HIV and other concomitant diseases, and also at festival peak periods.


Introduction
Tuberculosis (TB) is a deadly infectious disease caused by Mycobacteria tuberculosis. TB incidence occurs in every part of the world. More than 95% of TB mortality occurs in low/middle income countries, and it is among the five leading causes of mortality in women aged 15 to 44 [1,2]. In 2014, approximately ten million people were diagnosed with active TB and two million died from the disease [3]. The highest incidence of new TB cases occurred in the Western Pacific Regions and

Study Setting
Eastern Cape (EC) is the second largest province in South Africa, with an area about 170,000 square kilometres, which is almost the size of Uruguay (Figure 1). It occupies about 13.9% of South Africa's entire land area and the entire population is about 6.5 million persons, which makes it the third largest population in South Africa. The EC racial population distribution is 86.3% black, 8.3% coloured, 4.7% white and 0.4% Indian/Asian. The proportion of the latter group has increased with most migrants coming from Sub-Saharan Africa, Indian and Asia. The capital city is Bhisho with the two most populous cities being East London and Port Elizabeth. EC is located on the South Eastern coast with many naturally beautiful spots, especially the rocky cliffs, oceans and thick green scrublands known as the wild coastline.

Data Collection
We studied retrospective data of all confirmed TB cases reported to the Eastern Cape Department of Health from 2010 to 2015 recorded on the noticeable infectious diseases occurrence data form monthly and yearly. The data occurrence and death rates of every single notifiable infection were mainly pulled together from all the TB hospitals in the province. Laboratory confirmation was based on Smear Positive Pulmonary Tuberculosis (PTB) cases. All suspected and confirmed TB cases are to be reported to Eastern Cape Health facility report of the Electronic Tuberculosis Register (ERT.Net) within a specific time of starting TB treatment. ERT.Net is a record unit for TB patients' treatment, TB therapy, TB investigation, training and seminars for health care workers and nurses. All TB incidents must be confirmed by medical staff and laboratory tests.

Ethical Considerations
This research was approved by the Govan Mbeki Research Ethics Committee, University of Fort Hare, Alice, Eastern Cape, South Africa, with reference number-QIN051SAZE01. Also, an approval letter to collect data was obtained from Eastern Cape Department of Health to conduct such research in the province with reference number-EC_2015RP26_384. Any information regarding study subjects used a number instead of their names and was kept confidential.

Development of the Model
This research was centred on forecasting comparisons in time series analysis of tuberculosis incidence data. Prior to model fitting, a time series plot was sketched to evaluate the behavioural pattern in the data over a period of years ( Figure 2). An additive decomposition of the TB time series was done to describe the seasonality components and trends ( Figure 3) and to estimate the seasonal effects that was used to create and present seasonally adjusted values. These adjusted seasonality values are used to remove the seasonal effect so that the trends can be shown clearly ( Figure 4). From this graph, we observed that the TB occurrence data had a periodical seasonality movement. Firstly, we looked at ARIMA model to assess the TB data. Moreover, the neural network auto-regression

Data Collection
We studied retrospective data of all confirmed TB cases reported to the Eastern Cape Department of Health from 2010 to 2015 recorded on the noticeable infectious diseases occurrence data form monthly and yearly. The data occurrence and death rates of every single notifiable infection were mainly pulled together from all the TB hospitals in the province. Laboratory confirmation was based on Smear Positive Pulmonary Tuberculosis (PTB) cases. All suspected and confirmed TB cases are to be reported to Eastern Cape Health facility report of the Electronic Tuberculosis Register (ERT.Net) within a specific time of starting TB treatment. ERT.Net is a record unit for TB patients' treatment, TB therapy, TB investigation, training and seminars for health care workers and nurses. All TB incidents must be confirmed by medical staff and laboratory tests.

Ethical Considerations
This research was approved by the Govan Mbeki Research Ethics Committee, University of Fort Hare, Alice, Eastern Cape, South Africa, with reference number-QIN051SAZE01. Also, an approval letter to collect data was obtained from Eastern Cape Department of Health to conduct such research in the province with reference number-EC_2015RP26_384. Any information regarding study subjects used a number instead of their names and was kept confidential.

Development of the Model
This research was centred on forecasting comparisons in time series analysis of tuberculosis incidence data. Prior to model fitting, a time series plot was sketched to evaluate the behavioural pattern in the data over a period of years ( Figure 2). An additive decomposition of the TB time series was done to describe the seasonality components and trends ( Figure 3) and to estimate the seasonal effects that was used to create and present seasonally adjusted values. These adjusted seasonality values are used to remove the seasonal effect so that the trends can be shown clearly ( Figure 4). From this graph, we observed that the TB occurrence data had a periodical seasonality movement. Firstly, we looked at ARIMA model to assess the TB data. Moreover, the neural network auto-regression model is mostly used in nonlinear multivariate analysis, which originates outside a system inputs [18] and can be used as a complement of linear analysis. However, the seasonal ARIMA (SARIMA) model and neutral network autoregressive model (NNAR) were used in analysing the trend of the time series data independently of the seasonal components and predicting the monthly TB incidence in South Africa. model is mostly used in nonlinear multivariate analysis, which originates outside a system inputs [18] and can be used as a complement of linear analysis. However, the seasonal ARIMA (SARIMA) model and neutral network autoregressive model (NNAR) were used in analysing the trend of the time series data independently of the seasonal components and predicting the monthly TB incidence in South Africa.   model is mostly used in nonlinear multivariate analysis, which originates outside a system inputs [18] and can be used as a complement of linear analysis. However, the seasonal ARIMA (SARIMA) model and neutral network autoregressive model (NNAR) were used in analysing the trend of the time series data independently of the seasonal components and predicting the monthly TB incidence in South Africa.

Development of the SARIMA Model
Time series seasonality is an unvarying pattern that recurs over S periods of time until the pattern changes over again. The SARIMA model integrates both non-seasonality and seasonality factors in a generative model. In the SARIMA model, seasonality in autoregressive (AR) and moving average (MA) terms predict using data values and errors at time intervals that are multiples of S. The SARIMA model is given as: where p = AR order in non-seasonality, d = difference in non-seasonality, q = MA order in non-seasonality, P = AR order in seasonality, D = difference in seasonality, Q = MA order in seasonality, and S = recurrence of time periods in the seasonality pattern. The general SARIMA model has the following form: The non-seasonality components are: The seasonality components are: In the equations, B represents the backward shift operator, ε stands for estimated residual error at t for μ = 0 and is constant and represents the observed values at t (t = 1, 2, …, k), ϕ is a vector of the AR coefficients, θ is a vector of the MA coefficients, Φ is a vector of the seasonal AR coefficients, and Θ is a vector of the seasonal MA coefficients. In the SARIMA model, seasonal subtraction of appropriate order is used to remove non-stationary data from the series. A first order seasonal difference is the deviation between a value and the corresponding value from the previous year and it is expressed as: = − , for monthly time series (S) = 12. Both autocorrelation and partial autocorrelation functions were used to detect six parameters in the components. Akaike

Development of the SARIMA Model
Time series seasonality is an unvarying pattern that recurs over S periods of time until the pattern changes over again. The SARIMA model integrates both non-seasonality and seasonality factors in a generative model. In the SARIMA model, seasonality in autoregressive (AR) and moving average (MA) terms predict X t using data values and errors at time intervals that are multiples of S. The SARIMA model is given as: where p = AR order in non-seasonality, d = difference in non-seasonality, q = MA order in non-seasonality, P = AR order in seasonality, D = difference in seasonality, Q = MA order in seasonality, and S = recurrence of time periods in the seasonality pattern. The general SARIMA model has the following form: The non-seasonality components are: The seasonality components are: In the equations, B represents the backward shift operator, ε t stands for estimated residual error at t for µ " 0 and σ 2 is constant and X t represents the observed values at t (t = 1, 2, . . . , k), φ is a vector of the AR coefficients, θ is a vector of the MA coefficients, Φ is a vector of the seasonal AR coefficients, and Θ is a vector of the seasonal MA coefficients. In the SARIMA model, seasonal subtraction of appropriate order is used to remove non-stationary data from the series. A first order seasonal difference is the deviation between a value and the corresponding value from the previous year and it is expressed as: x t " y t´yt´s , for monthly time series (S) = 12. Both autocorrelation and partial autocorrelation functions were used to detect six parameters in the components. Akaike information criterion (AIC) and Schwarz Bayesian criterion (BIC) also were performed to verify the better model that fit the data closely. The SARIMA model and SARIMA-NNAR model was built using R software ((R version 3.2.3, Network Theory Ltd., Bristol, UK) with the auto.arima () command and p value < 0.05 for statistical significance.

Development of the Neural Network Autoregressive (NNAR) Model
Artificial neural network models are widely applied on forecasting methods based on simple mathematical models that allow nonlinear multivariate associations between the dependent variable and its covariates. There are processes of self-organizing and learning in them. The learning rule used to adjust the neutral network weights is based on the function of long and short stochastic dependence of the time series. The approach is tested over six years' time series data obtained from the TB register and the lag values in the series can be used as variables for a neutral network auto-regression. Feed-forward neutral networks based on the nonlinear autoregressive model for forecasting time series with a layer hidden are considered, and NNAR (p, n) signifies p lag and n-nodes to forecast the output x t . An NNAR (p, 0) model corresponds to a model of ARIMA (p, 0, 0) but then without the limitations on the parameters to make sure it is stationary. It is helpful to also add the last observed values in the seasonality data from the same time as inputs. The input of this model is on the learning procedure, which employs the autoregressive neutral network with consideration of stochastic dependence of long or short term values of the time series (y t´1 , y t´2 , . . . ., y t´s q used as inputs to forecast the output y t and with n neutrons in the hidden layer. This is adjusted by a nonlinear function such as a sigmoid, to decrease the effect of excessive input values, in order to make the network robust to outliers. Since the SARIMA model is employed to examine the linear section of the TB data, the residuals part will have non-linear relationships. In the hybrid model, both the linear and nonlinear sections are combined. The estimated occurrence cases of TB at t time variable with two input variables were selected from the model.
The model fitting.

Comparison between the Two Models Performance in Simulation
Six parameter indexes were used to compare the goodness of fit efficiency and performances demonstrated with the errors from the two models. The error indexes are mean square error (MSE), Root mean square error (RMSE), mean absolute error (MAE), mean percent error (MPE), mean absolute scaled error (MASE) and mean absolute percentage error (MAPE). They are expressed as follows: where X t = real incidence cases,X t = estimated incidence and n = predictions number, A t = actual value of the quantity being forecast and F t = forecast.

Results
TB incidence data from January 2010 to December 2015 was used to perform the time series model fit. ACF and PACF plots were used to determine the key parameters (p, P, d, D, q, Q) of SARIMA model. The best model produced from the TB incidence data after the fifth trial was SARIMA (3, 0, 1, 0, 1, 2) 12 for monthly time series S = 12. The model equation is given as: (1´0.5112B) (1´0.9721B 12 )X t = (1 -0.7873B 12 )ˆ20731.651. The estimates and standard error of model parameters and their corresponding significant values are summarised in Table 1. In the concept of the SARIMA-NNAR model, the NNAR model was verified by using a smoothing constant of α = 0.1 from the range of 0 to 1 for simulation accuracy using nnetar and forecast.nnetar in a total of repeats networks of each random starting weights are fitted with lagged values of x as inputs and a single hidden layer with size nodes and with this constant, the hybrid model has its lowest MS, RMSE, MAE, MPE, MASE and MAPE. For non-seasonal data, the fitted model is denoted as an NNAR (p, k) model, where k is the number of hidden nodes. For seasonal data, the fitted model is called an NNAR (p, P, k)m model, which is analogous to an ARIMA (p, 0, 0)(P, 0, 0)m model but with nonlinear functions. The NNAR (p, P, k)m model was fitted and forecasted from Exponential triple smoothing (ETS). The values of p and P were not automatically selected but specified according to the AIC (optimal number of lags). For non-seasonality time series, the default was the best number of intervals (with smallest AIC) for a linear AR (p) model. In seasonality, the default values was P = 1 and p is selected from the best linear model fit to the seasonally adjusted data. These are then averaged when computing forecasts i.e., k was specified to n = (p + P + 0.1)/2 to the nearest integer.
Both the functions-ACF and PACF show significant spikes at lag 1 for seasonally differenced data, and almost significant spikes at lag 3 for PACF, showing some added non-seasonality terms to be included in the model ( Figure 5). A similar study showed that ACF and PCF of lag 12 show a significant peak suggesting a seasonal component of TB data [19]. The AICc of the SARIMA (3, 0, 1)(0, 1, 2) 12 model is 327.2, while that for the SARIMA-NNAR (3, 0, 1)(0, 1, 2) 12 model is 308.31. We attempted other models with AR requisites, but none gives a smaller AICc value. Consequently, we select the ARIMA (3, 0, 1)(0, 1, 2) 12 model. Its residuals are plotted in Figures 5 and 6. The model passed the residual tests, there are significant spikes in both the ACF and PACF. Entire significant spikes are seen within the significance limits, and the residuals occur to be white noise. A Ljung-Box test shown that the residuals have no outstanding autocorrelations and the model indicated that a Ljung-Box test was "non-significance", which is desirable. The prediction intervals were accurate due to the non-correlated residuals. Therefore, a seasonality ARIMA model appeared, which passes all the required checks and is ready for prediction.    The two models were compared in predicting the goodness of fit. TB incidence estimations for the forecast accuracy measures of scale-dependent errors on both models are summarised in Table 2. The measures of scale errors in MS, RMSE, MAE, MPE, MASE and MAPE were observed to be lower in the hybrid model compared with the single model. We made an effort to predict the estimates using both the SARIMA and SARIMA-NNAR models to forecast the number of yearly TB incidence cases in 2016 to 2017 and compare them with the real TB data (Table 3). However, in the forecast model curves (Figures 7 and 8), we observed that TB incidence monthly data in Eastern Cape indicated a marginally increasing trend and a The two models were compared in predicting the goodness of fit. TB incidence estimations for the forecast accuracy measures of scale-dependent errors on both models are summarised in Table 2. The measures of scale errors in MS, RMSE, MAE, MPE, MASE and MAPE were observed to be lower in the hybrid model compared with the single model. We made an effort to predict the estimates using both the SARIMA and SARIMA-NNAR models to forecast the number of yearly TB incidence cases in 2016 to 2017 and compare them with the real TB data (Table 3). However, in the forecast model curves (Figures 7 and 8), we observed that TB incidence monthly data in Eastern Cape indicated a marginally increasing trend and a seasonality pattern in the new number of cases of TB incidence. The yearly TB incidence was lower in SARIMA-NNAR in 2016 and 2017 compared to the SARIMA model.  7. Forecast from SARIMA model applied to TB case prevalence. Figure 7. Forecast from SARIMA model applied to TB case prevalence.

Discussion
A SARIMA and SARIMA-NNAR model was developed to forecast yearly incidence of TB cases in Eastern Cape. However, in both models, we observed that the time series TB data were simulated well, but the hybrid model that takes into account both the linear and non-linear components performed better than a single model of SARIMA. From our results so far, we could see that the hybrid model of ARIMA and neutral network provide a better forecast with more data characteristics than non-hybrid models. Predictions from the two models are shown in Figures 6 and 7.
The forecast was noticed to follow the recent trend in the data (this occurs due to subtraction). The rapidly and largely increased prediction intervals indicated that the TB incidence may possibly start increasing or decreasing at any period of time and in a contrast, the point forecasts trend downwards and the prediction intervals allow for the data to trend upwards during the forecast period. This behaviour is different from the one seen in Figure 9 where the prediction intervals are the same for the last few forecast horizons, and the point forecasts are equal to the mean of the data.
In this study, the results observed show that there will be no apparent improvement in the high burden incidence of TB in Eastern Cape in the near future. The predicted outcomes indicated that the reported yearly TB incidence cases will slightly increase in the nearest future in Eastern Cape.
The findings revealed that progress in TB control in Eastern Cape needs to be more intensified and adequate interventions are urgently needed. start increasing or decreasing at any period of time and in a contrast, the point forecasts trend downwards and the prediction intervals allow for the data to trend upwards during the forecast period. This behaviour is different from the one seen in Figure 9 where the prediction intervals are the same for the last few forecast horizons, and the point forecasts are equal to the mean of the data.
In this study, the results observed show that there will be no apparent improvement in the high burden incidence of TB in Eastern Cape in the near future. The predicted outcomes indicated that the reported yearly TB incidence cases will slightly increase in the nearest future in Eastern Cape. The findings revealed that progress in TB control in Eastern Cape needs to be more intensified and adequate interventions are urgently needed. Figure 9. Forecast from ARIMA model with non-zero mean applied to TB case prevalence.
There was a seasonal variation showing the periodicity of TB incidence in Eastern Cape. The yearly incidence data demonstrate a low incidence in 2010 to 2013 and was higher in 2014 and 2015 respectively. A similar study from the northern part of India showed that the peak period of TB occurrence was observed in April to June and October to December; though the prevalence of TB was lower in other months and there was no noticeable seasonality trends [20]. One of the plausible explanations for this seasonality in Eastern Cape may be the fact that HIV infected people are 30 times more likely to develop active TB due to the effect of HIV/AIDS becoming increasingly apparent, which makes the province to be one of the highest HIV affected areas in South Africa. It is also a low income and underdeveloped province.
Moreover, another notable cause may be the yearly prickly pear festival, one of the most significant yearly festivals in Eastern Cape, which mostly falls in late February or sometimes early March. During the entire month, there are massive crowd movements by various means of transportation. We conjecture that the peak in our study was most likely caused by overcrowding of public transportation over the festival period. There was a seasonal variation showing the periodicity of TB incidence in Eastern Cape. The yearly incidence data demonstrate a low incidence in 2010 to 2013 and was higher in 2014 and 2015 respectively. A similar study from the northern part of India showed that the peak period of TB occurrence was observed in April to June and October to December; though the prevalence of TB was lower in other months and there was no noticeable seasonality trends [20]. One of the plausible explanations for this seasonality in Eastern Cape may be the fact that HIV infected people are 30 times more likely to develop active TB due to the effect of HIV/AIDS becoming increasingly apparent, which makes the province to be one of the highest HIV affected areas in South Africa. It is also a low income and underdeveloped province.
Moreover, another notable cause may be the yearly prickly pear festival, one of the most significant yearly festivals in Eastern Cape, which mostly falls in late February or sometimes early March. During the entire month, there are massive crowd movements by various means of transportation. We conjecture that the peak in our study was most likely caused by overcrowding of public transportation over the festival period.
Another factor enhancing the TB progression in the winter period of the year is low temperature, which forces many people to stay indoors, if the house is poorly ventilated and crowded, this helps in the transmission of TB. A similar study shows that the summer peak was mainly as a result of enhanced winter transmission of TB due to indoor crowding [21]. Another study in United States suggests that reduced winter exposure may not be a strong contributor to TB risk [11,22]. Other possible methods for the seasonality in TB prevalent need to be studied.
Limitations to this study are: firstly, climatic data record (CDR), migration/geographical data and demographic data associated to the target population were not captured in the model fit to show if they constitute a significant cause of TB progression because of data availability limitations like a study conducted in Iran [23]. Secondly, South Africa is a low-middle income country and with differences in geographical entity and climatic conditions, so seasonal variation of TB progression in the various geographic province may be different. Lastly, both models were used only on the data from 2010 to 2015 and verified against only one year of data of TB prevalence. Hence, these results should be interpreted cautiously and should be revisited and analysed with additional time series data using a strong mathematical model.

Conclusions
Our data confirms that single forecast models can be dealt with by the emergence and application of hybrid models to forecast time series data. The result of the combined model thus, is more effectual and efficient than a single model in generating dependable forecasts of tuberculosis incidence cases. The model indicates that the TB prevalence in Eastern Cape will not increase remarkably in the forthcoming years; it is essential to effect better TB incidence control measures in South Africa. The TB prevalence seasonality from the models also indicate a greater necessity for TB interventions, focused on reducing infectious disease transmission with co-infection with HIV and other concomitant diseases and also on public events and movements during festival periods.