Stochastic Model for Drought Forecasting in the Southern Taiwan Basin

: The global rainfall pattern has changed because of climate change, leading to numerous natural hazards, such as drought. Because drought events have led to many disasters globally, it is necessary to create an early warning system. Drought forecasting is an important step toward developing such a system. In this study, we utilized the stochastic, autoregressive integrated moving average (ARIMA) model to predict drought conditions based on the standardized precipitation index (SPI) in southern Taiwan. We employed data from 1967 to 2006 to train the model and data from 2007 to 2017 for model validation. The results showed that the coe ﬃ cients of determination ( R 2 ) were over 0.80 at each station, and the root-mean-square error and mean absolute error were su ﬃ ciently low, indicating that the ARIMA model is e ﬀ ective and adequate for our stations. Finally, we employed the ARIMA model to forecast future drought conditions from 2019 to 2022. The results yielded relatively low SPI values in southern Taiwan in future summers. In summary, we successfully constructed an ARIMA model to forecast drought. The information in this study can act as a reference for water resource management.


Introduction
Drought is a natural phenomenon which can be classified into four categories: meteorological, hydrological, agricultural, and socioeconomic drought. Among these, meteorological drought can precede and lead to the other three types. It is essentially caused by insufficient precipitation and usually results in severe disaster events. Dai [1] found that the global trends of observed annual precipitation from 1950 to 2010 showed a decline in areas such as Africa, Southeast and East Asia, Eastern Australia, and Southern Europe. These results indicate that water shortages are becoming increasingly severe, which may lead to drought. However, drought events not only lead to environmental changes but also have serious impacts on the social economy and agricultural development [2]. For example, between 1976 and 2006, the total cost associated with drought in Europe was as high as €100 billion, while the number of people affected by drought increased by 20% [3]. Therefore, it is important to assess drought conditions to develop a mitigation strategy.
Many drought indices have been proposed to assess meteorological drought to identify the occurrence of drought events. Among these, the standardized precipitation index (SPI) has been widely employed to evaluate the intensity of meteorological droughts, owing to the following advantages. First, the SPI is based only on long-term rainfall data. Second, the SPI is in a standardized form, which means that it can be compared between different regions. Third, it has variable timescales, which allows it to describe different drought conditions. Therefore, the SPI was selected as the meteorological drought index in the present study.

SPI
There exist many methods that can identify drought events, and using drought indicators is the most convenient and effective method to do this. In this study, we explored meteorological drought in southern Taiwan. The SPI was proposed by McKee et al. [24]. This index simply utilizes the cumulative precipitation data over different periods for calculation. Because the index is standardized, it can be compared between different regions. The World Meteorological Organization [25] listed this as the preferred index for describing precipitation drought events. Considering the rainfall pattern in Taiwan, over a shorter aggregation time scale, the SPI only exhibits periodic oscillations in time, and it is difficult to evaluate long-term drought events. In contrast, it is easier to identify drought events with a longer aggregation time scale. Therefore, we chose a 12 month aggregation time scale for the SPI. In this study, we utilized historical SPI time series to predict future drought conditions using a time series model.

Time Series Model
The time series model is a stochastic model that is commonly used for time series forecasting. Common time series models include the autoregressive (AR), moving average (MA), and autoregressive moving

SPI
There exist many methods that can identify drought events, and using drought indicators is the most convenient and effective method to do this. In this study, we explored meteorological drought in southern Taiwan. The SPI was proposed by McKee et al. [24]. This index simply utilizes the cumulative precipitation data over different periods for calculation. Because the index is standardized, it can be compared between different regions. The World Meteorological Organization [25] listed this as the preferred index for describing precipitation drought events. Considering the rainfall pattern in Taiwan, over a shorter aggregation time scale, the SPI only exhibits periodic oscillations in time, and it is difficult to evaluate long-term drought events. In contrast, it is easier to identify drought events with a longer aggregation time scale. Therefore, we chose a 12 month aggregation time scale for the SPI. In this study, we utilized historical SPI time series to predict future drought conditions using a time series model.

Time Series Model
The time series model is a stochastic model that is commonly used for time series forecasting. Common time series models include the autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) models. Many studies have employed time series models to predict hydrological and meteorological time series [7,11,26].

Nonseasonal ARIMA Model
The AR and MA models can be effectively combined into an ARMA model. The current data is analyzed by a linear combination of previous data plus error terms. However, the ARMA model is only applicable when the data is stationary. If the original time series is nonstationary, then the difference must be added. The resulting model is called the ARIMA model, as proposed by Box and Jenkins [27]. The ARIMA model provides a new generation of forecasting tools, emphasizing the stochastic properties of time series rather than constructing single and simultaneous equation models. Each variable of the ARIMA model is represented by its own lagged value and stochastic error terms. The general nonseasonal ARIMA model consists of a p-order AR model, a q-order MA model, and the operators on the dth difference of the original time series data. This can be expressed as ARIMA(p,d,q), and the algorithm of the model is as follows: where φ(L) and θ(L) are polynomials of order p and q, respectively. The operators for the nonseasonal AR model of order p and MA model of order q are written as where L is the lag operator L i x t = x t−i .

Seasonal ARIMA (SARIMA) Model
Box et al. [28] extended the ARIMA model to deal with seasonality, with the result commonly referred to as the SARIMA model. The SARIMA model is analyzed by introducing seasonal periodic change to a general ARIMA mode, and it can be denoted as ARIMA(p,d,q)(P,D,Q) S , where (p,d,q) is the nonseasonal part of the model and (P,D,Q) S is the seasonal part. This can be expressed as follows: where p is the order of the nonseasonal AR model, q is the order of the nonseasonal MA model, d is the order of the general difference, P is the order of the seasonal AR model, Q is the order of the seasonal MA model, D is the seasonal differencing, and S is the length of a season.

ARIMA Model Development
In general, the development and selection of an appropriate ARIMA model consists of three steps: model identification, parameter estimation, and diagnostic checking [23,27,29,30].

Model Identification
This step involves transforming the data which is to confirm the original data to normality and stationarity and utilizing the autocorrelation function (ACF) and partial autocorrelation function (PACF) to initially confirm the general form of the ARIMA model. Here, the ACF and PACF were used to determine the order of the model, and the final model was selected using the goodness-of-fit criteria through the Akaike information criterion (AIC) [31] and Schwarz-Bayesian criterion (SBC) [32]. The model that gives the minimum AIC and SBC value is selected as the best. The mathematical formulations of the AIC and SBC are as follows: where m = (p + q + P + Q) is the number of model parameters, n is the number of data, and MSE is the mean-square error.
where y t is the observed data andŷ t is the predicted value.

Parameter Estimation
After identifying the ARIMA model, the parameters of the model must be estimated. This study employed the method of maximum likelihood, as proposed by Box and Jenkins [27], to estimate the parameters of the ARIMA model. Maximum likelihood estimation is a method of estimating the parameters of a statistical model.

Diagnostic Checking
Once an appropriate model has been selected and the parameters have been estimated, diagnosing the ARIMA model represents a crucial component of model development. In this step, it is necessary to verify whether the residuals of the model satisfy the properties of independence, following a normal distribution, and homoscedasticity to verify that the model is suitable for time series. Several statistical tests and plots of residuals are utilized for diagnostic checking. For a good model, the residuals must satisfy the white noise process requirements of being uncorrelated and normally distributed around a zero mean.
The residual autocorrelation function (RACF) and residual partial autocorrelation function (RPACF) of a time series are utilized to determine whether the series is independent. If the ACF and PACF of the residuals are significant within the confidence limits, this indicates that there is no significant correlation between the residuals. An alternative method is the Ljung-Box-Pierce (LBQ) test, which is a statistical method of testing for residual autocorrelation. The null hypothesis of the LBQ test is that the residuals are independent. The test statistic is defined in Equation (7): where m is the number of autocorrelation lags, n is the number of data, and r k is the sample autocorrelation at lag k. The statistical Q values are compared with the critical value with the degree of freedom at a 5% significance level. If the calculated values are less than the critical value, this means the residuals of the model are in accordance with white noise. The normality of residuals is verified by histograms and a probability plot of residuals. The residuals are also verified for homoscedasticity, which means there is a constant error variance over all the data. This is verified through a scatterplot of the residuals against predicted values. If the scatterplot exhibits no obvious patterns and the residuals are distributed randomly around zero, this indicates the residuals are homoscedastic.

Results and Discussion
In this study, to perform the predictive analysis, we selected seven stations with long-term monitoring data in the study area. The dataset was divided into two periods: training data (from 1967 to 2006) and validation data (from 2007 to 2017). Here, we used the ARIMA model to forecast future drought conditions based on the SPI. The model development steps included model identification, parameter estimation, and diagnostic checking. In this study, we utilized MATLAB to develop the time series model.

Model Identification
There are two main stages in model identification: (1) confirm whether the data is stationary and (2) utilize the ACF and PACF to determine the general form of the ARIMA model. According to the Kwiatkowski, Phillips, Schmidt, and Shin (KPSS) test, our data was nonstationary, and so it needed to be differenced. After applying the first-order difference, each station satisfied the model development conditions. Because the data at every station exhibited a similar pattern, we simply took the PR1 station as an example. Figure 2 shows that the ACF curve was damping out in a sine wave and the PACF exhibited a significant spike at lag 1, which reflected the AR(1) process. In addition, there were significant spikes in the PACF at lags 12, 24, 36, and 48, which indicated that the data exhibited seasonality with a period of 12. In the ACF, each station had a sine wave pattern, which also indicated that the data was seasonal [13]. Therefore, we chose AR(1) as the nonseasonal part of the ARIMA model, and the peaks at the lags that were multiples of 12 in the PACF indicated a seasonal model, which comprised the SARIMA model. The AIC and SBC criteria were utilized to select the best model, and the results are shown in Table 2. The model with the minimum AIC and SBC was selected as the best model. There are two main stages in model identification: (1) confirm whether the data is stationary and (2) utilize the ACF and PACF to determine the general form of the ARIMA model. According to the Kwiatkowski, Phillips, Schmidt, and Shin (KPSS) test, our data was nonstationary, and so it needed to be differenced. After applying the first-order difference, each station satisfied the model development conditions. Because the data at every station exhibited a similar pattern, we simply took the PR1 station as an example. Figure 2 shows that the ACF curve was damping out in a sine wave and the PACF exhibited a significant spike at lag 1, which reflected the AR(1) process. In addition, there were significant spikes in the PACF at lags 12, 24, 36, and 48, which indicated that the data exhibited seasonality with a period of 12. In the ACF, each station had a sine wave pattern, which also indicated that the data was seasonal [13]. Therefore, we chose AR(1) as the nonseasonal part of the ARIMA model, and the peaks at the lags that were multiples of 12 in the PACF indicated a seasonal model, which comprised the SARIMA model. The AIC and SBC criteria were utilized to select the best model, and the results are shown in Table 2. The model with the minimum AIC and SBC was selected as the best model.

Parameter Estimation
After identifying the order of the model, the parameters must be estimated. Table 3 presents the model parameters, standard errors, t-statistics, and p-values for the PR1 station in the Pozi River basin. It can be observed that the standard error was reasonably small compared with the model parameters.
In addition, most of the p-values of the model parameters were less than the alpha level (0.05), which implies that the estimations of the parameters were statistically significant. Therefore, these model parameters should be included in the model. Table 3. Statistical analysis of the model parameters for the PR1 station in the Pozi River basin. : SARIMA(1,1,0)(3,0,4)

Diagnostic Checking
After completing the parameter estimation, diagnostic checking was performed. The residuals of a model must be examined to verify that the model is adequate for the time series. The residuals must satisfy the following statistical properties: (1) the residuals are independent of each other; (2) the probability distribution is a normal distribution; and (3) homoscedasticity (constant variance) is satisfied. That is, the residuals must satisfy the requirements of a white noise process. Because each station yielded a similar result, we took the PR1 station as an example.
The independence of the residuals was checked by a correlogram and the LBQ test. Figure 3 shows the RACF and RPACF. The results show that most of the RACF and RPACF values were within the confidence limit, which implies that the residuals did not exhibit a significant correlation with each other. For the second method, the LBQ test was employed in this study to determine whether the residuals are dependent, and the results are presented in Table 4. We can observe that the computed statistics were less than the critical values at each station, which also indicates there was no significant correlation between the residuals, and the residuals from the selected model were in accordance with white noise.     Figure 4 depicts the histogram and normal probability plot of the residuals at the PR1 station in the Pozi River basin. The histograms show that the residuals were roughly centered on zero and were more or less normally distributed [13,33]. The normal probability plot of the residuals indicates that the residuals lay on a diagonal line, which represents the normal probability for residuals in each basin [13,34]. Therefore, both methods provided evidence of the normality of the residuals.   Figure 4 depicts the histogram and normal probability plot of the residuals at the PR1 station in the Pozi River basin. The histograms show that the residuals were roughly centered on zero and were more or less normally distributed [13,33]. The normal probability plot of the residuals indicates that the residuals lay on a diagonal line, which represents the normal probability for residuals in each basin [13,34]. Therefore, both methods provided evidence of the normality of the residuals. To determine whether the ability of the model to predict variable values is consistent, it is important to verify the homoscedasticity of the residuals [35]. The homoscedasticity of the residuals was checked by the scatterplot of the residuals against predicted values. The results are shown in Figure 5. The plot exhibited no pattern, and the residuals were randomly scattered. That is, the residuals were evenly distributed around a zero mean, which implies that the model was well fitted.
According to the above check, the residuals of the model were uncorrelated, had constant variance, and were normally distributed, and the statistical properties of the residuals were compliant with white noise. Thus, we confirmed that the selected model is adequate for the corresponding SPI time series at each station. To determine whether the ability of the model to predict variable values is consistent, it is important to verify the homoscedasticity of the residuals [35]. The homoscedasticity of the residuals was checked by the scatterplot of the residuals against predicted values. The results are shown in Figure 5. The plot exhibited no pattern, and the residuals were randomly scattered. That is, the residuals were evenly distributed around a zero mean, which implies that the model was well fitted.
According to the above check, the residuals of the model were uncorrelated, had constant variance, and were normally distributed, and the statistical properties of the residuals were compliant with white noise. Thus, we confirmed that the selected model is adequate for the corresponding SPI time series at each station.

Model Validation
Here, we utilized data from 2007 to 2017 for model validation. Figure 6 presents a comparison of the observed data with predicted values at each basin using the best SARIMA model. The predicted value yielded a similar pattern to the observed data, and the performance measures are presented in Table 5. In general, the higher the R 2 value, the better the performance of the model. According to our results, the R 2 values in each station were greater than 0.8, and the root-mean-square error (RMSE) and mean absolute error (MAE) were also sufficiently low. Therefore, the SARIMA model used to

Model Validation
Here, we utilized data from 2007 to 2017 for model validation. Figure 6 presents a comparison of the observed data with predicted values at each basin using the best SARIMA model. The predicted value yielded a similar pattern to the observed data, and the performance measures are presented in Table 5. In general, the higher the R 2 value, the better the performance of the model. According to our results, the R 2 values in each station were greater than 0.8, and the root-mean-square error (RMSE) and mean absolute error (MAE) were also sufficiently low. Therefore, the SARIMA model used to predict drought index in this study is reasonably precise.

Forecasting
The reason that drought forecasting is necessary for water resource management and planning is that drought events can then be diagnosed in advance, so that experts can take precautions. In this study, we employed the seasonal ARIMA model to forecast the drought condition in the next four years, from 2019 to 2022. The results at each basin are depicted in Figure 7. It was observed that the predicted values showed stochastic change for each SPI time series. According to the analytical results, the lowest SPI values were concentrated in the summer, which means that this region may be affected by drought in the summer in the future. In general, the main source of rainfall in Taiwan during the summer is typhoons. Typhoons can introduce abundant water resources. However, if the number of typhoons is small or the typhoons do not directly affect Taiwan, this sometimes results in a water shortage in the summer. In 2018, the number of typhoons generated between June and September was as high as 22. This is the year in which the most typhoons occurred in the past 10 years. However, only two typhoons affected Taiwan, which may have led to insufficient summer rainfall in Taiwan [36]. In addition, the Pacific high is the main climate factor during the summer in Taiwan, and this is becoming stronger owing to climate change [37], making Taiwan's climate hotter and with less precipitation. The Pacific high also guides the paths of typhoons. If the western edge of the Pacific high extends westward to the Asian continent, then a typhoon may pass through the Philippines instead of Taiwan. This is one of the reasons why there were fewer typhoons in 2018 in Taiwan. In recent years, the Pacific high has exhibited a tendency to move increasingly westward [38]. Therefore, the summer climate of Taiwan may change in the future under climate change. However, the details remain to be discussed in future research. In this study, the ARIMA model was employed to forecast drought events in the future. Unlike previous studies, we used a stochastic rather than deterministic method to describe the forecasting results. According to historical drought events in Taiwan, drought events occur approximately every three to four years. There was a water shortage situation in 2017 in Taiwan, and 2020 may be a drier year, which is consistent with the results we found in this study. A statistical method involves establishing suitable models to characterize climate factors such as precipitation. Stationarity is generally assumed to exist between predictor and historical data, but this is not always true. Thus, the uncertainty of the model may be higher. In addition, there are many factors that affect environmental changes, and this may present a challenge under this situation. Further studies should consider climate variability in the model for drought forecasting.
climate factors such as precipitation. Stationarity is generally assumed to exist between predictor and historical data, but this is not always true. Thus, the uncertainty of the model may be higher. In addition, there are many factors that affect environmental changes, and this may present a challenge under this situation. Further studies should consider climate variability in the model for drought forecasting.

Conclusions
In this study, the ARIMA model was employed as a drought forecasting tool in southern Taiwan. We used data from 1967 to 2006 to train the model. The model development included three steps: model identification, parameter estimation, and diagnostic checking. In the model identification step, we selected the general form of the model and chose the model with the minimum AIC and SBC as the best fit. In the parameter estimation step, we utilized several statistics to determine whether the parameters we estimated were significant. The results showed that most of the model parameters had p-values below the alpha level (0.05). The diagnostic checking step indicated that the statistical properties of the model residuals were compliant with white noise, including being uncorrelated with a constant variance and normal distribution. Then, the data from 2007 to 2017 were used to validate the model. The results showed that there was a high coefficient of determination (R 2 ) at each station (all over 0.80) and low values for the RMSE and MAE, which implies that the model is adequately precise at each basin. Finally, we used the ARIMA model to forecast the future drought conditions from 2019 to 2022. The forecasting results demonstrate that the SPI value is relatively low in the summer of 2020, which implies that there may be a water shortage in southern Taiwan. This phenomenon may be related to climate change, which leads to an enhancement of the Pacific high extending westward, thus affecting the paths of typhoons. In addition, the Pacific high dominates the summer climate in Taiwan, and if its intensity continues to increase, this will reduce precipitation in Taiwan in the future. However, the detailed evolution mechanism still remains to be discussed in the future.
The stochastic models selected for forecasting the SPI time series provided information on precipitation in southern Taiwan. This is a powerful tool, which can also be used to describe the hydrological time series. The ARIMA model used in this study, based on the SPI, can be applied to forecast drought impacts, playing a vital role in mitigating drought in water resource systems. However, the corresponding natural phenomena are complicated, owing to the influences of many factors. Stochastic models do not consider physical processes, and so it is difficult to understand the physical mechanisms of climate change. In addition, the assumption of the model may lead to higher uncertainty. Nevertheless, the model can be used to predict future trends and serve as a variable for other physical models in further studies.