Characterization and Prediction of Water Stress Using Time Series and Artiﬁcial Intelligence Models

: In agroecosystems, drought is a critical climatic phenomenon that affects evapotranspiration and induces water stress in plants. The objective in this study was to characterize and forecast water stress in the Hyderabad region of India using artiﬁcial intelligence models. The monthly precipitation data for the period 1982–2021 was characterized by the standardized precipitation index (SPI) and modeled using the classical autoregressive integrated moving average (ARIMA) model and artiﬁcial intelligence (AI), i.e., artiﬁcial neural network (ANN) and support vector regression (SVR) model. The results show that on the short-term SPI3 time scale the studied region experienced extreme water deﬁcit in 1983, 1992, 1993, 2007, 2015, and 2018, while on the mid-term SPI6 time scale, 1983, 1991, 2011, and 2016 were extremely dry. In addition, the prediction of drought at both SPI3 and SPI6 time scales by AI models outperformed the classical ARIMA models in both, training and validation data sets. Among applied models, the SVR model performed better than other models in modeling and predicting drought (conﬁrmed by root mean square error—RMSE), while the Diebold–Mariano test conﬁrmed that SVR output was signiﬁcantly superior. A reduction in the prediction error of SVR by 48% and 32% (vs. ARIMA), and by 21% and 26% (vs. ANN) was observed in the test data sets for both SPI3 and SPI6 time scales. These results may be due to the ability of the SVR model to account for the nonlinear and complex patterns in the input data sets against the classical linear ARIMA model. These results may contribute to more sustainable and efﬁcient management of water resources/stress in cropping systems.


Introduction
Meteorological and hydrological droughts commonly cause agricultural drought, i.e., water stress in agroecosystems. The three longest droughts globally occurred between July 1928 and May 1942 (Dust Bowl drought of the 1930s), July 1949and September 1957, and between June 1998 and December 2014 (early 21st-century drought). However, the frequency of droughts has increased in recent decades, reducing water stress, using AI techniques such as ANN and SVR. Moreover, this work was the first attempt to significantly compare the modeling performance of the studied techniques using the test of Diebold and Mariano [36]. The methodological framework begins with the characterization of drought by SPI. The AI-based ANN and SVR models presented here were developed to model and predict drought, not only in the Hyderabad region, but also in similar water-scarce and drought-affected agroecosystems. Thus, the results could serve as a valuable tool and extension in drought, i.e., water management, and food security.

Description of the Data and Study Area
The study pertains to Hyderabad, the capital city of Telangana state of India. Telangana has an area of 112,077 sq. km and has a population of 35,003,674. The state receives an average annual rainfall of about 1000 mm. The Hyderabad (17.3850 • N, 78.4867 • E) region of the Hyderabad region of the Telangana state is selected as the study area, as the average rainfall received by the region is usually less than the state average. The annual rainfall of the Hyderabad region was observed to be around 750 mm, the region receives most of its rainfall from the southwest monsoon. The monthly rainfall data (millimeter (mm)) in Hyderabad from January 1988 to May 2021, was obtained from the Directorate of Economics and Statistics, Government of Telangana. Further, the rainfall data is used to calculate the SPI for three and six-month time scales.

Standardized Precipitation Index (SPI)
The rainfall data were used to calculate the SPI indices for time scales of 3 (SPI3) and 6 (SPI6) months. The SPI was calculated based on the probability of precipitation for a specific time scale. The gamma distribution is defined by its frequency or probability density function.
f (x, α, β) = 1 where, α and β are the shape and scale parameters respectively; x is the rainfall amount (millimeter), and Γ(α) is the gamma function, α > 0, where, Γ(α) is the gamma function. Computation of the SPI was mainly involved in fitting a gamma probability density function to a given frequency distribution of precipitation. The maximum likelihood solutions used for optimal estimation of α and β are given as follows [9]: where, n = number of precipitation observations. Integrating the probability density function with respect to x and attaching α and β parameters yields the cumulative probability distribution function F(x): Equation (6) can be expressed as follows: where, t = x/β. Since the gamma function is undefined for x = 0 and the rainfall time series data may contain zero rainfalls, the cumulative probability of zero and non-zero rainfalls, H(x) can be calculated as: where, q is the probability of zero rainfall. If m is the number of zeros present in a rainfall time series, then q is estimated by m/n. The cumulative probability was then transformed into a standard normal random variable Z, then its mean is 0 and variance is 1, which is the value of the SPI.
Ref. [37] derived the following equations to transform the cumulative probability distribution into a standardized normal distribution to calculate the SPI as follows: where, C 0 = 2.515517, C 1 = 0.802853, C 2 = 0.010328, d 1 = 1.432788, d 2 =0.189269, and d 3 = 0.001308. The SPI for time scales 3 (SPI3) and 6 (SPI6) represent short-term and midterm meteorological droughts, respectively [37]. The SPI is categorized in to different groups as mention in Table 1 below: After the development of SPI3 and SPI6, the indexes from January 1982 to May 2020 were considered as the training dataset (data for model building), while the indexes from June 2020 to May 2021 were considered as the testing dataset (data for model validation) in this study.

ARIMA Model
The ARIMA model was characterized by three terms: p, d, and q, where, 'p' is the order of the 'auto regressive' (AR) term, referring to the number of lags of Y t time series to be used as predictors. 'q' is the order of the 'moving average' (MA) term, assuming the number of lagged forecast errors that should go into the ARIMA model. 'd' is the number of differences required to make the time series stationary; the most common approach of differencing is, subtract the previous value from the current value.
where, Y t is the time series, ∅ i and θj are model parameters for AR and MA terms, respectively, e t is the random error, p is a number of autoregressive terms, q is number of lagged forecast errors, and B is the backshift operator such that BY t = Y t−1 . The ARIMA model building consists of four stages, viz. identification, estimation, diagnostic checking, and forecasting [22].

Artificial Neural Network (ANN)
The ANN is one of the most widely used artificial intelligence techniques in the last two decades, for both classification and regression tasks. As time-series falls under regression problem, the lagged autoregressive observations were used as input to the network with an implicit functional representation of time. The mathematical expression for the output Y t of a single-layer or multi-layer feed-forward autoregressive network is expressed as follows: where, α j (j = 0, 1, 2, . . . , q) and β ij (i = 0, 1, 2, . . . , p, j = 0, 1, 2, . . . , q) are the model parameters, p is the number of input nodes, q is the number of hidden nodes, g is the activation function, and e t is the random error [38]. The logistic function is the most commonly used activation function in time series modeling.
The ANN model building was generally divided into the model building part (training data set) and validation part (testing set). In model building, the algorithm minimizes the error function between actual and predicted values. The error function was expressed as follows: where, N is the total number of error terms and w ij are the weight parameters of the neural network which are changed by a number of changes in where η is the learning rate which is a user-defined hyper-parameter [39,40]. The schematic representation of the autoregressive feed-forward neural network is depicted in Figure 1.
where, N is the total number of error terms and are the weight parameters of the neural network which are changed by a number of changes in ∆ = − where is the learning rate which is a user-defined hyper-parameter [39,40]. The schematic representation of the autoregressive feed-forward neural network is depicted in Figure 1.

Support Vector Regression (SVR)
The support vector regression (SVR) is a supervised machine learning technique originally developed for binary classification and later extended to nonlinear regression estimation problems by introducing an ε-insensitive loss function [41]. The basic principle involved in NLSVR (nonlinear SVR) is to transform the original input time series into a high dimensional feature space and then build the regression model in a new feature space.
Consider a vector of data set Z = {x i y i } N i=1 where x i ∈ R n is an input vector, y i is the scalar output, and N is the size of data set. The general equation SVR can be written as follows [41]: where, φ(x) is function of x, W is weight vector, b is bias term, and superscript T denotes the transpose. The coefficients W and b were estimated by minimizing the following regularized risk function: The first term 1 2 w 2 is called 'regularized term', which measures the flatness of the function. Minimization of the regularized terms will make a function as flat as possible. The second term 1 is called as 'empirical error' or loss function L ε which was estimated by Vapnik ε-insensitive loss function as follows: where, y i is the actual value, f (x i ) is the estimated value, and ε is the margin of tolerance where no penalty is given to errors. The kernel function in SVR plays a critical role in mapping the input to output, where the information is processed using a suitable kernel function. The most commonly used kernel functions radial basis function (RBF) K is given as follows: where, σ 2 is the variance, and x 1 − x 2 is the Euclidean distance between two sample points x 1 and x 2 . The performance of SVR depends upon the optimization of two hyperparameters; regularization parameter C also called the cost function, which balances the complexity and approximation accuracy of the model and kernel bandwidth parameter γ = 1 2σ 2 , which represents the variance of the kernel function. Finally, a general flow chart of the methodology followed in this study is delineated in Figure 2. The flow chart depicts the development of drought indices SPI3 and SPI6 using rainfall data, modeling, and forecasting of drought using different models and performance of models in both training and testing data sets in terms of RMSE and Diebold-Mariano (DM) test.

ARIMA Model
The time series plots of SPI at 3 and 6 timescales are depicted in Figure 4a,b. The time series modeling 'Box-Pierce test' was performed to check the autocorrelation of the series using the following formula [22]: where n is the sample size,ρ k is the sample autocorrelation at lag k, and h is the number of lags being tested; and Q follows χ 2 , a distribution with h degrees of freedom. Both SPI3 and SPI6 are autocorrelated as the Box-Pierce test was significant (p < 0.0001) ( Table 4). The main step in the ARIMA model is the identification of the different combinations of parameters, such as autoregressive terms (p), differencing terms (d), and moving average terms (q). A combination of the order of the parameters which has maximum log-likelihood and lowest values of Akaike information criteria (AIC) and Bayesian information criteria (BIC) is considered the best model. Along with log-likelihood, AIC and BIC criteria's ACF and PACF plots were also used ( Figures S1-S4) for model order identification. The results are presented in Table 4. For SPI3 and SPI6 the best fitted ARIMA model was (1,0,2) and (1,0,0) with maximum log-likelihood of −544.14 and −479. 53  For building the AI model, the nonlinearity of the original data set was tested using the BDS test [42]. The results of the BDS test are shown in Table 5 and confirm that the data under consideration is nonlinear in nature as the probability of significance of residuals is p < 0.0001.

ANN Model
ANN model parameters are given in Table 6 for both SPI3 and SPI6 time scales. The drought indices of time series were subjected to ANN model with 5 and 4 tapped time delays and 2 and 4 optimum nodes for SPI3 and SPI6, respectively. Sigmoidal activation function in input to the hidden layer and linear identity function in hidden layers to output layer was used with feed-forward network architecture. The total number of parameters or weights obtained was 15 and 25 for SPI3 and SPI6, respectively. Diagnosis checking of residuals by Box-Pierce non-correlation test, which indicates the non-auto correlated or random nature of the residuals (p = 0.83 and p = 0.72) for SPI3 and SPI6 was done.

SVR Model
For the SVR model, parameters were considered based on the least RMSE values. The parameter specifications are listed in Table 7 for both SPI3 and SPI6 timescales, where the radial basis function (RBF) was used as a kernel function, and the number of support vectors obtained are 451 and 337, respectively. The cost function, gamma, and epsilon are 7.9, 0.16, and 0.01 for SPI3, and 7.7, 0.16, and 0.1 for SPI6, respectively. Box-Pierce tests done for diagnostic checking of residuals are random and white noise in nature as p = 0.27 and 0.53 for both SPI3 and SPI6, respectively.

Discussion
Drought characterization by using SPI3 and SPI6 timescales indicated different categories of water stresses, i.e., moderately dry, severely dry, and extremely dry. Ten months showed extreme water stress in SPI3 and four months showed water stress on the SPI6 time scale, which indicates high water scarcity during the reported period (Table 8). Similar results for different classes of droughts and different time scales were also reported in [43][44][45][46][47][48][49][50]. Table 8. Classification of drought categories in SPI3 and SPI6.

Moderately dry
The modelling and forecasting of SPI3 and SPI6 were compared in terms of MSE, RMSE for both training and testing sets for SPI3 and SPI6, respectively. AI models outperformed the classical model, SVR model was found to be superior over ARIMA and ANN models in training and testing sets for both SPEI3 and SPEI6. Performance hierarchy of these models is as follows: SVR > ANN > ARIMA in both training and validation sets. The detailed comparison of modelling and forecasting performance of different models in training and testing sets are depicted in Tables 9 and 10. Forecasted values and model comparison criteria of testing data sets are given in Table 10. Further, actual vs. fitted data of both training and testing sets are given in Table S2 and are also plotted in Figure 5a,b for drought in both SPI3 and SPI6 time scales, respectively. Similar results were obtained in [51][52][53][54], where the SVR model outperformed ARIMA and ANN models.  The ANN for water stress modeling was developed with the 'Levenberg-Marquardt backpropagation algorithm' in a feed-forward network based on repetitive iterations. The user-defined hyper-parameters for ANN model, such as learning rate and momentum, were fixed based on repetitive experimentation as 0.04 and 0.001, respectively. To obtain the stable model, the network was repeated 30 times with 1250 iterations. The candidate input lags and hidden nodes were determined based on the lowest training errors. Tenfold crossvalidation was carried out for each model and the lowest cross-validation error obtained is reported in Tables 6 and 7. For SPI3 and SPI6 time scales we tuned the model with different permutations and combinations of hyper-parameters and chose the candidate parameters based on the lowest training MSE and RMSE. A similar procedure was followed in [55] for building ANN and SVR models. The comparison criteria, viz., MSE and RMSE exhibit only the observed differences between the predicted values of the models. The DM test statistics were used to determine the statistically significant differences among the models used. The ARIMA (M1) and ANN (M2) models are significantly different with respect to SVR (M3) model for both SPI3 and SPI6. Inter combinational significances are clearly mentioned in Table 11 for both SPI3 and SPI6. The performance of SVR model is significantly superior to the other two models in the input and testing data sets for both SPI3 and SPI6. These results may be due to the ability of the SVR model to account for the nonlinear and complex patterns in the input data sets, which can be a comprehensive tool for managing and alleviating water stress in plants. Similarly, the DM test was applied in [27,56,57] for significant comparison of modeling performance of the developed model in both training and validation data sets.

Conclusions
The present study was carried out to understand the short and medium-term water stress in agroecosystems of the Hyderabad region using SPI3 and SPI6 timescales. The results showed that during the period 1982-2021, 13 months were identified as severe drought, 10 months as extreme drought, and 29 months as moderate drought on the shortterm scale (SPI3). Similarly, the medium-term drought index (SPI6) identified 4 months as extreme drought, 11 months as severe drought, and 18 months as moderate water stress during the period 1982-2021. In addition, classical models and AI models were used to predict water stress, confirming that among the AI models, SVR performs better vs. ANN, and the classical ARIMA model in terms of MSE and RMSE. The results can be a very useful tool for management and planning at the regional/national level of water (stress) as one of the most important and critical resources in cropping systems. Due to the lack of availability of continuous historical data on other weather parameters for the study region, only rainfall data was used for estimating the drought index in this study. However, future studies need to be expanded with either weather (e.g., air temperature, solar radiation), hydrologic (e.g., water table), agronomic (e.g., yield reduction), or in combination, parameters to allow detailed characterization of drought using various machine learning-based models.

Data Availability Statement:
The data presented in this study were collected from the reports of the Directorate of Economics and Statistics, Government of Telangana. The data presented in this study are available on request from the corresponding author.