1. Introduction
Drought is a normal tragedy that has a harmful impact on society and the atmosphere [
1]. Drought has a tremendous influence on water availability, climate, agricultural production, and a huge effect on a region’s economy [
2,
3]. It is not easy to define drought because it is difficult to estimate the duration of an event. Drought builds up and leaves a lasting, monstrous effect over a geographical space without major infrastructure destruction [
4,
5]. A drought can be complex in length, intensity, or severity. For simplicity, drought is defined as an event where the water levels are low because of a persistent lack of rainfall.
Droughts can come in many forms, i.e., agricultural, hydrological, or socioeconomic, with the most common being meteorological drought [
6,
7]. Meteorological drought ensues when the average precipitation is too low. It is the most studied type for monitoring drought because it is the one that often initiates all others [
8]. Meteorological drought frequency depends not on precipitation in a region but rather its variation. Large fluctuations in rainfall on the shortage side of the area designate drought. Any climatic zone can experience this, including Northeast India, which is one of the greatest rainfall areas of the globe. It can even happen in humid and tropical regions i.e., Malaysia [
9], India [
10] and Bangladesh [
11,
12]. A current investigation study found that the rainier regions of the earth, such as the tropics, will be more at risk than ever for devastating droughts [
13]. As a result, there needs to be an increased focus on droughts monitoring and forecasting in tropical areas. Droughts are often worse in tropical areas because their ecosystems are accustomed to high yearly rainfall [
14].
Maharashtra, a tropical region located in the north-south, has faced a series of devastating droughts [
15]. These droughts were caused by low rainfall and low reserves of water. The drought’s origin of water scarcity has a profound and dire impact on the environment and the people who live there. They affect agriculture, infrastructure, and health [
16,
17]. These effects are intensified by population expansion, land use alternations, agricultural growth, and industrial development [
18]. There is a dire demand for an appropriate conception and drought modeling to ensure viable planning and governing of water resources. However, drought is emerging as a serious issue, and its characteristics cause a challenge when determining the duration and intensity of droughts. Such features also make it challenging to define the spatial extent of droughts and their inter-arrival period [
19,
20].
Drought is a usual natural disaster in Maharashtra that normally happens once a year, in one part or another [
21]. The government of Maharashtra has confirmed that, with 112.4 million people, Maharashtra has 197.9 billion cubic meters of available water resources, with 163.9 BCM of surface water and 33.9 BCM of groundwater. About 43% of the area in Maharashtra is in a shortfall or highly shortfall-sub-basins and encounters constant droughts. This city is more prone to agricultural damages from drought than other natural disasters [
21]. Therefore, meteorological droughts have become a hot research topic in India [
22,
23]. Although various research studies have been conducted to evaluate the effect and danger of droughts on agriculture, economy, water resources, and society [
22,
24], no research has been conducted about forecasting droughts in India. This is especially important for India, as the country faces issues such as climate change.
The weather in India is changing due to its proximity to the equator and an increase in global temperatures [
25,
26]. The country has experienced major weather extremes, flooding, and other disasters with this change. In regions of India where drought is a constant concern, experts have noted an increase in the return period of droughts due to changes in temperature and rainfall patterns [
27]. Climate change is expected to cause more economic damages due to droughts, affecting water resources and generating water scarcity [
28]. These negative effects demand the establishment of models for forecasting and monitoring drought effectively to plan strategies for managing rough-related risks timely [
29,
30]. Drought forecasting is an essential part of drought management. Improper forecasting leads to poor management and even harms the environment. Thus, there is a demand for quick, authentic, and precise models for drought forecasting that can give quantitative data on forthcoming drought-related dangers. With these models, droughts can be forecasted accurately by utilizing the right combination of input variables or drought indices [
31,
32,
33].
A wide variety of drought indices (DIs) were developed to monitor drought [
34]. One of the most detailed and statistically robust drought indices is the standardized precipitation index (SPI). It is simple, easy to understand, and independent of climatic factors [
26]. A new type of standardized precipitation index (SPI) [
35] has been introduced to help predict drought. It has been broadly acquired by the drought forecasting community and utilized in many research studies to explore drought variability in agricultural and hydrological regions [
1,
36,
37]. Machine learning (ML) techniques use a set of instructions that allow computers to learn from previous input and improve without necessitating a great deal of scripting [
38]. Machine learning algorithms have been applied to a variety of climatological application domains, such as rainfall and temperature forecasting, to create models that can replicate the empirical relationship among the various variables [
39]; drought forecasting [
40]; forecasting extreme weather [
41]; and streamflow modelling [
42]. Some of the most widely used ML algorithms for modelling the relationship between various variables include: relevance vector machines (RVM); artificial neural networks (ANN); k-nearest neighbours (KNN); extreme learning machines (ELM); support vector machines (SVM); genetic programming (GP); and random forests [
43,
44,
45,
46,
47,
48,
49]. There are many models for forecasting droughts. One is ARIMA, a regression integrated moving average. Another is MLR, which is multiple linear regression and is Markov Chain [
50]. SPI is an index obtained from a dispersal of rainfall deficits. This means the scale of SPI is not linear. This is troublesome for forecasting droughts because traditional statistical techniques are difficult to prediction drought when they are utilized. Machine learning (ML) has been demonstrated to be an essential tool in the fight against climate change. Recently, it has been able to model drought indices and climatology at unprecedented levels of accuracy. Many different types of ML models can be used for predicting SPI. Some of the most popular are artificial neural network (ANN) and M5 Tree (M5P) ML models [
51,
52]. Although scientists and scholars have come up with many different models for modeling DIs, it is difficult to generalize or create a “perfect” model that can work for the tropical region. In addition, the inappropriate combination of inputs of a model’s structure can lead to misguidance. Additionally, each area acts distinctively in response to stochastic events and historical conditions. Therefore, there is a need to evaluate the best model for predicting the SPI in the tropical region.
The current research focuses on drought forecasting because in last five decades this area has suffered so much from drought and water shortages for irrigation and drinking purposes. In this area, moderate forecasting should be the main importance for the planning and handling of any natural drought, and effective plans should be developed for lessening the drought impact on human and agricultural hydrological systems. In this view, we have conducted an investigation of the viability and usefulness of the ML models to evaluation of the SPI-3, and SPI-6 area, during 2000–2019. The best subset regression was used in this study to choose the most useful factors as inputs to the created artificial models after many inputs were built. Though machine-learning models are typically used for forecasts, this paper focuses on developing such models for SPI forecasting in Maharashtra, India. Three discrete machine learning models were developed, such as ANN and M5T, for forecasting SPI at two different time scales, i.e., (SPI-3 and SPI-6). The drought model was developed utilizing rainfall data from two stations in India (i.e., Angangaon and Dahalewadi), for the period of 2000–2019, with the three objectives: (1) to develop and compare the machine learning models based on the best input combination and sensitivity analysis; (2) to estimate the forecasting of SPI-3 and SPI6; and (3) to find the best models for meteorological drought forecasting in the semi-arid region.
3. Results
This section includes the results of the selection of the inputs to developed models for SPI-3 and SPI-6 predictions, the sensitivity analysis of the input parameters and performance evaluations of the developed models at both the stations i.e., Angangaon and Dahalewadi stations. The input variables of the models, such as SPI-1 to SPI-24, means one to 12 months for standardized precipitation index used in the different scenarios of the model. All SPI values are estimated using SPI package in the R- programming software. In this package, we have estimated the month wise SPI for study area. These results are presented in different sub-sections, which includes the descriptions, tables, and figures below.
3.1. Input Selection Using Best Subset Model for the SPI-3, and 6 Months
The regression analysis was performed on different input combinations to select the best input combination for the development of models at both stations. These best input combinations were used to develop the models for the prediction SPI-3 and SPI-6 at Angangaon station and Dahalewadi station. The regression analysis was carried out on 12 different input combinations. The selection of the best input combinations is created on the values of mean square error (MSE), determination coefficients (R2), Adjusted R2, Mallows’ Cp, Akaike’s AIC, and Amemiya’s PC. The criterion for the selection of the best input grouping is based on the higher values of R2 and Adjusted R2, while the lowest values of MSE, Mallows’ Cp, Akaike’s AIC, and Amemiya’s PC.
Table 1 showed the regression analysis performed to determine the best input combination for SPI-3 and SPI-6 prediction at the Angangaon station. It is clear from
Table 1A that combination 7 with variables SPI-1/SPI-3/SPI-4/SPI-5/SPI-8/SPI-9/SPI-11 has the highest values of R
2 and Adjusted R
2 of 0.758 and 0.750, respectively, for the SPI-3 prediction. Similarly,
Table 1B showed that combination 4 (SPI-1/SPI-2/SPI-6/SPI-7) has been selected as the best input combination for the prediction of SPI-6 at the Angangaon station.
Table 2A,B shows the values of performance evaluators for the selection of the best input combination at Dahalewadi station. The combination 7 (SPI-1/SPI-3/SPI-4/SPI-5/SPI-8/SPI-9/SPI-11) and combination 4 (SPI-1/SPI-2/SPI-6/SPI-7) were selected as the best input combinations for the prediction of SPI-3 and SPI-6, respectively. It is observed that for the prediction of SPI-3, the best input combination 7 (SPI-1/SPI-3/SPI-4/SPI-5/SPI-8/SPI-9/SPI-11) has the highest values of the R
2 and Adjusted R
2 of 0.758 and 750, and the lowest values of MSE of 0.471 (
Table 2A). For predicting SPI-6, combination 4 (SPI-1/SPI-2/SPI-6/SPI-7) has the highest values of the R
2 and Adjusted R
2 of 0.847 and 0.844, and the lowest values of MSE of 0.417 at Dahalewadi meteorological station (
Table 2B).
3.2. Sensitivity Analysis
The sensitivity analysis was performed on the given input variables to identify the most effective parameters at Angangaon station and Dahalewadi station. The results for the sensitivity analysis for SPI-3 and SPI-6 at Angangaon station are presented in
Table 3A,B. It is clear from
Table 3A that the input parameters SPI (t-1), SPI (t-3), SPI (t-4), SPI (t-5), SPI (t-8), SPI (t-9) and SPI (t-11) with absolute standard coefficient (β) values of 0.916, −0.168, 0.146, −0.138, 0.121, −0.094, and 0.113, respectively, obtained as the effective parameters for the prediction SPI-3 at both stations. Similarly, for SPI-6 prediction the input parameters SPI (t-1) (β = 1.017), SPI (t-2) (β = −0.085), SPI (t-6) (β = −0.184), SPI (t-7) (β = 0.167) found as the effective parameters at both the stations (
Table 3(B)). Therefore, the results revealed that SPI (t-1) is the most sensitive parameter with the highest values of β = 0.916, 1.017, respectively, for SPI-3 and SPI-6 prediction were observed (
Table 4A,B). The graphical representation of the effective input parameters is shown in
Figure 2 and
Figure 3.
3.3. Evaluation Machine Learning Models Based on the Best-Selected Subset Models
The performance of all the developed models for ANN (4, 5), ANN (5, 6), ANN (6, 7), and M5P models was assessed through the different statistical indicators, namely, MAE, RMSE, RAE, RRSE, and R2. To select the best model, the model has the highest values of R2 and the lowest values of MAE, RMSE, RAE, and RRSE.
3.3.1. Angangaon Station
Table 5A,B shows the results of ANN (4, 5), ANN (5, 6), ANN (6, 7), and M5P models for the prediction of SPI-3 and SPI-6 based on statistical indicators.
Table 5 (A) revealed that the M5P model outperformed the ANN (4, 5), ANN (5, 6), and ANN (6, 7) models for SPI-3 prediction during training and testing phases. The values of MAE, RMSE, RAE, RRSE, and r for the M5P model were observed to be 0.709 and 0.388, 0.948, and 0.551, 76.47 and 48.58, 67.61 and 48.21, 0.757, and 0.884, respectively, during the training and testing phases. For the prediction of SPI-6 (
Table 5B), ANN (6, 7) performed superior to other models during the training phases, while M5P models were found to be superior during the testing phases. The values of MAE, RMSE, RAE, RRSE, and r for ANN (6, 7) during the training phases, were obtained as 0.502, 0.743, 45.77, 48.56, and 0.885, respectively. Similarly, during testing phases the values obtained, respectively, as 0.396, 0.530, 46.85, 37.80, and 0.927 for the M5P model. The graphical representation through-line plot and scatter plot for ANN (4, 5), ANN (5, 6), ANN (6, 7), and M5P models during the testing phases were also analyzed, as shown in
Figure 4 (SPI-3 prediction) and
Figure 5 (SPI-6 prediction). The values of coefficient of determination (R
2) for ANN (4, 5), ANN (5, 6), ANN (6, 7), and M5P models were observed as 0.705, 0.726, 0.740, and 0.782, respectively, for SPI-3 prediction. For SPI-6 prediction, the values of R
2 for ANN were obtained as 0.861 and that of the M5P model as 0.857. The developed models are in good agreement with 1:1 line. Therefore, it is clear from the quantitative and qualitative analysis, the M5P model was found to be the most accurate model for the prediction of SPI-3 and SPI-6 at Angangaon station. The results of all the developed models were also improved during SPI-6 prediction.
3.3.2. Dahalewadi Station
The results of developed ANN (4, 5), ANN (5, 6), ANN (6, 7), and M5P models based on performance evaluators for the prediction of SPI-3 and SPI-6 are shown in
Table 6A,B. For the prediction of the SPI-3 model, the M5P model was found superior with MAE, RMSE, RAE, RRSE, and r values were obtained as 0.708 and 0.388, 0.947 and 0.551, 76.38, and 48.57, 67.53 and 48.21, 0.758, and 0.885, respectively, during the training and testing phases (
Table 6A). Similarly, the values of the M5P model for SPI-6 prediction (
Table 6B) during the training and testing phases were found, respectively, as 0.454 and 0.396, 0.710, and 0.530, 41.39 and 46.84, 46.38, and 37.80, 0.888, and 0.927. Furthermore, the graphical analysis showed that the values of R
2 for developed ANN (4, 5), ANN (5, 6), ANN (6, 7), and the M5P models were obtained as 0.705, 0.726, 0.740, and 0.782, respectively, during the testing phases for SPI-3 prediction (
Figure 6). Likewise, the values were obtained, respectively, as 0.861, 0.862, 0.861, and 0.860 during testing phases for SPI-6 prediction (
Figure 7). The developed models are in good agreement with 1:1 line. It was also observed that the results were improved during testing phases for al developed models. Therefore, it is clear that the M5P model outperformed the other developed models at Dahalewadi station. In comparison among SPI-3 and SPI-6 models, the M5P model during SPI-6 prediction was found superior and results were also improved during SPI-6 prediction.
4. Discussion
It is crucial to explore the potential application of machine learning methods and data mining approaches for dry season monitoring in order to develop better adaptability methods. Recent research has demonstrated that certain climate occurrences, such as drought episodes and the risks they pose, can be accurately predicted using machine learning algorithms [
62]. In several scientific fields, machine learning techniques are now widely used, including: flood prediction and evaluation [
63]; determining dust pollution [
64]; modelling soil and landscapes [
65]; and landslide susceptibility valuation [
66]. Machine learning models outperform conventional statistical techniques, according to earlier studies. Machine learning algorithms can also handle enormous datasets and produce more accurate results [
65,
66]. In drought forecasting, we have spilt SPI-3 and 6-month datasets of 80% and 30%, used to train and test during the development of machine learning models. This dataset covers the 2000–2019 years, it is a times series dataset, hence the 20% dataset was used for testing, and the models performed better for drought forecasting. Machine learning models want to use huge datasets for creating models; if big datasets are used in machine learning model gives a more accuracy. We have checked all models on performance metrics, such as RMSE, MSE, etc. These metric indicators and the Talyor diagram are also helpful for finding the best models for SPI-3 and SPI-6 month drought forecasting, particularly in the semi-arid region. During the models we have a prepoly cleaning dataset and have removed the missing values in the datasets. The ML models are the better performers of all datasets. l The machine learning model results are sufficient for drought forecasting in the semi-arid region. If any models are given 70% accuracy they are very useful for drought forecasting under climatic changes in the semi-arid region. We have checked the ground reality as the basis and have developed the SPI-3 and 6 drought forecasting to be helpful for farmers and crops, particularly in the winter and summer seasons.
The Talyor diagram is better at understanding the model’s performance related to SPI-3 and SPI-6, and this gives more accuracy in the form of the correlation coefficient and standard deviations. This diagram could provide greater knowledge and be used to check our model performance on a mathematical diagram, so many researchers today can use it for model performance. Excessive evapotranspiration and moisture deficiency are two effects of extreme droughts on water resource imbalance [
67]. Drought has been shown by some researchers to cause unaffordable socioeconomic losses, decreased agricultural productivity, and environmental deterioration [
68]. The onset of droughts is indicated by a downward trend in long-run average precipitation (normal precipitation) for a given basin [
69,
70]. Droughts are characterized by low relative humidity, high temperatures, high wind velocity, and rainfall characteristics such as intensity, length, and distribution of rainfall during agricultural growth seasons [
15]. The Taylor diagram [
13] represents the performance of all developed models based on the correlation coefficient (r), root mean square deviation (RMSD) and standard deviation (SD) for all developed models at Angangaon station (
Figure 8) and Dahalewadi station (
Figure 9). It is clear from
Figure 8 and
Figure 9 that the M5P model has higher r values and lesser RMSD and SD values, as compared to ANN (4, 5), ANN (5, 6), and ANN (6, 7) models. Therefore, the M5P model was found to be superior to other developed models at both stations.
5. Conclusions
The purpose of this study is to investigate the feasibility of machine learning models to forecast the SPI drought index at two different scales (i.e., SPI-3 and SPI-6) in Maharashtra, India. The developed models examined monthly rainfall data from 2000–2019 at two discrete meteorological stations (i.e., Angangaon and Dahalewadi). The forecasting models were made possible with the help of the statistical auto-correlation method. It is observed that for the prediction of SPI-3, best input combination 7 (SPI-1/SPI-3/SPI-4/SPI-5/SPI-8/SPI-9/SPI-11) has the highest values of the R2 and Adjusted R2 of 0.758 and 0.750, and lowest values of MSE of 0.471, while for predicting SPI-6, combination 4 (SPI-1/SPI-2/SPI-6/SPI-7) has the highest values of the R2 and Adjusted R2 of 0.847 and 0.844, and the lowest values of MSE of 0.417, at both meteorological stations. Moreover, SPI (t-1) is the most sensitive parameter with the highest values of β = 0.916 and 1.017, respectively, for the observed SPI-3 and SPI-6 prediction. The obtained forecasted outcomes show consistency in results attained utilizing ANN (4, 5), ANN (5, 6), ANN (6, 7); we observed minimal RMSE and greater R2 at both stations in forecasting the SPI-3 and SPI-6. However, the M5P shows the best performance during training with minimal RMSE values during training being (0.948, 0.919) and (0.947, 0.710), and during testing are (0.551, 0.530) and (0.551, 0.530) at Angangaon and Dahalewadi meteorological stations in forecasting the SPI-3 and SPI-6. It is clear from the quantitative and qualitative analysis that the M5P model was found to be the most accurate model for predicting SPI-3 and SPI-6 at both stations. This research will assist in establishing a system that can be utilized for the studied rainfall stations. It will also be a valuable tool for planners, policymakers, and water resource managers to mitigate droughts.