Kabul River Flow Prediction Using Automated ARIMA Forecasting: A Machine Learning Approach

The water level in a river defines the nature of flow and is fundamental to flood analysis. Extreme fluctuation in water levels in rivers, such as floods and droughts, are catastrophic in every manner; therefore, forecasting at an early stage would prevent possible disasters and relief efforts could be set up on time. This study aims to digitally model the water level in the Kabul River to prevent and alleviate the effects of any change in water level in this river downstream. This study used a machine learning tool known as the automatic autoregressive integrated moving average for statistical methodological analysis for forecasting the river flow. Based on the hydrological data collected from the water level of Kabul River in Swat, the water levels from 2011–2030 were forecasted, which were based on the lowest value of Akaike Information Criterion as 9.216. It was concluded that the water flow started to increase from the year 2011 till it reached its peak value in the year 2019–2020, and then the water level will maintain its maximum level to 250 cumecs and minimum level to 10 cumecs till 2030. The need for this research is justified as it could prove helpful in establishing guidelines for hydrological designers, the planning and management of water, hydropower engineering projects, as an indicator for weather prediction, and for the people who are greatly dependent on the Kabul River for their survival.


Introduction
In ancient times, cities were established on the banks of rivers so that their inhabitants could take advantage of the opportunities offered by the river in terms of food, trade, and defence, and the same is applicable in this era of advancement as well [1,2]. Water is necessary for human existence. River water is a source of life for the domestic, industrial, irrigational, and energy sectors [3]. River basin management is a scientific and technical and defence, and the same is applicable in this era of advancement as well [1,2]. Water is necessary for human existence. River water is a source of life for the domestic, industrial, irrigational, and energy sectors [3]. River basin management is a scientific and technical area of study and involves several intricacies because of the various features of particular rivers and their offshoot branches, and land drained by the application of this study [4]. Therefore, it becomes fundamental for engineers to understand the likely behavior of rivers.
The behavior of river water is often unexplainable and unexpected. However, water behavior can be studied and controlled by structural (dams, reservoirs, and barrages) and non-structural (disaster prevention, response mechanisms, and floodproofing) measures. Based on past values, hidden information like the flow of water at a specific time can be revealed using forecasting techniques, which can help early response actions and prevent disasters [5]. Water level and runoff forecasting is a measure of the non-structural type that is essential for modelling natural hazards [6]. Forecasting the water flow of a river is directly related to the developmental activities in nearby regions of the country as it is used in the planning of the cities, the management of river basins, the making of dams, the calculating and controlling of risks related to floods and droughts, and for supplying water for household usage and generating power [7].
The Kabul River originates from the mountains of Hindu Kush and covers about 700km distance before joining the Pakistan water system [8]. The catchment area of the Kabul River in Pakistan is 14,000 km 2 , while 62,908 km 2 lies in Afghanistan, which makes the overall catchment area 76,908 km 2 [9]. The Kabul River has an overall basin area of 87,499 km 2 [10]. Although the Kabul River originates from Afghanistan, yet it faces water shortage due to the lack of adequate infrastructure of water storage due to the perpetual war [11].
Apart from the Kabul River, other major rivers of Pakistan enter the country from India. As the upper riparian discharge comes under the jurisdiction of India, Pakistan cannot control the water level to fulfill its water requirement [12,13]. This scenario makes it even more important for an agricultural country like Pakistan to plan for increasing its efficiency in the present and future water flow. If Pakistan fails to acknowledge the behavior and importance of the Kabul River, it will face a similar situation of water scarcity like Afghanistan. Hence, both countries (Pakistan and Afghanistan) are largely dependent on agriculture using the Kabul River water. Figure 1 illustrates the river's origin and its basin location. It is clear that the river originates in Kabul and extends into Pakistan.  The existential threat to the Kabul River is the change imposed by climatic conditions, which have also made the forecasting of river water flow essential because disturbed rainfall patterns have already started to seriously affect the availability of water [14]. Climatic conditions are getting worse day by day, and weather anomalies have direct effects on rivers like the Kabul River. It is estimated that precipitation will decrease by 50% in the Kabul River basin towards the end of this century, which will produce floods of unforeseeable flow and will negatively impact streamflow dynamics [15]. It is also expected that the Khyber Pakhtunkhwa province of Pakistan will be severely affected in terms of the economy and water crisis by 2080 and that the water crisis will result in a considerable decrease in wheat and maize production due to climate change [16]. The people dependent on the Kabul River Basin have been greatly affected by the temperature rise and the shifting of precipitation patterns; moreover, the melting of a glacier in the Hindu-Kush region created havoc in the 2010 floods, which caused considerable damage to the Pakistan economy (855 billion Rupees) and dispersed 20 million people residing near the banks of the river [17,18]. It is estimated that 20% of the precipitation will decrease due to the shift in monsoon season, which, combined with the effect of melting glaciers, will affect millions of people's existence, as has already been seen in the 2010 floods, in which a significant fertile area was lost due [19].
As the operation of dams is based on the river flow, the Warsak Dam is one of the most important dams of Pakistan in terms of irrigation and energy generation; it is necessary to study the past inflow and outflow to enable forecast the future values, which could help in meeting the water demands of the country [20,21]. Concerning Pakistan, the Kabul River serves as a lifeline for providing safe and drinkable water for 2 million people of Peshawar city and its subregions. Pakistan built the Warsak dam in 1960 on the Kabul River, which generates 243 MW hydropower [22]. Any increase or decrease in the water level of the Kabul River will threaten the balance of life in Pakistan and will result in catastrophic consequences. For example, the floods in the Kabul River happen two times a year, once due to the snowmelt from April to September and secondly as a result of monsoon torrential rainfall in August [23]. With the increase in global warming, the snow melts quicker, and the discharge in the river results in floods. It is estimated that every 1.5 • C or 2 • C rise in temperature results in a 34% or 43% increase, respectively, in runoff from the upstream Indus basin [24]. In the 2010 floods in the Indus river basin, 5.4-million-acre land was lost, 2200 people lost their lives, and 14 million people were left homeless, which resulted in the loss of 43 billion USD [25]. Similarly, any decrease in the water level of the river can adversely affect the system of the agricultural activities in Pakistan as the agriculture sector was the fifth-highest contributor to Pakistan's overall Gross Domestic Product (GDP) in 2020, and 35.89% of its people are employed in this sector [26].
Human activities like hydropower structures, an explosion in population, a heavy amount of silt, inadequate rainfall annually, unregulated urbanization, illegal settlements, and unapproved water channels from this river have caused a reduction in its water level. Therefore, in terms of its importance for human existence and increased water demand, it is necessary for Pakistan to limit its future water demand and flow [27,28].
Keeping in mind the importance of the above discussion, this study aims to forecast the flow of the Kabul River till the year 2030. To better prepare for recurring natural flood vulnerabilities and avert monetary losses and casualties, possible future changes in flow rate intensity in the Kabul River basin should be analyzed. The objective of this study is to use an effective learning algorithm that could accurately predict and evaluate the different patterns of water levels based on various periods. Another objective of this analysis is to help the upstream technicians of the reservoir by providing a better forecasting tool for the prediction of the expected water levels using the Automatic ARIMA model. The achieved objective will be significant to the relevant authorities because it will help them to plan socio-economic developmental activities efficiently to enable them to cater for future needs, provide water-restraining structures in case of floods, and prepare strategies for water disasters, and it will help relief workers to reduce irreversible human and economic losses.

Literature Review
Numerous studies have been conducted to forecast river flow around the globe. Previously, hydrological events were forecasted using conventional methods to predict runoff discharge, capacity, and streamflow of water-level; however, machine learning (ML) is now increasingly being used in hydrological forecasting [29,30]. The term ML implies that machines analyze, cluster, extract complex linkages, and make decisions without programming [31]. The added advantage of using ML is its ability to determine the patterns of the input data and produce output results by analysing the complex structures hidden in the data [32]. The data-driven forecasting models as used in this study are based on the historical data of the water levels, including runoff volumes, storage capacity, and river discharge. This approach includes the use of statistical data as input variables to measure the extent of water flow using output variables [33]. Various algorithms like artificial neural network (ANN) and adaptive neuro-fuzzy inference systems (ANFIS) were used to forecast the water level using hydrological variables like temperature, wind, and evaporation. The water level data from 2007 to 2011 of Chahnimeh Reservoirs in Zabol, Iran was used for analysis. It was found that the ANFIS model was better at predicting the future values of water levels compared to ANN due to it more closely fitting the original values [34].
Various ML models like support vector machine (SVM) ANN, ANFIS, and generalized regression neural networks (GRNN) were used for estimation of the water reservoir level in Millers Ferry Dam on the Alabama River in the USA. When the results were compared to moving average (MA) and autoregressive moving average (ARMA), it was found that ANFIS model 5 output results were more promising due to the lowest value of mean absolute error (MAE), R 2 , and mean squared error (MSE) [35]. Some researchers used semi-hybrid models like Wavelet-based Artificial Neural Network (WANN) and Waveletbased Adaptive Neuro-Fuzzy Inference System (WANFIS). The daily water level of the Andong dam in South Korea was forecasted using these two semi-hybrid techniques. The results were expressed as the comparison of the accuracy of these two methods. It was concluded that both methods tend to accurately forecast the conventional models and can yield better efficiency results in the daily water level analysis [36,37]. A least-squares SVM (LSSVM) is another type of intelligent algorithm that was used for the prediction of daily water level Yangtze River in China based on the water level of data from 2010-2016. Based on the lowest value of root mean squared error (RMSE), index of agreement, and mean absolute percent error (MAPE), the improved LSSVM method tends to provide useful figures for hydrological levels [38]. As Pakistan has constructed the Warsak dam over the Kabul River, electricity generation greatly depends on the water level in this river. Hence, hydroelectric consumption was forecasted based on 53-years-worth of data in Pakistan. Methodologically, the autoregressive integrated moving average (ARIMA) model with (p,d,q) values of (9,1,7) was selected for forecasting. The results revealed that hydroelectric consumption will increase 1.65% annually, with a cumulative increase of 23.4% till 2030 all over Pakistan [39].

Methods
This study used an ML approach to perform the forecasting. In this study, the methodology was followed by the collection of the hydrological data from 1961-2005. For this purpose, the time series was checked for stationarity using Augmented Dickey Fuller (ADF) test. The ADF test was first invented by David Dickey and Wayne Fuller in 1979 and tests the time series for the null hypothesis of the presence of unit root test [40]. The mathematical expression for ADF is given by: where α is constant, β is coefficient of time trend, p is the lag order, and t is the error term. After selecting the appropriate lags of order p, the test is executed for the null hypothesis Sustainability 2021, 13, 10720 5 of 26 γ = 0 [41]. If the time series has non-stationarity, then the stationarity can be achieved using regression or differencing until the time series become stationary. The concept of ARIMA was first developed by an electrical engineer named Norbert Wiener et al. in [1930][1931][1932][1933][1934][1935][1936][1937][1938][1939][1940]. It consists of three parts called autoregressive (AR), integrated (I), and moving average (MA) [42], whereas ARIMA was first put into use in time series for modelling forecasts by Box Jenkins in 1970 [43]. Since then, the use of ARIMA has found wider application in the fields of engineering, economics, hydrology, and social analysis [44]. The first general form of ARMA was given by Peter Whittle in 1951 [45], which can be shown as: where ε t is regarded as a white noise term and φ and θ are regarded as the coefficients of the time series. The mathematical form of AR (p) and MA (q) is given below in Equations (3) and (4), which were given by [46]: AR (p), p (number of autoregressive terms) y t = c + β 1 y t−1 + β 2 y t−2 + β 3 y t−3 + . . . + β p y t−p + ε t It is a case of multiple regressions, including lagged values of y t as predictors. It is referred to as AR(p), and p indicates AR model of order (p) MA (q) MA (q), q (number of moving average terms), where d is the times of differentiation. An automated ARIMA tool was used, which allows the users to identify a suitable ARIMA specification and to perform the forecast for the time series. Automated ARIMA tool is not only limited to ARIMA modelling but also considers a variety of modelling procedures, and the selection of ML models along with its orders were identified. Based on the selected models, the model with the lowest Akaike Information Criteria (AIC) or Bayesian Information Criteria (BIC) was selected.
AIC was first used by Hirotugu Akaike in 1971 [47,48]. AIC calculates the prediction error, which measures the quality of a statistical model with other relative models [49]. It can be expressed mathematically as [50]: where k stands for estimated parameters in the model andL is the maximum value of the likelihood function. For a given set of models, the model with the lowest AIC is selected based on the goodness-of-fit measure. AIC also has a penalty system that discourages overfitting and hence improves the goodness-of-fit. Similarly, BIC (also known as Schwarz information criterion, SIC) appeared in the 1978 paper, which was developed by Gideon E. Schwarz [51]. In the case of BIC, the formula is similar to AIC but the difference is in a penalty for a different number of parameters. AIC has a penalty system of "k", while penalty in BIC is ln k [52]. The BIC can be expressed as [53]: where k is the number of parameters estimated by the model, n is the number of data points, andL is the maximized value of the likelihood function of the model. AIC checks the quality of each model relative to other models and thus becomes a means for model selection. Mathematically, AIC and BIC (Bayesian information criterion) differ slightly only in terms of penalty for the number of parameters. For AIC the penalty is 2k, whereas for BIC it is ln(n)k. BIC is also a model selection criterion in which the model with the least BIC value is selected. In comparing AIC and BIC, the performance of AIC was found more satisfactory than BIC [52]. It is argued that BIC is the best fit for true model selection for which AIC is not appropriate. This is because when selection is done by considering BIC as the base, the probability of the true model comes to be 1 as n → ∞, which is less than 1 in the case of AIC. Yet, the advisors of AIC claim that it is a negligible issue, as there is no "true model" available in the overall set [54][55][56]. If the models do not consist of the best fit, the analysis part would be repeated by selecting different lags for the automated ARIMA tool. After the model was proposed by the tool, the residual error analysis was performed to check the accuracy of the output of the mathematical model. Among many validation procedures, this study used an out-of-sample validation test to check for the identification and estimation of the model suggested by the automated ARIMA tool. The concept of an out-of-sample validation test is to compare the fitting of the portion of the original data set with the forecasted data set model. Finally, forecasting was performed for the year 2011-2030. The accuracy of the forecasting and error analysis was performed using R 2 , and error analysis was done using root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
RMSE is the square root of the sum of all squared differences between the predicted and actual errors [57]. RMSE shows the range of residual spread. It can be expressed mathematically as [58]: where N = number of observations and i = variables along with predicted values and actual values. MAE is regarded as one of many measures that are used in a forecast analysis. It is the measure of error between paired observations showing identical occurrence [59] or it is the average of all absolute error terms [60]. It is mathematically expressed as [61]: |y −ŷ| (8) where N = total number of data points and ∑ sum of the absolute value of the residual y −ŷ. MAPE is the measure of correctness of the forecast [62]. It is expressed in percentage. Mathematically, it can be shown as [63]: where n is the number of observations, t is the number of variable terms, A t is the actual value, and F t is the forecasted value.

Data Collection
The water flow data were collected of the Kabul River in Swat, a city of Khyber Pakhtunkhwa (KP) Province, Pakistan. The historical data were gathered by the Water and Power Development Authority (WAPDA), a Government Department from the year 1961 to 2005. Afterward, they discontinued collecting the data, where a private consultant with the name AGES collected the data from 2006 to 2010. The seasonal decomposition was performed to analyze the data set that could reveal useful information about the time series. It was found that the highest recorded value was 110 cumecs in 1991, and the secondhighest value of 107.78 was measured in 2005. The lowest reading was 59.97 cumecs in 1982, as shown in Figure 2a. Figure 2b shows the trend in the data set. Figure 2c illustrates the seasonal factor present in the time series, and Figure 2d signifies residuals present in the data.

Forecasting Using Automated ARIMA Tool
In this study, the data of the water flow of the Kabul River was collected and forecasted through the time series method by using software named EViews. EViews is an outstanding interactive program that is the best fit for detailed data analyses [64]. EViews allow forecasting using the Automated ARIMA forecasting feature, which is timesaving in comparison to the traditional programming languages. The term "automated ARIMA" feature selects the model among the AR, MA, ARMA, ARIMA, and seasonal ARIMA models, and it does not mean that this feature will only consider the ARIMA model. For this time series forecasting, the ARIMA model has been used. There are several tools available for linear time series forecasting, but the body of knowledge credits ARIMA as the most

Forecasting Using Automated ARIMA Tool
In this study, the data of the water flow of the Kabul River was collected and forecasted through the time series method by using software named EViews. EViews is an outstanding interactive program that is the best fit for detailed data analyses [64]. EViews allow forecasting using the Automated ARIMA forecasting feature, which is timesaving in comparison to the traditional programming languages. The term "automated ARIMA" feature selects the model among the AR, MA, ARMA, ARIMA, and seasonal ARIMA models, and it does not mean that this feature will only consider the ARIMA model. For this time series forecasting, the ARIMA model has been used. There are several tools available for linear time series forecasting, but the body of knowledge credits ARIMA as the most suitable one [5]. The ARIMA model contains autoregressive (AR), integrated (I), and moving average (MA). The AR part describes the relationship between present and past observations, the MA part represents the autocorrelation structure of error, and the I part represents the differencing level of the series [65]. ARIMA is one of the most powerful and successful linear statistical models for time series forecasting [66]. Research made by Valipour and Banihabib [67] showed that in comparison with ARMA (autoregressive moving average), the ARIMA model is better than ARMA because it can make time-series stationery in the training and forecasting phase [68]. It can transform the non-stationary data into stationary data. Nevertheless, Yu and Lei [69] believe that to decrease the element of uncertainty and increase the predictive performance, the combination of different types of models is recommended, i.e., the hybrid approach.
The reason for using the Automated ARIMA tool is, firstly, the selection of appropriate values for p (number of autoregressive terms), d (differences required to achieve stationarity), and q (moving average terms). As the ARIMA algorithm consists of (p,d,q) the determination of (p,d,q) is a laborious and time-consuming task, but the Automatic ARIMA function will select the best fit model automatically based on the lowest values of the selected parameters like AIC or BIC. Secondly, the ARIMA modelling accounts for the missing data in the time series. As there is missing data from 2010-2020 in the time series, this modelling technique could compensate for the missing data based on the previous readings. Although many factors come into play that could affect the water flow in the river, the basic reason for using the ARIMA tool is to predict the missing data without considering those factors that could lead to uncertainty in the results. The estimation of missing data help the engineers, designers, and flood controlling department as they seek to include the missing data from this study in their implementation. It has been proven that the missing hydrological data can be computed from the estimation of the fitted models using ARIMA [70,71]. Finally, the use of ARIMA has been well document in hydrological analysis. ARIMA has been used with full confidence in the analysis of water quality [65], rainfall [69,72], runoff [73,74], river discharge [65,75], drought [76,77], monthly streamflow [78,79], and groundwater anomaly [80]. Figure 3 shows the flowchart of the methodology followed. Firstly, the hydrological time series is obtained. Then, the stationary is checked using the ADF test. The stationarity can be achieved using differencing. The Automated ARIMA tool is incorporated to identify the models for analysis. A model is selected based on the lowest AIC and BIC value. The error/residual analysis is performed to check the accuracy of the selected mathematical model. If the validation fails to satisfy the parameters of the best-fitted model, the analysis is repeated for a different model selection using appropriate lags. The study further proceeds with the forecast.

Automated ARIMA Forecasting
The automatic model selection specification for the ARIMA model can be divided into four steps: i Using raw or transformed data, such as logs of the dependent variable. ii Selection of appropriate level of integration of the dependent variable. iii Evaluation of the exogenous regressors. iv Selection of the order of the ARMA model using the evaluating technique.
Automatic forecasting automatically takes steps i, ii, and iv. In each step, the user selects the exogenous regressors, hence the name is Automatic ARIMA instead of Automatic ARIMA. Any time series, y t uses ARIMA (p,d,q) if [81], where the exogenous variable X t is a constant term and υ t is the seasonal ARMA term. In this case, forecasting can be made using the dependent variables AR, integration, and MA, which can be selected using evaluation techniques. The estimation methods in EViews make use of three information criteria types: Schwarz Criterion (SIC or BIC), Akaike Information Criterion (AIC), and the Hannan-Quinn Criterion (HQ). Based on these criteria, the number of terms of ARMA is selected [81].

Automated ARIMA Forecasting
The automatic model selection specification for the ARIMA model can be divided into four steps: i.
Using raw or transformed data, such as logs of the dependent variable. ii. Selection of appropriate level of integration of the dependent variable. iii. Evaluation of the exogenous regressors. iv. Selection of the order of the ARMA model using the evaluating technique.
Automatic forecasting automatically takes steps i, ii, and iv. In each step, the user selects the exogenous regressors, hence the name is Automatic ARIMA instead of Automatic ARIMA. Any time series, uses ARIMA (p,d,q) if [81], Before performing the analysis, the data need to be split into train and test data. For this purpose, the month-wise data from the year 1961 to 2000 was selected as train data and data from 2001 to 2010 was selected as the test data. Automated ARIMA forecasting is a feature offered within the EViews where a user needs to provide the maximum autoregressive, differencing, and moving average value. The automated ARIMA parameters are shown in Appendix B.
It can be observed that in this analysis the maximum AR value was taken as 4, maximum differing was taken as 2, and maximum MA was taken as 4. As the data show seasonality (S), maximum SAR and SMA were taken as 2.

Model Validation
With the automated ARIMA forecasting feature, various ARIMA models are run where the best model needs to be separated. In this regard, the model selection was based on Akaike Information Criterion (AIC) where the lowest value shows the best-fitted model. The model validation features are shown in Appendix C.

Summary of ARIMA Forecasting
Out of 600, 480 observations were taken as train data. Overall, 225 models were run where the best ARMA model came as (2,4)(2,2) based on AIC value, which was equal to 9.216. A summary of the ARIMA forecasting is provided in Figure 4. these criteria, the number of terms of ARMA is selected [81].
Before performing the analysis, the data need to be split into train and test data. For this purpose, the month-wise data from the year 1961 to 2000 was selected as train data and data from 2001 to 2010 was selected as the test data. Automated ARIMA forecasting is a feature offered within the EViews where a user needs to provide the maximum autoregressive, differencing, and moving average value. The automated ARIMA parameters are shown in Appendix B.
It can be observed that in this analysis the maximum AR value was taken as 4, maximum differing was taken as 2, and maximum MA was taken as 4. As the data show seasonality (S), maximum SAR and SMA were taken as 2.

Model Validation
With the automated ARIMA forecasting feature, various ARIMA models are run where the best model needs to be separated. In this regard, the model selection was based on Akaike Information Criterion (AIC) where the lowest value shows the best-fitted model. The model validation features are shown in Appendix C.

Summary of ARIMA Forecasting
Out of 600, 480 observations were taken as train data. Overall, 225 models were run where the best ARMA model came as (2,4)(2,2) based on AIC value, which was equal to 9.216. A summary of the ARIMA forecasting is provided in Figure 4.

Comparison of Forecasted and Actual Data
Forecasted and actual data have been compared in Figure 5. The actual data is given for a period of ten years, i.e., from 2000 to 2010. Out of these ten years, the data for the first five years was provided by WAPDA and the data for the remaining five years was provided by AGES. This set of data was selected for the category of test data too. Taking this test data as a reference set, the future forecast for the remaining 20 years was made possible. It can be seen in Figure 5 that actual and forecasted values lie close to each other with a few deviated values. Once it was made sure that actual and forecasted values lay in the proximity of each other, the water flow for the remaining years was forecasted. The

Comparison of Forecasted and Actual Data
Forecasted and actual data have been compared in Figure 5. The actual data is given for a period of ten years, i.e., from 2000 to 2010. Out of these ten years, the data for the first five years was provided by WAPDA and the data for the remaining five years was provided by AGES. This set of data was selected for the category of test data too. Taking this test data as a reference set, the future forecast for the remaining 20 years was made possible. It can be seen in Figure 5 that actual and forecasted values lie close to each other with a few deviated values. Once it was made sure that actual and forecasted values lay in the proximity of each other, the water flow for the remaining years was forecasted. The reason for ARIMA predictive method, firstly, is that it could cover the missing values, which are essential for future analysis. Secondly, in case of a significant weather shift, this analysis could prove useful to the engineers and designers to improve the capacity of the flood control devices in case of a significant anomaly in the Kabul River. The forecasted values indicate the flow of the river provided the water level due to melting of the glacier and weather shift, and the basin condition of the river remained the same throughout the analysis period. This forecast was produced irrespective of the weather anomalies that are subject to persistent change in the future.
There are many validation methods, and out-of-sample is one of them. The concept of this validation is to withhold a portion of sample data for identification and estimation and then conduct the forecasting for the remaining hold-out data to determine the presence of the errors within the sample fitted data and the forecasted data. In this case, the validation period was selected from 2000-2010 and the forecasting was performed from 2011-2030.
the errors within the sample fitted data and the forecasted data. In this case, the validation period was selected from 2000-2010 and the forecasting was performed from 2011-2030.
In the light of Figure 5, a peak of the actual water level at 370 cumec can be seen, while the forecasted plot indicates a marginal increase over 250 cumec. The difference between the actual and forecasted value is due to the reason that the ML algorithm estimates the time series value of 12 months and produces output in the form of the average of the past data.   In the light of Figure 5, a peak of the actual water level at 370 cumec can be seen, while the forecasted plot indicates a marginal increase over 250 cumec. The difference between the actual and forecasted value is due to the reason that the ML algorithm estimates the time series value of 12 months and produces output in the form of the average of the past data. Figure 6 illustrates the comparison of all-inclusive model sets. The transparent graph lines in the background represent the graph lines for 225 simulated models, whereas the graph line highlighted in the red graph line denotes the selected model (2,4)(2,2). It is evident that out of all the ARMA models, model (2,4)(2,2) has the least values, which is why it was selected as the best option.

Best Fitted Model (AIC)
Appendix A gives the details of the overall 225 ARIMA models. AIC value range is from 9.215538 to 10.26662, whereas BIC value range is from 9.319883 to 10.31879. The

Best Fitted Model (AIC)
Appendix A gives the details of the overall 225 ARIMA models. AIC value range is from 9.215538 to 10.26662, whereas BIC value range is from 9.319883 to 10.31879. The model selection is based on the AIC value, and as mentioned above, model (2,4)(2,2) is the selected ARIMA model that has the least AIC value of 9.215538. Different values for models (2,4)(2,2) have been highlighted in Appendix A. The BIC value of this selected model is 9.319883, and the HQ value is 9.256554.
The residual autocorrelation function (residual ACF) and residual partial autocorrelation function (residual PACF) plots were used to determine the residuals in the selected model. As evident from Figure 7a,b, the residuals are randomly scattered, showing the best fit for the selected model of forecast along with the absence of autocorrelation in the residuals. The vertical lines represent the 95% confidence interval (CI), whereas the blue blocks show the number of lags selected to determine the behavior of residuals. The residual ACF and residual PACF plots show that no lags deviate from the CI and are near to zero, which indicates that the residuals are independent and the model has accurately forecasted the time series.          It should be noted that the automated ARIMA tool considers all linear models, and the selection of ARMA over other models is performed based on the lowest AIC value. Moreover, this analysis used linear models for analysis; therefore, the trend of the dataset is different from the trend of the generated linear results. Additionally, the automated ARIMA tool accounts for the missing data with less error due to its linear behavior. These results predict that the water flow will remain the same if the current condition of the water flow, basin, and drainage remains the same through the forecast period from 2011-2030.

Water Flow Forecasting
The standard deviation of the actual error and predicted error is calculated to know the error difference between the actual data set and the model selected for the forecast. It is evident in Figure 10 that the predicted error of the selected model is less than the actual data set; hence, the selected model is the best fit for the forecast. These results extrapolated the missing value, which can be used as a reference for further studies of the Kabul River. The linearity in the missing data makes sense as there were unknown factors involved that produced unknown water levels for the missing period. The forecast shows that provided the temperature and precipitation remain constant from the coming years, there will be no significant change in the water levels in the Kabul River. However, in case of weather shifts or anomalies, this forecast could still be useful as it could be used by the locals to earn a living above this water level to ensure safety in the future. It could also prove useful for the hydrologists, structural engineers, and flood disaster management officials to construct the water withholding structure with a capacity of these water levels.
To elaborate on the results further, this study predicts the data from 2011-2030. It can be seen that the trend of the past (1961-2010) is considerably different than the forecasted (2011-2030) trend. The reason is, firstly, the choice of considering the constant conditions in the future. The missing data could be linear or non-linear; however, for the analysis purposes, the analysis was performed using linear models as non-linearity could have greatly affected the results and might have deviated from the actual scenario of the river. Secondly, as the data were not collected by the concerned agencies till 2010 and there are no further official data available that could accurately forecast the water levels for the coming years, the data of the missing years have not been taken into account while conducting the analysis. To unravel the anomalies in the hydrologic behavior of the river, linear behavior was adopted to forecast the missing data closer to the previously recorded data.

Discussion
The analysis of missing years (2000-2020) was carried out as a forecast to help the hydrologists account for the missing data. To give an idea of the situation of the missing data of the year 2000-2020, it was estimated that water availability in 2015 was reduced to  Table 1 illustrates the value of these parameters. These results extrapolated the missing value, which can be used as a reference for further studies of the Kabul River. The linearity in the missing data makes sense as there were unknown factors involved that produced unknown water levels for the missing period. The forecast shows that provided the temperature and precipitation remain constant from the coming years, there will be no significant change in the water levels in the Kabul River. However, in case of weather shifts or anomalies, this forecast could still be useful as it could be used by the locals to earn a living above this water level to ensure safety in the future. It could also prove useful for the hydrologists, structural engineers, and flood disaster management officials to construct the water withholding structure with a capacity of these water levels.
To elaborate on the results further, this study predicts the data from 2011-2030. It can be seen that the trend of the past (1961-2010) is considerably different than the forecasted (2011-2030) trend. The reason is, firstly, the choice of considering the constant conditions in the future. The missing data could be linear or non-linear; however, for the analysis purposes, the analysis was performed using linear models as non-linearity could have greatly affected the results and might have deviated from the actual scenario of the river. Secondly, as the data were not collected by the concerned agencies till 2010 and there are no further official data available that could accurately forecast the water levels for the coming years, the data of the missing years have not been taken into account while conducting the analysis. To unravel the anomalies in the hydrologic behavior of the river, linear behavior was adopted to forecast the missing data closer to the previously recorded data.

Discussion
The analysis of missing years (2000-2020) was carried out as a forecast to help the hydrologists account for the missing data. To give an idea of the situation of the missing data of the year 2000-2020, it was estimated that water availability in 2015 was reduced to 1032 m 3 from 5000 m 3 in 1947 [82]. As Pakistan constructed the Warsak dam on the Kabul River, the decreased flow resulted in reduced water flow for the canal system, and the area irrigated by the Kabul canal system was reduced to 25,967 acres in 2015-2016 from 26,200 in 2015-2016 [83]. The glacier dynamics have had a significant impact on the water flow of the Kabul River. The 84% less snow that occurred from the year 2001-2016 shows that the solid precipitation will decrease with time, which will result in lower water flow in the Kabul River, which might lead the area to drought in the basin [84]. Two small dams were constructed in Afghanistan, namely, the Qargha and Band-i-Amir, with the help of US aid in 2008; if these dams become fully operational, it will have disastrous effects in terms of hydrogeneration and irrigation [85]. In 2003 and 2005, the Kabul Basin treaty between Pakistan and Afghanistan was drafted, but it failed miserably due to the unavailability of the water flow data [28]. Pakistan's water supply from the Kabul River is hostage to the construction development and political stability in the Afghanistan region, and to the climatic conditions as any construction of the dam in Afghanistan region will result in a decrease of 25% less mean annual flow of Kabul River by the end of 2018 [86]. A study revealed that the increase of water demand and construction of more dams in Afghanistan will decrease its flow to 17% below the current flow of 8 million-acre-feet (MAF). This condition, along with climate variation, will give rise to a shortage of water in the Pakistan region [87,88].
To maximize the gross advantages of river management, a high-quality water inflow forecast is mandatory. This surface water is extremely important for the socio-economic development and growth of the region. Water infrastructure developments, floods, and droughts controlling industrial operations are all dependent upon this resource, thereby making efficient management of this resource necessary. Precise water flow prediction not only reduces the risks of mal-operation and probability of damages but also causes an increase in profits [89].
The stochastic nature of river flow makes its forecasting imperative for early hazard management. This forecasting of river water flow becomes even more vital in mountainous regions because a hefty-sized population living downstream is highly dependent upon this water resource for their agriculture and other economic activities [64]. There are early warning systems available, which, to manage water, produce an early measurement of water flow, but these warning systems are too expensive for poor communities to gain an advantage from them [90]; hence, the use of previous flood repetition data can be used to predict the future flood frequency, which could function as an early warning system regarding flood prediction. For this purpose, various contributing factors of the flood could be taken into account during analysis to help model water behavior. The findings of this study help to account for the missing data and forecast the data based on the weather of the data used for analysis. As the weather is unpredictable and is subject to change with the increase in global warming in the coming years, this study could prove to be a breakthrough in assessing the river behavior so that flood controlling devices are constructed with the required water-holding capacity. In recent times, artificial intelligence algorithms have been used by researchers to predict stream and river flow. As machine learning algorithms are based on statistical data, they generate highly accurate results and predictions. In this regard, Pianosi and Thi [91] estimated river water flow, and Wu and Han [92] estimated daily run-off in rivers using artificial intelligence algorithms. However, Agung [64] argues that care must be taken as the exact value of any parameter is never known; therefore, one should not rely on these models solely. He further maintains that a professional's knowledge and experience should also be taken into account in defining several alternate models because in statistical analysis the best out of all options can never be achieved.
Based on the obtained results, the predictive performance of the selected model (2,4)(2,2) is evaluated statistically by the test data set of the decade 2000-2010 as done by Yu and Lei [69] in his research. The predicted results were up to par, with a few ambiguities where sharp fluctuations of water flow occurred. The model was selected based on AIC [65].
The subject river of the current study, the Kabul River, is the major tributary of the Indus River. Flooding in Kabul results in flooding in Indus as well. In 2010, the disastrous flood in Kabul and Sawat Rivers killed as many as 1156 people and affected 3.8 million people only in KP province [93]. Using past data to predict future water flow, hidden information can be disclosed that is of pronounced importance for alleviating the effects of floods and thwarting disasters. The generated best-fit model (2,4)(2,2) indicates the water flow to be 250 cumecs or a little above it in the next ten years. The result forecasted in the selected model is, therefore, highly beneficial for the river basin management where river flow, particularly in the rainy season, becomes a major challenge to handle. It provides compelling results to river management on how they can make maximum use of the study and fulfils the needs of the relevant stakeholders, although it should be kept in mind that even highly accurate water flow prediction does not always proliferate the benefits because ultimately it depends upon the operational strategies of the river administration [89].
This study can be generalized to other areas as the method employed in this study is automatic forecasting, which performs forecasting by identifying the best-suited model based on the lowest values of AIC. This tool selected ARMA (2,4)(2,2) based on the linear measurement of the past hydrological data. The significance of this work is that forecasted results can be a clarion call for the policymakers to allocate funding to the reservoir to work at its full capacity without damaging its structure, which will be beneficial for the agriculture sector, hydroelectric generation, and industrial processes and help designers and water management engineers to make sustainable decisions. From the civil engineering perspective, this study could help designers to complete sustainable basin designs, construct dams for electric generation, design canals for maximum agriculture productivity, and reconstruct and rehabilitate damaged water tributaries to meet flood and stormwater discharge, all of which could help save fertile land from being lost due to natural disasters like 2010 floods. If Pakistan completes hydroelectric projects in time, it can meet its electricity demands in the future. The outflow of this river depends on its basin condition, and the poor condition of this basin could lead to continuous silting in Warsak dam, the underperformance of hydropower generation, more frequent floods, less available water for agricultural needs, and inability of the river to meet the needs of the people relying on it. The construction of water storage structures and rehabilitation of the waterbed of the basin will ensure the likelihood of the people who are dependent for their survival on the Kabul River.
Despite using modern techniques for forecasting, there is always an uncertainty factor in the results; therefore, the quantification of climate is essential for the development of hydrologic impact. The uncertainty of results in this study is directly dependent on the precipitation and the temperature variation with each passing year. For improved modelling reliability, these two factors must be adequately addressed. As for this study, the forecasting accuracy will deviate from the actual water flow and its variation in precipitation and temperature change. For example, the temperature in 2010 was not the highest recorded temperature yet there were floods recorded in 2010 in Khyber Pakhtunkhwa province of Pakistan. Secondly, the precipitation was unexpectedly highest and there were no official early alarm systems. Provided the change in these two factors, the forecasted values of this study might prove inconsistent as this study did not account for the change in the precipitation and temperature change. Similarly, the absence of accurate data could seriously affect the observation uncertainty for floods and droughts as it could reveal the mean and variance of the streamflow in the Kabul Basin.
In recent years, the forecasting capability has increased significantly and has found its applications in all fields [86][87][88]. On the flip side, certain factors contribute towards many surprises in the analysis. Unfortunately, these factors can neither be modelled nor predicted; hence, the analysis is always accepted with a certain degree of uncertainty. Temperature, precipitation, and earthquakes are a few examples that could result in catastrophic loss of human life. Climate change is on the rise and is continuously wreaking havoc in the shape of tsunamis and hurricanes, which cannot be modelled. There might be damaging consequences if the forecast fails to accurately predict the water levels in the Kabul Basin. The inaccuracy could lead to extreme disasters like floods and droughts in the basin, which could threaten human existence. As Pakistan has constructed the Warsak dam on the Kabul River, which is regarded as one of the major dams of Pakistan, any fluctuation in the water level could create a power shortfall and the country would be plunged into the darkness.

Conclusions
Keeping in view the importance of the country's major river in terms of the economy, hydroelectricity, and human existence, an analysis was undertaken to study and predict the water level from the year 2011 till the year 2030 based on historical trends. The Kabul River poses a threat to Pakistani soil in extreme conditions either by being flooded excessively due to the melting of glaciers and incessant precipitation or due to severe spells of droughts. Therefore, the need for forecasting is essential for the planning and management of future development. This development is based on forecasted values, and this study will not only serve the inhabitants of the country in extreme conditions but also will prove beneficial in energy generation. To prevent the devastation due to extreme water levels of the Kabul River, this study made necessary the use of the ML approach to forecast the water level so that engineers and decision-makers could apply preventive techniques to tackle extreme conditions. This research bridges the gap of missing data and connects it to the forecasted data. Based on the analysis of the hydrological data, the forecasting was evaluated by comparing it to the actual values, and it was found that ARMA (2,4)(2,2) accuracy was better than other modes based on the lowest values of AIC. The forecast revealed that the water level will not fluctuate much, the water level in Kabul River will be marginally more than 250 cumecs from 2011 till 2030, and there will be a diminutive difference in its quantity as compared to its value of 249 cumecs in 2000. It was also concluded that water level will gradually increase from January to August till it reaches its maximum level of 250 cumecs in September. As soon as the monsoon season diminishes, the water level will return to its minimum value of 10 cumecs in the months from October to December till the year 2030. Data Availability Statement: All the data is available within this manuscript.

Acknowledgments:
The authors would like to thank Universiti Teknologi PETRONAS (UTP) for the support provided for this research.

Conflicts of Interest:
The authors declare no conflict of interest.