Modelling of River Flow Using Particle Swarm Optimized Cascade-Forward Neural Networks: A Case Study of Kelantan River in Malaysia

: Water resources management in Malaysia has become a crucial issue of concern due to its role in the economic and social development of the country. Kelantan river (Sungai Kelantan) basin is one of the essential catchments as it has a history of ﬂood events. Numerous studies have been conducted in river basin modelling for the prediction of ﬂow and mitigation of ﬂooding events as well as water resource management. This paper presents river ﬂow modelling based on meteorological and weather data in the Sungai Kelantan region using a cascade-forward neural network trained with particle swarm optimization algorithm (CFNNPSO). The result is compared with those trained with the Levenberg–Marquardt (LM) and Bayesian Regularization (BR) algorithm. The outcome of this study indicates that there is a strong correlation between river ﬂow and some meteorological and weather variables (weighted rainfall, average evaporation and temperatures). The correlation scores ( R ) obtained between the target variable (river ﬂow) and the predictor variables were 0.739, − 0.544, and − 0.662 for weighted rainfall, evaporation, and temperature, respectively. Additionally, the developed nonlinear multivariable regression model using CFNNPSO produced acceptable prediction accuracy during model testing with the regression coe ﬃ cient ( R 2 ), root mean square error (RMSE), and mean of percentage error (MPE) of 0.88, 191.1 cms and 0.09%, respectively. The reliable result and predictive performance of the model is useful for decision makers during water resource planning and river management. The constructed modelling procedure can be adopted for future applications. this study presents the application of ANNs-based predictive modelling trained using PSO. Particularly, cascade-forward neural networks trained with PSO (CFNNPSO) for the prediction of river ﬂow are presented. This study validates the functional ability and signiﬁcance of ANN


Introduction
Malaysia is enriched with 189 river basins nationwide. This natural resource performs a crucial role in the economic and social development of the country [1]. More specifically, rivers are the major source of water for irrigation, residential, industrial, agricultural, and other human activities. Surface water in the form of streams and rivers contributes 97% of raw water supply [2]. Consequently, due to the over-dependence on surface water for food, recreation, water supply, transportation, and energy, techniques in the simulation of real-world and complex nonlinear water system processes. In addition, this research gives an insight into ANNs modelling in the Kelantan river scenario and the importance of understanding a river basin and variables before attempting to model the river flow. River flow can be effectively modelled with intelligent ANNs models, despite the spatial changes in the study field.
The river flow of Sungai Kelantan in the northeast part of Malaysia is predicted by using FFNN and CFNN based on available meteorological input variables (features) namely; weighted rainfall (mm), evaporation (mm), min of temperature ( • C), mean of temperature ( • C) and max of temperature ( • C). Some of the ANNs-related experimentation carried out in this study includes; feature/input variables selection, the effective number of hidden layer neurons, and performance comparison between CFNN and standard multi-layer FFNN trained with PSO and other common training algorithms such as Levenberg-Marquardt (LM), Bayesian Regularization (BR) backpropagation. This study has practical meaning from the perspective of the current state-of-the-art in artificial intelligence (AI) and the Internet of Things (IoT) technology. The machine learning model can be deployed in different ways, such as using a web app or real-time monitoring device to predict Kelantan river flow based on the readily available meteorological data. The applicability of this tool will be of importance nowadays in the realm of Industrial Revolution 4.0.

Study Area
The Kelantan river basin is one of the important catchments as it has a history of flood events [18,19]. The catchment is representing most of the land area of Kelantan State, as shown in Figure 1. There are several stations such as rainfall, water level, evaporation, water quality and meteorological stations operating in the area.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 16 ANN techniques in the simulation of real-world and complex nonlinear water system processes. In addition, this research gives an insight into ANNs modelling in the Kelantan river scenario and the importance of understanding a river basin and variables before attempting to model the river flow. River flow can be effectively modelled with intelligent ANNs models, despite the spatial changes in the study field. The river flow of Sungai Kelantan in the northeast part of Malaysia is predicted by using FFNN and CFNN based on available meteorological input variables (features) namely; weighted rainfall (mm), evaporation (mm), min of temperature (°C), mean of temperature (°C) and max of temperature (°C). Some of the ANNs-related experimentation carried out in this study includes; feature/input variables selection, the effective number of hidden layer neurons, and performance comparison between CFNN and standard multi-layer FFNN trained with PSO and other common training algorithms such as Levenberg-Marquardt (LM), Bayesian Regularization (BR) backpropagation. This study has practical meaning from the perspective of the current state-of-the-art in artificial intelligence (AI) and the Internet of Things (IoT) technology. The machine learning model can be deployed in different ways, such as using a web app or real-time monitoring device to predict Kelantan river flow based on the readily available meteorological data. The applicability of this tool will be of importance nowadays in the realm of Industrial Revolution 4.0.

Study Area
The Kelantan river basin is one of the important catchments as it has a history of flood events [18,19]. The catchment is representing most of the land area of Kelantan State, as shown in Figure 1. There are several stations such as rainfall, water level, evaporation, water quality and meteorological stations operating in the area.

River Flow Data
ANN is a popular machine learning algorithm that has been successfully applied for data-driven predictive modelling. Therefore, the main ingredient for the success of predictive modelling is the data itself in addition to the training algorithm developed. In this study, the river flow data were used together with meteorological parameters. The river flow data were collected from the north of Kuala Krai city downstream (merge of two main tributaries and before discharge into the sea). Similarly, the original data obtained consist of 348 monthly records of Sungai Kelantan river flow (cubic meters per second (cms)) spanning from January 1988 to December 2016. Rainfall and evaporation are usually measured in a determined station, and only the computed area weighted

River Flow Data
ANN is a popular machine learning algorithm that has been successfully applied for data-driven predictive modelling. Therefore, the main ingredient for the success of predictive modelling is the data itself in addition to the training algorithm developed. In this study, the river flow data were used together with meteorological parameters. The river flow data were collected from the north of Kuala Krai city downstream (merge of two main tributaries and before discharge into the sea). Similarly, the original data obtained consist of 348 monthly records of Sungai Kelantan river flow (cubic meters per second (cms)) spanning from January 1988 to December 2016. Rainfall and evaporation are usually measured in a determined station, and only the computed area weighted rainfall is used to evaluate the whole area rainfall quantity [20]. The river flow was mainly from one main station (Guillemard), while the weighted rainfall and evaporation (secondary data) were over the whole river catchment. Table 1 shows the attributes of the data used in this study. Duration

Data Pre-Processing
The stage of data pre-processing and feature selection process is crucial in the initial stage of the machine learning model building. This process can significantly affect the prediction accuracy in any type of data [21]. The overall data pre-processing and feature selection are summarized as follows: Data randomization: the data were randomized to enhance the diversity of the data before splitting into training and testing datasets. Data partition/splitting: datasets were randomly partitioned into training and testing datasets consisting of 260 data samples (≈75%) for model training and 88 data samples (≈25%) for model validation test. Data normalization: ANNs benefit from data normalization as do some other machine learning algorithms. The input data are normalized to standardize the scale of each variable. In this study, the data are normalized to the range [0,1] before the ANNs training. Feature selection: feature removal for considerably low correlation score to the output variable. Normally, if the correlation score is less than |0.5|, these variables indicate a low correlation, i.e., a weak association between the specific input variable with the target variable. This process is the most important for predicting the accuracy of this study. It is also useful for model parsimony, especially when the input features are large. The reduced number of input features will give benefit for model simplicity and data reduction in the absence of data collection/sensor measurement. However, the experimentation results of this process are discussed in Section 3. The correlation coefficient (r xy ) between two variables was calculated by dividing the covariance with the product of the standard deviations of the two variables as follows:

ANN Structure, Training Algorithm, and Feature Input Selection
ANN is a supervised machine learning that can be trained to map the relation between input/feature and the target/output by adjusting the weights and biases between neuron elements [22]. This highly nonlinear mapping can be applied in many areas, including multivariable regression. There are different types of ANNs structure and training algorithms. Among the common types are cascade-forward neural networks (CFNN), multi-layer feed-forward neural networks (FFNN) and recurrent neural networks. In this study, both FFNN and CFNN structures were implemented, and their effectiveness was compared and evaluated. The programming execution was performed in MATLAB 2019b software. The structure CFNN is similar to FFNN, but the key difference between the two is that CFNN include a connection from the input to the neurons in the following hidden layers. The advantage of this approach is that it accommodates the nonlinear input-output relationship without eliminating the linear relationship between the two [23]. FFNN is a standard structure for a multi-layer neural network which can be found in many works of literature. Additionally, CFNN is a further modification of FFNN where additional weights are connected from the input nodes to the hidden nodes, and output nodes, as shown in the upper portion of Figure 2. These additional weights do not exist in the standard FFNN. The different networks structure between CFNN and FFNN in terms of weight connection can also be seen in the study [7,24]. Detailed computation works about the application of the training algorithms can be found in [7,25].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 16 standard structure for a multi-layer neural network which can be found in many works of literature. Additionally, CFNN is a further modification of FFNN where additional weights are connected from the input nodes to the hidden nodes, and output nodes, as shown in the upper portion of Figure 2. These additional weights do not exist in the standard FFNN. The different networks structure between CFNN and FFNN in terms of weight connection can also be seen in the study [7,24]. Detailed computation works about the application of the training algorithms can be found in [7,25]. The output of the CFNN was expressed as: where (•) is the selected activation function, is the weight strength from a neuron in the last hidden layer ℎ to the single output neuron , and so on, for other weights' strength. is the element of the input/features variable and is the bias weight in the neurons of the first hidden layer, and so on. The symbol denotes the weight vector for the entire set of all weights ordered by layer, followed by neurons in a layer and then signal strength in a neuron. Hence, in this study, 1 or 2 hidden layers of ANN were used for the evaluation. The activation function selected for the hidden layer(s) and the output layer is the tangent sigmoid and linear function, respectively. The tangent sigmoid function can be expressed as: where is the signal coming into the neuron in hidden layers. For the linear function in the output layer, it can be expressed as follows: where is the input to the neurons in the output layer. The output of the CFNN was expressed as: where ϕ(·) is the selected activation function, w kh is the weight strength from a neuron in the last hidden layer h to the single output neuron k, and so on, for other weights' strength. x i is the ith element of the input/features variable and b i is the bias weight in the neurons of the first hidden layer, and so on. The symbol w denotes the weight vector for the entire set of all weights ordered by layer, followed by neurons in a layer and then signal strength in a neuron. Hence, in this study, 1 or 2 hidden layers of ANN were used for the evaluation. The activation function selected for the hidden layer(s) and the output layer is the tangent sigmoid and linear function, respectively. The tangent sigmoid function can be expressed as: where I i is the signal coming into the neuron in hidden layers.
For the linear function in the output layer, it can be expressed as follows: where I k is the input to the neurons in the output layer.
Appl. Sci. 2020, 10, 8670 Furthermore, ANN is usually trained using the backpropagation (BP) algorithm and its variations. Among the commonly used ANN training algorithms is Levenberg-Marquardt (LM), which can provide fast convergence for the moderate-sized FFNN of a few hundredweights [26]. However, fast convergence does not guarantee that the trained ANN model will not overfit the training data. In many applications, including this study, the generalization of the model to the given data is more of concern, i.e., the model will not be either overfitting or underfitting. A more advanced modification of the LM algorithm is called Bayesian Regularization (BR), which reduces the linear combination of squared errors and weights. At the end of the training, the resulting network will have good generalization qualities, i.e., to prevent model overfitting. Further detailed discussions on Bayesian regularization can be seen in [27]. Therefore, for the reason of generalization capability, this algorithm is also applied in this study. Lastly, the PSO algorithm is also applied to train both FFNN and CFNN as the main contribution of this study. The performance of both FFNN and CFNN trained with PSO will be evaluated. PSO is considered the most popular meta-heuristic algorithm inspired by the nature process of bird flocking introduced in 1995 by Kennedy and Eberhart [28][29][30]. It has some appealing features, such as fast convergence speed and simplicity of implementation. Since then many PSOs and their variants have been studied. One of the early PSO variants introduced was PSO with constriction coefficients which was proposed to guarantee solution convergence [30]. This version of PSO will be applied in this study to train the proposed ANN model.
Prior to the discussion of the PSO algorithm used to train ANNs, the basic ANNs training/learning process using the BP algorithm can be summarized as follows [31]: 1.
Obtain the training dataset (x i ) with the desired target (y).

2.
Setup ANN structure and parameters: number of hidden layers, number of neurons, learning rate (η), momentum constant and regularization constant (α) (if necessary).

3.
Initialize of all weights and biases to random. 4.
Start the ANN training and forward propagation of input data through the layers according to Equation (2).

5.
Calculate the error difference between ANN output (ŷ) and the desired target (y) such as using MSE (mean squared error) defined as: 6. Back-propagate the error through the output and hidden layer(s), and adapt output weight according to: where t indicates the iteration index and ∆w k indicates the change of weights' strength in the output layer k which is calculated as: 7.
Back-propagate the error through the hidden layer(s) and input, and adapt output weight according to: where ∆w h indicates the change of weights' strength in the hidden layer h, which is calculated as: Appl. Sci. 2020, 10, 8670 7 of 16 8.
If the error according to step 5 is sufficiently small, then stop the training iteration and proceed with model validation; otherwise, repeat steps 4 to 7.
The BP algorithm is developed based on the gradient-descent algorithm that tends to be slow in terms of convergence, as mentioned earlier. Levenberg-Marquardt (LM) is the alternative training algorithm with faster convergence developed based on a combination of Gauss-Newton and gradient descent algorithm with computation of the Jacobian matrix. The weight update rule in the LM algorithm is expressed as: where µ > 0 is non-negative scalar.
On the other hand, as mentioned earlier, Bayesian Regularization (BR) training is integrated into BP to prevent overfitting. The training goal is naturally to reduce modified error function expressed as [32]: where E y = 1 2 (y −ŷ) 2 (the sum squared errors) the "black box" regularization parameters α and β are responsible for penalizing the cost function (F) which affects the generalization of the trained model. In general, the higher the regularization constant (β), the more network weight connections will be dropped to prevent overfitting. Table 2 summarizes the ANN parameters and algorithm setup that was investigated in this study. Similarly, in order to have consistent results and fair comparison for each different ANN setup, the initial random seeds for weights and biases were set to the same state of random number generator in the software. All activation functions in each neuron are sigmoid for the hidden layer and linear function of the output layer as expressed in Equations (3) and (4), respectively. Moreover, the regularization constant (β) was set to 0.2, and the maximum number of iterations was set to 1000. There is no cross-validation procedure performed during the training. Table 2. Artificial neural network (ANN)-based model setup.

ANN Model Properties Experimentation
Feature Input Selection To use all features or reduced features using correlation score ANN structure/architecture To use feed-forward and cascade-forward structure

Number of hidden layer and its neurons
To use 1 or 2 hidden layers with 5, 10, or more neurons in each layer e.g., [n 1 + n 2 ] means n 1 neurons in hidden layer 1 and n 2 neurons in hidden layer 2 Training algorithms To use Levenberg-Marquardt (LM), Bayesian Regularization (BR) and Particle Swarm Optimization (PSO) Once the ANN training was performed using the training dataset, the trained ANN model was then validated using the testing dataset. The obtained model accuracy was evaluated by calculating the regression coefficient (R 2 ), RMSE (root mean squared error) value and mean of percentage error (MPE) which are defined as: Appl. Sci. 2020, 10, 8670 The overall process of the river flow modelling using ANN is illustrated in Figure 3. It begins with raw data collection, as explained in Section 2.1, followed by data analysis and pre-processing, as explained in Section 2.2. The ANN training and some related experimentations are explained in Section 2.3. This procedure can be considered a general procedure for ANN-based predictive model building.

ANN Training Using Particle Swarm Optimization (PSO)
In the training of ANNs using PSO, the training process is handled by an optimization approach. The objective of the optimization is to minimize prediction error by searching the optimum solution (variables) of the weights and biases of the ANNs. PSO was inspired by the collective behaviours of bird flocking and fish schooling. Each PSO particle represents the potential solution of a given optimization problem, and it consists of unique velocity and position components in search space.
Suppose that the population size of PSO swarm and the dimensional size (i.e., number of variables to be optimized) of a given optimization problem are represented as N and D, respectively [29]. Denote that = , , . . . , , , . . . , , and = , , . . . , , , . . . , , represents the velocity and position of each i-th particle in the search space, respectively, where = 1, . . . , and = 1, . . . , . The i-th PSO particle's best searching performance achieved so far is represented as , = , , , . . . , , , , . . . , , , . Meanwhile, the global best position refers to the so far best performance achieved by the entire PSO swarm, and it is denoted as = , , . . . , , , . . . , , . The new position of each i-th particle in search space is then determined based on the updated velocity vector. At the ( + 1)-th iteration of search process, the d-th dimension of velocity and position of each i-th particle, denoted as , ( + 1)and , ( + 1), respectively, are updated as follows [33]: , ( + 1) = , ( ) + , ( + 1) where is an inertia weight used to balance the exploration and exploitation searches of the particle by determining how much the previous velocity of a particle is preserved; and are the acceleration coefficients used to control the influence of self-cognitive (i.e., , ) and social (i.e.,

ANN Training Using Particle Swarm Optimization (PSO)
In the training of ANNs using PSO, the training process is handled by an optimization approach. The objective of the optimization is to minimize prediction error by searching the optimum solution (variables) of the weights and biases of the ANNs. PSO was inspired by the collective behaviours of bird flocking and fish schooling. Each PSO particle represents the potential solution of a given optimization problem, and it consists of unique velocity and position components in search space.
Suppose that the population size of PSO swarm and the dimensional size (i.e., number of variables to be optimized) of a given optimization problem are represented as N and D, respectively [29]. Denote that V i = V i,1 , . . . , V i,d , . . . , V i,D and X i = X i,1 , . . . , X i,d , . . . , X i,D represents the velocity and position of each i-th particle in the search space, respectively, where = 1, . . . , N and d = 1, . . . , D. The i-th PSO particle's best searching performance achieved so far is represented as P best,i = P best,i,1 , . . . , P best,i,d , . . . , P best,i,D . Meanwhile, the global best position refers to the so far best performance achieved by the entire PSO swarm, and it is denoted as G best = G best,1 , . . . , G best,d , . . . , G best,D .
The new position of each i-th particle in search space is then determined based on the updated velocity vector. At the (t + 1)-th iteration of search process, the d-th dimension of velocity and position of each i-th particle, denoted as V i,d (t + 1) and X i,d (t + 1), respectively, are updated as follows [33]: Appl. Sci. 2020, 10, 8670 9 of 16 where ω is an inertia weight used to balance the exploration and exploitation searches of the particle by determining how much the previous velocity of a particle is preserved; c 1 and c 2 are the acceleration coefficients used to control the influence of self-cognitive (i.e., P best,i ) and social (i.e., G best ) component of the particle; r 1 and r 2 are two random numbers generated from a uniform distribution with the range of 0 to 1, where r 1 , r 2 ∈ [0, 1]. A few main PSO parameters drive toward the optimum solution search, namely, ω and c 1 and c 2 . Clerc [30] in 2002 developed a constriction coefficient approach to guide the selection of these parameters to guarantee the convergence solution [34]. In the Clerc's version of PSO, the particle velocity in Equation (15) is expressed as: with φ = c 1 + c 2 , typically K = 1 and c 1 = c 2 = 2.05 and therefore χ = 0.73 [34]. This version of PSO is used to train the ANNs in this study.
There are three main components in optimization problems, namely, the solution variables (X), the cost function (F(X)) and the constraints. The implementation of the PSO for ANNs training is basically searching for the optimum ANNs weights and biases (the solution variables) to minimize the prediction error (the cost function) subject to the boundary constraints of the weights and biases (the constraint). The formulation of this PSO-based ANNs training can be expressed as: Subject to: The objective function for the ANN training is basically to minimize Normalized MSE (NMSE), which can be directly related to maximizing the regression coefficient R 2 . Here, the cost function is expressed as:

Results and Discussion
According to the summary listed in Table 2, some experimentation needs to be carried out to investigate various setups of the ANN model that will give the optimum prediction results. The first result is a related feature selection, as this is the first stage of data preparation before ANN training. The features were selected based on the correlation score between the independent variable (input features) and the dependent variable (target). Table 3 shows the correlation score for each feature and the target/output variable. The result indicates a strong correlation (R = 0.739) between weighted rainfall (x 1 ) and the river flow (y). It can be concluded that the correlation between a min of temperature (x 3 ) and the target variable (y) is very low and therefore x 3 was removed from the input feature. The lowly correlated feature would degrade the prediction accuracy if it was not removed from the feature. Table 3. Correlation score for feature selection.

Variables
Correlation Score (R) x 1 ←→ y 0.739 With these four selected features (after removing x 3 ), the ANN training experimentation proceeds and the evaluation is performed. For the model parsimony reason, further removal of either feature x 4 or x 5 is also investigated to ascertain whether it affects the model accuracy. This is because these two features are of the same type, i.e., temperatures.
The first experimentation is mainly to investigate the number of hidden layer neurons and comparisons between FFNN and CFNN trained with the LM algorithm. Table 4 shows the results of this experimentation. The two numbers in the hidden layer neurons indicated that two hidden layers were used with the corresponding number of neurons in each layer. In the first column of Table 4, the notation in the square bracket indicates the number of hidden layer neurons, for example, [5] meaning there are 5 neurons in 1 layer, {10 + 10} meaning that there are 10 neurons in two hidden layers, etc. The main finding in this experimentation is that the ANN trained with LM algorithm have a high tendency of overfitting, i.e., good prediction (even perfect, R 2 = 1) for training data but poor prediction of testing data. This occurs in both models using FFNN and CNNN structure. Some worse cases of this situation are highlighted in gray where the obtained RMSE is very high such that the ANN failed to make predictions, i.e., resulting negative values of R 2 , marked with '−' in the Table. In addition, increasing the number of neurons (and layers) tends to increase the chance of overfitting.
The second experimentation is the same as the first, but the BR training algorithm was used. Table 5 shows the results of this experimentation. In Table 5, the lower RMSEs obtained during model testing are marked by '*' and the overfitting situations are highlighted in gray. It can be seen from Table 5 that, generally, CFNN with one hidden layer (5 to 20 number of hidden neurons) sufficiently produced lower RMSEs when it is trained with BR algorithms. Moreover, the increasing number of neurons (and layers) did not give a satisfactory performance as can be seen from both Tables 4 and 5. Especially for the FFNN, poor generalization capability (overfitting) was observed when the number of hidden neurons gets larger. During testing, the lowest RMSE of 211.1 was obtained when CFNN with 20 hidden neurons (1 layer) was trained with the BR algorithm. The third experimentation was conducted to show the results of FFNN and CFNN training using the PSO algorithm (FFNNPSO and CFNNPSO respectively) where Clerc's PSO version was used. The number of populations used in the PSO is set to 40, and the iteration number is set to 1000, the same as the one used in the LM and BR algorithms. The results are shown in Table 6. As compared to the previously trained ANN with LM and BR algorithm, both FFNN and CFNN trained with PSO (FFNNPSO and CFNNPSO) generally show good prediction ability in both the training and testing dataset, except for a few cases when two hidden layers are used (highlighted in gray). Therefore, it is preferable to use only one hidden layer to prevent overfitting. Thus, in the next experimentation, only one hidden layer was used with some variations in the number of neurons. The few lowest RMSE during testing were obtained (marked by '*') for both FFNNPSO and CFNNPSO with 1 hidden layer, except for 1 case of FFNNPSO (row 5 of Table 6). In all experimentations with one hidden layer, only FFNNPSO with 10 hidden neurons shows slightly lower RMSE during the testing, as shown in row 2 of Table 6.
Furthermore, the fourth and fifth experimentation was conducted to investigate the CFNN model performance when only three features were used as the parsimonious model. The three features (x 1 , x 2 , x 5 ) were used in the fourth experimentation, while another combination of three features (x 1 , x 2 , x 4 ) were used in the fifth experimentation. The result of the fourth experimentation is shown in Table 7, where FFNNPSO and CFNNPSO with a different number of neurons in one hidden layer were investigated. The result indicates that it is possible to have a parsimonious model with only three features (x 1 , x 2 , x 5 ) as the ANN input. The prediction on testing data gave the best performance of R 2 = 0.88, RMSE = 191.1 cms and MPE = 0.09% when CFNNPSO with 10 hidden neurons was trained despite slightly lower RMSE during the training as compared to the rest. This makes sense since the feature x 4 and x 5 are basically of the same type, i.e., mean and max temperatures, as compared to the result in Table 6 with four features. Similarly, the results obtained in this study corroborates with the work of Khaki et al. [35], who reported an R 2 value of 0.84 in the estimation of Langat Basin using a feed-forward neural network. Additionally, Hong and Hong [36] obtained R 2 values of 0.85, 0.81, and 0.85 for validation, training and testing datasets, respectively, when multi-layer perceptron neural network models were applied in estimating the water levels of Klang River.  Figure 4 shows the regression plot of the best testing performance for CFNNPSO, with 10 hidden neurons (1 layer) and three input features (x 1 , x 2 , x 5 ), resulting to R 2 = 0.88, RMSE = 191.1 cms and MPE = 0.09%. Table 8 shows the results of the fifth experimentation using another three combinations of features (x 1 , x 2 , x 4 ). However, the result shows quite significant degradation of the model performance, particularly with the training dataset. This means that the combination of the three features is not feasible to build a parsimonious predictive model. As the final remarks on the feature selection, the accurate model can be achieved using four features (x 1 , x 2 , x 4 , x 5 ) or using three features (x 1 , x 2 , x 5 ) as these two can achieve comparable performance as long as one hidden layer is used. In other words, ANNs trained with PSO were able to achieve acceptable accuracy in predicting river flow by using only weighted rainfall, average evaporation and max temperature as input variables. However, CFNN structure is generally preferable as this can produce more robust generalization performance despite the number of neurons applied.
Furthermore, as a comparison, Multiple Linear Regression (MLR) is also used to benchmark the prediction outcome of the ANNs above. The MLR is trained via Lasso regression/L1 [37] with the regularization parameter value (α = 0.2) as the same one used during training using the BR algorithm. With the three features (x 1 , x 2 , x 5 ), the resulting MLR prediction of the river flow can be expressed in the following equation:ŷ = (2120.47)x 1 + (502.76)x 2 − (1185.77)x 5 + 447.32 (22) The MLR prediction on the test dataset produces a regression coefficient (R 2 ) of 0.73 and an RMSE of 279.3 cms, which is lower accuracy compared to the FFNNPSO and CFNNPSO prediction. This makes sense since MLR assume linear relation on the variables. Figure 4 shows the regression plot of the best testing performance for CFNNPSO, with 10 hidden neurons (1 layer) and three input features ( 1 , 2 , 5 ), resulting to 2 = 0.88, = 191.1 cms and = 0.09%.  Table 8 shows the results of the fifth experimentation using another three combinations of features ( 1 , 2 , 4 ). However, the result shows quite significant degradation of the model  Finally, the results of this study can be improved from the enhancement of data and improvement of the algorithm. Data-driven predictive modelling relies on the quantity and quality of the recorded data. Moreover, collection of field data is a costly practice that provides a series of snapshots of watercourse behaviour and supplements existing information. Therefore, it is essential to carry out a collaborative desk analysis to gather established existing records from different sources (consultants, environment agency or water services company) to improve current understanding and expertise deficiencies [38]. Additionally, hydrological and mathematical models play a significant role in the forecasting of river basins using field data obtained from different temporal and spatial scales [39].

Conclusions
Predictive modelling of river flow based on meteorological weather data using the Multilayer Artificial Neural Networks (ANNs) Particle Swarm Optimization (PSO) algorithm has been discussed. Sungai Kelantan river flow data ranging from January 1988 to December 2016 was used. The results demonstrate the potential applications of ANNs as an artificial intelligence-machine learning tool to predict river flow variables based on meteorological and weather data as studied in this paper, where two ANNs structures were used: feed-forward neural networks (FFNN) and cascade-forward neural networks (CFNN). The PSO algorithm used to train the ANN has also contributed to the advancement of the predictive model building. Generally, ANNs with one hidden layer trained using PSO were able to produce acceptable accuracy and good generalization for both the training and testing dataset. This result is better than the prediction performance of the Multiple Linear Regression (MLR) trained via Lasso Regression/L1. Moreover, a parsimonious model with reduced features was proposed; this feature was carefully selected. From the parsimonious model experimentation, it was possible to build an ANNs predictive model that can achieve acceptable accuracy in predicting river flow by using only weighted rainfall, average evaporation and max temperature as input variables. The experimentation results also indicate that CFNN trained using the PSO algorithm has more robust generalization performance compared to FFNN in the reduced feature (parsimonious) model. The model accuracy can still be improved using advanced techniques in machine learning modelling such as the ensemble method, improvement of the optimizer and cross-validation training procedure.
Furthermore, future research will work on some areas including benchmarking with other machine learning algorithms, benchmarking with other mete-heuristic algorithms for ANN training, data augmentation to enhance the diversity of the available data without generating actual data and real-time deployment of the predictive model in the Internet of Things (IoT) scenario. Despite the efficiency of ANNs as a black box model for river flow modelling, further exploration of research in this area is required. These include an automated feature selection mechanism, the possibility of using Deep learning neural networks regression, and improvement of accuracy to reduce overfitting via different optimizer algorithms. Another area includes the deployment stage of the machine learning model, which can involve Big Data, IoT and the Cloud computing platform. As AI tools in this regard are easily available nowadays, the area of this study promises high applicability of hydro-informatics systems, especially in Malaysia. This hydro-informatics concept and implementation need more extensive attention by authorities and decision makers to deal with water resource management which is currently a serious issue in some countries.