Can Decomposition Approaches Always Enhance Soft Computing Models? Predicting the Dissolved Oxygen Concentration in the St. Johns River, Florida

This study evaluates standalone and hybrid soft computing models for predicting dissolved oxygen (DO) concentration by utilizing different water quality parameters. In the first stage, two standalone soft computing models, including multilayer perceptron (MLP) neural network and cascade correlation neural network (CCNN), were proposed for estimating the DO concentration in the St. Johns River, Florida, USA. The DO concentration and water quality parameters (e.g., chloride (Cl), nitrogen oxides (NOx), total dissolved solid (TDS), potential of hydrogen (pH), and water temperature (WT)) were used for developing the standalone models by defining six combinations of input parameters. Results were evaluated using five performance criteria metrics. Overall results revealed that the CCNN model with input combination III (CCNN-III) provided the most accurate predictions of DO concentration values (root mean square error (RMSE) = 1.261 mg/L, Nash-Sutcliffe coefficient (NSE) = 0.736, Willmott’s index of agreement (WI) = 0.919, R2 = 0.801, and mean absolute error (MAE) = 0.989 mg/L) for the standalone model category. In the second stage, two decomposition approaches, including discrete wavelet transform (DWT) and variational mode decomposition (VMD), were employed to improve the accuracy of DO concentration using the MLP and CCNN models with input combination III (e.g., DWT-MLP-III, DWT-CCNN-III, VMD-MLP-III, and VMD-CCNN-III). From the results, the DWT-MLP-III and VMD-MLP-III models provided better accuracy than the standalone models (e.g., MLP-III and CCNN-III). Comparison of the best hybrid soft computing models showed that the VMD-MLP-III model with 4 intrinsic mode functions (IMFs) and 10 quadratic penalty factor (VMD-MLP-III (K = 4 and α = 10)) model yielded slightly better performance than the DWT-MLP-III with Daubechies-6 (D6) and Symmlet-6 (S6) (DWT-MLP-III (D6 and S6)) models. Unfortunately, the DWT-CCNN-III and VMD-CCNN-III models did not improve the performance of the CCNN-III model. It was found that the CCNN-III model cannot be used to apply the hybrid soft computing modeling for prediction of the DO concentration. Graphical comparisons (e.g., Taylor diagram and violin plot) were also utilized to examine the similarity between the observed and predicted DO concentration values. The DWT-MLP-III and VMD-MLP-III models can be an alternative tool for accurate prediction of the DO concentration values.


Cascade Correlation Neural Network (CCNN) Model
A CCNN is an efficient constructive neural network combining the idea of incremental structure and learning during its training. Training starts with a minimal network consisting of an input and output layer without a hidden layer. If the training can no longer reduce the residual error, then the training phase is stopped, and enters the next phase for the training of a potential hidden neuron [64,65]. The potential hidden neuron has associated connection weights from the input layer and all preexisting hidden neurons but not toward the output layer. The connection weights are optimized by the gradient ascent method to maximize the correlation between its output and the residual error of the CCNN model. When a potential hidden neuron is trained, connection weights associated with the output layer remain unchanged. When a potential hidden neuron is added to the structure of the CCNN model, it becomes a new hidden neuron, and its incoming connection weights are fixed for the remainder of the training phase [65][66][67]. Figure 1b represents the structure of the CCNN model.

Discrete Wavelet Transform (DWT)
Wavelet transform decomposition (WTD) can be generally classified as continuous wavelet transform (CWT) and discrete wavelet transform (DWT) [68,69]. DWT requires less time of the arithmetic processes, and is easier to implement than CWT [68,70]. A fast DWT algorithm requires four filters for perfect implementation (e.g., decomposition low-pass, decomposition high-pass, reconstruction low-pass, and reconstruction high-pass) [68,[71][72][73]. The low-pass filter for decomposition and reconstruction categories permits the interpretation of low frequency components, while the high-pass filter approves the investigation of high frequency components [72,74]. The multi-resolution approach using Mallat's DWT algorithm can be explained as a process to depict 'approximation' and 'details' for an underlying signal. An approximation produces a

Cascade Correlation Neural Network (CCNN) Model
A CCNN is an efficient constructive neural network combining the idea of incremental structure and learning during its training. Training starts with a minimal network consisting of an input and output layer without a hidden layer. If the training can no longer reduce the residual error, then the training phase is stopped, and enters the next phase for the training of a potential hidden neuron [64,65].
Appl. Sci. 2019, 9, 2534 5 of 24 The potential hidden neuron has associated connection weights from the input layer and all preexisting hidden neurons but not toward the output layer. The connection weights are optimized by the gradient ascent method to maximize the correlation between its output and the residual error of the CCNN model. When a potential hidden neuron is trained, connection weights associated with the output layer remain unchanged. When a potential hidden neuron is added to the structure of the CCNN model, it becomes a new hidden neuron, and its incoming connection weights are fixed for the remainder of the training phase [65][66][67]. Figure 1b represents the structure of the CCNN model.

Discrete Wavelet Transform (DWT)
Wavelet transform decomposition (WTD) can be generally classified as continuous wavelet transform (CWT) and discrete wavelet transform (DWT) [68,69]. DWT requires less time of the arithmetic processes, and is easier to implement than CWT [68,70]. A fast DWT algorithm requires four filters for perfect implementation (e.g., decomposition low-pass, decomposition high-pass, reconstruction low-pass, and reconstruction high-pass) [68,[71][72][73]. The low-pass filter for decomposition and reconstruction categories permits the interpretation of low frequency components, while the high-pass filter approves the investigation of high frequency components [72,74]. The multi-resolution approach using Mallat's DWT algorithm can be explained as a process to depict 'approximation' and 'details' for an underlying signal. An approximation produces a conventional trend of the original signal, while the details provide its high-frequency components [72,73,75]. The feature reports for Mallat's DWT algorithm can be found in Nason [76] and Percival and Walden [77]. Figure 2 shows Mallat's DWT algorithm for three-level decomposition [73].

Discrete Wavelet Transform (DWT)
Wavelet transform decomposition (WTD) can be generally classified as continuous wavelet transform (CWT) and discrete wavelet transform (DWT) [68,69]. DWT requires less time of the arithmetic processes, and is easier to implement than CWT [68,70]. A fast DWT algorithm requires four filters for perfect implementation (e.g., decomposition low-pass, decomposition high-pass, reconstruction low-pass, and reconstruction high-pass) [68,[71][72][73]. The low-pass filter for decomposition and reconstruction categories permits the interpretation of low frequency components, while the high-pass filter approves the investigation of high frequency components [72,74]. The multi-resolution approach using Mallat's DWT algorithm can be explained as a process to depict 'approximation' and 'details' for an underlying signal. An approximation produces a conventional trend of the original signal, while the details provide its high-frequency components [72,73,75]. The feature reports for Mallat's DWT algorithm can be found in Nason [76] and Percival and Walden [77]. Figure 2 shows Mallat's DWT algorithm for three-level decomposition [73].

Variational Mode Decomposition (VMD)
VMD is a fully adaptive and non-recursive algorithm for time-frequency signal analysis [78]. An original time series, f, can be decomposed into K intrinsic mode functions (IMFs) using the VMD approach. The constrained variation formulation for generating IMFs can be written as Equation (1): where δ = the Dirac function; j 2 = −1; · 2 = the L 2 distance; ω k = the center frequency; * = the convolution; u k (t) = A k (t) cos(φ k (t)) = the kth IMF; φ k = the non-decreasing function; and A k = the non-negative function. The constrained variational formulation can be modified as the following unconstrained pattern using an augmented Lagrangian method [78,79]: where L = the augmented Lagrangian; λ = the Lagrange multiplier; and a, b = the scalar product of a and b. Figure 3 explains the flowchart of the VMD algorithm [73].
where δ = the Dirac function; ( ) ( )cos( ( )) k k k u t A t t φ = = the kth IMF; k φ = the non-decreasing function; and k A = the non-negative function. The constrained variational formulation can be modified as the following unconstrained pattern using an augmented Lagrangian method [78,79]: where L = the augmented Lagrangian; λ = the Lagrange multiplier; and , a b = the scalar product of a and b. Figure 3 explains the flowchart of the VMD algorithm [73].

Hybrid Modeling Using DWT and VMD Approaches
DWT-based soft computing models (DWT-MLP and DWT-CCNN) are the hybrid models combined with the standalone models (MLP and CCNN) and DWT, respectively. In the same manner, VMD-based soft computing models (VMD-MLP and VMD-CCNN) conjugate the standalone models (MLP and CCNN) and VMD, respectively. Therefore, DWT-and VMD-based soft computing models consist of three steps (1). The training and testing dataset are decomposed into an approximation and multiple details using the DWT approach, and multiple IMFs using VMD, respectively (2). The standalone models (MLP and CCNN) are developed for each decomposed training dataset (3). The final predictions of DO concentration values are obtained by aggregating the sub-time series predicted from the standalone models (MLP and CCNN), respectively. Figure 4 represents the flowchart for DWT-and VMD-based soft computing modeling.
computing models consist of three steps (1). The training and testing dataset are decomposed into an approximation and multiple details using the DWT approach, and multiple IMFs using VMD, respectively (2). The standalone models (MLP and CCNN) are developed for each decomposed training dataset (3). The final predictions of DO concentration values are obtained by aggregating the sub-time series predicted from the standalone models (MLP and CCNN), respectively. Figure 4 represents the flowchart for DWT-and VMD-based soft computing modeling.

Performance Evaluation of Models
Performance measures are assessed by comparing predicted values with their corresponding observed values using the following criteria: I.

Performance Evaluation of Models
Performance measures are assessed by comparing predicted values with their corresponding observed values using the following criteria:

I
Root mean square error (RMSE): II Nash-Sutcliffe coefficient (NSE): III Willmott's index of agreement (WI): IV Mean absolute error (MAE): where DO obs and DO pre are the observed and predicted values, respectively; DO obs and DO pre are the average of observed and predicted values, respectively; and n is the length of time series data. The discrepancy between observed and predicted values can be shown using the RMSE criterion. A value of zero reflects perfect prediction. The RMSE criterion must be used for model evaluation to obtain accuracy in absolute units [80]. NSE is taken into account to evaluate the ability of predicting models [81]. If the squared difference between observed and predicted DO values is relatively large to concur with the variance in the observed DO values, then the NSE criterion will be zero. If the NSE criterion is negative, the results indicate that the observed mean is a better predictor than the model [81,82]. If the NSE criterion is equal to one, it indicates a perfect model [83]. The WI criterion varies between zero and one. WI calculates the ratio of mean square error (MSE) and can provide an advantage over the RMSE [84][85][86]. The MAE criterion can provide better information for a model's prediction. The MAE cannot be weighted towards higher or lower magnitudes. However, it evaluates all derivations from observed DO values in an equal manner [87].

Case Study
In this study, the fluctuations of the DO concentration independent of some parameters (e.g., chloride (Cl), nitrogen oxides (NOx), total dissolved solid (TDS), potential of hydrogen (pH), and water temperature (WT)) were selected as a case study. This study area was located in the southern east of Florida with a latitude of 28 • 32 33.864 N and a longitude of 80 • 56 33.428 W ( Figure 5). All data numbered 232 records along about 12 years (1996-2013). The data were arbitrarily divided into two parts for training and testing phases. The training datasets were chosen at 80% (n = 186) of the data length and the testing datasets covered the remaining 20% (n = 46).

Setting up the Standalone Models
The statistical parameters of the collected dataset are presented in Table 1. A considerable wide domain of data values can be observed in Table 1 (e.g., max of NOx = 0.46 mg/L and max of TDS = 1950.0). This issue implies that before importing the data to the standalone models (MLP and CCNN), they should be standardized between specific ranges (e.g., from 0 to 1). As can be seen from Table 1, the pH was the highest correlated parameter to the DO concentration (correlation coefficient (CC) = 0.760). It was followed by NOx (CC = 0.554) and then by WT with a negative value of the correlation coefficient (CC = −0.544), which denoted the reverse effect of WT on the DO concentration. Based on the coefficient of variation, the WT data were least dispersed, and the NOx data were most sporadic. Several combinations for setting up the standalone models (MLP and CCNN) regarding the input vectors (e.g., consisting of five chemical characteristic parameters of chloride (Cl), nitrogen oxides (NOx), total dissolved solid (TDS), potential of hydrogen (pH), and water temperature (WT)) can be created. Bear in mind that the only value in the output layer always corresponded to the DO concentration. Note: CV: the coefficient of variation, and CC: the coefficient correlation between inputs and DO.

Setting up the Standalone Models
The statistical parameters of the collected dataset are presented in Table 1. A considerable wide domain of data values can be observed in Table 1 (e.g., max of NOx = 0.46 mg/L and max of TDS = 1950.0). This issue implies that before importing the data to the standalone models (MLP and CCNN), they should be standardized between specific ranges (e.g., from 0 to 1). As can be seen from Table 1, the pH was the highest correlated parameter to the DO concentration (correlation coefficient (CC) = 0.760). It was followed by NOx (CC = 0.554) and then by WT with a negative value of the correlation coefficient (CC = −0.544), which denoted the reverse effect of WT on the DO concentration. Based on the coefficient of variation, the WT data were least dispersed, and the NOx data were most sporadic. Several combinations for setting up the standalone models (MLP and CCNN) regarding the input vectors (e.g., consisting of five chemical characteristic parameters of chloride (Cl), nitrogen oxides (NOx), total dissolved solid (TDS), potential of hydrogen (pH), and water temperature (WT)) can be created. Bear in mind that the only value in the output layer always corresponded to the DO concentration. In this study, six different input combinations for constructing the models, including five chemical characteristic parameters (combination ID = I) and even one quality parameter (combination ID = V and combination ID = VI), based on the highest positive and negative values of CC in Table 1 were constructed (see Table 2). Constructing the ANN architecture involves the creation of the ANN topology and training parameters, such as the number of neurons in the hidden layer(s). Looking back at Table 2, the first step of ANN architecture for determining the input combinations was already completed. In the second step, the main structure of ANN in terms of the number of layers should be specified. Based on the reports for the capability of the one hidden layer supervised neural networks in simulating complex phenomena [88], this study adopted a one hidden layer architecture for the standalone models (e.g., MLP and CCNN). Finally, the optimal number of neurons in the hidden layer was determined using the MSE criterion by a trial and error approach (see the third column of Tables 3-5). Note: * shows the best performance for each column; ** stands for introducing the best model.

Performance of Standalone Models
A statistical summary of the DO concentration performance using the standalone models (MLP and CCNN) is given in Table 3. Results of the MLP model with higher (NSE, WI, and R 2 ) and lower (RMSE and MAE) values indicated that the first (MLP-I) and third (MLP-III) input combinations acted better than the others for the training and testing phases, respectively. Since, however, choosing the best model is always based on the performance of the testing phase, the MLP model with the third combination topology (MLP-III) was selected as the best model. Similar interpretation can be done for the CCNN models. While the second combination (CCNN-II) provided the best performance for the training phase, the third input combination (CCNN-III) gave the best results for the testing phase. All the statistics were enriched in the CCNN-III model in comparison to the MLP-III model. A general comparison of the MLP-III and CCNN-III models revealed that the CCNN-III model yielded the better predictions than the MLP-III model. The RMSE, NSE, WI, R 2 , and MAE criteria for the testing phase of the CCNN-III model were improved by 129%, 8%, 2%, 3%, and 434% compared with the MLP-III model, respectively.
The variations of the observed versus predicted DO concentration values are depicted in Figure 6a,b. Visual analysis confirmed the better results of the CCNN-III model compared to those of the MLP-III model with he observed values. Figure 7a,b provides scatter plots of the testing dataset between the observed and predicted values using the MLP-III and CCNN-III models. From visual interpretation of Figure 7a,b, it can be understood that the dots in the MLP-III plot were a bit more sporadic than those in the CCNN-III plot. In addition, based on the coefficient of determination and the slope of the trend lines to the unity, it can be concluded that the CCNN-III model acted better in predicting the DO concentration.

Performance of Standalone Models
A statistical summary of the DO concentration performance using the standalone models (MLP and CCNN) is given in Table 3. Results of the MLP model with higher (NSE, WI, and R 2 ) and lower (RMSE and MAE) values indicated that the first (MLP-I) and third (MLP-III) input combinations acted better than the others for the training and testing phases, respectively. Since, however, choosing the best model is always based on the performance of the testing phase, the MLP model with the third combination topology (MLP-III) was selected as the best model. Similar interpretation can be done for the CCNN models. While the second combination (CCNN-II) provided the best performance for the training phase, the third input combination (CCNN-III) gave the best results for the testing phase. All the statistics were enriched in the CCNN-III model in comparison to the MLP-III model. A general comparison of the MLP-III and CCNN-III models revealed that the CCNN-III model yielded the better predictions than the MLP-III model. The RMSE, NSE, WI, R 2 , and MAE criteria for the testing phase of the CCNN-III model were improved by 129%, 8%, 2%, 3%, and 434% compared with the MLP-III model, respectively.
The variations of the observed versus predicted DO concentration values are depicted in Figure  6a,b. Visual analysis confirmed the better results of the CCNN-III model compared to those of the MLP-III model with he observed values. Figure 7a,b provides scatter plots of the testing dataset between the observed and predicted values using the MLP-III and CCNN-III models. From visual interpretation of Figure 7a,b, it can be understood that the dots in the MLP-III plot were a bit more sporadic than those in the CCNN-III plot. In addition, based on the coefficient of determination and the slope of the trend lines to the unity, it can be concluded that the CCNN-III model acted better in predicting the DO concentration.

DWT-Based Soft Computing Models
To decompose the input dataset using the DWT algorithm, the optimal level of decomposition (L) should be selected. In this study, Equation (8) was used to calculate the optimal level of decomposition [68,70,72]. Although the optimal level of decomposition can be implemented using a trial-and-error method, it is time-consuming and a waste of energy: Where N is the length of the time series, int[ ] k returns the integer portion of k, and k is a real number.
(a) Original water temperature

DWT-Based Soft Computing Models
To decompose the input dataset using the DWT algorithm, the optimal level of decomposition (L) should be selected. In this study, Equation (8) was used to calculate the optimal level of decomposition [68,70,72]. Although the optimal level of decomposition can be implemented using a trial-and-error method, it is time-consuming and a waste of energy: where N is the length of the time series, int[k] returns the integer portion of k, and k is a real number. In this study, L = 2 was determined using Equation (8). In addition, mother wavelets have to be set before DWT-based soft computing models are employed. Using different mother wavelets, the dataset was decomposed with a details (D 1 and D 2 ) and an approximation (A 2 ) components for individual input data [72,75]. For the DWT algorithm, Daubechies, Symmlets, and Coiflets have been frequently used as mother wavelets in previous studies [72,75,89,90]. Therefore, the underlying mother wavelets, including Coiflet-6 (C6), Coiflet-12 (C12), Coiflet-18 (C18), Daubechies-6 (D6), Daubechies-12 (D12), Daubechies-18 (D18), Symmlet-6 (S6), Symmlet-12 (S12), and Symmlet-18 (S18), were implemented. For each DWT-based soft computing model, the optimal mother wavelet yielding the best model performance was recommended. Figure 9 shows an approximation and details decomposed using the Symmlet-6 (S6) mother wavelet for the original water temperature (WT).

DWT-Based Soft Computing Models
To decompose the input dataset using the DWT algorithm, the optimal level of decomposition (L) should be selected. In this study, Equation (8) was used to calculate the optimal level of decomposition [68,70,72]. Although the optimal level of decomposition can be implemented using a trial-and-error method, it is time-consuming and a waste of energy: Where N is the length of the time series, int[ ] k returns the integer portion of k, and k is a real number.
(a) Original water temperature Original water temperature and sub series (details (D1 and D2) and approximation (A2) components) decomposed using Symmlet-6 (S6) mother wavelet. Table 4 gives the performance statistics using the DWT-MLP-III and DWT-CCNN-III models during the training and testing phases. Tables 3 and 4   Original water temperature and sub series (details (D1 and D2) and approximation (A2) components) decomposed using Symmlet-6 (S6) mother wavelet. Table 4 gives the performance statistics using the DWT-MLP-III and DWT-CCNN-III models during the training and testing phases. Tables 3 and 4 suggested that all of the DWT-MLP-III models improved the performance of the MLP-III model significantly, while all of the DWT-CCNN-III models did not improve the performance of the CCNN-III model during the testing phase. In addition, the DWT-MLP-III (D6) and DWT-MLP-III (S6) (e.g., RMSE = 0.161 (mg/L), NSE = 0.983, WI = 0.996, R 2 = 0.983, and MAE = 0.061 (mg/L) for D6 and S6) models produced the best results among all of the DWT-MLP-III models during the testing phase. The combination of the DWT into the MLP-III model could confirm the model accuracy for the prediction of the DO concentration. However, the CCNN-III model provided better results than any of the DWT-CCNN-III models. A comparison explained that all the DWT-MLP-III models yielded better results compared with all the DWT-CCNN-III models. Figure 10a,b shows the scatter plots of the testing dataset between the observed and predicted values using the DWT-MLP-III (S6) and DWT-CCNN-III (S6) models. From visual interpretation of Figure 10a,b, it can be explained that the dots in the DWT-CCNN-III (S6) plot were extremely sporadic compared to those in the DWT-MLP-III (S6) plot. In addition, based on the coefficient of determination and the slope of trend lines to the unity, it can be concluded that the DWT-MLP-III (S6) model acted better in predicting the DO concentration.

VMD-Based Soft Computing Models
To decompose the input dataset using the VMD algorithm, the number of IMFs (K) and the quadratic penalty factor (α) have to be implemented in advance. In this study, different sets of parameters were investigated, and three sets of parameters with higher correlations between the original and predicted data (i.e., the aggregation of decomposed series) were selected. The three sets were (K, α) = { (3,5), (4,5), (4, 10)}. Among them, the optimal parameters (K = 4 and α = 10) yielding the best performance of the VMD-based soft computing models were chosen finally. Figure 11 shows the original WT series and the IMFs decomposed using the VMD algorithm (K = 4 and α = 10).

VMD-Based Soft Computing Models
To decompose the input dataset using the VMD algorithm, the number of IMFs (K) and the quadratic penalty factor (α) have to be implemented in advance. In this study, different sets of parameters were investigated, and three sets of parameters with higher correlations between the original and predicted data (i.e., the aggregation of decomposed series) were selected. The three sets were (K, α) = {(3, 5), (4,5), (4, 10)}. Among them, the optimal parameters (K = 4 and α = 10) yielding the best performance of the VMD-based soft computing models were chosen finally. Figure 11 shows the original WT series and the IMFs decomposed using the VMD algorithm (K = 4 and α = 10).

VMD-Based Soft Computing Models
To decompose the input dataset using the VMD algorithm, the number of IMFs (K) and the quadratic penalty factor (α) have to be implemented in advance. In this study, different sets of parameters were investigated, and three sets of parameters with higher correlations between the original and predicted data (i.e., the aggregation of decomposed series) were selected. The three sets were (K, α) = { (3,5), (4,5), (4, 10)}. Among them, the optimal parameters (K = 4 and α = 10) yielding the best performance of the VMD-based soft computing models were chosen finally. Figure 11 shows the original WT series and the IMFs decomposed using the VMD algorithm (K = 4 and α = 10).  Table 5 shows the performance statistics using the VMD-MLP-III and VMD-CCNN-III models during the training and testing phases. Tables 3 and 5  Tables 4 and 5 explain that the statistical results of the DWT-and VMD-MLP-III models showed similar statistical patterns. A comparison of the best models revealed, however, that the VMD-MLP-III (K = 4 and α = 10) model yielded slightly better results than the DWT-MLP-III (D6 and S6) model. Figure 12a,b shows scatter plots of the testing dataset between the observed and predicted values using the VMD-MLP-III (K = 4 and α = 10) and VMD-CCNN-III (K = 4 and α = 10) models. From visual interpretation of Figure 12a,b, it can be explained that the dots in the VMD-CCNN-III (K = 4 and α = 10) plot were extremely sporadic compared to those in the VMD-MLP-III (K = 4 and α = 10) plot. Based on the coefficient of determination and the slope of trend lines to the unity, it can be concluded that the VMD-MLP-III (K = 4 and α = 10) model acted better in predicting the DO concentration.
In general, various studies have reported that the combination of soft computing models and decomposition approaches improved the accuracy and reliability of model performance for predicting DO concentration [12,[47][48][49][50][51][52]. Even if the CCNN model showed the outstanding performance for standalone models, the combination of the CCNN model and decomposition approaches cannot improve the model performance. The special model structure (e.g., adding hidden nodes) can prevent the model performance of complex nonlinear signals. To confirm the model performance, continued studies are required using different data, soft computing models, and decomposition approaches for predicting diverse environmental parameters (e.g., BOD, COD, TP, and TN etc.).  Table 5 shows the performance statistics using the VMD-MLP-III and VMD-CCNN-III models during the training and testing phases. Tables 3 and 5  Tables 4 and 5 explain that the statistical results of the DWT-and VMD-MLP-III models showed similar statistical patterns. A comparison of the best models revealed, however, that the VMD-MLP-III (K = 4 and α = 10) model yielded slightly better results than the DWT-MLP-III (D6 and S6) model. Figure 12a,b shows scatter plots of the testing dataset between the observed and predicted values using the VMD-MLP-III (K = 4 and α = 10) and VMD-CCNN-III (K = 4 and α = 10) models. From visual interpretation of Figure 12a,b, it can be explained that the dots in the VMD-CCNN-III (K = 4 and α = 10) plot were extremely sporadic compared to those in the VMD-MLP-III (K = 4 and α = 10) plot. Based on the coefficient of determination and the slope of trend lines to the unity, it can be concluded that the VMD-MLP-III (K = 4 and α = 10) model acted better in predicting the DO concentration.
In general, various studies have reported that the combination of soft computing models and decomposition approaches improved the accuracy and reliability of model performance for predicting DO concentration [12,[47][48][49][50][51][52]. Even if the CCNN model showed the outstanding performance for standalone models, the combination of the CCNN model and decomposition approaches cannot improve the model performance. The special model structure (e.g., adding hidden nodes) can prevent the model performance of complex nonlinear signals. To confirm the model performance, continued studies are required using different data, soft computing models, and decomposition approaches for predicting diverse environmental parameters (e.g., BOD, COD, TP, and TN etc.).

Diagnostic Analysis
In this study, three diagnostic analysis methods (i.e., Taylor diagram and violin plot) were considered for visual evaluation of the model performance.

Taylor Diagram
A polar plot presented by Taylor [91] was drawn for obtaining a visual understanding of model performance. It has the ability to highlight the goodness of model performance in comparison to observed values. The Taylor diagram depicts three statistics: (1) Correlation coefficient (the azimuth angle), (2) normalized standard deviation (radial distance from the origin), and (3) RMSE (distance from the reference observed point). A perfect matching of the predicted results is identified as a complete overlay by the reference point with the correlation coefficient equal to unity and the exact amplitude of variations compared with observations [91][92][93][94]. Figure 13a shows the Taylor diagram of the standalone models (MLP and CCNN). In the case of the best models, the diagram shows that the CCNN-III model had a lower RMSE than the MLP-III model. Although the correlation coefficients and standard deviations of the predicted data for both models were less than the observations, the node representing the CCNN-III model was closer to the observation node. Figure 13b shows the Taylor diagram of the hybrid models. The diagram shows that the DWT-MLP-III (S6) and VMD-MLP-III (K = 4 and α = 10) models had a lower RMSE than the DWT-CCNN-III (S6) and VMD-CCNN-III (K = 4 and α = 10) models. Although the correlation

Diagnostic Analysis
In this study, three diagnostic analysis methods (i.e., Taylor diagram and violin plot) were considered for visual evaluation of the model performance.

Taylor Diagram
A polar plot presented by Taylor [91] was drawn for obtaining a visual understanding of model performance. It has the ability to highlight the goodness of model performance in comparison to observed values. The Taylor diagram depicts three statistics: (1) Correlation coefficient (the azimuth angle), (2) normalized standard deviation (radial distance from the origin), and (3) RMSE (distance from the reference observed point). A perfect matching of the predicted results is identified as a complete overlay by the reference point with the correlation coefficient equal to unity and the exact amplitude of variations compared with observations [91][92][93][94]. Figure 13a shows the Taylor diagram of the standalone models (MLP and CCNN). In the case of the best models, the diagram shows that the CCNN-III model had a lower RMSE than the MLP-III model. Although the correlation coefficients and standard deviations of the predicted data for both models were less than the observations, the node representing the CCNN-III model was closer to the observation node. Figure 13b shows the Taylor diagram of the hybrid models. The diagram shows that the DWT-MLP-III (S6) and VMD-MLP-III (K = 4 and α = 10) models had a lower RMSE than the DWT-CCNN-III (S6) and VMD-CCNN-III (K = 4 and α = 10) models. Although the correlation coefficients and standard deviations of the predicted data of the DWT-MLP-III (S6) and VMD-MLP-III (K = 4 and α = 10) models were less than observations, the node representing the VMD-MLP-III (K = 4 and α = 10) model was closer to the observation node.

Violin Plot
As a further diagnostic tool, the violin plot is utilized in Figure 14a,b to assess the predicted results of the developed models for DO concentration. The violin plot, which has the ability to indicate the probability distribution of an observed and predicted dataset, is categorized as a box plot with the integration of the kernel density plot [95]. Based on the legends of Figure 14a, the median of the observed data was predicted by the MLP-III accurately (6.173 vs. 6.403), while the 25th and 75th percentiles in the CCNN-III had a better fit than the MLP-III. In addition, the MLP-III model overestimated the minimum, 25th percentile, median, and 75th percentile range of the DO concentration, whereas the CCNN-III model underestimated the 25th percentile, median, and 75th percentile range of the DO concentration. Overall, the violin plots indicated that the CCNN-III model performed better than the MLP-III model. Figure 14b explains that the VMD-MLP-III (K = 4 and α = 10) model overestimated the minimum, 25th, and median of the DO concentration, while the DWT-MLP-III (S6) underestimated the minimum, median, 75th percentile, and maximum of the DO concentration. Overall, the violin plots indicated that the VMD-MLP-III (K = 4 and α = 10) model performed slightly better than the DWT-MLP-III (S6) model.

Violin Plot
As a further diagnostic tool, the violin plot is utilized in Figure 14a,b to assess the predicted results of the developed models for DO concentration. The violin plot, which has the ability to indicate the probability distribution of an observed and predicted dataset, is categorized as a box plot with the integration of the kernel density plot [95]. Based on the legends of Figure 14a, the median of the observed data was predicted by the MLP-III accurately (6.173 vs. 6.403), while the 25th and 75th percentiles in the CCNN-III had a better fit than the MLP-III. In addition, the MLP-III model overestimated the minimum, 25th percentile, median, and 75th percentile range of the DO concentration, whereas the CCNN-III model underestimated the 25th percentile, median, and 75th percentile range of the DO concentration. Overall, the violin plots indicated that the CCNN-III model performed better than the MLP-III model. Figure 14b explains that the VMD-MLP-III (K = 4 and α = 10) model overestimated the minimum, 25th, and median of the DO concentration, while the DWT-MLP-III (S6) underestimated the minimum, median, 75th percentile, and maximum of the DO concentration. Overall, the violin plots indicated that the VMD-MLP-III (K = 4 and α = 10) model performed slightly better than the DWT-MLP-III (S6) model.

Conclusions
This study investigated the accuracy of two heuristic (MLP and CCNN) and decomposition (DWT and VMD) approaches for predicting dissolved oxygen (DO) concentration. To achieve this goal, the DO concentration and five chemical input parameters (Cl, NOx, TDS, pH, and WT) in the St. Johns River, Florida, USA, were used. For training and testing the developed models, the total dataset was divided into 80% and 20%, respectively. Several statistical indices (e.g., RMSE, NSE, WI, R 2 , and MAE) and diagnostic analyses (e.g., Taylor diagram and violin plot) were used to compare the developed models.
In the first stage, it was found that the CCNN-III

Conclusions
This study investigated the accuracy of two heuristic (MLP and CCNN) and decomposition (DWT and VMD) approaches for predicting dissolved oxygen (DO) concentration. To achieve this goal, the DO concentration and five chemical input parameters (Cl, NOx, TDS, pH, and WT) in the St. Johns River, Florida, USA, were used. For training and testing the developed models, the total dataset was divided into 80% and 20%, respectively. Several statistical indices (e.g., RMSE, NSE, WI, R 2 , and MAE) and diagnostic analyses (e.g., Taylor diagram and violin plot) were used to compare the developed models.