Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater

Mundi, Gurvinder; Zytner, Richard G.; Warriner, Keith; Bonakdari, Hossein; Gharabaghi, Bahram

doi:10.3390/w13182485

Open AccessFeature PaperArticle

Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater

by

Gurvinder Mundi

¹,

Richard G. Zytner

^2,*,

Keith Warriner

³

,

Hossein Bonakdari

⁴

and

Bahram Gharabaghi

²

¹

Mueller (Echologics Division), Toronto, ON M9W 1B3, Canada

²

School of Engineering, University of Guelph, Guelph, ON N1G 2W1, Canada

³

Department of Food Science, University of Guelph, Guelph, ON N1G 2W1, Canada

⁴

Department of Soils and Agri-Food Engineering, Laval University, Québec, QC G1V 0A6, Canada

^*

Author to whom correspondence should be addressed.

Water 2021, 13(18), 2485; https://doi.org/10.3390/w13182485

Submission received: 21 July 2021 / Revised: 31 August 2021 / Accepted: 2 September 2021 / Published: 10 September 2021

(This article belongs to the Topic Emerging Solutions for Water, Sanitation and Hygiene)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Wash-waters and wastewaters from the fruit and vegetable processing industry are characterized in terms of solids and organic content that requires treatment to meet regulatory standards for purpose-of-use. In the following, the efficacy of 13 different water remediation methods (coagulation, filtration, bioreactors, and ultraviolet-based methods) to treat fourteen types of wastewater derived from fruit and vegetable processing (fruit, root vegetables, leafy greens) were examined. Each treatment was assessed in terms of reducing suspended solids, total phosphorus, nitrogen, biochemical and chemical oxygen demand. From the data generated, it was possible to develop predictive modeling for each of the water treatments tested. Models to predict post-treatment water quality were studied and developed using multiple linear regression (coefficient of determination (R²) of 30 to 83%), which were improved by the generalized structure of group method of data handling models (R² of 73–99%). The selection of multiple linear regression and the generalized structure of group method of data handling models was due to the ability of the models to produce robust equations for ease of use and practicality. The large variability and complex nature of wastewater quality parameters were challenging to represent in linear models; however, they were better suited for group method of data handling technique as shown in the study. The model provides an important tool to end users in selecting the appropriate treatment based on the original wastewater characteristics and required standards for the treated water.

Keywords:

water quality; multiple linear regression (MLR); wastewater; machine learning; wash-water; food processing; treatment feasibility; water resources

1. Introduction

Fruit and vegetable food processing is a water-demanding process with the main uses being in washing, cutting, peeling, sanitizing, cooling, transporting, and equipment and machinery cleaning [1]. Different quality waters are used in different stages; however, the highest quality waters, that of the potable standard, is required to be used for final cleaning/rinsing. The high water usage ultimately leads to the generation of large volumes of wastewater [2,3]. In many cases, the wastewater (used once and wasted) or wash-water (recirculated waters reused for washing) are often high in solids, biochemical oxygen demand (BOD), and other pollutants [4,5]. The generated wastewater and wash-waters require adequate treatment and disinfection before they are either reused in the process or disposed of to reduce environmental impact.

Adequate treatment of waters is necessary to provide proper food safety when water is reused within the process and to protect the environment from receiving excess loads of contamination [6]. The recirculation and reuse of the wash-water are important in reducing the amount of wastewater requiring treatment or disposal in an overall program.

Thus, water reclamation and reuse are often practiced to minimize the amount of water utilized in many water-intensive industries like the fruit and vegetable washing and processing sector.

There are a diverse range of wastewater treatments available that have variable efficacy in terms of removing soluble and insoluble constituents. Moreover, the degree to which wastewater needs to be treated depends on the end-use, for example, disposal or re-introduction into the process. Selecting which treatment to apply is challenging due to the lack of comprehensive datasets and models for individual or combination of methods for different wastewater types. Information and models/tools that address treatment, water reuse feasibility, and its prediction are lacking.

Multiple linear regression (MLR) methods are popularly used for predicting post-treatment water quality parameters [7,8]. Due to the multiplicity of factors affecting post-treatment water quality, these linear data analysis methods may not be robust enough to deal with the complex data patterns in such problems. Several machine learning techniques can be a promising approach to cope with the limitations of MLR methods. Due to simple architecture, artificial neural networks (ANNs) are a good mathematical model alternative that have been used widely in recent studies to solve nonlinear problems in the science and engineering sector [9,10,11]. Granata et al. [12] highlighted that machine learning algorithms and effective use of field data are useful in modelling water quality parameters, which then can be considered for the sizing of the treatment unit [13]. Furthermore, different optimization methods were developed to improve the quality and sustainability in wastewater treatment process [14,15,16]. For the problems with high correlated input parameters, this technique is very effective and easy to implement. However, they are not able to generate an explicit equation between input parameters and output variable(s) that are much needed in engineering problems for practical application [10,17,18].

Among the various intelligence methods, a reliable and self-organizing neural network sub-model is the group method of data handling (GMDH) method. This method is successfully applied in diverse nonlinear problems such as plant disease detection [19], pressure meter modulus [20], soil temperature [21], dispersion coefficients in water pipelines [22], and small strain shear modulus of grouted sands [23]. Although the modeling competence of ANN-based techniques has been well documented, very few GMDH-based models with application in post-treatment water quality parameters analysis have been reported in recent literature. Thus, the objective of this study is to address the knowledge gap within the industry by developing models using well-known MLR techniques and the more recent GMDH-based methods. Producers and government are all interested in knowing the efficiency of various treatment processes, but information is lacking in the agriculture sector.

The proposed models highlighted in this study will predict post-treatment (treated) water quality parameters for total suspended solids (TSS_Treated), chemical oxygen demand (COD_Treated), biological oxygen demand (BOD_Treated), total nitrogen (TN_Treated), total phosphorus (TP_Treated), and Ammonia as Nitrogen (NH4-N_Treated), based on operation type, raw water quality, and wastewater treatment process (such as settling, dissolved air flotation, and membrane bioreactors).

The predicted post-treatment water quality parameters can be utilized to address treatment options and to assess the potential for meeting regulatory compliance for wastewater disposal and reuse, and to better understand the impact of releasing treated waters into the environment and on sensitive organisms downstream. The models will be of great use to all stakeholders, such as farmers, producers, processors, technology providers, consultants, and regulators to detect the level of contamination in wastewater before disposal or reuse.

2. Materials and Methods

2.1. Study Area, Wash-Water Samples, and Laboratory Analysis

The study was carried out in Southern Ontario, Canada. The selected wastewaters and wash-waters were collected from various fruit and vegetable washing and processing facilities, including fresh-cut produce (apple, çarrot, potato, Ginseng, and others). Many of the facilities were on-farm operations, currently utilizing passive treatments such as settling ponds, while other samples were collected from urban facilities, which dispose of wastewaters in the municipal Sewer. Wash-waters were generated in many applications where fruits and vegetables were washed to remove soil, processed, and/or provide microbiological decontamination. Different processing steps were required based on the type of vegetable or fruit being processed.

Root vegetables such as carrots, potatoes, and ginseng require washing to clean and remove soils that are attached to root material. This process is stated as washing (W). However, whenever the root vegetable or another fruit or vegetable requires processing, such as cutting/peeling, then the process is stated as washing and processing (WP). In addition, the products were also classified as a root vegetable, tree fruit, leafy green, and above ground. The full dataset consists of four independent subsets, two of which were studied and presented in Mundi et al. [24], while the remaining two were introduced for characterization in Mundi et al. [25], which were collected by the Ontario Ministry of Agriculture Finance and Rural Affairs (OMAFRA). The results presented in Mundi et al. [24] were table-based or decision matrices, where the user would read the treatment combination of the charts. These decision matrices will now be converted to models. Additional data from OMAFRA, in combination with Mundi et al. [25], were used to develop Power-Rank models for characterizing water quality to fill data gaps and develop models as presented herein.

A total of 245 unique samples were contained in the master dataset, 223 contained data on bench scale treatments, while the other 22 related to full-scale treatments. The bench-scale treatments consisted of screening (S), hydrocyclone (HC), the settling using jar test method with an optimized coagulation and flocculation process (C&F), dissolved air flotation using optimized coagulant dosage (DAF), centrifuge (C), and electrocoagulation (EC&F). Full-scale treatments sampled included single tank settling (SET1), settling followed by open grass (SET1G), three settling tanks in series (SET3), pond (POND), sequential batch reactor with four stages—settling, aeration, nutrient removal, settling (SETBIO4)—and membrane bioreactor with reverse osmosis and UV disinfection (MBR+RO+UV). The samples were collected at random from each facility under normal operating conditions. The samples collected consisted of (1) raw wash-water, the wastewater that is produced by the washing and/or processing operations, and (2) post-onsite treatment, the effluent treated wastewater employed by the facility. The raw wash-water and the post-onsite treatment samples were analyzed for a suite of water quality tests.

In addition, the raw wash-waters were also treated with six bench-scale treatments. These include the following treatments: Settling (jar test control—1 min rapid mix at 100 rpm, 10 min slow mix at 30 rpm, and 20 min settling), settling with a coagulation and flocculation process (jar test with coagulants varying in dosage from 5 to 400 mg/L, 1 min rapid mix at 100 rpm, 10 min slow mix at 30 rpm, and 20 min settling), screening (sieve 100 um), centrifugation (1801 × G for 3-min), dissolved air flotation (10-min retention of recycling water at a rate of 50% and 10-min detention for flotation) using an optimized coagulant dosage as determined earlier using the jar test, hydro-cyclone (4 mm apex, 6.7 mm vortex finder, 48 mm diameter, and 1.3 L/min of flow), and electrocoagulation (Maximum of 575.1 cm² surface area, minimum reaction time of 10-min, and maximum of 0.27 kWh/m³ power consumption). A detailed description of bench-scale treatments and their methodology, in addition to full-scale treatments, are outlined in Mundi et al. [24].

Water quality parameters were tested using standard methods as listed in text Standard Methods for Examination of Water and Wastewater, 22nd Edition [26]. TS and TSS were measured using the evaporation–mass balance Method 2540 B and Method 2540 D, respectively. BOD was measured using the Standard Method 5210. Additional water quality testing was conducted using the Hach instrumentation and water quality testing kits, such as COD (Dichromate Method, TNT821,2—Method 8000), TN (Persulfate Digestion, TNT826,7,8—Method 10208), TP (Ascorbic Acid, TNT843,4,5—Method 10209 and 10210), and Ammonia as N (Salicylate, TNT830,1,2—Method 10205). The Digital Reactor Block from HACHCO (DRB200-02) and the Ultraviolet-Visible Spectrophotometer from HACHCO (DR5000-03) were used to complete the previously listed tests.

2.2. Multiple Linear Regression

The compiled master dataset was manually inspected, and missing data were predicted using the relationships developed in Mundi et al. [25], between similar water quality parameters. Dataset statistics and modelling of MLR was completed using R software (RStudio—Version 1.0.153). A variable selection technique can reduce the number of input variables needed for developing models. This is done to remove input variables that have no significant relationship with the output variable, which reduces the computational complexity and improves predictions [8]. Using too many input variables can lead to an overfitted model and makes it less practical, as the collection and analysis of additional variables can be costly. Variable selection in MLR models was achieved through Pearsons’ Correlation matrix, highlighting a statistically significant (p-value < 0.01) correlation.

Before modeling the data, the inputs and output were normalized using a linear scaling method in Equation (1), with a 0 to 1 scale [27]. Equation (1) was used for numerical variables, such as BOD; however, the normalizing of categorical variables such as process and treatment were different. The process and treatment were scaled using a ranking number, similar to Mundi et al. [25], obtained by the ranking of average treatment reduction efficiency with respect to treatment parameters, such as BOD [25]. Normalizing was done to prevent the magnitude of each parameter from potentially influencing the weights assigned in model development, as the dataset includes several different types of measurements.

z_{i} = \frac{x_{i}}{x_{m a x} + 1}

(1)

where

z_{i}

is the normalized value computed for input i,

x_{i}

is the actual value of the input i, and

x_{m a x}

is the maximum value of all input values for a given parameter. MLR is a statistical technique used to model the relationship between two or more explanatory variables (independent) and a response variable (dependent) by fitting a linear equation to the observed data. The MLR model can be defined as:

y_{i} = β_{0} + β_{1} x_{i, 1} + β_{2} x_{i, 2} + \dots + β_{k} x_{i, k} + ε_{i}

(2)

where y_i is the dependent variable,

β_{0}

is a constant or intercept,

x_{i, k}

is an independent variable,

β_{k}

is the coefficient regression vector or slope, and

ε_{i}

is random measured error. In the present study, R language/software (RStudio—Version 1.0.153) was used to calculate the MRL models.

2.3. Generalized Structure of Group Method of Data Handling (GSGMDH)

The water quality of fruit and vegetable wastewater treatment systems is a complex multivariable system that is not easily modelled with theoretical or analytical variable-based models. The group method of data handling (GMDH) can be employed effectively in real-world cases where there is no theoretical experience about the association between the input variables and the outcome [18,19]. The actual variable (y) for the input vector (x₁, x₂, x₃, …, x_n) with a given M observation (i = 1, 2, …, M) can be defined as follows:

y_{i} = f (x_{i, 1} {, x}_{i, 2} {, x}_{i, 3}, \dots {, x}_{i, n})

(3)

GMDH can be trained to estimate the outcome value (

{\hat{y}}_{i}

) for given input variables as follows:

{\hat{y}}_{i} = \hat{f} (x_{i, 1} {, x}_{i, 2} {, x}_{i, 3}, \dots {, x}_{i, n})

(4)

The objective function is to minimize the square difference between the actual outcome (y) and estimated

{\hat{y}}_{i}

, as follows:

M i n {\sum_{i = 1}^{m} {(\hat{f} (x_{i, 1} {, x}_{i, 2} {, x}_{i, 3}, \dots {, x}_{i, n}) - f (x_{i, 1} {, x}_{i, 2} {, x}_{i, 3}, \dots {, x}_{i, n}))}^{2}}

(5)

The basis of the GMDH algorithm is the process of constructing a high-order polynomial known as the Volterra functional series, as follows:

y = a_{o} + \sum_{i = 1}^{m} a_{i} x_{i} + \sum_{i = 1}^{m} \sum_{j = 1}^{m} a_{i j} x_{i} x_{j} + \sum_{i = 1}^{m} \sum_{j = 1}^{m} \sum_{k = 1}^{m} a_{i j k} x_{i} x_{j} x_{k} + \dots

(6)

Hence, in the GMDH algorithm, the series of Volterra functions are decomposed into quadratic binomial polynomials. This mathematical description can be simplified by quadratic polynomials consisting of only two variables (neurons) in the form of:

{\hat{y}}_{i} = \hat{f} (x_{i}, x_{j}) = a_{o} + a_{1} x_{2} + a_{2} x_{j} + a_{3} x_{i}^{2} + a_{4} x_{j}^{2} + a_{5} x_{i} x_{j}

(7)

In the GMDH algorithm, a regression polynomial is yielded according to Equation (7) for all possibilities consisting of two independent variables to create the best fit between the M observed values by applying objective function as presented in Equation (5).

Although convectional GMDH has a high capability for nonlinear problem modelling the following points have an important impact on obtained results:

The second-order polynomial defined polynomial structure (Equation (7)) has only two input neurons.
The input neurons in each layer are selected only from adjacent layers.

According to these two limitations, problems of high complexity, the use of second-order polynomials may not yield acceptable outcomes in GMDH. Furthermore, considering two inputs for each neuron leads to an increase in the number of neurons to reach an adequate model. The use of adjacent neuron layers increases the number of polynomials produced. Therefore, these issues have a significant impact on the accuracy and simplicity of the proposed models. Thus, in this study, a generalized structure of GMDH (GSGMDH) is presented [17,18]. We coded the GSGMDH mode in the MATLAB software environment. The proposed model modifies the general convectional structure of the GMDH so that it simultaneously investigates all possible modes of achieving the best and simplest model available by using polynomials of order 2 and 3 as well as the use of two and three neurons. Finally, it selects the best model using the input index for each corrected Akaike information criterion (AIC) as follows:

A I C = n \times l o g [\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2} + 2 D + \frac{2 D (2 D + 1)}{N - D - 1}]

(8)

where, n is the number of samples,

{\hat{y}}_{i}

and y_i are predicted and observed values, and D is the number of tuned variables through modeling with GMDH.

Four situations can occur: (1) Second-order polynomials, (2) second-order polynomials with three inputs, (3) third-order polynomials with two inputs, and (4) third-order polynomials with three inputs. Among these four states, the first state is precisely the same relationship provided for the convectional GMDH (Equation (5)). Therefore, the general form of the polynomial defined in this study is as follows:

\begin{array}{l} \hat{y} = a_{o} + a_{1} x_{k} & + a_{2} x_{q} + a_{3} x_{p} + a_{4} x_{q} x_{k} + a_{5} x_{p} x_{k} + a_{6} x_{p} x_{q} + a_{7} x_{k}^{2} + a_{8} x_{q}^{2} \\ + a_{9} x_{p}^{2} + a_{10} x_{p} x_{q} x_{k} + a_{11} x_{q} x_{k}^{2} + a_{12} x_{k}^{2} x_{k} + a_{13} x_{p} x_{k}^{2} \\ + a_{14} x_{p} x_{q}^{2} + a_{15} x_{p}^{2} x_{k} + a_{16} x_{p}^{2} x_{q} + a_{17} x_{k}^{3} + a_{18} x_{q}^{3} + a_{19} x_{p}^{3} \end{array}

(9)

where k, p, q ∈ {1, 2, 3, …, n}.

The flowchart of the GSGMDH method is presented in Figure 1. As shown in this figure, after definition of the initial value for GSGMDH model parameters, all possible neurons will be created. In this operation, in order to select the final model, two criteria were checked. The accuracy of the obtained results was verified (1) with RMSE to keep the maximum allowable neurons and remove other neurons and (2) with AIC to examine the simplicity of the designed architecture.

2.4. Model Performance Evaluation

A variety of statistical measures were utilized to understand the unique properties of the model performance. The coefficient of correlation (R) was used to understand the amount of observed variance within the models, shown in Equation (10). The root mean square error (RMSE) and mean absolute percent error (MAPE) was used to understand model accuracy and precision, as shown in Equations (11) and (12). RMSE shows differences between the observed and predicted values in the units of the variable of the study. In Equations (10) and (11), variables O and P are the observed and model predicted values, respectively, and n is the number of observations.

R = \frac{\sum_{i = 1}^{n} (O_{i} - \bar{O}) (P_{i} - \bar{P})}{\sqrt{\sum_{i = 1}^{n} {(O_{i} - \bar{O})}^{2} * \sum_{i = 1}^{n} {(P_{i} - \bar{P})}^{2}}} * 100

(10)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(O_{i} - P_{i})}^{2}}{n}}

(11)

M A P E = \frac{100}{n} \sum_{i = 1}^{n} | \frac{(O_{i} - P_{i})}{O_{i}} |

(12)

3. Results and Discussion

In this study, for the first time, the results of wastewater treatment feasibility, from bench-scale testing and full-scale treatment sampling of fruit and vegetable washing and processing facilities were analyzed to develop predictive models to estimate water quality parameters of the treated wastewater. The study also used facility operating variables as input variables, such as the type of process (washing versus processing). The optimal subsets of input variables for the different models were selected using correlation analysis and statistical significance tests. The MRL and GSGMDH modelling techniques did not consider any deterministic models, chemical, or biological knowledge about the treatment processes, and were based on mathematical grounds only. MLR and GSGMDH have been used before; however, this study used these modelling techniques for wastewater and wash-waters treatment prediction for a wide range of fruits and vegetables. Previous studies by Chen et al. [28], Gil et al. [2], Kern et al. [5], and Mjalli et al. [29] have only focused on individual/single fruit or vegetable wash-water or wastewaters. This study provides insight into three novel aspects, a new sector with respect to industrial/agricultural wastewaters, the use of bench-scale testing and full-scale treatment data, and the large variety of fruit and vegetable wash-waters including the utilization of process type (W versus WP).

The first step was to transform categorical variables to numerical values for treatment and process variables, done using ranking analysis as described in the methodology. The aggregated ranks corresponding to treatment and process type (W or WP) are noted in Table 1. The ranks, best (1) to worst (0), indicate how different treatments impact the removal of the studied water quality parameters. Ranks greater than 0.75 indicate the most effective treatments, such as C&F, EC&F, and DAF for bench scale and MBR+RO+UV, POND, and SET1G for full scale. Ranks lower than 0.25 show the least effective treatments, while the remainder of the range 0.74–0.26 indicates moderate treatment effectiveness. TSS removal was least effective for S and HC treatment compared to C&F and settling at full scale, which utilizes chemicals and long settling time, respectively. Similar conclusions can be drawn for other water quality parameters under different studied treatments from Table 1. Some key observations show that MBR was the best overall; this was no surprise as MBR treatments produce the highest quality waters regardless of the type of wash-water or process. However, MBR is energy-intensive, requires long start-up times, and are very sensitive to changes in wastewater feed. EC&F and C&F show good reduction across many water quality parameters. Full-scale treatments with pond, settling with grasslands, and settling with three tanks in series (POND, SET1G, and SET3) were capable of reducing solids effectively, under the right conditions (flow, concentrations, and settling time).

There are some differences between W and WP within the same water quality parameter and treatment, for example, C&F for TP_Treated show, good effectiveness for WP but not so great for W type processes. Some process types and water quality parameters were not part of the study, such as the W type wash-waters for MBR. The missing water quality parameters for S and HC treatments were not collected since these treatments are least likely to impact TP, TN, COD, BOD, and NH₄-N, as shown in the literature. The ranks in Table 1 require substitution into the MLR equations (Table 2) for estimating treated water quality parameters. More information on the MLR equations is provided later.

After converting all data to numerical values, it was assessed for Pearson’s Correlation analysis, highlighting statistically significant (p-value < 0.01) correlations, presented in Table 3. Correlation analysis is a valuable tool in identifying correlations between water quality parameters, process, treatments, and dependent variables [29]. More importantly, it shows which input parameters have a good relationship with the dependent variable. Tomperi et al. [7] also used linear correlation analysis to identify variables that impact each other, similarly, BOD and COD have the highest correlation, while other parameters did not show similar results. This is due to the wide variety of wash-waters/wastewater explored in this study compared to the single source, pulp, and paper mill in studies by Tomperi et al. [7]. Treatment was relevant for most water quality parameters; however, for BOD_Treated and COD_Treated, the process was more relevant, suggesting the process has higher control as compared to the treatment, with respect to treated waters. Many core raw water quality parameters were relevant as expected, which are defined, for example, BOD_Raw being the core parameter for the BOD_Treated model. The correlations were weak for the TSS_Treated model, as the highest correlation was observed to be 0.32 with TS_Raw and the lowest was 0.08 with process type, which were similar to the relationships shown in Mjalli et al. [29] and Tomperi et al. [7] between water quality parameters. These correlations were used as the basis for selecting the different input parameters for the studied models. The authors selected variables with significant relationship for use in MLR and GSGMDH as input water quality parameters.

The range for wash-water quality and treated water quality parameters are shown in Table 4, highlighting the minimum, maximum, and the average values of the inputs and outputs used for the different models. The parameters studied and its values are in line with those studied by Letho et al. [1] and Gil et al. [2]; however, some wash-water studied had excessive levels of TSS due to soils from root vegetables. Disposal of wastewater and wash-water requires meeting the listed regulatory standards, also shown in Table 4.

Looking at Table 4, it is evident that the ranges of water quality parameters are highly variable as many different types of wash-waters were studied. The maximum value for TSS_Raw stands out as it is very high, even compared to the maximum TS_Raw value. This is because the TS_Raw value corresponding to this high TSS_Raw value was not available for some samples, as was predicted by Mundi et al. [25]. Some of the past data from other datasets only measured limited or different water quality parameters, and this was a major challenge, as some data points had to be excluded for some models. As a result, some models have lower sample numbers compared to others, as seen in Table 5.

The key findings of the study, that is the ability to predict treated water quality based on treatment, process, and raw wash-water quality parameters, are highlighted in Table 5. The table shows a number of samples used to construct the models, where training was 70% while test/validation was 30% of the total samples, and the selected input parameters.

Along with quality and performance parameters of the two different modeling methods, such as the coefficient of determination (R—%), the residual mean standard error (RMSE—mg/L), the and mean absolute percent error (MAPE—%).

A total of six water quality parameters were analyzed for MLR and GSGMDH as shown in Table 5. The greatest improvement was made for the TSS_Treated model, where GSGMDH significantly improved the prediction, from R² of 30% in MLR to 90%, while BOD_Treated was only improved by 16% using the GSGMDH method. The R² for TSS models were much lower when compared to Mjalli et al. [29] who reported values as high as 85%. The R² and RMSE improved for all six water quality parameters, while the MAPE did not show a consistent trend when going from MLR to the GSGMD modeling method. The use of both MAPE and RMSE was useful in providing a better understanding of the accuracy of predicted values and sensitivity of outliers, which play an important role in over or underfitting models.

Overfitting happens when a model learns the data details and noise in the training data that negatively impacts the performance and the models’ ability to generalize. Overfitting usually occurs with nonparametric and nonlinear models such as GSGMDH. In this case, we used a combination of both data details and noise, as some treatments had few experiments or single samples for full scale treatment, and due to the high variability of the data given different wash-water types and processes. Underfitting can also occur when the data does not show strong correlations or does not show a strong fit to linear functions.

Some underfitting is evident in MLR models as seen for TSS_Treated, where the R² is 30%, very low. The RMSE and MAPE show the magnitude and percentage of accuracy for each type of model. For the TSS_Treated model, the RMSE and MAPE were higher and lower, respectively, when compared to GSGMDH. A reason for this was that the model was based on highly variable datasets, and as such, the accuracy of the MRL TSS_Treated model is quite low. While with the use of GSGMDH the prediction was improved significantly, R² to 86%, the MAPE increased. The increase in the MAPE is attributed to the predicted values that are extreme, with large deviation from the actual value.

TP_Treated and TN_Treated models also improved as indicated by the change in R² from MLR to GSGMDH. TP_Treated and TN_Treated models were improved by approximately 14% and 38% points with respect to the R². The RMSE for all models for TP_Treated and TN_Treated was in the range of 3.49 to 8.99 mg/L, which is reasonably good, given the variability in wash-water data points. The MAPE ranges 54–457% for all models (Train and Test) of TP_Treated and TN_Treated, reflecting the low quality of model accuracy, which is attributed to the low number of samples and the highly variable dataset. However, it is interesting to note that the MLR models had a lower MAPE compared to the GSGMDH. This is attributed to the fact that linear models can better absorb outlier and extreme predicted values.

The BOD_Treated and COD_Treated models showed the best fit and improvement with the use of GSGMDH methods, with R² increasing by 16% and 31%, respectively. In addition to showing improvements in RMSE and MAPE. The BOD_Treated model improved by 16% as indicated by R. Measured and predicted values for treated BOD between MLR and GSGMDH models are show in Figure 2. This figure shows that MLR predicted values are widely scattered, while the GSGMDH model’s predicted values are much closer to the actual or measured values. This is due to the higher number of input parameters used for GSGMDH models in addition to its advanced capabilities in modelling non-linear functions. The overall trends show that GSGMDH methods can provide better prediction than MLR methods, as shown by the improved R² for some of the developed models.

However, a fine balance is required between the number of input variables selected for models and the risk of overfitting. Given the challenges of the diversity and breadth of the samples and experiments studied, the models were still able to capture and show generalized solutions. This was due to the ability of the GSGMDH to capture nonlinear relationships better than MLR methods. However, the large RMSE and MAPE observed for some models can make it challenging to assess the predicted effluent water quality parameters against regulations for effluent release, which are typically very low, as noted in Table 4. In general, it is evident that both MLR and GSGMDH methods are feasible in determining the treated water quality of wash-water/wastewater from the fruit and vegetable industry, as highlighted here. MLR is a great modeling method as it provided insight into the impact of coefficients of the input parameters, while GSGMDH provides greater accuracy; however, it can exaggerate outliers and incorrectly predicted values. A high degree of variability exists, which is expected due to the non-linearity of the input variables. Another major challenge is that of large datasets, which are often required for GSGMDH models; however, this was not the case for this study, where many different wash-waters were studied under many different treatment scenarios. Modelling is best suited for a single treatment with a large dataset. This study attempted to formulate a universal model, and as observed in this paper, this can be very challenging due to the high level of error in some models, as shown by MAPE and RMSE. However, the models can be improved on by collecting additional data.

The MLR equations are presented in Table 2. All equations have a positive intercept followed by a negative treatment coefficient, followed by positive process coefficients for most models, while the rest of the coefficients show both positive and negative values. The process coefficients are negative for some models because their ranking for W and WP process is different for each model. The worst conditions were represented by the lower rank of 0.5 equivalent to the WP process, and for other water quality parameters, it was equivalent to W process. The parameters TN_Raw, TDS_Raw, and NH4-N_Raw were most prevalent in the developed models. The magnitude and negative or positive value of the coefficient determine the level of effect on the predicted parameter. The GSGMDH equations are presented in Table 6. The structure of the GSMDH model for all studied parameters including TSS, BOD, COD, TP, and TN are shown in Figure 3a–e, respectively.

These MLR models/equations provide a simple, convenient, and practical way to determine the treatment effectiveness of the particular wash-waters, which can be utilized by producers/growers to assess for potential treatment to manage and reuse wash-waters. For example, BOD_Treated can be determined by first selecting Equation (14) in Table 2, then obtaining the applicable treatment and process variables from Table 1, which will be multiplied by their corresponding coefficients. The TDS_Raw, BOD_Raw, and TN_Raw are then plugged into the models for the wash-water/wastewater to be predicted. Once all the values are obtained, they are then entered into the BOD_Treated model, as shown by Equation (14) found in Table 2. The calculated answer can be used along with corresponding RMSE and MAPE to assess treatment reduction/effectiveness.

To illustrate the use of the BOD_Treated model, wash-water with TDS of 475 mg/L, COD_Raw of 165 mg/L, a process type of W, and undergoing EC&F treatment would have BOD_Treated of 60 mg/L. First, the numerical parameters are normalized; for TDSRaw, 475 is divided by the max value of TDS_Raw (from Table 4) plus one, 475/(8740 + 1), which equals 0.0545. Similarly, COD_Raw is normalized, and the value is 0.0133. The next step is to look up the process value for W type process from Table 1 for the BOD_Treated model, which is 1, as well as for the EC&F treatment, which is 0.56. The next step is to substitute the values into Equation (14) from Table 2 to calculate BOD_Treated. In some cases, the calculated value could be negative, given the linear relationship, and should be converted to 1 for simplicity. The equations of MLR and GSGMDH can be summarized and integrated into an easy-to-use Microsoft excel worksheet tool. The user can input the raw wash-water values and obtain a comparative analysis of treated effluent for all treatments at once. The predicted values for each treatment are then compared with regulated effluent limits to understand which treatment is most capable of meeting the standards. The worksheet also incorporates a cost analysis for the different treatments studied to provide cost/benefit analysis.

The developed models, which complement the treatment decision matrices/tables produced in Mundi et al. [24], show the level of treatment expected from various wash-waters/wastewaters. The decision matrices/tables provided a range for treatment effectiveness, while these models extend that analysis to numerical models for more flexible treatment predictions. Some limitation exists, such as over and underestimating treated water quality parameters due to under and overfitting of the models. However, the prediction of treated water quality levels is still valuable, and the methods demonstrated herein can be implemented by facilities for continuous monitoring and treatment selections. Many predicative tools utilize the black box ANN models, which are cumbersome to utilize that require software and cannot easily produce equations to use, similar to the MLR and GSGMDH shown in this study. This information is valuable for farmers, governments, engineers, and consultants, and other stakeholders in determining wash-water treatment and sizing, understanding treatments capable of meeting regulatory standards, and treatment costs from predicted water quality parameters. The study was successful in capturing a wide variety of information and concatenating it into a useful tool for making an important decision on the treatment of wash-waters/wastewaters, which previously did not exist.

4. Conclusions

The study highlighted the use of MLR and GSGMDH techniques to model fruit and vegetable washing and processing wastewater treatments in predicting treated water quality parameters. These universal models successfully captured data from many different commodities representatives of post-harvest processing, including the fresh-cut industry. Modelling a wastewater treatment process is difficult to accomplish due to the high non-linearity of the treatments studied and the non-uniformity and variability of wash-water/wastewaters as well as the nature of the chemical/biological reactions occurring in treatments. An MLR model approach was assessed by the authors of this study to understand any linear relationships, and the GSGMDH models for understanding interdependency between outputs and inputs involving nonlinear relationships. The MLR model utilizes simple linear equations to describe relationships, which can be very useful in predicting treatment levels as shown above. Similarly, GSGMDH models are non-linear, hence the more complicated equations when compared to the linear models but provide a more accurate prediction. For example, different stakeholders (design engineers, operators, suppliers, government, local fruit and vegetable organization/clubs) can utilize the above-developed models to determine which physio-chemical treatment can be utilized to treat various wash-water/wastewater types, and its corresponding approximate effluent expected water quality levels, without having to collect any data, spend money on testing water samples, or perform intensive studies before considering implementation of the technology.

The models developed for estimating treated parameters from bench-scale and full-scale treatments include TSSTreated, BOD_Treated, COD_Treated, TN_Treated, TP_Treated, and NH4-N_Treated water quality parameters. The derived models performed very well as indicated by R² values. The RMSE performance parameter indicates the expected error boundaries, which is in the range of acceptable. The variability is inherent to the developed models, as the datasets were a combination of lab and industry-wide samples and can be improved by increasing sample size for various sets, such as for full-scale treatments. A combination of data types, number of samples, the variability of the different wash-water types, and noise within the data influence the quality of the models. The balance between a number of input variables selected for modelling and risk of over or underfitting is critical.

Previous research has not consider the holistic approach shown in this study for understanding different fruit and vegetable wastewater treatments. As such, for the first time this study considered 14 different facilities (Apple, Carrot, Potato, Ginseng, and others) and 13 different wash-waters/wastewaters treatments for treatment effectiveness of TSS, TN, TP, NH4-N, COD, and BOD water quality parameters. This in itself is very novel, as the study has successfully researched and developed a complex problem into a workable solution, combining multiple factors while maintaining the depth and breadth of the study topic. MLR models were a great tool for predictions as they provided insight into the impact of input parameters through their coefficients and showed consistent error over all treatments. GSGMDH was more flexible and better captured the results but required greater caution when predicting values. Another novelty in this study is that of providing tools to estimate the treatment and corresponding water quality of major water quality parameters (COD, BOD, TSS, etc.). No tool existed in the literature of the fruit and vegetable washing and processing sectors that can easily predict water quality of treated waters. These novel MLR and GSGMDH models for predicting wash-water treatment feasibility provide a long list of treatments, which was not previously available or studied.

Author Contributions

Conceptualization, G.M., R.G.Z., K.W., H.B. and B.G. Investigation and Methodology, G.M., H.B. and B.G.; Formal Analysis and Validation, G.M., H.B. and B.G. Visualization, G.M., R.G.Z., K.W. and H.B.; Project administration and funding acquisition, R.G.Z.; Writing, review and editing, G.M., R.G.Z., K.W., H.B. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ontario Ministry of Agriculture and Rural Affairs (OMAFRA), grant number 052111.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in developing the various models comes from a variety of sources and due to confidentially by some producers is not available because of competitive interests.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lehto, M.; Sipilä, I.; Alakukku, L.; Kymäläinen, H.R. Water consumption and wastewaters in fresh-cut vegetable production. Agric. Food Sci. 2014, 23, 246–256. [Google Scholar] [CrossRef]
Gil, M.I.; Selma, M.V.; López-Gálvez, F.; Allende, A. Fresh-cut product sanitation and wash water disinfection: Problems and solutions. Int. J. Food Microbiol. 2009, 134, 37–45. [Google Scholar] [CrossRef] [PubMed]
Rudra, R.P.; Gharabaghi, B.; Sebti, S.; Gupta, N.; Moharir, A. GDVFS: A new toolkit for analysis and design of vegetative filter strips using VFSMOD. Water Qual. Res. J. 2010, 45, 59–68. [Google Scholar] [CrossRef][Green Version]
Mundi, G.S.; Zytner, R.G. Effective Solid Removal Technologies for Wash-Water Treatment to Allow Water Reuse in the Fresh-Cut Fruit and Vegetable Industry. J. Agric. Sci. Technol. 2015, 5, 396–407. [Google Scholar]
Kern, J.; Reimann, W.; Schlüter, O. Treatment of recycled carrot washing water. Environ. Technol. 2006, 27, 459–466. [Google Scholar] [CrossRef] [PubMed]
Halliwell, D.J.; Barlow, K.M.; Nash, D.M. A review of the effects of wastewater sodium on soil physical properties and their implications for irrigation systems. Soil Res. 2001, 39, 1259–1267. [Google Scholar] [CrossRef]
Tomperi, J.; Koivuranta, E.; Leiviskä, K. Predicting the effluent quality of an industrial wastewater treatment plant by way of optical monitoring. J. Water Process. Eng. 2017, 16, 283–289. [Google Scholar] [CrossRef]
Trenouth, W.R.; Gharabaghi, B. Event-based soil loss models for construction sites. J. Hydrol. 2015, 524, 780–788. [Google Scholar] [CrossRef]
Gharabaghi, B.; Sattar, A. Empirical models for longitudinal dispersion coefficient in natural streams. J. Hydrol. 2019, 575, 1359–1361. [Google Scholar] [CrossRef]
Díaz-Madroñero, M.; Pérez-Sánchez, M.; Satorre-Aznar, J.R.; Mula, J.; López-Jiménez, P.A. Analysis of a wastewater treatment plant using fuzzy goal programming as a management tool: A case study. J. Clean. Prod. 2018, 180, 20–33. [Google Scholar] [CrossRef]
Bagajewicz, M.; Savelski, M. On the Use of Linear Models for the Design of Water Utilization Systems in Process Plants with a Single Contaminant. Chem. Eng. Res. Des. 2018, 79, 600–610. [Google Scholar] [CrossRef]
Granata, F.; Papirio, S.; Esposito, G.; Gargano, R.; De Marinis, G. Machine Learning Algorithms for the Forecasting of Wastewater Quality Indicators. Water 2017, 9, 105. [Google Scholar] [CrossRef]
Lama, G.F.C.; Errico, A.; Francalanci, S.; Solari, L.; Preti, F.; Chirico, G.B. Evaluation of Flow Resistance Models Based on Field Experiments in a Partly Vegetated Reclamation Channel. Geosciences 2020, 10, 47. [Google Scholar] [CrossRef]
Galán, B.; Grossmann, I.E. Optimization strategies for the design and synthesis of distributed wastewater treatment networks. Comput. Chem. Eng. 1999, 23, S161–S164. [Google Scholar] [CrossRef]
Lotfi, K.; Bonakdari, H.; Ebtehaj, I.; Delatolla, R.; Zinatizadeh, A.A.; Gharabaghi, B. A novel stochastic wastewater quality modeling based on fuzzy techniques. J. Environ. Health Sci. Eng. 2020, 18, 1099–1120. [Google Scholar] [CrossRef] [PubMed]
Lotfi, K.; Bonakdari, H.; Ebtehaj, I.; Mjalli, F.S.; Zeynoddin, M.; Delatolla, R.; Gharabaghi, B. Predicting wastewater treatment plant quality parameters using a novel hybrid linear-nonlinear methodology. J. Environ. Manag. 2019, 240, 463–474. [Google Scholar] [CrossRef] [PubMed]
Bonakdari, H.; Ebtehaj, I.; Gharabaghi, B.; Vafaeifard, M.; Akhbari, A. Calculating the energy consumption of electrocoagulation using a generalized structure group method of data handling integrated with a genetic algorithm and singular value decomposition. Clean Technol. Environ. Policy 2019, 21, 379–393. [Google Scholar] [CrossRef]
Walton, R.; Binns, A.; Bonakdari, H.; Ebtehaj, I.; Gharabaghi, B. Estimating 2-year flood flows using the generalized structure of the Group Method of Data Handling. J. Hydrol. 2019, 575, 671–689. [Google Scholar] [CrossRef]
Chen, J.; Yin, H.; Zhang, D. A self-adaptive classification method for plant disease detection using GMDH-Logistic model. Sustain. Comput. Inform. Syst. 2020, 28, 100415. [Google Scholar] [CrossRef]
Zaki, M.F.M.; Ismail, M.A.M.; Govindasamy, D.; Leong, F.C.P. Prediction of pressure meter modulus (EM) using GMDH neural network: A case study of Kenny Hill Formation. Arab. J. Geosci. 2020, 13, 360. [Google Scholar] [CrossRef]
Alizamir, M.; Kisi, O.; Ahmed, A.N.; Mert, C.; Fai, C.M.; Kim, N.W.; El-Shafie, A. Advanced machine learning model for better prediction accuracy of soil temperature at different depths. PLoS ONE 2020, 15, e0231055. [Google Scholar]
Saberi-Movahed, F.; Najafzadeh, M.; Mehrpooya, A. Receiving More Accurate Predictions for Longitudinal Dispersion Coefficients in Water Pipelines: Training Group Method of Data Handling Using Extreme Learning Machine Conceptions. Water Resour. Manag. 2020, 34, 529–561. [Google Scholar]
Kordnaeij, A.; Moayed, R.Z.; Soleimani, M. Small Strain Shear Modulus Equations for Zeolite–Cement Grouted Sands. Geotech. Geol. Eng. 2019, 37, 5097–5111. [Google Scholar] [CrossRef]
Mundi, G.S.; Zytner, R.G.; Warriner, K. Fruit and vegetable wash-water characterization, treatment feasibility study and decision matrices. Can. J. Civ. Eng. 2017, 44, 971–983. [Google Scholar] [CrossRef]
Mundi, G.S.; Zytner, R.G.; Warriner, K.; Gharabaghi, B. Predicting fruit and vegetable processing wash-water quality. Water Sci. Technol. 2018, 77, 256–269. [Google Scholar] [CrossRef] [PubMed]
APHA/AWWA/WEF. Standard Methods for the Examination of Water and Wastewater, 22nd ed.; American Public Health Association; American Water Works Association; Water Environment Federation: Washington, DC, USA, 2012. [Google Scholar]
Thompson, J.; Sattar, A.M.; Gharabaghi, B.; Warner, R.C. Event-based total suspended sediment particle size distribution model. J. Hydrol. 2016, 536, 236–246. [Google Scholar] [CrossRef]
Chen, X.; Hung, Y.C. Predicting chlorine demand of fresh and fresh-cut produce based on produce wash water properties. Postharvest Biol. Technol. 2016, 120, 10–15. [Google Scholar] [CrossRef]
Mjalli, F.S.; Al-Asheh, S.; Alfadala, H.E. Use of artificial neural network black-box modeling for the prediction of wastewater treatment plants performance. J. Environ. Manag. 2007, 83, 329–338. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The flowchart of the proposed GSGMDH.

Figure 2. Measured versus Predicted treated BOD levels.

Figure 3. Structure of the developed GSGMDH models: (a) TSS; (b) BOD; (c) COD); (d) TP; and (e) TN.

Table 1. The developed ranks represent the ability of treatment to be effective, ranked best (1) to worst (0). Ranks greater than 0.75 are bolded, and ranks lower than 0.25 are shaded, while the remainder are left blank, for visual aid. The ranking for process is also provided.

		TSS_Treated		TP_Treated		TN_Treated		COD_Treated		BOD_Treated		NH4-N_Treated
	Process\	W	WP	W	WP	W	WP	W	WP	W	WP	W	WP
	Treatments	1	0.5	1	0.5	1	0.5	1	0.5	1	0.5	0.5	1
Bench Scale	C	0.55	0.70	0.11	0.29	0.44	0.14	0.33	0.29	0.33	0.13	0.33	0.14
	DAF	0.73	0.80	0.67	0.43	0.67	0.29	0.56	0.43	0.22	0.25	0.44	0.29
	EC&F	0.45	0.60	1	0.57	0.78	0.57	0.89	0.86	0.56	0.75	0.56	0.71
	C&F	1	0.90	0.44	0.86	0.56	0.43	0.67	0.63	0.44	0.5	0.22	0.38
	HC	0.27	0.50	-	-	-	-		-	-	-	-	-
	S	0.09	0.10	-	-	-	-	-	-	-	-	-	-
Full Scale	MBR + RO + UV	-	1	-	1	-	1	-	1	-	1	-	1
	POND	0.18	0.40	0.89	-	0.89	-	0.22	0.14	0.78	0.38	0.89	0.57
	SET1	0.36	0.30	0.33	0.71	0.33	0.86	0.44	0.57	0.89	0.63	0.78	0.86
	SET1G	0.91	-	0.78	-	1	-	0.78	-	1	-	0.67	-
	SET3	0.82	0.20	0.22	0.14	0.22	0.71	1	0.71	0.67	0.88	1	0.43
	SETBIO4	0.64	-	0.56	-	0.11	-	0.11	-	0.11	-	0.11	-

Washing (W), Washing and processing (WP) Bench-scale treatments consisted of Screening (S), hydrocyclone (HC), Settling using Jar Test method with optimized coagulation and flocculation process (C&F), dissolved air flotation using optimized coagulant dosage (DAF), centrifuge (C), and electrocoagulation (EC&F). Full-scale treatments sampled included a single tank settling (SET1), settling followed by open grass (SET1G), three settling tanks in series (SET3), pond (POND), sequential batch reactor with four stages—settling, aeration, nutrient removal, settling (SETBIO4), and mem-brane bioreactor with reverse osmosis and UV disinfection (MBR + RO + UV). (-) test not applicable.

Table 2. MLR equations for predicting treatment effluent water quality of wash-water treatments.

MLR Equations	Equation Number
${TSS}_{Treated} = 0.095 - 0.191 (Treatment) + 0.011 (Process) + 0.136 ({TS}_{Raw}) + 0.131 ({TN}_{Raw})$	(13)
${BOD}_{Treated} = 0.013 - 0.043 (Treatment) + 0.025 (Process) - 0.031 ({TDS}_{Raw}) + 0.563 ({BOD}_{Raw}) - 0.006 ({TN}_{Raw})$	(14)
${COD}_{Treated} = 0.237 - 0.169 (Treatment) - 0.172 (Process) + 0.270 ({TDS}_{Raw}) + 0.166 ({COD}_{Raw})$	(15)
${TP}_{Treated} = 0.035 - 0.069 (Treatment) - 0.172 (Process) + 0.318 ({TP}_{Raw}) + 0.282 ({NH}_{4} - N_{Raw})$	(16)
${TN}_{Treated} = 0.1202 - 0.236 (Treatment) + 0.059 (Process) - 0.311 ({TDS}_{Raw}) - 0.07 ({TN}_{Raw}) + 0.620 ({NH}_{4} - N_{Raw})$	(17)

Table 3. The correlation matrix between independent and dependent variables.

	Process	Treatment	pH_Raw	BOD_Raw	COD_Raw	NH₄-N_Raw	TN_Raw	TP_Raw	TSS_Raw	TS_Raw	TDS_Raw
TSS_Treated	−0.08	−0.42	0.02	−0.04	0.16	0.05	0.22	0.13	0.24	0.32	0.05
BOD_Treated	−0.53	−0.29	−0.2	0.9	0.62	0.09	0.36	0.04	0.05	0.28	0.61
COD_Treated	−0.58	−0.26	−0.11	0.65	0.56	0.07	0.19	−0.01	0.04	0.40	0.62
TP_Treated	−0.55	−0.35	−0.19	0.07	0.16	0.48	0.19	0.88	0.08	0.07	0.02
TN_Treated	−0.23	−0.43	0.17	0.34	0.49	0.58	0.50	0.40	0.10	0.27	0.35
NH₄-N_Treated	−0.13	−0.46	−0.07	0.18	0.41	0.85	0.37	0.27	0.56	0.14	−0.05

Table 4. Model input and output parameters and their minimum, maximum, and mean values.

Raw Wash-water Quality Parameters	(mg/L)	TSS_Raw	TDS_Raw	TS_Raw	COD_Raw	BOD_Raw	TN_Raw	TP_Raw	NH₄-N_Raw
	min	24	364	468	20	5	0.9 ^a	1.30	0.09
	mean	2498	1532	3795	1556	387	15	16.5	3.1
	max	42,920	8740	13,855	12,400	3760	101	179	35 ^b
Treated Wash-water Quality Parameters	(mg/L)	TSS_Treated	COD_Treated	BOD_Treated	TN_Treated	TP_Treated	NH₄-N_Treated
	min	0	2	2	0.03	0.04	0
	mean	452	632	177	9.4	5.6	4.5
	max	7160	8300	2300	53.1	90	70
Effluent requirements for wastewater discharge in Canada	(mg/L)	TSS	TDS	TS	BOD	TP	NH4-N
	Drinking Water ^c	-	500	-	-	0.01	0.02 ^g
	Sanitary /Sewer Discharge ^d	350	-	-	300	10	-
	PWQO ^e	25 ^f	-	-	20 ^f	0.02

^a For TN_Treated model lower limit for TN is 2.5 mg/L, ^b For NH₄-N_Treated model higher limit for NH₄-N is 105 mg/L, ^c Data obtained from Supporting Document for Ontario Drinking Water Quality Standards, Objectives and Guidelines, Table 1, Table 3 and Table 5, ^d Data obtained from City of Toronto Sewer Discharge and Storm Water Discharge Limits, Table 1, ^e Data obtained from Provincial Water Quality Objectives for Surface Water, some parameters are subjected to additional conditions, ^f Limits for effluent discharged to receiving waters; Guidelines for Effluent Quality and Wastewater Treatment at Federal Establishments, ^g Additional requirements related to pH.

Table 5. Models generated for treated water quality parameters showing the input variables used and their corresponding R² and RMSE values for train and validation datasets.

			R	RMSE	MAPE
TSS_Treated	GSGMDH	Train (122 *)	0.86	442	985
	GSGMDH	Test (46)	0.82	501	1131
	MLR	Train (122)	0.30	736	65
	MLR	Test (46)	0.54	706	364
BOD_Treated	GSGMDH	Train (120)	0.99	36	106
	GSGMDH	Test (52)	0.99	31	89
	MLR	Train (120)	0.83	159	558
	MLR	Test (52)	0.67	199	443
COD_Treated	GSGMDH	Train (123)	0.85	573	261
	GSGMDH	Test (54)	0.91	619	192
	MLR	Train (123)	0.54	890	820
	MLR	Test (54)	0.73	575	943
TP_Treated	GSGMDH	Train (62)	0.73	8.99	428
	GSGMDH	Test (28)	0.91	5.66	457
	MLR	Train (62)	0.59	6.40	58
	MLR	Test (28)	0.80	12.8	63
TN_Treated	GSGMDH	Train (57)	0.96	3.49	310
	GSGMDH	Test (29)	0.96	4.08	127
	MLR	Train (57)	0.58	7.60	68
	MLR	Test (29)	0.70	10.9	54

* Number of samples used for training and testing the models.

Table 6. The developed GSGMDH based equation for prediction.

	Input Parameters for all Following Models: x1 = Process, x2 = BOD_Raw, x3 = COD_Raw, x4 = NH₄−N_Raw, x5 = TN_Raw, x6 = TP_Raw, x7 = TSS_Raw, x8 = TS_Raw, x9 = TDS_Raw, x10 = Treatment
TSS	TSS = (\|1.528 − 30.083 ∗ x11 − 15.482 ∗ x1 − 4.522 ∗ x10 + 35.961 ∗ x1 ∗ x11 + 53.634 ∗ x10 ∗ x11 − 2.429 ∗ x10 ∗ x1 + 58.95 ∗ x11 ∗ x11 + 45.6 ∗ x1 ∗ x1 + 6.614 ∗ x10 ∗ x10 − 1.355 ∗ x10 ∗ x1 ∗ x11 + 5.404 ∗ x1 ∗ x11 ∗ x11 − 23.609 ∗ x1 ∗ x1 ∗ x11 − 107.196 ∗ x10 ∗ x11 ∗ x11 + 1.459 ∗ x10 ∗ x1 ∗ x1 − 33.867 ∗ x10 ∗ x10 ∗ x11 + 0.353 ∗ x10 ∗ x10 ∗ x1 − 49.415 ∗ x11 ∗ x11 ∗ x11 − 30.274 ∗ x1 ∗ x1 ∗ x1 − 2.809 ∗ x10 ∗ x10 ∗ x10\|) ∗ 7160
TSS	Where: x11 = 0.137 − 0.639 ∗ x5 − 0.1419 ∗ x10 − 0.7245 ∗ x10 ∗ x5 + 3.46 ∗ x5 ∗ x5 − 0.0056 ∗ x10 ∗ x10 − 4.203 ∗ x10 ∗ x5 ∗ x5 + 1.914 ∗ x10 ∗ x10 ∗ x5 − 0.274 ∗ x5 ∗ x5 ∗ x5 − 0.0313 ∗ x10 ∗ x10 ∗ x10
BOD	BOD = (\|0.008 + 0.841 ∗ x21 + 0.044 ∗ x9 − 0.128 ∗ x5 − 3.982 ∗ x9 ∗ x21 − 0.17 ∗ x5 ∗ x21 − 1.682 ∗ x5 ∗ x9 + 5.71 ∗ x21 ∗ x21 + 0.0196 ∗ x9 ∗ x9 + 0.943 ∗ x5 ∗ x5 + 3.0926 ∗ x5 ∗ x9 ∗ x21 − 2.853 ∗ x9 ∗ x21 ∗ x21 + 5.668 ∗ x9 ∗ x9 ∗ x21 − 2.219 ∗ x5 ∗ x21 ∗ x21 − 5.524 ∗ x5 ∗ x9 ∗ x9 − 1.328 ∗ x5 ∗ x5 ∗ x21 + 11.933 ∗ x5 ∗ x5 ∗ x9 − 6.199 ∗ x21 ∗ x21 ∗ x21 + 0.218 ∗ x9 ∗ x9 ∗ x9 − 3.512 ∗ x5 ∗ x5 ∗ x5\|) ∗ 2298 + 2
	Where: x11 = −0.554 − 0.426 ∗ x10 + 0.42 ∗ x2 + 1.623 ∗ x1 − 1.652 ∗ x2 ∗ x10 + 0.224 ∗ x1 ∗ x10 + 1.517 ∗ x1 ∗ x2 + 0.724 ∗ x10 ∗ x10 + 1.265 ∗ x2 ∗ x2 − 0.766 ∗ x1 ∗ x1 + 1.322 ∗ x1 ∗ x2 ∗ x10 − 0.289 ∗ x2 ∗ x10 ∗ x10 + 1.185 ∗ x2 ∗ x2 ∗ x10 − 0.586 ∗ x1 ∗ x10 ∗ x10 − 1.6 ∗ x1 ∗ x2 ∗ x2 + 0.17 ∗ x1 ∗ x1 ∗ x10 − 1.447 ∗ x1 ∗ x1 ∗ x2 − 0.115 ∗ x10 ∗ x10 ∗ x10 − 0.622 ∗ x2 ∗ x2 ∗ x2 − 0.3 ∗ x1 ∗ x1 ∗ x1
	and: x21 = −0.005 + 0.172 ∗ x11 + 0.071 ∗ x9 + 0.794 ∗ x2 + 7.808 ∗ x9 ∗ x11 + 25.691 ∗ x2 ∗ x11 − 3.26 ∗ x2 ∗ x9 − 21.654 ∗ x11 ∗ x11 − 0.7863662692 ∗ x9 ∗ x9 − 9.392 ∗ x2 ∗ x2 − 48.61 ∗ x2 ∗ x9 ∗ x11 + 17.523 ∗ x9 ∗ x11 ∗ x11 + 2.059 ∗ x9 ∗ x9 ∗ x11 − 17.201 ∗ x2 ∗ x11 ∗ x11 − 6.373 ∗ x2 ∗ x9 ∗ x9 − 20.94 ∗ x2 ∗ x2 ∗ x11 + 28.041 ∗ x2 ∗ x2 ∗ x9 + 43.899 ∗ x11 ∗ x11 ∗ x11 + 1.773 ∗ x9 ∗ x9 ∗ x9 + 8.693 ∗ x2 ∗ x2 ∗ x2
COD	COD = = (\|0.000423 + 0.073 ∗ x12 + 0.79 ∗ x11 − 6.087 ∗ x11 ∗ x12 + 1.742 ∗ x12 ∗ x12 + 5.405 ∗ x11 ∗ x11 − 9.309 ∗ x11 ∗ x12 ∗ x12 + 30.949 ∗ x11 ∗ x11 ∗ x12 − 0.46 ∗ x12 ∗ x12 ∗ x12 − 22.407 ∗ x11 ∗ x11 ∗ x11\|) ∗ 8298 + 2
	Where: x11 = −0.059 + 0.131 ∗ x10 + 1.3484 ∗ x9 − 0.694 ∗ x2 − 2.875 ∗ x9 ∗ x10 + 2.066 ∗ x2 ∗ x10 + 5.147 ∗ x2 ∗ x9 + 0.067 ∗ x10 ∗ x10 − 3.47 ∗ x9 ∗ x9 + 1.432 ∗ x2 ∗ x2 − 0.608 ∗ x2 ∗ x9 ∗ x10 + 1.315 ∗ x9 ∗ x10 ∗ x10 + 1.326 ∗ x9 ∗ x9 ∗ x10 − 1.818 ∗ x2 ∗ x10 ∗ x10 − 15.481 ∗ x2 ∗ x9 ∗ x9 − 0.94 ∗ x2 ∗ x2 ∗ x10 + 8.23 ∗ x2 ∗ x2 ∗ x9 − 0.104 ∗ x10 ∗ x10 ∗ x10 + 6.812 ∗ x9 ∗ x9 ∗ x9 − 3.527 ∗ x2 ∗ x2 ∗ x2
	and: x12 = −1.119 − 6.106 ∗ x10 − 34.025 ∗ x2 − 5.377 ∗ x1 − 1.72 ∗ x2 ∗ x10 + 13 ∗ x1 ∗ x10 + 103.148 ∗ x1 ∗ x2 + 2.854 ∗ x10 ∗ x10 + 1.817 ∗ x2 ∗ x2 + 27.183 ∗ x1 ∗ x1 + 2.378 ∗ x1 ∗ x2 ∗ x10 − 0.7980132594 ∗ x2 ∗ x10 ∗ x10 + 0.834 ∗ x2 ∗ x2 ∗ x10 − 1.312 ∗ x1 ∗ x10 ∗ x10 − 1.213 ∗ x1 ∗ x2 ∗ x2 − 7.666 ∗ x1 ∗ x1 ∗ x10 − 69.104 ∗ x1 ∗ x1 ∗ x2 − 0.924 ∗ x10 ∗ x10 ∗ x10 − 1.167 ∗ x2 ∗ x2 ∗ x2 − 20.56 ∗ x1 ∗ x1 ∗ x1
TP	TP = (\|−0.055 − 1.609 ∗ x11 + 0.619 ∗ x10 + 1.528 ∗ x4 − 10.509 ∗ x10 ∗ x11 + 26.087 ∗ x4 ∗ x11 − 1.137 ∗ x4 ∗ x10 + 51.713 ∗ x11 ∗ x11 − 0.738 ∗ x10 ∗ x10 − 16.005 ∗ x4 ∗ x4 + 37.421 ∗ x4 ∗ x10 ∗ x11 − 42.84 ∗ x10 ∗ x11 ∗ x11 + 8.358 ∗ x10 ∗ x10 ∗ x11 − 929.104 ∗ x4 ∗ x11 ∗ x11 − 0.0658 ∗ x4 ∗ x10 ∗ x10 + 473.109 ∗ x4 ∗ x4 ∗ x11 − 3.534 ∗ x4 ∗ x4 ∗ x10 + 322.282 ∗ x11 ∗ x11 ∗ x11 + 0.263 ∗ x10 ∗ x10 ∗ x10 − 39.329 ∗ x4 ∗ x4 ∗ x4\|) ∗ 89.96 + 0.04
TP	where: x11 = −2.442 + 7.202 ∗ x1 + 0.533 ∗ x4 + 0.896 ∗ x4 ∗ x1 − 4.397 ∗ x1 ∗ x1 − 3.009 ∗ x4 ∗ x4 − 4.066 ∗ x4 ∗ x1 ∗ x1 + 21.622 ∗ x4 ∗ x4 ∗ x1 − 0.2734052233 ∗ x1 ∗ x1 ∗ x1 − 15.831 ∗ x4 ∗ x4 ∗ x4
TN	TN = (\|0.079 − 0.157 ∗ x11 + 0.747 ∗ x5 − 1.742 ∗ x9 + 8.214 ∗ x5 ∗ x11 − 5.488 ∗ x9 ∗ x11 − 10.891 ∗ x9 ∗ x5 + 1.884 ∗ x11 ∗ x11 + 2.472 ∗ x5 ∗ x5 + 14.098 ∗ x9 ∗ x9 + 3.182 ∗ x9 ∗ x5 ∗ x11 − 4.938 ∗ x5 ∗ x11 ∗ x11 − 6.586 ∗ x5 ∗ x5 ∗ x11 − 8.161 ∗ x9 ∗ x11 ∗ x11 − 19.931 ∗ x9 ∗ x5 ∗ x5 + 7.19 ∗ x9 ∗ x9 ∗ x11 + 45.043 ∗ x9 ∗ x9 ∗ x5 + 1.522 ∗ x11 ∗ x11 ∗ x11 + 0.609 ∗ x5 ∗ x5 ∗ x5 − 22.436 ∗ x9 ∗ x9 ∗ x9\|) ∗ 60.83 + 0.03
TN	Where: x11 = 0.124 + 0.296 ∗ x4 + 0.339 ∗ x9 − 0.06 ∗ x10 − 7.27 ∗ x9 ∗ x4 − 7.826 ∗ x10 ∗ x4 + 0.627 ∗ x10 ∗ x9 + 42.361 ∗ x4 ∗ x4 − 1.624 ∗ x9 ∗ x9 − 0.143 ∗ x10 ∗ x10 + 18.159 ∗ x10 ∗ x9 ∗ x4 − 196.991 ∗ x9 ∗ x4 ∗ x4 + 71.295 ∗ x9 ∗ x9 ∗ x4 + 0.209 ∗ x10 ∗ x4 ∗ x4 + 0.382 ∗ x10 ∗ x9 ∗ x9 + 1.765 ∗ x10 ∗ x10 ∗ x4 − 1.498 ∗ x10 ∗ x10 ∗ x9 − 2.397 ∗ x4 ∗ x4 ∗ x4 + 0.785 ∗ x9 ∗ x9 ∗ x9 + 0.258 ∗ x10 ∗ x10 ∗ x10

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mundi, G.; Zytner, R.G.; Warriner, K.; Bonakdari, H.; Gharabaghi, B. Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater. Water 2021, 13, 2485. https://doi.org/10.3390/w13182485

AMA Style

Mundi G, Zytner RG, Warriner K, Bonakdari H, Gharabaghi B. Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater. Water. 2021; 13(18):2485. https://doi.org/10.3390/w13182485

Chicago/Turabian Style

Mundi, Gurvinder, Richard G. Zytner, Keith Warriner, Hossein Bonakdari, and Bahram Gharabaghi. 2021. "Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater" Water 13, no. 18: 2485. https://doi.org/10.3390/w13182485

APA Style

Mundi, G., Zytner, R. G., Warriner, K., Bonakdari, H., & Gharabaghi, B. (2021). Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater. Water, 13(18), 2485. https://doi.org/10.3390/w13182485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Models for Predicting Water Quality of Treated Fruit and Vegetable Wastewater

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area, Wash-Water Samples, and Laboratory Analysis

2.2. Multiple Linear Regression

2.3. Generalized Structure of Group Method of Data Handling (GSGMDH)

2.4. Model Performance Evaluation

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI