# A Comparative Assessment of Variable Selection Methods in Urban Water Demand Forecasting

^{1}

^{2}

^{*}

## Abstract

**:**

_{p}criterion and (vii) principal component analysis (PCA). The results showed that different variable selection methods produced different multiple linear regression models with different sets of predictor variables. Moreover, the selection methods (i)–(vi) showed some irrational relationships between the water demand and the predictor variables due to the presence of a high degree of correlations among the predictor variables, whereas PCA showed promising results in avoiding these irrational behaviours and minimising multicollinearity problems.

## 1. Introduction

_{p}criterion; (vi) best model with the Akaike information criterion (AIC); and (vii) the model with selected variables based on preprocessing by PCA. The performance of various models is assessed for an independent validation period. This is one of the comprehensive studies in comparing the performance of variable selection methods in long-term water demand forecasting. Moreover, this is one of the few papers that has discussed the multicollinearity problem in water demand forecasting and has highlighted how to resolve the problem. Results of the study are expected to provide important insights into the variable selection methods in water demand modelling to produce more accurate water demand projections. The findings of this study would be useful in enhancing the sustainability of urban water resources and water supply systems in a given region by providing a better tool to estimate future water demand.

## 2. Study Area and Data

^{2}), water price (AUS $/Kilolitre(KL)), conservation program participation (CPP) and three water restriction levels (i.e., Levels 1, 2 and 3) imposed in the study area during previous drought periods (2003–2009).

## 3. Methods

#### 3.1. Forward Selection

#### 3.2. Backward Elimination

#### 3.3. Stepwise Selection

#### 3.4. Best Model with Residual Mean Square Error Criteria

^{k}. The number of independent variables considered in this study was 11. In the best model with MSE criteria, all the possible models (2

^{11}) were evaluated, and the model with the lowest value of MSE was selected. The MSE measures the variance for each of the models and is calculated by the following equation:

#### 3.5. Best Model with the Akaike Information Criterion

#### 3.6. Best Model with Mallow’s C_{p} Criterion

_{p}criterion was proposed by Mallow [43] for univariate regression analysis, and it selects the model with the minimum value of the C

_{p}statistic. The C

_{p}statistic can be calculated by the following equation:

#### 3.7. Principal Component Analysis

## 4. Results

_{p}criterion, eight predictor variables out of 11 were found to be statistically significant. Rainfall, number of rainy days and solar exposure showed no effect on water demand. This model also showed some irrational relationship like earlier models as water price and Level 1 dummy variables showed positive correlation with water demand, and average temperature showed negative correlation. In the best model with the AIC criterion, seven variables out of 11 were found to be statistically significant. Rainfall, number of rainy days, Level 1 dummy and solar exposure showed no relation with water demand. This model also had some irrational characteristics like earlier models. It can be seen in Figure 2 and Table 1 that all of the selection methods considered different sets of variables to be taken as final input in their regression models. Moreover, all of them had some irrational relationships with the water demand. The more likely reason for these irrational relationships is the presence of multicollinearities among the independent variables. In terms of modelling results’ statistics as shown in Table 1, the best model with the MSE criterion was found the best among those six models as it had the highest R

^{2}and Adjusted(Adj.) R

^{2}values and the lowest RMSE (root mean square error) and MAPE (mean absolute percentage error) values. However, the models from 3–6 ((iii) backward elimination; (iv) best model with the criteria of residual mean square error; (v) best model with Mallow’s C

_{p}criterion; (vi) best model with the Akaike information criterion (AIC)) all had comparable results with each other.

- Model 1: Rainfall, mean maximum temperature, CPP, Level 1, Level 2, Level 3
- Model 2: Rainfall, mean maximum temperature, water price, Level 1, Level 2, Level 3
- Model 3: Rainfall, mean maximum temperature, CPP, water price, Level 1, Level 2, Level 3

_{p}, AIC and selection of variables after PCA (i.e., Model 2)) for the independent data period is presented in Figure 5, which also shows that the regression model with the selected independent variables performed better than all the other models. These results indicate that the selected independent variables are capable of simulating monthly water demand with a higher accuracy, and the developed model is largely free from the multicollinearity problem. This also indicates that PCA performed better in selecting the independent variables than the other methods adopted in this study, which has the potential to produce forecasting results with better accuracy. This method is easy to implement and can be used in other water supply systems around the world to identify the influential water demand variables and estimate water demand.

## 5. Conclusions

_{p}criterion, AIC criterion and principal component analysis (PCA)) were compared for long-term water demand forecasting for the Blue Mountains Water Supply System located in New South Wales, Australia. The results showed that different variable selection methods resulted in different sets of predictor variables. Moreover, some selection methods (e.g., forward selection and backward elimination) resulted in a set of irrational variables and regression equations. On the contrary, when the predictor variables’ datasets were preprocessed by PCA, the developed water demand model produced better simulation results of the water demand than the other developed models. Moreover, the developed model after doing PCA analysis did not show any counter-intuitive relationship with the independent variables. The results also indicated that PCA has the potential to identify the influential variables in water demand modelling in a better way than the other statistical methods adopted in this study. However, the application of variable selection methods needs to be carefully scrutinized in the case of the presence of high degree of multicollinearities among the predictor variables. The findings of this paper are directly applicable to the study area in Australia; however, the developed technique can be adapted to other countries having different water use and climatic characteristics to develop water demand forecasting models.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Notter, B.; MacMillan, L.; Viviroli, D.; Weingartner, R.; Liniger, H.P. Impacts of environmental change on water resources in the Mt. Kenya region. J. Hydrol.
**2007**, 343, 266–278. [Google Scholar] - Koutroulis, A.G.; Tsanis, I.K.; Daliakopoulos, I.N.; Jacob, D. Impact of climate change on water resources status: A case study for Crete Island, Greece. J. Hydrol.
**2013**, 479, 146–158. [Google Scholar] [CrossRef] - Makki, A.A.; Stewart, R.A.; Beal, C.D.; Panuwatwanich, K. Novel bottom-up urban water demand forecasting model: Revealing the determinants, drivers and predictors of residential indoor end-use consumption. Resour. Conserv. Recycl.
**2015**, 95, 15–37. [Google Scholar] [CrossRef] - Gato, S.; Jayasuriya, N.; Roberts, P. Temperature and rainfall thresholds for base use urban water demand modelling. J. Hydrol.
**2007**, 337, 364–376. [Google Scholar] [CrossRef] - Arbués, F.; Villanúa, I.; Barberán, R. Household size and residential water demand: An empirical approach. Aust. J. Agric. Resour. Econ.
**2010**, 54, 61–80. [Google Scholar] [CrossRef] - House-Peters, L.; Pratt, B.; Chang, H. Effects of urban spatial structure, sociodemographics, and climate on residential water consumption in Hillsboro, Oregon. J. Am. Water Resour. Assoc.
**2010**, 46, 461–472. [Google Scholar] [CrossRef] - Babel, M.S.; Shinde, V.R. Identifying prominent explanatory variables for water demand prediction using artificial neural networks: A case study of Bangkok. Water Resour. Manag.
**2011**, 25, 1653–1676. [Google Scholar] [CrossRef] - Abrams, B.; Kumaradevan, S.; Sarafidis, V.; Spaninks, F. An econometric assessment of pricing Sydney’s residential water use. Econ. Rec.
**2012**, 88, 89–105. [Google Scholar] [CrossRef] - Haque, M.M.; Rahman, A.; Hagare, D.; Kibria, G. Probabilistic water demand forecasting using projected climatic data for Blue Mountains water supply system in Australia. Water Resour. Manag.
**2014**, 28, 1959–1971. [Google Scholar] [CrossRef] - Felfelani, F.; Kerachian, R. Municipal water demand forecasting under peculiar fluctuation in population: A case study of Mashhad touristy city. Hydrol. Sci. J.
**2015**, 61, 1524–1534. [Google Scholar] [CrossRef] - Gottlieb, M. Urban domestic demand for water: A Kansas case study. Land Econ.
**1963**, 39, 204–210. [Google Scholar] [CrossRef] - Conley, B.C. Price elasticity of the demand for water in Southern California. Ann Reg. Sci.
**1967**, 1, 180–189. [Google Scholar] [CrossRef] - Howe, C.W.; Linaweaver, F.P. The impact of price on residential water demand and its relation to system design and price structure. Water Resour. Res.
**1967**, 3, 13–32. [Google Scholar] [CrossRef] - Turnovsky, S.J. The demand for water: Some empirical evidence on consumers’ response to a commodity uncertain in supply. Water Resour. Res.
**1969**, 5, 350–361. [Google Scholar] [CrossRef] - Hanke, S.H. Demand for water under dynamic conditions. Water Resour. Res.
**1970**, 6, 1253–1261. [Google Scholar] [CrossRef] - Polebitski, A.S.; Palmer, R.N. Seasonal residential water demand forecasting for census tracts. J. Water Resour. Plan. Manag.
**2009**, 136, 27–36. [Google Scholar] [CrossRef] - Wei, S.; Lei, A.; Islam, S.N. Modeling and simulation of industrial water demand of Beijing municipality in China. Front. Environ. Sci. Eng. China
**2010**, 4, 91–101. [Google Scholar] [CrossRef] - Behboudian, S.; Tabesh, M.; Falahnezhad, M.; Ghavanini, F.A. A long-term prediction of domestic water demand using preprocessing in artificial neural network. J. Water Supply Res. Technol.-Aqua.
**2014**, 63, 31–42. [Google Scholar] [CrossRef] - Donkor, E.A.; Mazzuchi, T.A.; Soyer, R.; Alan Roberson, J. Urban water demand forecasting: Review of methods and models. J. Water Resour. Plan. Manag.
**2014**, 140, 146–159. [Google Scholar] [CrossRef] - Billings, R.B.; Jones, C.V. Forecasting Urban Water Demand; American Water Works Association: Denver, CO, USA, 2011. [Google Scholar]
- Tabesh, M.; Dini, M. Fuzzy and neuro-fuzzy models for short-term water demand forecasting in Tehran. Iran. J. Sci. Technol.
**2009**, 33, 61–77. [Google Scholar] - Bai, Y.; Wang, P.; Li, C.; Xie, J.; Wang, Y. A multi-scale relevance vector regression approach for daily urban water demand forecasting. J. Hydrol.
**2014**, 517, 236–245. [Google Scholar] [CrossRef] - Brentan, B.M.; Luvizotto E., Jr.; Herrera, M.; Izquierdo, J.; Pérez-García, R. Hybrid regression model for near real-time urban water demand forecasting. J. Comput. Appl. Math.
**2017**, 309, 532–541. [Google Scholar] [CrossRef] - Barrett, B.E.; Gray, J.B. A computational framework for variable selection in multivariate regression. Stat. Comput.
**1994**, 4, 203–212. [Google Scholar] [CrossRef] - McQuarrie, A.D.; Tsai, C. Regression and Time Series Model Selection; World Scientific Publishing Co., Pte. Ltd.: Singapore, 1998. [Google Scholar]
- Sauerbrei, W.; Royston, P.; Binder, H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat. Med.
**2007**, 26, 5512–5528. [Google Scholar] [CrossRef] [PubMed] - Lee, H.; Ghosh, S.K. Performance of information criteria for spatial models. J. Stat. Comput. Simul.
**2009**, 79, 93–106. [Google Scholar] [CrossRef] [PubMed] - Sharma, M.J.; Yu, S.J. Stepwise regression data envelopment analysis for variable reduction. Appl. Math. Comput.
**2015**, 253, 126–134. [Google Scholar] [CrossRef] - Haque, M.M.; Egodawatta, P.; Rahman, A.; Goonetilleke, A. Assessing the significance of climate and community factors on urban water demand. Int. J. Sustain. Built Environ.
**2015**, 4, 222–230. [Google Scholar] [CrossRef] - Raffalovich, L.E.; Deane, G.D.; Armstrong, D.; Tsao, H.S. Model selection procedures in social research: Monte-Carlo simulation results. J. Appl. Stat.
**2008**, 35, 1093–1114. [Google Scholar] [CrossRef] - Murtaugh, P.A. Performance of several variable selection methods applied to real ecological data. Ecol. Lett.
**2009**, 12, 1061–1068. [Google Scholar] [CrossRef] [PubMed] - Haddad, K.; Rahman, A. Regional flood frequency analysis in eastern Australia: Bayesian GLS regression-based methods within fixed region and ROI framework—Quantile Regression vs. Parameter Regression Technique. J. Hydrol.
**2012**, 430–431, 142–161. [Google Scholar] [CrossRef] - Xie, J.; Hong, T. Variable selection methods for probabilistic load forecasting: Empirical evidence from seven States of the United States. IEEE Trans. Smart Grid.
**2017**. [Google Scholar] [CrossRef] - Gagliardi, F.; Alvisi, S.; Kapelan, Z.; Franchini, M.A. probabilistic short-term water demand forecasting model based on the Markov Chain. Water
**2017**, 9, 507. [Google Scholar] [CrossRef] - Pacchin, E.; Alvisi, S.; Franchini, M.A. short-term water demand forecasting model using a moving window on previously observed data. Water
**2017**, 9, 172. [Google Scholar] [CrossRef] - Bluemountainsaustralia.com (n.d.). Location and Maps. Available online: http://www.bluemts.com.au/info/about/maps/ (accessed on 12 December 2017).
- Bluemountainsaustralia.com (n.d.). Climate. Available online: http://www.bluemts.com.au/info/about/climate/ (accessed on 12 December 2017).
- Haque, M.M.; Hagare, D.; Rahman, A.; Kibria, G. Quantification of water savings due to drought restrictions in water demand forecasting models. J. Water Resour. Plan. Manag.
**2014**, 140, 04014035. [Google Scholar] [CrossRef] - Browne, M.W. Cross-validation methods. J. Math. Psychol.
**2000**, 44, 108–132. [Google Scholar] [CrossRef] [PubMed] - Sydney Water. Water Conservation and Recycling Implementation Report, 2009–2010; Sydney Water Corporation: Sydney, NSW, Australia, 2010. [Google Scholar]
- Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis; John Wiley and Sons: New York, NY, USA, 2011. [Google Scholar]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control
**1974**, 19, 716–723. [Google Scholar] [CrossRef] - Mallows, C.L. Some comments on Cp. Technometrics
**1973**, 15, 661–675. [Google Scholar] - Abdul-Wahab, S.A.; Bakheit, C.S.; Al-Alawi, S.M. Principal component and multiple regression analysis in modelling of ground-level ozone and factors affecting its concentrations. Environ. Model. Softw.
**2005**, 20, 1263–1271. [Google Scholar] [CrossRef] - Olsen, R.L.; Chappell, R.W.; Loftis, J.C. Water quality sample collection, data treatment and results presentation for principal components analysis-literature review and Illinois River watershed case study. Water Res.
**2012**, 46, 3110–3122. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Blue Mountains region in Australia [37].

**Figure 2.**Standardized coefficients of the independent variables for each variable selection method.

**Figure 5.**Comparison of modelled results of all the developed models (i.e., stepwise, forward, backward, MSE, Mallow’s C

_{p}, AIC and selective variable regression (Model 2)).

**Table 1.**Modelling results from the developed models adopting different variable selection methods. CPP, conservation program participation.

Model | Stepwise Selection | Forward Selection | Backward Selection | MSE Criterion | Mallow’s C_{p} Criterion | AIC Criterion | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Dependent variable: | log10(PDMWC) | log10(PDMWC) | log10(PDMWC) | log10(PDMWC) | log10(PDMWC) | log10(PDMWC) | ||||||

N | 82 | 82 | 82 | 82 | 82 | 82 | ||||||

Independent variables: | Coefficients | p-value | Coefficients | p-value | Coefficients | p-value | Coefficients | p-value | Coefficients | p-value | Coefficients | p-value |

Constant | 1.0907 | <0.0001 | 1.0907 | <0.0001 | 1.0495 | <0.0001 | 1.0327 | <0.0001 | 1.0427 | <0.0001 | 1.0495 | <0.0001 |

Rainfall | −0.0001 | 0.0040 | −0.0001 | 0.0040 | - | - | 0.0000 | 0.2621 | - | - | - | - |

Mean Max Temp | - | - | - | - | 0.0189 | 0.0005 | 0.0204 | 0.0019 | 0.0189 | 0.0005 | 0.0189 | 0.0005 |

Water Price | 0.0744 | 0.0229 | 0.0744 | 0.0229 | 0.0812 | 0.0172 | 0.0904 | 0.0114 | 0.0898 | 0.0111 | 0.0812 | 0.0172 |

Num_Rainy D. | - | - | - | - | - | - | 0.0009 | 0.1650 | - | - | - | - |

Evaporation | 0.0004 | <0.0001 | 0.0004 | <0.0001 | 0.0003 | 0.0153 | 0.0002 | 0.0253 | 0.0003 | 0.0185 | 0.0003 | 0.0153 |

Level 1 (dummy) | 0.0217 | 0.0132 | 0.0217 | 0.0132 | - | - | 0.0115 | 0.2792 | 0.0102 | 0.3352 | - | - |

Level 2 (dummy) | - | - | - | - | −0.0311 | 0.0002 | −0.0243 | 0.0186 | −0.0251 | 0.0148 | −0.0311 | 0.0002 |

Level 3 (dummy) | - | - | - | - | −0.0269 | 0.0094 | −0.0204 | 0.1064 | −0.0200 | 0.1121 | −0.0269 | 0.0094 |

SolasEx | - | - | - | - | - | - | - | - | - | - | - | - |

CPP | −0.1243 | <0.0001 | −0.1243 | <0.0001 | −0.1201 | 0.0009 | −0.1319 | 0.0006 | −0.1302 | 0.0006 | −0.1201 | 0.0009 |

Average Temp | - | - | - | - | −0.0213 | 0.0002 | −0.0232 | 0.0011 | −0.0213 | 0.0002 | −0.0213 | 0.0002 |

Model performance | ||||||||||||

R^{2} | 69.80% | 69.80% | 74.60% | 75.70% | 75.00% | 74.60% | ||||||

Adj. R^{2} | 67.80% | 67.80% | 72.20% | 72.30% | 72.20% | 72.20% | ||||||

RMSE | 0.02 | 0.02 | 0.019 | 0.019 | 0.019 | 0.019 | ||||||

MAPE | 1.331 | 1.331 | 1.231 | 1.214 | 1.222 | 1.231 |

Variables | Rainfall | Num_Rainy D. | Mean Max Temp | Average Temp | Evaporation | SolarEx | Water Price | CPP | Level 1 (Dummy) | Level 2 (Dummy) | Level 3 (Dummy) |
---|---|---|---|---|---|---|---|---|---|---|---|

Rainfall | 1.00 | 0.68 | 0.21 | 0.27 | 0.01 | 0.10 | 0.14 | 0.13 | −0.05 | −0.01 | 0.11 |

Num_Rainy D. | 1.00 | 0.26 | 0.32 | 0.02 | 0.20 | 0.26 | 0.25 | −0.12 | −0.12 | 0.23 | |

Mean Max Temp | 1.00 | 0.99 | 0.79 | 0.86 | 0.07 | 0.07 | 0.19 | 0.00 | 0.02 | ||

Average Temp | 1.00 | 0.75 | 0.82 | 0.08 | 0.07 | 0.19 | −0.01 | 0.02 | |||

Evaporation | 1.00 | 0.86 | −0.12 | −0.14 | 0.37 | 0.07 | −0.25 | ||||

SolarEx | 1.00 | 0.19 | 0.22 | 0.10 | 0.00 | 0.08 | |||||

Water Price | 1.00 | 0.95 | −0.34 | −0.37 | 0.70 | ||||||

CPP | 1.00 | −0.37 | −0.38 | 0.80 | |||||||

Level 1 (dummy) | 1.00 | −0.14 | −0.47 | ||||||||

Level 2 (dummy) | 1.00 | −0.59 | |||||||||

Level 3 (dummy) | 1.00 |

Variables | PC 1 | PC 2 | PC 3 | PC 4 |
---|---|---|---|---|

Rainfall | 0.35 | 0.33 | 0.75 | 0.24 |

Num_Rainy D. | 0.34 | 0.53 | 0.62 | 0.21 |

Mean Max Temp | 0.97 | −0.01 | −0.07 | −0.08 |

Average Temp | 0.96 | 0.03 | −0.01 | −0.04 |

Evaporation | 0.88 | −0.27 | −0.22 | −0.04 |

SolarEx | 0.92 | 0.08 | −0.18 | −0.18 |

Water Price | 0.00 | 0.86 | −0.13 | −0.07 |

CPP | 0.02 | 0.92 | −0.18 | −0.10 |

Level 1 (dummy) | 0.24 | −0.49 | −0.18 | 0.71 |

Level 2 (dummy) | 0.01 | −0.54 | 0.42 | −0.62 |

Level 3 (dummy) | −0.05 | 0.88 | −0.23 | −0.03 |

Model | Model 1 | Model 2 | Model 3 | |||
---|---|---|---|---|---|---|

Dependent variable: | log10(PDMWC) | log10(PDMWC) | log10(PDMWC) | |||

N | 82 | 82 | 82 | |||

Independent variables: | Coefficients | p-value | Coefficients | p-value | Coefficients | p-value |

Constant | 1.1368 | 0.0000 | 1.1461 | 0.0000 | 1.1139 | 0.0000 |

Rainfall | −0.0001 | 0.0020 | −0.0001 | 0.0030 | −0.0001 | 0.0020 |

Mean Max Temp | 0.0025 | 0.0000 | 0.0025 | 0.0000 | 0.0025 | 0.0000 |

Water Price | −0.0292 | 0.0760 | 0.0451 | 0.2660 | ||

CPP | −0.0422 | 0.0150 | −0.0866 | 0.0480 | ||

Level 1 (dummy) | 0.0099 | 0.4230 | 0.0061 | 0.6210 | 0.0137 | 0.2850 |

Level 2 (dummy) | −0.0285 | 0.0130 | −0.0334 | 0.0040 | −0.0240 | 0.0470 |

Level 3 (dummy) | −0.0345 | 0.0090 | −0.0460 | 0.0000 | −0.0273 | 0.0590 |

Model performance | ||||||

R^{2} | 63% | 68% | 66% | |||

Adjusted(Adj.) R^{2} | 35% | 42% | 39% |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Haque, M.M.; Rahman, A.; Hagare, D.; Chowdhury, R.K.
A Comparative Assessment of Variable Selection Methods in Urban Water Demand Forecasting. *Water* **2018**, *10*, 419.
https://doi.org/10.3390/w10040419

**AMA Style**

Haque MM, Rahman A, Hagare D, Chowdhury RK.
A Comparative Assessment of Variable Selection Methods in Urban Water Demand Forecasting. *Water*. 2018; 10(4):419.
https://doi.org/10.3390/w10040419

**Chicago/Turabian Style**

Haque, Md Mahmudul, Ataur Rahman, Dharma Hagare, and Rezaul Kabir Chowdhury.
2018. "A Comparative Assessment of Variable Selection Methods in Urban Water Demand Forecasting" *Water* 10, no. 4: 419.
https://doi.org/10.3390/w10040419