Comparing Water Use Forecasting Model Selection Criteria: The Case of Commercial, Institutional, and Industrial Sector in Southern California

: The United States is one of the largest per capita water withdrawers in the world, and certain parts of it, especially the western region, have long experienced water scarcity. Historically, the U.S. relied on large water infrastructure investments and planning to solve its water scarcity problems. These large-scale investments as well as water planning activities rely on water forecast studies conducted by water managing agencies. These forecasts, while key to the sustainable management of water, are usually done using historical growth extrapolation, conventional econometric approaches, or legacy software packages and often do not utilize methods common in the ﬁeld of statistical learning. The objective of this study is to illustrate the extent to which forecast outcomes for commercial, institutional and industrial water use may be improved with a relatively simple adjustment to forecast model selection. To do so, we estimate over 352 thousand regression models with retailer level panel data from the largest utility in the U.S., featuring a rich set of variables to model commercial, institutional, and industrial water use in Southern California. Out-of-sample forecasting performances of those models that rank within the top 5% based on various in- and out-of-sample goodness-of-ﬁt criteria were compared. We demonstrate that models with the best in-sample ﬁt yeild, on average, larger forecast errors for out-of-sample forecast exercises and are subject to a signiﬁcant degree of variation in forecasts. We ﬁnd that out-of-sample forecast error and the variability in the forecast values can be reduced by an order of magnitude with a relatively straightforward change in the model selection criteria even when the forecast modelers do not have access to “big data” or utilize state-of-the-art machine learning techniques.


Introduction
The United States (U.S.) is one of the largest per capita water withdrawers in the world [1]. It has a large water supply overall; however, water scarcity is still a challenge as water is not present where, when, and in the form it is needed. A large portion of the Western U.S. has been vulnerable to drought, and this also portion constitutes the areas with the fastest population growth [2]. Water is an essential input for most sectors in the economy; hence, its scarcity has caused the U.S. to be historically dependent on large water infrastructure investments as well as extensive planning, both of which rely on water use forecasts [3,4]. The severity of droughts is expected to be exacerbated by the changing climate [5,6], which increases the importance of reliable forecasts for sustainable water management. California, by far the largest U.S. state both by population [7] and by economic activity measured however, may become smaller since water shortages in the western U.S. have led to state governments becoming more involved in water resource management. For example, in California, lawmakers passed a landmark legislation in 2014 that aims to regulate, for the first time, sustainable groundwater management [2]. This trend towards a centralized management mechanism further increases the importance of planning and forecasting within the CII sector, at least from the perspective of public water suppliers. Additionally, changing trends in the economic landscape of both developed and developing countries, a shift towards commercial activities in developed countries, and increasing industrial activities gaining pace in developing nations [18] are likely to increase the relative importance of CII sectors in total water use. As an example, the majority of the water used in industrial activities in California is, for the time, being self-supplied, while commercial water is supplied by public utilities [17]. Therefore, in addition to the institutional changes mentioned above, the shift from industrial towards commercial activities in the economic mix is another reason we may see the CII sector take up a greater share of publicly supplied water in the U.S.
Third, the CII sector already makes up a considerable portion of water use in other parts of the world, which makes this study relevant in the international context as well. In Europe, the industrial sector accounts for 23% of total water use, with significant variability in water use patterns across countries [19].
Finally, water-saving measures are likely to be a crucial part of overall water management in the face of droughts induced by climate change, rising population, and the surge in per capita demand due to globally increasing life standards [20,21]. The commercial sector can be an important avenue to save water, for example, through various rebate programs [22][23][24]. Effective implementation and credible evaluation of such programs will require reliable water use forecasts in the CII sector, which will also improve the success of water plans and budgets.

Current State of the Literature
There is a large body of literature studying urban water use, which can be divided into two main groups. The first group comprises studies that look at the effect of price and various other factors on water demand as well as determine the weight of different determining components. For a meta-analysis on earlier work, see Espey [25]. Estimating the effect of price on quantity demanded poses an econometric challenge, known as the simultaneity problem because quantity demanded and price affect each other simultaneously. Said differently, since the suppliers can set the price depending on what they expect the demand to be, the price is not an actual "independent variable". Therefore a price response parameter recovered from observational data cannot easily offer a credible estimate of the price effect. In order to overcome this challenge, more recent studies have used experimental methods to investigate the effect of price as well as other factors like social norms and comparisons [26][27][28][29]. Quasi-experimental methods like difference-in-difference and regression discontinuity design are also being more commonly used in recent years to avoid the simultaneity bias, where methods to study the effect of price and interventions such as low flow equipment on demand. Romano, Salvati, and Guerrini [30], and Morote, Hernández, and Rico [31] are examples of studies that study the individual determinants of urban water demand from Italy and Spain. An accessible account of experimental and quasi-experimental econometric methods can be found in Angrist and Pischke [32].
Though not as numerous as residential water forecast studies, we do see examples of papers that specifically focus on water use in the CII sector. Using survey data from large manufacturing plants in New Jersey, Derooy [33] calculated a price elasticity of 0.89. Ziegler and Bell [34] did a similar study for self-supplied firms in Arkansas. Within the context of Canadian manufacturing firms, Renzetti [35] used instrumental methods for price to avoid the simultaneity problem in their estimations for the price elasticity of water use. Using a system of simultaneous equations method, Babin, Willis, and Allen [36] examined the relationship of water intake and the utilization of other outputs (the degree of substitutability). Using data from 51 industrial plants in France and both seemingly unrelated regression and the feasible generalized least squared methodology, Reynaud [37] demonstrates how Sustainability 2020, 12, 3995 4 of 21 the elasticity of water use varies across water sources. As an alternative to an econometric method, Calloway, Schwartz, and Thompson [38] develop a linear programming model in order to analyze the effects of water quality policy on the use of water in ammonia production and on the cost of ammonia.
Another strand of studies pertains to water demand forecasting, to which our study contributes. In these studies, credible identification of parameters in the face of omitted variable type problems (such as the simultaneity problem explained above) is not necessarily the main objective. Rather, the priority is to generate accurate water use forecasts for the future in order to formulate policies and guide infrastructure investment decisions financed by tax or rate payer funds. For example, Alhumoud [39] uses 50 years of annual country-level data from Kuwait and time-series methods to generate forecasts 20 years into the future. His model selection method is based on the Box-Jenkins method. Using monthly household-level data from California, Brekke, and Larsen [40] demonstrate that modeling water demand via stepwise regression is an accessible alternative to the trend analysis method that is widely used in smaller suburban utilities. We also see examples of studies that use decision support systems (DSSs) from different geographies like China, the U.K., and California [41][42][43]. DSSs are used as a part of integrated frameworks that provide forecast output for different scenarios. Other urban water forecast works from the U.K. include Khatri and Vairavamoorthy [44], that use time series methods and ten years of monthly water use data from Birmingham, and Williamson, Mitchell, and McDonald [45], that use imputed household water use data and multilinear regression method. For thorough qualitative and quantitative reviews of models and methods, see Donkor, Mazzuchi, and Soyer [46] and Sebri [47], respectively.
In recent years, we see a surge in the studies that utilize state-of-the-art machine learning techniques in urban water use forecasting. Usually, these studies compare the forecast performances of different modeling methods using actual water use data from different parts of the world. Unlike any of the papers cited above, it is common to see out-of-sample forecast performance being used in these studies. For example, with water use data from South Africa, Oyebode and Ighravwe [48] provide a comparison of the performances of artificial neural networks (ANNs) (with two different algorithms), support vector machines (SVMs), and multiple linear regression methods. They find that evolutionary ANN performs better than the rest of the methods, while regression method outperformed ANN with the conjugate gradient algorithm. House-Peters and Chang [49] use data from Canada to compare the wavelet-bootstrap-neural-network (WBNN) method with moving average and bootstrap-based neural networks. Moving average, exponential smoothing, and ANN models are compared in Kofinas, Mellios, Papageorgiou, et al. [50] who have 3 years of time series data from a touristic island in Greece. Using urban water data use from southeastern Spain, Herrera, Torgo, Izquierdo, et al. [51] demonstrate that the SVM performs better than multivariate adaptive regression splines and random forests. In his detailed meta-analysis, Sebri [47] shows that forecast methods make a difference in the forecast errors. He states that ANN, Box-Jenkins, and SVM methods, on average, result in lower forecast errors than methods such as multilinear regression or Kalsman filter. Other recent examples of forecast papers that perform a comparison among different techniques including machine learning algorithms, time series, etc. include Adamowski and Karapataki [52] and Ghiassi, Zimbra, and Saidane [53].
While advanced machine learning and big-data methods can offer advantages, they may not be immediately accessible to smaller water utilities, who may or may not employ in-house data analysts with these skills [40]. Water Resources Municipal and Industrial Needs (IWR-MAIN) is a software that has been widely used by utilities to forecast water demand. In this method, the size of each CII sector is estimated using total employment, and CII water use is estimated based on the Standard Industrial Classification (SIC) of sectors. The method uses regression analysis to determine water intensity of each sector where the explanatory variables are the number of employees, the price of the water and sewer services, and whether or not there was a water conservation program [54]. A nationwide survey of over 3 thousand establishments and surveys from manufacturers from the U.S. Census Bureau and the California Department of Water Resources were utilized to improve the model [55]. The main intuition of this approach is to estimate a "water use coefficient" for each sector, multiply that by the forecasted size of the sector, and then sum up the estimated water use across all sectors. A summary of the historical progression of the IWR-MAIN model can be found in Morales, Heaney, Friedman, et al. [56].
Further, the current version of IWR-MAIN and its application among California utilities is discussed in a 2019 report by Dziegielewski et. al. [57]. The approach of IWR-MAIN has inspired similar applications by other utilities. For example, using establishment-level water billing and employment data from Idaho, Cook, Urban, Maupin, et al. [58] calculate the standard industrial classification (SIC) level employment coefficients, which are a weighted average of the per-employee water consumption for the SICs. Then, under various growth scenarios and employment forecasts, they use these coefficients to project water use into future periods. For an example, Morales, Heaney, Friedman, et al. [56] present a CII water use estimation methodology using a rich database of parcel-level consumer attributes and water use billing from Florida.
In addition to software packages, regression-based econometric models are currently being used by the utilities for forecast purposes. See Buck, Auffhammer, Soldati, et al. [11] for a summary of methods being used by a group of large California utilities; they show out-of-sample performance is not commonly used as a model selection criterion and forecast modelers typically only consider a narrow set of models.
In this paper, we show that the forecast performance among models of CII water use can be significantly improved with a relatively small adjustment in the model selection methodology, even if the state-of-the-art machine learning algorithms are not used. Using a model space of 352,116 models, we look at how the models that yield the best results based on R-squared and/or other common in-sample-fit criteria such as adjusted R-Squared, Akaike information criterion (AIC), and Bayesian information criterion (BIC) perform when forecasting out-of-sample. AIC and BIC or Schwarz Criterion are methods of comparing alternative specifications by adjusting the summed squared residuals for the sample size and the number of independent variables [59]. See Table 4 for the formulas. We then compare them to those models that would be selected under three different out-of-sample criteria we define.

Preview of the Results
The out-of-sample fit criteria are defined following Auffhammer and Steinhauser [60], which are also commonly used in evaluating models in the field of machine learning. Models are generated through inclusion and exclusion of different key covariates and the actual versus logged values of the dependent variable. The dependent variable is the water retailer (utility) level total annual CII water use. Covariates include median tier price, manufacturing and service sector employment in the service area of the retailer, and weather variables (maximum temperature, degree days, and precipitation). Degree days are defined as the difference between the daily temperature mean, (high temperature plus low temperature divided by two) and 65 • F. In other words, the temperature mean is above 65 • F, we subtract 65 from the mean, and the result is "degree days" [61].
In order to account for the fact that the demand for CII water is derived, together with other factors of production, we also included real U.S. GDP as a proxy for the overall purchasing power in the economy.
The results indicate that selecting models solely based on in-sample fit will yield poorly performing models when forecasting CII water use out-of-sample. Specifically, we demonstrate that the predictions that are generated by the highest R-squared models are highly dispersed around the actual value, relative to those that are generated by the models with the lowest absolute error. While it is known that models selected on in-sample fit can perform poorly out-of-sample, this paper brings the magnitude of the problem in the CII water forecast context to the attention of water planners. Equally important, the analysis contrasts variation in forecasts generated by prediction models that were selected based on different criteria. This highlights that decision-makers should consider a range of forecasts generated by a suite of the "best"-performing models.
These findings suggest that water planners, the forecasts of whom are often used to guide water policy, can avoid large errors by taking out-of-sample prediction performance into account when selecting models to forecast CII water use, which is a relatively small adjustment to their procedures. The path followed in this paper is similar to the one used in Auffhammer and Steinhauser [60] for forecasting CO 2 emissions. They use 41 years of state-level data to test about 27,000 models and compare the out-of-sample forecasting performances of benchmark models from the related literature and the ones that they find to be best under the aggregate error criterion. They find that benchmark models, which are calibrated against in-sample performance criterion, are likely to overestimate CO 2 emissions, which might be consequential in climate policy and international agreements.
In a similar spirit, this study compares the out-of-sample performances of the models that would be selected under various in-and out-of-sample criteria given the available dataset. Our findings highlight that the model selection criteria determine CII water use forecast performance.
The rest of the paper proceeds as follows: the geographical scope of the study, summary of the data, econometric model, and the details of performance criteria are provided in Section 2. Section 3 presents and discusses the results, and Section 4 concludes.

Geographical Scope
The geographical scope of this study is defined by the boundaries of the MWDSC, the largest water utility in the U.S. serving more than 18 million people [14,62]. For a geographical reference, see the map of the region published by the Southern California Association of Governments [63]. The dataset used here is a subset of a larger dataset collected for a study about forecasting single-family residential (SFR) sector water use [11]. Data collection effort, therefore, was focused on the retailers that reported more than 3000 single-family residential accounts as it is estimated that these retailers account for about 99% of this sector. One hundred fifty-three retailers were contacted within the realm of the study. CII data were obtained from 75 retailers and has 709 observations from 25 of the 26 member agencies that are under MWDSC. The only unrepresented member agency is San Marino, which has one of the lowest CII sectors of all member agencies. Table A1 in the Appendix A lists the agencies and the associated retailers. The water retailers in the study are located in Los Angeles, Orange, Riverside, San Bernardino, San Diego, and Ventura counties.

Data Sources
The rate schedules were received directly from retailers, while the water use figures are mostly based on monthly data in the Public Water System Statistics (PWSS) augmented with data received from retailers and aggregated to the calendar year [64]. For the price measure, we use the median tier of the rate schedule.
Location-specific data on average precipitation were obtained through the use of the geographical information and mapping software system, ArcGIS. Spatially referenced boundaries of state and private water districts were obtained from the Cal-Atlas geospatial clearinghouse [65]. These boundaries allowed visualization of each water district polygon using ArcGIS. The points at the centroid of each water system polygon were then geo-referenced. Based on the resulting set of points, the local precipitation data were extracted from rasters provided by the PRISM Climate Group from Oregon State University [66].
In those cases where the retailer level district boundaries were not available, zip codes were used as a geographical proxy. Retailers were assigned to representative zip codes on a case-by-case basis. The centroid of each zip code polygon was geo-referenced, and based on the resulting set of points, local precipitation data were extracted. The precipitation variable in our dataset is in millimeters of rainfall per year.
Data on temperature were obtained in the same manner as the precipitation data described above. Rasters for the temperature data (in degrees Celsius) were obtained from the PRISM Climate Group. The year-round maximum and minimum temperatures are used to calculate retailer-specific cooling degree.
Total employees within a retailer are computed based on two data sources. Historical annual employment is provided by the metropolitan water district at the member-agency-level from 1990 to 2010. To calculate employment at the retailer level, we used the Census Zip Code Business Statistics (ZCBS), which reports historical employment estimates at the zip code level from 2004 to 2010. The ZCBS only provides employment numbers based on the majority of sectors (largely excludes non-service oriented government positions) so total employment is not complete. Therefore, we only use the ZCBS to calculate the share of employment within a member agency due to a particular retailer. We calculate the relevant share using a crosswalk between zip codes and retailer level boundaries, and zip codes and member-agency-level boundaries. Finally, to compute a historically based retailer level total employment measure, we multiplied the share of employment within a member agency by the total employment in the member agency obtained from MWD (based on Employment Development Department data). For years prior to 2004, when ZCBS is unavailable, we assume the retailer level average employment shares from 2004 to 2006. That is, for each retailer, we assume their share of total employment within a member agency is constant between 1994 and 2003.
A variable measuring GDP is also included in the universe of considered regression models. Unlike the residential sector, water demand in the CII sector is derived, together with other inputs, as a part of the production process. In other words, water demand in CII is indirectly caused by the consumption for the goods and services that these sectors offer to consumers. Therefore, CII water use should not only depend on its own price but also on total consumption in the economy, which ultimately depends on income. A national, rather than regional, measure of GDP is utilized as CII water customers may provide goods and services to locations outside of California.
The real GDP data are obtained from the publicly available international macroeconomic data series provided at the USDA website [67]. All monetary figures in this study are standardized to 2000 dollars in order to account for the effect of inflation.
Tables 1 and 2 present the summary statistics of the variables in the training and the forecasting subsamples, respectively. The training sample is composed of data from the years 2000-2005, while the forecast sample is composed of data from the years 2006-2010.

Econometric Model
In the regression models, we follow the general form provided in Equation (1): q tar = β price tar + µ man.emp. tar + σ serv.emp tar + τ tmax tar + π precip tar + γ cdd tar + α a + η t + tar (1) where q tar is the annual water use in the CII sector in year t served by the retailer r that is under agency a; price tar is the median tier price charged; man.emp. tar is the number of manufacturing employees; serv.emp tar is the total number of service employees; tmax tar is the average maximum temperature; precip tar is the average annual precipitation; cdd tar is the cooling degree days; α a is the agency fixed effects; η t represent the time fixed effects. (Year indicators or GDP. Note that the year indicators and GDP cannot be used at the same time due to perfect collinearity); tar is the stochastic error term.
The model universe was created using different permutations of dependent and independent variables (and their actual and logged values). There are three main avenues through which new models are added to the model space. A first avenue is the inclusion versus exclusion of the main variables: price, number of employees in the manufacturing and service sectors in the retailers' boundaries, maximum temperature, degree days, precipitation, and GDP, as well as lagged dependent variables (up to two lags). The second variation is due to the inclusion of variables accounting for heterogeneity with respect to time and the institutions corresponding to different locations and levels of governance. These include agency indicators, time trends (up to cubic time trend), and year indicators. Finally, further variations are generated using logged vs. level dependent variables as well as total quantity versus per employee quantity as the dependent variable. Table 3 summarizes the details. As a result of these permutations, we ended up with 352,116 models.  For each model, effectiveness in terms of both common in-sample performance measures including R-squared, adjusted R-squared, AIC score, BIC score, and three out of sample performance measures (explained below) are calculated. Afterwards the models are sorted based on their performances with respect to each of these criteria, and then the out-sample performances of the top 1% in each category are compared.

Model Performance Criteria
Many different performance criteria can be chosen based on the forecasting and planning goals such as aggregating absolute or squared errors across different geographical or institutional boundaries. In this paper, we studied three different out-of-sample performance measures: mean squared forecasting error at both retailer and agency levels and the overall absolute aggregate error. Table 4 provides the formulations for in-and out-of-sample performance criteria, and in the following paragraphs we explain them in detail. The first out-of-sample criterion is the retail level mean squared forecast error (MSFE) (the third one from the bottom of Table 4). Here, q tar is the annual CII water use in year t for retailer r that belongs to agency a in the forecasting sample. q tar is the forecasted quantity for the same data point. R tar is the number of retailers for which data was available in agency a in year t, A t is the number of agencies in the sample in year t, and N is the total number of data points in the forecast sample (N = 310).
The second criterion is the counterpart of the first one at the agency level. We first aggregate the differences between the actual numbers and forecasted numbers at the agency level for each forecast year. Afterward, we take the mean of the squared forecast error over agencies (M = 101).
The final out-of-sample performance criterion is the absolute aggregate forecasting error. We calculate it as follows. All the quantities (both forecasted and actual) are aggregated over the forecast sample for each year, the aggregate of the forecasts is subtracted from the aggregate of the actual numbers, and then the average of the absolute value of the aggregate error is taken over the years.
One important detail to note is the comparability of the performance criteria across the models with different dependent variables (i.e., level vs. logged). It is important to establish this comparability of the goodness-of-fit measures across the models to be able to make meaningful statements about their relative performances. In order to do that, the performance measures for the models with a logged dependent variable had to be transformed in the following manner [68]. After the models with a logged dependent variable are estimated, the fitted values are exponentiated. Then, the actual quantities are regressed (without a constant term) on these exponentiated values. The fitted values are obtained from this second regression. These fitted values are used to calculate the prediction errors. The square of the correlation coefficient between the actual and the fitted within the training sample is comparable to R-squared.
A separate forecast adjustment is made due to the existence of models both with total quantity and quantity per employee as dependent variables. AIC and BIC scores are calculated using the sum of squared errors (see Table 4 for the formula). Therefore, unlike R-squared and adjusted R-squared, the magnitude will depend on the scale of the variables for AIC and BIC scores. For this reason, the scale of the error needs to be adjusted for a fair comparison of different models. All AIC and BIC scores are calculated using the deviance from the actual total quantity and the total quantity implied by the model. In other words, if the model is logged, the AIC and BIC scores are calculated from the squared errors obtained from the fitted values described above. If the dependent variable is per employee water quantity, the predicted total quantity is obtained by multiplying the fitted value by the total employee number. Finally, the models are ranked based on each one of the criteria in our list. Table 5 summarizes and compares the performances of the top 5% of models in each criterion for the models where the dependent variable is the total quantity. Every column (except for the first column) refers to a subset of all of the models in our model universe. Each row gives the mean and standard deviation of the performance measures of the top 5% models based on the criteria listed in that row within the subset given by the column. For example, the numbers in the first row of the second column present the mean and the standard deviation of the "Retail Level MSFE" of the models that rank in the top 5% in terms of the "Retail Level MSFE" category among only models that use a level dependent variable (as opposed to logged dependent variable). This categorization allows us to observe the association between the inclusion of certain variables in a model and forecast performance. We see that in models for which the dependent variable is total quantity, log models (3rd versus 1st and 2nd columns) displayed better out-of-sample performance on average, while in-sample performances were similar for models across all categories.

Results and Discussion
One notable result is that models without any lagged variables did much worse than models with lagged variables overall (comparing columns 1 versus 5) for almost all criteria. This is not surprising given the serially correlated nature of water use. Additionally, we see that, though it may reduce the noise, adding agency fixed effects did not improve forecasting performance.
Since the data are annual, we were forced to choose between year indicators variables and the (lagged) per capita GDP since including more than one of these covariates at the same time would result in perfect collinearity. In the models that use year fixed effects, the projection needs special consideration as we do not have a clear way to forecast the year fixed effects for future years. For simplicity, we treated all years in the forecast sample as the end year of the training sample.
Comparing the final two columns, we see that the performance of the models with year fixed effects and per capita GDP are fairly comparable for both in-and out-of-sample criteria. Therefore, in addition to providing a proxy for the size of the economy in forecasting indirect demand for water, measures of GPD appear to largely capture year fixed effects. The results are very similar for models in which the dependent variable is quantity per employee (Table A2 in the Appendix B).  Table 6 compares the absolute aggregate error of the models ranked within the top 5% of our criteria. For example, the number on the second row and the first column of Table 6 is the mean of the aggregate forecast error (in thousand acre-feet) of the models that are in the top 5% based on the "Retailer Level MSFE" criteria. We see that the models that score high based on in-sample-fit criteria did poorly in aggregate compared to the models that are selected based on the out-of-sample criteria. While the qualitative result was expected given the selection criteria, the key point is the magnitude of the difference between the mean of the aggregate error under different categories. The models that score high in the out-of-sample performance criteria yielded much lower absolute errors (12.72 for the absolute aggregate error (in 1000-Acre feet)), and a narrower distribution (standard deviation of 1.66) whereas the models that had the highest R-squared value, for example, did poorly, on average (mean absolute aggregate error: 535.90 (in 1000-Acre feet)), and the dispersion of their performance was almost two orders of magnitude larger (standard deviation of 918.88). The results are similar for the comparison of the models for which the dependent variable was quantity per employee (Table A3 in the Appendix B).  Table 6. In these graphs, the black dashed lines represent the highest and lowest forecasts generated by the models that ranked among the top 5% based on R-squared and retailer level MSFE, respectively, while the red solid line shows the actual values. Graphs for the rest of the criteria carry the same message and are provided in Figure A1 of the Appendix B.  Figure 1 and Figure 2 help visualize the point made in Table 6. In these graphs, the black dashed lines represent the highest and lowest forecasts generated by the models that ranked among the top 5% based on R-squared and retailer level MSFE, respectively, while the red solid line shows the actual values. Graphs for the rest of the criteria carry the same message and are provided in Figure B1 of the appendix. Notice the wide gap between the lowest and the highest CII water use forecasts in the graph displaying forecasts from the models with the top R-squared scores. We see in these figures that the CII water use forecasts generated by the models that are selected based on in-sample criteria are much more widely dispersed compared to those that are selected based on out-of-sample forecast criteria, signifying a large uncertainty in the forecast accuracy due to model choice.  To provide further visual insight, Figures 3 and 4 show the actual aggregate (represented with the red spike) and the histogram of the aggregate of forecasts for the models that are within the top 5% of the R-squared and absolute aggregate error criteria for year 2010. The graphs for the other years have very similar characteristics. They are provided in Figure B2 in the appendix.
Here, in addition to the dispersion, we also see that the average of the forecasts generated by models chosen with in-sample goodness-of-fit are also further away from the true value.
Distribution of 2010 Forecasts of Top 5% of Models Based on R-sq Notice the wide gap between the lowest and the highest CII water use forecasts in the graph displaying forecasts from the models with the top R-squared scores. We see in these figures that the CII water use forecasts generated by the models that are selected based on in-sample criteria are much more widely dispersed compared to those that are selected based on out-of-sample forecast criteria, signifying a large uncertainty in the forecast accuracy due to model choice.
To provide further visual insight, Figures 3 and 4 show the actual aggregate (represented with the red spike) and the histogram of the aggregate of forecasts for the models that are within the top 5% of the R-squared and absolute aggregate error criteria for year 2010. The graphs for the other years have very similar characteristics. They are provided in Figure A2 in the Appendix B.

Figure 2.
Highest and lowest projections generated by best models based on retailer level mean squared forecast error (MSFE) versus the actual value. Dependent variable is total quantity.
To provide further visual insight, Figures 3 and 4 show the actual aggregate (represented with the red spike) and the histogram of the aggregate of forecasts for the models that are within the top 5% of the R-squared and absolute aggregate error criteria for year 2010. The graphs for the other years have very similar characteristics. They are provided in Figure B2 in the appendix.
Here, in addition to the dispersion, we also see that the average of the forecasts generated by models chosen with in-sample goodness-of-fit are also further away from the true value.

Conclusion
The historic mismatch between the location of water supply and demand has shaped water infrastructure investment decisions and planning activities in the United States, one of the largest water users in the world-both in terms of total and per capita water use. These decisions are often guided by water forecast studies from utilities that rely on conventional econometric methods and/or black box software packages when generating CII water use forecasts. Splitting a dataset into a training set and validation set is a prominent idea in the field of statistical learning. This improves the accuracy of out-of-sample forecasts because forecasting, by definition, requires using the model to estimate new data points. In this paper, we demonstrate that using the out-of-sample forecast  Here, in addition to the dispersion, we also see that the average of the forecasts generated by models chosen with in-sample goodness-of-fit are also further away from the true value.

Conclusions
The historic mismatch between the location of water supply and demand has shaped water infrastructure investment decisions and planning activities in the United States, one of the largest water users in the world-both in terms of total and per capita water use. These decisions are often guided by water forecast studies from utilities that rely on conventional econometric methods and/or black box software packages when generating CII water use forecasts. Splitting a dataset into a training set and validation set is a prominent idea in the field of statistical learning. This improves the accuracy of out-of-sample forecasts because forecasting, by definition, requires using the model to estimate new data points. In this paper, we demonstrate that using the out-of-sample forecast performance criteria can significantly improve CII water use forecast accuracy and reduce forecast uncertainty due to modeling. Our study context is water use within the commercial, institutional, and industrial sector under MWDSC-the largest water utility in the U.S. CII water use is an understudied component of overall water demand due to lack of data and its complex nature. Yet its share of publicly provided water-use is expected to grow as water governing institutions evolve. As CII water use becomes a more significant portion of public water deliveries, then so too will its place in water-saving conservation policies designed to adjust to changes in water supply conditions, for example, in response to climate change.
Using over 352,000 models and rich panel data, CII water use forecasting performances of models selected based on in-sample and out-of-sample goodness-of-fit criteria were compared. Note that finding the best forecast method or studying the relative importance of different variables in explaining CII water use are not the objectives of this paper as these topics are very well documented both in academic papers and studies conducted by utilities from all around the world. While machine learning methods and large data sets offer advantages in forecasting, they might not yet be accessible to some utilities, especially smaller suburban utilities. The goal of our study is to demonstrate that a relatively straightforward adjustment to the model selection criteria significantly improves the forecast performance even when no advanced machine learning methods besides regression models are used and the dataset is composed of a few hundred observations.
Policymakers and planners who rely on water consumption forecasts for the CII sector, therefore, should pay attention to the out-of-sample performance of the models that are being utilized in their analyses. If a water governing body chooses to use econometric methods using the data available from the local region, they should avoid selecting models of CII water use based on in-sample-fit as this may result in suboptimal results in terms of the accuracy of forecasts for the CII sector. Further, decision-makers would be wise to consider uncertainty in forecasts, which may be considerable, especially in the CII sector. Instead of considering forecasts from one single model, we advise considering projections from a suite of models based on their out-of-sample forecasting accuracy in training data sets.
Author Contributions: D.U. wrote the STATA code for the formal analysis and visualization, conducted the literature review, and wrote and revised the manuscript; S.B. formulated the methodology, curated the data, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.    Figure B1. Distribution of the forecasts around the actual value. Panel (a) depicts the distribution of the models selected based on the AIC criterion while (b) has that of the ones selected based on lowest absolute aggregate error. Figure A1. Distribution of the forecasts around the actual value. Panel (a) depicts the distribution of the models selected based on the AIC criterion while (b) has that of the ones selected based on lowest absolute aggregate error.