Factor Analysis and Estimation Model of Water Consumption of Government Institutions in Taiwan

: Models for adequately estimating water consumption in Taiwanese government institutions were developed to assist the government to more accurately predict and account for their water needs. A correlation coefficient matrix of associated factors was constructed based on records per unit of water consumption, describing the impact of various water consumption factors. To understand and quantify the effect of the impact factors, linear and nonlinear regression models, as well as an artificial neural network model were adopted. To account for data variability, the data used for modelling were either fully or partially adopted. For partial adoption, the quartile method was employed to remove any outliers. Analysis of the factors affecting water consumption revealed that the building floor area and number of personnel in an organization had the largest impact on estimated consumption, followed by the number of residential personnel. As the coefficient of variation for the green irrigated area and number of consulting personnel was low, the total area and the total number personnel of water consumption decreased the effectiveness of the model.


Introduction
The subtropical island nation of Taiwan is affected by monsoons, plum rains, and typhoons. In the northwest Pacific, which is where Taiwan is located, four typhoons occur on average per year. Annual precipitation in Taiwan ranges from 1600 to 3200 mm. Although it is reasonable to expect that Taiwan has abundant fresh water-considering its annual rainfall-70% of precipitation landing on the plains is runoff to the sea and lost to evaporation each year. Most precipitation occurs in summer and autumn, with 78% from plum rains and typhoons between May and October. Additionally, the average annual amount of rainfall per capita in Taiwan is only 4074 m 3 as its population density is high at 647 per km 2 , which is low at one-fifth the global rainfall average per capita. Furthermore, the average price of water is USD 0.36 per thousand liters, which is less than 0.1% of the nation's per capita income. Consequently, the people of Taiwan may take water for granted and not value it as a natural resource [1][2][3] as water consumption per capita in Taipei reaches as high as 335 L per day.
Global warming and climate change are threatening water resources. Given that the volume of reservoirs is limited, much of Taiwan's terrain is precipitous, and increasingly more areas are being designated as environmental protection areas; thus, balancing the supply of water with demand is becoming more difficult [4,5]. Due to water use in irrigation and filtration, domestic households do not consume the highest percentage of water in Taiwan, but there is still a water shortage crisis. Thus, the promotion of water conservation and the enhancement of water consumption efficiency are indispensable.
To ensure sustainable water consumption, the creation and comparison of different domestic water consumption models may provide a reference for decision-makers in charge of implementing water policy. Therefore, the urgency of a precise water consumption estimation model for government institutions in Taiwan is justified. Water consumption forecasts are affected by numerous factors such as geographical and meteorological phenomena, economic factors, and methods of water consumption. Forecasts simulated using traditional statistical methods may lack sufficient accuracy [6]; however, the water consumption data have a varying range of non-linearity. Therefore, a method or function that does not need specifically structured data is necessary.
The aim of this study was five-fold: (a) to examine the correlation between annual water consumption and the factors affecting water consumption at each government institution; (b) to identify factor differences between different estimation methods; (c) to establish different models suitable for different government institutions; (d) to analyze the accuracies of different water consumption estimation models; and (e) to develop a model that adequately estimates water consumption.

Materials and Methods
Related studies can be classified into three major categories: consideration of water consumption impact factors, regression model analyses, and artificial neural network (ANN) analyses.

Water Consumption Impact Factors
Several studies [6][7][8] have noted the significant impact of various water consumption factors including previous water demand, number of family members, age of family members, garden size, frequency of irrigation, and the water consumption of agriculture.
Previous water consumption data have been considered as the key to estimating future consumption in numerous studies. To manage water consumption effectively, the data of each institution's water consumption must be collected [9,10]. Creating a suitable model for Taiwanese domestic water consumption requires identifying the major impact factors, thus step-by-step filtering was used in this study to select the major impact factors. Moreover, to avoid multicollinearity problems, all factors were considered in the regression models.

Regression Model
Numerous studies have employed linear and nonlinear regression to establish water consumption models. Some based on linear regression have included rainfall, air temperature, family income, and the cost of water as independent variables. Regression models have also been used to establish models for related topics such as the water utility market structure [11][12][13][14]. A typical linear regression model of water consumption is expressed as where y is the unit water consumption; wi is weights; xi is an impact factor of water consumption; and c is constant. As the model is linear, it is easy to estimate its advantages and disadvantages; however, the true relationships between water consumption and impact factors are not linear, but more complex. Hence, a model using one dependent variable and multiple predictive variables does not yield accurate forecasts. Therefore, nonlinear regression can also be employed where ci is the weight of regression. For rapid and convenient calculation, Equation (2) can be reformulated through logarithmic conversion or ( + ) = + • log( + ) + • log( + ) + ⋯ + • log( + ) where = log ( ).

Artificial Neural Networks (ANNs)
Errors are common when traditional forecast methods such as time extrapolation are used. Although widely used in the early 20th century, time extrapolation is rarely used in current studies. ANNs are fast and flexible methods for effectively forecasting domestic water demand [15].
ANNs have been used for estimation models and forecasting in numerous fields. An advantage of ANNs is that they can correlate large and complex datasets [16,17]. An ANN was previously used to develop and assess a drinking water quality model, and a multilayer perceptron ANN was required in the hydrological modelling [18].

Model of the Current Study
Over the past few decades, there has been a dramatic increase in the published research on sustainable water consumption, with most studies focusing on different industrial contexts. Few studies have discussed water consumption by individual government institutions. Despite the adoption of recent policies in Taiwan aimed at actively promoting water conservation, water demand has not substantially decreased as water consumption efficiency has not been enhanced (Table 1). This paper reports the results of a five-phase study that explored the theoretical basis for the estimation model, thus establishing a framework, collecting data, analyzing simulation results, and deriving conclusions. The subjects considered were government institutions located on Taiwan Island, the Penghu Islands, the Kinmen Islands, and the Matsu Islands, all of which have water supplied by faucet. Our data consisted of 2611 units taken from government institution-reported water consumption data since 2006. As there are numerous categories of government institutions in the original database, the categories were divided into 6 primary categories and 47 minor categories ( Table 2). Twenty-two independent variables were adopted in this study (Table 3).  The original database was sufficiently large to guarantee the accuracy of outlier effect models and data analysis. The quartile outlier method was adopted in this study. Furthermore, linear regression, nonlinear regression, and ANN models were developed by outlier effect models. To accord and compare these models, stepwise regression was used to select an independent variable. Each variable was also chosen to carry out the regression with other variables one by one. The advantage of this approach was that it avoided the problem of multicollinearity in each independent variable, thus preventing unstable regression parameters.
The ANN used in this study was the backpropagation neural network (BPNN), which is the most classic and general training algorithm. It also effectively solves problems including multilayers, feed-forwards, and supervised learning functions for different industries [19]. A constructive algorithm was used to determine the number of neurons in the hidden layer, which was initially set to one and gradually incremented until the most suitable number was determined [20]. The output was then expressed as where (•) is a transfer function; is the input; and are the weights; and 0 and 0 are the bias. The function (•) is a mapping rule for converting input into output. The most commonly adopted nonlinear conversion function in BPNN studies is the binary logistic sigmoid where ( ) = [0,1]. To obtain more optimal BPNN parameters, (output value) and (target value) are adjusted through  (8) and (9). When gradient descent was used, a common problem was that convergence did not feedback to the whole network, but only a partial network. To increase learning rate and accuracy, a momentum term was added to avoid oscillation during convergence. The mth weight can be expressed as where is the learning rate of the gradient descent method; and is the momentum factor. To fit the range of the transport function, data were normalized using the max-min mapping method. For a minimum and maximum of the transport function and , the minimum and maximum inputs in the database were and , respectively where is the normalized factor. Equation (11) can be reversed as where ( ) and are estimates of ( ) and x, respectively.

Model Efficiency Indexes
A comparison of three methods was adopted, where the R 2 of ANN was obviously the highest. However, judging which method was more suitable via R 2 was far from enough. Five model efficiency indices were employed to determine the suitability of each model: the mean absolute deviation (MAD), root mean squared error (RMSE), revised Teil inequality coefficient (RTIC), correlation coefficient (CC), and coefficient of efficiency (CE), defined as where N is the total number of units; is the real water consumption; and is the estimated water consumption.
where is the mean of ; and is the mean of .
Of the five efficiency indices, MAD, RMSE, and RTIC indicated higher efficiency as they approached zero. As CC approached one, the simulated and actual values became more closely correlated, whereas CE approaching one indicated higher precision.

Results
For multiple regression models, selecting suitable factors that were consistent and comparable was crucial; thus, each water consumption factor was tested against the water consumption data through a correlation analysis. The top six correlations between v17 and other water consumption factors were: v18, v05, v03, v07, v09, and v06. As v18 was converted from v17, it was not included in the analysis. Given that collinearity in the design matrix can result in inaccurate regression model estimates, v19 and v21 were excluded from the initial estimations due to the high collinearity between v19, v21, and v05. Usage of faucet water (v11) was one for all working databases; therefore, v11 was also eliminated.
Through step-by-step filtering, independent variables that failed a t test (i.e., t = 1.96) were eliminated one by one. The linear regression and nonlinear regression models developed in this study, which considered 2611 data inputs, are shown in Equations (18) and (19), respectively The R of these models was 0.665 and 0.692, respectively. When the ANN was employed to simulate the models, 100 random data inputs were sampled to act as a verification sample. The number of hidden layers was determined through trial and error, with the minimum number from 1 to 20, which was calculated from [(input layer = 9) + (output layer = 1)] × 2. To determine the lowest RMSE and highest R, a constructive algorithm was used. Eight hidden layers were found to result in the lowest RMSE, as depicted in Figure 1. The R and RMSE in this model were 0.929 and 41,636, respectively. Due to the possible typographical errors in the data used in this study, outliers for water demand per floor space unit (qA), water demand per number of staff (qN), and water demand per number of staff and per floor space (qAN) were considered. The quartile outlier method was employed for qA data, with the linear regression model The R of this linear regression model for deducting outliers under qA was 0.710. Equation (20) was then modified to an improved nonlinear regression model The R of this nonlinear regression model for deducting outliers under qA was 0.699. In the eight hidden layers of the ANN, the R was 0.904. Regarding the aforementioned quartile outlier method, the outliers under qN were deducted. With this condition, the linear regression, nonlinear regression, and ANN models were obtained. The linear regression model for deducting outliers under qN is shown in Equation (22) Under this condition, with eight hidden ANN layers, the R was 0.953. Furthermore, outliers under qAN were considered. With the quartile outlier method, the linear regression model was found to be identical to Equation (22), with R = 0.688. Similarly, the nonlinear regression model was identical to Equation (23), with R = 0.720. Eight was again, the most suitable number of hidden layers, and R was 0.866.
As previously mentioned, full adoption and partial adoption models were estimated. Given that the quartile outlier method for partial adoption is similar to that used to estimate the energy usage index in Taiwan, the use of raw water demand data to establish a model of water consumption was found to be unsuitable. Therefore, the outliers determined in the water demand per floor space unit, water demand per number of staff, and water demand per number of staff and per floor space unit were ignored. This outlier removal method was expected to improve the accuracy of the established water consumption model. Table 4 details the performance of each water demand model for full and partial adoptions, with the linear regression, nonlinear regression, and ANN models employed. Five efficiency indices were used to gauge model performance. The ANN model with outlier removal under water demand per number of staff was the most accurate model for estimating water consumption by government institutions in Taiwan, demonstrating the closest fit to the actual data. Considering all five model efficiency indices, the descending order of efficiency of these approaches was as follows: Excluding outliers under qN > excluding outliers under qA > excluding outliers under qAN > full adoption. The total efficiency for qAN was low due to a factor multiplication effect (vA = v03 + v04; vN = v05 + v06 + v07). Considering the MAD index, all three models were more accurate when the quartile outlier method was implemented to remove outliers under qN. The RMSE for the nonlinear regression model was higher than that for the linear regression model, which might be attributable to the nonlinear regression model being reversed and any deviation thus being increased. For the RTIC index, which indicates higher precision as it approaches 0, the ANN model was identified as the most efficient. The qN ANN model was also the most precise model when the RTIC index was considered. The CC index of the qN ANN model was 0.9528, which was the highest among all the models. Therefore, outlier removal under qN using an ANN was the most suitable model for estimating water consumption.

Conclusions
The data employed in this study concerned the water consumption of all government institutions in Taiwan. Linear regression, nonlinear regression, and an ANN were adopted to establish a water consumption estimation model. The quartile outlier method was also used to determine the effect on prediction accuracy for full or partial adoption of data. The major factors influencing water consumption were divided into four categories: area of water demand (floor and irrigation areas); water demand population (number of staff, visitor, and accommodation); usage of equipment with high water consumption (kitchens and swimming pool); and usage of non-faucet water sources (i.e., groundwater). In each case, the removal of outliers under qN with an ANN was the most accurate model. Furthermore, adopting the quartile outlier method maintained the median and effectively decreased data variability.
The school (education) category was identified as consuming the most water. The total number of school category was 1415, which accounted for most of the database in this study. Educational institutions were the best fit and the model used for other types of institutions, therefore, the model was most suitable when qN outliers were identified because the qN ANN model was the most suitable for fitting within the school category. An improved model that considered other categories could be established if more complete data on other institutions were available. A classic and general ANN model was employed in this study; thus, the activation function and number of hidden layers may also have affected its efficiency and precision.
The models established in this study could form the review process when each government institution imports their variable data in that year. Therefore, estimated water consumption can be calculated and used to judge whether the water consumption of government institutions is deemed reasonable. Hence, the established models could be the evaluation for saving water.