Modelling Tap Water Consumer Ratio

: Increasing population and the rising air temperatures are known as factors that cause water depletion in the watersheds. Therefore, it is important to accurately predict the future ratios of tap water consumers using the same watershed to the population living in the speciﬁed area, to produce better water policies and to take the necessary measures. Predictions can be made by a growth curve model (GCM). Parameter estimations of the GCM are usually based on the ordinary least square ( OLS ) estimator. However, the outlier presence a ﬀ ects the estimations and the predictions, which are obtained by using the estimated model. The present article attempts to construct ﬁrst- and third-order GCMs with robust least median square ( LMS ) and M estimators to make short-term predictions of ratios of tap water consumers. According to the ﬁndings, parameter estimations of the models, the outliers, and the predictions vary with respect to the estimators. The M estimator for short-term predictions is suggested for use, due to its robustness against outlier points.


Introduction
Water has vital importance for the survival of living beings. It is the basic element in terms of maintaining life in nature and human activities. The fact that 75% of the Earth's surface is covered with water creates a perception that water scarcity would never be an issue to be discussed; more than 97% of this water is seawater, 2% is a mass of ice, and the large part of the remaining 1% is groundwater which is difficult to reach [1]. Only a tiny fraction of the water that forms the large part of our planet we live consists of healthy drinkable water. Thankfully, this water is renewed by nature's solar-powered water cycle. With the evaporation fueled by the sun's energy, water vapor is carried to the atmosphere. Of this evaporation, 86% occurs from the sea and 14% from land [1]. Even though an equal amount falls back to the Earth as rain, sleet, or snow, the distribution of water is more on continents than in oceans. With the transfer of water from the sea to land in this repeatable process, the renewable local drinking water resources occur. However, increasing population and air temperatures are effective factors in reducing drinking water availability per capita. The demand for drinking water will increase due to increasing population. A reduction in drinking water availability with increasing water demand, will equal an increase in stress on water supplies, as water supplies cannot be replenished quickly enough to meet demand. Moreover, agriculture already accounts for approximately 70% of the drinking water withdrawals in the world and is commonly seen as one of the main factors behind the growing global scarcity of freshwater [2].
Although a lot of international agreements and declarations such as the International Convention on the Rights of the Child, [3], the Human Rights Council Resolution, [4], and the Water Framework Directive, [5], stated the right to safe drinking water and sanitation is an internationally recognized human right, a great part of the world's population cannot benefit from this right [6].
Turkey, a country with land both in Asia and Europe and a population of approximately 81 million people, is exposed to dry seasons. European Environment Agency reported that Turkey will encounter moderate and high-level water scarcity in many areas [7]. Thus, it is obvious that Turkey is a candidate country to experience problems with water scarceness. Considering that the population is predicted to be close to 100 million in 2030, [8], it is crucial to take precautions to avoid water shortages and to produce better water policies. Hence, it is important for a country to both follow and also predict the tap water consumer ratios (TWCRs) particularly to take measures to decrease the negative effects of the reduction in tap water that could occur very soon.
Registered water use in a country is important to ensure planned and economical water use. Therefore, monitoring the TWCRs of watersheds and predictions of their near future values would make it easier to establish a basis for water-related precautions to form. In the literature, there are studies on water and its future predictions [9][10][11][12][13]. Information on the assessment of sustainable water consumption perception, the evaluation of direct and indirect water consumption through the water footprint indicator, and the link between urban services and water uses are examined elsewhere [14]. Water footprint is described as an indicator of water use in relation to human consumption [15]. To incorporate the advances of life cycle assessment and water footprint analysis, an associated indicators set has been developed (see Reference [16]). However, ignoring outliers which are points that differ from the bulk of the data or fit a different distribution could cause biases in the findings [17].
This work aims to construct growth curve models (GCMs) to predict Turkey's TWCRs. It is the ratio of the number of households using tap water in a particular region to the total number of households in that region. In this study, there are 26 grouped cities corresponding to 26 particular regions. Grouping of the cities is according to the watershed use of households. In the construction of the models, the non-robust ordinary least square estimator (OLS), which is employed in general, and robust least median square (LMS) and M estimators are used. This study demonstrates that detected outlying points differ according to which estimator is employed. Hence, every estimated model and hence the short-term predictions that are produced by the estimated model vary with respect to estimators. Thus, making better predictions for TWCRs to take convenient water policies depends on employing robust estimators that handle outliers very well during the parameter estimations of the GCMs. It is highlighted that the presence of outlying points has an undue impact on the model's parameter estimations and future predictions.
The paper is organized as follows. In Section 2 we introduce GCMs based on non-robust OLS and, robust LMS and M estimators. To the best of our knowledge, this is the first time that robust LMS and M estimators are studied in this context. We show the differences in the outlying points, estimated GCMs, and hence the predictions by using the data originated in Section 3. Here, the TWCRs of Turkey are estimated by considering that the straight-line growth model, in other words, the first-order GCM, of time could be fitted to data. In the estimation procedure, we use OLS, LMS, and M estimators as mentioned above. As well as detection of outlying points, predictions to TWCRs of chosen years are obtained. Furthermore, to follow the estimations and predictions obtained from the estimated first-order GCMs based on OLS, LMS, and M estimators, separately, the curves are plotted on a single figure. Considering that the data could match more a third-order polynomial the progress is repeated for third-order GCMs.

GCMs Based on OLS, LMS, and M Estimators
The GCM usually expressed as is the change in a growth that corresponds to the response variable Y. This model indicates analytically how the parameters (B 0 , B 1 , etc.) and their standard errors (ε 0 , ε 1 , etc .) behave in a deterministic procedure for varying points of time [18]. X and Z are the design matrices. Here, Z used for grouped repeated measures is not taken into account since only the growth of Turkey's TWCRs in watersheds on different time points is the subject to be researched. At this point, the vector of unknown parameters, the error, and the design matrix are denoted as B m×1 , ε p×n , and X p×m , respectively. Each column of ε is assumed to be distributed as p-variate normal with 0 the mean vector and Σ > 0 the unknown covariance matrix. Additionally, Y is distributed as N p×n (XBZ, Σ, I n ) where XBZ is the expected value, Σ and I n are the covariance matrices of Y ij (i fixed and j = 1, . . . , p) and Y i (i = 1, . . . , n), respectively [18]. The number of time points examined on each of n observations is denoted by p and m − 1 is the degree of the polynomial in time.
The OLS estimator of B, which is defined asB OLS , is obtained fromB OLS = (X X) −1 X YZ(ZZ ) −1 . TheB 0 is the expected value of Y at time point 0 and called as the estimation of coefficient B 0 . TheB 1 is the expected value of Y when a one-unit change in time has occurred for observation i and called the estimation of coefficient B 1 . In addition, the OLS estimation of Σ, described asΣ OLS , is based onB OLS and is calculated from [18].Σ Regarding the detection of outliers, the sum of squares of residuals of the ith observation is calculated from is chi-square distributed with p degrees of freedom, the calculated value of it is compared with the critical value determined from χ 2 p,1−α , where α denotes the significance level. If the sum of squares of a suspicious observation is larger than the critical value, it would be appraised as an outlier [19,20]. The definition and explanations for e 2 i (B OLS ,Σ OLS ) mentioned above are also valid for LMS and M estimators.
The estimation procedure of B and Σ with robust LMS and M estimators depends on the weighted least square (WLS) estimator. Thus, the estimation procedure for WLS is based on minimizing [21]. The WLS estimator of B, which is denoted asB WLS , is computed from and the estimation of the weighted covariance matrix is computed from where H = W − WZ(Z WZ) −1 Z W. The notation "tr" denotes the trace. The elements of the diagonal weight matrix W that is used in Equation (3) and the calculation of H vary according to which estimator will be obtained. For instance, the ith element, w i , of the diagonal weight matrix, W t , is defined as when employing the LMS estimator and t is a value that ranges from 1 to C(n, h). Here, C(n, h) denotes the number of h-combinations from a given data set of n elements and h is calculated from The notation . means rounding to a lower integer. Then, the LMS estimators of B and Σ, defined asB LMS andΣ LMS , can be easily obtained by regarding the minimization problem of the objective function min where i = 1, . . . , n [19]. The M estimator,B M , is obtained by solving the objective function where k= 0, 1, 2, . . .. Here, the value ofB LMS is used for the initial pointB WLS 0 . In Equation (8), ρ indicates a function which has a minimum at 0 for all e 2 i (B WLS k ,Σ WLS k ) and k shows the iteration number. In this instance, Tukey's ρ function defined as is used to computeB M [20]. In Equation (9), ρ e 2 i (B WLS k ,Σ WLS k ) means the derivative of ρ e 2 i (B WLS k ,Σ WLS k ) . The ith diagonal element of the diagonal weight matrix W k is obtained from: Here, c = 18.0909 and is calculated as the value that provides E χ 2 is the expected value obtained from the chi-squared distribution with p degrees of freedom.

The Dataset
Turkey consists of eighty-one cities. These cities are categorized into twenty-six groups according to which local watershed they benefit from [8]. Table 1  As it has been explained, Z does not affect the estimations. Thus, it is taken as an 1 × 26 dimensional vector consisting of 1 s. With the benefit of this design, the parameters of GCMs denoted as B 0 and B 1 would be estimated. Three different methods including OLS, LMS, and M are used to reconstruct GCMs, separately. This makes it possible to show that the differences in the results of identified outlier observations vary regarding the methods.
In the second part of the study, the design matrix where the first column consists of ones and the other columns consists of the numbers that correspond to the chosen years [18], is employed. The reason for using the first, second, and third power of these numbers in the design matrix, respectively, is to build third-order GCMs. [17]. Here, Z is employed as defined previously in the construction of the first-order GCMs.

Detection of Outliers
The results of detecting outliers in the data, which consists of TWCRs in Turkey's watersheds at different time points, and the parameter estimations of the GCMs according to methods mentioned above are summarized in Table 2, when α = 0.05. This table summarizes the findings that are observed for first-and third-order GCMs. Watershed number 9 has been identified as outlier with both non-robust and robust estimators. This is strong evidence that there is an outlier in the data. However, when applying robust LMS and M estimators, watersheds numbered 4, 19, 25, and 1, 5, 22, respectively, are detected as outlying points, besides the watershed numbered 9. Thus, it is safe to infer that the predictions obtained from the estimated GCM based on OLS can be adversely affected by the undetected outliers. By definition, LMS and M estimators are more resistant to outliers compared to the OLS estimator [18][19][20]. Therefore, it is suggested to consider the predictions obtained from the estimated GCM based on these estimators.

Results
To show the differences in the estimated GCMs depended on OLS, LMS, and M estimators, Figure 1a,b are plotted. The horizontal line denotes the numbers corresponding to years and the vertical line denotes the predictions of ratios. Regarding the estimated first-order GCMs in Figure 1a, it is observed that the GCMs based on the OLS, LMS, M estimators are different. The observed predictions from the OLS appear to be larger when compared with the predictions from the LMS and M estimators. Knowing that the OLS estimator is being influenced by the outlier points, it is better to evaluate the predictions obtained from LMS and M estimators due to their robustness to outlier points [18][19][20]. Even, in general, the predictions of TWCRs obtained from the OLS and LMS estimators are higher than the predictions of TWCRs obtained from the M estimator. The M estimator is more resistant to outliers than the OLS and LMS estimators [18][19][20]. Therefore, it is recommended to evaluate the results obtained from the M estimator. For instance, the predictions for 2021 seem to be approximately 86%, 89%, and 90% in the case of using M, LMS, and OLS estimators, respectively. Thus, it is highlighted that the outlying observations affect the results. The predictions of the ratios of watersheds at years 2017 to 2021 (corresponds to 17 to 21, respectively) could be seen from these graphs as well. The values of the predictions based on the three methods tend to increase over the years. The differences between the estimated third-order polynomials can be seen clearly in this figure. Predictions on TWCRs of Turkey's watersheds have risen steadily, particularly after 2015. Moreover, even the results for M estimator are much lower than the results for OLS and LMS estimators. In addition, the predictions based on the M estimator are observed below 100. Thus, predictions obtained from this estimator are said to be acceptable since the vertical line in Figure 1b denotes the ratios. Consequently, it is proposed to consider the observed predictions from the M estimator because of its robustness against outlying points.

Conclusions
GCMs, as statistical growth models for short-term predictions, are used for various studies. Based on tap water consumer data in Turkey recorded between the years 2001 and 2016, this study investigated the TWCRs of Turkey's watersheds with first-and third-order GCMs for short-time predictions. To estimate the parameters of the models, OLS, LMS, and M estimators are used. Usage In addition, estimating and predicting procedures are repeated as the data could be more appropriate for a third-order GCM. Figure 1b illustrates the estimated third-order GCMs after using OLS, LMS, and M estimators and the predictions of TWCRs for watersheds in Turkey from 2017 to 2021 (corresponds to 17 to 21, respectively).
The differences between the estimated third-order polynomials can be seen clearly in this figure. Predictions on TWCRs of Turkey's watersheds have risen steadily, particularly after 2015. Moreover, even the results for M estimator are much lower than the results for OLS and LMS estimators. In addition, the predictions based on the M estimator are observed below 100. Thus, predictions obtained from this estimator are said to be acceptable since the vertical line in Figure 1b denotes the ratios. Consequently, it is proposed to consider the observed predictions from the M estimator because of its robustness against outlying points.

Conclusions
GCMs, as statistical growth models for short-term predictions, are used for various studies. Based on tap water consumer data in Turkey recorded between the years 2001 and 2016, this study investigated the TWCRs of Turkey's watersheds with first-and third-order GCMs for short-time predictions. To estimate the parameters of the models, OLS, LMS, and M estimators are used. Usage of both robust and non-robust estimators allowed us to remark on the differences in parameter estimates ( Table 2) and short-term predictions for TWCRs (Figure 1). A legitimate clarification for these findings seems to be the existence of outliers in the data. The predictions obtained through the M estimator are assumed to be the best, due to its robustness against outlier points [18][19][20]. According to these predictions, the TWCRs for Turkey's watersheds will constantly increase. Furthermore, a prediction based on the estimated third-order GCM for the year 2021 is expected to be approximately 5% more than in 2020. Hence, making short-term predictions with the robust M estimator means a better view of the truth, which will lead us to produce better improvements on water policies.