Modeling the Level of Drinking Water Clarity in Surabaya City Drinking Water Regional Company Using Combined Estimation of Multivariable Fourier Series and Kernel

: The purpose of this study is to propose an appropriate model to predict chemical composition during water puriﬁcation at the Regional Water Company (PDAM) Surabaya, in order to achieve proper drinking water standards. Drinking water treatment is very expensive, so the model serves as a basis for determining the composition of chemicals used in the water puriﬁcation process at PDAM Surabaya. This study examines a model of the relationship between the level of clarity of drinking water and the composition of the chemicals used. The government can obtain important beneﬁts from the forecasting model to formulate policies for the company. One of the objectives of developing the estimation method involved in this research is to efﬁciently determine the exact chemical composition resulting from the water puriﬁcation process, which will inform the ﬁnancing and control of water quality. We used a multivariable linear approach for some parametric components, a multivariable Fourier Series approach for some nonparametric components, and a multivariable Kernel approach for semiparametric regression. Using the penalized least square (PLS) approach, a mixed estimator of the Fourier and Kernel Series was obtained with semiparametric regression. The smoothing parameters were selected using a common cross-validation technique (GCV). The performance of this technique was evaluated using the Gaussian Kernel and Fourier Series with data trends in the drinking water clarity level obtained from PDAM Surabaya. The ﬁndings showed that this technique performed well, so we recommend that the government conduct an in-depth analysis to determine correct chemical composition so that the cost of water treatment can be minimized.


Introduction
In the adult human body, 70% of body weight is in the form of liquid. Therefore, drinking water is a nutritional element that is as important as carbohydrates, proteins, fats, and vitamins. Consuming good and sufficient mineral water can help the digestive process, regulate metabolism, regulate food substances in the body, and ensure body balance, provided that the quality of drinking water is assessed in terms of clarity [1]. To obtain drinking water that is fit for consumption involves costly processing. Careful planning is needed so that costs can be minimized. One approach that can be used is the application of a semiparametric regression model, as proposed in this paper.
Semiparametric regression is a regression analysis technique in addition to parametric and nonparametric regression. Both parametric and nonparametric features are integrated in semiparametric regression. The parametric features used in semiparametric regression include Fourier Series and Kernel. These three different estimators are combined in the estimate. Linear regression is the easiest and most efficient estimator compared to other nonparametric regression methods. Kernel is useful for use with un patterned data [2] and has a relatively faster convergence speed than polynomial estimators, Fourier Series, or splines [3]. Many nonparametric and semiparametric regression estimators have been developed. For data with variable patterns independent of node location, nonparametric regression without mixed estimators, such as spline smoothing [4], Penalized Spline [5]. B-Spline [6], Weighted Partial Spline [7]. For data with changing patterns at certain subintervals, truncated splines are used [8,9].
The assumption used by researchers when designing nonparametric or semiparametric regression models is that each predictor in the nonparametric component will follow the same pattern. However, the actual scenario is likely to reveal different correlation patterns among the predictors. Mixed estimators for nonparametric and semiparametric regression have been developed to estimate the regression curve according to the data pattern. Studies on mixed estimators for nonparametric regression have previously been conducted on the mixed Kernel estimator and Fourier Series in nonparametric regression [17], on the mixed estimator of Fourier Series and truncated spline in nonparametric regression by [13,18,19], and on the mixed estimator Kernel and smoothing spline in nonparametric regression [20]. For mixed semiparametric regression estimators, ref. [16] developed an approach that combines spline truncated and Kernel approaches, while [12,14] constructed a model combining Kernel estimators and Fourier Series. The authors of [20] developed an approach that combines smoothing spline and Fourier Series. The smoothing parameter serves to control the smoothness between the goodness of fit and the penalty.
This paper proposes a combined estimate of semiparametric regression of the multivariable Kernel mixed estimator and Fourier Series, in which some nonparametric components contain repeating and un patterned components. Semiparametric regression research using mixed estimates of Kernel and Series Fourier was carried out by [13] but has not been able to overcome the semiparametric data pattern of mixed multivariable Kernel and multivariable Fourier Series for parameters components; Kernel components and Fourier Series are also multivariable. This paper presents the combined estimation of the multivariable Kernel estimator and the multivariable Fourier Series, using a PLS estimation approach as the estimation method. The estimator technique uses a PLS estimate produced by combining goodness of fit and penalty.
With respect to the optimal smoothing parameter selection method using the GCV method, a small optimal smoothing parameter will produce a very rough estimate on the Fourier Series estimator, but a large smoothing parameter will produce a very fine estimate where the estimator is not able to estimate the data according to the pattern. Similarly, an optimal bandwidth is required because a very small bandwidth will result in a very coarse Kernel estimator and a very wide bandwidth will produce a Kernel estimator that is slippery and does not match the data pattern. Previous nonparametric and semiparametric regression researchers have extensively developed the GCV method. Researchers who studied the GCV technique in nonparametric regression, among others, found that the GCV technique was superior to an unbiased risk approach in this context [17]. The authors of [15] investigated semiparametric regression using the GCV method. The selection of smoothing parameters in this study was based on the development of optimal smoothing parameter selection on the balanced estimator combined multivariable Fourier Series and Kernel in semiparametric regression [13].
The model results obtained from the estimation results are anticipated to be used in modeling the level of clarity of drinking water in PDAM Surabaya. The estimation of parametric components is approximated by a multivariable parametric method, Kernel components are approximated by Gaussian Kernels and components of the Fourier Series are approximated using Fourier Series with trends. The estimation model can be used to predict the composition of the optimal use of chemicals while taking into account the threshold for drinking water. This information is anticipated to provide reference materials for planning drinking water management. As each chemical is expensive and its price varies, it is hoped that, using this model, PDAM Surabaya can efficiently manage the costs of producing clean water.

Materials and Techniques
Given a data pair of n that is (t 1i , . . . , t pi , x 1i , . . . , x qi , z 1i , . . . , z ri , y i ), with i = 1, 2, . . . , n. y i is the reaction variable. A multivariable semiparametric regression form is produced as follows: ε i is a random error for which I IDN(0, σ 2 ). Assuming that the regression curve µ is additive, it may be expressed as: The following equation yields the estimator η by PLS optimization: where˘l are the smoothing parameters. The function that measures goodness of fit makes up the first component of Equation (3), and the function that measures the penalty makes up the second.

Mixed Kernel Model and Multivariable Fourier Series in Semiparametric Regression
Several lemmas must be satisfied to generate a mixed model Kernel and multivariable Fourier Series in semiparametric regression in Equation (2). Lemma 1 presents solutions for parametric components, Lemma 2 presents solutions for Kernel components, Lemma 3 presents solutions for Fourier Series components and Lemma 4 presents solutions for goodness of fit, while Lemma 5 presents the penalty component form of Equation (3).

Lemma 1.
If the components of a linear parametric curve are multivariable, (2) are approximated by a multivariable linear function, then p ∑ j=1 g j (t ji ) can be written to matrix as Xβ, where X is a size matrix n × (p + 1), and β is a size vector (p + 1) × 1.
For g j (t ji ), when i = 1, 2, . . . , n, then the following is obtained so that the matrix can be written, such as: . (2) is approximated by a multivariable Kernel function, the Nadaraya-Watson estimator [2], then q ∑ k=1 m k (x ki ) can be written as matrix Ωy, as follows:

Lemma 2. If the components of the Kernel curve
where Ω is a size matrix size n × n, y is a size vector n × 1. (2) is approached with a multivariable Kernel function, the Nadaraya-Watson estimator [2], then q ∑ k=1 m k (x ki ) can be written as a matrix Ωy

Proof of Lemma 2. If the Kernel curve
For i = 1, 2, . . . , n, obtained: Kernel components can be written as follows: as a result of Equation (6), the following is obtained: where (2) is approximated by a Fourier Series function H l (z li ), assuming regression h l ∈ C(0, π), l = 1, 2, . . . , r, then r ∑ l=1 h l (z li ) can be written as where D is a matrix of size n × r(S +2) and γ is a vector of size r(S +2) × 1 .

Lemma 4.
If the semiparametric regression model is as in Equation (2) Theorem 1 provides a detailed explanation of the mixed model of Kernel and multivariable Fourier Series, where the linear curve is provided in Equation (4), the Kernel curve in Equation (5), the Fourier Series curve in Equation (8), and the penalty in Equation (12). (4), the Kernel in Equation (5), Fourier Series in Equation (8), and penalty in Equation (12). We obtain a multivariable by minimizing the PLS in Equation (3), giving us:η (β,ϕ,λ,s) (t, x, z) = M * y where y is a vector size n × 1 and M * is a matrix size n × n. (13) where γ T Pγ a is the penalty component.

Proof of Theorem 1. Based on Lemma 4, Equation (3)'s optimization can be expressed as:
In Equation (13), if A = (I − Ω k )y − Xβ, so ζ(γ, β), It can be expressed as: To get an estimator γ, so η(γ, β) is obtained by partially deriving from γ, and the result is equal to zero.
Next, to get an estimationβ, i.e., partially derivative of ζ(γ, β) to β, then equated to zero, as follows: Equation (14) is substitute into Equation (15), and we obtained: Next getγ, namely by substituting Equation (16) into the Equation (14), we obtained: The above equation can be written in matrix form, as followŝ y is a vector of size n × 1 and C * is a matrix of size n × n, (15) is substituted, we get an estimator for the parametric component:

Smoothing Parameter Selection
Semiparametric regression using the Kernel mixed estimator and multivariable Fourier Series is highly reliant on the selection of the best smoothing, bandwidth, and oscillation parameters. The authors of [22,23] state that the selection of the smoothing parameters using GCV in semiparametric regression utilizing the combined estimator multivariable Fourier Series and Kernel is according to: The least GCV(β, ϕ, λ, s) results in the ideal smoothing parameters, oscillation parameters, and bandwidth.

Modeling Data
In this section, the TKAM data at PDAM Surabaya are subjected to a combined model of the multivariable Kernel and the Fourier Series in semiparametric regression. Drinking water is very important for human life and must be used wisely considering future generations [24]. Because of water pollution, purification processes are necessary which are very costly [25] and require careful planning, including the assessment of the composition of the chemical substances needed to obtain drinking water that meets required standards.
After conducting an initial study on TKAM data at PDAM Surabaya, the obtained data show that there were differences in the data patterns between each predictor variable and the reaction variable; that is, some showed a Fourier Series pattern, others showed a Kernel pattern, and some followed a linear pattern. These data were then applied to the model in Equation (19) using R software with library(pracma), library(MASS), library(lmtest) and library(gtools). The response variable y was the level of clarity of drinking water. The predictor variables thought to affect the level of water clarity included aluminum sulfate (x 1 ), liquid chlorine (x 2 ), cupric sulfate (x 3 ), chlorine (x 4 ), Dukem 108A (x 5 ), and the turbidity of the water after deposition (x 6 ), where x 2 , x 3 , and x 6 are parametric components, x 1 dan x 5 are Kernel components and x 4 is a Series Fourier component. The estimated result, based on the smallest GCV criterion value [23,26], from Equation (20) is 0.00209. The following estimation models were obtained: An overview of the real data and the estimated results is presented in Figure 1.
An overview of the real data and the estimated results is presented in Figure 1.  A combination of the multivariable Kernel and multivariable Fourier Series in Semiparametric regression has a value R 2 = 88.2% when estimating the degree of clarity of drinking water in PDAM Surabaya utilizing semiparametric regression modeling. Based on the value = 88.2% obtained, the predictor variable can explain 88.2% of the variance in the relationship between the response variables. Furthermore, this shows the suitability of this model to be used in modeling the TKAM data from PDAM Surabaya [19].

Conclusions
This paper presents an estimation technique for semiparametric regression using PLS. We combined the multivariable parametric estimator, the multivariable Kernel and the multivariable Fourier Series to estimate the regression curve with data having a data pattern that was partly parametric multivariable, partly multivariable Kernel and partly Fourier Series multivariable. The model was based on the smallest GCV. We considered the outcomes using various types of Kernels, while looking at the estimator features of the Fourier Series and Kernel estimator in semiparametric regression. The model obtained was more adequate compared to [19] which had a determination coefficient R of 84%. Using a mixed estimate of Kernel and multivariable Fourier Series, the coefficient of determination R was found to be 88.4%.

Conflicts of Interest:
The authors declare no conflict of interest.

Symbol Meaning y i
Response variable to i x pi The predictor variable to p on the parametric component x for the i-th subject. the ith predictor variable on the kth parametric component The p-th predictor variable on the parametric component for the i-th subject t ki The predictor variable to k on the nonparametric component t for the i-th subject. z ri The predictor variable to r on the nonparametric component z for the i-th subject. a 0 ,a S , b Fourier Series parameters β Parametric component parameter vector/regression coefficient vector m(x) Parametric function for parametric components g(t) Kernel functions for nonparametric components h(z) Fourier Series functions for nonparametric components ε Random error vector with ε ∼ N(0,Iσ 2 ). σ 2 Error variance ϕ Bandwidth a Vector containing the parameters of the Fourier Series measuring (S + 2) × 1.

T
The matrix containing the coefficients of the multivariable Fourier Series of n × (S + 2