The Enhancement of Leak Detection Performance for Water Pipelines through the Renovation of Training Data

Leakage detection is a fundamental problem in water management. Its importance is expressed not only in avoiding resource wastage, but also in protecting the environment and the safety of water resources. Therefore, early leak detection is increasingly urged. This paper used an intelligent leak detection method based on a model using statistical parameters extracted from acoustic emission (AE) signals. Since leak signals depend on many operation conditions, the training data in real-life situations usually has a small size. To solve the problem of a small sample size, a data improving method based on enhancing the generalization ability of the data was proposed. To evaluate the effectiveness of the proposed method, this study used the datasets obtained from two artificial leak cases which were generated by pinholes with diameters of 0.3 mm and 0.2 mm. Experimental results show that the employment of the additional data improving block in the leak detection scheme enhances the quality of leak detection in both terms of accuracy and stability.


Introduction
Leakage detection is a primary problem in water management [1,2]. About 20-30% of the water has been lost in water supply system every year. Especially, the loss of water can be up to 50% in some systems [2]. The growing demand for water inspires reconsideration of the management and supply of pipeline systems. Complications in exploiting new water bodies can be beaten by decreasing water losses [3]. Furthermore, the present attention of environmental protection and issues related to water quality encourages a growing interest in leakage detection. The community concerned with water resources has been concentrated more on the natural environment. However, the guardianship of water against incursions in pipes and the protection of the environment from the arrival of a transported contaminant are as significant as the protection of aquifers and well-fields from contaminant discharge [4]. Subsequently, methodologies for early leak detection are strongly urged. Additionally, they should not induce the interruption of piping actions, and they should be simple enough to actualize in practice.
Many studies on leak detection for water supply systems have been conducted and published. The avenues may be passive or active [5], and hardware-based or software-based [6]. Passive methods require direct visual investigation or supervision of sites, while active methods include a signal analysis. Signals used in active methods can be acoustic, vibration, flow rate, or pressure. Besides, hardware-based methods are classified depending on the type of special sensing devices such as acoustic monitoring, vibration analysis, cable sensor, etc. On the other hand, software-based methods

The Offered Method
The overall flow diagram of leak detection is illustrated in Figure 1. First, the acquired AE signals were denoised by a Wavelet algorithm based on normalized Shannon entropy, which was also adopted in some recent studies of leak detection [12,19,20]. After that, the denoised signals were divided into separate analysis and evaluation datasets. The isolation of the evaluation dataset from the analysis dataset was to ensure the reliability of the performance evaluation results. Based on the analysis dataset, a defect signature pool was configured and the most discriminative signature subset, which was also applied on the evaluation dataset, was determined. Subsequently, based on selected features, the analysis dataset was renovated by detecting and removing outsiders before it was used to train SVM classifiers. Finally, the efficacy verification of the proposed method was carried out on the evaluation dataset. Each specific part is described in detail as follows.

Noise Reduction Using a Wavelet Transform and Shannon Entropy
Due to the nature of the AE mechanism, leakage noise is commonly nonstationary [21,22]. Time-frequency analysis methods, which are powerful tools to analyze the time-varying nonstationary signals, are recommended to study a signal in both the time and frequency domain simultaneously. Many studies have adopted the wavelet transform to detect the leak by reason of its multiresolution capability [23][24][25].
A form of wavelet transform which allows multiresolution investigation is known as a Wavelet packet transform (WPT) [26]. Signals can be decomposed into both wavelet coefficients and the scaling values through the WPT technique. Based on this technique, the complete decomposition hierarchy is provided. As a result, because of uniform frequency secondary groups, the decomposition becomes extremely adoptable [27].
A signal ψ(t) with a fixed energy, which is expressed as a mother wavelet, is a consecutive vacillating function of intensely short duration as indicated in Equation (1): where ψ s,τ (t) consists of the total standardized expressions (expansions) in time t designated by s > 0 (scale factor) and translation in time t is designated by −∞ < τ < ∞. Equation (2) expresses a cross correlation of x(t) with ψ s,τ (t) which depicts the Wavelet transformation of a signal x(t) [24,[27][28][29]. Mathematically, the similarity between two signals can be identified by cross-correlation analysis. Given two sets of signals x i and y i , where i = 0, 1, 2, . . . , N − 1, Equation (3) describes the function of normalized cross correlation with zero time-lag. The normalized cross correlation is a numerical quantity between 0 and 1, which predicts the closeness in characterization between two signals. Two signals which have identical characterizations generate a normalized cross correlation coefficient of 1.0 [30]: The determination coefficient is made by executing the WPT with filter banks through recursive schemes. Low-frequency components (approximations) and high-frequency components (details) at each resolution level are obtained by transmitting the signal x(t) to a two-channel filter. Compared to the wavelet transform technique, which decomposes only the approximations, the WPT technique decomposes both details and approximations at every resolution level.
The most indispensable challenge in wavelet analysis is the selection of the mother wavelet function as well as the decomposition level of signal. Among orthogonal wavelets, Daubechies (DB) wavelets have been widely implemented, as they match the transient components in acoustic and vibration signals [31]. The order of the mother wavelet function and the level of decomposition were often determined by trial-and-error methods based on intrinsic characteristics of the data [31,32]. In this study, the selected mother function is DB15, and the number of levels was experimentally determined by Equation (3). Figure 2 illustrates the binary hierarchical tree of discrete wavelet packet transform (DWPT) coefficients. Each node of this tree was considered as a sub-band and numbered according to its level and its ordinal in level. Here, hierarchical levels and ordinals are numbered from 1. An algorithm based on informative entropy was utilized to detect the unnecessary signatures in an AE signal acquired during a test, where the informative entropy was considered a cost function. In this method, only the sub-bands which focus the major information carried by the signal are intended to be picked. Generally, the following equations denote the Shannon entropy H X j if X j = x j , k is a cluster of coefficients of a specified sub-band of the WPT tree at stage of resolution j: Here, X j 2 = k x 2 j,k signifies a norm of X j [26]. A large value of H X j means that the signal is in higher disorder and carries less information. As a result, the corresponding sub-band and its subordinates are discarded. This implies that the entropy computes a correlation of energy among the sub-bands. At this moment, the aim is to select the WPT branch which transports the minor disorders and has minimum conceivable energy. If the informative entropy of the current resolved sub-band is smaller than that of the subsequent resolved sub-band, then the total data is conserved. Otherwise, a lesser energy level of resolution is essential. In other words, the selected sub-band should have the lowest entropy value and the highest resolution level. After that, the preferred sub-bands are used to reconstruct the AE signal such that the most significant part of the signal is saved, and the complementary component which is known to be noise is removed.
In summary, the dimensionality of the fault-signature pool used in the feature selection process is N dp × N sp × N cl , where N dp , N sp , N cl are the number of data points per leak condition class in the analysis dataset, the number of statistical parameters, and the number of classes to be discriminated in this study, respectively. Figure 3 illustrates an example of a data point configuration used to yield the most discriminatory feature subset. The set of elements in the fault-signature pool is denoted by X = x(dp, sp, cl) , with dp = 1, . . . , N dp , sp = 1, . . . , N sp , and cl = 1, . . . , N cl . Variables dp, sp, cl represent coordinates of data point x in the dataset X.

The Generation of the Discriminative Signature Subset
In order to achieve fairness, statistical parameters need to be standardized before evaluating and grading. This study used a simple scaling method with the following formula: Here, X k = x i |sp = k , X k = x i sp = k denote original and standardized sets of values of the k th signature (i.e., k th statistical parameter) respectively. After standardization, values of different signatures were all in the range [0,1].
To solve the small dataset problem, Tu et al. recently introduced an MSAC to evaluate the discrimination of fault signatures between two different classes [12]. The MSAC method estimates the potential value range of k th dimension of the signature sub-space of class i by interval mean X i k − 3std X i k , mean X i k + 3std X i k , where X i k denotes the set of values of k th signature of data points in class i. Therefore, the crossing level between two signature sub-spaces of classes i, j at dimension k which is denoted by MSAC i,j k is determined by Equation (7): Values of std X i k , std X j k represent the intraclass compactness of classes, and mean X i k − mean X j k represents the interclass separability. According to the authors of [12], the bigger the MSAC, the better discrimination. Thus, MSAC expresses the distinguishable ability of signatures for each pair of classes.
Although it is simple and has low computing cost, it is still effective and suitable for real applications such as leak detection. In this paper, the MSAC was used to rank signatures from top to bottom and the discriminatory signature subset was created by picking the signatures on top.

Data Renovation
To build a classification model, the correctness and generalization of the training dataset are extremely important. If the dataset is inaccurate or not generalized, then the accuracy, reliability, and stability of the trained model may be reduced. Related studies have most focused on big data [33][34][35].
Meanwhile, the problem with leak detection using a smart fault diagnostic model is related to the small data problem, because leakage signals are affected by many external factors. Therefore, it is necessary to revamp the dataset. In machine learning, the quality of samples is more important than their quantity, especially when the quantity is not large. The higher quality the samples, the greater the generalization ability and the better the accuracy. In a class, points that are far from the center and have a low probability distribution are known as outsiders. They are less significant than the rest and may be noise points. Consequently, they should be detected and removed.
This study focuses on improving the quality of data before training the classification model with a simple and effective technique. This technique includes three processes of detecting, eliminating outsiders, and updating dataset alternately until there no longer exist outsiders in the renovated dataset. In this study, we assumed that the statistical parameter values were Gaussian random variables. In term of statistics, the probability that each statistical parameter value in a specific class lies in the interval [mean − 3std, mean + 3std] is equal to 99.73% [36], where mean is the mean and std is the standard deviation of their values in that class. This study used such range as the limit for outsider detection to ensure that outsiders were both far from central points and had a low probability distribution. Outsiders were defined as data points that were outside the confident interval (CI), which was determined through the central coordinate (CC) (i.e., the central point) and the standard deviation of each dimension (i.e., each statistical parameter or signature) of the signature space. Figure 4 illustrates an example about how to identify the central point, inner points, and outsider points in a signature space having two dimensions. Denote X i k = x sp = k, cl = i as the set of values of k th statistical parameter of all data points in class i. The CC of class i, CC i , is defined in Equation (8). The CI of value of k th statistical parameter (i.e., k th dimension) of data points in the signature space of class i is given in Equation (9). A data point is considered as an outsider if any dimension of that data point is outside the CI of such dimension. The process of improving the dataset was implemented separately for each class and illustrated in Figure 5. Whenever outsiders are detected and eliminated, the dataset needs to be updated. After that, values of CC and CIs also needs to be updated, and as a result, new outsiders can be detected and eliminated. This process ends when no outsider is detected in the updated dataset: Figure 5. The process of improving the training dataset.

Classification
This study used a two-class SVM classifier whose theory is based on the idea of structural hazard minimization [37]. In the SVM method, the generalization error is minimized and the geometric margin between two classes is maximized. This method is also known as the maximum margin classifier. In this study, the kernel function was used to map the input data into a high-dimensional signature space and detect the best hyper plane to discriminate between the two classes of input data. The margin between two classes in the feature space was maximized by the best hyper plane. This quadratic optimization problem was worked out using Lagrange multipliers. The term "support vectors" is used to refer to the points which are nearest to the optimal hyper plane for each class [38]. Support vectors are selected along the surface of a kernel function which can be chosen among different functions such as polynomial, linear, radial-based, and sigmoid for the SVM during the training phase [39]. Based on a set of predetermined support vectors that are members of the set of training inputs, SVM distributes data with two class labels.
Kernel function parameter selection is one of the significant details of SVM modeling. In this paper, we used the radial based function (RBF), which is a common kernel function that can be employed to any sample distribution through parameter selection. The RBF has been used more and more in the nonlinear mapping of SVMs. The RBF kernel function expression is: The corresponding minimization problem of an SVM is expressed below: The minimum value of Equation (11) depends on the choice of parameters (C, γ). In this study, the grid search method was used to get the final optimal parameters (C, γ) [40]. This method respectively takes m values in C and takes n values in γ, for the m × n combinations of (C, γ), trains different SVM respectively, then estimates the learning precision. We can obtain the highest study accuracy of the best combination as the optimal parameters in the m × n combinations of (C, γ). Figure 6 shows the setup of the AE signal acquisition from a water pipeline system. The pipe, which was made of stainless steel 304, had an outside diameter of 34 mm and a wall thickness of 3.38 mm. A pump was employed to keep the water flow constant at a pressure of 3 bar. The experiments were executed under a balanced temperature of approximately 29 • C. AE sensors were mounted on both sides of the testing pipe fragment. The distance from sensors to the leak position was 1000 mm. In this study, wideband differential-auto sensor test (WDI-AST) sensors were used to provide high sensitivity and a wide frequency band. The characteristics of the sensors are recapped in Table 3.  The experiments were based on two leak cases with different pinhole diameters, i.e., 0.3 mm and 2.0 mm, which were considered dataset 1 and 2, respectively. The AE signals were collected in one-second time lengths and sampled at a frequency of 1 MHz. Details of the datasets which were used to assess the offered method are described in Table 4. In this table, "normal" means the no leakage case. Since different datasets were acquired in different dates, and operating conditions such as temperature, pressure, flow rate, etc. have impacts on AE signals, each "normal" data is taken accordingly with the leak data to have coherence with background condition.

Results and Discussion
Figures 7 and 8 illustrate one obtained AE signal sample of each case for each dataset over the time and its fast Fourier transform in frequency domains. It is clear that these original signals contained noise, and that there was not much difference between signals at healthy and unhealthy states.  To extract the most informative part of the signals, sub-bands were first produced by implementing the DWPT on each raw AE signal. Then, the optimal sub-band was selected depending on the minimum wavelet entropy before being employed to restore the AE signal. Figure 9 shows the difference between signals before and after denoising in both the time and frequency domains. In the next step, the fault signature pool was created from reconstructed AE signals in the analysis dataset. Then, the MSAC was used to evaluate the signatures. Table 5 lists signatures in order of best to worst in terms of leak detection, together with their MSACs corresponding to each case. After that, the two best signatures on top were selected as a discriminatory feature subset. In such a manner, the feature sub-set that was most discriminative for both cases included one parameter on the time domain, namely the square-mean-root, and one parameter on the frequency domain, namely the spectral centroid. Figure 10 illustrates the distribution of data points according to the selected features corresponding to each leak case. It can be seen that data points in the same class, in the case of a lower-level leak (pinhole size of 0.3 mm), had a higher concentration than in the case of a higher-level leak (pinhole size of 2.0 mm), while the separation between classes in the first case was lower than the other. The reason for this may be that instability of the AE signal increased belong with the leakage level. It follows that the leak detection method of using statistical parameters of AE signals was limited by leak level in both directions. Specifically, the greater the leakage level, the lower the concentration level in the same class and the greater the interclass separability. To enhance the stability and quality of the SVM classifiers, the training dataset needs to be improved by detecting and removing outsiders, which may be noise data points because of their low probability distribution and weak generalization. Based on the renovated analysis dataset, the SVM classifiers were trained before being used to detect leaks in the evaluation dataset. To evaluate the proposed method, this study used a 10-fold cross validation to compare classification accuracies (CAs). The CA given in Equation (12) is the ratio between the number of correctly classified data points (i.e., true points), N TP , and the total of data points, N total . The results of CAs of three methods are shown in Table 6. Here, "All" represents the conventional method, which uses all of 21 fault signatures without signature selection and data renovation, whereas the conventional method [12] uses a signature selection with MSAC without data renovation. In general, the proposed method, which added the data enhancement block, outperformed the method in [12], which had the same signature subset. Specifically, the former had no worse results than the latter in 10 total assessments of both cases. In addition, the former surpassed the latter by four times in dataset 1 and three times in dataset 2. It follows that the former enhanced the average CA of 4.61% and 1.58% compared to the latter when datasets 1 and 2 were used, respectively. Therefore, it is proven that the proposed method is both more accurate and more stable than the previous method. Compared to the method of using all the signatures in terms of CA, the proposed method was better in dataset 1, but worse in dataset 2. However, the proposed method, which used only two features, significantly reduced the number of dimensions of the fault signature vector compared to the non-signature-selection method, which used 21 features. This means that it is possible to mitigate the computational responsibility in the configuration of signature vectors in real applications. Moreover, low-dimensional signature vectors can assist in reduction of consumed time to train classifiers. Table 7 shows computational time comparison between the proposed and conventional method which employed all 21 signatures. Compared to the conventional method, the processing speed of the proposed method was improved by 31.17% in training, 76.77% in test, and 40.14% in total for the dataset 1. Similarly, those improvements for the dataset 2 were 41.63%, 76.80%, and 48.63% respectively. All experiments were implemented with MATLAB R2018b on an Intel Core i7-7700 CPU operating at 3.60 GHz.

Conclusions
In this paper, an intelligent leak detection method based on a model using statistical parameters extracted from AE signals was used for early leak detection. Since leak signals depend on many operation conditions, the training data in real-life situations usually has a small size. To solve the problem of a small dataset, a data improving method based on enhancing the generalization ability of the data was proposed. To evaluate the effectiveness of the proposed method, this study used the datasets obtained from two artificial leak cases which were generated by pinholes with diameters of 0.3 mm and 0.2 mm. Experimental results showed that the employment of the additional data improving block in the leak detection scheme enhances the quality of leak detection in both terms of accuracy and stability.

Conflicts of Interest:
The authors declare no conflict of interest.