A Novel Feature Selection Based on VMD and Information Gain for Pipe Blockages

: Targeting the challenge of determining the degree of blockage in buried pipelines and the difﬁculty of effectively extracting blockage features, a blockage detection method integrating variational mode decomposition (VMD) and information gain is proposed. Acoustic impulse response signals were obtained by deconvolving the output signals of the system, which were then subjected to VMD to obtain 12 components in different frequency ranges. Next, information gain (IG) was introduced to characterize the 12 components quantitatively, through which the components containing rich information about the pipe conditions were selected out. Meanwhile, sound pressure level conversion was performed on the selected components to amplify any changes in the sound ﬁeld. Finally, the root mean square entropy (RMSE) was calculated to constitute the feature eigenvec-tors, which were input into Random Forests (RF) classiﬁer for defect identiﬁcation of pipeline. As the experimental results demonstrate, the proposed method is capable of determining the degree of blockage effectively in the running state. Meanwhile, it can also eliminate the interference of functional parts such as lateral connections during the identiﬁcation process, thereby improving the identiﬁcation accuracy. The present study has shown both theoretical signiﬁcance and application value in the ﬁeld of defect detection and recognition.


Introduction
Sewage are the lifelines of urban construction and social development. During the sewage operation, factors such as overload, fatigue, and environmental pollution result in cracks, blockages, leakages, and other functional defects inside it, thereby lowering its service life [1]. A blockage is a ubiquitous phenomenon during pipeline operations. In case of slight blockage of the pipe, the blocked area will enlarge continuously over time if timely detection and management are not given, which eventually leads to severe blockage [2]. The severe blockage will compromise the carrying capacity of the pipe and the reliability of the system; will increase the possibility of environmental pollution and the redundancy of the system; and will cause over-pressure of partial pipes in the system to increase the possibility of leakage, thereby ultimately resulting in a serious waste of water resources and environmental pollution. The losses resulting from severe and multiple blockages can be minimized as long as the minor blockages can be detected and handled in time [3]. Given that the pipelines are buried deep underground, the evaluation of their operational conditions is complicated and challenging [4]. Hence, non-destructive testing of buried pipeline conditions is profoundly meaningful for ensuring the high efficiency and reliability of their normal operation, which is the focus and challenge of urban infrastructure maintenance [5,6]. framework and determines the frequency center and bandwidth of various components by iteratively searching for the optimal solution of the variational model [32]. As it decomposes an actual signal x(t) into k numbers of discrete mode components u(t), it can adaptively accomplish the effective separation of frequency-domain part of the signals, as well as various components, which highlights the local features of data and exhibits better noise robustness.
Construction of variational problem is a process in which the sum of k mode components is made equal to the original signal x(t). The steps for estimating the frequency bandwidth of various mode signals are as follows: (1) Hilbert transform is performed on various mode functions u k (t) to obtain their respective marginal spectra. (2) The functions u k (t) are mixed by an exponential modification to estimate their center frequency ω k (t), and the one-sided spectra are modulated to the corresponding basebands. (3) The bandwidths of various mode signals u k (t) are estimated through Gaussian smoothing of demodulated signals to minimize the sum of their bandwidths. The constrained variational problem arising above is expressed as: where δ(t) represents the impulse function, {u k } = {u 1 , u 2 , · · · , u k } represents the set of various mode function components, and {w k } = {ω 1 , ω 2 , · · · , ω k } denotes the center frequency of various mode components.
To solve the optimal solution of the foregoing variational model, the Lagrange multiplication operator λ(t) and the quadratic penalty factor α are used to transform the constrained variational problem into an unconstrained one. Relevant expression is: During the solving process, the alternate direction method of multipliers (ADMM) is employed, where u n+1 k , ω n+1 k and λ n+1 k are updated alternately. The saddle point ζ of Equation (3) is searched, which is precisely the optimal solution of the variational problem in Equation (2). Accordingly, the signal x(t) is decomposed into k numbers of discrete mode components u(t).

Information Gain-Based Selection of Effective IMF Components
The method operates on the following principle: Given a sample set D and continuous attributes a, assuming that a have n different values on D, which are sorted from small to large and are represented by a 1 , a 2 , . . . , a n . The values of adjacent attributes are set as a i and a i+1 , while the division results generated by t are identical when any value in an [a i , a i+1 ] interval is taken. Thus, for continuous attributes a, the median point a i + a i+1 /2 of the interval [a i , a i+1 ] is assigned as the candidate partition point. The partition point t can divide the given sample set D into subset sums D − t D + t , of which D − t represents the samples with the value range of attributes a not greater than t, while D + t denotes those samples with the value range of attributes a greater than t. The relevant computational formula for information gain is [33]: where Gain(D, a, t) denotes the information gain of the given sample set D following the dichotomy based on the partition point t. According to the following steps, the partition point that makes Gain(D, a, t) the maximum is selected: Step 1: The given sample set D is computed.
Step 2: For each attribute a, i.e., component, the information gain Gain(D, a, t) and Step 3: After selecting out the maximum value Gain(D, a, t), the corresponding component is chosen as the root node of the decision tree. The sample-set is split into two parts according to the computed partition point, where the samples greater than t are denoted by D + t , and the samples less than or equal to t are denoted by D − t .
Step 4: The remaining components are deemed as the dataset of the previous node, and the component with the largest information gain is chosen as the non-leaf node split by the root node.
Step 5: For each component, Steps 3 and 4 are repeated if the information gain value is greater than the given threshold.
Step 6: The component selection is terminated and completed if the maximum information gain is less than a given threshold. The information gain-based component selection method allows accurate extraction of the components that contain the majority of blockage features.

Sound Pressure Level Conversion
The conversion of sound pressure level amplifies the content of acoustic signals and enhances the distinction among various severities of blockages, thus that the characteristic information about different pipe conditions is more easily extractable in the subsequent decomposition [34]. The computational formula for the sound pressure level is as follows: where p e denotes the effective sound pressure value of the original acoustic signal and p 0 denotes the effective value of reference sound pressure, whose value is assigned as 2 × 10 −5 Pa herein.

RMSE
For vibration signals, RMS value indicates the change in instantaneous signal amplitude within their sampling period, which can reflect their respective vibration energies. Meanwhile, information entropy represents the system complexity resulting from multiple uncertain factors [35]. The RMSE E RMS is obtained by integrating the RMS into the information entropy, which combines the advantages of the two. Different fault types can be represented with different RMSE values [36].
The computational procedure for the RMS entropy is as follows: Step 1: The RMS values of various IMF components are computed following the sound pressure level conversion: where R i denotes the RMS value of the i-th component and N is the number of sample points.
Step2: The RMS values are constituted into an eigenvector A: Step3: The RMS values are homogenized: Appl. Sci. 2021, 11, 10824 5 of 18 Step4: The RMSE can be derived from the definition of information entropy as: where K represents the number of IMF components; and E RMS denotes the RMS entropy.

Proposed Feature Extraction Method
The flowchart of VMD-Information Gain for the pipe blockage detection process is illustrated in Figure 1.  Step 5: Sound pressure level conversion is performed on the filtered M effective components, and then the RMS entropy values are computed.
Step 6: The RMS entropy in various statuses are constituted separately into eigenvectors, which are then input into four different classifiers SVM, KNN, ELM, and RF, for effective identification of the pipeline operating conditions.

Experimental Setup and Experiment Conditions
An experimental platform was built in the laboratory [37], as shown in Figure 2, for acoustic detection of pipe blockages. In this experiment, the blocked substances were simulated with clay semi-cylinders, and sinusoidal sweep signals were used as the excitation signals were given their adjustable frequency bands as per demands. The use of sinusoidal sweep signals for exciting multi-DOF systems allows energy concentration of acoustic signals within the frequency range sensitive to the pipe thus that the required information in such frequency band can be easily elicited. They are often used as excitation signals for outdoor detection.
The clay pipeline had a diameter of 150 mm and a length of 14.4 m. During detection, a computer with LabVIEW software was used to control the virtual instrument for generating sinusoidal sweep signals within a 100-6000 Hz frequency range. Then, the analog Step 1: Each 300 sets of sampling are performed on the 9 statuses of buried pipelines to obtain 2700 sets of sample data in total.
Step 2: The original acoustic signals in various statuses are denoised through the Butterworth filter, thereby obtaining acoustic signals with a [0-6000] Hz frequency range.
Step 3: The denoised signals in various statuses are decomposed by the VMD, and the number of optimal decomposition layers K is decided by determining whether overdecomposition is produced or not.
Step 4: VMD operation is performed on the 100 sets of sampled data in 9 statuses, thereby deriving K components. By utilizing information gain, M effective components are filtered from the K components. Information gain values of the components are selected as per the principle of decision tree selection. A threshold is set up, which is then compared with the filtered information gain values. Filtering is terminated if the information gain value of a component signal is less than the threshold.
Step 5: Sound pressure level conversion is performed on the filtered M effective components, and then the RMS entropy values are computed.
Step 6: The RMS entropy in various statuses are constituted separately into eigenvectors, which are then input into four different classifiers SVM, KNN, ELM, and RF, for effective identification of the pipeline operating conditions.

Experimental Setup and Experiment Conditions
An experimental platform was built in the laboratory [37], as shown in Figure 2, for acoustic detection of pipe blockages. In this experiment, the blocked substances were simulated with clay semi-cylinders, and sinusoidal sweep signals were used as the excitation signals were given their adjustable frequency bands as per demands. The use of sinusoidal sweep signals for exciting multi-DOF systems allows energy concentration of acoustic signals within the frequency range sensitive to the pipe thus that the required information in such frequency band can be easily elicited. They are often used as excitation signals for outdoor detection.  For the sake of practical simulation, experiments were conducted at varying degrees of blockages on the experimental platform shown in Figure 2. Water flow was provided in the pipe for simulating the pipe condition, whose rate was dependent on the water pump. In the present experiments, the maximum rate of simulated water flow was 7 L/s, and other simulated water flow rates were 0.42 L/s, 1 L/s, 1.8 L/s, 4.25 L/s, and 6.1 L/s, respectively, which were used to form intra-pipe water levels of different heights. The severity of pipe blockage was defined by the laboratory as the percentage of blockage height in the cross-sectional pipe height. The simulated blockages with heights of 20 mm, 40 mm, and 55 mm were placed separately in the pipe of 150 mm in diameter. The heights of these rigid, non-porous blockages accounted for 13%, 26%, and 37% of the cross-sectional pipe area, respectively, which were approximately considered to be slightly blocked, moderately blocked, and moderately to severely blocked. The detailed data about the experiments are shown in Table 1.  The clay pipeline had a diameter of 150 mm and a length of 14.4 m. During detection, a computer with LabVIEW software was used to control the virtual instrument for generating sinusoidal sweep signals within a 100-6000 Hz frequency range. Then, the analog output port of NI PXIe-6363 was controlled via the DAQ assistant of LabVIEW to output analog voltage signals. After amplification of the signals with a power amplifier, the loudspeaker was driven to generate audio signals, which were then transmitted into the pipeline to serve as an excitation signal source. Such sound wave signals undergo complex interactions with the discontinuous interface of acoustic impedance in the pipe interior. The echo signals were received by the microphone placed at the pipe head end, which was then uploaded to the computer for storage. A sampling frequency of 44,100 Hz. Changes in the acoustic performance of pipe interior were identified by analyzing the received signals, during which the LM4950 power amplifier (Texas Instruments in Dallas, TX, USA) was used, as well as the FR874OHM loudspeaker (Visaton in Germany) and the SPM0208HE5 microphone (Knowles Acoustics in Illinois, USA).
For the sake of practical simulation, experiments were conducted at varying degrees of blockages on the experimental platform shown in Figure 2. Water flow was provided in the pipe for simulating the pipe condition, whose rate was dependent on the water pump. In the present experiments, the maximum rate of simulated water flow was 7 L/s, and other simulated water flow rates were 0.42 L/s, 1 L/s, 1.8 L/s, 4.25 L/s, and 6.1 L/s, respectively, which were used to form intra-pipe water levels of different heights. The severity of pipe blockage was defined by the laboratory as the percentage of blockage height in the cross-sectional pipe height. The simulated blockages with heights of 20 mm, 40 mm, and 55 mm were placed separately in the pipe of 150 mm in diameter. The heights of these rigid, non-porous blockages accounted for 13%, 26%, and 37% of the cross-sectional pipe area, respectively, which were approximately considered to be slightly blocked, moderately blocked, and moderately to severely blocked. The detailed data about the experiments are shown in Table 1. In this study, the pipe condition was set as a normal running empty state. There were conventional parts, i.e., lateral connections (LC), inside the empty pipeline that was in a normal operating state. The pipe conditions with the presence of single blockages included a 20 mm blockage; a 40 mm blockage; and a 55 mm blockage. Meanwhile, the pipe conditions with multiple blockages included coexistence of a 40 mm blockage and a 55 mm blockage that were placed in different positions inside the pipeline; coexistence of a 40 mm blockage and a lateral connection; coexistence of a 55 mm blockage and a lateral connection; and coexistence of a 40 mm blockage, a 55 mm blockage, and a lateral connection. This comes to a total of 9 pipe conditions. There were 100 sets of samples for each condition, with a total of 900 sets.

Data Pre-Processing
In signal processing, the aim is to maximize the extraction of valuable information from the signals. For physical systems such as buried pipelines, the sound pulse or frequency response is a characteristic quantity that contains detailed information, including the system geometry, sound velocity, and boundary conditions. Any changes in the properties of the physical system are reflected in the variations of sound pulse or relevant frequency response. During propagation in the pipe, acoustic signals collide with the pipe wall, blockages, and lateral connections to cause reflection, refraction, and diffraction. To investigate the frequency content of measured sound pressure signals, these time-domain signals were transformed from the time domain to the frequency domain via the Fourier transform. In Figure 3, the time-domain and frequency-domain waveforms of the acoustic signals are illustrated in the typical pipe conditions. erties of the physical system are reflected in the variations of sound pulse or relevant frequency response. During propagation in the pipe, acoustic signals collide with the pipe wall, blockages, and lateral connections to cause reflection, refraction, and diffraction. To investigate the frequency content of measured sound pressure signals, these time-domain signals were transformed from the time domain to the frequency domain via the Fourier transform. In Figure 3, the time-domain and frequency-domain waveforms of the acoustic signals are illustrated in the typical pipe conditions. In Figure 3a, the "Distance" of abscissa is the propagation distance of sound waves in the pipe, which facilitates the positional identification of pipe tail end, blockages, and lateral connections in the time-frequency domain diagrams. In the present paper, the propagation distance equals the propagation time multiplied by the sound propagation velocity in the air (approximately 340 m/s). The reasons are that in the drainage pipe, the water flow accounts for 20% of the cross-sectional pipeline area, and the sounds are transmitted mostly in the air. Figure 3a   In Figure 3a, the "Distance" of abscissa is the propagation distance of sound waves in the pipe, which facilitates the positional identification of pipe tail end, blockages, and lateral connections in the time-frequency domain diagrams. In the present paper, the propagation distance equals the propagation time multiplied by the sound propagation velocity in the air (approximately 340 m/s). The reasons are that in the drainage pipe, the water flow accounts for 20% of the cross-sectional pipeline area, and the sounds are transmitted mostly in the air. Figure 3a depicts the time-domain waveforms of original response signals. With the increasing propagation distance of sound waves in the pipeline, the energy of sound wave propagation was attenuated continuously, which was manifested as the decreasing signal amplitude with increasing distance in the sound pressure graphs. Nevertheless, the sound pressure waveforms in the three pipe conditions exhibit no distinct differences in the time domain, thus that it is difficult to distinguish the position and size of blockages in the time-domain waveforms, which are even hardly distinguishable from the pipe fittings (lateral connections). This is attributed mainly to the presence of environmental noise and the varying responses among objects to the frequency bands in different ranges. As is clear from Figure 3b, the frequencies in the three pipe conditions are concentrated primarily in [0-6000 Hz] in the spectrograms, while the components in other frequency bands are weak. Hence, the Butterworth filter was utilized to denoise the acoustic signals.
Given the difficulty of identifying the pipe operational status based on the analysis of conventional time-domain waveforms and spectrograms for sound pressure signals, further analysis, and processing of these signals were needed. The time-frequency map reflects the information that the frequency of acoustic signal changes with time. The time-frequency map reflects the energy carried by each frequency component of the signal through the cold and warm color. The warmer the color, the greater the energy. The common time-frequency transformation methods include short-time Fourier transform, Wigner distribution, and wavelet transform. Compared with the first two, wavelet transform has adaptive timefrequency resolution and a faster algorithm. Therefore, the sound signal was generated into a time-frequency map by continuous wavelet transform. As shown in Figure 4, the operational condition of a typical pipe is selected for time-frequency map analysis. through the cold and warm color. The warmer the color, the greater the energy. The common time-frequency transformation methods include short-time Fourier transform, Wigner distribution, and wavelet transform. Compared with the first two, wavelet transform has adaptive time-frequency resolution and a faster algorithm. Therefore, the sound signal was generated into a time-frequency map by continuous wavelet transform. As shown in Figure 4, the operational condition of a typical pipe is selected for time-frequency map analysis.  As shown in Figure 4a, lighter color indicates higher energy at the corresponding site. According to the time-frequency map of the normal pipeline. The energy concentrations are present only at the pipeline head and tail ends, while other locations show no energy concentration. For single blockages and lateral connections, as can be seen from Figure 4b,c, the energies have a good time-frequency concentration at the locations of lateral connections and blockages in the pipe, which conform to the acoustic theory. The energies gather at the blockage site while diverging slightly at the lateral connection site. In the case of multiple blockages, as can be seen from Figure 4d, with the increase in blockages, gradual attenuation was observed in the energies between various blockages and lateral connections. Moreover, the frequency bands, where the energies in different operating conditions appear, remain relatively stable. However, energy overlapping occurs when there are two blockages and a lateral connection in the pipeline. Although the energy spectrogram can locate the positions of the pipe head and tail ends, blockages, and lateral connections accurately, it is unable to determine the pipe operating condition properly.
Hence, further identification and research of the pipe condition were necessary. Since the sound waves of different frequencies were reflected in varying degrees at different rates, the intensities of signals and their sensitivities to the pipe condition vary in different frequency ranges. The characteristic components irrelevant to the pipeline blockage will interfere with the classification, thereby resulting in lowered classification accuracy.

Sensitive IMF Selection
VMD is applied to the acoustic signals for buried pipe conditions. The center frequency values of the components of the inherent mode function obtained by VMD decomposition are distributed from low to high, and the number of IMF components K is evaluated from 1. If the center frequency of the last IMF reaches its maximum value for the first time, it means that no insufficient decomposition occurs, and the value of K increases gradually until the maximum center frequency remains relatively stable. The parameter of decomposing mode number K was decided in advance, where it was determined according to a previous study [38]. Therefore, the K value in the paper was set as 12.
As played in Figure 5, a set of acoustic signals with the coexistence of 40 mm, 55 mm blockages, and lateral connections in the pipe were selected for VMD processing, the K value of 9 pipe operating conditions was finalized at 12 after the pre-setting process and the analysis of maximum center frequency for various components. Thereby deriving a total of 12 IMF components. The 12 IMF components after VMD decomposition represent the signal characteristics at different frequency scales, but the effective components were different for different operational conditions in the pipeline. The goals of component filtering were to simplify the feature space, eliminate complexity, and enhance the system performance. A component is considered more important if it can bring richer information to the classification model. Its presence or absence in the classification model leads to a larger change in the amount of information, and the difference in information amount before and after its addition is precisely the information gain it brings to the model. Utilizing the information gain, M effective components were selected from the 12 original components, which were used for the identification of different pipe conditions. The major steps for information gain based filtration of effective components are as follows: Step 1: Computation of the given sample set D. Nine types of pipe conditions are selected, each of which has 300 samples, totaling 2700 samples. For each sample, 12 components form a dataset D.
Step 2: Computation of information gain. The information gains of the 2700 samples are calculated according to formula (3), as well as the corresponding partition points.
Step 3: Selection of root node. As shown in Figure 6, the information gain value of "IMF1" is the largest among the 12 components, which is thus selected as the root node of the decision tree. The data set D is split into two parts based on the calculated partition points.

IMF3 IMF11
The partition point is less than or equal to the IMF component The partition point greater than the IMF component The major steps for information gain based filtration of effective components are as follows: Step 1: Computation of the given sample set D. Nine types of pipe conditions are selected, each of which has 300 samples, totaling 2700 samples. For each sample, 12 components form a dataset D.
Step 2: Computation of information gain. The information gains of the 2700 samples are calculated according to formula (3), as well as the corresponding partition points.
Step 3: Selection of root node. As shown in Figure 6, the information gain value of "IMF1" is the largest among the 12 components, which is thus selected as the root node of the decision tree. The data set D is split into two parts based on the calculated partition points.
Step 2: Computation of information gain. The information gains of the 2700 samples are calculated according to formula (3), as well as the corresponding partition points.
Step 3: Selection of root node. As shown in Figure 6, the information gain value of "IMF1" is the largest among the 12 components, which is thus selected as the root node of the decision tree. The data set D is split into two parts based on the calculated partition points.

IMF10 IMF6
The partition point is less than or equal to the IMF component The partition point greater than the IMF component IMF4 IMF2 Figure 6. Component diagrams from information gain filtering.
Step 4: Selection of subnodes. For the remaining 11 components, the information gains are calculated as per Step 2, and the component with the largest information gain is Step 4: Selection of subnodes. For the remaining 11 components, the information gains are calculated as per Step 2, and the component with the largest information gain is selected as the leaf node of the decision tree. As shown in Figure 6, "IMF3" is the largest component making the dataset D less than or equal to the information gain in IMF1, which is thus selected as the subnode of the left subtree at the second layer. Meanwhile, "IMF11" is the maximum value making the dataset D greater than the information gain in IMF1, which is selected as the subnode of the right subtree at the second layer. After passing through the second layer, the data are divided into four parts.
Step 5: For the remaining nine discrete wavelet packet components, their information gains are calculated as per Step 2, where the given threshold for information gain is 0.5. As shown in Figure 6, the IMF10 component is selected as the left subtree of the left subtree at the third layer; the IMF6 component is selected as the right subtree of the left subtree at the third layer; the IMF4 component is selected as the left subtree of the right subtree at the third layer, and the IMF2 component is selected as the right subtree of the right subtree at the third layer. The selection process of IMF is shown in Table 2. Table 2. Sensitive IMF selection based on information gain.

Number of Layers Screening of Sensitive IMF
The first layer selection IMF IMF1; The second layer selection IMF IMF3; IMF11; The second layer selection IMF IMF10; IMF6; IMF4; IMF2; Step 6: After the third layer selection, the maximum information gain is less than the given threshold of 0.5, and the component filtering is terminated and completed. To sum up, the root node of the decision tree is component 1, which has the largest information gain at the first layer. After the second layer selection, component 3 has the largest information gain in the left node, whereas component 11 is the largest in the right node. Meanwhile, the components in the third layer are IMF10, IMF6, IMF4, and IMF2. According to Figure 6, the components selected following the information gain filtering are IMF1, IMF3, IMF11, IMF10, IMF6, IMF4, and IMF2.

Sound Pressure Level Conversion
The components filtered based on information gain are subject to the conversion of sound pressure level to enhance the discrimination between components for easier feature extraction. In Figure 7, a comparison is made between the sound pressure signals and the signals following sound pressure level conversion.
given threshold of 0.5, and the component filtering is terminated and completed. To sum up, the root node of the decision tree is component 1, which has the largest information gain at the first layer. After the second layer selection, component 3 has the largest information gain in the left node, whereas component 11 is the largest in the right node. Meanwhile, the components in the third layer are IMF10, IMF6, IMF4, and IMF2. According to Figure 6, the components selected following the information gain filtering are IMF1, IMF3, IMF11, IMF10, IMF6, IMF4, and IMF2.

Sound Pressure Level Conversion
The components filtered based on information gain are subject to the conversion of sound pressure level to enhance the discrimination between components for easier feature extraction. In Figure 7, a comparison is made between the sound pressure signals and the signals following sound pressure level conversion.  As is clear from Figure 7a, for a normally operating pipeline, the inflection points of signals are not easily distinguishable regardless of whether there is a lateral connection or not. As shown in Figure 7b, the inflection points of signals that underwent sound pressure level conversion are distinguishable. The conversion of sound pressure level can better reflect the local features of signals, which can enhance the differentiation between pipe conditions and improve the sensitivity of acoustic signals.

RMS Entropy Features Extraction and Blockage Recognition
For 9 different pipe conditions, 300 sets of acoustic signals were collected as samples to perform Butterworth filtering, and then VMD was applied to derive 12 different IMF components. Seven effective components were selected through information gain filtering, which was then subjected to sound pressure level conversion for calculating the RMS of each set of samples. For vibration signals, RMS indicates the change in instantaneous signal amplitude within their sampling period, which can reflect their respective vibration energies. Meanwhile, information entropy represents the system complexity resulting from multiple uncertain factors. The RMSE is obtained by integrating the RMS into the information entropy, which combines the advantages of the two. As shown in Figure 8, the RMSE values of 100 samples of IMF1 components were selected.
As shown in Figure 8, it is clear that for different fault types, their ERMS values fluctuate in different numerical ranges. A rough classification of pipe conditions can be achieved after arrangement and comparison. ERMS can distinguish the blockage state from the normal state, and the effective component ERMS value of the normal pipe is lower than that of the blockage pipe. The analysis shows that when the pipe is blocked, the incident wave propagates along the axial direction of the pipe, and the physical phenomena such as reflection, refraction, and diffraction occur when the blockage occurs. Therefore, the signal is more disordered, which is manifested in the larger ERMS entropy value of the blocked pipe than that of the pipe in a normal operation state. For single blockages of different degrees (20 mm, 40 mm, 55 mm), the three types of pipe blockages are only different in the height of blockage, which leads to the crossover of ERMS values, affecting the final classification effect.
ing, which was then subjected to sound pressure level conversion for calculating the RMS of each set of samples. For vibration signals, RMS indicates the change in instantaneous signal amplitude within their sampling period, which can reflect their respective vibration energies. Meanwhile, information entropy represents the system complexity resulting from multiple uncertain factors. The RMSE is obtained by integrating the RMS into the information entropy, which combines the advantages of the two. As shown in Figure 8, the RMSE values of 100 samples of IMF1 components were selected. As shown in Figure 8, it is clear that for different fault types, their ERMS values fluctuate in different numerical ranges. A rough classification of pipe conditions can be achieved after arrangement and comparison. ERMS can distinguish the blockage state from the normal state, and the effective component ERMS value of the normal pipe is lower than that of the blockage pipe. The analysis shows that when the pipe is blocked, the incident wave propagates along the axial direction of the pipe, and the physical phenomena such as reflection, refraction, and diffraction occur when the blockage occurs. Therefore, the signal is more disordered, which is manifested in the larger ERMS entropy value of the blocked pipe than that of the pipe in a normal operation state. For single blockages of different degrees (20 mm, 40 mm, 55 mm), the three types of pipe blockages are only different in the height of blockage, which leads to the crossover of ERMS values, affecting the final classification effect.
From each data sample in the above data set, 200 samples were randomly selected as training samples and 100 testing samples. To verify the performance of VMD-IG-RMSE-RF method, KNN, SVM, ELM, and RF were used to identify the running state of the pipeline. The identification accuracy of 20 cross-verifications is shown in Table 3. From each data sample in the above data set, 200 samples were randomly selected as training samples and 100 testing samples. To verify the performance of VMD-IG-RMSE-RF method, KNN, SVM, ELM, and RF were used to identify the running state of the pipeline. The identification accuracy of 20 cross-verifications is shown in Table 3. It can be seen from Table 3 that the average recognition accuracy of the four classifiers (KNN, SVM, ELM, RF) reaches 87.56%. In addition, it can be seen from Table 3 that the average recognition accuracy of the RF classification is 99.58%, which is higher than the other three classifiers, and the maximum recognition accuracy reaches 100%. The reason for this phenomenon is due to the relatively ideal test environment and the good performance of the data acquisition system. It makes the characteristics of signal data easier to identify. The results show that the method has good performance. Therefore, the classifier selection of the subsequent comparative experiments is RF.
To verify the effectiveness of the methods based on VMD and IG, the proposed method (VMD-IG-RMSE-RF) and some similar available methods (EMD-IG-RMSE, LMD-IG-RMSE, WT-IG-RMSE) were used to analyze the same experimental data mentioned above. The identification results of the four methods are shown in Table 4. It can be seen from Table 4 that the average recognition accuracy of the proposed method is 99.58%, which is significantly higher than that of the other three methods. It shows that VMD can effectively extract the characteristics of pipeline blockage. At the same time, compared with other methods, the standard deviation of this method is the smallest, which verifies the stability of this method and the effectiveness of VMD and Ig in this method.
In order to demonstrate the effectiveness of IG feature extraction, another set of acoustically detected signals are selected randomly as the testing samples, which includes nine types of pipe conditions. There are 100 sets of samples for each state, totaling 900 samples. Every 12 components are extracted and then subjected to Principal Component Analysis (PCA), Kullback-Leibler Divergence(K-L), and information gain filtering. Among them, PCA is reduced to three dimensions, the K-L divergence chooses four components, and seven IMF components, IMF1, IMF3, IMF11, IMF10, IMF6, IMF4 and IMF2, were selected for information gain. Afterward, the ERMS values of feature parameter sets for components are input to the KNN, SVM, ELM, and EF models, respectively, for examining the accuracy of pipe condition identification.
The average diagnostic accuracy of different methods is shown in Figure 9. The identification accuracy of IG method is superior to other methods. Compared with other methods, the accuracy of information gain filtering is higher. This indicates that the component filtering method based on information gain retains the data feature information to the maximum extent, reduces the interference of redundant features and noise features to components, and improves the model recognition accuracy. Too many or too few sensitive features will reduce the accuracy of blockage identification. If the number of feature sensitivities is too small, there will be less blocking feature information. On the contrary, too many sensitive features will lead to the redundancy of blocking feature information and reduce the accuracy of blocking identification. For different types of classifiers, the accuracy of VMD-IG method is better than other methods, and the proposed feature extraction method is better than the other three methods.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 15 of 18 many sensitive features will lead to the redundancy of blocking feature information and reduce the accuracy of blocking identification. For different types of classifiers, the accuracy of VMD-IG method is better than other methods, and the proposed feature extraction method is better than the other three methods. Through the above analysis, it is found that the accuracy of acoustic signal identification for pipe conditions is improved markedly after component filtering, suggesting its effectiveness. The amount of information about feature parameters varies among components, with the components selected based on information gain containing more features about the pipe condition. Component filtering is effective in feature extraction, which is capable of reducing the data size substantially and can achieve high-accuracy identification with less time. The above results suggest that the present method is not only effective in identifying the pipeline blockage severity during operation but can also eliminate the influence of conventional parts such as lateral connections on the blockage identification, which improves the identification accuracy of pipe condition.
In order to further investigate the ability of VMD-IG-RMSE algorithm to identify pipe operational conditions and the details of blockage misjudgment, a multi-classification confusion matrix is introduced to quantitatively analyze the results of pipe operational conditions recognition in detail. The confusion matrix comprehensively reflects the recog-  Through the above analysis, it is found that the accuracy of acoustic signal identification for pipe conditions is improved markedly after component filtering, suggesting its effectiveness. The amount of information about feature parameters varies among components, with the components selected based on information gain containing more features about the pipe condition. Component filtering is effective in feature extraction, which is capable of reducing the data size substantially and can achieve high-accuracy identification with less time. The above results suggest that the present method is not only effective in identifying the pipeline blockage severity during operation but can also eliminate the influence of conventional parts such as lateral connections on the blockage identification, which improves the identification accuracy of pipe condition.
In order to further investigate the ability of VMD-IG-RMSE algorithm to identify pipe operational conditions and the details of blockage misjudgment, a multi-classification confusion matrix is introduced to quantitatively analyze the results of pipe operational conditions recognition in detail. The confusion matrix comprehensively reflects the recognition accuracy and number of misjudgments of pipelines at different blockage levels, as well as the misjudgment type of real blockage type. The quantization diagram of the confusion matrix of four classifiers KNN, SVM, ELM, and RF is shown in Figure 10. It is obvious from Figure 10: Class 1 and Class 2, respectively, represent the normal operating condition of the pipe clean and lateral connection, and the recognition accuracy of both the normal operation of the pipe on the test set reaches 100%. Therefore, the algorithm achieves 100% recognition accuracy between the normal operating conditions and blocked. Class 3, Class 4, and Class 5 represent a 20 mm blockage, a 40 mm blockage, and a 50 mm blockage for a single blockage in the pipe, respectively. By analyzing the types of misjudgment of blockage, it can be seen that the above misjudgment is the error between single blockage categories, which belongs to the misjudgment of different degrees of blockage of single blockage, and there is no misjudgment to multiple blockage. Class 6, Class 7, Class 8, and Class 9 represent a 40 mm blockage and a LC, a 55 mm blockage and a LC of multiple blocked pipes, a 40 mm blockage and a 55 mm blockage and a 40 mm blockage, a 55 mm blockage, and LC respectively. By analyzing the types of multiple blockage misjudgments, it can be seen that the above misjudgments are the misjudgments of multiple blockage categories with different degrees of blockage, and there is no misjudgment to a single blockage.
It can be seen that the recognition rate of comprehensive pipeline operation status can reach 99.56%. Through experimental verification, the improved VMD-IG-RMSE-RF algorithm has superior recognition ability and high diagnosis accuracy for pipeline blockage. It is obvious from Figure 10: Class 1 and Class 2, respectively, represent the normal operating condition of the pipe clean and lateral connection, and the recognition accuracy of both the normal operation of the pipe on the test set reaches 100%. Therefore, the algorithm achieves 100% recognition accuracy between the normal operating conditions and blocked. Class 3, Class 4, and Class 5 represent a 20 mm blockage, a 40 mm blockage, and a 50 mm blockage for a single blockage in the pipe, respectively. By analyzing the types of misjudgment of blockage, it can be seen that the above misjudgment is the error between single blockage categories, which belongs to the misjudgment of different degrees of blockage of single blockage, and there is no misjudgment to multiple blockage. Class 6, Class 7, Class 8, and Class 9 represent a 40 mm blockage and a LC, a 55 mm blockage and a LC of multiple blocked pipes, a 40 mm blockage and a 55 mm blockage and a 40 mm blockage, a 55 mm blockage, and LC respectively. By analyzing the types of multiple blockage misjudgments, it can be seen that the above misjudgments are the misjudgments of multiple blockage categories with different degrees of blockage, and there is no misjudgment to a single blockage.

Conclusions
It can be seen that the recognition rate of comprehensive pipeline operation status can reach 99.56%. Through experimental verification, the improved VMD-IG-RMSE-RF algorithm has superior recognition ability and high diagnosis accuracy for pipeline blockage.

Conclusions
Targeting the problem that only a few components from VMD contain useful information for blockage identification, and an information gain-based selection technique is proposed. It is not only effective in selecting the feature components containing substantial blockage information but also plays a crucial role in the deep mining of the information. Sound pressure level conversion is performed on the acoustic pressure signals collected from the standard pipe and the pipe installed with a lateral connection to make a comparison. It is found that the sound pressure level is capable of reflecting the local characteristics of pipeline conditions through the mixture of information while enhancing the discrimination between various operating conditions. The RMS entropy is proven to be responsive to blockage changes from the noise-containing acoustic signals, which can thus be used as input to the classifiers for achieving effective identification.
Given the complexity of pipeline topology and detection environment, as well as the diversity of the pipe defects, further exploration and research are needed, which should cover the following aspects: the sound propagation within pipelines with multiple defects should be explored to discover the pattern thus that a condition prediction model for varying defects can be developed. Advanced active acoustical detection technology requires further study. Such research should involve the development of sensitive acoustic sensors with long-distance availability, selection of valid excitation signal type, wavelength, and frequency, the study of the distribution of sound fields, investigation of the sound attenuation corresponding to various new materials used in civil infrastructures, as well as more advanced data processing techniques.

Data Availability Statement:
The data included in this study are all owned by the research group and will not be transmitted.