Learnable Wavelet Scattering Networks: Applications to Fault Diagnosis of Analog Circuits and Rotating Machinery

: Analog circuits are a critical part of industrial electronics and systems. Estimates in the literature show that, even though analog circuits comprise less than 20% of all circuits, they are responsible for more than 80% of faults. Hence, analog circuit fault diagnosis and isolation can be a valuable means of ensuring the reliability of circuits. This paper introduces a novel technique of learning time–frequency representations, using learnable wavelet scattering networks, for the fault diagnosis of circuits and rotating machinery. Wavelet scattering networks, which are fixed time– frequency representations based on existing wavelets, are modified to be learnable so that they can learn features that are optimal for fault diagnosis. The learnable wavelet scattering networks are developed using the genetic algorithm-based optimization of second-generation wavelet transform operators. The simulation and experimental results for the diagnosis of analog circuit faults demon-strates that the developed diagnosis scheme achieves greater fault diagnosis accuracy than other methods in the literature, even while considering a larger number of fault classes. The performance of the diagnosis scheme on benchmark datasets of bearing faults and gear faults shows that the developed method generalizes well to fault diagnosis in multiple domains and has good transfer learning performance, too.


Introduction
Electronic circuits are ubiquitous in our everyday lives, in applications ranging from the commercial domain to the safety-critical domain. As a result, unforeseen circuit failures can have enormous consequences for the safety and financial well-being of their users and producers [1,2]. Analog circuit failures can be attributed to interconnected failures or component faults, which are associated with either parametric drift (soft faults) or short circuit/open circuit [3] (hard faults). Analog circuits have become increasingly complex and consequentially, fault diagnosis is increasingly difficult, due to: (a) component tolerances, (b) interactions among components, (c) inadequate accessible measurement nodes; and (d) the inherent non-linearity in the behavior of analog circuits. Compared to digital circuits, analog circuits are more susceptible to interference and have fewer measurement nodes. Interestingly, even though analog circuits account for less than 20% of all circuits, they are responsible for more than 80% of circuit faults [4,5] Therefore, the fault diagnosis of analog circuits has become a highly important research area in recent years.
There are two broad categories for fault diagnosis approaches for circuits: analytical methods and data-driven methods. Circuit transfer function equations are required to apply analytical methods [6]. If these equations are unavailable, they can be determined using design principles or parameter identification techniques [7], and fault diagnosis is then achieved by exposing the circuit to a test stimulus and using the response to estimate the circuit parameters. This technique is suitable for linear analog circuits but is not feasible for nonlinear analog circuits because of the complexity involved [8].
The standard approach that the vast majority of the methods followed is to extract features and apply a dimensionality reduction algorithm to obtain a lower-dimensional feature set which is then fed to a classification algorithm. Extracting features informative for fault diagnosis requires technical expertise which restricts its application as a generalized method. Recently, techniques have been proposed involving the direct application of deep learning methods for fault diagnosis. These techniques use input data to learn features autonomously through a multi-layered neural network. This avoids the need for manual feature extraction and feature selection. For example, different 2D representations [30,31] have been developed for circuit outputs for use with state-of-the-art deep learning networks such as ResNet50 [32] to achieve fault diagnosis. However, the creation of an optimal custom deep learning network structure for the problem at hand requires subject matter expertise and extensive trial-and-error [33]. Inspired by wavelet scattering theory [34] and second-generation wavelet transform [35], we propose a novel technique that does not need to be optimized for structure and learns wavelet filters instead of random filters from the data. Hence, it overcomes the shortcomings of deep learning networks. The remainder of the paper is organized as follows: Section 2 presents a theoretical background of the techniques involved in the approach. Section 3 details the developed fault diagnosis methodology. Section 4 details the application of the approach to the fault diagnosis of two circuits and a bearing and a gears dataset. The conclusions follow in Section 5.

Theoretical Background
As mentioned earlier, in this paper, time-frequency representations are learnt from the circuit outputs for fault diagnosis using learnable wavelet scattering networks (LWSNs). This involves modifying wavelet scattering networks, which are fixed timefrequency representations based on existing wavelets, such that they can learn features that are optimal for fault diagnosis. Learnable wavelet scattering networks are developed using the genetic-algorithm-based optimization of second-generation wavelet transform operators. Support vector machines (SVMs) are used as classifiers for the features learned by the LWSN. In the following subsections, we review the basics of a wavelet transform, a wavelet scattering network, a genetic algorithm, and a support vector machine and introduce the concept of learnable wavelet scattering networks.

Wavelet Transform
A wavelet transform is a collection of bandpass filters with progressively broader bandwidths at higher frequencies. A wavelet is a time-limited waveform that has a nonzero norm and zero average value. Often, signals are piecewise smooth but have momentary transients; for example, edges in images or transients caused by rapid changes in economic conditions in financial time series. The Fourier basis is not suited for the sparse representation of these signals, as their sinusoids have infinite duration and would require sine waves of various frequencies for representation. Wavelets, being irregular and of limited time, require the break-up of a signal into a limited number of variations of the original wavelet √ ( ). The scale parameter s is inversely proportional to the frequency. A small scale s leads to a compressed wavelet, which is ideal for high-frequency signals with rapidly changing details. A long scale s leads to a stretched wavelet, which is ideal for slowly changing signals with coarse features; i.e., a low-frequency signal. This increases the flexibility of the time-frequency analysis. The wavelet transform (1) has scale-varying basis functions.
The continuous wavelet transform (CWT) (2) compares a signal with shifted and scaled versions of the mother wavelet.
Here, is the number of voices per octave, as it requires intermediate scales to increase the scale by an octave. Higher values of result in a finer discretization of the scale parameter and an increase in the amount of computation required. The discrete wavelet transform (DWT) has a much coarser discretization of the scale parameter such that the number of voices per octave is always one. Depending on the translation parameter discretization, there are two broad types of DWT: decimated DWT and non-decimated DWT.
Decimated DWT (3): The translation parameter is 2 j m, where m is a non-negative integer and is the scale. The decimated DWT is a sparse representation; hence, it is used for compression, denoising, signal transmission, etc.
Non-decimated DWT (4): Like in the case of the CWT, the translation parameter is independent of the scale parameter. The non-decimated DWT is a more redundant representation than the decimated DWT and is translation invariant.

Wavelet Scattering Networks (WSNs)
In an effort to create interpretable networks that mimic human performance on vision and auditory tasks, some researchers use wavelet-transform-based methods, as wavelets are an approximation of the response of the human visual cortex and cochlea to stimuli [36]. For example, the wavelet transform renders a time domain signal to the time-frequency plane with a decreasing frequency resolution with increasing frequency, which is similar to the human cochlear response.
Mallat [37] proposed WSNs ( Figure 1) as a first step in understanding the success of Convolutional Neural Networks (CNNs). A wavelet scattering network computes a representation that preserves high-frequency information, is stable to deformations, and is translation invariant, which makes it a good feature extractor for classification. It is a cascade (tree) of convolutions between Gabor wavelet transforms (represented by in Figure 1) and non-linear modulus and averaging operators (represented by in Figure 1), which "scatter" the signal along multiple paths. The number of paths at each node of the WSN is the scale of the wavelet transform (scale = 3 in Figure 1), and the number of layers of wavelet transforms is typically two. Discrete versions of WSNs were proposed by Wiatowski [36] and involve existing discrete orthogonal and biorthogonal wavelets. Unlike CNNs, a scattering network outputs coefficients at all layers, not just the last layer, and filters are not learned from data but are predefined wavelets. Thus, the filters retain their physical meaning, which cannot be said of the filters that are developed through the learning process in a typical convolution neural network. Operations in both CNNs and wavelet scattering networks can be represented as ( * ) , where is the input signal, is the filter weight, is the nonlinearity, and is the pooling operator. In CNNs, the weights are weights of learned random filters, while in WSNs, the weights are the weights of the fixed wavelet filters. Scattering networks provide stateof-the-art classification accuracies on simple to moderately complex datasets, such as textures in CUReT dataset [34], or musical genre and environmental sound classification [37], and images in MNIST dataset [38]. However, for extremely complex datasets such as ImageNet [39] or TIMIT Acoustic-Phonetic Continuous Speech Corpus [40], CNNs are still more accurate than scattering networks. A major reason for this is that scattering networks are fixed-feature generators, while CNNs learn features from the data. As a result, an effort is made to make the discrete wavelet scattering networks have the learnability property, such that they can learn features from the data.

Learnable Wavelet Scattering Networks (LWSNs)
Instead of the fixed wavelet filters of the WSN, the wavelet filters in the LWSN are learnable using a second-generation wavelet transform (SGWT). The classical wavelet transform is realized through the translation and expansion of the mother wavelet function. This definition is very restrictive, so the SGWT does away with it. The lifting method [35] or the lifting scheme ( Figure 2) is a space domain wavelet construction method used to construct the SGWT filters, and it builds sparse representations by exploiting the correlation inherent in most real-world data. It consists of three basic steps: 1. Split: Let ( ) be an original signal. In this step, ( ) is divided into two subsets: the even subset ( ) and odd subset ( ). The subsets are correlated according to the correlation structure of the original signal.
3. Update: Coarse approximation ( ) to the original signal is created by combining the even coefficients and the linear combination of the prediction differences is the update operator. By iterating on the approximation signal ( ) using the three steps, the approximation and the detail signal are obtained at different levels. The optimization of the lifting scheme's Update (U) and Predict (P) operators in the LWSN is carried out using the genetic algorithm (GA). The optimized Update (U) and Predict (P) operators are converted to the wavelet ( ) and averaging operators ( ) using Claypoole's algorithm [35], such that the structure in Figure 1 can be used to learn time-frequency representations from the data.  Table 1 illustrates the differences between deep learning networks, wavelet scattering networks, and learnable wavelet scattering networks.

Genetic Algorithm (GA)
The GA [43] mimics the theory of natural selection. As in the case with evolution, a population consists of individuals which reproduce to create the next generation. This reproduction involves the combination of genetic material from parents to create an offspring. Each subsequent generation will be created by parent individuals by combining their genes. The selection of parents (individuals) to combine is based on their fitness, and the fitness of an individual is based on the fitness function. A total of 10% of the individuals with the best fitness move on to the next generation. This mechanism is called elitism, and the percentage of the elite individuals can be changed. The remaining individuals take part in crossover, where the genes of two individuals (parents) are combined to create the genes of the individual of the next generation (child). Crossover is carried out until the required number of individuals (children) is created in the next generation. Analogous to mutation in natural reproduction, random changes are added to the genes of a fraction of the children created. This helps to avoid getting stuck in the local minima of the optimization of the fitness function. The process repeats for the new generation and the subsequent generations until the predefined maximum number of generations is reached or there is no improvement in the fitness in consecutive generations.

Support Vector Machine (SVM)
An SVM is based on the concept of finding decision planes or hyperplanes that maximize the separation between classes. If the classes are not linearly separable, a kernel trick is used to map the data into higher dimensions in an effort to separate them. To find the support vectors and hence construct an optimal hyperplane, the following optimization problem [44] is solved: where is the penalty parameter to guard against overfitting, and are the slack variables introduced to handle inseparable data. The input data consists of and , which are the independent and the dependent variable (class label), respectively. The kernel function transforms the input data into higher dimensions.

Fault Diagnosis Methodology
The implementation of the diagnostic scheme is depicted in Figure 3. Firstly, a dataset of signals when the circuit components are degrading is obtained via simulation or experimentation. This dataset is randomly split into a training dataset [ , ] and a testing dataset [ , ], where and represent the circuit output signals in the training and the testing dataset, respectively, and and represent the corresponding labels (degrading components). A subset of signals (30%), ′, is randomly selected from the entire training dataset to be used with the GA. This is done to prevent overfitting to the training dataset and to reduce the time taken for GA optimization. The fitness function used is the Davies-Bouldin (DB) index [45], as it considers the ratio of within-class and between-class distances. As a result, the minimization of the DB index leads to maximum separation between the classes. The GA is used to optimize the Predict and Update operators of the SGWT, such that the DB index is minimized. The genes in each individual in the GA are the coefficients for the P and U operators that need to be optimized by the GA. The P and U operators are assumed to be of length 8; hence, the number of genes in each individual is 16. Other hyperparameters chosen for the GA include population size: 100, elite count: 10%, crossover fraction: 90%, mutation rate: 5%, and the stopping criterion of the GA is when there is no appreciable improvement in the fitness function for 30 consecutive generations. The feature space ( ) created by the LWSN, with the optimized P and U operators, is classified using the SVM as the classifier. Since SVM hyperparameter optimization is not the focus of this paper, the hyperparameter optimization was carried out using built-in MATLAB functions.

Experiments and Results
The proposed method was verified using two analog circuits, the Sallen-Key bandpass filter circuit and the two-switch forward convertor circuit, and two rotating machinery datasets, CWRU bearing faults dataset and UoC gear faults dataset.

Sallen-Key Bandpass Filter
The first circuit under test (CUT1) is the Sallen-Key bandpass filter (Figure 4), which is the most frequently studied circuit for analog circuit fault diagnosis. Unlike other papers that only consider the fault diagnosis of four of the seven passive components, we considered all seven passive components for fault diagnosis. The parametric fault ranges for the seven components considered are shown in Table 2. As can be seen from Table 2, we considered a single class for each component as opposed to other papers in the literature that consider two classes for each component. The data for each class were split into training and testing data sets via a 75%-25% split. The LWSN was trained on the training data, and the testing accuracy of the LWSN is reported in Table 3, along with the testing accuracy of the original wavelet scattering network and the Gaussian-Bernoulli Deep Belief Network (GB-DBN)-based approach [22], which was used for comparison. This paper was used for comparison because it uses a deep-learning-based feature extractor, the DBN, along with an SVM for classification. Hence, it is conceptually similar to our paper. The confusion matrix for the fault diagnosis of the Sallen-Key bandpass filter using LWSN is shown in Table 4.   The Sallen-Key bandpass filter circuit involved seven fault types and one healthy class to detect and identify, which correspond to the 14 fault types for methods used in the literature. From Table 3, it can be seen that the proposed LWSN method achieved a marginal improvement of 0.7% in the fault diagnosis accuracy over comparable methods in the literature [18] and a 9% improvement in the fault diagnosis accuracy over a traditional WSN. As can be seen from the confusion matrix in Table 4, fault type F6, which corresponds to capacitor C1, was misdiagnosed most often; however, the diagnosis of other fault types was almost perfect.

Two-Switch Forward Convertor
The second circuit under test (CUT2) is the two-switch forward convertor circuit (Figure 5). A forward converter is a switching power supply circuit that is used for energy transfer when the two switches (transistors) are simultaneously turned on. The parametric fault ranges for the components considered after sensitivity analysis are shown in Table  5, along with the values for experimental verification. As can be seen from Table 5, we considered a single class for each single fault (single component degradation) as opposed to other papers in the literature that consider two classes for each single fault. The advantage of doing so is that we could consider one class for every double fault (two components degrading simultaneously), as can be seen from Fault Codes F14 and F15. If we were to consider two classes for each single fault, we would have to consider four classes for every double fault. The data for each class were split into training and testing data sets via a 75%-25% split. The testing accuracy of the LWSN on both the simulation and experimental data is reported in Table 3, along with the testing accuracy of the original wavelet scattering network and the Gaussian-Bernoulli Deep Belief Network (GB-DBN)-based approach [22], which were used for comparison. The confusion matrix for the fault diagnosis of the two-switch forward convertor circuit using LWSN is shown in Table 6. The experimental setup that was used to demonstrate our approach is shown in Figure 6. The two-switch forward convertor circuit (CUT2) was used with pulse width waveforms to trigger the two switches, generated using an Agilent Arbitrary Waveform Generator 33250A. The circuit components were swapped out with the components with values shown in the Experimental values column of Table 5. For instance, to mimic the degradation of resistor R1 from its nominal value of 33 Ω, resistors of 10 Ω, 20 Ω, 40 Ω, and 50 Ω were substituted, and the circuit output was captured at every instance. The circuit responses captured at the output using an Agilent Digital Oscilloscope 54853A were classified using the developed fault diagnosis methodology, and the results are provided in Table 3.
100 Ω * 10 μF  The sixteen fault types and one healthy class considered for the two-switch forward convertor correspond to 28 fault types for methods in the literature, and this is a much more challenging fault diagnosis problem compared to CUT1. From Table 3, it can be seen that the proposed LWSN method achieved a significant improvement of 8.9% in the fault diagnosis accuracy over the comparable method in the literature [22] and a 10.9% improvement in the fault diagnosis accuracy over the traditional WSN. As can be seen from the confusion matrix in Table 6, fault type F3, which corresponds to resistor RL, was misdiagnosed as fault type F8 (resistor R8). Other notable misclassifications include the single fault F1 (resistor R1) and the double fault F15 (resistor R1 and R2). This highlights the complexity of analog circuit fault diagnosis. However, the developed LWSN method stands out in terms of fault diagnosis performance in comparison to existing methods.

Bearing Fault Diagnosis
In rotating machinery applications, rolling bearing faults are the most common, leading to the performance deterioration of machinery. Hence, bearing fault diagnosis plays a vital role in the health management of machinery [46]. To test the effectiveness of the method across different domains of fault diagnosis, the developed method was tested on a bearing faults benchmark dataset. The Case Western Reserve University (CWRU) motor bearing dataset was generated using a test rig consisting of a 2 hp Reliance Electric motor, a torque transducer/encoder, a dynamometer, and drive-end and fan-end Svenska Kullager-Fabriken deep-groove ball bearings. Inner ring, outer ring, and rolling element defects were manufactured into the bearings. The motor was run at a near-constant speed (1720-1797 r/min) with different loads (0-3 hp) provided by the dynamometer. Vibration data were collected using accelerometers, which were vertically attached to the housing with magnetic bases. Sampling frequencies were 12 kHz for some of the tests and 48 kHz for the others. Further details can be found at the CWRU Bearing Data Center website [47]. As shown in Table 7, one healthy bearing and three fault modes, including the inner ring fault, the rolling element fault, and the outer ring fault, were classified into ten categories (one health state and nine fault states) according to different fault sizes. A plot of the data can be seen in Figure 7. The data were resampled such that the entire dataset had a constant sampling rate, and then, the data were split into chunks with sizes of 1024. The dataset was then split into training and testing datasets in the ratio of 75%:25% using stratified sampling. The LWSN achieved 99.2% accuracy for the testing dataset, which is comparable to the state-of-the-art methods [48]. The confusion matrix is shown in Table 8.

Ring 3 Predicted Class
The CWRU bearing dataset involves nine fault classes and one healthy class. As can be seen from the confusion matrix in Table 8, for the bearing fault diagnosis, fault types F3 and F9 were misdiagnosed most often; however, the diagnosis of other fault types was perfect.

Gear Fault Diagnosis
The second rotating machinery fault diagnosis dataset considered was the University of Connecticut (UoC) gear fault dataset [49]. The CWRU dataset and the UoC dataset were ranked the simplest and the most difficult benchmark dataset, respectively [48], for rotating machinery fault diagnosis. The average RMS and the average power of the signals in the CWRU and the UoC dataset were 0.27, −9.36 dB and 0.07, −21.91 dB, respectively. Preprocessing methods such as stochastic resonance [50] can be used to enhance weak fault characteristics in datasets such as UoC; however, in this paper, the LWSN method was applied directly to the raw vibration data.
In the UoC dataset, nine different gear conditions were introduced to the pinions on the input shaft, including healthy condition, root crack, missing tooth, spalling, and chipping tip with five different levels of severity. All the collected datasets were used and classified into nine categories (one health state and eight fault states missing, crack, spall, chip5a, chip4a, chip3a, chip2a, and chip1a) to test the performance. The data were resampled such that the entire dataset had a constant sampling rate, and then, the data were split into chunks with sizes of 1024. The dataset was then split into training and testing datasets in the ratio 75%:25% using stratified sampling. The LWSN achieved 96.51% accuracy for the testing dataset, and the confusion matrix is shown in Table 9. Our result is marginally better, as the best result reported in [48] was 96.19%. Since the UoC dataset had 3600 samples per fault class and there were nine fault classes, the developed method is able to process the big data of rotating machinery.

Transfer Learning
In recent years, transfer learning has been gaining importance, as it enables knowledge acquired through training on data to be transferred from a source domain to gain insight in the target domain. This importance rises from the fact that it is very challenging to collect data from all possible conditions that machinery may encounter. Umdale et al. [51] created different datasets by dividing the original CWRU dataset based on speed and load, as can be seen in Table 10. For instance, in dataset D1, the goal was to determine if training on lower speeds in the source data set would still enable us to achieve acceptable fault diagnosis on a dataset with higher rotational speeds, as can be seen from the target dataset of D1. In dataset D2, the opposite was true-the goal was to determine if datasets with higher speeds would have vital information for fault diagnosis at lower speeds, whereas mixtures of speeds were considered in datasets D3 and D4. The maximum training and testing accuracies reported by [51] are shown in Table 10, where testing accuracies are an indication of the effectiveness of transfer learning. As can be seen from Table 10, the developed LWSN is more effective for transfer learning across all four datasets. Exploratory work suggests that LWSN can perform at least as well as deep learning networks at transfer learning, but further work needs to be undertaken to determine if there is a fundamental improvement. These results imply that the LWSN network can extract discriminative information from raw data effectively and achieve fault classification with high accuracy, irrespective of the complexity and domain of the dataset.

Conclusions
Traditional fault diagnosis methods involve the extraction of fixed representations in the time domain, frequency domain, or time-frequency domain. These methods require technical expertise for designing appropriate features from the fixed representations. In this paper, a new feature extraction technique based on learnable wavelet scattering networks was developed to diagnose faults primarily in analog circuits and rotating machinery. By learning a time-frequency representation from the data, the developed method has a better ability to extract essential features of the fault signals. This results in better fault diagnosis accuracy, by almost 9%, compared to the state-of-the-art fault diagnosis method in the literature. By considering more classes for fault diagnosis than any other paper in the literature, a more thorough fault diagnosis was demonstrated. The fault diagnosis performance of this method was verified by experiments on the two-switch forward convertor circuit. The experiments indicated that the fault diagnosis model trained on simulation data is able to effectively diagnose faults from the actual circuit. Analog circuits and gears/bearings are the predominant sources of faults in electronic systems and rotary mechanical systems, respectively. The developed fault diagnosis approach was applied to the CWRU bearing faults and the UoC gear faults benchmark datasets and achieved fault diagnosis accuracy that is comparable to state-of-the-art methods. Since the UoC gear faults benchmark dataset is considered the most challenging benchmark dataset in rotating machinery fault diagnosis, this speaks to the ability of the developed method to extract weak fault signatures. Hence, the generalizability of the developed fault diagnosis approach across the most common industrial fault diagnosis domains was demonstrated. Initial experiments indicated that the developed approach is also effective in transfer learning; however, further experiments need to be carried out to confirm these observations.
The incorporation of learnability in traditional wavelet scattering networks resulted in a 10% improvement in fault diagnosis accuracy. As opposed to deep learning networks, the developed learnable wavelet scattering networks do not require an extensive trialand-error process to optimize their structure. Additionally, the developed learnable wavelet scattering networks learn wavelet filters as opposed to the random filters learnt in deep learning networks. Hence, the filters learnt by learnable wavelet scattering networks are interpretable, which enables wavelets to be used to gain further insight into circuit faults. The interpretability of the wavelets learnt by the learnable wavelet scattering networks and digital circuit fault diagnosis are possible avenues for future research.