Lightweight Network with Variable Asymmetric Rebalancing Strategy for Small and Imbalanced Fault Diagnosis

: Deep learning-related technologies have achieved remarkable success in the ﬁeld of intelligent fault diagnosis. Nevertheless, the traditional intelligent diagnosis methods are often based on the premise of sufﬁcient annotation signals and balanced distribution of classes, and the model structure is so complex that it requires huge computational resources. To this end, a lightweight class imbalanced diagnosis framework based on a depthwise separable Laplace-wavelet convolution network with variable-asymmetric focal loss (DSLWCN-VAFL) is established. Firstly, a branch with few parameters for time-frequency feature extraction is designed by integrating wavelet and depthwise separable convolution. It is combined with the branch of regular convolution that fully learns time-domain features to jointly capture abundant discriminative features from limited samples. Subsequently, a new asymmetric soft-threshold loss, VAFL, is designed, which reasonably rebalances the contributions of distinct samples during the model training. Finally, experiments are conducted on the data of bearing and gearbox, which demonstrate the superiority of the DSLWCN-VAFL algorithm and its lightweight diagnostic framework in handling class imbalanced data.


Introduction
With the development of modern industrial technology, the working process of rotating machinery is more integrated and intelligent [1][2][3]. Mechanical components inevitably fail because of the complexity, harshness, and uncertainty of the working environment. The faults that are not detected early can cause serious damage to the equipment and significantly increase the cost of maintenance [4,5]. Therefore, providing effective fault monitoring and health management for mechanical systems plays a crucial role [6].
The response of the defective mechanical parts to the external excitation is abnormal, and thus, the fault signals are generated. The traditional condition monitoring method is to analyze the probability distribution of the signals for fault diagnosis. Such methods are based on artificial feature engineering with a large amount of expert experience, and their capabilities are limited by complex and variable mechanical systems [7,8].
In recent years, deep learning (DL) methods with multi-level nonlinear transformations have been used to autonomously mine information, such as statistical and structural relationships, between data to establish reliable diagnostic models. Consequently, DL methods that can realize the expression of high-dimensional feature information of data have been widely developed. Lei et al. [9] systematically reviewed the development of intelligent diagnosis and provided future prospects. DL methods are continuously improved to solve specific problems. For example, for the problem that samples are disturbed by complex environmental noise in industrial practice, Zhang et al. [10] applied multi-scale feature extraction units to vibration signals for learning complementary and rich fault information on different time scales. Then, a novel easy-to-train module based on adversarial learning was used to improve the feature learning ability and generalization ability of the model. Faced with the problem of variable working conditions, Shao et al. [11] proposed an improved convolutional neural network with transfer learning, which had excellent diagnostic performance in rotor-bearing systems under different working conditions. Therefore, to monitor the invisible faults, Chen et al. [12] exploited the domain-invariant knowledge of the data through adversarial learning between feature extractors and domain classifiers. The fault classifier generalized the knowledge from the source domain to diagnose invisible faults in the meantime. The interpretability of the DL method has also received attention recently. Zhao et al. [13] developed a model-driven deep unrolling approach to realize ante-hoc interpretability, the core of which was to unroll a corresponding optimization algorithm of a predefined model into a neural network, which was naturally interpretable. Additionally, some advanced techniques, such as contrastive self-supervised learning [14], meta-learning [15], metric learning [16] and incremental learning [17], are also utilized by some scholars to solve specific problems in fault diagnosis.
Most existing DL-related methods assume that the distribution of training data is balanced. Nevertheless, the rotating machinery systems often operate in a healthy state, and the collected fault samples only account for a small part. DL models will be dominated by classes with sufficient samples and ignore the minority classes with insufficient feature understanding [18][19][20], which leads to overfitting. If the model is severely biased, resulting in a sharp decrease in the classification accuracy of the minority class, it will influence the maintenance efficiency of the mechanical system. More importantly, it is expensive to collect sufficient annotation signals from industrial equipment. In consequence, it is of great practical significance to correctively classify small and imbalanced data [21,22].
Fault diagnosis methods for small and imbalanced data can be mainly divided into three categories: methods based on sampling technology, data generation and cost-sensitive learning. In general, methods based on sampling techniques are classified as either oversampling the minority class or under-sampling the majority class [23]. Among them, the synthetic minority over-sampling technique (SMOTE) has yielded many achievements, which augments the data sets by randomly selecting some samples within the nearest neighbor range. Georgios et al. [24] proposed a heuristic over-sampling method based on K-means clustering and SMOTE to generate artificial data, which enabled various classifiers to attain high classification results on class imbalanced data sets. In addition, the adaptive synthetic (ADASYN) over-sampling approach has been used by many researchers to alleviate the degree of class imbalance. Li et al. [25] proposed a fault diagnosis model incorporating ADASYN, a reconstructed data manner and a deep coupled dense convolutional neural network (CDCN), which had satisfactory results on the data set of power transformers. Although resampling methods such as SMOTE and ADASYN have improved the diagnostic performance to a certain extent, the distribution of the sample feature space is difficult to learn due to the complexity of the vibration signals of mechanical equipment, and thereby problems such as distribution marginalization can occur that result in the generation of invalid samples.
With the in-depth study of generative deep learning models, data generation methods represented by generative adversarial networks (GANs) and variational auto-encoders (VAEs) have become the most common means to solve class-imbalanced problems because of their better generated data [23]. VAEs and GANs using unsupervised learning do not aim at extracting features to establish a mapping between input and output but rather learn the distribution of training data and then generate similar data to weaken the impact of class imbalance. Liu et al. [26] proposed a novel data synthesis approach called deep feature enhanced generative adversarial network, where a pull-away function is integrated into the objective function of the generator to improve the stability of the generative adversarial network. This method shows great potential in class-imbalance bearing fault diagnosis. In Ref. [27], an approach based on a conditional variational auto-encoder generative adversarial network (CVAE-GAN) was proposed for imbalanced fault diagnosis. The method utilized an encoder to attain the sample distribution and then generated similar samples by a decoder, and it was optimized continuously through an adversarial learning mechanism. Since the optimization of deep generative models is high-latitude non-convex optimization, such models are usually difficult to train and consume a lot of computational resources, which will miss the optimal time for maintenance during actual fault monitoring. Additionally, if only a few samples are available for training, the real data distribution cannot be fully learned and the quality of the fault samples generated will be too low to meet the requirement of intelligent diagnosis.
The algorithms based on cost-sensitive learning are dedicated to adjusting the contribution of diverse samples in the model training process by applying cost-sensitive losses [28]. The class-imbalanced problem is solved by imposing cost penalties on distinct classes at the algorithmic level, and such methods are more economical in terms of computational resources and more suitable for establishing lightweight models. Recently, a series of cost loss functions, such as focal loss (FL) [29], class-balanced loss [30], etc., have been proposed to deal with long-tailed distribution data. In the field of fault diagnosis, Geng et al. [31] proposed a new loss function, namely imbalance-weighted cross-entropy (IWCE), which was employed for learning deep residual networks to handle imbalanced bogies fault data from rail transit systems. In Ref. [32], a new CNN-based imbalance diagnosis method was proposed because of the long-tail distribution data from the sensor system. The feature extraction module was optimized by the weighted-center-label loss, while the fault recognition module adopted the distance between the feature and the pattern center vector to diagnose the fault. This manner exhibited effective diagnosis capability for imbalanced data through the automatic extraction of separable and discriminative features. However, many existing cost-sensitive learning methods do not pay attention to the dynamic changes of the corresponding contributions of various samples during the model training. Furthermore, when faced with extremely small samples and serious class imbalance problems, the feature extraction module will fail to fully excavate key features from limited data, which further curbs the effectiveness of cost-sensitive learning methods.
Above all, a lightweight diagnosis framework based on deep separable Laplacewavelet convolutional network with variable-asymmetric focal loss (DSLWCN-VAFL) is constructed to improve the diagnostic performance in small and imbalanced cases while taking into account the timeliness of faults monitoring. In this method, on the one hand, the multi-scale regular convolutional branch fully learns the time-domain features of the data. On the other hand, the proposed depthwise separable Laplace-wavelet convolution layer containing fewer parameters can excavate the time-frequency features of the data, and then the deeper abstract features are captured by the conventional convolution layer. The combination of these two branches allows for a rich set of discriminative features to be attained from limited samples. In addition, the introduction of global average pooling (GAP) fully retains part of the spatial encoding information of the signals, which not only strengthens the inter-channel connection and reduces the number of parameters but also improves the robustness of the model by increasing the receptive field. Subsequently, a novel asymmetric soft-threshold loss VAFL is designed, which dynamically adjusts the contributions of distinct samples during the convergence of the neural network to alleviate the bias problem of the model. The main contributions of the work are as follows:

1.
A lightweight framework for small and imbalanced fault diagnosis is established, namely DSLWCN-VAFL. This method performs well on extremely small samples and seriously imbalanced class data sets, and it consumes only a small amount of computational resources, whose application prospect is very good.

2.
A new DSLWC branch with few parameters is designed. The branch containing the DSLWC layer can mine the time-frequency features from the input data while increasing few parameters and then cooperate with the multi-scale regular convolutional branch that fully learns the time-domain features so that the model can extract more abundant sensitive feature vectors of different types from the limited signal samples, thereby improving the classification ability.

3.
A novel cost-sensitive loss, VAFL, is proposed. VAFL implements that samples of distinct categories impose a variable cost to highlight the misclassified samples of a minority class, which reasonably rebalances the contributions of diverse samples and alleviates the bias problem caused by imbalanced class data.

4.
Finally, experiments are conducted on the gear and bearing data sets. The experimental results demonstrate that compared with several popular means, the proposed method achieves an eminent advantage in terms of diagnostic capability and efficiency in the case of limited samples, class imbalance and noisy interference.
The rest of the paper is organized as follows. Section 2 introduces the basic theories briefly. The proposed method is described in detail in Section 3. Section 4 analyzes the proposed method on the gear and bearing signal data sets, respectively. Finally, the conclusion is drawn in Section 5.

Depthwise Separable Convolution (DSC)
DSC decomposes the regular convolution into two parts: channel convolution and point-by-point convolution. The difference between depthwise separable convolution and ordinary convolution is shown in Figure 1.  Specifically, the input is expressed as X ∈ R L × C, the input channel is C, the output channel is C , the size of each filter is k × 1, and the step size is 1. Then, the output can be expressed as X ∈ R L × C , where L represents the length of the features. In traditional convolution, the input is convolved with C filters to obtain C feature maps. For depthwise separable convolution, an input channel corresponds to one filter to generate C feature maps. In order to achieve C feature maps, 1 × 1 convolution is introduced to map the previous C feature maps to C feature maps.
The parameters P reg and Floating Point Operations (FLOPs) F reg of the regular convolution are expressed in Equations (1) and (2) [33], respectively.
The parameters P sep and FLOPs F sep of the depthwise separable convolution are expressed in Equations (3) and (4), respectively.
Therefore, the parameters and FLOPs can be reduced by:

Basic Principle of Loss Function
Cross-entropy (CE) loss is a common loss function that measures the difference between the actual probability distribution of samples and the probability distribution predicted by a neural network, which is represented in Equations (6) and (7).
where y specifies the truth class and p ∈ [0, 1] is the estimated probability of the network. However, its effect is not good when dealing with imbalanced problems. Focal loss (FL), as an improved cross-entropy loss, has been proved to alleviate the problem of poor performance of one-stage target detection with extremely imbalanced data [29]: where L FL focuses the loss on the low confidence samples. As shown in Figure 2, the closer the probability p t of the high confidence samples is to 1, the faster the loss weight of the training sample will converge to 0 compared to the cross-entropy.

Depthwise Separable Laplace Wavelet Convolution (DSLWC)
When the convolutional layer of a regular CNN performs a set of temporal convolutions between the input data and some finite impulse response filters, the information of key segments cannot be extracted sufficiently by the convolution operation [34]. Moreover, models based on regular convolution operations often suffer from overfitting problems due to the large number of parameters involved.
In order to alleviate the problem above, and inspired by the continuous wavelet convolution kernel [35], the Laplace wavelet is integrated into the convolution kernel, which adds constraints to the convolution kernel waveform to extract explicit periodic pulse information from the input data and fully mine the time-frequency features. In addition, for the purpose of simplifying the structure, the depthwise separable Laplace wavelet convolution layer is proposed to replace the regular convolution layer.
The definition of the basic wavelet dictionary ψ u,s (t) is shown in Equation (9): where ψ(·) is the wavelet basis function, t denotes the time. s is a scale factor, which makes the scaling transform of the wavelet function so that each wavelet traversal approaches different signal frequencies. u is a translational factor so that the wavelet function can traverse the time axis of the signals. s and u are dynamic adaptive adjustable parameters. Wavelet analysis has a unique advantage in processing nonlinear signals. Mechanical vibration signals belong to non-stationary real signals, so the real Laplace wavelet basis function is adopted, as shown in Equation (10) [36]: From Equations (9) and (10), the real Laplace wavelet convolution (LWC) dictionary ψ Lap u,s (t) can be obtained, as shown in Equation (11).
As shown in Figure 3, DSLWC is further implemented by Equation (11), which is represented in Equation (12): where x i is the input feature mapping. δ(·) is a nonlinear activation function. The performance of DSLWC is mainly related to the translational factor u and scale factor s. These two dynamic adaptive adjustable parameters are updated by backpropagation, as shown in Equations (13) and (14):

Feature maps Input
According to Equations (13) and (14), the gradients of s and u, namely, the composite partial derivation of the loss function, need to be calculated before updating them. is the gradient, and α denotes the learning rate.
In the calculation process, the partial derivative of the loss function to the feature output y w is obtained first. Secondly, the partial derivation of y w to ψ w u,s is gained according to Equation (12). Thirdly, on the basis of Equation (11), the updated gradient s w and u w can be gained, respectively. Finally, a backpropagation is completed by subtracting the product of the learning rate α and the gradient from the previous value.
In terms of the chain rule and integrating Equations (13) and (14), the gradients of u and s can be calculated as expressed in Equations (15) and (16) to update these two parameters.
Furthermore, wide convolutional kernels are commonly used for models dealing with 1D-signal data, and although it is easier to understand the low-frequency trend of the input data and thereby suppress the high-frequency noise, more parameters are introduced. In addition, when fine-grained features are abstractly separated from the input data, the number of channels increases significantly to ensure dimensionality reduction without losing information, which will also introduce a large number of parameters and thus affect the computational efficiency of the model. However, the number of parameters in DSLWC is much less than that in regular convolution, which effectively reduces the computational burden. For example, assume that the filter size is 7 × 1, the input channel is 50, and the output channel is 30. According to Equation (1), the number of parameters of the regular convolution is 10,530. The number of parameters of the depthwise separable convolution is 1880, as calculated by Equation (3). In contrast, DSLWC only needs adaptive adjustment of s, u and the number of parameters required is 1630 (50 × 2 + 30 × 51), which is only 163/1053 of that in the regular convolution. In summary, the number of parameters required by DSLWC is very small, which can play a huge advantage in establishing lightweight networks.

Variable-Asymmetric Focal Loss
A sample can be defined as a positive sample if the predicted label of the fault diagnosis model is the same as the true label. At the same time, the samples with estimated probability >0.5 are easy positive samples, and hard positive samples are those with estimated probability 0.5. However, the definitions of easy negative samples and hard negative samples are the opposite of the above. In order to handle the diverse samples from a data set of long-tailed distributions efficiently and pertinently, an asymmetric focal loss function that varies with the epochs is proposed.
Specifically, the role of vanilla focal loss (FL) is to seek trade-offs between the importance of easy and hard samples. When the attenuation factor γ is large, FL will inhibit the easy sample. Although the easy samples can be suppressed in this way, the contribution difference between positive and negative samples during the process of the convergence of a neural network is ignored. Therefore, the attenuation factor should be decoupled, and the contribution of positive and negative samples should be rebalanced to help the model update its weight in a better direction. An approach of asymmetric soft-thresholding is employed on the positive and negative parts of the loss to decouple the weighting factors between positive and negative labels, which is represented in Equation (17).
At the beginning of the training, the deep learning model does not learn the features of the samples well. The percentage of samples that can be classified correctly and have high confidence is not very large. At the same time, the attenuation factor should not be too large for the consideration of reducing the impact on the samples with low confidence. With the advancement of training, the classes (majority) are more easily identified and become easy positive. By increasing the value of the attenuation factor, the dominance of easy positive samples on the loss can be reduced, and thus the weight of the classes (minority) can be increased. Then, for the positive samples, a dynamic positive attenuation factor is proposed, as shown in Equation (18).
where hyper-parameter γ + denotes the initial positive attenuation factor. e i corresponds to the current number of training epochs, and e n corresponds to the total number of training epochs. The value of γ pos increases with α. In addition, the square root is very sensitive to errors and gives a good indication of the measurement precision of the data, which makes the attenuation factor possess better dynamic adaptability. For negative samples, the proportion of easy samples will first increase as the features are gradually learned. The easy negative samples dominate the negative part of the loss, which compresses the adjustment of the weights of the hard negative samples that are mainly from the class (minority). Therefore, increasing the value of the attenuation factor can suppress the easy negative samples. However, in the middle and late stages of training, the number of negative samples decreases and the proportion of easy samples also decreases sharply. Accordingly, it is necessary to reduce the attenuation factor to avoid the loss corresponding to the hard samples being too low to weaken the learning ability of the model. For the considerations above, a cyclical negative attenuation factor is proposed for the negative sample, which is expressed in Equation (19). where hyper-parameters γ − (<γ + ) denote the initial negative attenuation factor. β is the maximum value that γ neg can increase. n c ( 1) provides variability for the progress of γ neg reaching the maximum. γ neg changes from the minimum to the maximum at 1/n c of the training process and again from the maximum to the minimum for the rest of the epochs. Integrating Equations (18) and (19) with Equation (17), the variable-asymmetric focal loss (VAFL) can be defined as: The discrete value of the sample loss is utilized to locate the boundaries of the easyhard samples and the positive-negative samples. The contribution rate is then rebalanced according to the different sample losses during the process of backpropagation. Hence, the VAFL function can reasonably adjust the impact of different samples on the convergence process of the model, which is suitable for the intelligent fault diagnosis of class imbalance data.

The Proposed Framework Based on DSLWCN-VAFL Algorithm
In the fault diagnosis of rotating machinery, it is very expensive and challenging to attain the label fault data of industrial equipment to establish a reliable fault diagnosis structure. Meanwhile, the collected samples of fault conditions are usually far fewer than the samples of normal operation. However, most of the data-driven approaches are based on the premise of a balanced distribution of categories, and the model structure is so complex that it consumes huge computational resources. To this end, a new lightweight model, namely DSLWCN-VAFL for processing class-imbalanced data, is proposed in this paper, as shown in Figure 4. The introduction of the DSLWC layer allows the timefrequency features in the data to be mined and then cooperates with the multi-scale regular convolution branch to fully learn the time-domain features, which enables the model to extract more abundant sensitive feature vectors of different types from the limited signal samples and thus make the data distribution clearer. At the same time, a new cost-sensitive loss mechanism, VAFL, is designed, which reasonably rebalances the contributions of distinct samples during model training.
In industrial applications, the imbalanced fault diagnosis framework based on DSLWCN-VAFL is shown in Figure 5. The specific steps are as follows: 1.
Obtain the vibration signals of the rotating machinery by acceleration sensors.

2.
Perform data segmentation and normalization of the raw vibration signals.

3.
Divide the collected data into training sample set, validation sample set and test sample set.

4.
Input training sample set into DSLWCN-VAFL, and verify the classification performance through the validation set.

5.
Feed the test sample set into the trained DSLWCN-VAFL for fault diagnosis and output the results.

Implementation Details
In order to verify the performance of the proposed imbalanced fault diagnosis method based on DSLWCN-VAFL, various experimental studies will be conducted on the bearinggear data from Southeastern University and bearing data from Case Western University, respectively. All experiments are implemented in Pytorch 1.8.0, Python 3.8.5, running on AMD Ryzen 7 4800H with Radeon Graphics @2.90 GHz (16G RAM), GTX1650 GPU.
In addition, some specific training parameters are set as follows. The parameter optimizer of the network is Adam, and the learning rate is set to 0.001. According to Ref. [37,38], the batch size is set to 64. An early stop is utilized to avoid overfitting the model. In experiments, some hyperparameters in VAFL have a wide selection range. In general, the attenuation factor is set to 2 [29]. Therefore, the positive attenuation factor γ + in VAFL is set to 2. While γ − is less than γ + , γ − is set to 1. For the consideration of extremely imbalanced situations, such as the number of easy samples is much larger than the number of hard samples, VAFL needs to have a stronger suppression capability. Therefore, the maximum control coefficients α and β are set to 3, and the specific structures, parameters, and FLOPs of DSLWCN-VAFL are listed in Table 1, where the number of classes is expressed as C. According to Refs. [39,40], DSLWCN-VAFL with a small number of parameters and FLOPs can be called a lightweight model, which can effectively reduce the computational burden of fault diagnosis. When the class-imbalanced data are used for fault diagnosis, even if the samples from the class (majority) are classified wrongly, the accuracy can still maintain a high value through the samples from the class (minority). Therefore, the accuracy is not a good representation of the experimental effect. In addition to accuracy, G-mean and F1-Score are introduced as the evaluation indexes to comprehensively evaluate the classification performance.
where  The experimental data are provided by Southeast University [41]. As shown in Figure 6, the steady-state signals are collected from the drivetrain dynamic simulator (DDS) with the rotating speed system load set to 20 HZ-0V. Among them, bearing faults are induced by cracks in disparate locations. The remaining gear faults are divided into four types: Chipped, Miss, Root and Surface. Both Chipped and Root are caused by cracks, while the locations are distinct. The fault Miss is caused by the lack of a gear tooth. Surface indicates the presence of wear on the gear surface. According to Ref. [37], when the number of samples is less than 100, it can be called a small sample problem, and when the number of samples is 10, it is called an extremely small sample problem. During the construction of the data set, scholars hardly set the samples of all classes as limited. In order to fully analyze the performance of the proposed method on small and imbalanced data, three different data sets, A, B and C, are constructed. The specific information is shown in Table

Raw Noisy
Number of data points Amplitude(m/s²)  Therefore, to verify the optimization of the performance of DSLWC and VAFL for the network when performing small and imbalanced fault diagnosis, ablation experiments are conducted on the data sets A, B, C and the noisy samples. DSLWCN-VAFL is compared with other models, which are DCNN-CE (without DSLWC and VAFL), DCNN-VAFL (without DSLWC) and DSLWCN-FL (without VAFL). After 200 iteration epochs, the diagnostic performance of the four models on distinct data sets is shown in Figures 8-10. In the case where each class is balanced, but the samples are extremely limited, DCNN lacks the ability to extract explicit periodic pulse information from the input data and fully exploit the time-frequency features, and the model complexity does not match the amount of data, which leads to over-sensitivity of the model to noise and outliers and thus overfitting. The lightweight model DSLWCN with few parameters can extract distinct types of multi-scale features to attain more effective features. Therefore, it exhibits an excellent diagnostic performance when dealing with limited samples. From Ref. [28], vanilla FL degrades classification performance when handling balanced data sets. However, the effect of the employment of VAFL is close to CE, which is more suitable for processing balanced data sets.   The diagnostic results of DCNN-CE are disappointing when handling the imbalanced data sets. The main reason is that DCNN itself has poor feature extraction capability and cannot effectively learn features from fault classes with scarce data. More importantly, the cross-entropy loss function does not reasonably balance the contribution of easy-positive and hard-negative samples in the training process of the model, which results in the normal class with sufficient samples dominating the loss and thus failing to implement effective classification. What is worse, the performance of DCNN-CE will decline dramatically as the imbalance problem becomes severe. With the support of VAFL, the imbalance diagnosis performance of DCNN is improved, while DSLWCN with better feature learning capability can also obtain better results with the help of FL. FL balances the contribution of easy and hard samples, which is not realized by CE, and this improves the effect of imbalanced fault diagnosis. However, VAFL that decouples the positive-negative samples and can dynamically adjust the attenuation factor possesses a more reasonable contribution balance strategy, which makes DSLWCN-VAFL gain wonderful diagnostic results even on severely imbalanced data sets. The accuracy, F1-Score and G-mean reach 98.57%, 98.31% and 98.37%, respectively. Through the comparison in Figure 7, it is not difficult to find that the key features used to identify the health of rotating machinery components are easily submerged in the noise, which seriously affects the performance of intelligent fault diagnostic models in practical applications. Nevertheless, DSLWCN-VAFL with detail time-frequency feature extraction ability does not deteriorate substantially when faced with the task of the interference of noises, which indicates that it has a certain anti-noise capability. Moreover, the standard deviation of DSLWCN-VAFL is the smallest compared with others in the comparative experiments, suggesting that it possesses better stability. Therefore, to further analyze the processing efficiency of the proposed model, the running time of the four models on distinct data sets is listed in Table 3. It can not be denied that as a lightweight model, DSLWCN is more advantageous in terms of diagnostic efficiency.

Results of Visualization
For the sake of obtaining a more intuitive feel of the prediction results of different models, visualization tools are introduced. Taking the data set C with severely imbalanced classes and limited samples as an example, the confusion matrix for each model of the first run is plotted. As shown in Figure 11, the overall diagnostic accuracies of the four models are 83.44%, 92.12%, 97.56%, and 98.77%, respectively. It can be found that there is a serious bias in the diagnostic results of DCNN-CE. Nevertheless, after VAFL suppresses a large number of easy-to-classify samples and mines hard-to-classify samples from the classes (minority), the problem of the network deviating to the direction of invalid learning is mitigated, and the updating direction of the gradient is also better optimized. Although the fault signal features corresponding to label 5 and label 6 are relatively difficult to distinguish and lead to misclassification, DSLWCN with stronger feature mining ability can extract different types of multi-scale features from limited samples to improve the fault diagnosis capability.
As another common visualization technique, t-distributed Stochastic Neighbor Embedding (t-SNE) is introduced to verify the diagnostic performance of the proposed method. The data set C is still taken as an example to obtain the feature visualization results after dimensionality reduction in the data. As can be seen from Figure 12, after diagnosis with DCNN, there is an overlap between distinct faults, indicating that faults are not well differentiated. However, DSLWCN makes the spatial distribution differences between various fault classes increase relatively, and the intra-class distribution is relatively dense. Furthermore, the proposed model augmented by the improved rebalancing strategy VAFL makes the distribution boundaries of the classes clearer and enhances the separability, which facilitates the classifier in classifying diverse classes of data and thus improving the monitoring capability of different health states.

Comparison of Various Class Imbalanced Methods
For further analysis of the effectiveness of VAFL, the proposed method is compared with five existing methods. Specifically, CE is the classical technique for class-balanced problems, while the other four are state-of-the-art long-tailed classification techniques: classbalanced loss [30], gradient harmonizing mechanism for classification loss (GHMCL) [42], AdaptiveFocalLoss [43], and label-distribution-aware margin loss (LDAML) [44]. To ensure fairness, the parameters of these methods are set according to the optimal ones in the paper. Experiments are conducted on the three data sets separately, and the diagnostic performance is measured by the F1-Score, as listed in Table 4. It is undeniable that most long-tailed classification techniques do not perform as well as CE when faced with classbalanced problems, but VAFL does have promising results. Moreover, the F1-Score of the proposed method reaches 99.82% and 98.31% on the moderately imbalanced data set B and severely imbalanced data set C, respectively. Consequently, VAFL also performs well on class-imbalanced data sets. The loss curves of different methods on data set C are plotted in Figure 13. Compared with others, the convergence speed of VAFL is faster, and it can converge in the 25th epoch of the iteration epochs, indicating that it is more computationally efficient. In addition, the fluctuation of VAFL in the later stage is small, and the stability is stronger. All in all, VAFL can replace CE and become a more general technique in the face of either a balanced or imbalanced problem.   The effectiveness of the DSLWCN-VAFL algorithm and its class imbalanced fault diagnosis framework designed in this paper needs to be further explored. Bearing signal data provided by Case Western Reserve University is one of the most widely used standard public data sets in prognostics health management (PHM) [45]. The signal data are gained from the accelerometer of the motor-driven mechanical system. Meanwhile, the sampling frequency is 12 kHz, the motor load is 1hp, the operation is steady-state and the corresponding speed is 1772 r/min. The experimental platform of CWRU is displayed in Figure 14.  Table 5. Some bearing signals and noisy signals under different health conditions are plotted in Figure 15.   A  10  10  10  10  10  10  10  10  10  10  None  A-Noise  10  10  10  10  10  10  10  10  10  10  5  B  100  30  30  30  30  30  30  30  30  30  None  B-Noise  100  30  30  30  30  30  30  30  30  30  5  C  100  10  10  10  10  10  10  10  10 10 None C-Noise 100 10 10 10 10 10 10 10 10 10 5

Diagnosis Results and Analysis
In view of the similar fault diagnosis task, the experimental parameters in Case 2 are kept consistent with those in Case 1. The comparison results of the diagnostic performance of the four models are shown in Figures 16-18. Through the comprehensive analysis and verification of the three indicators (Accuracy, F1-Score, G-mean), DSLWCN-VAFL reveals satisfactory diagnostic capability on both class-balanced data sets with extremely limited samples and class-imbalanced data sets. Furthermore, according to the running time of the models in Table 6, DSLWCN-VAFL holds excellent diagnostic efficiency as a lightweight class imbalance diagnostic model. In addition, the damage diameter corresponding to the slight fault shown in Figure 15 is smaller, and the features are more blurred under the interference of noise, whereas DSLWCN-VAFL can still achieve an excellent performance of more than 98% in all three indices on severely imbalanced noisy data sets. Taking the data set C with severely imbalanced classes and limited samples as an example, the confusion matrix for each model of the first run is plotted in Figure 19. The overall classification accuracies of the four models are 87.44%, 94.43%, 98.90% and 99.50%, respectively. It can be clearly seen that although the feature extraction ability of DCNN is poor, after the operation of rebalancing the contributions of distinct samples from the class (majority) and class (minority) by VAFL, the probability of the fault class being misclassified as a normal class is greatly reduced. The t-SNE feature visualization of different models in Figure 20 more intuitively highlights that the classification boundary of the features learned by DSLWCN is more clear, and its corresponding 2-D spatial graph shows that the intra-class distribution is more compact and the inter-class distribution is more dispersed. The analysis above, once again, demonstrates that the proposed method has outstanding advantages in class imbalanced fault diagnosis.

Comparison with Other Diagnosis Frameworks
The effectiveness of VAFL and other state-of-the-art classification techniques is further compared on the bearing data sets and measured by the F1-Score, the results of which are presented in Table 7. The diagnostic results of the proposed method are the best and the standard deviation is the smallest, indicating that the stability of DSLWCN-VAFL is excellent.
To verify the superiority of the intelligent fault diagnosis method proposed in this paper, some classical class imbalanced diagnosis frameworks are selected as the comparison frameworks, such as those based on SMOTE [19], ADASYN [46], VAE [47] and GAN [48]. The specific comparison results on two class imbalance data sets with different levels of severity are shown in Figure 21. The results of traditional methods such as Smote and ADASYN on severely imbalanced data sets are dissatisfactory, and the F1-Score only reaches 91.42% and 89.62%. VAEs and GANs have become the most common techniques for solving class imbalanced problems. The new samples generated by such deep learningrelated methods based on data synthesis extend the feature space of the original samples, making it easier for the classifier to distinguish diverse types of features, which leads to the advantages of the diagnostic capability of these methods compared to traditional data augmentation methods. However, these deep generative models are often difficult to train and require a lot of computational resources. In contrast, as a lightweight model, DSLWCN-VAFL can still maintain a remarkable diagnostic ability under the premise of consuming limited computational resources, which indicates that its application is promising.

Conclusions
In this paper, a lightweight method named DSLWCN-VAFL is proposed to solve the problem of small and imbalanced data sets. As one of the key technologies in this method, the DSLWC layer not only possesses fewer parameters than regular convolution but also captures time-frequency features from the input 1D data. The branch with the DSLWC layer, combined with the branch of multi-scale regular convolution that can fully learn the time-domain features, achieves abundant discriminative features from limited samples to improve the classification ability of the model. Furthermore, another key technology, namely the novel cost loss VAFL, is designed. The loss function with the ability of dynamic adjustment rebalances the influence of different samples on the convergence of the neural network. Based on the gear and bearing data sets, the diagnostic performance and antinoise capability of DSLWCN-VAFL in the presence of extremely limited samples and severe class imbalance are discussed in detail. In addition, the effectiveness of each module in the proposed method is verified by ablation experiments. The comparative experiments with some popular methods highlight the superiority of the proposed method. DSLWCN-VAFL not only has promising prospects of application but also provides a new research idea for the solution of class-imbalanced problems.
For future work, the effective processing of multi-source heterogeneous data collected from different sensors is also worth considering, and the noise-insensitive practicability when the data dimension is under strong background noise needs to be further improved. In addition, if faced with variable operating conditions or cross-device diagnosis, it is also worthwhile investigating the employment of techniques such as domain adaptation or transfer learning to solve the imbalanced problem. Finally, the methods for small and imbalanced fault diagnosis through zero-sample learning remain to be explored in extreme cases where no fault samples are available at all.

Conflicts of Interest:
The authors declare no conflict of interest.