1. Introduction
With the rapid advancement of modern industrial technology, mechanical systems are evolving towards larger dimensions and higher precision, making the stability and safety of equipment particularly crucial [
1]. Bearings, in particular, play a pivotal role in modern industry as core components of various rotating machinery. The performance and reliability of bearings directly impact the efficiency, stability, and safety of the entire system [
2]. A bearing failure can lead not only to equipment shutdown and production interruptions but also to significant economic losses [
3]. In practical applications, the operating conditions of bearings vary, and factors such as equipment downtime and data collection costs limit the acquisition of sample data required for fault diagnosis [
4]. Therefore, researching fault diagnosis methods under conditions of small samples and varying operating conditions holds significant theoretical importance and practical value.
In the field of fault diagnosis, signal decomposition algorithms are critical for extracting sensitive features from non-smooth signals [
5]. Although the Empirical Mode Decomposition (EMD) proposed by Huang et al. [
6] can decompose non-stationary signals into multiple Intrinsic Mode Functions (IMFs), it suffers from issues such as mode mixing. To address these problems, subsequent researchers have proposed a series of improved methods, including the Ensemble Empirical Mode Decomposition (EEMD) by Zhang et al. [
7], the Complete Ensemble Empirical Mode Decomposition (CEEMD) by Yeh et al. [
8], and the Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) by Torres et al. [
9]. While these methods have made significant improvements, they still encounter the issue of residual noise. Drag et al. [
10] proposed Variational Mode Decomposition (VMD), which is based on EMD and offers robustness along with rigorous mathematical theoretical support. However, the effectiveness of this decomposition is constrained by the parameters ‘a’ and ‘K’ [
11]. Miao et al. [
12] introduced feature mode decomposition (FMD), which updates the filter bank through correlated kurtosis, thereby overcoming some limitations of both VMD and EMD. FMD demonstrates superior performance in the decomposition of mechanical signals, particularly in the absence of prior knowledge regarding fault periods. Nevertheless, it imposes high requirements for parameter combinations in the presence of environmental noise and interference. Xu Shuai et al. [
13] optimized FMD using the Whale Optimization Algorithm, but challenges remain, such as the insufficient sensitivity of the objective function to weak fault characteristics and the tendency of the optimization algorithm to converge to local optima. Shi Yifei et al. [
14] found that when optimizing FMD based on the Gray Wolf Optimization algorithm, the limited flexibility in weight allocation of the objective function adversely affected the efficiency of fault feature extraction. These methods have certain advantages; however, their ability to extract features is limited, and they are susceptible to local optima.
In recent years, various intelligent fault diagnosis technologies, particularly those utilizing deep learning, have been implemented in the field of bearing fault diagnosis. However, these models typically require substantial amounts of training data, which is often scarce under normal operating conditions of the equipment [
15]. To address the challenge of insufficient samples, some researchers have made notable advancements in small-sample bearing fault diagnosis, including the application of meta-learning approaches. Meta-learning is a machine learning paradigm that enables algorithms to rapidly adapt to new tasks. The primary methods of meta-learning encompass metric-based meta-learning, model-based meta-learning, and optimization-based meta-learning [
16]. Among these, metric learning-based meta-learning quantifies the relationship between samples through weight calculations, which includes techniques such as Siamese Networks, Matching Networks, Prototypical Networks, and relation networks [
17]. Recent studies have overcome the limitations associated with few-shot learning through metric learning, exemplified by the Improved Siamese Neural Network (ISNN), which achieved an impressive diagnostic accuracy of 84.1% in a 10-sample scenario, representing a 34.3% improvement over traditional CNNs, thereby validating the efficacy of this approach in few-shot contexts [
18]. Wu et al. [
19] introduced a meta-learning-based few-shot transfer learning method tailored for variable working conditions, highlighting the significant advantages of relation networks in scenarios characterized by small sample sizes and relatively simple transfer tasks. Guo Min et al. [
20] proposed a coordinate attention relation network that embeds coordinate information and generates coordinate attention to construct the network. This approach addresses the limitations of traditional relation network models, which struggle to establish long-range dependencies in feature maps and face challenges in accurately locating fault features. However, these methods do not fully utilize features across different scales. In contrast, Liu et al. [
21] achieved a 12.3% improvement in diagnostic accuracy under variable speed conditions by employing a time–frequency domain/wavelet domain multi-view cross-attention mechanism, thereby underscoring the necessity for multi-scale fusion.
In light of the previous analysis, this study introduces a novel method for diagnosing bearing faults in small samples, employing ALA-FMD in conjunction with MSCA-RN. This approach improves the feature mode decomposition parameters by utilizing the global search capabilities of the lemming optimization algorithm, thereby successfully bypassing local optima and precisely extracting vital signal characteristics. The optimal components are then converted into two-dimensional visual representations through continuous wavelet transform (CWT). By adopting the multi-scale feature fusion and coordinate attention mechanism of MSCA, this method highlights significant areas and enhances feature representation, leading to high-precision identification of intricate signals in RN. The benefits of this strategy in the diagnosis of bearing faults include the following:
This method proves to be especially efficient in addressing the non-stationary signal traits of bearings, particularly when dealing with small sample sizes and fluctuating operating conditions. It demonstrates robust performance in fault feature extraction across scenarios with variable speed and load, while effectively uncovering latent information within the data.
The multi-scale network architecture effectively captures the multi-dimensional characteristics of bearing faults, ranging from early micro-defects to progressive failures. This enhancement significantly improves the accuracy of bearing fault classification.
2. Related Algorithms
2.1. Feature Mode Decomposition Algorithm
The FMD algorithm achieves multi-mode signal decomposition through adaptive Finite Impulse Response (FIR) filters, demonstrating its capability to characterize both impulsive and periodic features in fault signals while maintaining robustness against noise and interference. The core procedure is as follows:
To begin, set up the FIR filter bank by creating filters that utilize the Hanning window alongside input parameters, including the number of modes and filter length, which evenly partition the signal frequency spectrum into K segments. After that, derive the decomposed modes through a process of iterative filtering. In each iteration, estimate the period by analyzing the autocorrelation spectrum of the original signal in combination with the current mode. Following this, tackle the constrained optimization problem using the iterative eigenvalue method, updating the filter coefficients with the goal of maximizing correlated kurtosis (CK) to focus the filtered signal on high-impulse characteristic components. Throughout the iteration process, construct a matrix by calculating the cross-correlation coefficient (CC) between modes, eliminating redundant modes with the highest CC value but lower CK, and gradually reducing the number of modes to the specified value. The retained modes ultimately constitute the decomposition results. This algorithm effectively achieves feature decomposition of signals through adaptive filtering and modal redundancy elimination. The calculation formula for CC is as follows:
In the formula, and are the mean values of and , respectively.
To compare the denoising capabilities of different methods, we introduced metrics such as the signal-to-noise ratio (SNR), root mean square error (RMSE), and average root mean square error (MAE) to evaluate the noise reduction performance of various approaches.
In the equation, is the original signal and is the denoised signal.
2.2. Artificial Lemming Algorithm
The ALA [
22] functions as a population-based algorithm that necessitates the setup of all positions of search agents prior to commencing the iterative phase. The set of candidate solutions is represented as a matrix defined by N (the size of the population) and Dim (the dimensions related to the specific problem), constrained within designated upper and lower limits, as illustrated in Equation (5). The best position in each iteration is regarded as the optimal solution obtained so far or a near-optimal solution. The decision variable
for each dimension is calculated using Equation (6).
Then initialize as follows:
In the formula, rand is a random value within the range of 0–1, is the lower bound of the j-th dimension, and is the upper bound of the j-th dimension.
During the exploration phase, two algorithms are employed to simulate lemming behavior. The first algorithm models long-distance migration, which represents random movement during periods of food scarcity. The corresponding formula is as follows:
In the equation,
represents the position of the i-th search agent in the (t + 1)-th iteration, and
denotes the current optimal solution. F is used to alter the search direction and is calculated by Formula (8).
represents a random number vector describing Brownian motion.
is a vector of size 1 × Dim generated by Formula (9).
represents the current position of the i-th search agent.
denotes a randomly selected search individual from the population, where a is an integer index between 1 and n.
In the equation, represents the floor function.
The second behavior is burrowing, which simulates the construction of a safe shelter. The formula is shown below:
In the equation, X represents a random number related to the current iteration count,
denotes the random search individual, and b is a random integer index value between 1 and n. The calculation formula for L is as follows:
During the development phase, two algorithms were implemented to simulate lemming behavior. The first algorithm models the foraging behavior, which is used to simulate the lemming’s search for food within its habitat. The corresponding formula is as follows:
The second behavior observed during the development phase is predator avoidance, which simulates escape behavior upon encountering danger. This behavior can be expressed mathematically using the following formula:
In the equation, G signifies the escape coefficient, indicates the maximum count of iterations, and stands for the Lévy flight function.
The energy factor is employed to maintain a balance between exploration and exploitation. When energy levels are sufficient, the system enters the exploration phase; otherwise, it transitions to the exploitation phase. The calculation of the energy factor can be expressed by the following formula:
2.3. Relation Network
The relation network represents a metric-based approach to meta-learning, as shown in
Figure 1. When creating classification tasks, if the relation network chooses N categories randomly from the training dataset, using K labeled samples from each category as the support set, the leftover samples from these N categories are employed as the query set. This setup is identified as an N-way K-shot training strategy.
The primary components consist of the embedding module and the relational module. The embedding module comprises four convolutional blocks along with two pooling layers, whereas the relational module consists of two convolutional blocks, two pooling layers, and two fully connected layers with eight and one neurons, respectively. Each convolutional block features a convolutional layer, a Batch Normalization layer, and a ReLU activation function.
2.4. Multi-Scale Coordinate Attention Mechanism
The attention mechanism enhances the model by generating corresponding weights for various extracted features, thereby making their characteristics more pronounced. Building upon coordinate attention [
23], this paper introduces a multi-scale coordinate attention mechanism. This mechanism emphasizes relationships between different positions or coordinates in the input data, highlighting the significance of spatial information rather than merely considering relative feature relationships. By integrating a local scale branch, it enables cross-scale feature interaction and weight fusion. This approach processes spatial information in parallel across different scales, maintaining coordinate positioning accuracy while enhancing long-range dependency modeling capabilities. It is particularly effective for extracting and analyzing complex time–frequency features in mechanical fault diagnosis, offering a robust solution for high-precision diagnosis in scenarios characterized by small sample sizes and variable operating conditions. The structure of the multi-scale coordinate attention module is depicted in
Figure 2.
First, convolution operations are performed on the feature map using 3 × 3 and 5 × 5 convolution kernels to obtain feature maps A3×3 and A5×5. These feature maps will have shapes identical to the input feature map, with dimensions C × H × W.
Simultaneously, feature maps undergo average pooling operations in both horizontal and vertical orientations. In this phase, the output corresponding to the C-th channel in the height (H) direction can be represented as follows:
The output in the width (W) direction of the C-th channel can be expressed as
The concatenation of pooled feature maps results in a feature map sized C × (H + W). Following this, the features are combined and processed through nonlinear transformations using convolutional operations alongside ReLU activation functions, which produces a feature map with dimensions C/r × (H + W), where r denotes a scaling factor. Another convolutional operation is then performed, producing a feature map of size C × (H + W). This finalized fused feature map is split into two segments: one measuring C × H × 1 and the other measuring C × 1 × W.
Conduct convolution operations and utilize Sigmoid activation functions on the separated feature maps, generating attention weight maps sized C × H × 1 and C × 1× W. The values in these weight maps span from 0 to 1, reflecting the importance of different positions in the feature maps. By executing element-wise multiplication of the resulting attention weight maps with both the original input feature map and the feature maps modified by various convolution kernels, we produce feature maps that combine spatial attention information. These modified maps serve as the output feature maps B of the module.
The multi-scale coordinate attention module enhances the model’s feature representation and image analysis capabilities by extracting multi-scale features and applying attention mechanisms. This approach highlights crucial spatial information while suppressing irrelevant details in the input feature maps, ultimately providing higher-quality feature representations for subsequent network layers.
3. Small-Sample Fault Diagnosis Method
In order to tackle the issues related to complex feature extraction for rolling bearing faults and the limited availability of training samples, this study introduces a diagnosis approach for bearing faults that is founded on ALA-FMD and MSCA-RN.
Figure 3 displays the flowchart that depicts this method.
The original signal undergoes decomposition through FMD. To improve the precision of modal decomposition, ALA is utilized to fine-tune three parameters of FMD: the quantity of modes, the length of the filter, and the number of cutoff frequency bands. The minimum Residual Energy Index (REI) is used as the criterion for selecting the optimal modal components. The chosen optimal modal components are then converted from time-domain signals to two-dimensional time–frequency representations via CWT, which are later input into the MSCA-RN model for diagnosing faults.
3.1. ALA-FMD Method
The FMD method lacks the capability for parameter self-adaptation. Manual adjustments to parameters have been shown to exhibit poor stability and low efficiency. Specifically, an insufficient number of modes may result in the omission of critical fault information, while an excessive number of modes can introduce noise or redundant components [
24]. Furthermore, an excessively short filter length (L) diminishes separation accuracy, whereas an overly long length increases the computational burden. Increasing the number of filters (K) enhances frequency resolution but simultaneously exacerbates computational complexity. If K is too small, important signal features may be overlooked, adversely affecting decomposition performance. Therefore, this paper employs ALA to optimize these three parameters, thereby improving both decomposition effectiveness and efficiency. The detailed steps of the ALA-optimized FMD are outlined as follows:
- (1)
Initialize the population, set the maximum ALA iteration count to 100, and the population size N to 40. Since three parameters need to be optimized, the dimension size Dim is set to 3. This paper configures the mode within [3, 15], filter length within [64, 512], and the number of frequency band divisions within [2, 20], where ≥ .
- (2)
To determine the fitness function, first break down the signal utilizing the FMD parameters associated with potential solutions. Next, calculate the envelope entropy value of the reconstructed signal, which will act as the fitness function needed to identify the best parameter combination. It is essential that both n and K are valid integers; if not, adjust them by truncating or rounding to the nearest acceptable value. Additionally, because L must meet linear phase requirements to avoid signal distortion, its coefficients should conform to either symmetric or antisymmetric rules, implying that L should be an even number.
- (3)
Energy factor evaluation, calculate the magnitude of energy factors, determine whether to enter the exploration phase or the development phase.
- (4)
During the exploration phase, there exists a 30% probability of expanding the search range by integrating the current optimal solution with random individuals through Brownian motion-based random perturbation. Conversely, there is a 70% probability of making periodic adjustments towards the optimal solution based on the current position.
- (5)
During the development phase, there exists a 50% probability of performing a spiral search around the optimal solution for fine parameter tuning. Additionally, there is an equal probability of employing Lévy flights to facilitate local jumps, thereby preventing entrapment in local optima.
- (6)
To evaluate and update, one must assess the fitness of new individuals. If these individuals demonstrate superiority over the current optimal solution, it is imperative to update the optimal solution accordingly.
- (7)
If the total number of iterations has not been reached, continue to execute steps 2–6 until the desired iteration count is achieved, and then present the optimal combination of parameters.
3.2. MSCA-RN Fault Diagnosis Model
This paper proposes a multi-scale coordinate attention relational network model that integrates the advantages of relational networks for mining image features with the capabilities of a multi-scale coordinate attention mechanism to enhance features across varying scales, thereby facilitating the rapid diagnosis of bearing defects. By deeply integrating multi-scale feature engineering with relational networks, this model addresses the representational limitations of traditional attention mechanisms, which are often restricted to single-scale analysis. It offers a solution characterized by global structural localization, local detail enhancement, and multi-scale relationship measurement for bearing fault diagnosis. This approach is particularly effective in complex scenarios that involve small sample sizes and variable operating conditions.
3.3. Process Introduction
The fault diagnosis process of the multi-scale coordinate attention relational network is illustrated in
Figure 4.
In the signal processing stage, the original vibration signals collected by vibration sensors are initially processed using the ALA-FMD. The optimal modal components are then selected based on the minimum residual index. Subsequently, these signals are transformed into two-dimensional time–frequency domain representations through continuous wavelet transform to elucidate fault characteristics. The processed time–frequency domain signals are categorized into training sets (comprising support sets and query sets) and test sets in accordance with the meta-learning strategy. Among them, the REI is used to measure the noise residue in the modal components, and the calculation formula is
In the equation, represents the original signal, and represents the reconstructed signal of the modal component. The smaller the REI value, the more effective energy is retained in the component, and less noise is present.
During the model training phase, the MSCA-RN employs a meta-learning episodic training strategy to enhance its generalization capability for few-shot faults through multi-task learning. Initially, the input time–frequency images undergo global position localization and local detail enhancement via the MSCA module. Subsequently, features are extracted through the convolutional blocks of the embedding module to obtain multi-scale fused features. The feature vectors from both the support and query sets are combined and fed into the relation module, where a nonlinear metric function generates relation scores within the [0–1] range. The optimization process utilizes Mean Squared Error (MSE) as the loss function, in conjunction with the Adam optimizer to dynamically adjust the learning rate. Through backpropagation, the MSCA parameters, convolutional kernel weights of the embedding module, and fully connected layer parameters of the relation module are updated, gradually reducing the discrepancy between predicted scores and ground truth labels. After iterative training, the model converges. This stage incorporates strategies of multi-scale feature enhancement along with meta-learning, allowing the model to comprehensively grasp both the overall structures and the specific details of fault features in scenarios with limited examples and varying operational conditions.
During the model testing phase, unlabeled bearing time–frequency diagrams are input into the trained network. The MSCA and embedding modules generate feature vectors, which are concatenated sequentially with the feature vectors of each fault category in the support set to form feature pairs. Subsequently, these pairs are processed through the relation module to compute similarity scores, aiding in the identification of fault types.