Bearing Fault Diagnosis of Split Attention Network Based on Deep Subdomain Adaptation

: The insufﬁcient learning ability of traditional convolutional neural network for key fault features, as well as the characteristic distribution of vibration data of rolling bearing collected under variable working conditions is inconsistent, and decreases the bearing fault diagnosis accuracy. To address the problem, a deep subdomain adaptation split attention network (SPDSAN) is proposed for intelligent fault diagnosis of bearings. Firstly, the time-frequency diagram of a vibration signal is obtained by the continuous wavelet transform to show the time-frequency characteristics. Secondly, a residual split-attention network (ResNeSt) that integrates multi-path and channel attention mechanisms is constructed to extract the key features of rolling bearings to prevent feature loss. Then, a subdomain adaptation layer is added to ResNeSt to align the distribution of related subdomain data by minimizing the local maximum mean difference. Finally, the SPDSAN model is validated using the Case Western Reserve University datasets. The results show that the average diagnostic accuracy of the proposed method is 99.9% when the test set samples are not labeled, which is higher compared to the accuracy of other mainstream intelligent fault diagnosis models.


Introduction
Rotating machinery is widely used in aerospace, automobile manufacturing, wind power generation and other important engineering fields.Rolling bearing is a key component in rotating machinery.Because this mechanical equipment often operates under complex working conditions, bearings are prone to pitting, breaking, gluing and other failures, which will lead to the paralysis of the mechanical equipment and cause significant economic losses [1].Statistical analysis results provided by multiple studies have shown that more than 40% of the equipment faults are related to bearings [2].Thus, how to improve the fault diagnosis of bearings under variable working conditions is related to the stable operation of the whole equipment and production line.
The traditional fault diagnosis method determines the equipment health state by establishing the corresponding dynamic model.For instance, Ambrokiewicz et al. [3] not only considered the bearing internal stiffness, damping, clearance and other nonlinear characteristics, but also took the bearing external load, eccentricity and other characteristics as factors affecting the normal operation of the bearing ball.The dynamics model of the ball bearing motion process with two degrees of freedom was established to reveal the dimensionless relationship and the influence on the system response.In the study by Huangfu et al. [4], the traditional loaded tooth contact analysis (LTCA) method was extended to calculate the mesh stiffness and contact stress of spalled gear pairs, and established a novel dynamic model for spalled gear pairs to describe the dynamic response of the gear pair under different spall modes.Such methods heavily rely on the researcher expertise, and specific devices are needed to establish specific dynamic models, greatly limiting their applicability [5].
In addition, some scholars analyze the characteristic frequency of faults.For instance, Arkadiusz et al. [6] used Fourier transform and recurrence analysis to analyze the vibration signal during engine operation, and could accurately determine the location of the failed cylinder.In order to solve the problem that composite bearing faults are difficult to diagnose, Wang et al. [7] first established a functional model of bearing vibration signals, and then a single fault frequency feature set was separated through decoupling, so as to complete the diagnosis of composite faults.Almounajjed et al. [8] used discrete wavelet to analyze the electric signal of the motor in time domain, remove the interference components, and successfully extract the more obvious fault characteristic frequency.This kind of research is mainly based on the purpose of removing noise interference to extract the characteristic fault frequency of bearings, but this method requires a large number of marker signal samples, and also requires manual selection of features.In addition, the above research focused on the validity of a single working condition verification method, and there is a lack of exploration and research on fault diagnosis under variable working conditions.
In recent years, machine learning has accelerated the development of intelligent fault diagnosis through technologies such as ensemble learning [9], support vector machine (SVM) [10], and artificial neural networks [11].However, such fault diagnosis methods require additional processing of data characteristics and cannot provide fast diagnostic services [12,13].
As another branch of machine learning, deep neural network has been successfully applied in the field of fault diagnosis due to its advantages of feature self-extraction [14].As a representative of the current deep neural network, convolutional neural network (CNN) has the characteristics of parameter sharing and translation invariance, which can extract more robust features.Wang et al. [15] established a multi-scale convolutional neural network model by integrating feature extraction and pattern recognition for fault diagnosis.Further, Wu et al. [16] proposed solving the data imbalance problem by using a convolutional neural network with a minimum-maximization objective function.Next, Wang et al. [17] introduced the 1 × 1 convolution kernel, replacing the fully connected layer of the traditional convolutional neural network with global average pooling, aiming to reduce the model training parameters.Although CNN has achieved good results, it faces the following two problems in the field of bearing fault diagnosis: 1 The maximum pooling or average pooling used by CNN directly merges the information, which leads to the key information being unable to be identified. 2It must be satisfied that the training set and the test set have the same probability distribution, but it is difficult to meet this assumption because of the complex and changeable working conditions in practical engineering.When the working condition of the equipment changes greatly, the recognition effect of CNN model will decrease significantly.
To solve problem 1 , a split attention module is introduced into the network, and a multi-channel structure and attention mechanism are adopted to enrich the diversity of fault features, strengthen the connection between fault features, improve the network's learning of fault features, and avoid the loss of key fault features.To solve problem 2 , some scholars introduce the idea of transfer learning (TL) [18].TL can solve the problem of cross-domain distribution difference and is widely used in the field of fault diagnosis.For instance, Yang et al. [19] used polynomial kernel-induced distance to measure and evaluate the distributional difference between the source and target domains.Chen et al. [20] used enhanced transfer convolution network to solve the decision boundary confusion problem in two domains.Further, Cheng et al. [21] introduced the adversarial idea and trained the classifier to confuse the sample features of the two domains; this was carried out to align the domains.However, such transfer learning methods only consider the distribution differences of the whole domain, not the distribution differences of related subdomains.As shown in Figure 1a, intra-domain data chaos will occur after global adaptation.Similar characteristics of fault samples may be wrongly classified.Aiming to mitigate said deficiencies, the authors proposed a bearing fault diagnosis model based on the deep subdomain adaptive split attention network (SPDSAN).This method extracts the signal features through the residual network fused with multipath and channel attention mechanisms.Next, local maximum mean discrepancy (LMMD) is used to align distributions of related subdomains in two domains-as shown in Figure 1b.The scientific contributions of this paper are as follows: 1.The vibration signal was transformed into a time-frequency graph by continuous wavelet transform as the learning object of the network.Compared with the one-dimensional vibration signal, the time-frequency graph can not only provide the timedomain and frequency-domain characteristics of the fault, but also avoid the network to learn the single dimensional characteristics and affect the diagnosis accuracy.2. The split attention module was introduced into the feature extraction network, and the multi-channel structure and attention mechanism were adopted to enrich the feature map diversity and improve the ability of the network to learn fault features.3. LMMD was used to measure the difference of relevant subdomains in the source domain and target domain data, and the distribution of relevant subdomains under the same category was adjusted to capture the fine-grained information of each category, so as to achieve the subdomain alignment.4. The method performance was compared to several widely used intelligent bearing fault diagnosis methods, and its effectiveness was verified.

Problem Description
Transfer learning is applying the knowledge learned in the source domain as "experience" to the target domain, given that the source domain contains several labeled samples and meets the network training requirements.Samples in the target domain do not contain labels and are used as network final test sets.In this study, the source domain is the rolling bearing fault state signal collected under working condition A by laboratory simulation of the bearing fault.Furthermore, the target domain is the bearing fault state  Aiming to mitigate said deficiencies, the authors proposed a bearing fault diagnosis model based on the deep subdomain adaptive split attention network (SPDSAN).This method extracts the signal features through the residual network fused with multipath and channel attention mechanisms.Next, local maximum mean discrepancy (LMMD) is used to align distributions of related subdomains in two domains-as shown in Figure 1b.The scientific contributions of this paper are as follows: 1.
The vibration signal was transformed into a time-frequency graph by continuous wavelet transform as the learning object of the network.Compared with the onedimensional vibration signal, the time-frequency graph can not only provide the timedomain and frequency-domain characteristics of the fault, but also avoid the network to learn the single dimensional characteristics and affect the diagnosis accuracy.

2.
The split attention module was introduced into the feature extraction network, and the multi-channel structure and attention mechanism were adopted to enrich the feature map diversity and improve the ability of the network to learn fault features.

3.
LMMD was used to measure the difference of relevant subdomains in the source domain and target domain data, and the distribution of relevant subdomains under the same category was adjusted to capture the fine-grained information of each category, so as to achieve the subdomain alignment.4.
The method performance was compared to several widely used intelligent bearing fault diagnosis methods, and its effectiveness was verified.

Related Works 2.1. Problem Description
Transfer learning is applying the knowledge learned in the source domain as "experience" to the target domain, given that the source domain contains several labeled samples and meets the network training requirements.Samples in the target domain do not contain labels and are used as network final test sets.In this study, the source domain is the rolling bearing fault state signal collected under working condition A by laboratory simulation of the bearing fault.Furthermore, the target domain is the bearing fault state signal collected under working condition B. Let D s = x s i , y s i n s i=1 be the data in the source domain, where n s is the number of fault samples in D s , x s i is the i-th sample in D s , and y s i is its corresponding fault label.Assuming that the health status of bearings has class C, and indicates that the sample belongs to the j-th type fault.Further, is the target domain, n t is the number of fault samples in D t , and x t j is the j-th sample in D t .The probability distributions of D s and D t are denoted as P and Q, respectively.It should be noted that P = Q.

ResNeSt-Split Attention Network
CNN has a strong ability to learn signal features.However, with the deepening of the network, the CNN gradient will disappear [22].In 2015, He et al. [23] proposed ResNet residual neural network.Using the residual block structure combined with "Shortcut Connections", the previous residual block can flow into the following block without obstruction.Thus, the problem of gradient disappearance caused by the network being too deep is avoided.Residual neural network (ResNet) used in this paper adds multiple split-attention (SA) block modules based on the ResNet.Moreover, residual split-attention networks (ResNeSt) combine the multipath structure with the channel attention mechanism, expressing the channel attention as a feature map group and weighting different branch feature channels to generate the final feature map (see Figure 2).

ResNeSt-Split Attention Network
CNN has a strong ability to learn signal features.However, with the deepening of the network, the CNN gradient will disappear [22].In 2015, He et al. [23] proposed ResNet residual neural network.Using the residual block structure combined with "Shortcut Connections", the previous residual block can flow into the following block without obstruction.Thus, the problem of gradient disappearance caused by the network being too deep is avoided.Residual neural network (ResNet) used in this paper adds multiple splitattention (SA) block modules based on the ResNet.Moreover, residual split-attention networks (ResNeSt) combine the multipath structure with the channel attention mechanism, expressing the channel attention as a feature map group and weighting different branch feature channels to generate the final feature map (see Figure 2).Firstly, the input feature map was divided into K branches, and each branch was subdivided into R subgroups.Hence, the total number of feature maps was G K R = × .Secondly, 1 1,3 3 × × convolution operation was carried out for each subgroup, and dif- ferent weights were assigned to each subgroup through the SA module before it was finally aggregated.The feature map outputs obtained via aggregation and the residual module were combined linearly.Figure 3 shows the SA module.Firstly, the input feature map was divided into K branches, and each branch was subdivided into R subgroups.Hence, the total number of feature maps was G = K × R. Secondly, 1 × 1, 3 × 3 convolution operation was carried out for each subgroup, and different weights were assigned to each subgroup through the SA module before it was finally aggregated.The feature map outputs obtained via aggregation and the residual module were combined linearly.Figure 3 shows the SA module.In the SA module, the combined feature ˆk U of the k-th branch was obtained by ele- ment-wise summation and fusion of R subgroups: where j U represents the j-th input feature in the SA module (the output feature after 3 × 3 convolution shown in Figure 3).In the SA module, the combined feature Ûk of the k-th branch was obtained by elementwise summation and fusion of R subgroups: where U j represents the j-th input feature in the SA module (the output feature after 3 × 3 convolution shown in Figure 3).
where H, W, and C/K represent the length, width, and the number of channels of each output feature map, respectively.The global information obtained by global pooling of the fused feature maps was calculated next: where r k c represents the c-th channel value of the 1 × 1 × C/K feature map of Ûk following the global pooling.Further, Ûk c (i, j) represents the value at pixel (i,j) in the c-th channel of Ûk .
Next, r k adaptively calculates the weight of each subgroup through the fully connected layer: where a k i (c) is the weight of the i-th subgroup and G c i is the weight function composed of two fully connected layers and a ReLU activation function.
Therefore, the final weighted fusion feature V k ∈ R H×W×C is generated by multiplying the original feature of each subgroup with the weight of each channel.The output of the c-th channel is as follows: where V k c represents the weighted fusion features of the c-th channel of each branch and U S(k−1)+i represents features of the S(k − 1) + i-th subgroup.

Subdomain Adaptation
Maximum mean discrepancy (MMD) is widely used to evaluate the distribution difference between D s and D t [24].However, using MMD to align the global distribution ignores the relationship between the source and the target domain's relevant subdomains, losing each subclass's fine-grained information.As such, it usually causes data confusion between both domains.Therefore, this paper introduced LMMD to align the distribution between the relevant subdomains.LMMD is expressed as: where x s and x t represent the sample instances in D s and D t , respectively, P and Q are the distributions followed by these domains, • H is the regenerating kernel Hilbert space (RKHS), φ(•) is the mapping function, and E(•) represents the mathematical expectation of the subclass.This paper introduces the concept of weights, which can be simplified as follows: y ic ∑ (x j ,y j )∈D y jc (7) where ω sc i and ω tc j are the weights of x s i and x t j belonging to subclass c, respectively.In this study, one-hot coding was used to calculate the weight ω of each sample belonging to the class.Further, y ic is the c-th element of the source-domain label vector y i , representing the probability that the sample belongs to class c.For target domain samples without labels, the pseudo-label y t j output by the SoftMax was used to calculate the weight ω tc i of sample x t i belonging to class c.Finally, the SPDSAN will generate activation functions in l layer, namely z sl i n s i=1 and z tl j n t j=1 to achieve the deep network adaptation.Therefore, the subdomain adaptation function is: where z l denotes the lth (l ∈ L = {1, 2, • • • , |L|}) layer activation.

The SPDSAN Diagnostic Process
The SPDSAN model process proposed in this paper is shown in Figure 4.It includes the time-frequency image generation, the domain-shared feature extractor, subdomain adaptation, and fault classification.

The SPDSAN Diagnostic Process
The SPDSAN model process proposed in this paper is shown in Figure 4.It includes the time-frequency image generation, the domain-shared feature extractor, subdomain adaptation, and fault classification.The diagnostic process was carried out as follows: firstly, the vibration signal was transformed into a time-frequency image as the network learning object.Secondly, the ResNeSt-50 was used to extract image signal features.By assigning different weights to channels, the ResNeSt-50 integrated channel attention mechanism improved the weight of fault features in the sample population, thus enabling, the network to learn more about the fault features in the sample.To reduce the training time and accelerate the model convergence, the ResNeSt-50 model was pre-trained using the ImageNet 2012 data for the general feature extraction.Then, the LMMD was used to measure the distribution differences of related subdomains in the subdomain adaptation layer, which was used as the optimization target Loss of the SPDSAN model.Finally, the error between the true s i y of s D and the classifier-predicted label ˆs i y was assumed as the optimization objec- tive Loss .
In sum, the training goal of the SPDSAN is to minimize Loss and Loss to achieve higher diagnostic accuracy in the final diagnosis.

Target Optimization
The SPDSAN extracts the domain-transferable feature representations through deep feature representation learning and local maximum mean error learning.There are two The diagnostic process was carried out as follows: firstly, the vibration signal was transformed into a time-frequency image as the network learning object.Secondly, the ResNeSt-50 was used to extract image signal features.By assigning different weights to channels, the ResNeSt-50 integrated channel attention mechanism improved the weight of fault features in the sample population, thus enabling, the network to learn more about the fault features in the sample.To reduce the training time and accelerate the model convergence, the ResNeSt-50 model was pre-trained using the ImageNet 2012 data for the general feature extraction.Then, the LMMD was used to measure the distribution differences of related subdomains in the subdomain adaptation layer, which was used as the optimization target LossB of the SPDSAN model.Finally, the error between the true y s i of D s and the classifier-predicted label ŷs i was assumed as the optimization objective LossA.
In sum, the training goal of the SPDSAN is to minimize LossA and LossB to achieve higher diagnostic accuracy in the final diagnosis.

Target Optimization
The SPDSAN extracts the domain-transferable feature representations through deep feature representation learning and local maximum mean error learning.There are two optimization objectives in the training process: 1.
Minimizing the difference of LossA between the real and the predicted label of the source domain sample.This will increase the classifier accuracy when diagnosing the source domain sample.

2.
Minimize the LMMD of LossB between the source and the target domain.
The LossA can be expressed as: where f (•) is the predicted output of the source domain samples on the classifier and J s is the cross-entropy loss function.
The final optimization objective J can be calculated as follows: where LossB = ∧ d H (P, Q) and α is the trade-off parameter between the domain adaptation loss and the classifier loss.

Experiment and Analysis
All the presented experiments were completed using i7-9700K processor, 128 GB running memory, RTX 3070 TI graphics card, and Windows 10 operating system, while Pytorch was used as the code framework.The batch size of each training was 16, and the stochastic gradient descent algorithm was used for training.The momentum was 0.9, and the learning rate was η θ = 0.01/(1 + αθ) β , where α = 10, β = 0.75, and θ linearly changed from 0 to 1 during the training process [25].

Introduction to the Fault Datasets
The experimental data in this paper were collected from the bearing fault datasets of Case Western Reserve University (CWRU) [26] and the bearing model is SKF6205.The fault data acquisition test bench is shown in Figure 5.
Appl.Sci.2022, 12, 12762 8 of 14 where ( ) f ⋅ is the predicted output of the source domain samples on the classifier and s J is the cross-entropy loss function.
The final optimization objective J can be calculated as follows: where ( , ) and α is the trade-off parameter between the domain ad- aptation loss and the classifier loss.

Experiment and Analysis
All the presented experiments were completed using i7-9700K processor, 128 GB running memory, RTX 3070 TI graphics card, and Windows 10 operating system, while Pytorch was used as the code framework.The batch size of each training was 16, and the stochastic gradient descent algorithm was used for training.The momentum was 0.9, and the learning rate was

Introduction to the Fault Datasets
The experimental data in this paper were collected from the bearing fault datasets of Case Western Reserve University (CWRU) [26] and the bearing model is SKF6205.The fault data acquisition test bench is shown in Figure 5.  1.In this study, four datasets (0, 1, 2, 3 HP) with different working conditions were generated to simulate the transfer learning tasks.1.In this study, four datasets (0, 1, 2, 3 HP) with different working conditions were generated to simulate the transfer learning tasks.

Build Experimental Datasets
Continuous wavelet transform (CWT) is used to convert one-dimensional vibration signals into time-frequency graphs.Time-frequency graphs contain the time-domain and frequency-domain features of faults, which can avoid the influence of single feature of network learning on diagnostic accuracy.Therefore, in this study, the continuous wavelet transform was used to convert the vibration signal into a two-dimensional time-frequency image as the input of the network [27].
Firstly, in order to expand the number of datasets, an overlapping sampling technique was employed in each health condition dataset [28].As shown in the Figure 6, the original vibration signal is sliced with a window of 1024 points, and each data sample contains 1024 points.Then, CWT was used to convert the selected 1024 points into a time-frequency image with a size of 256 × 256, the wavelet base was selected as cmor3-3, and the size sequence length was 64.In addition, another 1024 continuous sampling points were selected in the way of overlapping sampling to generate another time-frequency image; each sample overlaps 500 points.

Build Experimental Datasets
Continuous wavelet transform (CWT) is used to convert one-dimensional vibration signals into time-frequency graphs.Time-frequency graphs contain the time-domain and frequency-domain features of faults, which can avoid the influence of single feature of network learning on diagnostic accuracy.Therefore, in this study, the continuous wavelet transform was used to convert the vibration signal into a two-dimensional time-frequency image as the input of the network [27].
Firstly, in order to expand the number of datasets, an overlapping sampling technique was employed in each health condition dataset [28].As shown in the Figure 6, the original vibration signal is sliced with a window of 1024 points, and each data sample contains 1024 points.Then, CWT was used to convert the selected 1024 points into a timefrequency image with a size of 256 × 256, the wavelet base was selected as cmor3-3, and the size sequence length was 64.In addition, another 1024 continuous sampling points were selected in the way of overlapping sampling to generate another time-frequency image; each sample overlaps 500 points.The number of obtained health status samples is shown in Table 2.The 10 health states under four different loads were alternately used as s D and t D for learning and transfer.Samples in the t D were not labeled during the network learning process to rep- resent different fault samples under unknown working conditions.For example, in the task "Working condition 0-1", 10 types of health status data under 0 HP load were used as source domain samples, and another 10 under 1 HP load were used as target domain samples for the transfer learning task.Combining Table 3 and Figure 7 shows that the DAN adopts global alignment, with an average diagnostic accuracy of 91.5%.However, the diagnostic accuracy of each task fluctuates greatly, especially for task 2-0 (only 88%).The reason for such behavior is that global adaptation aims to align the overall distribution of the D s and the D t ; thus, the correlation between each subfield is ignored.The DAAN has the weakest recognition effect, with an average diagnostic accuracy of 91.6%.The recognition accuracy fluctuates greatly, and its robustness is the lowest.This is due to the global alignment that is based on adversarial thinking, requiring a large sample set of D s and D t to confuse the domain discriminator, making it unable to judge the sample domain label to achieve the global alignment.Therefore, using too few samples is the main reason for its low diagnostic accuracy.
The MRAN makes up for the DAN defect by extracting multi-representation features of sample images and aligning them in different feature spaces.However, the diagnostic accuracy in task 1-0 is only 94%.The primary reason is that when the samples of the D s and the D t have high similarity, the extracted multi-representation features are more similar.Hence, they cannot be classified correctly, yielding an average diagnostic accuracy of 98.8%.
Under the premise of using the subdomain migration method, the average diagnostic accuracy of RDSAN using Resnet-50 as a feature extractor is 98.8%.This value is lower than 99.9% obtained for the SPDSAN, proving that the SPDSAN has a stronger ability to learn fault features.correlation between each subfield is ignored.The DAAN has the weakest recognition effect, with an average diagnostic accuracy of 91.6%.The recognition accuracy fluctuates greatly, and its robustness is the lowest.This is due to the global alignment that is based on adversarial thinking, requiring a large sample set of s D and t D to confuse the do- main discriminator, making it unable to judge the sample domain label to achieve the global alignment.Therefore, using too few samples is the main reason for its low diagnostic accuracy.

Feature Visualization
In this paper, the t-distribution random adjacent embedding algorithm [32] was used to visualize the data features of the D t of 12 transfer learning tasks and present them in the form of scatter plots, as shown in Figures 8 and 9.
Appl.Sci.2022, 12, 12762 11 of 14 The MRAN makes up for the DAN defect by extracting multi-representation features of sample images and aligning them in different feature spaces.However, the diagnostic accuracy in task 1-0 is only 94%.The primary reason is that when the samples of the s D and the t D have high similarity, the extracted multi-representation features are more similar.Hence, they cannot be classified correctly, yielding an average diagnostic accuracy of 98.8%.
Under the premise of using the subdomain migration method, the average diagnostic accuracy of RDSAN using Resnet-50 as a feature extractor is 98.8%.This value is lower than 99.9% obtained for the SPDSAN, proving that the SPDSAN has a stronger ability to learn fault features.

Feature Visualization
In this paper, the t-distribution random adjacent embedding algorithm [32] was used to visualize the data features of the t D of 12 transfer learning tasks and present them in the form of scatter plots, as shown in Figures 8 and 9.Each subfigure in Figures 7 and 8 contains 10 health states.Different shapes and colors represent one health status.It is evident from clustering results that each health state can be well clustered with distinct regional characteristics through the adaptive alignment of subdomains.However, in part of the migration task, the 0.5334 mm ball fault was mistakenly assigned to other health states.This may be due to the signal characteristics of the fault, which are similar to those of other health states in different periods.Hence, the diagnosis accuracy of the migration as mentioned above task is slightly lower than that of other migration tasks.Each subfigure in Figures 7 and 8 contains 10 health states.Different shapes and colors represent one health status.It is evident from clustering results that each health state can be well clustered with distinct regional characteristics through the adaptive alignment of subdomains.However, in part of the migration task, the 0.5334 mm ball fault was mistakenly assigned to other health states.This may be due to the signal characteristics of the fault, which are similar to those of other health states in different periods.Hence, the diagnosis accuracy of the migration as mentioned above task is slightly lower than that of other migration tasks.

Conclusions
To address the inconsistency in feature extraction network problems when applied to bearing fault diagnosis, such as insufficient ability to learn fault features and the characteristic distribution of vibration data of rolling bearing collected under variable working conditions, the authors proposed the SPDSAN diagnostic model.The experimental results have shown that the proposed model has higher robustness and diagnostic accuracy compared to other methods.Based on the results, the following conclusions can be made:

•
Compared with ResNet, ResNeSt, which integrates multi-channel and split-attention mechanisms, can more fully learn the transferable fault features in samples.This facilitates subsequent transfer learning tasks.

•
In the domain adaptation layer, the subdomain alignment method is used to reduce the distribution difference between the  and the  , and to reduce the misdiagnosis caused by the small subdomain distance caused by the global alignment.Therefore, it is only necessary to train the network with samples under one working condition to complete the fault diagnosis under all working conditions • The comparison and analysis of different experimental results show that the proposed method has good generalization and robustness.
Author Contributions: Conceptualization, H.W. and L.P.; formal analysis, H.W.; writing-original draft preparation, L.P.; writing-review and editing, L.P.; project administration, L.P.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Conclusions
To address the inconsistency in feature extraction network problems when applied to bearing fault diagnosis, such as insufficient ability to learn fault features and the characteristic distribution of vibration data of rolling bearing collected under variable working conditions, the authors proposed the SPDSAN diagnostic model.The experimental results have shown that the proposed model has higher robustness and diagnostic accuracy compared to other methods.Based on the results, the following conclusions can be made:

•
Compared with ResNet, ResNeSt, which integrates multi-channel and split-attention mechanisms, can more fully learn the transferable fault features in samples.This facilitates subsequent transfer learning tasks.

•
In the domain adaptation layer, the subdomain alignment method is used to reduce the distribution difference between the D s and the D t , and to reduce the misdiagnosis caused by the small subdomain distance caused by the global alignment.Therefore, it is only necessary to train the network with samples under one working condition to complete the fault diagnosis under all working conditions • The comparison and analysis of different experimental results show that the proposed method has good generalization and robustness.

Figure 1 .
Figure 1.Results of different methods on domain adaptation problem.(a) target domain adaptation based on global domain adaptation method; (b) target domains adaptation based on related subdomain adaptation method.

Figure 1 .
Figure 1.Results of different methods on domain adaptation problem.(a) target domain adaptation based on global domain adaptation method; (b) target domains adaptation based on related subdomain adaptation method.

Figure 4 .
Figure 4.The structure diagram of the SPDSAN network model.

Figure 4 .
Figure 4.The structure diagram of the SPDSAN network model.
linearly changed from 0 to 1 during the training process[25].

Figure 5 .
Figure 5. Data acquisition test bench.The bearing dataset contains inner ring fault (IF), ball bearing fault (BF), and outer ring fault (BF) simulated by artificial electric discharge machining; the sampling frequency is 12 kHz.Each fault type contains three signals of different fault sizes (0.1778 mm, 0.3556 mm, 0.5334 mm).The details are shown in Table1.In this study, four datasets (0, 1, 2, 3 HP) with different working conditions were generated to simulate the transfer learning tasks.

Figure 5 .
Figure 5. Data acquisition test bench.The bearing dataset contains inner ring fault (IF), ball bearing fault (BF), and outer ring fault (BF) simulated by artificial electric discharge machining; the sampling frequency is 12 kHz.Each fault type contains three signals of different fault sizes (0.1778 mm, 0.3556 mm, 0.5334 mm).The details are shown in Table1.In this study, four datasets (0, 1, 2, 3 HP) with different working conditions were generated to simulate the transfer learning tasks.

Figure 7 .
Figure 7. Diagnostic accuracy of the CWRU dataset.

Figure 7 .
Figure 7. Diagnostic accuracy of the CWRU dataset.

Table 1 .
Details of the CWRU dataset.

Table 1 .
Details of the CWRU dataset.