1. Introduction
Bearings, as critical components within real-world mechanical systems, play a vital role in the reliability and functionality of machinery [
1]. Consequently, bearing condition monitoring (BCM) and bearing fault diagnosis (BFD) are essential for preventing unexpected breakdowns and ensuring operational efficiency. Recently, data-driven methods for BCM and BFD have attracted significant attention due to their capability to effectively identify the health state of equipment. They mainly consist of algorithms based on machine learning and methods based on signal processing. Among these methods, machine learning (ML) [
2] algorithms have emerged as a powerful tool, establishing mapping models between measured sensor data and the operational states of machinery, ultimately enabling accurate health state diagnosis [
3]. Machine learning encompasses a variety of algorithms such as K-nearest neighbors [
4], Bayesian learning [
5], deep learning [
6] and meta-learning [
7] that can be employed to analyze sensor data and predict fault conditions. Different from the machine learning algorithms, signal processing-based methods depend on domain-specific prior knowledge to guide feature extraction. For instance, Antoni [
8] provided quantifiable measures to characterize signal deviations from healthy states, enabling the early detection of incipient faults through systematic feature quantification. The blind deconvolution techniques proposed by Chen et al. [
9] effectively suppress background noise and enhance fault-related impulses. Additionally, Chen et al. [
10] also developed a signal decomposition method combined with cyclic spectral correlation, making it particularly effective for diagnosing faults under varying operating conditions. These signal processing methods complement data-driven machine learning approaches, forming a comprehensive toolkit for addressing the challenges in real-world condition monitoring and fault diagnosis scenarios. However, most data-driven algorithms rely on sufficient datasets, and insufficient samples can hinder the effective training of ML models and severely limit their effectiveness [
11]. In addition, machines often operate under varying conditions in real-world scenarios, which will cause data distribution differences between training and testing data.
Technically, these aforementioned challenges can be settled by cross-domain learning approaches, such as statistical transfer learning (STL), joint learning, adversarial task augmentation and so on [
12]. Among them, STL provides a promising solution to address challenges due to its simplicity and efficiency [
13]. Technically, STL belongs to transfer learning (TL), which can acquire knowledge from other relevant fields to solve the drawbacks of traditional ML-based models. The particularity of STL lies in that it focuses on leveraging source-domain specific statistical knowledge to boost the performance of target tasks under the framework of statistical modelling. Theoretically benefitting from the abilities of parameter learning from statistical models as well as knowledge transfer from transfer learning, STL can extract the relatively more global transferable knowledge [
14], thereby enabling excellent knowledge transfer capabilities and better generalization for BCM and BFD under changing operating conditions.
Recently, this method has attracted the attention of many scholars. For example, Hu et al. [
15] developed a transfer learning strategy based on balanced adaptation regularization for unsupervised BFD across domains. Ganin et al. [
16] introduced a novel domain adaptation framework that utilizes adversarial training to minimize the inter-domain distribution gap. Ganin et al. [
17] introduced a novel adversarial domain classifier to optimize both classification tasks and domain adaptation. However, the nonlinearity of vibration data would increase the complexity of the model-training process to some extent.
In light of this, kernel transfer learning (KTL), as a kind of STL approach, has emerged as a particularly effective method. KTL utilizes kernel functions to project data onto elevated-dimensional feature spaces, allowing for a more nuanced comprehension of the data’s fundamental patterns [
18]. Additionally, by leveraging the kernel trick and sparse representations, KTL can also address the computational challenges arising from nonlinear data. Theoretically, KTL expands the application scope of STL techniques for more complex scenarios. For KTL, plenty of strategies have emerged in recent years, including the following two major aspects: (1) Optimization and selection of kernel function. For example, Cao et al. [
19] applied a single parameter to represent the relationship between the two domains in a standard transfer learning scenario involving two domains. Wagle et al. [
20] further extended this approach by differentiating between the covariance of the two domains. Wei et al. [
21] generalized a transfer kernel to solve a multi-source-domain task through setting up the similarity coefficient of each pair of source-target domains. (2) Optimization of domain alignment. For example, Wei et al. [
22] developed a multi-kernel learning approach that integrates multiple kernel functions to facilitate parallel processing capabilities. Zheng et al. [
23] proposed a Gaussian process regression of the adaptive two-stage model using the matching kernel. Lu et al. [
24] utilized sample correlation and learnable kernels to realize domain adaptive learning. Wu et al. [
25] applied an adaptive deep transfer learning method for domain adaptive learning. Additionally, recent advancements in zero-faulty or limited-fault data scenarios deserve further attention. For example, emerging techniques like the shrinkage mamba relation network (SMRN) [
26] leverage out-of-distribution data augmentation to enable fault detection and localization in rotating machinery under zero-faulty data regimes. Such methods address the challenge of absent fault labels by generating synthetic anomalies, bridging gaps in data scarcity. However, most existing studies only consider and quantify inter-domain differences by adopting a single parameter ranging from 0 to 1 for kernel optimization. This limits the model’s ability to capture the global distributional differences to some extent.
In order to tackle the aforementioned questions, this article proposes a novel transfer kernel enabled kernel extreme learning machine (TK-KELM) model for BCM and BFD. Firstly, a parallel structure for the pre-training of the model is designed to more fully represent the state and change of vibration signals. Then, the intra-domain differences are involved into the optimization process, thereby releasing the weight parameters that describe the inter-domain relationships and enabling them to break the limitations of the original range from 0 to 1. Subsequently, those transfer kernels undergo optimization through a similarity-guided feature-directed transfer kernel optimization strategy (SFTKOS), which fine-tunes model parameters based on domain similarities across different feature scales. Additionally, an integrated framework that combines functional principal component analysis with maximum mean difference (FPCA-MMD) is introduced to derive multi-scale domain-invariant degradation indicators for enhancing the model’s overall robustness. Finally, the gradient iterative segmentation (GIP) algorithm and support vector machine (SVM) are applied for BCM and BFD, respectively, demonstrating the effectiveness of this methodology. The key contributions of this study are outlined as follows:
A new mechanism for KTL is proposed to capture both intra-domain and inter-domain differences, which breaks the limitation of the original parameter ranging scope (from 0 to 1) in a conventional KTL framework, and shows strong adaptability for cross-domain transfer learning.
A similarity-guided feature-directed transfer kernel optimization strategy is designed under the framework of parallel structure and MMD to optimize kernel parameters, which is beneficial for mining domain invariant features completely and improving the cross-domain learning performance.
A novel transfer kernel enabled kernel extreme learning machine (TK-KELM) model is proposed, which is beneficial for boosting the adaptability of the model for BCM and BFD.
The organization of this article is outlined below.
Section 2 delves into the related knowledge essential for understanding our work.
Section 3 introduces the proposed method, detailing its components and the theoretical underpinnings.
Section 4 presents two case studies to illustrate the practical application and efficacy of the approach proposed in this paper. In conclusion,
Section 5 wraps up the article.
3. Proposed Method
As presented in
Figure 1, the overall block diagram of the proposed methodology is shown. The proposed method mainly includes three parts: (1) capturing global invariant-characteristics, (2) training the model and (3) condition monitoring and fault diagnosis. Firstly, diverse sets of time-domain characteristics are retrieved from the source domain, and the domain correlation is captured through employing the kernel function, and then the parameters are optimized through SFTKOS. Secondly, the FPCA-MMD method is utilized to construct the bearing degradation index for model training and prediction. Finally, given optimized transfer kernels as well as domain-invariant indexes, BCM and BFD are realized by solving the problem of distribution differences caused by variable working conditions.
3.1. Capturing Global Invariant-Characteristics
In STL, one of the challenges is quantifying the relationship of inter-domain data sets, particularly when the has limited annotated data. This paper combines feature transfer and Bayesian methods to address this issue. First, the parameters , , are defined as the covariance matrices of with , with and with , respectively. The parameter is then designed to represent the similarity between and . For the feature transfer method, the characteristics matrix of the target domain will be , where is in range . However, in the Bayesian approach, the relation between and is viewed probabilistically. In this case, is bounded within , representing the chance associated with this relationship. Consequently, should become a free parameter rather than a fixed probability distribution.
According to the literature [
24], it is assumed that the prior knowledge probability for different similarities is uniform, which allows
to remain free while still providing a probabilistic interpretation. An additional parameter,
, is introduced to refine the posterior understanding of the source
and
, thus further relaxing constraints on the parameter
.
As a result, the proposed transfer kernel can be expressed as follows:
where
,
and
represents the well-known radial basis function (RBF) kernel. This transfer kernel leverages shared parameters for both domains and incorporates prior information from the source set. The parameter
controls the inter-domain similarity, with a range of
, allowing for a broader range of similarities. The parameter
represents the different kernel functions, offering the two domains posterior distributions. Moreover, the newly proposed transfer kernel must also meet the requirements of positive semi-definite (PSD). Suppose that
is a PSD matrix, where
for
and
. The matrix
also satisfies PSD.
Proof. Based on the properties of PSD matrices, the following properties of PSD matrices hold: (a) If there is a nonzero matrix satisfying , then for the PSD matrix , satisfies PSD. (b) Assuming the matrices and are both PSD, then will be PSD. (c) Supposing that the matrix is PSD, then for any positive number , should be PSD as well.
The proof proceeds as follows:
where
where
consists of kernel functions, thus is PSD, and the matrix
is nonzero. Using the property (a), then the matrix
is PSD. If
, and according to the property (c), the matrix
meets PSD. Therefore, based on the property (b), the matrix
is PSD as well. Additionally, when
, then
; thus,
will still be PSD. □
However, although it has been proven that the transfer kernel is PSD, it is often not effective when applied directly to the KELM. Therefore, it is necessary to optimize the transfer kernel to further optimize the inter-domain feature matching. Thereby, SFTKOS, as shown in
Figure 2, is proposed. First of all, a parallel structure is designed to support multi-scale feature extraction to more fully characterize the state of vibration signals. Then, the correlation of inter-domain data sets is calculated according to the transfer kernel. Furthermore, the minimization MMD method is employed to optimize the transfer kernel to obtain the values of parameter
and
. Finally, the optimized kernel function
is computed by the weighted fusing mechanism according to the minimization MMD size.
3.2. Training the Model
3.2.1. The Domain-Invariant Feature Extraction
In this study, a combination of functional principal component analysis and maximum mean difference methods (FPCA-MMD) was designed to extract domain-independent tracking indicators from bearing time-domain data. According to the literature [
29], when processing high-dimensional time domain signals, FPCA can treat the data as a continuous function, so as to effectively capture potential degradation characteristics and reveal the trend in the degradation process. However, FPCA itself has limitations, such as the inability to directly measure statistical differences between different time periods, and the lack of clear quantification of dynamic changes in different degradation stages.
Therefore, FPCA-MMD combined with the MMD method can make up for this shortcoming of FPCA. MMD can effectively measure the difference in distribution between different time periods or degradation stages, thereby helping to reveal the dynamic changes in the degradation process. Specifically, FPCA first analyzes multi-dimensional data and extracts the features that can represent the main changes in the degradation process. These characteristics help us understand the main trends of degradation. Meanwhile, MMD is employed to measure the differences in the characteristic distribution at different degradation stages. By calculating these differences, we can quantify the changes between different stages, thereby more clearly identifying the key nodes in the degradation process. Therefore, to capture the relevant characteristics from bearing data in the time domain, FPCA-based dimensionality reduction is applied, resulting in a feature vector
that characterizes the bearing’s degradation. A sliding window technique [
30] with step size
is then employed to partition the feature set, producing subsets
, in which
. To analyze the changes in the degradation process, the variation difference between each subset and the initial one is computed. The difference value is quantified by the MMD method as follows:
where
denotes the data segment corresponding to the sliding window, and
represents the index of the segment. The MMD function described in
Section 2 is utilized to generate a series of difference values
. These values are fed into the TK-KELM model as input to track bearing degradation.
3.2.2. KELM Model Training
Because the kernel function needs to meet the Mercer condition, it is also possible to use the new transfer kernel proposed above. Thus, KELM based on transfer kernel is named as TK-KELM in this work. At the same time, the structure of the new model is still consistent with the KELM model. Therefore, the new transfer kernel can be expressed as follows:
Thereby, the new output in this case is given by the following:
Then, the model output weight is solved as follows:
And finally, the predicted output can also be expressed as an expression that is only related to the kernel matrix. Then, the degradation indicator can be taken as the input of the model, and the predicted label data is used as a supplement to the insufficient target domain sample.
3.3. Condition Monitoring and Fault Diagnosis
Accurate monitoring of the degradation stage and early identification of faults are critical for BCM and BFD. The process of condition monitoring consists of systematically appraising the operational state of equipment to spot early indicators of degradation. Early detection allows for timely intervention, potentially preventing catastrophic failures. Fault diagnosis, on the other hand, involves identifying the specific nature and cause of a failure once it has occurred, providing insights for corrective actions and optimizing maintenance strategies.
For identifying the condition change point, a gradient iterative partitioning algorithm is applied in this article. The gradient is used to evaluate the variations in the degradation signal. A sharp variation in the gradient indicates a significant change in condition monitoring, which can be detected as a change point. The gradient at each instance is calculated as follows:
where
is the rate of change at that moment, that is, the gradient value. Additionally, a gradient threshold
is set according to the 3σ principle. Based on the above settings, the objective of this paper is to identify the condition change points in the gradient. A change point is considered valid only when
and the next
consecutive gradient values satisfy
. If these conditions are not met, the point is treated as a false change point.
After the identification of the failure points is completed, the core work of the next step is to deeply analyze these failure points to determine the specific nature behind them, and determine which fault types the fault points belong to. For fault diagnosis, a way of systematic implementation is designed as follows: First, the multi-scale features of vibration signals from both source and target domains are extracted using the transfer kernel optimized by SFTKOS. These features are then processed through the FPCA-MMD integration framework to derive domain-invariant degradation indicators, which serve as the input features for the classification model. Finally, the SVM is employed as the classifier, where the RBF is selected as the kernel function to handle the nonlinearity of the input features. The regularization parameter C of the SVM is determined via cross-validation within the range of 1 to 100 to balance the model’s generalization ability and classification accuracy.
The computational complexity of the proposed TK-KELM model is dominated by the kernel matrix construction and model training steps. For samples and feature dimensions, computational complexity for online monitoring is , where refers to the kernel matrix construction and model training steps; is the complexity of the covariance of d-dimensional features; and represents the computational complexity of dimensionality reduction operations.