A Reinforcement Ensemble Learning Method for Rolling Bearing Fault Diagnosis under Variable Work Conditions

Ensuring the smooth operation of rolling bearings requires a precise fault diagnosis. Particularly, identifying fault types under varying working conditions holds significant importance in practical engineering. Thus, we propose a reinforcement ensemble method for diagnosing rolling bearing faults under varying working conditions. Firstly, a reinforcement model was designed to select the optimal base learner. Stratified random sampling was used to extract four datasets from raw training data. The reinforcement model was trained by these four datasets, respectively, and we obtained four optimal base learners. Then, a sparse ANN was designed as the ensemble model and the reinforcement learning model that can successfully identify the fault type under variable work conditions was constructed. Extensive experiments were conducted, and the results demonstrate the superiority of the proposed method over other intelligent approaches, with significant practical engineering benefits.


Introduction
Modern large rotating machinery often runs in extreme environments, and in the case a failure occurs, the consequences will be catastrophic.And because rolling bearings are an important part of rotating machinery, it is vital to ensure their safe operation [1].Hence, the precise identification of the health status of rolling bearings is crucial for the smooth and effective operation of large machinery and equipment [2].Particularly in the face of the current increasingly complex working conditions, rolling bearings often need to operate at different speeds and under different loads, and so it has become a new challenge to accurately identify the operating conditions under different work conditions.
Currently, transfer learning is often used to conduct fault diagnosis problems under different work conditions.For instance, Wu et al. introduced a deep transfer maximum classifier discrepancy approach for diagnosing rolling bearing faults [3].He et al. proposed a transfer fault diagnosis method using an enhanced deep auto-encoder [4].However, these methods still rely on data distribution and cannot intelligently extract similar fault features through the network structure.
Recently, deep learning models have replaced shallow network models and have been extensively applied in fault diagnoses with remarkable results [5][6][7][8].For instance, Lin et al. proposed a counter-based open-circuit switch method for a single-phase cascaded h-bridge multilevel converter [9].Zhao et al. proposed a life prediction method based on a digital twin-driven model [10].Mochammad et al. designed a bearing fault degradation model based on a multitime window fusion unsupervised health indicator [11].Zhang et al. proposed a model-based analysis method for bearing daults [12].
Sensors 2024, 24, 3323 2 of 22 As the research of intelligent diagnosis methods deepens, the network structure is becoming more and more complex, and selecting the appropriate network model often requires a lot of labor and time costs.Therefore, researchers began to explore automatic methods for building deep learning models based on different tasks [13][14][15].Currently, reinforcement learning is widely used in this approach due to its powerful autonomous decision-making capabilities.For instance, the Google DeepMind team employed reinforcement learning to discover the optimal neural network structure [16].Wang et al., on the other hand, introduced a reinforcement-based neural architecture search approach for diagnosing rolling bearing faults [17].
Ensemble learning is an effective deep learning method, and the mainstream ensemble learning methods still combine some weak models to improve the overall performance, and its integration effect mainly depends on the diversity of the models and integration strategies.The ensemble policies are mostly voting methods.For example, Li et al. introduced an improved selective ensemble deep learning approach for fault diagnosis [18].Feng et al. proposed an ensemble learning method for classifying failure modes and predicting bearing capacity [19].In this paper, the ensemble learning method is combined with intelligent modules.The first module is a reinforcement learning module, which is used to intelligently select the base model that can identify the same fault type under different operating states.The second module is a sparsely connected artificial neural network (SCANN) model for constructing the ensemble strategies.
Therefore, a Reinforcement Ensemble Learning Method (RELM) is proposed for rolling bearing fault diagnoses under variable work conditions.Firstly, data of identical fault types under varying working conditions are consistently labeled, and four training sets are randomly selected using a reinforcement learning model hierarchically.Secondly, the reinforcement learning model is trained by a policy gradient method, and four different optimal network structures are selected for the four different datasets, and the selected optimal network structures are used as the base models.Finally, a sparse artificial neural network (SCANN) is constructed as an ensemble model, and the output of the base model considered is the input feature of the ensemble model.The proposed method is validated with a high-speed aeronautical bearing dataset; the experimental results demonstrate the effectiveness of the proposed method, and the main contributions are listed as follows.
(1) The RELM is proposed for rolling bearing fault diagnoses under variable working conditions.In contrast to traditional transfer learning methods, the RELM does not need to deal with the data distribution under different work conditions but directly seeks the optimal network structure that can extract key features reflecting the fault condition.(2) A reinforcement learning method is applied to design an optimal network structure selection model.Intelligent methods are used to replace the significant labor and time costs previously required to design deep learning models.(3) A SCANN was constructed to effectively reduce the overfitting problem of the secondary classifier.
The rest of this paper is organized as follows: the theoretical basis is introduced in Section 2. The proposed method is explained in detail in Section 3. Experimental verification is carried out in Section 4, and the conclusion is summarized in Section 5.

Reinforcement Learning
Reinforcement learning is utilized to describe and solve the task of an intelligent agent learning a policy to reach reward maximization or achieve a specific goal during its interaction with the environment.As depicted in Figure 1, the agent chooses actions, and the environment responds with feedback, causing the agent to transition to a new environment.Simultaneously, the environment provides rewards, and the agent strives to maximize these rewards over time [20].environment.Simultaneously, the environment provides rewards, and the agent strives to maximize these rewards over time [20].Reinforcement learning can be viewed as a Markov Decision Process (MDP), representing the interaction between an agent and its environment.This MDP comprises a quadruplet. = (, , , ): When the agent is in an environment ,  represents the state space.Each state, denoted as  ∈ , describes the environment perceivable by the agent, and the agent's possible actions make up the action space ; when the agent implements action  ∈  on the current state , the environment will transfer from the current state to another state with certain probability ; meanwhile, the environment will give the agent a reward according to the potential reward function  [21].

Agent Environment
Hence, a value function is defined to represent the long-term rewards under policy  in the current state.The value function combines the current reward with the cumulative subsequent reward, essentially capturing the expectation of cumulative rewards.The state value function of state  under policy  is denoted as  ().
where  represents the expectation under policy , which is the expected value of selecting actions according to policy .∑   | represents the discounted sum of rewards  obtained at each time step . ∈ 0,1 is the discount factor.
Expanding  (), in which  denotes the reward at the future time  and  , refers to the next state.
The state  at moment  and  at moment  + 1 are related through a recurrence relation, leading to the deduction of the Bellman equation as follows: Given a policy  and an initial state , taking action  = () leads to transitioning to the next state  , , with a probability ( , |, ).Consequently, the expectation  of the equation can be expanded as where  represents the immediate reward obtained when transitioning from state  to state ′ by taking action .The initial action, denoted as , is determined by the policy  and the state . represents the return, which is the cumulative discounted reward obtained after time step .Taking into account the value influence of action , the action value function can be expressed as follows: Reinforcement learning can be viewed as a Markov Decision Process (MDP), representing the interaction between an agent and its environment.This MDP comprises a quadruplet.MDP = (S, A, P, R): When the agent is in an environment E, S represents the state space.Each state, denoted as s ∈ S, describes the environment perceivable by the agent, and the agent's possible actions make up the action space A; when the agent implements action a ∈ A on the current state s, the environment will transfer from the current state to another state with certain probability P; meanwhile, the environment will give the agent a reward according to the potential reward function R [21].
Hence, a value function is defined to represent the long-term rewards under policy π in the current state.The value function combines the current reward with the cumulative subsequent reward, essentially capturing the expectation of cumulative rewards.The state value function of state s under policy π is denoted as V π (s).
where E π represents the expectation under policy π, which is the expected value of selecting actions according to policy π. ∑ ∞ t=0 γ t r t |s t represents the discounted sum of rewards r t obtained at each time step t. γ ∈ [0, 1] is the discount factor.
Expanding V π (s), in which r t denotes the reward at the future time t and s ′ refers to the next state.
The state s t at moment t and s t+1 at moment t + 1 are related through a recurrence relation, leading to the deduction of the Bellman equation as follows: Given a policy π and an initial state s, taking action a = π(s) leads to transitioning to the next state s ′ , with a probability P( s ′ |s, a).Consequently, the expectation E π of the equation can be expanded as where r represents the immediate reward obtained when transitioning from state s to state s ′ by taking action a.The initial action, denoted as a, is determined by the policy π and the state s.G t represents the return, which is the cumulative discounted reward obtained after time step t.Taking into account the value influence of action a, the action value function can be expressed as follows: Similarly, the Bellman equation for the action value function can be expressed as follows: The system will transition to the next state s , based on policy π, with a probability denoted as P( s ′ |s, a): The optimal policy is determined by maximizing the value function under various initial conditions.This involves finding the policy π that achieves this optimization.
In which Pss ′ represents the transition probability from state ss to state s ′ under the action a.In order to fit V π (s) and Q π (s, a) by the parameter θ.A function of the neural network is used to approximate V π (s) and Q π (s, a) [22].
Parameterize the policy as π θ (s, a) = P( a|s, θ), use the model-free approach, put the agent into an uncertain dynamic environment, and use the initial value method for the objective function: In which J(θ) represents the expectation operator under the policy πθ.To maximize J(θ), that means, seek a set of parameter vectors θ to maximizes J(θ).Gradient descent is used to calculate ∇ θ J(θ): For multi-step MDPs, the policy gradient method is approached by the likelihood rate, where the value function is the sum of the multi-step values, using Q π (s, a) to replace the reward value R (s, a) of a single step: The Monte Carlo policy gradient method using stochastic gradient ascent method to update the parameter θ by sampling; the updated formula: Add a baseline function b(s), the equation can be modified to

The Generation of Baseline Model
At present, deep learning methods are widely used and have achieved good results in fault diagnosis.However, how to select the optimal hyperparameters of a neural network is still a difficult task.Some optimization methods have been used to optimize the network structure but they have not been widely used due to efficiency and network structure constraints [23][24][25].However, due to the variable data distribution of different datasets and the black box characteristics of the deep learning model architecture, it is difficult to summarize the relevant laws.The selection of a network structure still mainly relies on human experience.Therefore, there is an urgent need to develop intelligent methods for automatically selecting neural network structures.
In this paper, a reinforcement learning model for optimal network structure selection is presented.The reinforcement learning model, illustrated in Figure 2, can be seen as a quadruplet comprising a controller, actions, child models, and rewards.First, the controller generates a series of actions to construct the child model network structure.Then, the child model is trained using the corresponding dataset, and its classification accuracy is obtained.This accuracy is fed back to the controller as a reward.Due to the non-differentiable nature of the reward function, the controller parameters are updated using the policy gradient of reinforcement learning methods.These steps are iteratively performed to select the optimal child model.The specific structure and function of each module are explained as follows.

•
Controller.The controller consists of Gated Recurrent Unit (GRU) cells, which we use to generate actions that represent the structure of the child model.The calculation process can be developed as follows: In which, x t is the input state, h t is the output state and c t represents cell state at time t.W z is the connection matrix for input x t and update gate, W r is the connection matrix for input x t and rest gate, W is the weights connected with the cell state c t .b z , b r , and b h are the bias vectors.
In this paper, the controller's structure consists of a one-hidden-layer GRU connected to a softmax classifier.The fundamental architecture of the controller is illustrated in Figure 3.The input to the controller is the child model network structure from the previous time step, and the output consists of predicted actions used to construct the child model.
In this paper, the controller's structure consists of a one-hidden-layer GRU connected to a softmax classifier.The fundamental architecture of the controller is illustrated in Figure 3.The input to the controller is the child model network structure from the previous time step, and the output consists of predicted actions used to construct the child model.In which θ denotes controller parameters, the objective function () can be denoted as follows: () is the sum of cross entropy loss and regularization loss.λ represents the weight decay parameter.The exploration rate  is 80%. stands for the number of generated child models of the current dataset, which is set to be 20.Currently, Adam, RMSProp, AdaDelta, and stochastic gradient descent (SGD) are commonly used optimization algorithms.However, there are currently no established guidelines for determining which optimization algorithm yields the best results.This means that the selection of the optimization algorithm is essentially based on experiments.To overcome the effects of different optimization algorithms on the selection of network structures, the controller is selected for each of the four different optimization algorithms when carrying out the selection of the four base model structures in this paper.

•
Action.Actions are the laws that link the controller to the child model.Actions are generated by the controller and are used to select the sub-model network structure.
After each parameter update, the controller produces a list of actions  =  ,  , ⋯ ,  , in which  ,  is the value of kernels and filters in the first convolution layer,  ,  is the value of kernels and filters in the second convolution layer, and the rest,  ,  and  ,  , may be deduced by analogy.

•
Child model.The child model is constructed by actions generated in the previous step.To ensure efficiency and minimize computational effort, we predefined the search space for the child model as a four-layer convolutional structure, each layer consisting of kernels and filters, with the values of the kernels chosen from [1 × 1] and [3 × 3] and the values of the filters chosen from 16, 32, and 64.Each child model contains an input layer, four-layer convolutional layers, a global average pooling layer, and a fully connected layer.At iteration , the controller produces new actions, trains the child model with the corresponding dataset, and obtains an accuracy  .

•
Reward ().The configuration of the reward system is pivotal for optimizing the selection of the child model structure in terms of efficiency and effectiveness.To maximize cumulative rewards, an exponentially weighted average method is employed.In this method, the accuracy obtained at each iteration is assigned different weights following an exponential decay function.
The exponentially weighted average accuracy  at iteration  can be calculated as follows: In which θ denotes controller parameters, the objective function J(θ) can be denoted as follows: J(θ) is the sum of cross entropy loss and regularization loss.λ represents the weight decay parameter.The exploration rate µ is 80%.t stands for the number of generated child models of the current dataset, which is set to be 20.
Currently, Adam, RMSProp, AdaDelta, and stochastic gradient descent (SGD) are commonly used optimization algorithms.However, there are currently no established guidelines for determining which optimization algorithm yields the best results.This means that the selection of the optimization algorithm is essentially based on experiments.To overcome the effects of different optimization algorithms on the selection of network structures, the controller is selected for each of the four different optimization algorithms when carrying out the selection of the four base model structures in this paper.
• Action.Actions are the laws that link the controller to the child model.Actions are generated by the controller and are used to select the sub-model network structure.After each parameter update, the controller produces a list of actions in which [a 1 , a 2 ] is the value of kernels and filters in the first convolution layer, [a 3 , a 4 ] is the value of kernels and filters in the second convolution layer, and the rest, [a 5 , a 6 ] and [a 7 , a 8 ], may be deduced by analogy.

•
Child model.The child model is constructed by actions generated in the previous step.
To ensure efficiency and minimize computational effort, we predefined the search space for the child model as a four-layer convolutional structure, each layer consisting of kernels and filters, with the values of the kernels chosen from [1 × 1] and [3 × 3] and the values of the filters chosen from 16, 32, and 64.Each child model contains an input layer, four-layer convolutional layers, a global average pooling layer, and a fully connected layer.At iteration t, the controller produces new actions, trains the child model with the corresponding dataset, and obtains an accuracy acc t .

•
Reward (R).The configuration of the reward system is pivotal for optimizing the selection of the child model structure in terms of efficiency and effectiveness.To maximize cumulative rewards, an exponentially weighted average method is employed.In this method, the accuracy obtained at each iteration is assigned different weights following an exponential decay function.
The exponentially weighted average accuracy EWA t at iteration t can be calculated as follows: In order to balance the current accuracy and the historical accuracy, β is selected as 0.8.After adding the bias correction: Sensors 2024, 24, 3323 The cumulative reward R t set to be acc t at t iterations minus To select the optimal child model, which is required to maximize the cumulative reward R t , the objective function of the controller can be written as R t is none-differentiable, the hyperparameters θ of the controller are updated using a policy gradient method.
∇ θ L(θ) = E P(a To reduce variance, the bias term b is introduced: After adding the discount factor γ, ∇ θ L(θ) can be modified as in which γ is 0.99.

Cross-Validation Method
Traditional training methods commonly employ the complete training dataset, which can easily lead to overfitting.Currently, cross-validation methods are applied to integrated methods, which can improve data utilization and effectively prevent overfitting and underfitting [26].
The main concept of the cross-validation method is shown in Figure 4. Divide the original training dataset into four subsets, with three subsets serving as the training set and the other subset serving as the verification set.For each iteration, use a subset of three training sets to train the model and use the remaining subset of one verification set to validate the performance of the model.The reinforcement model is trained with four datasets, respectively, so as to obtain the optimal network structure suitable for the current dataset as the base model.Eventually, four base models with different network structures can be obtained.

The Sparse Artificial Neural Network
Traditional ensemble strategies like voting methods, bagging, or boosting typically employ shallow networks for their upper layer models and attempt to enhance performance by adjusting the weights of the base models, but they have limitations.To overcome these limitations, this paper employs a stacking ensemble strategy that uses deep The detailed process of this method is (1) the fault data samples of the same type collected in different working conditions are directly assigned with the same labels to construct the training dataset; (2) the original data are divided into training sets and a test set; (3) the reinforcement models are trained with four datasets, and the controller optimization algorithms are selected as Adam, RMSProp, AdaDelta, and SGD, respectively, to obtain the optimal network structure suitable for the current dataset; (4) the four optimal network structures serve as base models, and their output matrices from the test results are preserved as input features for the subsequent ensemble model in constructing the reinforcement learning model.

The Sparse Artificial Neural Network
Traditional ensemble strategies like voting methods, bagging, or boosting typically employ shallow networks for their upper layer models and attempt to enhance performance by adjusting the weights of the base models, but they have limitations.To overcome these limitations, this paper employs a stacking ensemble strategy that uses deep learning models for the upper layer.The likelihood matrix output by the base models is utilized as input features for the upper model.
To optimize the ensemble model and prevent overfitting of the secondary learning machine, a sparse artificial neural network (SANN) is constructed as the ensemble model.The structure of the SANN is illustrated in Figure 5, comprising an input layer, three hidden layers, and an output layer.

The Sparse Artificial Neural Network
Traditional ensemble strategies like voting methods, bagging, or boosting typically employ shallow networks for their upper layer models and attempt to enhance performance by adjusting the weights of the base models, but they have limitations.To overcome these limitations, this paper employs a stacking ensemble strategy that uses deep learning models for the upper layer.The likelihood matrix output by the base models is utilized as input features for the upper model.
To optimize the ensemble model and prevent overfitting of the secondary learning machine, a sparse artificial neural network (SANN) is constructed as the ensemble model.The structure of the SANN is illustrated in Figure 5, comprising an input layer, three hidden layers, and an output layer.The neural network contains four layers; the dimensionality of the input vector is 64, and the output vector is five-dimensional.The output vectors of each layer are represented as follows.
Input layer: First hidden layer: Second hidden layer: Sensors 2024, 24, 3323 Third hidden layer: Output layer: The weight matrix and bias vector for each layer are w l−3 , b l−3 ; w l−2 , b l−2 ; w l−1 , b l and w l , b l .The active functions of each layer are f l−3 , f l−2 , f l−1 and f l .
To address overfitting issues resulting from excessive feature learning, half of the feature detectors are randomly omitted in each batch.Consequently, during forward propagation, the activation values of specific neurons are set to 0 with a probability of P, enhancing the network's generalization.
Thus, the expression of hidden layer one is obtained: where each element in net l−3 represents the weighted sum of the input layer vector and the bias vector.Bernoulli function is used to generate the probability vector r, which means it randomly produces the vector of 0 or 1.And so on, for layer l: During the training of the sparse network, the process involves randomly removing half of the hidden neurons while keeping the input and output neurons intact.The input features are then forwarded through the altered network, and the loss function is backpropagated through this modified network.By employing the random gradient descent method, a small set of training samples is used to update the parameters of the undeleted neurons.Subsequently, the deleted neurons are restored (during this phase, the deleted neurons remain unchanged, and the undeleted neurons are updated), and this training process is repeated until the training is complete.

General Procedures of Proposed Method
To address the challenges of fault diagnosis under variable working conditions, this paper presents a reinforcement ensemble method.It automatically selects base learners and combines them using a sparse artificial neural network.The schematic representation of the proposed method is depicted in Figure 6, and the detailed procedures are outlined as follows: (1) The fault data samples of the same type collected under different working conditions are directly assigned with the same labels to construct the training dataset.(2) The original data are divided into training sets and a test set.
(3) The reinforcement models are trained with four training datasets to obtain four optimal network structures; the controller optimization algorithms are selected: Adam, RMSProp, AdaDelta, and SGD, respectively.(4) The sparse artificial neural network is designed as the ensemble model, and the four optimal network structures obtained in the previous step are used as the base learners.(5) The output matrix of the base learners is used as the input feature of the ensemble model.( 6) The proposed method is verified with a high-speed aerospace bearings dataset under variable work conditions.

Introduction of Experimental Dataset
This paper was validated through data collected from experiments conducted on a test rig established at the DIRG Lab within the Department of Mechanical and Aerospace Engineering.The test rig is purposefully designed to evaluate high-speed aerospace bear-

Experimental Verification 4.1. Introduction of Experimental Dataset
This paper was validated through data collected from experiments conducted on a test rig established at the DIRG Lab within the Department of Mechanical and Aerospace Engineering.The test rig is purposefully designed to evaluate high-speed aerospace bearings, capturing acceleration data across varying speeds, radial loads, and damage levels.The test stand, as depicted in Figure 7, consists of a high-speed spindle with a rotating drive shaft.The spindle's main body is securely fastened to a robust stand placed on a substantial steel base plate.The spindle's speed is regulated via the inverter's control panel.On the same base plate, two brackets hold two identical roller-bearing outer rings (positions B1 and B3).The inner rings of these bearings are connected to a short hollow shaft that is custom-designed for operation at speeds of up to 35,000 rpm.During the experiment, the precision sledge connected to the outer ring of the bearing is rotated and pulled by the nut, compressing two parallel springs to generate the required load [27].
Sensors 2024, 24, x FOR PEER REVIEW 12 of 24 The test stand, as depicted in Figure 7, consists of a high-speed spindle with a rotating drive shaft.The spindle's main body is securely fastened to a robust stand placed on a substantial steel base plate.The spindle's speed is regulated via the inverter's control panel.On the same base plate, two brackets hold two identical (positions B1 and B3).The inner rings of these bearings are connected to a short hollow shaft that is custom-designed for operation at speeds of up to 35,000 rpm.During the experiment, the precision sledge connected to the outer ring of the bearing is rotated and pulled by the nut, compressing two parallel springs to generate the required load [27].The test data were gathered by measuring the acceleration of the bearing support and the electric spindle.For this purpose, a triaxial IEPE-type accelerometer was employed.This accelerometer is capable of recording frequencies within the range of 1 to 12,000 Hz (with an amplitude tolerance of ±5% and phase tolerance of ±10°).It features a nominal frequency of 55 kHz and a nominal sensitivity of 1 mV/ms −2 .
Radial acceleration data were collected from bearings under various conditions, including different levels of damage, operating speeds, and loads.The conditions used were as follows: Nominal speed: 100 Hz, and nominal load: 0 N. Nominal speed: 200 Hz, and nominal load: 1000 N. Nominal speed: 300 Hz, and nominal load: 1800 N.Under these different operating conditions, the radial accelerations of the bearings were measured.Training and test samples were collected at a sampling rate of 51,200 Hz.
The dataset contained three different fault domains, each containing 4500 samples.Therefore, the original data consisted of 13,500 samples.To ensure the independence of the test samples.The 13,500 samples were divided into training samples and test samples, of which 9000 samples were training data and 4500 samples were test data.We divided the training data into four subsets, each containing 2250 samples, with three subsets serving as the training set and the other as the validation set.For each iteration, we used a subset of three training sets to train the model and the remaining subset of one validation set to validate the performance of the model.
Meanwhile, in order to ensure the fairness of the experiment, the test set consisted of 2250 randomly selected test samples from 4500 test samples, which were used to verify the performance of the proposed method and other comparative methods.The test data were gathered by measuring the acceleration of the bearing support and the electric spindle.For this purpose, a triaxial IEPE-type accelerometer was employed.This accelerometer is capable of recording frequencies within the range of 1 to 12,000 Hz (with an amplitude tolerance of ±5% and phase tolerance of ±10 • ).It features a nominal frequency of 55 kHz and a nominal sensitivity of 1 mV/ms −2 .
Radial acceleration data were collected from bearings under various conditions, including different levels of damage, operating speeds, and loads.The conditions used were as follows: Nominal speed: 100 Hz, and nominal load: 0 N. Nominal speed: 200 Hz, and nominal load: 1000 N. Nominal speed: 300 Hz, and nominal load: 1800 N.Under these different operating conditions, the radial accelerations of the bearings were measured.Training and test samples were collected at a sampling rate of 51,200 Hz.
The dataset contained three different fault domains, each containing 4500 samples.Therefore, the original data consisted of 13,500 samples.To ensure the independence of the test samples.The 13,500 samples were divided into training samples and test samples, of which 9000 samples were training data and 4500 samples were test data.We divided the training data into four subsets, each containing 2250 samples, with three subsets serving as the training set and the other as the validation set.For each iteration, we used a subset of three training sets to train the model and the remaining subset of one validation set to validate the performance of the model.
Meanwhile, in order to ensure the fairness of the experiment, the test set consisted of 2250 randomly selected test samples from 4500 test samples, which were used to verify the performance of the proposed method and other comparative methods.Table 1 provides details about three distinct fault domains, while Table 2 describes the various fault types within each domain.Raw signals collected under different working conditions using the accelerometer are illustrated in Figures 8-10. (3)

The Effectiveness of Proposed Learning
To verify the REM method proposed in this paper, four datasets were trained with the reinforcement model for 20 epochs to obtain the four base models.The base models were adopted as base learners of the ensemble model.The structures of four base learners are shown in Table 3.Using the above four convolutional models as the base learners, followed by a sparse artificial neural network (SANN) as the secondary trainer, the hidden layer structure of SANN was 64-32-16, and the activation function was the relu function.The reinforcement ensemble model was constructed and executed ten times under consistent conditions, resulting in the average accuracy presented in Table 4.The visual representation of the test results can be seen in Figure 11.The test results illustrate that the highest accuracy among these base classifiers is of base learner two, which can reach 97.90%.Base learner one has a slightly lower accuracy than base learner two and base learner three, and the worst is base learner four, which has a test accuracy of 94.71%.Among them, the four base classifiers have better recognition of fault type two and fault type four, and the average accuracy can reach more than 99%.The next is the fifth fault type, except for base learner four, whose average accuracy is 94.71%, and the other three base learners can reach more than 98%.In contrast, the recognition accuracy of the first and third fault types is poor in general.
Using the above four convolutional models as the base learners, followed by a sparse artificial neural network (SANN) as the secondary trainer, the hidden layer structure of SANN was 64-32-16, and the activation function was the relu function.The reinforcement ensemble model was constructed and executed ten times under consistent conditions, resulting in the average accuracy presented in Table 4.The visual representation of the test results can be seen in Figure 11.After ensembling the base model with SANN, the overall accuracy has improved significantly, and the average accuracy can achieve over 99%.The accuracy of the first and third types of faults can reach more than 98%, and the accuracy of the remaining three types of faults is close to 100%.
In summary, the reinforcement ensemble learning model efficiently selects the base models, and the sparse artificial neural network (SANN) as a secondary learning machine notably enhances the accuracy in recognizing various types of faults.
The confusion matrix analysis reveals interesting patterns in the performance of the base learners.Base learner one tends to struggle with the third fault type, often misclassifying it as the first fault type.Similarly, the first fault type frequently gets misclassified as the third fault type.In contrast, the second and fourth fault types exhibit high accuracy in base learner one.Base learner two follows a similar pattern, with the first fault type having low accuracy and frequently being misclassified as the third fault type.The second, fourth, and fifth fault types achieve accuracy rates exceeding 99%.Base learner three exhibits lower accuracy for the third fault type and often mistakes it for the first fault type.The first fault type, again, is prone to being misclassified as the third fault type.Base learner four shares characteristics with the other learners, with the first fault type showing lower accuracy and often getting misclassified as the third fault type, along with the third fault type itself.The fifth fault type is also commonly misclassified as the third fault type.These insights highlight the strengths and weaknesses of each base learner when dealing with specific fault types.The ensemble model aims to leverage these differences to improve overall fault diagnosis performance (See Figure 12).
After being ensembled by SANN, it can be found that the accuracy of both the first and third fault types improves significantly and reaches above 98%.The accuracy of the remaining three types of faults is also above 99%.The proposed method has successfully improved diagnostic accuracy and can effectively identify variable types of faults under different working conditions.
Various ensemble methods, such as the Random Forest Classifier (RF), K-Nearest Neighbor Classifier (KNN), Fully Connected Artificial Neural Network (ANN), and Deep Belief Network (DBN), were applied to combine the predictions from the four base learners selected through a reinforcement learning model.These ensemble methods had specific configurations and were trained for 100 epochs to assess their performance.Notably, the Random Forest Classifier utilized 200 decision trees, while the K-Nearest Neighbor Classifier considered six neighboring points.The Fully Connected Artificial Neural Network had a hidden layer structure of 64-32-16, and the Deep Belief Network was structured with two hidden layers and hidden units of 200, 100, and 50.The outcomes of these experiments were recorded and are summarized in Table 5. (1) (2) (5) After being ensembled by SANN, it can be found that the accuracy of both the first and third fault types improves significantly and reaches above 98%.The accuracy of the remaining three types of faults is also above 99%.The proposed method has successfully improved diagnostic accuracy and can effectively identify variable types of faults under different working conditions.
Various ensemble methods, such as the Random Forest Classifier (RF), K-Nearest Neighbor Classifier (KNN), Fully Connected Artificial Neural Network (ANN), and Deep Belief Network (DBN), were applied to combine the predictions from the four base learners selected through a reinforcement learning model.These ensemble methods had  Compared to the test results in Table 4, all five ensemble methods can improve the test accuracy of the base learners, but the proposed method is the most outstanding.The second is KN, ANN, and DBN, and the RF is slightly worse.Since the diagnostic accuracy of the base learners selected by the reinforcement learning method is relatively high, the total accuracy of the proposed method is not very prominent.However, the performance of the proposed method is significantly better than that of other integration methods in the first and third fault types, which are the most difficult to identify.
The test results are represented visually in Figure 13.The proposed method demonstrates notable superiority, particularly in the identification of the first and third fault types, which are challenging to distinguish.In contrast, when dealing with the second, fourth, and fifth fault types, the ensemble methods' performance exhibits marginal differences.This can be attributed to the strong individual performance of the base learners in these cases, as they consistently deliver commendable recognition results.On the whole, the method introduced in this study surpasses other integrated techniques.4, all five ensemble methods can improve the test accuracy of the base learners, but the proposed method is the most outstanding.The second is KN, ANN, and DBN, and the RF is slightly worse.Since the diagnostic accuracy of the base learners selected by the reinforcement learning method is relatively high, the total accuracy of the proposed method is not very prominent.However, the performance of the proposed method is significantly better than that of other integration methods in the first and third fault types, which are the most difficult to identify.

KN
The test results are represented visually in Figure 13.The proposed method demonstrates notable superiority, particularly in the identification of the first and third fault types, which are challenging to distinguish.In contrast, when dealing with the second, fourth, and fifth fault types, the ensemble methods' performance exhibits marginal differences.This can be attributed to the strong individual performance of the base learners in these cases, as they consistently deliver commendable recognition results.On the whole, the method introduced in this study surpasses other integrated techniques.We randomly selected a group of test results to show the confusion matrix.The confusion matrix of the test results for different ensemble methods is shown in Figure 14.It can be seen that in RF, the first type of fault has the lowest accuracy rate of 95.15% and is most likely to be misclassified as the third category of fault.This is followed by the third We randomly selected a group of test results to show the confusion matrix.The confusion matrix of the test results for different ensemble methods is shown in Figure 14.It can be seen that in RF, the first type of fault has the lowest accuracy rate of 95.15% and is most likely to be misclassified as the third category of fault.This is followed by the third type of fault, which is easily misclassified as the first type of fault, and the second, fourth, and fifth types of faults have the highest accuracy, which can reach more than 99%.For KN, the third fault type has the lowest accuracy and is also prone to be misclassified as the first category of faults, followed by the first category of faults, which is prone to be misclassified as the third category of faults, and the second, fourth, and fifth category of faults can reach an accuracy of more than 99%.Similarly, in ANN, the first category of faults has the lowest accuracy rate and is most likely to be misclassified as the third category of faults.This is followed by the third fault type, which is easily misclassified as the first type of fault.Finally, in DBN, the third fault type has the lowest accuracy and is likewise easily misclassified as the first fault type, followed by the first type of fault, which is easily misclassified as the third type of fault.For the proposed method, similarly, the accuracy of the first-and third-category faults is relatively low, but the diagnostic accuracy is better than the other integration methods overall.
For Convolutional Neural Networks (CNNs), four convolutional layers are used with corresponding kernel and filter sizes of [1,16] [3,32] and [1,16] [3,32], respectively.These convolutional layers are followed by a global average pooling layer, a fully connected layer, and a softmax classifier.LSTM (Long Short-Term Memory) is utilized with a single layer containing 128 hidden units.
Therefore, 6750 samples are randomly sampled from the original training data while keeping the test dataset unaltered.Each method underwent training 100 times, resulting in multiple sets of test results.The various methods were executed ten times under identical conditions, and the average test accuracy was calculated and is presented in Table 6.The proposed method consistently outperforms the other approaches in terms of the average accuracy.Detecting the same fault type under varying working conditions proves challenging when employing a straightforward, intelligent diagnostic model.Among them, the diagnosis effect of the CNN and BP neural networks is better than that of SVM and LSTM, and the overall recognition accuracy can reach 87%.However, for the first, third, and fifth types of faults, the recognition accuracy is low, especially for the third type of faults, and the recognition accuracy of these two methods does not reach 70%.The diagnosis results of SVM and LSTM are poor at 75.82% and 77.93%, respectively.These two methods have poor recognition results for the third and fifth types of faults.For the first type of fault, the diagnosis accuracy of SVM can reach 94.05%, while LSTM is only 70.16%.However, these four intelligent methods can accurately identify the second and fourth types of faults.The proposed method effectively and accurately identifies various types of faults, achieving an overall recognition accuracy exceeding 99%, thereby demonstrating its efficacy.
Figure 15 visually demonstrates the diagnostic accuracy of these methods, with the proposed approach consistently exhibiting high accuracy in identifying various fault types.The other four intelligent diagnosis methods can only accurately identify the second and fourth types of faults.For the BP network, the identification accuracy of the third fault type is the worst, followed by the fifth type of fault.The identification accuracy of the first fault type is slightly poor, but it can also reach 85%.The accuracy of the third and fifth types of fault recognition of SVM is very low.CNN also has the lowest accuracy rate for the third type of fault, followed by the first type of fault.LSTM has poor accuracy for three types of faults that are difficult to identify.

Parameters Sensitivity Analysis
In the proposed approach, the hyperparameter  controls the exploration rate during child model generation.And  represents the forgetting coefficient to balance the current reward and the historical reward.The values of  and  directly affect the diagnostic result of the proposed method.Therefore, this section conducted sensitivity analysis experiments to check the rationality of the two hyperparameters.The range of  is {50%, 60%, 70%, 80%, 90%}, and the range of  is {0.5, 0.6, 0.7, 0.8, 0.9}.
We conducted experiments on cross-domain fault diagnosis cases.Figure 16 illustrates the sensitivity analysis regarding the diagnostic outcomes and two hyperparameters.The experimental findings highlight the significant influence of hyperparameters  and  on the diagnostic outcome.Notably, for  = 80% and  = 0.8, the diagnostic accuracy reached its peak at 99.19%.These results underscore the validity of the selected hyperparameter values.

Parameters Sensitivity Analysis
In the proposed approach, the hyperparameter µ controls the exploration rate during model generation.And β represents the forgetting coefficient to balance the current reward and the historical reward.The values of µ and β directly affect the diagnostic result of the proposed method.Therefore, this section conducted sensitivity analysis experiments to check the rationality of the two hyperparameters.The range of µ is {50%, 60%, 70%, 80%, 90%}, and the range of β is {0.5, 0.6, 0.7, 0.8, 0.9}.
We conducted experiments on cross-domain fault diagnosis cases.Figure 16 illustrates the sensitivity analysis regarding the diagnostic outcomes and two hyperparameters.The experimental findings highlight the significant influence of hyperparameters µ and β on the diagnostic outcome.Notably, for µ = 80% and β = 0.8, the diagnostic accuracy reached its peak at 99.19%.These results underscore the validity of the selected hyperparameter values.

Parameters Sensitivity Analysis
In the proposed approach, the hyperparameter  controls the exploration rate during child model generation.And  represents the forgetting coefficient to balance the current reward and the historical reward.The values of  and  directly affect the diagnostic result of the proposed method.Therefore, this section conducted sensitivity analysis experiments to check the rationality of the two hyperparameters.The range of  is {50%, 60%, 70%, 80%, 90%}, and the range of  is {0.5, 0.6, 0.7, 0.8, 0.9}.
We conducted experiments on cross-domain fault diagnosis cases.Figure 16 illustrates the sensitivity analysis regarding the diagnostic outcomes and two hyperparameters.The experimental findings highlight the significant influence of hyperparameters  and  on the diagnostic outcome.Notably, for  = 80% and  = 0.8, the diagnostic accuracy reached its peak at 99.19%.These results underscore the validity of the selected hyperparameter values.

Conclusions
A reinforcement ensemble method was proposed for fault identification under different working conditions.Firstly, a reinforcement model was constructed to select optimal base learners.Secondly, stratified random sampling was used to extract four datasets from raw training data.The reinforcement model was trained by these four datasets, respectively, and four optimal base learners were obtained.Finally, a sparse ANN was designed as the

Figure 3 .
Figure 3.The Structure of Controller.

Figure 3 .
Figure 3.The Structure of Controller.

Figure 6 .
Figure 6.Abstract graphic of proposed method.

Figure 6 .
Figure 6.Abstract graphic of proposed method.

Figure 7 .
Figure 7. General view of the test rig. .

Figure 7 .
Figure 7. General view of the test rig.

Figure 11 .
Figure 11.Test result of proposed method.

Figure 11 .
Figure 11.Test result of proposed method.

Figure 13 .
Figure 13.Test result of different ensemble methods.

Figure 13 .
Figure 13.Test result of different ensemble methods.

Figure 15 .
Figure 15.Test result of different diagnosis methods.

Figure 16 .
Figure 16.Sensitivity analysis of  and .

Figure 15 .
Figure 15.Test result of different diagnosis methods.

Figure 16 .
Figure 16.Sensitivity analysis of  and .

Table 1
provides details about three distinct fault domains, while Table 2 describes the various fault types within each domain.Raw signals collected under different working conditions using the accelerometer are illustrated in Figures 8-10

Table 1 .
Details about different fault domain.

Table 2 .
Descriptions of each fault conditions.

Table 1 .
Details about different fault domain.

Table 2 .
Descriptions of each fault conditions.

Table 3 .
Base learners earned by meta learning method.

Table 4 .
Testing result of proposed method.

Table 5 .
Testing result of different ensemble methods.

Table 6 .
Testing result of different diagnosis methods.