1. Introduction
Modern machines with rotating components tend to use rolling bearings for their bearing arrangements. For reasons of energy efficiency and limited design space, the bearings are laid out as small as possible, which can lead to them being operated at the limits of their durability. An unforeseen failure of a bearing can cause considerable damage to the entire machine and its environment. Especially in the case of safety-relevant systems, an unforeseen failure must be avoided in any case. In order to prevent such unforeseen failures, condition monitoring and predictive maintenance are becoming increasingly important [
1]. Condition monitoring involves using suitable sensors to record measurement data during operation, which is then processed to draw conclusions about the condition of the component [
2]. If the condition is judged to be critical in this process, corrective actions such as maintenance can be planned. To be able to carry out such planning with as little risk as possible, it is essential to estimate the remaining useful life (RUL) of components [
3].
Rolling bearing damage can occur in various ways. The damage can be caused by lack of lubrication, short-term overload or material fatigue due to long-term stress. Material fatigue usually manifests itself in the form of propagating pitting within the raceway surfaces [
4]. Recently, for bearing damage detection, traditional condition monitoring methods have been increasingly combined with Artificial Intelligence (AI). Machine learning (ML), as a subfield of AI, plays an essential role here. ML algorithms can be used to recognize complex structures in data and to evaluate these structures [
5]. This offers the possibility of automated inference from the data. Applied to the challenge of RUL prediction, these are approaches to automatically draw conclusions about the RUL from the data measured at the component. Among the machine learning algorithms used for RUL predictions there are different variants of neural networks, such as convolutional neural networks (CNN) [
6], recurrent neural networks (RNN) [
7], long short-term memory (LSTM) [
8], and generative adversarial networks (GAN) [
9]. Furthermore, there are contributions to state detection using random forest algorithms [
10]. Machine learning is therefore becoming increasingly relevant, not least in the field of tribology [
11].
When using machine learning, the achievable prediction quality is highly dependent on the type and quality of the data as well as the preprocessing used. Targeted data preprocessing has a significant impact on both the achievable prediction accuracy and the computational speed of the implemented algorithms [
12,
13]. In the context of rotating machinery, the measurement of structure-borne sound has proven particularly useful for drawing conclusions about the components’ condition [
14,
15,
16]. Therefore, the present article will also use structure-borne sound measurements to investigate the condition of rolling bearings and to predict their RUL.
Recent approaches for predictive maintenance based on electrical impedance measurements of rolling bearings can complement or even replace structure-borne sound measurements with in situ information [
17]. The quality of the underlying model is continuously increased by considering unloaded rolling elements and modeling the detailed rolling contact geometry [
18]. ML approaches are used to further enhance the predictive capabilities [
19].
In a previous paper presented by the present authors, the influence of feature engineering on condition monitoring of rolling bearings was shown using a random forest regressor [
20]. A feature engineering approach is presented in the previous work, which, compared to features from Lei et al. [
21], achieves particularly good results in structure-borne sound-based condition detection. Based on these results, the feature-engineering approach is optimized and extended in this study regarding the prediction of remaining useful life. The aim is to develop a methodology that leads to a RUL prediction model with high accuracy and good traceability. Therefore, the investigations are focused on feature engineering and the consideration of information from the temporal past. In order to predict the RUL of rolling bearings, a methodology based on a random forest condition regression is presented.
  2. Materials and Methods
To evaluate the developed feature engineering methods in the context of RUL predictions, a methodology in which all other model components and their parameters remain constant as boundary conditions is used. The approach used for this purpose is illustrated in 
Figure 1. The individual model parts are described in more detail within the subsequent sections.
  2.1. Experiments
The investigations are based on structure-borne sound measurement data, which is recorded on a rolling bearing test rig. The concept of the FE9 test rig used was originally designed for testing rolling bearing greases. An electric motor drives the test head shaft via a belt. On one side of the shaft, an ancillary bearing is mounted, which is provided with circulating oil lubrication. The grease-lubricated test bearing, the wear of which is to be examined, is located on the other side of the shaft. In order to accelerate grease aging and thus, its wear, the test bearing is heated. An axial load is applied with the aid of a spring preload. The test head of the FE9 test rig can be seen in 
Figure 2.
In the case of the tests evaluated in this work, the test bearings used are of type 6206-C-C3 (Schaeffler AG, Herzogenaurach, Germany) and lubricated with a low-temperature grease. The grease is used beyond the limits of its specification due to the applied thermal load, which is why the operating life is greatly reduced. A constant speed of 6000 rpm is present at the test head shaft. The axial load is 1500 N and the temperature of the heater on the test bearing is set to 140 °C. The sensor used is a three-axis piezo accelerometer of type PCB-356A15 (PCB Piezotronics, Depew, NY, USA). The sensor is mounted close to the test bearing, as shown in 
Figure 3.
For the investigations carried out here, only data from the sensor’s 
X-axis, which is aligned radially to the bearing, is analyzed. The data is acquired at a sampling rate of 20 kHz. An imc CRONOSflex (imc Test & Measurement GmbH, Berlin, Germany) data acquisition system is used in the measuring chain with an 8th order Cauer LP anti-aliasing filter having a cut-off frequency of 8 kHz. The amplitude is resolved with 24 bits. The measurement data is recorded at intervals of 1 s with intervening pauses of 59 s. A total of nine endurance runs are investigated. A threshold value in the power consumption of the driving electric motor is defined as a termination criterion for the experiments. This leads to test run times ranging from 10 h to 20 h. At the end of the tests, the test bearings show very similar damage patterns in the form of pitting. 
Figure 4 shows an example of the inner ring of one bearing after endurance testing.
  2.2. Data and Labeling
The aim of the procedure used here is to directly infer the bearings remaining useful life from the trained ML model. Therefore, a supervised learning approach in terms of a regression is used. The label must represent the progressive bearing damage. As already shown in [
20], a label that linearly increases from 0 to 1 is used for this purpose. A similar labeling approach has also been presented in [
23]. 
Figure 5 shows the label based on the structure-borne sound signals of a single test run. From a mathematical point of view, the label can be described as the normalized test run duration. The value 0 represents the original condition of the bearing, while 1 indicates the end of its useful life.
  2.3. Feature Engineering
A wide variety of feature-engineering methods have already been described in the literature [
13,
24]. The focus of the research conducted here is on the comparison of different feature-engineering methods that consider the temporal past in the context of feature generation. As a basis, the so-called averaged-frequency-band (afb) features are used, which have already been shown in [
20] to be particularly performant and computationally efficient compared to the features proposed by Lei et al. [
21]. The studies in [
20] were based on the same data set also used for the present work. Starting from the afb features, additional features are now to be generated, which contain the information of the temporal past. The influence of these processed features on the RUL prediction is to be investigated.
  2.3.1. Averaged-Frequency-Band Features
As base features, the so-called averaged-frequency-band features are used, the calculation method of which is visualized in 
Figure 6. To calculate the afb features for each 1 s measurement interval, the data of the interval is first transformed into the frequency domain by means of an FFT. The resulting amplitude spectrum is divided into frequency bands of equal width. Finally, the average values of the amplitudes within the formed frequency bands are used as features. Thus, an afb feature describes the average value of the amplitudes within a frequency band.
Based on preliminary investigations and in order to keep the total number of features and thus the model complexity at a moderate level, the number of frequency bands in this case is set to 8.
  2.3.2. Rolling Mean Features
In order to utilize information from the temporal past, rolling means can be used. In the case presented here, these rolling means were calculated from the afb features presented previously. To be able to represent the short-term dynamics as well as the long-term behavior, several averages are formed over different time spans. Progressively increasing time spans seem to make sense for this use, which was confirmed in preliminary studies. The progressive staggering of rolling means is shown in 
Figure 7 for three rolling mean durations using the time course of afb1(8).
  2.3.3. Cumulative Features
Another way to account for temporal information is to use accumulated quantities. Already in [
25], the cumulative sum of values was proposed to generate features with monotonic behavior. These accumulated features provide long-term trends, which helps the ML algorithm in its decision making. In the case presented here, the afb features are used for accumulation. Each afb feature is summed up cumulatively from the beginning of the experiment.
  2.4. Machine Learning
The goal of the machine learning is to approximate an unknown function, which maps the input features to the label. Since the label is defined in the form of a continuous variable, this is supervised learning in terms of a regression [
26]. In the field of machine learning, there is a wide variety of regression algorithms [
5]. A comprehensive overview of the available Deep Learning methods can be found in [
27]. In [
10], a random forest approach has already been used to detect the state of journal bearings. The aim of the present work is to show the influence of targeted feature engineering on RUL prediction performance. Therefore, traceability shall be as good as possible. For this reason, deep learning algorithms are not used here, instead, a random forest regressor is chosen. A random forest is an ensemble method based on decision trees [
28]. It is considered to be very robust and to provide continuously good results compared to other regression algorithms.
In the workflow used here to investigate feature engineering, the machine learning algorithm is considered as a constant boundary condition. Therefore, the parameters of the random forest are kept fixed. Based on preliminary studies, the number of trees is set to 500, and the maximum tree depth is limited to 20. The models used in this work are implemented in Python using the numpy, pandas, scipy, and matplotlib libraries. Additionally, the library Scikit-learn is used for the implementation of the random forest and metrics for result evaluation.
To evaluate the models built with the different feature engineering methods, a 9-fold cross-validation is used. Out of the total nine endurance test runs available, eight endurance tests are used for training. The remaining test run is used for the test data set, which means that the test data is always completely separated from the training data. This is repeated a total of nine times so that the data from each test is used as independent test data once.
The quality of the prediction is evaluated using metrics. For this purpose, the MAE and the R
2 are chosen. The MAE (Mean Absolute Error) provides a directly interpretable result of the regression quality in the context of the label used here. For example, an MAE of 0.05 means that the prediction of the current bearing condition is on average 5% from the true value. Consequently, the MAE tends to 0 in case of a perfect model. In addition, the R
2, which is called the coefficient of determination, provides a general measure of the quality of a regression. It tends towards a maximum value of 1 for optimal predictions. The smaller the value, the worse the prediction [
29]. For the overall evaluation in the results section below, the average value of the nine metrics calculated during cross-validation is considered. This ensures an evaluation of the model quality based on the entire data set.
  2.5. RUL Prediction
To infer the predicted remaining useful life 
 based on the predictions of the label within the regression, the following equation is used:
Here,  is the current operating time and  is the label predicted at the corresponding time. The described mathematical relationship results from the background of the selected label, which corresponds to the normalized test-run time. At the beginning of the measurement, where the predicted label  is close to zero, RUL prediction is not practical due to large inaccuracies, which can be directly justified by Equation (1). Dividing by small  then leads to very large fluctuations in the RUL prediction, caused by only slight variations in the predicted label. For this reason, RUL prediction is evaluated exclusively for the second half of the test runs. The result evaluation by means of the RUL-based MAE is also performed exclusively on the second half of the test runs.
In order to compare the predicted with the true remaining useful life 
, the latter must also be calculated. This is performed using the total operating time until bearing failure 
 and the true label at the respective time 
:
  3. Results
The previously presented feature engineering approaches are now compared to each other. In detail, the three approaches listed in 
Table 1 are considered.
The first feature set is denoted as afb(8). No temporal past information is used with this feature set. It therefore serves as a reference. For the second feature set, the afb(8) approach is combined with the rollingmeans(10, 80, 600) approach. The third feature set is a combination of the reference features afb(8) and the cumsum approach.
To compare the three feature sets mentioned above, the workflow shown in 
Figure 1 is used, keeping the boundary conditions constant. The resulting regression and RUL predictions are shown in 
Figure 8 using a single test data set. While the plots on the left show the regression results of the trained machine learning algorithm, the plots on the right visualize the RUL prediction derived from it. The metrics MAE and R
2 of the visualized results are entered within 
Figure 8.
The prediction scatters strongly when using only afb(8) features, see 
Figure 8a. Several states can be identified in the predicted label, between which the prediction changes quite abruptly. In the case of an ideal prediction, all prediction points (blue) would be exactly on top of the reference line. Thus, the vertical deviation of the predictions from the reference line visualizes how inaccurate the prediction is. The same applies to the mapping of the RUL. Here, with optimal prediction, the test data points would align with the orange line, representing the true RUL. The corresponding RUL prediction using the afb(8) features is very inaccurate due to the large prediction spread of the regression results and poorly represents the true RUL, as can be seen in 
Figure 8b. A significant improvement in prediction quality is achieved by adding the rolling-means, as shown in 
Figure 8c,d. On average, the forecast shows similar trends, but is much less scattered. This is evident not only in the predicted label, but also within the resulting RUL prediction. Further improvement of the results is achieved with the combination of the afb(8) and cumsum features, which is visualized in 
Figure 8e,f. The steps visible with the other two feature sets disappear almost completely here. These improvements of the results can be determined not only visually, but also based on the metrics calculated. Smaller MAEs and larger R
2s represent the prediction improvements.
Since 
Figure 8 only illustrates one of the total of nine cross-validation runs, the overall cross-validation results are summarized in 
Table 2. For this purpose, the average of the regression MAE and the regression R
2 calculated via cross-validation are entered. Additionally, the averaged MAE of the RUL prediction as well as the relative deviation of the MAE with respect to the test run times are evaluated in the last two rows.
Looking at the averaged metrics from cross-validation, the results already obtained in 
Figure 8 are supported. Adding the rolling mean features to the afb(8) yields a significant improvement, with the cumulative features performing even better compared to the rolling mean features. In the MAE of the individual test bearings’ RUL, it is noticeable that this sequence of model performance does not apply quantitatively in the same way for each test bearing. Consequently, there is a non-negligible dispersion of the individual test data sets. A possible explanation for this dispersion is the different physical wear behavior of the various bearing endurance test runs used.
  4. Discussion
Based on the experimental data used, the results presented show that a clear improvement in RUL prediction is possible with the help of temporal information, implemented by means of time-considering features. By using a well-defined workflow where only the feature sets are changed, the impact of the different features on the RUL prediction performance is evaluated. For the RUL prediction, a random forest regression approach is used. Comparing the two presented ways of incorporating temporal past information in the form of extended feature sets, the approach of cumulatively generated features performs particularly well. By using this extended feature set, the averaged MAE of RUL predictions can be reduced by 37% in comparison to the use of base features only. Calculating rolling means with progressively staggered window widths also adds value in terms of predictive accuracy, although the results are slightly worse than those obtained with the cumulative approach. In the case presented here, the base features are formed from the so-called averaged-frequency-bands, which have already been shown to perform particularly well on the data used in [
20]. The authors assume that the methodology presented here will lead to improved RUL predictions for other base features in an analogous manner. A validation of this hypothesis is still pending at this point.
It should be noted that the evaluations carried out here are based on test data obtained on a rolling bearing test rig under constant operating conditions. Limitations are to be expected when implementing the methodology proposed here in a real application, with varying boundary conditions such as speeds, loads or temperatures. In particular, the formation of accumulated features could be error-prone, since each individual point in time has an influence on the entirety of the following time span. Thus, continuous and reliable measurement data acquisition is indispensable for the correct determination of accumulatively formed features.
Future work can investigate further approaches of feature engineering and the possibilities of considering temporal information. The implementation of further RUL prediction methods and the possibilities of deep learning algorithms have been omitted here in order to focus on the integration of temporal information via extended feature engineering approaches. For comparison purposes, it seems reasonable to also consider deep learning methods, such as CNNs, RNNs or LSTMs, which natively offer the possibility to take temporal information into account. However, with these methods, the comprehensibility of decision making is lost. Furthermore, with regard to hybrid models, it seems promising to motivate the development of novel features by physical backgrounds. The investigations should also be extended to additional data that are recorded at non-constant bearing operating conditions. In order to achieve satisfactory RUL prediction results even at non-constant operating conditions, the methods may have to be extended.