Classifying Multivariate Signals in Rolling Bearing Fault Detection Using Adaptive Wide-Kernel CNNs

: With the developments in improved computation power and the vast amount of (automatic) data collection, industry has become more data-driven. These data-driven approaches for monitoring processes and machinery require different modeling methods focusing on automated learning and deployment. In this context, deep learning provides possibilities for industrial diagnostics to achieve improved performance and efﬁciency. These deep learning applications can be used to automatically extract features during training, eliminating time-consuming feature engineering and prior understanding of sophisticated (signal) processing techniques. This paper extends on previous work, introducing one-dimensional (1D) CNN architectures that utilize an adaptive wide-kernel layer to improve classiﬁcation of multivariate signals, e.g., time series classiﬁcation in fault detection and condition monitoring context. We used multiple prominent benchmark datasets for rolling bearing fault detection to determine the performance of the proposed wide-kernel CNN architectures in different settings. For example, distinctive experimental conditions were tested with deviating amounts of training data. We shed light on the performance of these models compared to traditional machine learning applications and explain different approaches to handle multivariate signals with deep learning. Our proposed models show promising results for classifying different fault conditions of rolling bearing elements and their respective machine condition, while using a fairly straightforward 1D CNN architecture with minimal data preprocessing. Thus, using a 1D CNN with an adaptive wide-kernel layer seems well-suited for fault detection and condition monitoring. In addition, this paper clearly indicates the high potential performance of deep learning compared to traditional machine learning, particularly in complex multivariate and multi-class classiﬁcation tasks.


Introduction
In the current industrial era, manufacturers rely more and more on the use of sensors for data collection and analysis. These developments boost the industry towards newer standards as what is called Industry 4.0 [1,2]. These new approaches for improving performance and increasing production efficiency require scalable methods to process and explain the collected (often complex) data such as multivariate time series. With improved computational power, automated methods are easier to deploy while requiring less background knowledge on the engineering and physics part of the machinery, resulting in more efficient and "intelligent" approaches [3,4]. Due to the improved availability of large datasets derived from sensors, these automated learning techniques, e.g., deep learning, provide strong performance in signal classification tasks. This work focuses on 1.
We investigate the performance of wide-kernel CNNs [10] in several settings and adaptations designed to process multivariate time series derived from various experiments. For this, we demonstrate how high performance in multi-class signal classification tasks can be achieved.

2.
We propose a method for implementing an adaptive wide-kernel in the first convolutional layer that is able to transform any form of sequential sensor data without any dimensionality reduction; therefore, abiding the principles of a wide-kernel convolutional layer as proposed in [10,11]. We implement this in two models.

3.
We evaluate our model options thoroughly on multiple datasets in deviating contexts in order to show their generalizability, both with and without large amounts of training data.

4.
Finally, we illustrate the impact of model settings and architecture adaptations on model performance, resulting in a streamlined model on both the performance side, as well as computational efficiency.
The rest of the paper is structured as follows: We discuss related work in Section 2, where we also briefly summarize the necessary background on deep learning. This is followed by the introduction of the proposed models in Section 3. Next, we describe the experiments and findings in depth in Section 4. Finally, Section 5 concludes with a discussion and summary and also outlines some promising future prospects.

Related Work and Background
This section covers relevant work relating to maintenance of industrial machinery connected to fault detection and condition monitoring, and describes related deep learning methods in the section below.

Maintenance
The usage of machinery is essential in industrial applications. Failure of these machines due to wear of the underlying elements is one of the most prevalent concerns in industry; therefore, equipment maintenance is critical in preventing malfunctions and, as a result, minimizing downtime. According to [12], maintenance expenses range from 15% to 60% of the total cost of the manufactured goods. Around 33% of maintenance expenses are directly connected to redundant and incorrect equipment maintenance within these margins. As a result, lowering the expenses of costly maintenance might substantially reduce overall production costs by improving equipment productivity [13].
According to [14,15] there are three distinct techniques for maintaining equipment: (1) modificative maintenance, in which components are upgraded to improve machine productivity and performance, (2) preventive maintenance, in which a component is replaced just before it fails and (3) break-down corrective maintenance, in which a part is replaced after it fails, leading to downtime of the machine. In this paper, we concentrate on (2) preventive maintenance, which itself is separated into two types: usage-based maintenance (UBM) and condition-based maintenance (CBM).
The UBM method relies entirely on arranging maintenance visits by the engineer when a specific threshold of consumption is achieved. In practice, this implies that visits are scheduled with a certain interval between them, comparable to a yearly automobile inspection. This technique results in relatively little equipment downtime, which is good for production. However, this method has two major disadvantages; the high expenses of maintenance visits and the replacement of parts that are still usable. As a result, in many industrial applications, CBM is the recommended maintenance approach, making use of data-driven methods and approaches, e.g., cf. [14,16,17].
CBM assesses the current state of equipment to identify if maintenance is required. The concept behind CBM is to only execute maintenance when specific parameters, e.g., deviating behavior in the data, indicate a reduction in performance or a predicted rise in failures. This means fewer maintenance visits and more efficient usage of the underlying components, which in turn leads to lower overall maintenance costs.

Fault Detection and Condition Monitoring
Within CBM, fault detection and condition monitoring are common approaches for rotating industrial equipment where faults regularly occur [18]. In the past, this was achieved using physics-based models, which require background knowledge on the underlying processes. These models hardly adapt to changing circumstances and increments in the amount of data and variables [19]. Innovations in data-driven analytics and the advances in Industrial Internet of Things (IOT) have altered the area of fault detection and condition monitoring towards a more intelligent approach [4,20]. These methods allow for automated data processing without prior understanding of technical elements of industrial machinery, while easily adaptable to changing operation conditions.
Because of the increased availability of large-scale time series datasets and better processing capacity, the usage of deep learning applications has grown in popularity. These time series are recorded by sensors, which are increasingly being used for fault detection and condition monitoring. When elements of equipment decay over time, for example, the analog metrics of the machine will not immediately reflect this; however, increased power usage (motor current), vibrations or temperature of machine elements monitored with internal and external technologies, such as sensors, might indicate that the underlying parts need to be replaced [4,9,10]. These signals derived from sensors can be converted into numerical time series data for subsequent study. However, for reliable fault detection, considerable efforts in feature extraction is typically required.
Traditional methods for extracting representative features to classify signals include time-domain analysis (e.g., statistical measures such as mean and standard deviation), frequency-domain analysis (e.g., Fourier transformations [7], see Figure 1) and timefrequency domain analysis, (e.g., wavelet transformations [21,22]). As one could expect, the quantity of features derived from the different domains results in a high-dimensional dataset. Therefore, features are picked [23] and techniques are frequently used to decrease the dimensionality of these features, such as principal component analysis (PCA) [24,25] or linear discriminant analysis (LDA) [26]. Furthermore, [27], for example, utilized information entropy to preprocess the original time series data.  Before the final dataset can be supplied to a classifier, the essential preprocessing steps typically take a substantial amount of time and high-level knowledge in signal processing and data processing-regarding standard (non-deep learning) machine learning approaches. Additionally, the feature extraction process is influenced by the type of data gathered from a particular machine or sensor, for instance, vibration sensors require different preprocessing steps than analog sensors.
Within fault detection and condition monitoring, many different classifiers are researched, including k-nearest neighbors (K-NN) [28,29], support vector machines (SVM) [21,[30][31][32], artificial neural networks [33,34] and interpretable machine learning methods such as random forests (RF) [35]. The performance of these techniques varies a lot depending on the data quality, thoroughness of the feature extraction process and complexity of the classification task; therefore, it is often difficult to find the right classifier for the task at hand. In other words, there is not one particular machine learning classifier that is most capable of distinguishing different fault conditions. As a result, a comparison between classifiers is deemed necessary for every fault detection task to find the most optimal model. This work concentrates on deep learning applications, and more specifically on the use of one-dimensional CNNs in the context of fault detection utilizing time series data, i.e., multivariate signal data. To evaluate our proposed deep learning techniques, we used renowned benchmark datasets for fault detection and condition monitoring in various settings. We look at the generalizability of these techniques, their performance with limited training data and compare them to traditional machine learning approaches such as knearest neighbors, random forests and support vector machines.

Deep Learning
In general, deep learning methods offer strong processing and learning on complicated data. For instance, automatic feature generation and refinement techniques, such as for complicated classification problems, may typically leverage connections in the data in order to retrieve valuable information from the data into redefined structures. Using complicated multivariate signal data for fault detection and condition monitoring is an example of utilizing these complex data structures. Overall, there has been a lot of interest in utilizing neural networks for such complicated classification problems during the previous decade.
Initially, the multi-layer perceptron (MLP) was used, in which all network layers are fully linked [36]; however, because of the significant increase in calculation time, the depth of these networks is restricted. Thus, in the past years, more advanced neural network architectures were developed to accommodate for this.
The creation of recurrent neural networks (RNN), such as long short-term memory (LSTM) networks, has yielded promising results since they are able to account for time dependencies and therefore can handle time series data and signals very well [37,38]. However, because of memorizing long-term time dependencies, RNNs use vast amounts of memory (RAM). So, these models are less suited for long sequence data due to increased training times. This is especially the case for signal data from sensors, which is often sampled at a high frequency consisting of many data points. To tackle the training issue of RNNs, combined models utilizing autoencoders as feature extractors were created [39]. These models enhance computation but also increase the model's complexity and decrease its interpretability.
Deep learning algorithms have been used to detect faults and monitor machine conditions many times. The MLP [40] was one of the earliest deep learning applications in fault detection and condition monitoring. Later, RNNs [37] and CNNs [4,20,41,42] became more common in fault detection, where they have exhibited significant performance increases. Further, CNN approaches combined with data transformations, e.g., spectrograms, have been proposed several times [3,43]. Ref. [44] was successful in the creation of a 1D CNN that is able to handle raw signals by integrating automated feature extraction with time series classification. These 1D CNNs can also withstand noise effectively and can be trained with small amounts of data [45].

Overview-Convolutional Neural Networks
A convolutional neural network (CNN), in general, is a regularized MLP that specializes in processing two-dimensional inputs such as picture pixels and color channels. CNNs have previously proven to be effective in computer vision tasks including image classification and video identification [46,47].
The main advantages of a CNN, compared with a traditional neural network, such as an MLP, is the use of local receptive fields, weight-sharing and sub-sampling. Especially, the weight-sharing significantly reduces memory requirements and therefore improves algorithmic efficiency [48]. Commonly, a convolutional layer consists of three phases. The first layer performs a number of convolutions, followed by the second phase that consists of an activation function. Afterwards, a pooling function is applied [48].
Before employing a CNN on one-dimensional data, e.g., time series, the data has to be converted using signal processing techniques into a two-dimensional representation in the time-frequency spectrum or using wavelet transforms [22,49,50]. For example, onedimensional signals can be transformed in two-dimensional spectrograms, which in turn can be fed as an image to the CNN. This approach is not able to process raw signals directly, thus contradicting the advantages of employing deep learning applications over standard machine learning approaches. The one-dimensional (1D) CNN was created to tackle this challenge by integrating automated feature extraction for time series classification tasks [44].
These models are good at handling noise in time series and can be trained with different data sizes, while being less computationally heavy compared to RNNs or MLPs. As a result, 1D CNNs are becoming more and more applied in time series classification tasks such as fault detection and condition monitoring [45].

Convolutional Layer
A convolutional layer convolves the input with filter kernels followed by the activation unit to generate output features. Each of these filters uses the same kernel to extract local features from the input local region, called weight-sharing. Results of the convolutional operations across the input are fed to the activation function that leads to the output features. The convolution operation is described as: Here, b l i denotes the bias and k l i denotes the weights of the i-th filter kernel in layer l. M l (j) describes the j-th local region in layer l. ( * ) represents the convolution operation that computes the dot product of the kernel and the local regions. y l+1 i (j) denotes the input of the j-th neuron in feature map i of layer l + 1.

Activation Function
The activation function is embedded in every convolutional layer to acquire nonlinear features from the input after the convolutional operation. Depending on the input and the task at hand, there are several different activation functions available; however, in recent years, the rectified linear unit (ReLU) has proven to be efficient in its computations and is therefore the most common used activation function. In this study, the ReLU activation function was used in the convolutional layers and can be described as follows: Here, x represents the outputs of the convolutional operation y l+1 i (j) and N(0, σ(x)) represents Gaussian distributed noise with mean 0 and variance σ(x), which has proven to make optimization easier [51].

Pooling Layer
The output of the convolutional layer and activation function are usually fed to a pooling layer (also known as sub-sampling layer). This layer reduces the spatial size of the input features by a down-sampling operation and decreases the number of parameters and computations in the network. There are different pooling functions such as max-pooling and average-pooling. The pooling function performs a local operation over the input features resulting in a representation that becomes invariant to small translations of the input. In general, the pooling function can be denoted as: S(·) denotes the pooling operation where values of the convolved features are computed on different locations. For every layer l, the i-th weight matrix is denoted as ω i . M i represents the outputs of the convolutional layer (feature map) and b i denotes the bias. These calculations then result in the compressed feature representation p l i given above.

Method
The design of a CNN and the quality of the data have a significant impact on its classification performance, for instance, sensor signals from industrial machinery regularly contain significant levels of noise. Previous work showed the great performance of using a wide-kernel in the first convolutional layer, followed by smaller kernels in the followup layers, for detecting faults and classifying conditions of rotating equipment [10,11].
However, these methods were not able to automatically scale with varying data inputs; therefore, we extend on those approaches in this work by developing an adaptive widekernel layer that extracts features and filters out signal noise, transforming the input without any representation loss.
The general idea behind this adaptive layer is that depending on its input data, it will transform the data into an n-dimensional matrix without losing any information, e.g., the dot product of the first layer's output is equal to the dot product of the input data. Hence, the layer functions as a feature extractor without reducing its dimensionality, which results in a feature set extracted from low frequency bands [10].
This adaptive layer does require a set of rules to be adhered to. Otherwise, the output of the first wide-kernel layer does not correspond with the input data. The rules for the adaptive layer are summarized as follows: • The sequence length and width of the first kernel should be a power term of two. • The width of the first kernel should be higher than the denominator for calculating the stride in the first layer (default = 4). • The kernel width of the first layer should not exceed the sequence length.
In this case, we calculate the amount of stride in the first layer, followed by the number of filters used: where K is the kernel size of the first layer, x is the denominator used to calculate the stride in the first layer F s with a default setting of 4 and S the sequence length of a signal resulting in penalty function δ. Based on these values, the number of filters can be calculated: Here, TS denotes the number of time series available in the data, for example, the number of sensors. The result is the number of filters, denoted as F, used in the first layer of the model. One of the main properties of this equation is that the number of filters will increase when the kernel size and therefore stride is set to a higher level, while the number of filters decreases with a low set value. This aligns with the idea that if the kernel size reduces the local information, there will be multiple versions of those local combinations. If the kernel size increases, there are less steps that the kernel is applied to per filter. Therefore, increasing the number of filters will equalize this deviation.
We implemented this adaptive layer structure on several fault detection and condition monitoring tasks in the form of a time series classification task. In this section, the proposed 1D CNN models are described in more detail. In addition, a supplementary page (https: //github.com/JvdHoogen/Adaptive-WCNN) is made available containing the code for the proposed models.

Adaptive Wide-Kernel CNN (A-WCNN)
Similar to our previous proposed WDTCNN model [11], the adaptive wide-kernel CNN (A-WCNN) contains five convolutional layers followed by two fully connected layers; however, the A-WCNN is able to adapt its first wide-kernel layer based on the dimensionality of the input data. So, this hybrid version can be easily deployed on different datasets with other dimensionalities, without having to manually adjust the architecture of the model. After each convolutional layer, the model utilizes local average pooling to decrease the vector size of the convolutional output with length T divided by two, resulting in a pooled output length of T 2 .
In our experiments, we assessed the performance of the A-WCNN under deviating circumstances using multiple sensors. In addition, the model generalizes more easily and allows different kernel sizes, which is set to 64 in this paper. The architecture of the A-WCNN based on a two-dimensional time series input, segmented into sequences of 2048 data points, is shown on the left side of Figure 2.

Adaptive Wide-Kernel Multichannel CNN (Ada-WMCNN)
As we have presented and shown in our previous work, a wide-kernel deep multichannel CNN (WDMTCNN) [11] is able to better process and classify signals due to its separate feature extraction between the different time series dimensions. The model is fed by separate inputs each representing a univariate time series. These time series are processed at the same time by the separate CNNs, which are concatenated in the last stages of the model, e.g., before the fully connected layers. This approach leads to completely independent feature representations of the separate time series, which has proven to perform better in [11]. One main issue with the multichannel approach is that the model is less scalable than the non-multichannel versions, since the number of separate CNNs increases with every additional feature in the multivariate time series, resulting in a complex model with many more parameters. In many cases this leads to longer processing times while performance gains are not always evident.
We adapted the WDMTCNN by only implementing a separate CNN structure in the first wide-kernel layer. This optimized model is still able to extract these specific characteristics of every individual time series, e.g., features from low-frequency bands, but concatenates directly after, resulting in less parameters and is therefore less computationally heavy. To compare its performance properly with our previous multichannel model (WDMTCNN) that runs completely separate CNNs, which is called the A-WMCNN in this paper, we experimented with both model settings to show its behavior across multiple datasets and multiple experiments. In addition, our adapted multichannel model also exploits local average pooling throughout every convolutional layer, reducing the vector size with steps of two. A complete overview on Ada-WMCNN can be seen on the right side of Figure 2.

Parameter Settings
Besides the models' architecture, parameter settings are vital in finding the most optimal classification performance. In this section, we examine the parameter settings of both models.
• Normalization: The distribution of each layer's input varies throughout training, slowing down the process. Batch normalization is a technique for minimizing the influence of internal covariance on the training process, thus speeding up the training process [52]. We imputed batch normalization right after the convolutional layer. For each input x, batch normalization normalizes each dimension k and computes the training set's expectation and variance: When it comes to nonlinear representations, normalizing each input may modify what the layer represents. For each activation, batch normalization has a shift and scale function to account for this constraint: where p (k) and q (k) scale and shift each activation x (k) , respectively. These parameters are learnt in tandem with the model's initial parameters, restoring the network's representation power. • Fully Connected Layer: Before the final classification layer, the models are provided with a fully connected layer. This layer is used to identify global compositions of the final convolutional output and is equivalent to a fully connected layer in a multilayer perceptron. The fully connected layer is denoted as follows: for layer l, where ω l is the weight matrix and x l is the input for the fully connected layer, represented as a flattened vector of the output from the previous pooling layer.
A dot product operation is denoted by * , while a bias term is denoted by b l . The Sigmoid activation function employed in the layer is represented by f (·). • Classifier: In the last layer of both models, a Sigmoid classifier is used, which is expressed as: where x is the value of the preceding layer's ith output, and e is Euler's mathematical constant. This results in output values that are independent of each other and are not constrained to sum up to 1. Therefore, we chose Sigmoid because it excels in classifying signals of any length with modest or no variations across classes, necessitating the calculation of each probability separately. • Optimization: An Adam stochastic optimizer is used to optimize the networks. For each individual parameter, it makes use of the power of adaptive learning rates. The optimizer is computationally efficient and memory-light, making it ideal for models with a large number of trainable parameters that process high dimensional data. Adam is particularly well suited for noisy and non-stationary signals [53]. Adam optimization can be denoted as: At epoch t, f (·) represents the stochastic objective function with parameters θ t , yielding gradient g t . The first and second biased moment estimates are designated as m t θ and v t θ , respectively, where the decay rates are β 1 = 0.9 and β 2 = 0.999. The bias-corrected momentum is represented by m θ and v θ . Using a default learning rate of α = 0.001 and = 10 −8 , the corrected momentum is utilized to update the parameters θ t . • Loss Function: Both networks construct the loss function using mean squared error (MSE), which is more often employed in regression analysis than classification tasks. MSE, on the other hand, may produce fewer differences between values, which we think is beneficial for our experiments, since certain fault circumstances seem to be comparable. The MSE is calculated as the average of the squared differences between the projectedŷ and actual y values, with greater discrepancies penalized by the model. MSE is expressed as follows:

Software and Resources
In this research, we used Python combined with Tensorflow and Keras to develop the proposed models. In addition, all other algorithms, such as the machine learning models, are derived from the Sklearn package. Calculations were performed with the support of Numpy and table formatting with Pandas. For optimizing processing time, the data was standardized using the StandardScaler algorithm provided in the Sklearn package. Furthermore, to reduce the overall training time, the models were trained on a dedicated server with two Intel Xeon CPUs (3.2 GHz), 256 GB RAM and a Nvidia Quadro RTX 6000 (24 GB) GPU. After training, the models are fairly small (between 1 and 2 MB) and are deployable on standard PC hardware as well as edge computing platforms.

Results
To evaluate the proposed models, we used data from the Case Western Reserve University (CWRU) [54] and Paderborn University [8] reflecting bearing fault experiments of rotating equipment. In this section, both experiments and datasets are described in more detail. Furthermore, all results are discussed in this paragraph.

CWRU Bearing Dataset
The CWRU bearing dataset represents a bearing fault experiment where damages were applied to the bearing element. To measure vibration signals, sensors were placed on the drive end and fan end of the machine; we refer to [54,55] for more detail. The test rig is also displayed at: https://engineering.case.edu/bearingdatacenter/apparatus-and-procedures; Figure 3 shows an example of the according signals. These vibrations are digitized into two time series that we segmented into sequences of 2048 data points without overlap. The data from the experiments used in this study is divided into two categories; 12k fan end and 12k drive end, both utilizing a sampling rate of 12 kHz. Both experiments contain different properties, resulting in deviating datasets with fluctuating amounts of classes. Within these datasets, different machine operations are used to measure the vibrations, respectively; 1797, 1750, 1730 and 1772 motor speed (revolutions per minute, RPM). Next to that, there is a normal condition for every motor speed. Overall, in many fault detection and condition monitoring studies, the CWRU bearing dataset [54,55] has been used, e.g., [10,43,45,[56][57][58]. It can be considered as a de facto benchmark dataset for fault detection because it is publicly available and in principle modeled analogously to important industrial application settings, where data are typically not made openly available. Furthermore, the CWRU dataset is straightforward to interpret and analyze by combining the large number of supplied class labels, which are typically difficult for standard machine learning models to handle.
There are three different depths of damages inflicted to the bearing, 0.007, 0.014 and 0.021 inches, respectively. Each damage type has five distinct bearing fault locations (ball, inner race, outer race opposite, outer race orthogonal and outer race centered). To enhance the tasks' complexity and restrict the number of data samples per class, we opted to count each condition as a single class. Thus, resulting in a large-scale classification task where distinctions are made on fault level and machine condition. The data samples are made by segmenting the two time series into 2048-point sequences with no overlap. This sequence length has been extensively utilized in other research for efficient implementation of the fast Fourier transform (FFT) method, which is known as a powerful baseline [10,45,59]. Every fault condition has approximately the same number of samples per class. Except for the normal conditions, where the number of samples fluctuate. An overview of the dataset with its number of classes and samples can be seen in Table 1. Within these experiments, we applied varying combinations of the different motor speeds. Combining these different motor speeds leads to various numbers of samples, but also changes in the number of classes to classify, therefore increasing the complexity of the given classification task. In addition, we experimented with different percentages of training data, e.g., 80%, 40% and 20% training data, to exhibit the behavior of our model under different circumstances. Furthermore, we used k-fold cross validation with k set to 5 to improve the generalizability of the trained models. The classification accuracy of the models is calculated by averaging the accuracy across every fold. However, within every fold, we predicted on the same unseen test set. We chose accuracy as the most important metric due to the multi-class classification task with mostly balanced data (except the normal conditions), as used in many previous studies [10,43,45].

Paderborn Bearing Dataset
The Paderborn bearing dataset can also be seen as a benchmark for fault detection and condition monitoring of damaged rolling bearing elements of rotating equipment, cf. [20,42]. The dataset represents motor current signals of an electromechanical drive system and vibrations of the housing [8]. The signals can be extracted in the existing frequency inverters. Therefore, no additional sensors needed to be placed on the system, as was the case in the CWRU bearing experiments, resulting in more resource-efficient experimentation, thus less expensive. Monitoring damages in external bearings, which are positioned in the drive system but outside the electric motor, is a unique feature of the current method. Regardless, the motor current signal was employed as an input for fault detection.
In total, the data derived from the experiments represents "healthy", "real damaged" and "artificially damaged" bearings. Data were recorded for approximately 4 s with a sampling rate of 64 kHz, resulting in many separate data files with approximately 256 thousand data points. We concatenated the data files based on their specific bearing conditions, e.g., "healthy", "real damaged" and "artificially damaged" bearings. Due to the many different machine settings, we decided not to deviate between these, and only look at the specific bearing fault condition. This resulted in a large multi-class classification task and a large-scale dataset containing many sequences of length 2048, similar to the CWRU data.
However, in contrast to the CWRU data, the number of classes is a bit lower, the data between every single class are more balanced and due to the high number of sequences, the training time of the models is much higher. Furthermore, the Paderborn Bearing dataset contains three time series, e.g., two motor current signals and one vibration signal, that are taken into account during processing. Therefore, the first wide-kernel layer adapts itself to the input data, resulting in a slightly changed architecture compared with our models on the CWRU data.
For this experiment we also used k-fold cross validation with k = 5 to improve the generalizability of the trained models and tested models' performance with different train splits. However, due to the larger number of available samples, we attempted to complicate the classification task by lowering the train split even further, e.g., with only 10% and 5% training data. Table 2 provides an overview on the different conditions with their respective number of samples.

Machine Learning Algorithms
Our proposed models are compared with traditional machine learning (ML) algorithms such as k-nearest neighbors (K-NN), random forest (RF) and support vector machine (SVM). These models do not lend themselves for processing raw signals, which means that we preprocessed the data into a set of features in both the time and frequency domain. The features used to feed the ML models are derived from various studies, e.g., [60,61] and are described in detail in Table 3. All of these features were used in the ML models and are calculated across the sequences (with length 2048). Features in the frequency-domain are calculated after transforming the time series to the frequency spectrum using the FFT algorithm that computes the one-dimensional discrete Fourier transform (DFT) with backward normalization, which is known as a potent baseline [10,45,59]. In total, the feature set consists of nine different features for every one of the separate time series. Table 3. Set of features used for the ML algorithms with their formula and description.

Features
Formula Description

Time-Domain
Mean Square of standard deviation. Median Median x = x (n+1)/2 Median value of a sequence x given by x i as above.

Minimum
Min x Minimum value of sequence x given by x i as above.

Maximum
Max x Maximum value of sequence x given by x i as above. Range Range = Max x − Min x Difference between the maximum and minimum value.

Frequency-Domain
Signal Energy Peng = ∑(fft x i ) 2 Energy of a signal calculated using FFT.
Power of a signal calculated using FFT.
For all ML models, we used grid search optimization with five-fold cross-validation derived from the Sklearn package to assess which model performs best for every experiment. The parameter settings of the best performing model vary throughout the different experiments and datasets used. In Table 4 the properties of the grid search optimization are described. In addition, we compared the models with the standard WDCNN models as proposed in [10]. The results are described per dataset since these sets have different properties and classification conditions. Further, we distinguished between different percentages of training data to see how well the ML models perform under varying circumstances regarding data availability.

Results CWRU Dataset
In general, the results on the CWRU dataset in Table 5 clearly show that in almost every case the deep learning applications outperform traditional ML approaches, therefore emphasizing on the power of automated learning approaches in fault detection and condition monitoring. Overall, Ada-WMCNN shows the best performance on the different experimental settings suggesting that using a multichannel structure is particularly useful for the first wide-kernel layer. However, compared to the A-WMCNN, the margins are really close. Therefore, choosing a multichannel approach seems to be well-suited for the task at hand. In that case, one should look at the amount of computational resources needed for the model, which is significantly less for the Ada-WMCNN due to its fewer trainable parameters. This model merges the separate inputs in an earlier stage, compared to the A-WMCNN, resulting in less convolutional operations. In addition, the results for all models demonstrate that lowering the amount of training data will almost always result in a lower performance on the test set, especially for the more complicated classification tasks e.g., with all types of motor speed (RPM). These results clearly indicate the importance of sufficient training samples for fault detection and condition monitoring. For this particular dataset (CWRU), we expected this behavior due to the sparsity of the amount of sequences available. Especially with just 20% train data, the number of samples per class is in the most complex case lower than 20 samples resulting in major performance drops, sometimes even below 50% accuracy. This phenomenon can be better observed in Figure 4 where the results of the CNNs are averaged across all the CWRU bearing experiments for their given train split.

Results Paderborn Dataset
As can be seen in Table 6, the results on the Paderborn dataset show different characteristics compared to the results on the CWRU dataset. One of the most prominent observations is the consistency in the results across multiple classification tasks with different train splits, which can also be observed when averaging the accuracy scores of the CNNs across all experiments (see Figure 5). This particular behavior is ought to be caused by three factors: first, the amount of data available is much higher than for the CWRU experiments, even with lower train splits; second, there are fewer classes to choose from, indicating that the classification task at hand is slightly easier; third, the data consist of three signals instead of two representing motor currents and vibrations. Another remarkable observation is the much higher performance of the traditional ML approaches compared to the results in the CWRU experiments. It seems that the extracted features are particularly well-suited for motor current signals when detecting bearing faults, especially in the case of the Random Forest algorithm. However, the results clearly reveal that in this case the A-WCNN model is the favorite over the other models, showing the highest accuracy performance in every single experiment, followed by the Ada-WMCNN that runs closely behind. This result suggests that having an overarching first wide-kernel layer with many filters performs slightly better compared to the multichannel separation in the Ada-WMCNN model, while completely separating the signals and feed them as univariate time series results in lower performance (results from the A-WMCNN model). A reason for this might lie in the nature of the data where motor current signals tend to be more correlated than separately installed vibration sensors as used in the CWRU bearing experiments.

Overall Model Performance
The results displayed in Tables 5 and 6 show the performance of the models for every different experiment. However, to obtain a better understanding in general, we calculated the mean accuracy score across all experiments for both datasets to give an indication of the overall performance, regardless of the experiment combinations related to train splits and machine conditions. In addition, we describe the standard deviation to explain the stability of the models' performance, therefore making some assumptions regarding the generalizability of the models. Furthermore, for the CNNs, the model size in terms of parameters and computation speed based on GPU acceleration for every epoch is described in Table 7.
As can be seen in Table 8, the proposed adaptive models clearly outperform any other model. However, it seems that it depends on the properties of the dataset which model performs best. For example, the A-WMCNN performs most optimal on the CWRU dataset, also with the second lowest standard deviation, while having a lower performance on the Paderborn dataset, with a relatively high standard deviation. Vice versa, we see similar behavior for the A-WCNN model. Another interesting observation is the performance of the optimized multichannel model (Ada-WMCNN). While performing a bit worse than the best performing model in both experiments, its averaged overall performance is better. Further, as Table 7 suggests, this model has less parameters and a competitive computation speed compared to the other adaptive CNNs. Therefore, according to this study, the Ada-WMCNN is well-suited for a variety of fault detection and condition monitoring tasks.
Finally, it is worth mentioning that even though the ML models perform considerably worse than the deep learning approaches, the random forest (RF) algorithm seems to mark the highest performance with a fairly low standard deviation, indicating that this ML model is surprisingly stable under varying circumstances. The RF algorithm clearly outperforms the other ML models, which may suggest that tree-based algorithms are well-suited for fault detection and condition monitoring tasks-provided that the data are preprocessed into a rich feature set. This is one of the directions to pursue in future work.

Discussion
This work extends on our previous research [11] to investigate the performance of our proposed models utilizing an adaptive wide-kernel in the first convolutional layer in fault detection and condition monitoring tasks. This adaptive layer is able to scale towards varying dimensionalities of time series data, while maintaining its core task for extracting valuable features without representation loss.
With this adaptive layer we optimized our previous proposed models as described in [11] resulting in two models, respectively A-WCNN and Ada-WMCNN. The first model initializes the adaptive wide-kernel layer followed by small kernel layers, similar to the previous proposed WDTCNN model. The Ada-WMCNN is able to process multivariate time series in a univariate way in the first wide-kernel layer. This model has a similar principle as the WDMTCNN model proposed in [11], but is more computationally efficient due to the lower number of trainable parameters and higher computation speed, as can be seen in Table 7. We compared the models with many traditional machine learning applications (optimized with a grid search) as well as with the original WDCNN proposed by [10] and our completely separate multichannel CNN (A-WMCNN).
The proposed models are tested in a wide range of experiments on two well-known bearing fault datasets; the CWRU Bearing dataset, a dataset with few samples and many different fault and machine conditions and the Paderborn Bearing dataset, a large-scale dataset with a vast number of samples and fewer fault conditions. Compared to the ML models and the original WDCNN, our optimized models demonstrated a better performance; therefore, concluding that an adaptive wide-kernel layer is a valuable addition to existing 1D CNN architectures. However, between the different settings for the adaptive models, it seems that highest performance depends on the properties of the dataset, e.g., the best performing model deviates per dataset. Overall, we can conclude that the Ada-WMCNN model performs best throughout all experiments suggesting that a separate multichannel in the first wide-kernel layer is effective and adaptable to changing environments. Furthermore, both models handle raw signals directly and are quite good in detecting faults and distinguish between machine conditions, even in cases with minimal training data.
On another note, in this research our models are only trained on experimental data that are carefully recorded and annotated, which is usually difficult to obtain in real-world settings. Thus, generalization of the trained models in real-world settings is difficult to assess, also due to scarcity of available data. Furthermore, it seems in many cases, especially in the results on the Paderborn dataset, that the classification task is fairly easy for the models to perform. Therefore, for future research, we intend to apply transfer learning by training the models on experiment data and test their performance on real-world data to assess the generalizability of the trained models in other contexts. However, in this particular situation, the properties of the data should be similar, e.g., data derived from vibration sensors or motor current signals. Further, we aim to investigate the performance of the adaptive models in a context with variable sequence lengths. In addition, we plan to research the application of attention mechanisms in time series classification tasks as described in [62,63] and implement these mechanisms in fault detection and condition monitoring tasks.