Exploring the Relationship between Preprocessing and Hyperparameter Tuning for Vibration-Based Machine Fault Diagnosis Using CNNs

: This paper demonstrates the differences between popular transformation-based input representations for vibration-based machine fault diagnosis. This paper highlights the dependency of different input representations on hyperparameter selection with the results of training different conﬁgurations of classical convolutional neural networks (CNNs) with three common benchmarking datasets. Raw temporal measurement, Fourier spectrum, envelope spectrum, and spectrogram input types are individually used to train CNNs. Many conﬁgurations of CNNs are trained, with variable input sizes, convolutional kernel sizes and stride. The results show that each input type favors different combinations of hyperparameters, and that each of the datasets studied yield different performance characteristics. The input sizes are found to be the most signiﬁcant determiner of whether overﬁtting will occur. It is demonstrated that CNNs trained with spectrograms are less dependent on hyperparameter optimization over all three datasets. This paper demonstrates the wide range of performance achieved by CNNs when preprocessing method and hyperparameters are varied as well as their complex interaction, providing researchers with useful background information and a starting place for further optimization.


Introduction
Data-driven machine health monitoring (MHM) can allow machine operators to improve capitalization of a mechanical asset's useful lifetime and avoid unexpected interruptions to machine operability by detecting mechanical faults in advance of a machine breakdown. Traditionally, signals gathered from machine-mounted sensors are analyzed by system experts to determine whether maintenance action should be taken. This involves first gathering and preprocessing the signals, extracting useful features from the signals, and then analyzing the features to determine whether a fault is present. Current research for advancing MHM is focused on improving automatic feature selection and feature analysis to reduce the dependence on human experts [1].
Bearings and gears are common subjects for MHM since they are nearly ubiquitous in machine design and are accountable for a large proportion of machine failures [2]. Vibration, temperature, current, oil analysis, and acoustic emissions are among the types of signals used in MHM. Vibration signals are the most widely investigated signal type for their convenience and affordability, and because vibration signals can carry useful information about fault type and severity. Machine learning is regarded as a suitable tool for use in vibration-based MHM for its ability to detect complex patterns contained in vibration signals. Classical applications of machine learning using hand-crafted features such as k-nearest neighbors (kNN) and support vector machines (SVM) have been widely explored and perform well in many fault diagnosis problems [3]. The performance of these The results will provide a template framework for presenting the design of CNNbased diagnosis algorithms and data representations, as well as determine whether some representations are universally stronger or stronger only when paired with certain CNN configurations and datasets. The following section reviews the basic concepts involved in bearing and gear fault diagnosis using CNNs, as well as introduces the different input types used to train CNNs.

Vibration-Based Bearing Fault Diagnosis Fundamentals
This section describes the fundamental nature of the observation data and some of the known factors of variation behind it. Additional useful resources include a model for single-point defect vibration signals given by McFadden and Smith [19] and a detailed bearing fault diagnosis tutorial from Randall and Antoni [20].
Most diagnostic algorithms focus on localized bearing faults since they are more easily distinguished by their vibrational signature and usually precede destructive component failure. Localized faults consist of geometric irregularities, such as pits or spalls that inhibit the normal smooth motion of the bearing components relative to each other. These faults may arise due to any number of factors, such as overheating, lubrication failure, overloading, material defects, misalignment, and through natural material wear or fatigue. Some of the former factors can occur simultaneously and/or behave as catalysts to accelerate the natural rate of fatigue.
As the inner race and outer race rotate relative to each other, the contact zones between the different bearing components pass over the fault. This short-duration contact is similar to an impact and causes a burst of vibrations at frequencies known as the fault characteristic frequencies (FCFs) that propagate through the bearing housing to be detected elsewhere on the machine. The most basic form of bearing fault diagnostics involves checking vibration signals for repeated impacts at the FCFs.
Several other factors combine to obfuscate the diagnostic information within the signal. Among these is the contamination of the signal by background noise, which is a prevalent issue in industrial applications, where many machines interfere with each others' sensors through common transmission paths. As bearing resonant frequencies tend to be quite high, it is generally easiest to find wide frequency bands dominated by bearing signals in the higher end of the spectrum rather than the lower end, as the lower end of the frequency spectrum is more likely to contain contamination from distant background machinery. Another complicating factor is the cyclical variation of transmission paths within the bearing itself caused by the relative rotation of bearing components. The influence of this amplitude modulation is illustrated in Figure 1.
A basic and powerful tool to simplify signal analysis and reveal the underlying shape of the vibration waveform involves taking the envelope of the signal, which effectively smooths the resonance waves induced during impacts against the fault zone into a single wave. This may be regarded as the effective instantaneous amplitude, and it more clearly reveals the FCF than the raw signal. Figure 1 shows how different fault locations involve different amplitude modulation patterns as the fault moves relative to the sensor.
Frequently changing operating conditions pose a challenge for in situ bearing fault diagnosis. A bearing within a machine may be subject to different speed and loading conditions depending on the machine's mode of operation. Varying speed will cause a change in the FCF being activated by a given fault and a change in loading conditions will change the vibration magnitude and can also change the modulation and attenuation characteristics of the vibration response. This motivates the development of diagnostic algorithms that are unaffected by changes in operating conditions or limits them to work on machines with consistent operating conditions.
Although laboratory test benches attempt to simulate real industrial situations, artificially damaged bearings and gears are imperfect approximations of real damage endured gradually through normal wear processes. The resulting difference in fault geometry induces a different fault signal during operation. Figure 2 shows test bench measurements from identical bearings under the same loading conditions with a seeded fault introduced to the outer race with electron discharge machining (left) and outer race spalling by natural wear occurring after accelerated life testing (right). The raw signals and frequency spectra are somewhat different in appearance, but the differences are minimized in the envelope spectra.
Vibration 2021, 4 FOR PEER REVIEW 4 Figure 1. Typical signals and envelope signals from local faults in rolling element bearings [19].
Frequently changing operating conditions pose a challenge for in situ bearing fault diagnosis. A bearing within a machine may be subject to different speed and loading conditions depending on the machine's mode of operation. Varying speed will cause a change in the FCF being activated by a given fault and a change in loading conditions will change the vibration magnitude and can also change the modulation and attenuation characteristics of the vibration response. This motivates the development of diagnostic algorithms that are unaffected by changes in operating conditions or limits them to work on machines with consistent operating conditions.
Although laboratory test benches attempt to simulate real industrial situations, artificially damaged bearings and gears are imperfect approximations of real damage endured gradually through normal wear processes. The resulting difference in fault geometry induces a different fault signal during operation. Figure 2 shows test bench measurements from identical bearings under the same loading conditions with a seeded fault introduced to the outer race with electron discharge machining (left) and outer race spalling by natural wear occurring after accelerated life testing (right). The raw signals and frequency spectra are somewhat different in appearance, but the differences are minimized in the envelope spectra.

Vibration-Based Gear Fault Diagnosis Fundamentals
A useful resource describing gear fault detection using vibration spectral analysis is given by P. D. McFadden [21]. Planetary gearboxes have unique characteristics and are prevalent enough to warrant individual consideration; an informative tutorial is given by Figure 2. Comparison of natural faults (left column) and real fault signals (right column) with respect to raw signal (top row), raw frequency spectrum (middle row) and envelope spectrum (bottom row) for identical bearings under the same operating conditions.

Vibration-Based Gear Fault Diagnosis Fundamentals
A useful resource describing gear fault detection using vibration spectral analysis is given by P. D. McFadden [21]. Planetary gearboxes have unique characteristics and are prevalent enough to warrant individual consideration; an informative tutorial is given by Guo et al. [22], which also reviews conventional methods employing time-synchronous averaging and narrowband demodulation.
Unlike bearing vibration signals, which contain no periodic impulses in the healthy state, healthy gear signals contain sharp periodic impulses due to the repeated engagement and disengagement of the meshing teeth. Therefore, gear faults cannot be diagnosed simply by detecting the presence of new periodic fault signals, but must be diagnosed by detecting other changes in the gear signal. The frequency of tooth engagement is known as the gear meshing frequency (GMF) and is calculated as the product of the shaft frequency and the number of teeth on the gear mounted to that shaft.
Gear faults may arise due to several factors, including inadequate lubrication, contamination, installation error, and overloading. These factors may give rise to fatigue pitting on the tooth surfaces or cracks at the base of the tooth where stresses are concentrated. Spectral analysis reveals that sidebands often appear about the GMF and its harmonics when a local tooth defect is present in a gear. In systems with many pairs of meshing gears, it can be very difficult to distinguish the many components in the spectrum, complicating the diagnosis process. Additionally, the presence of sidebands does not always indicate fault severity.
It is not possible to obtain a pure recording of the gear meshing signal. Practical constraints require accelerometers to be placed on the external gearbox housing, this results in a variable transmission path that passes through other gears, shafts, bearings, and the housing itself. The resulting modulation can considerably distort the signal and cause sidebands to arise even for healthy gears. Figure 3 shows the raw signals and envelope spectra for a two-stage gear reducer when all gears are healthy and when the input gear is chipped.
R PEER REVIEW 6 Figure 3 shows the raw signals and envelope spectra for a two-stage gear reducer when all gears are healthy and when the input gear is chipped. With respect to gear fault diagnosis, the objective of this paper is to explore whether CNN architectures that are effective for bearing fault diagnosis are also effective for gear fault diagnosis, or whether gear fault diagnosis requires an entirely different configuration of hyperparameters. With respect to gear fault diagnosis, the objective of this paper is to explore whether CNN architectures that are effective for bearing fault diagnosis are also effective for gear fault diagnosis, or whether gear fault diagnosis requires an entirely different configuration of hyperparameters.

Theory of CNNs
This section gives a general description of how CNNs operate without explaining finer details or providing mathematical formulae. Bouvrie provides useful resources and a detailed discussion of CNNs, along with their mathematical description [23]. Lei et al.'s review of deep learning for machine fault diagnosis also contains a useful visual explanation of the fundamental operations used in CNNs. In general, CNNs consist of three main layer types: convolutional layers, pooling layers, and fully connected layers. A general visual description is presented in Figure 4. With respect to gear fault diagnosis, the objective of this paper is to explore whether CNN architectures that are effective for bearing fault diagnosis are also effective for gear fault diagnosis, or whether gear fault diagnosis requires an entirely different configuration of hyperparameters.

Theory of CNNs
This section gives a general description of how CNNs operate without explaining finer details or providing mathematical formulae. Bouvrie provides useful resources and a detailed discussion of CNNs, along with their mathematical description [23]. Lei et al.'s review of deep learning for machine fault diagnosis also contains a useful visual explanation of the fundamental operations used in CNNs. In general, CNNs consist of three main layer types: convolutional layers, pooling layers, and fully connected layers. A general visual description is presented in Figure 4. A convolutional layer contains some number of filter kernels, each of which convolves the previous layer to produce an output called a feature map. The filters may be of any size that fits within the dimensions of the previous layer's output. The convolution A convolutional layer contains some number of filter kernels, each of which convolves the previous layer to produce an output called a feature map. The filters may be of any size that fits within the dimensions of the previous layer's output. The convolution operation consists of taking the dot product between a region of the input space and the learnable weights in the kernel. The product is passed through an activation function such as a rectified linear unit (ReLU) and the result is mapped to the corresponding region of the feature map. The filter is swept over the input space using a step size called the stride, taking a dot product at each step to fully populate the feature map. Stride offers CNNs strong translation invariance with respect to the location of features in the input space without the need for many new learnable parameters. A new feature map is generated by each kernel in the convolutional layer. The pooling operation is a downsampling process that reduces the number of training parameters while preserving important information about the input. The pooling layer also involves a filter of some size that is swept over the previous layer at increments determined by its stride. Common pooling operations are max pooling and mean pooling, which keep the maximum and mean of the input, respectively. Using pooling layers after convolutional layers reduces training time, as well as overfitting.
Several repetitions of convolution and pooling layers allow a CNN to learn complex features from the input data. The features can then be flattened into a 1D layer and used as the input to at least one fully connected layer through which the features are mapped to the target class. For supervised learning of machine health states, softmax classification is usually used for the last layer. The parameters within the convolutional kernels and fully connected layers are learned during training with a gradient descent algorithm, as with most neural network training algorithms. The configuration of the CNN, including the sequence of layers, number and size of filters in each layer, learning rate, stride, and other hyperparameters, are hand picked by the network designer and constitute a fully defined CNN architecture.

Data Augmentation
Data augmentation is a critical tool in machine learning used when the available training data is limited or when it is necessary to improve the network's invariance to certain types of transformations. Data augmentation effectively expands the original training dataset by performing some transformation to the training samples and yields a greater number of unique new training samples. The transformation used must alter the original data in a way that makes it distinguishable from other transformed data, but without totally obscuring the underlying factors of variation. The class labels are usually preserved during data augmentation. For image classification datasets, augmentation techniques include cropping, rotating, and obscuring random subregions of an image, which improve translational and rotational invariance as well as overall generalization. Each of these transformations can involve randomized parameters to allow multiple new images to be generated from each original image.
For the present work, a sliding window is used to extract many short-duration training samples from each original experimental measurement in the time domain. Preprocessing transformations, such as the fast Fourier transform (FFT), are then performed as needed on the extracted sample. This method has several benefits: it is simple and easily reproduced, it offers a many-fold increase in available training samples, it dramatically reduces the dimensionality of the training samples, and it promotes translational invariance in the resulting trained network. The data augmentation technique is illustrated in Figure 5. A sliding window of some prescribed length is used to extract a sample from the original experimental measurement. The window is advanced in time by a set increment to extract the next training sample. Some overlap between windows is provisioned to increase the yield of new training samples, improving translational invariance.

CNN Input Types
Since CNNs can theoretically extract useful features directly from raw data, very good accuracy should be achievable without using complex data preprocessing. In practice, however, strong performance may require manipulation of the observation data, especially in cases where data is initially of high dimensionality and the availability of train-

CNN Input Types
Since CNNs can theoretically extract useful features directly from raw data, very good accuracy should be achievable without using complex data preprocessing. In practice, however, strong performance may require manipulation of the observation data, especially in cases where data is initially of high dimensionality and the availability of training data is limited. One important cause for this is related to a phenomenon known as "the curse of dimensionality". When the number of dimensions increase, the volume of occupied space increases more quickly and the distance between available datapoints becomes much greater, leading to a sparse dataset. Reducing the number of dimensions used to represent the data while preserving factors of variation that maintain a statistically significant relationship between the observations and their labels (e.g., the health states of the present application) can make it far easier for machine learning algorithms to learn features from the data and overcome sparsity.
Reducing dimensionality is not the only objective of preprocessing. Useful transformations to the raw data can highlight or accentuate explanatory factors, making classification easier. A common example of this is the Fourier transform, which is very useful when the observed data is a time-series measurement, but obvious explanatory factors are present in the frequency spectrum. Transformations may be selected by the algorithm designer using domain-specific expertise, or they may be performed by generic data reduction operations such as principal component analysis or autoencoders. This study is focused on common expert-chosen transformations, since they preserve spatial patterns that will be detectable by the CNN.
The following subsections introduce each of the input types that are used to train CNNs in each of the case studies. The aim is to provide the reader with an intuitive understanding of the transformations used without using detailed mathematical descriptions.

Raw Time Input
As the simplest possible input type, raw time provides only a few parameters that may be chosen by the algorithm designer to characterize this 1D input. Chiefly, one must consider the length or duration of the raw signal that is used as an input feature. In general, a longer signal duration is more likely to contain useful diagnostic information. If a signal is too short, it may not contain enough instances of periodic fault-induced emissions to establish a high probability of those emissions being the result of a fault-related pattern. However, increasing the signal duration also increases the computational power required to train the network. It also increases the probability of the network becoming overfit.
Another motivation for using a shorter window arises when training samples are extracted from a longer experimental measurement via moving window data augmentation. Shorter windows allow more unique training sample to be extracted from the available measured data.
When machine-mounted accelerometers operate at higher sample rates, more data is generated to describe the machine's vibrations over a given span of time. The two signals shown in Figure 6. show how a different number of samples are needed to span a full shaft rotation, and how some fault related signals are manifested over multiple full rotations.
Generally, the resonant frequencies of bearing components are known, so it is possible to determine the upper bound beyond which it is no longer useful to increase the sample rate. However, for this work, no re-sampling of the signals is performed; all benchmark signals will be considered with their native sampling frequencies as stated in each case study.
The experiments conducted here include a study of the influence of input duration against performance in a range of different CNN configurations to determine when it may be advantageous to use a longer or shorter input.
shaft rotation, and how some fault related signals are manifested over multiple full rotations.
Generally, the resonant frequencies of bearing components are known, so it is possible to determine the upper bound beyond which it is no longer useful to increase the sample rate. However, for this work, no re-sampling of the signals is performed; all benchmark signals will be considered with their native sampling frequencies as stated in each case study. The experiments conducted here include a study of the influence of input duration against performance in a range of different CNN configurations to determine when it may be advantageous to use a longer or shorter input.

Frequency Spectrum
Data augmentation is performed prior to using the fast Fourier transform (FFT) to obtain the 1D frequency spectrum samples. The absolute value of the single-sided spectrum is kept. Since a no low-pass filter is used, the upper bound of the spectrum is defined by the Nyquist frequency, itself simply determined by halving the sample rate. The frequency resolution achieved is dependent on the duration of the signal extracted during data augmentation. In this case, the resulting input vector will contain half the number of elements contained within the original time domain window.

Frequency Spectrum
Data augmentation is performed prior to using the fast Fourier transform (FFT) to obtain the 1D frequency spectrum samples. The absolute value of the single-sided spectrum is kept. Since a no low-pass filter is used, the upper bound of the spectrum is defined by the Nyquist frequency, itself simply determined by halving the sample rate. The frequency resolution achieved is dependent on the duration of the signal extracted during data augmentation. In this case, the resulting input vector will contain half the number of elements contained within the original time domain window.
The overlapping sliding windows used during data augmentation ensure that each spectrum obtained contains variations of the original signal. As with using a raw time input, the probability that a successful diagnosis can be performed on any given spectrum will depend on whether a window coincides with transient fault-induced signals; a very short window has a low probability of containing such information. Figure 7 shows two such spectra obtained from the same experimental measurement with a 75% overlap. The first sample (bottom left) coincides with an instance of a ball passing over an outer race fault, resulting in a spectrum that more clearly indicates the presence of the fault. Assuming the original measurement is stationary, extracted spectrums become increasingly similar to the spectrum of the whole measurement as the window size is made larger. Therefore, using longer windows to extract data results in a more homogeneous dataset. such spectra obtained from the same experimental measurement with a 75% overlap. The first sample (bottom left) coincides with an instance of a ball passing over an outer race fault, resulting in a spectrum that more clearly indicates the presence of the fault. Assuming the original measurement is stationary, extracted spectrums become increasingly similar to the spectrum of the whole measurement as the window size is made larger. Therefore, using longer windows to extract data results in a more homogeneous dataset.

Envelope Spectrum
The envelope transformation simplifies a signal to reveal the overall shape of the signal. It emphasizes lower frequency modulating signal elements such as low frequency fault related impacts. As a result, the spectrum of the envelope signal can more easily outline FCFs and their harmonics. This eliminates information about the bearing component's resonant frequencies, that may not be useful for the model's ability to learn simpler and more distinct patterns. The transformation can yield an upper and lower envelope. Since vibration signals tend to be symmetric, the upper and lower envelopes will have very similar frequency spectra. This work uses a standard MATLAB function (envspectrum(x,fs)) where x is the signal and fs is the sample rate in Hz) to obtain the envelope spectrum. Figure 8 illustrates the transformation in multiple steps: from the original raw signal to the upper envelope signal to the envelope spectrum. The spectrum of the original signal is also shown for comparison. The upper envelope spectrum clearly illustrates the fault characteristic frequency with its harmonics, whereas high frequency vibrations around bearing component resonance frequencies dominate the raw spectrum.

Envelope Spectrum
The envelope transformation simplifies a signal to reveal the overall shape of the signal. It emphasizes lower frequency modulating signal elements such as low frequency fault related impacts. As a result, the spectrum of the envelope signal can more easily outline FCFs and their harmonics. This eliminates information about the bearing component's resonant frequencies, that may not be useful for the model's ability to learn simpler and more distinct patterns. The transformation can yield an upper and lower envelope. Since vibration signals tend to be symmetric, the upper and lower envelopes will have very similar frequency spectra. This work uses a standard MATLAB function (envspectrum(x,fs)) where x is the signal and fs is the sample rate in Hz) to obtain the envelope spectrum. Figure 8 illustrates the transformation in multiple steps: from the original raw signal to the upper envelope signal to the envelope spectrum. The spectrum of the original signal is also shown for comparison. The upper envelope spectrum clearly illustrates the fault characteristic frequency with its harmonics, whereas high frequency vibrations around bearing component resonance frequencies dominate the raw spectrum.

Short Time Fourier Transform Spectrogram
The spectrogram provides a very useful time-frequency representation of the signal; it includes the benefits of spectral analysis as well as time localization of frequency components. The spectrogram is obtained by dividing the original signal into many shorter windows and taking the FFT of each one in sequence. The result is a 2D matrix with time progression along one direction and the frequency scale along the other.
Two parameters used in defining the transformation allow some flexibility in tailoring the spectrogram to suit a given application: window length and overlap. The window length is the result of the original signal length divided by the number of desired samples prior to taking the FFT. A shorter window provides finer division of the signal in time, and thus improves time localization, increasing the time resolution of the spectrogram. The penalty for increased time resolution is decreased frequency resolution due to the innate restrictions of the Fourier transform. The frequency resolution is proportional to the size of the windowed signal and is thus improved by increasing the window length. This creates a tradeoff between time and frequency resolution. The overlap used in obtaining the spectrogram lends some improvement to the time resolution by fitting more windows into the signal, with some samples shared by adjacent windows. For simplicity, and to avoid training CNNs with very large inputs, no overlap is used in this study. Figure 8. Comparison of raw vibration signal, envelope signal, raw spectrum, and envelope spectrum for an outer race fault sampled at 64 kHz from the Padderborn University dataset.

Short Time Fourier Transform Spectrogram
The spectrogram provides a very useful time-frequency representation of the signal; it includes the benefits of spectral analysis as well as time localization of frequency components. The spectrogram is obtained by dividing the original signal into many shorter windows and taking the FFT of each one in sequence. The result is a 2D matrix with time progression along one direction and the frequency scale along the other.
Two parameters used in defining the transformation allow some flexibility in tailoring the spectrogram to suit a given application: window length and overlap. The window length is the result of the original signal length divided by the number of desired samples prior to taking the FFT. A shorter window provides finer division of the signal in time, and thus improves time localization, increasing the time resolution of the spectrogram. The penalty for increased time resolution is decreased frequency resolution due to the innate restrictions of the Fourier transform. The frequency resolution is proportional to the size of the windowed signal and is thus improved by increasing the window length. This creates a tradeoff between time and frequency resolution. The overlap used in obtaining the spectrogram lends some improvement to the time resolution by fitting more windows into the signal, with some samples shared by adjacent windows. For simplicity, and to avoid training CNNs with very large inputs, no overlap is used in this study.

Network Architectures
This paper aims to investigate the relationship between the basic CNN hyperparameters and the choice of signal preprocessing method. Therefore, the same sequence of layers is used in all experiments with experimental variables that include input type, input size, convolutional kernel size and stride. All network configurations use two convolutional layers, each one followed by a ReLU layer, a 50% dropout layer, and a max pooling

Network Architectures
This paper aims to investigate the relationship between the basic CNN hyperparameters and the choice of signal preprocessing method. Therefore, the same sequence of layers is used in all experiments with experimental variables that include input type, input size, convolutional kernel size and stride. All network configurations use two convolutional layers, each one followed by a ReLU layer, a 50% dropout layer, and a max pooling layer. Additional layers were considered, but a limit of two was selected due to the observed overfitting even in shallow networks. As the number of layers increases, deeper features can be extracted, but the increased number of parameters require a larger training dataset to avoid overfitting, and none of the available open access datasets are of adequate size to allow for the proper training of many-layered CNNs.
The size of the convolutional kernel is varied to investigate its influence. The same kernel size is used in the first and second layers for simplicity and to reduce the number of experimental network configurations. Max pooling for 1D inputs use a 2 × 1 filter and a 2 × 1 stride while max pooling for 2D inputs use a 2 × 2 filter and a 2 × 2 stride. All convolutional layers have 16 filters. The output from the second max pooling layer is flattened and followed by a fully connected layer with 7 output neurons. A softmax layer is used to classify the output. Here, network configurations are simply denoted by the convolutional kernel size and stride and the input size used. As an example, Table 1 details the network architecture for a network with kernel size of 8 × 1, stride of 2 × 1, and input size of 1248 × 1. In addition to the network architecture, the parameters for training a CNN greatly influence the final accuracy. Unless otherwise indicated, the same training parameters are used across all case studies. Training is ended after eight epochs for Case Study 1, 3 epochs for Case Study 2, and 40 epochs for Case Study 3, since this was observed to provide network convergence across all data types and network configurations. All configurations are trained using the ADAM optimizer for gradient descent. 1D CNNs are trained with an initial learning rate of 0.01, while 2D CNNs are trained with an initial learning rate of 0.001. The number of training samples used for each configuration varies depending on the signal duration used in any given configuration. The duration of the original signal, as well as the dimensions of the processed input data, are presented along with the accuracies for each configuration.

Case Study 1: Case Western Reserve University Bearing Fault Dataset
This case study explores the use of CNNs for identifying healthy bearings and bearings having various fault types. Key parameters for preparing the training data and CNN hyperparameters are varied to identify important trends. The objective here is to identify which variables strongly influence diagnostic accuracy and to identify the most accurate combination of input type and CNN configuration.

Dataset Description
The Case Western Reserve University (CWRU) dataset [24] is commonly used to benchmark algorithms for bearing fault diagnosis [25]. The dataset contains vibration measurements from multiple accelerometers mounted on an electric motor containing bearings in various states of health. Bearing faults were artificially introduced using electrodischarge machining. Different fault types were simulated separately at the inner raceway, outer raceway, or one of the rolling elements. Damaged bearings were installed onto the fan end or drive end positions of the motor's shaft. Thus, seven fault states are created; one for which all bearings are healthy, and three different damage states for each of the two bearing locations. Using these seven states as labels, the trained CNN shall have to accurately diagnose both the fault type and fault location for a prediction to be counted as correct by the error function.
The motor was operated with loads of 0 to 3 HP and speeds ranging from 1720 to 1797 rpm. Three fault severities were simulated by machining defects of different sizes into the bearings. Vibration data was sampled at 12 kHz for a duration of 10 seconds for each configuration of motor load, fault severity, and fault type. Additional recordings were conducted with outer race faults in which the bearing was installed by placing the static fault at different locations relative to the accelerometer. That is directly below the accelerometer, orthogonal to the accelerometer, or opposite from the accelerometer.

Data Preparation
Data from different runs of the test bench are divided between training and testing groups prior to performing data augmentation. This ensures an accurate appraisal of the testing accuracy of the trained networks, data from the training set must not overlap with data from the test set, including by being obtained from the same damaged bearing specimen. K-fold cross-validation is used to verify results with k = 4. Thus, each network configuration is trained with one quarter of the original experimental data. Each testing and validation accuracy is the average obtained over the k trained CNNs.
Though the experimental data contains two channels of vibration data, data from only one accelerometer is used. The fault type and faulty bearing location are used as labels, for a total of seven labels. The three different fault severities and four different loading conditions are lumped into each of these seven labels. The goal of the network is to diagnose the fault type and location irrespective of these factors.
The data augmentation method described in Section 2.4 is used for all experiments, with 25% overlap. Six input lengths are chosen, using multiples of the number of samples needed to span a full shaft rotation at the given sample rate and average rotation rate over all experiments. Prior to training, the data is normalized to have a mean of zero and standard deviation of one.

Results and Discussion
Tables 2-5 show the average training and testing accuracies taken over 4-fold crossvalidation for various network input sizes, convolutional filter sizes and strides. The results obtained cover a broad range of accuracies, highlighting significant sensitivity to changing hyperparameters. The best accuracy obtained here (80.1%) still falls short of some other researchers' findings using shallow classical CNNs. This seems likely to be a result of their allowing samples extracted from a given experimental run to exist in both the training and testing dataset.  The tables are color-formatted to highlight higher accuracies in green with lower accuracies in red. The highest testing accuracy for each table is emboldened, along with its corresponding training accuracy.
Evidently, different input types favor different hyperparameters for maximizing validation accuracy. Using raw temporal measurements appears best paired with mid-sized kernels and a smaller input space. Using FFT preprocessing achieves poor accuracy if large kernels are used and does not show a strong dependence on input size, whereas envelope spectrum preprocessing works best with the largest kernel and mid-sized input. Using the spectrogram input type provides the most consistently strong diagnostic accuracy, with less sensitivity to changes in input size and kernel size and stride than other input types.  Testing accuracy is consistently lower than training accuracy, especially for longer input durations, suggesting that overfitting remains a significant problem despite the use of dropout. This might be addressed with a more extensive and more sophisticated data augmentation method. If only the strongest configurations for each input type are taken, the Fourier spectrum, envelope spectrum, and spectrogram seem to perform approximately equally. Figure 9 shows confusion matrices from the best performing configurations under each input type. Since 4-fold cross-validation is used, four confusion matrices can be produced from each configuration. The summation of these four confusion matrices are presented for each input type.
Considering that practical users of diagnosis algorithms may not be concerned with the fault type and simply need to know whether the bearing is healthy or needs replacing, the best preprocessor appears to be the envelope spectrum. The envelope spectrum preprocessor resulted in no instances of misclassified healthy bearings, nor any faulty bearings falsely classified as healthy. Figure 9 shows confusion matrices from the best performing configurations under each input type. Since 4-fold cross-validation is used, four confusion matrices can be produced from each configuration. The summation of these four confusion matrices are presented for each input type.  Considering that practical users of diagnosis algorithms may not be concerned with the fault type and simply need to know whether the bearing is healthy or needs replacing, the best preprocessor appears to be the envelope spectrum. The envelope spectrum preprocessor resulted in no instances of misclassified healthy bearings, nor any faulty bearings falsely classified as healthy.

Case Study 2: Paderborn University Bearing Fault Dataset
This case study contains two parts. In part 1, the same CNN configurations used in section 4 are trained with data from artificially damaged bearings and tested with bearings with real damages gained during accelerated lifetime testing. In part 2, these CNN configurations are again used with training and testing datasets both originating from artificially damaged bearing experiments. The purpose of part 1 is to evaluate the crossdomain applicability of the studied CNN configurations in a situation that reflects a real scenario. Part 2 aims to determine the extent to which the inaccuracy observed in part 1 can be attributed to the domain difference. This case study will also reveal whether the trends in hyperparameter selection for each input type are consistent across these two parts.

Case Study 2: Paderborn University Bearing Fault Dataset
This case study contains two parts. In part 1, the same CNN configurations used in Section 4 are trained with data from artificially damaged bearings and tested with bearings with real damages gained during accelerated lifetime testing. In part 2, these CNN configurations are again used with training and testing datasets both originating from artificially damaged bearing experiments. The purpose of part 1 is to evaluate the cross-domain applicability of the studied CNN configurations in a situation that reflects a real scenario. Part 2 aims to determine the extent to which the inaccuracy observed in part 1 can be attributed to the domain difference. This case study will also reveal whether the trends in hyperparameter selection for each input type are consistent across these two parts.

Dataset Description
The Paderborn University bearing dataset [26] is another popular benchmarking dataset used for bearing fault diagnosis algorithms. An important differentiating characteristic of this dataset is that it includes bearings with seeded faults as well as bearings with natural faults. Three different methods are used to simulate bearing damage: electric discharge machining, drilling (various diameters), and electric engraving. Outer race and inner race fault types are studied. Training with seeded faults and testing with natural faults provides an analogue for situations when algorithms developed on lab data are deployed in industry where only natural faults exist.

Data Preparation
For part 1, the faults in the testing set are naturally developed during an accelerated lifetime test before being measured. For part 1, the dataset is split into training and testing with the same scheme as used Chen et al. [11] so that the performance of various traditional CNNs can be directly compared to that of their novel CNN architecture. In the second part, training and testing are both done with bearings having artificial damage. Tables 6 and 7 indicate which experimental runs are included in each dataset for the two parts. The data augmentation procedure described for Case Study 1 is reused for Case Study 2 with the same overlap and input durations. Since the experiments in this dataset use different rotational speeds and sample rates, the input durations used do not correspond with integer numbers of shaft rotations here. However, the reuse of the same input durations allows for direct comparison of the same CNN configurations between datasets.  The results in Tables 8-11 show the accuracy of the previously studied CNN configurations for the cross-domain task described above.  All CNN configurations have significantly lower testing accuracy compared to those found using the CWRU dataset though training accuracy remains high. While this is certainly attributable, to some extent, to the disparity between measurements of artificial and natural bearing damage, other factors might make the Paderborn dataset more difficult to learn from compared to the CWRU dataset. Foremost among these factors is the reduced number of experimental runs from which to learn, as this leads to dataset sparsity.
Unlike Case Study 1, the envelope spectrum appears to be the poorest choice in preprocessor based on overall accuracy. The remaining three preprocessing methods seem to be approximately equal, though all seem too poor to be particularly useful.
The results of Part 1 give broader confirmations of the findings of Chen et al. [11], who demonstrate that classic implementations of CNNs are not able to learn enough useful features from the artificially damaged bearings to accurately diagnose real faults.

Part 2: Artificial to Artificial Damage
The results in Tables 12-15 demonstrate the improvement in accuracy when the training and testing data both relate to artificially damaged bearings.  Part 2 eliminates the underlying domain difference between natural and artificial bearing damage measurements by using artificially damaged bearings for training and validation. Validation accuracy is improved overall with the notable exception of CNNs trained with envelope spectrum data. This indicates that the inherent difference between artificial and natural bearing damage are a significant, but not sole, contributor to the poor accuracy achieved in Part 1.
As with Case Study 1, results obtained using spectrogram preprocessing appear to be the least sensitive to changes in kernel and input sizes. Other trends linking accuracy and hyperparameter values differ between case study 1 and case study 2. This suggests that different underlying factors including experimental procedure and physical setup influence hyperparameter optimization. This implies that CNNs may need to be tuned for different industrial applications if a universally applicable architecture and training scheme is not developed.

Case Study 3: 2009 PHM Challenge Gear Fault Dataset
This case study explores the effectiveness of the previously described CNN configurations for diagnosing various health conditions of a two-stage gear box using the 2009 PHM Challenge dataset [27]. The objective of this case study is to determine whether architectures apparently useful for bearing fault detection by CNNs are also able to perform gearbox fault diagnosis.

Dataset Description
This dataset contains eight unique health states for the gearbox, each having a different combination of subcomponents that are either healthy or artificially damaged. This gives rise to health states with multiple faults, leading to a more complicated diagnosis problem. Table 16 summarizes the states of the various gears, bearings, and shafts for each of the health states. The dataset contains two channels of vibration measurements and a tachometer signal. For this experiment, only the first channel is used. For all states, four seconds are sampled at a sampling frequency of 66.67 kHz.

Data Preparation
The same data augmentation procedure described above is used here to generate many more training and testing samples of different sizes form the original measurements. Again, k-fold cross validation is used, with k = 3. One difference in procedure was mandated for the process using envelope spectrum preprocessing; data was normalized to have a zero mean as well as having a standard deviation of one. This was necessary to achieve network convergence under the same learning parameters as all other preprocessing methods.
Unlike the datasets used in the previous two case studies, there is only one measurement obtained for each health state. This means that training and testing data cannot be obtained from different sets of damaged components. This leads to a simpler machine learning problem for which less generalization is needed to achieve high testing accuracies. The results below indeed show that the testing accuracies achieved here are much higher than the previous case studies.

Results and Discussion
Tables 17-20 present the results for Case Study 3 in the same format used above.  Despite the increased complexity of the mechanical system studied, the diagnostic abilities of the studied CNNs appear much greater for this dataset. This is almost certainly a result of the fact that the data from all experimental runs appear in both training and testing, even if unique samples created during data augmentation do not appear in both datasets. As with the other case studies, different preprocessing methods yield different patterns in which kernel sizes and input sizes perform best.
Using the raw time input type appears to only be successful when larger kernels are used. Moreover, performance is somewhat improved by using mid-sized inputs. FFT input types appear to work very well irrespective of kernel size and benefit somewhat by using larger inputs. The results from envelope spectrum contrast starkly with those of other preprocessing types-the average testing accuracy achieved is much poorer for the configurations trained here. It appears that a larger input duration than studied here would be needed for the envelope spectrum to be accurate. This may be due to the longer period involved for repeating gear meshing envelope signatures. Training accuracies improved with larger input durations for all input types and kernel sizes, though corresponding validation accuracies did not necessarily follow. Raw time and envelope spectrum showed the most significant overfitting, especially with long inputs durations and small kernels. Spectrogram preprocessing gives strong results overall with performance peaking when paired with mid-sized kernels and larger inputs.

Discussion
Few insights can be extrapolated with respect to which CNN configurations are best suited to each input type. The only commonality between all three datasets studied is the low sensitivity to hyperparameter selection in achieving high accuracy for models trained with spectrograms. This makes them the safest choice if extensive hyperparameter optimization is not possible. This is not necessarily revealed when only considering the best overall configurations from each case study, as in Table 21, though a wide range of input durations and kernel sizes are represented here. The three datasets studied yield different patterns and vastly different accuracies from the same CNNs. There is a complex relationship between input type and hyperparameter optimization that varies for each dataset. Case Study 2 highlights the ineffectiveness of classical CNNs for cross-domain problems involving training with artificially damaged bearings and testing with real damage, suggesting that they would not be adequate for real industrial applications, highlighting this as an area for additional future work.
The comparatively high performance of the studied CNNs with the PHM 2009 dataset in case study 3 show how extracting training and testing samples from the same experimental runs leads to a far easier problem and inflates validation accuracies. This supports the findings of Pandhare et al. [17], who demonstrate that diagnostic accuracy can drop from 95 to 100% when experimental data is mixed to approximately 60% when experimental data contained in training and testing datasets is mutually exclusive. It seems probable that researchers finding greater accuracies on the CWRU dataset using classically shallow CNNs achieve such results by "contaminating" validation datasets in this way. A frequent claim in papers such as this is that the application of deep learning for machine fault diagnosis will eliminate the need for a human expert, presumably an expert on the mechanical system being diagnosed. However, true that claim might be, it neglects the fact that a different sort of expert is needed to create a viable solution using deep learning methods. Clearly, useful diagnoses cannot be achieved without a preprocessor and CNN architecture that is well suited for the target domain. It is also clear that misleadingly high diagnosis accuracies can be achieved if data augmentation is performed on the experimental data before the data is randomly shuffled and split into training and testing datasets, leading to contamination of the testing set.
If the diagnostic problem is fairly constructed to reflect real-world challenges, it appears that classically shallow CNNs are, irrespective of kernel size and input size, not able to perform in a manner that would motivate industrial implementation. However, the present study is not an exhaustive exploration of CNNs; other hyperparameters can be altered to give many more network architectures. Future work should focus on designing algorithms that can learn from simulated faults to provide accurate diagnosis of real faults and addressing overfitting problems.