Intelligent Fault Diagnosis of Industrial Bearings Using Transfer Learning and CNNs Pre-Trained for Audio Classification

The training of Artificial Intelligence algorithms for machine diagnosis often requires a huge amount of data, which is scarcely available in industry. This work shows that convolutional networks pre-trained for audio classification already contain knowledge for classifying bearing vibrations, since both tasks share the need to extract features from spectrograms. Knowledge transfer is realized through transfer learning to identify localized defects in rolling element bearings. This technique provides a tool to transfer the knowledge embedded in neural networks pre-trained for fulfilling similar tasks to diagnostic scenarios, significantly limiting the amount of data needed for fine-tuning. The VGGish model was fine-tuned for the specific diagnostic task by handling vibration samples. Data were extracted from the test bench for medium-size bearings specially set up in the mechanical engineering laboratories of the Politecnico di Torino. The experiment involved three damage classes. Results show that the model pre-trained using sound spectrograms can be successfully employed for classifying the bearing state through vibration spectrograms. The effectiveness of the model is assessed through comparisons with the existing literature.


Introduction
The monitoring of rotating systems through bearing sensoring is part of the implementation of predictive maintenance strategies. The deployment of such approaches is motivated by the resulting benefits for industrial rotors in terms of cost reduction and increased production [1]. A primary concern of predictive maintenance and condition monitoring is the fault diagnosis of bearings, this is for two main reasons. First, durability assessments of rolling bearings are affected by significant uncertainties [2], given the complex interaction between a variety of parts. Additionally, it is well established that bearings are key nodes for retrieving information on the whole mechanical system [3]. In this context, the analysis of vibration signals represents one of the most informative tools for the assessment of machine conditions [4].
The past thirty years have seen increasingly rapid advances in this field thanks to the development of numerous signal processing techniques for fault identification. For instance, the literature on envelope analysis has been considerably developed [2,[4][5][6][7][8][9][10][11], which has shown its effectiveness in benchmark cases [12] and it is being implemented in industry for condition monitoring purposes. The outcomes of this kind of signal processing tool have the benefit of being highly interpretable, since the models' assumptions are sharply identifiable. On the other hand, the outcomes may be user dependent. The extraction of diagnostic information from vibration signals is often affected by the assumptions of the identification models and by the user's experience. For instance, choosing an optimal demodulation band [13][14][15][16][17] naturally implies an inherent arbitrariness. This paper discusses the application of a transfer learning methodology to the test rig available at Politecnico di Torino [58], which was designed to accommodate medium-sized bearings of industrial interest. To the best of the author's knowledge, this is the first work including experiments conducted on medium-size industrial bearings with localized faults. Additionally, this paper aims to explore the fault diagnosis capabilities of CNNs pre-trained for audio classification. Namely, the VGGish convolutional network [37,59,60] is employed to perform bearing fault diagnosis. The VGGish network was originally trained for large-scale audio classification by using millions of audio samples extracted from YouTube®videos [37]. In this work, the pre-trained model is fine-tuned by a few thousands vibration records retrieved under different working conditions of the machine. Such a knowledge transfer is inspired by the idea that the search of fault distinctive features in vibration spectrograms is conceptually similar to the identification of sound spectrograms [56]. The results corroborate this hypothesis and show that the feature extraction capabilities of the pre-trained VGGish network can be effectively transferred to fault diagnosis scenarios. Thanks to the use of TL, a large-scale and high-potential classification model can be reused for the purpose of machine diagnosis by fine-tuning with a very small dataset. Furthermore, it is shown that the pre-trained VGGish model outperforms the VGGish framework trained from scratch in the presence of a thousandsample set. Additionally, it is found that, for the case under analysis, the VGGish performs better than models pre-trained for image recognition.
The overall structure of this paper takes the form of five sections. The introductory paragraph presents the topic of intelligent fault diagnosis of industrial bearings and provides the motivations which motivated the author to perform this investigation with respect to the existing literature. The second section gives an insight into the AI methodologies involved in this study, and CNNs, transfer learning and the VGGish model are presented. A description of the test rig for industrial bearings and the vibration dataset is provided in the third section, whereas the fourth section includes results, discussion and implications. Finally, the fifth section provides the concluding remarks.

Transfer Learning for Bearing Fault Diagnosis
This section provides a short summary of the main AI devices involved in this study. CNNs, transfer learning and the VGGish audio feature extractor are introduced.

Convolutional Neural Networks (CNNs)
The typical structure of a CNN ( Figure 1) includes a sequence of layers in which several algebraic operations take place. This claim is valid for the vast majority of deep learning approaches, but CNNs are differentiated by their ability to handle multidimensional data. That is one of the reasons why the introduction of CNNs [34,61] completely transformed image-based AI. A wide range of research areas thereafter took advantage of these structures. Indeed, as previously described, sound spectrograms were employed to train CNNs for audio classification [37]. The convolution operation mainly consists of applying filter kernels to the input data, whereas pooling layers carry out data downsampling. Finally, fully connected layers flatten multidimensional data [18,20,55,56] in one-dimensional vectors. For classification tasks, the last fully connected layer returns the output class. Convolutional and fully connected layers also implement nonlinear effects by means of activation functions. The Rectified Linear Unit (ReLU) is one of the possible activation functions for introducing nonlinearities in the output of convolutional layers [56].
The training process aims to optimize a specific loss function, which can be interpreted as a measure of the distance between the predictions of the model and the ground truth. For instance, the cross-entropy of Equation (1) is the typical loss function employed for classification tasks with mutually exclusive classes, • M is the number of observations; • N is the number of classes; •ŷ mn is the network output for the m-th observation and the n-th class; • y mn is the ground truth for the m-th observation and the n-th class. The training process aims to optimize a specific loss function, which can be interpreted as a measure of the distance between the predictions of the model and the ground truth. For instance, the cross-entropy of Equation (1) is the typical loss function employed for classification tasks with mutually exclusive classes, where: • is the number of observations; • is the number of classes; • is the network output for the -th observation and the -th class; • is the ground truth for the -th observation and the -th class. At the end of training stage, the weights of the network filters are optimized for the specific task and contain the knowledge related to the latter. In particular, the stacked convolutional layers learn hierarchical representations of the input data. Convolutional layers are mainly devoted to the feature extraction. For deep learning models, the extraction process is automated and does not require manual feature selection. Moreover, deeper layers correspond to more abstract features. In other words, the convolutional layers learn to extract discriminating features of the input data during training. The extracted feature maps are condensed in the fully connected layers which terminate in the network output for the classification.

Transfer Learning
Transfer learning covers a wide range of techniques aimed at reusing the knowledge already contained in AI models. A complete exploration of all the TL methodologies is beyond the scope of this study; a comprehensive insight is given by the works of Pan and Yang [42] and Lei et al. [20]. Parameter-based TL is considered for the purpose of this investigation. Namely, it is assumed that the knowledge transfer can be carried out by reusing the parameters of a pre-trained model. In the case of CNNs, the parameters are represented by the network weights, which enclose the knowledge. Thanks to the data from the source domain, the pre-trained network acquires the feature extraction capabilities for accomplishing the specific source task. The knowledge is thus transferred to the target domain of interest to fulfil a target task. Figure 1 shows a typical transfer learning framework for CNNs. Some or all of the At the end of training stage, the weights of the network filters are optimized for the specific task and contain the knowledge related to the latter. In particular, the stacked convolutional layers learn hierarchical representations of the input data. Convolutional layers are mainly devoted to the feature extraction. For deep learning models, the extraction process is automated and does not require manual feature selection. Moreover, deeper layers correspond to more abstract features. In other words, the convolutional layers learn to extract discriminating features of the input data during training. The extracted feature maps are condensed in the fully connected layers which terminate in the network output for the classification.

Transfer Learning
Transfer learning covers a wide range of techniques aimed at reusing the knowledge already contained in AI models. A complete exploration of all the TL methodologies is beyond the scope of this study; a comprehensive insight is given by the works of Pan and Yang [42] and Lei et al. [20]. Parameter-based TL is considered for the purpose of this investigation. Namely, it is assumed that the knowledge transfer can be carried out by reusing the parameters of a pre-trained model. In the case of CNNs, the parameters are represented by the network weights, which enclose the knowledge. Thanks to the data from the source domain, the pre-trained network acquires the feature extraction capabilities for accomplishing the specific source task. The knowledge is thus transferred to the target domain of interest to fulfil a target task. Figure 1 shows a typical transfer learning framework for CNNs. Some or all of the feature extraction layers are frozen, whereas the last layers are replaced with new ones. The weights of the latter are optimized by fine-tuning the model in the target domain. One of the most fascinating aspects of this technique is related to the amount of training data. Considering that the actual training involves few layers, the amount of training data is extremely low with respect to training from scratch. However, the potential of extracting complex features is preserved in the frozen layers.
This study investigates the case of knowledge transfer from an audio feature extractor to the assessment of bearing health state. The methodology is outlined in Figure 2. The model A is pre-trained for audio recognition. For instance, the label "Guitar" is assigned to guitar sounds. The ability of extracting spectrogram features is transferred to the domain of vibration signals by reusing part of the model A. Then, the model B is fine-tuned by employing a reduced amount of target data. As an example, the target task could be the assignment of the label "Bearing fault" to the vibration signal.
of the most fascinating aspects of this technique is related to the amount of training data. Considering that the actual training involves few layers, the amount of training data is extremely low with respect to training from scratch. However, the potential of extracting complex features is preserved in the frozen layers.
This study investigates the case of knowledge transfer from an audio feature extractor to the assessment of bearing health state. The methodology is outlined in Figure 2. The model A is pre-trained for audio recognition. For instance, the label "Guitar" is assigned to guitar sounds. The ability of extracting spectrogram features is transferred to the domain of vibration signals by reusing part of the model A. Then, the model B is fine-tuned by employing a reduced amount of target data. As an example, the target task could be the assignment of the label "Bearing fault" to the vibration signal.

VGGish Network for Bearing Health Monitoring
An audio feature extractor is a CNN designed to unpack the most distinctive features detectable in an audio spectrogram. These features are condensed in a low-dimensional space, where a classifier can operate more conveniently to discern classes. This process is also known as feature embedding. The classifier can also be constituted of a series of fully connected layers attached to the end of the feature extractor. The author chose to transfer knowledge from an audio CNN because those networks can already identify spectrogram features, wherever the signal originates. However, the literature shows examples of knowledge transfer from image classification networks [55] to benchmark vibration datasets.
The VGGish architecture [37] summarized in Table 1 contains 62 million weights. The model was originally trained by Hershey et al. [37] in 2017 by using 70 million YouTube clips, for a total amount of 5.24 million hours and 30,871 audio labels. The network input is constituted of a 96 × 64 mel spectrogram [62,63], which is a time-frequency transformation typically applied to audio signals. The pre-trained framework can be used in two ways. First, it can act as a feature extractor to embed audio in the 128 feature vector that feeds a classification model. Alternatively, the architecture can be part of a larger model that needs fine-tuning. Figure 3a and Figure 3b show examples of low-level and mediumlevel features, respectively, learned by the pre-trained VGGish. It is noted that more complex spectrogram features correspond to deeper layers.

VGGish Network for Bearing Health Monitoring
An audio feature extractor is a CNN designed to unpack the most distinctive features detectable in an audio spectrogram. These features are condensed in a low-dimensional space, where a classifier can operate more conveniently to discern classes. This process is also known as feature embedding. The classifier can also be constituted of a series of fully connected layers attached to the end of the feature extractor. The author chose to transfer knowledge from an audio CNN because those networks can already identify spectrogram features, wherever the signal originates. However, the literature shows examples of knowledge transfer from image classification networks [55] to benchmark vibration datasets.
The VGGish architecture [37] summarized in Table 1 contains 62 million weights. The model was originally trained by Hershey et al. [37] in 2017 by using 70 million YouTube®clips, for a total amount of 5.24 million hours and 30,871 audio labels. The network input is constituted of a 96 × 64 mel spectrogram [62,63], which is a time-frequency transformation typically applied to audio signals. The pre-trained framework can be used in two ways. First, it can act as a feature extractor to embed audio in the 128 feature vector that feeds a classification model. Alternatively, the architecture can be part of a larger model that needs fine-tuning. Figures 3a and 3b [62,63] is quite common in audio processing. Indeed, the mel scale is perceptually relevant for human hearing, which is more sensitive at lower frequencies. In this study, the same preprocessing steps are applied to vibration signals in order to enhance the similarities between the source and the    The preprocessing steps result in a 96 × 64 patch, in accordance with the input of the network. The use of the mel spectrogram [62,63] is quite common in audio processing. Indeed, the mel scale is perceptually relevant for human hearing, which is more sensitive at lower frequencies. In this study, the same preprocessing steps are applied to vibration signals in order to enhance the similarities between the source and the target domain. According to the author of this work, it is reasonable to assume that this circumstance fosters knowledge transferability. TL was applied for identifying bearing health conditions. For this purpose, the last layer of the VGGish was replaced with a new one. Namely, the regression layer was replaced with a fully connected layer with three neurons for classifying three bearing health states. Next, a classification layer was added. Since the feature extraction layers remained unchanged, it can be stated that the original VGGish feature embedding fed the classification layer. Moreover, a dropout layer was added before the last fully connected layer. Dropout layers set weights to zero with a given probability in order to reduce the number of trainable parameters and avoid overfitting. In this case, the dropout probability was set to 50%. When the training was run, only the weights related to new layers were updated. The replacement of the only last layer and the implementation of dropout strategies showed to be the most effective approach for the analyzed case. Table 2 reports the set of hyperparameters adopted in this work.

Vibration Dataset for Industrial Bearings
The TL methodology was applied to the dataset generated by a test rig for industrial bearings available at Politecnico di Torino [58]. To the best of the author's knowledge, the existing literature provides scant evidence of deep learning strategies applied to datasets covering medium-size bearings (360 mm outer diameter). Three health states were analyzed: normal condition, inner race damage and outer race damage. This section provides a description of the test rig, of the experimental activity and of the dataset construction.

Description of the Test Rig
The test rig presented in reference [58] (Figure 4) can house up to four bearings with outer diameters ranging from 280 mm to 420 mm. A full description of the test rig goes beyond the scope of this work, since a comprehensive outline of the design activity and equipment is already provided in [58]. A 30 kW three-phase induction motor is controlled by an inverter. The motor is connected to the shaft by means of a rubber joint. The shaft rotation is sustained by the two main bearings. The so-called "self-contained box" houses the test bearings, which can be loaded with up to 200 kN thanks to oil actuators. The two air-oil pumps control the radial and the axial actuators, respectively, by converting pneumatic pressure into oil pressure (up to 500 bar). Then, the radial and the axial loads are applied independently. The lubrication system consists of an external control unit that monitors the oil jet system. The ISO VG 150 oil is injected with a flow rate of 2.5 L/min and a pressure of 6 bar. Four SKF CMS 2200T sensors are fitted to the four adapters in order to measure acceleration and temperature. The main features of the vibration sensors are reported in Table 3. The condition monitoring framework includes a LMS Scadas III data acquisition system. The latter is interfaced with a laptop for signal acquisition and post-processing.  The layout of the self-contained box ( Figure 5) provides an advantage of balancing the loads of the actuators through the elastic deformation of the box. Thus, the test loads are internally accommodated and the load circuit is "self-contained". Consequently, the main     Four SKF CMS 2200T sensors are fitted to the four adapters in order to measure acceleration and temperature. The main features of the vibration sensors are reported in Table 3. The condition monitoring framework includes a LMS Scadas III data acquisition system. The latter is interfaced with a laptop for signal acquisition and post-processing.

Experimental Activity and Dataset Construction
This study takes into account three health states for the spherical roller bearing SKF 22,240 CCK/W33 (Figure 6a). The bearings have an inner diameter of 200 mm with a 1:12 tapered bore and an outer diameter of 360 mm. In addition to the normal state, inner race (IR) damage ( Figure 6b) and outer race (OR) damage (Figure 6c) are considered. The faults have a diameter of 2 mm and a depth of 0.5 mm. The damages were mechanically machined on the race that is most loaded in the case of application of an axial load. In order to apply the damages, bearings were dismounted. Then, the faults were drilled on the race of interest by employing a solid carbide drill with a diameter of 2 mm. Although the produced faults are representative of localized defects in rolling bearings, the vibration data extracted cannot obviously represent the complete scenario of defects detectable in rolling bearings.
The experiment involved the analysis of four load cases at 10 different shaft speeds as reported in Table 4. Then, 40 signals were extracted for each health state totaling 120 signals. The vibration signals were acquired by means of the data acquisition system and sampled at 20,480 Hz. Each of the acquisitions lasted 30 s. Therefore, 1 hour of signal acquisition was taken into account. The experiment involved the analysis of four load cases at 10 different shaft speeds as reported in Table 4. Then, 40 signals were extracted for each health state totaling 120 signals. The vibration signals were acquired by means of the data acquisition system and sampled at 20,480 Hz. Each of the acquisitions lasted 30 s. Therefore, 1 hour of signal acquisition was taken into account.  The dataset was constructed by extracting non-overlapping chunks from the vibration signals ( Table 5). The duration of the chunks was of 1.6 s. Therefore, 18 chunks were extracted for each signal. The resulting dataset consisted of 2160 samples equally balanced in the three classes: Normal, IR and OR. The data labelling for the supervised learning scheme was achieved as a natural consequence of the experiment. The amount of data are remarkably low for the use of large deep learning architectures. However, fault diagnosis can be performed thanks to TL.  The dataset was constructed by extracting non-overlapping chunks from the vibration signals (Table 5). The duration of the chunks was of 1.6 s. Therefore, 18 chunks were extracted for each signal. The resulting dataset consisted of 2160 samples equally balanced in the three classes: Normal, IR and OR. The data labelling for the supervised learning scheme was achieved as a natural consequence of the experiment. The amount of data are remarkably low for the use of large deep learning architectures. However, fault diagnosis can be performed thanks to TL. The dataset was randomly split in order to test the applicability of the proposed method. Table 6 reports the information regarding the data split. A typical deep learning splitting strategy was applied: 80% of the data were used for fine-tuning the VGGish model, 10% of the data constituted the validation set, whereas the remaining 10% were used to test the method with new data.

Results and Discussion
This paper investigates the capabilities of CNNs pre-trained for audio classification to perform bearing fault diagnosis. It is argued that these networks are endowed with highly specific knowledge for extracting spectrogram features. For this purpose, the vibration dataset including damaged industrial medium-sized bearings was produced by means of proper experimental activity conducted on a specifically conceived test rig. A detailed description of the hardware is provided in reference [58]. As anticipated in Section 2.3, the VGGish convolutional architecture can act as a spectrogram feature extractor, as long as a proper preprocessing is carried out. Figures 7a, 7b and 7c show examples of normalized vibration signals for the normal state, IR and OR damages, respectively. Figure 8a-c shows the corresponding mel spectrograms obtained through the preprocessing. Finally, Figure 9a-c shows the corresponding 128-dimensional feature embedding output from the pre-trained VGGish feature extractor. Essentially, the information dissolved in the multifaceted mel spectrograms is translated and synthetized in a low-dimensional feature space via feature embedding. The classifier can discern classes by learning the differences that establish between feature embeddings. In this particular case, the feature embedding corresponds to a vector containing 128 elements.

Results and Discussion
This paper investigates the capabilities of CNNs pre-trained for audio classification to perform bearing fault diagnosis. It is argued that these networks are endowed with highly specific knowledge for extracting spectrogram features. For this purpose, the vibration dataset including damaged industrial medium-sized bearings was produced by means of proper experimental activity conducted on a specifically conceived test rig. A detailed description of the hardware is provided in reference [58]. As anticipated in Section 2.3, the VGGish convolutional architecture can act as a spectrogram feature extractor, as long as a proper preprocessing is carried out. Figure 7a, Figure 7b and Figure 7c show examples of normalized vibration signals for the normal state, IR and OR damages, respectively. Figure 8a-c shows the corresponding mel spectrograms obtained through the preprocessing. Finally, Figure 9a-c shows the corresponding 128-dimensional feature embedding output from the pre-trained VGGish feature extractor. Essentially, the information dissolved in the multifaceted mel spectrograms is translated and synthetized in a low-dimensional feature space via feature embedding. The classifier can discern classes by learning the differences that establish between feature embeddings. In this particular case, the feature embedding corresponds to a vector containing 128 elements.   The model was fine-tuned using the hyperparameters reported in Table 2. The training time was 936 s on a standard laptop without GPU acceleration (Intel ® Core i7−10510U CPU @ 1.80 GHz). The model was implemented in the Matlab ® environment by means of machine learning, deep learning and audio toolbox libraries. It is worth noting that the original VGGish structure was trained on multiple GPUs for 184 hours [37]. Figure 10 shows the behavior of the loss functions during the training conducted according to the parameters in Table 2. In particular, the validation set served to monitor potential overfitting by analyzing the trend in the validation loss. The number of maximum epochs was set to four (216 iterations), since it was observed that the training process stabilized at this point and overfitting did not occur, though it was detectable during the first two epochs. The accuracies reported in Table 7 reveal the applicability of the diagnosis model to new test data. The complete confusion matrix resulting from the test data is shown in Figure 11. A single normal sample is predicted as OR damaged and a single OR sample is predicted as normal. Therefore, the classifier showed high precision and recall as reported in Table 8. The model was fine-tuned using the hyperparameters reported in Table 2. The training time was 936 s on a standard laptop without GPU acceleration (Intel ® Core i7−10510U CPU @ 1.80 GHz). The model was implemented in the Matlab ® environment by means of machine learning, deep learning and audio toolbox libraries. It is worth noting that the original VGGish structure was trained on multiple GPUs for 184 hours [37]. Figure 10 shows the behavior of the loss functions during the training conducted according to the parameters in Table 2. In particular, the validation set served to monitor potential overfitting by analyzing the trend in the validation loss. The number of maximum epochs was set to four (216 iterations), since it was observed that the training process stabilized at this point and overfitting did not occur, though it was detectable during the first two epochs. The accuracies reported in Table 7 reveal the applicability of the diagnosis model to new test data. The complete confusion matrix resulting from the test data is shown in Figure  11. A single normal sample is predicted as OR damaged and a single OR sample is predicted as normal. Therefore, the classifier showed high precision and recall as reported in Table 8.
Furthermore, the proposed model was compared with the VGGish model trained from scratch, the YAMNet model [56] and the VGG16 model pre-trained on ImageNet [34,35] proposed by Shao et al. [55]. Table 7 shows the accuracies obtained for the different models, whereas Table 8 reports the precision and the recall for the different classes. The VGGish trained from scratch reaches poor diagnosis accuracies and consistent overfitting phenomena occur. This is due to the fact that the original VGGish architecture was trained on millions of samples. Therefore, the structure is inherently unsuitable for correctly learning hierarchical features over a few thousands of training samples. Given the availability of a limited amount of training data, network weights of millions are extremely prone to overfit the training set. For this reason, TL is the most effective strategy. The YAMNet model [56] showed promising accuracies and reduced training times, but some overfitting was detectable. Finally, the VGG16 model [55] was trained by employing wavelet time-frequency images. The training of the model under the conditions reported in [55] required GPUs and was computationally expensive. The resulting metrics show that the VGG16 framework pre-trained on ImageNet is not suitable for the analyzed case.
According to the author of this work, this is due to the fact that several convolutional layers should be retrained in the model [55]. Consequently, more training data are required. On the other hand, few layers of the pre-trained VGGish and YAMNet need finetuning, since audio classification models are already capable of extracting distinctive spectrogram features. On the contrary, the knowledge contained in networks pre-trained on the ImageNet dataset cannot be considered highly specific for spectrogram recognition.      tuning, since audio classification models are already capable of extracting distinctive spectrogram features. On the contrary, the knowledge contained in networks pre-trained on the ImageNet dataset cannot be considered highly specific for spectrogram recognition.   Furthermore, the proposed model was compared with the VGGish model trained from scratch, the YAMNet model [56] and the VGG16 model pre-trained on ImageNet [34,35] proposed by Shao et al. [55]. Table 7 shows the accuracies obtained for the different models, whereas Table 8 reports the precision and the recall for the different classes. The VGGish trained from scratch reaches poor diagnosis accuracies and consistent overfitting phenomena occur. This is due to the fact that the original VGGish architecture was trained on millions of samples. Therefore, the structure is inherently unsuitable for correctly learning hierarchical features over a few thousands of training samples. Given the availability of a limited amount of training data, network weights of millions are extremely prone to overfit the training set. For this reason, TL is the most effective strategy. The YAMNet model [56] showed promising accuracies and reduced training times, but some overfitting was detectable. Finally, the VGG16 model [55] was trained by employing wavelet timefrequency images. The training of the model under the conditions reported in [55] required GPUs and was computationally expensive. The resulting metrics show that the VGG16 framework pre-trained on ImageNet is not suitable for the analyzed case. According to the author of this work, this is due to the fact that several convolutional layers should be retrained in the model [55]. Consequently, more training data are required. On the other hand, few layers of the pre-trained VGGish and YAMNet need fine-tuning, since audio classification models are already capable of extracting distinctive spectrogram features. On the contrary, the knowledge contained in networks pre-trained on the ImageNet dataset cannot be considered highly specific for spectrogram recognition.
The encouraging results indicate that the TL methodology is a valuable approach for the fault diagnosis of bearings. Remarkably, the knowledge contained in a network pre-trained for sound recognition can be reused for condition monitoring tasks. Moreover, the amount of training data is considerably low with respect to the network trained from scratch. The original VGGish network was trained by using 70 million audio samples, whereas less than 2000 samples were needed for performing fault diagnosis. Therefore, deep learning frameworks endowed with high knowledge content could be exploited without the need for millions of data samples. This remarkable implication is determined by the fact that the features extracted from the pre-trained VGGish network are already capable of identifying typical spectrogram features. Then, only slight adjustments are needed to adapt the model to the classification of vibration spectrograms. The feature embedding in which the sound spectrograms are translated is therefore convenient for vibration spectrograms as well.
However, this occurrence poses an issue in the interpretation of the diagnosis outcomes. Indeed, the 128 features which flow through the classifier have no clear physical interpretation. In this case, acoustically relevant features were able to classify vibrations. In contrast to traditional signal processing tools, where some parameters (e.g., kurtosis, crest factor and ball passing frequencies) have a physical meaning, the user does not know what the features actually represent for data-driven fault diagnosis, although they may perfectly work. Therefore, it is quite challenging to estimate the features variability with respect to the changes in the input signals. Additionally, the development of proper interpretability tools is of paramount importance for the correct visualization of domains alignment in transfer learning.

Conclusions
This work proposes a transfer learning methodology for fault diagnosis of industrial bearings. The VGGish architecture, originally pre-trained for sound classification on 70 million audio samples, is fine-tuned by using less than 2000 vibration samples. The experimental data related to the test set-up at the Politecnico di Torino and designed for the monitoring of industrial bearings are hereby presented. The experiment involved three health states ranging over ten speeds and four load cases for medium-size bearings. Vibration data were classified with 99.07% accuracy. The training time was 936 s. It is concluded that:

•
Deep learning CNNs are promising approaches for industrial condition monitoring; • The existing potentials included in large deep learning architectures can be exploited for bearing fault diagnosis using of small datasets, as long as transfer learning is applied; • Transfer learning drastically reduces the computational demand by applying deep learning in fault diagnosis tasks; • The acoustical features extracted from the VGGish network are also relevant for classifying bearing vibrations; • CNNs pre-trained for sound classification are more efficient and accurate than models pre-trained for image recognition.
The main limitations include the challenge of interpreting the extracted features. Although this study exhibits promising results, further investigations are also needed to apply this concept in industry, where fault data are scarcely available and balanced classes are not applicable. The knowledge transfer to unseen working conditions or different machines should be investigated as well.