Deep Transfer Learning for Machine Diagnosis: From Sound and Music Recognition to Bearing Fault Detection

: Today’s deep learning strategies require ever-increasing computational efforts and demand for very large amounts of labelled data. Providing such expensive resources for machine diagnosis is highly challenging. Transfer learning recently emerged as a valuable approach to address these issues. Thus, the knowledge learned by deep architectures in different scenarios can be reused for the purpose of machine diagnosis, minimizing data collecting efforts. Existing research provides evidence that networks pre-trained for image recognition can classify machine vibrations in the time-frequency domain by means of transfer learning. So far, however, there has been little discussion about the potentials included in networks pre-trained for sound recognition, which are inherently suited for time-frequency tasks. This work argues that deep architectures trained for music recognition and sound detection can perform machine diagnosis. The YAMNet convolutional network was designed to serve extremely efﬁcient mobile applications for sound detection, and it was originally trained on millions of data extracted from YouTube clips. That framework is employed to detect bearing faults for the CWRU dataset. It is shown that transferring knowledge from sound and music recognition to bearing fault detection is successful. The maximum accuracy is achieved using a few hundred data for ﬁne-tuning the fault diagnosis model.


Introduction
Although regular and scheduled maintenance strategies are still employed in many industrial contexts, the needs to rely on condition-based monitoring for machine health management have become increasingly pronounced due to several reasons. Industrial rotating systems are ever more complex, and wear and fatigue life estimators may not be accurate enough to properly schedule maintenance. Indeed, the useful life of some machine components is characterized by a marked statistical scatter, as in the case of rolling element bearings (REBs) [1]. On account of these aspects, early scheduled maintenance is likely to increase avoidable machine downtimes, whereas late scheduled maintenance leads to an unacceptable number of failures [2]. Moreover, the structural health of rotors and REBs affect each other in actual operating conditions, given the underlying coupling between their dynamic behavior. These issues hardly comply with the requirements of modern industry, which increasingly tends to embrace the advantages offered by condition-based monitoring in terms of cost savings and production targets attainment [3]. Machine fault diagnosis concerns the study of techniques aimed at detecting, isolating, and identifying machinery faults on the basis of monitoring data [4,5].
The last four decades have seen a growing interest towards this field within research and industry environments. In particular, signal processing methods have been extensively examined to uncover the links between vibration data [6][7][8][9][10][11][12][13] or acoustic emissions [14][15][16] and REB health state. Typically, these approaches stem from human reasoning, which infer expected particularities of data in light of the physical phenomena taking place in damaged REBs or, more generally, in defective industrial equipment. Nevertheless, monitoring data are simultaneously influenced by a variety of factors, such as noise and structural resonances. The superimposition of several sources of excitation further affect vibration responses. It may therefore occur that latent paths are hardly detectable in diagnostic features by the inferences of human intelligence, especially in real working conditions. For this reason, artificial intelligence (AI) found fertile ground for applications in vibration-based diagnostics and prognostics.
Research on intelligent fault diagnosis (IFD) [5] has consistently populated scientific literature starting from the beginning of this century. At first, machine learning approaches such as the support vector machine (SVM) [17][18][19][20] were mostly investigated. Later, deep learning and neural networks [21] revealed their capabilities in automatically identifying intricated as much as discriminative features in large image datasets [22] upon which machine learning algorithms were in trouble. IFD soon took advantage of these AI tools with the application of autoencoders [23,24], Convolutional neural networks (CNNs) [25][26][27] and Long Short-Term Memory networks (LSTMs) [28] for machinery diagnosis tasks. However, the encouraging results achieved by AI in fault diagnosis may present some limitations: lack of engineering interpretability and lack of a sufficient number of labelled data.
The interpretability of AI models is a current research topic in IFD [5,29] and its insights represent some future trends for research in the upcoming decade. Deep learning achieved unthinkable goals until about a few decades ago in manifold scenarios (e.g., computer vision, medicine, and speech recognition). This is also due to the availability of a sufficient number of labelled data for training deep architectures. However, a significant amount of data for faulty machines is barely available in practical engineering contexts [5]. In such cases, training processes are difficult to manage, since millions of network parameters are extremely prone to overfit small datasets. Thus, the ability to generalize the learned knowledge is intrinsically undermined. Noteworthy research efforts have brought to light benchmark datasets [30][31][32][33], on which several methodologies were tested. Compared with images datasets, it is experimentally challenging to extract millions of data which meaningfully portray probability distributions for machinery diagnosis.
Recently, transfer learning (TL) approaches began to address the problem of missing data [5,34]. The core idea of TL relates to the possibility of applying the knowledge learned in a source task to a target task. TL is inspired by the human brain which can intelligently use knowledge acquired in previous tasks to face new ones quicker or more efficiently [34]. TL strategies applied to machine fault diagnosis typically rely on networks pre-trained using images, to catch low-level features also in pictures drawn from vibration signal processing. This is partially occasioned by the fact that most of deep learning developed around images and CNNs. Consequently, several deep learning applications came across images for formalizing problems of various kinds. One of the largest datasets to train AI models for image recognition is ImageNet (https://www.image-net.org/, accessed on 29 October 2021) [35]. Currently, the dataset contains more than 14 million images, and over the past ten years, it has been extensively used to train deep learning architectures in extracting discriminative features of images for classifying them.
Applications of knowledge transfer for machine diagnosis using ImageNet are available in the literature. Shao et al. [36] successfully classified machine failures using neural networks pre-trained on ImageNet. Cao et al. [37] showed that a small set of training data is adequate to effectively perform a gear fault diagnosis using deep architectures trained with ImageNet. Instead, Zhang et al. [38] and Hasan and Kim [39] employed TL to diagnose machine failures, by transferring knowledge between different working conditions. However, no previous studies have investigated knowledge transfer from sound recognition tasks to machine diagnosis. The present work is motivated by the idea that networks pre-trained on audios for sound event detection (SED) [40][41][42] may encapsulate the necessary knowledge to classify REB spectrograms. The main goal of SED is to identify instances of sound events in audio recordings [40,43]. Applications of machine listening are available for instance in the fields of traffic monitoring [44], smart rooms [45], autonomous vehicles [46] and healthcare monitoring [47]. Identifying music, instruments or music genres is among the capabilities of SED networks [48,49].
The search for bearings characteristic tones in vibration signals is conceptually similar to pitch identification in sound, speech and music recognition. Then, detecting occurrences of bearing failures through vibrations is not so different from identifying sound events in audio signals. It is therefore reasonable to assume that the knowledge required to label spectrograms originated from faulty bearings is partially enclosed in pre-trained SED networks, since some of these have specifically learned to extract dominant patterns in spectrograms.
May the knowledge gained in these tasks be transferred to characterize machine vibrations? Is this transfer strategy successful for a small dataset? These are the questions which moved this investigation. Obviously, vibration signals are not audio acquisitions, but they can be pre-processed in similar ways to deal with SED networks. Indeed, SED networks operate with feature spaces pursuant to the human perception of the sound. This study shows that this knowledge transfer efficiently works for a small dataset containing vibration signals.
The overall structure of this paper takes the form of six sections. The introduction presented the condition-based monitoring topic and projected AI advantages and issues in the field of machine diagnosis. An overview of CNNs and TL theoretical background is provided in the second section. The third section introduces YAMNet, an efficient CNN for sound detection. This deep architecture is employed in the fourth section for detecting bearing faults in a limited data scenario. Findings are discussed in the fifth section, and finally some concluding remarks are provided in the sixth section.

Convolutional Neural Networks (CNNs)
Convolutional neural networks (CNNs) consist of deep layered architectures, in which multidimensional algebra operations are carried out sequentially. These architectures match especially well with multidimensional data since the spatial information of the latter is preserved while being treated. For instance, the algebraic representation of RGB images needs two dimensions for populating pixel matrices, and a third dimension for RGB channels. The capabilities of CNNs were originally exploited for image classification tasks [22,50], but nowadays the potentials of these deep architectures are acknowledged in numerous fields concerning multidimensional data [25][26][27]42,51]. Generally, CNNs include three types of layers, corresponding to specific operations: convolution, pooling, and fully connected layers ( Figure 1). Convolutional layers convolve input tensors x l−1 of the layer (l − 1) by using filter kernels k l [5,51,52] as reported in Equation (1): where * is the convolution operator, b l the bias term of the layer l, and σ an activation function, which introduces nonlinear effects. Rectified Linear Unit (ReLU) is one of the most common activation functions for convolutional layers: ReLU(x) = max(0, x). The output of the convolutional layer is the feature map x l . A training process essentially attempts at optimizing the elements of the kernel k l . Such an optimization produces output feature maps that emphasize the most significant attributes for categorizing input data. This aspect is factually the core idea underlying automated feature extraction in CNNs, which apprehend dominant patterns in training data. The feature extraction occurring in stacked convolutional layers is a hierarchical process: first convolutional layers typically learn general and low-level features, whereas deeper layers learn more complex, specific, and high-level features. Figure 2 shows an example of hierarchical feature learning for the well-known CNN AlexNet [22] trained on the ImageNet dataset. Such representations can be produced by passing a random noise image through the network for several iterations. Namely, the input noise is gradually modified by updating the network gradient in order to maximize the activation of a chosen layer. Then, the convolutional structure outputs images that mostly activate the channels of the specific layer. These maps highlight the features learned by the network in such a layer. Pooling layers usually follow convolutional layers in CNNs architectures ( Figure 1). The output of pooling layers is a low-resolution feature map, which is obtained through downsampling the input map. Pooling operations reduce the number of network parameters in order to mitigate data overfitting. Furthermore, the low-dimensional outputs of pooling layers extract local features, while preserving the multidimensionality of data. An example of downsampling is provided by the Max Pooling function, which only saves the maximum value of the input map in the regions passed by the filter. Instead, Global Average Pooling performs a global mean over the input map ( Figure 3).
The features learned over stacked convolutional and pooling layers merge in the fully connected layers which receive flattened input vectors. The number of neurons of the last fully connected layer coincides with the number of output classes. For typical classification tasks, the activation function of output layers is a saturating function such as Softmax, which can account for class probability. When new data are provided to the network, the learned features activate differently and culminate in the last fully connected layers for the assessment of the output class. The training is performed by minimizing an objective loss function. Equation (2) reports the cross-entropy loss function, typically employed for classification problems: Given N observations and K classes,ŷ nk is the network output for the n-th observation and k-th class, whereas y nk is the target label (generally its value is 0 or 1 for classification tasks). The loss function is minimized by updating the network weights and biases using a Stochastic Gradient Descent (SGD) algorithm. Several optimization strategies such as Adam optimizer [53] were further developed in the last decade for improving SGD performances.

Transfer Learning
The minimization of loss functions in CNNs is an optimization problem which involves millions of weights. It is evident that such procedures are computationally demanding. Moreover, a large number of weights results in many degrees of freedom for the SGD algorithm. When training datasets are not sufficiently large, network weights may therefore overfit data, producing models that are unable to generalize knowledge beyond training sets. Convergence issues can be addressed when SGD is unable to find minima in the loss function. TL is one of the tools developed within the AI field to face these issues.
Considering a source domain D s and a target domain D t , TL techniques are able to transfer the knowledge acquired in D s for carrying out a source task T s to the domain D t for fulfilling a target task T t [5,34,36]. For instance, fault diagnosis tasks learned in certain machine conditions can be transferred to different working conditions [38,39], or the diagnosis knowledge can be migrated between different machines [54].
Practically, one way for implementing TL consists of freezing the first layers of CNNs, saving the knowledge acquired for extracting low-level features. On the other hand, the last layers are modified and further trained to fine-tune the model. Favorably, such finetuning procedures require limited data for updating the remaining weights. This kind of approach is known in the literature as parameter-based [5], since some of the network parameters remain unchanged. In particular, the knowledge enclosed in the frozen weights concerns the ability of the network of identifying specific features in data. The pre-acquired knowledge can thus be employed to face new tasks, in which features detection abilities can be re-used. In this sense the training time is drastically reduced, and few-data trainings are less prone to overfitting. The number of replaced layers for fine-tuning essentially depends on the degree of diversity between source and target domains [36]. Nevertheless, the development of structured transferability criteria and metrics is a challenging focus of current research [5]. Further TL approaches for machine diagnosis can be feature-based, instance-based or generative models [5].
Existing literature has emphasized the capabilities of networks pre-trained on Im-ageNet to detect machine faults [36,37]. In such cases, low-level features characterizing ImageNet are exploited to classify time-frequency images of vibrational data. This work is motivated by the idea that networks pre-trained for SED, speech and music recognition, hold enough knowledge for classifying bearings faults though spectrogram analysis. Indeed, SED frameworks are already trained for the case of spectrograms classification.
Additionally, provided the inherent similarity between source and target domains, few layers are expected to need fine-tuning.

YAMNet: An Efficient CNN for Sound Event Detection
YAMNet [55][56][57] is a pre-trained CNN developed for SED tasks. The network is trained with AudioSet [48], a dataset containing 632 classes of sound events drawn from more than 2 million YouTube ® clips. Environmental sounds, human sounds, music genres, and musical instruments are examples of classes included in the dataset.
YAMNet is built on the MobileNet [58] architecture. This is an efficient framework designed to fit low latency artificial intelligence models in mobile applications, where the availability of computational resources is quite restrained. In MobileNet architectures, the standard convolution is replaced with the depthwise separable convolution. This particular approach is further detailed in the original paper provided by Google Inc. in 2017 [58], where it is showed that the computational cost may be reduced up to nine times compared with standard convolution. In addition to input and output layers, YAMNet contains 27 convolutional layers, 1 global average pooling layer, and 1 fully connected layer. Standard convolution and depthwise separable convolutions are sequentially stacked up to the pooling layer. The convolutional layers employ ReLU activation functions and batch normalization processes [59]. Finally, the output layer brings the sound class prediction with a Softmax activation function.
To have an idea of the effect of depthwise convolutions, it is possible to compare one of the early CNNs such as AlexNet [22] with YAMNet. The number of parameters contained in AlexNet amounts to 61 million in 8 layers, whereas YAMNet contains 3.7 million parameters [41] in 30 layers. Fewer learnable weights further reduce overfitting risks in such architectures.

Mel Spectrogram Features
The feature maps that input YAMNet are constructed on the basis of coefficients drawn from time-frequency representations of sounds. However, human perception of sound in terms of frequency content is not linear. It is widely known that humans perceive pitch with better resolution at lower frequency ranges. Indeed, provided a fixed frequency gap between two sounds, our hearing would differentiate tones at lower frequency ranges better than in higher ranges, where tones would seem quite similar. Consequently, artificial intelligences that learn sound perception tasks should account for this aspect. To properly label auditory events, deep learning strategies for speech recognition, music recognition, and SED evolved by exploiting feature spaces pursuant to the human perception of sound. In particular, YAMNet employs a Mel spectrogram feature map.
The Mel (root of the word Melody) scale [60] and the Mel Spectrogram [61] take into account the logarithmic nature of human hearing. The Mel scale is empirically designed on the base of the psychoacoustic perception of pitch. The expression of Equation (3) shows how frequencies f can be converted in Mel: In accordance with human experience, the points that are equally spaced on the Mel scale are not linearly resolved in the Hz scale and vice versa. The Mel spectrogram is a time-frequency representation of sounds, produced by applying Mel scale filter banks to classic spectrograms [61]. Figure 4 shows the Mel conversion curve and reports an example of traditional Hz and Mel scale spectrograms for a linear sine sweep.   (Figure 2), the extracted features are distinctive of spectrograms attributes. As previously discussed, this is due to the fact that the weights of YAMNet are specifically optimized to detect spectrograms features. For this reason, the authors of this work hypothesize that some of the knowledge needed to classify REB spectrograms may be already included in the layers of SED networks. This actually means that only the last layers would require learning adjustments, whereas the feature extraction layers and the corresponding weights could be frozen without requiring further training. Then, a small dataset would be sufficient for fine-tuning the net weights, complying with limited data availability.

Bearing Fault Detection Using YAMNet and Transfer Learning
The present work investigates the capabilities of pre-trained SED networks in fulfilling bearing fault diagnosis tasks. Namely, a TL approach was undertaken with the aim of transferring the knowledge held by YAMNet layers to REB spectrograms classification. To this end, the Case Western Reserve University (CWRU) [30,62] bearing dataset was examined, representing a standard reference in this field. The dataset was pre-processed to comply with YAMNet architecture and it was split for training, validation, and test phases. In this context, different labelling options were considered. Afterwards, the network was fine-tuned for dealing with fault diagnosis scenarios characterized by restricted datasets. Finally, the model was tested under new data.

CWRU Dataset
The CWRU test bench ( Figure 6  The CWRU experimental campaign considered localized faults in DE and FE bearings, which were damaged by means of electro-discharge machining. The faults were introduced separately at the inner race (IR), outer race (OR), and rolling elements (B), and tests were run for motor powers of 0, 1, 2, and 3 hp with a shaft speed ranging between 1721 and 1796 rpm. This study considers the fault diameters of 0.007, 0.014, 0.021, and 0.028 inches for the DE bearing. Vibrations signals were extracted in normal and fault conditions by means of accelerometers placed on the motor housing. In the case hereby examined, normal data were sampled at 48 kHz, whereas DE fault data were sampled at 12 kHz. The OR fault condition with 0.028 in damage was not analyzed since data was not available. The label codification for the dataset B reports the damage type (B, IR, OR) in the first part, whereas the second part denotes the damage severity (007, 014, 021, 028). For example, class IR014 contains signals of experiments run with an inner race damage of 0.014. Dataset C, instead, is labelled using the damage type (B, IR, OR) and the motor load in horsepower (_0, _1, _2, _3). The label OR_3, for instance, indicates that a signal was extracted at 3 hp load with an outer race fault. The network determines the output class by assigning numerical scores at the end of the last layer (classification layer). Given the presence of a Softmax activation function, the output of the last neurons returns the probability of belonging to a certain class. Then, the label of the most likely class is assigned. Essentially, the main difference between the three dataset lies in intra-class distributions. Class B_0 contains samples linked to different damage severities as well as class B007 includes signals extracted at different working loads. Dataset C includes less samples in the OR classes, since data for outer race damages with 0.028 in faults were not available. The CWRU tests were conducted following the labelling of the dataset B; therefore, some class imbalances inevitably occur for datasets A and C. The three datasets were then investigated to have an insight into the effect of different intra-class distributions and imbalances on model performances.

Dataset Pre-Processing
The YAMNet trained model accepts input signals sampled at 16 kHz and normalized in the range [−1, 1] [56]. Hence, vibration data were pre-processed to fit the model input requirements. Moreover, the similarities between the source and the target domain can be enhanced using Mel spectrum features for vibration signals as well. In so doing, a successful knowledge transfer is more likely to occur. The datasets were randomly split in training (70%), validation (20%), and test (10%) sets, as reported in Table 1. The choice of the dataset split guaranteed the availability of 300 training samples. The training set was employed to fine-tune the model weights, whereas the validation set had the function of monitoring the model convergence and checking potential overfitting issues while training. Finally, the model effectiveness was verified under never-seen data through the test set. It must be pointed out that 300 training samples constitute a very small dataset for such a network, since 2 million AudioSet samples were necessary to properly train the YAMNet deep architecture. Nevertheless, TL leverages on the pre-trained network and provides the opportunity to deal with few data.
As previously mentioned, the effectiveness of the proposed strategy strongly depends on the similarities between source and target domains. Therefore, signals were treated with the same pre-processing that would be carried out for sound event detection. The data pre-processing involves the extraction of the Mel spectrum coefficients for feeding YAMNet architecture, since those coefficients constitute the input feature map. The pre-processing procedure [55,56] provides that:
The Mel spectrogram is computed using Hann windows with 400 samples length and 60% overlap. The Mel scale filter bank includes 64 filtering bands; 3.
The resulting spectrogram is partitioned by using 96 sliding frames with 48 frames of overlap.
In this case, spectrograms lengths resulted of 100 frames and a single 96 × 64 spectrogram was drawn from each signal. Table 2 reports the main features of Mel spectrograms that are consistent with YAMNet architecture. Examples of Mel spectrograms extracted from the CWRU dataset are shown in Figure 7.

YAMNet Fine-Tuning
It is claimed that the features detectable by the YAMNet framework ( Figure 5) provide a source of knowledge to classify bearing spectrograms as well. To verify this assumption, the fully connected and the output layers of YAMNet were replaced with new layers and the convolutional part remained unaltered. The weights of the replaced layers were consequently optimized by training the model with CWRU data. Given the inherent similarities between source and target domains, the replacement of the only fully connected layer was effective. The main hyperparameters featuring the training phase are reported in Table 3.  The Adam optimizer was employed for adapting the learning rate during training [53]. The batch normalization process [59] normalizes mini batches of data for each network channel independently. Such a normalization occurs within layers, thus preventing exploding gradients issues. Over an entire iteration, a mini batch set is passed through the network, whereas a whole epoch involves the complete training set. The validation frequency specifies the number of iterations elapsed between validations. The trend of the loss functions was monitored over the training process. As an example, Figure 8a shows the loss function resulting from training with dataset C.
In this dataset few samples were portioned in a considerable number of classes (Table 1) and overfitting issues manifested. This is evident by the fact that the validation loss remained considerably higher than the training loss. To comply with overfitting matters, a dropout layer with 90% dropout probability was added before the fully connected layer. Dropout layers represent an effective strategy to prevent overfitting in deep networks. Those layers set input elements to zero with a given probability and significantly reduce the number of network parameters. Converge issues can be hypothesized. Figure 8b shows the improvements for network performances due to the dropout layer. A smoother training process is achieved, and overfitting is remarkably mitigated. Training times and epochs are reported in Table 4. For all the datasets, the training stopped either if maximum accuracy was stably reached or if the maximum epochs were met.

Model Validation
The fine-tuned model was tested under new data to validate its effectiveness, and the capabilities of TL approaches were assessed in limited data scenarios by leveraging on networks pre-trained with sound events. The confusion matrices resulting from the test data are shown in Figure 9, whereas Table 5 shows the overall accuracies achieved at the end of training, validation, and test. There is evidence that the model can properly perform ideal fault diagnosis tasks for dataset A and dataset B with 100% accuracy. Additionally, no overfitting occurred, though extremely small datasets were employed. The class imbalances associated to data availability for the dataset A did not affect model accuracy. In this sense, the TL strategy was effective when the knowledge of spectrograms features is transferred from an SED network to bearing fault detection tasks. Nonetheless, the replacement of a single layer proved to be adequate because the source domain of YAMNet shared numerous features with the domain of vibration signals, as long as these are properly processed. The adoption of Mel spectrum features is part of the pre-processing, and it enhances similarities between transfer domains. Dataset C showed lower accuracies but, when the dropout approach is applied, performances markedly improved.
The results are consistent with the literature concerning parameter-based TL for CWRU diagnosis. For instance, in [36] it is observed that networks pre-trained on ImageNet are able to perform fault diagnosis tasks with 99.95% accuracy and a training time of 229 s. However, more than one layer required fine-tuning because more training data were available (4000 samples). It can be argued that the dominant features learned on ImageNet were not highly specific of time-frequency images.

Discussion
The maximum accuracy is reached when data are split into a considerable number of classes, as long as balanced samples are used (dataset B). Clearly, this is a remarkable capability of TL combined to SED networks, since classes including few more than 20 training samples deliver a very efficient diagnosis model. Training YAMNet from scratch would actually require millions of samples. Similarly, 100% accuracy is achieved also in the event of class imbalances (dataset A), provided that training classes are fed with more training data (Table 1). Interestingly, performances slightly deviate from 100% when class imbalances are superimposed to extremely few training data (dataset C).
Although the accuracy achieved for the dataset C is quite promising, appreciable overfitting occurred in this case, as suggested by the validation and test sets (Table 5). For the diagnosis task of dataset C, training samples did not provide full diagnosis accuracy and the model did not effectively generalize knowledge to new data. Nevertheless, the implementation of a consistent dropout strategy markedly mitigated this effect and remarkable accuracies (higher than 90%) were reached.
Dataset C resulted in a slightly harder form a diagnostic perspective. Contrary to the dataset A, fewer data per class were available and data imbalances affected performances. Depicting intra-class variability at a general level was a harder task for those training data.
Indeed, when the dropout improvements are introduced, only the imbalanced OR classes present misclassification (Figure 9d).
Future work can provide a systematic investigation on the performance variations with respect to the size of training sets for such a TL strategy. In this context, weighted loss function can be taken into account to deal with class imbalances. Further, the effect of additive gaussian noise on sample data can be investigated as well. The choice of network hyperparameters can indeed be affected by noisy signals. Automated tuning of hyperparameters is currently a challenging aspect and deserves further improvements. Future research will include additional experimental validation based on different datasets.

Conclusions
This study aimed at investigating the capabilities of deep networks pre-trained on sound events in fulfilling bearing fault diagnosis. It is claimed that the inherent knowledge of these architectures in identifying features of audio spectrograms can be transferred to the characterization of machine vibrations as well. For this purpose, transfer learning is applied to an efficient convolutional network. This latter was originally designed to fit AI computing for sound detection in mobile devices. It is concluded that: • Networks pre-trained on sound events can fulfill a fault diagnosis task with ideal accuracy by adopting transfer learning approaches; • The features learned over stacked convolutional layers of YAMNet architecture are also relevant for spectrograms of machine vibrations; • Limited data scenarios can be successfully addressed by replacing a single fully connected layer for fine-tuning YAMNet; • When limited data are split in many fault classes with imbalances, overfitting may occur despite high accuracies. In such cases, dropout layers consistently mitigate this phenomenon and further improvements in model accuracies are achieved.