Explainable AI for Machine Fault Diagnosis: Understanding Features’ Contribution in Machine Learning Models for Industrial Condition Monitoring

: Although the effectiveness of machine learning (ML) for machine diagnosis has been widely established, the interpretation of the diagnosis outcomes is still an open issue. Machine learning models behave as black boxes; therefore, the contribution given by each of the selected features to the diagnosis is not transparent to the user. This work is aimed at investigating the capabilities of the SHapley Additive exPlanation (SHAP) to identify the most important features for fault detection and classiﬁcation in condition monitoring programs for rotating machinery. The authors analyse the case of medium-sized bearings of industrial interest. Namely, vibration data were collected for different health states from the test rig for industrial bearings available at the Mechanical Engineering Laboratory of Politecnico di Torino. The Support Vector Machine (SVM) and k-Nearest Neighbour (kNN) diagnosis models are explained by means of the SHAP. Accuracies higher than 98.5% are achieved for both the models using the SHAP as a criterion for feature selection. It is found that the skewness and the shape factor of the vibration signal have the greatest impact on the models’ outcomes.


Introduction
One of the most important components of modern industry is rotating machinery [1].Rolling element bearings are one of the essential parts of these machines, because they provide stiff support with a low-power consumption, thanks to their overall low friction coefficient, in a wide range of operating conditions [2].They are also the most critical component, because they are susceptible to many failure causes not only related to wear and fatigue, but also to faulty installation, poor maintenance, and bad handling practices [3][4][5].Their damage and consequent failure can cause significant unexpected downtimes, economic losses, and safety-related issues [6,7].Therefore, in the past few decades, industries have recognised the importance of a reliable and robust condition monitoring (CM) system for fault detection and diagnosis [8,9].
Rolling element bearings are complex mechanical systems, in which the interaction between their elements produces vibrations even if they are "geometrically perfect", and the presence of faults, defects, and their location produce a characteristic impulsive vibration signature [10].Thus, vibration analysis has become one of the most used techniques in CM, because the signal acquired from the machine contains clear fault-related signatures and can be explored easily through signal processing techniques [11][12][13][14][15][16].In most situations the bearing vibration cannot be measured directly, so the vibration signature is affected by noise coming from the structure and other equipment in the system.For that reason, machine users would prefer an automatic method to shorten the maintenance cycle and improve the diagnosis accuracy, which is called intelligent fault diagnosis (IFD).This technique uses machine learning (ML) theories and algorithms, such as the support vector machine (SVM) [17][18][19][20][21], the k-nearest algorithm (kNN) [22,23], artificial neural networks (ANNs) [24,25], deep belief networks (DBNs) [26], and convolutional neural networks (CNNs) [27][28][29], for fault classification [30].A database of vibration data is used to extract some important and characteristic features to train the ML algorithms at the recognition of faults.Features can be extracted from the time, frequency, or time-frequency domain, but high-dimension feature vectors may contain either redundant or irrelevant information that can increase the computational cost and reduce the accuracy of the classification algorithm [31].Indeed, overfitting problems correlated to high-dimensional data can deteriorate the performance of ML algorithms, since their generalisability decreases as the number of features increases [32].Thus, feature selection, with a dimensionality-reduction technique, plays a fundamental role in designing powerful and robust AI algorithms [33].Then, unlike physics-based models, explaining the contribution given by each feature to the outcome of an IFD model is not straightforward.
The term "feature importance" refers to a set of strategies for allocating scores to input features in a predictive model, indicating the relative significance of each item when producing a prediction [34].Commonly used feature selection algorithms for bearing fault diagnosis include the following:

•
Filter-based methods: filters directly pre-process the collected features depending on the data intrinsic characteristics (correlation between features and class label), independent to the training of the algorithm.Typical importance indicators are Relief and Relief-F [35], information gain (IF) [36], Minimum Redundancy Maximum Relevance [37], Fischer score [38], and Distance Evaluation (DE) [39]; • Wrapper-based methods: wrappers evaluate the interaction of the feature selection activity with the training phase of classifiers [40,41]; • Embedded methods: feature selection is performed during the classification process by adding an L1 [42] or L2 [43] regularisation term in the cost function, making the computation faster than that of the wrappers.
Other widely used data-compression techniques are the principal component analysis (PCA) [44], the linear discriminant analysis (LDA) [45], and the partial least square (PLS) [46], which are based on an algebraic calculation and geometric projection to create the separability into the feature space.However, these approaches lack an appropriate explanation of the feature selection process and a proper justification about the influence of each feature on the final output of the IFD.Without a proper understanding of the feature selection process, all the ML-based classifiers only focus on the classification accuracy, such as the black box approach [9].
The SHAP (SHapley Additive exPlanations) value [47] is a novel feature importance calculation method, which is suitable for both regression and classification problems that can represent a valid alternative to the most used feature selection methods, because it can provide a local interpretation of the model on a single prediction, as well as the global average behaviour of the model in terms of the contribution of each feature to the outcome of the ML algorithm [48].According to the existing literature [9,[49][50][51], the terms "interpretability" and "explainability" are often used synonymously, referring to models where users can understand how the inputs are mapped into outputs.This is the case for the techniques aimed at quantifying the contribution given by each feature to determine the outcome of their importance and highlight those with the greatest impact on prediction.In recent years, the SHAP value has become a popular method for the interpretation of the feature contributions in fitted ML models, as it can be seen in the references [47,[52][53][54][55][56][57], posing the bases to Explainable Artificial Intelligence (XAI) [58] and new feature selection methods.

Literature Review
A review of the existing literature about Shapley values shows that the researches from Strumbelj et al. [55,59] were the first to propose them in the explanation of how features contribute to the prediction of classification and regression machine learning algorithms, posing the basis for XAI.They showed that Shapley values can reveal the influence of features, no matter the concept that the AI algorithm learns.An issue that affected all the algorithms based on the Shapley values was the high computational cost, but Song et al. [52] proposed a solution to make the computation feasible, even for a large number of input data and features, by sampling the permutations of the inputs randomly and estimating the cost functions related to the evaluated algorithms via the Monte Carlo simulation.The work by Lundberg et al. [47] introduced the kernel SHAP framework, which involved the standard Shapley value algorithm and the local interpretable modelagnostic explanation (LIME) algorithm, to assign each feature an importance value for a particular prediction.Further developments of the SHAP method were carried out by Aas et al. [53], who introduced the extension to the kernel SHAP to reduce the computation time and lead to more reliable predictions when features are correlated.Redelmeier et al. [56], who developed an algorithm able to explain the contribution of mixed (continuous, discrete, and categorical) and dependent features using conditional inference trees, and reported an improvement in the computation time when the features are dependent, with respect to Aas et al.
The kernel SHAP and the extension to the kernel SHAP were successfully used in studies regarding many different subjects to explain the outcome of both classifiers and the regression ML algorithms.Banerjee et al. [60] used the Shapley values to find causal connections between socioeconomic metrics and the spread of COVID-19 in the different phases of the pandemic.Vega Garcia et al. [34] applied the SHAP method to explain the forecasts of the air quality in terms of the NO 2 concentration made by the long short-term memory (LSTM) deep learning model, while Rohmer et al. [61] explained the predictions of a machine-learning-based model for sea-level projections.The SHAP method can also be applied to finance, as it was performed in the study by Moehle et al. [57], who wanted to assess the feature importance in the evaluation of the portfolio performance to genomics.This is similar to the work by Watson et al. [62], who explained the importance of interpretable machine learning (iML) for making the prediction of the ML algorithm understandable to human, when the observed phenomena are highly complex and dependent on a large number of variables, and to aviation, as cited in the research by Midftfjord et al. [63].They employed the SHAP method to highlight the most important features affecting the outcome of the XGBoost algorithm for the real-time evaluation of the airport runway conditions.The SHAP method can also be used for the multi-label classification problem, as it was reported in Dong et al. [64] and Goštautaite et al. [65]: the former proposed and tested a multi-label learning fused with the Shapley value (SHAPFS-ML), while the latter used the SHAP method to explain the predictions about the students' learning style.The SHAP explanation was also used in medicine (Chen et al. [66], Oh et al. [67]), psychology (Akimoto et al. [68], Li et al. [69]), and engineering (Sun et al. [70], Remman et al. [71]) In the field of feature selection, the research from Cohen et al. [72] was the first to present contribution selection algorithms (CSAs) that were based on Shapley values to select the most relevant features.The CSA was tested on seven different datasets and the authors reported its capability to improve the performance of their classifier, competing with the most used feature selection methods, especially in those cases where features interacted with each other.Marcilio et al. [73] stated that the absolute of the SHAP values can be extrapolated to be used as a feature selection mechanism, while providing a humancompatible explanation for the predictions of ML algorithms.However, they reported that the kernel SHAP approach takes a quadratic amount of time in both the dimensionality and size of the dataset, making the computation unfeasible even for moderate datasets, but they concluded that this problem could be decreased by selecting features based on correlation before feeding the SHAP algorithm.Zacharias et al. [74] developed a design framework of an XAI-based feature selection method that evaluates the performance of a ML model with a reduced number of features, selected depending on their SHAP values, with respect to the original one, which is trained considering all the features.They highlighted that the use of XAI for feature selection may make the task more accessible for intermediate data scientists and might thereby contribute to data democratisation, facilitating the exchange of information between ML developers and domain experts to enhance the effectiveness of feature selection.This kind of algorithm was employed by Guha et al. [75] in a ML model for visual human action recognition and by Jothi et al. [76] for a ML model predicting generalised anxiety disorder among women.They both reported a high improvement in the computation time of their models after feature selection, proving the validity of that method.A further development was proposed by Tripathi et al. [77] who developed a feature selection algorithm that was based on the SHAP values for a binary classification problem, which does not need a user-defined threshold on the importance values or the number of important features.
There are reports in the literature about the application of the SHAP method to explain the outcomes of ML classifying algorithms for fault detection in rotating machinery.The SHAP algorithm was applied to supervised learning in the works by Hasan et al. [9], who explained the decisions of a kNN classifier for a small-sized ball bearing fault diagnosis, highlighting the most relevant features and concluding that the presented model had a better generalisation capability than the standard algorithm, making it perform a bearing fault diagnosis under different configurations and working conditions.Mey et al. [78] explained that the prediction of deep neural networks applied to the fault diagnosis of rotating systems through the analysis of vibration data.On the other hand, Brito et al. [49] explained the outputs of an unsupervised machine learning algorithm for fault detection in rotating machinery with gears and employed the Shapley values in a fault diagnosis to find the root causes of the faults.

Scope of the Research and Outline
In the field of IFD that is related to rotating machinery, only few works deal with the SHAP value as an instrument to assess the importance of features [9,17,49,78,79].Additionally, the literature presents no evidence of XAI analysis performed for industrial applications characterised by medium-sized bearings and the assessment of the effects of SHAP-based feature selection on the testing accuracy of classifying algorithms.To the best of the authors' knowledge, this is the first work dealing with this topic for industrial cases, involving medium-sized spherical roller bearings for high-duty applications.
For that reason, the present work is aimed at investigating the capabilities of the SHAP method to analyse feature contributions, making their selection univocally justifiable and trackable to achieve more robust and reliable ML algorithms for IFD of industrial bearings in rotating machinery for industrial interest.To this aim, specific experimental activities were conducted on the test rig for medium-sized industrial bearings available at the Mechanical Engineering Laboratory of Politecnico di Torino [80].
This paper is organised as follows: The introduction provides and overview of IFD and XAI by exploring the existing literature.Additionally, the aim of this work is described.The second section provides insight into the theoretical arguments underlying SVM, kNN, and Shapley values.The third section describes the test rig and the experimental activity conducted on medium-sized bearings.Section 4 includes the application of the SHAP analysis to the trained IFD models, whereas the fifth section provides the results and discussion on the explainability of diagnosis models and feature selection abilities.Finally, the sixth section includes the concluding remarks.

Theoretical Background
This section introduces the Artificial Intelligence (AI) tools that are employed in this study for the IFD and interpretability of AI models.Namely, the SVM and the kNN algorithms are selected among the ML models, whereas the SHAP technique is employed for interpreting the diagnosis that is achieved through the black box models.

ML Models
AI algorithms for fault diagnoses of rotating machinery have become popular due to their robustness and adaptation capabilities The first and most traditionally used algorithms for fault diagnosis are the SVM and kNN [30].

Support Vector Machines (SVMs)
A SVM is a supervised learning AI algorithm developed by Cortes and Vapnik [81] that is widely used in classification tasks and can manage large feature spaces, because the dimension of the classified vectors used for the training phase does not have an influence on the performance.Thus, the SVM is a good approach for fault diagnosis, since the vibration features might not have to be limited.It has good generalisation properties, because in the goal of the training phase is the minimisation of the Structural Misclassification Risk (SMR) [82][83][84].The SMR is an inductive principle of use in ML, which was introduced by Vapnik and Chervonenkis [85] and is implemented through the minimisation of the expression reported in Equation ( 1): where E train is the training error, λ is the constant regularisation term, H is the regularisation function that controls the trade-off between minimising the training error and minimising the expected gap between the training and test errors, and w is the vector containing the weight of each feature.Thus, the cost function J of the algorithm can be written as reported in Equation (2): where N is the number of examples in the database, x the vector of the feature values related to the i-th example, h w is the predicting function of the classifier, y is the label of the i-th example, d is the number of features, and w is the j-th entry of the vector of weights.
Given the set of input data reported in Equation ( 3), which are labelled depending on the class (y i = ±1).The linear SVM theory supposes that there exists a hyperplane that represent the boundary between classes.The hyperplane is expressed in Equation ( 4), where b is a constant scalar term.The maximisation of the margin distance between the points and the decision hyperplanes provides some reinforcement to the algorithm, making the classification of new points simpler [86].
Considering the possible presence of noise in data with the slack variable ζ i and the error penalty constant C, the optimal hyperplane that separates the data can be obtained as a solution to the optimisation problem of Equation ( 5): Appl.Sci.2023, 13, 2038 6 of 22 Replacing the weight vector w with w = N ∑ i=1 α i y i x i , where α is the weight of the i-th sample, the optimisation problem can be written in the form of Equation ( 6): SVMs can also be used in nonlinear classification problems that apply kernel functions, which map the d-dimensional input vector x into a l-dimensional feature space to make a linear classification feasible.Using the non-linear vector function K x i , x j = Φ T (x i ) Φ x j , the decision function can be written in the form of Equation ( 7): The kernel function of the following Equation ( 8) returns the dot product of the feature space mapping from the original data points: There are many types of kernel functions, such as linear, polynomial, and Gaussian RBF (radial basis function).The selection of the most suitable kernel is crucial, because it defines the feature space in which the training set data will be classified [20,82].

k-Nearest Neighbour (kNN)
The kNN algorithm is a non-parametric, supervised learning method developed by Fix and Hodges [87], in which the input is made of the k closest training samples of a dataset and the output for a classification problem is a class membership: the object is assigned to the class most common among its k nearest neighbours.Given a set of observations D, where x i is the feature vector and y i is the corresponding class label of the i-th example.The (x i , y i ) is assumed to be i.i.d.from some unknown distribution P of (x, y) on R d × {ω 1 , . . . ,ω M }, where M is the number of classes, and the goal is the design of a classifying function φ n : R d → {ω 1 , . . . ,ω M } that maps a feature vector x into its desired class from {ω 1 , . . . ,ω M }.
The performance of the classifier is evaluated through the probability error of Equation ( 10) [88]: where J is the cost function and φ n is the classifying function.
If the underlying distribution is known, then the optimal decision rule φ * that minimises the probability of error is the Bayes decision rule of Equation ( 11): It can be shown that at any given point, x, the probability that its nearest neighbour x belongs to class ω i converges to the corresponding a posteriori probability P(ω i |x ) as the number of reference observations goes to infinity.It was shown by Cover and Hart [89] that under some continuity conditions on the underlying distribution, the asymptotic probability of error L NN for a multi-class kNN classifier is bound by the terms reported in Equation ( 12): where L * is the optimal Bayes probability of error.Thus, the nearest-neighbour rule is asymptotically optimal when the classes are not overlapping [90].
In the classification phase, after the constant k has been defined by the user, an unlabelled vector is classified by assigning the label that is most frequent among the k training samples that are nearest to that query point.The distance metric could be represented by the Euclidean distance, the Hamming distance, or the correlation coefficients.The value of k deeply affects the classification performance [40]: larger values reduce the effects of noise but make the boundaries between classes less distinct.The accuracy is also affected by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance.Finally, data-reduction is one of the biggest problems when working with huge datasets, because only some of the data points, called prototypes, are needed for accurate classification.A commonly used algorithm for dataset reduction is the Hart algorithm [91].

Shapley Values
The Shapley values [92] were introduced to assign pay-outs to players depending on their contribution towards the total playout, but they can be effectively used to measure the global importance of a feature.Shapley values can be explained by using the game theory as follows: given a cooperative game with d players aiming at maximising a payoff and letting S ⊆ M = {1, . . . ,d} be a subset of |S| players, the contribution function ν(S), which maps a subset of players to the real number, called the contribution of coalition S, describes the total expected sum of payoffs that the member of S can obtain by cooperation.Thus, Shapley values embody a method to distribute the total gains to players.The amount that player j earns is given by Equation ( 13) [53]: This expression is a weighted mean over the contribution functions differences for all subsets S of players not containing player j.The algorithm also accounts for the nondistributed gain φ 0 = ν(∅).
In the ML theory, single predictions take the place of a pay-out, and features become the players.The main properties of the Shapley values can be summarised as follows [93].

•
Efficiency: the total gain is distributed • Symmetry: if i and j are two players that contribute equally to all coalitions, then their Shapley value is the same ν(S ∪ {i}) = ν(S ∪ {j}) → φ i = φ j ; • Dummy player: if a player does not contribute to any coalitions, then his Shapley value is zero φ j = 0; • Monotonicity: if two games v and v and a player always makes a greater contribution to v than to v for all S, then the gain for v will be greater than that for v Linearity: if two coalition games described by the gain functions v and w are combined, the distributed gain is the sum of the two gains is φ j (v + w) = φ j (v) + φ j (w).This is valid also for a multiplication by a constant value α, φ j (α v) = α φ j (v).
Given a training set of N labelled samples D = {(x 1 , y 1 ), . . . ,(x N , y N )} used to train a predictive ML model f (x) whose outcome should be as close as possible to label y, Shapley values can be used to explain the prediction y * = f (x * ) for a specific feature vector x = x * .The prediction can be written as reported in Equation ( 14): where φ 0 = E[ f (x)] and φ * j is the φ j for x = x * .Therefore, Shapley values explain the difference between the prediction y * = f (x * ) and the global average prediction.For the computation of the Shapley values for a prediction explanation, the contribution v(S) for a certain subset S must be defined.This function should take values similar to f (x * ) when only the value of the subset S of these features is known.As cited by Lundberg [47], the expected output of the predictive model of Equation (15), conditional to the feature values x S = x * S of this subset, can be used: Thus, the conditional expectation summarises the whole probability distribution and, if it is considered the prediction for f (x * ), it is also the minimiser of the squared error loss function.
There are two main algorithms for the computation of the Shapley: the KernelSHAP and the extension to the KernelSHAP method.Both define the value function of Equation ( 16) such that the sum of the Shapley values of a query point over all the features corresponds to the total deviation of the prediction for the query point from an average: Their main difference is in the way the feature contribution to the final prediction is evaluated: the latter is an interventional method, which attributes an influence onto x j only if the value function that is computed and considering that element is significantly different than if it was computed without it.On the contrary, the conditional approach of the former attributes even influences features with no interventional effect, only because of their presence in the model.Thus, this kind of approach requires further modelling of how the features are correlated [94].
The extension to the KernelSHAP algorithm [53] defines the value function of Equation ( 17) of the feature in S at the query point x using the conditional distribution of X S C , given that X S is the query point values:

Dataset for Industrial Bearings
The dataset employed in this study was constructed by means of the test rig for industrial medium-sized bearings available at the Mechanical Engineering Laboratory of Politecnico di Torino [80].The test rig was specifically designed for research activities.A comprehensive description of the test rig is provided in the reference [80].The dataset involves ten shaft speeds, three different radial loads, and three bearing health states.

Test Rig Description
The test rig shown in Figure 1 is powered by a 30 kW three-phase motor by SIEMENS ® , which is connected to the SIEMENS ® G120_CU240E_2 INVERTER with a brake resistor that drives the shaft up to 195 Nm of the maximum torque at the rated spin speed of 1470 rpm.The speed and torque controls, managed by the inverter, can set the motor speed up to 2500 rpm.The "self-contained box" (Figure 2) is made of two housing blocks with oil lubrication and is equipped with up to four different roller bearings, whose reactions are directly balanced by the box.Hydraulic actuators on the upper panel of the box and between each pair of bearings allow the application of radial and axial static loads, respectively, up to 200 kN.The innovative design of the box makes it an isolated system with respect to all of the loads applied; in fact, radial and axial loads act only inside the box and are uncoupled with the outer part of the rotor system.The vibration signature is in the time and frequency domain as long as the operating temperature of each bearing under test can be measured through four SKF CMSS 2200 T sensors, which are mounted along the radial direction of the bearing-box adapters.Vibration signals are acquired through the LMS Scadas III Digital Acquisition system, equipped with four PQMAs (Programmable Quad Microphone Amplifiers).The main advantages of this test rig can be resumed as follows:

Vibration Dataset
The vibration signals coming from any bearing are always affected by the noise coming from the surrounding parts of the machine, which can make the recognition of defects and faults difficult [5].In fact, those paths are usually complex and have non-stationary behaviour and the extraction of the required information could be hard [49].For those reasons, a preliminary analysis of the vibration signature of the test rig in regular operating conditions is essential before the beginning of any condition monitoring activity [80].
In the experimental testing session carried out for the generation of the database of vibration data used in this work, four SKF spherical self-aligning bearings 22,240

Vibration Dataset
The vibration signals coming from any bearing are always affected by the noise coming from the surrounding parts of the machine, which can make the recognition of defects and faults difficult [5].In fact, those paths are usually complex and have non-stationary behaviour and the extraction of the required information could be hard [49].For those reasons, a preliminary analysis of the vibration signature of the test rig in regular operating conditions is essential before the beginning of any condition monitoring activity [80].
In the experimental testing session carried out for the generation of the database of vibration data used in this work, four SKF spherical self-aligning bearings 22,240

Vibration Dataset
The vibration signals coming from any bearing are always affected by the noise coming from the surrounding parts of the machine, which can make the recognition of defects and faults difficult [5].In fact, those paths are usually complex and have non-stationary behaviour and the extraction of the required information could be hard [49].For those reasons, a preliminary analysis of the vibration signature of the test rig in regular operating conditions is essential before the beginning of any condition monitoring activity [80].
In the experimental testing session carried out for the generation of the database of vibration data used in this work, four SKF spherical self-aligning bearings 22,240 CCK/W33 with a tapered bore and a 360 mm outer diameter were mounted inside the test rig.The vibration signals were acquired through the sensors and the DAQ system and were digitally sampled at f s = 20, 480 Hz with AC coupling, which was used to remove the offset DC component of the voltage that was used to energise the piezoelectric accelerometers from the acquisition.A sixth-order Butterworth filter with a 10, 200 Hz cut-off frequency was initially applied to the acquired signals and some digital filters were then designed for the elimination of the mechanical and electromagnetic noise to improve the signal-to-noise ratio (SNR).
Experimental data were obtained for three radial load cases over ten different rotational speeds and three defect conditions.The rotational speed refers to the three-phase motor synchronous velocity.Point defects were made by removing a circular portion of the material with a diameter of 2 mm and a depth of 0.5 mm from the raceways and only a single defect was present at a time in the bearing.The inner race (IR) and the outer race (OR) defects are shown in Figure 3, whereas Table 1 shows the operating conditions for the tests.Each recorded vibration path is 60 s long and the whole database consists of 90 vibration paths, labelled depending on the defect condition of the tested bearing.
. 2023, 13, x FOR PEER REVIEW 10 of 22 the offset DC component of the voltage that was used to energise the piezoelectric accelerometers from the acquisition.A sixth-order Butterworth filter with a 10,200 Hz cut-off frequency was initially applied to the acquired signals and some digital filters were then designed for the elimination of the mechanical and electromagnetic noise to improve the signal-to-noise ratio (SNR).Experimental data were obtained for three radial load cases over ten different rotational speeds and three defect conditions.The rotational speed refers to the three-phase motor synchronous velocity.Point defects were made by removing a circular portion of the material with a diameter of 2 mm and a depth of 0.5 mm from the raceways and only a single defect was present at a time in the bearing.The inner race (IR) and the outer race (OR) defects are shown in Figure 3, whereas Table 1 shows the operating conditions for the tests.Each recorded vibration path is 60 s long and the whole database consists of 90 vibration paths, labelled depending on the defect condition of the tested bearing.Non-defective (0), inner race defect (1), and outer race defect (2)

SHAP Analysis for Industrial Bearings
In this paragraph the proposed methodology for feature selection through the SHAP analysis is presented, and the results obtained in the above-mentioned experimental database are reported.All the computations were carried out using Matlab ® and Python environments.These are the main steps of our method: 1. Vibration database acquisition and data pre-processing; 2. Feature extraction and creation of a database of the labelled samples; 3. Training of AI algorithms; 4. Shapley values computation and feature selection.
Figure 4 resumes the most important phases and operations performed in this work.

Features Extraction and Labelling of Vibration Samples
The extraction of the most relevant features from the data gathered during the collecting phase is the step following the data acquisition for the setup of the ML models.This stage is aimed at the identification and computation of representative features from

SHAP Analysis for Industrial Bearings
In this paragraph the proposed methodology for feature selection through the SHAP analysis is presented, and the results obtained in the above-mentioned experimental database are reported.All the computations were carried out using Matlab ® and Python environments.These are the main steps of our method: 1.
Vibration database acquisition and data pre-processing; 2.
Feature extraction and creation of a database of the labelled samples; 3.
Training of AI algorithms; 4.
Shapley values computation and feature selection.These characteristics can highlight the presence of defects, but they can also contain irrelevant information that can influence the performance of the classifier [96].
The features used in this work belong to the time and the frequency domain.The time-domain signal usually changes in its amplitude and distribution when a defect is present in one of the elements of the bearing since it induces some characteristics vibration impulses.For example, the mean, root mean square, and impulse factor reflect the vibration amplitude and energy, while the standard deviation, crest factor, kurtosis, and shape factor can be used to describe the time series distribution [97,98].The frequency-domain parameters can add some information, which is not present in the time-domain.They can be extracted after the computation of the frequency-domain signal through the Discrete

Features Extraction and Labelling of Vibration Samples
The extraction of the most relevant features from the data gathered during the collecting phase is the step following the data acquisition for setup of the ML models.This stage is aimed at the identification and computation of representative features from the recorded data using a time-domain statistical analysis and a Fourier spectral analysis [95].
These characteristics can highlight the presence of defects, but they can also contain irrelevant information that can influence the performance of the classifier [96].
The features used in this work belong to the time and the frequency domain.The time-domain signal usually changes in its amplitude and distribution when a defect is present in one of the elements of the bearing since it induces some characteristics vibration impulses.For example, the mean, root mean square, and impulse factor reflect the vibration amplitude and energy, while the standard deviation, crest factor, kurtosis, and shape factor can be used to describe the time series distribution [97,98].The frequency-domain parameters can add some information, which is not present in the time-domain.They can be extracted after the computation of the frequency-domain signal through the Discrete Fast Fourier Transform (DFFT) [99].Commonly used frequency-domain features include the mean frequency, which represents the vibration energy, frequency centre, and root mean square frequency, which shows the position shift of the main excited frequencies and the standard deviation frequency that describes the degree of convergence of the spectrum power [100].
In this work, prior to the feature extraction, the vibration paths were segmented into 400 ms long intervals to achieve more examples for the training phase of the AI algorithms.From each segment, twenty-three features, taken from Lei et al. [101,102], were extracted and they are reported in Table 2.The first eleven parameters belong to the time-domain, whereas the others belong to the frequency domain: x(i) is the i-th acceleration sample, N is the number of data points in the time segment, s(k) is the k-th spectrum amplitude, K is the number of spectrum lines, and f k is the frequency value of the k-th spectrum line.After the feature extraction activity, a sample with twenty-three features and a label depending on the defect condition was generated from each segment.Therefore, a wide database of labelled samples was setup for the subsequent training phase.

Time−Domain Features
Frequency−Domain Features

Training of AI Algortihms
ML algorithms such as the SVM and kNN have been among the most used theories since the introduction of IFD [30].For that reason, they were chosen for the analysis made in this work.Both the algorithms are characterised by some parameters that must be tuned to achieve their best performance.
As said previously, the goal of the SVM algorithm is the identification of the optimal hyperplane that maximises the margin of the training data and therefore, the distance between the nearest points of each class.The parameters that were tuned are the following:

•
The kernel function that defines the function of the hyperplane used to separate data; • C, called the regularisation term, which is the penalty parameter of the error term and controls the tradeoff between a smooth decision boundary and classifying the training points correctly; • γ, which influences the number of nearby points used for the calculation of the separating hyperplane using the radial basis function (RBF) kernel.
Regarding the kNN algorithm, the main tuneable parameter is the number of neighbours k, which represents the number of the closest training examples that are considered for the classification of a new point.The training of the algorithms and optimisation of the tuneable parameters were carried out through the Python function GridSearchCV, using a 5-fold cross-validation, and the main parameters of the optimised algorithms are resumed in Table 3, where d is the number of features and σ X is the variance of the feature database.Before the training phase, N = 2000 samples were randomly selected from the original database and then split into two sets with a ratio 80/20: the bigger dataset was used for the training phase itself, while the other one was used for the evaluation of the algorithms testing accuracy.

Shapley Values Computation and Features Selection
When a trained model provides an outcome for a regression or classification task, we may want to analyse its behaviour with respect to the inputs and outputs in an easy and understandable way, since the accuracy does not give a complete and exhaustive description of the performance [34].For example, the decision process of non-linear and complex models is often difficult to be understood.To evaluate the performance of the proposed AI algorithms and the contribution of each feature to the outcome of the classification problem, three stages were implemented in this study: The SHAP values can be analysed from two points of view: they are the expression of the contribution of each feature to the outcome of a single classification outcome or, on the other side, we can study how these values are attributed globally for each feature.
The SHAP values computed on each point explain how each feature contributed to the model classification prediction, by modifying the base value, which is the mean prediction of the model trained on the specific set of data used in the training phase.Figure 5 shows how each feature contributes to the change of the base value towards higher or lower values.The base value of each class is the probability that a sample is classified into that class, independent of the values of its features.Therefore, it is highly linked to the probability distribution of the training examples among the classes.On the contrary, the final value of the prediction represents the probability of the point under investigation to be classified in that specific class and it is the result of the feature contribution: if the computed value is one, the predicted probability of a specific fault by the model is 100%.Thus, the outcome of the classification would be the class with the highest probability.In this specific case the base value is around 0.34.This is plausible because, since the training examples were taken randomly from the database, the distribution among the selected examples of each of the three defect classes should be around 33%.Then, the base value is modified by each feature, depending on its value.The amounts of contribution of each feature is represented by the Shapley value and is displayed by an arrow that pushes to increase or decrease the prediction.During the training phase, the SHAP algorithm learns a threshold value for each feature that is related to each different class to determine if that feature, depending on its value, will have a positive or negative effect towards the prediction and the magnitude of the contribution.For example, a certain feature can have a positive effect on a class and a negative effect on the others.From Figure 5 It can be seen that p 17 , p 18 , and p 19 (red arrows) are the most important features in increasing the probability of this example to be classified in that class, whereas p 21 (blue arrow) is the most influent in decreasing the base value.Features can have a positive or a negative contribution depending on their value, which is related to the threshold value computed in the training phase of the algorithm.The final value, which is 0.40, tells that the sample under investigation has a 40% probability to be classified in this defect class.
final value of the prediction represents the probability of the point under investigation to be classified in that specific class and it is the result of the feature contribution: if the com puted value is one, the predicted probability of a specific fault by the model is 100%.Thus the outcome of the classification would be the class with the highest probability.In thi specific case the base value is around 0.34.This is plausible because, since the training examples were taken randomly from the database, the distribution among the selected examples of each of the three defect classes should be around 33%.Then, the base value is modified by each feature, depending on its value.The amounts of contribution of each feature is represented by the Shapley value and is displayed by an arrow that pushes to increase or decrease the prediction.During the training phase, the SHAP algorithm learn a threshold value for each feature that is related to each different class to determine if tha feature, depending on its value, will have a positive or negative effect towards the predic tion and the magnitude of the contribution.For example, a certain feature can have a pos itive effect on a class and a negative effect on the others.From Figure 5 It can be seen tha  ,  , and  (red arrows) are the most important features in increasing the probabil ity of this example to be classified in that class, whereas  (blue arrow) is the most in fluent in decreasing the base value.Features can have a positive or a negative contribution depending on their value, which is related to the threshold value computed in the training phase of the algorithm.The final value, which is 0.40, tells that the sample under investi gation has a 40% probability to be classified in this defect class.

Results and Discussion
Figure 6 shows the SHAP values obtained for different damage classes when the SVM and the kNN algorithms are applied; similarly, Figure 7 shows the mean SHAP val ues for the features.The plots of Figure 6 present the features sorted along the -axis de pending on their impact on the output, but they also report the SHAP values of every feature computed on each example of the dataset.As stated in Shapley [92], for the feature with the highest importance, their SHAP values span over a greater range than the rest Points are also coloured depending on their value.In this way, it is easy to figure out when a feature has a high or low impact on the output and when it contributes to increasing o decreasing the base probability value of each class.Interestingly, the impact on mode outputs given by the analysed features is similar in both the ML models.For instance high values of the feature  have a positive impact for the class zero and the class one whereas high values of the feature  have a negative impact for the class two, which i the outer race damage.This means that signals characterised by high values of  are

Results and Discussion
Figure 6 shows the SHAP values obtained for different damage classes when the SVM and the kNN algorithms are applied; similarly, Figure 7 shows the mean SHAP values for the features.The plots of Figure 6 present the features sorted along the y-axis depending on their impact on the output, but they also report the SHAP values of every feature computed on each example of the dataset.As stated in Shapley [92], for the feature with the highest importance, their SHAP values span over a greater range than the rest.Points are also coloured depending on their value.In this way, it is easy to figure out when a feature has a high or low impact on the output and when it contributes to increasing or decreasing the base probability value of each class.Interestingly, the impact on model outputs given by the analysed features is similar in both the ML models.For instance, high values of the feature p 6 have a positive impact for the class zero and the class one, whereas high values of the feature p 6 have a negative impact for the class two, which is the outer race damage.This means that signals characterised by high values of p 6 are very unlikely to belong to bearings that are damaged on the outer race.For some features, such as p 19 and p 3 , a similar attitude is detectable for both the models.
very unlikely to belong to bearings that are damaged on the outer race.For some features, such as  and  , a similar attitude is detectable for both the models.Figure 7 reports the SHAP values of the ten most important features, which are sorted depending on their average impact on predictions for the three different defect conditions.As it can be seen from those plots,  , which is the skewness parameter, is by far the most important feature for the classification of the bearing defect condition for both the SVM and kNN algorithms.Skewness is the ratio of the average deviation from the mean cubed divided by the standard deviation cubed; for a random variable with normal distribution, the skewness is zero.Higher values of skewness discriminate between an outer race defect and the other two defect conditions.For classes zero and two, the skewness has a much higher mean SHAP value than the other features.Other important features for both the Figure 7 reports the SHAP values of the ten most important features, which are sorted depending on their average impact on predictions for the three different defect conditions.As it can be seen from those plots, p 6 , which is the skewness parameter, is by far the most important feature for the classification of the bearing defect condition for both the SVM and kNN algorithms.Skewness is the ratio of the average deviation from the mean cubed divided by the standard deviation cubed; for a random variable with normal distribution, the skewness is zero.Higher values of skewness discriminate between an outer race defect and the other two defect conditions.For classes zero and two, the skewness has a much higher mean SHAP value than the other features.Other important features for both the algorithms that belong to the time-domain are p 3 , the squared mean root of absolute values (SMRA), which is determinant in the recognition of the defect-free class, and p 10 , the shape factor, which is a parameter dependent on the signal shape that seems to be important in the discrimination between the inner and the outer race defect, especially for the SVM.The skewness and shape factor were also found to be relevant features in other works, such as the one from Hui et al. [41].The last two relevant parameters that are in common with both the algorithms are p 18 and p 19 : the former, which is the root mean square frequency, is nearly the most important feature for class one, while the latter has a greater impact on kNN.
Appl.Sci.2023, 13, x FOR PEER REVIEW 16 of 22 algorithms that belong to the time-domain are  , the squared mean root of absolute values (SMRA), which is determinant in the recognition of the defect-free class, and  , the shape factor, which is a parameter dependent on the signal shape that seems to be important in the discrimination between the inner and the outer race defect, especially for the SVM.The skewness and shape factor were also found to be relevant features in other works, such as the one from Hui et al. [41].The last two relevant parameters that are in common with both the algorithms are  and  : the former, which is the root mean square frequency, is nearly the most important feature for class one, while the latter has a greater impact on kNN.After the SHAP analysis was carried out, the four most important features of each class were selected for the SVM and kNN algorithms.Table 4 reports all the selected features for each algorithm.In kNN, only five features appear in the top four places of the After the SHAP analysis was carried out, the four most important features of each class were selected for the SVM and kNN algorithms.Table 4 reports all the selected features for each algorithm.In kNN, only five features appear in the top four places of the three different defect-related classes, while the most relevant features related to the SVM are nine.Therefore, the kNN model is explained by fewer parameters, with respect to the SVM.The influence of the feature selection activity was evaluated by comparing the accuracy of the algorithm trained considering all the twenty-three features, or only the features in Table 4 that are in common to both the SVM and kNN, which are p 3 , p 6 , p 10 , p 18 , and p 19 .For this task, 100 different datasets containing N = 2000 randomly selected samples from the original database were used.Each dataset was split into two subsets with a ratio 80/20: the former was used for the training phase, while the latter was used for the computation of the testing accuracy.The accuracy of the algorithms in both the considered cases and for each set of data was recorded and the mean accuracy was computed to analyse the overall impact of the feature selection activity.The results are reported in Figure 8. three different defect-related classes, while the most relevant features related to the SVM are nine.Therefore, the kNN model is explained by fewer parameters, with respect to the SVM.The influence of the feature selection activity was evaluated by comparing the accuracy of the algorithm trained considering all the twenty-three features, or only the features in Table 4 that are in common to both the SVM and kNN, which are  ,  ,  ,  , and  .For this task, 100 different datasets containing  = 2000 randomly selected samples from the original database were used.Each dataset was split into two subsets with a ratio 80/20: the former was used for the training phase, while the latter was used for the computation of the testing accuracy.The accuracy of the algorithms in both the considered cases and for each set of data was recorded and the mean accuracy was computed to analyse the overall impact of the feature selection activity.The results are reported in Figure 8.

Model
Selected Features SVM  ,  ,  ,  ,  ,  ,  ,  ,  kNN  ,  ,  ,  ,  Thus, with the help of the insights provided by the SHAP analysis, the number of relevant features could be reduced with proper justification, making the ML models more robust and easily explainable in terms of feature importance for the outcome of the model.As it can be seen from Figure 8a, the accuracy for the SVM is only 1% worse, as if it has been trained considering all the features, whereas Figure 8b shows that the kNN algorithm performs better only if the most important features are considered.This was expected, because, as cited previously, this kind of algorithm suffers from feature redundancy.Then, diagnosis accuracies are not significantly affected by the selection of only five features out of twenty-three, as long as the SHAP values are employed to select the relevant features by means of model explanations.

Conclusions
This study aimed at investigating the capabilities of SHAP for explaining machine learning models in the intelligent fault diagnosis of industrial bearings.Additionally, this results in identifying the most relevant features for diagnosis.The explanations of the diagnosis outcomes were provided for the SVM and kNN models.The authors investigated the case provided by the test rig for industrial bearings available at the Mechanical Engineering Laboratory of Politecnico di Torino.The conclusions are as follows: • The SVM and the kNN models are able to achieve diagnosis accuracies higher than 98.5% for medium-sized industrial bearings; Thus, with the help of the insights provided by the SHAP analysis, the number of relevant features could be reduced with proper justification, making the ML models more robust and easily explainable in terms of feature importance for the outcome of the model.As it can be seen from Figure 8a, the accuracy for the SVM is only 1% worse, as if it has been trained considering all the features, whereas Figure 8b shows that the kNN algorithm performs better only if the most important features are considered.This was expected, because, as cited previously, this kind of algorithm suffers from feature redundancy.Then, diagnosis accuracies are not significantly affected by the selection of only five features out of twenty-three, as long as the SHAP values are employed to select the relevant features by means of model explanations.

Conclusions
This study aimed at investigating the capabilities of SHAP for explaining machine learning models in the intelligent fault diagnosis of industrial bearings.Additionally, this results in identifying the most relevant features for diagnosis.The explanations of the diagnosis outcomes were provided for the SVM and kNN models.The authors investigated the case provided by the test rig for industrial bearings available at the Mechanical Engineering Laboratory of Politecnico di Torino.The conclusions are as follows:

•
The SVM and the kNN models are able to achieve diagnosis accuracies higher than 98.5% for medium-sized industrial bearings;

•
The SHAP values are effective for interpreting machine learning models that are aimed at industrial condition monitoring;

•
The SHAP analysis is employed to show that four features out of twenty-three are really important to achieve good diagnosis accuracies; • The skewness and the shape factor of the vibration signals have the greatest impact on the outcomes of machine learning diagnosis models.
The generalisability of these results is subjected to certain limitations.For instance, the choice of the diagnosis model could influence the explainability.Then, future works will provide explanations for different ML models, and the results will be compared with those of this study.Especially, research will be focused on deep learning, which is the most used in IFD nowadays, because it allows the automatic learning of fault features from the collected data instead of the artificial feature extraction, attempting to provide end-to-end diagnosis models when handling increasingly grown data and connecting the raw monitoring data to their corresponding health states of machines, which will further release the contribution of human labour [30,103].Another field of investigation will be the effect of the feature selection through the SHAP analysis on the computation time.We noticed a little improvement in the computation time between the ML model trained with all the features and those trained only with the selected ones, but since the SHAP analysis is computationally very expensive, an overall time saving could not be achieved.However, as it is reported in the literature [75,76], ML models involving deep learning techniques could take advantage of the feature selection through the SHAP method in terms of the computation time, and, therefore, it will be the subject of our future study.Additionally, the feature selection capabilities should be assessed on a general level, and we will mainly focus on feature selection capabilities besides explainability, by providing proper comparisons with the common techniques targeted in this task.

• 22 •
Independent radial and axial loads of up to 200 kN on each tested bearing; • Simultaneous testing of four bearings, which allows the self-balance of applied loads with minimal transmission to the platform; • High modularity, which enables testing on different sized bearings up to 420 mm of the outer diameter; • Direct measure of friction torque of tested bearings; • Main bearings are immune to the loads acting on the test bearings.Appl.Sci.2023, 13, x FOR PEER REVIEW 9 of Direct measure of friction torque of tested bearings; • Main bearings are immune to the loads acting on the test bearings.

Figure 4
Figure 4 resumes the most important phases and operations performed in this work.

Figure 4 .
Figure 4. Flowchart of the procedure presented in this work.

Figure 4 .
Figure 4. Flowchart of the procedure presented in this work.

Figure 5 .
Figure 5. SHAP explanation for a single prediction.

Figure 5 .
Figure 5. SHAP explanation for a single prediction.Then, if this computation is replicated for the entire dataset, it is possible to analyse how the SHAP values are attributed to features globally.The computation of the Shapley values was performed using the "shapley" library in Python on both the optimised algorithms trained on the previously cited training set with N train = 1600 examples.The Shapley values were computed using the first 500 samples of the training set.Only the ten most important features, selected depending on the overall absolute value of their Shapley values, were plotted for each one of the three classes of defect.

Table 3 .
Hyperparameters for the SVM and kNN algorithms.

Table 4 .
Most important features according to SHAP analysis.