This study presents a comprehensive methodology for the fault diagnosis of the hoist head sheave device in mining operations, addressing the challenges of heavy-load dynamics, environmental harshness, and early fault detection. The approach integrates theoretical analysis of failure mechanisms, finite element-based simulations for virtual prototyping, experimental validation through controlled testing, advanced signal processing for multi-source data fusion, and a hybrid deep learning model for automated classification. The pipeline is designed for reproducibility and scalability, leveraging established tools: SolidWorks 2022 for 3D modeling, ANSYS 2023 for modal analysis, ADAMS 2023 for multi-body dynamics, MATLAB R2023a for signal processing, and Python 3.10 (with TensorFlow 2.12 and scikit-learn 1.3) for ML. The dataset encompasses simulated signals (
n = 2000 samples) and experimental recordings (
n = 1200 samples) across five fault classes: normal operation, wheel body crack, shaft fatigue, bearing inner ring fault, and bearing outer ring fault. Each class includes balanced representations under variable operational conditions (lift speeds: 5–15 m/s; rope tension: 300–600 kN).
Figure 1 presents the integrated research framework, highlighting the two-stage pipeline of rigid–flexible model construction and validation-driven intelligent fault diagnosis.
The methodology progresses sequentially: from theoretical foundations to simulation, signal preparation, experimental corroboration, and diagnostic modeling. This ensures a robust bridge between abstract mechanics and practical AI-driven insights, with validation at each stage to minimize sim-to-real discrepancies.
3.1. Theoretical Foundation: Failure Mechanisms and Modal Features
From the fundamental principles of structural dynamics, all types of mechanical components can be regarded as dynamic systems composed of key parameters such as stiffness characteristics, mass distribution, and damping coefficients. When defects or damage occur in the system, these fundamental parameters inevitably undergo corresponding changes, leading to alterations in the system’s vibration characteristics. This manifests as significant differences in modal parameters such as modal frequencies, mode shapes, and frequency response functions. Based on this principle, diagnostics can identify and evaluate structural health by monitoring changes in dynamic characteristics. Specifically, effective damage diagnosis methods are established by comparing modal parameter differences between intact and damaged states. This modal analysis-based damage identification technology provides crucial theoretical foundations and technical tools for structural health monitoring. This paper primarily employs modal frequency as the damage identification parameter.
To address practical challenges, the finite element method discretizes structural systems from infinite degrees of freedom to finite ones. Most reliable structural damage identification methods use finite element models as reference standards. These models provide benchmarks for structural damage, enabling detection by comparing modal parameters between current and reference models. Typically, damage identification begins with simple, cost-effective methods to determine whether damage has occurred. Once confirmed, complex analytical methods are employed for comprehensive damage detection.
Due to their accessibility and high precision, modal frequencies are the preferred choice for structural damage detection. Monitoring changes in modal frequencies offers a method that is not only simple and practical but also allows flexible adjustment of measurement point layouts according to specific requirements. Consequently, damage detection using modal frequencies holds significant advantages in practical applications [
38].
In engineering vibration analysis, most mechanical vibration systems can be simplified using discretization methods into dynamic systems with finite degrees of freedom. After discretizing the structure using the finite element method, its motion characteristics can be described by nth-order matrix differential equations:
where
,
,
denote the quality, damping, and stiffness matrices, all of order
order, typically a symmetric matrix;
,
,
represents displacement, velocity, and acceleration arrays, all of order
step;
denotes the external excitation array, with order
step. When the mechanical energy loss of a system is negligible, it can be regarded as a conservative system. For systems possessing n, a one-degree-of-freedom undamped vibrating system, whose differential equation is as follows:
Since the above equation is non-homogeneous, the general solution structure of non-homogeneous linear ordinary differential equations possesses a well-defined mathematical expression. Specifically, the complete solution to such equations can be decomposed into a linear superposition of two components: one is the general solution corresponding to the homogeneous equation, representing the system’s free response characteristics; the other is the particular solution to the non-homogeneous equation, reflecting the forced response under external excitation. For the free vibration of a structure:
Suppose,
, where
is the amplitude matrix of the free response, and the order is the
order. Then, the expression of the homogeneous equation in the frequency domain of the above equation is:
The eigenvalue is represented in the
equation, where
represents the eigenvector. Introduce a variable
, where
, and
is the eigenvalue; instead of
, it is the first-order
normalized displacement modal vector (
). From this, the following relationship can be obtained:
In large mechanical structures such as head sheave devices, damage often leads to a significant decrease in structural stiffness with relatively little impact on mass distribution, so there will be a small change in the stiffness matrix of the structure, which in turn leads to
and
consequent changes. For the vibration of the damaged structure, its perturbation equation can be expressed as:
The
change in the overall stiffness matrix is the
change in eigenvalue,
, which is the change in eigenvector, which can be obtained by using the first-order approximation method and then combining with Equation (5) to form the following:
Using
the left multiplication Equation (7) formula is combined with the Equation (5) formula, and we can get:
According to the above equation, when the structure is damaged,
the change is
in a linear relationship, and because
of it,
will be affected by
, which is the change. Therefore, if the structure is damaged, its modal frequency will also change, and this phenomenon is described in many studies as a frequency drift characteristic after structural damage [
39,
40]. In addition, when the damage location is fixed, the more severe the damage, the greater the frequency change.
3.3. Signal Processing
Effective fault diagnosis of the hoist head sheave device necessitates the extraction of discriminative features from non-stationary, multimodal signals generated under varying operational and environmental conditions. To this end, a two-channel signal processing framework is implemented, targeting vibration and acoustic emission (AE) signals. Each modality is processed through dedicated pipelines to capture complementary features, subsequently fused to form an enriched representation for classification.
3.3.1. Preprocessing
To ensure the diagnostic system effectively distinguishes fault-related patterns in the head sheave device, preprocessing is employed as a foundational step to enhance signal quality, remove irrelevant noise, and normalize inputs across varying operational states. Vibration and acoustic signals are simultaneously acquired using triaxial piezoelectric accelerometers and high-sensitivity directional microphones, respectively, each sampled at 10 kHz to preserve transient fault components. Prior to feature extraction, the vibration signals undergo a structured preprocessing pipeline comprising de-trending to eliminate quasi-static displacement components and zero-phase Butterworth bandpass filtering within the 0.5–2.5 kHz range. This frequency band is empirically chosen to capture bearing fault harmonics, structural resonance, and shaft-related vibrational modes, while suppressing low-frequency drift and high-frequency environmental noise. Each signal stream is segmented into overlapping windows of 1024 samples with 50% overlap using a Hann window to minimize spectral leakage during subsequent transformations. For acoustic signals, a pre-emphasis filter is applied to amplify high-frequency content typically associated with stress-wave emissions and material degradation. The acoustic waveform is then partitioned into 25 ms frames with 10 ms overlap, followed by Hamming windowing to preserve temporal continuity and reduce boundary effects during spectral transformation. Both signal types are standardized using z-score normalization to mitigate the influence of sensor variability and operational amplitude fluctuation. Outlier detection, based on kurtosis thresholds and envelope variance monitoring, is implemented to identify and discard low signal-to-noise ratio (SNR) segments or drifted traces. After preprocessing, all signal segments are temporally aligned and encoded into a structured tensor format that serves as the input for subsequent time–frequency transformation and model training. This preprocessing framework ensures robust feature representation under complex loading conditions and lays the foundation for high-performance fault classification.
3.3.2. Vibration Signal Analysis
Vibration signals serve as the primary modality for capturing the dynamic response of the head sheave device under operating conditions characterized by cyclic loading, structural resonance, and mechanical wear. These signals inherently exhibit non-stationary behavior due to the combined influence of shaft rotation, rope tension fluctuation, and fault-induced impacts. To effectively extract fault-related features embedded in such complex signals, time–frequency domain analysis is adopted, with a particular emphasis on the S-Transform (ST) for its superior joint resolution and interpretability. The vibration acceleration sensor measurement points are shown in
Figure 4.
The ST provides a scalable time–frequency representation that maintains absolute phase information while offering frequency-dependent windowing. This is particularly advantageous for detecting transient fault signatures such as bearing pitting, fatigue-induced cracks, and shaft imbalance. In this study, segmented vibration signals preprocessed as described in
Section 3.3.1 are transformed using the ST to produce two-dimensional scalograms. These scalograms highlight the evolution of energy content across frequency bands over time, thereby enabling localization of fault-sensitive spectral patterns.
In addition to scalogram generation, several time and frequency domain features are extracted to enhance the representational capacity of the model. These include statistical features such as Root Mean Square (RMS), Kurtosis, and Crest Factor, which are sensitive to energy variations and impulsiveness associated with structural defects. Furthermore, envelope analysis is performed using the Hilbert Transform to demodulate amplitude-modulated components and identify key fault frequencies, including Ball Pass Frequency of Inner and Outer Race (BPFI and BPFO), fundamental shaft frequency (f shaft), and sideband structures linked to modulation effects.
To facilitate compatibility with deep learning architectures, the resulting scalograms and statistical feature vectors are reshaped into image-like inputs and normalized across the dataset. These representations serve as inputs to the CNN-RF model, allowing for both spatial feature learning and interpretable classification. The use of vibration-based S-Transform analysis not only enhances sensitivity to early fault signatures but also ensures robustness across varying load levels and operating states, making it an effective approach for real-time condition monitoring of the head sheave system.
3.3.3. Acoustic Signal Analysis
Acoustic signals serve as a valuable complementary modality to vibration data, particularly for detecting early-stage and subtle structural defects in the head sheave device. While vibration analysis excels at capturing low-frequency mechanical resonances and shock responses, acoustic emissions are more sensitive to high-frequency transient phenomena such as crack initiation, material friction, and stress-wave propagation. These acoustic patterns, typically imperceptible to human hearing and easily masked by environmental noise in mining environments, require specialized processing to extract fault-relevant features. The field acoustic sensor measurement points are shown in
Figure 5.
In this work, raw acoustic signals acquired using directional condenser microphones with a flat frequency response from 20 Hz to 20 kHz are preprocessed and transformed into compact, high-resolution spectral features using the Mel-Frequency Cepstral Coefficient (MFCC) technique. MFCCs are particularly suitable for non-stationary acoustic signal processing due to their perceptual scaling, which emphasizes frequency bands where fault-induced emissions are most prominent. The MFCC extraction pipeline consists of pre-emphasis filtering to amplify high-frequency components, followed by frame segmentation (25 ms frames with 10 ms overlap) and Hamming windowing to reduce edge discontinuities. Each frame undergoes Fast Fourier Transform (FFT), after which a 40-channel Mel-scale filter bank is applied to approximate the human auditory response. The logarithmic filter bank energies are then converted to decorrelated coefficients using Discrete Cosine Transform (DCT), typically resulting in a 13-dimensional feature vector per frame.
The extracted MFCC features are reshaped into two-dimensional matrices representing spectral evolution over time, effectively capturing the temporal dynamics of fault-induced acoustic variations. These feature maps are standardized via z-score normalization to ensure amplitude consistency across different operating sessions and sensor instances. Additionally, augmentation techniques such as time-warping and pitch shifting are employed to improve model generalization and address class imbalance within the acoustic dataset.
By transforming acoustic waveforms into MFCC representations, the diagnostic framework gains enhanced sensitivity to transient, non-periodic acoustic events associated with incipient failures. These features are subsequently fused with vibration-derived representations at the feature level, providing the hybrid CNN-RF diagnostic model with a multimodal input that enhances fault classification accuracy. The integration of acoustic signal analysis thereby strengthens the system’s ability to detect faults under complex and variable operational scenarios where vibration signals alone may be insufficient.
3.3.4. Multi-Source Fusion and Augmentation
To enhance the diagnostic performance and generalization capability of the proposed fault recognition system, a multi-source data fusion strategy is employed by integrating both vibration and acoustic features into a unified feature representation. This fusion approach leverages the complementary characteristics of each modality: vibration signals are adept at capturing periodic impacts and resonance behaviors, whereas acoustic emissions are more sensitive to high-frequency transient events caused by crack propagation or frictional interactions. The fusion process is executed at the feature level, enabling joint learning of multimodal information without introducing redundancy or misalignment.
The preprocessed vibration signals are transformed via S-Transform into two-dimensional scalograms and the MFCC-based acoustic representations are synchronized using a common time base ensured during data acquisition. These representations are concatenated channel-wise to form a fused feature tensor of shape N × C × T, where N denotes the number of samples, C the combined channel dimensions of both modalities, and T the time–frequency length. This composite input preserves spatial and spectral integrity and serves as input to the CNN-RF diagnostic architecture. As illustrated in
Figure 6, the proposed architecture combines deep convolutional feature extraction from fused vibration and acoustic signals with an RF classifier, enabling accurate fault classification through ensemble learning.
To further improve model robustness and prevent overfitting, a series of data augmentation techniques are applied prior to training. In the time domain, Gaussian noise injection is used to simulate sensor-level variability and measurement uncertainty. Random time shifts and scaling transformations are employed to account for operational fluctuations and misalignment in signal timing. In the spectral domain, mixup augmentation blends two samples from different classes in a convex combination, which improves decision boundary smoothness and reduces class overfitting. These augmentations are applied uniformly across both vibration and acoustic channels to preserve alignment in the fused input.
Moreover, statistical techniques such as Principal Component Analysis (PCA) are used during exploratory analysis to verify the separability and redundancy of fused features, ensuring that the multimodal representation retains high discriminative power while maintaining computational efficiency. The final fused and augmented dataset exhibits improved coverage of operational conditions, variability in fault manifestations, and resilience to noise, thereby enhancing the performance and generalization capability of the hybrid CNN-RF model in fault classification tasks.
3.4. Experimental Validation
This paper designed and constructed a failure testing platform for the wheel hub assembly, as shown in
Figure 7. By simulating failures in the wheel body, axle, and bearings, the feasibility of the monitoring system was validated.
Vibration sensors were mounted on the bearing housing surface using magnetic bases, while the acoustic sensor was positioned adjacent to the bearing housing. The vibration and acoustic signals were first conditioned by a signal conditioning circuit and then transmitted to the data acquisition card for sampling. The sampled data were subsequently transferred to a computer via a serial interface for storage and analysis. To reproduce the fault states and representative working conditions of the hoisting head sheave system in a controlled and repeatable manner, this study employs a scaled-down experimental platform consisting of a wheel body, rotating shaft, and a 6204 deep-groove ball bearing. The 6204 bearing was selected to facilitate controlled fault reproduction and repeatable data collection on the test rig, rather than to match the full-scale bearing size of an industrial head sheave. The main parameters of the 6204 bearing are listed in
Table 1.
Four types of faults were artificially induced in the test specimen: wheel body failure, shaft failure, outer bearing ring failure, and inner bearing ring failure, as shown in
Figure 8.
During the experiment, determining the rotational speed of the sheave assembly alone suffices to obtain the fault characteristic frequencies of the sheave body and shaft. Substituting the bearing parameters and rotational frequency into Equations (9) and (10) then calculates the fault characteristic frequencies of the bearing outer and inner rings. The values of these characteristic frequencies depend on the bearing’s geometric parameters and operating speed [
43], with the calculation relationship as follows:
Bearing inner ring rotational frequency
:
Bearing outer ring failure frequency
:
The rotational speed of the inner ring of the bearing is expressed in the unit RPM: this indicates the number of rolling bodies; represents the diameter of the rolling element in mm; represents the diameter of the node circle, in mm; and indicates the contact angle.
Subsequently, comparing the extracted characteristic frequencies from signal analysis with the theoretically calculated values enables precise diagnosis of the specific component failure within the sheave assembly. After completing the assembly of the sheave assembly failure test platform, failure simulation experiments for the sheave assembly can be conducted. The experimental procedure is as follows: (1) After assembling components under normal operating conditions on the test platform, set the sampling rates for vibration signals and acoustic signals to 10,000 Hz and 22,050 Hz, respectively. Adjust the motor speed to 600 RPM. Once the motor speed stabilizes, collect the vibration and acoustic data at this point. (2) Remove the components from normal operating conditions and sequentially replace them with the faulty wheel body, faulty shaft, faulty outer bearing ring, and faulty inner bearing ring. Repeat the experiment following the same procedure. (3) Analyze and process the vibration and sound signals from the normal operating condition and the conditions with wheel body failure, shaft failure, outer bearing ring failure, and inner bearing ring failure.
3.4.1. Modal Excitation Tests
To verify the accuracy and physical relevance of the finite element model and to characterize the dynamic behavior of the head sheave structure under realistic boundary conditions, a series of modal excitation tests were conducted. These experimental modal tests served two key purposes: (1) to extract the natural frequencies and mode shapes of the physical structure for comparison with simulation results, and (2) to provide a vibration baseline for distinguishing structural degradation due to faults.
The tests were carried out on a full-scale head sheave assembly mounted on a steel support frame designed to replicate in situ constraints. A modal hammer excitation method was employed using an instrumented force hammer with a built-in piezoelectric load cell to apply controlled broadband impacts to the wheel rim, web, and bearing seat locations. Corresponding response signals were recorded through triaxial accelerometers strategically placed on the sheave body, especially near regions of high modal curvature identified in simulation. The excitation and response signals were sampled at 10 kHz using a multi-channel data acquisition system with time-synchronized triggering to ensure phase coherence.
Frequency response functions (FRFs) were computed using a Hanning window and an average of multiple impact repetitions to improve spectral clarity. The resulting FRFs were processed to extract modal parameters including natural frequencies, damping ratios, and normalized mode shapes via a peak-picking method and confirmed using curve fitting techniques.
Comparison with the simulation results from the ANSYS modal analysis revealed a strong agreement. The first five modal frequencies deviated by less than 3.5% between experiment and simulation, validating the fidelity of the finite element model. Notably, Mode 2 (first lateral bending) and Mode 4 (out-of-plane torsional mode) showed the most sensitivity to faults in the bearing and web regions, consistent with the location of induced cracks and spalls. This alignment between physical and simulated modal behavior substantiates the use of simulated data in generating training inputs for the fault diagnosis model.
Furthermore, these modal tests provide empirical evidence of frequency shifts and damping changes that correlate with defect progression. As such, the modal excitation results not only validate the structural model but also reinforce the suitability of vibration-based fault indicators extracted in earlier stages of the diagnostic pipeline.
3.4.2. Fault Simulation Platform
To generate realistic and labeled fault data under controlled conditions, a dedicated fault simulation platform was developed to replicate the operational dynamics of a mine hoist head sheave device. This platform enables the emulation of common mechanical failures, such as bearing spalling, inner race cracking, and shaft misalignment, under varying load and rotational speed conditions. The testbed was designed to reflect the structural and boundary characteristics observed in actual mining hoisting systems, ensuring high mechanical fidelity and relevance for model training and validation.
The simulation platform consists of a variable-speed motor-driven shaft assembly connected to a fabricated head sheave structure mounted on adjustable support bearings. The shaft and sheave are coupled via a flexible coupler to introduce torsional effects. The bearing housings are instrumented with triaxial accelerometers and directional microphones to capture vibration and acoustic signals, respectively. The data acquisition system is configured for synchronized multi-channel sampling at a rate of 10 kHz, with analog signal conditioning to maintain signal integrity. To simulate distinct fault types, controlled damage was introduced into the bearing and shaft assemblies: (a) Outer ring pitting was artificially created via electric discharge machining; (b) inner race cracks were initiated using notch fatigue under cyclic loading; and (c) shaft eccentricity was generated by intentional misalignment during installation.
Each fault condition was validated visually and with NDT (non-destructive testing) prior to data collection. The system was operated under variable speeds (150–450 RPM) and load conditions to simulate real-world variability. For each condition, multiple operational runs were recorded to capture repeatable and stable signal characteristics. Each recording was time-stamped and manually labeled according to the fault type and severity for supervised learning purposes.
The platform not only facilitated the generation of high-quality datasets for training the CNN-RF diagnostic model but also allowed for controlled experimentation on the influence of fault location, load level, and rotational speed on the signal features. By combining fault emulation with synchronized multi-sensor acquisition, the platform serves as a reliable foundation for model development, performance benchmarking, and comparative evaluation across diagnostic architectures.
3.4.3. Validation Metrics
To quantitatively evaluate the diagnostic performance of the proposed CNN-RF fault classification model, a comprehensive set of statistical validation metrics was employed. These metrics are designed to assess not only the overall accuracy of the model but also its sensitivity to minority class detection, its robustness to class imbalance, and its reliability in practical deployment scenarios. The following key metrics were used:
Accuracy (ACC): Measures the proportion of correctly classified instances over the total number of samples. It provides a general indicator of model effectiveness but may be less informative in imbalanced datasets.
Precision (P): Defined as the ratio of true positives (TPs) to the sum of true positives and false positives (FPs). Precision reflects the model’s ability to avoid false alarms, particularly relevant in safety-critical applications such as mine hoist monitoring.
Recall (R) or Sensitivity: The ratio of true positives to the sum of true positives and false negatives (FNs). It evaluates the model’s capacity to detect actual fault conditions, which is critical for early-stage fault identification.
F1 Score: The harmonic mean of precision and recall. F1 balances the trade-off between false alarms and missed detections, and is particularly valuable in multi-class fault scenarios where class distributions vary.
Confusion Matrix: A multi-class matrix summarizing the true and predicted labels for each fault type, allowing visual assessment of misclassification patterns, class confusion, and dominant fault detection pathways.
Receiver Operating Characteristic (ROC) Curve and AUC (Area Under Curve): While primarily applicable to binary classification tasks, ROC-AUC values were computed on a one-vs-rest basis for each fault class to evaluate discrimination capability across decision thresholds.
Training and Validation Loss Curves: Tracked during model training to assess convergence behavior, generalization gap, and overfitting tendencies.
All metrics were computed over stratified cross-validation folds to ensure statistical stability. The CNN-RF model achieved a mean classification accuracy exceeding 96%, with F1 scores above 0.93 for all critical fault classes. These results confirm the model’s effectiveness in identifying both localized and distributed fault modes under varying operating conditions, validating the overall robustness and reliability of the proposed diagnostic system.
3.5. Fault Diagnosis: Multi-Source Fusion CNN-RF Model
To achieve high-accuracy fault identification under complex working conditions, a hybrid fault diagnosis framework based on a Convolutional Neural Network and Random Forest (CNN-RF) is developed, leveraging multi-source fused features from both vibration and acoustic modalities. This architecture combines the deep feature extraction capabilities of CNNs with the classification stability and interpretability of ensemble learning through RFs, thereby addressing both the nonlinear complexity of signal features and the demand for reliable decision-making in critical safety systems.
The input to the model consists of fused two-dimensional representations: vibration scalograms derived from the S-Transform and acoustic MFCC feature maps. These are concatenated along the channel dimension after normalization and temporal alignment, producing a unified feature tensor that preserves both spectral and temporal information. The CNN component comprises multiple convolutional layers with ReLU activation, interleaved with MaxPooling layers to progressively extract spatial hierarchies and reduce dimensionality. Batch normalization is applied to accelerate convergence and improve generalization. The final convolutional output is flattened and passed through fully connected layers to generate high-level latent features.
Unlike conventional end-to-end CNN classification models, the proposed framework decouples feature extraction and classification. The deep features produced by the CNN are fed into a RF classifier, trained separately to improve robustness against overfitting and enhance fault class separability. This ensemble classifier constructs multiple decision trees based on random feature subsets and aggregates their outputs through majority voting, thereby mitigating noise sensitivity and capturing nonlinear class boundaries effectively.
To interpret the contribution of different feature types and ensure transparency, the model integrates SHAP (SHapley Additive exPlanations) values to quantify feature importance at the output stage. The SHAP analysis confirms that fault-relevant frequency bands in both vibration and acoustic domains are among the top contributors to classification decisions, validating the effectiveness of multi-source fusion. Recent industrial-process studies have shown that SHAP can effectively reveal physically meaningful relationships in black-box models, improving transparency and supporting deployment in safety-critical engineering applications [
44,
45].
As illustrated in
Figure 9, the architecture adopts a dual-channel feature extraction pathway where vibration and acoustic signals are independently processed via parallel CNN streams and a unified ensemble classification stage.
The diagram highlights the decoupling of feature learning and decision-making processes, as well as the modularity and scalability of the proposed framework. This separation enables easy retraining or adaptation to new sensor modalities or fault types with minimal architectural changes. The CNN-RF model achieves superior performance compared to baseline classifiers, with classification accuracy exceeding 96% across five distinct fault categories. This model demonstrates strong performance on the tested dataset and provides interpretable decision support via SHAP analysis. This study focuses on validating an integrated, deployment-oriented diagnostic pipeline under the tested operating condition; cross-speed and cross-load generalization will be investigated in future work using held-out operating conditions. Moreover, the modular nature of the model allows flexible retraining or adaptation when new fault types or sensors are introduced, making it suitable for deployment in real-time monitoring systems for hoisting equipment in mining applications.
3.5.1. Data Preparation
Vibration and acoustic signals were acquired from the head sheave device fault test platform under five operating conditions: normal state, wheel body fault, shaft fault, outer ring fault, and inner ring fault. The vibration signal was sampled at 10,000 Hz, and the acoustic signal was sampled at 22,050 Hz. For both modalities, training samples are generated from each continuous recording using a sliding-window segmentation (window length 2048 points, step size 512 points, i.e., 75% overlap), corresponding to a segment duration of approximately 0.205 s (vibration, 10 kHz) and 0.093 s (acoustic, 22.05 kHz). For each operating condition, 1200 samples were collected, resulting in a total of 6000 vibration samples and 6000 acoustic samples across the five categories. The dataset was then divided into a training set (70%) and a test set (30%).
It should be noted that the supervised learning experiments reported in this study (i.e., CNN-RF training and testing) are conducted only using the vibration and acoustic data collected from the physical fault test platform. The simulation results generated via ANSYS and ADAMS are used exclusively for fault mechanism analysis and physical verification (e.g., validating dynamic responses and characteristic signatures against theoretical and experimental observations), and are not included as training samples for the CNN-RF model.
Table 2 details the experimental dataset information, with samples labeled from 1 to 5 according to their respective operating conditions.
Figure 10 presents the generalized S-Transform of vibration and MFCC heatmaps of sound for the five operating conditions on the sheave assembly failure test platform.
For each operating condition, the vibration/acoustic signals were recorded as continuous time series (10 s per recording, repeated 5 times). Training samples were generated using a sliding-window segmentation strategy with window length 2048 points and step size 512 points (75% overlap). Therefore, the model inputs are overlapped segments extracted from continuous recordings rather than independent 1 s acquisitions. The segment duration is determined by the sampling rate and window length (2048 points corresponds to approximately 0.16 s at the specified sampling rate).
Because training samples are generated from continuous recordings using overlapping windows, we perform train/test splitting strictly at the recording-run level (group-wise). All segments extracted from the same recording run/session are assigned exclusively to either training/validation or testing, and no run contributes samples to both sets. We repeat the run-wise evaluation N = 5 times using different run-level partitions and report results as mean ± standard deviation across repeats.
3.5.2. CNN Feature Extractor
In the proposed CNN-RF framework, the Convolutional Neural Network (CNN) component serves as a deep feature extractor, automatically learning hierarchical representations from complex time–frequency inputs derived from vibration and acoustic signals. This component is essential for capturing both local and global spatial patterns in the input data, which are often difficult to define through manual feature engineering, especially in the presence of non-stationary and nonlinear characteristics commonly found in mechanical fault signals.
Figure 9 illustrates the proposed multi-source fusion fault diagnosis model, with specific parameter settings shown in
Table 3.
The CNN architecture is designed to process two-dimensional input matrices: S-Transform scalograms from vibration signals and MFCC feature maps from acoustic signals. Each input passes through a dedicated CNN branch composed of multiple layers of convolution operations followed by nonlinear activation functions (ReLU), batch normalization, and MaxPooling layers. The convolution layers apply multiple filters (kernels) to detect edges, ridges, and fault-relevant textures in both time and frequency directions, while MaxPooling layers progressively reduce the spatial dimensions to improve computational efficiency and ensure translation invariance.
The CNN structure used in this study includes three convolutional-pooling blocks, followed by a flattening layer that transforms the final feature maps into a one-dimensional vector. This vector is passed through fully connected (dense) layers that perform high-level feature abstraction, enabling the network to learn discriminative representations that are robust to noise, load variation, and signal amplitude scaling. Dropout regularization is applied during training to prevent overfitting and enhance model generalization.
To integrate the two sensor modalities, the feature vectors from the vibration and acoustic branches are concatenated after their respective CNN paths, forming a unified feature embedding. This fused vector captures both structural and acoustic behavior of the monitored system, providing a comprehensive basis for fault diagnosis.
The CNN feature extractor is trained using cross-entropy loss and optimized with the Adam optimizer, employing a mini-batch gradient descent approach with learning rate scheduling and early stopping based on validation loss. The training process is guided by augmented data and evaluated through accuracy, precision, and F1-score metrics to ensure robust feature learning.
3.5.3. Model Integration and Evaluation
Upon completion of the deep feature extraction via parallel CNN branches for vibration and acoustic signals, the resulting high-dimensional feature vectors are concatenated to form a unified representation. This fused feature vector embodies both temporal-frequency structural patterns and high-frequency acoustic characteristics, providing a comprehensive descriptor of the head sheave system’s operational condition.
The integrated feature set is subsequently fed into a RF classifier, which serves as the decision-making component of the hybrid CNN-RF architecture. RF, an ensemble learning method, constructs multiple decision trees using bootstrapped subsets of the training data and a randomized selection of features at each split. The final classification decision is determined via majority voting, which enhances model stability and reduces sensitivity to noise and overfitting. This model integration approach effectively decouples feature learning from classification, improving interpretability and allowing for independent tuning of each component.
During training, the CNN is optimized using labeled samples via backpropagation and cross-entropy loss minimization, while the RF is trained using the CNN-extracted features with predefined fault class labels. For reproducibility, the complete hyperparameter configuration for CNN training (optimizer, learning rate, batch size, epochs, validation split, early stopping, and regularization) is summarized in
Table 4.
Table 4 also reports the RF settings used in the final model (e.g., number of trees, maximum depth, splitting criterion, and random seed). The entire training process is monitored through validation metrics, including loss convergence curves and confusion matrix analysis, to ensure model robustness and prevent overfitting.
Model evaluation is performed on a held-out test dataset, comprising samples not seen during training or validation phases. As discussed earlier, classification performance is assessed using accuracy, precision, recall, F1-score, and a confusion matrix. The hybrid model demonstrates excellent diagnostic performance across five fault categories, achieving an overall classification accuracy above 96%, with high recall and F1-scores for each individual fault type. To enhance explainability and validate the decision logic, SHAP (SHapley Additive exPlanations) values are computed for the RF output. These values quantify the contribution of individual features (e.g., frequency components, MFCC bands) to the model’s predictions, providing insights into the discriminative power of specific vibration and acoustic attributes.
The final integrated model proves to be not only accurate but also scalable, interpretable, and robust to variations in operational conditions. Its modular structure enables easy retraining or fine-tuning when extended to new fault types or sensor configurations, making it well-suited for real-world deployment in condition monitoring systems of mining hoisting equipment.