Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning

Yang, Yongsheng; Chen, Zehui; Wang, Heng

doi:10.3390/act15060322

Open AccessArticle

Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning

by

Yongsheng Yang

^*,

Zehui Chen

and

Heng Wang

Institute of Logistics Science and Engineering, Shanghai Maritime University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Actuators 2026, 15(6), 322; https://doi.org/10.3390/act15060322

Submission received: 17 April 2026 / Revised: 28 May 2026 / Accepted: 4 June 2026 / Published: 6 June 2026

(This article belongs to the Special Issue Fault Diagnosis and Prognosis in Actuators)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The operational health state of motor bearings is critical to the operational safety of harbor portal slewing cranes. However, in harsh industrial environments with strong noise and time-varying rotational speeds, existing bearing fault diagnosis methods still suffer from the problems of incomplete fault feature extraction from single-sensor signals and the excessively large size of multi-source fusion models, which makes them unable to adapt to edge deployment. To address these issues, this paper proposes a Multi-source Feature Fusion Lightweight Network (MTFL-Net) integrated with targeted structured channel pruning. First, vibration and current signals are preprocessed via differentiated time-frequency transformation and converted into 2D time-frequency images, to fully preserve transient impact and spectral fault features. Second, a multi-branch feature extraction architecture embedded with residual connections, multi-scale convolution and channel attention gating is designed, to alleviate feature degradation and adaptively enhance fault-sensitive features. Third, targeted structured channel pruning is performed on the feature extraction branches, to remove redundant channels while retaining the multi-source fusion logic and core feature extraction structure. Experiments on two public bearing datasets show that the original model achieves 99% diagnostic accuracy, and the pruned model still maintains an accuracy of 95%. The results demonstrate that MTFL-Net can significantly reduce model size and computational cost while retaining high diagnostic precision.

Keywords:

portal crane; bearing fault diagnosis; multi-source fusion; lightweight network; structured channel pruning

1. Introduction

Port portal slewing cranes serve as the core lifting equipment in modern logistics hubs. As critical components of the crane-driving motors, rolling bearings are essential to the stable operation of the entire crane. The drive motors of port portal slewing cranes adopt deep groove ball bearings as core supporting components. Yet these bearings usually operate continuously under harsh conditions characterized by heavy shock loads, strong electromagnetic interference and high humidity, making them highly prone to diverse failures including wear, spalling, cracking and pitting [1]. The initiation and propagation of bearing faults give rise to abnormal noise during equipment operation. In severe scenarios, such failures can trigger industrial safety accidents and incur substantial economic losses. Therefore, achieving accurate and rapid diagnosis of early bearing faults is a key technical link to improve the operational reliability of harbor cranes, reduce operation and maintenance costs, and ensure production safety. However, the weak impact features of early bearing faults are easily submerged by the strong industrial noise and vibration responses of healthy components, making it difficult to extract effective fault information. Qiao et al. [2] proposed a digital twin-guided physical–virtual denoising method, which combines high-fidelity dynamic simulation models with Wasserstein generative adversarial networks to accurately extract early fault features under strong noise background, providing a new approach for early fault detection of rolling bearings.

Traditional bearing fault diagnosis methods rely heavily on manual experience and prior knowledge of signal processing. Technicians need to manually extract fault features via time-domain statistical indicators, frequency-domain spectrum analysis, wavelet transform and other approaches. However, in signal scenarios with strong noise, non-stationarity and variable working conditions, the effectiveness of these extracted features degrades significantly, and diagnostic results are easily influenced by human subjectivity. With the continuous advancement of machine learning, methods such as Random Forest (RF) [3], Support Vector Machine (SVM) [4] and Extreme Learning Machine (ELM) [5] have been gradually applied to bearing fault identification. Although these methods have improved the automation level of diagnosis to a certain extent, they still suffer from drawbacks including time-consuming feature extraction, weak generalization ability and poor dataset adaptability, which can hardly meet the practical demands of real-time performance and precision for automated operation and maintenance in modern ports. In particular, under the harsh port environment with small labeled samples and strong background noise, traditional methods face severe performance degradation. To address this issue, Zhang et al. [6] proposed the DPCCNN model specifically for small-sample and high-noise environments. By integrating dilated hierarchical interactive convolution with a global aggregation module, the model achieves superior noise robustness and few-shot generalization capabilities with a lightweight architecture, providing a novel solution for fault diagnosis in complex industrial settings.

To address limitations of traditional machine learning, deep learning techniques have gained significant attention for bearing fault diagnosis. Through neural network-based end-to-end recognition, these methods adaptively extract fault-related features directly from raw data. At present, most deep learning diagnosis methods take a single vibration signal as input. Wang et al. [7] optimized feature extraction capability by designing a Spatial Reduction Window Attention module, and achieved accurate diagnosis of bearing faults under variable operating conditions and noise interference using an Enhanced Hierarchical Vision Transformer. Fan et al. [8] adopted the Local Maximum Synchrosqueezing Wavelet Transform to enhance time-frequency energy concentration, and combined the Squeeze-and-Excitation Network channel attention mechanism to optimize multiscale fault feature extraction, realizing high-precision bearing fault diagnosis in high-intensity noise environments. However, in actual industrial scenarios, single signals often lose critical fault information due to sensor installation constraints, motor casing shielding, common-mode noise from equipment foundations, and other factors, resulting in insufficient diagnostic robustness. Furthermore, under the time-varying rotational speed conditions of port cranes, bearing vibration signals exhibit non-stationary frequency modulation and amplitude modulation characteristics, which further increases the difficulty of fault diagnosis. Liu et al. [9] proposed a lightweight fault diagnosis framework VSFD-Net for variable speed rolling bearings, which adopts separable multi-scale convolution and broadcast self-attention modules to effectively capture non-stationary fault features while maintaining a lightweight model structure. In recent years, the rapid advancement of distributed intelligent sensing technology has strongly driven research on multi-sensor fusion for fault diagnosis. This technique can not only fuse multi-channel signals from a single sensor, but also integrate multi-dimensional monitoring signals such as vibration, current and rotational speed. It effectively makes up for the information limitations of single sensors and realizes the comprehensive capture of fault features. Currently, multi-source fusion methods are mainly divided into three categories: data-level fusion [10], feature-level fusion [11] and decision-level fusion [12]. Among them, feature-level fusion achieves the optimal balance between data dimensionality reduction and feature retention, and has become a research hotspot in multi-source fault diagnosis. Tong et al. [13] converted dual-source time-domain vibration and current signals of bearings into time-frequency representations via continuous wavelet transform, and constructed a diagnostic framework integrating Coordinate Attention and Efficient Multi-Scale Attention to effectively capture both the temporal information and global feature dependencies of bearing faults. Zhang et al. [14] allocated fusion weights to three-channel vibration signals using information entropy, and performed feature-level fusion on the shallow and deep features as well as different pooling features from CNN, which considerably enhanced the accuracy of bearing fault diagnosis.

Although existing fault diagnosis methods have achieved certain progress, they still suffer from a critical drawback: large model parameters and high computational complexity, making them difficult to deploy on edge computing devices with limited computing power and memory. Model lightweighting [15] provides an effective solution to this problem. In recent years, numerous lightweight fault diagnosis methods have been proposed for resource-constrained industrial scenarios, including a dedicated lightweight model compression framework for intelligent fault diagnosis of machines [16], which systematically integrates pruning, quantization and knowledge distillation to achieve efficient model deployment on edge devices. Among various lightweight strategies, structured channel pruning [17] takes convolution channels as the basic processing unit, preserves the complete network structure, and features strong hardware compatibility. It can significantly compress the model size while retaining core diagnostic performance, making it the optimal lightweight scheme for industrial edge deployment. Cheng et al. [18] employed a multi-scale feature extraction and fusion module to capture fault characteristics and adopted hierarchical structured pruning to eliminate redundant channels. This approach achieves significant model compression while maintaining excellent diagnostic performance. Sun et al. [19] improved the heterogeneous convolutional network with channel pruning for lightweight feature extraction, and further optimized fault features using the CBAM attention mechanism, achieving accurate bearing fault diagnosis with low parameters and computational cost.

Based on the above research status and technical bottlenecks, this paper integrates multi-source time-series feature fusion and structured pruning, and proposes a Multi-source Temporal Coupled Lightweight Network (MTFL-Net). The main innovations are as follows:

A differential time-frequency preprocessing pipeline for multi-source signals is designed. Aiming at the problem that existing methods adopt unified time-frequency transformation for signals with different physical characteristics, leading to insufficient expression of fault features; adaptive time-frequency transformation methods are used for vibration and current signals, respectively. Core frequency bands are screened based on the energy contribution of bearing fault characteristic frequencies, and speed signals are mapped to high-dimensional features as auxiliary working condition information to retain fault information in different signals more comprehensively.

A multi-branch heterogeneous feature extraction network is constructed. Aiming at the problem that most existing multi-source fusion networks adopt homogeneous branch structures and are difficult to adapt to the feature distributions of different signals, structurally adaptive feature extraction branches are designed for vibration, current and speed signals respectively. Residual connections are introduced into each branch to alleviate the feature degradation problem of deep networks, and fault-sensitive features are adaptively enhanced through a channel attention mechanism to improve the effectiveness of feature extraction.

A selective structured pruning method for multi-branch networks is proposed. Aiming at the problem that traditional global structured pruning treats all convolutional layers indiscriminately and it is easy to destroy the information fusion logic of multi-branch networks, only the ordinary convolutional layers responsible for basic feature extraction in the vibration and current branches are pruned. The attention modules, residual connections, multi-source feature fusion layers and classification head are completely preserved to maintain the core advantages of multi-source information fusion while realizing model lightweighting.

The rest of this paper is organized as follows. Section 2 introduces the fundamental theories of the proposed method, and elaborates the overall architecture and core implementation of MTFL-Net. Section 3 conducts comparative experiments, noise robustness tests, lightweight performance evaluation and ablation studies on two public bearing datasets, to verify the effectiveness and superiority of the proposed model. Section 4 summarizes the research findings, discusses the limitations of this study, and prospects future research directions.

2. Materials and Methods

This chapter elaborates the core basic theories involved in the proposed method, including multi-scale convolution, residual network and structured channel pruning. For each theory, the formal definition, mathematical expression and brief derivation are given, and its adaptability to rolling bearing fault diagnosis under complex working conditions is explained, so as to provide a complete theoretical basis for the construction of the Multi-source Feature Fusion Lightweight Network.

2.1. Relevant Fundamental Research

2.1.1. Multi-Scale Convolution

Standard convolution layers rely on a fixed-size convolutional kernel to extract features, which can only capture information within a single receptive field. For the vibration and current signals of port crane bearings, which show non-stationary, transient-impact and wide-band characteristics, single-scale convolution fails to capture both local fault shock features and global frequency-domain distribution features at the same time, resulting in incomplete feature expression.

Multi-scale convolution motivated by the idea that image features occur at multiple scales [20], it adopts parallel convolutional branches with different kernel sizes, so as to extract features at multiple scales simultaneously. Let the input feature map be

X \in R^{C \times H \times W}

, where

C

denotes the number of channels, and

H

and

W

denote the height and width of the feature map, respectively. A group of convolutional kernels with different receptive fields

K_{1}

,

K_{2}

, …,

K_{n}

are used to perform convolution operations on

X

in parallel. The output of multi-scale convolution is spliced along the channel dimension, and its mathematical expression is:

Y = C o n c a t (X * K_{1}, X * K_{2}, \dots, X * K_{n})

(1)

where

*

represents the discrete convolution operation, and

C o n c a t (\cdot)

represents the concatenation operation along the channel dimension.

After concatenation, batch normalization is used to stabilize the feature distribution. In the scenario of bearing fault diagnosis, small convolutional kernels are responsible for capturing local fault shock features, while large convolutional kernels are used to extract global trend features and frequency-domain distribution rules. The parallel fusion of multi-scale features provides a more comprehensive feature representation for subsequent fault classification.

2.1.2. Residual Network

Deep neural networks used for fault diagnosis often need sufficient depth to mine high-dimensional fault features. However, as the network depth increases, problems such as gradient vanishing and feature degradation become prominent, which makes the network difficult to train and leads to the loss of effective deep fault features.

The deep residual network [21] introduces a shortcut connection to reconstruct the learning objective of the network. A basic residual block consists of a main mapping branch and a shortcut branch. Let the input of the residual block be

x

, the residual mapping learned by the main branch be

F (x, {{ω}_{i}})

, where

{{ω}_{i}}

represents the weight parameters of each layer in the main branch. Then the output

H (x)

of the residual block is:

H (x) = F (x, {{ω}_{i}}) + x

(2)

Traditional deep networks directly learn the original mapping

H (x) = F (x)

. When the network is deep, the gradient tends to disappear during backpropagation, and the network cannot converge effectively. In the residual structure, the network learns the residual term

F (x) = H (x) - x

instead of the original mapping. When the optimal mapping is close to the identity mapping,

F (x)

approaches zero, and the network can still maintain stable information transmission through the shortcut connection. This structure effectively suppresses feature degradation and ensures that deep fault features in vibration and current signals can be stably extracted.

2.1.3. Structured Channel Pruning

Structured channel pruning takes the convolutional channel as the minimum processing unit, which removes redundant channels that contribute little to feature extraction, so as to reduce the number of parameters and computational complexity of the model. Different from unstructured pruning, structured pruning does not destroy the overall topology of the network and has good hardware adaptability, which is suitable for lightweight deployment of multi-branch fault diagnosis models. However, conventional global structured pruning methods indiscriminately prune all convolutional layers without considering the functional differences between network modules. When applied to multi-branch heterogeneous fusion networks, this approach often damages the attention modules, residual connections and fusion layers that carry core diagnostic logic, leading to severe accuracy degradation or even network failure. This critical limitation has not been systematically addressed in existing lightweight fault diagnosis research, which motivates our proposed targeted pruning strategy.

(1): Channel Importance Evaluation

The importance of a channel is measured by the L1-norm of its convolutional kernel weights [22]. Channels with larger L1-norm values contribute more to fault feature extraction, while channels with smaller values are mostly redundant. For the c-th channel in a convolutional layer, the importance score

I_{c}

is calculated as:

I_{c} = \sum_{k = 1}^{K} {‖W_{c, k}‖}_{1}

(3)

where

W_{c, k}

denotes the weight of the k-th convolutional kernel in the c-th channel,

K

denotes the total number of kernels in one channel, and

{‖\cdot‖}_{1}

denotes the L1-norm operation.

(2): Pruning Principle

All channels are sorted in descending order of importance scores. According to the preset optimal pruning rate, channels with low scores are removed, and channels with high scores that carry core fault information are retained. After pruning, a small learning rate is used for fine-tuning to restore the diagnostic accuracy lost in the pruning process.

In the multi-branch network of this paper, structured pruning is only performed on the redundant channels of the feature extraction branches, and the fusion structure and residual connection are completely preserved. This strategy ensures the integrity of multi-source feature fusion while realizing model lightweight.

Different from existing multi-branch network pruning methods that adopt soft weight evaluation and indiscriminate layer pruning, the proposed targeted structured pruning divides pruneable areas and protected areas with explicit architectural hard boundaries. It only prunes ordinary convolutional layers for basic feature extraction, while completely retaining attention modules, residual connections, multi-source feature fusion layers and classification head. This design fundamentally avoids damaging the multi-source information fusion logic, which is not systematically considered in conventional structured pruning for fault diagnosis networks.

2.2. The Proposed Method

We propose the Multi-source Feature Fusion Lightweight Network (MTFL-Net) with four key modules: differentiated time-frequency preprocessing for multi-source signals, multi-branch temporal feature extraction integrated with channel attention, multi-source feature fusion and Softmax classification, and targeted structured pruning for multi-branch architectures. Taking vibration, current and speed signals as input, the model strengthens fault-sensitive features and realizes multi-source feature fusion. It completes fault classification through standard, fully connected layers and Softmax, and uses targeted structured channel pruning to lightweight the model for edge deployment with little degradation in diagnostic accuracy. The complete implementation framework is depicted in Figure 1.

2.2.1. Differentiated Time-Frequency Preprocessing of Multi-Source Signals

The vibration, current and speed signals of port crane bearings differ significantly in physical excitation mechanism, noise interference characteristics and time-series distribution. A single transformation method cannot retain complete fault features. This section designs a differentiated time-frequency preprocessing process for the three types of signals, converting one-dimensional time-series signals into 64 × 64 pixel two-dimensional time-frequency maps to provide standardized input for subsequent feature extraction.

For vibration signals with non-stationary, multi-component and transient impact characteristics, Wavelet Packet Decomposition (WPD) and Short-Time Fourier Transform (STFT) are combined. Given the original discrete vibration signal

x [n]

with sampling frequency

f_{s}

, the 3-layer WPD is implemented using Daubechies4 wavelet basis, and the wavelet packet coefficient of the k-th node in the j-th layer is:

W_{j, k} [n] = \sum_{m} x [n] \cdot ψ_{j, k} (m - n)

(4)

where

ψ_{j, k} (\cdot)

is the wavelet basis function of the k-th node in the j-th layer. The 3-layer WPD decomposes the original signal into 8 independent frequency band components with equal bandwidth, and STFT is performed on the wavelet packet coefficient sequence of each frequency band respectively, generating 8 64 × 64 2D time-frequency images corresponding to the full frequency band.

To eliminate input dimensional redundancy and suppress noise interference from invalid frequency bands, a fault-sensitive core frequency band screening strategy is adopted. The importance of each frequency band is evaluated by the energy contribution ratio of theoretically calculated bearing characteristic frequencies and their 1st–5th harmonics, including Ball Pass Frequency Outer race (BPFO), Ball Pass Frequency Inner race (BPFI), Ball Spin Frequency (BSF) and Fundamental Train Frequency (FTF). The specific calculation formulas are:

B P F O = \frac{Z}{2} f_{r} (1 - \frac{d}{D} \cos α)

(5)

B P F I = \frac{Z}{2} f_{r} (1 + \frac{d}{D} \cos α)

(6)

B S F = \frac{D}{2 d} f_{r} [1 - {(\frac{d}{D} \cos α)}^{2}]

(7)

F T F = \frac{1}{2} f_{r} (1 - \frac{d}{D} \cos α)

(8)

where

Z

is the number of rolling elements,

d

is the rolling element diameter,

D

is the bearing pitch diameter,

α

is the contact angle, and

f_{r}

is the shaft rotation frequency, all 8 frequency bands are sorted in descending order of fault feature energy contribution, and the top 5 core frequency bands that cover more than 90% of the effective fault feature energy are retained. Only the time-frequency images corresponding to the screened 5 core frequency bands are used as the input of the subsequent vibration signal feature branch, while the remaining invalid frequency band images with low fault information and high noise are discarded [23]. The result is normalized to a 64 × 64 time-frequency image to highlight fault impact and characteristic frequency distribution.

For current signals dominated by motor electromagnetic characteristics and load fluctuations, Ensemble Empirical Mode Decomposition (EEMD) and Hilbert Transform (HT) are adopted. The stable intrinsic mode function

c_{k} (t)

is decomposed by EEMD with the following parameters: number of ensemble trials are 100 times, standard deviation of added Gaussian white noise is 0.2, and decomposition stopping criterion standard deviation of IMF components obtained from two adjacent iterations less than

1 \times 10^{- 4}

. Then the analytical signal is constructed by HT:

z_{k} (t) = c_{k} (t) + j \cdot H \{c_{k} (t)\} = A_{k} (t) e^{j θ_{k} (t)}

(9)

where

A_{k} (t)

is the instantaneous amplitude and

θ_{k} (t)

is the instantaneous phase. A time-frequency map with the same size as the vibration signal is generated accordingly.

Speed signals, as one-dimensional scalar time-series, are synchronized with vibration and current signals by resampling and linear interpolation, and then mapped to high-dimensional feature vectors via a fully connected layer to complete dimension alignment of multi-source signals.

2.2.2. Multi-Branch Temporal Feature Extraction Module

To achieve accurate extraction of fault features and noise suppression while maintaining a simple and efficient network structure, lightweight channel attention is embedded in the vibration and current branches to enhance fault-sensitive channel features through adaptive weight assignment. The speed branch adopts fully connected mapping for feature dimension upgrade, and residual structure is introduced in each branch to alleviate feature degradation of deep networks. Table 1 presents the parameters of the vibration network and current network. The input dimension of each branch is defined as [Batch size (B), Number of sensors, Number of frequency bands, Number of image channels, Height (H), Width (W)]. The dimensions shown in the table correspond to the maximum sensor configuration of the KAIST dataset (4 vibration sensors, 3 current sensors). For the Paderborn University dataset, which contains 1 vibration signal and 2 current signals, the input dimensions are adjusted to B × 1 × 5 × 3 × 64 × 64 for the vibration branch and B × 2 × 5 × 3 × 64 × 64 for the current branch. The dimensionality reshaping layer automatically concatenates multi-sensor features along the channel dimension, so no modifications to the subsequent network structure are required.

(1): Vibration Signal Feature Branch

The vibration branch is based on depthwise separable convolution and multi-scale parallel convolution to extract multi-scale features of time-frequency maps. A channel attention module is connected after convolution output to compress spatial dimensions through global average pooling, construct channel feature descriptors, and learn channel weight distribution through two fully connected layers to enhance key fault features and suppress noise channels. The features after attention weighting are output through residual structure, and the residual mapping relationship is expressed as:

X_{v}^{l + 1} = F_{v} (X_{v}^{'}) + X_{v}^{l + 1}

(10)

where

X_{v}^{'}

is the feature after attention weighting, and

F_{v} (\cdot)

is the feature mapping function composed of multi-scale convolution and nonlinear transformation.

(2): Current Signal Feature Branch

The current branch adopts a four-stage convolution-pooling sequence to expand the receptive field layer by layer to mine deep electromagnetic fault features. A channel attention module is connected after each stage of convolution, and its feature weighting and residual output logic are consistent with the vibration branch to ensure the dimensional matching and distribution consistency of multi-branch output features, laying a foundation for subsequent multi-source feature fusion.

(3): Speed Signal Feature Branch

As working condition auxiliary information, the speed signal is mapped from a one-dimensional scalar to a 512-dimensional high-dimensional feature vector through a single fully connected layer. This dimension design is consistent with the output feature dimensions of the vibration and current branches, ensuring dimension matching during multi-source feature fusion, and the linear transformation relationship is:

X_{r} = W_{r} \cdot x_{r} + b_{r}

(11)

where

x_{r}

is the original speed scalar,

W_{r}

and

b_{r}

are the weight and bias of the fully connected layer respectively, and

X_{r}

is the final output high-dimensional speed feature.

Although the speed signal is a single-time scalar, it can act as a working condition embedding to modulate the decision boundary of the classifier by channel-wise concatenation with the high-dimensional fault features of vibration and current. Specifically, during the training process, the model automatically learns the distribution differences in fault features at different speeds, converting speed information into weight corrections for fault features, thereby realizing adaptive adaptation to time-varying fault characteristic frequencies under variable speed conditions. This design avoids complex temporal alignment operations while retaining the guiding role of working condition information in fault diagnosis.

2.2.3. Multi-Source Feature Fusion Mapping and Softmax Classification

This paper employs direct channel-wise concatenation to achieve multi-source feature fusion. This method requires no additional learnable parameters and features a simple structure with strong hardware compatibility. It fully preserves the fault features independently extracted from each modality, thereby achieving complementary fusion of vibration-shock information, electromagnetic response information, and operating condition information. The concatenated features are then subjected to nonlinear mapping and dimensionality reduction via a fully connected layer. Finally, the Softmax function is utilized to determine the probabilities of fault categories, establishing a complete end-to-end classification framework.

(1): Multi-Source Feature Channel Fusion

The three types of signals (vibration, current, speed) respectively carry complementary information of fault impact, electromagnetic response and working condition state. After branch extraction and attention enhancement, they have uniform feature dimension and time-series correspondence. This paper adopts channel-wise direct concatenation to realize multi-source feature fusion, which requires no additional learning parameters, has a simple structure and is easy to implement in engineering. The fusion form is expressed as:

F_{c a t} = C o n c a t (X_{v - o u t}, X_{c - o u t}, X_{r})

(12)

where

X_{v - o u t}

,

X_{c - o u t}

and

X_{r}

are the output features of vibration, current and speed branches respectively, and

C o n c a t (\cdot)

is the channel-wise concatenation operation. The fused feature

F_{c a t}

integrates multi-source fault information and working condition information, with stronger discriminability and robustness.

(2): Fully Connected Mapping of Fused Features

The high-dimensional fused features after concatenation have information redundancy, so a fully connected layer is used to complete nonlinear transformation and dimensionality reduction, mapping high-dimensional fused features to low-dimensional discriminative feature space to improve classification efficiency and generalization ability. The linear mapping relationship of the fully connected layer is:

F_{f c} = δ (W_{f c} \cdot F_{c a t} + b_{f c})

(13)

where

W_{f c}

and

b_{f c}

are the weight and bias of the fully connected layer,

δ (\cdot)

is the ReLU nonlinear activation function, and

F_{f c}

is the output low-dimensional discriminative feature, which can effectively reduce the computational complexity of the subsequent classification layer.

(3): Softmax Fault Classification and Loss Function Construction

The standard Softmax function is used to map low-dimensional discriminative features to the posterior probability distribution of fault categories to achieve accurate discrimination of bearing health status. For a classification task with

N

fault categories, the probability output of the i-th category by the Softmax function is:

P_{i} = \frac{\exp (F_{f c}^{(i)})}{\sum_{i = 1}^{N} \exp (F_{f c}^{(i)})}

(14)

where

F_{f c}^{(i)}

is the output feature of the fully connected layer corresponding to the i-th category, and

P_{i}

satisfies

\sum_{i = 1}^{N} P_{i} = 1

, which can directly characterize the probability of the sample belonging to each category.

2.2.4. Targeted Structured Pruning for Multi-Branch Networks

To solve the problems of large parameters, high computational cost and difficult edge deployment of multi-source fusion models, this section proposes a targeted structured pruning framework for multi-branch heterogeneous networks. Unlike traditional global structured pruning that indiscriminately prunes all convolutional layers, the proposed targeted structured channel pruning explicitly divides pruneable regions and protected regions. It only prunes the feature extraction convolutional layers of the vibration and current branches, while completely retaining the attention modules, residual connections, multi-source fusion layers and classification head, realizing model lightweighting without damaging the multi-source feature fusion logic.

(1): Channel Importance Evaluation Criterion

This paper adopts the weight L1 norm described in Section 2.1 as the channel importance evaluation index. The core logic is that the larger the L1 norm of the convolution channel weight, the higher the contribution of the channel to fault feature extraction, and vice versa for redundant channels. All convolution channels are sorted in descending order of importance score, and the threshold is dynamically determined based on the optimal pruning rate. This setting follows the classic paradigm of structured pruning, and verified by multiple groups of control experiments, 60% is the optimal pruning rate, that is, the top 40% core feature channels are retained, and the importance score corresponding to the position is the preset threshold, which can achieve the optimal lightweight effect with minimal accuracy loss.

(2): Execution Flow of Multi-Branch Targeted Pruning

Traditional global pruning easily damages the multi-branch network topology and attention modules. This paper adopts a targeted pruning strategy, clearly dividing pruneable areas and protected areas: only the convolution channels of vibration and current branches are pruned, and channel attention modules, residual connections, speed branches, feature fusion layers and classification layers are completely retained.

The pruning execution flow is divided into four steps: ① Pre-train the complete model to convergence to obtain stable feature extraction ability. ② Calculate and sort the importance scores of all channels to be pruned. ③ Eliminate redundant channels according to the optimal pruning rate and reconstruct the network structure. ④ Fine-tune the model with a small learning rate to recover the accuracy loss caused by pruning.

(3): Pruning Accuracy Recovery and Constraint Mechanism

Pruning will cause a slight accuracy drop of the model. This paper adopts a small learning rate fine-tuning strategy, using AdamW as the optimizer, with a learning rate of

1 \times 10^{- 5}

and 20 training epochs, so that the model can quickly adapt to the pruned network structure.

To ensure the diagnostic performance and structural integrity of the pruned model, three constraint rules are formulated: ① Channel attention and residual modules are fully protected and not pruned. ② The structure of feature concatenation and classification layers is completely retained to ensure the smooth flow of multi-source fusion and classification processes. ③ The pruning rate does not exceed 70% to avoid accidental deletion of core fault feature channels.

3. Results

To verify the diagnostic performance of the proposed model under different working conditions and fault types, two internationally public rolling bearing datasets are adopted, namely the Paderborn University bearing dataset and the KAIST variable-condition bearing dataset. The two datasets cover steady-state and variable-speed conditions, multiple fault types, and strong noise interference scenarios, which can fully validate the practicability and robustness of the model.

3.1. Experimental Environment and Parameter Settings

All experiments in this paper are completed under the Windows 11 operating system, with the hardware configuration of Intel Core i9-14900HX processor, NVIDIA RTX 4090 48 G graphics card, and 64 G running memory; the deep learning framework adopts PyTorch 2.7.0+cu128, and the programming tool is PyCharm 2024.3.3.

Fixed parameter configurations are applied to ensure the reproducibility of all experiments. The total training epoch is set to 100 with a batch size of 32. AdamW is used as the optimizer, the initial learning rate is

1 \times 10^{- 4}

, the weight decay coefficient is

1 \times 10^{- 3}

, and the learning rate scheduling adopts cosine annealing strategy; the loss function uses label-smoothed cross-entropy with a smoothing factor of 0.1. In the fine-tuning stage after pruning, a smaller learning rate of

1 \times 10^{- 5}

is applied for 20 epochs to enable the model to adapt to the lightweight structure rapidly.

Four quantitative metrics are used for comprehensive evaluation: fault diagnosis accuracy, number of trainable parameters, floating-point operations, and single-sample inference time, which characterize the diagnostic accuracy, lightweight level, and inference efficiency simultaneously. All experimental results reported in this paper are the mean values of 10 independent repeated experiments to ensure statistical stability and reproducibility.

3.2. Experimental Group A

3.2.1. Introduction to the Dataset

The bearing dataset from the University of Paderborn [24] contains samples of 32 operating states for the 6203 deep-groove ball bearings, covering three typical working conditions: artificial damage, accelerated life test damage, and healthy state. Figure 2 illustrates the experimental platform employed, primarily comprising an electric motor, torque measurement shaft, rolling bearing test module, flywheel, and load motor.

3.2.2. Dataset Preprocessing

For verification, eight representative bearing states are selected: KA03, KA15, KI01, KI07, KB27, KI21, K003 and K004. Vibration signals, current signals, and rotational speed signals are simultaneously collected as model inputs.

Due to data volume limitations, the original continuous signals are segmented using a non-overlapping sliding window with a length of 1024 data points, ensuring each segment contains 5–10 complete fault cycles. Each bearing state includes 2000 samples, forming a balanced dataset finally divided into training, validation, and test sets with a ratio of 7:2:1.

3.2.3. Analysis of Experimental Results

Based on the Paderborn University bearing dataset, this section validates the comprehensive performance of MTFL-Net from four perspectives: diagnostic accuracy, lightweight efficiency, noise robustness, and module effectiveness. Comparative models are selected from mainstream fault diagnosis methods, including the traditional CNN models WDCNN [25] and ResNet18, the lightweight models CWT-AA-ResNet [26] and SWT-MCNN [27], and the latest general-purpose lightweight vision models ConvNeXt-Tiny [28] and TinyViT [29]. All models are trained under unified experimental parameters and data partition protocol to ensure a fair comparison.

The final results are comprehensively evaluated using four metrics: accuracy, number of parameters, floating-point operations (FLOPs), and single-sample inference time, as shown in Table 2.

The experimental results lead to the following observations:

The original MTFL-Net achieves a diagnostic accuracy of 99.87% on the PU dataset, outperforming all comparative models. This performance gain is attributed to the complementary characteristics of multi-source signals and the enhancement of fault-sensitive features by the attention mechanism, which enables the model to accurately distinguish subtle fault differences.
The parameter and computation volume of the original MTFL-Net are higher than SWT-MCNN and TinyViT, but significantly lower than ConvNeXt-Tiny. Notably, MTFL-Net achieves 5.7 and 3.55 percentage points higher accuracy than TinyViT and ConvNeXt-Tiny respectively with only 8.17 M parameters, effectively balancing the strong feature representation capability of multi-source fusion and model compactness.
After a 60% targeted structured channel pruning, the model is compressed to 3.28 M parameters with FLOPs reduced to 292.09 M. The per-sample inference time is only 2.88 ms, and the accuracy drop is limited to 3.13%. These results fully satisfy the deployment requirements for edge devices.
To further assess how well the pruned model would work in real industrial settings, we analyzed its hardware needs and computational performance using the typical specs of industrial control computers commonly used in port automation. The pruned MTFL-Net is only 12.6 MB in size, uses roughly 18 MB of memory when running, and peaks at less than 35% CPU utilization. These figures show it easily meets the real-time requirement for industrial fault diagnosis, and can run alongside other monitoring systems on the same edge device without causing performance issues.

3.2.4. Confusion Matrix Analysis

To intuitively compare the classification performance of each model for different bearing operating conditions, we plotted the confusion matrices of all the comparative models based on the test set of the Paderborn bearing dataset, as shown in Figure 3. The test set covers eight types of bearing operating conditions, including outer race single-point damage (KA03, KA15), inner race single-point damage (KI01, KI07, KI21), inner–outer race composite damage (KB27), and two groups of healthy bearing samples (K003, K004), which essentially covers the common actual fault types of rolling bearings.

As can be seen from the confusion matrix results, the MTFL-Net proposed in this paper achieves an overall classification accuracy of 99.87%, outperforming all comparative models. For single-point damage samples with distinct feature differences, the model realizes 100% correct recognition without any misclassification. For inner–outer race composite damage samples with higher diagnostic difficulty, only two samples are misclassified, and the recognition accuracy still reaches 99%, with all remaining samples correctly identified.

In contrast, the classification accuracy of all other comparative models shows a significant decline with the increase in fault complexity. Especially for WDCNN, the traditional CNN model, although it can maintain a recognition rate of over 90% for simple single-point damage, a large number of early weak fault samples are misclassified, and its performance is far inferior to that of MTFL-Net. Among the general-purpose lightweight models, ConvNeXt-Tiny achieves an overall accuracy of 96.32%, outperforming ResNet-18 (95.15%), while TinyViT achieves 94.17% accuracy, slightly lower than ResNet-18. Both models still have insufficient discriminative ability for composite faults and early weak faults due to the lack of specialized design for non-stationary bearing fault signals. While SWT-MCNN and CWT-AA-ResNet outperform traditional CNN in single fault recognition, they still have insufficient discriminative ability between composite faults and healthy samples. It can be clearly observed from the misclassification distribution that the multi-source feature fusion structure and channel attention mechanism designed in this paper can effectively enhance the feature representation of weak faults and composite faults.

3.3. Experimental Group B

3.3.1. Introduction to the Dataset

The Korea Advanced Institute of Science and Technology (KAIST) public dataset [30] is collected under a wide range of variable speed conditions from 680 RPM to 2460 RPM, covering four states: healthy, inner race fault, outer race fault, and rolling element fault. We selected 120 s state data of rolling element bearings under different rotational speeds from the original dataset and classified them according to the included fault states. The number of samples for each fault is expanded to 3000, and non-overlapping sliding windows are used for signal segmentation. The dataset is finally divided into training, validation, and test sets with a ratio of 7:2:1, consistent with the experimental protocol of the PU dataset. Figure 4 depicts the test rig employed.

3.3.2. Experimental Results and Analysis

This paper uses vibration, current and rotational speed data from the dataset. To quantitatively evaluate and compare the inherent noise robustness of different diagnostic models under strictly controlled and reproducible conditions, Gaussian white noise with different intensities is added to the original signals as a standard benchmark test. It should be clearly emphasized that this is not intended to directly simulate the complex and diverse noise environment of actual port cranes, but rather to provide a unified baseline for fair performance comparison between different models. The signal-to-noise ratio gradients are set as raw data, 30 dB, 20 dB, 10 dB, and 0 dB to test the noise robustness. All signals are resampled for time synchronization to ensure the temporal consistency of multi-source signals. The signal-to-noise ratio is defined based on the power ratio over the entire signal sequence, with the injection formula as follows:

S N R = 10 \log_{10} \frac{P_{s i g n a l}}{P_{n o i s e}}

(15)

where

P_{s i g n a l}

and

P_{n o i s e}

represent the average power of the original clean signal and the additive noise. Considering the impact of the actual working environment on the signals, noise was only added to the vibration and current signals. The diagnostic accuracy of each model under variable working conditions with different SNR levels is shown in Table 3.

As shown in Table 3, the diagnostic accuracy of all models decreases with the reduction in SNR under variable working conditions, but the degradation degree varies significantly:

(1): CWT-AA-ResNet and SWT-MCNN suffer from severe accuracy degradation under variable working conditions, especially in low SNR scenarios. At 0 dB, the accuracy of CWT-AA-ResNet and SWT-MCNN drops to 87.93% and 85.74% respectively, with a decline of more than 10% compared with the high SNR condition. This is because the single signal input lacks complementary information of working conditions and electromagnetic features, making it difficult to distinguish fault features from background noise and speed fluctuations.
(2): The original MTFL-Net maintains the highest diagnostic accuracy under all SNR levels. Even at 0 dB strong noise and variable working conditions, the accuracy still reaches 93.36%, the pruned MTFL-Net also outperforms all comparative lightweight models, with an accuracy of 90.41% at 0 dB, verifying the robustness of the model structure and pruning strategy.
(3): The performance gap between MTFL-Net and other models widens as the working condition fluctuates and noise intensity increases, which fully demonstrates that the multi-source fusion framework, differentiated time-frequency preprocessing, and attention mechanism can effectively capture stable fault features under non-stationary and strong interference scenarios.

3.3.3. Confusion Matrix Analysis

To further analyze the classification performance of each model for different fault types under variable working conditions, the confusion matrix of the original MTFL-Net, CWT-AA-ResNet, SWT-MCNN, ViT, ResNet-18 and WDCNN on the test set is shown in Figure 5. The labels on the coordinate axes are: NS (normal state), IRF (inner race fault), ORF (outer race fault), REF (rolling element fault).

The confusion matrix leads to the following observations:

From the confusion matrix results, MTFL-Net achieves a recognition accuracy of over 95% for all fault types, with 99.2% for normal state, 98.7% for inner race fault, 97.5% for outer race fault, and 95.3% for rolling element fault. Inner and outer race faults produce clear periodic impact signals with well-defined characteristic frequencies, which are relatively easy to capture even under speed fluctuations, so the model shows almost no misclassification for these two fault types. In contrast, rolling element faults generate much weaker and more dispersed impact features, and their characteristic frequencies are prone to overlap and aliasing with working condition interference components, making them the most challenging fault type to diagnose in practice. The misclassification in our model mainly occurs between rolling element fault and outer race fault, which is consistent with the actual feature variation law of bearing faults under variable speed conditions.
CWT-AA-ResNet and SWT-MCNN have significant misclassification for rolling element fault, with recognition accuracies of only 91.2% and 88.5%, respectively. A large number of rolling element fault samples are misclassified as normal state or outer race fault, which is because the single input cannot fully capture the weak fault features under speed fluctuations, and the fixed convolution kernel cannot adapt to the change in fault characteristic frequency.
Compared with the comparative models, MTFL-Net significantly improves the recognition accuracy of weak fault types under variable working conditions, which is attributed to the multi-source signal fusion that provides complementary fault information, and the channel attention mechanism that adaptively enhances the weight of weak fault features.

To verify the retention of multi-source fusion logic by the targeted pruning strategy, we counted the channel retention rate and fault feature distribution of each modality before and after pruning:

Channel retention distribution: The convolutional channel retention rate was 42% for the vibration branch and 38% for the current branch, with uniform distribution and no over-pruning of any modality;
Fault type accuracy comparison: After pruning, the accuracy of single-point faults (inner/outer race) decreased from 100% to 98.7%, and rolling element faults from 95.3% to 92.1%, with the accuracy drop of all fault types controlled within 3.2%.

The above results show that the proposed targeted pruning strategy can preferentially retain the fault-sensitive channels of each modality, and completely retain the core logic of multi-source fusion while significantly compressing the model.

3.3.4. Ablation Study for Variable-Condition Scenarios

To verify the effectiveness of each core module of MTFL-Net in variable working condition scenarios, five groups of ablation control experiments are designed, and the results are shown in Table 4. The evaluation metric is the diagnostic accuracy under 20 dB SNR and variable working conditions.

The ablation results show that:

Removing multi-source fusion leads to the largest accuracy drop, which confirms that the complementary information from current and vibration signals is critical for fault feature extraction under variable working conditions. Removing the speed signal input results in an accuracy decrease of 4.00%, which verifies that the introduction of speed working condition information can effectively help the model adapt to speed fluctuations and improve the generalization ability under variable working conditions, which is a key advantage of MTFL-Net over CWT-AA-ResNet and SWT-MCNN. Removing the residual connection and channel attention leads to accuracy drops of 5.72% and 2.79% respectively, indicating that the residual structure alleviates the feature degradation of deep networks, and the channel attention mechanism enhances the fault-sensitive features, both of which play an important role in maintaining model performance under non-stationary conditions.

To further isolate the working condition adaptation effect of the speed signal, we designed a control experiment: all samples were input into the model after the speed was uniformly normalized to 1500 RPM. At this time, the diagnostic accuracy dropped to 92.42%, which was lower than 93.15% without speed input. This result indicates that the core value of the speed signal is to provide a prior distribution of features under variable working conditions, helping the model distinguish between feature changes caused by speed fluctuations and those caused by real faults; when the working condition is fixed, the speed signal no longer provides effective information, but instead introduces redundant dimensions leading to a slight decrease in accuracy.

4. Discussion

4.1. Summary of Core Innovations

To address the core challenges in bearing fault diagnosis of port portal gantry cranes, including strong background noise, non-stationary working conditions, information loss from single-sensor measurements, and difficulty in edge deployment of multi-source fusion models, this paper proposes a lightweight fault diagnosis model termed MTFL-Net. The innovations and advantages of this study are mainly reflected in the following three aspects:

First, a multi-source signal fusion framework with differentiated time-frequency preprocessing is constructed. This framework integrates vibration, current, and rotational speed signals, and adopts targeted time-frequency transformation methods for different types of signals. It fully retains the transient impact features, electromagnetic response features, and working condition correlation features of faults, breaks through the information limitation of a single sensor, and improves the anti-interference ability and feature expression capability of the model in complex industrial scenarios.

Second, a multi-branch feature extraction structure integrating residual connections, multi-scale convolution, and channel attention is designed. The structure proposed in this paper not only alleviates feature degradation and gradient vanishing in deep networks through residual connections, but also adaptively enhances the weights of fault-sensitive features via the channel attention mechanism, balancing the local feature extraction capability of convolution and the global feature modeling ability of the attention mechanism.

Third, a targeted structured channel pruning strategy for multi-branch fusion networks is proposed. By pruning redundant channels in the feature extraction branches while fully preserving the attention module, residual connections, and multi-source fusion layers, this strategy achieves parameter compression while well maintaining fault diagnosis accuracy. It resolves the conflict between multi-source feature fusion and model lightweighting, and provides a feasible solution for the edge deployment of the model in practical port scenarios.

4.2. Limitations and Future Research Directions

This study has several limitations that warrant further refinement. The proposed multi-source fusion framework currently integrates only vibration, current, and rotational speed signals, excluding complementary sensing modalities such as temperature and acoustic emission that could expand the dimensionality of fault information captured by the model. The pruning rate of the current model is manually preset as well, while the optimal pruning rate may vary significantly under different working conditions, resulting in a lack of adaptive adjustment capability. Most notably, our robustness benchmarking relies solely on Gaussian white noise, which cannot fully replicate the complex noise profiles of real port environments—including impulsive mechanical noise from cargo handling, electromagnetic interference from high-power electrical systems, and structural vibrations transmitted through crane frames—and all experiments were conducted exclusively on public bearing datasets, leaving the model’s performance in actual port gantry crane field tests unvalidated. Despite these limitations, the proposed model can also be applied to deep groove ball bearings of the same type in rotating machinery including industrial motors, water pumps, and fans.

Building on this work, our future research will address these gaps systematically. We will collect real-world vibration, current, and noise data from in-service port cranes to rigorously validate and optimize the model’s performance under authentic industrial conditions. We will also extend the MTFL-Net framework to bearing remaining useful life prediction, broadening its practical application scope, and enrich the multi-source fusion input by incorporating temperature and acoustic emission monitoring signals to provide a more comprehensive characterization of bearing health states. In addition, we will explore an adaptive structured pruning method that can adjust to complex and variable working conditions, further enhancing the model’s lightweight performance and engineering applicability.

Author Contributions

Conceptualization, Y.Y., Z.C. and H.W.; methodology, Z.C. and H.W.; software, Z.C.; resources, Y.Y.; writing—original draft preparation, Z.C. and H.W.; writing—review and editing, Y.Y., Z.C. and H.W.; visualization, Z.C.; supervision, Y.Y. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Khan, M.A.; Asad, B.; Kudelina, K.; Vaimann, T.; Kallaste, A. The Bearing Faults Detection Methods for Electrical Machines—The State of the Art. Energies 2023, 16, 296. [Google Scholar] [CrossRef]
Qiao, Z.; Ning, S.; Gai, Y.; Xie, C. A Digital Twin Guided Physical-Virtual Denoising Method for Early Fault Detection of Rolling Element Bearings. Mech. Syst. Signal Process. 2026, 249, 114108. [Google Scholar] [CrossRef]
Er, M.B.; Koca, T. A Novel Approach for Motor Bearing Fault Detection Using EMD-Based Denoising and Detrended Fluctuation Analysis & LSTM Multimodal Hybrid Features with K-Means Clustering. J. Vib. Eng. Technol. 2026, 14, 103. [Google Scholar] [CrossRef]
Cao, J.; Wen, Z.; Zhao, Q.; Zhang, Y.; Huang, L. A Rolling Bearing Fault Diagnosis Model Based on SAO-VMD and IARO-SVM. IEEE Trans. Consum. Electron. 2026, 72, 1112–1121. [Google Scholar] [CrossRef]
Liu, X.; Zhang, Z.; Meng, F.; Zhang, Y. Fault Diagnosis of Wind Turbine Bearings Based on CNN and SSA–ELM. J. Vib. Eng. Technol. 2023, 11, 3929–3945. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, Z.; Jiao, Y.; Zhao, R.; Hu, X.; Che, R. DPCCNN: A New Lightweight Fault Diagnosis Model for Small Samples and High Noise Problem. Neurocomputing 2025, 626, 129526. [Google Scholar] [CrossRef]
Wang, J.; Zhao, Y.; Wang, W.; Wu, Z. Improving Bearing Fault Diagnosis Method Based on the Fusion of Time–Frequency Diagram and a Novel Vision Transformer. J. Supercomput. 2025, 81, 262. [Google Scholar] [CrossRef]
Fan, Y.; Fu, Z.; Li, H.; Yang, Y. Motor Bearing Fault Diagnosis Based on LMSWT with Improved Multiscale Convolutional Neural Network. IEEE Trans. Instrum. Meas. 2025, 74, 3537011. [Google Scholar] [CrossRef]
Liu, G.; Zhang, C.; Xu, S.; Zhang, J.; Wu, L. A New Lightweight Fault Diagnosis Framework Towards Variable Speed Rolling Bearings. IEEE Access 2024, 12, 70170–70183. [Google Scholar] [CrossRef]
Wu, X.; Du, J. Few-Shot Fault Diagnosis of Switch Machines Based on Multi-Sensor Fusion and Improved Class-Center Fine-Tuning Prototype Network. Control Eng. Pract. 2026, 172, 106849. [Google Scholar] [CrossRef]
Chen, F.; Zhao, Z.; Hu, X.; Liu, D.; Yin, X.; Yang, J. A Nonlinear Dynamics Method Using Multi-Sensor Signal Fusion for Fault Diagnosis of Rotating Machinery. Adv. Eng. Inform. 2025, 65, 103190. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Bai, R.; Shi, Y.; Chen, X.; Xu, Q. Enhanced Rolling Bearing Fault Diagnosis Using Multimodal Deep Learning and Singular Spectrum Analysis. Appl. Sci. 2025, 15, 4828. [Google Scholar] [CrossRef]
Tong, C.; Chen, L.; Zhang, J.; Zhao, Z. Intelligent Fault Diagnosis Based on Multi-Source Information Fusion and Attention-Enhanced Networks. Sci. Rep. 2025, 15, 36222. [Google Scholar] [CrossRef]
Zhang, Z.; Jiao, Z.; Li, Y.; Shao, M.; Dai, X. Intelligent Fault Diagnosis of Bearings Driven by Double-Level Data Fusion Based on Multichannel Sample Fusion and Feature Fusion under Time-Varying Speed Conditions. Reliab. Eng. Syst. Saf. 2024, 251, 110362. [Google Scholar] [CrossRef]
He, Y.; Xiao, L. Structured Pruning for Deep Convolutional Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2900–2919. [Google Scholar] [CrossRef]
Li, H.; Qiao, Z.; Zhang, C.; Lu, Y.; Zhang, X.; Wang, J. A Lightweight Model Compression Framework for Intelligent Fault Diagnosis of Machines on Resource-Constrained Devices. Nondestruct. Test. Eval. 2025, 1–22. [Google Scholar] [CrossRef]
Lee, S.; Jeon, Y.; Lee, S.; Kim, J. Tailored Channel Pruning: Achieve Targeted Model Complexity Through Adaptive Sparsity Regularization. IEEE Access 2025, 13, 12113–12126. [Google Scholar] [CrossRef]
Cheng, Y.; Lin, X.; Zhu, H.; Wu, J.; Shi, H.; Ding, H. A Novel Hierarchical Structural Pruning-Multiscale Feature Fusion Residual Network for Intelligent Fault Diagnosis. Mech. Mach. Theory 2023, 184, 105292. [Google Scholar] [CrossRef]
Sun, J.; Liu, Z.; Wen, J. An Efficient Lightweight Network with Improved Heterogeneous Convolution for Bearing Fault Diagnosis. IEEE Access 2025, 13, 185759–185770. [Google Scholar] [CrossRef]
Brock, A.; Lim, T.; Ritchie, J.M.; Weston, N. Neural Photo Editing with Introspective Adversarial Networks. arXiv 2017. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv 2017. [Google Scholar] [CrossRef]
Nasiri, A.; Rahideh, A.; Agah, G.R.; Hedayati Kia, S. Ball-Bearing Fault Detection of Squirrel-Cage Induction Motors Based on Single-Phase Stator Current Using Wavelet Packet Decomposition and Statistical Features. IEEE Trans. Energy Convers. 2025, 40, 1529–1537. [Google Scholar] [CrossRef]
Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition Monitoring of Bearing Damage in Electromechanical Drive Systems by Using Motor Current Signals of Electric Motors: A Benchmark Data Set for Data-Driven Classification. PHM Soc. Eur. Conf. 2016, 3. [Google Scholar] [CrossRef]
Zhang, W.; Peng, G.; Li, C.; Chen, Y.; Zhang, Z. A New Deep Learning Model for Fault Diagnosis with Good Anti-Noise and Domain Adaptation Ability on Raw Vibration Signals. Sensors 2017, 17, 425. [Google Scholar] [CrossRef] [PubMed]
Cui, X.; An, X.; Hu, J. Fault Diagnosis Method for Rolling Bearings Based on Continuous Wavelet Transform and Optimized Residual Network. In Proceedings of the 2025 5th International Symposium on Artificial Intelligence and Intelligent Manufacturing (AIIM), Chengdu, China, 19–21 September 2025; pp. 335–338. [Google Scholar] [CrossRef]
Ge, M.; Chen, Y.; Liu, H.; Chen, K. Research on Bearing Fault Diagnosis Based on Synchronous Compressed Wavelet Transform and Multi-Scale Convolutional Neural Networks. In Proceedings of the 2025 12th International Forum on Electrical Engineering and Automation (IFEEA), Xi’an, China, 7–9 November 2025; pp. 906–910. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. TinyViT: Fast Pretraining Distillation for Small Vision Transformers. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 68–85. [Google Scholar] [CrossRef]
Jung, W.; Kim, S.-H.; Yun, S.-H.; Bae, J.; Park, Y.-H. Vibration, Acoustic, Temperature, and Motor Current Dataset of Rotating Machine under Varying Operating Conditions for Fault Diagnosis. Data Brief 2023, 48, 109049. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The overall framework of the proposed MTFL-Net.

Figure 2. Paderborn test bench.

Figure 3. Confusion matrices of each model under eight types of states on the Paderborn bearing dataset.

Figure 4. Layout of the rotating machine test bench and its components.

Figure 5. Confusion matrices for each comparative trial.

Table 1. Branch network parameters.

Network Branch	Layer Type	Input Dimensions	Output Dimensions
Vibration Branch	Input Layer	B × 4 × 5 × 3 × 64 × 64	B × 4 × 5 × 3 × 64 × 64
	Dimensionality Reshaping	B × 4 × 5 × 3 × 64 × 64	B × 20 × 3 × 64 × 64
	Initial Convolution Module	B × 20 × 3 × 64 × 64	B × 32 × 3 × 32 × 32
	Multi-scale Feature Extraction Module	B × 32 × 3 × 32 × 32	B × 128 × 3 × 32 × 32
	Hierarchical Max Pooling	B × 128 × 3 × 32 × 32	B × 512 × 3 × 2 × 2
	Global Average Pooling	B × 512 × 3 × 2 × 2	B × 512 × 3 × 1 × 1
	Dimensionality Flattening	B × 512 × 3 × 1 × 1	B × 1536
	Fully Connected Layer	B × 1536	B × 512
Current Branch	Input Layer	B × 3 × 5 × 3 × 64 × 64	B × 3 × 5 × 3 × 64 × 64
	Dimensionality Reshaping	B × 3 × 5 × 3 × 64 × 64	B × 15 × 3 × 64 × 64
	Conv Block Stage 1–4	B × 15 × 3 × 64 × 64	B × 256 × 3 × 2 × 2
	Global Average Pooling	B × 256 × 3 × 2 × 2	B × 256 × 3 × 1 × 1
	Dimensionality Flattening	B × 256 × 3 × 1 × 1	B × 768
	Fully Connected Layer	B × 768	B × 512

Table 2. Performance comparison of different models on the PU bearing dataset.

Model	Diagnostic Accuracy (%)	Parameters (M)	FLOPs (M)	Inference Time per Sample (ms)
WDCNN	92.26	2.87	386.52	3.64
ResNet-18	95.15	9.18	1728.64	10.32
TinyViT	94.17	5.23	621.37	6.18
ConvNeXt-Tiny	96.32	21.58	408.72	6.52
SWT-MCNN	97.83	2.12	214.56	2.15
CWT-AA-ResNet	98.11	9.78	682.34	6.82
MTFL-Net (Original)	99.87	8.17	957.33	8.43
MTFL-Net (Pruned)	96.74	3.28	292.09	2.88

Table 3. Diagnostic accuracy (%) of each model under variable working conditions with different SNR levels.

Model	Raw Data	30 dB	20 dB	10 dB	0 dB
WDCNN	92.47	90.53	86.16	79.29	72.45
ResNet-18	95.12	92.27	89.34	83.62	76.39
ViT	94.58	92.16	88.25	82.18	74.88
SWT-MCNN	94.17	93.62	90.04	88.35	85.74
CWT-AA-ResNet	96.52	94.85	92.73	90.61	87.93
MTFL-Net (Original)	99.64	99.28	97.15	95.37	93.36
MTFL-Net (Pruned)	96.82	96.15	94.08	92.26	90.41

Table 4. Ablation study results under variable working conditions.

Model Variant	Diagnostic Accuracy (%)
Without multi-source fusion (single vibration signal)	89.27
Without speed signal input	93.15
Fixed speed condition (all samples normalized to 1500 RPM)	92.42
Without residual connection	91.43
Without channel attention	94.36
Full MTFL-Net	97.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Chen, Z.; Wang, H. Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning. Actuators 2026, 15, 322. https://doi.org/10.3390/act15060322

AMA Style

Yang Y, Chen Z, Wang H. Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning. Actuators. 2026; 15(6):322. https://doi.org/10.3390/act15060322

Chicago/Turabian Style

Yang, Yongsheng, Zehui Chen, and Heng Wang. 2026. "Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning" Actuators 15, no. 6: 322. https://doi.org/10.3390/act15060322

APA Style

Yang, Y., Chen, Z., & Wang, H. (2026). Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning. Actuators, 15(6), 322. https://doi.org/10.3390/act15060322

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Fault Diagnosis of Port Crane Bearings Based on Multi-Source Feature Fusion Network and Structured Pruning

Abstract

1. Introduction

2. Materials and Methods

2.1. Relevant Fundamental Research

2.1.1. Multi-Scale Convolution

2.1.2. Residual Network

2.1.3. Structured Channel Pruning

2.2. The Proposed Method

2.2.1. Differentiated Time-Frequency Preprocessing of Multi-Source Signals

2.2.2. Multi-Branch Temporal Feature Extraction Module

2.2.3. Multi-Source Feature Fusion Mapping and Softmax Classification

2.2.4. Targeted Structured Pruning for Multi-Branch Networks

3. Results

3.1. Experimental Environment and Parameter Settings

3.2. Experimental Group A

3.2.1. Introduction to the Dataset

3.2.2. Dataset Preprocessing

3.2.3. Analysis of Experimental Results

3.2.4. Confusion Matrix Analysis

3.3. Experimental Group B

3.3.1. Introduction to the Dataset

3.3.2. Experimental Results and Analysis

3.3.3. Confusion Matrix Analysis

3.3.4. Ablation Study for Variable-Condition Scenarios

4. Discussion

4.1. Summary of Core Innovations

4.2. Limitations and Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI