1. Introduction
The water injection pump serves as a core power device in oilfield development, maintaining reservoir pressure through high-pressure water injection while undertaking the dual functions of energy conversion and fluid transportation. Sustained production system operation and safety hinge directly on the equipment’s operating condition, underscoring its pivotal role in crude oil recovery and oilfield development performance [
1]. Due to prolonged exposure to harsh operating conditions such as high pressure and highly corrosive media, water injection pumps are prone to mechanical wear, seal failure, and other faults. These issues may lead to fluctuations in injection pressure, reduced injection efficiency, increased energy consumption, and, in severe cases, production accidents such as pipeline bursts. In deep oilfield development scenarios, where bottom-hole pressure gradients are high and medium compositions are more complex, water injection pump failures can lead to paralysis of the water injection system, reduce crude oil recovery rates, and severely compromise the economic benefits of oilfield development.
Figure 1 shows a photograph of a water injection pump at an oilfield site.
Among various types of water injection pumps, the plunger pump has become the mainstream model in modern oilfield water injection systems due to its high working pressure, high volumetric efficiency, and flexible flow rate adjustment capability. It pressurizes the medium through the reciprocating motion of plungers within cylinders, which alters the volume of the working chamber. However, its complex mechanical structure leads to diverse fault modes. Moreover, these faults involve high-dimensional feature spaces containing substantial redundant and ineffective features with coupling relationships. Different faults may exhibit similar patterns in pressure, flow rate, or vibration anomalies, further interfering with the diagnostic process and reducing the accuracy of fault identification.
In recent years, researchers have applied both traditional machine learning and deep learning to plunger pump fault diagnosis [
2,
3,
4,
5,
6,
7,
8]. However, existing methods still suffer from several limitations. First, different fault types exhibit scale variations in feature representation, and fixed-size convolution kernels struggle to comprehensively capture multi-scale features, limiting the model’s adaptability to complex operating conditions. Second, high-dimensional feature spaces often contain substantial redundant and ineffective information, which interferes with the accuracy of fault identification. Third, existing diagnostic models generally suffer from high computational complexity and large memory consumption, making them difficult to deploy on industrial edge devices and unable to meet the requirements for real-time fault diagnosis and early warning in industrial settings. Therefore, a diagnostic model must not only possess strong feature extraction and representation capabilities but also address the requirements of lightweight design and real-time performance.
To address the above issues, this paper proposes a fault diagnosis method for plunger pumps based on multi-scale convolution and attention, termed MSLAN. The method captures fault features at different scales through a multi-scale fusion convolution module, introduces an attention mechanism to suppress redundant information, and adopts lightweight strategies such as depthwise separable convolutions to reduce model complexity. Experiments on a self-constructed plunger pump dataset and two public datasets validate the effectiveness and generalizability of the proposed method.
5. Experiment
5.1. Dataset
To comprehensively validate the effectiveness and generalization capability of the MSLAN model presented herein, three datasets were selected for experimental validation. Real plunger pump data collected from an oilfield in Shandong Province constitute a self-constructed dataset used to verify the method’s diagnostic performance in real-world engineering settings. Generalization across rotating machinery fault diagnosis tasks is assessed using the CWRU rolling bearing dataset [
11] and the Southeast University gearbox dataset [
12], demonstrating that the proposed approach is applicable beyond plunger pumps and can be extended to other mechanical components.
Self-Built Dataset: This dataset originates from real operating condition data of a plunger pump in an oilfield in Shandong Province, reflecting equipment operating states in actual industrial environments. The experimental object is a triplex single-acting plunger pump used in the oilfield water injection system. The monitoring system deploys multiple types of sensors at key positions of the pump body, including a pump discharge pressure sensor, a housing vibration acceleration sensor, a crosshead temperature sensor, and a drive motor current sensor. The data acquisition system synchronously records signals from each channel, forming a multi-source time-series dataset. The dataset contains three fault types of the plunger pump, as well as a healthy state. The fault types are connecting flange misalignment, stuffing box leakage, and dynamic seal failure. The dataset contains a total of 2412 samples, including 954 samples of healthy status, 504 samples of connecting flange misalignment, 504 samples of stuffing box leakage, and 450 samples of dynamic seal failure.
Figure 7 illustrates the sensor placement of the field data acquisition platform, where markers 1–7 indicate the installation positions of vibration sensors for collecting signals in the X, Y, and Z directions.
CWRU Dataset: The rolling bearing dataset from Case Western Reserve University (CWRU) is among the most authoritative and extensively employed public datasets in fault diagnosis research, originating from the university’s Electrical Engineering Laboratory. To assess the proposed method’s performance in bearing fault identification, this dataset was selected for supplementary experiments. The experimental setup comprises a motor, a torque sensor, and a dynamometer, with the test bearings supporting the motor shaft. During the experiments, three categories of single-point defects—namely inner race, outer race, and ball faults—were artificially introduced into the bearings via electric discharge machining, with fault diameters of 0.007, 0.014, and 0.021 inches [
11]. Vibration signals were acquired using an accelerometer positioned on the drive-end motor housing, operating at a sampling frequency of 12 kHz. For this study, the operating condition under 0 hp load was adopted, with a sampling length of 512 and an overlap rate of 0.75. Specifically, ten distinct fault types were selected for this study, with each type containing 117 samples.
Southeast University Dataset: Originating from Southeast University, the university’s gearbox dataset is recognized as one of the commonly used public datasets within the fault diagnosis community. This dataset was selected for supplementary experiments to validate the proposed method’s capability for gear fault diagnosis. The experimental platform comprises a motor, motor controller, planetary gearbox, parallel-axis gearbox, brake, and controller, with the test gears housed inside the gearbox. During the experiments, four categories of faults—tooth surface wear, gear root crack, tooth defect, and tooth breakage—were artificially created in the gears through machining, alongside a healthy state serving as a control [
12]. Vibration measurements were obtained via accelerometers installed at strategic locations on the test rig, collecting signals from eight channels at a sampling frequency of 5120 Hz. The operating condition with a rotational speed of 1200 r/min and a load of 0 N·m was selected for this study, utilizing a sampling length of 1024 and an overlap rate of 0.75. Specifically, five distinct health conditions were selected, with each condition containing 4092 samples.
Each dataset was randomly divided into training, validation, and test sets.
Table 1 summarizes the sample sizes for each dataset.
5.2. Experimental Setup
All experiments were conducted on a hardware platform consisting of an Intel Core i5-12400 processor, an NVIDIA GeForce RTX 5060 8 GB graphics card, and 16 GB RAM. The programs were written in Python 3.9, with the deep learning framework PyTorch 2.7.1 and CUDA version 12.8.
To comprehensively evaluate the performance of MSLAN, all model parameters were initialized according to a preset random seed. During the training process, parameters were updated through the backpropagation algorithm and optimized using the Adam optimizer. The learning rate was set to 0.001, and the batch size was set to 32. To ensure convergence to the optimal state, the maximum number of training epochs was set to 200, and an early stopping mechanism was introduced. The F1-Score on the validation set was used as the monitoring metric. If this metric showed no improvement for ten consecutive training epochs, training was terminated early, and the model weights with the best performance were saved.
To evaluate the robustness of the proposed model in complex industrial environments, Gaussian white noise was added to the raw signals of the CWRU rolling bearing dataset and the Southeast University gearbox dataset to simulate background noise interference under actual operating conditions. A single noise level was employed in this experiment, with a uniform signal-to-noise ratio (SNR) of 10 dB. Specifically, for the CWRU dataset, the standard deviation of the raw signal is 0.36, the standard deviation of the added noise is 0.11, and the noise variance is 0.10 times the signal variance. For the Southeast University gearbox dataset, the standard deviation of the raw signal is 0.05, the standard deviation of the added noise is 0.02, and the noise variance is 0.10 times the signal variance. Noise addition was performed prior to data input, and all compared methods were trained and tested under the same noise conditions to ensure fairness of the experimental results.
The key architectural parameters of the proposed MSLAN model are presented in
Table 2.
5.3. Comparison Experiments
To verify the advanced performance of MSLAN in plunger pump fault diagnosis, a comparative experiment was designed to comprehensively compare it with other deep learning models. The selected models encompass commonly used time series classification models (FCN [
13], ResNet [
13], InceptionTime [
14], TimesNet [
15]), models frequently employed in fault diagnosis (WDD-CNN [
16], CNN-BiLSTM [
17], WDCNN [
18]), and lightweight models (MobileNet [
19], ShuffleNet [
20], GhostNet [
21]). Through comparison with the above three categories, the comprehensive performance of MSLAN in terms of diagnostic accuracy, model complexity, and computational efficiency was thoroughly evaluated. To ensure fairness, the learning rate was adjusted to the optimal value specified in the original papers or kept consistent with MSLAN training. The batch size was uniformly set to 32, the number of epochs to 100, and early stopping was adopted. The loss function was uniformly set as cross-entropy loss, and the Adam optimizer was used. Model performance was evaluated using F1-score, FLOPs, and Params, with experimental results presented in
Table 3,
Table 4 and
Table 5.
The comparative experimental results demonstrate that MSLAN achieves excellent and stable diagnostic performance across all three datasets, with F1-scores of 88.95%, 98.89%, and 99.90% on the self-constructed dataset, CWRU dataset, and gearbox dataset, respectively. While maintaining high accuracy, both the computational cost (FLOPs) and the number of parameters (Params) of MSLAN are significantly lower than those of the other compared models. Compared with time-series classification models such as FCN, ResNet, InceptionTime, and TimesNet, MSLAN reduces computational overhead by one to two orders of magnitude while achieving comparable or even superior accuracy. In comparison with commonly used fault diagnosis models such as WDD-CNN, CNN-BiLSTM, and WDCNN, MSLAN exhibits a notable improvement in F1-Score on the self-constructed dataset while maintaining a more lightweight architecture. Furthermore, relative to lightweight models such as MobileNet, ShuffleNet, and GhostNet, MSLAN achieves substantially higher diagnostic accuracy across all datasets while retaining a clear advantage in model complexity. Overall, MSLAN achieves an optimal balance among diagnostic accuracy, model lightweightness, and computational efficiency, fully validating its effectiveness and superiority in the task of plunger pump fault diagnosis.
It is worth noting that the F1-scores of all comparative methods on the self-constructed dataset are significantly lower than those on the two public datasets, indicating that the self-constructed dataset itself presents a higher level of diagnostic difficulty. This is primarily attributed to the stronger environmental noise, more complex operational fluctuations, and inter-class overlap among fault features inherent in the field-collected plunger pump data, which faithfully reflects the intrinsic challenges of real-world industrial fault diagnosis.
5.4. Ablation Experiments
To achieve higher diagnostic accuracy in plunger pump fault diagnosis tasks, MSLAN introduces cross-channel convolution in the SMSF module to effectively integrate multi-source information, and employs depthwise separable convolution along with shared pointwise convolution strategies to reduce computational redundancy while efficiently extracting multi-dimensional fault features. Additionally, a multi-branch parallel attention mechanism is introduced in the MBPA module to enhance the perception capability of critical fault features. To validate the effectiveness of the above components in plunger pump fault diagnosis tasks, multiple ablation experiments are conducted on the self-constructed dataset, where one or a combination of modules is selected for each experiment. The final ablation experiment results are shown in
Table 6. It is worth noting that, since multi-scale convolution serves as the core structure of the network, it is not subjected to ablation analysis in these experiments.
From the experimental results, it can be observed that the complete model, integrating cross-channel convolution, depthwise separable convolution with shared pointwise convolution strategies, and the MBPA attention mechanism, achieves the optimal diagnostic performance, with an F1-Score of 88.95%, which is 2.08 percentage points higher than that of the baseline model. Meanwhile, the computational complexity is significantly reduced to 8.79 MFLOPs, representing a reduction of over 90%, and the parameter count is reduced by an order of magnitude. Regarding the contribution of each component, the introduction of cross-channel convolution alone reduces computational complexity but leads to a decrease in accuracy due to information loss. The depthwise separable convolution with shared pointwise convolution strategy significantly improves computational efficiency while maintaining accuracy. In the comparison of attention mechanisms, MBPA achieves a 1.34 percentage point improvement in accuracy compared to the standard SE module under similar computational complexity, validating the effectiveness of the multi-branch parallel structure in modeling channel dependencies. In summary, the synergistic effect of SMSF and MBPA enables the model to achieve optimal diagnostic accuracy while maintaining extremely high computational efficiency, fully demonstrating the effectiveness and practicality of the proposed method in plunger pump fault diagnosis tasks.
5.5. Visualization Analysis
To further demonstrate the effectiveness of the MBPA module, fault samples are randomly selected from the self-built plunger pump dataset, the CWRU dataset, and the Southeast University dataset in this section to visualize the feature maps of the two MBPA modules in the MSLAN network. The feature maps before and after the operation of the two MBPA modules are shown in
Figure 8,
Figure 9 and
Figure 10, where the left side of each figure shows the original feature map and the recalibrated feature map of the first MBPA layer, and the right side shows those of the second MBPA layer.
The intra-module attention mechanism assigns channel-wise attention weights at each layer, revealing that channels with higher weights have greater influence on downstream computations. This channel prioritization intensifies as the network advances toward the output layer. By this stage, the model has nearly completed its processing of the original data, feature variations show little further change, and the most informative channels have been identified. From the self-built dataset results presented in
Figure 8, the feature map changes observed in the first MBPA layer remain modest, while only a subset of channels retain large values within the second MBPA layer. In
Figure 9 on the CWRU dataset, the attention weight changes in both MBPA modules are very distinct, with different channels being activated and suppressed after each layer.
Figure 10, obtained from the Southeast University dataset, illustrates that with increasing network depth, the feature maps evolve from a uniformly distributed pattern in the first layer toward alternating light-dark stripe patterns in the second layer—a shift that demonstrates how the attention mechanism enables the network to discern channel significance.
To intuitively evaluate the classification performance and feature learning capability of the MSLAN model, ResNet, WDCNN, and MobileNet were selected for visual comparison with MSLAN. The t-SNE [
22] dimensionality reduction visualization and confusion matrix analysis were conducted on three datasets, and the results are presented in
Figure 11,
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16,
Figure 17 and
Figure 18.
The t-SNE visualization results show that the features extracted by MSLAN exhibit the clearest clustering structure in two-dimensional space. Different fault categories are well-separated with distinct boundaries, and samples within the same class are tightly clustered, indicating that MSLAN effectively learns highly discriminative fault features. In contrast, the feature distributions of ResNet and WDCNN show some degree of inter-class overlap, with blurred boundaries between certain fault categories. MobileNet exhibits the poorest clustering performance, with multiple categories intermixed and difficult to distinguish. Compared to the public CWRU and Southeast University datasets, the self-constructed dataset exhibits more noticeable inter-class overlap in the t-SNE visualization, with samples of different fault types not being clearly separated and the clustering structure being relatively ambiguous. The self-constructed dataset originates from real-world oilfield operating conditions, where the signals inevitably contain complex environmental noise, operational fluctuations, and sensor measurement errors, resulting in more intricate fault feature distributions and relatively ambiguous inter-class boundaries. In contrast, the CWRU and Southeast University datasets are typically collected under controlled laboratory conditions, where fault patterns are more distinct and noise levels are lower, leading to better feature clustering performance. Although the clustering performance of the self-constructed dataset is inferior to that of the public datasets, MSLAN still achieves a better clustering structure compared to other methods, further demonstrating its resilience to complex industrial data.
From the confusion matrix analysis, MSLAN achieves the highest classification accuracy across all datasets, with significantly deeper diagonal colors compared to the other models and very few misclassified samples. ResNet and WDCNN exhibit some misclassifications between easily confusable fault categories, while MobileNet shows the most severe misclassification, with notably low recognition accuracy for several categories.
Overall, the visualization results confirm that MSLAN consistently outperforms ResNet, WDCNN, and MobileNet across both feature extraction and classification tasks, attesting to the proposed method’s robustness and efficacy in plunger pump fault diagnosis.
6. Discussion
This paper proposes a lightweight fault diagnosis method based on multi-scale convolution and attention mechanisms, termed MSLAN, to address the challenges in plunger pump fault diagnosis, including the difficulty in capturing multi-scale fault features, interference from redundant information in high-dimensional feature spaces, and high model computational complexity. By leveraging the SMSF module, the method efficiently captures fault features across different scales with minimal computational expense, successfully addressing the receptive field limitation of a single convolution kernel. Additionally, the MBPA module is introduced, which significantly enhances the model’s ability to perceive critical fault features and effectively suppresses redundant information through refined modeling of complex inter-channel dependencies. Experiments conducted on three datasets—a self-built plunger pump dataset, the CWRU bearing dataset, and the Southeast University gearbox dataset—confirm that the proposed method delivers outstanding performance for plunger pump fault diagnosis while demonstrating strong generalization across other rotating machinery fault diagnosis tasks. More importantly, the incorporation of lightweight techniques—namely depthwise separable convolution and shared pointwise convolution—enables the model to considerably reduce parameter count and computational burden while sustaining high diagnostic accuracy, meeting the requirements for deployment on edge devices and real-time diagnosis in industrial contexts.
Although the proposed method achieves remarkable results in terms of diagnostic accuracy and computational efficiency, certain limitations remain.
First, in terms of multi-source feature fusion, while the proposed method can process input data from multiple channels including vibration, pressure, temperature, and current sensors, it primarily relies on channel concatenation for feature fusion, failing to fully exploit the complementarity and correlation among different sensor signals at the physical mechanism level. Specifically, pressure signals reflect the pump’s working load and fluid pulsation characteristics, vibration signals reflect impact and wear conditions of mechanical components, temperature signals reflect thermodynamic changes in friction pairs, and current signals reflect the motor’s driving state and load fluctuations. These signals often exhibit different temporal scale response characteristics and coupling relationships when faults occur. Simple channel stacking struggles to effectively model the deep synergistic mechanisms among heterogeneous sensors, potentially limiting the model’s comprehensive perception of complex fault patterns.
Second, validation of the method was performed on a self-built dataset alongside two public datasets, all gathered under relatively controlled experimental environments or specific operating conditions. In actual oilfield production, equipment operating conditions are complex and variable, with dynamic changes in load, rotational speed, environmental noise, and other factors imposing higher demands on the model’s generalization capability. Although experiments simulated some interference by adding Gaussian white noise, this still deviates from the complex and non-stationary noise present in real industrial environments.
Third, the lower diagnostic performance of the proposed method on the self-constructed plunger pump dataset, compared with the two public datasets, is not a model flaw but an objective reflection of the inherent challenges posed by real-world industrial data. The self-constructed dataset, collected from actual oilfield operating conditions, contains stronger environmental noise, more complex operational fluctuations, and inter-class overlap of fault features, truly reflecting the greater difficulty of industrial fault diagnosis tasks. Future improvements for this challenging scenario may include the introduction of robust denoising modules, the design of imbalanced-data loss functions, and the exploration of multi-sensor fusion strategies to further improve diagnostic performance in complex industrial environments.
To address the above limitations, future work will focus on the following directions.
First, a physics-informed multi-source feature fusion mechanism will be constructed. We plan to deeply analyze the physical response patterns of sensor signals such as vibration, pressure, temperature, and current during fault occurrence, and design feature alignment and fusion strategies based on physical priors. For instance, temporal alignment mechanisms can handle differences in time scales among different signals, or attention mechanisms can be introduced to adaptively learn the contribution weights of each sensor under different fault modes. Leveraging synergistic information from heterogeneous sensors through deep mining, the model’s diagnostic capability for compound faults and incipient weak faults can be further augmented.
Second, the cross-domain generalization capability of the model will be enhanced. To cope with the variability of operating conditions, domain adaptation or meta-learning strategies will be adopted, empowering the model to extract transferable fault features from known conditions and achieve fast adaptation to unseen conditions using only a few samples. Additionally, we plan to construct industrial datasets covering a wider range of operating conditions, including more fault types and more complex environmental interference, to provide more rigorous test benchmarks for evaluating model robustness.
Third, efforts will be directed toward enhancing model interpretability. Deep learning models often operate as black boxes, rendering their decision-making processes challenging to interpret. To strengthen confidence in diagnostic results, future investigations will integrate visualization approaches, including class activation mapping, to conduct a granular analysis of the model’s attention regions and salient features. Uncovering the foundations of diagnostic decisions will provide field engineers with clearer fault insights, paving the way for human–machine collaborative intelligent diagnosis frameworks.
In summary, the MSLAN method proposed in this paper provides an effective solution for intelligent fault diagnosis of plunger pumps that achieves both high accuracy and high efficiency. Through continuous research in areas such as multi-source feature fusion, cross-condition generalization, and interpretability, this technology is expected to evolve toward greater intelligence, reliability, and practicality, providing strong support for intelligent operation and maintenance of oilfield equipment.