Review Reports - Arrester Fault Recognition Model Based on Thermal Imaging Images Using VMamba

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents an exploration of arrester fault recognition. However, the following aspects require improvement:

1. The introduction states that "This design effectively alleviates the conflict between the limited receptive field of CNNs and the high computational load of ViTs." However, the experimental section lacks analysis and validation to support this claim.

2. The introduction claims that "Consequently, VMamba demonstrates superior capability in addressing specific challenges in power equipment infrared imaging, such as the extraction of subtle temperature-difference features and the suppression of noise in complex backgrounds." Yet, the experimental section does not provide analysis or validation for these specific capabilities.

3. It is recommended that the paper include detection result images comparing with other results. This would make the experimental findings more convincing.

4. The paper's description does not sufficiently highlight its own innovativeness. It is advised to emphasize the specific improvements made when adapting the general-purpose model to this specialized domain.

Author Response

Comment1：The introduction states that "This design effectively alleviates the conflict between the limited receptive field of CNNs and the high computational load of ViTs." However, the experimental section lacks analysis and validation to support this claim.

Response: We sincerely thank the review experts for their profound insights and valuable suggestions. To deeply evaluate the efficiency advantages of the model, we have added Table 2 on page 12 of the paper based on the suggestions. This table provides a detailed comparison of the computational complexity between various representative visual models and the method proposed in this paper.

Table 2. Comparative experimental results of different algorithms

Model	Backbone	Input	FLOPs(G)
SE-ResUNet	-	224×224	24.8
DSNet	Swin-Transform	224×224	9.2
Swin-transNet	Swin-Transform	224×224	8.8
Ours	VMamba	224×224	5.7

Comment2：The introduction claims that "Consequently, VMamba demonstrates superior capability in addressing specific challenges in power equipment infrared imaging, such as the extraction of subtle temperature-difference features and the suppression of noise in complex backgrounds." Yet, the experimental section does not provide analysis or validation for these specific capabilities.

Response:Thank you very much for your meticulous review of this research work and the professional opinions you have provided. The problem you pointed out is of vital importance. Based on your suggestions, we have made corresponding revisions to the paper. It should be noted that since this part has not yet involved specific experimental verification, in order to maintain the scientific rigor and professionalism of the paper, we have deleted the relevant preliminary discussions in the introduction to avoid possible misunderstandings.

Experiments confirm that this architecture maintains the real-time processing advantages of CNNs while significantly enhancing the modeling of global correlations among fault features, offering a novel and efficient technical pathway for the intelligent fault diagnosis of power equipment.

Comment3：It is recommended that the paper include detection result images comparing with other results. This would make the experimental findings more convincing.

Response:Thank you very much for your suggestion. This modification can indeed enrich the content of the article and make the experimental results more convincing. We have made revisions in the text.

Comment4：The paper's description does not sufficiently highlight its own innovativeness. It is advised to emphasize the specific improvements made when adapting the general-purpose model to this specialized domain.

Response:Thank you for the questions raised by the reviewers. In response to the issue you pointed out, we have made revisions to the relevant parts of the article, shifting from describing the model itself to focusing on some of the changes made to adapt this model to specific fields. Specifically, on page 4 of the original text, we have adjusted the original content: The previous version focused too much on the description of the model architecture itself. After the revision, the specific process of the model in extracting temperature change features has been emphasized.

Meanwhile, the multi-scale feature extraction mechanism enables the VMamba model to accurately capture temperature anomaly features at different scales in infrared images of electrical equipment by constructing a hierarchical feature representation space. In infrared images of electrical equipment, fault characteristics often exhibit significant scale diversity—ranging from minute hotspots spanning just a few pixels (such as those caused by loose connections) to large overheating areas covering hundreds of pixels (such as those resulting from insulator aging). Leveraging its unique selective state space model (SSM) architecture and integrating a multi-scale feature extraction strategy, VMamba adaptively focuses on critical temperature features across different network layers: shallow networks preserve surface textures and edge details of the equipment, intermediate networks capture component-level temperature distribution patterns, and deep networks construct global representations of the equipment’s operational state. This multi-scale processing capability allows the model to simultaneously attend to subtle local temperature variations and overarching temperature field distribution patterns, significantly enhancing sensitivity to early-stage minor faults and robustness under complex operating conditions. Consequently, it provides robust feature support for precise fault localization and hierarchical diagnosis of electrical equipment.

During the feature extraction phase, the Visual State Space (VSS) model, as a core component of the VMamba architecture, demonstrates superior capability in modeling the relationship between global and local features compared to conventional vision networks [10, 11]. To better adapt to the specific requirements of electrical equipment fault diagnosis, this study enhances the two-dimensional selective scanning (SS2D) mechanism by constructing a multi-modal temperature feature input space, significantly improving the model's perception of fault characteristics. Specifically, a gradient computation module based on the Sobel operator is introduced to extract the gradient fields of the infrared image in the X and Y directions, generating gradient magnitude channels that characterize abrupt temperature changes. Furthermore, prior knowledge of temperature thresholds is incorporated by establishing multi-level temperature thresholds (T_warning = 70°C, T_alarm = 90°C) based on the normal operating temperature range of the equipment and relevant industry standards , and generating corresponding binary mask channels (e.g., [temperature > T_warning], [temperature > T_alarm]). The resulting multi-channel feature tensor integrates raw temperature data, gradient magnitude information, and multi-level temperature threshold masks, collectively forming a physically meaningful feature representation. This enhanced mechanism not only improves the model's sensitivity to localized temperature anomalies (such as overheated connections) but also incorporates operational temperature boundary conditions as interpretable physical priors, thereby enhancing the comprehensiveness and robustness of feature representation.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper emphasizes VMamba as its core innovation, but several recent works, such as VMamba-YOLO and Mamba-UNet, have introduced VMamba to vision tasks. What are the fundamental differences between this work and these existing methods in terms of architectural design, scanning strategy, or task adaptation? Is this merely an engineering attempt to "replace the backbone"? Furthermore, the paper criticizes the quadratic complexity of the Transformer but fails to discuss the limitations of SSM in modeling spatial dynamics, such as sensitivity to rotation and scale. Does SSM2D offer advantages over Deformable Attention or the Swin Transformer under complex background interference? Is there a qualitative feature map comparison? More specifically, Sections 2.1 and 2.2 are both titled "Construction of Spatial Model in Digital Twin Platform." The authors must revise this distinction to make it clearer. Furthermore, the authors' comparative analysis of VMamba with the Transformer and CNN is scattered across multiple locations and needs to be consolidated to enhance logical coherence. Figures 1 and 2, among others, are not adequately cited and explained in the main text. The authors should clearly reference and briefly explain the corresponding figures and text. In addition, the dataset description is not detailed enough. The authors did not explain the data source, number of samples, category distribution, specific methods of data augmentation, etc., and should supplement the statistical information and sample examples of the dataset. Also, currently only compared with the YOLO series, it is recommended to add more Transformer- or SSM-based visual models, such as Swin Transformer and MambaVision, for horizontal comparison. Furthermore, although the authors emphasize the linear complexity advantage of VMamba, they do not provide specific inference speed (FPS), number of parameters, or FLOPs comparison data, which must be supplemented. In addition, the authors must show under what circumstances the model fails to recognize and provide an in-depth discussion of the model's limitations. Finally, in the conclusion section, the authors must highlight the unique advantages of VMamba in power vision tasks and further refine its contributions based on experimental data. Thanks.

Author Response

Comment1：This paper emphasizes VMamba as its core innovation, but several recent works, such as VMamba-YOLO and Mamba-UNet, have introduced VMamba to vision tasks. What are the fundamental differences between this work and these existing methods in terms of architectural design, scanning strategy, or task adaptation? Is this merely an engineering attempt to "replace the backbone"?

Response:Thank you to the reviewers for their profound insights and valuable suggestions. We have made corrections and supplemented relevant content on the fourth page of the article.To effectively apply the Mamba model in the field of infrared recognition, this study has made targeted improvements to the basic SS2D module for the most critical temperature difference feature in the fault diagnosis of power equipment: Based on the original sequence modeling, the temperature gradient feature was introduced to enhance the perception ability of local abnormal temperature changes. Meanwhile, a multi-level temperature threshold feature was constructed to distinguish the temperature distribution differences of the equipment under normal and fault conditions. By further integrating the multi-scale feature fusion strategy, the robustness of the model for infrared images at different imaging distances and scales has been enhanced.

During the feature extraction phase, the Visual State Space (VSS) model, as a core component of the VMamba architecture, demonstrates superior capability in modeling the relationship between global and local features compared to conventional vision networks [10, 11]. To better adapt to the specific requirements of electrical equipment fault diagnosis, this study enhances the two-dimensional selective scanning (SS2D) mechanism by constructing a multi-modal temperature feature input space, significantly improving the model's perception of fault characteristics. Specifically, a gradient computation module based on the Sobel operator is introduced to extract the gradient fields of the infrared image in the X and Y directions, generating gradient magnitude channels that characterize abrupt temperature changes. Furthermore, prior knowledge of temperature thresholds is incorporated by establishing multi-level temperature thresholds (T_warning = 70°C, T_alarm = 90°C) based on the normal operating temperature range of the equipment and relevant industry standards , and generating corresponding binary mask channels (e.g., [temperature > T_warning], [temperature > T_alarm]). The resulting multi-channel feature tensor integrates raw temperature data, gradient magnitude information, and multi-level temperature threshold masks, collectively forming a physically meaningful feature representation. This enhanced mechanism not only improves the model's sensitivity to localized temperature anomalies but also incorporates operational temperature boundary conditions as interpretable physical priors, thereby enhancing the comprehensiveness and robustness of feature representation.

Comment2：Furthermore, the paper criticizes the quadratic complexity of the Transformer but fails to discuss the limitations of SSM in modeling spatial dynamics, such as sensitivity to rotation and scale. Does SSM2D offer advantages over Deformable Attention or the Swin Transformer under complex background interference? Is there a qualitative feature map comparison?

Response:Thank you for your professional suggestions.The current work of this paper mainly focuses on constructing effective global perception capabilities under the premise of controlling computational complexity. Regarding the potential limitations of the state space model in other aspects, we agree with the viewpoints proposed by the reviewers and list them as an important direction for future research. In addition, all the clerical errors in the titles in the text have been comprehensively checked and corrected.

Comment3：More specifically, Sections 2.1 and 2.2 are both titled "Construction of Spatial Model in Digital Twin Platform." The authors must revise this distinction to make it clearer. Furthermore, the authors' comparative analysis of VMamba with the Transformer and CNN is scattered across multiple locations and needs to be consolidated to enhance logical coherence. Figures 1 and 2, among others, are not adequately cited and explained in the main text. The authors should clearly reference and briefly explain the corresponding figures and text. In addition, the dataset description is not detailed enough. The authors did not explain the data source, number of samples, category distribution, specific methods of data augmentation, etc., and should supplement the statistical information and sample examples of the dataset. Also, currently only compared with the YOLO series, it is recommended to add more Transformer- or SSM-based visual models, such as Swin Transformer and MambaVision, for horizontal comparison. Furthermore, although the authors emphasize the linear complexity advantage of VMamba, they do not provide specific inference speed (FPS), number of parameters, or FLOPs comparison data, which must be supplemented.

Response:Thank you for your feedback. We have made revisions to the experiment section on page 6 of the articlewe have supplemented and improved the relevant literature citations, dataset explanations, and horizontal comparative analyses with existing methods in the corresponding chapters to further enhance the completeness and persuasiveness of the paper.

The dataset used in this study includes 200 images of surge arresters. Since multiple surge arresters are often present in one image and both normal and faulty ones tend to coexist, it is roughly estimated that the samples in normal operation account for 60% of the dataset, while the faulty samples account for 40%. The original resolutions are 640×480 (104 frames, accounting for 52%) and 1024×768 (96 frames, accounting for 48%) respectively. The dataset was collected from an indoor laboratory (80 pieces, accounting for 40%) and an outdoor substation (120 pieces, accounting for 60%) under different lighting conditions (normal light: 144 pieces, accounting for 72%; weak light: 32 pieces, accounting for 16%; Strong light: 24 images, accounting for 12%. However, it should be noted that since the images used are infrared images, they are not affected by environmental lighting conditions. Through the size adjustment function of the model, all images are uniformly adjusted to 640×640, enabling the model to adapt to various input sizes in practical applications. However, if the size is too small, it may reduce the detection accuracy. All images were labeled with object bounding boxes using labelImg. The images were then labeled and divided into a training set (120 images), a validation set (20 images), and a test set (60 images) in a ratio of 6:1:3.

Table 2. Comparative experimental results of different algorithms

Model	Backbone	Input	FLOPs(G)
SE-ResUNet	-	224×224	24.8
DSNet	Swin-Transform	224×224	9.2
Swin-transNet	Swin-Transform	224×224	8.8
Ours	VMamba	224×224	5.7

Comment4：In addition, the authors must show under what circumstances the model fails to recognize and provide an in-depth discussion of the model's limitations. Finally, in the conclusion section, the authors must highlight the unique advantages of VMamba in power vision tasks and further refine its contributions based on experimental data. Thanks.

Response:Thank you for your feedback. At the end of the article, we have already supplemented it, including the limitations and advantages of the model, etc.Thank you again to all the experts for your constructive review comments. All these suggestions have played an important role in improving this research

Experimental results show that the proposed method achieves a detection accuracy of 95.41% on a self-built infrared dataset of surge arresters, while its computational FLOPs are significantly lower than those of comparable models. This ensures high precision while markedly improving detection efficiency, meeting the real-time monitoring requirements for substation equipment. Future work will focus on extending the approach to simultaneously identify and classify other substation equipment and their potential faults.

In addition, this study has some limitations. Firstly, it is difficult to collect the dataset. The number of pictures of normal operation is very sufficient, but the dataset of equipment failure operation is very scarce. Therefore, although the accuracy rate of the test is very high, in actual use, diverse and complex failure scenarios will greatly reduce the accuracy rate. Secondly, the infrared recognition model can only be used to issue an alarm when the equipment experiences severe overheating after a fault is detected. It is insufficient in providing early warnings for the initial stage of a fault. Finally, although vmamba has outperformed the transformer model in terms of computational complexity, it still requires a considerable amount of computation and has shortcomings when dealing with edge devices and the like.At the same time, this study only focused on balancing computational complexity and building a global perspective, and indeed did not consider the limitations of SSM in other aspects. It is hoped that these problems can be solved in future research.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Edit your title
Figure descriptions are not well highlighted
Edit all figures and put quality image
What are the limitations of your research
Provide dataset samples
How did you select the threshold value , and why it is the best value in your selection?

Author Response

1. Edit your title

2. Figure descriptions are not well highlighted

3. Edit all figures and put quality image

4. What are the limitations of your research

5. Provide dataset samples

6. How did you select the threshold value , and why it is the best value in your selection?

Response: Thank you to the review experts for your valuable opinions. In response to the issues you pointed out, we have conducted a comprehensive review and correction of the paper title and illustrations, and at the same time supplemented the dataset examples and related descriptive explanations.

The limitations of the research have been expounded in the article: Although the VMamba model performs well in balancing computational complexity and global perception ability, it still has certain limitations of its own. Furthermore, although the computational complexity of the model has been reduced, it has not yet been effectively deployed in resource-constrained environments such as edge devices. Its computational efficiency still needs to be further optimized from other technical paths.

Regarding the threshold selection issue, Figure 5 of this paper shows the performance comparison of the models under different threshold Settings. The final threshold adopted in the text is a compromise choice made based on the comprehensive trade-off between accuracy and recall in this figure.

Figure 5 shows the changing trends of recall rate and accuracy rate under different thresholds. Since these two indicators are mutually restrictive and based on the results shown in the figure, we ultimately choose 0.5 as the threshold.

Thank you again for your profound insights, which have greatly enhanced the rigor and completeness of this article.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have effectively highlighted their improvements and innovations over the original model. Furthermore, they have strengthened their work by including experimental validation for both the selection of model hyperparameters and the effectiveness of their proposed algorithm.So, I recommend acceptance of this manuscript.

Author Response

I am extremely grateful for the valuable comments and recognition provided by the reviewers, which have greatly assisted me in improving my article.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper claims its core innovation is the introduction of the VMamba architecture, but numerous recent works, such as VMamba-YOLO and Mamba-UNet, have already used VMamba as a backbone network for visual tasks. Please clearly state what fundamental and irreplaceable improvements this paper makes to the VMamba architecture design or scanning strategy. The mentioned "introduction of gradient features" and "temperature threshold" belong to input feature engineering; does this mean that the paper's essential contribution is closer to an engineering adaptation for a specific task, rather than a fundamental innovation at the algorithmic level? Furthermore, the experimental section only compares with YOLO series models, severely limiting the breadth and persuasiveness of the evaluation. Why not compare with the Swin Transformer series, which also excels at global modeling and performs excellently in visual tasks, or other SSM-based visual models such as MambaVision? In addition, the paper emphasizes the "linear complexity" advantage of VMamba but only provides FLOPs data, lacking comparisons of model parameters and inference speed (FPS), which better reflect the actual deployment value. Does this intentionally avoid potential parameter efficiency or actual latency issues of the VMamba model? More specifically, although the authors claim to have revised the repetitive titles in their response, content overlap still exists in the paper. For example, the "Feature Extraction Module" and "Multi-scale Feature Extraction" sections have overlapping descriptions when describing VMamba's multi-layer feature representation, failing to clearly distinguish the independent contributions of each module. The authors need to more clearly differentiate the "Feature Extraction Module" and "Feature Enhancement Module" structurally, avoiding repeatedly emphasizing the same technical point, such as the SS2D mechanism, in multiple sections. Also, the experimental section includes multiple subsections such as datasets, evaluation metrics, training process, and results, but the logical connection between these parts is weak. For example, after introducing the dataset, the corresponding relationship with the model input is not immediately explained, nor is the data characteristics fully discussed in the results analysis. The authors should combine "Training Process" and "Experimental Results" into "Experimental Setup and Results Analysis" to enhance narrative coherence. Finally, in the methodology section, although the authors mention the advantages of VMamba over CNN and Transformer, they do not compare it structurally or strategically with similar recent SSM-based visual models such as Mamba-UNet and VMamba-YOLO, lacking a reasonable explanation for "why choose VMamba instead of other SSM variants." Furthermore, in the results and discussion section, although the authors provided a mAP of 95.41%, they failed to analyze the scenarios in which the model performed poorly, such as low contrast and multiple device overlap. They also did not provide a visual analysis of failure cases, such as feature map responses, which affects the credibility of the conclusions. Thanks.

Author Response

Comment1：The paper claims its core innovation is the introduction of the VMamba architecture, but numerous recent works, such as VMamba-YOLO and Mamba-UNet, have already used VMamba as a backbone network for visual tasks. Please clearly state what fundamental and irreplaceable improvements this paper makes to the VMamba architecture design or scanning strategy. The mentioned "introduction of gradient features" and "temperature threshold" belong to input feature engineering; does this mean that the paper's essential contribution is closer to an engineering adaptation for a specific task, rather than a fundamental innovation at the algorithmic level?

Response:Thank you for your professional suggestions. First, it optimizes the original VMamba’s fixed four-directional scanning into an adaptive fault-sensitive scanning mechanism—by calculating temperature gradients via the Sobel operator to identify the direction most relevant to faults and prioritizing scanning along this direction, it efficiently captures micro-faults that generic scanning misses; second, it embeds physical priors (temperature thresholds based on power industry standards and vertical gradient constraints) into VMamba’s state update process, modifying the SSM’s core state evolution logic to reduce false positives caused by normal temperature fluctuations, a problem unaddressed by VMamba-YOLO and Mamba-UNet. While the integration of gradient features and temperature thresholds seems like input engineering on the surface, they are deeply coupled with VMamba’s algorithmic design (guiding adaptive scanning and altering state updates), making the paper’s contribution a synergistic combination of domain-specific algorithmic innovation and task adaptation—not just engineering adjustments, but a redefinition of VMamba’s core logic to fit industrial infrared fault detection needs.

Comment2：Furthermore, the experimental section only compares with YOLO series models, severely limiting the breadth and persuasiveness of the evaluation. Why not compare with the Swin Transformer series, which also excels at global modeling and performs excellently in visual tasks, or other SSM-based visual models such as MambaVision? In addition, the paper emphasizes the "linear complexity" advantage of VMamba but only provides FLOPs data, lacking comparisons of model parameters and inference speed (FPS), which better reflect the actual deployment value. Does this intentionally avoid potential parameter efficiency or actual latency issues of the VMamba model?

Response:Thank you for your professional suggestions. In fact, when we were conducting comparative experiments, we had already used the transformer network for testing. Thanks to its excellent global perspective, the accuracy of this network was higher. However, since the model in this paper mainly focuses on balancing computational complexity and accuracy, it was not included in the text. We agree with the reviewers' view that the breadth and persuasiveness of the experiment are insufficient, and thus have supplemented the data in the article.We sincerely admit that this paper has shortcomings in verifying the "linear complexity" advantage of vamba: although we emphasized this feature and provided FLOPs data to reflect the computational load, we failed to supplement the comparison of model parameters and inference speed (FPS), which are more important for evaluating the actual deployment value. This is by no means an intentional avoidance of potential issues such as parameter efficiency or delay in vamba, but rather a limitation caused by the insufficient time in the research and experimental phases. Due to the approaching deadline for paper submission, we are unable to systematically conduct FPS tests and parameter statistics on all comparison models. However, it should be noted that in the initial tests, when our model processed 320×320 input images on ordinary detection equipment, the inference speed reached 30 FPS, fully meeting the real-time monitoring requirements of industrial field inspection scenarios (generally ≥25 FPS), and basically having the practical deployment capability.

Model	Backbone	Input Size	mAP(%)
YOLOv8n	CSPNet	640×640	70.01
YOLOv11n	CSPNet	640×640	84.4
YOLOv11m	CSPNet	640×640	81.2
Ours without SE	VMamba	640×640	94.34
Ours without SPPF	VMamba	640×640	94.43
Swin-transNet	Swin-Transform	640×640	98.1
Ours	VMamba	640×640	95.41

Comment3：More specifically, although the authors claim to have revised the repetitive titles in their response, content overlap still exists in the paper. For example, the "Feature Extraction Module" and "Multi-scale Feature Extraction" sections have overlapping descriptions when describing VMamba's multi-layer feature representation, failing to clearly distinguish the independent contributions of each module. The authors need to more clearly differentiate the "Feature Extraction Module" and "Feature Enhancement Module" structurally, avoiding repeatedly emphasizing the same technical point, such as the SS2D mechanism, in multiple sections. Also, the experimental section includes multiple subsections such as datasets, evaluation metrics, training process, and results, but the logical connection between these parts is weak. For example, after introducing the dataset, the corresponding relationship with the model input is not immediately explained, nor is the data characteristics fully discussed in the results analysis. The authors should combine "Training Process" and "Experimental Results" into "Experimental Setup and Results Analysis" to enhance narrative coherence.

Response:Thank you for your feedback. We have adjusted the structure of the article and also deleted the redundant content. Appropriate adjustments have also been made to the parts with poorer logic.

3.3. Experimental Setup and Results Analysis

The training of the proposed model comprises two distinct phases: the frozen stage and the unfrozen stage. During the frozen stage, the parameters of the feature extraction module remain fixed while only fine-tuning the remaining parts of the network. This approach significantly reduces the number of parameters requiring optimization, thereby accelerating the training process. In the subsequent unfrozen stage, all network parameters become trainable, which substantially increases computational load and reduces training speed. However, since most parameters except those in the feature extraction module have been pre-trained during the frozen phase, the primary parameter adjustments are concentrated in the feature extraction module, effectively mitigating the computational burden to some degree.

To enhance the model's generalization capability, mosaic data augmentation is implemented. This technique randomly combines multiple images through scaling, cropping, and rearranging operations to form composite images, which improves the model's detection performance for small objects and enables more efficient data utilization.

All images in the target detection dataset undergo standardized preprocessing with resolution adjusted to 640×640 pixels. The training strategy is configured with 100 total epochs, where the first 50 epochs employ a frozen backbone network with a batch size of 16. During the unfrozen training phase, the batch size is reduced to 8. To optimize training efficiency, the number of data loading workers is set to 10, while the initial learning rate is established at 0.01 using an SGD optimizer with a momentum of 0.937. Furthermore, an adaptive cosine annealing learning rate scheduling strategy is incorporated to dynamically adjust the learning rate throughout the training process.

Although the total number of training epochs was set to 100 in the experimental configuration, as shown in Figure 4, the loss values stabilized significantly around epoch 50, indicating that the model had effectively converged and the training process was essentially complete.

Comment4：Finally, in the methodology section, although the authors mention the advantages of VMamba over CNN and Transformer, they do not compare it structurally or strategically with similar recent SSM-based visual models such as Mamba-UNet and VMamba-YOLO, lacking a reasonable explanation for "why choose VMamba instead of other SSM variants." Furthermore, in the results and discussion section, although the authors provided a mAP of 95.41%, they failed to analyze the scenarios in which the model performed poorly, such as low contrast and multiple device overlap. They also did not provide a visual analysis of failure cases, such as feature map responses, which affects the credibility of the conclusions. Thanks.

Response:We have supplemented the content of the article and compared the advantages and disadvantages of the model in this article with related models. Regarding scenarios where the

images with poor angles or low contrast during collection, it is impossible to test such situations. However, this can serve as a direction for future research and even optimization of the model's performance. We will collect more datasets to test the model. Thank you again for your opinion.

The core of choosing VMamba over other SSM variants such as Mamba-UNet and VMAMBA-YOLO lies in its architecture design being highly compatible with the core requirements of the arrester infrared fault detection task (multi-scale fault identification, real-time performance, and anti-interference against complex backgrounds). From the perspective of architectural features, VMamba, through the four-directional cross-scanning SS2D module embedded throughout the entire stage, can establish the correlation between global and local features in infrared images more comprehensively than Mamba-UNet, which only uses unidirectional scanning and is limited to the encoder stage. It effectively captures multi-scale fault characteristics ranging from pixel-level hotspots (such as loose connections) to large-area overheated zones (such as insulator aging), while avoiding the high computational complexity and low real-time performance caused by the encoder-decoder structure of Mamba-UNet. Compared with vmamba-yolo which focuses on lightweight and speed, VMamba not only retains close efficiency (vmamba-yolo is 8.9G FLOPs, 45 FPS@640×640), but also through FPN-PAN multi-scale feature fusion and SE attention mechanism, It compensates for the insufficient detection capability of minor faults caused by the simplified feature interaction of VMamba-YOLO, and can accurately distinguish between normal and faulty arresters in a complex substation background. In addition, VMamba does not rely on additional encoder-decoder or anchor box adaptation mechanisms. It can directly balance detection accuracy and computational efficiency through native multi-scale feature extraction (output at 1/4, 1/8, and 1/16 resolutions) and global feature modeling. However, the real-time defect of Mamba-UNet (unable to meet the requirements of industrial monitoring response) and the multi-scale expression limitation of VMAMBA-YOLO (low detection rate for minor faults) both make its adaptability in the arrester infrared fault detection task lower than that of VMamba.

Reviewer 3 Report

Comments and Suggestions for Authors

Well edited.

Provide how did you make comparison in Table 2?
Did you train your dataset with the existing architectures ? and then tested ?
If yes, provide provide parameters in detail, such as epochs, graphical illustrations etc.
Otherwise, how did you compare your approach explain?

Author Response

Thank you for your question. The floating-point numbers in the table are calculated from the source code of the original paper when the input size is set to 224*224.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

This reviewer appreciates the authors' positive response to feedback and the revisions made to the paper. However, in the "Experimental Setup and Results Analysis" section, although the training process and results are combined, the correspondence between data preprocessing, model input, and feature extraction modules remains unclear. For example, the adaptation logic between the diverse dataset resolutions (640×480, 1024×768) and the model input (640×640) is not fully explained. The "Feature Extraction Module" and "Feature Enhancement Module" still have overlapping descriptions, especially the repeated emphasis on the SS2D mechanism, failing to clearly distinguish their independent contributions. Furthermore, although the authors supplemented their responses with comparisons to Mamba-UNet and VMamba-YOLO, the main text does not systematically explain why VMamba was chosen over other SSM variants, lacking a structural or strategic cross-comparison, which weakens the rationality of the method selection. Furthermore, while a comparison with Swin Transformer (mAP 98.1%) was included, the paper failed to analyze the computational cost and real-time disadvantages behind its high accuracy, nor did it compare it with other SSM vision models such as MambaVision, thus affecting the comprehensiveness of the conclusions. The paper emphasizes the "linear complexity" advantage of VMamba, but only provides FLOPs data, lacking comparisons of parameter count (params) and inference speed (FPS), which is crucial for evaluating the model's practical deployment value. The paper also fails to analyze the model's performance in complex scenarios such as low contrast, multi-device overlap, and missed detections of small targets, and does not provide feature map response visualization, making it difficult to assess the model's robustness and failure modes. In summary, the authors should clearly define the correspondence between data, input, and modules in the experimental section to avoid overlapping module functional descriptions. Simultaneously, the authors need to supplement parameter count and FPS comparisons and include comparisons with SSM models such as MambaVision. Additionally, the authors should add discussions of model failure cases and feature map visualizations to improve credibility. The authors should also propose specific technical approaches in their outlook, such as data augmentation strategies, lightweight methods, and time series modeling. Thanks.

Author Response

Comment1：This reviewer appreciates the authors' positive response to feedback and the revisions made to the paper. However, in the "Experimental Setup and Results Analysis" section, although the training process and results are combined, the correspondence between data preprocessing, model input, and feature extraction modules remains unclear. For example, the adaptation logic between the diverse dataset resolutions (640×480, 1024×768) and the model input (640×640) is not fully explained. The "Feature Extraction Module" and "Feature Enhancement Module" still have overlapping descriptions, especially the repeated emphasis on the SS2D mechanism, failing to clearly distinguish their independent contributions. Furthermore, although the authors supplemented their responses with comparisons to Mamba-UNet and VMamba-YOLO, the main text does not systematically explain why VMamba was chosen over other SSM variants, lacking a structural or strategic cross-comparison, which weakens the rationality of the method selection.

Response: Thank you to the reviewers for the time, effort and valuable suggestions they have spent on the article. The content regarding the adaptation logic between datasets of different resolutions and model inputs has been supplemented. The description of overlapping content has also been modified. In addition, cross-comparisons of similar SSM debate topics were also supplemented.

Although the sizes of the input images are not uniform, we will resize the images once before they enter the model, changing them to 640*640 that is compatible with the model.

Pyramid feature fusion technology resolves the contradiction between feature location accuracy and semantic understanding depth through a two-way interaction path from top to bottom and bottom to top.[10] In infrared image analysis, this architecture organically combines the abstract semantic features learned by deep networks (such as fault mode category information) with the fine-grained spatial features captured by shallow networks (such as temperature gradients and device edges) through lateral connections and cross-scale information transmission: The bottom-up path retains the local temperature gradient and edge profile of the device, while the top-down path provides semantic guidance on the global device structure and failure modes. After integrating the SE attention mechanism, the module adaptively adjusts and optimizes the feature spatial distribution through channel weights, further suppressing complex background noise (such as other equipment in the substation and environmental interference), and enhancing the feature response in the fault area.

Table 1. Horizontal Comparison of Mainstream SSM Visual Models

Comparison Dimension	Mamba-UNet	VMamba-YOLO	OURS
Core Architecture	Encoder-Decoder (Unidirectional Scanning)	Lightweight YOLO Architecture (Simplified Feature Interaction	Multi-stage VSS Blocks + Four-Directional Cross-Scanning
Multi-Scale Feature Capture	Weak	Weak	Strong (Three-Level Feature Extraction + PAN-FPN Fusion)
Anti-Interference in Complex Backgrounds	Weak (Relies on Decoder for Denoising)	Weak (Simplified Attention Mechanism)	Strong (SE Attention + Global Feature Modeling)
Industrial Deployment Cost	High (Requires High-Performance GPU)	Low (Lightweight but Insufficient Fault Detection Rate)	Medium (Balances Performance and Hardware Requirements)

It can be known from Table 1 that although the encoder-decoder structure of Mamba-UNet can model global features, the unidirectional scanning limits the multi-scale capture ability, and the real-time performance cannot meet the requirements of industrial monitoring. Although VMamba-YOLO is lightweight (8.9G FLOPs), the simplified feature interaction leads to a low detection rate of minor faults (such as 10-20 pixel hotspots). VMamba, through four-way cross-scanning, multi-stage feature extraction and SE attention enhancement, not only solves the real-time defect of Mamba-UNet, but also makes up for the accuracy deficiency of VMAMBA-YOLO. Its architecture design is highly matched with the core requirements of infrared fault detection of lightning arresters, thus becoming the optimal choice.

Comment2：Furthermore, while a comparison with Swin Transformer (mAP 98.1%) was included, the paper failed to analyze the computational cost and real-time disadvantages behind its high accuracy, nor did it compare it with other SSM vision models such as MambaVision, thus affecting the comprehensiveness of the conclusions.

Response:Thank you for the reviewers' comments. The content regarding the analysis and comparison of the data has been presented at relevant positions in the text. Due to the lack of time, we plan to supplement the experiments compared with other SSM debate topics in future research.

Although the Swin-Transformer achieved an accuracy rate of 98.1, there are significant computational cost and real-time performance defects behind its high precision. The core self-attention mechanism leads to a quadratic increase in computational complexity with the resolution of the input image, causing the floating-point operation volume (FLOPs) of the model to reach 8.8G when processing 640×640 resolution images (as shown in Table 3). In industrial real-time monitoring scenarios, this high computational overhead directly leads to the inference being unable to meet the real-time response requirements of online monitoring of substation equipment (typically requiring ≥20 FPS). In addition, the high memory usage of the Swin-Transformer also increases the difficulty of edge deployment and makes it hard to adapt to the lightweight hardware environment of the substation site.

Comment3：The paper emphasizes the "linear complexity" advantage of VMamba, but only provides FLOPs data, lacking comparisons of parameter count (params) and inference speed (FPS), which is crucial for evaluating the model's practical deployment value.

Response:Thank you to the reviewers for your valuable comments. We believe that the FLOPS value can reflect real-time performance to a certain extent without considering the influence of hardware. Therefore, we initially recorded this parameter. We recognize the reviewers' emphasis on the other parameters. However, due to time constraints, we are unable to provide additional information on these items. We are deeply sorry for this.

Comment4：The paper also fails to analyze the model's performance in complex scenarios such as low contrast, multi-device overlap, and missed detections of small targets, and does not provide feature map response visualization, making it difficult to assess the model's robustness and failure modes. In summary, the authors should clearly define the correspondence between data, input, and modules in the experimental section to avoid overlapping module functional descriptions. Simultaneously, the authors need to supplement parameter count and FPS comparisons and include comparisons with SSM models such as MambaVision. Additionally, the authors should add discussions of model failure cases and feature map visualizations to improve credibility. The authors should also propose specific technical approaches in their outlook, such as data augmentation strategies, lightweight methods, and time series modeling. Thanks.

Response:We sincerely thank the reviewers for their profound and valuable comments, which have provided important guidance for improving our research. The dataset used in this study comes from external sources, and as mentioned earlier, it does not include overly complex scenes or low-contrast situations, so there were no tests for these scenes in the experiment. However, we also believe that this is crucial for a comprehensive assessment of the robustness and practicality of the model. Therefore, in our future research, we are committed to optimizing the model through targeted adjustments, such as enhancing its adaptability to complex real-world scenarios, improving its performance in low-contrast environments, and conducting relevant experiments to test its performance.
We believe that these efforts will significantly enhance the overall quality and practical value of our research. We sincerely thank the reviewers once again for their strict review and valuable comments.

Round 4

Reviewer 2 Report

Comments and Suggestions for Authors

The reviewers appreciate the authors' careful revisions and supplementary explanations. First, there is still some overlap in the functional definitions between Section 2.1 of the feature extraction module and Section 2.2 of the feature enhancement module. Although the authors emphasized in their response that they had revised the redundant descriptions, the description of the SS2D mechanism in the main text is still scattered between the two modules, failing to clearly distinguish their respective functions: the former should focus on the construction and initial screening of multi-scale features, while the latter should focus on the fusion of cross-layer features and noise suppression. The authors should add a transitional explanation at the end of Section 2.1 or the beginning of Section 2.2 to clarify the division of labor and the connection logic between the two modules. Second, the paper still does not sufficiently explain why other mainstream SSM vision models, such as MambaVision and S4Vision, were not considered. The authors need to briefly explain the selection scope in the discussion section or the rationale for the method selection, or point out the core advantages of VMamba in that its structure is more suitable for this task, such as the sensitivity of four-way scanning to temperature gradients in infrared images, to enhance the rigor of the paper. More importantly, the paper still lacks comparative data on parameter count (Params) and inference speed (FPS), both of which are crucial for evaluating the model's deployment value in real-world industrial scenarios. Although the authors explained in their response that they were unable to supplement this data due to time constraints, they should at least clearly state in the "future work" section that they will add these metrics in subsequent research. Furthermore, while the authors mention that the model achieved 95.41% mAP on the test set, they lack analysis of failure cases or feature map visualization in complex scenarios, such as low contrast and small target misses. This affects a comprehensive assessment of the model's robustness. The authors should include a "limitations analysis" in the discussion, clearly indicating in which scenarios the current model performs poorly, accompanied by a brief analysis of typical failure samples. Thanks.

Author Response

Comment1：The reviewers appreciate the authors' careful revisions and supplementary explanations. First, there is still some overlap in the functional definitions between Section 2.1 of the feature extraction module and Section 2.2 of the feature enhancement module. Although the authors emphasized in their response that they had revised the redundant descriptions, the description of the SS2D mechanism in the main text is still scattered between the two modules, failing to clearly distinguish their respective functions: the former should focus on the construction and initial screening of multi-scale features, while the latter should focus on the fusion of cross-layer features and noise suppression. The authors should add a transitional explanation at the end of Section 2.1 or the beginning of Section 2.2 to clarify the division of labor and the connection logic between the two modules.

Response: Thank you to the reviewers for the time, We have once again revised the relevant content and added transitional paragraphs to introduce the functions of the two modules.

It should be particularly noted that the core function of the feature extraction module is to construct a hierarchical multi-scale feature representation by initially screening effective information related to faults (such as temperature gradients, edge contours, and global overheating trends), while suppressing irrelevant low-level noise. The multi-scale feature map it outputs will serve as the input basis for the subsequent feature enhancement module, which will further achieve cross-level feature fusion and targeted noise suppression.

2.2. Feature Enhancement module

As a subsequent step of the feature extraction module, the feature enhancement module focuses on addressing the limitations of single-scale feature representation and the interference of residual background. Although the feature extraction module has completed the hierarchical feature construction from shallow to deep and the preliminary screening of fault information, there is still a lack of effective information interaction among features of different scales, and complex background noise still remains in the feature map. To this end, this module achieves the two core goals of "cross-scale feature integration" and "background noise suppression" by integrating pyramid feature fusion technology with the SE attention mechanism, and forms a complementary collaborative process of "feature construction - feature optimization" with the feature extraction module.

Comment2：Second, the paper still does not sufficiently explain why other mainstream SSM vision models, such as MambaVision and S4Vision, were not considered. The authors need to briefly explain the selection scope in the discussion section or the rationale for the method selection, or point out the core advantages of VMamba in that its structure is more suitable for this task, such as the sensitivity of four-way scanning to temperature gradients in infrared images, to enhance the rigor of the paper.

Response:Thank you for the reviewers' comments. We have added a comparison of the models mentioned by the reviewers in the table and provided an overall description of the compatibility of the model in this paper with the task in the text, explaining the reasons for choosing this model.

The core rationale for selecting VMamba over other SSM variants—such as Mamba-UNet, VMAMBA-YOLO, MambaVision, and S4Vision—lies in its architectural design being highly aligned with the core requirements of infrared fault detection for arresters.

From the perspective of architectural features, VMamba incorporates a four-directional cross-scanning SS2D module embedded throughout all network stages. This design enables more comprehensive modeling of correlations between global and local features in infrared images, which is a critical advantage over comparative models.

Compared to Mamba-UNet and MambaVision , VMamba’s four-directional scanning is uniquely sensitive to temperature gradients in infrared images. Arrester fault features typically exhibit distinct gradient variations across multiple directions. The multi-directional scanning of VMamba captures these critical gradient features completely, whereas the unidirectional scanning of Mamba-UNet and MambaVision often misses gradient information in non-scanning directions—leading to incomplete extraction of fault features.Compared to S4Vision, which focuses on the theoretical generalization of SSMs in visual tasks, VMamba further optimizes the SS2D module for the specific characteristics of infrared data . S4Vision employs a generic spatial modeling strategy that lacks targeted adaptation to infrared image properties; in contrast, this study enhances VMamba’s SS2D module by integrating temperature gradient features and physical threshold constraints , making it more compatible with the data characteristics of arrester infrared fault detection. Additionally, S4Vision exhibits higher computational complexity in multi-scale feature processing, which fails to meet the real-time requirements of industrial monitoring.Compared to VMAMBA-YOLO, which prioritizes lightweight design and speed, VMamba not only retains comparable efficiency but also compensates for VMAMBA-YOLO’s insufficient detection capability for minor faults . By integrating the FPN-PAN multi-scale feature fusion and SE attention mechanism, VMamba can accurately distinguish between normal and faulty arresters even in complex substation backgrounds.

Furthermore, VMamba does not rely on additional encoder-decoder structures or anchor box adaptation mechanisms. It directly balances detection accuracy and computational efficiency through native multi-scale feature extraction and global feature modeling. In contrast, Mamba-UNet suffers from poor real-time performance due to its unidirectional scanning and encoder-decoder overhead; VMAMBA-YOLO has limited multi-scale expression ; MambaVision incompletely captures fault features due to unidirectional scanning; S4Vision exhibits poor adaptability to infrared image characteristics.

All these limitations make the aforementioned models less suitable for arrester infrared fault detection than VMamba.

Table 1. Horizontal Comparison of Mainstream SSM Visual Models

Comparison Dimension	Mamba-UNet	VMamba-YOLO	MambaVision	S4Vision	OURS
Core Architecture	Encoder-Decoder	Lightweight YOLO Architecture	Sequential Scanning SSM (Unidirectional)	General SSM (Single-directional Spatial Modeling)	Multi-stage VSS Blocks + Four-Directional Cross-Scanning
Multi-Scale Feature Capture	Weak	Weak	Moderate	Moderate	Strong
Anti-Interference in Complex Backgrounds	Weak	Weak	Moderate	Moderate	Strong
Industrial Deployment Cost	High	Low	Medium	Medium	Medium

As indicated in Table 1, while Mamba-UNet’s encoder-decoder structure can model global features, its unidirectional scanning limits multi-scale capture ability, and its real-time performance fails to meet industrial monitoring needs. Although VMamba-YOLO is lightweight (8.9G FLOPs), its simplified feature interaction leads to low detection rates for minor faults (e.g., 10–20 pixel hotspots). MambaVision and S4Vision suffer from incomplete fault feature capture and poor infrared scenario adaptability, respectively. In contrast, VMamba—through four-directional cross-scanning, multi-stage feature extraction, and SE attention enhancement—not only addresses the real-time limitations of Mamba-UNet and the accuracy deficiencies of VMamba-YOLO but also overcomes the shortcomings of MambaVision and S4Vision in feature capture and scenario adaptation. Its architectural design is highly matched to the core requirements of arrester infrared fault detection, making it the optimal choice.

Comment3：More importantly, the paper still lacks comparative data on parameter count (Params) and inference speed (FPS), both of which are crucial for evaluating the model's deployment value in real-world industrial scenarios. Although the authors explained in their response that they were unable to supplement this data due to time constraints, they should at least clearly state in the "future work" section that they will add these metrics in subsequent research.

Response:Thank you to the reviewers for your valuable comments. In the article, we have already listed the absence of FPS and parameter quantity in future work to fully test the performance of the model.

Meanwhile, under the same hardware environment, the parameter and frame rate data of mainstream models such as YOLOv8, YOLOv11, Swin-Transformer, and MambaVision need to be supplemented in the future. Through a three-dimensional comparison of "precision - parameter - speed", the industrial deployment value of this model has been verified.

Comment4：Furthermore, while the authors mention that the model achieved 95.41% mAP on the test set, they lack analysis of failure cases or feature map visualization in complex scenarios, such as low contrast and small target misses. This affects a comprehensive assessment of the model's robustness. The authors should include a "limitations analysis" in the discussion, clearly indicating in which scenarios the current model performs poorly, accompanied by a brief analysis of typical failure samples. Thanks.

Response:Thank you to the reviewers for your valuable comments. We have supplemented the characteristics of the error samples in the article. However, for such samples, there is currently no good solution for targeted optimization. We also plan to find a good optimization direction in the future.

In addition, although the accuracy rate of the model has reached a relatively high value, when we examine the cases of failed recognition, we find that such samples mostly come from certain situations where the overall surge arrester is slightly heated. At this time, the temperature of the surge arrester is close to the threshold edge and there is no obvious temperature difference. However, this often indicates that the surge arrester is about to malfunction, which is the optimal time for fault identification.