UCDnet: Double U-Shaped Segmentation Network Cascade Centroid Map Prediction for Infrared Weak Small Target Detection
Round 1
Reviewer 1 Report
The topic selection of the article has certain reference significance. Overall, the analysis of the problem is thorough and persuasive, but some details still need to be polished:
1. This article lacks a background introduction to such massive data represented by the Internet of Things. Writing background refers to the historical background in which the author created this article, and the context in which it was created. Therefore, please optimize the writing background of the article so that it can have a deeper understanding of the content.
2. In the introduction section of the article, the author's expression of the writing meaning is somewhat vague and difficult to understand. Please reorganize the language to express it so that readers can quickly understand the writing meaning and purpose of the article.
3. This article introduces the research of multiple scholars and explains their research methods. However, the literature section can emphasize the shortcomings of various methods, which is enough to shift to newly proposed methods. it is better to add the following references to enrich the work and emphasize the role of remote sensing:
10.1016/j.media.2020.101949
10.1108/AEAT-02-2020-0030
10.1016/j.jksuci.2019.10.014
10.1109/JSTARS.2022.3188732
4. For evaluation purposes, the text should include a discussion section that compares the results with existing literature. There is a large amount of literature related to agricultural data, which should be relevant to the proposed research results and discussed.
5. The conclusion of this paper still provides a lot of background information, which obviously does not meet the requirements of conclusion writing. In addition, this article did not explain the shortcomings of the experimental section and the direction of future research.
6. Kindly add future work also.
Author Response
Cover Letter
Dear Reviewer:
Thank you for your comments concerning our manuscript entitled “UCDnet: Double U-shaped Segmentation Network cascade Centroid-map Prediction for Infrared Weak Small Target Detection” (ID: remotesensing-2516960). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made corrections which we hope meet with approval. We use the “highlight” function in Microsoft Word in our revised file, so that changes are easily visible to you. The main corrections in the paper and the response to your comments are as flowing:
Point1: This article lacks a background introduction to such massive data represented by the Internet of Things. Writing background refers to the historical background in which the author created this article, and the context in which it was created. Therefore, please optimize the writing background of the article so that it can have a deeper understanding of the content.
Response: Thank you very much for your professional suggestion. Based on your suggestion, we have optimized the writing background in the Introduction (section 1) and made corresponding modifications to the Abstract, aiming to facilitate a better understanding of our paper for you and other researchers. The specific changes and additions are as follows:
Changes to the Abstract:
In recent years, the development of deep learning has brought great convenience to the work of target detection, semantic segmentation, and object recognition. In the field of infrared weak small target detection (e.g., surveillance and reconnaissance), it is not only necessary to accurately detect targets but also to perform precise segmentation and sub-pixel-level centroid localization for infrared small targets with low signal-to-noise ratio and weak texture information. To address these issues, we propose UCDnet…
Revisions and Additions to Introduction (Section 1):
Infrared small target detection technology, as the main technical support for surveillance and reconnaissance [1-2], precise localization [3-5], and attitude estimation [6], has been widely applied in various fields. In the field of surveillance and reconnaissance, such as drone tracking and search and rescue operations, not only the rough detection of infrared targets is required, but also precise segmentation and centroid localization of the targets. This is essential for effective prediction of target motion trajectories, thereby enabling early warning and appropriate measures. In the field of the Internet of Things (IoT), such as smart transportation and smart agriculture, precise detection is a prerequisite for achieving perception and decision-making. In the aforementioned application scenarios, targets often occupy a small number of pixels in the image and suffer from issues such as texture loss and low signal-to-noise ratio.
Challenges:
On one hand, as the distance between the imaging device and the target increases, the size of the target in the image becomes smaller, and the target appears in a faint and weak state. Existing semantic segmentation algorithms perform well in segmenting large-sized or clearly bounded targets, but they show limited effectiveness in segmenting infrared small targets with low signal-to-noise ratio and weak texture information.
On the other hand, existing infrared weak small target detection methods focus on bounding box detection and pay less attention to the problem of accurate centroid localization of these targets. When the infrared small targets have irregular shapes and undergo continuous changes in posture, the center of the bounding box cannot effectively represent the centroid of the target. Existing deep learning-based methods for predicting target centroids are still limited to pixel-level predictions, only able to predict which pixel the centroid belongs to in the image. In certain specific scenarios, a deviation of one pixel in predicting the target centroid can result in several meters of actual distance, and this deviation may be further amplified as the process continues.
Point2: In the introduction section of the article, the author's expression of the writing meaning is somewhat vague and difficult to understand. Please reorganize the language to express it so that readers can quickly understand the writing meaning and purpose of the article.
Response: Thank you very much for your valuable review. We are very sorry for our unclear presentation and inadequate description about writing meaning. Firstly, based on your suggestions, we have made changes to the Introduction (section 1). In the first paragraph, we have provided a clearer description of the research background and application scenarios. In the second paragraph, we have elaborated on the technical challenges that the paper aims to address. The specific revisions and additions are as follows:
Infrared small target detection technology, as the main technical support for surveillance and reconnaissance [1-2], precise localization [3-5], and attitude estimation [6], has been widely applied in various fields. In the field of surveillance and reconnaissance, such as drone tracking and search and rescue operations, not only the rough detection of infrared targets is required, but also precise segmentation and centroid localization of the targets. This is essential for effective prediction of target motion trajectories, thereby enabling early warning and appropriate measures. In the field of the Internet of Things (IoT), such as smart transportation and smart agriculture, precise detection is a prerequisite for achieving perception and decision-making. In the aforementioned application scenarios, targets often occupy a small number of pixels in the image and suffer from issues such as texture loss and low signal-to-noise ratio.
Challenges:
On one hand, as the distance between the imaging device and the target increases, the size of the target in the image becomes smaller, and the target appears in a faint and weak state. Existing semantic segmentation algorithms perform well in segmenting large-sized or clearly bounded targets, but they show limited effectiveness in segmenting infrared small targets with low signal-to-noise ratio and weak texture information.
On the other hand, existing infrared weak small target detection methods focus on bounding box detection and pay less attention to the problem of accurate centroid localization of these targets. When the infrared small targets have irregular shapes and undergo continuous changes in posture, the center of the bounding box cannot effectively represent the centroid of the target. Existing deep learning-based methods for predicting target centroids are still limited to pixel-level predictions, only able to predict which pixel the centroid belongs to in the image. In certain specific scenarios, a deviation of one pixel in predicting the target centroid can result in several meters of actual distance, and this deviation may be further amplified as the process continues.
Next, we have revised the Abstract, and the content is as follows:
In recent years, the development of deep learning has brought great convenience to the work of target detection, semantic segmentation, and object recognition. In the field of infrared weak small target detection (e.g., surveillance and reconnaissance), it is not only necessary to accurately detect targets but also to perform precise segmentation and sub-pixel-level centroid localization for infrared small targets with low signal-to-noise ratio and weak texture information. To address these issues, we propose UCDnet…
Point3: This article introduces the research of multiple scholars and explains their research methods. However, the literature section can emphasize the shortcomings of various methods, which is enough to shift to newly proposed methods. it is better to add the following references to enrich the work and emphasize the role of remote sensing:
10.1016/j.media.2020.101949
10.1108/AEAT-02-2020-0030
10.1016/j.jksuci.2019.10.014
10.1109/JSTARS.2022.3188732
Response: Thank you for your suggestions. We have read the references you provided, and they are highly relevant to the research content of our paper. We have added these references to relevant sections in the paper.
10.1016/j.media.2020.101949, discussion section on page 18, reference [33].
10.1108/AEAT-02-2020-0030, introduction section on page 1, reference [6].
10.1016/j.jksuci.2019.10.014, discussion section on page 18, reference [34].
10.1109/JSTARS.2022.3188732, conclusion section on page 19, reference [35].
Point4: For evaluation purposes, the text should include a discussion section that compares the results with existing literature. There is a large amount of literature related to agricultural data, which should be relevant to the proposed research results and discussed.
Response: Thank you very much for your professional suggestion. Based on your suggestion, we have conducted additional experiments on an agricultural field dataset for olive fruits semantic segmentation (as shown in Appendix A.3 and Figure A2 in Appendix A) to validate the performance of our algorithm in an agricultural setting. The results and analysis of these experiments have been included in the Discussion (section 3.2.5).
Supplementary experiments in Appendix A.3:
Figure A2. The first row in the figure shows the semantic segmentation results of motor magnetic tile defects, the second row presents the semantic segmentation results of cells, and the third row exhibits the semantic segmentation results of olive fruits.
Changes and additions to the Discussion (section 3.2.5):
Additionally, to verify the applicability and generalization ability of our proposed algorithm beyond the field of surveillance and reconnaissance, we evaluated our algorithm on three different publicly available datasets from diverse fields. On an industrial field dataset for motor magnetic tile defect detection, our algorithm demonstrated excellent segmentation results even when facing weak texture information of motor magnetic defects (see experimental results in the first row of Figure A2 in Appendix A). On a medical field dataset for cell semantic segmentation [33-34], our algorithm accurately extracted edges of irregularly shaped cells (see experimental results in the second row of Figure A2 in Appendix A). On an agricultural field dataset for olive fruit semantic segmentation, our algorithm achieved precise segmentation for small-sized olive fruits, remaining unaffected by factors like crowding and occlusion (see experimental results in the third row of Figure A2 in Appendix A). These supplementary experiments further demonstrate the high robustness of our proposed algorithm. Our algorithm proves to be effective not only on our simulated dataset but also on various datasets from other fields, showcasing its excellent performance.
Point5: The conclusion of this paper still provides a lot of background information, which obviously does not meet the requirements of conclusion writing. In addition, this article did not explain the shortcomings of the experimental section and the direction of future research.
Response: Thank you very much for your professional suggestion. We apologize for our unclear description. Based on your suggestion, we have added the shortcomings of the experimental section in the Discussion (section 3.2.5). Additionally, we have made changes to the Conclusions (section 4) of this paper by removing background information and adding the direction of future research.
Additions to the Discussion (Section 3.2.5):
The proposed algorithm in this paper also has some limitations: (1) Point targets in infrared images are prone to confusion with high-frequency noise and similar objects. The method proposed in this paper focuses on fine segmentation and localization but requires further improvement in false alarm removal. (2) Infrared weak small targets have low signal-to-noise ratio and can easily be overwhelmed by background clutter. The algorithm in this paper may experience missed detections when the target signal-to-noise ratio is below 1. This is unacceptable for applications such as security monitoring and autonomous driving, which demand high safety requirements. These limitations will be the focus of our future work.
Changes and Additions to Conclusions (Section 4):
This paper presents a novel method called UCDnet for infrared weak small target detection and centroid localization. The innovation of our proposed semantic segmentation subnet lies in the improved U-Net and DNAnet structures, with the integration of attention modules. The double U-shaped backbone feature extraction network in our approach enables more accurate segmentation of target edges, while the addition of attention modules improves the capturing of target positional information. The innovation of our centroid detection subnet lies in the ability to overcome the constraint of unit pixel size in the original image, achieving sub-pixel-level centroid localization and minimizing the difference between predicted and ground truth centroid positions. Extensive comparative experiments demonstrate the superiority of our proposed semantic segmentation and centroid localization methods in terms of detection precision and robustness compared to existing mainstream methods.
In the future, we plan to expand our research in three directions: (1) How to use the motion information of targets in sequential images to reduce false alarms. The algorithm proposed in this paper is based on single-frame image for target detection, which may encounter challenges in removing false alarms in certain low-quality images or complex scenes. We will explore the temporal characteristics of targets in sequential images, combining spatial and temporal features to improve target discriminability. (2) How to construct an integrated network for enhancement and detection to address the issue of missed detections when the target is extremely weak. In our future work, we will embed more efficient enhancement modules or super-resolution reconstruction modules into the detection network to enhance target features and capture target details, aiming to achieve or even surpass the human eye's detection limit. (3) How to utilize multi-source image information fusion for handling weak small targets [35]. The infrared images used in this paper are obtained from thermal radiation imaging. Due to the technological limitations of sensors, issues such as loss of target texture information, high noise, and low signal-to-noise ratio may exist. In the future, we will develop a weak small target detection network based on multi-source data, incorporating data from multiple platforms and payloads to achieve stable and continuous detection and localization capabilities.
Point6: Kindly add future work also.
Response: Thank you very much for your professional suggestion. We sincerely apologize for the inadequate description regarding future work in our paper. We have addressed this issue by adding a paragraph on future work in the Conclusions (section 4) of the paper. The specific details are as follows:
In the future, we plan to expand our research in three directions: (1) How to use the motion information of targets in sequential images to reduce false alarms. The algorithm proposed in this paper is based on single-frame image for target detection, which may encounter challenges in removing false alarms in certain low-quality images or complex scenes. We will explore the temporal characteristics of targets in sequential images, combining spatial and temporal features to improve target discriminability. (2) How to construct an integrated network for enhancement and detection to address the issue of missed detections when the target is extremely weak. In our future work, we will embed more efficient enhancement modules or super-resolution reconstruction modules into the detection network to enhance target features and capture target details, aiming to achieve or even surpass the human eye's detection limit. (3) How to utilize multi-source image information fusion for handling weak small targets [35]. The infrared images used in this paper are obtained from thermal radiation imaging. Due to the technological limitations of sensors, issues such as loss of target texture information, high noise, and low signal-to-noise ratio may exist. In the future, we will develop a weak small target detection network based on multi-source data, incorporating data from multiple platforms and payloads to achieve stable and continuous detection and localization capabilities.
We tried our best to improve the manuscript and made some changes to the manuscript.
We appreciate for Reviewers’ warm work earnestly, and hope that the correction will meet with approval.
Once again, thank you very much for your comments and suggestions.
Author Response File: Author Response.pdf
Reviewer 2 Report
Loss of spatial precision can occur during the mapping of pixel-level predictions onto a centroid-map. Although it might not be able to depict the exact boundaries or minute features of the targets, the centroid-map approximates where the targets' centres are located. Particularly for items with variable shapes or boundaries, this loss of spatial accuracy can affect the overall performance of detection.
The usage of multiple U-shaped networks sequentially is implied by the double U-shaped segmentation network cascade technique. For real-time applications in particular, this increasing complexity may result in larger processing demands, which would slow down and reduce the effectiveness of the detection process. It's possible that the training and inference processes will take a lengthy period, needing a lot of computer power.
When a model overfits, it becomes overly focused on the training set and performs poorly on untrained sets of data. What steps are considered to address this problem, appropriate regularization methods and a substantial volume of diverse training data should be used.
The double U-shaped segmentation network cascade approach introduces additional hyperparameters that need to be carefully tuned. It can be difficult to determine the best configuration and parameters for each network in the cascade.
The Dice loss function does not give gradient information on the localization or geographical extent of the segmentation, it just considers the overlap or similarity between predicted and ground truth masks. This restriction may make it difficult for the model to acquire accurate border localisation, particularly when precise boundary delineation is essential.
Making assumptions regarding the form and limits of objects is necessary for estimating sub-pixel centroids. The sub-pixel-level localisation may create more ambiguity or inaccuracies when objects have irregular or unclear boundaries. Ambiguity in object boundaries can be caused by elements like partial occlusion, object overlap, or inborn image noise.
Sub-pixel-level localization is influenced by the resolution and quality of the input images. How do you overcome it.
Author Response
Cover Letter
Dear Reviewer:
Thank you for your comments concerning our manuscript entitled “UCDnet: Double U-shaped Segmentation Network cascade Centroid-map Prediction for Infrared Weak Small Target Detection” (ID: remotesensing-2516960). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made corrections which we hope meet with approval. We use the “highlight” function in Microsoft Word in our revised file, so that changes are easily visible to you. The main corrections in the paper and the response to your comments are as flowing:
Point1: Loss of spatial precision can occur during the mapping of pixel-level predictions onto a centroid-map. Although it might not be able to depict the exact boundaries or minute features of the targets, the centroid-map approximates where the targets' centres are located. Particularly for items with variable shapes or boundaries, this loss of spatial accuracy can affect the overall performance of detection.
Response: Thank you very much for your professional suggestion. The proposed method in this paper focuses on semantic segmentation and centroid localization of infrared weak small targets. Typically, small targets in infrared images exhibit a scattered pattern, and even if the shape of the target resembles that of a tadpole, the pixel values at the target centroid are significantly higher than those at the boundaries. In the centroid localization subnet, its final output is the result of multiplying the semantic segmentation mask and the centroid map. As long as the semantic segmentation subnet can predict the approximate shape of the target, the centroid localization subnet can accurately determine the centroid position of the target.
Point2: The usage of multiple U-shaped networks sequentially is implied by the double U-shaped segmentation network cascade technique. For real-time applications in particular, this increasing complexity may result in larger processing demands, which would slow down and reduce the effectiveness of the detection process. It's possible that the training and inference processes will take a lengthy period, needing a lot of computer power.
Response: In the initial design, we considered the complexity of the model. Compared to the U-Net network, the network architecture proposed in our paper is more complex. However, our network has fewer channels in the feature maps outputted by each layer. In the U-Net network, the number of channels in the feature maps outputted by each layer is 64, 128, 256, 512, and 1024, respectively. In contrast, our proposed network has 16, 32, 64, 128, and 256 channels in the feature maps outputted by each layer. Our network focuses more on extracting features of the target through dense connections. We have included the evaluation metric GFLOPs in the comparison table to demonstrate the model's complexity. The results show that our proposed model has lower complexity compared to the traditional U-Net network.
Table 1. Comparison of our semantic segmentation algorithm with five other algorithms.
Point3: When a model overfits, it becomes overly focused on the training set and performs poorly on untrained sets of data. What steps are considered to address this problem, appropriate regularization methods and a substantial volume of diverse training data should be used.
Response: Thank you very much for your professional suggestion. As you mentioned, L1 regularization, L2 regularization, and dropout layers are effective methods for preventing neural network overfitting. The most effective approach to prevent overfitting is to increase the volume of data. During the training process, we applied various data augmentation techniques (e.g., image flipping, rotation, adjusting image brightness, and contrast) to prevent network overfitting. In the future, we plan to simulate a larger-scale infrared weak small target dataset for training purposes. Additionally, we will pretrain our model on other large-scale datasets to enhance the network's generalization capability. We will also adjust the complexity of the network based on the loss and IoU during the network's training and testing processes. Thank you for your valuable suggestions.
Point4: The double U-shaped segmentation network cascade approach introduces additional hyperparameters that need to be carefully tuned. It can be difficult to determine the best configuration and parameters for each network in the cascade.
Response: The proposed network in this paper incorporates not only conventional hyperparameters, such as learning rate, batch size, and optimizer, but also includes the loss weights for the semantic segmentation subnet and the centroid localization subnet. Adjusting these hyperparameters requires a certain level of experience in training neural networks. During the training process, we adjust the learning rate based on the rate of loss curve descent and the complexity of the model. We modify the batch size according to the dataset size to allow frequent parameter updates. The selection of an appropriate optimizer is based on the training speed and gradient stability. Additionally, we fine-tune the loss weights of the two subnets based on their respective loss and IoU curves. Through extensive experimentation, we have determined these hyperparameters to enhance network performance and expedite convergence.
Point5: The Dice loss function does not give gradient information on the localization or geographical extent of the segmentation, it just considers the overlap or similarity between predicted and ground truth masks. This restriction may make it difficult for the model to acquire accurate border localisation, particularly when precise boundary delineation is essential.
Response: Thank you very much for your professional review. In our previous experiments, we vaguely observed some issues with the Dice loss, as you mentioned. However, for the main research subject of this paper, infrared targets, their boundaries are inherently blurred due to the thermal imaging principle. This makes it challenging to precisely determine the target boundaries. The reason we initially chose Dice loss is because it is sensitive to small targets and has high computational efficiency. Besides, for the centroid localization subnet, its final output is the result of multiplying the semantic segmentation mask and the centroid map. As long as the semantic segmentation subnet can predict the approximate shape of the target, the centroid localization subnet can accurately determine the centroid position of the target. We sincerely appreciate the valuable suggestions you provided, and in the future, we will make improvements to the loss function of our semantic segmentation part based on your suggestion.
Point6: Making assumptions regarding the form and limits of objects is necessary for estimating sub-pixel centroids. The sub-pixel-level localization may create more ambiguity or inaccuracies when objects have irregular or unclear boundaries. Ambiguity in object boundaries can be caused by elements like partial occlusion, object overlap, or inborn image noise.
Response: Our proposed semantic segmentation network for single-frame infrared images may encounter false alarms or missed detections in certain scenarios. The issue you raised aligns with our upcoming research direction. We plan to focus on the semantic segmentation of sequential images. By utilizing a Generative Adversarial Network (GAN) combined with the historical information of the targets, we aim to establish a target model that can compensate for partially occluded or overlapped targets by reconstructing their overall shapes. The inherent noise in the images generally remains consistent in terms of its position. A target detection network for sequential images can accurately differentiate targets from noise based on the motion information. In the future plan section of our main text, we have also incorporated research on sequential image object detection. If we can design such a neural network model, it would represent a significant breakthrough in the field of infrared weak small target detection.
Point7: Sub-pixel-level localization is influenced by the resolution and quality of the input images. How do you overcome it.
Response: Thank you very much for your review. Before training the neural network, we perform data augmentation to enhance the training process. For example, we resize the images to change their resolution, adjust the brightness and contrast to create blurrier images, and apply image flipping or rotation to alter the shape of the objects. Additionally, if computational resources allow, we can utilize super-resolution reconstruction techniques to enhance the resolution of input images. Our proposed network consists of several residual-like structures, which contribute to improving the robustness of the network to a certain extent.
We tried our best to improve the manuscript and made some changes to the manuscript.
We appreciate for Reviewers’ warm work earnestly, and hope that the correction will meet with approval.
Once again, thank you very much for your comments and suggestions.
Author Response File: Author Response.pdf
Reviewer 3 Report
Please kindly find the attached file.
Comments for author File: Comments.pdf
Author Response
Cover Letter
Dear Reviewer:
Thank you for your comments concerning our manuscript entitled “UCDnet: Double U-shaped Segmentation Network cascade Centroid-map Prediction for Infrared Weak Small Target Detection” (ID: remotesensing-2516960). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made corrections which we hope meet with approval. We use the “highlight” function in Microsoft Word in our revised file, so that changes are easily visible to you. The main corrections in the paper and the response to your comments are as flowing:
Point1: Please provide a block diagram for the procedures of data simulation.
Response: Thank you very much for your meaningful review. We added a section (Appendix A.1) and a block diagram (Figure A1 in Appendix A) in our paper to illustrate the overall workflow of our data simulation. The overall steps include simulating background images, adding noise, designing target motion trajectories, adding targets to the background, randomly cropping image patches with targets, and generating semantic segmentation labels. The block diagram is shown in the following figure:
Figure A1. The schematic diagram of the simulation process for the proposed dataset of infrared weak small targets in this paper
Point2: Please kindly present a characteristics table and mention all of the applied elements of the proposed model in detail.
Response: Thank you very much for your review. In the appendix of our paper, We have added a section (Appendix A2) and a characteristics table (Table A1 in Appendix A) listing the details of the Res_Attention block, Downsample block, Upsample_n block, and Multiply block used in our proposed network. These details include kernel size, stride, padding, upsampling and downsampling ratios, among others. We appreciate your feedback, which has made our paper more comprehensive.
Table A1. Constituent elements of each block in the proposed network architecture.
Point3: You have presented some metrics to evaluate the performance of the proposed model. As you know, for hyperparameters models such as deep neural networks models you have to investigate by using different criteria like accuracy, sensitivity, specificity, F-score, precision, recall, and so on. Please kindly provide more information about the aforementioned metrics to prove the efficiency of the suggested model.
Response: Thank you very much for your professional suggestion. We have added the F1-score as an evaluation metric for assessing the overall performance of the network (seen in Table 1 in section 3.2.2). Additionally, we have also added the metric "GFLOPs" to evaluate the complexity of the model. The evaluation metric "Sensitivity" is equivalent to the evaluation metric "Recall" we have used. We completely agree with you regarding medium or large-sized targets. However, in the proposed dataset of infrared weak small targets, the image size is 256*256 pixels, and the targets' size typically ranges within 10*10 pixels. The evaluation metric "Accuracy" is defined as (TP+TN)/(TP+FN+FP+TN), and the evaluation metric "Specificity" is defined as TN/(TN+FP). In our detection results, the values of TN are close to the values of TP+FN+FP+TN. Since the “Accuracy” and “Specificity” of other semantic segmentation methods, except for MLCM, can achieve values above 0.999, we did not include these two metrics as criteria for evaluating the performance of our model.
Table 1. Comparison of our semantic segmentation algorithm with five other algorithms
Point4: How have you created interconnection among the elements of the suggested model (It is not clear)? Please explain them deeply.
Response: Thank you very much for giving us the chance to explain again. As shown in the figure below, the gray lines indicate no operations performed on the feature maps, the blue lines represent the feature maps being fed into the Res_Attention block, the green lines indicate downsampling operations on the input feature maps, the red lines represent upsampling operations on the input feature maps, and the brown lines represent element-wise multiplication of corresponding spatial positions between two feature maps. Taking the purple box as an example, by upsampling the feature map FM1_2 and then concatenating it with feature maps FM0_0 and FM0_1 along the channel dimension, followed by inputting into the Res_Attention block, we obtain the feature map FM0_2.
Point5: What kind of software programs and hardware tools have you applied to implement the proposed idea? Please kindly list them with full information.
Response: Thank you very much for your review. We apologize for the unclear and inadequate description. We have now added this content in the Experimental Setup (section 3.2.1). The primary hardware device utilized was an NVIDIA RTX 3090 graphics card with 24GB of VRAM. For software programs, we employed the PyTorch deep learning framework, Matplotlib for plotting, Opencv for image processing, and Numpy for scientific computing.
Point6: The section of the Discussion is weak. Please kindly modify it.
Response: Thank you very much for your review. We have recognized that the section of the Discussion is weak. We have made significant revisions to our Discussion (section 3.2.5), and the content is as follows:
Currently, numerous classical target detection algorithms have been applied in the field of infrared weak small target detection. We selected five representative algorithms, namely MLCM, U-Net, HRNet, MTU-Net, and DNANet, for comparative experiments with our proposed method, and obtained outstanding results. The IoU of our semantic segmentation results reached a leading level because of the proposed double U-shaped structure with attention mechanism. Furthermore, we innovatively introduced a deep learning-based approach for sub-pixel-level centroid localization, significantly improving the centroid localization precision compared to existing methods. Extensive experiments validated that our proposed algorithm exhibits favorable performance and robustness.
Additionally, to verify the applicability and generalization ability of our proposed algorithm beyond the field of surveillance and reconnaissance, we evaluated our algorithm on three different publicly available datasets from diverse fields. On an industrial field dataset for motor magnetic tile defect detection, our algorithm demonstrated excellent segmentation results even when facing weak texture information of motor magnetic defects (see experimental results in the first row of Figure A2 in Appendix A). On a medical field dataset for cell semantic segmentation [33-34], our algorithm accurately extracted edges of irregularly shaped cells (see experimental results in the second row of Figure A2 in Appendix A). On an agricultural field dataset for olive fruit semantic segmentation, our algorithm achieved precise segmentation for small-sized olive fruits, remaining unaffected by factors like crowding and occlusion (see experimental results in the third row of Figure A2 in Appendix A). These supplementary experiments further demonstrate the high robustness of our proposed algorithm. Our algorithm proves to be effective not only on our simulated dataset but also on various datasets from other fields, showcasing its excellent performance.
The proposed algorithm in this paper also has some limitations: (1) Point targets in infrared images are prone to confusion with high-frequency noise and similar objects. The method proposed in this paper focuses on fine segmentation and localization but requires further improvement in false alarm removal. (2) Infrared weak small targets have low signal-to-noise ratio and can easily be overwhelmed by background clutter. The algorithm in this paper may experience missed detections when the target signal-to-noise ratio is below 1. This is unacceptable for applications such as security monitoring and autonomous driving, which demand high safety requirements. These limitations will be the focus of our future work.
Point7: Please try to bring some samples related to the proposed idea in the real world such as industry, economics, and so on.
Response: Thank you very much for your professional suggestion. Based on your advice, we have conducted additional experiments on an industrial field dataset for motor magnetic tile defect detection (as shown in Appendix A.3 and Figure A2 in Appendix A) to validate the performance of our algorithm in an industrial setting. The results and analysis of these experiments have been included in the Discussion (section 3.2.5).
Changes and additions to the Discussion (section 3.2.5):
Additionally, to verify the applicability and generalization ability of our proposed algorithm beyond the field of surveillance and reconnaissance, we evaluated our algorithm on three different publicly available datasets from diverse fields. On an industrial field dataset for motor magnetic tile defect detection, our algorithm demonstrated excellent segmentation results even when facing weak texture information of motor magnetic defects (see experimental results in the first row of Figure A2 in Appendix A). On a medical field dataset for cell semantic segmentation [33-34], our algorithm accurately extracted edges of irregularly shaped cells (see experimental results in the second row of Figure A2 in Appendix A). On an agricultural field dataset for olive fruit semantic segmentation, our algorithm achieved precise segmentation for small-sized olive fruits, remaining unaffected by factors like crowding and occlusion (see experimental results in the third row of Figure A2 in Appendix A). These supplementary experiments further demonstrate the high robustness of our proposed algorithm. Our algorithm proves to be effective not only on our simulated dataset but also on various datasets from other fields, showcasing its excellent performance.
Supplementary experiments in Appendix A.3:
Figure A2. The first row in the figure shows the semantic segmentation results of motor magnetic tile defects, the second row presents the semantic segmentation results of cells, and the third row exhibits the semantic segmentation results of olive fruits.
Point8: Please kindly present a comparison table and try to compare the obtained result with the other articles which have proposed robust models. Please refer to the articles that are newly published (in the period of 2019-2023)
Response: Thank you very much for your professional review. In response to your suggestions, we realized that our comparative experiments were not comprehensive enough. Therefore, we conducted two additional comparative experiments: HRNet (2019) and MTU-Net (2023), and we have incorporated the experimental results into Table 1 in section 3.2.2.
Table 1. Comparison of our semantic segmentation algorithm with five other algorithms
We tried our best to improve the manuscript and made some changes to the manuscript.
We appreciate for Reviewers’ warm work earnestly, and hope that the correction will meet with approval.
Once again, thank you very much for your comments and suggestions.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
I have no furthere comments.