Self-Training Can Reduce Detection False Alarm Rate of High-Resolution Imaging Sonarâ€
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe false alarm rate by self-training a deep learning detector on sonar images was reduced. The results showed that the false alarm rate decreased by 3.91% and 18.50% on 240kHz and 450kHz sonar images. Related work about current deep learning-based sonar images can be different with proposed ideas. The various target and background images were used in public datasets such as SAS450 and SAS240. The performance comparison between detectors is interesting about the false alarm rate. The train/loss graph as shown in Figure 12 is reasonable. F1 scores between detectors A to H are reasonable to be obtained. There are no specific English grammar errors. The authors described the limitations of the proposed work such as hardware constraints and performance.
The methodology for training and detection for proxy classification tasks is described very well. In Figures 3 and 4, the authors provided methodology in Figures and detailed classification information. Very detailed information about the deep learning network training framework and transfer learning strategy for sonar data are mentioned in the Discussion section. As mentioned, there are 4 different perspectives about the transfer training network and classification. In addition, there are several limitations about the sonar image acquisition process, detector performance, and proxy tasks and their performance. Therefore, I can conclude that the submitted manuscript can be a minor revision with the following suggestive comments below.
1. There is no Figure 13. It is empty.
2. Author contribution format is wrong.
3. In the References section, the authors need to use abbreviated journal names.
4. In conference papers, city, country, and date information need to
be provided.
5. Why are training and validation sets 80% and 20% respectively in Figure 4 ?
6. In Figure 2, what and why prexy and detect tasks use instead of the
detection models ?
7. Please correct 240kHz and 450kHz to 240 kHz and 450 kHz.
Author Response
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files
Comments 1: There is no Figure 13. It is empty. |
Response 1: Thank you for pointing it out. I'm very sorry for my oversight in not noticing that Fig13 was missing in the PDF version of the manuscript, while it was present in the Word document. In this submission of the manuscript, I have specifically checked to ensure that all figures, including Fig13, are complete in both the PDF and Word files.
|
Comments 2: Author contribution format is wrong. Response 2: Thank you for pointing out this issue. I have revised the description of author contributions by referring to the example provided in the MDPI Word template. This part was highlighted in blue and can be found on line 661 of the revised manuscript.
Comments 3: In the References section, the authors need to use abbreviated journal names. Response 3: Thank you for pointing out this issue. We have made adjustments to the references and changed the journal name to its abbreviation.
Comments 4: In conference papers, city, country, and date information need to be provided. Response 4: Thank you for pointing out this issue. We have added the dates and locations to the conference papers in the references.
Comments 5: Why are training and validation sets 80% and 20% respectively in Figure 4 ? Response 5: Thank you for pointing it out. There is no fixed standard ratio for splitting the training set and validation set. The ratio adopted in this paper is based on the following considerations. We have also added an explanation regarding this issue in the corresponding part of the article. Although there is no standard ratio for splitting the training set and validation set, in deep learning, the training set typically comprises 60% to 80% of the dataset. A larger training set can help the model better learn the features and patterns of the data. Considering that the proxy task is designed to better learn the target features, we have set the proportion of the training set to 80%. We have also provided an explanation in the manuscript, which was highlighted in blue and can be found on line 193 of the revised manuscript.
Comments 6: In Figure 2, what and why proxy and detect tasks use instead of the detection models ? Response 6: Thank you for your suggestions. The main work of the detection task is to train a detector under the detection model, but in addition to the training of the detector, the task may also include some image preprocessing (such as gamma enhancement, etc.). Therefore, we initially chose to use the term "detection task" in Fig2. However, taking your advice into consideration, the purpose of Fig2 is to compare training strategies. What we want to show readers here is how different modules of the detection model load weights. Therefore, We think it is better to change the term "detection task" to "detection model". We have made the corresponding modifications to the image accordingly, which was highlighted in blue and can be find it on line 168 of the revised manuscript.
Comments 7: Please correct 240kHz and 450kHz to 240 kHz and 450 kHz. Response 7: Thank you for pointing out this issue. I have made the necessary revisions to the corresponding part of the manuscript.
|
Reviewer 2 Report
Comments and Suggestions for AuthorsThe benefits of Proxy task should be clearly mentioned especially with the introduction also. This may help authors to emphasize more on the contributions of this work.
Additionally, please recheck the writing of "Proxy task" in Fig 2 also.
Is it have any certainly significant between the backbone weights and Neck-head weights beside of this application?
Please discuss about the correlation of data sets between training data and testing data. This will help the readers clearly evaluate the sufficient of the proposed scheme more.
That would be helpful if authors can have the additional table representing the results of evaluation metics also.
The improving performance of detection seems a bit significant at 450 kHz but the data number of 450 kHz seems least. Is it have any relation about this point?
Please clarify more about the result in Fig. 12 (b) for the last-plot of accuracy always as 1.
In the Discussion section 5, authors are suggested to revised it a bit in order to certify the results in accordance with main contributions of this work. Also the discussions more on its limitations are very welcome.
Author Response
Thank you very much for taking the time to review this manuscript. In response to the 8 suggestions you raised, we have provided careful replies below. Regarding the improvement suggestions, we have made corresponding revisions in the manuscript and highlighted these changes prominently. We hope that our response meets your expectations for the adequacy of our method description.
Comments 1: The benefits of Proxy task should be clearly mentioned especially with the introduction also. This may help authors to emphasize more on the contributions of this work.
|
Response 1: Thank you for your valuable suggestions. A proxy task can be understood as an indirect task designed to achieve a specific training objective, with the benefit of simplifying the solution to the original task. The setup of proxy tasks is usually related to downstream tasks, but in a simpler form, or it is easier to obtain supervisory information from unlabeled data. By solving proxy tasks, the model can learn the intrinsic structure and features of the data, thereby improving its performance on downstream tasks. We have added relevant content to the introduction. We have made the addition and highlighted it in yellow, which can be found at line 67 of the article.
|
Comments 2: please recheck the writing of "Proxy task" in Fig 2 also. Response 2: We are very sorry about this, as we did not notice the spelling error in the manuscript. Thank you for bringing it to our attention, and we have already made the correction. The revised Figure 2 can be found at line 168 of the article.
Comments 3: Is it have any certainly significant between the backbone weights and Neck-head weights beside of this application? Response 3: Yes, in deep learning, the backbone, neck, and head refer to different parts of the network architecture, each serving distinct functions. The backbone is the main component of the entire deep neural network. In image processing tasks, the backbone network is typically used to extract global and local features of the image, such as edges, textures, and shapes. The neck is placed between the backbone and the head, and its main function is to reduce the dimensionality or adjust the features coming from the backbone, further enhancing the diversity and robustness of the features. The head is the final layer of the model, typically a classifier or regressor. It is usually used to obtain the network's output and make predictions based on the extracted features. Therefore, we only loaded the backbone network weights obtained from the proxy task, with the aim of inheriting the target features learned by the proxy task. We have also added explanations about the functions of the backbone, neck, and head in the manuscript, which were highlighted in yellow and can be found at line 205 of the revised manuscript.
Comments 4: Please discuss about the correlation of data sets between training data and testing data. This will help the readers clearly evaluate the sufficient of the proposed scheme more. Response 4: Thank you for your suggestion. You mentioned "the correlation of datasets between training data and testing data." We believe that the correlation between training and testing datasets can be considered in the following aspects: (1) Temporal correlation: Whether the training and validation data are collected during the same time period. (2) Spatial correlation: Whether the training and validation data are collected at the same location. (3) Feature correlation: Whether the similarity of target sample features between training and validation is sufficiently high. (4) Statistical correlation: Whether the training and validation data are randomly assigned. This paper conducts three experiments to evaluate the sufficiency of the proposed scheme: (1) 240k SAS self-training for performing learning and inference tasks on 240k SAS data. In this experiment, the dataset consists of sonar images of the same targets collected by the same sonar during the same time period and at the same location. The training and testing samples are randomly divided. Self-training and detection model training are only performed using the training set, and the test set contains sonar images from perspectives that the model has not learned during training. The purpose of the experiment is to validate the sufficiency of the proposed scheme. (2) 450k SAS self-training for performing learning and inference tasks on 450k SAS data. This experiment has the same purpose as the previous one, but uses images collected by a higher frequency sonar. The purpose of the experiment is to validate the robustness of the proposed scheme to frequency. (3) 240k SAS self-training for performing learning and inference tasks on 450k SAS data. This experiment involves self-training weight transfer on sonar images collected at different frequencies, times, locations, and on different targets. The purpose is to validate the adaptability of self-training to frequency. In summary, the above methods are used to evaluate the sufficiency of the proposed scheme in terms of temporal, spatial, feature, and statistical correlations. To help readers fully understand the experimental content, we have made some changes: Firstly, we have designed an experimental framework diagram for the paper, which is added as Figure 8 in the new manuscript and can be found on line 354. Secondly, we have adjusted the organization of the experiment and results section of the article as follows: With the addition of the "Settings" subsection, which introduces the correlation of data in the three experiments mentioned above, we have revised the structure of the article. The "Settings" subsection, along with data preparation and model preparation, is now placed under Section 4.1, titled "Experimental Settings." This modification was highlighted it in yellow and can be found on line 346 of the revised manuscript.
Comments 5: That would be helpful if authors can have the additional table representing the results of evaluation metrics also. Response 5: Thank you for your suggestion. The evaluation metrics mentioned in Section 3.3, along with the corresponding results obtained by various detectors, are recorded in Table 4, which can be found on line 567. The IOU metric is used as a threshold to determine whether a target has been detected, and we stated in the experimental section (line 395) that an IOU ≥ 0.5 is the criterion we used to consider a target as detected. We also noticed that in the first version of the manuscript, the data was too fragmented and there were too many tables, which might have hindered your ability to locate the tables. Therefore, we have merged some of the tables and highlighted key data, hoping to facilitate better reading of the article for readers.
Comments 6: The improving performance of detection seems a bit significant at 450 kHz but the data number of 450 kHz seems least. Is it have any relation about this point? Response 6: We agree with your point that the improvement in detection performance is more significant on the test set of the 450 kHz dataset, which is related to the limited number of datasets within the 450 kHz dataset. The SAS450 dataset contains fewer sonar images, resulting in fewer sonar images in the test set of this dataset. Therefore, the percentage improvement in performance is more pronounced. Additionally, besides having fewer images, the SAS450 dataset also has a large proportion of background in the images. This leads to very few targets in the dataset, making it more difficult for the model to learn target features during training, and more prone to generating false alarms on the test set. Consequently, this also provides more opportunities to reduce false alarms. However, this is a common problem in the field of underwater target detection: the collected sonar images are few, and the background occupies a large proportion of these sonar images. This results in a dataset with few target samples and a lot of background. Therefore, the improvement in detection performance of this method on such small-scale datasets is also of referential significance. We have also added an explanation in the manuscript regarding the more significant improvement of the detector on the SAS450 test set, which were highlighted it in yellow and can be found on line 587 of the revised manuscript.
Comments 7: Please clarify more about the result in Fig. 12 (b) for the last-plot of accuracy always as 1. Response 7: Thank you for pointing it out. The 450 kHz dataset contains only three types of targets, plus the background, making a total of four categories. The accuracy_top5 reflects the probability that the top five most confident classifications include the correct label. Since there are only four categories in the dataset, the top five will always include the correct label, which is why the accuracy is shown as 1 in the figure. We have also added an explanation for why the accuracy_top5 is always 1 in the manuscript, which was highlighted it in yellow and can be found on line 486 of the revised manuscript.
Comments 8: In the Discussion section 5, authors are suggested to revised it a bit in order to certify the results in accordance with main contributions of this work. Also the discussions more on its limitations are very welcome. |
Response 8: Thank you for your valuable suggestions. We have made revisions to this section accordingly. In terms of the contributions of the article, we have elaborated on the contributions brought by our work based on the results of the three experiments. Regarding the limitations of the article, we have identified limitations from the perspectives of data, model, and algorithm, and pointed out directions for future work. Additionally, considering that this section contains many conclusions, we have changed the name of Section 5 to "Conclusion and Discussion" in English. The revised content of this part has been highlighted in yellow, and you can view it starting from line 616 of the new manuscript with yellow highlight.
|
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsAuthors already considered all comments.
Only some further revision requested by editor office if any.