Saliency-Guided Remote Sensing Image Super-Resolution
Round 1
Reviewer 1 Report
This paper is improved. However, I still have the same problems as the last time.
- For superresolution with GAN, is using PSNR and SSIM for evaluation OK?
- More detailed discussion is needed. For example, which part in the images is mostly affected when adding the saliency loss? How can we set the parameters to balance the four losses? How does the feature of images (the detected saliency area) affect the results? etc.
- SRGAN+Lsa seems to be much better then SG-GAN, can you explain this?
In all, just adding too many images is meaningless. The most important thing is to analyze the result in detail, even with one image as the case.
Author Response
Thanks for the reviewers' constructive suggestions, which are highly appreciated. We have carefully scrutinized the manuscript and made corresponding revisions. We sincerely hope that the revised manuscript has addressed all your comments and suggestions. Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
I applaud the authors’ complete set of responses to reviewer comments. I believe many of the straightforward concerns have been addressed, yet some of the responses seem insufficient. Relatedly, I cannot understand the reasoning provided in several of the cases, which is indicative of a reasonably unclear article structure. My specific remaining concerns are listed below:
Q3: I still believe that this is a description of the authors’ method, rather than an individual contribution
Q4: This too is a description of what the authors did in the paper. The contribution of this paper, which is indeed important, is the creation of a more efficient super-resolution network that is able to focus training on salient regions of the image. The fact that this involves a saliency loss and that the authors validate their model are not, in themselves, contributions
Q6: I do not feel that the addition of this sentence addresses the comment, which is that no contextualizing of this work occurs during the literature review portion of the introduction. Instead, many differing papers are presented with little description of how this work relates to or builds upon these prior efforts. The breadth of the literature review is impressive, but the connection to this paper is difficult to understand.
Q9: The authors’ response is helpful in understanding why there is a need to reduce the total number of parameters in the model, but it does not address the comment, which is that these two units (the upscaling unit and the reducing parameter unit) are not described in the paper and it is not clear what part of the network architecture depicted in Figure 4 corresponds to these units
Q12: I do not understand the authors’ response
Q14: I am confused by the authors’ response. In the paper, the authors suggest that that the saliency part of the network is trained in advance, but to do so I believe one would need images that were labeled according to pixel saliency. This does not appear to be the case with the RAISE dataset, which they reference in this response as the source of training data for the saliency network. At a higher level, the paper could use a diagram illustrating when different parts of the model are trained and tested, as well as the datasets that are employed at each of these stages. As it is currently written, it is difficult to understand which datasets are employed at which stages of the training and testing workflow.
Q15: I believe I understand the authors' response. I see from their revisions that PSNR is simply a transformation of MSE, so that minimizing MSE should generally maximize PSNR. I do think that the paper would benefit from a clearer delineation of where and how in the process the various loss functions are used. If I understand correctly, the BASNet uses a sum of BCE, SSIM, and IOU loss, but this model is entirely pre-trained and is not fine-tuned within SG-GAN. Within the bounds of the SG-GAN training described in this paper, it is the saliency, L1, perceptual, and adversarial loss functions that are summed and minimized. In general, the use of various loss functions in different stages of the modeling effort is not delineated clearly in the paper and could use a simple summary to orient readers and make the training process more transparent. Additionally, I am still a bit confused regarding the specification of some of these loss functions. For example, the saliency loss is represented as a sum over feature channels. However, isn't the saliency map output a single channel (i.e. greyscale)?
Q18: The authors stated response does not appear to address this question, but I believe from their paper that there was an additional experiment added where they combined the developed saliency loss function with the SRGAN model. If I understand correctly, this is designed to address my original comment, which is that the finding that the loss function matters was not supported by their experiments. I believe I should be comparing the SRGAN and the SRGAN+Lsa scores in order to isolate the contribution of the loss function. This is indeed a good experiment, but it is left to the reader to understand and interpret this connection. This should be clearly described in the paper and the authors should explain that this is the data that supports their statement that the loss function is important.
Q19: Half of Table 3 is duplicated from Table 2, I believe, though there are some discrepancies. For example, the UCAS-AOD PSNR score for SRGAN is 31.91 in Table 3 but 30.91 in Table 2. This materially affects the authors' conclusion that the Lsa loss function addition improves the SRGAN performance even when scoring based on the entire image, regardless of saliency. In other words, if the SRGAN score is 31.91 (as in Table 3), the performance actually decreases with the addition of Lsa. There are a few other data discrepancies between these two tables as well.
Relatedly, I don't think there's any reason for these to be separate tables, and this structure is confusing for the reader.
In addition, I don't understand why the addition of the saliency loss function (i.e. comparing SRGAN+Lsa to SRGAN) theoretically should improve the model's performance across the entire image, rather than just when looking at the saliency-weighted score. I do not understand why this should be the case unless the loss function in SRGAN was poorly defined relative to the scoring metrics. Making the model better at identifying salient regions shouldn't necessarily make it better at super-resolution overall - just in the salient areas, no? If this assumption is incorrect, I was not able to deduce the authors' argument to describe why this performance bump would occur. They should make this explicit in their article, as it is a key finding. If there is indeed no theoretical justification, their results leaad me to wonder if the SRGAN+Lsa model might have had additional epochs of training over the SRGAN model, such that this performance improvement just comes from more training. This concern would be alleviated if the authors clearly described why one might see a better score for SRGAN+Lsa even when scoring performance across the entire image regardless of saliency.
Finally, I believe that the authors meant that the figures show qualitative results, rather than quantitative results in their response to Q18.
-----------------
Beyond these specific concerns, while I cannot access my first round of comments (a link is broken on the website), I believe that I suggested that the authors make both their data and code available via open-source platforms like zenodo, GitHub, Code Ocean, etc. The authors include a Data Availability statement noting that the data is available upon request and do not mention the code availability (as far as I can tell). This form of data provision does not meet modern standards of academic data science and should be addressed. The authors should either host the code and data on an easily accessible platform or indicate why they cannot do so (e.g. intellectual property and/or licensing constraints, etc.).
Author Response
Thank you very much for the positive comments and constructive suggestions. We have carefully scrutinized the manuscript and made corresponding revisions. We sincerely hope that the revised manuscript has addressed all the reviewers' comments and suggestions. Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
I would like to thank the authors for considering my suggestions and answering my concerns correctly. I think now the article is suitable for publication.
Author Response
Thank you very much for your recommendation, which is highly appreciated. The positive comments and constructive suggestions are valuable for improving the manuscript.
Round 2
Reviewer 1 Report
Thanks for the authors. The paper is improved and I think it is ready for pulish. I only think there are too many papers in the reference and some of them may be deleted.
Author Response
Thank you very much for your recommendation, which is highly appreciated. The positive comments and constructive suggestions are valuable for improving the manuscript. Additionally, we have deleted 14 references in the revised manuscript.
Reviewer 2 Report
Thank you to the authors for responding to the latest round of comments. Many of my concerns have been addressed. Here are my responses and remaining concerns
- Point 2: The authors' response was unrelated to my comment, which regards the third bullet point. The fact that they conducted validation experiments of their designed algorithm is not in itself a contribution of the paper - it is a necessary approach to ensure that their algorithm is working.
- Point 3: This addition helps clarify why the goal of this paper is important and how it is motivated by previous work. Thanks for including this! My one comment is that it is probably too strong of language to say that Gu et al "proved" that variation in super-resolution performance comes from areas with complex structure/texture. There are a variety of reasons for differing performance across algorithms, and this sentence would be more accurate to say something like "Gu et al suggested that variation in super-resolution performance is often due to algorithms' performance in areas of complex structure"
- Point 4: This description is helpful. Again, I would suggest that the authors make the connection between the different blocks in Figure 4 and the 5 units described in the text clearer. Labeling Figure 4 with the same terminology used to describe the 5 units in the text would go a long way toward the interpretability of the figure.
- Point 5: I am a little confused by the authors' response. I believe they have swapped the definitions of map1/map3 with map2/map4. The caption of the image suggests that map1/map3 DO contain the sigmoid layer, while this is the opposite of what is in the authors' response and what is now described in their new paragraph.
- Point 6: I appreciate the authors' clarification here. This now makes sense to me.
- Point 7: Most of the clarification I was hoping for was addressed in the authors' response to Point 6. However, I still do not understand how the saliency map is a 3-channel map. The figures shown in the paper are in greyscale and the authors' explanation: "we use Python to visualize the height width and number of channels of the saliency map" does not make sense to me.
- Point 8: The addition of this section is quite helpful, thank you. However, the authors' response (and the paragraph itself) is somewhat confusing - particularly the statement "Therefore, we will count the image areas' repair results with more complex structures on the aforementioned remote sensing datasets." Does this mean that the results in Table 2 are calculated only on the salient regions? Or are they counted equally across the image? In general, the grammatical structure of this additional paragraph makes it difficult to interpret.
- Point 9: The authors seem to have swapped Table 3 out for an entirely different table, with different metrics compared on different underlying datasets. I don't understand this new table, why the swap was made, nor how this relates to my comment.
- Point 11: The quantitative improvement of SRGAN+Lsa over SRGAN is repeated twice. Once in this section and once in the new Section 4.5. The authors should avoid this repetition.
New comments:
- Point 12: I understand and respect the authors' interest in keeping their code proprietary. If, however, they intend to release it upon publication of this paper as indicated in their response, then there should be a code availability statement indicating the release and pointing to a repository where the code will be hosted (even if it is not there yet). This ensures transparency and reproducibility of the authors' results.
- Point 13: The authors seem to have successively added more and more images with each round of reviews. While visually appealing, it is difficult to keep track of what readers are to look for in each figure. I think the paper would benefit from a reduction in these figures and guidance within the captions of the remaining figures that indicate what visual behavior is indicated by the various images shown in the figure.
Author Response
Thank you very much for the positive comments and constructive suggestions. We have carefully scrutinized the manuscript and made corresponding revisions. We sincerely hope that the revised manuscript has addressed the reviewers' comments and suggestions. Please see the attachment.
Author Response File: Author Response.pdf
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
An SG-GAN model is applied to do saliency-guided image super-resolution. My notes are listed as below:
1) I think the introduction needs improvement. The topic, problem statement, and research gaps should be explained more in detail. The author should focus on Image super-resolution/Saliency object detection and explain the problem and research gaps on that and then present the main contribution of the work. Whereas the salient object detection is explained in the last paragraphs of the Introduction. This should be addressed during the revision.
2) In the literature review part, it would be good to provide more related works for salient object detection, and then provide the added values of your own proposed method.
3) For the GAN’s application (line 35), I would recommend to add the other applications of GAN such as image segmentation (https://doi.org/10.1109/ACCESS.2020.3038225), image classification (https://doi.org/10.3390/rs9121328), and change detection (https://doi.org/10.1109/LGRS.2021.3066435).
4) It would be nice to explain how to generate images progressively by referring to the discriminator Network.
5) Justify the following parameters
(a) Adam is used for optimizer. Why not other optimizers with different learning rates?
(b) selecting the batch size of 16 and what is the effect of batch size lower than 16?
6) It would be good to see the results for individual loss functions.
7) what kind of evaluation metrics have been used to calculate the accuracy? Should provide all with equations.
8) please provide the limitation of the method for the given task.
9) Also, the following works can be useful to improve the Introduction or methodology parts.
- https://doi.org/10.3390/rs12020216
- https://doi.org/10.1109/ACCESS.2021.3075951
Reviewer 2 Report
In this paper, the authors propose a method to improve the computational efficiency of super-resolution tasks through the use of an algorithm that identifies and focuses on "salient" portions of the image. Super-resolution has many use cases across a wide range of computer vision problems and thus this research itself is quite salient.
The visual performance of their model compared to the benchmark models they compare to is clearly appealing and they appear to have developed a super-resolution algorithm with generally strong performance. The caveat is that I am not familiar with these comparison models and cannot confirm that they are actually state-of-the-art models.
While the demonstrated performance is strong and thus suggestive that this model could prove broadly useful in super-resolution tasks, there are numerous issues with the construction of arguments, the overly general interpretation of experimental results, and the use of overly technical and confusing language in the paper.
In addition to these concerns, I believe that a methodological contribution such as that described in this paper should if at all possible be accompanied by the release of data and code used in the study. Unless the authors have a compelling reason they cannot release this information, I would consider that a necessary step to ensure transparency before accepting this paper; however, I defer to the policy of the journal to determine if this step, in particular, is necessary.
I have divided my comments into a list of overall points and a list of line-by-line comments, questions, and suggestions.
High-level concerns:
- Extensive grammatical editing is required for this to be easily readable
- In general, the authors often use abbreviations and/or technical terminology before defining the words' meanings. This is confusing.
- The authors often make claims that do not appear to be founded on the results of their experiments. If the claims are based on findings from the analysis in this paper, it should be more clearly described how the results of the findings lead them to the stated conclusions. See the line-by-line comments for several examples. One such example is their claim that the choice of a loss function is important for super-resolution performance on remote sensing images. While this seems likely to be true, the experiment they described before making this claim was one in which they compared performance across a variety of different networks, rather than one in which they were able to isolate the performance impact of the loss function.
- If the authors wish to publish this in a remote sensing-focused journal, they will need to be clearer about the machine learning acronyms and terminology they employ. An alternative would be to transition this to a more computer science-focused journal, in which perhaps a greater level of knowledge of standard deep learning models and terminology could be expected.
- I am confused by a perceived mismatch between the stated goals and contribution of the authors' architecture and the design of the architecture itself. They suggest that they can improve computational efficiency by allowing the network to focus on performance in salient regions of the image while ignoring others. To me, this would imply weighting each pixel in the image-related loss functions by its saliency. This is what I expected the authors to have done. However, their approach is to calculate the unweighted perceptual, L1, and adversarial losses as you would in a more typical GAN but then to also add to this a loss function associated with how well the generated image's saliency matches that of the high resolution image. To me, this seems like it would cause the model to perform well at both (a) identifying salient regions, and (b) super-resolution in general, across the entirety of the image. It does not seem like this loss function is optimized to "improve super-resolution in salient regions", which is what I understand to be the stated outcome of this paper. The authors should either explore this alternative structure (weighting each pixel in the calculation of L1, perceptual and adversarial losses by it's estimated saliency) or explain why this would not achieve the desired outcome.
- Relatedly, the authors describe how their method results in improved computational efficiency for achieving a desired performance in salient regions, yet they do not conduct any experiments to show computational efficiency. Their main experiment does standardize hyperparameters such as the number of epochs across all of the comparator networks they test against. However, with vastly different network structures the actual run times of each model is likely to be quite different. Their claim of improved efficiency is unsupported until they provide some quantitative evidence of this.
- The authors make no mention of non-RGB remote sensing imagery, which plays a large role in a variety of use cases. More broadly, remote sensing does not even play an important role in this analysis, which is focused on a general computer vision task. They show experimental results on remote sensing imagery (as well as natural imagery) but never discuss why they have chosen remote sensing imagery as the type of imagery to highlight in their paper. This, combined with the overly technical computer science terminology used throughout the paper, makes me think that this article would be a better fit for a computer vision-focused journal, rather than one specifically targeting remote sensing.
Line-by-line points:
- 23: The authors should define "reconstruction-based approach", as well as LC and AC
- 68: What does "appropriately and effectively" mean? These are not precise terms. Would "efficiently" be a more useful term? Also, I believe this should say "recover the high-resolution images from the low-resolution input"
- 69: This second bullet point is a description of the authors' approach, not a contribution of the paper. Also, the authors should define "map-level" as the meaning of this term is not obvious from context.
- 72: This also is not a separate contribution, but rather a clarification of what "appropriately and effectively recovering" means from the first bullet point. I believe there is only one (important) contribution of the paper, which is that it proposes a more efficient approach to conducting image super-resolution tasks
- 76-77: a discussion of experimental results appears in both Section 4 and 5. These should be confined to one section for clarity.
- 86-126: The authors make little attempt to categorize this literature review and to situate their contribution within the existing methods. As such, this reads somewhat like a laundry list of former papers that have something to do with super-resolution using CNNs and GANs. It would be much more helpful if it was structured such that readers could understand how the contributions of the current work relate to these other papers.
- 119: Why does a VGG discriminator "guide the generator to pay attention to deep features"? This connection could use more description of the connection
- 160: The authors state that they omit the batch normalization layer of their residual blocks to reduce memory footprint, but do not describe what role this layer normally plays in a ResNet. Will this omission have an impact on performance? Some indication of why this change to the more classic structure is justified would be helpful
- 163-165 (and figure 4): The reducing parameters unit and the upscaling unit are not described and it is not immediately obvious which parts of Figure 4 correspond to each of these two units. Why is the reducing parameters unit needed "to prepare for the upscaling unit"?
- 181: Define FCN
- 203-206: The description of why an activation function is useful seems like it is not necessary here, given that this is a fundamental component of neural networks and there are many other more difficult to understand pieces of the authors' network structutre that have not been defined. I think this can be removed.
- 215-217: This is confusing. Is this implying that the sigmoid activation is not actually used? If so, why is it described above, and why is its output shown in several figures.
- Eq. 4 - L239: Both the Variable D and the subscript D are not defined. I think there is a typo in that L^g_adversarial is repreated twice. No discription of this function is provided. If it is a common set of GAN loss functions, then they do not need to be repeated here. Otherwise, there should be some textual description of why these loss functions are used.
- 242-244: The authors should state how and why these loss function weights were chosen and provide some justification for why the L1 loss is weighted so much higher than the others
- Figure 7: Reorder columns 2 and 3. It is confusing to have column 3 referenced before column 2
- 269-270: How is this model trained in advance? i.e. what dataset are the labels coming from? Is this MSRA10K, as alluded to L254-256?
- 271-275: If these are commonly used metrics for super-resolution performance, why are these not incorporated into the network's loss function?
- Figure 8: This figure would benefit from an additional column in which the LR image is blown up to the same size as the HR and various SR images. Also the FSRCNN network is referenced in the text but does not appear in this image.
- 305-307: The authors should explain why the use of L1 loss generates more precise images
- 310-312: It is unclear why the authors reach the conclusion that the loss function has a strong impact on SR performance for remote sensing imagery. They are comparing a wide array of network structures and do not seem to have conducted an experiment to isolate the contribution of the loss function to model performance
- Table 2: Include a description of the columns in the caption
- 343-350: This reads like an assertion, based on evidence, of why the performance boost is lower in the UCAS-AOD dataset, however there is no clear evidence presented. Instead, this is a hypothesis of the mechanism by which this performance characteristic is realized. It should be rewritten such that it is interpreted as a hypothesis, not an evidence-based finding. Relatedly, it is unclear why a higher starting resolution in the input image results in a larger performance boost from the SG-GAN model. The authors state this as one of the reasons for this finding, but it is not clear how this logic applies.
- 357-380: This description of the comparator models belongs in the introduction, not in the discussion section.
Reviewer 3 Report
A good idea. The paper is well written and easy to follow. But I still have some comments:
- For superresolution with GAN, is using PSNR and SSIM for evaluation OK?
- More detailed discussion is needed. For example, which part in the images is mostly affected when adding the saliency loss? How can we set the parameters to balance the four losses? How does the feature of images (the detected saliency area) affect the results? etc.
- In Figure 9, the last image, SRGAN+Lsa seems to be much better then SG-GAN, can you explain this?