Self-Supervised Infrared Video Super-Resolution Based on Deformable Convolution
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1.The schematic diagram in Figure 1 requires improvement in both visual clarity and logical coherence.
2.The comparative analysis in Section 3.2 demonstrates a lack of comprehensiveness.
3.The experimental validation requires enhanced scientific rigor due to insufficient ablation studies.
4.The visualization experiments should be enhanced through strategic emphasis on architectural nuances.
Author Response
(1) The schematic diagram in Figure 1 requires improvement in both visual clarity and logical coherence.
Response: The visual clarity and logical coherence are both improved in Figure 1.
(2) The comparative analysis in Section 3.2 demonstrates a lack of comprehensiveness.
Response: The comparative analysis of traditional super resolution restoration methods and the algorithm in this paper is improved in Section 3.2.
(3) The experimental validation requires enhanced scientific rigor due to insufficient ablation studies.
Response: The experimental validation is suppled to improve scientific rigor.
(4) The visualization experiments should be enhanced through strategic emphasis on architectural nuances.
Response: The experiments and analysis are improved.
Reviewer 2 Report
Comments and Suggestions for AuthorsSelf-Supervised Infrared Video Super-Resolution Based on Deformable Convolution
This paper presents a novel self-supervised infrared video super-resolution (SR) method that utilizes deformable convolution to enhance motion estimation without requiring high-resolution video supervision. The proposed method effectively estimates blur kernels and learns motion information adaptively, addressing key challenges in infrared video SR.
The manuscript is well-organized and logically structured but has several limitations. Therefore, I suggest a major revision to allow the authors to improve the manuscript. My comments are summarized below:
- Why does this paper only compare its method with bicubic interpolation and one existing self-supervised method? It is recommended to include comparisons with additional methods, such as optical flow-based SR methods or state-of-the-art video SR models (e.g., EDVR, TDAN).
- It is recommended to include an ablation study to evaluate the contributions of different components, such as deformable convolution and blur kernel estimation.
- The dataset used in this study is limited to infrared images of aircraft and missiles. The authors should discuss the generalizability of the method to other infrared applications (e.g., pedestrian detection, surveillance).
- The paper lacks an analysis of computational efficiency. Given the importance of real-time processing in infrared applications, the inference time per frame should be reported.
- The manuscript contains grammatical errors. A thorough proofreading is recommended.
- It is recommended to add a discussion section to analyze how hyperparameter choices (e.g., learning rate, training epochs) affect performance.
Comments on the Quality of English Language
The manuscript contains grammatical errors. A thorough proofreading is recommended.
Author Response
(1) Why does this paper only compare its method with bicubic interpolation and one existing self-supervised method? It is recommended to include comparisons with additional methods, such as optical flow-based SR methods or state-of-the-art video SR models (e.g., EDVR, TDAN).
Response: The proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. The bicubic interpolation is a general comparison method for image processing. The optical flow-based SR methods or state-of-the-art video SR models maybe not suitable for the special target image processing.
(2) It is recommended to include an ablation study to evaluate the contributions of different components, such as deformable convolution and blur kernel estimation.
Response: The deformable convolution and blur kernel estimation are improved.
(3) The dataset used in this study is limited to infrared images of aircraft and missiles. The authors should discuss the generalizability of the method to other infrared applications (e.g., pedestrian detection, surveillance).
Response: The proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. And the proposed method is well optimized for the special target image processing. The other infrared applications will be researched in the future.
(4) The paper lacks an analysis of computational efficiency. Given the importance of real-time processing in infrared applications, the inference time per frame should be reported.
Response: The processing time analysis is improved.
(5) The manuscript contains grammatical errors. A thorough proofreading is recommended.
Response: The manuscript is improved on writing.
(6) It is recommended to add a discussion section to analyze how hyperparameter choices (e.g., learning rate, training epochs) affect performance.
Response: It is mentioned above that the proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. And the proposed method is well optimized for the special target image processing.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis article proposes a deep network for super-resolution in infrared video based on deformable convolution, where the proposed network is self-supervised with no training references required. The deformable convolutional network is introduced to adaptively learn motion information to capture more accurate tiny motion changes where it overcomes the limitations of optical flow prediction in handling complex motion. The idea makes sense, and the experimental results support the efficacy of the proposed model. However, the paper has several issues that must addressed. My comments are below:
- The paper has many grammatical and typos (e.g., the tense in the Introduction Section), which the authors must double-check.
- Increase the description of the illustrations and the resolution quality for the Figures, such as for Figures 1-3.
- In Fig 2, the skip connection is used for two features with a pooling layer in between. How does the author adjust the dimensions? The authors should add more details, including but not limited to the dimensions at any stage.
- The two terms of the loss function contribute equally to the overall loss function; why did the authors add balancing parameters?
- Some relevant references are missing, such as “https://doi.org/10.1117/12.2680541”.
- “RCAB combines Channel Attention (CA) mechanism with residual concept.” this part needs an in-depth description.
- The computational cost analysis should be provided.
- As far as I can see, your method outperformed the others, which is good. However, as a reader, I do not know why your method surpassed the others. Because, the paper lacks of a proper discussion section that discusses the experimental results comprehensively, considering the advantages and disadvantages/limitations of all methods used (including yours). I also recommend adding more recent comparison methods.
- Expanding the future work section to discuss further optimization of the network's computational efficiency or applying this network to other video tasks would provide valuable insights into the broader applicability of the proposed method.
Author Response
(1) The paper has many grammatical and typos (e.g., the tense in the Introduction Section), which the authors must double-check.
Response: The manuscript is improved on writing.
(2) Increase the description of the illustrations and the resolution quality for the Figures, such as for Figures 1-3.
Response: Figure 1-3 are improved.
(3) In Fig 2, the skip connection is used for two features with a pooling layer in between. How does the author adjust the dimensions? The authors should add more details, including but not limited to the dimensions at any stage.
Response: In the RCAB (Residual Channel Attention Block) module, the skip connection is applied between the input and the output of the residual attention branch, both of which retain the same spatial dimensions throughout the block.
It is true that a global average pooling operation is used within the channel attention mechanism. Specifically, we apply tf.reduce_mean(f, axis=(1, 2), keepdims=True) to compute the average over the spatial dimensions of the feature map f, resulting in a tensor of shape (B, 1, 1, C). This operation serves to summarize the global spatial context of each channel and is followed by two 1×1 convolutional layers and a sigmoid activation to generate channel-wise attention weights of the same shape (B, 1, 1, C). These weights are then multiplied element-wise with the original feature map f (of shape B, H, W, C), producing an attention-modulated output of the same shape.
The skip connection is formed by directly adding this output to the original input tensor, both of which have the identical shape (B, H, W, C). Therefore, no dimension adjustment is required at any stage of the skip connection.
To elaborate the dimensions more clearly:
Input: (B, H, W, C)
After first conv + ReLU: (B, H, W, C)
After second conv: (B, H, W, C)
After global average pooling: (B, 1, 1, C)
After channel-down 1×1 conv + ReLU: (B, 1, 1, C // r)(where r denotes the reduction ratio used in the channel attention bottleneck (typically set to 16 in our implementation))
After channel-up 1×1 conv + Sigmoid: (B, 1, 1, C)
After attention modulation (element-wise multiplication): (B, H, W, C)
After residual addition (skip connection): (B, H, W, C)
(4) The two terms of the loss function contribute equally to the overall loss function; why did the authors add balancing parameters?
Response: As correctly pointed out, the two terms in our loss function contribute equally to the overall optimization, and in our experiments, both balancing parameters are indeed set to 1. The inclusion of these parameters mainly serves to make the formulation more general and mathematically complete, following common conventions in the literature. This also leaves room for future extensions where different weightings might be beneficial in other scenarios or datasets.
(5) Some relevant references are missing, such as “https://doi.org/10.1117/12.2680541”.
Response: The relevant is added.
[33] Ma S, Khader A, Xiao L. Complementary features-aware attentive multi-adapter network for hyperspectral object tracking[C]//Fourteenth International Conference on Graphics and Image Processing (ICGIP 2022). SPIE, 2023, 12705: 686-695.
(6) “RCAB combines Channel Attention (CA) mechanism with residual concept.” this part needs an in-depth description.
Response: RCAB is a fusion of CA and residual thinking. To input a feature input, we first perform a convolution Relu convolution operation to obtain f. Then, f is rescaled through a CA module to obtain x. Finally, x is added to the input to obtain the output feature.
(7) The computational cost analysis should be provided.
Response: The processing time analysis is improved.
(8) As far as I can see, your method outperformed the others, which is good. However, as a reader, I do not know why your method surpassed the others. Because, the paper lacks of a proper discussion section that discusses the experimental results comprehensively, considering the advantages and disadvantages/limitations of all methods used (including yours). I also recommend adding more recent comparison methods.
Response: The proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. The bicubic interpolation is a general comparison method for image processing. And the proposed method is well optimized for the special target image processing. The other infrared applications will be researched in the future. The comparative analysis of traditional super resolution restoration methods and the algorithm in this paper is improved in Section 3.2. The experiments and analysis are improved.
(9) Expanding the future work section to discuss further optimization of the network's computational efficiency or applying this network to other video tasks would provide valuable insights into the broader applicability of the proposed method.
Response: It is mentioned above that the proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. And the proposed method is well optimized for the special target image processing. The other infrared applications will be researched in the future.
Reviewer 4 Report
Comments and Suggestions for Authors- The images, layout, and text in the paper have serious issues. The attached images in the paper are too blurry. There are obvious layout problems with the formula labels and line spacing. The most serious problem is that the title of Section 2.1, "Blur Kernel Estimation Network Nk," and the name of this module in Figure 1 are different. The names of Figure 4 and Figure 5 are the same, but they describe different networks. Has the author carefully checked the paper? The resolution of Figures 7-8 is too low, and no local magnification is provided, making it impossible to clearly show the differences in SR effects.
- The division of training set and test set only mentioned 25 videos, and did not explain how to verify the representativeness of the data set. Moreover, comparison experiments were seriously insufficient, and the method in this paper only conducted comparison experiments with the two methods.
- The Self-supervised deep blind video super-resolution proposed by the author lacks innovation and is too similar to self-supervised Deep Blind Video super-resolution (2022).
- The usage of blur kernel kiin Self-supervised deep blind video super-resolution is modified in the current network diagram of the paper, but it is not explained in the pictures and text of the paper. The author needs to explain the treatment of this part.
- The references are not new enough, and the citation format is not uniform. Some conference names are abbreviated, and some are full.
- In the second section, the description of the network is chaotic, without a detailed introduction of the connections between each module. In the cascade process of deformation convolution, is it necessary to verify multi-scale design through ablation experiments considered? In Figure 4, the feature extraction network uses convolutional layer and 5 residual blocks to extract multi-scale features? Consider the performance impact of adding a network with different depths of contrast.
- There are too few experiments in Section 3, and only PSNR/SSIM is used for objective indicators. It is suggested to add comparative tests and introduce noise suppression ratio NSR and other indicators for comparison.
Author Response
(1) The images, layout, and text in the paper have serious issues. The attached images in the paper are too blurry. There are obvious layout problems with the formula labels and line spacing. The most serious problem is that the title of Section 2.1, "Blur Kernel Estimation Network Nk," and the name of this module in Figure 1 are different. The names of Figure 4 and Figure 5 are the same, but they describe different networks. Has the author carefully checked the paper? The resolution of Figures 7-8 is too low, and no local magnification is provided, making it impossible to clearly show the differences in SR effects.
Response: Thank you to the reviewer for carefully reviewing the quality of the graphics and formatting details. Due to formatting adjustments during typesetting and conversion, the clarity of some images has indeed decreased, and details such as line spacing and labels are not well represented. We will comprehensively check the layout, unify chart naming, improve image resolution, and add necessary enlarged areas and annotations in subsequent modified versions to ensure overall clarity and professionalism of expression.
The issue of inconsistent module naming between the "Fuzzy Kernel Estimation Network Nk" in Section 2.1 and the module in Figure 1 is due to our negligence in the naming adjustment process. We will provide a unified statement in subsequent versions to avoid ambiguity. At the same time, for the situation where the names of Figure 4 and Figure 5 are repeated but the content is different, we will further optimize the correspondence between chart numbers and titles to make them more in line with reading habits.
Thank you again for the reviewer's careful pointing out. We will take seriously all the inconsistencies between the text and images to ensure that the final manuscript meets higher standards in both content and format.
(2) The division of training set and test set only mentioned 25 videos, and did not explain how to verify the representativeness of the data set. Moreover, comparison experiments were seriously insufficient, and the method in this paper only conducted comparison experiments with the two methods.
Response: The proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. The bicubic interpolation is a general comparison method for image processing. And the proposed method is well optimized for the special target image processing. The other infrared applications will be researched in the future.
In terms of comparative experiments, the two methods selected in this article are representative and have reference value in relevant fields, therefore they have strong comparative significance. In the future, we also plan to further expand the scope of comparative methods to enhance the comprehensiveness and persuasiveness of experimental results.
(3) The Self-supervised deep blind video super-resolution proposed by the author lacks innovation and is too similar to self-supervised Deep Blind Video super-resolution (2022).
Response: Thank you for the valuable feedback provided by the reviewer. The self supervised strategy adopted in this article does indeed draw on the basic ideas of related work, but our research focus is not on the complete innovation of the framework, but on this basis, combined with the special requirements of infrared imaging, we have optimized the video super-resolution module in a targeted manner. Specifically, we introduced deformable convolution instead of traditional optical flow estimation for inter frame alignment, which better adapts to the characteristics of unclear texture details and difficult alignment in infrared images, further improving the reconstruction performance of the model in infrared videos.
(4) The usage of blur kernel kiin Self-supervised deep blind video super-resolution is modified in the current network diagram of the paper, but it is not explained in the pictures and text of the paper. The author needs to explain the treatment of this part.
Response: Thank you for the reviewer's attention to the design of the loss function. What we understand is that the current loss function does not introduce supervision for fuzzy kernels. In this work, we attempted to incorporate the loss term related to fuzzy kernels into the overall optimization objective during the initial experimental stage. However, in infrared video scenes, excessive fuzzy constraints can actually affect the model's ability to recover structural details, leading to a decrease in overall performance.
Therefore, after considering the characteristics of the experimental results and infrared images comprehensively, we choose to adopt the current loss design.
(5) The references are not new enough, and the citation format is not uniform. Some conference names are abbreviated, and some are full.
Response: The references are totally improved.
(6) In the second section, the description of the network is chaotic, without a detailed introduction of the connections between each module. In the cascade process of deformation convolution, is it necessary to verify multi-scale design through ablation experiments considered? In Figure 4, the feature extraction network uses convolutional layer and 5 residual blocks to extract multi-scale features? Consider the performance impact of adding a network with different depths of contrast.
Response: Thank you for the reviewer's attention to the details of network architecture design. We introduced the overall network structure in the second part, and the connections between each module follow the basic logic of self supervised learning and temporal modeling. Due to space limitations, some connection details have not been elaborated in the main text, but we have expressed the hierarchy and information flow relationship of the modules as clearly as possible in the diagram.
The design of cascading and multi-scale structures for deformable convolutions is indeed a key consideration in our model construction process. Due to the unique spatial distribution and detail expression of infrared images, we pay more attention to maintaining a balance between the compactness of the structure and the actual effect in our design. The current cascade structure has achieved a good compromise between performance and computational cost through multiple rounds of experimental adjustments.
The feature extraction network in Figure 4 adopts a combination of convolution and residual blocks, with the main purpose of achieving cross layer information fusion and multi-scale feature extraction. We believe that further analysis of network depth and contrast strength is a direction worth exploring, and future work will consider introducing more structural variants and ablation experiments to further enrich model analysis and validation.
(8) There are too few experiments in Section 3, and only PSNR/SSIM is used for objective indicators. It is suggested to add comparative tests and introduce noise suppression ratio NSR and other indicators for comparison.
Response: The experiments and analysis are improved. The processing time analysis is improved. The analysis of PSNR and SSIM as general and comparable important technical indicators of image processing is as follows:
Industry standards: international standards organizations (such as ITU-T and ISO) still regard PSNR as the core quality index in video coding standards (h.264/avc and hevc).
Comparability: PSNR provides a unified quantitative standard across algorithms and research, which is convenient for the reproduction and comparison of results.
Combined with SSIM: PSNR reflects global fidelity, while SSIM focuses on local structure. The combination of the two can comprehensively evaluate the quality.
The authority of PSNR stems from its mathematical simplicity, historical status, standardization support and practicability in specific tasks. Although there is a defect that does not match the perception of human eyes, it is still an irreplaceable basic indicator in the field of image processing as an objective measure of the degree of distortion. In practical applications, it is recommended to combine task requirements with perception indicators (such as SSIM) to give consideration to efficiency and accuracy.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe revised dissertation generally addresses the previous comments; however, it is recommended to enhance the quality of all figures throughout the manuscript, as many currently appear blurry and unclear.
Author Response
Replied for Reviewer 1
(1) The revised dissertation generally addresses the previous comments; however, it is recommended to enhance the quality of all figures throughout the manuscript, as many currently appear blurry and unclear.
Response:
We sincerely thank the reviewer for the recognition of the revisions and the valuable suggestion regarding figure quality. We acknowledge the importance of clear and high-resolution illustrations for effective communication. In the revised manuscript, we will carefully re-export all figures at higher resolution and ensure consistency in visual clarity throughout the paper to enhance readability and presentation quality. Your feedback is greatly appreciated and will help us further improve the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsAll my comments have been addressed.
Author Response
Replied for Reviewer 2
(1) All my comments have been addressed.
Response:
We sincerely thank the reviewer for the time and effort spent reviewing our manuscript. We are grateful for your constructive feedback and pleased that all your comments have been satisfactorily addressed.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe response to most of my previous comments is not convincing. E,g,
1. The article still needs to review grammar and typos; the author should check which verb tense is appropriate for the literature review.
2. The response to comment 3 is not reflected in the new version of the paper.
3. For comment 6, an in-depth description and step-by-step derivation of RCAB must be provided.
4. The proposed method must be compared with more recent comparison methods. In addition, a discussion section that discusses the experimental results and the advantages and disadvantages/limitations of all methods used (including yours) must be provided.
5. the author must add a future work section to discuss further optimization of the network's computational efficiency or applying this network to other video tasks.
must be improved
Author Response
Replied for Reviewer 3
(1) The article still needs to review grammar and typos; the author should check which verb tense is appropriate for the literature review.
Response:
We thank the reviewer for the helpful suggestion regarding grammar and verb tense usage. We agree that appropriate tense usage in the literature review is important for clarity and consistency. we will carefully proofread the manuscript to improve grammatical accuracy and ensure consistent verb tense, particularly in the literature-related sections.
(2) The response to comment 3 is not reflected in the new version of the paper.
In Fig 2, the skip connection is used for two features with a pooling layer in between. How does the author adjust the dimensions? The authors should add more details, including but not limited to the dimensions at any stage.
Response:
Thank you for your clarification. We have added this reflection to section 2.1 of the paper.
(3) For comment 6, an in-depth description and step-by-step derivation of RCAB must be provided.
“RCAB combines Channel Attention (CA) mechanism with residual concept.” this part needs an in-depth description.
Response:
We appreciate the reviewer’s suggestion. In response, we have incorporated a more detailed explanation and step-by-step description of the RCAB module in Section 2.1 of the paper.
(4) The proposed method must be compared with more recent comparison methods. In addition, a discussion section that discusses the experimental results and the advantages and disadvantages/limitations of all methods used (including yours) must be provided.
Response:
We sincerely appreciate the reviewer’s insightful suggestions. We fully agree that including more recent comparison methods and a comprehensive discussion of experimental results can provide additional perspectives. In this study, we have aimed to ensure the clarity and focus of the evaluation by selecting representative baselines and highlighting the core contributions of our method. We acknowledge the importance of broader comparisons and more detailed discussions, and will carefully consider these directions in our subsequent research and future versions of the work.
(5)the author must add a future work section to discuss further optimization of the network's computational efficiency or applying this network to other video tasks.
Response:
Thank you for your feedback. We have added relevant discussions in the conclusion section.
Reviewer 4 Report
Comments and Suggestions for Authors1. The revised draft has not added any comparative experiments. The method in this paper only conducted comparative experiments with two methods, which is insufficient to support the verification. It is suggested to add comparative experiments to verify the superior performance of the method in the paper and the representativeness of the data set.
2. It is suggested that the references include papers from the past three years. Currently, there are only three papers from 2023 and beyond.
3. The revised paper still did not conduct ablation experiments to verify the necessity of each module design. The reply mentioned that the modules were adjusted through experimental verification. Was it considered to demonstrate the verification process? Regarding the necessity of verifying the multi-scale design through ablation experiments for the cascade process of deformed convolution proposed in the previous opinion, as well as the fact that the feature extraction network in Figure 4 uses convolutional layers and five residual blocks to extract multi-scale features, the reason why the original text chose five residual blocks is to consider adding the impact of comparing networks at different depths on performance.
Author Response
Replied for Reviewer 4
(1) The revised draft has not added any comparative experiments. The method in this paper only conducted comparative experiments with two methods, which is insufficient to support the verification. It is suggested to add comparative experiments to verify the superior performance of the method in the paper and the representativeness of the data set.
Response:
We thank the reviewer for the valuable suggestion. We understand that additional comparative experiments could provide a more comprehensive evaluation of the proposed method. In this work, we have carefully selected two representative methods for comparison, which we believe are sufficient to demonstrate the effectiveness of our approach. While we acknowledge that expanding the comparison to include more recent methods could further strengthen our validation, we opted to focus on these baselines in order to maintain clarity and consistency in the evaluation. We will consider incorporating additional comparisons in future work as more relevant methods become available.
(2) It is suggested that the references include papers from the past three years. Currently, there are only three papers from 2023 and beyond.
Response:
Thank you for your suggestion. We have replaced references 1-4 with the latest papers after 2023.
(3) The revised paper still did not conduct ablation experiments to verify the necessity of each module design. The reply mentioned that the modules were adjusted through experimental verification. Was it considered to demonstrate the verification process? Regarding the necessity of verifying the multi-scale design through ablation experiments for the cascade process of deformed convolution proposed in the previous opinion, as well as the fact that the feature extraction network in Figure 4 uses convolutional layers and five residual blocks to extract multi-scale features, the reason why the original text chose five residual blocks is to consider adding the impact of comparing networks at different depths on performance.
Response:
We thank the reviewer for the insightful comment. In this work, our primary focus is on demonstrating the effectiveness of the self-supervised approach rather than performing ablation experiments. While we acknowledge that ablation studies can provide a deeper understanding of the individual contributions of each module, our main goal in this paper is to showcase the impact of the self-supervised mechanism on video super-resolution. Nevertheless, we appreciate the reviewer’s suggestion and will certainly consider incorporating ablation experiments in future work to further validate the necessity of each module and design choice.
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsThanks for responding to the comments. I have no more comments. Congratulations!
Author Response
We sincerely thank the reviewer for the time and effort spent reviewing our manuscript. We are grateful for your constructive feedback and pleased that all your comments have been satisfactorily addressed.
Reviewer 4 Report
Comments and Suggestions for Authors- In reply 2, the author mentioned that the paper was mainly to prove the effectiveness of the Self-supervised method without ablation experiments. However, in report 1, the innovation of the self-supervised network was mentioned. The author's reply was that the self-supervised framework was based on the self-supervised deep blind video super-resolution In this paper, the focus is on improved deformable convolutions, while in report 2 I proposed to validate these improvements through ablation experiments, the author emphasized in response 2 that the paper focuses on self-supervised networks. The author is invited to explain.
- Regarding comparative experiments: While Bicubic serves as the traditional baseline and "Self-supervised deep blind video super-resolution" (the very paper that inspired this work's framework) is included, comparing only these three methods provides insufficient evidence for the network's superiority - particularly when using a self-constructed dataset. The authors have not addressed requests from both review rounds to expand these comparisons.
Author Response
(1)In reply 2, the author mentioned that the paper was mainly to prove the effectiveness of the Self-supervised method without ablation experiments. However, in report 1, the innovation of the self-supervised network was mentioned. The author's reply was that the self-supervised framework was based on the self-supervised deep blind video super-resolution In this paper, the focus is on improved deformable convolutions, while in report 2 I proposed to validate these improvements through ablation experiments, the author emphasized in response 2 that the paper focuses on self-supervised networks. The author is invited to explain.
Response:
Thank you for the reviewer's attention to the details of network architecture design. We introduced the overall network structure in the second part, and the connections between each module follow the basic logic of self supervised learning and temporal modeling. Due to space limitations, some connection details have not been elaborated in the main text, but we have expressed the hierarchy and information flow relationship of the modules as clearly as possible in the diagram.
The design of cascading and multi-scale structures for deformable convolutions is indeed a key consideration in our model construction process. Due to the unique spatial distribution and detail expression of infrared images, we pay more attention to maintaining a balance between the compactness of the structure and the actual effect in our design. The current cascade structure has achieved a good compromise between performance and computational cost through multiple rounds of experimental adjustments.
The feature extraction network in Figure 4 adopts a combination of convolution and residual blocks, with the main purpose of achieving cross layer information fusion and multi-scale feature extraction. We believe that further analysis of network depth and contrast strength is a direction worth exploring, and future work will consider introducing more structural variants and ablation experiments to further enrich model analysis and validation.
We thank the reviewer for the insightful comment. In this work, our primary focus is on demonstrating the effectiveness of the self-supervised approach rather than performing ablation experiments. While we acknowledge that ablation studies can provide a deeper understanding of the individual contributions of each module, our main goal in this paper is to showcase the impact of the self-supervised mechanism on video super-resolution. Nevertheless, we appreciate the reviewer’s suggestion and will certainly consider incorporating ablation experiments in future work to further validate the necessity of each module and design choice.
Thank you for the reviewer’s attention to the details of network architecture design. We have introduced the overall network structure in Section 2, where the connections between modules are based on the fundamental principles of self-supervised learning and temporal modeling. Due to space limitations, certain connection details have not been fully elaborated in the main text. However, we have made efforts to clearly express the hierarchical relationships and information flow among modules through the provided structural diagram. In addition, the design prioritizes logical consistency across modules to ensure stable information propagation and effective feature reuse during training.
The design of cascading and multi-scale structures for deformable convolutions was indeed a key consideration during model development. Given the unique spatial distributions, texture characteristics, and noise properties of infrared images, we placed particular emphasis on balancing structural compactness with performance effectiveness. Specifically, we observed that infrared image sequences often exhibit subtle motion and localized distortions, for which deformable convolutions are particularly well-suited. Through multiple rounds of experimental adjustments, including varying the number of cascading stages and the receptive field size, we found that the current cascading structure achieves a good trade-off between computational cost and performance, enabling the network to better capture fine-grained motion details without introducing excessive model complexity.
The feature extraction network shown in Figure 4 adopts a combination of convolutional layers and residual blocks, primarily aimed at facilitating cross-layer information fusion and multi-scale feature extraction. By integrating features from different levels, the network is able to simultaneously preserve local textures and capture broader contextual information, which is crucial for enhancing the quality of super-resolved frames. We agree with the reviewer that further investigation into the effects of network depth, contrast enhancement mechanisms, and feature aggregation strategies would be valuable. In future work, we plan to explore additional structural variants, such as dilated convolutions and attention mechanisms, and conduct more comprehensive ablation experiments to further enrich the analysis and validation of our model design choices.
We sincerely thank the reviewer for this insightful comment. In the present study, our primary focus is to demonstrate the effectiveness of the self-supervised learning framework in addressing the challenges of video super-resolution, especially under conditions where ground-truth high-resolution labels are unavailable. While we acknowledge that ablation experiments can provide deeper insights into the contribution of each module, the main goal of this work is to highlight the overall impact and feasibility of the self-supervised mechanism. Nevertheless, we greatly appreciate the reviewer’s suggestion. In future research, we will actively incorporate systematic ablation studies, including analyzing the influence of different module configurations and hyperparameter settings, to further validate the necessity and effectiveness of each component in our architecture.
(2)Regarding comparative experiments: While Bicubic serves as the traditional baseline and "Self-supervised deep blind video super-resolution" (the very paper that inspired this work's framework) is included, comparing only these three methods provides insufficient evidence for the network's superiority - particularly when using a self-constructed dataset. The authors have not addressed requests from both review rounds to expand these comparisons.
Response:
We thank the reviewer for the valuable suggestion. We understand that additional comparative experiments could provide a more comprehensive evaluation of the proposed method. In this work, we have carefully selected two representative methods for comparison, which we believe are sufficient to demonstrate the effectiveness of our approach. While we acknowledge that expanding the comparison to include more recent methods could further strengthen our validation, we opted to focus on these baselines in order to maintain clarity and consistency in the evaluation. We will consider incorporating additional comparisons in future work as more relevant methods become available.
The proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. The bicubic interpolation is a general comparison method for image processing. The optical flow-based SR methods or state-of-the-art video SR models maybe not suitable for the special target image processing.
It is mentioned above that the proposed method in this paper is focusing on treating plane and missile on the sky background from the Strategic Priority Research Program of the Chinese Academy of Sciences. And the proposed method is well optimized for the special target image processing.
We thank the reviewer for the valuable suggestion. We understand that additional comparative experiments could provide a more comprehensive evaluation of the proposed method. In this work, we have carefully selected two representative methods for comparison, which we believe are sufficient to demonstrate the effectiveness of our approach. While we acknowledge that expanding the comparison to include more recent methods could further strengthen the experimental validation, we opted to focus on these baselines to maintain clarity and consistency in the evaluation process. We will consider incorporating additional comparisons in future work as more relevant methods become available.
It is important to note that the primary focus of the proposed method is on processing specific targets, namely aircraft and missiles against a sky background, in the context of the Strategic Priority Research Program of the Chinese Academy of Sciences. The unique characteristics of these targets—including their relatively small size, high motion dynamics, and the relatively simple but noisy background—pose distinct challenges that differ from those encountered in general video super-resolution tasks.
In this regard, bicubic interpolation is used as a standard baseline for image processing evaluation, providing a widely accepted point of reference. However, many existing optical flow-based SR methods and state-of-the-art video SR models are primarily designed for general scenes with rich textures and complex motion patterns, and may not be directly suitable for the specialized target processing required in this study. These methods often rely heavily on dense motion estimation or scene consistency assumptions, which are not always applicable when dealing with isolated targets moving against relatively homogeneous backgrounds.
As mentioned above, the proposed method has been specifically optimized for the processing of plane and missile imagery under sky backgrounds. Our approach emphasizes accurate target structure restoration and effective background suppression, which are critical for improving subsequent tasks such as detection and tracking. Therefore, we believe that the current comparative experiments, although limited in number, are appropriate and sufficiently demonstrate the advantages of the proposed method for the specific application scenario considered in this paper. In future work, we plan to further expand comparative evaluations by including specialized target-focused SR methods, should they become available.
Author Response File: Author Response.pdf