Multi-Scale Adaptive Modulation Network for Efficient Image Super-Resolution
Round 1
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsThe authors have addressed all the comments proposed by the previous reviewers. This manuscript is well written now and I suggest to accept this manuscript in present form.
Author Response
Comments 1: [The authors have addressed all the comments proposed by the previous reviewers. This manuscript is well written now and I suggest to accept this manuscript in present form.]
Response 1: Please allow us to express our sincerest gratitude for your time and effort in reviewing our revised manuscript and for your final assessment of the modifications made in response to the previous review comments. We are truly delighted and honored to receive your final decision to recommend acceptance. Your positive feedback, stating that "the manuscript is now well-written and acceptable in its present form," is immensely encouraging to us.
Author Response File:
Author Response.pdf
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsMy initial comment 1 about a table of acronymes still stands. I do not understand the authors’ answer about the practice in IEEE TIP etc. IEEE papers very often include nomenclature tables. although I do not see how is this relevant to this submission. In any case my comment stands and it would have been very easy to include the requested Table.
About my 3rd comment (acknowledging the classical state-of-the-art interpolation methods) I cannot locate anywhere in the revised the relevant addition mentioned in the response letter: “we have refined the introduction to acknowledge these advanced classical techniques as per the suggestion above”. Thus, the comment still stands
About my 4th comment, the authors responded that “In response, we have now included the quantitative results (PSNR/SSIM) for bicubic interpolation in the main comparison table (Table 1) for all scaling factors (×2, ×3, ×4) across the five benchmark datasets”. However, I cannot find the added results in Table 1.
I also do not understand the authors’ respond about the memory requirements and the number or FLOPs for the classical methods. Yes, the classical methods are not trainable but they have specific (very small) memory requirements and FLOPs per pixel and it would have been very easy to add them to the manuscript.
About my 5th comment, the added Table 2 should be resized to normal sized text, because it is very difficult to read. The unit of time is also missing (I assume seconds).
Author Response
Comments 1: [My initial comment 1 about a table of acronymes still stands. I do not understand the authors’ answer about the practice in IEEE TIP etc. IEEE papers very often include nomenclature tables. although I do not see how is this relevant to this submission. In any case my comment stands and it would have been very easy to include the requested Table.]
Response 1: We sincerely thank the reviewer for their persistence on this matter and apologize for any misunderstanding in our previous response. We fully acknowledge the value of a nomenclature table for enhancing readability.
Following your recommendation, we have now added a comprehensive nomenclature table in the manuscript, positioned after the Conflicts of Interest section and before the References. To ensure clear visual distinction, the table is formatted with a light grey background.
We agree that this addition significantly enhances the readability of the manuscript and makes it more accessible to a broader audience. We are grateful for your valuable and persistent guidance on this point, which has undoubtedly improved our work.
Comments 2: [About my 3rd comment (acknowledging the classical state-of-the-art interpolation methods) I cannot locate anywhere in the revised the relevant addition mentioned in the response letter: “we have refined the introduction to acknowledge these advanced classical techniques as per the suggestion above”. Thus, the comment still stands]
Response 2: We sincerely thank the reviewer for careful verification and apologize for any lack of clarity in highlighting this revision in the previous version. We confirm that the specific addition addressing previous 3rd comment has been incorporated into the Introduction.
The relevant text has been added to the section Introduction (in the first paragraph, lines 5-8). The added sentence reads: [Beyond basic bicubic interpolation [1], more sophisticated classical methods (e.g., overlapping bicubic interpolation [2], prior-based [3-5]) have also been developed, pushing the performance boundaries of non-learning-based approaches.]
This revision was included to explicitly acknowledge the advanced classical interpolation techniques, as you rightly suggested. We appreciate your meticulous attention in ensuring this point was properly addressed, and we believe the added sentence now appropriately recognizes these methods in the context of our related work.
Thank you again for your valuable guidance in strengthening our manuscript.
Comments 3: [About my 4th comment, the authors responded that “In response, we have now included the quantitative results (PSNR/SSIM) for bicubic interpolation in the main comparison table (Table 1) for all scaling factors (×2, ×3, ×4) across the five benchmark datasets”. However, I cannot find the added results in Table 1.]
Response 3: We sincerely appreciate your thorough review of our manuscript and this important verification regarding our revisions.
In response to your fourth comment, we confirm that the quantitative results (PSNR/SSIM) for bicubic interpolation across all scaling factors (×2, ×3, ×4) and the five benchmark datasets have indeed been incorporated into the revised version of Table 1. As you may have noted, the first entry in each scaling factor section of Table 1 is now "Bicubic." For instance, in the "Scale ×2" section, the first row presents the PSNR and SSIM results of the Bicubic method on the Set5, Set14, B100, Urban100, and Manga109 datasets. The same structure applies consistently to the ×3 and ×4 sections.
We hope this clarification assists in easily locating the added data. Thank you once again for your meticulous and rigorous review, which has been invaluable in enhancing the clarity of our manuscript. Should you have any further questions, please do not hesitate to let us know.
Comments 4: [I also do not understand the authors’ respond about the memory requirements and the number or FLOPs for the classical methods. Yes, the classical methods are not trainable but they have specific (very small) memory requirements and FLOPs per pixel and it would have been very easy to add them to the manuscript.]
Response 4: We thank the reviewer for this clarification and for pushing us to provide a more complete computational comparison. We apologize for any misunderstanding in our previous response.
You are absolutely correct. While classical methods like bicubic interpolation are not trainable (hence we listed their parameters as "-"), they do have specific computational costs per pixel. In direct response to your comment, we have now added the FLOPs for the bicubic interpolation method to the relevant table in our manuscript (Table 1 Bicubic). The FLOPs were calculated as: the number of output pixels × operations per pixel. This provides a fair and consistent basis for comparing computational efficiency across all methods, both learning-based and classical.
We believe this addition significantly strengthens our analysis and are grateful for your expert guidance, which has helped improve the rigor and completeness of our work.
Comments 5: [About my 5th comment, the added Table 2 should be resized to normal sized text, because it is very difficult to read. The unit of time is also missing (I assume seconds).]
Response 5: We sincerely thank the reviewer for pointing out these important issues regarding the presentation of Table 2. We have carefully revised the table according to your comments.
Specifically, the font size in Table 2 has been adjusted to standard text size to significantly improve readability. Furthermore, the unit for inference time has been clearly labeled as " #Avg.Time [s]" in the table header to avoid any ambiguity. Regarding the content, we have streamlined Table 2 to include the inference times for a representative subset of methods due to length limitations. For a complete runtime comparison that includes all methods, we kindly direct the reviewer to Figure 1 (titled "Runtime vs. PSNR"), which provides a comprehensive visual analysis of the runtime versus performance trade-off.
Specifically, we have added the descriptions for runtime the second paragraph in Section 4.2 ("Running time comparisons") (Page 11, line 9-10): [Specifically, the overall runtime is displayed in Figure1 (title as "Runtime vs. PSNR").]
We believe these revisions have fully addressed your concerns and greatly enhanced the clarity of our results. Thank you again for this constructive suggestion.
Author Response File:
Author Response.pdf
Reviewer 3 Report (New Reviewer)
Comments and Suggestions for AuthorsPaper presents a novel method for single image super-resolution that showcases competitive performance with existing design approaches. It is specific in the sense that combines multi-scale feature extraction and local detail feature extraction in order to capture both local and global vision cues. These are later combined through swin transformer, which performs attention based feature fusion. The paper is well written and organized, and in current form could be acceptable for possible publication.
Although most of the experimental results are very well presented, I would also like to ask the authors to comment on or add to the text how the proposed method relates to the ESRGAN method, which has been considered in the literature as the main baseline for single image super-resolution:
Wang, Xintao, et al. "Esrgan: Enhanced super-resolution generative adversarial networks." Proceedings of the European conference on computer vision (ECCV) workshops. 2018.
Also, some more detail regarding the spectrum based loss function in eq. 2 (second term) would help to better emphasize the impact of this constraint on the trained network performance (was the L1 norm computed over amplitude spectrum, or the phase information was also included in the loss computation?).
Nice work, good luck.
Author Response
Comments 1: [Although most of the experimental results are very well presented, I would also like to ask the authors to comment on or add to the text how the proposed method relates to the ESRGAN method, which has been considered in the literature as the main baseline for single image super-resolution: Wang, Xintao, et al. "Esrgan: Enhanced super-resolution generative adversarial networks." Proceedings of the European conference on computer vision (ECCV) workshops. 2018.]
Response 1: We sincerely thank the reviewer for this valuable suggestion regarding the comparison with ESRGAN, which is indeed a seminal work in perceptual super-resolution. We are pleased to confirm that we have already incorporated a discussion of ESRGAN in the revised manuscript.
As highlighted by the reviewer, Section 2.1 (Related Works) now explicitly references both SRGAN and ESRGAN. The added text acknowledges ESRGAN's role in advancing perceptual-quality super-resolution through its use of adversarial learning, while also noting the common challenges associated with GAN-based methods—such as training instability, occasional illusory textures, structural inconsistencies, and limitations in pixel-level precision. The relevant content has been added to the first paragraph (line 11-16) of the CNN-based super-resolution in Section 2.1 (Related Works). The added content reads: [To overcome the limitations of smooth results generated by previous CNN-based methods [6,13,21], generative adversarial network (GAN)-based SR approaches (e.g.[23,24]) leverage adversarial learning to reconstruct high-resolution images with more realistic textures and enhanced detail. However, these methods still suffer from challenges (e.g. training instability, occasional illusory textures, structural inconsistencies), and still face limitations in maintaining pixel-level precision.]
We thank the reviewer for emphasizing the importance of this comparison, which strengthens the positioning of our contribution in the literature.
Comments 2: [Also, some more detail regarding the spectrum based loss function in eq. 2 (second term) would help to better emphasize the impact of this constraint on the trained network performance (was the L1 norm computed over amplitude spectrum, or the phase information was also included in the loss computation?]
Response 2: We thank the reviewer for raising this important question regarding the formulation of our spectrum-based loss function. We are pleased to clarify the implementation details of our FFT loss component.
In our implementation, the second term in Eq. 2 specifically computes the L1 norm between the amplitude spectra of the predicted and ground-truth images.This design choice is based on our experimental finding that constraining the amplitude spectrum effectively enhances the recovery of frequency content and textural details, while the spatial L1 loss already provides adequate constraints for structural consistency. Our ablation studies confirmed that incorporating this amplitude-based FFT loss yields consistent PSNR improvements of 0.01-0.04 dB compared to using L1 loss alone.
We appreciate the reviewer's valuable suggestion, which has helped us improve the clarity of our methodology presentation.
Author Response File:
Author Response.pdf
Reviewer 4 Report (New Reviewer)
Comments and Suggestions for AuthorsThe paper introduces MAMN (Multi-scale Adaptive Modulation Network), an image super-resolution (SR) architecture aiming to balance reconstruction quality and computational efficiency for real-time or resource-limited applications. Here are some revisions for help better paper contents improvements below:
- Figure 1 and Table 2 are unclear; please enlarge them for better visibility.
- Core Components for MAML, LDEL, and STL, MAML (Multi-scale Adaptive Modulation Layer) is to capture non-local multi-scale representations using variance-based adaptive weighting. LDEL (Local Detail Extraction Layer) is for Extracting high-frequency local details with lightweight depth-wise convolutions, and STL (Swin Transformer Layer) Enhances long-range dependency modeling for global context. Each component (MAML, LDEL, STL) contributes measurable PSNR improvements. While MAML and LDEL are reasonable extensions, Should the paper be better differentiate its design from similar hybrid CNN-Transformer SR methods like HAT, SAFMN, and SMFANet ? The “variance-based modulation” concept could use more theoretical grounding or comparison with standard attention weighting.
- For efficiency metrics. The inference test platform (CPU-based) may not fully reflect GPU performance; add GPU latency comparisons for fairness. Clarify FLOP measurement assumptions (input size, color channels, etc.).
- If can, report standard deviations or statistical significance to show the robustness of small PSNR gains (e.g., +0.04 dB).
- Figures 3 and 4 could include zoomed-in regions or difference maps to make improvements more visible.
- Please include comparisons with non-lightweight transformer-based SR models (e.g., HAT, SwinIR-Large) to show trade-offs beyond the lightweight domain.
- For future work, please adding to explore low-rank attention or adaptive token pruning to further reduce transformer cost.
Author Response
Comments 1: [Figure 1 and Table 2 are unclear; please enlarge them for better visibility.]
Response 1: We thank the reviewer for pointing out the issue with the clarity of Figure 1 and Table 2. We have carefully revised both elements to improve their visibility.
Specifically, we have enlarged local area in Figure 1 to ensure all details are clearly legible. Similarly, Table 2 has been resized with a larger, standard font size and adjusted layout to enhance readability.
We believe these modifications have successfully addressed the visibility concerns. Thank you again for this valuable suggestion, which has helped improve the presentation quality of our manuscript.
Comments 2: [Core Components for MAML, LDEL, and STL, MAML (Multi-scale Adaptive Modulation Layer) is to capture non-local multi-scale representations using variance-based adaptive weighting. LDEL (Local Detail Extraction Layer) is for Extracting high-frequency local details with lightweight depth-wise convolutions, and STL (Swin Transformer Layer) Enhances long-range dependency modeling for global context. Each component (MAML, LDEL, STL) contributes measurable PSNR improvements. While MAML and LDEL are reasonable extensions, Should the paper be better differentiate its design from similar hybrid CNN-Transformer SR methods like HAT, SAFMN, and SMFANet ? The “variance-based modulation” concept could use more theoretical grounding or comparison with standard attention weighting.]
Response 2: We sincerely thank the reviewer for these insightful suggestions regarding architectural differentiation and theoretical grounding. We have thoroughly addressed these concerns through substantial revisions to our manuscript.
- Enhanced Architectural Differentiation
In direct response to your comment, we have now separately listed the results comparison between CNN and similar Transformer-based methods (HAT, SAFMN, SMFANet) in Tables 1 and 2. The revised Section 4.2 now clearly highlights these differences and their impact on performance-efficiency trade-offs. The relevant content has been added to the second paragraph of the Quantitative comparison in Section 4.2. The added content reads: [Additionally, we also compared our approach with attention-based methods, including lightweight dynamic modulation (e.g., SAFMN [10], SMFANet [34], and SRConvNet [40]) and large-scale self-attention (e.g., SwinIR [11], HAT [26], and RGT [41]). As observed in Table 2, compared to similar lightweight models (SAFMN, SMFANet, SRConvNet), the proposed MAMN consistently achieves the best overall performance while maintaining a comparable parameters. At scaling factors of ×3 and ×4, MAMN attains the highest metrics across all five test datasets. The most significant improvements are observed on the ×3 scale for Urban100 (28.43/0.8570) and Manga109 (34.20/0.9478) datasets, while maintaining reasonable FLOPs control (only 21G at ×4). Compared to large-scale models (SwinIR, HAT, RGT) with parameters dozens of times greater than ours, MAMN achieves approximately 95% of HAT’s PSNR performance in ×4 SR tasks while utilizing less than 3.1% of the parameters (0.31M vs. 10M-21M). In terms of computational efficiency, MAMN requires only 1.4%-3.5% of the FLOPs of these large models (e.g., 21G vs. 592G-1458G at ×4 scale). Particularly on the Set5 dataset for ×4 SR, the proposed method attains about 98% of HAT’s performance with merely 1.5% of its parameters.]
- Experimental Validation of Variance-based Modulation
Following the reviewer's suggestion, we have conducted new ablation studies comparing our variance-based modulation mechanism with standard self-attention (“VM → self-attention” in Table 4). The results (now included in the revised Section 4.3) demonstrate that replacing variance modulation with standard attention mechanisms improves performance (within 0.05-0.07dB in PSNR) but leads to a sharp increase in parameters and computational complexity, rising by 58.6% and 81%, respectively. This provides empirical evidence for the effectiveness of our design choice. The relevant content has been added to Variance modulation in Section 4.3. The added content reads: [Furthermore, replacing variance modulation with standard attention mechanisms improves performance but leads to a sharp increase in parameters and computational complexity, rising by 58.6% and 81%, respectively. These findings confirm that variance modulation plays a critical role in improving the representational capacity of the model.]
- Theoretical Foundation Development
We acknowledge the reviewer's valid point regarding theoretical grounding. While we have added initial theoretical motivation in Section 3.2 relating feature variance to local complexity, we recognize this requires further development. We are actively exploring more comprehensive theoretical frameworks, which we plan to address in future work.
We believe these revisions significantly strengthen our methodological contribution and provide clearer differentiation from related works. Thank you for this constructive suggestion that has helped improve our paper's clarity and novelty presentation.
Comments 3: [For efficiency metrics. The inference test platform (CPU-based) may not fully reflect GPU performance; add GPU latency comparisons for fairness. Clarify FLOP measurement assumptions (input size, color channels, etc.).]
Response 3: We thank the reviewer for raising this critical point regarding hardware-specific performance evaluation. We have thoroughly revised our efficiency analysis section to address this concern directly.
Key improvements made in the revision:
(1)Platform Benchmarking: Now provide GPU inference times for all compared methods with Intel(R) i5-13600KF processor (20 cores @ 3.5GHz) and NVIDIA GeForce RTX 4060 Ti on the Windows operating system. The relevant content has been revised to Running time comparisons of 4.2 (lines 12-16). The revised sentence reads: [The test platform utilizes an Intel(R) i5-13600KF processor (20 cores @ 3.5GHz), 32GB system memory and an NVIDIA GeForce RTX 4060 Ti GPU running on the Windows operating system.].
[As shown in Table 1 and Table 2, MAMN achieves superior reconstruction quality across all five benchmark datasets while being 20.5% faster than HNCT (0.066s vs. 0.083s). Compared to the lightweight SMFANet [34], which requires 0.034s, MAMN maintains a better trade-off between performance and speed, achieving significantly higher accuracy despite a moderate increase in inference time. Furthermore, the proposed method demonstrates substantially greater efficiency than several larger models, such as DPSR [38], requires 0.169s (156.1% slower), while LatticeNet [28] takes 0.120s (81.8% slower) to process the same data. Table 1 and Table 2 indicate that the proposed method balances computational efficiency with high reconstruction quality. Specifically, the overall runtime is displayed in Figure 1 (title as "Runtime vs. PSNR").]
(2)Explicit FLOPs Calculation Clarification: We have added stated the assumptions used for FLOPs calculation in the manuscript. The FLOPs are now clearly defined as being calculated (for a standard output size of 1280x720 pixels with 3 color channels ) with the fvcore1 library, allowing for direct and fair comparison across all evaluated methods. The relevant content has been revised to Quantitative comparisons of 4.2 (lines 12-16). The revised sentence reads: [Besides the widely adopted PSNR and SSIM metrics, we also provide parameters and FLOPs (3 color channels) to assess model complexity with the fvcore1 library (i.e., fvcore.nn.flop_count_str) under an LR image to 1280 × 720 pixels.]
These revisions ensure a comprehensive and fair evaluation of computational efficiency across different deployment scenarios. Thank you for this valuable suggestion, which has significantly strengthened the practical relevance of our efficiency analysis.
Comments 4: [If can, report standard deviations or statistical significance to show the robustness of small PSNR gains (e.g., +0.04 dB).]
Response 4: We thank the reviewer for this important suggestion regarding the statistical robustness of the reported PSNR gains. We fully agree that demonstrating the significance of marginal improvements is crucial for validating the performance of our method.
In response to your comment, we have conducted additional statistical analysis on the performance metrics. While standard deviation values were not commonly reported in the cited benchmark studies for direct comparison, we have performed the following to address this point:
(1) Statistical Significance Testing: We conducted paired t-tests on the PSNR values obtained from multiple independent runs for our method and the top-performing baseline (LCRCA [29]) on the Set5 dataset at scale ×4. The results confirm that the observed improvement of +0.02 dB (32.35 vs. 32.33) is statistically significant (p < 0.05).
(2) Consistency Across Datasets: The robustness of our method is further evidenced by its consistent top-tier performance across all five benchmark datasets (Set5, Set14, B100, Urban100, Manga109), rather than on a single test dataset. This consistent ranking across diverse content reinforces the practical significance of the gains.
We believe these additions strengthen the validity of our performance claims and thank the reviewer for the suggestion to enhance the statistical rigor of our analysis.
Comments 5: [Figures 3 and 4 could include zoomed-in regions or difference maps to make improvements more visible.]
Response 5: We thank the reviewer for this excellent suggestion. In response to your comment, we have now updated Figures 3 and 4 to include zoomed-in regions for a clearer and more direct visual comparison.
These views allow for a detailed inspection of the textural and structural improvements achieved by our method, particularly in complex areas where the advantages are most pronounced. We believe these revised figures now more effectively demonstrate the visual quality enhancements discussed in the manuscript.
We are grateful for your valuable input, which has undoubtedly strengthened the visual presentation of our results.
Comments 6: [Please include comparisons with non-lightweight transformer-based SR models (e.g., HAT, SwinIR-Large) to show trade-offs beyond the lightweight domain.]
Response 6: We thank the reviewer for this valuable suggestion. In response, we have expanded our experimental comparisons to include comprehensive evaluations against leading non-lightweight transformer-based SR models, specifically HAT, SwinIR, and RGT, as now detailed in Section 4.2 and Table 2 of the revised manuscript.
The results clearly demonstrate the performance-efficiency trade-offs: while our method maintains a lightweight design, it achieves highly competitive restoration quality. We have added relevant content explanations. The relevant content has been added to the second paragraph of the Quantitative comparison in Section 4.2. The added content reads: [Additionally, we also compared our approach with attention-based methods, including lightweight dynamic modulation (e.g., SAFMN [10], SMFANet [34], and SRConvNet [40]) and large-scale self-attention (e.g., SwinIR [11], HAT [26], and RGT [41]). As observed in Table 2, compared to similar lightweight models (SAFMN, SMFANet, SRConvNet), the proposed MAMN consistently achieves the best overall performance while maintaining a comparable parameters. At scaling factors of ×3 and ×4, MAMN attains the highest metrics across all five test datasets. The most significant improvements are observed on the ×3 scale for Urban100 (28.43/0.8570) and Manga109 (34.20/0.9478) datasets, while maintaining reasonable FLOPs control (only 21G at ×4). Compared to large-scale models (SwinIR, HAT, RGT) with parameters dozens of times greater than ours, MAMN achieves approximately 95% of HAT’s PSNR performance in ×4 SR tasks while utilizing less than 3.1% of the parameters (0.31M vs. 10M-21M). In terms of computational efficiency, MAMN requires only 1.4%-3.5% of the FLOPs of these large models (e.g., 21G vs. 592G-1458G at ×4 scale). Particularly on the Set5 dataset for ×4 SR, the proposed method attains about 98% of HAT’s performance with merely 1.5% of its parameters.]
We believe this addition significantly strengthens the practical relevance of our contribution and thank the reviewer for the excellent suggestion.
Comments 7: [For future work, please adding to explore low-rank attention or adaptive token pruning to further reduce transformer cost.]
Response 7: We thank the reviewer for this insightful suggestion. We fully agree that exploring techniques like low-rank attention and adaptive token pruning represents a highly promising direction for reducing the computational cost of transformer-based models.
The relevant content has been added to the section Conclusion (lines 12-16). The added sentence reads: [Future work will focus on the development of architectural optimization and acceleration strategies (e.g., exploring low-rank attention mechanisms or adaptive token pruning) to substantially reduce the computational costs of transformer-based models and enhance operational efficiency in HR application scenarios.]
We believe this addition effectively addresses your suggestion and strengthens the perspective of our future research plan. Thank you for this constructive input.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsThe authors have responded to my comments shatisfactorily.
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript introduces an image super-resolution network combining multi-scale adaptive modulation and swing transformation. However, the manuscript needs some modifications before it can be published. The following are the comments and suggestions of the reviewers, aiming to improve the clarity, preciseness and overall quality of the manuscript.
- The abstract lacks an explanation of the advantages of the proposed method, as well as a brief description of the overall operational workflow.
- The manuscript does not provide a description of the motivation behind the proposed MAMN framework, nor does it detail its implementation.
- The manuscript lacks a description of Figure 1, and the physical meaning of the parameters shown therein is not clearly explained.
- The description of the innovations in the manuscript is overly general and does not specifically explain the core innovation, practical value, and operational workflow of each module within the overall framework.
- The introduction section lacks an outline of the arrangement of the subsequent chapters.
- To improve the readability of the manuscript, necessary annotations should be added to the figures, such as “input” and “result.”
- The manuscript lacks a detailed description of each innovation, and no flow chart is provided in each part of the proposed method to show the operation process and design motivation.
- The experimental section does not describe the datasets used, nor does it explain the rationale for selecting these datasets as training and testing sets.
- Some of the comparative methods selected in the manuscript are outdated, which fails to highlight the advantages of the proposed method.
- The manuscript does not provide a justification or explanation of the principles underlying the evaluation metrics used.
- The conclusion states that the MAMN network achieves a trade-off between computational complexity and reconstruction quality, but the manuscript does not provide any explanation regarding computational complexity.
- There are many obvious small errors in the manuscript, such as incorrect case of letters, such as line 178 and line 198 on page 6.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript proposes a state-of-the art deep learning models for image zooming, or image super-resolution. Overall the manuscript, is interesting and relatively well-written. The proposed deep learning components are described and the results use a number of standard datasets with many deep learning methods. I also appreciate the ablation study.
Few issues need resolution, before reconsidering the paper.
I suggest to include a table for the acronyms used in the manuscript.
In Eq.(5), please do not use “*” for scalar multiplication.
The use of classical interpolation methods is neglected. Except for the standard bilinear and bicubic interpolation, the authors should reference the generalized convolution interpolation methods, i.e. bspline (by Unser et al) and the Maximal-order interpolation of minimal support, e.g. MOMS, IEEE TIPS 2001), as well as Hermite kernels for image zooming, IEEE DSP 2025.
The authors should also include quantitative results from at least one of the classic methods (such as the bicubic interpolation) in at least one table, as baseline and discuss the difference in required memory and floating point operations.
Also please discuss the applicability of the deep learning methods to typical size images and provide the necessary inference time to zoom for standard computer hardware.

