Review Reports
- Yapei Feng*,
- Yuxiang Tang and
- Hua Zhong
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsI want to thank the authors for their research contribution. The study presents an interesting topic in many scientific fields. For this reason, the authors should supplement and improve the paper according to the comments below.
1. The introduction is limited to generic inpainting. The authors do not consider engineering contexts where inpainting has practical value. Why this omission? Do the authors plan to strengthen the application framework in biomedical or sensory contexts?
2. The authors do not provide sufficient details on hyperparameters, hardware configuration, and CPU effort, making reproducibility difficult. How do you intend to address this shortcoming? In addition, the method's effectiveness is only demonstrated on high-quality datasets. How does your model perform in real-world conditions with low-light acquisitions or limited instrumentation? Can contrast enhancement techniques be used to face this issue (see, for example, https://doi.org/10.1109/ACCESS.2017.2776349)?
3. The paper reports that the study provides innovative contributions without comparing them to diagnostic imaging methods. How do the authors intend to justify these claims? Furthermore, the authors limited their analysis to RGB images only. How do you plan to extend the method to multimodal scenarios?
4. In regions of high uncertainty, semantic priors risk introducing errors. Have the authors evaluated fuzzy approaches to improve robustness? Recent studies have investigated intuitionistic fuzzy divergence techniques for assessing the mechanical stress state of steel plates subject to bi-axial loads (https://doi.org/10.3233/ICA-230730). The authors should integrate the Methods section to provide a theoretical basis for addressing uncertainty.
Author Response
Comment 1: The introduction is limited to generic inpainting. The authors do not consider engineering contexts where inpainting has practical value. Why this omission? Do the authors plan to strengthen the application framework in biomedical or sensory contexts?
Response 1:We thank the reviewer for this insightful suggestion. We agree that emphasizing practical application contexts would significantly enhance the relevance of our work. In the revised manuscript, we have expanded the Introduction section to include specific engineering and biomedical application examples where image inpainting plays a critical role. For instance, we now discuss its applications in medical image restoration (such as repairing damaged MRI or CT scans) [1] and remote sensing (e.g., reconstructing cloud-occluded satellite imagery) [2]. These additions help situate our methodological contributions within practical application scenarios. We believe this enhancement better aligns the paper with the needs of applied research communities. The detailed content can be found in the first paragraph of Section 1, indicated in blue font.
Comment 2:The authors do not provide sufficient details on hyperparameters, hardware configuration, and CPU effort, making reproducibility difficult. How do you intend to address this shortcoming? In addition, the method's effectiveness is only demonstrated on high-quality datasets. How does your model perform in real-world conditions with low-light acquisitions or limited instrumentation? Can contrast enhancement techniques be used to face this issue (see, for example, https://doi.org/10.1109/ACCESS.2017.2776349)?
Response 2: (1)We sincerely apologize for the insufficient details regarding hyperparameters, hardware configuration, and CPU computational cost, as this has indeed affected the reproducibility of the study. We also recognize the limitations of only validating on high-quality datasets and will comprehensively improve this part of the content:
1)Hyperparameter details: We have detailed the hyperparameter information in the last paragraph of Section 4.1 (marked in blue font): In our experiments, we use the Adam algorithm for network optimization, where beta1 and beta2 are set to 0.0 and 0.9, respectively, and the learning rate is set to 10⁻⁴. During network tuning, the images in the training set are resized to 256×256.
2)Hardware and CPU computational cost: We will specify the experimental hardware configuration: Intel Xeon Gold 6348 CPU (2.60 GHz) and NVIDIA RTX 4090 GPU. Additionally, we have listed the computational performance and efficiency of the algorithm in Section 4.5, providing a basis for reproducibility.
(2)The answer to the question "How does your model perform in real-world conditions with low-light acquisitions or limited instrumentation" is discussed and analyzed in the last paragraph of Section 4.4.
(3)We thank the reviewer for suggesting the use of contrast enhancement techniques as a pre-processing step to improve performance under low-light conditions. This is a valuable insight for enhancing model robustness in practical applications.
After careful consideration, we have decided not to integrate this into our main experimental pipeline for the current study for the following reasons:
The core focus of this work is to propose a novel image inpainting algorithm based on structural prior and multi-scale fusion. Introducing an independent pre-processing module would shift the evaluation away from our core architectural innovation and complicate the ablation studies. Furthermore, through preliminary experiments, we observed that incorporating this technique could lead to the following issues:
1) Contrast enhancement is a "non-selective" process that amplifies both valid image information and noise. When dark areas are brightened, originally invisible sensor noise may become amplified. The inpainting algorithm might misinterpret this amplified noise as authentic texture, resulting in unnatural grain or patches in the inpainted regions, and even propagating noise to surrounding areas.
2) The essence of the inpainting algorithm is to learn the data distribution of natural images. Forcibly altering the global contrast of an image effectively shifts it away from this natural distribution. Consequently, the model may generate content that is inconsistent with the lighting and tone of the original image, causing the inpainted "patches" to appear visually incongruous and disrupting the overall harmony.
3) For models operating directly in pixel or feature space, unnatural contrast stretching can alter the activation values of feature maps and introduce misleading high-frequency components. This may bias the model's understanding of the image content, leading to the generation of incorrect structures or textures.
Therefore, while contrast enhancement techniques remain an important practical consideration, they fall outside the scope of this paper. Nevertheless, we have added a discussion on this point in the "Conclusions and Future Work" section of the revised manuscript, identifying it as a potential direction for future research.
Comment 3:The paper reports that the study provides innovative contributions without comparing them to diagnostic imaging methods. How do the authors intend to justify these claims? Furthermore, the authors limited their analysis to RGB images only. How do you plan to extend the method to multimodal scenarios?
Response 3:We thank the reviewer for raising this valuable point regarding comparison with diagnostic imaging methods. We fully recognize medical image inpainting as a significant application domain worthy of thorough investigation.After careful consideration, we have chosen to maintain our current focus on general image inpainting in this work, based on the following rationale:
First, the primary objective of this study is to establish a fundamental framework for semantic-structure aligned inpainting, which requires validation on standardized benchmarks to ensure fair comparison with established methods. Our comprehensive experiments across three challenging datasets (Places2, CelebA, and Paris StreetView) - encompassing diverse scenarios from natural landscapes to urban environments - have systematically demonstrated our approach's effectiveness and generalization capability.
Second, diagnostic imaging involves domain-specific constraints and evaluation criteria that substantially differ from natural image inpainting. Medical image restoration typically demands:
-
Strict preservation of anatomical consistency
-
Specialized evaluation metrics (e.g., diagnostic accuracy retention)
-
Domain-specific artifact handling (e.g., metal artifacts in CT scans)
Direct application of our method without substantial architectural adaptation and clinical validation would not yield meaningful comparisons with specialized diagnostic imaging techniques.
Nevertheless, we appreciate the potential applications of our approach in medical imaging. In response to the reviewer's suggestion, we have added a dedicated section in "Future Work" (Section 5.2) discussing:
-
Adaptation requirements for medical imaging modalities
-
Potential applications in CT/MRI artifact reduction
-
Opportunities for collaboration with clinical researchers for domain-specific validation
We propose that establishing our method's robust performance on standard benchmarks provides a necessary foundation before extending it to specialized domains like medical imaging. While medical image inpainting represents an important application direction, this study deliberately maintains its focus on general image restoration for three key reasons: (1) methodological comparisons require standardized benchmarks that are currently lacking in medical imaging literature; (2) clinical applications demand specialized validation beyond architectural innovation; (3) our primary contribution lies in advancing the fundamental understanding of semantic-structure interaction in image restoration.
Comment 4: In regions of high uncertainty, semantic priors risk introducing errors. Have the authors evaluated fuzzy approaches to improve robustness? Recent studies have investigated intuitionistic fuzzy divergence techniques for assessing the mechanical stress state of steel plates subject to bi-axial loads (https://doi.org/10.3233/ICA-230730). The authors should integrate the Methods section to provide a theoretical basis for addressing uncertainty.
Response 4: We sincerely thank the reviewer for this profound and insightful comment. The risk of introducing errors through semantic priors in high-uncertainty regions is indeed a critical issue that requires careful consideration in our method. The recommended research on intuitionistic fuzzy divergence techniques [DOI:10.3233/ICA-230730] provides us with a novel perspective and important theoretical reference for addressing such uncertainties, for which we express our genuine appreciation.
After thorough deliberation, we are inclined not to directly integrate the intuitionistic fuzzy divergence theory into the methodology section within the current research framework, primarily for the following three reasons:
First, from the perspective of research scope, the core contribution of this paper lies in proposing a "structural prior-based multi-scale fusion network architecture", focusing on addressing the adaptive alignment between semantic guidance and texture features during the inpainting process. While we fully acknowledge that introducing fuzzy theory can provide a rigorous mathematical foundation for uncertainty modeling, doing so would shift the research focus of this paper from "validating the effectiveness of the new architecture" to "developing an uncertainty quantification framework," thereby deviating from our initial research objective.
Second, in terms of technical approach, our existing design already incorporates multi-level mechanisms to address uncertainty:
(1) Implicit Weighting at the Loss Function Level: As shown in the loss function formula in Section 3.2, the weighting term in our adopted L1 reconstruction loss is essentially a simple yet effective attention mechanism for high-uncertainty regions (i.e., masked regions). Although mathematically simpler than mature fuzzy theory, this "adaptive weighting concept based on spatial location" is philosophically consistent with complex fuzzy weighting - both aim to provide special treatment for uncertain regions.
(2) Cross-validation Through Multi-scale Architecture: The multi-scale fusion mechanism we designed inherently enables cross-validation of semantic prior reliability across different scales. When semantic priors at a certain scale exhibit uncertainty, information from other scales can adaptively provide supplementation and correction, thereby effectively avoiding potential misjudgments that may arise from a single scale. This essentially constitutes a built-in fault-tolerant system.
(3) Selective Trust in Network Structure: Furthermore, the residual module design in our improved algorithm, through its feature selection mechanism, allows the network to adaptively adjust its reliance on information from different sources (such as semantic priors and texture features). This design enables the network to learn to "selectively trust" input information, naturally reducing sensitivity to unreliable priors.
Although these designs differ in theoretical form from fuzzy methods, they share the same functional goal - to systematically reduce the model's dependence on unreliable priors. Together, they form a multi-level solution for addressing uncertainty.
Notwithstanding the above, we highly value the direction suggested by the reviewer. To fully respect and adopt this suggestion, we have added a dedicated paragraph in the "Discussion and Future Work" section of the revised manuscript, systematically elaborating on the potential and implementation pathways of applying intuitionistic fuzzy divergence and other fuzzy set theories to quantify semantic prior uncertainty. We commit to making this a key direction for our subsequent research.
We believe that this approach maintains the focus on the core innovations of this paper while adequately demonstrating our attention to the uncertainty issue, and also points to a clear direction for future research. Once again, we thank the reviewer for this valuable suggestion, which has significantly broadened our research perspective.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this work, the authors present a semantic-aware hierarchical network (SAHN) that synergistically integrates multi-scale semantic guidance with structural consistency constraints. As results, the work advances image restoration research by establishing an effective paradigm for joint semantic-texture reconstruction. The manuscript demonstrates strong technical merit and contributes to the semantic-guided image inpainting literature. The organization of the manuscript is generally acceptable, but the detail of the proposed method is not well described. However, there are some major problems as shown below:
- The proposed network, while effective, appears computationally heavy. The manuscript would benefit from an analysis of model complexity (parameter counts, FLOPs, and inference time) to justify its feasibility for deployment. This would also align the contribution with ongoing research trends in lightweight and efficient visual restoration networks.
- The significant of the contribution is good. And then, the described subject is of current interest to MDPI Journal readers. I felt some references for the MDPI Journal may be needed.
- Introduction section: Should cite and explain latest 3~5 papers on lightweight architecture (Papers from 2023 to 2025).
Author Response
Comment 1:The proposed network, while effective, appears computationally heavy. The manuscript would benefit from an analysis of model complexity (parameter counts, FLOPs, and inference time) to justify its feasibility for deployment. This would also align the contribution with ongoing research trends in lightweight and efficient visual restoration networks.
Response 1:Additionally, we have supplemented a comparative analysis between the improved algorithm and main comparative methods (e.g., LaMa-Fourier, DDNM) in terms of model size (number of parameters in millions) and inference time. For details, please refer to Section 4.5.
We also compares the model scale and inference efficiency of three approaches: Our method employs an encoder-MSF-decoder architecture with 98.7M parameters (395MB storage); LaMa-Fourier utilizes a lightweight FFC architecture with 52.3M parameters (209MB storage); DDNM relies on pre-trained diffusion models like Stable Diffusion 1.5, requiring approximately 1.2GB of model data.
Testing on an NVIDIA RTX 4090 platform demonstrated that our method processes a single image in 178 milliseconds, LaMa-Fourier completes in 152 milliseconds, while DDNM requires 682 milliseconds due to its higher algorithmic complexity, significantly exceeding the other two methods.
Comment 2:The significant of the contribution is good. And then, the described subject is of current interest to MDPI Journal readers. I felt some references for the MDPI Journal may be needed.
Response 2: Thank you for your valuable suggestions regarding literature citations[12][13][14]. We have added three references from MDPI journals that are relevant to the theme of this paper. For specific content, please refer to the part marked in blue font in Paragraph 4 of Chapter 1.
Comment 3: Introduction section: Should cite and explain latest 3~5 papers on lightweight architecture (Papers from 2023 to 2025).
Response 3:Thank you for your valuable suggestions regarding literature citations. Regarding the issue you pointed out about the need to supplement literature in related fields, we have added citations to 3 relevant literatures in the specified location (e.g., "Section 1, Paragraph 4"), including the research on lightweight image inpainting published by Wang et al. (2023) in Electronics and the MD-GAN model proposed by Liu et al. (2025) in Applied Sciences. These literatures focus on the application of lightweight architectures in image inpainting and are closely related to the theme of "computational efficiency optimization of semantic-aware networks" discussed in this paper. Relevant citations have been formatted in accordance with the journal's specifications. For specific content, please refer to the part marked in blue font in Paragraph 4 of Chapter 1.
Reviewer 3 Report
Comments and Suggestions for Authors- Overall Evaluation
This paper presents a semantic-aware hierarchical network (SAHN) integrated with a multi-scale fusion generator for image restoration. The core contributions include a semantic prior mapper, a multi-scale fusion module, and a mask-guided discriminator, forming a coherent pipeline from semantic understanding to texture reconstruction. The manuscript is well-structured, and the experimental design is solid, utilizing three diverse datasets (Paris StreetView, CelebA, Places2) and comprehensive metrics to validate the method's effectiveness. The work demonstrates clear innovation and practical value.However, the manuscript requires minor revisions to address issues related to language clarity, missing methodological details, and insufficient depth in experimental analysis. Once these points are improved, the paper will be suitable for publication.
- Main Strengths
1.The proposed semantic prior mapper and multi-scale fusion generator effectively integrate high-level semantic guidance with low-level texture reconstruction, directly addressing common issues like semantic-texture disconnection.
2.Comprehensive experiments on multiple datasets and comparisons with both classical and recent state-of-the-art methods (e.g., EC, CTSDG, LaMa-Fourier, DDNM) provide strong, quantitative evidence for the method's performance. Ablation studies further validate key components.
3.The paper follows a standard and logical academic structure, from problem definition to conclusion, making it easy to follow.
- Specific Revision Suggestions
3.1 Language and Presentation
1.Grammatical Corrections and Terminology Unification: Correct grammatical errors and unify terminology. For instance, "kip-connected" should be "skip-connected". The phrase "L1 reconstruction loss is a constraint for..." is grammatically incomplete; rephrase it to "We use L1 reconstruction loss to constrain...". Unify the term for input images (e.g., use "damaged image" consistently instead of alternating with "broken image").
2.Consistency in Figure and Formula Citations: Figure references in the text (e.g., to Figures 1-4) are inconsistent with the figures provided (Figures 5-8). Supply the missing figures or renumber them sequentially according to their appearance in the text. Explain the functional role of Equation (7) in subsequent modules. Correct the formatting of subscripts (ω1 ~ ω4) in Equation (14).
3.2 Supplementary Details of the Method
1.Specification of Parameters for Pre-trained Model P: Provide detailed specifications for the pre-trained multi-label classification model P (trained on OpenImage with ASL loss). This includes: the backbone architecture, input image size (and whether it matches the 256x256 used in restoration), the channel number and spatial resolution of its output feature maps, and whether its weights are frozen.
2.Explanation of Differences Between the MSF Module and SPADE: Clarify the specific differences between the proposed Multi-Scale Fusion (MSF) module and the original SPADE module, such as modifications to the normalization layer modulation, the number of residual blocks, or the use of deformable convolutions.
3.Weight Values and Parameter Tuning Strategy for the Loss Function: Provide the specific weight values (e.g., ω1=1.0, ω2=0.1) for the loss components in Equation (14) (reconstruction, perceptual, adversarial, semantic prior). Briefly describe the rationale for these choices (e.g., empirical tuning, grid search).
3.3 Improvement of Experimental Design and Analysis
1.In-depth Comparison with State-of-the-Art Methods: Provide a more detailed comparative analysis against state-of-the-art methods like LaMa-Fourier and DDNM. For example, discuss specific advantages of your method, such as: "LaMa-Fourier may exhibit repetitive textures in large masked regions, which our semantic prior mapper helps to avoid," or "Our method offers an inference time of X ms/image compared to Y ms/image for DDNM, demonstrating superior efficiency."
2.Detailed Annotation of Qualitative Results: Enhance Figures 6-8 (Qualitative Comparison) by adding red boxes to highlight key regions where your method demonstrates superior restoration (e.g., jewelry details in CelebA, text in Paris StreetView). Include descriptive captions explaining these improvements (e.g., "Red box: Our method reconstructs sharper earring textures compared to the distorted output from CTSDG").
3.Analysis of Computational Efficiency: Report model size (number of parameters in Millions) and inference time (in milliseconds for a 256x256 image) for the proposed method and key competitors (e.g., LaMa-Fourier, DDNM). Specify the testing environment (e.g., NVIDIA RTX 3090 GPU).
Author Response
Comment 1:1.Grammatical Corrections and Terminology Unification: Correct grammatical errors and unify terminology. For instance, "kip-connected" should be "skip-connected". The phrase "L1 reconstruction loss is a constraint for..." is grammatically incomplete; rephrase it to "We use L1 reconstruction loss to constrain...". Unify the term for input images (e.g., use "damaged image" consistently instead of alternating with "broken image").
Response 1:Thank you for your valuable and constructive suggestions! We have comprehensively revised the grammatical errors you pointed out and completed the unification of terminology (Terminology Unification). The relevant revisions have been marked in blue font in the main text for your easy and intuitive reference. Should you have any further comments on the revisions, please feel free to let us know at any time.
Comment 2:Consistency in Figure and Formula Citations: Figure references in the text (e.g., to Figures 1-4) are inconsistent with the figures provided (Figures 5-8). Supply the missing figures or renumber them sequentially according to their appearance in the text. Explain the functional role of Equation (7) in subsequent modules. Correct the formatting of subscripts (ω1 ~ ω4) in Equation (14).
Response 2:Thank you for your valuable and constructive suggestions! (1)We have added the relevant figures and revised the format of the subscripts (ω₁~ω₄) in Equation (14) to standardize them. Both the added relevant figures and the revised subscripts have been marked in blue, facilitating your quick location and review.(2)We have provided an explanation of this equation’s function below Equation 7, and the explanation has been marked in blue font. The explanation is as follows:
This meticulously designed loss function ensures the alignment between the semantic features Fpyrn learned by the network from the damaged image and the desired ideal features Fgoal. Through this focused, multi-scale, and mask-aware supervision, the model is compelled to learn how to extract a reliable multi-scale semantic prior, Fprior, from any damaged image, which maintains high structural consistency with the intact image. This high-quality Fprior subsequently provides powerful and accurate structural guidance for the image inpainting process, such as steering texture generation via the MSF module.
Comment 3:Specification of Parameters for Pre-trained Model P: Provide detailed specifications for the pre-trained multi-label classification model P (trained on OpenImage with ASL loss). This includes: the backbone architecture, input image size (and whether it matches the 256x256 used in restoration), the channel number and spatial resolution of its output feature maps, and whether its weights are frozen.
Response 3: As detailed in lines 163–189, the configuration and usage of the pre-trained model are described as follows:
(1) Backbone Architecture: A pre-trained ResNet encoder (e.g., ResNet-50) is employed as the downsampling path (encoder) within the U-Net architecture.
(2) Input Size Compatibility: Although the original input size of the pre-trained model does not exactly match the 256×256 resolution used in our inpainting task, the ResNet architecture is inherently adaptable and can effectively process input images of varying dimensions.
(3) Feature Map Scaling: Throughout the U-Net encoder, the spatial resolution of the feature maps progressively decreases due to downsampling operations. When the input size is 256×256, there are differences between the scales of the intermediate feature maps (e.g., 128×128, 64×64) and those of the feature maps generated with the 224×224 input during pre-training (112×112, 56×56). To address this, we added an Adaptive Average Pooling layer to the last layer of the ResNet encoder, which uniformly maps the intermediate feature maps of different scales to a spatial size of 8×8. This ensures the dimensional consistency of the final output feature map (8×8, 2048 channels), thereby meeting the input requirements of the subsequent MSF module.
(4) Weight Update Strategy: In this study, the weights of this pre-trained model remain frozen throughout the network training process and are not updated. The model was originally pre-trained on the OpenImage dataset using an Asymmetric Loss (ASL) function and serves as a fixed feature extractor in our framework.
Comment 4:Explanation of Differences Between the MSF Module and SPADE: Clarify the specific differences between the proposed Multi-Scale Fusion (MSF) module and the original SPADE module, such as modifications to the normalization layer modulation, the number of residual blocks, or the use of deformable convolutions.
Response: Thank you very much for your constructive suggestions.We have supplemented a comparative analysis between the MSF module and SPADE, with details provided in the blue-font section of Section 3.2.Detailed information is as follows:
The core innovations and improvements of the MSF module over the original SPADE can be summarized into the following four points:
- Multi-scale Feature Fusion Architecture: The input of the MSF module consists of two parts: the first is the texture features extracted by the encoder (derived from the downsampling path of the U-Net), and the second is the multi-scale semantic structural prior (obtained from the semantic prior mapper designed in this paper, rather than the single-scale semantic input of SPADE). The output, on the other hand, is the enhanced feature that integrates multi-scale information and is upsampled to a unified scale.
(2) Dynamic Computational Allocation Mechanism: The MSF module dynamically allocates different numbers of SPADE residual blocks for semantic priors at different input scales.
(3) Integration of Deformable Convolution: Deformable convolution is incorporated to enhance the model's ability to model geometric transformations, replacing the standard convolution used in SPADE.
(4) The MSF module retains and directly employs the core normalization and feature modulation algorithm of the original SPADE.
Comments 5:Weight Values and Parameter Tuning Strategy for the Loss Function: Provide the specific weight values (e.g., ω1=1.0, ω2=0.1) for the loss components in Equation (14) (reconstruction, perceptual, adversarial, semantic prior). Briefly describe the rationale for these choices (e.g., empirical tuning, grid search).
Response: Thank you for the questions raised by the reviewer. For the selection of weights, we adopted the method of empirical tuning.
Regarding the weights of each loss component in Equation (14), we determined the specific values through empirical tuning as follows: ω₁ (reconstruction loss) = 1.0, ω₂ (perceptual loss) = 0.3, ω₃ (adversarial loss) = 0.1, and ω₄ (semantic prior loss) = 0.5. The rationale for these selections is as follows:
- We prioritize ensuring reconstruction accuracy, so ω₁ is set to a baseline value of 1.0;
- Next, we use the perceptual loss to improve visual quality, hence ω₂ is set to 0.3;
- Meanwhile, we control the weight of the adversarial loss to avoid mode collapse, which leads to ω₃ = 0.1;
- For the semantic prior loss, it is necessary to balance structural consistency, so ω₄ is set to 0.5.
Comment 6:.In-depth Comparison with State-of-the-Art Methods: Provide a more detailed comparative analysis against state-of-the-art methods like LaMa-Fourier and DDNM. For example, discuss specific advantages of your method, such as: "LaMa-Fourier may exhibit repetitive textures in large masked regions, which our semantic prior mapper helps to avoid," or "Our method offers an inference time of X ms/image compared to Y ms/image for DDNM, demonstrating superior efficiency."
Response: Thank you for your advice. We have supplemented more in-depth comparisons and discussions in Section 4.2, and the relevant content has been marked in blue font. The content is listed as follows:
Experimental results in Tables 1-3 demonstrate that the proposed method places greater emphasis on image visual quality and overall perceptual effect, ultimately achieving the optimal Fréchet Inception Distance (FID) score. A detailed comparative analysis is as follows: the SPL/SPN methods excel in pixel-level accurate reconstruction, which is reflected in their significant advantage in the Mean Absolute Error (MAE) metric;The DDNM algorithm exhibits stable restoration performance under high mask ratio scenarios;The LaMa-Fourier method tends to suffer from texture repetition issues in large masked regions, whereas the semantic prior mapper of our proposed method can effectively mitigate this phenomenon.Furthermore, in terms of inference efficiency, compared with the DDNM algorithm, the proposed method achieves a 3.9× speedup in inference and reduces memory usage by 50%, making it well-suited for practical deployment scenarios.
Comments 7:Detailed Annotation of Qualitative Results: Enhance Figures 6-8 (Qualitative Comparison) by adding red boxes to highlight key regions where your method demonstrates superior restoration (e.g., jewelry details in CelebA, text in Paris StreetView). Include descriptive captions explaining these improvements (e.g., "Red box: Our method reconstructs sharper earring textures compared to the distorted output from CTSDG")."
Response: We have marked some key regions with red boxes, aiming to highlight key information in the qualitative results.For details, please refer to Figures 6-8.
Comment 8:Analysis of Computational Efficiency: Report model size (number of parameters in Millions) and inference time (in milliseconds for a 256x256 image) for the proposed method and key competitors (e.g., LaMa-Fourier, DDNM). Specify the testing environment (e.g., NVIDIA RTX 3090 GPU).
Response: Thank you very much for your valuable suggestions on computational efficiency analysis. The relevant analysis is as follows:
This study compares the model scale and inference efficiency of three approaches: Our method employs an encoder-MSF-decoder architecture with 98.7M parameters (395MB storage); LaMa-Fourier utilizes a lightweight FFC architecture with 52.3M parameters (209MB storage); DDNM relies on pre-trained diffusion models like Stable Diffusion 1.5, requiring approximately 1.2GB of model data.
Testing on an NVIDIA RTX 4090 platform demonstrated that our method processes a single image in 178 milliseconds, LaMa-Fourier completes in 152 milliseconds, while DDNM requires 682 milliseconds due to its higher algorithmic complexity, significantly exceeding the other two methods.
Thank you again for your careful review and professional suggestions. We have made every effort to improve the paper, and we hope the revised version meets the requirements of the journal.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have answerd all my questions.
Author Response
We sincerely thank you for your careful review and confirmation that all questions have been addressed. Your valuable time, insightful comments, and constructive suggestions have greatly contributed to the improvement of our manuscript. We are grateful for your support throughout the review process .
Reviewer 2 Report
Comments and Suggestions for Authors- This manuscript looks quite interesting; it just needs a few revisions, after which I think it can be accepted.
- The abstract of this paper needs to include objective numerical results at the end in order to highlight its contribution.
- How were the coefficients as matrix in Figure 4 obtained? The corresponding rationale and theoretical basis need to be explained.
- It is recommended that the author, in the Introduction section, briefly mention some related works in the direction of “semantic priors + hierarchical networks + multi-scale fusion generators” to highlight the advantages of this manuscript overi traditional techniques.
None
Author Response
Comments 1:This manuscript looks quite interesting; it just needs a few revisions, after which I think it can be accepted.
Response 1: We are very grateful to the reviewer for their positive and encouraging feedback on our manuscript. We have carefully considered all the suggestions and have revised the manuscript accordingly. The point-by-point responses to the specific comments are detailed below.
Comment 2: The abstract of this paper needs to include objective numerical results at the end in order to highlight its contribution.
Response 2: We sincerely thank the reviewer for this excellent suggestion. We fully agree that incorporating objective numerical results in the abstract will more effectively highlight the contribution and performance of our work. In the revised version of the abstract, we have added the key quantitative results at the end. The added sentence is:
Compared to existing methods, our algorithm attains the highest overall PSNR of 34.99 with the best visual authenticity (with the lowest FID of 11.56).Comprehensive evaluations of three datasets demonstrate its leading performance in restoring visual realism.
Comment 3: How were the coefficients as matrix in Figure 4 obtained? The corresponding rationale and theoretical basis need to be explained.
Response 3: We thank the reviewer for raising this important point. The coefficients (weight matrices) in the discriminator (Fig. 4) are not pre-designed but are learned automatically during the adversarial training process.
Derivation of the Matrix Coefficients: The matrix originates from the output of the discriminator in our generative adversarial network (GAN) architecture. The discriminator takes a concatenated input of the ground-truth (real complete image) and the inpainted (generated repaired image). It consists of four convolutional layers, each performing a 2× downsampling (halving the feature map size) to extract multi-scale features. Finally, it outputs a prediction map where each pixel value represents the authenticity of the corresponding image patch:
For the ground-truth image (upper matrix), all pixel values are 1, indicating the discriminator recognizes these regions as completely real.
For the inpainted image (lower matrix), pixels outside the original broken area inherit values of 1 (since they come from the real image), while pixels inside the original broken area (generated by the generator) have values less than 1 (e.g., 0.4, 0.7, 0.6, 0.3), reflecting the discriminator’s assessment of the generated pixels’ “realism gap” compared to real pixels.
In summary, the matrix coefficients are the discriminator’s quantitative output of patch authenticity, derived from feature extraction and GAN-driven adversarial learning, with theoretical underpinnings in deep learning for image authenticity discrimination and multi-scale feature analysis.
Comment 4: It is recommended that the author, in the Introduction section, briefly mention some related works in the direction of “semantic priors + hierarchical networks + multi-scale fusion generators” to highlight the advantages of this manuscript over traditional techniques.
Response 4: We appreciate this valuable suggestion. We agree that situating our work within this specific research direction will better contextualize our contributions. We have revised the Introduction to add a new paragraph (We have added relevant explanations in the 5th and 6th paragraphs of the introduction section in Chapter 1, which are marked in blue font.) that discusses these relevant approaches. Specifically, we now cite and briefly analyze the following representative works:
Nazeri [40] integrates structural priors with a hierarchical network, employing a dual-generator architecture for multi-scale feature fusion. However, the edge priors it relies on remain at a low structural level and lack understanding of high-level semantics, leading to insufficient semantic consistency in complex scenes and a relatively rigid multi-scale fusion mechanism.Liu [41] combines a mutual encoder-decoder with a multi-scale generator and utilizes feature equalization to enhance detail restoration. Nevertheless, this method does not incorporate external semantic priors for guidance and relies solely on data-driven internal features, making it prone to structural distortion or logical errors when reconstructing regions with strong semantic information such as human faces or architectural structures. Xiong [42] integrates semantic segmentation priors, hierarchical networks, and multi-scale fusion. However, its network structure is complex, requiring additional branches and loss functions to distinguish between foreground and background. This results in a model with a large parameter size and low inference efficiency, making it difficult to adapt to practical scenarios with high-resolution images and high masking ratios.
To address these issues, we propose a multi-scale semantic-driven inpainting method. We leverages semantic priors to guide the inpainting process, thereby enhancing semantic coherence in the generated results. A hierarchical or progressive network architecture is employed to refine the output in a coarse-to-fine manner, enabling effective handling of large missing regions. Furthermore, a multi-scale fusion generator is utilized to aggregate features from different levels, simultaneously improving both the structural integrity and textural details of the restored image.
Once again, we would like to express our sincere gratitude to the reviewer for their insightful comments, which have significantly helped us improve the quality and clarity of our manuscript. We hope that our revisions and responses have adequately addressed all the points raised, and we believe the manuscript is now suitable for acceptance.