MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting

Wang, Shibin; Guo, Xuening; Guo, Wenjie

doi:10.3390/electronics14112218

Open AccessArticle

MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting

by

Shibin Wang

¹,

Xuening Guo

^2,*

and

Wenjie Guo

²

¹

Key Laboratory of Henan Province, Computer and Information Engineering College, “Educational Artificial Intelligence and Personalized Learning”, Xinxiang 453007, China

²

Information Management Department, Henan Normal University, Xinxiang 453007, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2218; https://doi.org/10.3390/electronics14112218

Submission received: 10 April 2025 / Revised: 26 May 2025 / Accepted: 28 May 2025 / Published: 29 May 2025

Download

Browse Figures

Versions Notes

Abstract

Image inpainting approaches have made considerable progress with the assistance of generative adversarial networks (GANs) recently. However, current inpainting methods are incompetent in handling the cases with large masks and they generally suffer from unreasonable structure. We find that the main reason is the lack of an effective receptive field in the inpainting network. To alleviate this issue, we propose a new two-stage inpainting model called MD-GAN, which is a multi-scale diverse GAN. We inject dense combinations of dilated convolutions in multiple scales of inpainting networks to obtain more effective receptive fields. In fact, the result of inpainting large masks is generally not uniquely deterministic. To this end, we newly propose the multi-scale probabilistic diverse module, which achieves diverse content generation by spatial-adaptive normalization. Meanwhile, the convolutional block attention module is introduced to improve the ability to extract complex features. Perceptual diversity loss is added to enhance diversity. Extensive experiments on benchmark datasets including CelebA-HQ, Places2 and Paris Street View demonstrate that our approach is able to effectively inpaint diverse and structurally reasonable images.

Keywords:

image diverse inpainting; generative adversarial networks; spatial adaptive normalization; dilated convolution; large missing areas

1. Introduction

Image inpainting aims to recover the missing areas of damaged images with semantically reasonable and structurally complete contents [1]. It has great research significance and wide application value in image editing, CT medical image completion, ancient typeface restoration and other scenarios where people want to remove unwanted objects from images [2,3].

Recent image inpainting methods primarily utilize GANs to learn deep features and semantic information about images [4]. For example, Pathak et al. [5] employ a convolutional encoder–decoder network. Zeng et al. [6] propose a pyramid context encoder network with a pyramid-style attention mechanism to optimize it. These methods perform well with small masks. However, when the missing areas become large, the inpainting results suffer from structural distortions. In this paper, the area with a mask ratio of 30–40% is defined as a large missing region.

We argue that there are two essential factors for large mask inpainting. One is the effective receptive field. The large missing areas cause the network to extract too much blank information. Increasing the effective receptive field can improve this problem and aid in understanding the global structure. The other is diverse results. There are too many possible results for damaged images with large masks. This requires that the network be able to generate diverse plausible and reasonable results. Zheng et al. [7] first propose diverse image inpainting. Liu et al. [8] enhance the diversity of the network by leveraging prior knowledge. It generates diverse images by different noises. When they deal with large missing areas, although they generate different probable results, the results are obviously different from the ground truth images. The main problem is that the network lacks the required amount of effective receptive fields.

In this paper, we present an image inpainting repair model, called MD-GAN, which is a two-stage network incorporating dilated convolutions [9] and spatial-adaptive normalization (SPADE) [10]. The model is composed of a rough network and a refinement network. The rough network employs partial convolutional encoder–decoder structures to produce a coarse prediction. Coarse prediction refers to the repair results generated by the rough network, and its role is to provide a global structure prior to the refined network. Based on this, the refinement network can generate a variety of details, and finally output fine restoration results. The refinement network modulates the input random noise by a specially designed multi-scale probability diversity (MPD) module. The MPD module refines the deep features of the random noise in a coarse-to-fine fashion by integrating coarse predictions and masks, thereby enabling the network to generate multiple distinct images. Furthermore, the model integrates dense combinations of dilated convolutions to achieve larger effective receptive fields. This enhancement allows the network to effectively repair images with large masked regions, demonstrating improved performance in image restoration tasks.

To further enhance MD-GAN’s feature map recovery capability, we incorporate the convolutional block attention module (CBAM) [11]. CBAM sequentially integrates a channel attention module and a spatial attention module, thereby enhancing the network’s representational power in both channel and spatial dimensions. Additionally, we refine the perceptual diversity loss to boost the network’s diversity generation capability. The experimental results demonstrate that these improvements significantly enhance the performance of the network.

The main contributions of our work are as follows:

(1): We propose a novel two-stage diversity inpainting network for large mask images inpainting via dilated convolutions and SPADE.
(2): We design an MPD module as the basic block in our refinement inpainting networks. It expends effective receptive fields in multiple scales to ensure that the diverse generation is structurally complete.
(3): We incorporate the convolutional block attention module and perceptual diversity loss into the network to optimize the generated content and diversity while keeping the number of parameters acceptable.

2. Related Work

2.1. Image Inpainting

Current image inpainting methods can be divided into early traditional image inpainting methods and deep learning-based image inpainting methods. The early traditional inpainting methods are addressed using either diffusion-based [12] or patch-based [13] methods. Diffusion-based methods propagate adjacent information to the missing regions by PDEs or variational methods. Patch-based methods copy similar information from the background area to fill in the missing regions. Although traditional methods can synthesize textures, they are unable to learn deep semantic features of the image.

With the development of deep learning and generative adversarial networks, GAN-based deep generative methods were widely applied to image inpainting tasks [14]. Raford et al. [15] propose the deep convolutional generative adversarial network, which integrates generative adversarial networks with convolutional neural networks. Yu et al. [16] propose a contextual attention layer by using GANs to mimic traditional patch-based methods. This method performed well on small masks but struggled with large irregular masks. Liu et al. [17] propose a partially convolutional operation, which can update the mask and image at the same time. The network is able to achieve inpainting on irregularly masked images. Yu et al. [9] propose a dilated convolution operation, which can expand the receptive field of the convolutional kernel through different dilation rates. The network is able to capture more distant contextual information and achieve better inpainting results with large masks. Iizuka et al. [18] employ stacked dilated convolutions to obtain greater spatial support and achieve realistic results with globally and locally consistent adversarial training. However, stacked dilated convolutions lead to grid-pattern artifacts. Zheng et al. [19] propose the dense multi-scale fusion block (DMFB) with dilated convolutions. The DMFB blocks improve the artifact issue and simultaneously enlarge the receptive field.

2.2. Diverse Image Inpainting Networks

Diversity image inpainting can generate different reasonable results for the same damaged image. The normal inpainting methods have demonstrated good restoration capabilities. However, they cannot generate a variety of semantically meaningful results. If the missing regions are large or the image textures are complex, different inpainting experts may inpaint different details. Therefore, inpainting methods capable of producing multiple reasonable results have emerged. Zheng et al. [7] first propose diverse image inpainting. It has achieved personalized generation using a coupled design of two parallel GAN networks. Liu et al. [8] enhance the diversity of the network’s output by strategically leveraging prior knowledge and incorporating a perceptual diversity loss into their training process. These methods are capable of generating diverse inpainting results, but often suffer from issues such as artifacts and unreasonable structures. To address these issues, this paper designs the MD-GAN network, which not only achieves diverse content output capabilities but also produces high-performance completion results.

3. Approach

In this section, we present the proposed inpainting architecture, as shown in Figure 1. The MD-GAN is a two-stage network designed for enhanced stability. It consists of a coarse network and a refinement network. The coarse network is built upon the U-Net framework [20], while the refinement network utilizes the generator structure of Vanilla GAN [21]. The output of the coarse network serves as prior information for the refinement network. Guided by this prior information, the refinement network modulates different latent vectors z, thereby achieving diverse image inpainting results.

3.1. The Coarse Network

The coarse network is responsible for generating a rough repair result. It adopts the partial convolutional encoder–decoder structure, which is a U-Net architecture consisting of partial convolutions and skip connections. The convolutional layers are replaced by partial convolutional layers. The ReLU activation layers in the decoder modules are replaced by LeakyReLU activation layers. The skip connections link the features and masks from the encoder to the decoder, and acts as the input features and masks for the next partial convolutional layer. I_gt is the ground truth image. M is the mask, and its occluded portion value is 0. The masked image

I_{m a s k} = I_{g t} ⊙ M

. The four-channel inputs concatenate the masked image I_mask and the mask M. The input

I_{i n} = s t a c k (I_{m a s k}, M)

is fed into the coarse network, which produces the coarse result P after passing through an encoder and a decoder. The encoder consists of eight partial convolutional layers. The decoder comprises eight nearest-neighbor-up sampling layers with a scale factor of two and eight partial convolutional layers. The coarse prediction output P, along with the mask M, is sent to the MPD module of the refinement network as prior knowledge.

3.2. The Refinement Network

The refinement network adopts a generator structure similar to Vanilla GAN, with the proposed MPD layers replacing the standard convolutional layers. As shown in Figure 1, a random variable z of size 256 serves as the input to the refinement network. After passing through a fully connected layer, six MPD modules and an activation function, it outputs the refined result I_out. Additionally, an attention layer is applied after the fourth MPD module.

3.2.1. MPD Module

Current image inpainting tasks under large masks require the inpainting results to be both structurally reasonable and diverse. The SPADE module can modulate random vectors using semantic information to generate diverse outputs. However, it lacks a large receptive field to capture global texture and structural features. Complementarily, the DMFB module can expand the receptive field. Therefore, we design the MPD module by integrating the DMFB module with an improved SPADE module. After receiving the feature maps from the upper network, the MPD module first passes them to the improved SPADE module. The improved SPADE module utilizes the prior information P to guide the diverse generation of the image. Then, it passes to the DMFB module to control the overall features. The MPD module is shown in Figure 2.

We make improvements to SPADE, utilizing the similar idea of PDGAN, as the patterns closer to the edges of the masked area have a greater correlation with the surrounding pattern information. Therefore, when generating the image within the mask, we hope that closer to the edge of the mask, the generated content will be more fixed. Conversely, we hope for more freedom for the generated content which is closer to the center. To this end, a probabilistic diversity mapping is incorporated into the SPADE module to control the modulation strength of the prior information P on the input F_in. And it is jointly controlled with the two variables, γ and β, and output by the convolutional layer. The improved SPADE module consists of hard SPDNE and soft SPDNE, controlled by probabilistic diversity mappings D_h and D_s, respectively. The structure is shown in Figure 2.

The hard SPADE module was designed with its probability map D_h calculated based on the distance to the edge of the mask M. The iterative calculation formula can be expressed as

M_{i} (x, y) = \{\begin{array}{l} 1 if \sum_{(a, b) \in N (x, y)} M_{i - 1} (a, b) > 0 \\ 0 otherwise \end{array}

(1)

where N(x, y) represents the 3 × 3 neighborhood centered at position (x, y). M_i represents the probability map obtained after the i-th iteration. The known pixels are assigned as 1. With each iteration, the probability map shrinks inward (M_i − M_i₋₁). And this region is assigned as 1/k_i. Thus, the probability closer to 0 is achieved when it is closer to the center of the mask.

D_h is only related to the mask but ignores the overall structural features. To address this issue, soft SPDNE is used as a complement. Soft SPDNE can dynamically learn global structural features of prior information and stabilize training. The probabilistic diversity mapping D_s of soft SPDNE is obtained by extracting features from Fin and the prior information P using a trainable network, which can be represented as

D_{s} = σ (Conv ([F_{p}, F_{i n}]) \cdot (1 - M) + M

(2)

After passing through the improved SPADE module, the generated results are more diverse. However, there is still insufficient control over global information, leading to the generation of unreasonable structural features and textural details. To address this issue, a sufficiently large effective receptive field is needed to capture enough overall image information. Dilated convolutions can expand the receptive field, but their sparse kernels skip many effective pixels during computation. Using large kernel convolutions with dense kernels [22] will introduce a large number of parameters, slowing down the network training. To expand the effective receptive field while minimizing the introduction of numerous parameters, the DMFB module is integrated into the improved SPDNE module to form the MPD module. The structure of the DMFB module is shown in Figure 3.

Dilated convolution increases the effective size of the convolution kernel by inserting a fixed number of spaces, known as the expansion factor, into the kernel. The convolution kernel is thus able to cover a larger input area and capture a wider range of contextual information. Dilated convolution adds a key parameter compared to ordinary convolution: the dilation rate. The dilation rate affects the number of spaces inserted into the convolution kernel. When the dilation rate is 1, it is the same as ordinary convolution. When the dilation rate is 2, every two pieces of data in the convolutional kernel are separated by a column of zeros. When dilated convolutions with different expansion coefficients are used consecutively, more pixels can be utilized with the same calculation parameters. Therefore, the network uses four dilated convolutions with different expansion coefficients and superimposes them to form a dense multi-scale fusion block DMFB of dilated convolution.

The input features are extracted by dilated convolutions with different dilation rates to obtain multi-scale features. Then, they are fused. We incorporate the DMFB module into the soft SPADE module, replacing the standard convolutional layers. This further enhances the soft SPADE module’s ability to capture global features. Additionally, we integrate the DBFM module after the improved SPADE modules and design a residual structure. The input feature map is added to the output feature map from the MPD module, which stabilizes the training process. This process can be expressed as

F_{o u t} = F_{i n} + D M F B (S P A D E (F_{i n}, P))

(3)

3.2.2. Attention Module

To enhance the network’s ability to represent the overall structure and texture of an image, a CBAM attention module is introduced. The CBAM attention module was introduced into the refinement network to emphasize key information. It cascades a channel attention module (CAM) and a spatial attention module (SAM) [11]. The channel attention focuses on the relationships between feature map channels. The spatial attention operates on pixel regions to complement the channel attention. The structure is shown in Figure 4.

The feature maps passed through the generator network, and then were fed into the attention module. In the attention module, the feature attention maps are derived sequentially along two independent dimensions: channel dimensions and spatial dimensions. Channel attention suppresses irrelevant semantic channels and enhances target channels, ensuring the semantic plausibility of generated content. Spatial attention focuses on key regions such as mask edges, guiding the network to prioritize repairing structural boundaries and improving the continuity of local details. Their synergistic effect enables MD-GAN to maintain global structural consistency while refining local textures in large-mask inpainting. These feature attention maps are then multiplied with the input feature maps to refine the features. Ablation experiments demonstrate that the addition of the CBAM attention module significantly improves the visual performance of the results. Our experiments show that when it is added after the fourth MPD module in the refinement network, the results are the best.

3.3. Loss Function

The loss function is a pivotal factor that significantly impacts the generative capability of the network. We use inpainting loss and adversarial loss to constrain the generation of images. We utilize perceptual diversity loss to enhance the diversity of output images.

3.3.1. Perceptual Diversity Loss

The perceptual loss compares the similarity between two images in terms of brightness, contrast and structure. Perceptual loss is more representative of human perception of image quality. Due to our network used, the two-sample generation mechanism will output two different results, I₁ and I₂, simultaneously, and we expect them to be more different to increase diversity. Therefore, we improve the perceptual loss similar to PDGAN and further simplify it. Perceptual diversity loss L_perd is defined as

L_{perd} = \frac{1}{N} \sum_{i}^{N} {‖ϕ_{i} (I_{1}) \cdot M - ϕ_{i} (I_{2}) \cdot M‖}_{1}

(4)

where ϕ_i is the feature maps of the i-th layer of the pre-trained VGG-16 network [23] and N is the number of layers in the network. During the training process, we maximize the perceptual diversity loss L_perd to enhance the diversity of our network. In the VGG-16 network, high-level features are sensitive to semantic structures, forcing the network to generate results with significant differences in object shapes and layouts. To this end, a hierarchical weighting strategy is adopted. Different weights are assigned to different VGG layers to enhance the contribution of high-level semantic differences to diversity.

3.3.2. Inpainting Loss

Inpainting loss consists of reconstruction loss, style loss and match loss. The reconstruction loss compares the differences between reconstruction images I_out and ground truth images I_gt through intermediate feature maps of a pre-trained network on the pixel level. Reconstruction loss L_recon is defined as

L_{recon} = \frac{1}{N} \sum_{i = 1}^{N} {‖I_{g t} - I_{o u t}‖}_{1}

(5)

Style loss L_style is defined as

L_{style} = \frac{1}{N} \sum_{j = 1}^{N} {‖G_{j}^{ϕ} (I_{g t}) - G_{j}^{ϕ} (I_{out})‖}^{2}

(6)

where the G(·) is the Gram matrix. The Gram matrix can reflect the correlation and texture between feature maps.

G = F^{T} ⊙ F

(7)

where F is the feature map tensor.

The feature matching loss compares the activation maps in the intermediate layers of the discriminator. It makes the generated feature representations of the generator more similar to ground truth images and stabilizes the training process. The feature matching loss L_match is defined as

L_{m a t c h} = \frac{1}{N} \sum_{i = 1}^{N} {‖D^{(i)} (x) - D^{(i)} (G (z))‖}_{1}

(8)

3.3.3. Adversarial Loss

We use the gradient penalty for the adversarial training of our inpainting framework. Adversarial loss L_adv is defined as

\begin{matrix} L_{a d v} = \min_{G} \max_{D} V (D, G) = \\ E [\ln D (I_{g h t} ⊙ M)] + E [\ln (1 - D (G (I)))] \end{matrix}

(9)

3.3.4. Total Loss

The final loss L_total is

L_{t o t a l} = λ_{1} L_{p r e d} + λ_{2} L_{r e c o n} + λ_{3} L_{s t y l e} + λ_{4} L_{m a t c h} + λ_{5} L_{a d v}

(10)

4. Experiments

4.1. Datasets and Evaluation Metrics

We trained separate models on three public datasets: Paris Street View [5], CelebA-HQ [24] and Places2 [25]. For the Paris Street View and CelebA-HQ, we used their standard settings. For Places2, we selected 30,000 images randomly. We divided the dataset into a training set with 28,000 images, and a test set with 2000 images. The masks used in the experiments are irregular masks with masking ratios ranging from 10% to 50%, proposed by Liu et al. [17]. The size of the datasets and masks for both training and testing in all experiments in this paper is adjusted to 256 × 256 pixels.

To objectively evaluate the effectiveness of the restored images, the experiments adopt three evaluation metrics. These metrics are the peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and Fréchet inception distance (FID) score.

4.2. Implementation Details

To ensure reproducibility, all experiments were conducted on an NVIDIA RTX 3090 GPU with PyTorch v1.13.1, CUDA v11.6, and Python v3.9. Random seeds were fixed at 42 for both the model and data pipeline. The two-stage training strategy, gradient penalties, and improved losses were employed to stabilize GAN training and mitigate mode collapse. The Adam optimizer was used with β₁ = 0.0 and β₂ = 0.99. The model was trained for 120 epochs. The initial learning rate for the generator network was set to 1 × 10⁻⁴. The batch size was set to 4. Hyperparameters were tuned via grid search on the Places2 validation subset, with performance evaluated using PSNR, SSIM, and visual inspection of diverse outputs.

We compare the following five mainstream methods: image inpainting model based on local binary pattern learning and spatial attention (LBP) [26], high-resolution image inpainting model (CoordFill) [27], irregular hole image inpainting model using partial convolutions (PC) [17], pluralistic image completion model (PIC) [7] and probabilistic diverse image inpainting model (PDGAN) [8].

4.3. Results and Comparisons

4.3.1. Qualitative Comparisons

Table 1, Table 2 and Table 3 present quantitative comparison results. Table 1, Table 2 and Table 3 present quantitative comparison results of various methods on the Places2, Paris Street View, and CelebA-HQ datasets, respectively. They compare and display PSNR, SSIM, and FID values for different mask region ratios.

The data in the table clearly demonstrate that the method proposed in this chapter either outperforms or matches other methods across all indicators at various mask ratios on the three datasets. Notably, as the mask area increases, the advantages of each indicator become more pronounced. The difference between the MD-GAN and other methods is that MD-GAN has a large enough effective receptive field brought by dilated convolution. This allows MD-GAN to capture the global structural features of the image, resulting in optimal performance at large masks.

Regarding the peak signal-to-noise ratio (PSNR), a higher PSNR value implies better image inpainting quality. The method in this chapter performs optimally at all ratios, with a significant edge under large mask ratios ranging from 40% to 50%. In the 40–50% mask interval of the Places2 dataset, the PSNR of MD-GAN is 1.25 dB higher than that of the suboptimal method PDGAN. This corresponds to a significant proximity of the repaired area to the real image in terms of pixel-level intensity distribution. Appears as less blurring and artifacts, as shown in Figure 5 for edge sharpness on the roof of the building.

For the structural similarity index measure (SSIM), the method in this chapter has an overall advantage. The semantic rationality of the corresponding repair results is significantly enhanced. As shown in Figure 5, the generated face is more consistent with the real face. However, it shows a slightly lower performance on the Paris Street View dataset. This might be attributed to the complex buildings with high structural complexity depicted in the images of the Paris Street View dataset. Although the proposed method is superior to others in most cases for inpainting complex structures under large masks, there is still room for improvement.

In terms of the Fréchet inception distance (FID) score, a lower FID value indicates a higher authenticity of the generated image. The method in this chapter performs sub-optimally in terms of the FID value on the Paris Street View dataset but is the best on the other two datasets. This is due to the highly complex structure of the Paris Street View dataset, suggesting that the method needs to enhance the network’s inpainting effect for complex structures. In the 40–50% mask scenario of Paris Street View, the FID of MD-GAN is 1.02 points lower than that of PDGAN. This shows that the generated street view structure, with features such as window arrangement and lane line direction, is more in line with the semantic distribution of the real scene. This reduces the common phenomenon of structural confusion.

The aim of this paper is to generate high-quality and diverse inpainting results under large masks. The comparison of indicators validates that the algorithm proposed in this chapter achieved the expected outcomes.

4.3.2. Quantitative Comparisons

Figure 5 presents qualitative comparison results on Paris Street View datasets. Figure 6 presents qualitative comparison results on CelebA-HQ datasets. Figure 7 presents qualitative comparison results on Places2 datasets. The mask ratios used were all large masks ranging from 30% to 50%. The results of PC are the same as the prior information used in our model and PDGAN.

From the qualitative analysis of the visualization results, it can be observed that LBP and PC exhibited obvious artifacts and blurring when dealing with large-scale missing parts. The structures of houses and the edges of human faces were unclear. CoordFill was able to generate smooth images, yet it lost some image information, such as the information on roadside railings, resulting in low credibility. Moreover, the above-mentioned three methods lacked the ability to generate diverse results. The diverse images generated by PIC had unreasonable semantics in the inpainted areas, with numerous artifacts and low-quality images. The images generated by PDGAN had good quality, and the diverse images also had differences such as different smile arcs and fine hair. However, the phenomena of structural chaos and texture blurring were still present. Although the above two methods could generate diverse images, their credibility and naturalness still needed improvement. In contrast, the images generated by the MD-GAN method in this chapter had the clearest structures and textures. When dealing with large masks, the repaired and generated roofs, door frames, and railings all had clear boundaries and reasonable textures. In terms of diversity, when repairing facial details, it could generate two reasonable results with more or fewer beards. When repairing street view images, it could generate different reasonable structures in the uncertain areas of the roof. In conclusion, the MD-GAN method in this chapter achieved the generation of diverse results with stable quality and reasonable structures under large-scale missing conditions.

4.4. Ablation Studies

Let V₁ denote the network in which the MPD module in the MD-GAN network is replaced with the original SPADE module, and V₃ denote the MD-GAN network in this chapter. To evaluate the function of the MPD module, Table 4, Table 5 and Table 6 present the quantitative comparison results between the MD-GAN network (V₃) using the MPD module and the network (V₁) using the original SPADE module on the Places2, Paris Street View, and CelebA-HQ datasets, respectively. The results show that for the improved MPD module, most of the PSNR and SSIM values are higher than those of the original SPADE module, especially when the mask ratios are in the ranges of 30~40% and 40~50%. This demonstrates that the improved MPD module effectively enhances the network’s repair effect when dealing with large masks.

Let V₂ represent the network with the CBAM attention module removed from the MD-GAN network, and V₃ represent the MD-GAN network in this chapter. To assess the role of the attention module, Table 4, Table 5 and Table 6 display the quantitative comparison results between the network after removing the attention module (V₂) and the original network MD-GAN (V₃) on the Places2, Paris Street View, and CelebA-HQ datasets. V₃ outperforms V₂ in all metrics on the three datasets. This verifies that the CBAM attention layer can effectively improve the network’s repair quality.

To explore the optimal position of the CBAM attention layer within the network, the CBAM attention module was placed after the 3rd (L₃), 4th (L₄), and 5th (L₅) MPD modules in the refinement network. The quantitative comparison results are presented in Table 7, using masks with a 40–50% rate on the Places2 dataset. The results indicate that placing the CBAM attention module after the 4-th MPD module in the refinement network yields better results. Analysis indicates that at higher layers, the resolution is higher. At this time, the attention module focuses more on image details rather than structures, which leads to a decrease in the repair effect. In contrast, at lower layers, there is excessive noise in the images. The attention layer is more likely to focus on useless noisy details, thus also resulting in a poor repair effect. Therefore, when adding the CBAM attention layer to the network, the optimal placement is in the middle layer to achieve the best results.

5. Conclusions

To address the challenges of image inpainting and large mask inpainting, this paper proposes an image-diverse inpainting model, MD-GAN. MD-GAN consists of a coarse network and a refinement generation network, capable of repairing defective images from coarse to fine. The refinement network employs a new MPD module. This module helps our network achieve a diverse inpainting of damaged images while ensuring reasonable and complete structures. Additionally, an attention mechanism and an appropriate loss function are incorporated. They not only enhance the randomness of the inpainting process, but also enable the network to perform well when inpainting large masks. Quantitative and qualitative experiments demonstrate that our method not only produces diverse inpainting results but also ensures smooth and complete structures when inpainting large masks.

In summary, MD-GAN’s superiority stems from three innovations: (1) its multi-scale receptive field design for large-mask structure modeling, (2) the MPD module’s ability to generate semantically diverse solutions, and (3) CBAM’s feature refinement for realistic details. These collectively address the core challenges of existing methods: structural collapse in large masks, limited diversity, and feature blur.

While MD-GAN excels at inpainting large masks with diverse and structurally reasonable results, it faces challenges in highly complex texture synthesis and real-time inference. Future work will focus on lightweight architecture design and texture-specific feature learning to expand its applicability.

Author Contributions

Conceptualization, X.G.; methodology, X.G.; software, X.G.; validation, X.G.; formal analysis, X.G.; investigation, X.G.; resources, X.G.; data curation, X.G.; writing—original draft preparation, X.G.; visualization, X.G. and W.G.; writing—review and editing, S.W.; supervision, S.W.; project administration, S.W.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research Project of Higher Education Institutions in Henan Province, grant number 24A520018 and the National Natural Science Foundation of China, grant number 62072160.

Data Availability Statement

Places2 Dataset, http://places2.csail.mit.edu/, accessed on 1 October 2023; CelebA-HQ Dataset, https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, accessed on 1 October 2023; Paris Street View Dataset, https://github.com/pathak22/context-encoder, accessed on 1 October 2023; Mask Dataset, https://arxiv.org/abs/1804.07723, accessed on 1 October 2023.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Guillemot, C.; Meur, O.L. Image inpainting: Overview and recent advances. IEEE Signal Process. Mag. 2013, 31, 127–144. [Google Scholar] [CrossRef]
Wang, W.; Jia, Y. Damaged region filling and evaluation by symmetrical exemplar-based image inpainting for Thangka. EURASIP J. Image Video Process. 2017, 38, 38. [Google Scholar] [CrossRef]
Li, J.; Song, G.; Zhang, M. Occluded offline handwritten Chinese character recognition using deep convolutional generative adversarial network and improved GoogLeNet. Neural Comput. Appl. 2020, 32, 4805–4819. [Google Scholar] [CrossRef]
Jam, J.; Kendrick, C.; Walker, K.; Drouard, V.; Hsu, J.G.-S.; Yap, M.H. A comprehensive review of past and present image inpainting methods. Comput. Vis. Image Underst. 2021, 203, 103–147. [Google Scholar] [CrossRef]
Pathak, D.; Krähenbühl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar] [CrossRef]
Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Learning pyramid context encoder network for high-quality image inpainting. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1486–1494. [Google Scholar] [CrossRef]
Zheng, C.; Cham, T.-J.; Cai, J. Pluralistic image completion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1438–1447. [Google Scholar] [CrossRef]
Liu, H.; Wan, Z.; Huang, W.; Song, Y.; Han, X.; Liao, J. PD-GAN: Probabilistic Diverse GAN for Image Inpainting. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 9367–9376. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. In Proceedings of the 2016 International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Park, T.; Liu, M.-Y.; Wang, T.-C.; Zhu, J.-Y. Semantic image synthesis with Spatially-adaptive normalization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2332–2341. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar] [CrossRef]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. (TOG) 2009, 28, 1–11. [Google Scholar] [CrossRef]
Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 2023, 35, 3313–3332. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 2016 International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T. Free-form image inpainting with gated convolution. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4470–4479. [Google Scholar] [CrossRef]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.-C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 89–105. [Google Scholar] [CrossRef]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (TOG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Hui, Z.; Li, J.; Wang, X.; Gao, X. Image fine-grained inpainting. arXiv 2020, arXiv:2002.02609. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 2015 Medical Image Computing and Computer-Assisted Intervention(MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Wang, Y.; Tao, X.; Qi, X.; Shen, X.; Jia, J. Image inpainting via generative multi-column convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 329–338. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 2014 International Conference for Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Torralba, A.; Oliva, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2018, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Zhou, J.; Li, Y. Deep generative model for image inpainting with local binary pattern learning and spatial attention. IEEE Trans. Multimed. (TMM) 2022, 24, 4016–4027. [Google Scholar] [CrossRef]
Liu, W.; Cun, X.; Pun, C.-M.; Xia, M.; Zhang, Y.; Wang, J. CoordFill: Efficient high-resolution image inpainting via parameterized coordinate. In Proceedings of the 2023 AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1746–1754. [Google Scholar] [CrossRef]

Figure 1. MD-GAN network and MPD module. The MD-GAN Network consists of a coarse network and a refinement network. The coarse network is based on partial convolutional layers and outputs, with coarse repair result P as the prior information for the refinement network. The refinement network generates diverse results through a random vector z and a multi-scale probability diversity (MPD) module. The MPD module is composed of hard SPADE, soft SPADE, and a dense multi-scale fusion block (DMFB).

Figure 2. (a) Hard SPDNE module. Hard SPDNE calculates the probability map D_h through mask edge distance to constrain the generated content in the mask edge region; (b) Soft SPDNE module. Soft SPDNE dynamically learns the global structural features of prior information P via a trainable network to generate the probability map D_s.

Figure 3. DMFB module. Multi-scale features are extracted through dilated convolutions with different dilation rates. After dense connection and fusion, the effective receptive field is expanded. Meanwhile, parameter redundancy caused by large convolution kernels is avoided.

Figure 4. The CBAM attention module. The CBAM module first filters key semantic features through channel attention, then locates structural boundaries via spatial attention, and finally outputs feature maps fusing global–local information.

Figure 5. Qualitative comparison to the state-of-the-art methods on the Paris Street View dataset.

Figure 6. Qualitative comparison to the state-of-the-art methods on the CelebA-HQ dataset.

Figure 7. Qualitative comparison to the state-of-the-art methods on the Places2 dataset.

Table 1. Quantitative comparison results of different methods on the Places2 dataset.

	Mask	PC	LBP	PIC	CoordFill	PDGAN	Ours
PSNR⁺	10~20%	28.05	28.79	27.91	28.73	28.60	29.40
	20~30%	25.53	25.53	25.14	25.51	25.86	26.28
	30~40%	22.06	22.99	22.63	23.42	23.74	24.02
	40~50%	21.22	20.99	20.78	21.68	21.56	22.81
SSIM⁺	10~20%	0.8764	0.9281	0.8714	0.8932	0.9445	0.9530
	20~30%	0.8526	0.8690	0.8562	0.8822	0.9133	0.9257
	30~40%	0.8254	0.7993	0.8289	0.8267	0.8777	0.8962
	40~50%	0.8039	0.7219	0.8058	0.7569	0.8462	0.8707
FID⁻	10~20%	19.702	10.726	20.769	10.232	13.344	9.429
	20~30%	26.267	19.252	27.703	18.365	20.946	15.980
	30~40%	36.695	34.086	39.574	31.483	35.251	26.127
	40~50%	43.499	53.791	46.140	41.258	42.938	32.434

Table 2. Quantitative comparison results of different methods on the Paris Street View dataset.

	Mask	PC	LBP	PIC	CoordFill	PDGAN	Ours
PSNR⁺	10~20%	31.30	32.02	31.95	31.90	31.88	32.66
	20~30%	28.94	28.33	28.54	28.99	29.50	29.53
	30~40%	26.50	25.90	26.02	26.35	27.25	27.30
	40~50%	24.32	24.15	23.79	24.96	25.56	25.58
SSIM⁺	10~20%	0.9088	0.9451	0.9384	0.9516	0.9627	0.9631
	20~30%	0.8929	0.8956	0.8814	0.9060	0.9440	0.9444
	30~40%	0.8742	0.8395	0.8615	0.8570	0.9233	0.9240
	40~50%	0.8536	0.7724	0.8369	0.8085	0.8971	0.8961
FID⁻	10~20%	63.547	23.505	63.306	53.162	50.003	51.975
	20~30%	72.906	43.130	75.136	63.856	61.420	61.140
	30~40%	85.426	68.856	86.483	73.302	76.088	75.589
	40~50%	90.848	100.391	90.848	95.436	87.557	86.540

Table 3. Quantitative comparison results of different methods on the CelebA-HQ dataset.

	Mask	PC	LBP	PIC	CoordFill	PDGAN	Ours
PSNR⁺	10~20%	30.87	31.56	30.06	31.53	31.32	32.16
	20~30%	27.85	27.79	27.52	29.09	29.06	29.43
	30~40%	26.43	25.49	25.61	25.92	27.37	27.64
	40~50%	25.10	23.34	24.63	24.93	25.26	26.06
SSIM⁺	10~20%	0.9242	0.9428	0.9174	0.9594	0.9702	0.9718
	20~30%	0.9280	0.8912	0.8960	0.9232	0.9532	0.9539
	30~40%	0.9121	0.8394	0.8793	0.8947	0.9376	0.9386
	40~50%	0.8955	0.7745	0.8643	0.8446	0.9207	0.9210
FID⁻	10~20%	11.196	7.487	10.724	8.751	5.892	4.101
	20~30%	8.996	20.407	14.001	13.856	6.848	6.215
	30~40%	11.311	21.162	18.394	16.841	9.497	9.257
	40~50%	13.822	32.982	19.494	17.953	11.507	11.648

Table 4. Ablation experiment on Places2 dataset.

	Method	Mask Ratio
	Method	10~20%	20~30%	30~40%	40~50%
PSNR⁺	V₁	28.93	25.96	23.77	22.09
	V₂	28.71	25.88	23.81	22.44
	V₃	29.40	26.28	24.02	22.81
SSIM⁺	V₁	0.9484	0.9197	0.8801	0.8512
	V₂	0.9479	0.9221	0.8891	0.8602
	V₃	0.9530	0.9257	0.8962	0.8707

Table 5. Ablation experiment on Paris Street View dataset.

	Method	Mask Ratio
	Method	10~20%	20~30%	30~40%	40~50%
PSNR⁺	V₁	31.97	29.51	27.27	25.51
	V₂	31.81	29.47	27.28	25.57
	V₃	32.66	29.53	27.30	25.58
SSIM⁺	V₁	0.9628	0.9441	0.9235	0.8957
	V₂	0.9621	0.9435	0.9237	0.8959
	V₃	0.9631	0.9444	0.9240	0.8961

Table 6. Ablation experiment on CelebA-HQ dataset.

	Method	Mask Ratio
	Method	10~20%	20~30%	30~40%	40~50%
PSNR⁺	V₁	31.95	29.53	27.26	25.58
	V₂	31.85	29.38	27.59	25.60
	V₃	32.16	29.43	27.64	26.06
SSIM⁺	V₁	0.9632	0.9451	0.9240	0.8958
	V₂	0.9633	0.9428	0.9251	0.8962
	V₃	0.9718	0.9539	0.9386	0.9210

Table 7. Position of attention ablation experiment on Places2 dataset.

	Position of CBAM
	L₃	L₄	L₅
PSNR+	22.785	22.791	22.776
SSIM+	0.8701	0.8703	0.8699

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Guo, X.; Guo, W. MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting. Electronics 2025, 14, 2218. https://doi.org/10.3390/electronics14112218

AMA Style

Wang S, Guo X, Guo W. MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting. Electronics. 2025; 14(11):2218. https://doi.org/10.3390/electronics14112218

Chicago/Turabian Style

Wang, Shibin, Xuening Guo, and Wenjie Guo. 2025. "MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting" Electronics 14, no. 11: 2218. https://doi.org/10.3390/electronics14112218

APA Style

Wang, S., Guo, X., & Guo, W. (2025). MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting. Electronics, 14(11), 2218. https://doi.org/10.3390/electronics14112218

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MD-GAN: Multi-Scale Diversity GAN for Large Masks Inpainting

Abstract

1. Introduction

2. Related Work

2.1. Image Inpainting

2.2. Diverse Image Inpainting Networks

3. Approach

3.1. The Coarse Network

3.2. The Refinement Network

3.2.1. MPD Module

3.2.2. Attention Module

3.3. Loss Function

3.3.1. Perceptual Diversity Loss

3.3.2. Inpainting Loss

3.3.3. Adversarial Loss

3.3.4. Total Loss

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Results and Comparisons

4.3.1. Qualitative Comparisons

4.3.2. Quantitative Comparisons

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI