SR-Inpaint: A General Deep Learning Framework for High Resolution Image Inpainting

: Recently, deep learning has enabled a huge leap forward in image inpainting. However, due to the memory and computational limitation, most existing methods are able to handle only low-resolution inputs, typically less than 1 K. With the improvement of Internet transmission capacity and mobile device cameras, the resolution of image and video sources available to users via the cloud or locally is increasing. For high-resolution images, the common inpainting methods simply upsample the inpainted result of the shrinked image to yield a blurry result. In recent years, there is an urgent need to reconstruct the missing high-frequency information in high-resolution images and generate sharp texture details. Hence, we propose a general deep learning framework for high-resolution image inpainting, which ﬁrst hallucinates a semantically continuous blurred result using low-resolution inpainting and suppresses computational overhead. Then the sharp high-frequency details with original resolution are reconstructed using super-resolution reﬁnement. Experimentally, our method achieves inspiring inpainting quality on 2K and 4K resolution images, ahead of the state-of-the-art high-resolution inpainting technique. This framework is expected to be popularized for high-resolution image editing tasks on personal computers and mobile devices in the future.


Introduction
Image inpainting or image completion, which involves the automatic recovery of missing pixels of an image according to the known information within the image, is an important research area in computer vision. With the rapid development of digital image editing technology, image inpainting has been widely applied to damaged photo restoration, occlusion removal, intelligent aesthetics and other graphics fields. Inpainting has been an active research area in the past few decades and many studies have been devoted to achieving visual realism and vividness [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]. However, due to the complexity of damaged images and the inherent ambiguity of methods, the semantics-continuous and texture-clear inpainting remains a major challenge, especially for High-Resolution (HR) images [17]. Hence, our work is motivated by the issue that most existing image inpainting techniques cannot realize high quality completion of damaged HR images.
Early inpainting methods can be broadly divided into the diffusion methods based on pixel propagation [1][2][3] and the patching methods based on texture borrowing [4][5][6][7][8], which do a poor job of reconstructing complex details [9]. In recent years, deep learning approaches have achieved promising success in inpainting. A stream of these methods hallucinates missing pixels using learned data distribution [9][10][11]18]. Another stream fills the hole using a data-driven manner with the external image sources [12][13][14][15][16]. Though these methods can yield meaningful structure in missing regions, the generated regions are often blurred and accompanied by artifacts. In addition, with the improvement of Internet transmission capacity and mobile device cameras, the resolution of image and video sources available to users via the cloud or locally is increasing [17]. However, for HR images, general image inpainting methods often yield a limited result. In addition the input is even rejected due to the memory limitation [17]. Now, there is an urgent need for methods that can reconstruct the missing high-frequency information in HR images and generate sharp texture details.
Therefore, several inpainting strategies have been proposed for the high-resolution reconstruction of high-frequency information. For example, Ikehata et al. [19] proposed a combined framework of patch-based inpainting and super-resolution to generate a dense high-resolution depth map from a corrupted low-resolution depth map and its corresponding high-resolution texture image. Kim et al. [20] proposed a method called "Zoomto-Inpaint", which enhances the high-frequency details of the inpainted area through a zoom-in, refine and zoom-out strategy, combines with high-resolution supervision and progressive learning. These frameworks improve the high-frequency reconstruction of the missing regions in general images. However, for HR images, these methods are not yet perfectly applicable and still face problems such as computational limitation. On the other hand, Yi et al. [17] proposed an HR image inpainting algorithm, which upsamples the Low-Resolution (LR) inpainted result and adds a high-frequency residual image into the blurred image to generate a sharp result through a contextual residual aggregation mechanism. The method effectively suppresses the cost of memory and computing power as well as achieves compelling quality in natural photographs with a monotonic background. However, the realism and semantic continuity of the inpainted results for the images with complex compositions or textures need to be further improved. For now, the visually realistic recovery of high-frequency information for the HR images with complex backgrounds is still a tricky task.
To this end, we propose a novel deep learning framework for HR image inpainting. The framework mainly consists of two deep learning modules: (1) a low-resolution inpainting module for the reconstruction of high-frequency information in the missing region, and (2) a super-resolution module for the enhancement of the resolution of the inpainted region. We input the HR images to the inpainting network by downsampling, hallucinating an LR map with high semantic continuity and coherence, then sending it to the super-resolution network for refinement, and finally obtaining a visually realistic inpainted result at high-resolution. Our method is capable of entering 2K and 4K resolution images and generating results at the same resolution, while ensuring the structural and semantic coherence, which is ahead of the state-of-the-art technology. In summary, our contributions are four-fold: • A novel deep learning framework for high-resolution inpainting, which allows the input of 2K and 4K resolution images to yield equally sharp results. • A "degradation and refinement" strategy is proposed to suppress suppressing memory and computational overhead while guaranteeing a high inpainting quality at high-resolution. • The structural coherence and visual fidelity of the inpainted results are enhanced to be ahead of the state-of-art technology. • A general high-resolution inpainting pipeline consisting of an independent inpainter and refiner in series that can be trained and modified separately.

Image Inpainting
Image inpainting is the fundamental and long-standing problem in computer vision. Traditional inpainting methods can be broadly classified into two categories: (1) diffusion methods [1][2][3], which propagate neighboring pixels; (2) patch methods [4][5][6][7][8]21,22], which explicitly borrow textures from surroundings. These methods are limited to locally available information and cannot recover meaningful structures in the missing regions, let alone complex details. The development of the image processing field including image synthesis, image super-resolution and image inpainting have been greatly facilitated with the proposal of deep learning and Convolutional Neural Networks (CNN), especially Generative Adversarial Networks (GAN) [15,18,[23][24][25][26][27][28]. For example, Pathak et al. [15] proposed a context encoder that makes a reasonable assumption about the hole in the picture by training a CNN. Furthermore, Yang et al. [18] proposed an optimization method based on GAN that produces more realistic and coherent results. In GAN, higher-order semantic acquisition is trained together with low-order pixel synthesis, which effectively compensates the shortcomings of traditional algorithms. However, due to the complexity and diversity of natural images, it is not enough to only generate new pixels, but also to ensure the visual fidelity and vividness of the inpainted results [15]. Classical single-stage GAN will lead to discontinuities, blurring, artifacts and excessive smoothing defects. Therefore, researchers have improved and innovated the framework based on GAN, such as Iizuka et al. [12] who used global and local two-stage discriminators to judge the semantics of the generated images and improve the consistency of the generated pixels with the original pixels. EdgeConnect [9] is an effective GAN-based inpainting framework inspired by the idea of "lines first, color next" in art creation, which generates complex details through a two-stage GAN, adhering well to the principles of structure-first. The result is impressive. However, general inpainting methods still struggle to remove its inherent blurriness, which is more obvious after zooming in. Hence, the high-resolution recovery of the missing high-frequency information in HR images is a non-negligible problem for HR image inpainting.

High-Frequency Image Content Reconstruction
For complex HR images, although some current methods can inpaint meaningful contents, they will lead to severe high-frequency information loss due to the input resolution limitation and the inherent ambiguity. For this reason, Yi et al. [17] proposed a contextual residual aggregation mechanism to produce high-frequency residuals for the missing content by weighted aggregating residuals from contextual patches, then add them to the blurry image to yield high-resolution result. However, this mechanism is difficult to ensure the structural and semantic consistency of the inpainted results. If we want to take full advantage of the existing semantic continuous inpainting, we can only downsample the input and thus obtain a low-resolution result. Hence, we propose to solve this contradiction using Super-Resolution (SR) techniques. SR reconstruction allows an HR image to be extrapolated from an LR image and to recover as much high-frequency information as possible, such as texture details. Early SR algorithms were based on image processing in the frequency or space domain, such as the Multiframe Image Restoration proposed by Tsai et al. [29] and Projection onto Convex Sets (POCS) proposed by Stark et al. [30]. In 2014, Dong et al. pioneered the application of deep learning to the superresolution reconstruction. Since then, a large number of super-resolution models based on deep learning have been proposed, from CNN to GAN. The SRCNN proposed by Dong et al. [31] uses a three-layer CNN, each layer corresponding to the feature extraction, nonlinear mapping, and high-quality reconstruction of the image, respectively. However, the network is too shallow leading to too small perceptual field of the generated images. Compared with the Deep Neural Networks (DNN), the SRCNN has weaker fitting ability and does poor job in complex details. The most direct solution is to increase the network depth. A deeper network inevitably leads to a larger perceptual field [32], which allows the network to utilize more contextual information and have a more reflective global mapping. For example, Reuben et al. [33] proposed a spatial light field super-resolution method, using deep CNN to restore the entire light field with consistency across all angular views. Kim et al. [34] proposed a very deep network (VDSR) that improves the SR performance in terms of both PSNR and SSIM. However, both SRCNN and VDSR input the LR image to the network by bicubic interpolation, resulting in low efficiency. For this reason, FS-RCNN [35] and ESPCN [36] operate the LR input directly and upsample at the end of the network. Although great success has been achieved in high-frequency recovery with bicubic degradation [37][38][39], for arbitrary blur caused by LR inpainting, these methods perform poorly due to the mismatch of degradation models. Zhang et al. [40] proposed a Plug-and-Play deep framework (DPSR) with a new degradation model that can handle LR images for arbitrary blur kernels, achieving promising results in synthetic and real LR images. Hence, we migrated this framework to HR inpainting task for high-frequency information reconstruction from LR to HR inpainted images.

Framework and Flow
We divide the HR inpainting task into two distinct problems: HR image inpainting and high-frequency information reconstruction. Hence, we propose a novel HR inpainting framework that first downsamples the HR input into a nLR network for inpainting, and the preliminary inpainted result is fed into a SR network for detail refinement. Finally, the inpainted HR result with high-frequency details can be obtained.
The entire framework is depicted in Figure 1. It mainly consists of two networks in series: (1) an LR inpainting network and (2) an SR network. As shown in Figure 1, a damaged HR image (2K or 4K) with a mask are used as input. Firstly, the input image and mask are bicubicly degraded in the input layer to obtain the LR map, avoiding the memory overflow caused by a too large input size. Next, the LR maps are fed into the LR inpainting network to yield a structure-coherent and detail-rich result in the LR field-of-view. The LR inpainted map is then sent to the SR network and scaled up to the original resolution by nonlinear mapping. This process realizes the high frequency information reconstruction at high resolution. Finally, the generated content is fused with the remaining part of the ground-truth image to obtain the HR inpainted image. The algorithm flowchart of our HR image inpainting method is shown in Figure 2. The following subsections depict the technical details of the deep learning networks used in our method.

LR Inpainting Network
The LR inpainting network aims to characterize variations across the entire image in the LR field-of-view and to recover missing information. As the fundamental quality of HR reconstruction, LR inpainting must ensure structural consistency, semantic continuity, and sufficient details of filling content in the LR field-of-view. Hence, we adopt a two-stage GAN framework [9] to realize high-quality image inpainting in LR field-of-view. The LR inpainting network consists of an edge generator and an image completion network, each stage of which follows an adversarial model consisting of a pair of generator-discriminator. Specifically, the generator follows the architecture proposed by Johnson et al. [41] and consists of two downsampling encoders, eight residual blocks [42], and two upsampling decoders. The discriminator uses the 70 × 70 PatchGAN [43,44] architecture, which discriminates whether the 70 × 70 overlapping blocks are true or not.  The core task of edge generator is to predict the edge map for the masked region, as shown in Equation (1). Let I gt be the ground truth image. C gt and I gray denote its edge map and grayscale counterpart respectively. In this stage, we use the masked grayscale imageĨ gray = I gray (1 − M) as input. Its edge map and image mask are denoted as C gt = C gt (1 − M) and M, respectively, and used as pre-condition. Here, denotes the Hadamard product.
The network is trained with both the adversarial loss L adv,1 and feature-matching loss L FM , as shown in Equation (2), where λ adv,1 and λ FM are regularization parameters. The feature-matching loss is very similar to perceptual loss, and compare the activation maps in the intermediate layers of the discriminator, which further stabilize the training process by forcing the similarities of both the results of the generator and the real images.
For the second stage, i.e., the image completion network, the incomplete color imagẽ I gt = I gt (1 − M) are used as input, conditioned using a composite edge map C comp , which is constructed by combining the edges inferred with the first stage and ground truth edges in the remaining part of the original image, i.e., C comp = C gt (1 − M) + C pred M. The network infers a color image I pred , with missing regions inpainted. This procedure is denoted as Equation (3).
This network is trained over a joint loss representation, as shown in Equation (4), containing 1 loss L 1 , adversarial loss L adv,2 , perceptual loss L perc and style loss L style . λ 1 , λ adv,2 , λ p and λ s are all regularization parameters.

SR Network
Since only LR inpainted results can be obtained from LR inpainting network, in addition to the unavoidable blur and noise in the inpainting process, it is necessary to address the problem of high-frequency information reconstruction at high resolution. Hence, the SR network aims to recover the missing high-frequency information in the HR field-of-view and enhance the resolution of LR inpainted results. Since the blurring pattern of the generated LR content is unknown, we adopt a deep plug-and-play SR framework for arbitrary blur kernels (DPSR) [40].
Most existing SR methods assume some degradation model. A widely used general degradation model for SR is depicted as Equation (5).
where x ⊗ k means the convolution between blur kernel k and HR image x. ↓ s is a subsequent downsampling operation with scale factor s, and n is additive white Gaussian noise (AWGN). However, DPSR employs a new degradation model that supports blur kernel estimation using existing deblurring methods. As shown in Equation (6), the degradation model of DPSR made a modification to the general degradation model by first bicubic downsample the full size image and then convolution with kernel K, rather than the convolution-downsample order, which is effective in dealing with blurry LR image.
Both models are then plus a noise term n. Once the model is defined, an energy function is formulated according to Maximum A Posteriori (MAP) probability, which contains two terms: data fidelity (likelihood) term and a regularization term. This optimization problem is solved with a quadratic splitting (HQS) algorithm.
Later, a super-resolver needs to be specified, which should also take the noise level as input. Here, we only need to modify the existing DNN-based super-resolver by adding a noise map as input. Methods such as SRMD can also be adopted as they already contain the noise level map.

Training Configuration and Strategy
We train both the Egde-Connect and DPSR model on a single NVidia Geforce GTX 1080 Ti GPU, with the PyTorch framework.
For the Edge-Connect model, the size of input image is 256 × 256. The batch size is set to 8. An Adam algorithm is adopted to optimize the model. The parameter β 1 is set to 0 and β 2 is set to 0.9. First, the Generator G 1 and Generator G 2 are trained separately using Canny edges. The learnings rate are set to 10 −4 until the training reaches the plateau. Then, the learning rate is reduced to 10 −5 . Generator G 1 and Generator G 2 continue to train until convergence. Finally, the networks are fine-tuned by removing D 1 . Generator G 1 and Generator G 2 are trained end-to-end with learning rate 10 −6 until convergence. The learning rate for the training of Discriminators are 1 10 of the generators. For the DPSR model, we trained an enhanced version of SRResNet, namely SRResNet+ as Zhang el al. [40]. The Adam algorithm [45] is again adopted to optimize the SRResNet+ model. The learning rate is first set to 10 −4 . Then, for every 5 × 10 5 iterations, the learning rate decreases by half and finally be fixed when reach 10 −7 . The batch size for training procedure is 16. The patch size of LR input is 48 × 48. Data augmentation is performed, by image rotation and flip.

Experimental Results and Discussion
Our proposed method (SR-Inpaint) is evaluated on the DIV2K dataset [46] and 200 2K-images as well as 200 4K-images of people, animals, nature, cities, objects, etc. Results are compared against the state-of-the-art HR image inpainting technology (HiFill by Yi et al. [17], CVPR 2020) both qualitatively and quantitatively. In the experiment, the damaged pixels were 10.612% of the total pixels for the 2K-image test set, and the damaged pixels were 12.799% of the total pixels for the 4K-image test set. Figure 3 displays the HR image inpainting pipeline in our framework. Firstly, the damaged image with mask is bicubicly downsampled to 1K resolution. The LR damaged map is then fed to the LR inpainting network for inpainting to yield an LR inpainted map. Subsequently, the LR inpainted map is fed into the SR network for frame inference to reconstruct the high frequency details at high resolution. Notice that a gate exists here to match the corresponding SR networks for input images of different resolutions. For 2K input, the "×2" network is matched to recover the original resolution; for 4K input, the "×4" network is matched to recover the original resolution. Finally, the SR-enhanced generated content is fused with the real background by masking to obtain the completed HR inpainting result.

Qualitative Evaluation
Firstly, our approach is compared with the state-of-the-art technology in terms of visual results. Figures 4 and 5 show the examples of the images generated by the our model and the comparison model under 2K and 4K inputs. For visualization, we replace the damaged area with black color. It is clearly visible that our model is able to generate semantically continuous results that are closer to ground-truth. In addition, most of the image structures remain coordinated. In contrast, the HiFill model does a poor job in terms of structure and semantics. Particularly, for complex background, the results of the HiFill model suffer from deformation and semantic incoherence.
We think it is explained by the fact that the HiFill algorithm borrows the surrounding texture to fill the holes. If the structure and semantics of the missing region are completely different from the surrounding, then it is difficult to guarantee a meaningful structure. In contrast, our inpainting is based on edge connection, following the principle of "lines first, color next", generating coherent structures through a two-stage GAN to achieve visual realism.  Compared to general image inpainting, for HR inpainting, the sharpness of the inpainted area is as important as the global picture coherence. Therefore, Figure 6 shows a zoomed-in comparison of the inpainting results. It can be seen that the areas generated after bicubicly upsampling and Gaussian pyramid-up are blurred and low resolution. The areas enhanced by the wavelet method have a non-negligible color difference with the original images. Meanwhile, the areas enhanced by the Super-Resolution (SR) enhancement mechanism show minimized ambiguity. Compared with the traditional techniques, the SR enhancement mechanism based on deep learning achieves the reconstruction of high frequency details at high resolution, which significantly improves the sharpness of the generated image.
Although the HiFill model can also generate high frequency details at high resolution through the Contextual Residual Aggregation (CRA) mechanism. The CRA mechanism aggregates high-frequency details using background residuals. However, if the highfrequency information in the background is not relevant to the high-frequency information in the missing region, then the generated high-frequency details are meaningless. Compared with the CRA mechanism, SR enhancement mechanism generates high-frequency details through global picture inference based on deep learning of big-data, which is guaranteed to be meaningful in most cases. In summary, although both generate HR results, our method does significantly better than the current state-of-the-art HR inpainting method in terms of semantic consistency as well as structural continuity. Our strategy maximizes the visual realism of the inpainting.

Quantitative Evaluation
For a more objective comparison between our method and state-of-the-art method in terms of high-resolution inpainting, we tested our method against state-of-the-art method on 2K and 4K image testsets and calculated the numerical metrics. The quality of our results are evaluated using the following metrics: Peak Signal-to-Noise Ratio (PSNR) [47], Structural SIMilarity (SSIM) [48], Normalized Root Mean Square Error (NRMSE), and Fréchet Inception Distance (FID) [49]. Among them, PSNR is used to measure the degree of deformation and noise; SSIM is used to describe the degree of similarity of the graphics structure; NRMSE is used to measure the pixel error; FID is used to measure the perceptual error based on deep features, using a pre-trained Inception-V3 model [50]. Table 1 presents the numerical results of our model and current state-of-the-art model on the 2K and 4K image testsets. It can be seen that our model performs better on both 2K and 4K testsets for all numerical metrics. It indicates that our framework is ahead of the state-of-the-art method in the quality of inpainting at 2K and 4K resolutions, better adding pixel-level details, better recovering the global structure, and obtaining more realistic results on perception.

Conclusions
We propose a general deep learning framework for the reconstruction of missing highfrequency information in high-resolution image through a super-resolution enhancement mechanism. Compared with traditional deep learning inpainting techniques, our model can handle both 2K and 4K images. Since our model adopts a "degradation and refinement" strategy, the computational overhead is well suppressed, while the inpainting quality is guaranteed. In addition, compared with the current state-of-the-art high-resolution inpainting model, our model leads in both visual results and numerical metrics, achieving semantic continuity, texture clarity, and visual fidelity. In the future, we will further optimize the network structure and training strategy to achieve better results as well as higher efficiency.