LPIN: A Lightweight Progressive Inpainting Network for Improving the Robustness of Remote Sensing Images Scene Classiﬁcation

: At present, the classiﬁcation accuracy of high-resolution Remote Sensing Image Scene Classiﬁcation (RSISC) has reached a quite high level on standard datasets. However, when coming to practical application, the intrinsic noise of satellite sensors and the disturbance of atmospheric environment often degrade real Remote Sensing (RS) images. It introduces defects to them, which affects the performance and reduces the robustness of RSISC methods. Moreover, due to the restriction of memory and power consumption, the methods also need a small number of parameters and fast computing speed to be implemented on small portable systems such as unmanned aerial vehicles. In this paper, a Lightweight Progressive Inpainting Network (LPIN) and a novel combined approach of LPIN and the existing RSISC methods are proposed to improve the robustness of RSISC tasks and satisfy the requirement of methods on portable systems. The defects in real RS images are inpainted by LPIN to provide a puriﬁed input for classiﬁcation. With the combined approach, the classiﬁcation accuracy on RS images with defects can be improved to the original level of those without defects. The LPIN is designed on the consideration of lightweight model. Measures are adopted to ensure a high gradient transmission efﬁciency while reducing the number of network parameters. Multiple loss functions are used to get reasonable and realistic inpainting results. Extensive tests of image inpainting of LPIN and classiﬁcation tests with the combined approach on NWPU-RESISC45, UC Merced Land-Use and AID datasets are carried out which indicate that the LPIN achieves a state-of-the-art inpainting quality with less parameters and a faster inpainting speed. Furthermore, the combined approach keeps the comparable classiﬁcation accuracy level on RS images with defects as that without defects, which will improve the robustness of high-resolution RSISC tasks. Investigation, X.Z.; Methodology, W.A.; Project administration, Y.D.; Resources, H.W. and Y.D.; Software, W.A. and X.Z.; Supervision, J.S.; Validation, W.A., X.Z., W.Z. and Y.D.; Visualization, W.A.; Writing—original draft, W.A.; Writing—review & editing, W.A. and H.W.


Introduction
Remote Sensing (RS) images are widely used in earth science, agriculture, military reconnaissance, disaster rescue and many other fields. To fully understand and utilize the rich information of the earth surface contained in RS images, the task of RS images scene classification (RSISC) has become a research hotspot. Most existing RSISC methods [1] can be roughly divided into two categories according to their approaches to feature designing and extracting. One is the traditional machine learning-based methods with hand-crafted features, such as models based on Bag of Visual Words (BoVW) [2], Randomized Spatial Partition (RSP) [3], Hierarchical Coding Vector (HCV) [4] and Fisher vectors (FVs) [5]. As deep learning technology has been proved to have excellent performance in computer vision and pattern recognition [6,7], the classification methods based on Convolutional Neural Network (CNN) [8][9][10][11][12][13][14][15][16][17][18][19] have been widely investigated for they can learn and extract image features automatically. Hu et al. [9] and Du et al. [10] used a pretrained CNN to extract vision and pattern recognition [6,7], the classification methods based on Convolutional Neural Network (CNN) [8][9][10][11][12][13][14][15][16][17][18][19] have been widely investigated for they can learn and extract image features automatically. Hu et al. [9] and Du et al. [10] used a pretrained CNN to extract image features. Yin et al. [11] defined three types of fusion-based models for RSISC and gave typical fusion methods. Li et al. [12] proposed a hybrid architecture of a pre-trained model combined with an unsupervised encoding method. Wang et al. [13] adopted an attention module and constructed an end-to-end network. Li et al. [14], Jiang et al. [15], and Zhang et al. [16] used Capsule Network (CapsNet) in their model which can preserve the spatial information of RS images. To date, the RSISC tasks have achieved a quite high-level accuracy on standard datasets. For instance, the overall accuracy (OA) is 94.87% [16] on the NWPU-RESISC45 dataset [20] and 96.05% [16] on the Aerial Image Dataset (AID) [21], both with a training ratio of only 20%. Moreover, the RS image classifications using the red-green-blue band of EuroSAT dataset [22] achieves an accuracy of 99.17% via deep transfer learning [23].

The Dilemma of Existing RSISC Methods
As reviewed above, the existing RSISC methods have performed well on standard RS image datasets, but cannot achieve a similar accuracy on real RS images, which usually have defects due to satellite sensors failures and atmospheric environment disturbances. Three typical defects exist in real RS images: (1) Periodic stripes. On 31 May 2003, the off problem of Landsat-7 Enhanced Thematic Mapper Plus (ETM+) Scan-Line Corrector (SLC) resulted in a stripe shape pixel loss in the images acquired since then. (2) Random noises. The electromagnetic interference between the satellite optical sensors and the load devices often generates random noises and leads to dead pixels in RS images. (3) Thick clouds. Thirty-five percent of the earth surface is covered by clouds [24], which makes it difficult to obtain pure RS images without clouds. In summary, the defects in real RS images will inevitably introduce invalid pixel interferences to RSISC tasks.
Experimental results in this paper show that the classification accuracy of a well performed model on the standard dataset may decrease by 68.05% at most when dealing with the defected RS images. The existing RSISC tasks take the defected RS images as the input and the invalid pixels directly participate in feature extracting as shown in Figure 1. These invalid pixels induce a defect information flow during the feature processing and might lead to the misclassification of scene classifier and the decrease of classification accuracy. Therefore, we believe that the existing RSISC tasks face a dilemma of accurate classification on real RS images with defects. The task of improving the robustness of RSISC methods and maintaining their classification accuracy on real RS images with defects is crucial for practical application of RSISC and is also the research emphasis of this paper.

The Approach to Overcoming this Dilemma
A variety of methods have been proposed to reduce the interference of defects in the defected images. For example, Li et al. [25] replaced several layers in the model with discrete wavelet transform. Duan et al. [26] designed image features that are insensitive to defects. Chen et al. [27] proposed a method to learn the statistical characteristics of defects and used image filters to separate the defects. Chen et al. [28] also adopted a spectralspatial preprocessing algorithm to improve the performance of the hyperspectral RS

The Approach to Overcoming This Dilemma
A variety of methods have been proposed to reduce the interference of defects in the defected images. For example, Li et al. [25] replaced several layers in the model with discrete wavelet transform. Duan et al. [26] designed image features that are insensitive to defects. Chen et al. [27] proposed a method to learn the statistical characteristics of defects and used image filters to separate the defects. Chen et al. [28] also adopted a spectral-spatial preprocessing algorithm to improve the performance of the hyperspectral RS image classification. However, these methods are mainly designed for enhancing the anti-defect capability of their models instead of eliminating defects. Various modules are adopted in the models to decrease interference of the defects, which leads to an increase in the complexity and training time of the model.
In addition, to apply the inpainting network on small portable systems such as RS unmanned aerial vehicles (UAV) for practical application of RSISC tasks, the models must be lightweight due to the restriction of memory and power consumption, which we believe is also one of the current challenges dealing with the defect RS images and needs to be paid extra attention to.
At present, few studies on improving the robustness of RSISC from the perspective of purifying the input images and lightweight design have been reported. Thus, to solve the dilemma fundamentally, we propose a Lightweight Progressive Inpainting Network (LPIN) based on the lightweight design, and a novel approach to purifying the front-end input RS images as shown in Figure 2 instead of improving the anti-defect ability of the back-end classification methods. image classification. However, these methods are mainly designed for enhancing the antidefect capability of their models instead of eliminating defects. Various modules are adopted in the models to decrease interference of the defects, which leads to an increase in the complexity and training time of the model. In addition, to apply the inpainting network on small portable systems such as RS unmanned aerial vehicles (UAV) for practical application of RSISC tasks, the models must be lightweight due to the restriction of memory and power consumption, which we believe is also one of the current challenges dealing with the defect RS images and needs to be paid extra attention to.
At present, few studies on improving the robustness of RSISC from the perspective of purifying the input images and lightweight design have been reported. Thus, to solve the dilemma fundamentally, we propose a Lightweight Progressive Inpainting Network (LPIN) based on the lightweight design, and a novel approach to purifying the front-end input RS images as shown in Figure 2 instead of improving the anti-defect ability of the back-end classification methods. New RSISC task procedure to solve the dilemma. The defect information flow introduced by invalid pixels is cut off by the inpainting network as the black X-shape shown, and the feature extractor can extract more purified features.
Different from the commonly used image preprocessing methods, such as image filtering, the proposed LPIN focuses on the image reconstruction to generate a completely new RS image and cut off the defect information flow introduced by invalid pixels. The LPIN has a light model weight and a fast computing speed which ensures it is suitable for the practical application of RSISC tasks. Then, it is combined with existing RSISC methods to extract purified features and improve the classification accuracy and the robustness of RSISC tasks. Besides, the proposed approach also has a wide applicability and can be adopted to multiple back-end classification methods.

Related Works of Image Inpainting
The image inpainting technology is used to reconstruct the images with undesired regions, it involves a branch of image reconstructions and has been widely applied in computer vision [29], such as human face repairing, mosaic removal, watermark removal, cultural heritage image restoration and so on. For RS images inpainting, patch matching based on probability and statistics theories are the most widely used methods. They aim at searching and matching suitable patches in valid regions and copy them to the corresponding defect regions [30][31][32][33]. Zheng et al. [34] inpainted the hyperspectral RS images with Nonlocal Second-order Regularization (NSR) and used semi-patch to accelerate nearest neighbor searching. Zhuang et al. [35] proposed a Fast Hyperspectral Denoising (FastHyDe) model and a Fast Hyperspectral Inpainting (FastHyIn) model to deal with defects in hyperspectral RS images. Li et al. [36] used Patch Matching-based Multitemporal Group Sparse Representation (PM-MTGSR) with auxiliary images to inpaint the defect region caused by thick cloud occlusion and sensor faults. Lin et al. [37] used the temporal correlation of RS images to create patch clusters in the temporal and spatial layers, and searched and matched the missing information from the clusters to inpaint the cloudcontaminated images. These methods can inpaint the defects in RS images, but if there are no suitable patches in the valid regions or the defect regions to be inpainted have complex structures, the inpainting often gives unsatisfactory results. New RSISC task procedure to solve the dilemma. The defect information flow introduced by invalid pixels is cut off by the inpainting network as the black X-shape shown, and the feature extractor can extract more purified features.
Different from the commonly used image preprocessing methods, such as image filtering, the proposed LPIN focuses on the image reconstruction to generate a completely new RS image and cut off the defect information flow introduced by invalid pixels. The LPIN has a light model weight and a fast computing speed which ensures it is suitable for the practical application of RSISC tasks. Then, it is combined with existing RSISC methods to extract purified features and improve the classification accuracy and the robustness of RSISC tasks. Besides, the proposed approach also has a wide applicability and can be adopted to multiple back-end classification methods.

Related Works of Image Inpainting
The image inpainting technology is used to reconstruct the images with undesired regions, it involves a branch of image reconstructions and has been widely applied in computer vision [29], such as human face repairing, mosaic removal, watermark removal, cultural heritage image restoration and so on. For RS images inpainting, patch matching based on probability and statistics theories are the most widely used methods. They aim at searching and matching suitable patches in valid regions and copy them to the corresponding defect regions [30][31][32][33]. Zheng et al. [34] inpainted the hyperspectral RS images with Nonlocal Second-order Regularization (NSR) and used semi-patch to accelerate nearest neighbor searching. Zhuang et al. [35] proposed a Fast Hyperspectral Denoising (FastHyDe) model and a Fast Hyperspectral Inpainting (FastHyIn) model to deal with defects in hyperspectral RS images. Li et al. [36] used Patch Matching-based Multitemporal Group Sparse Representation (PM-MTGSR) with auxiliary images to inpaint the defect region caused by thick cloud occlusion and sensor faults. Lin et al. [37] used the temporal correlation of RS images to create patch clusters in the temporal and spatial layers, and searched and matched the missing information from the clusters to inpaint the cloudcontaminated images. These methods can inpaint the defects in RS images, but if there are no suitable patches in the valid regions or the defect regions to be inpainted have complex structures, the inpainting often gives unsatisfactory results.
Compared with the patch-based methods, the deep learning-based methods can achieve much better inpainting performance because they can learn from the image content and style automatically and comprehend the image globally. Pathak et al. [38] proposed a Context Encoder (CE) model, which proved the feasibility of CNN for image inpainting. Since then, various architectures and modules have been developed to generate more Remote Sens. 2022, 14, 53 4 of 29 reasonable and realistic inpainting results. However, in the practical use of RSISC tasks on small portable systems, the inpainting network needs to have less parameters and a lightweight structure due to the restriction of memory and power consumption. However, in the field of image inpainting, researchers mainly focused on designing new modules or architectures and neglected the lightweight of networks.
For example, to maintain consistency of local texture and global content, Iizuka et al. proposed a Generative Adversarial Networks (GAN) [39] based model Global and Local Consistent Image Completion (GLCIC) [40] with a local discriminator and a global discriminator. To extract distant contextual information for the image missing part of an image, Yu et al. [41] released a coarse-to-fine inpainting network DeepFill based on Wasserstein GAN (WGAN) [42]. To extract valid multi scale features, Liu et al. [43] carried out a Partial Convolution (PCONV) layer with U-net [44] architecture. To avoid treating all image pixels as legal ones, Yu et al. [45] proposed a method called DeepFill v2 using a gated convolution to replace the vanilla convolution in the network. To generate structural information of different scales for a reasonable structured image, Li et al. [46] adopted two Progressive Reconstructions of a Visual Structure (PRVS) layer in the encoderdecoder network. To avoid excessive smoothness and blurring, Nazeri et al. [47] proposed the EdgeConnect method which first generates an edge map and then learns the prior knowledge from the edge map to obtain the result. To ensure the semantic relevance and context continuity of the edges of missing part, Liu et al. [48] [52]. To inpaint the RS Sea Surface Temperature (SST) images. Zhang et al. [53] used a space-time-spectral framework based on CNN to inpaint periodic stripes and thick clouds in satellite RS images. Wong et al. [54] came up with an adaptive spectral feature extraction method to inpaint hyperspectral images using spectral features and spatial information.
As reviewed above, these researchers did not pay attention to the inpainting task from the perspective of lightweight design and also did not care about the complexity of the models and the inpainting time consumption. Table 1 shows the pretrained weight size of several inpainting models mentioned above. We can see that their models often have large weights which are not suitable for portable systems and thus cannot be directly applied to the practical application of RS image inpainting. Unlike the existing inpainting methods, the proposed network LPIN is based on the innovative consideration of lightweight inpainting for RS images. As far as we know, this is the first time that an inpainting network is designed from a lightweight perspective. To fully realize a lightweight design, we first disassemble the complex inpainting task into simple tasks by applying a multi-stage progressive architecture, which makes the inpainting task easier for each substage. Afterwards, we adopt the weight sharing strategy among different stages to reduce the network parameters. Then, we enhance the information transmission by adopting a residual architecture and a multi-access of the input images to provide feature reuse in forward propagation and reduce gradients diffusion in backward propagation. Finally, we improve the inpainting performance by restricting the network with multiple loss functions. All these measures work together to give a lightweight but effective inpainting network. According to our experiments, the proposed lightweight network can achieve a state-of-the-art inpainting performance without complex modules and architectures.

Contribution
Our contributions are summarized as follows:

1.
A novel approach to improving the robustness of RSISC tasks is proposed, which is the combination of an image inpainting network and an existing RSISC method. Unlike the commonly used image preprocessing methods, the approach focuses on the image reconstruction to generate a completely new RS image. To our knowledge, this is the first attempt that image inpainting method has been applied to improve the robustness of RSISC tasks.

2.
An inpainting network LPIN is proposed on the novel consideration of lightweight design. Compared with the existing inpainting models, the LPIN has a model weight of only 1.2 MB, which is much smaller than other models and is more suitable for practical application of RSISC tasks when implementing it on small portable systems.
In spite of the small model weight, the LPIN still remains a state-of-the-art inpainting performance due to the progressive inpainting strategy, residual architecture and the multi-access of input images, which deepen the LPIN and guarantee a high feature and gradient transmission efficiency.

3.
The proposed approach has a wide applicability and can be adopted to various RSISC methods to improve their classification accuracies on images with defects. The results of extensive experiments show that the proposed approach on RS images with defects generally achieves a classification accuracy of more than 94% (maximum 99.9%) of that on the images without defects. This proves that it can greatly improve the robustness of RSISC tasks.

Materials and Methods
A well performed image inpainting network is essential for improving the robustness of RSISC tasks. Combined with this inpainting network, existing classification methods can achieve satisfactory results on the images with defects. In this section, we focus on architecture of the inpainting network LPIN. A basic Residual Inpainting Unit (RIU) is first proposed, with which the defect parts of RS images are preliminarily inpainted. The progressive architecture of the LPIN is then discussed in detail, from which the final inpainting results are obtained. Finally, various loss functions are used to generate more reasonable and realistic results. The overall inpainting architecture of the proposed approach is shown in Figure 3.

Basic Residual Inpainting Unit
The proposed RIU is a basic inpainting block with a residual architecture, as shown in Figure 3a. It inpaints the defects and generates a preliminarily result. The input x in of RIU concatenates with the real image with defected image I in to form a 6-channel feature, which is processed through two convolution layers, four Residual Blocks (ResBlocks) and one convolution layer in succession to produce a residual result x res . The final output of RIU x out is obtained by a pixel-wise addition of x res and I in . All convolution layers in RIU have 3 × 3 filters.
The features in ResBlocks x are updated as follows: where Res is the ResBlock and σ is the activation function ReLU. The ResBlock creates shortcuts in RIU, i.e., extra path, which makes the transmission of image features and gradients more efficient. The shortcuts ensure a deep network and prevent it from under- fitting and gradient diffusion. Firstly, only a part of the feature information is extracted by convolution layer during the forward propagation, and thus, the deeper the network is, the more the information loss is, which causes underfitting. The shortcuts solve this problem by passing the features of the former block to the subsequent blocks to provide an extra information for them. Besides, gradients in deep network are prone to diffuse during backward propagation. With the shortcuts, the later blocks transfer not only the gradients to the former block, but also the gradients before derivation. This means the gradients of each block are amplified, thus reducing the probability of gradient diffusion. Furthermore, small filter convolution is adopted in the RIU to reduce the number of network parameters. Although large filter convolution gives a better perception of image features, it widens the network and increases its parameter number, which will restrict the depth of a network. According to reference [56], a combination of several small filter convolution layers is equal to that of a large filter convolution layer in the performance of perceiving images while reducing the parameter number. Besides, as the defect regions in RS image are scattered and each single region is relatively small, a small filter convolution is more flexible.  is the real image without defects, is the corresponding image with defects, is the input of RIU, is a residual output, is the output of the i-th RIU and is the output of the last RIU, i.e., the final inpainting result .

Basic Residual Inpainting Unit
The proposed RIU is a basic inpainting block with a residual architecture, as shown in Figure 3a. It inpaints the defects and generates a preliminarily result. The input of RIU concatenates with the real image with defected image to form a 6-channel feature, which is processed through two convolution layers, four Residual Blocks (ResBlocks) and one convolution layer in succession to produce a residual result . The final output of RIU is obtained by a pixel-wise addition of and . All convolution layers in RIU have 3 × 3 filters.
The features in ResBlocks are updated as follows: (1)

Progressive Networks
A progressive multi-stage framework is adopted to form a deep network and achieve a better performance. As shown in Figure 3b, the proposed LPIN is composed of several RIUs, which take the output of one as the input for another. With the LPIN, the inpainting task is split into several simpler sub-tasks, and each sub-task inpaints part of the defects progressively. The i-th RIU takes the output of (i − 1)-th RIU x out i−1 as its input x in i , which means that the prior knowledge of the (i − 1)-th RIU is transmitted to the i-th RIU. This makes it easier to inpaint the defects for the subsequent RIUs. Multiple RIUs directly increase the depth of LPIN while not increasing its width due to the small filter convolution layer used. Multiple accesses of the input image I in as shown in the blue dashed lines in Figure 3b are adopted to each RIU to guarantee enough depth for the LPIN. The input of the i-th RIU is a concatenation of I in and its own input x in i , which takes full advantage of the valid feature in I in . Meanwhile, the output of the i-th RIU is a pixel-wise addition of I in and its own residual output x res i , with which the RIU forms a larger scale of residual structure. With these multiple accesses, LPIN establishes extra paths. The features and gradients can be directly passed to the top/bottom layers during forward and backward propagation, and the LPIN can become deeper by adding more RIUs while reducing the risk of gradient diffusion.
Weight sharing is another way to guarantee enough depth for the LPIN as the green dashed boxes in Figure 3b show. The more RIUs are adopted in LPIN, the longer the gradient propagation chain is, which might lead to over-fitting and gradient explosion or diffusion during backward propagation. In our network, each RIU shares their weights in a way that parameters are updated after each iteration in the training process and then synchronized to each RIU at the same time. The weight sharing strategy ensures a deeper but lightweight model while reducing the number of parameters, and preventing over-fitting and gradient explosion or diffusion.

Loss Functions
In addition to the progressive architecture of the LPIN, multiple loss functions are also adopted to achieve reasonable and realistic results as shown in Figure 3c. The mathematical symbols are defined as the following: I 0 is the Ground Truth (GT) image without defects, I in is the input image with defects, and M is a binary mask with which I in is simulated by a Hadamard product of I 0 , i.e., I in = M I 0 . In this section, four types of loss functions are proposed to restrict the inpainted images both on the pixel and semantic level as follows:

Reconstruction Loss
Reconstruction loss is used to measure the direct similarity between the inpainted image I out and the corresponding GT image I 0 . It is calculated with the negative Structural Similarity (SSIM) [57]. Unlike the commonly used Mean Square Error (MSE), SSIM can comprehensively measure the different in brightness, contrast and structure of two images instead of comparing them pixel by pixel. The SSIM of two images x and y can be calculated as: where µ x , µ y , σ 2 x , σ 2 y and σ xy are the mean of x, the mean of y, the variance of x, the variance of y, and the covariance of x and y, respectively. C 1 and C 2 are two constants. The more similar x and y are, the larger SSIM value is, and thus we use negative SSIM value as the loss function. The reconstruction losses for the defect region and valid region are calculated respectively and can be formulated as: where I de f 0 , I de f out , I val 0 and I val out are the defect region of I 0 , the defect region of I out , the valid region of I 0 , and the valid region of I out , respectively. λ de f and λ val are the loss function weights of L de f and L val .

Content Loss
Content loss, also known as perceptual loss [58], is a high-level semantic loss, which is extracted by a CNN and sets restrictions on the content difference of two images. In this work, due to the simple network architecture and good performance of image classification, we adopt Visual Geometry Group (VGG) network [56] with a pretrained weights on ImageNet [59] as the high-level semantic feature extractor as shown in the black dashed box in Figure 3c. The high-level features of inpainted image I out and GT image I 0 are extracted from the conv1_2, conv2_2, and conv 3_3 layer of VGG16 and then compared through a smooth L1 function S L 1 as the content loss L out cnt , which is formulated as: where φ i is the i-th feature extractor of VGG16, h, w and c are the corresponding height, width and channel of the extracted feature. In addition, we also calculate a content loss L comp cnt between a composed output I comp and GT image I 0 to maintain the consistency of defect region border which is described as: The total content loss is:

Style Loss
Style loss [60] is another a high-level semantic loss used to describe the style similarity of two images. It can eliminate checkerboard artifacts [47] and is insensitive to the pixel position. The style loss of the output image L out style and the composed image L comp style are calculated by a Gram matrix G, which is the covariance matrix of an image feature F with the position insensitivity: where h and w denote the height and width of a feature. The style loss is calculated as: Remote Sens. 2022, 14, 53 9 of 29

Total Variation Loss
Total Variation (TV) loss is a pixel-level regular term loss and does not participate in the training process. It restrains the adjacent pixels of the output image to reduce the noise and improve the spatial smoothness. The TV loss is calculated as: where, i and j are pixel coordinates of height and width. In summary, the total loss function is: where λ cnt , λ style , and λ tv are the weight of L cnt , L style , and L tv , respectively.

Results
In this section, the experiment procedures are explained in detail. The datasets for training and testing the network is first described and the training process is then demonstrated. The hyper-parameter setting is discussed afterwards. Finally, the experiment result of image inpainting and scene classification are carried out.

RS Image Datasets
Three representative RS image datasets for RSISC tasks are used to test the inpainting performance of LPIN and classification ability of the proposed approach: the most challenging NWPU-RESISC45 [20], the most widely used UC Merced Land-Use [2] and the most complex AID [21].
• NWPU-RESISC45 dataset was released by Northwestern Polytechnical University in 2017 and is the largest and most challenging dataset for RSISC task. It contains 31,500 images extracted from Google Earth with a fixed size of 256 × 256 and has 45 scene categories with 700 images in each category The 45 categories are airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snowberg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland. Some examples images from the NWPU-RESISC45 dataset are shown in Figure 4.
• UC Merced Land-Use dataset was released by University of California, Merced in 2010 and is the most widely used dataset for RSISC tasks. It contains 2100 RS images of 256 × 256 pixels extracted from USGS National Map Urban Area Imagery collection and has 21 categories with 100 images per category. The 21 categories are agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks and tennis court. Some example images from the UC Merced Land-Use dataset are shown in Figure 5.
• AID dataset was released by Wuhan University in 2017 and is one of the most complex datasets for RSISC tasks due to the images being extracted from different sensors and their pixel resolution varying from 8 m to 0.5 m. It contains 10,000 RS images of 600 × 600 pixels extracted from Google Earth imagery and has 30 scene categories with about 220 to 400 images per category. The 30 categories are airport, bareland, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, parking, park, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks and viaduct. Some example images from the AID dataset are shown in Figure 6.   UC Merced Land-Use dataset was released by University of California, Merced in 2010 and is the most widely used dataset for RSISC tasks. It contains 2100 RS images of 256 × 256 pixels extracted from USGS National Map Urban Area Imagery collection and has 21 categories with 100 images per category. The 21 categories are agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks and tennis court. Some example images from the UC Merced Land-Use dataset are shown in Figure 5.  600 × 600 pixels extracted from Google Earth imagery and has 30 scene categories with about 220 to 400 images per category. The 30 categories are airport, bareland, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, parking, park, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks and viaduct. Some example images from the AID dataset are shown in Figure 6.

Mask Datasets
A RS image with defects is simulated through a Hadamard product of a binary mask and a GT image. The masks of periodical stripes and random noises are randomly generated with Python. The masks of thick cloud are extracted from the NVIDIA Irregular Mask dataset [43]. All the three mask datasets have 2000 images with a 20~30% hole-to-image area ratio. Some example images from the three mask datasets are shown in Figure 7.

Mask Datasets
A RS image with defects is simulated through a Hadamard product of a binary mask and a GT image. The masks of periodical stripes and random noises are randomly generated with Python. The masks of thick cloud are extracted from the NVIDIA Irregular Mask dataset [43]. All the three mask datasets have 2000 images with a 20~30% hole-to-image area ratio. Some example images from the three mask datasets are shown in Figure 7.

Environments
The experiments are carried out on two NVIDIA RTX Titian 24G GPUs with Ubuntu 18.04, Pytorch 1.6.0 and CUDA 10.1. The LPIN is optimized by ADAM optimizer with a learning rate of 0.002. All training images are cropped and resized to 3 × 256 × 256 pixels. The Pytorch DistributedDataParallel API for multi-GPU training and the NVIDIA Apex tool for mixed precision are used to accelerate the training process.

Training Flow
The NWPU-RSISC45 dataset is randomly divided into two parts: 20% for training (6300 images) and the remaining 80% for testing (25,200 images), which means the training ratio is 20%. The training data is then augmented six times by rotating them by 90°, 180°, 270° and flipping them horizontally and vertically, i.e., we use 37,800 images for training. In the training phase, LPIN samples minibatch images 0 from the augmented

Environments
The experiments are carried out on two NVIDIA RTX Titian 24G GPUs with Ubuntu 18.04, Pytorch 1.6.0 and CUDA 10.1. The LPIN is optimized by ADAM optimizer with a learning rate of 0.002. All training images are cropped and resized to 3 × 256 × 256 pixels. The Pytorch DistributedDataParallel API for multi-GPU training and the NVIDIA Apex tool for mixed precision are used to accelerate the training process.

Training Flow
The NWPU-RSISC45 dataset is randomly divided into two parts: 20% for training (6300 images) and the remaining 80% for testing (25,200 images), which means the training ratio is 20%. The training data is then augmented six times by rotating them by 90 • , 180 • , 270 • and flipping them horizontally and vertically, i.e., we use 37,800 images for training. In the training phase, LPIN samples minibatch images I 0 from the augmented training dataset and normalize them using the mean and standard deviation of ImageNet. Then, LPIN samples minibatch binary masks M from the mask dataset and perform a Hadamard product of I 0 and M to get the RS images with defects, namely the input images I in . The RIU of each stage inpaints the defects progressively and the final RIU outputs the inpainting result of LPIN I out . Finally, the loss functions are used to restrict the value of I out on a pixel and semantic level. The detailed inpainting training flow is formally presented in Algorithm 1. 13 : x res i = f i x in i 14 : x out i = x res i + I in 15 : x in i+1 = cat x out i , I in 16: end for 17 : calculate the out of last RIU : I out = x out N 18: calculate loss functions. 19: update parameters using ADAM optimizer. 20: end for Meanwhile, we choose six existing RSISC methods and train them on the same training dataset. Then, the GT images and the images with defects are sent to each RSISC method, as well as the proposed methods which combine the LPIN with RSISC methods to test their classification accuracy, as shown in Figure 8.  Meanwhile, we choose six existing RSISC methods and train them on the same training dataset. Then, the GT images and the images with defects are sent to each RSISC method, as well as the proposed methods which combine the LPIN with RSISC methods to test their classification accuracy, as shown in Figure 8.

Hyper-Parameters Tuning
, , and are used to train the network separately and their values roughly converge to the following numbers: 3 × 10 −2 , 1 × 10 −4 , 2, and 7 × 10 −5 , respectively, as shown in Figure 9. The is not involved in the training, and its rough value is calculated at the start of training, being 2 × 10 −1 . In order to prevent certain loss from having

Weight Value of Loss Term
L de f , L val , L cnt , and L style are used to train the network separately and their values roughly converge to the following numbers: 3 × 10 −2 , 1 × 10 −4 , 2, and 7 × 10 −5 , respectively, as shown in Figure 9. The L tv is not involved in the training, and its rough value is calculated at the start of training, being 2 × 10 −1 . In order to prevent certain loss from having more impact on the training process than the others, each loss value is scaled to the same order of magnitude by multiplying their weights. Thus, the weight values are preliminary set as: λ de f = 0.6, λ val = 160, λ cnt = 0.01, λ style = 280, and λ tv = 0.08. For the reconstruction loss, considering that the defect regions are expected to have a better inpainting performance than that of the valid regions, we properly increase the number of λ hole and decrease the number of λ valid . Then we randomly select 450 images from the NWPU-RSISC45 dataset with 10 images per category and conduct a random searching test on them. The final weights are acquired from the test as follows: λ de f = 20, λ val = 10, λ cnt = 0.05, λ style = 100 and λ tv = 0.1.

Number of RIUs in LPIN
Too few RIUs may lead to insufficient inpainting capability, while too many RIUs may cause network redundancy and do not improve the performance. Therefore, the number of RIUs need to be determined properly before training. To find the optimal number, tests on LPIN with different RIU numbers are conducted. The Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR), and SSIM are used to evaluate the inpainting performance. Among them, the SSIM index has been described in Section 2.3.1. MAE (also known as L1 error) is an index to compare the L1 distance of two images and , formulated as: where is the pixel value in each image. The lower MAE is, the more similar two images are. PSNR is one of the most widely used image evaluation indexes, which is calculated as: where , is the L2 distance of and , is the maximum pixel value of an image, which is 255 for images in this paper. The larger PSNR is, the more realistic the inpainted image is.
The tests are carried out on multiple LPINs with RIU number changing from one to eighteen. Each test runs for 20 epochs and the inpainting results are show in Figure 10. We can see that the inpainting performance improves at first, but hardly gets better after

Number of RIUs in LPIN
Too few RIUs may lead to insufficient inpainting capability, while too many RIUs may cause network redundancy and do not improve the performance. Therefore, the number of RIUs need to be determined properly before training. To find the optimal number, tests on LPIN with different RIU numbers are conducted. The Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR), and SSIM are used to evaluate the inpainting performance. Among them, the SSIM index has been described in Section 2.3.1. MAE (also known as L1 error) is an index to compare the L1 distance of two images x and y, formulated as: where i is the pixel value in each image. The lower MAE is, the more similar two images are. PSNR is one of the most widely used image evaluation indexes, which is calculated as: PSNR(x, y) = 10 log 10 MAX 2 I MSE(x, y) where MSE(x, y) is the L2 distance of x and y, MAX I is the maximum pixel value of an image, which is 255 for images in this paper. The larger PSNR is, the more realistic the inpainted image is.
The tests are carried out on multiple LPINs with RIU number changing from one to eighteen. Each test runs for 20 epochs and the inpainting results are show in Figure 10. We can see that the inpainting performance improves at first, but hardly gets better after 8 RIUs. The LPIN with 8 RIUs performs a little inferior to the LPIN with 7 RIUs which reaches its bottleneck. Thus, we set the number of RIU as 7 in the following experiments.

Image Inpainting Results
We compare our LPIN with 4 state-of-the-art semantic inpainting models: PGN [55], PCONV [43], PRVS [46] and RFR [49]. Each model is trained on the NWPU-RESISC45 dataset under the same condition and inpaints 3 types of defects in RS images: periodic stripes, random noises, and thick clouds (stripes, noises, and clouds for short, respectively).

Model Complexity Analysis
The complexity of the model and parameter is measured by Floating Point Operations (FLOPs) and Bytes. The FLOPs indicates the time consumption of training a model and shows the time complexity of a model. If the FLOPs is too high, the model cannot converge quickly. The Bytes is the number of the parameters and shows the space complexity of a model. The larger the Bytes is, the more data is needed to train a model, which means that the model goes into overfitting easily during training.
The complexity, the model weight and the inpainting speed of our LPIN are compared with other inpainting models and the results are shown in Table 2. The model complexity is calculated by the Pytorch Torchstat API. We can see that due to the lightweight design, i.e., the multi-stage network of LPIN, the concise architecture of RIU, and the weight sharing strategy, the proposed LPIN has the least model complexity, number of parameters and model weight as well as the fastest inpainting speed.

Image Inpainting Results
We compare our LPIN with 4 state-of-the-art semantic inpainting models: PGN [55], PCONV [43], PRVS [46] and RFR [49]. Each model is trained on the NWPU-RESISC45 dataset under the same condition and inpaints 3 types of defects in RS images: periodic stripes, random noises, and thick clouds (stripes, noises, and clouds for short, respectively).

Model Complexity Analysis
The complexity of the model and parameter is measured by Floating Point Operations (FLOPs) and Bytes. The FLOPs indicates the time consumption of training a model and shows the time complexity of a model. If the FLOPs is too high, the model cannot converge quickly. The Bytes is the number of the parameters and shows the space complexity of a model. The larger the Bytes is, the more data is needed to train a model, which means that the model goes into overfitting easily during training.
The complexity, the model weight and the inpainting speed of our LPIN are compared with other inpainting models and the results are shown in Table 2. The model complexity is calculated by the Pytorch Torchstat API. We can see that due to the lightweight design, i.e., the multi-stage network of LPIN, the concise architecture of RIU, and the weight sharing strategy, the proposed LPIN has the least model complexity, number of parameters and model weight as well as the fastest inpainting speed.  Table 3. As can be seen, the LPIN has a good performance on stripes and noises inpainting and achieves a much better result than the other inpainting models. Its inpainting results for clouds are, however, slightly inferior to the former two. This is because that the LPIN is a lightweight network and has the advantage of inpainting the structured defects rather than the defects with large holes.

Qualitative Comparisons
The inpainting results of three types of defects are shown in Figure 11. The details in the blue box on each image are magnified and shown in the red box. We can see from the figure that despite the small parameter weight, the LPIN achieves a good visual inpainting performance on images with defects. Although the LPIN is inferior to PRVS in MAE for the thick cloud inpainting, there is no significant visual difference.  Table 3. As can be seen, the LPIN has a good performance on stripes and noises inpainting and achieves a much better result than the other inpainting models. Its inpainting results for clouds are, however, slightly inferior to the former two. This is because that the LPIN is a lightweight network and has the advantage of inpainting the structured defects rather than the defects with large holes.

Qualitative Comparisons
The inpainting results of three types of defects are shown in Figure 11. The details in the blue box on each image are magnified and shown in the red box. We can see from the figure that despite the small parameter weight, the LPIN achieves a good visual inpainting performance on images with defects. Although the LPIN is inferior to PRVS in MAE for the thick cloud inpainting, there is no significant visual difference.

Scene Classification Results
The previous section has proven effectiveness of the LPIN in inpainting three types of defects. In this section, scene classification tests are carried out to verify the robustness improvement of the RSISC tasks with the proposed method. The GT images and the corresponding images with defects are sent to six existing RSISC methods and the proposed combined methods respectively to test the classification accuracy. Considering that most RSISC methods are not publicly available, we choose three classical CNN classification networks: VGG16 [56], ResNet50 [61], and Inception V3 [62] and train them on the dataset using the pre-trained weights on ImageNet. In addition, we also use three other models published by our team: HCV + FV [4], ADFF [12], and F 2 BRBM [16]. The tests are conducted on the NWPU-RESISC45 dataset with a training ratio of 20%, and the final results are calculated by the mean and standard deviation of 10 random repeated experiments.

Classification Accuracy Results
Three measurements are used to evaluate the classification accuracy and robustness improvement of RSISC as follows: (1) Overall accuracy (OA), which is defined as the ratio of the number of correctly classified images to the total number of images. (2) Defect-to-GT Ratio (D2GR), which is a robustness index and represents the ratio of the OA of images with defects to the OA of GT images. The more robust a RSISC method is, the higher D2GR is.
(3) Confusion matrix, also known as the error matrix, is used to visualize the classification results of specific categories. In an ideal confusion matrix, all classification results are distributed only on the diagonal. The OA and D2GR of six existing RSISC methods and the corresponding combined methods with LPIN are shown in Table 4. We can see that the OA of original RSISC methods decrease dramatically on the images with defects, while the proposed methods still have a good performance. Specifically speaking, they generally achieve an D2GR of around 99% for periodic stripe defects and random noise defects and around 95% for thick cloud defects, which proves a great improvement on the robustness of RSISC. The confusion matrix of original F 2 BRBM on GT images is shown in Figure 12, and that of original F 2 BRBM and the corresponding combined method F 2 BRBM + LPIN on images with defects are show in Figure 13. It can be seen from the left column in Figure 13 that defects of all 3 types cause a large number of confusing items for the original F 2 BRBM. This is due to the defects cutting off the continuous semantic information of RS images and F 2 BRBM only being able to classify images depending on the backgrounds. As Figure 14 shows, some defects images from two different categories have much similar backgrounds which subsequently causes misclassification. For example, 25.9% of the airport images with stripe defects are classified to railway stations, 69.4% of the circular farmland images with noise defects are classified to rectangular farmlands and 39.2% of the desert images with cloud defects are classified to lakes.
Remote Sens. 2022, 14, x FOR PEER REVIEW 18 of that defects of all 3 types cause a large number of confusing items for the original F 2 BRBM This is due to the defects cutting off the continuous semantic information of RS imag and F 2 BRBM only being able to classify images depending on the backgrounds. As Figu  14 shows, some defects images from two different categories have much similar bac grounds which subsequently causes misclassification. For example, 25.9% of the airpo images with stripe defects are classified to railway stations, 69.4% of the circular farmlan images with noise defects are classified to rectangular farmlands and 39.2% of the dese images with cloud defects are classified to lakes.  The LPIN eliminates the semantic irrelevant defects, generates pixels that has contextual semantic information, improves semantic coherence, provides more information for F2BRBM, and thus increases the classification accuracy. According to the right column of Figure 13, the classification accuracy of airports with stripe defects, circular farmlands with noise defects and deserts with cloud defects increase to 91.4%, 97.0% and 97.7%, respectively. It is also worth noting that the LPIN would not bring additional semantic information, therefore, the accuracy of categories that are apt to be misclassified originally is not improved too much. For example, the original F 2 BRBM has a 15.2% probability of misclassifying palaces into churches on the GT images, and the combined method still has a 15.7% misclassifying probability on the images with stripe defects.  The LPIN eliminates the semantic irrelevant defects, generates pixels that has contextual semantic information, improves semantic coherence, provides more information for F2BRBM, and thus increases the classification accuracy. According to the right column of Figure 13, the classification accuracy of airports with stripe defects, circular farmlands with noise defects and deserts with cloud defects increase to 91.4%, 97.0% and 97.7%, respectively. It is also worth noting that the LPIN would not bring additional semantic information, therefore, the accuracy of categories that are apt to be misclassified originally is not improved too much. For example, the original F 2 BRBM has a 15.2% probability of misclassifying palaces into churches on the GT images, and the combined method still has a 15.7% misclassifying probability on the images with stripe defects.

OA Comparison of Different Inpainting Models
The OA results on images with defects of the original F 2 BRBM combined with different inpainting models are shown in Table 5 and Figure 15. We can see that the LPIN performs better than the other inpainting models on images with stripe and noise defects, but sightly inferior to PRVS and RFR on images with cloud defects. The LPIN strengthens the utilization of image contextual information through residual architecture and the multiple accesses of the input images. Stripe and noise defects are small and scattered. They cut off continuous semantic information but retain the global content and local contextual information. Therefore, LPIN performs well when processing images with small and scattered defects. In contrast, the cloud defects are relatively large and concentrated defects. The images with cloud defects lose contextual information and the inpainting model needs to have a more comprehensive understanding of the local semantic information to classify them correctly. Therefore, the inpainting performance of LPIN for images with large and concentrated defects is not as good as the one for images with stripe and noise defects, and the improvement of classification accuracy is not as obvious as the other two.

OA Comparison of Different Inpainting Models
The OA results on images with defects of the original F 2 BRBM combined with different inpainting models are shown in Table 5 and Figure 15. We can see that the LPIN performs better than the other inpainting models on images with stripe and noise defects, but sightly inferior to PRVS and RFR on images with cloud defects. The LPIN strengthens the utilization of image contextual information through residual architecture and the multiple accesses of the input images. Stripe and noise defects are small and scattered. They cut off continuous semantic information but retain the global content and local contextual information. Therefore, LPIN performs well when processing images with small and scattered defects. In contrast, the cloud defects are relatively large and concentrated defects. The images with cloud defects lose contextual information and the inpainting model needs to have a more comprehensive understanding of the local semantic information to classify them correctly. Therefore, the inpainting performance of LPIN for images with large and concentrated defects is not as good as the one for images with stripe and noise defects, and the improvement of classification accuracy is not as obvious as the other two.

Generalization Ability Results of Different Datasets
The LPIN trained on NWPU-RSISC45 dataset are directly applied to UC Merced land-use dataset and AID dataset to test the generalization ability of the proposed method. The inpainting results are shown in Table 6 and Figure 16. The OA and D2GR of F 2 BRBM and the corresponding combined method F 2 BRBM + LPIN on these two datasets are cal- Figure 15. Visualized OA of F2BRBM on images inpainted by different models.

Generalization Ability Results of Different Datasets
The LPIN trained on NWPU-RSISC45 dataset are directly applied to UC Merced landuse dataset and AID dataset to test the generalization ability of the proposed method. The inpainting results are shown in Table 6 and Figure 16. The OA and D2GR of F 2 BRBM and the corresponding combined method F 2 BRBM + LPIN on these two datasets are calculated as shown in Table 7 and Figure 17.      We can see that the LPIN still has a good inpainting performance, and the combined method with LPIN can correct most misclassified items. It can reach a high level of D2GR on different datasets with defects, and significantly increase the robustness of RSISC, which proves a good generalization ability of the proposed method.

Image Inpainting Ablation Studies
In this section, several ablation tests of the network architecture and the loss functions are carried out to analyze their influence on inpainted image quality. Each test is trained on NWPU RSISC-45 dataset for 20 epochs.

Network Architecture: With vs. without LSTM
The multi-stage network LPIN can be regarded as a time series structure. Due to the good performance of Long Short-Term Memory (LSTM) [63] on time series prediction [7], we adopt LSTM in LPIN and try to strengthen the inner connection between different RIU stages. The results are shown as Table 8, from which we can see that the inpainted image quality for LPIN with LSTM is worse than that for LPIN without LSTM. We can see that the LPIN still has a good inpainting performance, and the combined method with LPIN can correct most misclassified items. It can reach a high level of D2GR on different datasets with defects, and significantly increase the robustness of RSISC, which proves a good generalization ability of the proposed method.

Image Inpainting Ablation Studies
In this section, several ablation tests of the network architecture and the loss functions are carried out to analyze their influence on inpainted image quality. Each test is trained on NWPU RSISC-45 dataset for 20 epochs.

Network Architecture: With vs. without LSTM
The multi-stage network LPIN can be regarded as a time series structure. Due to the good performance of Long Short-Term Memory (LSTM) [63] on time series prediction [7], we adopt LSTM in LPIN and try to strengthen the inner connection between different RIU stages. The results are shown as Table 8, from which we can see that the inpainted image quality for LPIN with LSTM is worse than that for LPIN without LSTM. Although LSTM provides extra transmission of image features among RIUs, it increases the network weights. As a result, the number of its parameters exceeds that of the LPIN main body, which lowers the parameter gradient updating efficiency of image inpainting.

Reconstruction
Loss: L1 vs. Negative SSIM L1 and negative SSIM are two commonly used reconstruction losses for constraining two images. The results of LPIN with L1 or with negative SSIM are shown in Table 9. We can see that LPIN with negative SSIM loss achieves better inpainting quality. The L1 loss constrains two images pixel by pixel, therefore, if two images only have a slight difference in brightness, their L1 distance might be very large. The negative SSIM overcomes this drawback by measuring the differences in brightness, contrast and structure of two images, which makes it a more effective reconstruction loss function. VGG16 and ResNet50 are two well-performed feature extractors [23]. Different features extracted from them are used to calculate the content and style loss and compare the inpainting results, which are shown in Table 10. It can be seen that there is little difference between the two extractors in inpainting quality. We believe that ResNet50 pays more attention to the low-level features, which are extracted from the last few layers of the network and are useless for LPIN, since our network needs the high-level semantic features, which are extracted from the first several layers. Besides, VGG16 has a simpler structure and smaller convolution filter for the first several layers than ResNet50. As a result, we choose VGG16 as our feature extractor. The convolution layer and the maxpooling layer of VGG16 can both extract image features. A test on two different layers is carried out and the results are shown in Table 11. It can be seen that the convolution layer gives a slightly better inpainting result. We speculate that partial pixels of the feature are discarded when passing through the maxpooling layer. Therefore, some effective information cannot be transmitted to the extractor. As a result, we choose the convolution layers to extract image features.

Inpainting Results of Images with Hybrid Defecs
It is not always the case that only one type of defect exists on real RS images. Sometimes several types of defects appear on an image at the same time. Therefore, a test of images with a combined defects is also conducted. The quantitative image inpainting results are shown in Table 12 and qualitative visual results are shown in Figure 18.  The OA and D2GR of F 2 BRBM method and the corresponding combined method F 2 BRBM+LPIN on NWPU-RSISC45 dataset with combined defects are shown in Table 13. We can see that the OA of original F 2 BRBM decreases dramatically to only around 9%, but the proposed combined method still has a D2GR of around 90%, which proves its great robustness. The corresponding confusion matrixes of these two methods on images with defects of all 3 types are shown in Figure 19.  The OA and D2GR of F 2 BRBM method and the corresponding combined method F 2 BRBM+LPIN on NWPU-RSISC45 dataset with combined defects are shown in Table 13. We can see that the OA of original F 2 BRBM decreases dramatically to only around 9%, but the proposed combined method still has a D2GR of around 90%, which proves its great robustness. The corresponding confusion matrixes of these two methods on images with defects of all 3 types are shown in Figure 19.  As Figure 19a shows, many defect RS images are misclassified mainly into three categories: chaparral, harbor and parking lot, which all have scattered features, i.e., the scattered trees in a chaparral, the scattered ships in a harbor and the scattered vehicles in a parking lot. As Figure 20a shows, we believe that the combination of three defects added many scatted patches to the RS images and makes them visually similar to the three misclassified categories. Therefore, the misclassification mainly happens in these three categories. According to the confusion matrix of Figure 19a, 51.5% of the church images are classified into parking lots, 48.7% of the beach images are classified into harbors, and 68.9% of the wetland images are classified into chaparrals. However, with the help of the LPIN, the scattered disturbed patches are inpainted and the classification accuracy are improved correspondingly as Figure 20b shows. According to Figure 19b, the classification accuracy of churches, beaches and wetlands is 80.1%, 96.1% and 82.5%, respectively.  As Figure 19a shows, many defect RS images are misclassified mainly into three categories: chaparral, harbor and parking lot, which all have scattered features, i.e., the scattered trees in a chaparral, the scattered ships in a harbor and the scattered vehicles in a parking lot. As Figure 20a shows, we believe that the combination of three defects added many scatted patches to the RS images and makes them visually similar to the three misclassified categories. Therefore, the misclassification mainly happens in these three categories. According to the confusion matrix of Figure 19a, 51.5% of the church images are classified into parking lots, 48.7% of the beach images are classified into harbors, and 68.9% of the wetland images are classified into chaparrals. However, with the help of the LPIN, the scattered disturbed patches are inpainted and the classification accuracy are improved correspondingly as Figure 20b shows. According to Figure 19b, the classification accuracy of churches, beaches and wetlands is 80.1%, 96.1% and 82.5%, respectively. As Figure 19a shows, many defect RS images are misclassified mainly into three categories: chaparral, harbor and parking lot, which all have scattered features, i.e., the scattered trees in a chaparral, the scattered ships in a harbor and the scattered vehicles in a parking lot. As Figure 20a shows, we believe that the combination of three defects added many scatted patches to the RS images and makes them visually similar to the three misclassified categories. Therefore, the misclassification mainly happens in these three categories. According to the confusion matrix of Figure 19a, 51.5% of the church images are classified into parking lots, 48.7% of the beach images are classified into harbors, and 68.9% of the wetland images are classified into chaparrals. However, with the help of the LPIN, the scattered disturbed patches are inpainted and the classification accuracy are improved correspondingly as Figure 20b shows. According to Figure 19b, the classification accuracy of churches, beaches and wetlands is 80.1%, 96.1% and 82.5%, respectively.

Conclusions
In this paper, a progressive lightweight inpainting network named LPIN is proposed to provide a purified input for the RSISC method. Compared with other state-of-the-art inpainting networks, the LPIN can achieve a better inpainting performance with a simpler structure, a lighter weight of only 1.2 MB and a faster inpainting speed of 67.29 fps, which makes it possible to implant to small portable devices. The LPIN is then combined with the existing RSISC method to form a combined classification approach, which can effectively improve the classification accuracy of the existing classification methods on images with defects. It keeps the comparable classification accuracy level on RS images with defects as that without defects, thus improving the robustness of high-resolution RSISC tasks. Experimental results on different datasets prove that the proposed method also has a good generalization ability.
There are mainly three limitations in our work. Firstly, the RS images of a fixed region can be captured by satellites multiple times and thus they not only have spatial information but also temporal information. However, we focus on single RS image inpainting and only use its spatial information in this work. Secondly, it takes lots of effort to acquire the RS images with defects from satellites, therefore we only train and test our LPIN on the standard datasets. Finally, as the image inpainting network needs to acquire the global semantic and local contextual information of an image, the LPIN requires larger datasets and more training iterations compared with other types of networks such as image classification.
In our following work, we plan to carry out image inpainting research using the history RS images with temporal information, obtain real RS images from satellites, adopt more deep learning regularization techniques in the training process-such as early stopping [23] which we believe can reduce overfitting-and enhance the generalization ability of our model and further lower the time consumption.