Techniques for the Automatic Detection and Hiding of Sensitive Targets in Emergency Mapping Based on Remote Sensing Data

: Emergency remote sensing mapping can provide support for decision making in disaster assessment or disaster relief, and therefore plays an important role in disaster response. Traditional emergency remote sensing mapping methods use decryption algorithms based on manual retrieval and image editing tools when processing sensitive targets. Although these traditional methods can achieve target recognition, they are inefficient and cannot meet the high time efficiency requirements of disaster relief. In this paper, we combined an object detection model with a generative adversarial network model to build a two-stage deep learning model for sensitive target detection and hiding in remote sensing images, and we verified the model performance on the aircraft object processing problem in remote sensing mapping. To improve the experimental protocol, we introduced a modification to the reconstruction loss function, candidate frame optimization in the region proposal network, the PointRend algorithm, and a modified attention mechanism based on the characteristics of aircraft objects. Experiments revealed that our method is more efficient than traditional manual processing; the precision is 94.87%, the recall is 84.75% higher than that of the original mask R-CNN model, and the F1-score is 44% higher than that of the original model. In addition, our method can quickly and intelligently detect and hide sensitive targets in remote sensing images, thereby shortening the time needed for emergency mapping.


Introduction
Remote sensing images have the characteristics of wide coverage and high timeliness of data collection. They can provide timely and effective image surveying and mapping data for disaster areas and support governments and rescue agencies at all levels in emergency decision making, disaster assessment, and rescue deployment [1]. There have been many studies and applications of remote sensing mapping for emergency disaster response. For example, Yang Kejian et al. [2] discussed the application of remote sensing mapping technology in flood monitoring and evaluation; Fan Yida et al. [3] used remote sensing images to study an emergency disaster assessment method for the Wenchuan earthquake; and Xue Tengfei et al. [4] proposed an automatic mapping method based on remote sensing for earthquake emergencies. Recently, automatic mapping of disaster-affected areas has been studied using a different approach of the convolutional neural network (CNN) structure. Hacefendiolu et al. [5] used a pretrained faster R-CNN to detect the earthquake-induced ground failure areas and damaged structures, and Ghorbanzadeh et al. [6] applied two main deep-learning CNN streams combined with the Dempster-Shafer model to automatic landslide mapping.
In remote sensing mapping, according to China's Surveying and Mapping Law [7] and State Secrets Law [8], sensitive geographic targets that are relevant to national security, including military installations, large-scale weapons and equipment, secret agencies, and nuclear facilities, must be hidden before remote sensing images can be publicly released and used. At present, most processing methods rely on manual or semi-automatic methods of finding specific sensitive targets and using image editing tools to capture the background textures around these targets to cover and fill in the target areas. In the emergency remote sensing mapping scenario, however, this approach is not efficient or robust; the results are easily affected by the operator's skill, and the hiding effect is not ideal. These shortcomings restrict the rapid release and use of remote sensing image map products, preventing this approach from meeting the timeliness requirements for disaster emergency response. Therefore, how to automatically detect and hide sensitive targets is a subject worthy of study.
With the development of deep learning models and methods, extensive efforts have been made to achieve the automatic detection and hiding of objects. At present, the process of target recognition and detection is usually performed by machine learning algorithms. Feature extraction algorithms for application to remote sensing images can be divided into algorithms for low-level, mid-level, and high-level feature extraction. Lowlevel feature extraction algorithms extract a certain aspect of images, such as gradient, color, or texture. Such algorithms include the histogram of oriented gradients (HOG) algorithm [9], the scale-invariant feature transform (SIFT) [10], local binary pattern (LBP) [11], and speeded-up robust feature (SURF) [12]. Mid-level feature extraction refers to a combination of multiple low-level feature extraction algorithms to improve the expressiveness of the extracted features. Representative methods include MultiFtr [13], integral channel features (ChnFtrs) [14], and the fastest pedestrian detector in the west (FPDW) [15].
Traditional machine learning technology does not perform well when processing raw unstructured data. With the goal of extracting more hierarchical, abstract, and high-level features, a CNN can be regarded as a feature learning tool that can learn rich hierarchically structured feature representations from raw data. With these features, great performance improvements can be achieved in target recognition and detection and even in other vision tasks. Therefore, CNNs have undergone rapid development and have seen widespread use in computer vision applications. In the past two decades, a variety of effective detection frameworks have been developed to achieve the purpose of organically combining positioning, feature extraction and other auxiliary algorithms for efficient detection, such as fast region-based CNN (R-CNN) [16], faster R-CNN [17], mask R-CNN [18], you-only-look-once (YOLO) [19], and the single-shot multibox detector (SSD) [20]. Extensive efforts have been made to study how to use these deep CNNs for high-resolution remote sensing (HRRS). Researchers [21,22] have shown that transfer learning provides a powerful tool for remote sensing scene classification, the features from pretrained CNNs generalize well to HRRS datasets and are more expressive than the low-and mid-level features. As HRRS always needs a large amount of labeled data and cannot recognize the images from an unseen scene class without any visual sample in the labeled data, to overcome this drawback, zero-shot scene classification has been used to recognize images from unseen scene classes [23,24]. CNNs have been increasingly established as adaptive methods for new challenges in the field of earth observation (EO). Hoeser et al. provided a comprehensive overview of the impact of CNNs on EO applications [25,26].
Mask R-CNN is currently the most advanced deep learning model for this purpose, with powerful object detection and instance segmentation capabilities. It has achieved promising progress in natural image recognition. However, due to the resolution, shadowing, scale, and data volume of high-resolution remote sensing images, deep learning models are not very effective for instance segmentation in remote sensing images, and because the mask size generated by mask R-CNN is 14 × 14 pixels or 28 × 28 pixels, a large amount of detailed information is lost due to the high zoom ratio at the boundaries of large objects.
Image inpainting is a technology for inferring the information content of a missing image area based on the known information around it, and then repairing it to make the image complete. This technology is widely used for repairing damaged photographs, covering redundant information, hiding sensitive targets, and similar tasks [27]. In essence, the hiding of sensitive objects in remote sensing images is a form of image restoration processing. Traditional methods can be divided into two main categories, i.e., one is based on partial differential equations, such as the self-adaptive curvature repair algorithm based on the curvature-driven diffusion (CDD) model proposed by Yin Yong et al. [28], and the other category consists of sample-based methods, such as the PatchMatch algorithm proposed by Barnes et al. [29]. A generative adversarial network (GAN) model is an unsupervised model that uses random strategies to continuously learn and understand the abstract structures and high-dimensional semantic information of images through training. However, remote sensing images have the characteristics of complex backgrounds, diverse scales, and a large amount of information. When a GAN model is used to repair large-format remote sensing images, the internal structures and textures of the repaired areas typically have a high degree of similarity, while the edge parts and other background areas exhibit obvious differences.
To address the low accuracy of mask R-CNN in remote sensing image instance segmentation and the large differences at the edges of GAN-repaired patches, this paper proposes corresponding solutions. First, for mask R-CNN, we modify the loss function and optimize the region proposal network (RPN)-derived candidate frames to improve the accuracy of target detection; second, we introduce the PointRend [30] algorithm to enhance the accuracy at boundaries; and third, we combine the improved mask R-CNN model with a GAN model and apply the resulting method for the automatic detection and hiding of sensitive targets in remote sensing images. Experiments show that the accuracy and efficiency of this method in emergency remote sensing mapping are greatly improved as compared with traditional methods.

Overall Framework
The overall framework, as shown in Figure 1  The model processing procedure is as follows: ① Input the original images and pretrained target weight parameters, then, use the mask R-CNN + PointRend model to quickly retrieve and screen the images to detect whether sensitive targets exist. ② If no sensitive targets exist, the images can be used directly for subsequent production; if sensitive targets do exist, then, those targets are located and segmented, and the coordinates of the sensitive targets and the corresponding masks are output. ③ Use the Deepfill model to hide the marked areas in accordance with the output masks to generate unclassified images for subsequent production.

Mask R-CNN
The mask R-CNN model is a deep CNN model proposed, in 2017, by He et al. [18] for target detection tasks. It is based on the faster R-CNN model and additionally includes a branch network for predicting a segmentation mask for each region of interest (ROI) in order to generate high-quality masks for targets. The network structure of the mask R-CNN model, which is shown in Figure 2, includes a feature extraction network (a residual network, ResNet) [32], a feature pyramid network (FPN) [33], an RPN, and an ROIAlign and pixel segmentation network (a fully convolutional network, FCN) [34]. The conv2_x, conv3_x, conv4_x, and conv5_x structural blocks in the ResNet constitute four feature maps representing different levels of semantic information of the targets. The FPN performs summation operations on the feature maps at different levels to obtain high-level semantic information while preserving the spatial information of the targets. The combination of the feature maps is shown in Equation (1), where = 2,3,4 , is the upsampling convolution operation, is the convolution operation, and represents the summation of the values in the corresponding position of the matrix.

Region Proposal Network (RPN)
The RPN takes five feature maps of P2, P3, P4, P5, and P6 as input and generates rectangular candidate regions of five different scales and three different aspect ratios at each position in a sliding window manner. The default scales are (32, 64, 128, 256, and 512), and the aspect ratios are (0.5, 1.0, and 2.0).
According to the possible combinations of the different scales and aspect ratios listed above, approximately 36,000 candidate regions are generated on the image. The coordinates of each candidate area and the confidence that it is a foreground or background area are calculated, and 6000 candidate areas are reserved in accordance with their degrees of confidence. Finally, by using the strategy of non-maximum suppression (NMS) [15], 2000 ROI regions are obtained.

RoIAlign
In the newly added branch of the mask R-CNN model, an FCN is used to calculate the pixel values of the masks with a threshold of 0.5. The RoIAlign layer generates a 14 × 14 feature map for each ROI and uses a bilinear difference method to calculate the mask boundaries.

PointRend
CNNs for image segmentation typically operate on regular grids. The input image is a regular grid of pixels, the hidden representations are feature vectors on a regular grid, and the outputs are label maps on regular grids [31]. This tends to result in oversampling in low-frequency areas (belonging to the same object) and undersampling in high-frequency areas (the edges of objects). In instance segmentation, the pixels that are most likely to be misjudged by the model are those at the edges of objects [35].
PointRend [31] provides a way to solve the image segmentation problem by treating it as a rendering problem. Efficient procedures such as subdivision [36] and adaptive sampling [37] refine a coarse rasterization in areas where pixel values have larger variance.
Ray-tracing renderers often use oversampling [38], a technique that samples some points more densely than the output grid to avoid aliasing effects. Here, we apply classical subdivision to image segmentation. The fuzzy segmentation points of each target edge are further predicted in a process called fine segmentation. A flowchart of the PointRend algorithm is shown in Figure 3. The main process is as follows: 1.
Generate a mask prediction (coarse prediction) from a lightweight coarse mask prediction head.

2.
Select the N "points" that are most likely to be different from their surrounding points (such as points at the edge of an object).

3.
For each selected point, extract a "feature representation". This feature representation consists of two parts, i.e., fine-grained features, which are obtained through bilinear interpolation on the low-level feature map (similar to RoIAlign), and highlevel features (coarse prediction), which are obtained in Step 1.

4.
Use a simple multilayer perceptron (MLP) to calculate a new prediction from the feature representations and update coarse prediction_i to obtain coarse predic-tion_i+1.

Deepfill
The GAN-based image inpainting method has the problem that an inpainted area often has a distorted structure or fuzzy texture that is inconsistent with its surroundings. To address this problem, Jiahui Yu et al. [32] proposed a new method based on a deep generative counter-network model in their article "Generative Image Inpainting with Contextual Attention". This model, namely the Deepfill model, can repair missing areas in an image and also use the features of the surrounding parts of the image as a reference during the network training process to extract image content at a longer distance, thereby effectively solving the problems of edge structure distortion and texture blurring in the repaired image.

Coarse-refinement two-stage network
The network structure of the Deepfill model, which is shown in Figure 4, consists of two stages, i.e., a coarse network and a refinement network [39].
In the first stage, a CNN with dilated convolution is used to continuously learn, predict, and update the values of the convolutional weights in the missing area to finally obtain coarse repair results. The dilated convolution method can expand the receptive field of a convolution kernel to allow the kernel to capture a larger range of information.
Then, the coarse repair results are passed to the refinement network that serves as the second stage. The refinement network consists of the following two parallel convolutional network branches: One has a dilated convolution structure similar to that of the coarse network and continues the training process based on the results from the first stage, and the other branch has a contextual attention layer structure and matches candidate regions in the original input image that are similar to the preliminary repair results from the coarse network by performing multi-classification. Then, the final repaired image is obtained by merging the results from the two branches and upsampling. The entire Deepfill model consists of global and local Wasserstein GAN + gradient penalty (WGAN-GP) [40] networks. The global discriminator is used to evaluate the overall consistency of the repaired image and to determine whether the image has been successfully repaired, while the local discriminator focuses on distinguishing the consistency of each repaired area and its surroundings in the image.

Spatial attenuation of the reconstruction loss
In the image repair task, there are many different but acceptable repair results for the area to be repaired. Using only the distance from the original image as the criterion to measure the training loss value may cause a large number of feasible repair solutions to be eliminated, which will make the training of the entire network more difficult and "mislead" the direction of network optimization.
To address these problems, the Deepfill model includes a spatial attenuation mechanism for the reconstruction loss and an attention mechanism [41] that causes the convolutional weight in the repaired area to decrease with increasing distance from the center of the repaired area, and therefore reduces the impact on the loss value when the difference between the center of the repaired area and the original image is too large. Therefore, the training of the entire network is easier and more effective.

Evaluation Indexes
The main indicators used to evaluate the performance of the detection model are the precision, recall, miss rate, AP75, and F1-score. The represents the total number of targets in the test set, The miss rate measures the degree to which the model misses targets, as shown in Equation (4) as: Average precision (AP) was used to evaluate both classification and detection for the VOC2007 challenge [42], which summarizes the shape of the precision/recall curve. The AP measure can highlight differences between methods to a greater extent [42]. AP75 represents the area under the precise-recall curve drawn when the detected targets with IoU greater than 75% are regarded as correct statistics.
Since the precision and recall are often contradictory, the F1-score is introduced to comprehensively measure the precision and recall, to evaluate the model's performance, as shown in Equation (5) as: The performance of the hiding model is evaluated using two indicators, i.e., the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM). The PSNR is defined in terms of the mean square error (MSE). For a given m × n image I and a noisy image K, the MSE is defined as follows: The definition of the PSNR is as follows: where 2 represents the maximum possible pixel value in the image. This article considers red-green-blue (RGB) color images, and the final PSNR is calculated by calculating the PSNR of each of the three channels (R, G, and B) and taking the average value. The SSIM is an index for comparing the similarity of two images and in terms of three aspects, i.e., brightness, contrast, and structure.
is the mean of , is the mean of , 2 is the variance of , 2 is the variance of , and is the covariance of By setting ρ, ω, and τ equal to 1, we can obtain the following:

Application of the Two-Stage Processing Model with Aircraft Targets as an Example
We take aircraft target processing in remote sensing mapping as an example to verify the performance of the proposed two-stage processing model for target detection and hiding.

Aircraft Target Characteristics
Remote sensing images contain rich spatial information, spectral features, and texture information, as well as various target categories [43]. Aircraft targets have the following four main types of distinctive characteristics: ① Spectral characteristics, i.e., brightness ratio information in terms of color and gray level. The color of an aircraft target is different from the colors of natural features and thus can be used as a typical feature for recognition. ② Texture features, i.e., regular and repeated changes in the gray level of the target within a certain spatial range. ③ Shape features, i.e., the shape of the aircraft boundary and area, including characteristic scales and aspect ratios. The shape features of most targets are within a small range and are affected by factors such as the height from which the image was acquired and the side angle of view. The scale conversion range is large, and largescale targets also exist. These features serve as an important basis for identifying and detecting aircraft targets in this study. ④ Contextual features, i.e., the spatial relationships between various objects in the image. In particular, the aircraft targets considered in this article are generally located at an airport [44].

Reconstruction Loss Function of the Detection Model
The task of target detection in emergency remote sensing mapping is more difficult than conventional image target detection tasks. According to the above analysis of aircraft target characteristics, the detection model should have the following characteristics: ① a high-precision extraction capability for small aircraft targets in multi-scene, multi-information high-resolution remote sensing images and ② generalizability and compatible extraction capabilities for a few large aircraft targets with relatively high pixel ranges.
According to the mask R-CNN model, we adopt corresponding optimization strategies to improve the accuracy and robustness of the model for aircraft target detection to better satisfy the mission requirements.

Reconstruction Loss Function
The loss function of the mask R-CNN model is defined as follows: classification loss + bounding box loss + mask loss. The mask quality has a significant impact on the range of each hidden area and the results of target-hiding processing. Therefore, this paper introduces the parameter γ, as shown in Equation (11), to increase the contribution of to the loss function, adjust the direction of optimization during model training, and improve the mask quality, where γ is a constant greater than 1, as follows: This modification to the reconstruction loss function increases the model training cost and the difficulty of fitting. We selected multiple sets of parameters for comparison of the training results and comprehensively calculated the optimal value of γ, according to the results for the total loss value and the mask loss value. Figure 5 . (a,b) shows the change trend of total loss and the mask loss respectively.

Region Proposal Network (RPN) Optimization
The RPN generates different anchor boxes depending on the input aspect ratio and scale parameters. Considering the characteristics of the semantic information associated with aircraft, we modified the aspect ratio and scale parameters to match the shape characteristics of aircraft targets to reduce the redundancy in the number of anchor boxes, reduce the amount of calculation, and improve the hit rate of the anchor boxes.
According to an analysis of the statistical results shown in Table 1, the aspect ratios of common models of aircraft lie in the range of 0.9 to 1.25. Therefore, we set the anchor box aspect ratio coefficient α to α∈ (0.8, 1.0, 1.25). Since most aircraft targets in remote sensing images are small targets, after comparing the detection results at multiple different scales, we set the scale parameter β to β∈ (8,32,64,128,256).

Model Structure
The mask R-CNN model predicts masks on a 14 × 14 (faster-rcnn-c4) or 28 × 28 (fasterrcnn-fpn) grid regardless of the object size. Consequently, for large aircraft targets, many details are lost due to high scaling. Therefore, in this experiment, we replaced the mask head in the original mask R-CNN with the PointRend algorithm, as shown in Figure 6. For the coarse features, we used a network structure similar to that of the mask head in the original mask R-CNN and finally obtained a 7 × 7 coarse segmentation map. On this basis, we sampled N points and their coarse features; then, we obtained the fine-grained features of these N points from the original CNN P2 layer. Finally, we concatenated the two sets of features to obtain the feature representations of the candidate points.

. Mask Optimization
Limited by the quality of the mask output and the labeling quality of the training data, the original model is susceptible to a phenomenon in which some target masks may be incomplete and fail to completely cover the target area, which increases the difficulty of hiding processing and results in poor target-hiding effects. To improve the detection process, we added a mask optimization algorithm to the mask output layer of the original model, as shown in Figure 8.
The core idea of the optimization algorithm is inspired by the dilation algorithm in image morphology [47]. In Equation (12), A is the target mask output by the original model. A is used to determine the target anchor point, and the diffusion coefficient and the convolution kernel B are determined according to the target pixel value. Then, the convolution kernel B is used to convolve A such that the mask is expanded outward along the target contour to improve the quality of the target mask and its degree of target coverage as follows: * = ⋃ ∈ (12) Figure 9 shows a comparative example of masks generated by the original model and the model with optimization; from left to right are the original image, the mask generated by the original model, and the mask generated by the model with optimization. We can see that the mask generated by the model with optimization is of better quality, covers a higher proportion of the target, and achieves a good coverage effect for the shadows generated by the target due to factors such as the shooting angle and height of the remote sensing platform.   shows 2 aircrafts, (b,c) show the mask generated by the original model, and the mask generated by the model with optimization respectively. Note that image (c) is of better quality, covers a higher proportion of the target, and achieves a good coverage effect for the shadows.

Modification of the Hiding Model Attention Mechanism
The target-hiding process in a remote sensing image consists of generating background content to fill in the area to be hidden. The internal structure and texture of the background area have a high degree of similarity, while the edge regions are obviously different from other background or target areas.
Therefore, in the hiding processing task, more attention should be paid to the fusion of the background generated in the area to be hidden with the surrounding background to make the structure and texture more consistent. In this paper, the attention mechanism strategy of the original model is adopted, and the strategy for calculating the matrix M is modified according to the needs of hiding processing. In Equation (13), is set to 0.9 and is the L2 distance from the current point to the known pixel point ( , ). We increase the weight in boundary regions, weaken the influence of the center of the area on the model training process, and improve the degree of fusion between the boundary of the area to be hidden and the surrounding background to make the overall visual effect of the image more reasonable and natural.

Data Collection and Preprocessing
The experimental data come from the Remote Sensing Object Detection (RSOD) dataset and the remote sensing Dataset for Object Detection in Aerial images (DOTA) [48] annotated by Wuhan University.
To improve the generalizability and robustness of the model, during the process of dataset construction, the samples for training the detection model were mainly drawn from the RSOD dataset, and a small number of samples were randomly selected from DOTA; by contrast, the samples for training the hiding model were mainly drawn from DOTA, and a small number of samples from the RSOD dataset were randomly added.
Preprocessing of the detection model training data was completed as follows: ① The sample data were normalized to 1024 × 800 pixels to facilitate labeling and training. ② LabelMe was used to relabel the aircraft targets in the original samples to construct a new remote sensing image aircraft target instance segmentation dataset. ③ The dataset was randomly divided into a training set and a validation set at a ratio of 0.8:0.2. The test set was randomly selected and stored separately from the original samples. The samples in the test set were not used in the training process and were used only to evaluate the quality of the results of the trained model.
A total of 210 images were labeled, and data augmentation operations such as inversion, left and right mirroring, flipping, and rotation were applied. The total number of samples was 1607, with 1285 samples in the training set, and 322 samples in the validation set. Figure 10 shows several examples of target detection samples. Preprocessing of the hiding model training data was completed as follows: The method of directly randomly generating masks from large-format remote sensing image samples cannot satisfy the requirements for training a model to hide only specific target regions. Therefore, we cropped the aircraft and airport samples in the dataset to 256 × 256 pixels and retained only image samples depicting airport and runway background areas.
A total of 9502 samples were obtained and randomly divided into a training set and a validation set at a ratio of 0.8:0.2. The training set contained 7601 images, and the validation set contained 1901 images.

Training
The training methods and loss function calculations used for the two stages of the model are different; therefore, we trained and tuned the two models separately, and then combined them to obtain a two-stage trained model, which we applied for the tasks of the automatic detection and hiding of aircraft targets.
In this study, we utilized Pytorch as our deep learning framework. All experiments were performed on computer equipped with a 64-bit Intel i7-6700K CPU @ 4.0 GHz, 32 GB of RAM, and a GeForce GTX1070 GPU with 4 GB of memory, running under CUDA version 10.0. The operating system was Ubuntu 16.04 LTS.
The experimental parameters for the training of the detection model were set as follows: the learning rate was 0.0001, the batch size was 2, and the number of epochs was 50.
After the corresponding pretrained model was obtained, 46 remote sensing image samples were randomly selected from the test set to verify the model's detection performance.
The experimental parameters for the training of the intelligent hiding model were set as follows: the batch size was eight, the number of training times per epoch was 2000, the maximum number of iterations was 1,000,000, and the number of validation times per epoch was 200. The losses in the coarse network and the refinement network were both reduced to 0.5, and the maximum width and height parameters of the mask range were set to 256 pixels.

Model Performance Analysis
For the detection model, we randomly selected 46 images from the test set for detection to measure the performance of the model. The results are shown in Table 2. The actual number of targets is 475. The original mask R-CNN model detects a total of 214 targets, of which 210 are correct; the model proposed in this paper detects 411 targets, of which 389 are correct.
As shown in Figure 11, the precision of the original mask R-CNN model is 98.13%, while the recall is only 44.21% and the AP75 is 65.4%. Consequently, the F1-score is only 60.96% because a large number of targets are not detected. After optimization, the precision of the detection model proposed in this paper is still as high as 94.87%, while the recall reaches 81.68% and the AP75 reaches 79.2%, which are 84.75% and 21.1% higher than those of the original mask R-CNN model, respectively. Accordingly, the F1-score reaches 87.78%, which is 44% higher than that of the original model.  For the hiding model, we tested the model on the 46 images from the test set subjected to target detection processing as input images. The PSNR and SSIM values are shown in Figure 12. The average PSNR value is 32.26 and the average SSIM value is 0.98. Figure 12. The peak signal-to-noise ratio PSNR and structural similarity (SSIM) results. Image (a) shows the PSNR of the 46 sample images, the average value is 32. 26. Image (b) shows the SSIM, the average value is 0.98, which means the generated images are very similar to the non-classified images. Figure 13 presents comparative examples of the results of target detection and hiding processing; from left to right are the original images, detection results, and hiding processing results. This figure shows three different sets of images. The first row in each group shows the processing results of the original model, and the second row shows the processing results of our method. We can see that the two-stage model proposed in this article ① achieves better detection and recognition effects and a higher recall rate for aircraft targets in remote sensing images and ② shows a more powerful hiding processing ability and a more significant background restoration effect for the detected targets.

Comparative Analysis of Results
In terms of quality, the benchmark model obviously suffers from the phenomenon that the final hiding processing may fail due to missed targets. Moreover, even for targets that are successfully detected, the generated masks sometimes do not completely cover the target areas; as a result, the aircraft texture structure is still evident after hiding processing. By contrast, the proposed two-stage model can output high-quality masks, which, combined with the powerful performance of the hiding processing model, enable more successful hiding of sensitive targets with better fusion with the surrounding scene.
In terms of efficiency, it takes approximately 15 min to process a 1:10,000 image with a size of 1 GB and a resolution of 0.2 m using our method. Depending on the operating conditions of the production unit, when the manual processing method is adopted, even when the time-consuming task of screening for sensitive targets is not considered, hiding processing alone typically takes 30 min. Thus, the processing time of our method is reduced by more than 50% as compared with that of manual processing, demonstrating that our method can greatly accelerate the speed of emergency mapping.

Discussion of Experimental Results
The detection and hiding of sensitive targets are among the important links in the field of remote sensing emergency mapping. This study addressed this problem with a two-stage processing model combining deep learning methods and expanded the ideas and methods of related research in this field. First, this article proposes a complete model framework, including two stages of target detection and hiding, which basically realizes the automatic hiding of sensitive targets. Compared with the existing purely manual or semi-automatic methods, our proposed model framework guarantees a high accuracy rate and also greatly improves efficiency and meets the goal of rapid and accurate emergency mapping. Second, the detection accuracy of sensitive targets basically determines the result of target hiding, and therefore improving the accuracy of the detection stage is the key to optimizing the entire model. We combined the PointRend method on the basis of mask R-CNN, and this optimized model improved the detection accuracy of sensitive targets so that subsequent target hiding has a better effect. The optimized detection model can be extended to other similar cases. In addition, we applied the proposed target detection and hiding model to actual remote sensing data and achieved satisfactory results, which can basically replace manual work and realize automatic processing. This shows that the model we proposed has good practical application value and scalability and can be used in industrial production. In addition to the airplane case used in this article, it can be used to detect and hide other similar sensitive targets, such as warships, military depots, and military training grounds under the premise that the training dataset is large enough.

Conclusions
In our method, the detection recall, AP75 and F1-score for aircraft targets in remote sensing images are significantly improved, and the effect of hiding processing is reasonable and natural. Moreover, significant time is saved in the overall remote sensing mapping process, thus, demonstrating the practical applicability of the proposed two-stage model. Theoretically, this method is also suitable for the automatic detection and hiding of other single-type sensitive targets in emergency remote sensing mapping.
Despite being effective, some restrictions of our method still exist. The dataset sample size is not large enough and traditional data augmentation methods are used. Thus, the precision is not good enough, and the model lacks comparison with other excellent models. In the future, zero-shot and GAN can be used to solve the problem of limited available training data. More current CNN models for our aim of target detection will be considered, especially one-stage object detection models, such as RetinaNet [49], SSD, and YOLOv3 [50]. Data Availability Statement: Restrictions apply to the availability of these data. Data are classified and not allowed to be published publicly.

Conflicts of Interest:
The authors declare no conflict of interest.