Target Detection Method for Low-Resolution Remote Sensing Image Based on ESRGAN and ReDet

: With the widespread use of remote sensing images, low-resolution target detection in remote sensing images has become a hot research topic in the ﬁeld of computer vision. In this paper, we propose a Target Detection on Super-Resolution Reconstruction (TDoSR) method to solve the problem of low target recognition rates in low-resolution remote sensing images under foggy conditions. The TDoSR method uses the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) to perform defogging and super-resolution reconstruction of foggy low-resolution remote sensing images. In the target detection part, the Rotation Equivariant Detector (ReDet) algorithm, which has a higher recognition rate at this stage, is used to identify and classify various types of targets. While a large number of experiments have been carried out on the remote sensing image dataset DOTA-v1.5, the results of this paper suggest that the proposed method achieves good results in the target detection of low-resolution foggy remote sensing images. The principal result of this paper demonstrates that the recognition rate of the TDoSR method increases by roughly 20% when compared with low-resolution foggy remote sensing images.


Introduction
The task of target detection in remote sensing images is to locate, recognize, or classify ground objects. With the advent of the Convolutional Neural Network (CNN) [1], computer vision has become a hot spot in the field of artificial intelligence, especially in image processing, which has recently experienced unprecedented development [2,3]. Whether it is due to the low performance of some imaging equipment or the extreme weather conditions, the collected remote sensing images cannot satisfy the practice requirements with such low quality. The task of single image super-resolution (SISR) [4] processing is to recover a high-resolution image from a low-resolution image. Before the deep learning method was proposed, the Bicubic [4] method was usually used to deal with the problem of single image super-resolution. However, this method only used the pixel information of the low-resolution image itself, and all the pixels at each position were interpolated based on the information around the corresponding pixels so that the super-resolution image obtained by this method was unsatisfactory and had poor image quality. Learning a Deep Convolutional Network for Image Super-Resolution (SRCNN) [4] introduced CNN to the task of image super-resolution reconstruction for the first time. The network structure of the SRCNN only used three convolutional layers. Compared with the traditional reconstruction methods, the reconstruction effect of the image had improved but the details in the high-frequency parts of the image were still processed normally [5,6]. In response, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (SRGAN) [7] applied the Generative Adversarial Network (GAN) [8] to solve the problem of super-resolution. The SRGAN added perceptual loss and adversarial loss to the GAN framework to increase the authenticity of the generated images. While the visual effect of the super-resolution image reconstructed by the SRGAN had been improved, there

Related Work
Currently there are no open source and complete low-resolution remote sensing image datasets. Therefore, the public remote sensing image data set named DOTA-v1.5 [20] has been selected for this research. The Bicubic method is used to down-sample the DOTA data set so to obtain low-resolution remote sensing images. Then, the down-sampled remote sensing images are artificially simulated and fogged by the mainstream RGB channel synthesis fog method. This obtained low-quality image data set is then used for the following super-resolution reconstruction research.

Bicubic Interpolation
Bicubic interpolation, also called cubic convolution interpolation, is a complicated interpolation algorithm. This algorithm uses the gray values of 16 points around those points that will be sampled for cubic interpolation. Not only are the gray effects of four directly adjacent points considered, but also the influence of the gray value change rates between the adjacent points [21]. This paper uses this algorithm to down-sample the remote sensing images.
Suppose that the size of the source image A to be processed is u × v, and the size of the target image B scaled from A is U × V. According to the zoom ratio, the corresponding coordinate of a point (X, Y) on the target image B on the source image A is (x, y) = A X × u U , Y × v V . In the bicubic interpolation, the 16 pixels closest to (x, y) are selected when calculating the parameters of the pixel value at (X, Y) on the target image B. The algorithm needs to select an interpolation basis function to fit the data, and the commonly used interpolation basis function expression is shown in Formula 1. The image x y are selected when calculating the parameters of the pixel value at ( ) , X Y on the target image B . The algorithm needs to select an interpolation basis function to fit the data, and the commonly used interpolation basis function expression is shown in Formula 1. The image of the interpolation basis function is shown in Figure 1a, while Figure 1b is the schematic diagram of the bicubic algorithm. 1.5 | | 2.5 | | 1 for | | 1 The coefficients of the remaining points are calculated as above, and so the coefficients of the four points in the first row are: , The Q point is the source image coordinate point corresponding to the point (X , Y) on the target image B after being reduced several times; then, the coefficients of the 16 points around point Q are calculated, and the pixel value of point Q is obtained after weighting, as shown in Figure 1b. As is shown, where m is the distance between the point and the abscissa of the upper left corner point, and n is the distance between the point and the ordinate of the upper left corner point. To find the coefficient corresponding to each coordinate point: The distances between the four points in the X axis direction of each row and the Q point are 1 + m, m, 1 − m, 2 − m.
The distances between the four points in the Y axis direction of each row and the Q point are 1 + n, n, 1 − n, 2 − n.
From the interpolation basis function operation, if the row coefficient corresponding to point (i, j) is W(1 + m) and the corresponding column coefficient is W(1 + n), then the coefficient of this point is K 0,0 = W(1 + m)·W(1 + n).
The coefficients of the remaining points are calculated as above, and so the coefficients of the four points in the first row are: The coefficients of the four points in the second row are: The coefficients of the four points in the third row are: The coefficients of the four points in the fourth row are: Therefore, the pixel value of the Q point can be obtained by adding the pixel values of the 16 points multiplied by the corresponding coefficients. Then, the down-sampled image can be obtained by the Bicubic interpolation algorithm.
Finally, the down-sampled images are artificially fogged to obtain low-resolution remote sensing images under foggy conditions, as shown in Figure 2.
The coefficients of the four points in the third row are: The coefficients of the four points in the fourth row are: Therefore, the pixel value of the Q point can be obtained by adding the pixel values of the 16 points multiplied by the corresponding coefficients. Then, the down-sampled image can be obtained by the Bicubic interpolation algorithm. Finally, the down-sampled images are artificially fogged to obtain low-resolution remote sensing images under foggy conditions, as shown in Figure 2.

Effective Algorithms for SISR
The following will introduce some of the effective algorithms used in recent years for single image super-resolution reconstruction.

Generative Adversarial Network (GAN)
The generative confrontation network [8] is mainly composed of two network models: the generator network G and the discriminator network D . The main function of the generator network is to receive a random noise z and generate an image similar to the original through this noise. The role of the discriminator network is mainly to determine whether an image is real or synthesized by a generator. The two network models compete to improve their algorithmic capabilities until the discriminator cannot determine whether the composite image is true or false.

Effective Algorithms for SISR
The following will introduce some of the effective algorithms used in recent years for single image super-resolution reconstruction.

Generative Adversarial Network (GAN)
The generative confrontation network [8] is mainly composed of two network models: the generator network G and the discriminator network D. The main function of the generator network is to receive a random noise z and generate an image similar to the original through this noise. The role of the discriminator network is mainly to determine whether an image is real or synthesized by a generator. The two network models compete to improve their algorithmic capabilities until the discriminator cannot determine whether the composite image is true or false.
The cost function of generating a confrontation network is: where x represents a real image, z represents the random noise input to the generator G, and G(z) represents the image generated by the generator G. D(x) represents the probability of the discriminator D to determine whether the real image is indeed real. For D, the closer D(x) is to 1, the better. D(G(z)) represents the probability that D judges whether the image generated by G is real. G hopes that D(G(z)) is as large as possible, and at the time V(D, G) will become smaller. Finally, D hopes that the larger is the D(x), the smaller the D(G(z)), at time V(D, G) it will become larger.

SRGAN
SRGAN [7] applied the GAN [8] to the task of processing the image super-resolution reconstruction for the first time and made improvements in the loss function. SRGAN's network model is divided into three parts: generator, discriminator, and VGG [14] network. In the training process, the generator and the discriminator alternate against training and iterating continuously. The VGG [14,22,23] network only participates in the calculation of Loss.
The generator of the SRGAN is an improvement made on the basis of SRResNet. The generator network part contains multiple residual blocks, and each residual block contains two A convolutional layers which are connected to batch normalization (BN) [10] after the convolutional layer. Take the PReLU as the activation function and choose two 2× sub-pixel convolution layers to increase the feature size [6]. The discriminator network part of the SRGAN contains 8 convolutional layers. As the number of network layers deepens, the number of features continues to increase while the feature size continues to decrease. LeakyReLU is selected as the activation function [22] and finally passes through two fully connected layers and the final sigmoid. The activation function is used to predict the probability of whether the generated image is a real image.
The loss function of the SRGAN is divided into generator loss and discriminator loss. The generator loss consists of content loss and counter loss. The loss function of the generator is as follows: where l SR X is a content loss and l SR Gen is a confrontation loss. The content loss includes the MSE loss [6,21] and the VGG loss. The MSE loss is used to calculate the matching degree between pixels, and the VGG loss is used to calculate the matching degree of a feature layer. Using MSE can get a good performance evaluation index, but the superresolution reconstructed image obtained only using the MSE loss loses more high-frequency information. The purpose of adding the VGG loss is to more effectively recover the highfrequency information of the image.
The calculation of the MSE loss is as follows: where W represents the width of the image, H represents the height of the image, I HR is the real high-resolution image, and I LR is the low-resolution image corresponding to the real high-resolution image. The calculation of the VGG loss is as follows: where φ i,j represents the feature map obtained before the i maximum pooling layer of the j convolution of the VGG network, and W i,j and H i,j are the dimensions of the corresponding feature map in the VGG network.
The counter loss of the generator is calculated as follows: where D θ D G θ G I LR is the estimated probability that the reconstructed image and G θ G I LR is a natural HR image [8].
The optimization of generator network G θ G and discriminator network D θ D is as follows: where I HR is the real high-resolution image, I LR is the low-resolution image corresponding to the real high-resolution image, and I SR is the super-resolution after inputting the lowresolution image into the SRGAN network and the super-resolution reconstruction image. However, the super-resolution image reconstructed by the SRGAN still has a large gap with the real image and it cannot recover more real textural details or more semantic information.

EDSR
Compared with the SRGAN, the EDSR removes the BN layer on the basis of its network. For the task of the image super-resolution reconstruction, the image generated by the network is required to be consistent with the input source image in terms of brightness, contrast, and color, while only the resolution and some of the details are changed. In the processing of images, the BN layer is equivalent to contrast stretching. The color distribution of the image is normalized, which destroys the original contrast information of the image [12]. Therefore, the performance of the BN layer in the image super-resolution is not good. The addition of the BN layer increases the training time, thereby making the training unstable or even divergent.
The model performance of the EDSR is improved by removing the BN layer in the residual network, by increasing the number of residual layers from 16 to 32, and then expanding the model size. The EDSR uses the loss function of L1 [23] norm style to optimize the network model. During training, we first train the low-multiple up-sampling model, and then use the obtained parameters to initialize the high-multiple up-sampling model, which not only reduces the training time of the high-multiple up-sampling model but also achieves a very high level training effect.
The EDSR has achieved a good effect in the super-resolution reconstruction task, but there is still a large gap in edge detail from the real image.

Experimental Method
Through the research and comparison of image super-resolution reconstruction algorithms, the ESRGAN algorithm is selected for the following research and has the best effect in the field of remote sensing image reconstruction thus far. Through the super-resolution processing of low-resolution remote sensing images, the generated super-resolution images are identified and classified. The flow chart of the whole set of identification is shown in Figure 3.

ESRGAN
ESRGAN's [13] generator network refers to the SRResNet structure. The ESRGAN has two improvements on the basis of this generator network. First, it removes all the BN layers in the network. After removing the BN layers, the generalization ability of the model and the training speed are both improved. Second, the original residual block is changed to the Residual in Residual Dense Block (RRDB). The changed RRDB combines multi-layer residual networks and dense connections [12,24]. The previous algorithms for super-resolution reconstruction based on the GAN are used in the discriminator network to determine whether the image generated by the generator is true and natural [9]. The most important improvement of the ESRGAN discriminator network is the probability that it discriminates real images more realistically than fake images. The ESRGAN uses Figure 3. A flow chart of the remote sensing image recognition algorithm. Down-sampling the real high-definition image to obtain a low-resolution image, then simulating the fogging process on the low-resolution image to obtain a low-resolution foggy image, which is then super-resolution reconstructed, and finally the reconstructed image Target recognition on the image.

ESRGAN
ESRGAN's [13] generator network refers to the SRResNet structure. The ESRGAN has two improvements on the basis of this generator network. First, it removes all the BN layers in the network. After removing the BN layers, the generalization ability of the model and the training speed are both improved. Second, the original residual block is changed to the Residual in Residual Dense Block (RRDB). The changed RRDB combines multi-layer residual networks and dense connections [12,24]. The previous algorithms for super-resolution reconstruction based on the GAN are used in the discriminator network to determine whether the image generated by the generator is true and natural [9]. The most important improvement of the ESRGAN discriminator network is the probability that it discriminates real images more realistically than fake images. The ESRGAN uses the VGG features before activation in the perceptual domain loss and overcomes two shortcomings. First, the features after activation are very sparse, especially in deep networks. This sparse activation provides a weaker supervision effect, which makes the generator network performance low. Second, the use of activated features causes the super-resolution reconstructed image to differ in brightness from the real image.
The ESRGAN uses a relative average discriminator, and the loss function of the discriminator is defined as: where x r is the real image, x f is the original low-resolution image generated by the generator, D Ra x r , x f is the difference between the real image and the average value of the generated image, and D Ra x f , x r is the difference between the average value of the generated image and the real image.
The counter loss function of the generator is defined as: where L percep is the perceptual domain loss, L Ra G is the counter loss of the generator, and L 1 is the pixel-wise loss, (x, y, w, h, θ) and λ = 5 × 10 −3 , η = 0.01 in the experiment.

Rotating Equivariant Detector
Unlike natural images, targets in aerial images are usually arbitrarily oriented. In order to overcome this difficulty, researchers generally represent the detection of aerial targets as a task of orientation detection that relies on the characterization of oriented bounding boxes (OBBs) [17] instead of horizontal bounding boxes (HBBs) [17].
ReDet uses rotating equivariant networks instead of traditional convolutional neural networks to extract the features. Compared with convolutional neural networks, which share translation weights, rotating equivariant networks share translation and rotation weights. ReDet uses a rotating equivariant network and ResNet with Feature Pyramid Networks (FPN) [25] as the backbone to realize a rotating equivariant backbone network, named Rotation-equivariant ResNet (ReResNet) so to extract the features of the rotation equivariant, which can accurately predict the orientation and significantly reduce the model size.
Take the horizontal RoIs (HRoI) output by the backbone network through the Region Proposal Network (RPN) [26] as the input, shrink it to 10 channels after one convolution, enter the fully connected layer, and output a 5-dimensional vector. The gt value of each dimension is as follows: where the five values are the gt value of RRoI and the offset of HRoI. Use these offsets as inputs to enter the decoder module and to decode the relevant parameters of RRoI, namely (x, y, w, h, θ). This can make the final RRoI as close as possible to the gt value, which reduces the number of parameters and improves the performance of the rotating frame detection. Concurrently, ReDet designed a novel Rotation-invariant RoI Align (RiRoI Align). RiRoI Align includes both spatial alignment and direction alignment. Its task is to transform the rotation equivariant features so to obtain rotation-invariant features (instance level), the so-called rotation Unchanging means that, no matter how the input changes (rotation), the output is always the same. RiRoI Align generates RoI rotation invariant features from the feature map that are equal to the rotation.
Given an input image in ReDet, we input it into the ReResNet network, extract the rotational equivariant features, use RPN to generate HRoIs, and then use RoI Transformer to convert HRoIs to RRoIs (x, y, w, h, θ). Finally, RiRoI Align is used to extract rotation invariant features for RoI classification and bounding box regression.

TDoSR
In this method, the down-sampling and super-resolution reconstruction of the image are carried out according to the scale factor ×4. As the size of the image in the DOTA dataset is too large, it was cropped to a 1024 × 1024 size image before the experiment. Then we use the MATLAB Bicubic algorithm to down-sample the original high-definition remote sensing image to obtain a low-resolution remote sensing image with a size of 256 × 256. The method of the RGB channel synthesizing fog in MATLAB is used to artificially simulate and add fog to low-resolution remote sensing images.
The training process is divided into two stages. First, we train a PSNR-oriented model with the L1 loss. The learning rate is initialized as 2 × 10 −4 and decayed by a factor of 2 every 2 × 10 5 iterations. We then employ the trained PSNR-oriented model as an initialization for the generator. The generator is trained using the loss function in Equation (9) with λ = 5 × 10 −3 and η = 0.01. The learning rate is set to 1 × 10 −4 and halved at 50 K, 100 K, 200 K, and 300 K iterations. Pre-training with pixel-wise loss helps GAN-based methods to obtain more visually pleasing results. We use Adam [27] and alternately update the generator and discriminator network until the model converges. The low-resolution foggy remote sensing image is then input into the trained ESRGAN model for super-resolution reconstruction, and a high-resolution remote sensing image with a size of 1024 × 1024 is obtained.
The training for ReDet is as follows. For the original ResNet, we directly use the ImageNet pretrained models from PyTorch [28]. For ReResNet, we implement it based on the mmclassification [29]. We train ReResNet on the ImageNet-1K with an initial learning rate of 0.1. All models are trained for 100 epochs and the learning rate is divided by 10 at (30,60,90) epochs. The batch size is set to 256. Fine-tuning is on detection. We adopt ResNet with FPN [25] as the backbone of the baseline method. ReResNet with ReFPN is adopted as the backbone of proposed ReDet. For RPN, we set 15 anchors per location of each pyramid level. For R-CNN, we sample 512 RoIs with a 1:3 positive to negative ratio for training. For testing, we adopt 10,000 RoIs (2000 for each pyramid level) before NMS and 2000 RoIs after NMS. We adopt the same training schedules as mmdetection [29]. The SGD optimizer is adopted with an initial learning rate of 0.01, and the learning rate is divided by 10 at each decay step. The momentum and weight decay are 0.9 and 0.0001, respectively. We train all models in 12 epochs for the DOTA.
Then, input the high-resolution remote sensing image obtained in the previous step into the trained ReDet detector, and finally obtain the recognition rate of various targets.

Experimental Results and Analysis
In this paper, through the super-resolution reconstruction of low-resolution remote sensing images in a foggy interference environment, the reconstructed remote sensing images are subjected to target recognition. Due to space limitations, the relevant algorithms for the direction of the image super-resolution reconstruction are selected for comparison. The detailed information of the experimental environment of the algorithm in this paper is as follows: The hardware configurations of Device 2 are: CPU-Intel(R) Xeon(R) Gold 5218@2.30GHz x32 from Intel San Francisco, USA; GPU-NVIDIA Quadro P5000 from NVIDIA in Santa Clara, USA; memory-128 GB from GALAXY Hong Kong, China.
Software configuration: The environment configurations of the two devices are the same. The operating system is the 64-bit Ubuntu 18.04 LTS for both devices.
The driver version of the graphics card is: Nvidia-Linux-x64-450.80.02; CUDA version is 10.0; PyTorch 1.3.1.

Experimental Data
This experiment uses the DOTA-v1.5 [17,20]  As the size of the original data set image is large, it is not conducive to the training of the model, so the original image is uniformly cropped into an image of size 1024 × 1024. The cropped data set is used in super-resolution reconstruction and the subsequent target detection. After trimming, there are 10,352 image samples used for training, 10,694 image samples used for verification, and 10,833 image samples used for testing.

Comparative Experiment
When training the super-resolution model in this article, the original high-resolution data set is first down-sampled, then these down-sampled images are artificially fogged, and finally a low-resolution remote sensing image data set under foggy conditions is obtained. The low-resolution data set and the original high-resolution data set are then input into the super-resolution network for training so to complete the reconstruction of a super-resolution remote sensing image. The reconstructed image is input into the trained detector, and the performance of the super-resolution reconstruction network is tested by the recognition rate of different categories.
PSNR [30] and SSIM [31,32] are general indicators for evaluating image quality in the field of image processing, and are used in this paper. The detailed calculation formula and description are as follows: PSNR is the most common and widely used image objective evaluation index. It is based on the error between corresponding pixels, that is, based on the error-sensitive image quality evaluation. It is calculated as follows: where MSE represents the mean square error of the current image X and the reference image Y, X(i, j) and Y(i, j) represent the pixel values at the corresponding coordinates, H and W are the height and width of the image, respectively, and n is the number of bits per pixel (generally 8). The unit of PSNR is dB. The larger the value, the smaller the distortion, because larger values indicate a smaller MSE. If the MSE is smaller, and if the two images are closer, then the distortion is also smaller.
SSIM is structural similarity, which is an index to measure the similarity of two images. The calculation formula of SSIM is as follows: where u X and u Y represent the mean values of images X and Y, respectively; σ X and σ Y represent the standard deviations of images X and Y, respectively; σ 2 X , σ 2 Y represent the variances of images X and Y, respectively; and σ XY represents the covariances of images X and Y. C 1 , C 2 , C 3 is a constant and usually takes C 1 = (K 1 L) 2 , C 2 = (K 2 L) 2 , C 3 = C 2 2 , and generally K 1 = 0.01, K 2 = 0.03, L = 255.
The image after super-resolution reconstruction is shown in Figure 4. The reconstruction effect of the ESRGAN algorithm is the best, and the reconstructed image is closest to the original image. The average values of the objective evaluation indexes PSNR and SSIM after the super-resolution reconstruction of the various algorithms are calculated on the test set, as listed in Table 1. After comparison, the traditional interpolation algorithm has the worst effect, and the ESRGAN algorithm selected for this paper not only achieves the best objective evaluation index, but also shows the superiority of this algorithm in sensory vision. The image after super-resolution reconstruction is shown in Figure 4. The reconstruction effect of the ESRGAN algorithm is the best, and the reconstructed image is closest to the original image. The average values of the objective evaluation indexes PSNR and SSIM after the super-resolution reconstruction of the various algorithms are calculated on the test set, as listed in Table 1. After comparison, the traditional interpolation algorithm has the worst effect, and the ESRGAN algorithm selected for this paper not only achieves the best objective evaluation index, but also shows the superiority of this algorithm in sensory vision.

Results and Analysis
The unprocessed original real high-definition image is input into the detector model for testing, and the recognition accuracy of different categories in the original image is obtained, as shown in Table 2. The trained model of the detector selected in this paper has

Results and Analysis
The unprocessed original real high-definition image is input into the detector model for testing, and the recognition accuracy of different categories in the original image is obtained, as shown in Table 2. The trained model of the detector selected in this paper has better performance and recognition ability. In the experiment of super-resolution reconstruction, three algorithms (Bicubic, SR-GAN, and EDSR) are selected for comparison with the ESRGAN algorithm used in this paper. Among them, the Bicubic algorithm uses traditional interpolation methods to complete the image super-resolution work. The SRGAN algorithm is the first method to apply the GAN to the super-resolution deep learning. The EDSR algorithm is improved based on the SRGAN network. The ESRGAN algorithm is the best way to deal with the super-resolution reconstruction of remote sensing images.
The super-resolution images reconstructed by different algorithms are input into the detector model for classification and recognition, respectively, whereby the recognition rate of each category is obtained. The recognition accuracy of each category is counted and sorted, as shown in Table 2. The actual recognition effect is shown in Figure 5. In Table 2, the horizontal row represents the recognition rate of each type of target, and the vertical row represents the different super-resolution methods, where GT is the real image and LR is the real image which is fogged after the image down-sampling. In the experiment of super-resolution reconstruction, three algorithms (Bicubic, SRGAN, and EDSR) are selected for comparison with the ESRGAN algorithm used in this paper. Among them, the Bicubic algorithm uses traditional interpolation methods to complete the image super-resolution work. The SRGAN algorithm is the first method to apply the GAN to the super-resolution deep learning. The EDSR algorithm is improved based on the SRGAN network. The ESRGAN algorithm is the best way to deal with the superresolution reconstruction of remote sensing images.
The super-resolution images reconstructed by different algorithms are input into the detector model for classification and recognition, respectively, whereby the recognition rate of each category is obtained. The recognition accuracy of each category is counted and sorted, as shown in Table 2. The actual recognition effect is shown in Figure 5. In Table 2, the horizontal row represents the recognition rate of each type of target, and the vertical row represents the different super-resolution methods, where GT is the real image and LR is the real image which is fogged after the image down-sampling. It may be concluded from the recognition accuracy of Table 2 that the recognition effect is the best in the original high-definition image, while the traditional Bicubic interpolation algorithm has the worst effect, with a 10% decline when compared to the original image. No additional effective information was introduced in the process, the It may be concluded from the recognition accuracy of Table 2 that the recognition effect is the best in the original high-definition image, while the traditional Bicubic interpolation algorithm has the worst effect, with a 10% decline when compared to the original image. No additional effective information was introduced in the process, the reconstruction effect was poor, and the recognition rate was also the lowest. While several other deep learningbased super-resolution algorithms rebuild images and improve the image resolution, they also introduce external information for the image reconstruction so that the images have more detailed information. The ESRGAN algorithm selected for this paper has the best performance in terms of both visual effects and objective evaluation indicators. The reconstructed remote sensing image has rich texture details, the edge information is more obvious, and the recognition rate is the highest among all the algorithms. The accuracy difference is only roughly 1.2%.
The remote sensing image recognition algorithm proposed in this paper effectively solves the problem of the low recognition rate of low-resolution remote sensing images in foggy scenes.

Discussion
This paper proposed a new method for target detection in low-resolution remote sensing images in foggy weather. The low-resolution foggy remote sensing image was super-resolution reconstructed via the ESRGAN network, and the reconstructed superresolution image was input to the recognition classification in the trained detector model. After many experiments, this method improved the target recognition rate of low-resolution remote sensing images by nearly 20%. The main contributions of this paper are as follows. First, the application of image super-resolution reconstruction technology to the task of target detection in remote sensing images has broadened the application range of image super-resolution reconstruction technology. Furthermore, this research has realized the recognition and detection of small and weak targets on low-resolution remote sensing images under foggy conditions and achieved a very good detection effect. Finally, this paper compared the different methods of image super-resolution reconstruction at this stage, and ultimately selected the ESRGAN method as the best through many experiments, which helps the target detection task of remote sensing images at low resolution. The research undertaken in this paper has some benefit to the application of super-resolution reconstruction technology in the field of target detection. In the past two years, Transformer has shown the advantages in processing computer vision tasks, and has provided new research directions for the future of for the super-resolution reconstruction of remote sensing images.