Accurate Localization of Oil Tanks in Remote Sensing Images via FGMRST-Based CNN

: Object localization is an important application of remote sensing images and the basis of information extraction. The acquired accuracy is the key factor to improve the accuracy of object structure information inversion. The ﬂoating roof oil tank is a typical cylindrical artiﬁcial object, and its top cover ﬂuctuates up and down with the change in oil storage. Taking the oil tank as an example, this study explores the localization by combining the traditional feature parameter method and convolutional neural networks (CNNs). In this study, an improved fast radial symmetry transform (FRST) algorithm called fast gradient modulus radial symmetry transform (FGMRST) is proposed and an approach based on FGMRST combined with CNN is proposed. It effectively adds the priori of circle features to the calculation process. Compared with only using CNN, it achieves higher precision localization with fewer network layers. The experimental results based on SkySat data show that the method can effectively improve the calculation accuracy and efﬁciency of the same order of magnitude network, and by increasing the network depth, the accuracy still has a signiﬁcant improvement.


Introduction
In computer vision, object localization is to determine the position of the object in a single object image, and object detection is to identify all the objects and determine their positions in an image in which the type and number of objects are not fixed. In recent years, object localization in remote sensing images has been widely used in the investigation of agriculture and forestry, urban surveying, marine monitoring, and military reconnaissance [1,2]. The circle is a basic shape of objects in 2D images. Its localization technology is an important task in computer vision and pattern recognition, and it is also a hot research topic at home and abroad. The floating roof oil tank in remote a sensing image, as a representative of a man-fabricated object with the circular feature, plays an important role in both military and civil fields [3].
The position of the top cover of the floating roof oil tank fluctuates up and down with the change in oil storage, so the remote sensing image of the oil tank has different circular shadows and multiple circular areas. The localization of the oil tank in this study refers to the determination of the center position of the tank roof, the center position of the circular-arc-shaped shadow cast by the sun on the floating roof, and the radius of the oil tank in the image, which is of great significance for the inversion of the oil tank structure [4] and the oil storage information [5].
There are two kinds of methods for localizing the circle center. One is the traditional parameterized feature extraction method, including the method based on Hough trans-2 of 18 form [6] and the method based on template matching [7,8]. Han et al. [9] improved Hough transform by using the gradient direction information and reducing the dimension of parameter space, and then combined the graph search strategy to act on an oil tank [10]. Zhang et al. [11] used the improved random Hough transform (RHT), improved Canny algorithm, and template matching algorithm to deal with an oil tank. The other is the emerging deep learning [12] method. It transforms the raw data into abstract high-level representation through a nonlinear model to discover the distributed feature representation of the data [13,14]. In 1989, LeCun et al. [15] proposed convolutional neural networks (CNNs). In 2012, the deep CNN model built by Hinton et al. achieved state-of-the-art results in the ImageNet competition [16], which attracted extensive attention from industry and academia. In 2015, ResNet proposed by He et al. [17] won three championships of image detection, image classification, and image localization in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition, which exceeded the accuracy of human recognition for the first time [18].
In response to the problem of localization of the oil tank in remote sensing images, the two methods mentioned above have their own advantages and disadvantages. The traditional parameterized feature extraction method makes effective use of the circumference feature, but it is not learnable, and the applicability of parameters is poor. It needs to rely on experience or priori knowledge to continuously adjust parameters manually, and the degree of automation is low. The method based on deep learning is an end-to-end method. However, in the face of an object with a circular structure such as an oil tank, it is unable to fully explore the unique patterns of data and requires a large receptive field. Therefore, it needs to increase the receptive field by cascading multiple networks, which leads to a great increase in its parameters and computation. Based on this, this study aims to make effective use of the priori features while adding the learning features of the existing data to reduce the amount of calculation and improve accuracy and efficiency. We propose a new method combining the above two methods.
This study discusses the coverage of the receptive field of the CNN method and proposes the fast gradient modulus radial symmetry transform (FGMRST) based on the characteristics of the oil tank image and the fast radial symmetry transform (FRST) [19]. We use FGMRST to realize the spatial region aggregation of circular features and construct a method combining FGMRST and CNN. This method can efficiently aggregate spatial features and improve the accuracy and efficiency of the network. We carried out verification experiments based on the data of the SkySat satellite [20]. The results show that, compared with ordinary CNN, the method can effectively improve the accuracy of the localization of the oil tank, and has good stability.

CNN-Based Localization Method of the Oil Tank
As a kind of supervised deep feedforward neural networks with convolution, CNN has been widely concerned and applied because of its excellent performance in the field of computer vision, and is one of the representative algorithms of deep learning. The main idea of the CNN-based localization method of oil tank is: using CNN to train the five parameters of the remote sensing image, which are the X and Y coordinates of the center of the tank roof, the X and Y coordinates of the center of the circular-arc-shaped shadow cast by the sun on the floating roof and the radius of the oil tank, and obtain their predictive values through regression. In the face of the localization of the oil tank, CNN can predict the radius and obtain an accurate location of the circle center. However, its disadvantages are obvious: 1.
The main feature of an object with a circular structure such as the oil tank is on the circumference, but not in the circle (as shown in Figure 1). However, at present, the neural networks need to traverse all the pixels (as shown in Figure 3a), which leads to a large increase in unnecessary computation and a low processing efficiency.
networks need to traverse all the pixels (as shown in Figure 2a), which leads to a large increase in unnecessary computation and a low processing efficiency. 2. Due to the limitation of the receptive field, CNN gradually increases the receptive field and aggregates the spatial features by cascading networks (as shown in Figure 3a). Therefore, the larger the object size is, the deeper the network structure is required, which leads to a large number of parameters and computation.
3. It depends on the abundance and quantity of training samples, which does not exist in the traditional parameterized feature extraction method.

Introduction of FRST Theory
FRST is a transform that utilizes local radial symmetry to highlight points of interest within a scene [19]. The approach was inspired by the results of the generalized symmetry transform [21][22][23], although the final method bears more similarity to the work of Sela and Levine [24] and the circular Hough transform [19,25,26]. The approach presented herein determines the contribution each pixel makes to the symmetry of pixels around it, rather than considering the contribution of a local neighborhood to a central pixel [19].  The new transform is calculated at one or more radii n ∈N, where N is the set of radii of the radially symmetric features to be detected. The value of the transform at the radius n indicates the contribution to radial symmetry of the gradients a distance n away from each point.
Firstly, the positively affected pixels and the negatively affected pixels are defined: for pixel p, the positively affected pixel is defined as the pixel that the gradient vector g(p) is pointing to, a distance n away from p, and the negatively affected pixel is the pixel a distance n away that the gradient is pointing directly away from.
The coordinates of the positively affected pixel are given by: while those of the negatively affected pixel are: where "round" rounds each vector element to the nearest integer. The positively affected pixel and the negatively affected pixel are shown in Figure 4:  The radial symmetry contribution at radius n is defined as th At each radius n, an orientation projection image O n and a magnitude projection image M n are formed. For each pair of affected pixels, the corresponding point p +ve in the orientation projection image O n and magnitude projection image M n is incremented by 1 and ||g(p)||, respectively, while the point corresponding to p −ve is decremented by these same quantities in each image (the orientation and magnitude projection images are initially zero). That is: O n (p +ve (p)) = O n (p +ve (p)) + 1 M n (p +ve (p)) = M n (p +ve (p))+ g(p) M n (p −ve (p)) = M n (p −ve (p))− g(p) The radial symmetry contribution at radius n is defined as the convolution Remote Sens. 2021, 13, 4646 and O n (p) = O n (p)ifO n (p) < k n k n otherwise (9) A n is a two-dimensional Gaussian, α is the radial strictness parameter, and k n is a scaling factor that normalizes M n and O n across different radii.
The full transform is defined as the average of the symmetry contributions over all the radii considered: An overview of the algorithm is shown in Figure 5 along with the key signals (images) involved. orientation projection image On and magnitude projection image Mn is incremented by 1 and ||g(p)||, respectively, while the point corresponding to p − ve is decremented by these same quantities in each image (the orientation and magnitude projection images are initially zero). That is:  (6) The radial symmetry contribution at radius n is defined as the convolution  n = A n n S F (7) where An is a two-dimensional Gaussian, α is the radial strictness parameter, and kn is a scaling factor that normalizes Mn and On across different radii.
The full transform is defined as the average of the symmetry contributions over all the radii considered: An overview of the algorithm is shown in Figure 5 along with the key signals (images) involved. The effect of FRST applied to the human eye is shown in Figure 6: The effect of FRST applied to the human eye is shown in Figure 6: In FRST, the idea of using image gradient to localize points of high radial symm makes it suitable for circle detection. In addition, it does not need to traverse the w circle as CNN does, but only needs to use the circumference and its surrounding pix complete the transformation (as shown in Figure 2b). Therefore, the computation plexity is low, and the runtimes are fast. The schematic diagram of the object with cular structure is shown in Figure 3b.

An Improved FRST Algorithm Based on the Characteristics of Oil Tank Image FGMRST
In this study, we conduct analyses and experiments based on the SkySat satellite a resolution of 0.8 m.
The analysis of feature extraction using FRST in oil tank images is provided b As shown in Figure 7, we used FRST to transform 68 × 68 images with different ra input. In FRST, the idea of using image gradient to localize points of high radial symmetry makes it suitable for circle detection. In addition, it does not need to traverse the whole circle as CNN does, but only needs to use the circumference and its surrounding pixels to complete the transformation (as shown in Figure 3b). Therefore, the computation complexity is low, and the runtimes are fast. The schematic diagram of the object with a circular structure is shown in Figure 2b.

An Improved FRST Algorithm Based on the Characteristics of Oil Tank Images-FGMRST
In this study, we conduct analyses and experiments based on the SkySat satellite with a resolution of 0.8 m.
The analysis of feature extraction using FRST in oil tank images is provided below. As shown in Figure 7, we used FRST to transform 68 × 68 images with different radii as input.

An Improved FRST Algorithm Based on the Characteristics of Oil Tank Images-FGMRST
In this study, we conduct analyses and experiments based on the SkySat satellite with a resolution of 0.8 m.
The analysis of feature extraction using FRST in oil tank images is provided below. As shown in Figure 7, we used FRST to transform 68 × 68 images with different radii as input. It can be seen that: 1. When the floating roof of the oil tank is lower than the tank wall, there will be two areas that are brighter and darker in the image (as shown in Figure 7a), which makes positively affected pixels and negatively affected pixels at the boundary of the floating roof offset each other during transformation. Thus, the distribution of pixel values of orientation projection image and magnitude projection image is affected, and the final transformation result is affected. The solution to this problem in this study is to change the processing object of FRST from pixel to gradient modulus of pixel and cancel the positively affected pixel. Since the positively and negatively affected pixels play decisive roles in aggregating "black circle" (circle whose contour is darker than the background) and "white circle" (circle whose contour is brighter than the background), respectively, the coexistence of these two pixels makes FRST have the function of aggregating both "black circle" and "white circle" on an image. However, when the circle in the image is not a It can be seen that: When the floating roof of the oil tank is lower than the tank wall, there will be two areas that are brighter and darker in the image (as shown in Figure 7a), which makes positively affected pixels and negatively affected pixels at the boundary of the floating roof offset each other during transformation. Thus, the distribution of pixel values of orientation projection image and magnitude projection image is affected, and the final transformation result is affected. The solution to this problem in this study is to change the processing object of FRST from pixel to gradient modulus of pixel and cancel the positively affected pixel. Since the positively and negatively affected pixels play decisive roles in aggregating "black circle" (circle whose contour is darker than the background) and "white circle" (circle whose contour is brighter than the background), respectively, the coexistence of these two pixels makes FRST have the function of aggregating both "black circle" and "white circle" on an image. However, when the circle in the image is not a complete "black circle" or "white circle", the coexistence of the positively and negatively affected pixels will interfere with each other, thus affecting the aggregation effect of circles. After all the pixels are processed into a gradient modulus, all circles in the image will become "white circles", and then only negatively affected pixels need to be retained (because positively affected pixels have little effect on aggregating "white circles"), which solves the problem that some positively and negatively affected pixels offset each other during transformation.

2.
In the transformation results, almost only one aggregation area can be seen (see Figure 7b,c), and it is impossible to distinguish the center of the tank roof from the center of the circular-arc-shaped shadow cast by the sun on the floating roof. Therefore, FRST cannot complete the accurate localization task of the oil tank in this study. In view of the fact that the positions of the two circle centers are close, and the blur effect of the 3 × 3 Sobel operator used in FRST to calculate the gradient will have an adverse effect on accurate localizing, a simple first-order difference is used to calculate the gradient instead in this study.

3.
The effect of the transformation is greatly limited by the range of the input radius, and the wider the range, the worse the effect, which leads to the transformation being extremely dependent on the priori knowledge of radius. It can be inferred that the parameters related to the radius in FRST are particularly important to the result. The value of scaling factor k n is 9.9 by default when the radius is not 1, but this value is an empirical value given in an experiment with a radius range of 2-30. However, in this study and practice, the radius may exceed 30. Therefore, in this study, M n and O n are normalized by dividing all pixels in M n and O n by the maximum value of pixels in M n and O n respectively instead of k n .
In this study, the modified algorithm is named fast gradient modulus radial symmetry transform (FGMRST), and the specific transformation process is as follows: FGMRST is calculated at one or more radii n ∈ N, where N is the set of radii of the radially symmetric features to be detected. The gradient modulus of pixel p is calculated to obtain ||g(p)||. For the convenience of subsequent expression, let: q =||g(p)|| (11) For q, the negatively affected pixel is defined as the pixel that the gradient vector g(q) is pointing directly away from, a distance n away from q. The coordinates of the negatively affected pixel are given by: where "round" rounds each vector element to the nearest integer. The negatively affected pixel is shown in Figure 8: of the circular-arc-shaped shadow cast by the sun on the floating roof. Therefore, FRST cannot complete the accurate localization task of the oil tank in this study. In view of the fact that the positions of the two circle centers are close, and the blur effect of the 3 × 3 Sobel operator used in FRST to calculate the gradient will have an adverse effect on accurate localizing, a simple first-order difference is used to calculate the gradient instead in this study.
3. The effect of the transformation is greatly limited by the range of the input radius, and the wider the range, the worse the effect, which leads to the transformation being extremely dependent on the priori knowledge of radius. It can be inferred that the parameters related to the radius in FRST are particularly important to the result. The value of scaling factor kn is 9.9 by default when the radius is not 1, but this value is an empirical value given in an experiment with a radius range of 2-30. However, in this study and practice, the radius may exceed 30. Therefore, in this study, Mn and On are normalized by dividing all pixels in Mn and On by the maximum value of pixels in Mn and On respectively instead of kn.
In this study, the modified algorithm is named fast gradient modulus radial symmetry transform (FGMRST), and the specific transformation process is as follows: FGMRST is calculated at one or more radii n ∊ N, where N is the set of radii of the radially symmetric features to be detected. The gradient modulus of pixel p is calculated to obtain ||g(p)||. For the convenience of subsequent expression, let: For q, the negatively affected pixel is defined as the pixel that the gradient vector g(q) is pointing directly away from, a distance n away from q. The coordinates of the negatively affected pixel are given by: where "round" rounds each vector element to the nearest integer. The negatively affected pixel is shown in Figure 8: The radial symmetry contribution at radius n is defined as the convolution in Equation (7), where A n is a two-dimensional Gaussian, α is the radial strictness parameter. The full transform is defined as the average of the symmetry contributions over all the radii considered in Equation (10).
An overview of the algorithm is shown in Figure 9 along with the key signals (images) involved.
The radial symmetry contribution at radius n is defined as the convolution:  n = A n n S F (15) where An is a two-dimensional Gaussian, α is the radial strictness parameter. The full transform is defined as the average of the symmetry contributions over all the radii considered, An overview of the algorithm is shown in Figure 9 along with the key signals (images) involved. The analysis of feature extraction using FGMRST in oil tank images is provided below. As shown in Figure 10, we used FGMRST to transform 68 × 68 images with different radii as input. The analysis of feature extraction using FGMRST in oil tank images is provided below. As shown in Figure 10, we used FGMRST to transform 68 × 68 images with different radii as input. The analysis of feature extraction using FGMRST in oil tank images is provided below. As shown in Figure 10, we used FGMRST to transform 68 × 68 images with different radii as input. Comparing Figure 10b,c with Figure 7b,c, it can be found that FGMRST can better extract and distinguish the two circle centers. However, the final symmetry map is produced after combining all radial symmetry images [27]. Thus, different radii information is accumulated in the output of the transform [27]. As long as the value of input radius is Comparing Figure 10b,c with Figure 7b,c, it can be found that FGMRST can better extract and distinguish the two circle centers. However, the final symmetry map is produced after combining all radial symmetry images [27]. Thus, different radii information is accumulated in the output of the transform [27]. As long as the value of input radius is a range rather than a single value, the number of pixels at certain locations and specific radii that contribute to the projection matrices is not possible to be evaluated, thus the purpose of accurately localizing the circle center and determining the radius cannot be achieved.
Based on the above analysis, it is found that FGMRST alone is difficult to achieve relatively good accuracy and a relatively high level of automation for localization of oil tank in remote sensing images. It is worth noting that FGMRST has the ability to reduce the localization range of the circle center and has good transformation results for the small radius range, so it may be combined with other methods to better achieve high-precision localization of the oil tank.

Overview of Methods
FGMRST cannot accurately localize the circle center and determine the radius, but it can reduce the localization range of the circle center and realize the spatial aggregation of the circular features. In this study, FGMRST and CNN are combined to construct network parameters, and an FGMRST-based CNN method is proposed. The schematic diagram of the method is shown in Figure 2. To make the receptive field at the center of the circle contain the features of the circle, CNN must expand the receptive field through multi-layer stacking, which is inefficient, while FGMRST can accumulate the peripheral pixels near the center of the circle, and the aggregation efficiency is high. Therefore, this study attempts two combination methods: New method one (called tandem method): FGMRST is used to process the original image, and then the processed image is input into CNN. New method two (called dual-channel method): the original image and the image processed by FGMRST are combined in a dual-channel manner, and then input into CNN. The comparison of calculation location and feature processing location of three methods is shown in Table 1. The convolution operation and FGMRST can only deal with the features at the circumference and the circle center, respectively. After combining them, the new method can process the circumference feature at both the circumference and the center.

Data set preparation
The center position of the tank roof, the center position of the circular-arc-shaped shadow cast by the sun on the floating roof, and the radius of the oil tank to label, were selected and the data set was created.
2. Pre-processing The bilinear interpolation method was used to compress the sample images to a uniform size.
3. Define CNN This study designed three CNN structures. The first CNN is named net_conv2, whose structure model is shown in Figure 13. It is divided into three parts: the input layer, hidden layer, and output layer. The input layer is the input sample data. The hidden layer is the key part of CNN, including two convolutional layers and two batch normalization (BN) layers. The convolutional layer is composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling will affect the accurate localization of the oil tank, the sampling is realized by setting the stride of two convolutional layers to 2 instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a three-layer perceptron, and the number of output features of each layer is 120, 84, and 5, respectively. After the samples are input into CNN, the hidden layer is used for feature extraction, and then the fully connected layer is used to predict the related parameters. The second CNN is named net_conv3, which adds a convolutional layer and a BN layer after the second BN layer of net_conv2, and the others remain unchanged. The structure model is shown in Figure 14. The third CNN is named

Data set preparation
The center position of the tank roof, the center position of the circular-arc-shaped shadow cast by the sun on the floating roof, and the radius of the oil tank to label, were selected and the data set was created.
2. Pre-processing The bilinear interpolation method was used to compress the sample images to a uniform size.
3. Define CNN This study designed three CNN structures. The first CNN is named net_conv2, whose structure model is shown in Figure 13. It is divided into three parts: the input layer, hidden layer, and output layer. The input layer is the input sample data. The hidden layer is the key part of CNN, including two convolutional layers and two batch normalization (BN) layers. The convolutional layer is composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling will affect the accurate localization of the oil tank, the sampling is realized by setting the stride of two convolutional layers to 2 instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a three-layer perceptron, and the number of output features of each layer is 120, 84, and 5, respectively. After the samples are input into CNN, the hidden layer is used for feature extraction, and then the fully connected layer is used to predict the related parameters. The second CNN is named net_conv3, which adds a convolutional layer and a BN layer after the second BN layer of net_conv2, and the others remain unchanged. The structure model is shown in Figure 14. The third CNN is named Figure 12. Flow chart of dual-channel method.

Data set preparation
The center position of the tank roof, the center position of the circular-arc-shaped shadow cast by the sun on the floating roof, and the radius of the oil tank to label, were selected and the data set was created.

2.
Pre-processing The bilinear interpolation method was used to compress the sample images to a uniform size.

3.
Define CNN This study designed three CNN structures. The first CNN is named net_conv2, whose structure model is shown in Figure 13. It is divided into three parts: the input layer, hidden layer, and output layer. The input layer is the input sample data. The hidden layer is the key part of CNN, including two convolutional layers and two batch normalization (BN) layers. The convolutional layer is composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling will affect the accurate localization of the oil tank, the sampling is realized by setting the stride of two convolutional layers to 2 instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a three-layer perceptron, and the number of output features of each layer is 120, 84, and 5, respectively. After the samples are input into CNN, the hidden layer is used for feature extraction, and then the fully connected layer is used to predict the related parameters. The second CNN is named net_conv3, which adds a convolutional layer and a BN layer after the second BN layer of net_conv2, and the others remain unchanged. The structure model is shown in Figure 14. The third CNN is named net_alex because it is modified from AlexNet [16]. Its structure model is shown in Figure 15. It is also divided into three parts: input layer, hidden layer, and output layer. The hidden layer contains five convolutional layers. The convolutional layer is also composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling affects the accurate localization of the oil tank, the sampling is realized by setting the stride of five convolutional layers as 2, 2, 1, 2, and 2, respectively, instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a two-layer perceptron, and the number of output features of each layer is 1024 and 5, respectively. net_alex because it is modified from AlexNet [16]. Its structure model is shown in Figure  15. It is also divided into three parts: input layer, hidden layer, and output layer. The hidden layer contains five convolutional layers. The convolutional layer is also composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling affects the accurate localization of the oil tank, the sampling is realized by setting the stride of five convolutional layers as 2, 2, 1, 2, and 2, respectively, instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a two-layer perceptron, and the number of output features of each layer is 1024 and 5, respectively.

256@5×5
Original image The image processed by FGMRST net_alex because it is modified from AlexNet [16]. Its structure model is shown in Figure  15. It is also divided into three parts: input layer, hidden layer, and output layer. The hidden layer contains five convolutional layers. The convolutional layer is also composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling affects the accurate localization of the oil tank, the sampling is realized by setting the stride of five convolutional layers as 2, 2, 1, 2, and 2, respectively, instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a two-layer perceptron, and the number of output features of each layer is 1024 and 5, respectively.

256@5×5
Original image The image processed by FGMRST net_alex because it is modified from AlexNet [16]. Its structure model is shown in Figure  15. It is also divided into three parts: input layer, hidden layer, and output layer. The hidden layer contains five convolutional layers. The convolutional layer is also composed of a filter layer, a spatial sampling layer, and a nonlinear layer. The nonlinear layer is realized by the RELU activation function. Considering that the sampling position of the pixels for the max pooling affects the accurate localization of the oil tank, the sampling is realized by setting the stride of five convolutional layers as 2, 2, 1, 2, and 2, respectively, instead of adding the pooling layer. The output layer is a fully connected layer, which is designed as a two-layer perceptron, and the number of output features of each layer is 1024 and 5, respectively.

Training network
The L1Loss function is suitable for regression problems, so we chose it as the loss function, as shown in the following formula: where y represents the real value andŷ represents the output of the model. For the consideration of convergence speed and accuracy, the Adaptive Moment Estimation (Adam) algorithm was selected to update network parameters.
The training samples were trained, the loss curve was drawn, and the training situation was judged by observing the trend of the curve.

5.
Testing network on the test set The test samples were used for testing, the average losses of each and the total of the five parameters were calculated, and the loss curve in sample units was drawn.

Training Details
In this study, nine experiments were designed based on the proposed method, as shown in Table 2.

Data sets
This paper uses SkySat satellite data, which is a sub-meter resolution microsatellite for business-oriented commercial earth observation [20]. The experimental data set in this paper is based on 1312 image slices fabricated from 0.8m resolution SkySat images, which is divided into 1024 training samples and 288 test samples (the proportion is close to 4:1).

Experimental Process and Results
The experiment relies on the algorithm designed in Section 2.3.2 and is carried out once each under the condition that the random seed is 0 and 1.
In the preparation before training, the learning rate of optimizer Adam is shown in Table 3. The exponential decay rate of the first moment estimation is 0.9, and that of the second moment estimation is 0.99. The MultiStepLR method was used to adjust the learning rate dynamically, setting epoch (the number of iterations) = 300 and epoch = 650 as the milestones of the learning rate, and the learning rate adjustment multiplier factor (gamma) is 0.5.
In the training process, 1024 samples (32 batches) of the train set were used for training, and GPU was used to accelerate the training. We set epoch = 1000, then drew the loss curve, and judge the training situation by observing the trend of the curve.
In the testing process, 288 samples (9 batches) of the test set were used for testing, and then the average losses of each and the total of the five parameters were calculated on the test result. The size of the sample image was restored to the state before compression, and the predicted values of sample parameters were converted according to the same compression ratio. The converted result was used to represent the predicted values of the sample parameters under the original size, and the loss between this value and the value of the original sample was calculated. The loss curve in sample units was drawn.
To ensure the effectiveness of the network used in the test set, the network corresponding to each epoch in the training process was applied to the test set, and the loss trend diagram of the test set is drawn as shown in Figures 16 and 17. To ensure the display effect, the y-axis was logarithmic and smoothed, and the smoothing coefficient was set to 0.95. compression, and the predicted values of sample parameters were converted according to the same compression ratio. The converted result was used to represent the predicted values of the sample parameters under the original size, and the loss between this value and the value of the original sample was calculated. The loss curve in sample units was drawn.
To ensure the effectiveness of the network used in the test set, the network corresponding to each epoch in the training process was applied to the test set, and the loss trend diagram of the test set is drawn as shown in Figures 16 and 17. To ensure the display effect, the y-axis was logarithmic and smoothed, and the smoothing coefficient was set to 0.95.  It can be seen from the figure that after the milestones of learning rate (epoch = 300 It can be seen from the figure that after the milestones of learning rate (epoch = 300 and epoch = 650), the loss is still decreasing. After the guarantee was obtained, the trained network was used to process the test set, and then the average prediction loss of five parameters was calculated on the result, as shown in Figures 18 and 19.
From the above two graphs (Figures 18 and 19), it can be found that for the prediction effect, no matter what kind of structure of the network is, the "dual-channel method" better than "only using CNN" which is better than the "tandem method"; and the effect of "dual-channel method" improves with the increase in network depth.  It can be seen from the figure that after the milestones of learning rate (epoch = 300 and epoch = 650), the loss is still decreasing. After the guarantee was obtained, the trained network was used to process the test set, and then the average prediction loss of five parameters was calculated on the result, as shown in Figures 18 and 19.   The total average prediction errors of these nine experiments for the five parameters are provided in the following table.
To measure the impact of the dual-channel method on computational efficiency, two physical quantities, params and multiplication and addition counts (MACs), were introduced to evaluate different network methods, as shown in Table 4. Figure 20 is drawn according to Table 4, and the data marked in the figure indicates the number of additional convolution layers added on net_conv3.  From the above two graphs (Figures 18 and 19), it can be found that for the prediction effect, no matter what kind of structure of the network is, the "dual-channel method" better than "only using CNN" which is better than the "tandem method"; and the effect of "dual-channel method" improves with the increase in network depth.
The total average prediction errors of these nine experiments for the five parameters are provided in the following table.   After analyzing Table 4 and Figure 20, it is found that: 1. When the number of additional convolution layers is 0, the number of parameters and computation (params and MAC) of the two methods are the same, but the prediction error of the dual-channel method is less than that of the ordinary CNN method. After analyzing Table 4 and Figure 20, it is found that: 1. When the number of additional convolution layers is 0, the number of parameters and computation (params and MAC) of the two methods are the same, but the prediction error of the dual-channel method is less than that of the ordinary CNN method.

2.
The prediction error of adding 4 layers in the dual-channel method is close to that of adding 15 layers in the ordinary CNN method, but the amount of parameters (params) of the latter is 1.6 times that of the former, and the amount of computation (MAC) of the latter is 3.2 times that of the former. 3.
The number of parameters and computation (params and MAC) of the two methods increases with the increase in the number of network layers, but the dualchannel method grows more slowly. Therefore, the dual-channel method effectively reduces the number of parameters and computation, that is, improves the computational efficiency.

Conclusions
To improve the accuracy and stability of the localization of the oil tank in remote sensing images, this paper proposes an FGMRST-based CNN method. Experiments based on SkySat show that: 1.
In the FGMRST-based CNN method, the average prediction error of the dual-channel method is reduced by 14.64% compared with the ordinary CNN method, which effectively improves the accuracy.

2.
In the shallowest network (net_conv2), the average prediction error of the dualchannel method is reduced by 19.66% compared with the ordinary CNN method. In the deeper network (net_conv3), the average prediction error is reduced by 15.73%. In the deepest network (net_alex), the average prediction error is reduced by 8.54%. It shows that the proposed method is still better than the method using only CNN when the number of network layers increases.

3.
In the FGMRST-based CNN method, the dual-channel method significantly improves the computational efficiency compared with the ordinary CNN method.
The main work and innovation of this paper are as follows: An improved FRST algorithm based on the characteristics of oil tank images is proposed, FGMRST. The advantages and disadvantages of FGMRST and CNN are discussed, and the FGMRST-based CNN method is designed, which is different from the traditional deep learning method that expends the receptive field by sacrificing resolution or increasing the model size. With only a little extra computation cost of FGMRST, the dual-channel method can efficiently aggregate spatial features, effectively improve the accuracy and stability of localization of the oil tank and improve the real-time performance in practical application.
The disadvantage of this study are as follows: Due to the limitation of time and the author's ability, only two combination methods of FGMRST and CNN are designed in this paper. In future work, we will continue to explore the combination of the traditional parameterized feature extraction method and the deep learning method. A key challenge in the work was the cast of shadows of the object. It was addressed using a gradient approach. However, there are still some unwanted shadows in the image (such as those caused by the uneven ground and tank top), and the method in this paper cannot eliminate the interference caused by these shadows. Therefore, in future work, we will try to combine our method with some advanced shadow detection and removal methods (such as a framework for automatic detection and removal of shadows proposed by Salman et al. [28], and methods based on paired regions proposed by Guo et al. [29,30]) to eliminate the localizing interference caused by redundant shadow information.