STC-Det: A Slender Target Detector Combining Shadow and Target Information in Optical Satellite Images

: Object detection has made great progress. However, due to the unique imaging method of optical satellite remote sensing, the detection of slender targets is still insufﬁcient. Speciﬁcally, the perspective of optical satellites is small, and the characteristics of slender targets are severely lost during imaging, resulting in insufﬁcient detection task information; at the same time, the appearance of slender targets in the image is greatly affected by the satellite perspective, which is likely to cause insufﬁcient generalization capabilities of conventional detection models. In response to these two points, we have made some improvements. First, in this paper, we introduce the shadow as auxiliary information to complement the trunk features of the target lost in imaging. Second, to reduce the impact of satellite perspective on imaging, in this paper, we use the characteristic that shadow information is not affected by satellite perspective to design STC-Det. STC-Det treats the shadow and the target as two different types of targets and uses the shadow information to assist the detection, reducing the impact of the satellite perspective on detection. Among them, in order to improve the performance of STC-Det, we propose an automatic matching method (AMM) of shadow and target and a feature fusion method (FFM). Finally, this paper proposes a new method to calculate the heatmaps of detectors, which veriﬁes the effectiveness of the proposed network in a visual way. Experiments show that when the satellite perspective is variable, the precision of STC-Det is increased by 1.7%, and when the satellite perspective is small, the precision of STC-Det is increased by 5.2%.


Introduction
In recent years, object detection has made great progress [1][2][3][4][5]. Slender target detection is also an important part of object detection. As a typical slender target, a high-voltage transmission tower is one of the most important objects of infrastructure of the power transmission system. Its operating state determines the operation of entire power grid and the safety of people [6,7]. The detection of high-voltage transmission towers is helpful for monitoring the operation status of high-voltage transmission towers. The early detection of transmission towers was mainly based on manual inspections. The whole work was time-consuming and laborious. Later, UAV were introduced to increase detection efficiency [8][9][10][11][12][13][14]. However, due to the sparse distribution of high-voltage transmission towers and the small coverage of UAV single-view images, it is still difficult to achieve large-scale inspections of transmission towers. Recently, with the development of satellite remote sensing, aerospace monitoring methods have become more and more mature, and a single scene of aerospace imagery covers a wider range, which can achieve large-scale sparse target detection.
A great amount of research has been conducted on the detection of high-voltage transmission towers in remote sensing images at home and abroad. Parkpoom et al. [15] used Canny and Hough to detect targets, and then classified them to detect power transmission towers in aerial images. Wang et al. [16] established an aerial data set and compared the performance of Faster R-CNN [17] and YOLO-v3 [18]. These studies are all based on aerial images. The single scene coverage of aerial images is small. High-voltage power transmission towers are a special type of target, and their distribution is generally sparse. It is difficult to quickly complete large-scale inspections using aerial images. The existing researches on the detection of transmission towers in satellite remote sensing are mainly based on SAR images. Zhou et al. [19] used machine learning methods to detect power transmission towers in SAR images. Wenhao et al. [20] built a two-stage detector combining YOLO v2 and VGG to improve the accuracy and efficiency of the transmission tower detection in SAR images. Tian et al. [21] proposed an improved rotating box detector (DRBox) and satellite remote sensing wire consumables estimation model based on high-resolution optical satellite remote sensing images to detect the number of transmission towers and estimate related consumables. The current research effect based on optical satellite remote sensing is similar to SAR. They are only limited to the image level and do not use external information such as target characteristics and observation conditions. The advantages of optical satellite remote sensing cannot be fully utilized.
Huang et al. [22] proposed SI-STD. The network introduces shadow information to compensate for the characteristics of target lost during imaging. In the case of sufficient illumination, the problem of information loss is basically solved. It is just a preliminary attempt to use external information such as shadows for detection. It does not fully consider the impact of imaging conditions on the presentation of features, and still strongly depends on the features presented by the training data set. When the view angle changes, the generality of the algorithm model is insufficient, and the performance of the model decreases. SI-STD introduces shadow information to complement the features lost during imaging, but the generation of shadows is completely dependent on the solar elevation angle, and its performance in the image has nothing to do with the satellite perspective, while the target's performance in the image basically depends on the satellite perspective, the roots of the shadow and the target are completely different, but SI-STD merges the shadow and the target together and treats them as a big target for detection. This approach is obviously affected by the satellite perspective. Any changes in these parameters will affect the results. As shown in Figure 1, when the satellite perspective is relatively small, the imaging information of the target in the image is not obvious. At this time, the recognition is basically based on shadow information, and when the satellite perspective is larger, the imaging information of the target in the image becomes gradually obvious. At this time, the characteristics of the shadow and the target can be considered comprehensively in the recognition. However, SI-STD merges the shadow and the target together and obviously cannot automatically assign the weight of the shadow and the target. As shown in Figure 2b, this figure is the result of the heatmap of SI-STD. The darker the color in the figure, the higher the attention of detection. It can be seen that most of the attention of SI-STD is on the target itself; there is little attention to the shadow, but the information contained in the shadow of the image is obviously greater than the target itself. This also confirms that although SI-STD can solve the problem of feature loss, there are still shortcomings when the satellite perspective changes. In summary, the detection of high-voltage transmission towers based on optical satellite images still has the following problems: • During imaging, a large number of features of slender targets are lost. As shown in Figure 1, for traditional object detection methods, the main features of some stumpy targets such as houses are concentrated on the top. The information will not be lost during imaging and has little effect on object detection. However, the main characteristics of slender targets such as high-voltage transmission towers are concentrated on the vertical trunk. During imaging, they will be greatly compressed in the vertical direction, and many features will be lost, which is not conducive to object detection. • According to the imaging geometry model of the optical satellite remote sensing, the imaging results are greatly affected by the satellite perspective. The same target has different image under different satellite perspective. As shown in Figure 1a,b, high-voltage power transmission towers behave differently under different satellite perspective. In different situations, target information and shadow information contribute differently to detection. Through the above analysis, we cannot combine the shadow with the target together. For this reason, we propose STC-Det, which treats the shadow and the tower as two types of targets to detect separately, and realize the automatic distribution of the weights of shadow information and tower information. In summary, the overall contribution of this paper is as follows: • STC-Det is proposed, which broadens the application range of slender target detection and expands the application range of satellite perspective.
• Using deformable convolution for reference, an automatic shadow and target matching method is designed. This method achieves fast shadow and target matching with only a small increase in network complexity, improves network efficiency, and reduces computational complexity. • A new feature fusion method is designed, which realizes the fusion of shadow features and target features, and can also realize automatic weighting of features, which further improves the utilization efficiency of shadow and target feature information. • In order to intuitively see the influence of shadow information and target information on detection when the satellite perspective changes, we have improved Group-CAM [23] so that it can be used for the visualization of the heatmap of object detection, and thereby verify the effectiveness of STC-Det.

Materials and Methods
In this part, we first analyze the imaging geometry model of optical satellite images, then introduce the overall architecture of STC-Det, and the two main innovative modules AMM and FFM, and finally introduce our method for heatmap calculation. Figure 3 shows the imaging geometry model of slender targets in optical satellite images. Different from stumpy targets, the characteristics of slender targets are often distributed on the trunk located on the side elevation of the target. The optical satellite utilizes side-view imaging. When the satellite perspective is small, most of its trunk features are compressed during imaging. When the satellite perspective is not too small, some of its trunk features will inevitably be compressed. This provides different imaging results for the same slender target under a different satellite perspective. The shadow of the slender target is produced by the sun, which reflects the trunk characteristics of the target and has nothing to do with the satellite perspective. Shadows are flat features. According to the imaging model, it can be seen that the image of shadow is basically not affected by the satellite perspective. Therefore, in the case of a small satellite perspective, we can use shadow information to supplement the trunk information lost in imaging of slender targets; when the satellite perspective changes, the shadow imaging results are basically unchanged, and we can use constant shadows to assist in the detection of changing targets, thereby reducing the impact of the satellite perspective on the detection of slender targets. Based on this feature, we proposed STC-Det.

STC-Det
As shown in Figure 4, STC-Det still uses the structure of Faster R-CNN [17], but it improved the region proposal network (RPN) and Head in Faster R-CNN, and named them DBBoxRPN and DBBoxHead, respectively. In DBBoxRPN, the candidate regions of the shadow and the target are extracted separately, and the automatic matching module (AMM) module is used to realize the automatic matching of the shadow with the target. In DBBoxHead, the FFM module is used to realize the information fusion and automatic weighting of shadow features and target features, which solves the problem of information fusion.
In Figure 4, BackBone is mainly used to extract features of different scales of data. This part is mainly composed of ResNet [24] and FPN [25]. ResNet is a deep convolutional network, which mainly includes a convolutional layer, pooling layer, and activation layer. ResNet is used to extract features of different depths. FPN is a feature combination network that combines features extracted by ResNet into feature groups of different scales and depths. The input data of the BackBone part is the original image (x img ) after normalization and other preprocessing, x img ∈ 3×H 0 ×W 0 , finally output a five-layer feature The role of DBBoxRPN is similar to that of RPN in Faster R-CNN, except that it not only needs to extract the candidate regions of the target but also needs to extract the candidate regions of the shadow corresponding to the target, as shown in Figure 4a. DBBoxRPN consists of a shared convolution layer and five convolution channels. The five layers of features extracted by BackBone are input into DBBoxRPN for calculation in turn, and all feature layers share the same set of convolution operations. The specific formula is as follows: where C 0 (·), C 1 (·), C 2 (·), C 3 (·), C 4 (·), and C 5 (·) are six convolutional layers of different sizes, p s , p t , b s , b t , and d are respectively the confidence of the shadow candidate regions predicted by the network, the confidence of the target candidate regions, the regression of the shadow candidate regions, the regression of the target candidate regions, and the offset of the shadow relative to the target (the offset is used to realize the automatic matching of shadow and target; it will be described in detail in the following Section 2. is the regressor at the center of the prediction box, w and h are the regressions of the predicted regions width and height, respectively. d = (d x , d y ), d x and d y are the offset of the shadow center of the same target relative to the x and y coordinates of the target center, respectively. The ground truth of the shadow and target are b st = (x st , y st , w st , h st ) and b tt = (x tt , y tt , w tt , h tt ), respectively. The ground truth of the categories of shadow and target are p s and p t , respectively. The anchors of the shadow and target are b sa = (x sa , y sa , w sa , h sa ) and b ta = (x ta , y ta , w ta , h ta ). The predicted proposals for shadow and target are b sp = (x sp , y sp , w sp , h sp ) and b tp = (x tp , y tp , w tp , h tp ), respectively. They have the following relationships:

Conv layers + Feature Pyramid Networks
x sp = x sa + x s w sa , x tp = x ta + x t w ta (6) y sp = y sa + y s h sa , y tp = y ta + y t h ta (7) w sp = w sa e w s , w tp = w ta e w t (8) The deviations of the ground truth of the shadow and the target from the anchor are b s = (x s , y s , w s , h s ) and b t = (x t , y t , w t , h t ), respectively. They can be obtained by the following equations: After extracting the shadow and target candidate regions, the shadow and target must be matched. From the imaging geometry model, the shadow in image is related to the target. The shadow is the image of the target under the sun, and the two must be connected in space. The side-view imaging of satellite will not change this. The AMM module (described in detail in Section 2.3) uses the relationship between the shadow and the target on the image to match the extracted shadow candidate regions with the target candidate regions. DBBoxRPN outputs a series of paired proposals P = {(p si , p ti ), i = 1, 2, 3, . . . , N}, among them, p si and p ti are the shadow candidate regions and target candidate regions of the same target, and N is the maximum number of candidate regions extracted from a signal image.
The loss of DBBoxRPN is: where N t represents the total number of extracted target candidate regions, N s represents the total number of extracted shadow candidate regions, N l represents the number of target candidate regions that can be matched to the shadow candidate regions, and d represents the ground truth of the offset of the shadow center to the target center. L cls , L reg , and L loc are the loss of classification, regression, and position shift, respectively. L cls uses FocalLoss [26], the gamma and alpha of which are set to 2.0 and 0.25, respectively. L reg and L loc use SmoothL1Loss(as shown in Formula (15)). The weight of each loss in Formula (14) is 1.
The proposals extracted by DBBoxRPN undergo RoIAlign [27] to extract the corresponding features F s and F t , F s = B(P s ), where B(·) is an interpolation operation.
Finally, F s and F t are input into DBBoxHead for classification and regression. DBBox-Head contains two channels. One channel is used for classification; it contains a feature fusion module (FFM) and two fully connected layers. The other channel is used to regress the target position; it only contains two fully connected layers, specifically: Among them, c represents the category results of the network, B represents the target position regression results of the network, f c(·) represents the fully connected layer, f latten(·) represents the tiling operation, and FFM(·) represents the feature fusion module (described in detail in Section 2.3).
The loss of DBBoxRPN is: where L cls uses cross-entropy loss, and L reg uses SmoothL1Loss (as shown in Formula (15)).
The weight of each loss in Formula (18) is 1.
The total loss of the entire network is:

AMM
The automatic matching module (AMM) is located in DBBoxRPN, as shown in Figure 4b, which is used to match the extracted shadow candidate regions with target candidate regions. Generally, if you want to match the shadow and target together, you must use the position information of the two. Although preliminary matching can be carried out, it needs to be matched and screened one by one. The speed is slow, and the error of using the position information to match is also relatively large. AMM can easily and quickly match the shadow candidate regions with target candidate regions. Figure 5b is a schematic diagram of deformable convolution. Deformable convolution calculates the position offset of each position to change the receptive field of the convolution kernel, so as to realize the convolution of any shape. Deformable convolution can be calculated by the following formula: where f represents the feature before convolution, F represents the feature after convolution, and (∆x n , ∆y n ) represents the offset of each feature point (x 0 + x n , y 0 + y n ), which can be learned through the network. Deformable convolution [28] obtains the offset of each feature point through the network, which can realize the convolution of any shape. Deformable convolution learns the mapping of the convolution kernel to different parts of the target. According to the imaging geometry model of the optical satellite, the same pair of target and shadow should be connected together. If the target and shadow are regarded as a whole, the target and shadow will be two parts of the whole. Drawing lessons from the idea of deformable convolution, we designed AMM. We can directly learn the mapping from the center of the target to the center of the shadow, which is equivalent to learning the mapping of two parts. With this mapping, we can directly map the target candidate regions to the position of the shadow candidate regions and quickly achieve the matching of the candidate regions. As shown in Figure 5a, the blue rectangle in the figure represents the candidate regions of the shadow, the red rectangle represents the candidate regions of the target, the blue dot represents the center point of the shadow, and the red dot represents the center point of the target. According to the mapping from the target to the shadow, we can directly get the shadow center point from the target candidate region. If the center point of the shadow candidate region is consistent with the obtained shadow center point, the corresponding shadow candidate regions matches the target candidate regions. The specific calculation is shown below.
The calculation of the offset d from the target center to the shadow center is shown in Section 2.1. The ground truth of the center offset is d = (d x , d y ): When matching the shadow with target, according to the target candidate region b tp , the matching shadow center b s = (x s , y s ) can be obtained by the center offset d = (d x , d y ): Through the above calculations, we obtain the shadow center b s and the predicted shadow candidate region b sp . We only need to find the corresponding shadow prediction regions according to the calculated shadow center b s . If there is a shadow prediction region b centered on b s , then b is the shadow prediction region that matches the target prediction region b tp .
In training, both the predicted regions of the target and the shadow are different from the ground truths. To improve robustness, after the shadow center b s is obtained, b s is not directly compared with the center of the shadow prediction region. We first calculate the center of the shadow prediction region. If the center of the shadow prediction region falls within a certain range centered on b s , it can be regarded as a matching candidate region, and then select the one with greatest confidence from these candidate regions as the final result.

FFM
The feature fusion module (FFM) is located in DBBoxHead (as shown in Figure 4c), and its main function is to fuse the features of classification. DBBoxRPN not only extracts the candidate regions of the target but also extracts the shadow candidate regions corresponding to the target. When classifying, we can use only the shadow feature, or only the target feature, or even use the combined features of the two at the same time and automatically set the weight of the difference, as shown in Figure 6. From the analysis of satellite imaging geometric model, when satellite perspective are small, the characteristics of the target are not too obvious, and the shadow information is more obvious, so the shadow information can be used for classification. Theoretically, as long as the fusion weights of the two types of features are reasonably allocated for classification, the combined use of the features of the target and the shadow for classification will have a better effect. For this reason, we designed the FFM module to perform feature selection and fusion. Figure 6 shows six feature fusion methods, of which (a) and (b) use only one feature of the shadow and the target, (c) and (d) directly add the shadow and the target feature together, and (e) and (f) combine the features of the shadows and the target. The three fusion methods on the left in Figure 6 do not use 1*1 convolution after fusion, while the three methods on the right use 1*1 convolution after fusion. 1*1 convolution is a point-bypoint channel convolution, where it acts as a weighted average and can dynamically learn weighting coefficients.  After the above theoretical analysis and comparative experiments (Table 1), we choose Figure 6f as the FFM module, which is specifically:

Shadow Features
Among them, F s , F t , and F cls represent the extracted shadow features, target features, and the output of FFM, respectively. concat(·) stands for feature stacking operation, and Conv(·) stands for 1*1 convolution operation. Table 1. Experiment results of using different feature fusion methods. This table shows the experimental results of the multiple fusion methods shown in Figure 6. All results are obtained on the TowerM data set. Among them, "Feature selection" means to select different information, such as "shadow" means to use only shadow information, "tower" means to use only tower information, "add" means to add shadow information and tower information together, and "concat" means to combine shadow information and tower information. "Postoperation" refers to the operation after information is selected, where "none" means no operation, and "1*1 conv" means that a convolution operation is performed with a 1 × 1 convolution kernel after information is selected. The last two columns represent the changes in AP and AR when the IoU threshold is 0.5 for different operations.

Heatmap Visualization Algorithm
Group-CAM is a heatmap visualization algorithm for image classification. We have improved Group-CAM to enable it to be used for object detection. The calculation of Group-CAM is divided into four steps:

1.
Input the original image I 0 into the network F and extract the characteristic layer A and the corresponding gradient value W to be viewed.

2.
Use filter to filter the original image I 0 to get I 1 , use the extracted gradient value W to weight the feature layer A and group to get the feature mask M = [M 1 , M 2 , M 3 , · · · , M L ], L is to be divided into a number of groups. 3.
The feature mask M is respectively weighted and merged with the original image I 1 after the illusion, and the masked image I is obtained. Input I into the network to obtain the weight of the corresponding feature layer.
4. Use the obtained feature layer weights to weight the corresponding feature layers to obtain the final heat map. CAM = αM (30) In image classification, the confidence score of the corresponding category of the data can be obtained by inputting the data into network, and the score can be directly used as the weight. In object detection, the data input into network can obtain a series of prediction regions and confidence. The result cannot be used directly as a weight.
In order to make Group-CAM suitable for object detection tasks, we made adjustments in the third step of the original Group-CAM. When calculating the weight in the third step, we do not directly use the final result but use the loss of the corresponding data as the weight, and the loss is just the opposite of the original confidence. The greater the confidence of the original Group-CAM, the greater the weight. Here, the greater the loss, the smaller the weight, specifically:

Data Set
The data used in the experiment includes three types of transmission towers with heights between 15 m and 70 m, namely the wine-glass tower, the dry-type tower and cat-head tower, the same as SI-STD [22]. In order to verify the effectiveness of the methods proposed in this paper, we made two data sets. To make it easier to distinguish, we named the first data set "TowerS" and the second data set "TowerM", as shown in Figure 7. The satellite perspective of TowerS is relatively small, and the satellite perspective of TowerM is variable. Both of these data sets are intercepted from BJ-2, WV-2 and Google Earth, and the size of a single data is 600*600. TowerS contains 1574 training data and 686 test data, while TowerM contains 2575 training data and 904 test data. The data volume of the three towers in all data sets is approximately 1:1:1.

Training Configurations
All experiments in this paper are carried out in the PyTorch environment. The experimental platform is two RTX 2080 Ti graphics cards, and the batch size is set to 4. The size of the experimental optical images is 600*600. In order to facilitate subsequent operations, it is filled with 0 to 608*608. There are a total of 20 epochs in the experiment. The initial stage learning rate is set to 0.001667. In the following 3600 steps, the learning rate increases by 9.44 × 10 −7 at each step until it increases to 0.005. From the seventh to the twelfth epoch, the learning rate drops to 0.0005, and from the thirteenth to the twentieth epoch, the learning rate drops to 0.00005. In the DBBoxRPN part, the network uses an anchor-based method to extract the shadow and target candidate regions. There are 16 shadow anchors, the aspect ratio is [1.0,2.0,3.0,4.0], and the scale size is [4,8,16,32], 12 anchors of the target are set, the aspect ratio is [0.5,1.0,2.0,3.0], and the scale size is [4,8,16,32]. Shadow and target are in progress. When matching, the λ x and λ y in Equations (21)- (24) are 50. In order to improve the robustness of matching, after calculating the center position of the predicted shadow regions, we will perform matching within a radius of 50 (This value is determined through experiments, as shown in Figure 8.).

Parameter Selection and Ablation Experiment
In AMM of the DBBoxRPN part, when matching the shadow with the target, we use the tower as the target to find the corresponding shadow. According to the center of the target candidate region and the offset from the target to the shadow, we can get the position of the shadow center b s . At this time, the shadow candidate region with b s as the center can be directly selected as the shadow to be matched, but there are some deviations. The shadow candidate regions and target candidate regions predicted by the network have a certain deviation from the ground truth, and there will be a certain error in directly using the shadow candidate regions. In order to reduce errors and improve the robustness of the network, we do not directly use b s as the shadow center but regard the shadow candidate regions within a certain range of the b s circle center as the possible range of the shadow to be matched and then select the shadow candidate with the highest confidence in this range as the matching shadow. The search range is a hyperparameter. In order to find the optimal range, we have conducted a series of experiments for different search ranges, as shown in Figure 8. Through experiments, it is found that the precision of the result is highest when b s is the center of the circle and the search radius is 50. This value is also used in subsequent experiments.
In FFM of the DBBoxHead part, we propose several feature fusion methods, as shown in Figure 6. In order to choose the best method, we conducted further experiments on these methods. The experimental results are shown in Table 1. In this table, "shadow" means that only shadow information is used, "tower" means that only target information is used, and "add" and "concat" mean that both shadow and target information are used together. It can be seen that only using shadows for classification has the lowest precision, followed by only using targets for classification. The precision of classification using shadow and target information is higher than that of using only one type of information alone. This also proves that our idea of using shadows to complete actual information and to assist classification is correct. Judging from the postoperation without using 1 × 1 convolution, in addition to using shadow information alone for classification, other methods have improved accuracy after using 1 × 1 convolution. 1 × 1 convolution is also equivalent to a weight allocator, indicating that it is effective to redistribute feature weights before classification. The precision corresponding to different search radius in AMM. All results are obtained on the TowerM data set. The abscissa represents different search radius, the search radius is in pixels, and the radius range is 1 to 65. The ordinate represents the precision corresponding to different search radius. There are two broken lines in the figure. The blue one represent the changes of AP when the IoU threshold is 0.5 for different search radius, and the red one represent the changes of AR when the IoU threshold is 0.5 for different search radius, and the precision ranges from 0.8 to 1.0.
The result in row 3 in Table 1 is the result when only the target information is used for classification and no postoperation is performed. This experiment just divides the classification and regression of the Head part in the original Faster R-CNN framework into two channels for processing separately. Comparing the results of Faster R-CNN in Table 2, it can be seen that in the final classification, separating classification and regression can improve the precision and reduce the mutual influence of classification and regression. Comparing the results of the first four rows with the last four rows of Table 1, it can be seen that the comprehensive use of shadow and target information for classification can indeed improve the final precision. This also reflects that our AMM is indeed effective and can accurately match the shadow to the target.

Performance on the Two Data Sets
In the experiment part, there are five comparison networks, namely Faster R-CNN [17], TSD [29], ATSS [30], Retinanet [26], and SI-STD [22]. In order to verify the effectiveness of STC-Det when the satellite perspective change, we establish two data sets, which are called TowerS and TowerM, respectively. The satellite perspective of the data in TowerS are relatively simple, while the satellite perspective of the data in TowerM are variable. There are six experimental indicators: "AP", "AP 50 ", "AP 75 ", "AR", "Params", "FPS". "AP", "AP 50 ", "AP 75 " represent the precision of networks under different IoU thresholds, "AR" represents the average recall of networks under different sizes of bboxes, "Params" represents the number of network parameters, "FPS" represents the network's ability to process data per second, the calculations of "Precision" and "Recall" are shown in Formula (33) and (34), and the calculation method of the above six parameters is consistent with the MS COCO data set [31]. The performance of experimental networks on the two data sets are shown in Tables 2 and 3  Through the analysis of Tables 2 and 3, it can be seen that the precision of STC-Det on the TowerS data set is increased by 5.2% compared with other networks, and the precision on the TowerM data set is still improved by 1.7% compared with other networks (here, considering the average precision AP). On the TowerS data set, the recall of STC-Det increased by 1.6%, and it also maintained a high recall rate on the TowerM data set. However, the network parameters of STC-Det did not increase much, much less than the number of parameters of TSD, indicating that STC-Det improved the detection precision while maintaining no significant increase in parameters. Observing the FPS index, it is found that the detection speed of STC-Det is the slowest among these networks, almost twice as slow as other networks. This is because the network needs to extract the shadow and target candidate regions at the same time, and also to match the shadow and the target, which undoubtedly greatly increases the amount of calculation, but in the case of low requirements for detection speed, it is worthwhile to trade the loss of detection speed for a substantial increase in precision. Figure 9 shows the visualization results of the networks used in the experimental part. Combining Table 2 with Table 3, it can be seen that these networks can all complete the detection task. Networks such as Faster R-CNN, TSD, ATSS, and Retinanet perform relatively ordinarily on these two data sets. SI-STD performs better when the satellite perspective is small, while the precision decreases when the satellite perspective increases. This may be due to the increase of the satellite perspective, and the target features gradually become obvious. SI-STD basically only pays attention to the target, so the shadow next to target will interfere with the detection. STC-Det performed well in both cases, which may be due to the separate use of shadow and target features and the different weights of shadow and target. Figure 10 can also explain this phenomenon. It can be seen from the heatmap that traditional networks such as Faster R-CNN, TSD, ATSS, Retinanet pay more attention to the target itself. These networks only focus on the target itself and ignore the shadows next to the target. When the satellite perspective is small, the characteristics of the target are very inconspicuous, which makes the detection precision low. After the satellite perspective is increased, the target features are obvious, and the detection precision is improved. According to Figure 10e, it can be seen that when the satellite perspective is small, the SI-STD pays attention to the shadow and the target at the same time. At this time, the target characteristics are not obvious, and the network tends to shadow characteristics. This phenomenon makes sense. But when the satellite perspective increases, the target characteristics become obvious. SI-STD almost only pays attention to the target feature, but it also extracts the shadow feature. The shadow extracted interferes with the overall detection and reduces the detection precision. It can be seen from Figure 10f that STC-Det can comprehensively use shadow features and target features regardless of whether the satellite perspective is large or small. It can be clearly found that when the satellite perspective is small, STC-Det pays more attention to the shadow features. When the satellite perspective is large, STC-Det is more inclined to focus on the target features, which is also our expectation. It can be proved that compared with SI-STD, STC-Det can flexibly set different weights for shadows and target features, and can automatically select different degrees of attention in different situations, which also broadens the scope of application of the network to the satellite perspective.

Discussion
The trunk information of the slender target in the optical satellite remote sensing will be greatly compressed during imaging, and its performance is greatly affected by the satellite perspective. In most cases, the shadow of a slender target can well reflect the missing structural features of target, and the imaging of the shadow is basically not affected by the satellite perspective. Based on this feature, we designed the STC-Det network, which uses shadow information to assist in detecting slender targets, and can automatically adjust the weights of shadows and targets under different satellite perspective, which greatly improves the detection accuracy and the robustness of the satellite perspective. SI-STD is our preliminary attempt to detect slender targets based on shadow information. Based on SI-STD, STC-Det is proposed to solve the impact of satellite perspective on detection. However, there are still some problems that need to be overcome. First, the shadow-based detection methods are greatly affected by shadows. In some extreme weather or special solar parameters, the shadow information is blurred, which not only cannot help the detection, but will interfere with the detection. Second, the current method uses less external information such as satellite parameters and is a completely data-driven detector, which requires a great amount of data and calculations, and the detection efficiency of the algorithm is low. To solve this problem, we can consider adding external prior information such as satellite parameters in the to reduce the dependence on data and the complexity of algorithm. Finally, the existing research does not consider the drastic changes of the solar altitude and the solar azimuth, which is also a problem that needs to be overcome in the future.

Conclusions
In this paper, we found that there are two problems in the detection of slender targets based on optical satellite images. One is that the main feature of slender targets located on the side elevation of the trunk is compressed greatly during imaging, which is not conducive to detection. Second, the performance of slender targets such as power transmission towers in the image is greatly affected by the satellite perspective, and the same target behaves completely differently under different satellite perspectives. In response to these two problems, this paper has made some improvements. We analyzed the imaging geometry model of slender targets and their shadows in optical satellite images. Slender targets are easily affected by changes of the satellite perspective, while their shadows are less affected by the satellite perspective. When the satellite perspective is small, shadows can be used to make up for the missing side elevation shape information of the slender targets. When the satellite perspective changes, the invariable shadow features can be used to assist in the interpretation of the changing slender targets. Combining this idea, in the paper, we propose STC-Det. STC-Det imitated deformable convolution and designed the AMM module to realize the automatic matching of the shadow candidate regions with the target candidate regions, and the FFM module was proposed after many experiments to effectively use the shadow and target information, which realized the fusion of shadow and target information and the automatic distribution of the weights of the two. Finally, in order to intuitively see the impact of changes in satellite perspective on slender targets and shadow imaging results, in this paper, we improve the existing heatmap algorithm to make it suitable for object detection and verify the validity and correctness of the method proposed in the paper in an intuitive way. In this paper, we propose a series of improvement measures for the detection of high-voltage power transmission towers and other slender targets in optical satellite images. These methods are indeed effective, but there are still many shortcomings. Some hyperparameters of the STC-Det network proposed in this paper need to be adjusted manually, and the detection speed of the network needs to be improved. Therefore, the focus of the next phase of work is to reduce the number of hyperparameters in the network and improve the efficiency of the network.