TRDet: Two-Stage Rotated Detection of Rural Buildings in Remote Sensing Images

: Fast and accurate acquisition of the outline of rural buildings on remote sensing images is an method to monitor illegal rural buildings. The traditional object detection method produces useless background information when detecting rural buildings; the semantic segmentation method cannot accurately segment the contours between buildings; the instance segmentation method cannot obtain regular building contours. The rotated object detection methods can effectively solve the problem that the traditional artiﬁcial intelligence method cannot accurately extract the outline of buildings. However, the rotated object detection methods are easy to lose location information of small objects in advanced feature maps and are sensitive to noise. To resolve these problems, this paper proposes a two-stage rotated object detection network for rural buildings (TRDet) by using a deep feature fusion network (DFF-Net) and a pixel attention module (PAM). Speciﬁcally, TRDet ﬁrst fuses low-level location and high-level semantic information through the DFF-Net and then reduces the interference of noise information to the network through the PAM. The experimental results show that the mean average precession (mAP), precision, recall rate, and F1 score of the proposed TRDet are 83.57%, 91.11%, 86.5%, and 88.74%, respectively, which outperform the R2CNN model by 15%, 15.54%, 4.01%, and 9.87%. The results demonstrate that the TRDet can achieve better detection in small rural buildings and dense rural buildings. information to the network. Experiment results on a newly collected large remote sensing dataset with diverse rural buildings under complex backgrounds show that each module has played an effective role, and our TRDet network model achieves good recognition results; it improves the mAP by 15% and achieves good performance in rural building detection. Future work can be focused on how to improve the speed of model training and the generalization ability of the network. In addition, we will also try to extend our rural building detection model to other object detection.


Introduction
For a long time, many cultivated lands have been illegally occupied as homesteads, especially in rural China, due to the lack of unified planning and random location of rural building construction. Remote sensing technology is widely used in hydrological analysis, dynamic detection, fire severity, and habitat monitoring, etc. [1][2][3][4][5][6]. With the rapid development of remote sensing technology, remote sensing image acquisition has become more convenient and low-cost [7]. The effective and scientific management of rural buildings based on remote sensing images has proved a competitive method, which is of great significance to the sustainable development of urban and rural areas [8][9][10].
Deep learning (DL) has become increasingly popular recently, which has obtained the most advanced results in buildings datasets [11,12]. According to the different detection principles and results, this paper divides the building detection methods based on DL into three categories: object detection, semantic segmentation, and instance segmentation.
The semantic segmentation method is a classification algorithm for each pixel [13]. The contour of buildings can be extracted from remote sensing images by using the semantic segmentation method. In recent years, semantic segmentation methods have been widely used in building area extraction and change detection [14][15][16]. To introduce the semantic segmentation method in the natural scene into the building extraction task of To solve the above issues, this paper proposes a two-stage rotated object detection model for small and dense rural buildings, called a two-stage rotated detection network (TRDet). TRDet uses DFF-Net to fuse the semantic information in different scale-spaces adds PAM to suppress the interference of noise information on the network and uses IoU (intersection over union) loss function to regress the object frame. Experiments show that our network model achieves good detection performance.
The rest of this paper is organized as follows. In Section 2, the dataset and method are explained. Section 3 introduces the experimental settings. Section 4 presents the results of the experiment. Section 5 shows a brief discussion of this work. Finally, Section 6 introduces the conclusion of the paper.

Dataset and Methods
This Section shows the relevant information of the dataset, then introduces the proposed methods used to detect rural buildings, including DFF-Net and PAM. DFF-Net can solve the problem of small objects' lost location information and insufficient sampling with the increase of the network depth, and PAM can effectively reduce the interference of noise information on the network.

Rural Building Dataset
The study area was located in Yichang City in China. Due to the multiple-source and multiple-time of remote sensing dataset images, it was easy to cause errors in model validation [41]. To overcome these problems, the data source of this dataset was the unmanned aerial vehicle (UAV) images of rural areas in Zigui County in 2014 and Dianjun District in 2020, with a spatial resolution of 0.2 m. The UAV used in this paper was a vertical take-off and landing UAV (the China Three Gorges University independently researched and equipped with a power suit and flight control navigation system), equipped with a positioning system (GPS/GLONASS) and a modified camera. The camera has an effective pixel size of 42 million, sensor size is 35.8 mm × 23.9 mm, and a focal length of 40 mm. Raw images were stored in RGB mode (aspect ratio: 7952 × 5304 pixels) in JPEG format. The UAV's average flight relative altitude was 1800 m, the image course overlap rate was 80%, the side overlap rate was 70%.
To improve the diversity of the data, the selected images included buildings of different building densities, distribution directions, and sizes. In addition, during the selection process, we recorded precise geographic coordinates to ensure that there were no duplicate images in the selected picture. We cropped the UAV images to 1000 × 1000 pixel, using (x, y, w, h, θ) five-value labeling to mark the image [41]. At the same time, to ensure that the data distribution of the training and validation sets was approximately the same, we randomly divided the datasets into training and validation sets, as shown in Table 1. When the UAV takes images, the difference in flight altitude leads to slight changes in image resolution, especially in mountainous areas. As shown in Figure 1, the pixel size of rural buildings in our data set was within 15 to 270 pixels. In addition, as shown in Table 2, we divided all instances in the dataset into three parts according to the height of the horizontal bounding box, and judged the object size according to the threshold defined in [42]. Rural buildings under 32 pixels were small objects, rural buildings over 96 pixels were large objects, and rural buildings of other pixel sizes were medium objects.  As shown in Figure 2, we can find various building patterns, architectures, scene features, lighting conditions, and styles in the dataset. Although these various rural buildings can be easily identified by visual inspection, DL is not easy. The challenges of datasets include small rural buildings, different building densities, confusing details, and complex land cover types in object areas.

Method
This Section outlines the two-stage rotating target detection model for TRDet. As shown in Figure 3, in the first stage, TRDet uses the first four layers of ResNet101 [43] as the backbone network to extract features from top to bottom, by DFF-Net and PAM can obtain more feature information and reduce noise. In the second stage, RPN is first used to generate dense horizontal anchors, and then the regression is based on five parameters (x, y, w, h, θ) and rotation non-maximum suppression (R-NMS), carried out to obtain results under any rotated frame. This paper adopts the ROI Align [21] alignment features and uses the global average pooling layer (GAP) instead of the full connection layer.
The backbone network used in TRDet is shown in Figure 4. ResNet is mainly composed of residual block structures with the convolutional layers of sizes of 1 × 1, 3 × 3, and 1 × 1. The backbone network of TRDet contains 91 convolutional layers, and a ReLU layer follows each convolutional layer. Enter an image and get the feature maps of C * = {C 1 , C 2 , C 3 , C 4 } after the backbone network. The feature maps' size is 1 2 , 1 4 , 1 8 , 1 16 times the original image size, and the number of channels is 64, 256, 512, 1024, respectively.

DFF-Net
To solve the issues of small buildings' lost location information and insufficient sampling in advanced feature maps, we fuse the features of C 3 and C 4 by DFF-Net and set the stride of the anchor to an appropriate size to ensure sufficient anchor samples of small targets. Correctly, the DFF-Net first sample C 3 and C 4 to n times the original image size, and the number of channels in C 3 is expanded through the deep feature module (DFM) to make. The D obtained in layer C 3 has the same number of channels as that in layer C 4 . Finally, the D is fused with the down-sampled C 4 to obtain the final feature map F 3 .
As shown in Figure 5, the DFM is divided into two parts: one part retains the original feature information, and the other part fuses the spatial information of different scales to guide the feature transformation process in the original feature space C3; the convolution feature operation in two different scales can especially expand the receptive field. Given the input X n = [x 1 , x 2 , . . . . . . , x n ] ∈ R n×h×w to output Y m = [y 1 , y 2 , . . . . . . , y m ] ∈ R m×h×w , a conventional 2D convolutional layer F is transformed with a group of filter sets K = [k 1 , k 2 , . . . . . . , k m ], where k i denotes the i-th set of filters with size n. And the output feature map at the channel i can be written as: where * denotes convolution and Part One: Feature transformations on C 3 is performed based on K 1 : • Part Two: First, we adopt the average pooling with filter size 4 × 4 and stride 4 on C 3 as follows: The second step fuses the two features at different scales by element-wise summation: where F 2 (P) = K 2 * P and Up(·) is a bilinear interpolation operator that converts the feature map to the same size as C 3 . The weight value is gotten through the sigmoid function to guide the transformation of the original feature map. The third operation can be formulated as follows: where σ(·) is the sigmoid function, F 3 (C 3 ) = K 3 * C 3 and · denotes element-wise multiplication. After that, the feature map D 2 is obtained through the convolution operation: Finally, D 1 and D 2 are collocated to get the same number of D as equal C 4 channels. Eventually, DFF-Net fuses the feature maps of up-sampled C 4 and D by element-wise summation to reach F 3 . The feature map F 3 obtained by our fusion method can well balance the semantic information with the location information. Compared with FPN [44], DFF-NET only increases a few parameters and reduces the interference of useless information in the underlying feature map.
This paper sampled the different sizes of pre-fusion feature maps to accommodate further SA. In Table 3, the detection accuracy and the time consumption were listed on the different stride detections. We found that the detection mAP of the rotated detection (RD) task and the horizontal detection (HD) task was 0.01% and 0.26% higher than when the stride was 8 while the value of SA is equal to 4, respectively. It further validated that the smaller the SA, the higher the expected max overlapping score [40], and the better the model captured small buildings. However, our model took nearly twice as long as 8 on the condition that the score of SA was equal to 4, therefore, TRDet set the SA to the most efficient 8.

PAM
In remote sensing images, the ground objects are complex, thus, the RPN network can easily introduce a lot of noise information into ROI. The noise information will cause interference to the network, thus increasing the probability of false detection and missed detection. At present, a great quantity studies have shown that attention mechanisms can effectively reduce the interference of noise information on the network [45][46][47].
We compared several attention mechanisms experimentally and introduced a supervised PAM to reduce the interference of noise information on the network. As shown in Figure 6, the feature map F 3 was convoluted to obtain a two-channel saliency map. Because the saliency map is continuous, non-object information will not be eliminated entirely. Then the value of the saliency map was limited to [0, 1] by the softmax function and element-wise was multiplied by F 3 to get the final feature map A 3 . In addition, the PAM produced a binary image (background 0, foreground 1) based on the ground truth. The loss of cross-entropy between binary image and saliency map was used as part of the loss function to guide model learning.

IoU Loss Function
The IoU loss function is defined as [28] where x, y, w, h and θ denote the proposed box center coordinates, width, height, and angle, respectively; N represents the number of boxes; t n is a binary value (for foreground t n = 1, background t n = 0, and background have no regression); v * j and v * j represent the true and predicted vector of targets; u ij and u ij represent the labeled and predicted masked pixel, respectively; p n is the probability distribution for each category calculated by the softmax function; t n is the object label; and α, β and γ are the super-parameters. In addition, the regression loss L reg is the IoU smooth L1 loss [28], the classification loss L cls is the softmax cross-entropy [48], and the pixel attention loss L att is pixel-wise softmax cross-entropy. The extra parameter in the equation was set to α = 4, β = 1, γ = 2.

Experimental Settings
This Section describes the evaluation metrics and related implementation details used in this paper.

Evaluation Metrics
This paper adopted the mAP, the Recall (R), the Precision (P), and the F1 score (F1) to evaluate the performance of the proposed methods quantitatively. The mAP is widely used to estimate the quality of object detectors, and the DOTA metric was adopted to compute mAP. P is the model's ability to find only relevant objects, that is, the proportion of all prediction results given by the model that hit the true objects. R is the ability of the model to find all the relevant objects, that is, the number of true objects that the model's prediction results can cover at most. The F1 score was used to evaluate the overall performance of the model. The P, R, and F1 were calculated by: where TP is true positive, FP is false positive, FN is false negative. If the IoU between a ground-truth and a prediction exceeds 0.5, TP is when there is the same class label, FP is when there is a different label. The ground-truths without corresponding TP predictions were labeled as FN. As shown in Figure 7, in the HD task, the calculation of IoU is as follows:

Implementation Details
All the experiments were performed in TensorFlow [49] version 1.12 and run with Intel Xeon E5-2680 v3 processor (produced by Intel, Taiwan, China), 128 G memory, Nvidia Geforce RTX 2080Ti GPU (produced by Leadtek, Taiwan, China) with 11 Gb memory.
This paper adopted the ResNet101 [43] as the pre-trained model to initialize the parameters of the feature extractor. To fully use the weight file of the pre-training model, the full connection layer was replaced by the C5 block of the pre-training model to initialize the parameters. The result of the ablation study in Section 2.2.1 showed that the suitable SA was 8, the size of the base anchor was 256, and the ratios of the anchor were 1, 1 2 , 1 3 , 1 4 , 1 5 , 1 6 , 1 7 , 1 8 , 1 9 . The model was trained for 300,000 rounds with a learning rate of 0.0003. When IoU > 0.7, the anchor was assigned as a positive sample, and when IoU < 0.3, the anchor sample was negative. To test the possible configurations and the performance of the super parameters, we conducted several experiments and finally set it to the optimal specific value above.

Results
This Section explores the potential of TRDet in the detection and classification of rural buildings in remote sensing images of urban and suburban areas. Section 4.1 is the Ablation study, and Section 4.2 is the result of the rural building detection dataset.

Ablation Study
An ablation study is carried out to verify the effectiveness of the proposed modules on the results of experiments. This Section introduces the influence of DFF-Net, PAM, and IoU loss function on the R, P, F1, and mAP, respectively.

Effect of DFF-Net
Feature fusion and reducing the SA were effective means to improve building detection. In Table 4, after adding DFF, the mAP increased by 12.03%. We also compared the SF-Net [28]. This method can improve the results of building detection, but it cannot obtain the best detection results. We attribute the improvement in accuracy to our proposed DFM. The feature map F3 obtained by the DFF-Net can better balance semantic information and location information.  Table 4, the performance of PAM is better than the other, resulting in a 1.74% increase in network mAP.

Effect of the IoU Loss Function
With the IoU loss function, the mAP increased by 1.23%. As shown in Figure 8, it can effectively guide the network to learn and make the model easier to obtain the object coordinates.

Result on the Rural Building Dataset
We compared the model performance on four different models in our dataset: R2CNN [38], R3Det [36], SCRDet [28], and our TRDet model. All experimental data and parameter settings remained the same to validate the ability of models.

RD Task
The purpose of the experiment was to detect the performance of evaluating rotated detection. The experimental results of four models on the validation set are as shown in Table 5. Our proposed TRDet model achieved state-of-the-art performance with a mAP of 83.57%, P of 91.11%, R of 86.5%, and F1 of 88.74%. As shown in Figure 9, our model has great detection results in small and dense buildings. And we can perfectly identify small buildings sandwiched between two large buildings in Figure 9b. Compared with the R2CNN model, our model significantly improved by 15%, 15.54%, 4.01%, and 9.87% in performance indicators, respectively.  As shown in Figure 10, many small buildings were ignored on R2CNN detection, and even R2CNN identified incorrect fields as buildings. In the first row, the detection result of R2CNN missed the rural building with a special roof, and our model detected this building with high confidence. In the second row, the R2CNN model mistakenly regarded the large truck and canopy as rural buildings, while the TRDet correctly identified them as the background. In the third row, the R2CNN model was not as good as the TRDet in positioning the coordinate edge of rural buildings. In the third and fourth row, the R2CNN model considered arable land as the rural building while the TRDet identified it as the background. The TRDet can also play a good role in complex environments. The detection results prove the superiority of our algorithm. Compared with the SCRDet model, which is an improved R2CNN model, the proposed TRDet model increased by 3.07% of mAP. The mainstream single-stage R3Det model achieved 77.35% of mAP, 82.64% of P, 85.96% of R, and 84.26% of F1. As shown in Figure 11, TRDet worked better than the other two models. In the second row, R3Det had un-suppressed detection boxes with R-NMS in the case of relatively dense build-ings. In the third row, R2cnn incorrectly identified two buildings as one. This further proved that our modules could make the network achieve a wonderful recognition effect in difficult scenarios.

HD Task
This experiment detects the performance of evaluating horizontal detection. The results of the HD task in three models are shown in Table 6, our model also achieves the best performance, and the scores of mAP, P, R, and F1 reach 86.21%, 91.11%, 88.77%, and 89.92%, respectively. As shown in Figure 12, the detection results of the HD box contain a large amount of useless information, and the detection results in dense scenes are messy and difficult to identify. R2CNN missed many objects, and the performance of the SCRDet was also unsatisfactory.

Comparison of Similar Studies and the Contribution of TRDet
The area detected in traditional object detection will contain much background information. Figure 12 shows the result of horizontal bounding box detection. Many limitations will arise in practical applications of this detection result, such as estimating the area of buildings. Rotated object detection can more accurately locate the position information of tilted objects. Dickenson et al. [50] proposed a rotated object detection model based on VGG and BFP, experimented on four cities (Las Vegas, Paris, Shanghai, Khartoum) in the DeepGlobe challenge dataset, obtained in dense scenes building detection results are poor. In this paper, the feature network adopted ResNet to obtain better accuracy. Wen et al. [51] introduced RRPN in Mask RCNN for building rotation detection. Still, RRPN needs to generate enough boxes with different angles in practical applications, resulting in an ample time overhead. If the spacing of the angles increases, the model's accuracy decreases.
R2CNN is a network converted from Faster-RCNN, which first generates a series of horizontal boxes and then performs rotation regression. However, due to R2CNN directly performing regression and classification in the last layer, the features learned by the network are relatively single, and there is a lot of false detections. In the RD and the HD task, the result of P obtained by TRDet was 15.54% and 7.48% higher than R2CNN, respectively. As a single-stage rotated object detection method, R3Det introduced FRM, and the accuracy was improved compared with other single-stage rotated object detection networks, but the training speed was slow. In the RD task, the mAP of the TRDet method was 15% and 6.22% higher than R2CNN and R3det, respectively.
This paper provides a highly effective method to detect rural buildings with rotated object detection. The proposed methodology can effectively solve that the rotated object detection methods are easy to lose location information of small objects in advanced feature maps and sensitive to noise. SCRDet is also a method to improve network accuracy through feature fusion and attention model, but TRDet obtains 3.07% higher mAP in the RD task. As shown in Figure 9, we can observe from the experimental results that TRDet had effective detection performance on small and dense rural buildings.

Comparison of Different Models
As in Section 4.1, we compared similar models to demonstrate the effectiveness and superiority of our method, and the experimental results are satisfactory. The proposed methods have the following three advantages:

1.
In rural building detection, the size of buildings may vary greatly due to different altitudes. The proposed DFF-Net can extract the rural buildings of different scales. Different from the traditional feature fusion methods, the feature map by DFM integrates the information of two scales. Compared with ordinary channel expansion, D better balances the semantic information and location informants and obtains a larger receptive field. As a common feature extraction network, FPN has a complex structure and large parameters. In the HD task, the mAP of the TRDet method was 11.64% higher than FPN, the P was close, and the R was 12.89% different. The DFF-Net used the feature maps of C 3 and C 4 layers to ignore the bottom other less relevant features and only increased a small number of parameters. In Table 4, the mAP of our DFF-Net was 3.03% higher than that of SF-Net. We attribute the improvement in accuracy to our proposed DFM. As shown in Figure 10, the proposed model DFF-Net can perfectly fit the contours of large buildings and can also capture small buildings.

2.
The noise in remote sensing images will affect the model in the training phase, resulting in false detection and missed detection. The attention mechanism is a common method to alleviate noise interference. However, not all attention mechanisms are effective for it. In Section 4.1.2, we can see that the accuracy decreased by 1.24% after adding SE [46] (a typical channel attention module). The MDA [28] using channel attention was also unsatisfactory. Specifically, channel attention assigns different weights to each channel, which improves the weights of simple samples and ignores the information of hard samples. It leads to the accuracy of hard samples detections reduces. The PAM module was utilized to assign the supervised weights for each pixel to control the scores of generated feature maps from zeros to one, which reduced the noise influence, enhanced the information of target objects, and did not eliminate the non-object information. Therefore, it had great proficiency in alleviating problems with false detections and missed detections. In Figure 12, SCRDet cannot effectively distinguish the boundary between two buildings in the case of dense buildings.

3.
Due to the periodicity of angle, the traditional smooth L1 function is prone to sudden increase. Therefore, the IoU loss in this paper under the boundary condition, |− log(IoU)| ≈ 0, can eliminate this surge. As shown in Figure 8, the IoU loss can accurately evaluate the loss of the prediction box relative to the true box in the training process.

Future Work
In this paper, the model is trained with an image size of 1000 × 1000 pixels, so it is necessary to cut images to the required size. However, this may cause some buildings to be divided into multiple parts. Figure 11d shows the proposed method can find incomplete buildings, but the confidence of such objects is lower than common buildings. Therefore, it is only necessary to reasonably set the screening conditions of the confidence level, and the false detection phenomenon can be effectively avoided. In the actual test, changing the repetition rate of the cropped image according to the resolution size can also avoid the missed detection phenomenon in this case.
Building extraction is still an open problem that needs more research. In the future, we will plan to establish a larger dataset, including common buildings, buildings under construction, buildings with special shapes, and buildings with complex backgrounds, then design and train special networks to detect these buildings.

Conclusions
This paper presents a two-stage rotated object detection model for rural buildings based on deep feature fusion and pixel attention modules. Compared with R2CNN networks, our DFF-Net can effectively increase the size of the receptive field and fuse the characteristic information at different scales, and use PAM to eliminate the interference of noise information to the network. Experiment results on a newly collected large remote sensing dataset with diverse rural buildings under complex backgrounds show that each module has played an effective role, and our TRDet network model achieves good recognition results; it improves the mAP by 15% and achieves good performance in rural building detection. Future work can be focused on how to improve the speed of model training and the generalization ability of the network. In addition, we will also try to extend our rural building detection model to other object detection.
Author Contributions: All authors contributed in a substantial way to the manuscript. B.P. conceived, designed, and performed the research and wrote the manuscript. D.R., C.Z. and A.L. made contributions to the design of the research and data analysis. All authors discussed the basic structure of the manuscript. All authors have read and agreed to the published version of the manuscript.