Surround-Net: A Multi-Branch Arbitrary-Oriented Detector for Remote Sensing

: With the development of oriented object detection technology, especially in the area of remote sensing, signiﬁcant progress has been made, and multiple excellent detection architectures have emerged. Oriented detection architectures can be broadly divided into ﬁve-parameter systems and eight-parameter systems that encounter the periodicity problem of angle regression and the discontinuous problem of vertex regression during training, respectively. Therefore, we propose a new multi-branch anchor-free one-stage model that can effectively alleviate the corner case when representing rotating objects, called Surround-Net. The creative contribution submitted in this paper mainly includes three aspects. Firstly, a multi-branch strategy is adopted to make the detector choose the best regression path adaptively for the discontinuity problem. Secondly, to address the inconsistency between classiﬁcation and quality estimation (location), a modiﬁed high-dimensional Focal Loss and a new Surround IoU Loss are proposed to enhance the unity ability of the features. Thirdly, in the reﬁned process after backbone feature extraction, a center vertex attention mechanism is adopted to deal with the environmental noise introduced in the remote sensing images. This type of auxiliary module is able to focus the model’s attention on the boundary of the bounding box. Finally, extensive experiments were carried out on the DOTA dataset, and the results demonstrate that Surround-Net can solve regression boundary problems and can achieve a more competitive performance (e.g., 75.875 mAP) than other anchor-free one-stage detectors with higher speeds.


Introduction
As one of the most important basic tasks in computer vision, object detection has attracted the attention of many researchers and has been studied extensively. The main goal of this task is to find the location of an object in an image and label the category that it belongs to. It can be used as a key component in many downstream tasks.
With the rise of deep learning technology, object detection has developed quickly. Architectures can be divided into two-stage and one-stage architectures depending on the processing stage [1]. For example, the R-CNN family [2][3][4] and its improvements [5][6][7] follow the "coarse-to-fine" pipeline. On the contrary, one-stage architectures, such as the YOLO family [8][9][10][11], SSD [12], and RetinaNet [13], are able to complete detection tasks in one step. Although they have a good running speed, their accuracy is not high compared to two-stage architectures. In addition, models can also be divided into the anchor-base category [2][3][4][5][6][7][9][10][11][12][13][14], which requires multiple anchors of different sizes and proportions, and the anchor-free category [15][16][17][18], in which the object is represented as a point. The former has higher accuracy, while the latter is easier to train and adjust.
Recently, object detection tasks in arbitrary directions have received more and more attention. In the field of remote sensing detection, oriented detection can not only identify densely packed objects but also ones with a huge aspect ratio. Figure 1 shows the limitations of horizontal object detection. Most oriented object detectors are adapted from horizontal object detectors, and they can be broadly divided into five-parameter models [19][20][21][22][23][24][25] and eight-parameter models [26][27][28][29][30]. The five-parameter models add an angle parameter (x, y, w, h, θ) and achieve great success. However, these detectors inevitably suffer from the angle boundary problem during training. As shown in Figure 2a, assuming that the center coordinates of the two boxes coincide (without loss of generality), the red box (Prediction, −1/2π) needs to match the brown box (Ground-truth, 3/8π), something that can be achieved by adjusting the angle and size. An ideal regression method rotates the prediction box counterclockwise without resizing it. However, due to the boundary characteristics of the angle, the prediction box can not achieve its purpose because of the sudden change. Instead, it must rotate 3/8π clockwise and adjust its size simultaneously. SCRDet [25] introduces IoU-Loss to enable the model to find a better regression method, but it cannot eliminate it. The eight-parameter models were proposed to solve the angle problems that are present in the five-parameter models. They cover the detection task using a point-based method by directly predicting four vertices. However, such a direct method introduces new problems. First, it is necessary to sort the vertices when calculating the regression loss. Otherwise, the almost identical rectangle will still produce a colossal Loss (as shown in Figure 2b); secondly, as described in Figure 2c, the sorted order may still be sub-optimal. In addition to regression, along with a clockwise method (green line), there is an ideal regression method (brown line). RSDet [31,32] proposes comparing the Loss after moving the vertexes one unit clockwise and one unit counterclockwise. However, this approach only solves part of the problem. When faced with the situation shown in Figure 2d, no matter how the adjustments are made, there are still sub-optimal paths (two longer regression paths always exist).
After analyzing the above ranking and regression discontinuity problems, we propose a multi-branch eight-parameter model called Surround-Net. It decomposes the prediction process into multiple branches to take all cases into account in an anchor-free and onestage way. To improve the model's consistency during testing and training, a modified multi-branch-aware adaptive Focal Loss function [13] is creatively proposed so that the classification branch selection can be trained simultaneously. A size-adaptive dense prediction module is proposed to alleviate the imbalance between the positive and negative samples. Moreover, to enhance the model's performance during localization, this work also presents a novel center vertex attention mechanism and a geometric soft constraint. The results further show that Surround-Net can solve the sub-optimal emerge problem present in previous models and can achieve a competitive result (e.g., 75.875 mAP and 74.071 mAP in 12.57 FPS) in an anchor-free one-stage way. Overall, our contributions are as follows:

1.
A multi-branch anchor-free detector for oriented object detection is proposed and solves the sorting and suboptimal regression problems encountered with eightparameter models; 2.
To jointly training branch selection and class prediction, we propose a modified Focal Loss function, and a size-adaptived dense prediction module is adopted to alleviate the imbalance between the positive and negative samples; 3.
We propose a center vertex attention mechanism to distinguish the environment area and use soft constraints to refine the detection boxes.

Materials and Methods
First, we will provide an overview of the content structure. The architecture of our proposed anchor-free one-stage model is introduced in Section 2.1. Section 2.2 elaborates on the multi-branch structure and the adaptive function design for dense predictions as well as on the multi-branch adaptive Focal Loss for joint training. The prediction of the circumscribed rectangle and sliding ratios are discussed in Section 2.3, and the soft constraints for refinement are introduced in that section as well. Finally, we describe how to encode a rotating detection box using all of the predicted values. The center vertex attention mechanism for feature optimization is introduced in Section 2.4.

Architecture
As shown in Figure 3, the whole pipeline can be divided into the following four cascading modules: the feature extraction module, the feature combination module, the feature refine module, and the prediction head module. Initially, we use 1-5 convolutional ResNet-101 (ResNet-152 for better performance) layers and resize the final output feature map to 1/4 of the original input image size. In the up-sampling process, we first use 3 × 3 convolutions to resize the small-scale feature maps with rich semantic information to the same size as the feature maps from the previous levels. At the same time, the concatenate operation is performed on the feature map from a previous level that contains more delicate details via a 3 × 3 convolution layer. After completing one concatenation layer, it is followed by a 1 × 1 convolutional layer to enhance the fusion of the channel elements in the feature map. Before entering the next up-sampling stage, batch normalization [33] and Leaky ReLU [34] were used to normalize the feature map and improve its nonlinear fitting ability. At the tail of the feature combination, an attention module is proposed to refine the feature map. This part will be detailed in Section 2.4. Inspired by TSD [35], we also added additional convolutional layers for each prediction head in a decoupled manner.
There are three detection heads that follow the following refined feature map: the multi-branch selection and classification head, the circumscribed rectangle prediction head, and the sliding ratio prediction head. For the multi-branch selection and classification head, the number of filters is 4 × C, where C is the number of categories and 4 is the number of different branches; for the circumscribed rectangle prediction head, the number is 4, representing the four distances (l, r, t, b) from the center point to their corresponding circumscribed rectangle; for the sliding ratio prediction head, the number is 2, representing the two required ratios. As shown in Figure 4a-d, we divided the rotating bounding boxes obtained from the circumscribed rectangle into four cases corresponding to the multi-branch selection head. The yellow star represents the midpoint of the boundary of the circumscribed rectangle. According to the coordinates falling on the boundary, the regression process can be divided into the following two cases: Figure 4a-d. In the first case, starting from a vertex of the circumscribed rectangle and sliding in the horizontal and vertical directions, a rotating bounding box can be obtained optimally. In the second case, the sliding vertices can be selected either counterclockwise or clockwise. This once again represents an optimal regression method and achieves the same results as RSDet [31,32]. The cases corresponding to Figure 4e,f can be regarded as the prediction of the horizontal bounding box whose sliding ratios are close to 0. Therefore, using the multi-branch regression method, the suboptimal regression problem shown in Figure 2b-d is solved. Section 2.2 describes how to use multi-branch prediction to find the best regression method in the above four regression branches.

Potential Points of the Object
According to Figure 3, it can be seen that the output of the multi-branch selection and classification head is a heat map in R 4C×W×H for an input RGB image in R 3×H ×W , where H and W are the height and width of the image and H = H /4, W = W /4. We expanded the prediction channel and replaced the classification score with the PQES (prediction quality estimated score) (this will be explained in detail later) to measure whether the rotated bounding box obtained through the current branch had the highest IoU (Intersection-over-Union) with the ground truth. To improve the model's generalization ability to cope with possible artificial labeling errors [36] and overfitting, we employed an adaptive modified Gaussian kernel function to label smooth the center of the object with the surrounding positions. The kernel of the modified Gaussian function is the following: is the center coordinate of m th object on the feature map. W box m and H box m are the width and height of the ground-truth bounding box. The element-wise maximum value strategy in which the same category overlaps was adopted. We charged the value of min(W box m , H box m ) to cover five standard deviations under the Gaussian distribution, so the value of Z 1 was set to 5. Considering that the pixels contained in the actual object in the remote sensing image account for a small proportion of all of the pixels in the image, the imbalance between positive and negative samples will be severe. Therefore, the dense prediction method [17,28] was also adopted. However, different from the former, we considered using an adaptive logarithm strategy to alleviate the gap in the size between categories. This was followed by a shape-adaptive positive sample expansion kernel as follows: The positive sample expansion function follows the following two principles: (1) the coverage is less than or equal to K m ; (2) the ground-truth boundary cannot be exceeded. We treat min(W box m , H box m ) as dis and α 1 as x; the above principles can be expressed in a mathematical formula as follows: The partial derivative of the original function of the variable x is always less than 0, which is a monotonic downward trend. The partial derivative for another variable dis is as follows: 1 − x dis ln 2 = 0 and dis = x ln 2 Substitute the value of (7) into (4) and let the equation equal 0 as follows: Find the second partial derivative with respect to the variable dis as follows: Substituting Equations (6) and (8) into Equation (9), we can obtain as follows: Because the second derivative is greater than 0, it is reasonable to set the value of α 1 to 1.88. Figures 5 and 6 show the growth of the corresponding dense prediction intervals as the size of the ground-truth box increases.  We calculated the Loss for each position in the multi-branch selection and classification head tensor, and the targets to be learned can be defined as follows: IoU heat−map is the PQES (prediction quality estimated score) mentioned at the beginning. IoU 1 is the Intersection-Over-Union (IoU) between the predicted circumscribed bounding box and the real circumscribed bounding box; IoU 2 is the IoU between the predicted rotating bounding box and the ground truth. For all of the negative samples, the entire IoU heat−map is set to 0. The α 2 and α 3 are the weight factors determined in the subsequent experimental section. Taking the idea of Focal Loss [13], we propose a new modified multi-branch-aware adaptive Focal Loss function to train the model. The ground truth of the L heat−map can likewise be fetched dynamically during the following training process (refer to Equations (12) and (13)): where l is the supervised value, and p is the prediction. θ total and θ positive represent the number of all of the samples and positive samples in the feature map, respectively. We need to scale down the contribution of negative samples. µ 1 and µ 2 follow the Varifocal Loss [37] setting to make µ 1 = 0.75, µ 2 = 2.

Size Regression of Rectangle
For the regression of the circumscribed rectangle, we calculate the four distances (l, r, t, b) to the circumscribed rectangle bounding box from the feature points and use the GIoU [38] for training. The loss functions can be documented as follows: bbox pre and bbox m reprinted the m th predicted circumscribed bounding box and the real one. Utilizing the two sliding ratios, we can deduce the coordinates of the final rotating bounding box. Using the regression method seen in Figure 4a as an example, the process looks similar to what is observed in Figure 7 as follows: x tr 1 = x tr − p 1 (l + r) ↔ y tr 1 = y tr (16) x tr 2 = x tr ↔ y tr 2 = y tr + p 2 (t + b) (17) We used the Sigmoid function [39] in the model and multiplied it by a constant 0.5 to make the sliding ratios within (0,0.5) meet the following conditions ( a ∼ d means the four regression ways in Figure 4): For training the sliding ratios, a new form of IoU Loss combined with Smooth L1 Loss [4] and GIoU [38] Loss called Surround-IoU Loss is adopted as follows: bbox pre r and bbox m r represent the m th predicted rotating bounding box and the ground truth. Furthermore, the shape of the rotating bounding box cannot safely ensure a rectangular shape. As displayed in Figure 8, the soft constraints satisfy the following geometric properties: (x l − x t , y l − y t ) represents vec t→l , and (x r − x t , y r − y t ) represents vec t→r ; the rest are vec r→b and vec r→t . Thus, the soft constraints can be written in the following form:

Center Vertex Attention
Typically, the size of a circumscribed rectangle is always greater than a horizontal one (as shown in Figure 9). As such, inspired by CBAM [40], a center vertex attention strategy should be adopted to improve the ability of the model to identify different points. We arranged the attention mechanism based on the mask at the following two sites: the center point and the four vertices of the rotating bounding box. The module architecture is portrayed in Figure 10.
F O represents the original feature map, F F represents the refined feature map, M C represents the center vertex region generated by the modified Gaussian function. The following equation expresses the fusion process and the target to be learned:  We use the same Loss function as CenterNet [16] for supervision. Moreover, SENet [41] has also been employed as the auxiliary channel attention network, and the value of the reduction ratio is 16.
The total loss function L total can be written as follows (where N represents the number of positive samples in an input image, and λ is a hyperparameter for balance):

Result and Discussion
We evaluated our model on the DOTA dataset within the PyTorch 1.7.1 + cu110 [42] framework. The training processing was deployed on a workstation with an NVIDIA Quadro RTX 5000 16 GB GPU and an Intel(R) Xeon(R) Silver 4214 CPU @ 2.20 GHz. The test processing was completed on an NVIDIA Quadro RTX 4000 GPU with 8 GB of memory.

DOTA Dataset
There were two detection tasks that were introduced to the DOTA dataset [43]. Task1 operates with the oriented bounding boxes (OBB) as the ground truth, and Task2 employs the HBB. Task1 was used for rotation detection. The dataset contains 2806 aerial images of various scales and orientations, ranging from 800 × 800 to 4000 × 4000. There are 188,282 target instances in total that are split into the following 15 categories: plane, baseball diamond (BD), bridge, ground track field (GTF), small vehicle (SV), large vehicle (LV), ship, tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), harbor, swimming pool (SP), and helicopter (HE). In order to satisfy the Surround-Net size requirements, the original input images and the corresponding labels needed to be adjusted. The step size was fixed to 100 pixels, and the window size was set to 600 pixels. After clipping there were 69,337 images in the training-verification set and 35,777 images in the test set. The decentralized detection results were combined by retaining the top 300 heat map values (the threshold is set to 0.1). Figure 11 shows part of the DOTA dataset (note that these images have been cropped).

Evaluation Indicators
We adopted the same acceptance criteria used in PASCAL VOC2007 [44] and employed means average precision (mAP) to evaluate the performance.
The P-R curves can be drawn according to the precomputed precision and recall of the detection results. The average precision (AP) metric is calculated by the area under the P-R curve; hence, the higher the AP value, the better the performance, and vice versa. The AP is defined as follows: Here, P and R represent the single-point values of precision and recall, respectively. Mean average precision (mAP) is the average of the AP in each category as follows:  The white part of the figure represents the area where the value is not zero. In the source zone, the number is moving closer to one. In other areas, the number could be obtained using Equation (26).

Parameters Setting
In this section, a series of ablation experiments are performed on the DOTA validation dataset to determine the value of the hyperparameters used in Surround-Net. It is important to note that we used ResNet-50 as the backbone network for ablation experiments and set the number of training epochs to 50. The input images were resized to a resolution of 600 × 600. There were the following three hyperparameters: the weight factor λ of L total , the weight factor α 2 of the IoU heat−map , and the weight factor α 3 of the ground − truth heat−map .

1.
L total weight: We first analyzed the impact of the hyperparameters of the total loss on the detection performance. We refer to the design methods of the Loss weight in some mainstream multitasking learning models [2][3][4][5][6][7][8][9][10][11][12][13] and conducted the tests between 0.1 and 2 under the same experimental conditions. As shown in Table 1, λ achieved the best performance when the value was 1.25. It was also observed that the detector's performance decreases to varying degrees when the selected value is too high or too low. After adopting the GIoU and normalizing the Smooth L1, the L r reg was unified with L heat−map in the same order of magnitude. As such, there is no necessity to downscale the regression loss. Otherwise, the model would over focus on a single task and damage the detector's performance. Therefore, it is reasonable to employ 1.25 as the value of the hyperparameter λ and adopt this value in the following experiment; 2.
The values of α 2 and α 3 : These two hyperparameters can be found in Equations (12) and (13). Parameter α 2 is used to counterbalance the value generated by the modified Gaussian kernel and the IoU between the prediction box and the ground truth box. Likewise, the parameter α 3 was also utilized to make a trade-off between the two different s IoU styles. One is between two horizontal bounding boxes, and the other is between two rotating bounding boxes. We restrain the range between 0 and 1 to satisfy the IoU range. Starting from intuition, parameters α 2 and α 3 were set to 0.5 when testing λ. Therefore, it was necessary to investigate whether other combinations could better enhance the detector's performance. We first experimented with parameter α 2 and fixed the α 3 to 0.5. As shown in Table 2, the detector was able to achieve the best performance when parameter α 2 was 0.5, and there was a slight difference when other values were specified. This indicates that the proposed model can maintain classification and location consistency. In other words, the model will not have a wide gap between the classification and location scores. Similarly, the test results for the parameter α 3 are exhibited in Table 3. However, we discovered that increasing the proportion of IoU between the horizontal bounding box results in the detector demonstrating a new optimal performance. In actuality, because a slight angular deflection can seriously reduce the IoU between rotating bounding boxes, the regression difficulty of rotating bounding boxes is more significant than that of horizontal bound-ing boxes, consistent with the phenomena observed in DAL [24]. Accordingly, we employed the final hyperparameters in the following experiment:α 2 = 0.5, α 3 = 0.7.

Contributions of Several Modules in Surround-Net
In this section, we conducted ablation experiments to corroborate the contributions of the following several modules mentioned in Section 2: the soft constraint module, dense prediction module, and center vertex attention module. When evaluating the module's effectiveness, we chose all of the DOTA datasets instead and used ResNet-152 as a backbone.

1.
Soft constraint module: The results are shown in Table 4. The data in the first row describes how we deleted the model's soft constraint component during the training process. That is, L so f t is not calculated. This reveals that the model's performance decreases by approximately 0.7% after losing the soft constraint. Indeed, Figure 8 illustrates that soft constraints are essential for generating rectangular bounding boxes while ensuring right-angle characteristics and assisting in the regression of the sliding proportion. In addition, to further explore the effectiveness of the soft constraint module, we visualized the comparison results without adding this constraint. As shown in Figure 13, most of the bounding boxes predicted by the model in Figure 13a are parallelograms, contrary to the rectangle that we need. In Figure 13b, this phenomenon has been intensely alleviated; 2.
Dense predict module: The second row in the table represents the model's performance without using the dense prediction module. The analysis in Section 2 points out that if the number of positive samples is not increased then the model will fall into overfitting because there are too few positive samples. Therefore, it can be observed from the results that the dense prediction module improves the overall performance of the model by roughly 1.13%; 3.
Center vertex attention module: The penultimate row in the table shows the contribution of the center vertex attention module to the overall performance. Introducing this module aims to enhance the feature extraction ability to make the model better focused on the object position and its four vertices of the corresponding rotating bounding box. It is evident that the addition of the attention module resulted in a 1.501% gain in the overall performance of the model, which further explains the necessity of an attention module in the detection tasks using remote sensing images.

Analysis of Surround IoU
In this section, we evaluate the effectiveness of Surround IoU Loss by comparing the stability of the Loss curve. For this experiment, the complete DOTA training set was chosen, and the number of iterations was uniformly set to 100 for comparison purposes. The Loss curve in Figure 14a depicts that a direct Smooth L1 Loss SmoothL1 (·) was employed. The   It should be noticed that the downward trend is not gentle, and there is a "mutation" at around the 7th iteration and the 12th iteration. Based on the analysis in Section 2, this is because direct vertex coordinate regression cannot reflect the relative position difference between the detection boxes. Therefore, we borrow the idea of calculating the Loss in a rotating bounding box in SCRDet [25] by introducing normalization and IoU information. The image in Figure 14b shows the Loss during the training process after replacing the original Loss function with Surround IoU, and it is obvious that the Loss curve becomes more flattened than the former.

Analysis of Multi-Branch Regression
This subsection discusses the impact of the proposed multi-branch prediction structure on model performance. According to the discussion in Section 2, there are the following four types of regression branches: oriented primary diagonal slender rectangular regression, oriented minor diagonal rectangular slender regression, oriented primary diagonal rectangular regression, and oriented minor diagonal rectangular regression. The model's validity can be verified by masking some branches artificially. In the following experiments, we compared the performance by means of shielding in the Figure 4a,b branch and in the Figure 4c,d branch. The results and visual analysis are provided in Table 5 and Figure 15.  We completed the testing process by zeroing the different branches on the prediction tensor and picking a new maximum value in the remaining branches. According to the results in Table 5, keeping only some of the branches will significantly impair the model's performance. It is observed that if only the branches in Figure 4a,b are kept, the AP values of the large vehicle, small vehicle, ship, and harbor categories show a relatively colossal drop compared to the best performance. However, if only the branches in Figure 4c,d are kept, then the AP values for the roundabout, basketball court, and storage tank categories show the same decline. In the first row of pictures in Figure 15, the first two represents the prediction results for when only branches c-d are kept, and the last two represent the results obtained when only branches a-b are kept. Compared to keeping all of the branches in the second row, the detection boxes for the harbor and large vehicle categories only showed regression in directions c-d, showing a high level of redundancy. Moreover, in the large vehicle category, many objects are missed by NMS [45]. In the last two pictures in the first row, since the regression can only be carried out in the a-b directions, the obtained detection boxes cannot completely cover the target, resulting in a shallow IoU with the ground truth. It can be seen from the results in the second row, which multi-branch regression can adaptively select the appropriate detection box.

Comparisons with State-of-the-Art Detectors
In this part, we compare Surround-Net with other state-of-the-art detectors on the DOTA dataset and obtain the FPS results shown in Tables 6 and 7. Furthermore, we randomly selected some of the detection results shown in Figure 16.   During the experiment, Surround-Net used both ResNet-101 and ResNet-152 as the backbone and multi-scale testing to achieve the best performance. As shown in Table 6, the detectors are divided into the following two groups: anchor-based (A) and anchor-free (AF) detectors. Table 6 indicates that ResNet-152 performs better than ResNet-101, so there is a heavy dependence on feature extraction.
Overall, although lower than some anchor-based two-stage models and anchor-free models, Surround-Net still obtained competitive results (75.88 mAP) and maintained a high processing speed (12.57 FPS), resulting in a performance trade off. In particular, compared to the eight-parameter models [26][27][28][29][30][31][32], Surround-Net still achieves state-of-the-art results, confirming the correctness of the method in solving boundary problems. Table 7 shows the FPS results that were obtained under the same test conditions. It can be seen that Surround-Net-152 still has 12.57 FPS, showing the efficiency of our model. Figure 16 reflects that Surround-Net can not only complete high-precision detection for horizontal objects but also achieve excellent results when applied to objects with a high aspect ratio.

Final Discussion and Conclusions
After analyzing the ranking problems and regression discontinuity problems in the five-parameter and eight-parameter models, we introduced a new one-stage anchor-free model with multi-branch prediction for oriented detection tasks. In order to maintain consistency between classification, localization, and branch selection, we replaced the original classification label with the corresponding PQES (prediction quality-estimated score). Further, we added center vertex spatial attention to ensure that our model fully utilizes features to distinguish the foreground from the background. The soft constraint was also proposed to refine the bounding box. At the same time, to improve the prediction recall and to alleviate the imbalance between positive and negative samples in the anchor-free model, we also adopted a dense prediction strategy.
In the experiment, we first discussed the impact of the weights of the Loss function. The experiments show that the Loss weight λ should be 1.25, and in PQES, the weights of α 2 and α 3 should be set to 0.5 and 0.7. Further, we investigated the influence of our three proposed modules. The soft constraint module results in a performance enhancement of 0.7%, and more importantly, it enables the model to output a rectangular detection box that meets the geometric requirements. The dense prediction module improves the performance by 1.13%, while the attention mechanism model improves it by 1.501%. In order to enhance the smoothness of the Loss curve, we adopted the Surround IoU Loss by incorporating location information to train the sliding ratio. In addition, we also conducted experiments and discussions on the effectiveness of multi-branch prediction, which showed that a single regression method will damage the detector's performance in terms of oriented detection. Finally, we compared the results with the state of the art and visualized the detection results. It can be seen that the model proposed in this paper achieved values of 75.88 mAP and 74.07 mAP at 12.57 FPS, which is a competitive result and represents a trade-off between performance and running speed.
However, it should be noted that there is still a slight gap between Surround-Net and the state of the art. In the anchor-based and two-stage models [19][20][21]25,26,31,46,47] listed in Table 6, the best Surround-Net performance is in a position that is higher than the middle of the ranking list. In the anchor-free and one-stage models [27,28,30,[48][49][50][51][52][53], the values representing the best performance in our work are only lower than those achieved by DAFNE [53]. However, when multi-scale testing technology is not used and Resnet-101 is used as the backbone, the performance of Surround-Net-101 (72.66%) is better than that of DAFNE-101 [53] (70.75%). In addition, we found a few missed detections (which occur in the first two pictures in the last line of Figure 16) in some specific categories (Small Vehicle and Ship). It is because the output feature map undergoes 4-fold down-sampling, and some objects may share the same center. As such, reducing the probability of missed detection is particularly important for our future research.  Data Availability Statement: Data used in this study are available from the corresponding authors by request.

Conflicts of Interest:
The authors declare no conflict of interest.