Constraint Loss for Rotated Object Detection in Remote Sensing Images

: Rotated object detection is an extension of object detection that uses an oriented bounding box instead of a general horizontal bounding box to deﬁne the object position. It is widely used in remote sensing images, scene text, and license plate recognition. The existing rotated object detection methods usually add an angle prediction channel in the bounding box prediction branch, and smooth L 1 loss is used as the regression loss function. However, we argue that smooth L 1 loss causes a sudden change in loss and slow convergence due to the angle solving mechanism of open CV (the angle between the horizontal line and the ﬁrst side of the bounding box in the counter-clockwise direction is deﬁned as the rotation angle), and this problem exists in most existing regression loss functions. To solve the above problems, we propose a decoupling modulation mechanism to overcome the problem of sudden changes in loss. On this basis, we also proposed a constraint mechanism, the purpose of which is to accelerate the convergence of the network and ensure optimization toward the ideal direction. In addition, the proposed decoupling modulation mechanism and constraint mechanism can be integrated into the popular regression loss function individually or together, which further improves the performance of the model and makes the model converge faster. The experimental results show that our method achieves 75.2% performance on the aerial image dataset DOTA (OBB task), and saves more than 30% of computing resources. The method also achieves a state-of-the-art performance in HRSC2016, and saved more than 40% of computing resources, which conﬁrms the applicability of the approach.


Introduction
Remote sensing images are an important manifestation of remote sensing information, which is vital to national defense security. Remote sensing-image object detection is a prerequisite and basis for tasks such as spatial object tracking and instance segmentation [1]. With the extensive application of convolutional neural networks (CNNs) in computer vision, object detection has undergone rapid development [2]. R-CNN [3] is predominantly used for object detection based on deep learning. After 2016, a series of two-stage detectors based on candidate regions have become the mainstream, such as Fast RCNN [4], Faster RCNN [5], and R-FCN [6]. The two-stage detector has good detection accuracy; however, the detection speed is poor owing to the complex network structure. YOLO [7], SSD [8], and RetinaNet [9] are representative one-stage detectors; they do not involve the region proposal network and greatly improved the detection speed; however, the detection accuracy is sacrificed. While the anchor-based method is developing rapidly, anchor-free methods have also received attention in recent years owing to the proposal of CornerNet [10]. The more popular ones are FCOS [11], CenterNet [12], and ExtremeNet [13]. They have replaced the previous generation of anchor methods by predicting key points, thereby opening up a new direction for the research of object detection technology [14]. In addition, there are some researches on high resolution [15,16], unbalanced samples [17], and other issues [18]. The above method has achieved good performance in natural images, such as the COCO [19] and Pascal VOC [20] datasets. Therefore, it is applied to remote sensing image object detection tasks. For example, Zhang et al. [21] combined with fast registration and YOLOv3, proposed an effective aerial infrared image sequence moving vehicle detection method. Liao et al. [22] proposed the Local Perception Region Convolutional Neural Network (LR-CNN) and constructed a new method for vehicle detection in aerial images. Lei et al. [23] proposed a tiny vehicle detection method based on spatio-temporal information, which realized the detection of tiny moving vehicle in satellite video. However, the horizontal bounding box cannot provide accurate orientation and scale information in remote-sensing-image object detection tasks [24][25][26] (see Figure 1). Therefore, the research of rotated object detection in remote sensing images is of great significance for engineering applications.  In recent years, rotated object detection has been derived from classic object detection [27][28][29], and most existing methods use five parameters (coordinates of the central point, width, height, and rotation angle) to describe the oriented bounding box. The initial exploration of rotated object detection involves rotating the RPN [30]; however, it involves more anchors, which implies that additional running time is required. Ding et al. [31] proposed an RoI transformer that converts the axis-aligned RoI into a rotatable RoI to solve the problem of misalignment between the RoI and the oriented object. Han et al. [32] proposed a S 2 ANet, which was used for depth feature alignment for rotating object detection. In addition, SCRDet [33] reported for the first time the problem of sudden changes in loss in rotated object detection tasks (see Figure 2) and proposed IoU-smooth L 1 loss to overcome this problem. Similarly, PIoU Loss [34] and R 3 Det [35] both add a very small weight to the loss function to overcome the problem of sudden change in loss. However,these methods all inhibit the sudden change in loss, and do not solve the problem fundamentally. Therefore, some novel ideas have been proposed. Zhu [26] and Xu [36], respectively, proposed a new method for expressing directed objects in aerial images, which avoids complicated calculation rules, but the performance is not ideal. In CSL [37], a circular label is designed to convert the angle-regression problem into a classification problem. In RSDet [38], a modulation rotation loss is proposed to eliminate the problem of discontinuity in loss. Although CSL and RSDet both effectively solve the problem of loss mutation, there are still some new problems that have not been considered. For example, the detection performance of CSL is not ideal; RSDet does not consider the problem of training resource consumption. Therefore, a new regression loss mechanism is essential for the development of rotating object detection. Reasons for sudden changes in loss. Only the angle θ regression is considered, assuming that the center point and size of the prediction frame and ground truth are the same, and size of the long side and the short side are 30 and 15, respectively. Consequently,the above bounding box can be described by five parameters: the prediction box (x, y, 30, 15, 85 • ) and the ground truth box (x, y, 15, 30, 5 • ). The prediction offset is: (0, 0, 15, 15, 80 • ) and the ideal offset is (0, 0, 0, 0, 10 • ). It can be seen that L 1 loss is far greater than ideal owing to the exchange of width and height and the periodicity of the angle.
To solve the above problems, we propose a new loss function with a decoupling modulation mechanism and constraint mechanism. The decoupling modulation mechanism divides the deviation of smooth L 1 loss into three parts (center point, size, and rotation angle) and modulates them, which effectively overcomes the sudden change in loss. On this basis, the constraint mechanism provides a constraint domain for the center point and size of the bounding box so that it has a tolerance for deviation in the regression process. This improves the performance and convergence speed of the model. In addition, the decoupling modulation mechanism and constraint mechanism we proposed are general and they perform well when applied to most popular regression loss functions.
In summary, the contributions of this paper are as follows: • We propose a decoupling modulation mechanism that decouples the loss deviation into three parts and modulates them, respectively. It overcomes the problem of sudden changes in loss for detecting rotating objects and makes the training process more stable. • We propose a constraint mechanism, which effectively solves the problem of slow network convergence and improves the performance of the model by adding the constraint domain as the center point and size of the bounding box. Experiments on the DOTA dataset reveal an improvement in mAP by 1.2% and in convergence speed by 40%, and the 0.5% mAP and 30% convergence speed are improved on the HRSC2016 dataset. • Our method is independent of the model; thus, it is generic and can be applied to most regression loss functions. The experimental results for nine popular loss functions (including deviation-based loss and IoU-based loss) verify its effectiveness.
The remainder of this paper is organized as follows. Section 2 describes our motivation, the proposed method, and a detailed analysis of the characteristics of the method. Section 3 reports the details of the experiment, including the datasets, implementation details, ablation study, and experimental results. Finally, Section 4 presents the conclusions of this article.

Materials and Methods
In this section, we first describe the proposed constraint loss function and then analyze the constraint parameters (CPs). Finally, the adjustability and generalizability of the constraint loss are discussed.

Constrained Loss Function
In the initial stage of the development of rotating object detection, smooth L 1 loss still plays an important role and the regression is represented in (1).
where (x, y) is the coordinate of the center point of the rectangular box, (w, h) represents its width and height, and θ is defined as the acute angle to the X-axis. The range of values of θ is [0, π 2 ] or [− π 2 , π 2 ], as defined by openCV (see Figure 3). The * represents the ground truth labels, and the regression calculation is normalized to avoid overfitting.  The sudden change in loss of the oriented bounding box regression process is mainly caused by two reasons: (i) the exchange of width and height, and (ii) the periodicity of the angle. To solve the above two problems, we decoupled and modulated the regression calculation of the rotating bounding box.
Decoupling and Modulation: Inspired by [38], the regression calculation of the oriented bounding box is first decoupled into three parts: (i) center point regression, (ii) size regression, and (iii) angle regression. Then, the latter two parts are modulated as follows: In [38], the exchange of edges is always accompanied by modulation in the angle period, and the bounding box regression is expressed as: where L 1 2 represents the first row of L 2 in (3) (the same definitions for L 2 2 , L 1 3 , and L 2 3 ). The above design is useful for the regression of most bounding boxes; however, some special problems are ignored. For example, the L 2 and L 3 modulations are not synchronized, as shown in Figure 4. Therefore, the regression of the bounding box is divided into three parts, as shown in (2)-(4), and modulated, respectively. Finally, the output result is modulated twice, and the bounding box is mirrored and rotated. In addition, when the 180 • representation method is used, L 2 modulation is suppressed, and the modulation term π/2 of L 3 is replaced by π.
Foods 2021, 10, x FOR PEER REVIEW 2 of 34 45 46 Considering the fact that Vitamin C is not produced in the human body, there is a 47 daily referenced intake on an approximate level of 80 mg/day, that should be provided 48 through an appropriate diet, with this value being significantly affected by gender, age, 49 health status and individual consumer lifestyle (e,g. smokers vs non-smokers) [3]. As a 50 result of being a basic nutritional element, sometimes also deliberately added into certain 51 foods during their industrial manufacture, its loss during different types of processing 52 and subsequent storage has gained considerable interest in the relative literature [4][5][6][7][8][9][10]. 53 Looking down the mechanism of Vitamin C degradation, it is generally accepted that 54 ascorbic acid (AA) reacts through two major paths, the most common one being in the 55 presence of oxygen ('aerobic pathway') that leads to the formation of dehydroascorbic 56 acid (DHAA), which then can follow different modes of degradation [11][12][13]. In the ab-57 sence of oxygen ('anaerobic pathway'), L-ascorbic acid degrades without being oxidized 58 first, and thus DHAA is not formed. Based on this mechanism, besides temperature effect, 59 the roles of oxygen, oxidative agents, and catalysts presence in L-ascorbic acid degrada-60 tion mechanisms and kinetics have also been extensively studied [8,[13][14]. 61 In current literature, there is abundant evidence (as it will be described in this article) 62 that Vitamin C is a sensitive, water soluble compound, considered to be prone to severe 63 deterioration, especially during the conventional processing techniques (e.g. thermal 64 treatment, drying), with extreme conditions leading to a rapid loss. In this context, novel 65 techniques or appropriate combination of preservation steps have been investigated, so 66 as to alleviate the negative effect of the most popular industrial methods. On the other 67 hand, despite what has been generally considered and accepted a priori, there are studies 68 that show that processing may not be the main cause of Vitamin C loss; even if one 69 achieves substantial control or selects the optimized conditions during the processing 70 stages, the post-processing and handling conditions could be the determinant factor at the 71 time of use. In this field, there remains a need for an in-depth review of more systematic 72 studies, trying to separately investigate and mathematically describe Vitamin C loss, dur-73 ing each processing step, as well as during the subsequent distribution and storage of the 74 perishable food products. This review paper aims at putting all these issues in perspective 75 of processes that are designed so as to mitigate the acknowledged sensitivity of Vitamin 76 C. 77 When reviewing current literature, it is evident that Vitamin C degradation is strongly 78 affected by the food tissue in question, as well as the specific process/storage conditions 79 applied [15]. Therefore, it is deemed necessary to establish appropriate kinetic equations, 80 Constraint: To enable the network to converge faster and reduce resource consumption, the calculation of the deviation between the center point and the size of the bounding box must be constrained. The constraint is expressed as follows: where, con(.) is the conditional function, which means L * i = L i when the condition is met; otherwise, L i = 0. α and β represent the constraints on the center point and target frame scale, respectively, to ensure that the model always evolves in the correct direction during the training process.
The regression loss function is shown in Figure 5, where the constraint line comes from the Constraint Parameters (CPs) and coincides with the X-axis. In particular, it is abandoned below the constraint line. This means that the loss value is set to 0 when the deviation between the predicted box and the ground truth is within the constraint range.
In summary, the proposed constraint loss functions L C are expressed as follows:

Constraint Analysis
Without loss of generality, some changes occurred in the fine-tuning of the network owing to the introduction of α and β. We explored its geometric meaning and analyzed it using mathematical reasoning.
Geometric meaning: The geometric meaning of the CPs, proposed in this study, is shown in Figure 6. In the figure, α represents the constraint radius of the center point, and the center point deviation is set to 0 when the distance between the center point of the prediction box and the ground truth is less than α. β represents the bounding box size constraint radius, and the size deviation is set to 0 when the difference between the predicted box size and the ground truth is less than β. In addition, the endpoints of the bounding box are limited to the same color area. It is worth noting that this is similar to the filter of the bounding box, but it is completely different. Our goal is to prevent the prediction box from developing in a bad direction when the center point or size of the prediction box is close to the ground truth.  Gradient Analysis: Figure 7 shows a simplified diagram of the network structure. In the process of loss back propagation, the convolutional layer, pooling layer, and fully connected layer occupy the dominant position. Therefore, we analyzed the gradients of the three key layers mentioned, and studied the influence of the constraint loss function on them.
The gradient of the loss function S(L) with respect to the output layer is expressed as: where represents the gradient vector of the loss function S with respect to the predicted value Y, σ represents the partial derivative of the activation function z(X), is the Hadamard product, which represents the point-to-point multiplication operation between matrices or vectors. The gradient propagation of the fully connected layer is expressed as: The input P l−1 of the pooling layer can be obtained from P l . This process is usually called an upsample.
where the second term can be understood as the constant 1 in the pooling process, because no activation function is involved in the pooling layer. The input C l−1 of the convolutional layer can be obtained from C l .
A more detailed gradient analysis is shown in Appendix A, and based on this, we know the influence of the constraint loss function on the key layer. When the center point constraint (L * 1 ) or the size constraint (L * 2 ) is activated, the components of the regression deviation L C are reduced, and this response is directly transferred to the calculation of the gradient value of the main layer (δ Y , δ l , P l−1 , C l−1 ), and parameter adjustment without additional steps. This simplifies the backpropagation task. In particular, it is more pronounced when L * 1 and L * 2 are activated simultaneously. This means that only the angle parameter θ is adjusted (L = L 3 ), and the task will be simple and easy to implement. It is worth noting that when L * 1 and L * 2 are activated, the center point and size of the candidate frame are adjusted to the constraint range (see Figure 6), and subsequent adjustments have also been suppressed. This avoids additional calculations and development in unfavorable directions.
Convergence Analysis: Through geometric and gradient analyses, we found that the introduction of CPs had a positive effect on the convergence speed of the network. This is reasonable because the introduction of CPs is equivalent to increasing the tolerance of the prediction box. In the process of returning the prediction box to the truth box, the existence of tolerance enables the network to stabilize faster. To verify this idea, we tested the number of iterations when the network reached stability under different CPs, and the experimental results verified our hypotheses.

Adjustability and Generalizability
Adjustability: Inspired by self-placed learning [39], a cascaded sequence of CPs was proposed [40]. At the beginning of the training phase, a larger constraint parameter can filter out candidate frames with lower confidence, so that high-confidence bounding boxes can be focused on training. As the training progresses and the restriction parameters become smaller, the network begins to pay attention to the modification and optimization of the bounding box. In addition, the bounding box is restrained from expanding toward the constraint range, so that the network always develops in the ideal direction.
Generalizability: Based on the above analysis, we know that the introduction of α and β can improve the performance of the regression loss function and prompt the network to always train in the right direction. Therefore, we consider applying this constraint method to other regression loss functions and analyze its generality.
The existing rotating bounding box regression loss function can be divided into two categories: deviation-based and IoU-based categories. The deviation-based loss functions include MSE, MAE, Huber [41], Log-Cosh [42], and Quantile [43]. IoU-based loss functions include IoU [44], GIoU [45], DIoU [46], and CIoU. For the deviation-based loss functions, a cascade constraint parameter sequence is set (see Figure 8). For the IoU-based loss functions, we designed an overlapping sequence as its CP sequence. When the IoU of the prediction box and ground truth is greater than the threshold, the loss is considered to be zero (general IoU loss = 1-IoU). The initial threshold of overlap is 0.5, and gradually increases to 0.75 as the number of training increases. A more detailed parameter design is presented in Table 1.

Experiments and Results
Our experiment was carried out on a server with Ubuntu 14.04, Titan X Pascal and 12G memory. The experiment first provides the dataset and evaluation protocol, followed by a detailed ablation analysis and an overall evaluation of the proposed method.

Datasets and Evaluation Protocol
DOTA [47] is a large and challenging aerial image dataset in the field of object detection, including 2806 pictures and 15 categories, with a picture scale ranging from 800 × 800 to 4000 × 4000. It contains a training set, validation set, and test set, which account for 1/2, 1/6, and 1/3 of the total data set, respectively. Among them, the training and validation sets were marked with 188,282 instances, with an arbitrary direction quadrilateral box. In this study, we used the 1.0 version of rotating object detection, and the image was cropped into 600 × 600 slices. It was scaled to 800 × 800 during the training. Short naming is defined as plane (PL), baseball diamond (BD), bridge (BR), ground field track (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). The official evaluation protocol of the DOTA in terms of the mAP is used.
HRSC2016 [48] is a dataset dedicated to ship detection is in the field of object detection. The dataset contains 1061 images from two scenarios, including ships on the sea and ships close to the shore. There are three levels of tasks (for single class, four types, and 19 types of ship detection and identification). The image sizes range from 300 × 300 to 1500 × 900, and most of them are larger than 1000 × 600. Among them, the training set was 436, the verification set was 181, and the test set was 444. The example is marked by a rotating rectangular box, and the standard evaluation protocol of HRSC2016 in terms of mAP is used.

Implementation Details
The proposed method is implemented based on the rotation detection benchmark proposed by Yang et al [49]. We used RetinaNet as the baseline method and ResNet50, ResNet101, and ResNet152 as the backbone network for the experiments. To reflect the fairness of the experiment, all comparative experiments used the same backbone network, and the batch size was set to 8, owing to the limitation of GPU memory. In all experiments, we use the momentum SGD optimizer to optimize the network, and the momentum and weight decay are 0.9 and 1 × 10 −4 , respectively. The initial learning rate is 5 × 10 −4 , and for each training epoch, the learning rate decays to 0.1 times the original, and the size of the epoch depends on the number of training samples. The hyperparameters α and β are set to a sequence (see Figure 8), which gradually decreased as the number of iterations increased.

Ablation Study
The ablation study includes the effect of the modulation mechanism and the constrained mechanism on the network, as well as the convergence and generality of the proposed constrained loss function. (For the convenience of comparison, the validation set in the DOTA is used for evaluation, because the test set label has not been released.) Effects of the modulation mechanism: We experimented with the proposed decoupling modulation mechanism in the regression loss function, and compared it with popular loss functions, such as L 1 , smooth-L 1 , IoU-smooth-L 1 [33], and L mr [38]. In the experiment, we used the same backbone network (resnet50) and five-parameter regression method, and used RetinaNet as the baseline method. The experimental results show that in Table 2, our modulation mechanism has achieved better performance than L mr [38], which has increased by 0.4 and 0.6 in DOTA and HRSC2016, respectively. This further proves our idea that decoupling the regression parameters can improve the performance of the network model. Effects of the constraint mechanism: To verify the effectiveness of the constraint loss function (L C ), we experimented with the center point constraint and the size constraint, respectively and explored the influence of different constraint domains on the performance of the model. In the experiment, the CPs α and β are designed to be 5,4,3,2,1,0, and test them, respectively, where 0 indicates an unconstrained state. To ensure fairness of the experiment, each comparison experiment was trained for 30 epochs (training times per epoch was 20,673). In addition, the proposed decoupling modulation loss is used. The experimental results are shown in Figure 9. Obviously, there are some positive and negative effects due to the introduction of α and β. Compared with the unconstrained case, the effective CPs accelerate the network convergence speed. In particular, when CPs = 4 or 5, the network convergence speed is faster, but the model performance is sacrificed. The network convergence speed and model performance were improved when CPs = 1, 2, or 3. The performance of the model is the best when α = 2 and β = 3, which are improved by 3.5 and 3.2, respectively. Although this is not the final result (only 30 epochs are trained), and this improvement will become smaller as the training expands, this is sufficient to confirm our method. The introduction of CPs can not only improve the performance of the model, but more importantly, it greatly improves the convergence speed of the model and reduces resource consumption.  Based on the above experiments, we found that appropriate center point constraints and size constraints can have a positive effect on network training. Therefore, it is necessary to explore these combinations. In the experiment, we predefined an optimal combination (α,β) = (2,3) (from the ablation experiment of the center point constraint and size constraint), and designed some other combinations. The results are shown in Table 3. The predefined combination (α,β) = (2,3) and L dm led to the best performance compared to other combinations and loss functions, which confirmed our idea.  Table 4, where different CPs have significant differences in the consumption of training resources of the network. Larger CPs mean greater tolerance for predicting bounding boxes, less training resources are required, and faster convergence and stability, while smaller CPs have the opposite. Adjustability: By combining the characteristics of different CPs, an experiment of cascade constraint sequence was designed. We designed 30 epochs and divided them into six equal parts, defined as A0, . . ., A5 = [epoch1, epoch5], . . ., [epoch 26, epoch30]. Subsequently, a series of sequence experiments were designed, and the results are listed in Table 5. As the sequence complexity increased, the performance of the model improved. In particular, in the G4 case, the model's mAP was improved by 2.7 compared to the G0 case, which confirms the effectiveness of the cascade constraint sequence. Generalizability: To verify the generality of the constraint loss function, we experimented with our method in the popular regression loss function. In the experiment, RitinaNet was used as the baseline method, and Resnet50 was used as the backbone network. The constraint parameter sequence adopts the G4 sequence in Table 5 owing to its excellent performance. In the experiment, the performance of different regression loss functions in the OBB and HBB tasks of DOTA were tested, and the results are shown in Table 6. Obviously, after the optimization of L dm or CPs proposed by us, the performance of the model has been improved to varying degrees. In particular, it has better performance in the deviation-based method, which increases the Huber loss by 2.4 mAP on the DOTA, and increases the MAE loss by 1.9 mAP on the HRSC2016. One possible explanation is that L dm plays an important role in the OBB task.
Effects of data augmentation: Many studies have proved that data enhancement can effectively improve detector performance. We extended the data by random horizontal, vertical flipping, random graying, random rotation,and random change channels. In addition, additional enhancements have been made to categories with a small number of samples (such as helicopters and bridges). The experimental results are shown in Table 7, and a 3.1% improvement was obtained on the DOTA (from 67.9% to 71.0%); a 1.7% improvement was obtained improvement on the HRSC2016 (from 86.2% to 87.9%). We also explored a larger backbone network, and the results showed that a larger backbone can result in better performance. The final performance of our improvement was 74.3% and 88.9% using ResNet152 as the backbone network. In addition, our training times is 8 × 10 5 , which is far less than that of the popular method, and the resource consumption due to training as greatly reduced.

Overall Evaluation
We compare our proposed constraint loss function with the state-of-the-art rotating object detection method on two datasets DOTA [47] and HRSC2016 [48].
Results in DOTA: We first experimented with our method on the DOTA dataset and compared it with popular rotated object detection methods, as depicted in Table 8. The results of the overall evaluation experiment were obtained by submitting our model to the official DOTA evaluation server. In the experiment, the training and verification sets of the DOTA were used as training samples, and the test set was used to verify the performance of the model. The compared methods include scene file detection methods R 2 CNN [50], RRPN [30], popular rotated object detection method ICN [51], RoI Transformer [31], Gliding Vertex [36], and methods that consider sudden changes in loss, such as SCRDet [33], R 3 Det [35], and RSDet [38]. The performance of our method is 1.1 better than the best result in the comparison method (RSDet+ResNet152+Refine). Although our method did not achieve state-of-the-art performance in the DOTA rankings, it showed the best performance in the one-stage detector. In addition, our method saves more than 30% of the computing resources compared with most methods. The visualization results are shown in Figure 10.   Results in HRSC2016: We also experimented with our method in HRSC2016 and compared it with popular detectors, and the results are shown in Table 9. First, a comparative experiment was carried out using the methods proposed in scene text detection, RRPN, and R2CNN, and the detection accuracy was not ideal. RoI Transformer and Gliding Vertex have achieved good detection accuracy but require more training resources. RetinNet-H and RetinaNet-R were used as baseline methods, among which RetinaNet-R obtained 89.1% detection accuracy. R 3 Det [35] achieved a better detection accuracy than the above method. In the end, our method achieved an accuracy of 89.7%, achieved the state-of-theart performance under the optimization of L dm and CPs, and nearly half of the training resources were saved. The visualization results are shown in Figure 11.

Conclusions
In this study, a constraint loss function is proposed, including the decoupling modulation loss and cascade constraint mechanism. The former overcomes the regression to the rotating frame during the loss mutation process, and the latter enables the direction of a parameter update to be supervised and improves the convergence speed of the network. As a result, our method enables the method based on RetinaNet-R to achieve a 75.2% performance on the DOTA benchmark dataset, and achieves a state-of-the-art performance on the HRSC2016 dataset. In addition, the cascade constraint mechanism is applied to the popular regression loss function to achieve a better performance.
In this study, we improved the performance of the regression loss function through the modulation mechanism and the constraint mechanism, which not only improved the performance of the model, but also saved training resources. However, there are also the following limitations: • The selection method of constraint parameters is supervised and artificially set, which limits its performance. • Although the cascade constraint sequence is variable, the range of each stage is fixed (as shown in Table 5  However, it is worth noting that the selection method of constraint parameters in the proposed method is supervised and artificially set, which limits its performance.
Therefore, in future research, we intend to try a learnable constraint parameter selection mechanism to improve the generalization ability of the model. In addition, a future research direction is to explore a new way of defining a rotated bounding box. The existing methods are all rotating rectangular boxes or four vertices. The former has the problem of angular periodicity, and the latter is easily affected by the order of the vertices. Therefore, the study of a better and effective rotating bounding box definition method is of great significance to the development of rotated object detection.

Data Availability Statement:
The data presented in this study are available on request from the first author.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Details of Gradient Analysis
For the input image X ∈ [W × H × 3], after network processing, output Y i ∈ [x, y, w, h, θ], where i = 1, 2, 3. . . n, represents the corresponding proposal. Assuming that the ground truth of the prediction box Y * = [x * , y * , w * , h * , θ * ], we obtain the deviation of the center point, size, and angle of the target frame (L 1 , L 2 , L 3 ) according to our design (2)-(4). We assume that W represents the set of all weight parameters, b is the deviation, and n is the number of samples. Forward propagation process z = WX + b, activation function y = σ(z).
(1) The first is the calculation of the partial derivative of the loss function S(L) with respect to the output layer.
From the chain rule: where represents the gradient vector of the loss function S with respect to the predicted value Y, σ represents the partial derivative of the activation function z(X), and is the Hadamard product, which represents the point-to-point multiplication operation between matrices or vectors.
(2) To calculate the partial derivative of the FC, according to the above analysis, the partial derivative of the j-th element of the l-th layer can be expressed as follows: The vector form is expressed as: During the parameter update process, the partial derivative of parameters W and b can be expressed as (3) The pooling layer compresses the input during the forward propagation process. The input P l−1 of the pooling layer can be obtained from P l . This process is usually called an upsample. P l−1 = upsample(P l σ (z l−1 )) (A9) where the first term represents upsampling, and the second term is the derivative of the activation function. The second term can be understood as the constant 1 in the pooling process, because no activation function is involved in the pooling layer. In addition, there is no parameter update in the pooling layer because the W and b parameters are not involved.
(4) Similar to the back propagation process of the pooling layer, the input C l−1 of the convolutional layer can be obtained from C l . C l−1 = C l ∂z l ∂z l−1 = C l * rot180(W l ) σ (z l−1 )) (A10) During the parameter update process, the gradient of W and b can be expressed as: Based on the above analysis, we found that the partial derivative of the loss function S with respect to L is always equal to the prediction deviation L (see (A2)) and is linearly positively correlated with the gradient of the output layer elements (see (A4)). This means that the introduction of the constraint loss function will have a direct impact on the back propagation process.
When the center point constraint (L * 1 ) or the size constraint (L * 2 ) is activated, the components of the regression deviation L C are reduced, and this response is directly transferred to the calculation of the gradient value of the main layer (δ Y , δ l , P l−1 , and C l−1 ), and parameter adjustment without additional steps. This simplifies the backpropagation task. In particular, it is more pronounced when L * 1 and L * 2 are activated simultaneously. This means that only the angle parameter θ is adjusted (L = L 3 ), and the task will be simple and easy to implement. It is worth noting that when L * 1 and L * 2 are activated, the center point and size of the candidate frame are adjusted to the constraint range (see Figure 6), and subsequent adjustments have also been suppressed. This avoids additional calculations and development in unfavorable directions.