Figure 1.
Some RSIs in the DOTA and HRSC2016 datasets. (a) The direction of objects in RSIs is always arbitrary. The HBB (top) and OBB (bottom) are two representation methods in RSI object detection. (b) Remote sensing images tend to contain complex backgrounds. (c) The scales of objects in the same remote sensing image may also vary dramatically, such as with small vehicles and track fields on the ground. (d) There are many objects with large aspect ratios in RSIs, such as slender ships.
Figure 1.
Some RSIs in the DOTA and HRSC2016 datasets. (a) The direction of objects in RSIs is always arbitrary. The HBB (top) and OBB (bottom) are two representation methods in RSI object detection. (b) Remote sensing images tend to contain complex backgrounds. (c) The scales of objects in the same remote sensing image may also vary dramatically, such as with small vehicles and track fields on the ground. (d) There are many objects with large aspect ratios in RSIs, such as slender ships.
Figure 2.
Framework of the our method. The backbone network, HRGANet, is followed by the RIE prediction model. The HRGANet backbone network contains HRNet and GAM. Up samp. represents a bilinear upsampling operation and a 1 × 1 convolution. Down samp. denotes 3 × 3 convolution with a stride of 2. Conv unit. is a 1 × 1 convolution.
Figure 2.
Framework of the our method. The backbone network, HRGANet, is followed by the RIE prediction model. The HRGANet backbone network contains HRNet and GAM. Up samp. represents a bilinear upsampling operation and a 1 × 1 convolution. Down samp. denotes 3 × 3 convolution with a stride of 2. Conv unit. is a 1 × 1 convolution.
Figure 3.
The network structure of the GAM. W, H, and C represent the width, height, and channel number of the feature maps, respectively. ⊗ represents the broadcast multiplication operation. ⊕ denotes the concatenation operation. is a convolution operation with 1 × 1 kernels, BN is a batch normalization operation, and ReLU is the ReLU activation function. A weight block is composed of a 1 × 1 convolution operation, a BN operation, and a ReLU operation.
Figure 3.
The network structure of the GAM. W, H, and C represent the width, height, and channel number of the feature maps, respectively. ⊗ represents the broadcast multiplication operation. ⊕ denotes the concatenation operation. is a convolution operation with 1 × 1 kernels, BN is a batch normalization operation, and ReLU is the ReLU activation function. A weight block is composed of a 1 × 1 convolution operation, a BN operation, and a ReLU operation.
Figure 4.
(a) RRB representation , where , w, h, and represent the center point, width, height, and small angle jitter, respectively. (b) RIE representation of the target used in our method. , , and are the center point, long half-axis vertex, and four outer rectangle vertices of the RIE. e and represent the eccentricity and orientation label, respectively. Yellow lines b denote the short half axis. Red, blue, and green lines represent the offsets of the long half axis a.
Figure 4.
(a) RRB representation , where , w, h, and represent the center point, width, height, and small angle jitter, respectively. (b) RIE representation of the target used in our method. , , and are the center point, long half-axis vertex, and four outer rectangle vertices of the RIE. e and represent the eccentricity and orientation label, respectively. Yellow lines b denote the short half axis. Red, blue, and green lines represent the offsets of the long half axis a.
Figure 5.
(a) IoU curves under different height–width ratios and angle biases. a/b represent the height–width ratio, i.e., the aspect ratio of the object. (b) IoU curves under different orientation offsets and RIE eccentricities.
Figure 5.
(a) IoU curves under different height–width ratios and angle biases. a/b represent the height–width ratio, i.e., the aspect ratio of the object. (b) IoU curves under different orientation offsets and RIE eccentricities.
Figure 6.
(a) The proportion distribution of the numbers of instances in each category in the DOTA dataset. The outer ring represents the number distribution of 15 categories. The internal ring denotes the total distribution of small (green), middle (blue), and large instances (yellow). (b) The size distribution of instances in each category in the DOTA dataset. We divided all of the instances into three splits according to their OBB height: small instances for heights from 10 to 50 pixels, middle instances for heights from 50 to 300 pixels, and large instances for heights above 300 pixels.
Figure 6.
(a) The proportion distribution of the numbers of instances in each category in the DOTA dataset. The outer ring represents the number distribution of 15 categories. The internal ring denotes the total distribution of small (green), middle (blue), and large instances (yellow). (b) The size distribution of instances in each category in the DOTA dataset. We divided all of the instances into three splits according to their OBB height: small instances for heights from 10 to 50 pixels, middle instances for heights from 50 to 300 pixels, and large instances for heights above 300 pixels.
Figure 7.
Visualization of the detection results of our method on the DOTA dataset.
Figure 7.
Visualization of the detection results of our method on the DOTA dataset.
Figure 8.
Visualization of the detection results of our method on the HRSC2016 dataset.
Figure 8.
Visualization of the detection results of our method on the HRSC2016 dataset.
Figure 9.
Accuracy versus speed on the HRSC2016 dataset.
Figure 9.
Accuracy versus speed on the HRSC2016 dataset.
Table 1.
The structure of the backbone network of HRNet. It mainly embodies four stages. The 1st (2nd, 3rd, and 4th) stage is composed of 1 (1, 4, and 3) repeated modularized blocks. Meanwhile, each modularized block in the 1st (2nd, 3rd, and 4th) stage consists of 1 (2, 3, and 4) branch(es) belonging to a different resolution. Each branch contains four residual units and one fusion unit. In the table, each cell in the Stage box is composed of three parts: The first part represents the residual unit, the second number denotes the iteration times of the residual units, and the third number represents the iteration times of the modularized blocks. ≡ in the Fusion column represents the fusion unit. C is the channel number of the residual unit. We set C to 48 and represent the network as HRNet-W48. Res. is the abbreviation of resolution.
Table 1.
The structure of the backbone network of HRNet. It mainly embodies four stages. The 1st (2nd, 3rd, and 4th) stage is composed of 1 (1, 4, and 3) repeated modularized blocks. Meanwhile, each modularized block in the 1st (2nd, 3rd, and 4th) stage consists of 1 (2, 3, and 4) branch(es) belonging to a different resolution. Each branch contains four residual units and one fusion unit. In the table, each cell in the Stage box is composed of three parts: The first part represents the residual unit, the second number denotes the iteration times of the residual units, and the third number represents the iteration times of the modularized blocks. ≡ in the Fusion column represents the fusion unit. C is the channel number of the residual unit. We set C to 48 and represent the network as HRNet-W48. Res. is the abbreviation of resolution.
Res. | Stage1 | Fusion | Stage2 | Fusion | Stage3 | Fusion | Stage4 | Fusion |
---|
| | ≡ | | ≡ | | ≡ | | ≡ |
| | ≡ | | ≡ | | ≡ | | ≡ |
| | | | ≡ | | ≡ | | ≡ |
| | | | | | ≡ | | ≡ |
Table 2.
Comparison with state-of-the-art methods of oriented object detection in RSIs on the DOTA dataset. We set the IoU threshold to 0.5 when calculating the AP.
Table 2.
Comparison with state-of-the-art methods of oriented object detection in RSIs on the DOTA dataset. We set the IoU threshold to 0.5 when calculating the AP.
Method | Backbone | FPN | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | mAP |
---|
Anchor-based |
FR-O [56] | ResNet-50 | - | 79.42 | 77.13 | 17.70 | 64.05 | 35.30 | 38.02 | 37.16 | 89.41 | 69.64 | 59.28 | 50.30 | 52.91 | 47.89 | 47.40 | 46.30 | 54.13 |
R-DFPN [26] | ResNet-101 | | 80.92 | 65.82 | 33.77 | 58.94 | 55.77 | 50.94 | 54.78 | 90.33 | 66.34 | 68.66 | 48.73 | 51.76 | 55.10 | 51.32 | 35.88 | 57.94 |
CNN [27] | ResNet-101 | - | 80.94 | 65.67 | 35.34 | 67.44 | 59.92 | 50.91 | 55.81 | 90.67 | 66.92 | 72.39 | 55.06 | 52.23 | 55.14 | 53.35 | 48.22 | 60.67 |
RRPN [28] | ResNet-101 | - | 88.52 | 71.20 | 31.66 | 59.30 | 51.85 | 56.19 | 57.25 | 90.81 | 72.84 | 67.38 | 56.69 | 52.84 | 53.08 | 51.94 | 53.58 | 61.01 |
ICN [29] | ResNet-101 | | 81.40 | 74.30 | 47.70 | 70.30 | 64.90 | 67.80 | 70.00 | 90.80 | 79.10 | 78.20 | 53.60 | 62.90 | 67.00 | 64.20 | 50.20 | 68.20 |
RoI Trans [30] | ResNet-101 | | 88.64 | 78.52 | 43.44 | 75.92 | 68.81 | 73.68 | 83.59 | 90.74 | 77.27 | 81.46 | 58.39 | 53.54 | 62.83 | 58.93 | 47.67 | 69.56 |
CAD-Net [32] | ResNet-101 | | 87.80 | 82.40 | 49.40 | 73.50 | 71.10 | 63.50 | 76.70 | 90.90 | 79.20 | 73.30 | 48.40 | 60.90 | 62.00 | 67.00 | 62.20 | 69.90 |
Det [33] | ResNet-101 | | 89.54 | 81.99 | 48.46 | 62.52 | 70.48 | 74.29 | 77.54 | 90.80 | 81.39 | 83.54 | 61.97 | 59.82 | 65.44 | 67.46 | 60.05 | 71.69 |
SCRDet [34] | ResNet-101 | | 89.98 | 80.65 | 52.09 | 68.36 | 68.36 | 60.32 | 72.41 | 90.85 | 87.94 | 86.86 | 65.02 | 66.68 | 66.25 | 68.24 | 65.21 | 72.61 |
ProjBB [39] | ResNet-101 | | 88.96 | 79.32 | 53.98 | 70.21 | 60.67 | 76.20 | 89.71 | 90.22 | 78.94 | 76.82 | 60.49 | 63.62 | 73.12 | 71.43 | 61.69 | 73.03 |
Gliding Vertex [36] | ResNet-101 | | 89.64 | 85.00 | 52.26 | 77.34 | 73.01 | 73.14 | 86.82 | 90.74 | 79.02 | 86.81 | 59.55 | 70.91 | 72.94 | 70.86 | 57.32 | 75.02 |
APE [37] | ResNet-50 | | 89.96 | 83.62 | 53.42 | 76.03 | 74.01 | 77.16 | 79.45 | 90.83 | 87.15 | 84.51 | 67.72 | 60.33 | 74.61 | 71.84 | 65.55 | 75.75 |
A-Net [35] | ResNet-101 | | 88.70 | 81.41 | 54.28 | 69.75 | 78.04 | 80.54 | 88.04 | 90.69 | 84.75 | 86.22 | 65.03 | 65.81 | 76.16 | 73.37 | 58.86 | 76.11 |
CSL [22] | ResNeXt101 [60] | | 90.25 | 85.53 | 54.64 | 75.31 | 70.44 | 73.51 | 77.62 | 90.84 | 86.15 | 86.69 | 69.60 | 68.04 | 73.83 | 71.10 | 68.93 | 76.17 |
Anchor-free |
IENet [42] | ResNet-101 | | 57.14 | 80.20 | 65.54 | 39.82 | 32.07 | 49.71 | 65.01 | 52.58 | 81.45 | 44.66 | 78.51 | 46.54 | 56.73 | 64.40 | 64.24 | 57.14 |
PIoU [46] | DLA-34 [61] | - | 80.90 | 69.70 | 24.10 | 60.20 | 38.30 | 64.40 | 64.80 | 90.90 | 77.20 | 70.40 | 46.50 | 37.10 | 57.10 | 61.90 | 64.00 | 60.50 |
Axis Learning [43] | ResNet-101 | | 79.53 | 77.15 | 38.59 | 61.15 | 67.53 | 70.49 | 76.30 | 89.66 | 79.07 | 83.53 | 47.27 | 61.01 | 56.28 | 66.06 | 36.05 | 65.98 |
P-RSDet [19] | ResNet-101 | | 89.02 | 73.65 | 47.33 | 72.03 | 70.58 | 73.71 | 72.76 | 90.82 | 80.12 | 81.32 | 59.45 | 57.87 | 60.79 | 65.21 | 52.59 | 69.82 |
-DNet [48] | Huorglass-104 | - | 89.20 | 76.54 | 48.95 | 67.52 | 71.11 | 75.86 | 78.85 | 90.84 | 78.97 | 78.26 | 61.44 | 60.79 | 59.66 | 63.85 | 64.91 | 71.12 |
BBAVectors [50] | ResNet-101 | - | 88.35 | 79.96 | 50.69 | 62.18 | 78.43 | 78.98 | 87.94 | 90.85 | 83.58 | 84.35 | 54.13 | 60.24 | 65.22 | 64.28 | 55.70 | 72.32 |
DRN [47] | Hourglass-104 | - | 89.71 | 82.34 | 47.22 | 64.10 | 76.22 | 74.43 | 85.84 | 90.57 | 86.18 | 84.89 | 57.65 | 61.93 | 69.30 | 69.63 | 58.48 | 73.23 |
CBDA-Net [45] | DLA-34 [61] | - | 89.17 | 85.92 | 50.28 | 65.02 | 77.72 | 82.32 | 87.89 | 90.48 | 86.47 | 85.90 | 66.85 | 66.48 | 67.41 | 71.33 | 62.89 | 75.74 |
RIE * | HRGANet-W48 | - | 89.23 | 84.86 | 55.69 | 70.32 | 75.76 | 80.68 | 86.14 | 90.26 | 80.17 | 81.34 | 59.36 | 63.24 | 74.12 | 70.87 | 60.36 | 74.83 |
RIE | HRGANet-W48 | - | 89.85 | 85.68 | 58.81 | 70.56 | 76.66 | 82.47 | 88.09 | 90.56 | 80.89 | 82.27 | 60.46 | 63.67 | 76.63 | 71.56 | 60.89 | 75.94 |
Table 3.
Comparison of the results of accuracy and parameters on the HRSC2016 dataset.
Table 3.
Comparison of the results of accuracy and parameters on the HRSC2016 dataset.
Model | Backbone | Resolution | AP (%) | Parameters |
---|
BL2 [23] | ResNet101 | - | 69.60 | - |
CNN [27] | ResNet-101 | 800 × 800 | 73.07 | - |
RC1&RC2 [40] | VGG-16 | 800 × 800 | 75.70 | - |
RRPN [28] | ResNet-101 | 800 × 800 | 79.08 | 181.5 MB |
PN [31] | VGG-16 | - | 79.60 | - |
RRD [41] | VGG-16 | 384 × 384 | 82.89 | - |
RoI Trans [30] | ResNet-101-FPN | 512 × 800 | 86.20 | 273.8 MB |
Det [33] | ResNet-101-FPN | 800 × 800 | 89.26 | 227.0 MB |
A-Net [35] | ResNet-101-FPN | 512 × 800 | 90.17 | 257.0 MB |
IENet [42] | ResNet-101-FPN | 1024 × 1024 | 75.01 | 212.5 MB |
Axis learning [43] | ResNet-101-FPN | 800 × 800 | 78.15 | - |
BBAVector [50] | ResNet-101 | 608 × 608 | 88.60 | 276.3 MB |
PIoU [46] | DLA-34 [61] | 512 × 512 | 89.20 | - |
GRS-Det [20] | ResNet-101 | 800 × 800 | 89.57 | 200.0 MB |
DRN [47] | Hourglass-104 | 768 × 768 | 92.70 | - |
CBDA-Net [45] | DLA-34 [61] | - | 90.50 | - |
RIE | HRGANet-W48 | 800 × 800 | 91.27 | 207.5 MB |
Table 4.
Ablation study of the RIE. All of the models were implemented on the HRSC2016 and DOTA datasets.
Table 4.
Ablation study of the RIE. All of the models were implemented on the HRSC2016 and DOTA datasets.
Model | GAM | ewoLoss | Recall | Precision | F1-Score | HRSC2016 mAP | DOTA mAP |
---|
Baseline | - | - | 91.76 | 72.81 | 81.19 | 86.15 | 71.89 |
RIE | - | 🗸 | 93.18 | 78.95 | 85.48 (+4.29) | 88.63 (+2.48) | 73.71 (+1.82) |
🗸 | - | 94.21 | 80.33 | 86.72 (+5.53) | 89.90 (+3.75) | 74.83 (+2.94) |
🗸 | 🗸 | 95.11 | 81.78 | 87.94 (+6.75) | 91.27 (+5.12) | 75.94 (+4.05) |
Table 5.
Results of the comparison between the angle-based and RIE-based representation methods on the DOTA and HRSC2016 datasets based on three backbone networks.
Table 5.
Results of the comparison between the angle-based and RIE-based representation methods on the DOTA and HRSC2016 datasets based on three backbone networks.
Dataset | Representation Method | Backbone | mAP (%) |
---|
| | ResNet-101 | 68.87 |
DOTA | Angle-based | HRNet-W48 | 70.36 |
| | HRGANet-W48 | 71.46 |
| | ResNet-101 | 73.28 (+4.41) |
DOTA | RIE-based | HRNet-W32 | 74.15 (+3.79) |
| | HRGANet-W48 | 75.94 (+4.48) |
| | ResNet-101 | 83.40 |
HRSC2016 | Angle-based | HRNet-W48 | 85.60 |
| | HRGANet-W48 | 87.47 |
| | ResNet-101 | 87.63 (+4.23) |
HRSC2016 | RIE-based | HRNet-W48 | 89.90 (+4.30) |
| | HRGANet-W48 | 91.27 (+3.80) |
Table 6.
Statistical results of the GAM parameters.
Table 6.
Statistical results of the GAM parameters.
GAM Architecture | GAM Layers | Parameters | Memory (MB) |
---|
| Conv1 × 1 | 1 × (1 × 1 × 48) = 48 | |
Weight block 1 | BN | 2 | |
| ReLU | 0 | |
| Conv1 × 1 | 1 × (1 × 1 × 48 × 2) = 96 | |
Weight block 2 | BN | 2 | |
| ReLU | 0 | |
| Conv1 × 1 | 1 × (1 × 1 × 48 × 4) = 192 | |
Weight block 3 | BN | 2 | |
| ReLU | 0 | |
| Conv1 × 1 | 1 × (1 × 1 × 48 × 8) = 384 | |
Weight block 4 | BN | 2 | |
| ReLU | 0 | |
Fusion | Conv1 × 1 | 256 × (1 × 1 × 48 × 15) = 184,320 | 0.7031 |
Softmax | softmax function | 0 | 0 |
total | - | 185,048 | 0.7059 |