Author Contributions
Conceptualization, methodology, formal analysis, L.X. and C.P.; writing—original draft preparation, experiment, project administration, L.X.; validation, writing—review and editing, C.P.; resources, data curation, experiment, supervision, Y.G. and Z.S. All authors have read and agreed to the published version of the manuscript.
Figure 1.
The proposed framework of the CF-SSD. The input image is normalized first and then sent to the backbone network to produce four basic feature maps. Then, the feature maps are converted into new features by the GAM and CF modules. Our work comprises (I) GAM, (II) CF, (III) anchor design, and (IV) loss function design.
Figure 1.
The proposed framework of the CF-SSD. The input image is normalized first and then sent to the backbone network to produce four basic feature maps. Then, the feature maps are converted into new features by the GAM and CF modules. Our work comprises (I) GAM, (II) CF, (III) anchor design, and (IV) loss function design.
Figure 2.
The structure of GAM. The inputs from the backbone are concatenated into a vector after 1 × 1 convolution operations and average pooling operations. Then, through linear layers, the weight vectors for every channel of every feature map are obtained.
Figure 2.
The structure of GAM. The inputs from the backbone are concatenated into a vector after 1 × 1 convolution operations and average pooling operations. Then, through linear layers, the weight vectors for every channel of every feature map are obtained.
Figure 3.
The structure of the CF module. The feature maps from the GAM pass through 1 × 1 convolution to the same dimension. Then, combinational fusions among them are performed by up-sampling and down-sampling for different scales.
Figure 3.
The structure of the CF module. The feature maps from the GAM pass through 1 × 1 convolution to the same dimension. Then, combinational fusions among them are performed by up-sampling and down-sampling for different scales.
Figure 4.
Transforming the feature map with C × W × H into C′ × W × H by convolution operations. (a) Transformation by 3 × 3 convolution. (b) Transformation by 1 × 1 and 3 × 3 convolutions. (c) Transformation by 1 × 1, 1 × 3, and 3 × 1 convolutions. After every convolution, batch normalization (BN) and ReLU operations are performed.
Figure 4.
Transforming the feature map with C × W × H into C′ × W × H by convolution operations. (a) Transformation by 3 × 3 convolution. (b) Transformation by 1 × 1 and 3 × 3 convolutions. (c) Transformation by 1 × 1, 1 × 3, and 3 × 1 convolutions. After every convolution, batch normalization (BN) and ReLU operations are performed.
Figure 5.
The positions of the real anchor and the predicted anchor. Here, the real anchor is the yellow box, and the predicted anchor is the blue box. In (a–e), the predicted boxes and the real boxes all deviate. In (a–c), the Lloc values of Equation (14) are probably the same. In (d,e), the Lloc values of the original SSD loss, which contains the width loss, height loss, and center loss, are probably the same.
Figure 5.
The positions of the real anchor and the predicted anchor. Here, the real anchor is the yellow box, and the predicted anchor is the blue box. In (a–e), the predicted boxes and the real boxes all deviate. In (a–c), the Lloc values of Equation (14) are probably the same. In (d,e), the Lloc values of the original SSD loss, which contains the width loss, height loss, and center loss, are probably the same.
Figure 6.
SSDD sample, in which small ship targets in large areas of the sea, blurred ships, ships in offshore areas, and noise background are listed in the first column, second column, third column and fourth column, respectively.
Figure 6.
SSDD sample, in which small ship targets in large areas of the sea, blurred ships, ships in offshore areas, and noise background are listed in the first column, second column, third column and fourth column, respectively.
Figure 7.
Different structures for reducing convolution computation. (a) Transformation by 1 × 1 and 3 × 3 convolution. (b) Transformation by 1 × 1, 1 × 3, and 3 × 1 convolutions. (c) Transformation by parallel two-way convolutions for (a). (d) Transformation by parallel two-way convolutions, one way for (a) and the other for (b). (e) Transformation by parallel two-way convolutions for (b).
Figure 7.
Different structures for reducing convolution computation. (a) Transformation by 1 × 1 and 3 × 3 convolution. (b) Transformation by 1 × 1, 1 × 3, and 3 × 1 convolutions. (c) Transformation by parallel two-way convolutions for (a). (d) Transformation by parallel two-way convolutions, one way for (a) and the other for (b). (e) Transformation by parallel two-way convolutions for (b).
Figure 8.
Comparison of the different detectors: from left to right are SSD, SSD+FPN, RetinaNet480, and CF-SSD. (a) inshore ship, (b) small offshore ship, (c) docked ship, (d) inshore and docked ship, (e) blurred offshore ship, (f) blurred offshore ship.
Figure 8.
Comparison of the different detectors: from left to right are SSD, SSD+FPN, RetinaNet480, and CF-SSD. (a) inshore ship, (b) small offshore ship, (c) docked ship, (d) inshore and docked ship, (e) blurred offshore ship, (f) blurred offshore ship.
Figure 9.
Qualitative study example of ships on NWPUVHR-10. For each pair, the left side, middle side, and right side are the results of SSD512, SSD+FPN, and CF-SSD512, respectively. (a) offshore ship, (b) offshore ship, (c) offshore ship, (d) inshore and docked ship, (e) inshore and docked ship, (f) docked ship.
Figure 9.
Qualitative study example of ships on NWPUVHR-10. For each pair, the left side, middle side, and right side are the results of SSD512, SSD+FPN, and CF-SSD512, respectively. (a) offshore ship, (b) offshore ship, (c) offshore ship, (d) inshore and docked ship, (e) inshore and docked ship, (f) docked ship.
Figure 10.
Failed test examples of CF-SSD on the SSDD dataset, (a) ground truth and (b) detection result.
Figure 10.
Failed test examples of CF-SSD on the SSDD dataset, (a) ground truth and (b) detection result.
Table 1.
The scale matches of prior anchor area and object area of the SSDD dataset.
Table 1.
The scale matches of prior anchor area and object area of the SSDD dataset.
Anchor Setting | The Ratio of Object Area to Image Area |
---|
<0.004 | 0.004–0.01 | 0.01–0.05 | ≥0.05 |
---|
SSDD dataset | ✓ | ✓ | ✓ | ✓ |
default prior anchor | ✕ | ✕ | ✓ | ✓ |
adjusted prior anchor | ✓ | ✓ | ✓ | ✓ |
Table 2.
Comparison of the results of various algorithms on VOC2007.
Table 2.
Comparison of the results of various algorithms on VOC2007.
Method | Backbone | Input Size | FPS | mAP |
---|
Faster RCNN [4] | VGG16 | 600 × 1000 | 7 | 0.732 |
Faster RCNN [4] | ResNet101 | 600 × 1000 | 2.4 | 0.764 |
ION [5] | VGG16 | 600 × 1000 | 1.25 | 0.765 |
R-FCN [46] | ResNet101 | 600 × 1000 | 9 | 0.805 |
R-FCN Cascade [2] | ResNet101 | 600 × 1000 | 7 | 0.810 |
CoupleNet [47] | ResNet101 | 600 × 1000 | 7 | 0.817 |
YOLOv2 [7] | Darknet19 | 352 × 352 | 81 | 0.737 |
YOLOv3 [8] | ResNet34 | 320 × 320 | − | 0.801 |
SSD300 [9] | VGG16 | 300 × 300 | 46 | 0.772 |
DSSD320 [14] | ResNet101 | 320 × 320 | 9.5 | 0.786 |
RSSD300 [17] | VGG16 | 300 × 300 | 35 | 0.785 |
FSSD300 [21] | VGG16 | 300 × 300 | 36 | 0.788 |
RefineDet320 [19] | VGG16 | 320 × 320 | 40 | 0.800 |
RFBNet300 [22] | VGG16 | 300 × 300 | − | 0.807 |
AFP-SSD [48] | VGG16 | 300 × 300 | 21 | 0.793 |
F_SE_SSD [24] | VGG16 | 300 × 300 | 35 | 0.804 |
BPN320 [20] | VGG16 | 320 × 320 | 32 | 0.803 |
CF-SSD300 | ResNet50 | 300 × 300 | 33 | 0.809 |
Table 3.
Results of different components of CF-SSD on the SSDD dataset.
Table 3.
Results of different components of CF-SSD on the SSDD dataset.
Component | mAP |
---|
Original SSD | 0.8822 |
SSD | 0.8871 |
SSD + CF | 0.8994 |
SSD + CF + Mixed loss | 0.9011 |
SSD + GAM + CF + Mixed loss | 0.9030 |
SSD + SE + CF + Mixed loss | 0.9003 |
SSD + SA + CF + Mixed loss | 0.9002 |
Table 4.
Comparison of the results of various algorithms on the SSDD dataset.
Table 4.
Comparison of the results of various algorithms on the SSDD dataset.
Method | Input Size | Backbone | FPS | mAP |
---|
SSD [9] | 300 × 300 | VGG16 | 49 | 0.887 |
SSD+FPN | 300 × 300 | ResNet50 | 40 | 0.896 |
FSSD [21] | 300 × 300 | VGG16 | 38 | 0.894 |
RetinaNet384+FPN [18] | 384 × 384 | ResNet50 | 24 | 0.878 |
RetinaNet480+FPN [18] | 480 × 480 | ResNet50 | 19 | 0.896 |
Faster RCNN [4] | 320 × 320 | ResNet50 | 5 | 0.888 |
FCOS+FPN [10] | 384 × 384 | ResNet50 | 16 | 0.901 |
CF-SSD | 300 × 300 | ResNet50 | 35 | 0.903 |
Table 5.
Comparison of the results of various algorithms on NWPUVHR-10.
Table 5.
Comparison of the results of various algorithms on NWPUVHR-10.
Method | Input Size | Backbone | Inference Time (s) | mAP |
---|
R-P-Faster RCNN [49] | 512 × 512 | VGG16 | 0.155 | 0.765 |
SSD512 [9] | 512 × 512 | VGG16 | 0.061 | 0.784 |
Deformable R-FCN [50] | 512 × 512 | ResNet101 | 0.201 | 0.791 |
Faster RCNN [4] | 600 × 1000 | VGG16 | 0.16 | 0.809 |
Deformable Faster RCNN [51] | 600 × 1000 | VGG16 | − | 0.844 |
RetinaNet512 [18] | 512 × 512 | ResNet101 | 0.17 | 0.882 |
RDAS512 [52] | 512 × 512 | VGG16 | 0.057 | 0.895 |
Multi-scale CNN [53] | 512 × 512 | VGG16 | 0.11 | 0.896 |
YOLOv3 [8] | 512 × 512 | Darknet53 | 0.047 | 0.896 |
FMSSD [54] | 512 × 512 | VGG16 | − | 0.904 |
CF-SSD512 | 512 × 512 | ResNet50 | 0.084 | 0.906 |