Author Contributions
Conceptualization, F.Z. and J.Z.; methodology, F.Z.; software, F.Z.; validation, F.Z.; formal analysis, F.Z.; investigation, F.Z.; resources, F.Z. and G.Z.; data curation, F.Z.; writing—original draft preparation, F.Z. and G.Z.; writing—review and editing, F.Z.; visualization, F.Z.; supervision, F.Z.; project administration, F.Z. and J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.
Figure 1.
The whole pipeline of the proposed FFEDet. Utilization of DarkNet53 as the primary network architecture for feature extraction at four distinct levels. The extracted features are further enhanced using ECFA-PAN to improve their quality. Finally, we conduct object detection on three feature maps that contain rich semantic information at varying levels.
Figure 1.
The whole pipeline of the proposed FFEDet. Utilization of DarkNet53 as the primary network architecture for feature extraction at four distinct levels. The extracted features are further enhanced using ECFA-PAN to improve their quality. Finally, we conduct object detection on three feature maps that contain rich semantic information at varying levels.
Figure 2.
Composition of the CBAM structure. Channel attention module (CAM) and spatial attention module (SAM), which incorporate operations such as global average pooling (GAP) and global max pooling (GMP) along the spatial dimensions. Channel average pooling (CAP) and channel max pooling (CMP) are performed along the channel dimensions.
Figure 2.
Composition of the CBAM structure. Channel attention module (CAM) and spatial attention module (SAM), which incorporate operations such as global average pooling (GAP) and global max pooling (GMP) along the spatial dimensions. Channel average pooling (CAP) and channel max pooling (CMP) are performed along the channel dimensions.
Figure 3.
Composition of the SPPCSPC structure. The SPPCSPC module processes the input feature map through multi-scale pooling and convolution operations to generate higher-dimensional features, followed by multiple convolutions and concatenations, ultimately outputting the enhanced feature map.
Figure 3.
Composition of the SPPCSPC structure. The SPPCSPC module processes the input feature map through multi-scale pooling and convolution operations to generate higher-dimensional features, followed by multiple convolutions and concatenations, ultimately outputting the enhanced feature map.
Figure 4.
Composition of the ECFA structure, which receives feature maps from three hierarchical scales, namely , , and . Downsampling and upsampling operations are applied to and . The resulting outcomes are then added to and concatenated, with spatial and channel attention mechanisms applied.
Figure 4.
Composition of the ECFA structure, which receives feature maps from three hierarchical scales, namely , , and . Downsampling and upsampling operations are applied to and . The resulting outcomes are then added to and concatenated, with spatial and channel attention mechanisms applied.
Figure 5.
The Structure of CSP, ELAN, and E-ELAN. (a) CSP: the input is passed through two branches. One of the branches employs a recurrent residual structure for multiple iterations. Then, the outputs are concentrated from both branches along the channel dimension. (b) ELAN and E-ELAN: in the ELAN module, feature fusion is accomplished by integrating the output of each stacked module layer, while in the E-ELAN model, feature fusion is achieved by incorporating the output of each convolutional layer within the stacked module.
Figure 5.
The Structure of CSP, ELAN, and E-ELAN. (a) CSP: the input is passed through two branches. One of the branches employs a recurrent residual structure for multiple iterations. Then, the outputs are concentrated from both branches along the channel dimension. (b) ELAN and E-ELAN: in the ELAN module, feature fusion is accomplished by integrating the output of each stacked module layer, while in the E-ELAN model, feature fusion is achieved by incorporating the output of each convolutional layer within the stacked module.
Figure 6.
The Structure of SEConv and S-ELAN. (a) SEConv: employing varied receptive field convolutions facilitates the extraction of multi-scale information, e.g., extracting inter-channel dependency information using pointwise convolutions. (b) S-ELAN: the input splits into two branches, employing stacked SEConv modules and residual fusion techniques.
Figure 6.
The Structure of SEConv and S-ELAN. (a) SEConv: employing varied receptive field convolutions facilitates the extraction of multi-scale information, e.g., extracting inter-channel dependency information using pointwise convolutions. (b) S-ELAN: the input splits into two branches, employing stacked SEConv modules and residual fusion techniques.
Figure 7.
The partial examples from the dataset, labeled from (a–d), correspond to PASCAL VOC, VisDrone-DET2021, TGRS-HRRSD, and DOTAv1.0, respectively.
Figure 7.
The partial examples from the dataset, labeled from (a–d), correspond to PASCAL VOC, VisDrone-DET2021, TGRS-HRRSD, and DOTAv1.0, respectively.
Figure 8.
Qualitative examples of small object scene detection on Pascal VOC. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv7 algorithm, and our algorithm.
Figure 8.
Qualitative examples of small object scene detection on Pascal VOC. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv7 algorithm, and our algorithm.
Figure 9.
Qualitative examples of small object scene detection on VisDrone-DET2021. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.
Figure 9.
Qualitative examples of small object scene detection on VisDrone-DET2021. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.
Figure 10.
Qualitative examples of small object scene detection on TGRS-HRRSD. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.
Figure 10.
Qualitative examples of small object scene detection on TGRS-HRRSD. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.
Figure 11.
Qualitative examples of small object scene detection on DOTAv1.0. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.
Figure 11.
Qualitative examples of small object scene detection on DOTAv1.0. (a) Qualitative example of image 1. (b) Qualitative example of image 2. (c) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.
Table 1.
Performance comparison between Conv and SEConv.
Table 1.
Performance comparison between Conv and SEConv.
Method | Params(M) | FLOPs | Input | Output |
---|
Conv(3 × 3) | 0.59 | 24.2 | (1,256,640,640) | (1,256,640,640) |
SEConv | 0.21 | 8.6 | (1,256,640,640) | (1,256,640,640) |
Table 2.
Experimental environment.
Table 2.
Experimental environment.
Configuration | Parameter |
---|
Operating System | Ubuntu 18.04 |
GPU | NVIDIA RTX A6000 |
CUDA | 12.2 |
Frame | PyTorch 2.0.1 |
Programming Language | Python 3.8 |
Table 3.
Experimental results of different detection algorithms on the PASCAL VOC, VisDrone-DET2021, TGRS-HRRSD, and DOTAv1.0 datasets.
Table 3.
Experimental results of different detection algorithms on the PASCAL VOC, VisDrone-DET2021, TGRS-HRRSD, and DOTAv1.0 datasets.
Datasets | Method | Params(M) | | AP | | | | | FPS(f/s) |
---|
PASCAL VOC | Baseline | 37.62 | 84.4 ± 0.2 | 62.5 ± 0.1 | 67.8 ± 0.3 | 25.9 ± 0.2 | 50.9 ± 0.1 | 71.4 ± 0.2 | 68.6 |
YOLOv5l | 46.20 | 80.5 ± 0.3 | 58.8 ± 0.2 | 62.9 ± 0.4 | 21.6 ± 0.3 | 44.0 ± 0.2 | 66.7 ± 0.3 | 73.2 |
YOLOv8l | 43.60 | 82.7 ± 0.4 | 62.3 ± 0.3 | 64.0 ± 0.2 | 24.9 ± 0.3 | 48.7 ± 0.2 | 68.2 ± 0.4 | 70.5 |
Ours | 39.52 | 85.5 ± 0.2 | 63.8 ± 0.1 | 69.3 ± 0.3 | 27.0 ± 0.2 | 51.7 ± 0.1 | 73.0 ± 0.2 | 69.2 |
VisDrone-DET2021 | Baseline | 37.62 | 49.1 ± 0.3 | 28.0 ± 0.2 | 27.6 ± 0.2 | 18.7 ± 0.1 | 39.1 ± 0.3 | 49.3 ± 0.2 | 71.6 |
YOLOv5l | 46.20 | 41.4 ± 0.2 | 24.4 ± 0.1 | 24.8 ± 0.2 | 16.1 ± 0.1 | 34.0 ± 0.2 | 41.7 ± 0.3 | 77.4 |
YOLOv8l | 43.60 | 42.9 ± 0.3 | 25.9 ± 0.2 | 26.4 ± 0.2 | 17.4 ± 0.1 | 33.9 ± 0.2 | 43.1 ± 0.3 | 60.8 |
Ours | 39.52 | 53.9 ± 0.2 | 32.1 ± 0.1 | 32.7 ± 0.2 | 23.2 ± 0.1 | 42.9 ± 0.2 | 49.7 ± 0.2 | 72.7 |
TGRS-HRRSD | Baseline | 37.62 | 89.8 ± 0.2 | 68.9 ± 0.1 | 81.5 ± 0.3 | 28.7 ± 0.2 | 58.1 ± 0.1 | 60.3 ± 0.2 | 74.5 |
YOLOv5l | 46.20 | 89.1 ± 0.3 | 63.9 ± 0.2 | 75.0 ± 0.4 | 30.2 ± 0.3 | 54.0 ± 0.2 | 58.7 ± 0.3 | 74.1 |
YOLOv8l | 43.60 | 89.9 ± 0.2 | 69.4 ± 0.1 | 80.9 ± 0.3 | 28.6 ± 0.2 | 60.4 ± 0.1 | 62.3 ± 0.2 | 69.8 |
Ours | 39.52 | 91.4 ± 0.2 | 70.1 ± 0.1 | 83.1 ± 0.3 | 31.0 ± 0.2 | 61.0 ± 0.1 | 63.4 ± 0.2 | 69.9 |
DOTAv1.0 | Baseline | 37.62 | 76.7 ± 0.2 | 51.6 ± 0.1 | 54.1 ± 0.2 | 25.4 ± 0.1 | 51.3 ± 0.2 | 60.5 ± 0.1 | 73.4 |
YOLOv5l | 46.20 | 73.0 ± 0.3 | 49.0 ± 0.2 | 50.9 ± 0.3 | 20.5 ± 0.2 | 45.3 ± 0.1 | 58.6 ± 0.2 | 68.3 |
YOLOv8l | 43.60 | 74.5 ± 0.2 | 52.9 ± 0.1 | 56.1 ± 0.2 | 24.9 ± 0.1 | 50.4 ± 0.2 | 57.3 ± 0.1 | 74.4 |
Ours | 39.52 | 78.2 ± 0.2 | 53.1 ± 0.1 | 55.0 ± 0.2 | 26.8 ± 0.1 | 53.7 ± 0.2 | 60.9 ± 0.1 | 70.6 |
Table 4.
Ablation study on the VisDrone-DET2021 dataset.
Table 4.
Ablation study on the VisDrone-DET2021 dataset.
ECFA | SEConv | DFSLoss | WIoUv3 | | AP |
---|
| | | | 49.1 | 28.0 |
√ | | | | 52.0 | 30.6 |
| √ | | | 49.7 | 28.2 |
| | √ | | 50.3 | 28.6 |
| | | √ | 50.5 | 29.2 |
√ | √ | | | 52.4 | 30.8 |
| | √ | √ | 51.1 | 29.7 |
√ | √ | √ | | 53.1 | 29.8 |
√ | √ | | √ | 53.5 | 31.9 |
√ | √ | √ | √ | 53.9 | 32.1 |
Table 6.
Performance comparison of DFSLoss combined with each bounding box loss on VisDrone-DET2021.
Table 6.
Performance comparison of DFSLoss combined with each bounding box loss on VisDrone-DET2021.
Method | | AP | |
---|
CIoU [39] + DFSLoss | 50.3 | 28.6 | 28.1 |
GIoU [37] + DFSLoss | 50.2 | 28.7 | 28.3 |
DIoU [38] + DFSLoss | 50.1 | 28.3 | 27.9 |
SIoU [50] + DFSLoss | 49.7 | 27.9 | 27.4 |
EIoU [51] + DFSLoss | 49.3 | 27.4 | 27.1 |
MPDIoU [52] + DFSLoss | 50.0 | 28.1 | 27.8 |
WIoUv1 + DFSLoss | 50.4 | 28.9 | 28.3 |
WIoUv2 + DFSLoss | 50.9 | 29.3 | 28.8 |
WIoUv3 + DFSLoss | 51.1 | 29.7 | 29.3 |
Table 7.
Comparison results with other state-of-the-art methods on the VisDrone-DET2021 dataset.
Table 7.
Comparison results with other state-of-the-art methods on the VisDrone-DET2021 dataset.
Method | Backbone | Params(M) | | AP | | FPS(f/s) |
---|
YOLOv3 [5] | Darknet53 | 61.53 | 40.0 | 22.2 | 22.4 | 54.6 |
YOLOv4 [16] | Darknet53 | 52.50 | 39.2 | 23.5 | 23.4 | 55.0 |
YOLOv5l [17] | Darknet53 | 46.20 | 41.4 | 24.4 | 24.8 | 77.4 |
YOLOX [15] | Darknet53 | 54.20 | 39.1 | 22.4 | 22.7 | 68.9 |
YOLOv6l [18] | EfficientRep | 58.50 | 41.8 | 25.4 | 25.8 | 116 |
YOLOv8l [55] | Darknet | 43.60 | 42.9 | 25.9 | 26.4 | 60.8 |
CascadeNet [56] | ResNet101 | 184.00 | 47.1 | 28.8 | 29.3 | - |
RetinaNet [57] | ResNet50 | 59.20 | 44.9 | 26.2 | 27.1 | 54.1 |
HRDNet [53] | ResNet18 + 101 | 63.60 | 49.3 | 28.3 | 28.2 | - |
GFLV2 [58] (CVPR 2021) | ResNet50 | 72.50 | 50.7 | 28.7 | 28.4 | 19.4 |
RFLA [59] (ECCV 2022) | ResNet50 | 57.30 | 45.3 | 27.4 | - | - |
QueryDet [19] (CVPR 2022) | ResNet50 | - | 48.1 | 28.3 | 28.8 | 14.9 |
ScaleKD [54] (CVPR 2023) | ResNet50 | 43.57 | 49.3 | 29.5 | 30.0 | 20.1 |
Ours | DarkeNet53 | 39.52 | 53.9 | 32.1 | 32.7 | 72.7 |