Figure 1.
Overview of Our Architecture Adjustments. Grey convolutions and unsaturated channels are removed to decrease runtime while preserving a good accuracy. The mask branch is improving the detection accuracy during training and can be removed for inference. A centerness prediction branch is applied to delete inaccurate detections which are triggered from a feature map pixel that is far off the detection’s center. Depending on the targeted runtime we use a smaller backbone.
Figure 1.
Overview of Our Architecture Adjustments. Grey convolutions and unsaturated channels are removed to decrease runtime while preserving a good accuracy. The mask branch is improving the detection accuracy during training and can be removed for inference. A centerness prediction branch is applied to delete inaccurate detections which are triggered from a feature map pixel that is far off the detection’s center. Depending on the targeted runtime we use a smaller backbone.
Figure 2.
Three images from iSAID. The first image shows the high variety in appearance and size of the classes in iSAID. The second and third images show the high difference in terms of GSD. The color coding for the classes is applied in all figures.
Figure 2.
Three images from iSAID. The first image shows the high variety in appearance and size of the classes in iSAID. The second and third images show the high difference in terms of GSD. The color coding for the classes is applied in all figures.
Figure 3.
Distribution of object sizes in iSAID [
52] (
a) and MS-COCO [
57] (
b). The distribution of iSAID is heavily skewed towards small objects. Thus a focus on the small objects is needed. However, even though large objects are rare, each of them is more important for the average precision since they are typically in rare classes. This discrepancy combined with the many small and thus hard to detect objects poses a major challenge for the detector.
Figure 3.
Distribution of object sizes in iSAID [
52] (
a) and MS-COCO [
57] (
b). The distribution of iSAID is heavily skewed towards small objects. Thus a focus on the small objects is needed. However, even though large objects are rare, each of them is more important for the average precision since they are typically in rare classes. This discrepancy combined with the many small and thus hard to detect objects poses a major challenge for the detector.
Figure 4.
Histogram of the number of instances per class.
Figure 4.
Histogram of the number of instances per class.
Figure 5.
Comparing RetinaNet with different backbones. The ResNet family shows a good accuracy-runtime-trade-off in almost all circumstances. Only in the high FPS regime RegNetX achieves a higher precision with the same runtime comaped to ResNet-18. However, further experiments have shown that ResNet-18 is better when other improvements are also applied.
Figure 5.
Comparing RetinaNet with different backbones. The ResNet family shows a good accuracy-runtime-trade-off in almost all circumstances. Only in the high FPS regime RegNetX achieves a higher precision with the same runtime comaped to ResNet-18. However, further experiments have shown that ResNet-18 is better when other improvements are also applied.
Figure 6.
Comparison of different classification head settings, i.e., number of stacked convolutions and number of channels, for input images of size (left), (middle) and (right). Reducing the resolution to impairs the precision only slightly while the accuracy is significantly increased. Further reducing it to decreases the precision heavily for large model configurations. However, for small model configurations the relative impairment is lower while still significantly increasing the frame rate. Thus, both steps are considered for selecting the final models.
Figure 6.
Comparison of different classification head settings, i.e., number of stacked convolutions and number of channels, for input images of size (left), (middle) and (right). Reducing the resolution to impairs the precision only slightly while the accuracy is significantly increased. Further reducing it to decreases the precision heavily for large model configurations. However, for small model configurations the relative impairment is lower while still significantly increasing the frame rate. Thus, both steps are considered for selecting the final models.
Figure 7.
Qualitative results of our 15 FPS model (second row), 30 FPS model (third row), 60 FPS model (fourth row) and 90 FPS model (fifth row) compared to RetinaNet (bottom row) and ground truth annotations (top row). These results approve the quantitative results with the faster models dropping in accuracy because of more missed objects. While RetinaNet does not miss many objects, it outputs many false positive predictions which leads to its poor quantitative results.
Figure 7.
Qualitative results of our 15 FPS model (second row), 30 FPS model (third row), 60 FPS model (fourth row) and 90 FPS model (fifth row) compared to RetinaNet (bottom row) and ground truth annotations (top row). These results approve the quantitative results with the faster models dropping in accuracy because of more missed objects. While RetinaNet does not miss many objects, it outputs many false positive predictions which leads to its poor quantitative results.
Figure 8.
Qualitative result of RetinaNet predicting oriented bounding boxes. The resulting predictions are almost perfect. Longish objects which are not axis-aligned have a much tighter bounding box when predicting OBBs compared to HBBs. This also reduces the risk of the NMS dropping correct bounding boxes because of a high overlap.
Figure 8.
Qualitative result of RetinaNet predicting oriented bounding boxes. The resulting predictions are almost perfect. Longish objects which are not axis-aligned have a much tighter bounding box when predicting OBBs compared to HBBs. This also reduces the risk of the NMS dropping correct bounding boxes because of a high overlap.
Table 1.
Comparison of our proposed object detector to different state-of-the-art detectors. Even though the compared detectors are close in terms of accuracy, our detectors have a significantly higher frame rate. The increased accuracy compared to the baseline RetinaNet are due to the multiple improvements mentioned in
Section 3.2. The main reasons for the increased performance of the 15 FPS model are the application of TensorRT and half-precision floating-point. The faster models benefit from carefully scaling the backbone, the head and the resolution.
Table 1.
Comparison of our proposed object detector to different state-of-the-art detectors. Even though the compared detectors are close in terms of accuracy, our detectors have a significantly higher frame rate. The increased accuracy compared to the baseline RetinaNet are due to the multiple improvements mentioned in
Section 3.2. The main reasons for the increased performance of the 15 FPS model are the application of TensorRT and half-precision floating-point. The faster models benefit from carefully scaling the backbone, the head and the resolution.
Model | Backbone | Parameters | AP | AP50 | AP75 | APs | APm | APl | FPS |
---|
Faster R-CNN + FPN [12] | ResNet-50 | 41,424,031 | 41.6 | 64.2 | 45.7 | 44.1 | 50.4 | 42.2 | 2.8 |
RetinaNet [4] | ResNet-50 | 36,620,267 | 33.2 | 53.1 | 35.9 | 35.2 | 41.2 | 27.6 | 3.0 |
Mask R-CNN [15] | ResNet-50 | 44,050,863 | 42.8 | 65.2 | 47.7 | 45.3 | 52.6 | 44.8 | 2.3 |
Mask R-CNN + PA-FPN [61] | ResNet-50 | 47,591,343 | 42.5 | 64.7 | 47.1 | 44.9 | 52.8 | 67.5 | 2.2 |
EfficientDet D2 [48] | EfficientNet-B2 | 8,020,764 | 25.4 | 45.6 | 25.7 | 27.0 | 32.4 | 33.0 | 7.6 |
YOLOv4 [7] | CSPDarkNet53 | 64,079,388 | 44.4 | 70.1 | 47.6 | 46.7 | 51.0 | 52.5 | 3.3 |
15 FPS (ours) | ResNet-50 | 34,770,281 | 45.3 | 68.4 | 49.3 | 46.8 | 57.9 | 70.2 | 15.7 |
30 FPS (ours) | ResNet-50 | 27,910,733 | 43.6 | 65.7 | 47.8 | 45.2 | 58.6 | 87.0 | 30.7 |
60 FPS (ours) | ResNet-18 | 13,005,801 | 39.4 | 62.9 | 41.7 | 40.4 | 56.8 | 76.1 | 63.3 |
90 FPS (ours) | ResNet-18 | 13,005,801 | 31.9 | 53.2 | 33.2 | 31.6 | 56.7 | 79.9 | 97.8 |
Table 2.
Comparison of our proposed object detector to different results published for the iSAID. All evaluations are executed on the test set.
Table 2.
Comparison of our proposed object detector to different results published for the iSAID. All evaluations are executed on the test set.
Model | Backbone | AP | AP50 | AP75 | APs | APm | APl |
---|
Mask R-CNN [52] | ResNet-101 | 37.2 | 60.8 | 40.7 | 39.8 | 43.7 | 16.0 |
PANet [52] | ResNet-101 | 46.3 | 66.9 | 51.7 | 48.9 | 53.3 | 26.5 |
PANet [52] | ResNet-152 | 47.0 | 68.1 | 52.4 | 49.5 | 55.1 | 28.0 |
15 FPS (ours) | ResNet-50 | 43.6 | 64.3 | 48.7 | 46.2 | 49.4 | 19.6 |
30 FPS (ours) | ResNet-50 | 41.9 | 62.3 | 46.5 | 44.4 | 49.8 | 19.1 |
60 FPS (ours) | ResNet-18 | 38.3 | 60.1 | 41.6 | 40.1 | 49.6 | 18.6 |
90 FPS (ours) | ResNet-18 | 31.0 | 51.0 | 32.5 | 31.3 | 49.0 | 29.2 |
Table 3.
Comparing RetinaNet evaluated with MMDetection and TensorRT and different floating point precisions. Both applying TensorRT and reducing the precision to 16 bit is slightly increasing the runtime. However, applying both optimizations increases the runtime by a multiple while not impacting the accuracy.
Table 3.
Comparing RetinaNet evaluated with MMDetection and TensorRT and different floating point precisions. Both applying TensorRT and reducing the precision to 16 bit is slightly increasing the runtime. However, applying both optimizations increases the runtime by a multiple while not impacting the accuracy.
Executor | Precision | AP | AP50 | AP75 | FPS |
---|
MMDet | FP32 | 35.3 | 58.6 | 37.2 | 3.0 |
MMDet | FP16 | 35.3 | 58.6 | 37.2 | 5.0 |
TensorRT | FP32 | 35.2 | 58.4 | 37.1 | 4.2 |
TensorRT | FP16 | 35.3 | 58.4 | 37.2 | 14.8 |
Table 4.
Comparison of RetinaNet with varying octave base scales and different feature maps. Note that start and end indicate the first and last pyramid level used as feature map. Reducing the smallest anchor by either reducing the anchors’ base size or incorporating an earlier feature map in the detection head to accommodate the many small objects increases the AP. While incorporating an earlier feature map leads to the best average precision, it heavily impairs the FPS because of the processing of the high resolution feature map.
Table 4.
Comparison of RetinaNet with varying octave base scales and different feature maps. Note that start and end indicate the first and last pyramid level used as feature map. Reducing the smallest anchor by either reducing the anchors’ base size or incorporating an earlier feature map in the detection head to accommodate the many small objects increases the AP. While incorporating an earlier feature map leads to the best average precision, it heavily impairs the FPS because of the processing of the high resolution feature map.
Base Size | Start | End | AP | AP50 | AP75 | FPS |
---|
4 | P3 | P7 | 32.7 | 52.5 | 35.2 | 14.8 |
4 | P2 | P7 | 35.8 | 57.2 | 38.7 | 6.4 |
3 | P3 | P7 | 35.2 | 56.4 | 38.3 | 14.8 |
2 | P3 | P7 | 35.3 | 58.4 | 37.2 | 14.8 |
1 | P3 | P7 | 26.1 | 49.2 | 24.9 | 14.8 |
Table 5.
Comparison of RetinaNet with and without ATSS. The centerness score prediction is typically applied after filtering low confidence detections. However, the TensorRT NMS has not support for such an operation. Thus, we evaluate both multiplying the centerness score before filtering detections and ignoring the centerness score. Both variants significantly increase the precision compared to not applying ATSS while multiplying the centerness score with the class score prediction before the filtering procedure shows a higher precision. To accommodate the lower scores, we reduce the score threshold which slightly increases the precision. All variants with ATSS significantly increase the FPS since ATSS uses only 1 anchor instead of 9.
Table 5.
Comparison of RetinaNet with and without ATSS. The centerness score prediction is typically applied after filtering low confidence detections. However, the TensorRT NMS has not support for such an operation. Thus, we evaluate both multiplying the centerness score before filtering detections and ignoring the centerness score. Both variants significantly increase the precision compared to not applying ATSS while multiplying the centerness score with the class score prediction before the filtering procedure shows a higher precision. To accommodate the lower scores, we reduce the score threshold which slightly increases the precision. All variants with ATSS significantly increase the FPS since ATSS uses only 1 anchor instead of 9.
Detector | Threshold | AP | AP50 | AP75 | FPS |
---|
RetinaNet | 0.5 | 35.3 | 58.4 | 37.2 | 14.8 |
RetinaNet | 0.25 | 35.3 | 58.5 | 37.2 | 14.8 |
ATSS * | 0.5 | 37.8 | 58.5 | 41.0 | 15.7 |
ATSS * | 0.25 | 37.9 | 58.8 | 41.1 | 15.7 |
ATSS † | 0.5 | 37.2 | 58.5 | 40.0 | 15.7 |
ATSS † | 0.25 | 37.2 | 58.6 | 40.1 | 15.7 |
Table 6.
Comparison of dataset preparation settings, i.e., w/ and w/o discarding of boxes with low IoU to the original box in the uncropped image and class-balanced resampling. The third line is the default data preparation strategy proposed by [
52]. To prevent confusing the detector by unidentifiable objects during training, we discard objects which have a visibility of less than 50% in the experiments for the ablation studies. However, since the validation set includes such objects, discarding impairs the performance. The denser cropping strategy with only 200 pixel overlap in the application of the sliding window significantly increases the precision.
Table 6.
Comparison of dataset preparation settings, i.e., w/ and w/o discarding of boxes with low IoU to the original box in the uncropped image and class-balanced resampling. The third line is the default data preparation strategy proposed by [
52]. To prevent confusing the detector by unidentifiable objects during training, we discard objects which have a visibility of less than 50% in the experiments for the ablation studies. However, since the validation set includes such objects, discarding impairs the performance. The denser cropping strategy with only 200 pixel overlap in the application of the sliding window significantly increases the precision.
Discard | Overlap | Class-Bal. | AP | AP50 | AP75 |
---|
✓ | 400 | - | 35.3 | 58.6 | 37.2 |
✓ | 400 | ✓ | 37.9 | 61.9 | 40.2 |
- | 200 | - | 34.0 | 57.4 | 35.8 |
- | 400 | - | 38.3 | 62.5 | 41.1 |
- | 400 | ✓ | 41.1 | 66.0 | 44.2 |
Table 7.
Class-wise results comparing RetinaNet with and without class-balanced sampling. Even though the rare classes like Helicopter are obviously benefiting most from the oversampling of the images including them, frequent classes like small vehicles also show a better precision. Ship is the only class that is slightly impaired by the class-balancing. Abbreviations: BD—Baseball Diamond, GTF—Ground Track Field, SV—Small Vehicle, LV—Large Vehicle, TC—Tennis Court, BC—Basketball Court, ST—Storage Tank, SBF—Soccer Ball Field, RA—Roundabout, SP—Swimming Pool, HC—Helicopter.
Table 7.
Class-wise results comparing RetinaNet with and without class-balanced sampling. Even though the rare classes like Helicopter are obviously benefiting most from the oversampling of the images including them, frequent classes like small vehicles also show a better precision. Ship is the only class that is slightly impaired by the class-balancing. Abbreviations: BD—Baseball Diamond, GTF—Ground Track Field, SV—Small Vehicle, LV—Large Vehicle, TC—Tennis Court, BC—Basketball Court, ST—Storage Tank, SBF—Soccer Ball Field, RA—Roundabout, SP—Swimming Pool, HC—Helicopter.
| Plane | BD | Bridge | GTF | SV | LV | Ship | TC | BC | ST | SBF | RA | Harbor | SP | HC |
---|
% of Annotations | 2.8 | 0.2 | 0.5 | 0.2 | 69.1 | 8.2 | 11.9 | 0.8 | 0.2 | 3.0 | 0.3 | 0.2 | 2.1 | 0.6 | 0.1 |
No class-bal. | 60.6 | 42.8 | 19.2 | 24.3 | 21.4 | 34.1 | 52.3 | 68.9 | 32.3 | 38.4 | 20.1 | 32.6 | 39.7 | 36.1 | 7.3 |
With class-bal. | 61.1 | 48.1 | 20.4 | 27.7 | 21.8 | 35.0 | 52.2 | 71.7 | 37.8 | 39.3 | 25.0 | 35.1 | 40.8 | 38.3 | 14.6 |
Table 8.
Comparing RetinaNet with auxiliary mask branch and without mask branch for different input resolutions. Across all resolutions the auxiliary mask branch is showing a significant increase in precision while not impairing the FPS since it is not executed during inference.
Table 8.
Comparing RetinaNet with auxiliary mask branch and without mask branch for different input resolutions. Across all resolutions the auxiliary mask branch is showing a significant increase in precision while not impairing the FPS since it is not executed during inference.
Mask Branch | Res. | AP | AP50 | AP75 | FPS |
---|
- | 800 × 800 | 38.2 | 62.3 | 41.1 | 14.8 |
- | 600 × 600 | 37.2 | 60.1 | 39.8 | 23.5 |
- | 400 × 400 | 30.6 | 52.1 | 31.7 | 39.6 |
✓ | 800 × 800 | 39.6 | 63.6 | 42.8 | 14.8 |
✓ | 600 × 600 | 38.4 | 61.6 | 41.3 | 23.6 |
✓ | 400 × 400 | 32.1 | 53.6 | 33.5 | 39.8 |
Table 9.
Comparing different optimization algorithms on RetinaNet. Since Adam-based optimizers typically require a lower learning rate, we execute Adam and AdamW with a learning rate of 1 × 10 instead of 1 × 10. Adam shows a significant increase in terms of precision compared to SGD and AdamW further increases the precision slightly.
Table 9.
Comparing different optimization algorithms on RetinaNet. Since Adam-based optimizers typically require a lower learning rate, we execute Adam and AdamW with a learning rate of 1 × 10 instead of 1 × 10. Adam shows a significant increase in terms of precision compared to SGD and AdamW further increases the precision slightly.
Optimizer | Learning Rate | AP | AP50 | AP75 | APs | APm | APl |
---|
SGD | 1 × 10 | 39.7 | 63.7 | 43.0 | 42.0 | 47.6 | 49.5 |
Adam [63] | 1 × 10 | 41.2 | 65.9 | 44.6 | 43.5 | 50.3 | 45.0 |
AdamW [64] | 1 × 10 | 41.9 | 66.5 | 45.3 | 43.8 | 53.7 | 50.0 |
Table 10.
Runtimes of RetinaNet’s components. Due to technical reasons some operations could not be assigned to a certain component. The backbone and the head are the two significant parts sharing almost 90% of the runtime equally. The remaining share is mainly consumed by the neck.
Table 10.
Runtimes of RetinaNet’s components. Due to technical reasons some operations could not be assigned to a certain component. The backbone and the head are the two significant parts sharing almost 90% of the runtime equally. The remaining share is mainly consumed by the neck.
Component | Ms per Image | Share |
---|
Backbone | 28.0 | 42.98% |
Neck | 6.1 | 9.41% |
Head | 29.1 | 44.64% |
NMS | 0.9 | 1.44% |
Unassigned | 1.0 | 1.52% |
Table 11.
Comparing RetinaNet (ResNet-50 and ResNet-18) with different head and neck configurations. The first two steps of downscaling the head and the neck lead to a significant speed-up with both backbones while only slightly impairing the precision. Further reducing the number of channels to 64 and removing all stacked convolutions in the head leads to a significant drop in accuracy. Thus we do not consider this setting for further experiments.
Table 11.
Comparing RetinaNet (ResNet-50 and ResNet-18) with different head and neck configurations. The first two steps of downscaling the head and the neck lead to a significant speed-up with both backbones while only slightly impairing the precision. Further reducing the number of channels to 64 and removing all stacked convolutions in the head leads to a significant drop in accuracy. Thus we do not consider this setting for further experiments.
Backbone | Neck Channels | Head Channels | Stacked Convs. | AP | AP50 | AP75 | APs | APm | APl | FPS |
---|
R50 | 256 | 256 | 4 | 35.3 | 58.4 | 37.2 | 37.9 | 38.8 | 21.9 | 14.8 |
R50 | 170 | 170 | 3 | 34.8 | 58.1 | 36.7 | 37.6 | 38.5 | 22.6 | 18.3 |
R50 | 128 | 128 | 2 | 34.1 | 57.6 | 35.5 | 37.0 | 35.6 | 25.8 | 22.7 |
R50 | 64 | - | 0 | 30.3 | 53.1 | 30.7 | 33.4 | 26.4 | 12.2 | 25.4 |
R18 | 256 | 256 | 4 | 31.5 | 53.9 | 32.7 | 34.2 | 35.1 | 11.9 | 20.1 |
R18 | 170 | 170 | 3 | 30.1 | 53.2 | 30.6 | 33.1 | 30.9 | 16.3 | 27.1 |
R18 | 28 | 128 | 2 | 29.0 | 51.0 | 29.3 | 32.0 | 26.2 | 18.8 | 38.2 |
R18 | 64 | - | 0 | 22.8 | 42.7 | 21.8 | 26.3 | 13.3 | 2.7 | 48.1 |
Table 12.
Selected model configuration for each of the four targeted frame rates. Channels are the number of channels in the neck and in the head. Convolutions are the number of stacked convolutions in the head. For each targeted frame rate the most precise configuration is selected from
Figure 6.
Table 12.
Selected model configuration for each of the four targeted frame rates. Channels are the number of channels in the neck and in the head. Convolutions are the number of stacked convolutions in the head. For each targeted frame rate the most precise configuration is selected from
Figure 6.
FPS | Backbone | Channels | Convolutions | Input Resolution |
---|
15 | ResNet-50 | 256 | 4 | |
30 | ResNet-50 | 170 | 3 | |
60 | ResNet-18 | 128 | 2 | |
90 | ResNet-18 | 128 | 2 | |
Table 13.
Power consumption and required VRAM of the four models in 30 W power mode. Average power is the total power measured during inference including loading images. Model power is the power averaged over the raw model execution, i.e., the power needed for loading images is subtracted. The required power is significantly below the configured 30 W limit since not all of Jetson’s processing units are utilized. Moreover, the power draw decreases with a higher frame rate since the utilization of the high-power GPU decreases. The required VRAM decreases for the faster models since the smaller number of feature maps have a lower resolution and the smaller models have less parameters.
Table 13.
Power consumption and required VRAM of the four models in 30 W power mode. Average power is the total power measured during inference including loading images. Model power is the power averaged over the raw model execution, i.e., the power needed for loading images is subtracted. The required power is significantly below the configured 30 W limit since not all of Jetson’s processing units are utilized. Moreover, the power draw decreases with a higher frame rate since the utilization of the high-power GPU decreases. The required VRAM decreases for the faster models since the smaller number of feature maps have a lower resolution and the smaller models have less parameters.
Model | Average Power | Model Power | VRAM |
---|
15 FPS | 12.2 W | 15.3 W | 1777 MiB |
30 FPS | 10.3 W | 14.1 W | 1725 MiB |
60 FPS | 8.4 W | 12.7 W | 1669 MiB |
90 FPS | 7.6 W | 11.7 W | 1657 MiB |
Table 14.
Comparing RetinaNet predicting horizontal bounding boxes (HBBs) and oriented bounding boxes (OBBs) in different configurations. Both bounding box types are trained and evaluated on their respective dataset type. While the AP is heavily impaired, the AP50 is only slightly reduced when predicting OBBs since the precise estimation of the angle is not important for objects with a ratio close to 1 to achieve an IoU above 50%. Among the tested configurations, predicting the usual bounding box parameter and an angle while using modulated loss to handle angle periodicity achieves the best results.
Table 14.
Comparing RetinaNet predicting horizontal bounding boxes (HBBs) and oriented bounding boxes (OBBs) in different configurations. Both bounding box types are trained and evaluated on their respective dataset type. While the AP is heavily impaired, the AP50 is only slightly reduced when predicting OBBs since the precise estimation of the angle is not important for objects with a ratio close to 1 to achieve an IoU above 50%. Among the tested configurations, predicting the usual bounding box parameter and an angle while using modulated loss to handle angle periodicity achieves the best results.
Configuration | AP | AP50 | AP75 |
---|
HBB | 38.3 | 62.5 | 41.1 |
5 Par. | 26.0 | 54.2 | 20.5 |
5 Par. + Modulated Loss | 27.6 | 55.3 | 23.4 |
8 Par. | 25.4 | 52.5 | 20.5 |
Table 15.
Evaluating the impact of parallelizing the skew IoU computation and increasing the IoU threshold used for checking for a horizontal overlap. Without further optimizations, the NMS becomes a bottleneck when predicting OBBs and reduces the advantage of TensorRT and half-precision floating-point arithmetics. Parallelizing the skew IoU to increase the GPU utilization and only calculating the skew IoU if a significant horizontal IoU is given, increases the performance to a level close to HBB prediction.
Table 15.
Evaluating the impact of parallelizing the skew IoU computation and increasing the IoU threshold used for checking for a horizontal overlap. Without further optimizations, the NMS becomes a bottleneck when predicting OBBs and reduces the advantage of TensorRT and half-precision floating-point arithmetics. Parallelizing the skew IoU to increase the GPU utilization and only calculating the skew IoU if a significant horizontal IoU is given, increases the performance to a level close to HBB prediction.
Executor | OBB IoU | HBB IoU | AP | AP50 | AP75 | FPS |
---|
TensorRT | HBB Reference | 38.2 | 62.3 | 41.1 | 14.8 |
MMDet | Sequential | >0% | 27.6 | 55.3 | 23.4 | 2.3 |
TensorRT | Sequential | >0% | 27.5 | 55.2 | 23.3 | 8.8 |
TensorRT | Sequential | >60% | 27.5 | 55.0 | 23.3 | 12.8 |
TensorRT | Parallel | >0% | 27.5 | 55.2 | 23.3 | 10.7 |
TensorRT | Parallel | >60% | 27.5 | 55.0 | 23.3 | 13.9 |