DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection
Abstract
1. Introduction
- Dynamic Task Alignment Head: This innovative detection head enables the dynamic alignment of classification and regression branches by adjusting the predicted bounding boxes that display substantial discrepancies between these branches. This modification seeks to improve the consistency of the model’s predictions across various tasks, which is essential for enhancing the reliability of small object detection.
- Diverse-Scale Channel-Specific Convolution: This convolution methodology not only decreases the overall parameter count of the model but also promotes a more efficient exchange of information among feature channels. The expected result of this improved feature integration is the production of more comprehensive representations, which are advantageous for the detection of small targets.
2. Related Work
2.1. Multi-Scale Learning
2.2. Anchor-Free Mechanism
2.3. Attention Mechanism
3. Methodology
3.1. Dynamic Task Alignment Head
3.2. Diverse-Scale Channel-Specific Convolution
4. Experiment
4.1. TinyPerson Dataset Object Detection
4.2. Evaluation Metrics and Implementation Details
4.3. Experimental Results
4.4. Supplementary Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1257–1265. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 354–370. [Google Scholar]
- Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Small object detection using deep feature pyramid networks. In Proceedings of the Advances in Multimedia Information Processing–PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Proceedings, Part III 19. Springer: Berlin/Heidelberg, Germany, 2018; pp. 554–564. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
- Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
- Zhu, C.; Chen, F.; Shen, Z.; Savvides, M. Soft anchor-point object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 91–107. [Google Scholar]
- Zand, M.; Etemad, A.; Greenspan, M. Objectbox: From centers to boxes for anchor-free object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 390–406. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
- Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
- Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
- Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 5694–5703. [Google Scholar]
- Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 5641–5651. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 16965–16974. [Google Scholar]
- Chen, L.; Gu, L.; Zheng, D.; Fu, Y. Frequency-Adaptive Dilated Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 3414–3425. [Google Scholar]
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 15909–15920. [Google Scholar]
- Han, K.; Wang, Y.; Guo, J.; Wu, E. ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 15751–15761. [Google Scholar]
- Zheng, M.; Sun, L.; Dong, J.; Pan, J. SMFANet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the European Conference on Computer Vision, Seattle, WA, USA, 16–24 June 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 359–375. [Google Scholar]
- Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
- Draelos, R.L.; Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv 2020, arXiv:2011.08891. [Google Scholar]
Method | AP@.5 | AP@[.5,.95] | Params (M) | FLOPs (G) |
---|---|---|---|---|
Yolov5-N | 23.8 | 7.42 | 2.50 | 7.1 |
yolov6-N [29] | 22.3 | 6.85 | 4.23 | 11.8 |
Yolov8-N | 24.1 | 7.39 | 3.00 | 8.1 |
Yolov9-T [30] | 21.9 | 6.8 | 1.97 | 7.6 |
Yolov8+DTA-N | 26.1 | 8.06 | 2.24 | 8.6 |
Yolov8+DTA+DSCSC-N | 26.4 | 8.17 | 2.14 | 8.4 |
Yolov5-S | 26.5 | 8.13 | 9.11 | 23.8 |
yolov6-S [29] | 25.1 | 8.12 | 16.30 | 44 |
Yolov8-S | 26.7 | 8.59 | 11.13 | 28.4 |
Yolov9-S [30] | 26.8 | 8.55 | 7.17 | 26.7 |
Yolov8+DTA-S | 29 | 9.16 | 8.88 | 33 |
Yolov8+DTA+DSCSC-S | 29.9 | 9.48 | 8.47 | 32.2 |
Yolov5-M | 27.7 | 8.97 | 25.05 | 64 |
yolov6-M [29] | 19.8 | 6.12 | 51.98 | 161.1 |
Yolov8-M | 27.6 | 9.05 | 25.84 | 78.7 |
Yolov9-M [30] | 27.3 | 8.71 | 20.16 | 77 |
Yolov8+DTA-M | 30.7 | 9.45 | 23.15 | 98.3 |
Yolov8+DTA+DSCSC-M | 30.9 | 9.68 | 21.68 | 95.2 |
Method | Reference | Input Size | AP@.5 | AP@[.5,.95] | Params (M) | FLOPs (G) |
---|---|---|---|---|---|---|
DynamicHead-N [24] | CVPR2021 | 640 | 23.2 | 7.4 | 3.49 | 9.6 |
Gold-YOLO-N [23] | NeurIPS 2023 | 640 | 24.2 | 7.38 | 5.98 | 10.2 |
PKINet-N [22] | CVPR2024 | 640 | 23.1 | 7.42 | 6.06 | 25.1 |
StarNet-N [31] | CVPR2024 | 640 | 20.7 | 6.7 | 2.21 | 6.5 |
RMT-N [32] | CVPR2024 | 640 | 24.3 | 7.96 | 14.83 | 43.2 |
HGNetV2-N [33] | CVPR2024 | 640 | 23.6 | 7.33 | 2.35 | 6.9 |
FADC-N [34] | CVPR2024 | 640 | 25.2 | 7.86 | 3.02 | 8 |
RepViT-N [35] | CVPR2024 | 640 | 23.1 | 7.26 | 2.28 | 6.3 |
ParameterNet-N [36] | CVPR2024 | 640 | 24.2 | 7.64 | 4.43 | 6.9 |
SMFANet-N [37] | ECCV2024 | 640 | 23.2 | 7.19 | 2.67 | 7.3 |
DTA+DSCSC-N(Ours) | - | 640 | 26.4 | 8.17 | 2.14 | 8.4 |
DynamicHead-S [24] | CVPR2021 | 640 | 25.9 | 8.58 | 10.85 | 28.1 |
Gold-YOLO-S [23] | NeurIPS 2023 | 640 | 27.1 | 8.68 | 13.61 | 29.9 |
PKINet-S [22] | CVPR2024 | 640 | 23.4 | 7.49 | 10.46 | 36.1 |
StarNet-S [31] | CVPR2024 | 640 | 22.4 | 7.31 | 6.54 | 17.3 |
RMT-S [32] | CVPR2024 | 640 | 25.3 | 8.3 | 19.39 | 54.4 |
HGNetV2-S [33] | CVPR2024 | 640 | 25.5 | 8.18 | 8.47 | 23.3 |
FADC-S [34] | CVPR2024 | 640 | 26.6 | 8.52 | 11.16 | 28 |
RepViT-S [35] | CVPR2024 | 640 | 26.4 | 8.31 | 8.24 | 21.5 |
ParameterNet-S [36] | CVPR2024 | 640 | 26.8 | 8.69 | 16.80 | 23.7 |
SMFANet-S [37] | ECCV2024 | 640 | 26.8 | 8.64 | 9.59 | 25 |
DTA+DSCSC-S(Ours) | - | 640 | 29.9 | 9.48 | 8.47 | 32.2 |
DynamicHead-M [24] | CVPR2021 | 640 | 25.9 | 8.4 | 24.71 | 75.2 |
Gold-YOLO-M [23] | NeurIPS 2023 | 640 | 28.1 | 9.14 | 26.69 | 76.7 |
PKINet-M [22] | CVPR2024 | 640 | 23.9 | 7.64 | 18.36 | 59.9 |
StarNet-M [31] | CVPR2024 | 640 | 23.2 | 7.33 | 14.41 | 41.1 |
RMT-M [32] | CVPR2024 | 640 | 26.4 | 8.5 | 405.89 | 78.4 |
HGNetV2-M [33] | CVPR2024 | 640 | 26.2 | 8.64 | 18.39 | 57.9 |
FADC-M [34] | CVPR2024 | 640 | 28 | 9.09 | 25.92 | 77.6 |
RepViT-M [35] | CVPR2024 | 640 | 27.8 | 8.96 | 16.34 | 49.3 |
ParameterNet-M [36] | CVPR2024 | 640 | 28.3 | 9.08 | 44.40 | 59.4 |
SMFANet-M [37] | ECCV2024 | 640 | 28.2 | 9.19 | 21.38 | 65.6 |
DTA+DSCSC-M(Ours) | - | 640 | 30.9 | 9.68 | 21.68 | 95.2 |
t = 20% | t = 30% | t = 50% | t = 99% | |
---|---|---|---|---|
DynamicHead-M [24] | 3.8% | 6.2% | 12.4% | 61.3% |
Gold-YOLO-M [23] | 5.3% | 8.7% | 17.4% | 88.2% |
PKINet-M [22] | 2.1% | 4.4% | 12.4% | 89.4% |
StarNet-M [31] | 0.8% | 2.2% | 8.5% | 88.8% |
RMT-M [32] | 1.1% | 2.6% | 9.1% | 77.9% |
HGNetV2-M [33] | 3.5% | 6.2% | 13.7% | 84.7% |
FADC-M [34] | 4.4% | 7.1% | 14.2% | 86.4% |
RepViT-M [35] | 4.3% | 7.0% | 14.2% | 73.0% |
ParameterNet-M [36] | 4.9% | 7.8% | 15.6% | 87.6% |
SMFANet-M [37] | 4.9% | 7.8% | 15.9% | 83.6% |
Yolov8-M | 5.0% | 8.2% | 16.4% | 84.7% |
DTA+DSCSC-M(ours) | 6.6% | 10.9% | 21.0% | 92.3% |
K | AP@.5 | AP@[.5,.95] | Params (M) | FLOPs (G) |
---|---|---|---|---|
[1, 3, 3, 5] | 29.4 | 9.58 | 20.35 | 92.3 |
[3, 5, 5, 7] | 29.6 | 9.56 | 22.49 | 96.9 |
[5, 7, 7, 9] | 30.5 | 9.62 | 25.69 | 103.7 |
[3, 5, 7, 9] | 29.7 | 9.49 | 24.36 | 100.8 |
[1, 3, 5, 7] | 30.9 | 9.68 | 21.68 | 95.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, K.; Li, Q.; Yan, Y.; Wang, X.; Qi, D. DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection. Appl. Sci. 2025, 15, 9060. https://doi.org/10.3390/app15169060
Ye K, Li Q, Yan Y, Wang X, Qi D. DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection. Applied Sciences. 2025; 15(16):9060. https://doi.org/10.3390/app15169060
Chicago/Turabian StyleYe, Kaiqi, Qi Li, Yunfeng Yan, Xianbo Wang, and Donglian Qi. 2025. "DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection" Applied Sciences 15, no. 16: 9060. https://doi.org/10.3390/app15169060
APA StyleYe, K., Li, Q., Yan, Y., Wang, X., & Qi, D. (2025). DTA-Head: Dynamic Task Alignment Head for Regression and Classification in Small Object Detection. Applied Sciences, 15(16), 9060. https://doi.org/10.3390/app15169060