FSD-YOLO: A Fusion Framework for Region Segmentation and Deformable Object Detection in Container Yards
Abstract
1. Introduction
- Difficulty in small object detection: Distant personnel often occupy only a tiny fraction of the image, and insufficient shallow feature representation leads to low recall rates for small targets.
- Inadequate modeling of deformable objects: Targets such as cranes exhibit significant geometric deformations across different operational states, which are difficult to capture using conventional convolutions with fixed receptive fields, resulting in inaccurate localization.
- Limited effectiveness of feature fusion: Traditional bilinear interpolation–based upsampling tends to lose semantic information during multi-scale feature fusion, thereby degrading detection accuracy.
- Lack of regional semantic awareness: Single-task detection models are unable to determine the safety attributes of the regions where targets are located, making it difficult to support semantic rule–based intelligent safety warning.
- A dual-branch architecture that integrates semantic segmentation and object detection is designed. A SegFormer-based module is employed for pixel-level region segmentation, and decision-level fusion is applied to enable region-aware intelligent safety warning, effectively addressing the lack of regional semantic information.
- A C2f-shallow module is introduced into the shallow layers of the YOLOv8n backbone to fully exploit fine-grained details in high-resolution features, thereby enhancing small object detection performance.
- A C2fDCN module is proposed by embedding deformable convolutions into the detection head. By learning adaptive sampling offsets, the receptive field shape is dynamically adjusted, improving the modeling capability for deformable targets such as cranes.
- The CARAFE content-aware upsampling operator is adopted to replace conventional bilinear interpolation, optimizing the multi-scale feature fusion process through adaptive kernel prediction.
- A dynamic loss-weighting mechanism for small objects is designed, where bounding box loss weights are adaptively adjusted according to target area, reinforcing the model’s training focus on small targets.
2. Related Work
3. Proposed Method
3.1. Overall System Architecture
3.1.1. Input Layer
3.1.2. Semantic Segmentation Branch
3.1.3. Object Detection Branch
- Backbone. The backbone extracts multi-scale features from P1/2 to P5/32. Unlike the original YOLOv8 architecture, a C2f-shallow module is introduced at the P1/2 stage to enhance high-resolution feature representation, thereby improving the modeling of distant small targets. The remaining stages adopt standard convolutional downsampling and C2f blocks, and an SPPF module is employed at the end of the backbone to enlarge the receptive field.
- Neck. The neck adopts a bidirectional feature fusion structure combining FPN and PAN. In the top-down pathway, CARAFE content-aware upsampling is used to replace conventional bilinear interpolation, alleviating semantic information loss during upsampling and improving multi-scale feature fusion quality. In addition, at the P2 and P3 fusion stages, standard C2f blocks are replaced with C2fDCN blocks by introducing deformable convolution, enabling adaptive receptive field adjustment for deformable targets.
- Detection Head. A four-scale detection head is employed, corresponding to feature maps P2/4, P3/8, P4/16, and P5/32. Compared with the original three-scale detection head in YOLOv8, the additional P2 detection layer significantly enhances small-object detection capability, which is particularly suitable for recognizing distant personnel and local operational targets in container yards.
3.1.4. Region Based Safety Assessment Module
3.2. Improved Object Detection Model
3.2.1. C2f-Shallow Module
3.2.2. C2fDCN Module
3.2.3. CARAFE Upsampling
3.2.4. Dynamic Loss Weighting for Small Objects
3.3. Semantic Segmentation Model
3.3.1. Model Architecture
3.3.2. Segmentation Task Definition
3.3.3. Training Configuration
3.4. Decision Fusion Module
3.4.1. Region Assignment Algorithm
3.4.2. Safety Warning Rules
4. Experimental Results and Analysis
4.1. Dataset and Evaluation Metrics
4.1.1. Dataset
4.1.2. Evaluation Metrics
4.2. Experimental Settings
4.3. Experimental Analysis
4.3.1. Training Process Analysis
4.3.2. Error Analysis
4.3.3. Real-Time Performance Analysis
4.4. Comparison with Advanced Methods
4.5. Ablation Study
4.6. Training and Results of the Semantic Segmentation Model
4.7. Performance of the Fusion System
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Correction Statement
References
- Kaidabettu, C.D.; Lange, A.K.; Jahn, C. Gantry crane scheduling and storage techniques in rail-road terminals. In Adapting to the Future: Maritime and City Logistics in the Context of Digitalization and Sustainability; Proceedings of the Hamburg International Conference of Logistics (HICL); epubli GmbH: Berlin, Germany, 2021; Volume 32, pp. 457–492. [Google Scholar] [CrossRef]
- Wu, F.; Hu, M.; Xie, F.; Bu, W.; Zhang, Z. A multimodal sensor fusion and dynamic prediction-based personnel intrusion detection system for crane operations. Processes 2025, 13, 4017. [Google Scholar] [CrossRef]
- Fabiano, B.; Currò, F.; Reverberi, A.P.; Pastorino, P. Port safety and the container revolution: A statistical study on human factor and occupational accidents. Saf. Sci. 2010, 48, 980–990. [Google Scholar] [CrossRef]
- Ning, S.; Ding, F.; Chen, B. Research on the method of foreign object detection for railway tracks based on deep learning. Sensors 2024, 24, 4483. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Lee, S. Construction site safety management: A computer vision and deep learning approach. Sensors 2023, 23, 944. [Google Scholar] [CrossRef]
- Liu, L.; Guo, Z.; Liu, Z.; Zhang, Y.; Cai, R.; Hu, X.; Yang, R.; Wang, G. Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision. Buildings 2024, 14, 2429. [Google Scholar] [CrossRef]
- Cuong, T.N.; You, S.S.; Cho, G.S.; Choi, B.; Kim, H.S.; Vinh, N.Q.; Yeon, J.H. Safe operations of a reach stacker by computer vision in an automated container terminal. Alex. Eng. J. 2024, 109, 285–298. [Google Scholar] [CrossRef]
- An, T.T.; You, S.S.; Bao Long, L.N.; Tan, N.D.; Kim, H.S. Robust visual-based tracking using deep learning with image enhancement for reach stackers in container terminals. Alex. Eng. J. 2025, 132, 1–26. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Xu, X.; Chen, X.; Wu, B.; Wang, Z.; Zhen, J. Exploiting high-fidelity kinematic information from port surveillance videos via a YOLO-based framework. Ocean. Coast Manag. 2022, 222, 106117. [Google Scholar] [CrossRef]
- Chen, X.; Ma, Q.; Wu, H.; Shang, W.; Han, B.; Biancardo, S.A. Autonomous port traffic safety orientated vehicle kinematic information exploitation via port-like videos. Transp. Saf. Environ. 2025, 7, tdaf048. [Google Scholar] [CrossRef]
- Kim, H.; Kim, T.; Jo, W.; Kim, J.; Shin, J.; Han, D.; Choi, Y. Multispectral benchmark dataset and baseline for forklift collision avoidance. Sensors 2022, 22, 7953. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8 [Software]. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 October 2025).
- Wang, X.; Li, K.; Fan, F.; Wu, Y.; Zhang, Y. Optimized YOLOv8s framework with deformable convolution for underwater object detection. Sci. Rep. 2025, 15, 45446. [Google Scholar] [CrossRef]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-aware reassembly of features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar] [CrossRef]
- Feng, Q.; Xu, X.; Wang, Z. Deep learning-based small object detection: A survey. Math. Biosci. Eng. 2023, 20, 6551–6590. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299. [Google Scholar]
- Tran, S.V.; Lee, D.; Bao, Q.L.; Yoo, T.; Khan, M.; Jo, J.; Park, C. A Human Detection Approach for Intrusion in Hazardous Areas Using 4D-BIM-Based Spatial-Temporal Analysis and Computer Vision. Buildings 2023, 13, 2313. [Google Scholar] [CrossRef]
- Hou, X.; Li, C.; Fang, Q. Computer vision-based safety risk computing and visualization on construction sites. Automat. Constr. 2023, 156, 105129. [Google Scholar] [CrossRef]
- Ning, S.; Ding, F.; Chen, B.; Huang, Y. Railway Intrusion Risk Quantification with Track Semantic Segmentation and Spatiotemporal Features. Sensors 2025, 25, 5266. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, P.; Huang, Y.; Dai, L.; Xu, F.; Hu, H. Railway obstacle intrusion warning mechanism integrating YOLO-based detection and risk assessment. J. Ind. Inf. Integr. 2024, 38, 100571. [Google Scholar] [CrossRef]
- Ouadou, A.; Huangal, D.; Alshehri, M.; Scott, G.; Hurt, J.A. Semantic Segmentation of Burned Areas in Sentinel-2 Satellite Imagery Using Deep Learning Transformer and Convolutional Attention Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 17728–17739. [Google Scholar] [CrossRef]
- Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear deformable convolution for improving convolutional neural networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
- Liu, H.; Zhang, C.; Yao, Y.; Wei, X.-S.; Shen, F.; Tang, Z.; Zhang, J. Exploiting Web Images for Fine-Grained Visual Recognition via Dynamic Loss Correction and Global Sample Selection. IEEE Trans. Multimed. 2022, 24, 1105–1115. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
- Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar] [CrossRef]
- Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; Li, B. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv 2020, arXiv:2008.02312. [Google Scholar] [CrossRef]











| Component | Avg Latency (ms) | P50 (ms) | P95 (ms) | FPS | Peak VRAM (MB) |
|---|---|---|---|---|---|
| Detection branch | 7.91 | 7.91 | 7.97 | 126.44 | 492.1 |
| Segmentation branch | 21.64 | 21.64 | 22.39 | 46.21 | 411.4 |
| End-to-end fusion framework | 31.36 | 31.34 | 31.81 | 31.89 | 483.7 |
| Model | Parameters (M) | GFLOPs | mAP50-95 | mAP50 | P | R | F1 | Avg Latency (ms) | FPS |
|---|---|---|---|---|---|---|---|---|---|
| YOLOv8n | 3.01 | 4.1 | 0.5394 | 0.8435 | 0.9327 | 0.7775 | 0.8480 | 4.46 | 224.34 |
| YOLOv8s | 11.14 | 14.3 | 0.5664 | 0.8691 | 0.9230 | 0.8197 | 0.8683 | 4.80 | 208.46 |
| YOLOv8m | 25.86 | 39.5 | 0.5626 | 0.8742 | 0.9600 | 0.7731 | 0.8565 | 6.89 | 145.06 |
| YOLOv8l | 43.63 | 82.7 | 0.5716 | 0.8917 | 0.9014 | 0.8336 | 0.8662 | 9.02 | 110.87 |
| YOLOv8x | 68.16 | 129.1 | 0.5640 | 0.8737 | 0.9205 | 0.8088 | 0.8610 | 11.53 | 86.76 |
| YOLOv11n | 2.59 | 3.2 | 0.5154 | 0.8184 | 0.9276 | 0.7326 | 0.8186 | 5.18 | 193.12 |
| YOLOv11s | 9.43 | 10.8 | 0.5431 | 0.8705 | 0.9191 | 0.8046 | 0.8580 | 5.26 | 190.21 |
| YOLOv11m | 20.06 | 34.1 | 0.5643 | 0.8559 | 0.8965 | 0.7988 | 0.8448 | 6.48 | 154.38 |
| Fast R-CNN R50 | 41.7 | 828.0 | 0.5771 | 0.9335 | 0.9335 | 0.6369 | 0.7572 | 27.47 | 36.41 |
| Mask R-CNN R50 | 44.2 | 1040.0 | 0.5781 | 0.9340 | 0.9340 | 0.6405 | 0.7599 | 27.82 | 35.94 |
| Faster R-CNN R50 | 41.7 | 828.0 | 0.5777 | 0.9434 | 0.9434 | 0.6412 | 0.7635 | 27.93 | 35.81 |
| FSD-YOLO (Detection) | 7.59 | 177.2 | 0.6433 | 0.9565 | 0.9534 | 0.9267 | 0.9399 | 7.91 | 126.44 |
| FSD-YOLO-Lite (Detection) | 3.01 | 58.5 | 0.6269 | 0.9480 | 0.9260 | 0.9083 | 0.9170 | 5.99 | 167.01 |
| Method | C2f | C2fDCN | CARAFE | DLW | mAP50 | mAP50-95 | Precision | Recall |
|---|---|---|---|---|---|---|---|---|
| Baseline (M0) | 0.8435 | 0.5394 | 0.9327 | 0.7775 | ||||
| +C2f-shallow (M1) | ✓ | 0.9424 | 0.6285 | 0.8891 | 0.9262 | |||
| +C2fDCN (M2) | ✓ | 0.9343 | 0.6158 | 0.8919 | 0.9105 | |||
| +CARAFE (M3) | ✓ | 0.9336 | 0.6258 | 0.9438 | 0.8893 | |||
| +DLW (M4) | ✓ | 0.9497 | 0.6274 | 0.9283 | 0.9182 | |||
| +C2fDCN+CARAFE | ✓ | ✓ | 0.9305 | 0.6136 | 0.9194 | 0.9160 | ||
| +C2fDCN+DLW | ✓ | ✓ | 0.9348 | 0.6158 | 0.8954 | 0.8993 | ||
| +CARAFE+DLW | ✓ | ✓ | 0.9407 | 0.6143 | 0.9126 | 0.9197 | ||
| FSD-YOLO (M5) | ✓ | ✓ | ✓ | ✓ | 0.9565 | 0.6433 | 0.9534 | 0.9267 |
| FSD-YOLO-Lite | ✓ | ✓ | ✓ | ✓ | 0.9480 | 0.6269 | 0.9260 | 0.9083 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Dai, L.; Liang, Z.; Feng, Q.; Xie, S.; Li, H. FSD-YOLO: A Fusion Framework for Region Segmentation and Deformable Object Detection in Container Yards. Sensors 2026, 26, 2029. https://doi.org/10.3390/s26072029
Dai L, Liang Z, Feng Q, Xie S, Li H. FSD-YOLO: A Fusion Framework for Region Segmentation and Deformable Object Detection in Container Yards. Sensors. 2026; 26(7):2029. https://doi.org/10.3390/s26072029
Chicago/Turabian StyleDai, Linghao, Zhihong Liang, Qi Feng, Shihuan Xie, and Hongxu Li. 2026. "FSD-YOLO: A Fusion Framework for Region Segmentation and Deformable Object Detection in Container Yards" Sensors 26, no. 7: 2029. https://doi.org/10.3390/s26072029
APA StyleDai, L., Liang, Z., Feng, Q., Xie, S., & Li, H. (2026). FSD-YOLO: A Fusion Framework for Region Segmentation and Deformable Object Detection in Container Yards. Sensors, 26(7), 2029. https://doi.org/10.3390/s26072029

