Fs2PA: A Full-Scale Feature Synergistic Perception Architecture for Vehicular Infrared Object Detection via Physical Priors and Semantic Constraints
Highlights
- A Gradient-Informed Attention (GIA) module is proposed to explicitly inject physical geometric priors, successfully breaking the CNN texture bias and overcoming target boundary blurring in thermal crossover scenarios.
- A Full-Scale Feature Synergistic Perception Architecture (Fs2PA) is constructed, utilizing a high-resolution layer and a cross-scale shared detection head to simultaneously prevent feature erosion of tiny objects and suppress background noise.
- The architecture achieves a state-of-the-art balance between accuracy and efficiency (64.06% mAP@50 and 547 FPS on FLIR v2), proving highly viable for real-time edge deployment in autonomous vehicles.
- It provides a generalizable design paradigm for bridging physical thermal radiation features and deep learning semantic constraints, substantially enhancing all-weather perception robustness.
Abstract
1. Introduction
- A physics-guided feature extraction method: To address the problem of blurred features caused by the lack of texture in infrared images, the GIA module is proposed. Through a normalized gradient initialization strategy, explicit geometric priors are injected into the deep network, breaking the “texture bias” of traditional CNN models and significantly enhancing the model’s boundary perception capability in thermal crossover scenarios.
- A Full-Scale feature pyramid construction method oriented towards vehicular scenarios: Addressing the “feature erosion” problem of long-distance tiny objects under the vehicular perspective, a Full-Scale feature architecture including a high-resolution layer is constructed, effectively reversing the loss of feature information caused by deep downsampling and ensuring that the features of long-distance tiny thermal source objects are effectively preserved in the feature maps.
- A synergistic lightweight detection head design method: The SAS-Head is proposed to solve the problems of computational redundancy and background noise introduced by the layer. This module forms a “Mutual Redemption” relationship with the layer: by leveraging a cross-scale parameter sharing mechanism, it not only achieves extreme parameter compression but also forces deep semantics to exert strong constraints on shallow features, effectively suppressing false alarms.
- State-of-the-art vehicular infrared object detection performance: Extensive comparative experiments were conducted on two mainstream infrared object detection datasets, FLIR v2 and M3FD. The results demonstrate that the proposed architecture achieves a significant breakthrough in detection accuracy while maintaining an ultra-high real-time inference speed of 547 FPS, fully validating the effectiveness and superiority of the synergistic design of physical perception and semantic constraints.
2. Related Works
2.1. Infrared Object Detection and Physical Feature Enhancement
2.2. Small Object Detection and Multi-Scale Architectures
2.3. Lightweight Detection Head Design
3. Materials and Methods
3.1. Overall Architecture
- 1.
- Physics-Guided Backbone:To overcome the lack of color and texture information in infrared images, we do not directly adopt the native C3k2 modules of YOLOv11. Instead, we embed GIA modules within the backbone network. This module utilizes explicit physical gradient operators to initialize convolution kernels, forcing the network to shift its attention from flat thermal radiation regions to the geometric boundaries of the objects. This design injects a “geometric prior” at the early stages of feature extraction, effectively countering the thermal blurring and low contrast inherent to infrared imagery, and providing the subsequent feature pyramid with structurally rich shallow-layer features.
- 2.
- Full-Scale Feature Pyramid:Given the significant “micro-scale shift” phenomenon in vehicular scenarios, standard three-scale (–) neck networks lead to severe missed detections of long-distance tiny objects. To this end, we extend the topological depth of the PANet to construct a Full-Scale feature pyramid that includes a detection layer (Stride = 4). The introduction of this high-resolution branch allows the network to preserve the complete geometric contours of tiny thermal sources (< pixels), avoiding the “feature erosion” caused by deep downsampling. Meanwhile, we retain the deep branch to maintain comprehensive perception of close-range large-scale objects, ensuring all-domain safety in driving scenarios.
- 3.
- Scale-Aware Shared Head:To address the computational redundancy introduced by the layer and the vulnerability of shallow features to background thermal noise, we designed the SAS-Head. This module abandons traditional independent parameter designs and employs a unified set of shared convolutional weights to simultaneously process all feature levels from to . Physically, this mechanism achieves extreme parameter compression, offsetting the computational overhead of the high-resolution layer. Logically, it constructs a cross-scale “semantic resonance”, forcing the global semantic information from the deep layer to participate in the gradient updates of the shallow layer. This effectively suppresses high-frequency background false alarms, achieving a dual breakthrough in computational efficiency and detection robustness.
3.2. The Gradient-Informed Attention (GIA) Module
3.2.1. Overall Architecture of GIA
- Gradient Branch: Dominated by Gradient Prior Convolution (GPConv), this branch acts as a “physical edge detector”. It bypasses pure data-driven learning to extract isotropic gradient magnitudes via Sobel-initialized convolutions, followed by a convolution for channel integration.
- Attention Branch: This branch first compresses channels via a convolution, then integrates a Target Channel Recalibration (TCR) module [31]. TCR aggregates global spatial context and dynamically recalibrates channel weights via a Multi-Layer Perceptron (MLP), followed by a convolution to enhance target-specific features while suppressing thermal noise.
3.2.2. Core Component I: Gradient Prior Convolution (GPConv)
3.2.3. Core Component II: Target Channel Recalibration (TCR Module)
3.3. Fine-Grained Detection Branch
3.3.1. Data Observations and Physical Challenges: Feature Erosion
3.3.2. Architectural Implementation
- Geometric Details: Directly inherits the rich edge and texture information from the C2 layer (benefiting from the gradient enhancement of the GIA module).
- Initial Semantics: Integrates contextual information from deeper layers.
3.4. The Scale-Aware Shared Head (SAS-Head)
3.4.1. Design Motivation and Core Mechanism
3.4.2. Module Architecture and Implementation
- A.
- Shared Depthwise Separable Convolution:
- B.
- Group Normalization (GN) for Cross-Scale Distribution Adaptation:
- C.
- Scale-Aware Regression Calibration:
4. Results and Discussions
4.1. Datasets and Evaluation Metrics
4.1.1. Dataset Description
FLIR v2 Dataset (Primary Dataset)
- Training Set: Consists of 10,742 images, used for model parameter optimization.
- Validation Set: Consists of 1144 images, used to monitor model convergence and select the optimal weights during the training process.
- Test Set: Consists of 3749 images sourced from independent video sequences, used to evaluate the model’s generalization capabilities in continuous dynamic scenarios.
M3FD Dataset (Generalization Dataset)
- Class Mapping: The original label “People” was mapped to Person, and “Motorcycle” was mapped to Bicycle.
- Class Filtering: Considering that the thermal signatures of Bus and Truck in M3FD differ significantly from the Car category (primarily sedans) defined in FLIR v2, we treated categories such as Bus, Truck, and Lamp as background and removed them to avoid evaluation errors caused by definition ambiguity. Only the original “Car” label was retained for the Car category.
4.1.2. Evaluation Metrics
4.2. Experimental Environment and Implementation Details
- Hardware and Software Setup
- Training Strategy
- Implementation and Evaluation Details
4.3. Ablation Studies
4.3.1. Progressive Effectiveness Analysis
- (1)
- GIA Module:
- (2)
- Full-Scale Layer:
- (3)
- SAS-Head:
4.3.2. Architectural Paradigm Analysis: The Paradox of
- (1)
- The Dilemma of Standard Heads: as a “Noise Source”
- (2)
- Semantic Activation of SAS: as a “Semantic Anchor”
- Without (Model C): When applying SAS to the – architecture, performance drops significantly to 61.44%. This suggests that without the global context provided by the deep-layer hierarchy, it is difficult for shared weights to converge to a state that is robust across all scales.
- With (Model D-Fs2PA): When is retained, SAS utilizes its global semantic information to guide the update of shared features, pushing the mAP up to 64.06%.
4.4. Comparative Experiments
4.4.1. Intra-Scale Comparison (Nano-Scale)
4.4.2. Cross-Scale Comparison
4.5. Generalization Analysis
4.6. Qualitative Analysis
4.6.1. Qualitative Detection Results
- Scenario A: Long-range Tiny Objects. As shown in Figure 11a, due to the perspective effect caused by distance, targets occupy very few pixels (<) in the field of view and exhibit blurred textures. As the baseline model’s detection head is based on the layer (8× downsampling), the spatial features of tiny objects are severely eroded during deep convolutions, leading to systematic missed detections. In contrast, benefiting from the introduction of the Full-Scale Layer, our method successfully preserves high-resolution geometric features from shallow layers. It can acutely capture faint thermal signals and recall the vast majority of tiny objects, significantly outperforming the Baseline’s “blindness” to them.
- Scenario B: Boundary Dissolution Induced by Thermal Crossover. Figure 11b illustrates the severe feature degradation caused by “thermal crossover”. The thermal radiation intensity of the vehicle body highly converges with the background vegetation. This thermal equilibrium, coupled with complex vegetation textures, causes the gradient boundaries to almost disappear, visually “melting” the target into the environment. The Baseline model struggles to separate targets from such low-contrast backgrounds, frequently resulting in missed detections. However, our proposed GIA module enhances the perception of faint contours by injecting explicit gradient priors. Experiments demonstrate that even when target boundaries undergo severe dissolution, our method can still accurately delineate potential targets.
- Scenario C: Cluttered Background & Semantic Interference. Figure 11c presents complex street scenes containing numerous non-target heat sources (e.g., truck structures, road hot spots). Lacking sufficient semantic discriminability, the Baseline model easily misclassifies background noise with similar thermal characteristics as vehicles, leading to a surge in the false alarm rate. Conversely, our model utilizes the Deep Semantic Constraints provided by the SAS-Head to effectively leverage global contextual information for calibrating the logical validity of target categories. It successfully filters out false alarms caused by background hot spots and semantic confusion, demonstrating exceptionally strong anti-interference robustness.
4.6.2. Internal Mechanism Verification via Grad-CAM
4.7. Discussions
5. Conclusions
- The GIA module injects explicit geometric priors through normalized gradient initialization and attention recalibration, thereby breaking the traditional “texture bias” to acutely capture blurred boundaries in thermal crossover scenarios.
- A Full-Scale feature pyrami introduces a high-resolution branch to reverse the feature erosion caused by deep downsampling, effectively preserving the geometric contours of long-range tiny objects.
- The SAS-Head utilizes a cross-scale parameter-sharing mechanism to forge a “semantic resonance”, leveraging deep-layer semantics to suppress shallow-layer background noise while effectively neutralizing computational redundancy.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal object detection in difficult weather conditions using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
- Bijelic, M.; Gruber, T.; Mannan, F.; Kraus, F.; Ritter, W.; Dietmayer, K.; Heide, F. Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11682–11692. [Google Scholar]
- Farooq, M.A.; Shariff, W.; O’callaghan, D.; Merla, A.; Corcoran, P. On the role of thermal imaging in automotive applications: A critical review. IEEE Access 2023, 11, 25152–25173. [Google Scholar] [CrossRef]
- Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5802–5811. [Google Scholar]
- Yoon, S.; Cho, J. Deep multimodal detection in reduced visibility using thermal depth estimation for autonomous driving. Sensors 2022, 22, 5084. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
- Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 5117–5127. [Google Scholar]
- Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
- Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048. [Google Scholar]
- Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 72–80. [Google Scholar]
- Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 877–886. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Li, S.; Ma, Q.; Zhang, S.; Yang, C. DCCS-Det: Directional Context and Cross-Scale-Aware Detector for Infrared Small Target. IEEE Trans. Geosci. Remote Sens. 2026, 64, 4400215. [Google Scholar] [CrossRef]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
- Benjumea, A.; Teeti, I.; Cuzzolin, F.; Bradley, A. YOLO-Z: Improving small object detection in YOLOv5 for autonomous vehicles. arXiv 2021, arXiv:2112.11798. [Google Scholar]
- Li, M.; Liu, X.; Chen, S.; Yang, L.; Du, Q.; Han, Z.; Wang, J. MST-YOLO: Small object detection model for autonomous driving. Sensors 2024, 24, 7347. [Google Scholar] [CrossRef] [PubMed]
- Yang, C.; Chen, M.; Xiong, Z.; Yuan, Y.; Wang, Q. Cm-net: Concentric mask based arbitrary-shaped text detection. IEEE Trans. Image Process. 2022, 31, 2864–2877. [Google Scholar] [CrossRef] [PubMed]
- Yang, C.; Chen, M.; Yuan, Y.; Wang, Q. Reinforcement shrink-mask for text detection. IEEE Trans. Multimed. 2022, 25, 6458–6470. [Google Scholar] [CrossRef]
- Yang, C.; Chen, M.; Yuan, Y.; Wang, Q. Zoom text detector. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 15745–15757. [Google Scholar] [CrossRef] [PubMed]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Jocher, G.; Qiu, J.; Jing, Y. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 January 2026).
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
- Vollmer, M.; Möllmann, K.P. Infrared Thermal Imaging: Fundamentals, Research and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Zeng, J.; Zhong, H. YOLOv8-PD: An improved road damage detection algorithm based on YOLOv8n model. Sci. Rep. 2024, 14, 12052. [Google Scholar] [CrossRef] [PubMed]
- Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Teledyne FLIR. Teledyne FLIR Free ADAS Thermal Dataset v2. 2022. Available online: https://adas-dataset-v2.flirconservator.com (accessed on 25 October 2025).
- Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Nitra, Slovakia, 1–3 July 2020; IEEE: New York, NY, USA, 2020; pp. 237–242. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
- Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L. Ultralytics YOLOv5 (Version 3.1). 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 January 2026).
- Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
- Xu, Y.; Jiao, Y. An improved YOLO11-based UAV infrared vehicle detection method with HS-FPN and CBAM mechanisms. In Proceedings of the Fifth International Conference on Digital Signal and Computer Communications (DSCC 2025), Lanzhou, China, 23–25 May 2025; SPIE: Cergy-Pontoise, France, 2025; Volume 13653, pp. 236–242. [Google Scholar]
- Zhu, R.; Zhang, J.; Yang, D.; Zhao, D.; Chen, J.; Zhu, Z. Exploring Attention Placement in YOLOv5 for Ship Detection in Infrared Maritime Scenes. Technologies 2025, 13, 391. [Google Scholar] [CrossRef]
- Wen, S.; Li, L.; Ren, W. A lightweight and effective yolo model for infrared small object detection. Int. J. Pattern Recognit. Artif. Intell. 2025, 39, 2551009. [Google Scholar] [CrossRef]
- Atrash, A.; Ertekin, S.; Ugur, O.; Moured, O.; Chen, Y.; Zhang, J. TY-RIST: Tactical YOLO Tricks for Real-time Infrared Small Target Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 2201–2210. [Google Scholar]
- Wu, J.; Wu, L.; Wang, D.; Peng, Y.; Liao, Z. YOLOv8-IRD: Infrared Road Small Object Detection Algorithm Based on Improved YOLOv8. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; IEEE: New York, NY, USA, 2024; pp. 1243–1248. [Google Scholar]
- Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 January 2026).
- Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2024; Volume 37, pp. 107984–108011. [Google Scholar]
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]












| Model | Batch Size = 1 | Batch Size = 16 | ||
|---|---|---|---|---|
| FP32 | FP16 | FP32 | FP16 | |
| Baseline (YOLOv11n) | 112.0 | 113.2 | 713.9 | 704.6 |
| Fs2PA (Ours) | 70.3 | 70.3 | 525.5 | 547.2 |
| Model | GIA | P2 Layer | SAS-Head | mAP@50 (%) | APPerson (%) | APBicycle (%) | APCar (%) | GFLOPs |
|---|---|---|---|---|---|---|---|---|
| Baseline (YOLOv11n) | - | - | - | 57.55 | 72.43 | 25.35 | 74.88 | 6.3 |
| +GIA | ✓ | - | - | 60.02 | 73.65 | 29.67 | 76.74 | 9.4 |
| +P2 | ✓ | ✓ | - | 62.16 | 76.58 | 28.92 | 80.98 | 13.7 |
| +SAS (Fs2PA) | ✓ | ✓ | ✓ | 64.06 | 77.16 | 33.18 | 81.83 | 11.3 |
| Model ID | Feature Scales | Head Type | mAP@50 (%) | mAP@50:95 (%) | GFLOPs | Description |
|---|---|---|---|---|---|---|
| A | P2–P5 | Standard | 62.16 | 36.23 | 13.7 | Redundant P5 |
| B | P2–P4 | Standard | 62.24 | 36.13 | 13.0 | Naive Light |
| C | P2–P4 | SAS-Head | 61.44 | 36.20 | 10.7 | Incomplete |
| D (Fs2PA) | P2–P5 | SAS-Head | 64.06 | 37.02 | 11.3 | Optimal |
| Method | Scale | Prec. (%) | Rec. (%) | mAP@50 (%) | mAP@50:95 (%) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|
| CenterNet-R18 [36] | N/A | - | - | 17.00 | 5.90 | 22.5 | 232 |
| YOLOX-nano [27] | N | - | - | 45.90 | 23.70 | 2.6 | 585 |
| YOLOX-tiny [27] | T | - | - | 52.20 | 29.00 | 15.2 | 621 |
| RT-DETR-N [38] | N | 61.48 | 46.65 | 51.45 | 26.58 | 26.9 | 262 |
| D-FINE-N [37] | N | 78.12 | 63.20 | 57.50 | 32.00 | 7.1 | 87 |
| YOLOv8-CBAM [41,42] | N | 70.60 | 50.80 | 57.00 | 32.40 | 8.1 | 769 |
| YOLOv8-CA [43,44] | N | 72.50 | 50.40 | 57.50 | 33.10 | 8.1 | 833 |
| YOLOv8-IRD [45] | N | 69.20 | 54.10 | 60.60 | 34.60 | 10.8 | 500 |
| YOLO-FIRI [46] | T | 60.90 | 45.70 | 44.80 | 20.60 | 15.2 | 270 |
| YOLOv5nu [39] | N | 69.53 | 50.31 | 56.13 | 31.73 | 7.1 | 604 |
| YOLOv8n [47] | N | 69.60 | 51.01 | 56.75 | 32.44 | 8.1 | 622 |
| YOLOv9t [48] | T | 70.69 | 52.14 | 58.36 | 33.03 | 7.6 | 408 |
| YOLOv10n [49] | N | 68.95 | 49.17 | 57.01 | 32.63 | 6.5 | 739 |
| YOLOv11n [28] | N | 68.31 | 51.33 | 57.55 | 32.55 | 6.3 | 705 |
| YOLOv12n [50] | N | 68.32 | 51.80 | 57.51 | 32.57 | 5.8 | 651 |
| YOLOv13n [40] | N | 69.85 | 49.81 | 56.15 | 30.98 | 6.1 | 492 |
| Fs2PA | N | 73.48 | 56.56 | 64.06 | 37.02 | 11.3 | 547 |
| Method | Scale | Prec. (%) | Rec. (%) | mAP@50 (%) | mAP@50:95 (%) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|
| YOLOv5su | S | 72.14 | 55.29 | 61.77 | 36.18 | 23.8 | 460 |
| YOLOv8s | S | 72.87 | 55.57 | 62.27 | 35.94 | 28.4 | 424 |
| YOLOv9s | S | 72.11 | 55.82 | 61.89 | 36.24 | 26.7 | 394 |
| YOLOv10s | S | 72.48 | 53.19 | 61.80 | 36.32 | 21.4 | 525 |
| YOLOv11s | S | 71.32 | 55.04 | 61.59 | 36.32 | 21.3 | 552 |
| Fs2PA | N | 73.48 | 56.56 | 64.06 | 37.02 | 11.3 | 547 |
| Method | Scale | Prec. (%) | Rec. (%) | mAP@50 (%) | mAP@50:95 (%) | FPS |
|---|---|---|---|---|---|---|
| CenterNet-R18 | N/A | - | - | 15.00 | 5.40 | 108 |
| YOLOX-nano | N | - | - | 38.80 | 18.30 | 610 |
| YOLOX-tiny | T | - | - | 45.20 | 24.10 | 565 |
| RT-DETR-N | N | 63.37 | 47.48 | 51.85 | 27.27 | 256 |
| D-FINE-N | N | 73.99 | 62.31 | 51.10 | 27.40 | 79 |
| YOLOv8-CBAM | N | 69.10 | 49.40 | 54.20 | 30.50 | 769 |
| YOLOv8-CA | N | 74.40 | 48.30 | 54.40 | 30.20 | 833 |
| YOLOv8-IRD | N | 69.60 | 47.90 | 53.40 | 29.00 | 500 |
| YOLO-FIRI | T | 58.90 | 45.90 | 43.30 | 19.00 | 250 |
| YOLOv5nu | N | 74.19 | 46.51 | 52.91 | 28.71 | 554 |
| YOLOv8n | N | 70.30 | 49.05 | 53.60 | 29.53 | 589 |
| YOLOv9t | N | 71.06 | 49.47 | 54.61 | 30.01 | 426 |
| YOLOv10n | N | 70.52 | 48.36 | 54.19 | 29.79 | 709 |
| YOLOv11n | N | 70.67 | 48.85 | 53.96 | 29.89 | 738 |
| YOLOv12n | N | 66.56 | 48.65 | 53.62 | 29.74 | 620 |
| YOLOv13n | N | 71.13 | 47.14 | 53.23 | 29.03 | 488 |
| Fs2PA | N | 71.40 | 52.31 | 57.94 | 31.89 | 524 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Pei, B.; Wu, L.; Zheng, X.; Zhou, C.; Wang, D. Fs2PA: A Full-Scale Feature Synergistic Perception Architecture for Vehicular Infrared Object Detection via Physical Priors and Semantic Constraints. Sensors 2026, 26, 2257. https://doi.org/10.3390/s26072257
Pei B, Wu L, Zheng X, Zhou C, Wang D. Fs2PA: A Full-Scale Feature Synergistic Perception Architecture for Vehicular Infrared Object Detection via Physical Priors and Semantic Constraints. Sensors. 2026; 26(7):2257. https://doi.org/10.3390/s26072257
Chicago/Turabian StylePei, Boxuan, Leyuan Wu, Xiaoyan Zheng, Chao Zhou, and Dingxiang Wang. 2026. "Fs2PA: A Full-Scale Feature Synergistic Perception Architecture for Vehicular Infrared Object Detection via Physical Priors and Semantic Constraints" Sensors 26, no. 7: 2257. https://doi.org/10.3390/s26072257
APA StylePei, B., Wu, L., Zheng, X., Zhou, C., & Wang, D. (2026). Fs2PA: A Full-Scale Feature Synergistic Perception Architecture for Vehicular Infrared Object Detection via Physical Priors and Semantic Constraints. Sensors, 26(7), 2257. https://doi.org/10.3390/s26072257

