In public safety scenarios, the precise detection and positioning of prohibited weapons such as firearms and knives along with the involved personnel are the core pre-requisite technologies for violent risk warning and emergency response. However, in security surveillance scenarios, there are common problems
[...] Read more.
In public safety scenarios, the precise detection and positioning of prohibited weapons such as firearms and knives along with the involved personnel are the core pre-requisite technologies for violent risk warning and emergency response. However, in security surveillance scenarios, there are common problems such as object occlusion, difficulty in capturing small-sized weapons, and complex background interference, which lead to the shortcomings of existing general object detection models in the tasks of detecting and locating security-related objects, including poor adaptability, low detection accuracy, and insufficient robustness in complex scenarios. Therefore, this paper proposes a threat object detection framework for security scenarios (PDGV-DETR) based on adaptive dynamic convolution and cross-scale semantic fusion, specifically optimized for the detection and positioning tasks of weapons and personnel objects in static security surveillance images. This research focuses on category recognition at the object level and pixel-level spatial positioning, and does not involve the classification and identification of violent behaviors based on temporal information. There are clear technical boundaries and scene limitations between the two. This framework is optimized through three core modules: designing a dynamic hierarchical channel interaction convolution module to reduce computational complexity while enhancing the ability to detect occluded and incomplete objects; constructing an improved bidirectional hybrid feature pyramid network, combining the cross-scale fusion module to strengthen multi-scale feature expression, and adapting to the simultaneous detection requirements of small weapon objects and large personnel objects; and introducing a global semantic weaving and elastic feature alignment network to solve the problem of low discrimination between objects and complex backgrounds. Under the same experimental configuration, the proposed model is verified against current mainstream models on typical datasets: on a dataset of 2421 conflict scene personnel violent images, the peak average precision mAP50 of PDGV-DETR reached 85.9%. Through statistical verification, compared with the baseline model RT-DETR with an average value ± standard deviation of 0.840 ± 0.007, the average value ± standard deviation of PDGV-DETR reached 0.858 ± 0.004, demonstrating statistically significant performance improvement, with a
p-value less than 0.01. This model can accurately complete the task of locating the object area of personnel, and compared with the deformable DETR, the accuracy improvement rate reached 15.1%.; on the weapon-specific dataset OD-WeaponDetection, the mAP for gun and knife detection reached 93.0%, improving by 2.2% compared to RT-DETR. Compared to the performance fluctuations of other general object detection models in complex security scenarios, PDGV-DETR not only has better detection and positioning accuracy for security-related objects, but also significantly improves the generalization and stability of the model. The results show that PDGV-DETR effectively balances the accuracy of positioning, detection, and computational efficiency, accurately completing end-to-end detection and positioning of weapon and personnel objects in static security surveillance images, demonstrating highly competitive performance in the detection and positioning of security-related objects in security scenes, providing core object-level pre-processing technology support for scenarios such as public area monitoring, intelligent video monitoring, and early warning of violent risks, and providing basic data for subsequent violent behavior recognition based on temporal data.
Full article