AI-Based Weapon Detection for Security Surveillance: Recent Research Advances (2016–2025)

Thangavel Murugan; Nasurudeen Ahamed Noor Mohamed Badusha; Amnah Rashed Obaid Ali Semaihi; Maryam Mohamed Rashed Alkindi; Eman Mohammed Rashed Alnaqbi; Ghala Hmouda Turki Alketbi

doi:10.3390/electronics14234609

,

and

College of Information Technology, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(23), 4609;https://doi.org/10.3390/electronics14234609

This article belongs to the Special Issue Advances in Image Processing, Artificial Intelligence and Intelligent Robotics, 2nd Edition

Version Notes

Order Reprints

Abstract

The necessity for intelligent monitoring has grown more urgent as the number of crimes involving firearms and knives in homes and public areas has increased. Traditional CCTV systems require human operators, whose attentiveness and the impracticability of monitoring multiple video feeds simultaneously limit their effectiveness. Artificial intelligence (AI)-based vision systems can automatically detect firearms and enhance public safety, thereby overcoming this constraint. In accordance with the Preferred Reporting Items for Systematic Reviews (PRISMA) criteria, a systematic evaluation of AI-based weapon detection for security monitoring is conducted. The paper summarizes research works on AI, machine learning, and deep learning techniques for identifying weapons in surveillance footage from 2016 to 2025, encompassing 101 research papers. The reported precision ranged from 78% to 99.5%, recall ranged from 83% to 97%, and mean average precision (mAP) ranged from approximately 70% to 99%. While AI-based monitoring significantly enhances detection accuracy, issues with inconsistent evaluation criteria, limited real-world validation, and dataset variability persist. The research study emphasizes the need for uniform benchmarking, robust privacy protections, and standardized datasets to ensure the ethical and reliable implementation of AI-driven weapon-detection systems.

Keywords:

artificial intelligence; YOLO; SSD; Faster R-CNN; weapon detection

1. Introduction

Today, there is a high demand for active monitoring systems to curb the rising incidence of public crimes. As incidents involving knives and firearms, such as robberies, shootings, and threats at gunpoint, increase, CCTV cameras are becoming more widespread in public spaces. Early detection of weapons in surveillance footage is crucial for security and safety. Recent progress in object detection and recognition, driven by advances in machine learning and deep learning [1,2], has been significant, including the use of CNNs and YOLO [3,4].

Detecting weapons poses challenges for typical smart city security systems. These issues fall into three categories: technical, operational, and moral. Technical—Low-quality footage, resulting from poor resolution, inadequate lighting, or obstructions, can significantly limit the system’s ability to detect firearms. Operational—One of the main challenges is ensuring effectiveness and safety when managing and storing large amounts of surveillance data. Moral—There are serious privacy concerns with continuous monitoring. It is not easy to find a balance between the public’s right to privacy and security needs. Surveillance systems are not responsive enough to detect and respond to suspicious activities, such as armed robbery, in public locations. As a result, there is a need for technology that can automatically identify criminal behavior from Closed Circuit Television (CCTV) footage without requiring human intervention. A range of high-performance computing algorithms has been developed, but they have been limited to specific conditions [5].

Artificial intelligence (AI)-enhanced smart city security surveillance systems leverage advanced technologies to address many of the issues faced by traditional surveillance systems. Large datasets are used to train AI models, particularly machine learning and deep learning algorithms, which improve weapon detection accuracy by reducing false positives and negatives. Although AI-enhanced smart city security surveillance offers numerous benefits, it also presents several challenges that need to be addressed. Training AI models requires high-quality, labeled data; however, collecting and annotating large datasets is both expensive and time-consuming. Monitoring criminal behavior necessitates continuous human observation. Most of the time, these incidents occur due to the use of portable handheld weapons, specifically a pistol and a gun. There exist object detection algorithms that detect weapons such as knives and handguns. Detecting handguns and knives is one of the most challenging tasks due to occlusion, viewpoint variations, background clutter, and other factors [6].

Prior studies have focused on AI-based surveillance in general, but not specifically on AI-based weapon identification. The purpose of this study is to review all of the research conducted in the domain of weapon detection using AI, machine-learning, and deep-learning models between the years of 2016 and 2025. It aims to examine advancements and performance patterns in weapon detection systems over time, analyze the challenges in research, and highlight any evolution in AI-based surveillance methods. The purpose of this research is to evaluate and analyze AI-based methods for identifying weapons in surveillance systems. The primary focus is to determine whether AI, machine learning, and deep learning models were deployed between 2016 and 2025, and, if so, what their reported accuracy, precision, and recall were. The review will also examine the datasets and evaluation measures employed. Additionally, the review will raise awareness of research gaps in developing robust, reliable AI-based weapon-detection systems and ultimately highlight pressing issues such as data variability and real-world applicability. This research study makes several significant contributions:

i.: Develop and evaluate a thorough review of AI-enabled weapon detection models across a variety of datasets and settings.
ii.: The research evaluates the performance of a number of the most popular and referenced deep learning models (YOLO, Faster R-CNN, SSD, CNN) based on their accuracy, speed, and real-time feasibility.
iii.: The study assesses the efficiencies of public vs. private data for detecting guns and knives.
iv.: Develop a comparative framework that can be utilized to analyze model performances relative to environmental and hardware constraints.
v.: Presenting future research opportunities for improved real-time performance and multi-weapon detection accuracy.

The remaining sections of this research study are organized as follows: Section 2 covers Research Methodology. Section 3 covers Methodological Framework. Section 4 discusses AI-enhanced Smart City Surveillance—Weapon Detection: Gun. Section 5 covers AI-enhanced Smart City Surveillance—Weapon Detection: Knife. Section 6 discusses AI-enhanced Smart City Surveillance—Weapon Detection: Knife and Gun. Section 7 discusses the Research Summary with Key Findings. Section 8 addresses the Research Gap and Future Scope. Section 9 presents the Conclusion.

2. Research Methodology

The research methodology employs a detailed review strategy to minimize bias and produce more reliable and trustworthy findings. In addition to identifying research gaps, challenges, and obstacles worth exploring within the context of Weapon Detection, this review aims to contribute to a comprehensive analysis of the latest literature. This survey examines studies with relevant results and applies them to the Field of Weapon Detection.

Table 1 provides a comparative summary of critical research studies focusing on the weapon detection literature, including topics of interest, the time period covered, and the methods employed. Warsi et al. [6] studied knife and handgun detection algorithms between 2013 and 2018. Debnath et al. [7] provide a synthesis of computer vision-based methodologies for automatic gun and knife detection from 1962 to 2020. Yadav et al. [5] evaluate the performance of classical machine learning and deep learning models for weapon identification, published between 2011 and 2022. Santos et al. [3] also provide a systematic review of deep learning-based weapon detection research, examining models, datasets, and barriers in the literature from 2014 to 2022. In summary, the table illustrates the progression of weapon detection research over time, including the increasing use of deep learning methods and systematic reviews.

Table 1. Summary of Key Studies on Weapon Detection.

This study distinguishes itself from the existing literature summarized in Table 1 by its exhaustive, systematic, and data-driven assessment of AI-based weapon detection, conducted in accordance with the PRISMA framework, which supports methodological transparency and rigor. In contrast to previous studies that may have been primarily narrative or case studies that focused on a particular model or dataset, our paper covers a broader temporal range—from 2016 to 2025—and synthesizes an overall description of the findings from 101 research studies on weapon detection, which also presents a quantitative assessment of popular model evaluation metrics: precision, recall, and mean average precision (mAP). Moreover, our study goes beyond reporting on algorithm comparisons. It robustly discusses several areas of evaluation inconsistencies, dataset considerations, and ethical concerns. These findings underscore the importance of research published using standardized benchmarks and privacy-friendly approaches. This multidimensional approach provides not only technical and methodological perspectives but also ethical insights; hence, the review offers an up-to-date, systematic, and practical contribution compared to other studies, serving as a reference point for the development of ethical, reliable, and efficient AI-based surveillance.

The articles are selected based on their focus on Weapon Detection and the use of Artificial Intelligence, Machine Learning, and Deep Learning. The various sources used as our search databases are shown in Figure 1 and Figure 2, which list Conferences and Journals, respectively. These databases include verified conferences and articles.

Figure 1. Article Count from Conferences.

Figure 2. Article Count from Journals.

The search selection parameters demonstrate how we focus on the relevant content returned by search engines. Our goal is to find primary publications on our subject that address our research questions. A total of 101 studies were selected from various databases. The study analysis (Figure 3) included articles published from 2016 to 2025. Many duplicate papers unrelated to the research requirements had to be eliminated. For this process, research studies were selected from seven electronic databases. The number of publications chosen from each of the twelve electronic databases is shown in Figure 1 and Figure 2. The papers were categorized into two groups: conference papers and articles. For conference papers, IEEE Xplore leads with 73% of the focus (27 papers), followed by ACM and Springer, with 14% (5 papers) and 8% (3 papers), respectively. With 36% of the articles selected, journals indexed by Scopus received the most attention (23 papers), followed by ScienceDirect (23%; 15 papers), IEEE (16%; 10 papers), and MDPI (11%; 7 papers).

Figure 3. Articles Selection Process.

In developing this study, we followed the criteria outlined in the Preferred Reporting Items for Systematic Reviews (PRISMA) to help ensure the quality and repeatability of our review process. For a methodical, peer-reviewed review process, PRISMA (see Figure 3) provides a standard checklist of guidelines that we closely followed in this work. We developed a review protocol that outlined the criteria for selecting articles, the search protocol, data extraction, and the data analysis process. The inclusion and exclusion criteria are detailed in Table 2.

Table 2. Selection Criteria for the Literature: Inclusion and Exclusion.

To ensure transparency and replicability, we have provided a detailed account of our literature search strategy. The review now indicates that studies were recovered from [list of databases, i.e., Scopus, IEEE Xplore, SpringerLink, ScienceDirect, MDPI, Wiley Online Library] which included predetermined search strings that combined the following keywords: “weapon detection”, “firearm recognition”, “knife detection”, and “surveillance”. As previously stated, searches were conducted over the period 2016–2025, with the last search performed on [30 September 2025]. Only English language publications were considered. A two-phase screening process (i.e., titles and abstracts followed by full-text review) was conducted by multiple reviewers. If there was any disagreement about inclusion decisions, it was resolved through discussion and consensus. The PRISMA flowchart (Figure 3) and counts have been updated to reflect the current state of the process as reported.

All records were screened in two stages: first, a title and abstract review, followed by a full-text review, applying predetermined inclusion and exclusion criteria to assess each record’s eligibility. Discussions were conducted to maintain objectivity and resolve conflicts with consensus and debate. To apply consistency in the data collection process, a systematic approach of data extraction was developed. The protocol outlined the significant characteristics for documentation, including the year range, dataset type, AI/ML/DL model employed, evaluation metrics (e.g., precision, recall, mAP), and key results. The extraction protocol also suggested the use of standardized forms to maintain uniformity in extraction procedures; the authors independently verified the quality and reliability of the data extraction.

Extraction of Information: The papers were categorized into two groups: conference papers and articles. The process involves examining each document and classifying it by publication year. This results in a list of research studies (both journal and conference publications) along with supporting documentation for each research topic. The study covers research conducted from 2016 to 2025. Significant progress was made in 2023 in identifying and categorizing weapons. The growth in research studies during this period is illustrated in Figure 4.

Figure 4. Article Count—Distribution by Year.

Distribution of Research Papers: An analysis of research articles investigated or shortlisted reveals patterns in Weapon Detection research from 2016 to 2025. Journals and conferences were the two categories into which all the shortlisted studies were divided. Figure 5 shows the distribution of articles across these two categories: journal publications account for 63%, and conference papers account for 37%.

Figure 5. Division as per the Source of Distribution.

Figure 6 illustrates the comprehensive workflow of an AI-based weapon-detection system that complements surveillance cameras to provide real-time information on potential threats. Surveillance cameras capture video footage and provide continuous images as input, which may contain potential weapons (e.g., knives, guns). Images of potential weapons undergo a preprocessing step that resizes all frames to usable dimensions and omits unnecessary visual features using convolutional neural networks (CNNs). Preprocessed or volunteer frames are submitted for analysis by an object detection model (e.g., YOLO) to report on the identification and localization of the weapon in the images (i.e., bounding box and weapon classification). Once a weapon is identified and localized, the algorithm takes action, determining its location and automatically generating a threat alert to notify security staff of the weapon requiring immediate attention. The demonstration of the threat capture pipeline highlights an efficient, end-to-end AI-driven approach that leverages computer vision, deep learning, and intelligent alerting. The efficiency of functions combines to increase situational awareness and improve safety and security, specifically in surveillance.

Figure 6. AI-Driven Weapon Detection Workflow.

Figure 7 illustrates the chronological timeline of the YOLO (You Only Look Once) family of object detection algorithms, starting from 2015 (the original YOLO), and addresses developments that continued until 2025. The YOLO development process began with the original YOLO implementation, YOLOv1, introduced in 2015. The introduction of YOLOv1 enabled real-time object detection with a single neural network. YOLO development rapidly progressed to YOLOv2 in 2016, using improved accuracy and speed through architecture advances. Following the YOLOv2 release, YOLOv3 introduced improvements in multi-scale detection and feature extraction by using deeper convolutional layers in 2018. YOLOv4 and YOLOv5 were launched in 2020, using the CSPDarknet architecture and incorporating new training-time optimizations. YOLO tracking models offer faster performance and have been widely adopted in real-time computer vision tasks. YOLOv6 and YOLOv7 follow the FIR approach, which was released in 2021. These models utilize optimizations that improve both efficiency and accuracy, making them suitable for edge devices. In 2023, YOLOv8 introduced transformer-based applications and trained heads for the detection task. YOLOv9 through YOLOv12, which emerged between 2024 and 2025, represent the next generation of architecture, integrating transformer augmentation, improved feature fusion, and adaptive learning. In summary, the trajectory of advances in YOLO model families reflects continuous innovations and refinements to YOLO model designs and research, ultimately leading to improvements in real-time object and weapon detection.

Figure 7. Evolution Timeline of YOLO Models.

3. Methodological Framework

3.1. Input Modality (Image/Video Based)

Investigations into weapon detection are being developed across multiple input modalities, including images, videos, and multimodal data (audio-visual combined with textural or infrared). Most of the work on weapon detection is based on images [8,9,10,11,12]. These works employed datasets such as COCO and IMFDB and achieved very high precision above 199% using YOLO, Faster R-CNN, and Mask R-CNN as their main models. Video-based weapon detection studies [13,14,15,16] focus on real-time surveillance use cases and achieve good temporal performance; however, most approaches do not address occlusion or changes in lighting conditions. There are some new studies [17,18] focused on multimodal gun detection employing audiovisual, thermal, or synthetic data. These studies show that they can robustly detect the weapon promptly. Still, they highlight the need for greater diversity in their training dataset and for improved strategies for dataset fusion.

The articles reviewed utilized a variety of input modalities, including image-based data, X-ray, infrared, and multi-frequency (Millimeter-Wave (MMW) and Terahertz (THz)) data. Image-based approaches [19,20,21,22,23] are the most common because of their availability and ease of use, achieving high accuracy levels of 88–100% with convolutional neural network (CNN) and YOLO architectures. The X-ray imaging study demonstrated success in detecting concealed weapons [24], achieving an improved mAP of 91.3%; however, lighting and object similarity posed challenges for both performance and accuracy. Infrared-based detection (e.g., [25] appears to offer simple data acquisition in low-visibility environments, but its performance should still be validated. Multi-modal sensing data from millimeter-wave (MMW) and terahertz (THz) sources [26] produced precision measurements of 97–98%, demonstrating even better secret weapon detection performance and security screening efficiency.

The weapon detection studies implemented use widely different modalities for their input data, including images, videos, synthetic data, or multimodal datasets. However, images (e.g., [27,28] are the most common, are generated from publicly available datasets such as COCO and Open Images, and are used to train deep learning methods that achieve accuracies of up to 99%. Generally, these methods perform well for detecting static images and extracting features, but often struggle to detect images under different illumination or with occlusion. Alternatively, video models [29,30,31] leveraged real-time surveillance or behavioral recognition over time, with CNN and YOLO frameworks achieving the highest F1 scores (over 90%). However, these implementations have challenges with motion blur and crowded environments. A third approach synthesizes and/or provides a multimodal dataset [32,33,34,35] containing data from the real world and simulated environments—PMMW, mock attack, terahertz, and inference measures—to improve generalization and performance in complex or low visibility conditions. The synthetic and demonstration methods generally yielded mAP values of 97%.

3.2. Model Type (ML, CNN, R-CNN, YOLO (v1–v8)

In the studies examined, it is evident that deep learning models, particularly YOLO (versions v3–v8), were the most discussed for their speed and real-time detection precision. CNN-based and hybrid CNN models [2,36,37,38] achieved high accuracy in detecting the requested objects; however, they required large datasets to enable generalization. Models like Faster R-CNN or Mask R-CNN [1,9,11,39] achieved higher precision and accuracy but required more computational time. More recently, studies have also used Transformer models (ViT) and hybrid models like SCSP-ResNet and Capsule Networks [34,40,41]. These recent trends have benefited the study by enabling adaptive and accurate detection of small objects. Nonetheless, the mAP scores (over ~90%) and inference time of the YOLO framework models show that they are the best frameworks for adopting new technologies.

Deep learning models vary in their advantages and disadvantages across datasets. Traditional feature-based models, such as FAST [19], achieved 97% precision in the controlled image recognition study. CNN and DaCoLT [20] improved accuracy in automatic weapon detection, but it decreased dramatically for outdoor images. However, in the lightweight architecture category—MobileNet, PoseNet, and Mask R-CNN [21]—an accurate and efficient balance was measured for multi-view detection by meeting an impressive 95% accuracy. SSD with FPN [24] and YOLOv5 [23,36] extend the fast frame capture with a precision accuracy of 97.7%. Overall, the YOLO-based and hybrid CNN models demonstrated improved performance and potential flexibility while maintaining relatively high performance across datasets. In contrast, other deep learning models may have offered additional benefits but did not perform as efficiently in the assessment context, particularly compared to YOLO. Models balance in use cases for scalability and adaptation, but larger, more diverse datasets often improve the results for better, more robust deployment.

The studies examined provide clear evidence of the preeminence of deep learning frameworks, with YOLO versions (v3–v8) leading the pack in frequency, followed by CNN-, R-CNN-, and Transformer- or hybrid-based architectures. YOLO frameworks [33,42,43] are the most common due to their trade-off between speed and precision; some studies even report precision and recall over 90% in real-time scenarios. Most CNN and DCNN [27,28,44] papers report nearly 100% accuracy on controlled datasets, confirming that CNNs and DCNNs are also excellent frameworks for processing feature representations. R-CNN family frameworks [30,31,45,46] generally report higher accuracy as well, but exhibit more resource requirements for model training, testing, or prediction. Lastly, Transformer- [40], hybrid- [47], or other transformer-based [48] models signal a modern movement toward better understanding of context and scaling multi-scale object detection, particularly when small or partially obscured weapons are displayed. YOLOv5–v8, regardless of the architectural framework, consistently achieved the best mAP of 85–98% and the fastest inference times among the studies.

3.3. Application Environment (Real-Time Surveillance)

The application of weapon detection frameworks occurs in multiple environments, whether in the field through live surveillance, for example, a public safety monitor, or through evidence gathered in controlled laboratory environments. The environments listed feed into the development and production of weapon detection systems [13,49,50,51]. Priority is placed on speed and environmental adaptability to deploy weapon-detection systems in (3) controlled environments/research dataset papers for testing the accuracy of a model under constraints. Recent engagement applications have adopted weapon detection [52,53] and mobile/embedded systems [43,54], which lean towards a light, compact architecture for mobile embedded systems on chip. Overall, having a model adapt to the complexity of a given environment to achieve successful deployment still depends on how accurately it compensates for environmental variables, such as heat, light, blur, or density.

These models are used across a range of operations, including real-time surveillance, security screening, and military applications. The first two studies [19,22] were focused on real-time image classification for general surveillance, but are also applicable to other surveillance-based tasks. Studies [20,21] applied convolutional neural networks (CNNs) and MobileNet models, respectively, to indoor detection and multi-view space. Studies of interest using X-ray and MMW/THz [24,26] focused on security checkpoints and concealed weapons, respectively. These studies support research findings of increased accuracy in surveillance-based detection in contrived settings, but at the cost of higher computational and memory requirements. Infrared-based tasks [25] and YOLOv5-based models [23] used practical settings to achieve high accuracy for low-light surveillance. Still, each of these studies indicated the need for real-world applications and hardware considerations.

In the field of weapon detection, detection has been applied in several settings, including real-time surveillance, public safety monitoring, laboratory testing, and security/defense. Real-time and CCTV studies [29,31,49,51] focus on low-latency, high-speed inference using YOLO or other lightweight CNNs, achieving reliable rates (up to 97%) in live remote monitoring systems. Controlled or lab experiments [27,28,36,38] tend to provide studies that test for accuracy and robustness, lending insight into how well the models perform and helping develop performance baselines. Defense and other high-security applications [24,32,34,52,53] have used X-ray imaging, PMMW, or infrared imaging for concealed weapon detection with high precision, but typically required costly specialized hardware. More recently, works have investigated even mobile and edge-based systems [43,54], where resources may be limited to achieve faster inference, but no more or less accurate than in other studies.

4. AI-Enhanced Smart City Surveillance—Weapon Detection: Gun

Legal authorities have faced increasing pressure to control and reduce crime rates in cities as urban populations have grown. Monitoring public use of weapons and having real-time data on criminal incidents are essential. Additionally, firearm-related crime rates are a significant concern globally. Although many countries permit firearm possession, it remains crucial to monitor their public use. Gun violence is one of the most critical public health emergencies today. Real-time gunshot location and identification are vital for military, security, and law enforcement operations. Recent advances in artificial intelligence have enabled the development of effective gunshot recognition systems. Deep learning, a branch of artificial intelligence that mimics the human brain’s function to perform tasks and recognize patterns, plays a key role. Video surveillance for real-time firearm detection in public spaces is among the most effective strategies for preventing and identifying criminal activity. Figure 8 illustrates the process of detecting weapons using artificial intelligence or other AI techniques.

Figure 8. Gun Detection Workflow Diagram Using Artificial Intelligence.

Hussein et al. [8] discussed the challenge of using image-processing techniques to detect concealed firearms in public places. The study detects hidden weapons, such as knives and guns, using real-time infrared (IR) and RGB images. The proposed workflow includes image acquisition, preprocessing, fusion, and edge detection, utilizing the Discrete Wavelet Transform (DWT) for image fusion and the Otsu algorithm for segmentation. Images were resized to 500 × 500 pixels and processed with various image operations as part of the experimental setup. Achieving a 99% detection accuracy, the technology outperformed previous methods in precision. For automatic weapon detection, future research suggests implementing this technique in real-time applications. A 99% accuracy was achieved using the Otsu Algorithm for threshold-based segmentation in automatic weapon detection; however, the algorithm could not operate in real time and required further improvements.

Verma et al. [9] explore the application of Faster R-CNN, a popular object detection model, for detecting handheld guns in public spaces. The Internet Movie Firearm Database (IMFDB) is employed in the study, focusing on weaponry such as shotguns, rifles, and revolvers across a variety of settings, including challenges such as background variation and occlusion. Based on the Faster R-CNN (Region-Based Convolutional Neural Network), specifically a refined VGG-16 architecture, the researchers proposed a gun detection system. The process involved refining gun images from the IMFDB after transferring learning weights from pre-trained models and using VGG-16 as a feature extractor. By using ImageNet-pretrained weights to reduce training time, the model was trained with MatConvNet on a single CPU. Accuracy, true positive rate (TPR), and false positive rate (FPR) were the evaluation metrics. The SVM (Support Vector Machine) classifier achieved a high accuracy of 93.1%, outperforming previous methods such as SIFT (Scale Invariant Feature Transform) and SURF (Speeded Up Robust Features), which had accuracies of 84.26% and 88.67%, respectively. The system achieves high detection accuracy, but the research discusses trade-offs in processing time that may limit its suitability for real-time surveillance without further optimization. For practical deployment, additional research is recommended to substantially reduce computing costs and enhance real-time performance. Faster R-CNN, which efficiently leverages VGG-16 with high accuracy, detects firearms by employing a support vector machine classifier pre-trained on ImageNet. The primary drawback of Faster R-CNN is its computational expense, which limits its real-time capabilities.

Idree et al. [55] discuss the limitations of police surveillance systems operated by humans, noting that this approach can lead to mistakes and missed detections due to attention fatigue. The researchers recommend using computer vision techniques to automate video surveillance, with a focus on real-time detection of objects, actions, and anomalies. These systems are evaluated using standard, pre-curated, annotated datasets split into training and test sets. The development includes features such as activity labeling (e.g., theft and assault), object identification (e.g., recognizing police officers and detecting weapons), and extended video summaries. The proposed solution is tested within a local police department and implemented with deep neural network models trained on annotated datasets. Although challenges such as distinguishing between normal and abnormal actions remain, early results show promising accuracy. Future efforts should focus on addressing deployment issues and enhancing real-time detection capabilities. The utilization of computer vision provided an efficient method for monitoring weapon activity. However, the performance metrics were neither studied nor unambiguously reported.

de Azevedo Kanehisa et al. [10] discuss the challenge of identifying firearms in public areas to enhance response times in potentially dangerous situations. The researchers employed the YOLO technique, renowned for its effectiveness in real-time object detection. They utilized a dataset compiled from the Internet Movie Firearms Database (IMFDB), which comprises over 4600 labeled images of firearms. A manually labeled bounding-box method was applied to fine-tune the model for reliable detection after training on labeled images with a 90/10 train-test split. The system demonstrated strong firearm recognition capabilities, achieving a mean average precision (mAP) of 70.72%, 95.73% sensitivity, 97.30% specificity, and 96.26% accuracy, even with partial obstruction, as indicated by experimental results. However, false positives from objects that resemble guns remain a limitation; future research aims to incorporate other imaging modalities, such as infrared, to improve resilience and diversity in detection with lower-quality photos. YOLO with Darknet and CNN provided a robust, practical solution for high-accuracy detection but showed weaknesses on low-quality images, leading to imprecision.

Dubey [1] discusses the development of a trained model capable of identifying concealed firearms. Manually examining security photos to assess gun-related risks is costly, time-consuming, and prone to human error. This study aims to determine which neural network models, such as Faster R-CNN and SSD, are most suitable for detecting guns in still images. Open-source platforms provided the dataset for this research. An accuracy-based metric may introduce bias toward larger classes, since real-world datasets often contain multiple classes with uneven distributions. The risk of misclassification must also be considered. The model achieved its lowest accuracy of 64% when the image and background had reduced contrast, but it reached 99% accuracy when contrast was high. Due to the slow detection speed—approximately 15 s per object—real-time object detection is currently not feasible. More real-time surveillance images with additional classifications, such as shotgun, rifle, and handgun, could be used to train the model. The Faster R-CNN demonstrated high accuracy and speed in weapon detection, but requires further training on real-time surveillance data to improve its adaptiveness.

Gelana et al. [2] highlight the lower danger faced by victims in active shooter situations. Their article addresses the challenge of detecting firearms in security camera footage, particularly in high-traffic public areas such as theaters and shopping centers. The dataset, which focuses solely on firearms, contains 1869 positive and 4000 negative images. The proposed method uses a TensorFlow-based Convolutional Neural Network (CNN) for weapon classification, complemented by image processing techniques such as background subtraction and Canny edge detection. A CCTV video dataset is used in the experimental setup to evaluate the algorithm, yielding a detection accuracy of 97.78%, with a sensitivity of 93.84% and a specificity of 99.73%. Although the algorithm is designed for real-time detection, it struggles with concealed weapons and can only identify visible handguns. Future work will involve expanding the dataset and refining the CNN to enhance detection speed and accuracy. A convolutional neural network (CNN) yielded accurate, real-time firearm detection. However, there were still too many false positives, which undermined the reliability of the detection process.

Vallez et al. [39] discuss reducing false positives in surveillance systems for pistol detection. The data were collected from CCTV cameras using a dataset from the University of Seville in Spain. This was conducted in controlled settings, including 871 photos and 177 firearms with annotations. Initial detection used the Faster R-CNN technique, followed by an autoencoder trained on simulated high school hallway data created with Unreal Engine 4. The autoencoder’s design incorporated reconstruction error to filter out common false positives. The results showed a precision of 77.24 and a 37.9% reduction in false positives, with no loss of detection capability. Future research will explore other detection architectures and improve the autoencoder’s thresholding. The Faster-RCNN model effectively decreased false positives; however, the general framework still needed improvement to be effective at broader weapon detection.

Warsi et al. [49] argue that, given the current state of the world, automated visual monitoring is crucial for security personnel to identify handguns. The goal of this research is to visually identify firearms in live recordings. The researchers combined their collection of pistols from various angles with the ImageNet dataset to improve results. The YOLOv3 algorithm is employed in the proposed method, while the Faster R-CNN algorithm is used for comparison to evaluate false positives and false negatives. YOLOv3 has a speed advantage over Faster R-CNN. YOLOv3 can process 45 frames per second, whereas Faster R-CNN can process eight frames per second. Two of the four videos outperformed Faster R-CNN in terms of accuracy. The precision for the proposed approach is 96.51%, and the F1 score is 75%. YOLOv3’s fast processing speeds and reasonable accuracy make it suitable for real-time applications. However, this research identifies challenges in detecting small or concealed weapons and suggests further refinement of the model to address these issues. The YOLOv3 model performed fast, reliable detection of a pistol, even under partial occlusion. Performance suffered when low-quality frames were used or when frames obscured the gun.

González et al. [52] address the challenges of detecting guns in real time through CCTV surveillance. This article introduces a new dataset derived from a university’s actual CCTV system and the creation of synthetic images. The researchers demonstrate that a Faster R-CNN object detector with FPN (Feature Pyramid Network) performs well after training on both real and synthetic images in a real CCTV setting. State-of-the-art weapon recognition was improved through a two-stage training process using a Faster R-CNN with FPN and ResNet-50, resulting in a weapon detection model suitable for quasi-real-time CCTV applications (inference time of 90 ms with an NVIDIA GeForce GTX 1080 Ti). The results showed a confidence level of 0.95 or 0.99 and an IoU of 0.50. Since the primary goal of this study is to detect hazardous items and the researchers are not interested in differentiating them, knives and other weapons were excluded, and only the class “weapon” was used to detect pistols and rifles. The Faster R-CNN model with FPN achieved high detection accuracy when evaluated on CCTV datasets. However, the detection of other types of weapons was not included in the model’s design during the evaluation.

Pang et al. [56] discuss how this work enhances security by using passive millimeter-wave (PMMW) imaging to detect hidden metallic objects on the human body in real time. Beihang University provided the dataset, which includes 1634 PMMW images showing people carrying metallic weapons in various settings. The proposed method utilizes YOLOv3 (You Only Look Once) in two versions: YOLOv3-13 and YOLOv3-53, which differ in computational load and network depth. Both models employ a one-step detection approach to identify objects at different scales. YOLOv3-53 achieves a maximum mean average precision (mAP) of 95% at 35 frames per second, while YOLOv3-13 reaches 85% mAP at 150 frames per second. Even with limited sample data, this approach is more practical and efficient for real-time detection of weapon contraband on the human body in PMMW images. Future research aims to improve accuracy and adaptability across various scenarios, including those involving thick clothing and diverse environmental conditions. The combination of YOLOv3 and SSD models demonstrated improvements in both detection effectiveness and feasibility. However, this combined approach took longer to train correctly, resulting in improved accuracy in both models.

Vílchez et al. [11] present improved evaluation accuracy and reduced bias in license exams. The study focuses on automating the detection of bullet impacts on shooting-range silhouettes. The detection model, Mask R-CNN, uses a pre-trained model trained on the COCO dataset. The proposed Bullet Impact Detection (MBID) method employs Mask R-CNN (ResNet 50 and 101) with four steps: preprocessing, impact detection, edge detection, and results evaluation. The dataset contains 600 images with 2401 bullet impacts. Mask R-CNN outperforms other methods, such as YOLOv3, SVM, and the Circular Hough Transform, in object detection, classification, and zone delineation. ResNet 50 achieves 97.6% accuracy, 99.5% precision, and 97.9% recall. Further research could enhance detection accuracy by addressing irregular impact shapes and refining dataset conditions. Mask R-CNN achieved nearly perfect precision and recall, resulting in very high accuracy. The model’s limitation was the artificially small dataset size.

Xu et al. [57] discuss the issue of low-cost, automated weapon detection in surveillance videos and developed a TensorFlow-based approach using the SSD-MobileNet model for efficient identification. They used a small weapon dataset of 1218 images, mainly from COCO, with the main categories being handgun, shotgun, and rifle. The system extracts key frames from videos based on interframe differences to reduce redundant information. It employs SSD-MobileNet, a lightweight deep neural network for feature extraction, and SSD for detection. The model achieved precision values of 0.8524 at an IoU of 0.5 and 0.7006 at an IoU of 0.75 on an Intel i3 processor paired with an AMD Radeon GPU, demonstrating its reliability across different precision metrics. The system should be reconfigurable, low-cost, and capable of energy-saving through key-frame processing, especially in low-traffic areas. The work also involves increasing training data to improve accuracy and implementing the system in dual operational modes to accommodate various usage scenarios. The study revealed that SSD with CNN enabled the detection of firearms live. However, accuracy decreased when the source video was shot at a lower resolution than the SSD video.

Aftab et al. [36] discuss the need to create a suitable plan to effectively respond to any terrorist attack on hospitals, airports, shopping centers, schools, universities, colleges, railway stations, passport offices, bus stands, dry ports, and other important private and public locations. The dataset includes images from various sources. An automated system must quickly detect and classify weapons to plan a proper response and minimize damage. This study is based on a Convolutional Neural Network (CNN) model that addresses earlier challenges. It uses datasets compiled on YOLO and is designed for single-use viewing, focusing on real-time firearm identification. The proposed system for detecting and classifying guns has undergone extensive training and testing. The researchers evaluated the system using a 70-30 split rule. Results show that our model performed well, achieving 97.01% accuracy in firearm detection. In the future, researchers plan to use advanced techniques to gather larger datasets and expand the range of categories. The study’s accuracy and F1 scores for CNN-based weapon detection were excellent; however, more data would be needed for generalizability.

Akbulut et al. [58] highlight deficiencies in existing surveillance systems for quickly detecting guns and underscore the urgent need for improved public safety measures to prevent armed threats. Focusing on handguns and rifles, the research utilizes the Coco dataset and its own dataset to categorize different types of guns. The study uses the YOLO algorithm, trains the model on labeled images, performs live object recognition, and analyzes video frames to identify weapons. Implementation is performed on a PC equipped with a GPU that supports TensorFlow and OpenCV. The speed and accuracy of the YOLO model are evaluated using live video feeds after it has been trained on labeled data. The device achieves 95% accuracy, detecting weapons more quickly and precisely than previous systems. Future research will aim to add new features, such as contextual notifications that follow the movements of recognized guns, and improve the model to reduce false alarms. This article also examines how environmental conditions affect detection accuracy. YOLO’s ease of use and accuracy for efficient, effective detection were limited by the small dataset used in the analysis.

Bhatti et al. [59] state that the central issue of interest is the real-time identification of firearms in real-life video footage to better protect against and eliminate unlawful acts. The datasets used include a small self-made dataset based on personal images and videos, YouTube videos, open-source GitHub repositories, the University of Granada dataset, and the IMFDB (Internet Movie Firearms Database). The weapons employed are diverse, including primary weapons such as pistols, secondary weapons such as small handheld firearms, and deception items such as cellular phones, metal detectors, and wallets. Some anticipated features of the proposed technique include the use of the most advanced deep learning methods for real-time weapon detection in CCTV footage. Yolov4 outperformed all other algorithms, with an F1-score of 91% and a mean average precision that was 91.73% higher than the previous record. YOLOv4 achieved some of the highest accuracy and reliability across datasets of varying sources, though it primarily focused on small weapons.

Hashmi et al. [60] focus on utilizing deep learning techniques for real-time weapon detection in video feeds from surveillance systems. The researchers generate a weapons dataset for training by selecting assets and photos from Google Photos. The researchers propose a system that uses Convolutional Neural Networks (CNNs) to process video frames and identify weapons. The researchers in this study have compared two state-of-the-art models for weapon identification: YOLOv3 and YOLOv4. The precision, recall, F1 score, and mAP of YOLOv3 and YOLOv4 are 84%, 85%, 71%, and 78%, and 77.30% and 84.85%, respectively. Afterwards, using a separate, independently created dataset of weapons, the researchers conducted a comparative study. This comparison allows researchers to analyze in greater depth and understand how small changes can lead to better outcomes. To improve weapon detection, researchers will develop a method to augment our dataset with additional photos and enhance the classification process. YOLOv3 and v4, which achieved varying degrees of improvement in accuracy and recall, were resource-intensive. A dataset with a broader reach was provided.

Qi et al. [61] present that around the world, especially in the US, gun violence is a serious issue. Deep learning techniques have been developed to identify firearms in smart IP cameras and surveillance video cameras, enabling real-time notification to security staff. The lack of large public datasets is a challenge in developing gun detection algorithms. In this study, the researchers first released a dataset containing 51K gun photos annotated for detection. They then collected 51K cropped gun images from various sources for gun categorization. The researchers introduce a gun detection system that uses a cloud server to manage devices, data, and alerts, along with a smart IP camera as an embedded edge device to reduce the false-positive rate further. Based on these findings, they employed ResNet50 as the complex classifier on the cloud server and ResNet18 as the lightweight classifier on the edge device. The accuracy rates are 97.83% and 96.97%, respectively. This edge/cloud framework enables real-world gun detection and is expected to significantly reduce the false-positive rate. The ResNet-based YOLOv3 and CenterNet are comparable in accuracy. However, this produced more false positives, limiting the measure’s viability and reliability.

Ramon et al. [62] show that one of the highest rates of violence in the world occurs in Latin America, leading to murders, robberies, firearms, and insecurity. Beyond just taking lives, homicide destroys the lives of the victims’ families and the community. It fosters a violent environment that harms institutions, the economy, and society. Due to limited data, the study was difficult and time-consuming. The images used for training were sourced from the internet through various datasets or image searches. To identify different types of firearms in public places, such as shops, ATMs, and streets, researchers used object detection. Four types of firearms were used to train the YOLO v3 and Efficient D0 models: pistols, submachine guns, shotguns, and rifles. The study’s findings indicate that YOLO v3, with an accuracy of 0.80, is the best network for detecting weapons. EfficientNet YOLOv3 performed well with tasks involving firearm detection. However, it would require a more diverse dataset to make consistent predictions.

Ruiz-Santaquiteria et al. [63] aim to improve weapon detection in surveillance recordings by building on previous methods that rely solely on visual cues, which often fail in low-light, distant, or obstructed views. The primary challenge is accurately recognizing weapons in complex scenes by combining gun features with human body positions to enhance detection precision and minimize false alarms. The dataset comprises images from various sources, including YouTube videos, publicly available firearm datasets, and synthetic images generated from video games. This study introduces a new approach (Hand Region Classifier [HRC] + Pose Data [P]) that integrates human pose information with weapon appearance into a unified system. To generate binary images of postures, key points are first predicted to locate hand regions. Then, the final bounding boxes are formed by merging outputs from different subnetworks. Since the 2D human stance is often used for action or gesture recognition, the Monash data in HRC show a precision, recall, and AP of 96.83%, 20.33%, and 24.72%, respectively. Meanwhile, HRC + P achieves 34.68% (AP), 33.67% (Recall), and 90.18% (Precision). Sometimes, objects are hard to see due to distance, poor lighting, or partial/full obstruction. In these cases, human body posture helps identify firearms that might otherwise go unnoticed. False positives in other image regions can be filtered out because pose data are used only to classify hand regions of detected individuals. Finally, everyday handheld items like wallets, keys, and cell phones can lead to misclassifications or false positives in real-world applications. Further research will address these issues in more detail. The object detectors that utilized deep learning and hand region and position data achieved better results in terms of weapon detection, specifically. Still, because of the low recall, adding false positives did not increase its robustness.

Salido et al. [50] propose a study on the difficulty of handgun detection in video surveillance images to prevent violent incidents. The dataset they used consists of annotated video surveillance images featuring handguns and includes 6800 images for training. The focus was specifically on handguns. The techniques tested involved three CNN-based methods: RetinaNet, YOLOv3, and Faster R-CNN. They applied a sequence of CNN-based methods to extract features from handguns and used pose information to reduce false positives—the experimental setup involved training and testing three models, with YOLOv3 being the fastest. RetinaNet achieved the highest performance, with an average precision of 96.36% and a recall of 97.23%. Advantages observed included high accuracy with fewer false positives. Limitations noted included lower-quality images, which may not allow detection of small, occluded handguns. Future work recommends deploying the system in real time across more complex environments. The YOLO (You Only Look Once)-based model achieved real-time, low-cost handgun detection with high recall; however, it struggled in crowded scenes or when the handgun was partially occluded.

Ahmed et al. [64] address the problem of detecting real-time weapons in surveillance footage, a vital aspect of ensuring public safety. They created datasets (Custom Weapons) comprising 8327 images across diverse backgrounds and angles to enhance detection robustness, with a primary focus on handheld weapons, including pistols, revolvers, and rifles. They propose a model based on Scaled-YOLOv4, optimized with TensorRT, and deployed on high-performance GPUs, such as the RTX 2080 Ti, as well as on edge devices, including the Jetson Nano. During preprocessing, they employed mosaic augmentation to enhance the detection of small objects and improve accuracy. The model achieved an mAP of 92.1% and an FPS of 85.7 on an RTX 2080 TI, demonstrating performance suitable for real-time applications. The advantages of this approach include high accuracy, efficient real-time performance, edge deployment capability, reduced latency, and enhanced privacy. Future work is encouraged to address challenges related to false positives and model optimization, especially for low-power devices. Scaled-YOLOv4 demonstrated good detection in both open and crowded scenes; however, there is still room for improvement, particularly with low-resolution video input.

Ashraf et al. [65] addressed the problem of minimizing false positives and false negatives in the detection of weapons in video surveillance systems. The dataset (an open-source pistol dataset from the University of Granada and the Internet Movie Firearms Database (IMFDB)) consisted of 15,000 images of weapons, including rifles and handguns. The preprocessing techniques included a Gaussian blur to remove background noise. The researchers suggested using the YOLO-V5S algorithm combined with a CNN to improve detection speed and accuracy. The process involved preprocessing the dataset, training YOLO-V5S on resized images to 416 × 416, and applying a Gaussian blur to reduce background distractions. The experimental setup compared the YOLO-V5 model with other models, such as Faster R-CNN, using metrics like precision, recall, and F1 score. The model achieved 99.5% precision and 84.6% recall, with frame processing completed in 0.011 s, making this solution both fast and accurate. The researchers also identified several areas for improvement, such as handling customized weapons that deviate from the typical pistol appearance and employing tested methods to distinguish between objects of similar size and shape, which are significant challenges in weapon detection. Moving forward, they plan to enhance the model by incorporating additional preprocessing techniques, such as brightness control. To reduce false positives and false negatives, they also aim to improve the training set by including videos of moving pistols, custom graphics, and adjusting colors and contrast to increase visibility. YOLOv5 and Faster R-CNN achieved similar accuracy; however, both struggled with images with low brightness and contrast.

Jadhav et al. [66] emphasize the need for better technologies due to the rise in armed robberies, school shootings, and terrorism. Therefore, this project aims to develop a system that automatically detects weapons in CCTV footage, reducing the number of people required for monitoring and lowering potential risks. The research uses two datasets: one from Kaggle and one from Open Images, each containing 5687 images of weapons (primarily guns, handguns, revolvers, etc.) and images depicting police and judicial services in Pakistan. Several objects are included within the images to minimize false positives caused by confounding objects such as handguns. The study examines the feasibility of two deep learning networks renowned for their effectiveness in object detection: YOLOv4 and Scaled YOLOv4. The process involves image annotation with bounding boxes and the creation of models for weapon recognition using labeled datasets. The performance of each model was evaluated on images and videos using metrics like mAP, precision, and F1 score. Experiments were conducted in a Google Colab environment with GPU acceleration. The models were trained on both the Kaggle (mAP: 100%) and Open Images (mAP: 71.58%) datasets, with object annotations performed using YOLO Label. YOLOv4 achieved an average precision of 86.19%, a loss error of 0.1933, a precision of 79%, and an F1 score of 77%, slightly outperforming Scaled-YOLOv4. Future research aims to further reduce false positives and negatives. The YOLOv4 with CSPDarkNet53 achieved outstanding results in terms of mean Average Precision (mAP) and general accuracy across various datasets. However, the number of false positives and false negatives remained high across all datasets.

Kiran et al. [67] propose that anomalies are identified and corrected using computer vision. Video surveillance systems capable of recognizing scenes and unusual events are essential for meeting the growing demands for safety, confidentiality, and protection of personal property. Monitoring such activities helps reduce crime and social offenses by identifying disruptive behavior. The approach uses publicly available videos from various sources. The core of the proposed solution involves applying specific AI-based algorithms for weapon detection to enhance existing conventional methods. For weapon detection, the system uses SSD and Faster R-CNN, both based on convolutional neural networks (CNNs). To improve accuracy, the Faster R-CNN was trained with pre-labeled video datasets. Both methods are effective and produce better results, with Faster R-CNN achieving 84.6% accuracy. Faster R-CNN demonstrated higher accuracy. The researchers do not evaluate any other detection methods in this approach. The RCNN achieved good results with accelerated accuracy; however, it required additional training and experimentation with various models to achieve improvements.

Manikandan et al. [68] describe that, for security reasons, closed-circuit television (CCTV) cameras are installed and monitored in both public and private areas. In home and commercial security, image and video footage are utilized for object detection, identification verification, and rapid response. Different classifications are necessary to detect objects and humans based on features observed in static and motion footage. The collection includes various images and videos of dangerous objects (weapons) observed in CCTV footage. This article introduces an Attuned Object Detection Scheme (AODS) for identifying harmful objects from CCTV inputs. The proposed scheme uses a convolutional neural network (CNN) for object detection and classification. The classification is performed based on features extracted and analyzed by CNN. The method’s performance is validated using metrics such as accuracy, precision, and F1-score. In this approach, training with an external dataset reduced error and complexity by 7.47% and 8.23%, respectively, and increased accuracy by 8.08%. It is expected that future object classification will utilize labels. The CNN-AODS model produced good results in distinguishing the presence of firearms. It was not versatile enough to detect other objects, and the presence of different objects was observed.

Mishra et al. [13] present this study, which addresses the critical issue of firearm detection in public spaces—a significant concern given the rising gun violence globally. The dataset (open source datasets) used for training and testing includes both real and synthetic images of firearms; it contains over 1000 images compiled from various sources, including those generated with 3D techniques. The proposed method is based on the YOLO object detection model, known for its high speed and accuracy in real-time applications. Among the available versions, YOLOv4 was selected to strike a balance between processing speed and detection accuracy. It can process up to 24 frames per second, meeting the needs of real-time video surveillance. The model was trained on a single GPU with images annotated with bounding boxes for firearm localization. YOLOv4’s architecture features a PANet path-aggregation neck, a YOLOv3 head, and a CSPDarknet53 core. Implementing this new architecture in the backbone and modifying the neck increased the frames per second (FPS) by 12% and the mean average precision (mAP) by 10%. Furthermore, training this neural network on a single GPU simplifies the process. Its fast processing speed enables efficient operation on live feeds from CCTV cameras. However, limitations include developing detection capabilities for other types of weapons and improving performance in low-visibility conditions. YOLOv3 achieved higher Global mAP and FPS scores. However, it limits performance in low-light or occluded scenes. Therefore, it would be restricted for use in real-time, live surveillance of people and traffic.

Rasheed et al. [69] address the growing problem of thefts in the banking and retail sectors, focusing on the use of artificial intelligence to detect and alert to weapons. The researchers provide a dataset of 7801 images categorized as handguns or rifles, sourced from Google Images, movies, and CCTV footage. A real-time object detection algorithm, “You Only Look Once” (YOLOv5), is proposed to locate guns and rifles. When video frames are captured, YOLOv5 runs to detect objects and sends alerts to local law enforcement. The system works on both mobile and web platforms, using a Raspberry Pi or Jetson Nano for real-time tracking. It is trained on Google Colab for 100 epochs with a batch size of 16. The system connects to a NodeMCU for internet notifications and to a GSM module for communication. After 3 h of training, it achieved a mean average precision (mAP) of 87.8%, with an accuracy of 88% and a recall of 83%. The YOLOv5s version, built with the YOLOv5s modifier, outperformed YOLOv4 and YOLOv3 in terms of faster inference and improved accuracy. However, the system currently relies on visible cameras, which could be improved by integrating firearms detection. High-quality graphics are also desirable, but this could be achieved by running the model on a GPU. While the YOLOv5 model demonstrated strong capabilities in addiction classification, recent versions may offer even better performance.

Al-Mousa et al. [38] propose that the rate of gun violence has rapidly increased in recent years. The majority of modern security systems depend on human workers to continuously patrol hallways and lobbies. Future closed-circuit television (CCTV) and security systems should be able to identify threats and take appropriate action when necessary, thanks to advances in machine learning, particularly deep learning. The most crucial component of developing any deep learning solution is data. Applications that rely on photos, like these, require large amounts of data. Princess Sumaya University for Technology’s Innovation Lab is where data collection takes place (PSUT). The security system architecture presented in this work leverages deep learning and image processing to detect weapons in real time. To identify individuals carrying different types of weapons, the system processes a video feed and occasionally captures images of them. These images are fed into a convolutional neural network (CNN), which then determines whether or not the image poses a threat. If it does, the system notifies security personnel via a mobile application and provides a picture of the situation. A Raspberry Pi B+ (with an ARM Cortex-A53 1.4 GHz processor, Wi-Fi connectivity, and 1 GB of RAM) is used, placed on a Table 1. 8 m high, along with a connected camera. The camera is approximately 4.5 m from the lab entrance. To monitor the process, the Raspberry Pi is connected to a display. The system achieved 92.5% accuracy during testing and completed detection in only 1.6 s. Other weapon types, such as knives, rifles, and semi-automatic guns, can be added to the dataset. The CNN model is a fast and reasonably accurate firearm classification algorithm; it would require more data to provide reasonable confidence.

Chatterjee et al. [46] state that any violence is a shame to our civilized society. However, violence still plays a significant role in our society today and claims many innocent lives every day. Using a gun is one of the most traditional forms of violence. Nowadays, firearm-related fatalities are a global issue. It poses a significant challenge to law enforcement and poses a threat to civilization. The 3698-picture dataset, containing 4703 annotated objects, has been created for the detection of firearms and human faces. The primary images were sourced from the WIDER FACE dataset and the Internet Movie Firearms Database. To detect firearms and human faces, this research employs a range of detection methods, including the latest EfficientDet-based architectures and Faster Region-Based Convolutional Neural Networks (Faster R-CNN). At the post-processing stage, an ensemble approach that utilizes Weighted Box Fusion, Non-Maximum Suppression, and Non-Maximum Weighted procedures has enhanced detection performance in distinguishing between weapons and human faces. The mean average precision scores for mAP0.5, mAP0.75, and mAP [0.500.95] are 77.02%, 16.40%, and 29.73%, respectively, based on the Weighted Box Fusion-based Ensemble Detection Scheme. Among all tested options, these results show the best performance. The work may eventually be adapted for real-time testing, allowing research on its scalability and reliability in real-world applications. The Faster R-CNN model produced good mAP results on the firearm datasets; however, it did not test any in live scenarios.

Doan et al. [70] present a study on the rise in crime rates driven by heated weapons. Security systems must detect potentially violent situations early. The experimental dataset comprises 6420 224 × 224 photos collected from public sources using the Pistol Detection Dataset, Pistol Classification Dataset, and Sohas Weapon Dataset. These photos feature a diverse range of pistol perspectives and sizes. This research aims to apply YOLOv5, v7, and v8 models to enhance the accuracy and diversity of pistol detection in surveillance cameras. The models YOLOv5-n, YOLOv5-m, YOLOv5-L, YOLOv7-X, YOLOv7-W6, YOLOv7-E6, YOLOv8-l, and YOLOv8-x performed exceptionally well in the hot weapon identification challenge, with the YOLOv8-x model achieving the highest accuracy of 95.6%. Based on the comparison results, the best model is YOLOv7-E6. The testing results show that these YOLO models can be integrated with the latest versions to improve security surveillance systems, enabling early warnings and proactive security measures in critical situations. The YOLOv5–v8 models demonstrated substantial precision and recall, though results fluctuated across devices and environments.

Khalid et al. [51] focus on identifying and segmenting guns in surveillance videos. The system’s performance was evaluated using a publicly available dataset for weapon detection, which includes images of handguns and other firearms in different contexts. For weapon detection, they utilized a deep learning model called YOLOv5. The detected weapons included handguns, pistols, and revolvers, featuring a variety of grip styles. YOLOv5 is the latest and fastest version of the YOLO family, commonly used in real-time applications. The specific YOLOv5 model used in this study was based on a conventional deep learning framework, with hyperparameters tuned through experimentation. Training was performed on a high-performance GPU to reduce computational load. The system achieved precision of 63%, recall of 25%, and accuracy of 43%. However, it may still struggle to identify partly hidden firearms or very small ones. Enhancing the model’s resistance to occlusions and improving its performance in small-object detection would increase its usability in real-world scenarios. The YOLOv5 image detector achieved high accuracy and robustness; however, it struggled to identify smaller firearms accurately.

Khan et al. [71] identify the primary concern of this research: the increasing demand for weapon-detection systems capable of recognizing firearms, with a focus on pistols in video feeds to enhance public safety and security. The dataset used in this study is the Custom Pistol Dataset. The weapons are pistols. The algorithm or technique employed is YOLOv5 (You Only Look Once version 5). YOLOv5 is a deep learning architecture for real-time object detection in the YOLO series. It was tested on standard computational platforms suitable for processing real-time video. In a live video stream, the YOLOv5 model was trained to detect pistols with an accuracy of approximately 95%. The system could also detect other items, such as rifles, knives, and other firearms or dangerous objects, expanding its functionality. The YOLOv5 model for detecting pistols was highly sophisticated and performed well, but it should be extended to detect other weapons as well.

Nale et al. [14] propose that the demand for computer vision-based automated surveillance has grown due to the increasing use of closed-circuit television (CCTV) systems in modern security applications. The primary goal is to minimize human intervention while improving early threat detection and real-time security assessments. In the absence of a pre-existing dataset for real-time detection, the researchers assembled one from various sources, such as Roboflow Computer Vision Datasets, YouTube CCTV recordings, film data, camera images, and online photographs. The proposed weapon detection method prioritizes accuracy and recall in object identification, particularly in challenging conditions such as low-light environments, by combining Detectron2 and YOLOv7. With the YOLOv7 model trained on a new database, object detection algorithms that used Region of Interest (ROI) performed better than those that did not, yielding notable results. It outperformed earlier real-time studies with a mean average precision (mAP) of 87.3%, an F1 score of 91%, and a confidence level of nearly 98%. The research aims to enhance security while also attracting security-conscious tourists and investors, thereby generating economic benefits. Future research will focus on further reducing false positives and negatives, potentially extending to additional classes. The YOLOv7 improved the robustness of the image detection system and enhanced security; however, it still struggles with false positives and negatives.

Nanda et al. [72] focus on developing an effective video forensics system capable of detecting weapons, disguises, and suspicious activities within video footage. The dataset used contains 1346 images for gun detection and 1043 images for mask detection, collected from Kaggle, GitHub, and other sources. The main features identified are guns and facial masks, which stand out from other devices on the system and help detect any immoral or abnormal behavior. The study employs a custom CNN architecture, combined with the YOLO (You Only Look Once) approach, for object detection. The system provides frame labels for the detected objects: “M” indicates a mask, “NM” indicates no mask, and “G” indicates a gun. The system was run on a computer with an Intel Core i7 processor and an NVIDIA GeForce RTX 2060 graphics card. The YOLO model detected masks with an accuracy of 92.3%, while gun detection accuracy was 100%, surpassing the customized CNN model by 61.5%. The YOLO model outperformed previous methods, especially in real-time applications, achieving higher speed and accuracy. Future improvements could include increasing the sample size to broaden the range of suspicious activities the model can detect and optimizing the CNN further to enhance detection accuracy. The YOLO network achieved excellent accuracy and efficiency in detecting guns; however, the customized CNN performed poorly, and the datasets were too small.

Naseeba et al. [73] present the application of computer vision (CV) and deep learning (DL) to enhance weapon detection in video surveillance. The fact that the model was trained on a large dataset (Self-Created Image) containing still images and videos of a wide variety of weapons ensures its adaptability and resilience. In this study, researchers use VGG16 and Faster RCNN to automate firearm detection. The method uses manually annotated image datasets. This is surpassed by Faster RCNN, which achieves 0.736 s/frame. VGG16’s frame rate is 1.606 frames per second, by comparison. Faster R-CNN is more accurate, reaching 94.8% compared to VGG16’s 92.5%. The former can achieve live detection due to its higher speed and accuracy. Additionally, it can be trained with larger datasets using GPUs. This approach does not detect various types of weapons or large datasets. Both VGG-16 and Faster R-CNN achieved good accuracy and speed, but they did not generalize beyond specific categories, such as weapon types.

Pullakandam et al. [74] discussed real-time weapon detection using YOLOv8, optimized for faster, lighter deployment in resource-constrained environments via quantization. Approximately 3000 images, curated from various online sources, were included in the dataset and labeled by type, including guns and knives. The architecture of YOLOv8 features a new anchor-free detection head that improves its precision and flexibility across different object sizes. It achieved reduced model size and faster inference by using a quantization technique. In this regard, YOLOv8 reached 90.1% mAP and reduced processing time by 15%. The experimental results show that it outperforms earlier versions, such as YOLOv5, across most metrics, particularly in terms of computational efficiency, making it well-suited for mobile and edge devices. Limitations, such as sensitivity to low-resolution images, were noted, indicating the need for further work to optimize the quantization model for varying image qualities. While the results for detecting guns in YOLOv5 and YOLOv8 achieved high accuracy, the work presented could benefit from further statistical optimization to improve efficiency.

Rahil et al. [12] state that it is now crucial to incorporate advanced automatic pistol detection systems into public surveillance cameras to enhance their effectiveness, given the widespread issue of gun violence. The researchers collected and generated their own dataset (COCO) of images featuring various firearms. To improve the model’s performance across multiple environments and reduce its reliance on human security personnel, a carefully curated dataset of diverse firearm images was utilized. This work uses the advanced YOLO-V5 algorithm to provide a comprehensive approach for real-time firearm detection in security footage. The results demonstrate that, even in challenging, complex scenarios, the YOLO-V5 model achieves high accuracy (95%) and recall (92%) in detecting handguns. This research catalyzes further investigation in this exciting area and marks a significant step toward the development of efficient firearm detection systems. This approach addresses an urgent need for improved public safety by automating firearm identification in real-time surveillance footage, offering valuable insights into the capabilities of intelligent surveillance technology. While YOLOv5 produced fast, accurate detection results, the framework required more comprehensive data to achieve scalable, generalizable results.

Raza et al. [53] address the problem of detecting gunshots to prevent crimes using sound recognition. The dataset consists of 851 audio clips of gunshots from eight gun models, collected from public YouTube videos. The proposed technique involves a novel Discrete Wavelet Transform Random Forest Probabilistic (DWT-RFP) approach for feature extraction and a meta-learning model, Meta-RF-KN (MRK), for classification. The workflow includes extracting Mel-Frequency Cepstral Coefficients (MFCC) features, combining them with probabilistic features, and applying the MRK model for gunshot detection. The experimental setup employed an 80:20 train-test split and was executed in a Google Colab environment with GPU support. The proposed model achieved 99% accuracy, outperforming other state-of-the-art methods. Future improvements include reducing computational complexity and addressing real-world noise and issues related to simultaneous gunfire. The Discrete Wavelet Transform Random Forest Probabilistic (DWT-RFP) detection model demonstrated high accuracy in detecting gunshots, improving on-scene safety; however, background noise and computational cost were significant issues.

Ruiz-Santaquiteria et al. [40] discuss improving handgun detection in CCTV video recordings by addressing issues of over- and under-detection, especially in challenging monitoring situations. They use a hybrid dataset that combines images from the Guns Movies Database, YouTube clips, synthetic video game data, and the COCO dataset, with a focus on handguns. Their proposed approach integrates visual cues and body pose information via a dual-branch framework that targets hand regions and body pose key points to enhance detection accuracy. The model is developed in TensorFlow using several CNN and transformer-based architectures, such as Darknet-53, ViT, and SEIT, along with additional filtering to reduce false positives. The model achieved excellent results across various complex datasets, with an average precision of 91.73% and exceptional performance on challenging data. Future work aims to reduce dependence on pose estimation accuracy, regardless of camera placement, and to incorporate spatiotemporal data from actual videos to further mitigate false positives. The Vision Transformer, using a CNN architecture, also achieved substantial average precision (AP); however, it was less robust due to high computational requirements and reliance on computationally intensive human pose estimation.

Shah et al. [75] focus on the challenge of real-time pistol detection to help combat firearm-related crimes in public spaces. Manual CCTV monitoring is a common practice nowadays; unfortunately, most efforts fail due to human error, especially when observation must be maintained for extended periods. This research explains the implementation of the YOLO algorithm within an automated system to reliably and efficiently detect pistols. The dataset (COCO) includes 3000 images of pistols captured under various lighting conditions, quality levels, and camera angles that reflect real-world scenarios. These images were annotated and split into training and validation sets. The study compares the performance of three YOLO versions—YOLOv3, YOLOv4, and YOLOv5—in terms of speed and accuracy. Additionally, by reviewing prior research and analyzing implementation results, a comparative assessment of these models’ performance was conducted. Evaluation metrics focused on precision and recall, which are more relevant than accuracy in object detection. With higher F1 scores and mean average precision, YOLOv5 (mAP50 of 98.60 and F1 score of 98%) outperformed earlier versions. YOLOv5 offers a lightweight, fast, and accurate model. Although this system provides excellent detection capabilities, it currently only detects handguns. Future work should extend the system to recognize all types of firearms and sharp pointed objects. The YOLOv5 method demonstrated exceptional detection precision and speed; however, it was limited to detecting only specific types of guns.

Sumi et al. [76] discuss concerns about excessive violence, misconduct, and weapon use captured by open security cameras when using YOLOv5 for weapon detection with Information Expansion. The datasets include various weapons, such as rifles, handguns, and blades. These datasets (Custom Dataset, Utility Synthetic Dataset, and Mock Attack Dataset) are used to evaluate the effectiveness of the YOLOv5-based weapon detection system by combining different data augmentation strategies to assess its viability. The method proposed in the article relies on the YOLOv5 (You Only Look Once) algorithm for identifying weapons. The YOLOv5-based weapon detection system was trained using an experimental setup on an NVIDIA GeForce RTX 2080 Ti GPU and CUDA 11. Given the datasets, the proposed weapon detection framework using YOLOv5 achieved promising results, attaining 97% mAP with real data and 93% mAP with synthetic data. Future work should focus on expanding evaluation metrics based on findings, suggesting room for improvement despite the high accuracy achieved with YOLOv5 and data augmentation. Overall, the YOLOv5-based framework demonstrated high precision and a more robust backend; however, it still needed improvement in accuracy.

Wang et al. [17] address real-time weapon detection in CCTV footage, focusing on gun identification, and find that small objects, such as handguns and rifles, are challenging to detect when heavily cluttered in scenes that change rapidly under varying lighting conditions. They utilize a synthetic dataset generated using the Unity Game Engine, combined with real-world CCTV footage, and augment images from the OpenImg dataset, resulting in a total of 2582 images of handguns and rifles. In this research, an improved YOLO v4 (SCSP-ResNet-based + receptive field boost) is proposed, and the F-PaNet is introduced to enhance small-object detection by expanding spatial information and fusing multi-scale features. Using transfer learning, the method was trained on both synthetic and real-world datasets, achieving a mean Average Precision of 81.75%—an improvement of 7.37% over the baseline—with a 4.2% reduction in inference time when implemented on the Darknet platform. In the future, researchers plan to experiment with more complex synthetic datasets and explore their scalability beyond image classification to other computer vision tasks such as image segmentation and object tracking. SCSP-ResNet YOLOv4 demonstrated robust performance in small-object detection across complex environments. However, the dataset complexity and background variation required improvement.

Arora et al. [77] discuss the detection of guns and other weapons, a challenging task with many applications in law enforcement, safety, and monitoring. A custom dataset of 9000 images of firearms and heavy weapons, carefully collected from various sources, including the web, public-domain datasets, and private collections, is used to train the YOLOv8 model. The goal is to develop a YOLOv8-based object detection system. The system performed exceptionally well on multiple datasets of gun and weapon photos, demonstrating high accuracy. Its ability to reliably recognize weapons is shown by its effectiveness in real-time applications and an impressive average precision of 90.5%. To keep the system adequate and relevant in the challenge of weapon detection, future research should focus on enhancing its capabilities through dataset expansion, algorithm improvements, and realistic deployment scenarios. The YOLOv8 framework successfully identified firearms from customized datasets. Opportunities for improvement were identified through database enrichment and algorithm enhancements.

Flores et al. [15] discuss automated, real-time weapon-detection systems enabled by recent advances in deep learning. These systems primarily use body-pose key-point extraction and object-detection frameworks to extract hand-region data as spatial features and to identify body posture. However, the effectiveness of these methods is limited by their inability to contextualize stance and spatial elements. The study uses the Mobile Guns Dataset and the Monash Guns Dataset, which together comprise 155 videos totaling 6558 video frames. The proposed method starts by detecting people’s poses in a video frame using OpenPose to locate hand areas. After cropping the hand region, spatial features are extracted using DarkNet-53. A binary image of a human stance is generated by placing human pose key points on an image. A convolutional neural network subsequently processes this image to extract features related to human pose. Results clearly show that incorporating temporal information enhances the model’s performance in average accuracy, recall, precision, and F1 scores. Future work may explore other temporal representations, such as using transformers or incorporating additional input features, to enhance the model’s capabilities further. OpenPose and DarkNet53 achieved good temporal performance in firearm detection. Future studies may wish to assess different representations of temporal data.

More et al. [16] address the problem of real-time detection of violence and weapons in video streams to enhance public safety. It utilizes a combination of datasets, including the Hockey Fights Dataset and a composite dataset for violence detection, as well as a custom dataset for weapon detection that encompasses various types of weapons, such as guns, shotguns, and rifles. The proposed technique combines Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory (Bi-LSTM) for violence detection and YOLOv8/YOLOv9 for weapon detection. The system processes video frames, extracts spatial and temporal features for violence classification, and identifies weapons using the YOLO algorithm. The experimental setup involves training models on 30-frame sequences with 100 × 100 pixel dimensions for violence detection and YOLOv8/YOLOv9 for weapon identification. The model achieved a 99.85% training accuracy and 98.19% validation accuracy for violence detection, while YOLOv8 achieved a mean Average Precision (mAP) of 0.805 for weapon detection. Future work suggests optimizing the system to reduce computational time while enhancing real-time detection capabilities. Further, CNN-BiLSTM with YOLOv8 or YOLOv9 improved firearm detection accuracy and security. However, other weapons were not included in the study.

Nadeem et al. [78] propose that the first research question addresses a key topic: the acquisition and handling of real data for violence detection in computer vision applications. The dataset used is WVD, which includes various types of weapons and violence scenarios, all developed from GTA V. The Weapon Violence Dataset (WVD) employs optical flow, specifically the Dense Gunning–Farnebäck technique. In terms of study design, the experimental setup ensures the systematic and accurate creation and use of synthetic data for violence detection, while also recognizing the strengths and limitations of using such data. Hot Violence Detection: The researchers achieved 87% accuracy in detecting hot violence scenes involving firearms and explosives using models trained on the WVD dataset. The article’s future work and drawbacks suggest that models should be validated with real data to ensure proper functioning in real-world environments. The WVD’s initial detection capability was scalable. However, issues with authenticity and randomness reduced the dataset’s reliability.

Valliappan et al. [18] present research on audio-based gun detection, particularly useful when visual data are unavailable. The researchers introduce a dataset using YAMNet, a deep learning model for audio classification, comprising 1174 audio samples from 12 types of firearms, collected from sources such as the Gunshot Audio Dataset and the Gunshot Audio Forensics Dataset. In this work, Mel spectrograms were used to transform audio signals into a format that YAMNet could analyze visually. Transfer learning was applied to YAMNet, fine-tuning the model to identify gun types with 94.96% accuracy. Its high accuracy and robustness to noisy audio inputs have made it valuable for forensics and real-time applications. This model uses transfer learning to reduce reliance on large amounts of labeled audio data, thereby improving efficiency. This work highlights potential future improvements, such as enhancing accuracy by expanding audio datasets and exploring more optimized model architectures for deployment on embedded audio devices in real-world scenarios. Audio-visual cues enhanced classification accuracy when both YAMNet and InceptionV3 were utilized. However, using audio data would require more samples and greater diversity.

Yadav et al. [79] present the specific research question in this investigation: the identification of weapons in CCTV videos under low-light conditions at night. The dataset used in this work is the Custom Nighttime Weapon Dataset, obtained from the internet. It consists of 15,367 images captured in low-light environments or at night. The weapons include pistols and handguns. The use of YOLOv7 aims to combine a deep learning approach with a brightness enhancement feature to improve weapon detection at night. This approach addresses gaps in the literature, particularly in night-time weapon detection, by combining deep learning methods with image enhancement algorithms to improve system performance. The improvements made with YOLOv7-DarkVision over the standard YOLOv7 resulted in approximately a 10% increase in accuracy at night, achieving a precision score of 95.50% and an F1-score of 93.41%. However, the proposed YOLOv7-DarkVision model still faces limitations when deployed in extremely low-light conditions. YOLOv7 achieved a good overall precision and firm performance in low-light and nighttime conditions. But it still had some limitations in some low-lit environments.

You et al. [80] argue that, for military decision-making and assessing firepower threats, fine-grained detection of military targets is essential because it can provide more precise and comprehensive battlefield situational data. To address the issues of imprecise categorization and low detection accuracy in fine-grained troop target detection, the researchers collected data by extracting images from the Internet, war films, and other sources. They used the Label Image tool to annotate the targets and developed a fine-grained detection dataset of army targets. To improve detection accuracy and reduce misclassification of army targets across various weapons, they propose a fine-grained detection method based on the YOLOv8-AD network. This approach accurately identifies and classifies soldiers armed with different weapons, including firearms and rocket launchers. The Soldier Target dataset achieves an mAP50 of 79.6% with YOLOv8-AD, as indicated by comparative experimental results. The method has an average detection time of 11.2 ms per image, with detection accuracy and recall rates of 75.7% and 74%, respectively, on 1333 random untrained images. It offers a novel approach to automatic weapons and security monitoring, enabling fine-grained detection of military targets. By incorporating temporal data and balancing sample distribution, the researchers aim to mitigate training difficulties in future research and enhance real-time detection performance and efficiency. YOLOv8-AD provided a high level of secure, efficient weapon monitoring. However, the complexity of training makes it difficult to optimize fully. Table 3 summarizes the earliest survey utilizing Gun Detection.

Table 3. Summary Table—Gun Detection.

5. AI-Enhanced Smart City Surveillance—Weapon Detection: Knife

In recent years, many crimes involving various weapons, including swords, knives, and others, have been reported in both homes and public places. CCTV cameras are installed in public areas to monitor and help reduce these types of crimes. Security personnel typically monitor video footage from these cameras, and video surveillance systems that can identify and analyze scenes and unusual activities are crucial for intelligence monitoring, especially amid the growing demand for safety, security, and protection of personal property. Figure 9 illustrates the process of detecting sharp cutting weapons using artificial intelligence or related AI techniques.

Figure 9. Knife Detection Workflow Diagram Using Artificial Intelligence.

Buckchash et al. [19] discuss the challenges of knife detection due to the wide variety of blade shapes, textures, and sizes. Implementing an automatic weapon (knife) detection system can leverage the widespread use of surveillance cameras to help reduce crime. The researchers compiled an internal dataset of over 1500 images, including both positive and negative samples, along with a few online attack videos, since no common dataset exists for face-off footage. This research introduces a novel object detection method for visual knife detection in video data. The proposed approach consists of three steps: foreground segmentation, Multi-Resolution Analysis (MRA) for classification and target confirmation, and conspicuous feature recognition based on the Accelerated Segment Test (FAST) for image localization. The performance of this method is compared to traditional object localization techniques, with a frame-by-frame analysis showing over 70% overlap and proper localization of the knife. Although the current framework achieves a detection speed of four frames per second, this could be improved by reducing the number of extracted patches. The researchers plan to extensively test the framework in various real-world scenarios in the future. The FAST algorithm demonstrated adequate performance in terms of accuracy and F1-scores, enabling acceptable levels of feature extraction and detection. Further testing is necessary to assess the robustness of this approach in other circumstances.

Castillo et al. [20] address the problem of identifying cold-steel weapons in surveillance footage, particularly when brightness fluctuations occur. To train the detection model effectively, a new dataset must be designed, which requires time and effort. The study proposes a deep learning model with a brightness-guided preprocessing strategy, called DaCoLT (Darkening and Contrast at Learning and Test stages), to enhance identification accuracy across various lighting conditions using a bespoke dataset comprising a diverse range of cold steel weapons. A convolutional neural network (R-FCN with ResNet101) was used to build the model, which was then trained and evaluated across a range of illumination conditions. An F1 score of 84.15% was obtained through experimentation, with Recall and Precision at 72.65% and 100%, respectively. This results in a new brightness-based automatic cold steel weapon detection model for video surveillance. The researchers will discuss the challenging task of identifying weapons in outdoor settings, where moving background objects and unfavorable weather conditions complicate detection. The CNN with the DaCoLT model achieved perfect precision but only reasonable detection performance in the automatic weapon use case; however, this capability decreased when tested outdoors, where lighting and background conditions varied.

Guo et al. [24] investigate the automatic detection of knives in X-ray security scans to enhance the performance of security inspections. The dataset developed, SDCK2019, comprises 5672 X-ray images of seven types of knives, including kitchen knives and daggers. The contribution of this work to improving the SSD (Single Shot Detector) model lies in replacing the VGG16 backbone with ResNet101. This improves feature extraction, especially for small objects. Consequently, detection is enhanced by an SSD model with feature fusion, which integrates more features and boosts knife detection accuracy. Experiments conducted on SDCK2019 showed an average Precision of 91.3%. Key strengths of the model include its robustness for detecting small targets and its ability to reduce false negatives. However, a weakness is the high computational requirements. Future improvements could involve further optimizing the model to enhance real-time performance. The improved SSD with FPN demonstrated reasonable detection of small objects in complex X-ray images; however, it struggled to separate visually similar objects under different lighting conditions.

Noever et al. [21] address the issue, although assessing threat intent via knife detection in surveillance videos is challenging, given the various forms and contexts in which knives appear. For this reason, the researchers share their training procedures using the private knife dataset, which includes diverse images of blades at different angles and with different silhouettes. They propose three methods: classification with MobileNet, hand segmentation from knives using MaskRCNN, and analysis of hand positions with PoseNet to identify threatening poses. When applied to affordable devices, these models achieved over 95% classification accuracy and more than 90% confidence in simultaneously detecting hand and knife positions. Future work aims to reduce high false-negative and false-positive rates by expanding datasets and incorporating new sensors, such as X-ray or thermal imaging, better to handle complex environments, including schools and public events. The integration of MobileNet, PoseNet, and Mask R-CNN achieved high performance through a multi-viewpoint approach. However, false positives persisted in the system, affecting reliability.

Amador-Salgado et al. [25] address the problem of detecting knives in indoor environments using surveillance cameras. The dataset used includes videos captured with a 1 MP CCTV camera under both white and infrared lighting conditions. The proposed technique combines color analysis and invariant moment calculations to detect knives, utilizing OpenCV and MATLAB for implementation. The experimental setup involves a commercial CCTV system, with video frames processed to extract key knife features. The system achieved new security solutions and a 1.192% error rate. Future improvements could focus on expanding detection to include objects similar to knives and enhancing the system’s performance under various lighting conditions. The infrared-based detection strategy was simple and easy to implement, yielding image data with ease; however, its performance was not thoroughly evaluated, as it was justifiable only through impressive, measurable results.

Babu et al. [22] address the crucial issue in this conceptual work: the lack of an optimal, particularly automated method for identifying and categorizing weapons in video sequences. The dataset downloads include COCO (Common Objects in Context), a large dataset with object annotations in the text; the Open Images Dataset, a rich dataset with extensive content; and a custom Weapon Dataset, which marks images specifically. These include firearms, knives, swords, and cutters. The proposed technique uses CNNs to identify and categorize weapons in video data. The system for weapon detection and classification uses a deep learning algorithm based on Convolutional Neural Networks (CNNs) to detect weapons in surveillance videos. Handheld gun detection with Faster R-CNN achieves 88% accuracy in identifying and locating guns in images. The limitations of this study are low weapon perceptiveness to the picture environment; specifically, the current system only recognizes and categorizes weapons on a picture-by-picture basis. CNN models trained on COCO and custom datasets achieved accurate real-time weapon detection. The weapon detection was confined to standard images rather than a dynamically changing scene, which would have enriched the identification levels.

Jayachitra et al. [26] present a real-time method for detecting concealed or hidden objects beneath people’s clothing, a challenging task in video surveillance. This model uses two datasets: MMW and THz. The MMW dataset contains 6700 photos of hidden objects, while the THz dataset has 3476. The study employs the Modified Weighted You Only Look Once v5 (MWYOLOv5) model to develop an effective technique for detecting concealed weapons in humans. The weighted boxes fusion (WBF) approach reduces the likelihood of incorrect predictions due to low confidence levels. Consequently, higher-confidence boxes have a greater influence on the final fused box coordinates than lower-confidence boxes. Additionally, selecting optimal hyperparameter values is essential for training the YOLOv5 model with feature extractors based on CSPDarknet53 and feature aggregation using Path Aggregation Networks (PANet) to detect hidden objects. This is achieved by fine-tuning YOLO hyperparameters —including learning rate, momentum, weight decay, and batch size—using a novel crossover salp swarm algorithm (CSSA). Compared to existing methods, this approach provides greater accuracy in identifying hidden objects in THz and MMW images. Both datasets are used to train and evaluate the proposed hazardous weapon classification model, with results showing high performance—achieving impressive mAP@.5 (98.97%) and mAP@—mAP@-5:95 (97.15%). Future improvements to enhance concealed item detection include: (1) increasing the model’s generalization ability through data augmentation, and (2) reducing computational complexity. YOLOv5 demonstrated superior precision on the MMW and THz datasets, making it an effective tool for detecting concealed weapons. Its computational demand remained high, limiting its use in real time.

Huynh et al. [23] note that automatic object detection in images and videos has become increasingly important as their use grows in modern times. The research data, comprising 2155 class knives and 2078 knife photos, was obtained from GitHub. The dataset can be used to train a reliable object detection model because it contains images of various sizes, with an average resolution of approximately 1280 × 720 pixels. Two subsets of the dataset were created: one for validation (30%) and one for training (70%). In this article, the researchers propose a method based on the YOLOv5 model for detecting knives in images and videos. The YOLOv5 model was developed using the YOLO (You Only Look Once) architecture to optimize speed and performance. The YOLOv5 backbone comprises multiple convolutional layers, including C3 (3 × 3 convolutional) layers, which extract image features. According to the analysis, the recommended method (YOLOv5s) achieves 97.7% precision, 94% recall, and mAP50 of 97.6%. The main advantages of the YOLOv5s model include its lightweight design, faster training and application, high recall and precision scores, and suitability for small to medium-sized objects. Our research identifies several limitations that need to be addressed for future progress, including Low Confidence Threshold False Positives, Dataset Expansion, and Real-World Implementation. YOLOv5s demonstrated superior precision, recall, and mAP while training in a reasonable time. However, its performance needs to be verified on larger datasets and through bootstrapping to achieve broader applicability. Table 4 provides a summarized overview of the research findings on Knife Detection.

Table 4. Summary Table—Knife Detection.

6. AI-Enhanced Smart City Surveillance—Weapon Detection: Gun and Knife

CCTV cameras were initially used to record live footage from the coverage area, enabling monitoring of individual activities. Artificial intelligence for criminal detection, such as identifying guns and knives in video surveillance, has garnered significant interest from security experts and law enforcement agencies in recent years. Additionally, improving the effectiveness and reliability of video surveillance systems for disaster and crime prevention greatly benefits from the use of artificial intelligence (AI) and deep learning. Figure 10 illustrates the process of detecting weapons, such as guns and knives, using AI or related techniques.

Figure 10. Handguns and Knife Detection Workflow Diagram Using Artificial Intelligence.

Grega et al. [29] present a study addressing the challenge of automatically detecting firearms and knives in CCTV footage, reducing reliance on human operators and enabling rapid response times in the event of a threat. The researchers collected their own dataset of CCTV footage, focusing on images involving firearms, especially handguns, and knives. It proposes a Haar Cascade classifier integrated with OpenCV that primarily detects weapons near human silhouettes within the frame, adding contextual relevance to the detection. The method uses MPEG-7 visual descriptors to enhance object recognition, aiming to balance sensitivity and specificity. The algorithm was tested on real CCTV footage under various conditions, including low resolution and inconsistent lighting. The results demonstrated high success, with firearm detection achieving nearly zero false positives and knife detection exceeding the specificity of similar studies. Training and test sets were created from 8.5 min of footage, with each set containing approximately 12,000 frames. Forty percent of each set included positive examples (firearms in plain sight), while sixty percent included negative examples (no guns, but other objects in hand). When edge histogram features are employed, accuracy improves to 91%. Ongoing work aims to optimize detection for additional weapon types further and enhance real-time performance. Thanks to the hybrid CNN and image processing method, effective real-time performance was enabled with an adequate accuracy of 90%. However, the downside, of course, was that false positives occasionally occurred, which detracted from its reliability.

Navalgund et al. [30] report that CCTV cameras are frequently used to reduce crime in an area. Despite installing CCTV in both public and private spaces to monitor the environment, crime rates have not decreased. Although the use of weapons is strictly prohibited in places such as ATMs, banks, and specific public areas, the system is tested using datasets of videos and images collected from YouTube and Google. These datasets include burglary, murder, and other illegal activities involving weapons. The proposed method employs the pre-trained deep learning model VGG-19, which detects guns and knives in a person’s hand and identifies when they are pointed at another person. Additionally, the researchers examined how two different pre-trained models, such as GoogleNet InceptionV3, performed during training. VGG19 achieved higher training accuracy, reaching 91%, 93%, and 92%, respectively. Compared with other current crime-detection methods, this proposed approach yields promising results. The researchers do not further explore the analysis methods across multiple approaches. FRCNN with VGG-19 achieved adequate accuracy and recall, thereby improving overall detection performance. However, additional sample training was necessary to allow for successful generalization.

Dwivedi et al. [27] present that enhanced surveillance systems are necessary to address the increasing crime rate, specifically concerning firearms and blades, as will be critically discussed in this research. The internet provides a diverse, high-quality source of laboratory samples that enhance effective model development and assessment; these resources include a range of tools and equipment. The proposed approaches leverage deep learning and transfer learning to improve weapon classification. Therefore, the initial transfer learning using a pre-trained VGG16 model and modifications of network parameters, such as dropout rates and fully connected layer sizes, are the primary focus of the experimental setup aimed at improving weapon detection performance. From the above analysis, it can be seen that these models achieve retrieval and classification accuracies higher than 99%. Additional research may explore alternative hyperparameters, such as learning rates, to train current models on relatively limited datasets. The Deep Convolutional Neural Network model achieved remarkable accuracy (99%), indicating very efficient performance. The flip side was the minuscule dataset, making scalability difficult.

Fernandez-Carrobles et al. [45] propose a solution for real-time recognition of guns and knives in video surveillance, aiming to enhance public safety in crowded areas. The gun dataset is from Private Collection, and our own COCO knife localization dataset initially contained only 4326 images but expanded to 12,978 through cycling training. The planned system utilizes Faster R-CNN, incorporating the GoogleNet and SqueezeNet architectures for feature extraction, and features a Regional Proposal Network (RPN) for generating object proposals. GoogleNet, with 22 layers, offers the highest accuracy, while SqueezeNet is smaller and simpler, suitable for real-time applications. Researchers fine-tuned both GoogleNet and SqueezeNet for 30 epochs with different initial learning rates using stochastic gradient descent with momentum and L2 regularization. SqueezeNet achieved the best performance in gun detection with an AP50 of 85.44%, whereas GoogleNet was less accurate for knife detection, with an AP50 of 46.68%. CCTV cameras can enhance their performance and detection accuracy by utilizing new deep learning techniques. Future work will explore more lightweight architectures for embedded systems, and the dataset will be significantly expanded to improve robustness and accuracy, particularly to address SqueezeNet’s lower performance in knife detection. The Ventura Faster R-CNN model, which combined GoogleNet and SqueezeNet, achieved a reasonably high level of accuracy in gun and knife detection. To further improve accuracy, it developed a new model architecture for knife detection.

Egiazarov et al. [28] discuss improving public safety by comparing two methods for detecting firearms in images: a semantic segmentation model and an end-to-end deep CNN. The testing dataset comprises images (a custom dataset) of AR-15 rifles, divided into parts such as the stock, magazine, barrel, and receiver, and utilized with various CNNs. While the deep CNN directly identifies the entire weapon, the segmentation model combines individual detections to make a final decision. Experimental results demonstrate that the segmentation model provides greater flexibility, requires less data, and exhibits greater stability in low-data scenarios. In contrast, the deep CNN achieves higher accuracy—around 92.5% (Training Full AR)—but requires more data and computing resources. Future improvements could focus on optimizing the segmentation model’s aggregation process to enhance accuracy. The CNN approach was found to be flexible and achieved notably high accuracy (92.5%), though it could be improved in performance on a few items across various detection scenarios.

Ağdaş et al. [81] classify images into guns, knives, and non-weapon objects using transfer learning techniques on pre-trained CNN architectures, including AlexNet, VGG16, and VGG19. The dataset comprises 16,000 images—9500 knives, 3500 guns, and 3000 ordinary images—collected from open-access datasets and the internet. With accuracy rates of 99.73% and 99.67%, respectively, the VGG16 model, trained via fine-tuning for 2 and 3 classes, achieved the highest accuracy in identifying illegal devices in the experimental results. Each model was trained on a GPU, enhancing its potential for real-time applications. Results are highly reliable for real-time weapon classification, with future improvements focused on increasing dataset diversity to encompass a broader spectrum of crime-related objects. VGG16 and ResNet50 achieved high accuracy without background variation and with the use of large datasets (achieving 100% accuracy with modified ResNet50). However, performance degraded when background variation occurred or when the dataset was small.

Kavya et al. [37] define a deep learning model for classifying seven categories of weapons—assault rifles, bazookas, grenades, hunting rifles, knives, pistols, and revolvers—using a CNN inspired by VGGNet. A dataset of 5214 images collected from the internet was used; preprocessing steps included background masking and cleaning. The developed CNN model achieved a very high accuracy of 98.4%, compared to VGG-16 (89.75%), ResNet-50 (93.7%), and ResNet-101 (83.33%). In this research, the architecture is simplified to make the network more lightweight, enabling faster training and compatibility with lower-end hardware. While effective, future work should address occlusion-related challenges in detection to improve robustness in complex scenarios. VGGNet demonstrated solid performance in real-time detection, achieving a reported accuracy of 98.4%. However, low-lighting and occluded weapons proved to be catch-alls.

Narejo et al. [32] present the core question addressed in this work: the ability to detect weapons, particularly guns, in video surveillance systems to enhance security and combat crime. The database used in the study is PMMW. The weapons targeted include guns, knives, firearms, and concealed weapons. The approach relies on the YOLO (You Only Look Once) v3 algorithm for object detection. This enhances system security by promptly alerting the operator, enabling a response to a weapon in the field, and recording and analyzing incidents for future reference. The accuracy of the trained YOLO V3 model is 98% and 89%. Future development could lead to improved, flexible surveillance systems for security operations and response in dynamic, unstructured environments. YOLOv3 improved weapon detection accuracy on the PMMW and custom datasets, but struggled in complex environments and with associated background noise.

Belurkar et al. [82] describe a weapon-detection problem in real-time surveillance footage that could enhance public security through automatic detection. The researchers proposed using the YOLOv4 algorithm in combination with a CNN to improve object recognition accuracy while maintaining faster processing speeds. This is achieved using a custom-trained YOLOv4 model with labeled datasets, which will be further deployed in real-time CCTV scenarios. Additionally, CNNs are used to classify object proposals into specific categories, such as weapon types. The experiment’s foundation involved training the model in Google Colab, which offers free GPU access, and testing it on live video feeds, processing frames at up to 60 FPS to significantly enhance real-time detection performance. However, some limitations include reduced performance in cluttered environments and challenges in detecting concealed weapons, highlighting areas for future improvement in object localization. YOLOv4, combined with Harris Corner Detection, resulted in improved frame processing speed; however, the results were not clearly analyzed or reported.

Bushra et al. [42] discuss in this research how to detect weapons such as pistols, rifles, and knives and perform facial recognition in crowded public places using YOLO V5. The dataset used to train the model in the GitHub repository is from Roboflow, a computer vision development platform. The weapons employed are handguns and blades. The algorithm used is YOLO V5, the latest in the You Only Look Once family of deep learning models for real-time object detection. The process of integrating the YOLO V5 approach for weapon detection in surveillance involves an experimental procedure with the following stages: Preprocessing, Configuration, Training, Testing. The model achieved 332 detections, 98% of which were actual weapon detections across 324 observations. Future work could focus on improving the model’s accuracy across different lighting conditions and with various types of weapons. The YOLOv5 model achieved high precision at high speed and efficiency. However, the model’s performance decreased with high variability and/or complexity.

Fathy et al. [83] address the task of weapon recognition in video surveillance and propose integrating deep learning-based IoT and fog computing with SDN to achieve rapid detection while simultaneously reducing network bandwidth. The research uses the SOHAs weapon database, which includes small weapons such as pistols and knives, and tests several variants of the YOLOv5 model, with YOLOv5n performing best. The developed system features a layer of edge computing devices responsible for weapon recognition and keyframe selection, thereby limiting network traffic volume. Built on Mininet with OpenFlow switches and a Ryu controller, the system manages bandwidth among current network flows based on existing traffic. Effective flow management strategies result in a 75% increase in average throughput, while 14.7% and 32.5% packet losses, along with a degradation of 1,321,070 in mean jitter, highlight the limitations of generally non-adaptive quality of service models. The next step will focus on practical implementation aspects such as the number and placement of SDN controllers, as well as improving detection accuracy and enabling real-time operation. The SDN with the YOLOv5 model achieved a very high throughput and low packet drop rate. However, the model’s overall detection accuracy required improvement.

Hnoohom et al. [31] report that Thailand has experienced instability in recent years, similar to other countries worldwide. The number of offenses endangering individuals or assets is likely to increase if current trends persist. Nowadays, closed-circuit television (CCTV) technology is widely used for surveillance and tracking to enhance public safety. The Armed CCTV Footage (ACF) dataset, consisting of self-collected mock CCTV footage of pedestrians carrying pistols and knives, was assembled to address scenarios not adequately covered by existing datasets in the field of CCTV weapon detection. This work aimed to introduce a deep learning method for recognizing small-weapon objects using image tiling. Experiments were conducted on a publicly available benchmark dataset (Mock Attack) to evaluate detection performance. The proposed tiling approach achieved a significantly higher mAP, 10.22 times higher. To analyze the improvement, different object detection models were trained using the tiling technique. The tiled ACF dataset achieved an mAP of 0.758 when detecting handguns and knives with SSD MobileNet V2. The detection of weapons can be substantially improved by the proposed tiling method combined with our ACF dataset. Although our high-quality ACF dataset enhances weapon detection in CCTV footage, it has certain limitations because it was collected from images of knives and pistols under sunny or overcast daylight conditions. During inference, detection accuracy may decrease in nighttime conditions or when encountering other types of weapons or objects not included in our collection, potentially leading to incorrect detections. The Faster R-CNN model with SSD MobileNet provided solid performance in armed CCTV detection; however, its object detection accuracy required tuning.

Lamas et al. [84] report that a significant proportion of false negatives persists when CNN-based object detection models are used for weapon detection in CCTV footage. Most current research in this area focuses on a single weapon type—mainly firearms—and improves detection using various pre- and post-processing techniques. The Sohas weapon identification dataset’s knife and pistol classes were used to train the detection models. There are a total of 3250 images, with 1825 in the knife class and 1425 in the pistol class. This study proposes a top-down approach that uses a weapon identification model to evaluate hand regions after they are identified via human posture estimation. Researchers developed a new factor, the Adaptive Pose Factor, that accounts for the body’s distance from the camera to determine the optimal location for each hand region. Our tests demonstrate that the top-down method, known as Weapon Detection over Pose Estimation (WeDePE), enhances detection performance. In fifteen movies analyzed across various settings, results demonstrated that the top-down strategy increases accuracy by up to 17.5%, recall by 20.8%, and F1 score by 19.4% compared to a bottom-up approach. The WeDePE methodology surpasses current state-of-the-art detection models and alternative bottom-up methods in both indoor and outdoor recording scenarios. The researchers plan to incorporate multiple weapon detection techniques for different objectives. Top-down detection, including pose training, improved robustness and enabled the detection of armed subjects with specific precision and recall. Still, it would need to be modified to detect multiple types of weapons.

Maddileti et al. [85] address the challenge of detecting unusual weapons in real-time from surveillance footage using a hybrid model that combines YOLOv3 and Faster R-CNN. The dataset includes images of firearms—guns, knives, and rifles—from Kaggle. The approach integrates the proposed method with object detection through YOLO v3 for rapid detection and Faster R-CNN for precise object localization, employing a two-stage process: preprocessing and feature extraction. The experimental setup involved training the model on the dataset and evaluating detection performance using metrics such as detection scores and confusion matrices. The results were highly favorable, surpassing the previous designs with an mAP of 85%, a FPS of 53, an SSIM of 0.99, a sensitivity of 97.23%, and a throughput of 94.23%, demonstrating high effectiveness in real-time surveillance. Although significant progress has been made in this area, further enhancements could lead to numerous valuable real-time applications. While the systems YOLOv3 and RCNN demonstrated high throughput and accuracy for weapon detection, structural changes were still necessary for scalability, resulting in inefficient use.

Tejashwini et al. [86] discuss the main problem of interest in this research: the lack of an automated system for identifying prohibited weapons in CCTV-captured videos. The datasets include COCO and Open Images, with the focus on pistols, and the technique of interest is real-time weapon detection using YOLOv5 from CCTV footage. The system’s design ensures that the hardware operates effectively under normal conditions and can continuously generate security alerts. The accuracy results showed pistol detection with 87% precision according to the WHO website and knife detection with 96% precision. Although YOLOv5 is recognized as one of the lightest object detection models, the computational power required for real-time processing—such as processing full HD video streams—can be substantial. YOLOv5 was able to accurately identify pistols and knives; however, there were overheads in computational time and processing power that, for this study, limited the experimental system’s efficiency.

Borthakur et al. [87] discuss enhancing military surveillance by automating object detection with YOLO models. The primary focus is on detecting vehicles, drones, personnel, and weapons, including guns and knives. In this research, two datasets are utilized: MCVC, comprising 6772 images with six classes—tank, truck, airplane—and the extended MIOD dataset, which includes 16,012 images and four additional classes—person, knife, gun, and drone. The researchers compared YOLOv3, YOLOv4, and YOLOv5m on the MCVC dataset. The highest mAP at 0.5 was 95.9%, achieved with YOLOv5m. Additionally, YOLOv5m was retrained on the MIOD dataset, achieving a test mAP@0.5 of 82.3%, with a precision of 83.5% and a recall of 79.4%. Subsequently, the Slicing Aided Hyper Inference (SAHI) technique was applied to enhance the model’s ability to detect smaller objects within larger frames. Future work could include addressing class imbalance in datasets and improving the detection of camouflaged or noisy objects. While YOLOv5 with SAHI was accurately detecting, especially with smaller objects, there needs to be a structural progression to use at military detection levels.

Devasenapathy et al. [88] state that object detection is one of the most important research topics in digital forensics. The object detection approach can be helpful across various organizations and businesses, including data rescue at local and international levels, criminal justice, airport safety, traffic surveillance, and medical diagnostic scanning. This study employs several enhancement techniques to identify guns in video surveillance images. The researchers used additional methods to improve the accuracy and overall performance of an artificial neural network, including image segmentation with Clustering-Based Segmentation, HOG feature extraction, and background noise reduction using the Kolmogorov filter. Images with 500 × 500, 650 × 450, or 900 × 650 pixels are required for our project. MATLAB software and low-resolution Google photos of people holding knives will be used to evaluate the approach. 80% of the data are used for training and 20% for testing, following the proposed 80-20 split in machine learning. The artificial neural network (ANN) using histogram of oriented gradients (HOG) features achieved good accuracy (89.78%), and it was reasonable to expand the dataset to improve performance further.

Dugyala et al. [89] emphasize that security is a primary concern across all industries, particularly in light of the rise in crime during crowded events. Early identification of potentially violent situations is enabled by weapon detection (WD). Detecting weapons is challenging even with advanced closed-circuit television (CCTV) systems and deep learning (DL) algorithms. This research proposes a PELSF-DCNN model for weapon detection. The approach involves eight stages. After each step, an experimental investigation is conducted to demonstrate the effectiveness of the proposed model. The model achieves 96.8% precision and 97.5% accuracy. Additionally, YOLO v8 has shown improved results. Similarly, the proposed model performs better overall. Therefore, the model is quite effective for WD. Although it could distinguish between real and fake guns, it could also detect firearms. Future work will expand to differentiate between genuine guns and other deadly weapons by introducing a new ensemble classifier. The You Only Look Once version 8 (YOLOv8) model, utilizing PELSF–DCNN, achieved high accuracy (97.5%) and reliability; however, it was unable to detect more than one weapon.

González et al. [90] propose a semi-supervised learning approach to detect firearms in a large, unannotated dataset, thereby improving accuracy. The researchers introduce a new dataset containing 458,599 Instagram images labeled with firearms-related hashtags. The focus of this work is firearms, using a conditioned cooperative training methodology in which a teacher model generates pseudo-labels from a student model. These models leverage the DETR architecture, thus iteratively enhancing both models through knowledge transfer. The model demonstrated an improvement of up to 10.5% in average precision over previous semi-supervised methods in the experimental setup. The UGR and YouTube GDD Dataset metrics for the proposed approach are approximately 71.32 and 58.56 (AP), respectively. Perhaps the most significant benefit is the effective utilization of unlabeled data, reducing the need for extensive manual annotation. One drawback is that it relies heavily on highly confident pseudo-labels, which can lead to errors propagating across successive training cycles. Future research could improve this pipeline by developing a more robust threshold selection process for pseudo-labeling, thereby further enhancing accuracy. Additionally, cooperative training reduced annotation costs while maintaining competitive results. The performance decreased when applied in complex or cluttered scenes.

Sharma et al. [91] discuss the detection of various types of weapons, including guns, knives, and heavy firearms, using the real-time YOLOv8 algorithm. In this research, a dataset comprising sources such as the internet, public-domain datasets, and private datasets, containing approximately 16,000 labeled images of weapons, was used for training. Images were collected from sources such as Roboflow and other publicly available datasets. The methodology involved training the YOLOv8 model, which is faster and more accurate than earlier versions, through a plan that included data extraction, cleaning, and model training. The experimental setup evaluated the YOLOv8 model’s performance on a validation dataset of 1400 images, achieving a mean average precision (mAP) of 88.2%. These results demonstrate the model’s high accuracy in detecting firearms; however, the research suggests that future work should focus on detecting concealed weapons in cluttered environments. A larger dataset should be developed, and additional features such as motion tracking and occlusion management should be incorporated. The YOLOv8 variant demonstrated fast, accurate detection. It required larger dataset sizes to improve generalization.

Sivakumar et al. [33] focus on handgun detection in poor lighting conditions, where traditional models often struggle to perform effectively. The dataset comprises the Weapon Detection Dataset (WDD), the Gun Dataset (GD), and the Gun Object Detection dataset, totaling 15,367 images and five videos collected in various dark environments, with a primary focus on handguns. This study proposes a new pipeline comprising a collection of convolutional neural networks with distinct architectural designs. Each neural network is trained with a distinct mini-batch with minimal or no overlap in the training samples. The researchers developed an improved YOLOv7 model by incorporating a brightness enhancement algorithm for low-light conditions. The system detects dark frames, applies real-time image enhancement, and then detects weapons. Training the modified YOLOv7 on the presented WDP Model achieved 94% accuracy and 95.7% precision. It is also important to note that advancing weapon detection systems could significantly help prevent misuse and monitor individuals carrying weapons. Future research could focus on optimizing the framework to support more neural networks without substantially increasing computational requirements. The YOLOv7 and Faster R-CNN variants reported substantial precision and accuracy across all datasets. However, structural determination still needed to be remedied.

Tram et al. [47] address the problem of detecting weapons, such as pistols, knives, and rifles, in surveillance videos using state-of-the-art YOLOv5, YOLOv7, and Swin Transformer, combined with IDM tools and Image Downloader. Data are collected from open online sources based on keywords related to knives, pistols, and rifles. Additionally, information was obtained from publicly accessible sources, including Roboflow data, Knife Detection datasets, and Handgun Detection datasets. The proposed methods combine the YOLO model with the Swin Transformer to achieve high-accuracy object detection via image preprocessing, feature extraction, and Mask R-CNN object segmentation. Multiple YOLO versions are trained in an experimental setup and then compared on WeaponData_VN. Under ideal recognition conditions, YOLO-v7e6 appears to have the highest accuracy, ranging from 97 to 99%. The accuracy of YOLOv7-e6 surpasses that of other YOLO models. These results indicate that using the same Swin Transformer model as in this study significantly improves object detection, supporting future comparisons of various weapon-detection techniques. The combination of YOLO and Swin Transformer variants reported high accuracy, consistent with all other versions. However, further refinement would be needed to achieve increased performance in the future.

Ugany et al. [43] discuss the challenge of detecting weapons in real time from crime scene videos or surveillance footage. This research uses a dataset of 8327 images containing guns and knives amid diverse backgrounds to enhance Tiny YOLO’s accuracy. Tiny YOLO is relatively fast and efficient; therefore, it has been preferred over models such as Faster R-CNN and YOLOv6 due to its higher mAP of 92.33%, precision of 96%, and F1-score of 92%. In the experimental setup, Tiny YOLO was trained using different image filters to adapt to real-time detection conditions. As a result, this algorithm demonstrated high accuracy and speed in identifying weapons in real-world scenarios. Future work will focus on reducing false positives and on detecting other object classes. Tiny YOLO can efficiently detect weapons on low-resource devices while maintaining high accuracy. However, it may require improved detection of small objects in complex scenes.

Yadav et al. [92] address the challenge of real-time weapon detection by resource-constrained mobile robots. The dataset used consisted of approximately 3000 images from three sources: IITP_W, Handgun, and Sohas, featuring various types of weapons, including pistols, rifles, and knives. The proposed technique, YOLO-WL, is a lightweight version of the PicoDet model designed for low-power devices such as the Raspberry Pi and Jetson Nano. The workflow optimizes the backbone, neck, and head modules to reduce computational costs while incorporating attention-based learning to enhance feature detection. In the experimental setup, the model was trained on NVIDIA A100 GPUs and tested against PicoDet on these three datasets. The model achieved precision scores of 94.4%, 89.8%, and 91.58% for the IITP_W, Handgun, and Sohas datasets, respectively. In terms of prediction accuracy, the suggested model performs better; however, it has a slightly higher computational cost compared to PicoDet. The YOLO-WL model achieved a high mAP across three datasets for weapon detection. Similarly, it was computationally intensive and impractical.

Abdullah et al. [93] present that the number of terrorists and criminals using light weapons, such as knives and guns, has increased significantly in recent years worldwide. Unfortunately, most monitoring systems currently rely heavily on human observation and involvement. Therefore, the need for an intelligent system capable of identifying various weapons becomes crucial in the fields of computer vision and security. A self-collected dataset was used to train and evaluate the proposed system. This dataset comprises numerous images of various types of weapons, including knives, rifles, and handguns. It contains 11 classes (9 for weapons and 2 for armed human postures), with 1500 images per class, ensuring even distribution across categories. This balanced distribution enhances weapon detection accuracy by reducing bias and overfitting. The system primarily uses deep learning techniques, specifically You Only Look Once (YOLO) version 8 (YOLOv8), to identify distinct classes of light weapons. The proposed approach was evaluated using a custom weapons dataset comprising hundreds of images across various weapon categories. The results demonstrated a mean average precision of 97.2%, indicating the effectiveness of ensemble learning in accurately detecting multiple types of weapons. Extending the system’s detection capabilities beyond firearms to other item categories and incorporating contextual data, such as scene analysis and behavior modeling, could further improve its performance. YOLOv8 achieved a high mAP (97.2%) and demonstrated extreme reliability in weapon detection. Overall, it would be more useful with a greater range of objects.

Akhila et al. [94] present a firearm detection system. Stabbing and shootings are violent crimes that threaten public safety and cause significant trauma. To prevent lone-wolf attacks without human oversight, technology is necessary. Therefore, the most effective way to locate and identify weapons using neural networks is to develop an automatic weapon detection system powered by deep learning. Google is used to obtain the dataset’s photos, and YouTube provides frames from live videos. This study focuses on both unified and Two-Stage object detectors, whose resulting models not only detect the presence of weapons but also classify them, such as rifles, pistols, knives, and revolvers, in addition to detecting people. For model validation and training, the study employs the Faster R-CNN and YOLOv5 families. To improve speed and performance, YOLOv5 underwent pruning and assembly. Models with an inference speed of 8.1 ms achieve a maximum score of 78%. Faster R-CNN models, however, reach a top AP of 89%. Data augmentation techniques, including the use of random noise images, increase model robustness. Risks to validity include selection bias and timing issues. For future model training, the resulting model may be provided as a pre-trained model. Since the dataset is homogeneous, selection bias—which threatens internal validity—is minimized. Both YOLOv5 and Faster R-CNN achieved high overall accuracy and precision. Still, they both suffered from selection bias and a lack of diversity within their training and validation datasets.

Berardini et al. [95] propose that, to improve public safety, early identification of knives and weapons from CCTV footage is essential. Although Deep Learning (DL) techniques for generic object detection are becoming more advanced, there are still unresolved issues with weapon detection from video surveillance. A Dahua camera was used to collect the CCTV data necessary for training the DL algorithm to detect hazardous items. The dataset used to train the DL algorithms was created by processing the collected data. To achieve this, each video was cut after verifying the start and end of each collection session. The weapon-detection challenge is approached from an edge computing perspective in the proposed study. A two-step DL technique was developed and tested on a custom indoor CCTV dataset, and its performance was compared with that of existing state-of-the-art methods. The method employs a first Convolutional Neural Network (CNN) to detect individuals, which then directs a second CNN to identify knives and pistols. This system was implemented on an NVIDIA Jetson Nano edge device connected to an IP camera to evaluate performance in a real indoor environment. The system operated nearly in real-time without requiring expensive hardware. The results from the low-power NVIDIA Jetson Nano, with COCO Average Precision (AP = 79.30) and Frames per Second (FPS = 5.10), demonstrated the superiority of the proposed method over alternatives, encouraging the adoption of accessible automated video surveillance systems. Future work will focus on collecting a similar dataset in outdoor settings and testing the proposed method in those settings. A video surveillance system based on CNNs demonstrated the automation of various surveillance tasks. However, this dataset did not include outdoor samples and dude samples.

Idakwo et al. [41] discuss the illegal ownership and use of weapons, including knives, rifles, and handguns, which have led to a significant increase in insecurity and have prompted a greater use of surveillance cameras for live video monitoring. The effectiveness of any deep learning model depends on the dataset and on its ability to learn image features efficiently. Therefore, creating a library of high-quality weapon images is essential. As a result, a comprehensive dataset with excellent weapon photos was obtained from the public Mock assault dataset. Using their respective size ratios relative to the input of the redesigned capsule network, the high-definition images from the surveillance cameras in the public Mock Attack dataset were automatically split into smaller tiles. Due to its fast prediction speed, low data requirements, and ease of recognizing posture, texture, and image distortions, the capsule network was chosen for system detection and classification. The average F1 score, recall, accuracy, and precision were 99.43%, 98.14%, 98.77%, and 98.45%, respectively. Classifying weapons into their respective categories will help security authorities determine the most effective methods for locating and apprehending offenders. The Modified Capsule Network achieved 99.43% accuracy, demonstrating a remarkable ability. However, this research did not expose the Modified Capsule Network to a wide range of datasets.

Jacob et al. [54] present a novel surveillance system that uses a customized YOLOv8 (You Only Look Once version 8) object detection algorithm to identify knives and weapons in real time, addressing rising public safety concerns. To ensure the model’s robustness, the dataset encompasses a range of scenarios, lighting conditions, and angles. The system’s real-time object recognition capability is examined in detail, including video frame recording and deploying the YOLOv8 model on a Raspberry Pi. Evaluation metrics—accuracy, precision, recall, and F1 score—demonstrate that the system is well-balanced, achieving high accuracy (90.9%) and sensitivity in detecting positive cases. The YOLOv8 model demonstrates robust real-time object detection, accurately identifying knives and firearms across various scenarios. Its generalization can be further improved by continuously adding new scenarios, lighting conditions, and object variations to the dataset. In relation to the datasets described in this paper, YOLOv8 demonstrated consistent performance and greater precision and recall. However, it could have been further improved with either an expanded or a fine-tuned dataset.

Keerthana et al. [96] present the development of advanced surveillance systems driven by the recent increase in security concerns. The dataset includes various images and videos of firearms from multiple sources. This research introduces a novel weapon identification system that utilizes the cutting-edge YOLOv8 algorithm for real-time object detection. User authentication can be easily integrated into the system to ensure secure access and control over its features. The proposed method can identify guns in various settings, including photos, videos, and live webcam feeds, by leveraging the high accuracy of the YOLOv8 model. This research employs the Yolov8 algorithm to detect weapons with an impressive 94% detection accuracy. Future versions of our weapon detection system will focus on improving the YOLOv8 model’s accuracy by using diverse datasets. Additionally, researchers aim to further enhance our framework. While YOLOv8 shows good speed and accuracy across various datasets, its performance has yet to be validated on more diverse datasets.

Martinez, H. et al. [97] describe how detection and surveillance systems have evolved in tandem with technological advancements. It is crucial to have surveillance measures in place to prevent armed individuals from appearing. Two significant datasets are used in the learning process: the Terahertz Human Dataset and ImageNet. These datasets contain 26,505 images primarily depicting the human body. However, some of the researchers used additional tailored datasets. The researchers introduced a new, fog computing-optimized surveillance system for identifying both humans and weapons. This strategy can be integrated into 5G deployments. It efficiently utilizes the Edge and Fog layers, with the Fog layer running many deep learning algorithms on embedded GPUs. The system has been trained to recognize knives, rifles, and firearms. Researchers aim to evaluate a situation with hundreds of camera sensors and numerous fog nodes in the future. Additionally, they plan to provide security features for low-resource devices. The implementation of Edge–Fog–Cloud computing facilitated scalability and improved system ease of use; however, there was insufficient discussion of the findings and performance metrics.

Mukto et al. [98] discuss a comprehensive crime-monitoring system that efficiently detects weapons, violence, and faces in real time using CCTV. The CMS (Crime Monitoring System) operates on three layers: weapon detection via YOLOv5, violence detection with MobileNetV2, and face recognition using the LBPH (Local Binary Pattern Histogram) algorithm. It uses a diverse dataset of labeled video frames containing weapons, such as guns and knives, to ensure model accuracy. Standard preprocessing and augmentation steps, including rescaling and flipping, are employed to enhance detection performance. The experimental results showed that YOLOv5 achieved over 80% accuracy in weapon detection, MobileNetV2 reached about 95% in violence detection, and the face recognition model achieved 97%. When tested in various real-world scenarios, the integrated CMS demonstrated exceptional performance, effectively ensuring security and safety by detecting criminal activities and issuing prompt alerts. Future improvements will focus on further optimizing the system for additional crime scenarios and reducing false negatives. YOLOv5 was considered sufficiently fast and overall respectable for detecting potential weapons in real-time monitoring. Notwithstanding this, the false-positive issues and systematic speed issues were observed in crowded scenes.

Rao et al. [99] emphasize that public safety is a top priority in today’s society, and maintaining the security of public areas is crucial. Especially in surveillance scenarios, weapon-detection devices are vital to ensuring public safety. However, current approaches for detecting occluded and modified weapons have shown below-par performance. The publicly available weapon detection dataset, created by extracting images from online video clips, is used to evaluate the proposed model. This research introduces an effective, nonlinear, and scalar-growing cosine-based Deep Convolutional Neural Network (NSGCU-DCNN) for weapon identification in surveillance footage. The primary objectives of this work are to predict specific objects with minimal effort and cost, and to detect occluded objects in noisy backgrounds. Analysis conducted using publicly accessible datasets shows that the proposed NSGCU-DCNN achieves top metrics with 97.85% accuracy, 98.24% sensitivity, and 96.81% specificity. Currently, the methods only detect guns and knives; other types of weapons are not considered. Future improvements will include recognizing different kinds of weapons and employing more advanced neural networks for detection. The NSGCU-DCNN demonstrated robustness and reliability, achieving a high overall sensitivity level. Nonetheless, the handling of the diverse weapons remains insufficient.

Sivakumar et al. [100] propose that security forces urgently need to implement computerized command systems due to the rising number of criminal acts. A carefully selected and arranged dataset of seven distinct weapon types has been used to train the proposed model. This model is built using the Keras architecture, based on TensorFlow, and utilizes the VGGNet architecture. It outperforms the VGG-16 model (89.75% accuracy), ResNet-50 (93.70%), and ResNet-101 (83.33%) in classification accuracy, achieving an impressive 98.40%. This study offers a crucial perspective on the effectiveness of deep learning for weapon categorization, yielding promising results that could significantly enhance the security forces’ ability to combat crime. Future research may focus on developing a robust infrastructure for autonomous robot warriors that can independently recognize and analyze inputs, such as weapon detection problems considered in this study. However, a far more robust infrastructure would need to be developed for use at scale.

Talib et al. [34] propose advanced weapon-detection systems, which have become necessary due to the increasing use of light weapons in criminal and terrorist activities. Since a conventional weapons dataset is not readily available, the study used a synthetic dataset generated automatically. This research focuses on utilizing You Only Look Once version 7 (YOLOv7), a high-speed deep learning detection model, to meet the demand for quick and effective threat identification. The proposed method can simultaneously recognize and distinguish multiple weapons because it was trained on a self-curated dataset containing numerous photos of various deadly light weapons. Specific weapon categories achieved a mean average precision (mAP) of 97%, demonstrating YOLOv7’s exceptional speed and accuracy, making it unique in object and weapon identification. The system’s performance could be further improved by adding more images for each weapon type. Future research, especially in autonomous security systems, is expected to be guided by and informed through the study’s conclusions. YOLOv7 achieved a good mAP (97%) and increased efficiency for security forces, despite autonomous systems still being in an early stage of development.

Appavu et al. [44] discuss an attack detection technique based on convolutional neural networks (CNNs) for surveillance footage. Using both machine learning and deep learning techniques, as recommended, improves accuracy. Performance evaluations demonstrated the effectiveness of the proposed method in detecting violent content in videos. The future solution outperforms existing methods for categorizing violent and criminal content in movies, according to test results. InceptionV3, ResNetV2, Inception_ResNetV2, Violence_Net, and the proposed vioNet models were all trained on four datasets. The classification results for the endorsement data were as follows: InceptionV3, 92%; ViolenceNet pseudo-97%; InceptionResNetV2, 90%; and VioNet, 99.72%. The DenseNet prototype was chosen for our application layout over the more complex Inception or InceptionResNet models due to its more straightforward feature map integration. The CNN-based VioNet and related models achieved very high accuracy on available violence datasets; extending these datasets would further enhance generalization capability.

Apeh et al. [101] propose a dispersed wireless sensor network for monitoring and self-defense. There are benefits to combining artificial intelligence (AI) with wireless sensor networks (WSN) to collect security data regarding terrorist movement patterns in conflict areas. Through autonomous monitoring, it increases the effectiveness of highly advanced technology in the fight against terrorists, which is more resilient than front-line human personnel. Using a deep convolutional neural network (DCNN) integrated within the YOLOv5 platform, the YOLOv5 model is used to experiment with object identification and categorization on images of human-conceived weapons. The simulation of attacker detection in a wireless sensing model based on spanning tree networks yielded average detection accuracy values of 94.85%, 95.10%, 96.58%, 93.57%, 95.26%, and 97.17%, respectively. Additionally, at a processing time of 0.875 s, the detection accuracy for gun identification is 100%. A Deep Convolutional Neural Network (DCNN) using YOLOv5 achieved perfect accuracy in detecting firearms and strong detection of possible intrusions. Still, it did not use the latest YOLO versions to explore additional possibilities.

Berardini et al. [48] propose that Video Surveillance Systems (VSSs) are essential for reducing crimes involving weapons, especially knives and firearms. The high incidence of crimes involving knives and handguns emphasizes the importance of early weapon identification. This has driven the development of automated methods for detecting weapons in surveillance camera footage and the proliferation of video surveillance systems. The researchers use our specially created WeaponSense dataset, comprising 52 video sequences captured with the Dahua 4 MP Bullet Network Camera, for all their tests. Due to their small size, firearms remain difficult to detect effectively in real time, despite advances from traditional computer vision to Deep Learning (DL) techniques. The resource-intensive nature of current DL methods, which employ sophisticated detection architectures, makes them costly and energy-consuming, limiting their deployment on effective edge devices. This renders these approaches impractical for edge and real-time applications with limited resources. Our work proposes YOLOSR, which combines an Enhanced Deep Super Resolution (EDSR)-based network with a shared backbone and a You Only Look Once (YOLO) v8-small model to address these issues. YOLOSR maintains the same computational complexity, with 28.8 billion floating-point operations and an on-device latency of 101 ms per image, compared to the state-of-the-art YOLOv8-small model. The Average Precision improved by 10.2 percentage points. The researchers plan to evaluate the methods on various edge devices in the future to understand device-specific limitations and further optimize our models for broader use and enhanced real-world performance. YOLOSR used YOLOv8 and focused on weapon-sense detection, specifically on improving learning ability. However, there is little discussion of basic assessment metrics, such as accuracy and precision.

Yadav et al. [35] present WeaponVision AI, a sophisticated software program capable of accurately recognizing weapons in recorded videos, live streams, and photos. Additionally, this software can detect firearms even in dimly lit environments. To develop WeaponVision AI, a large dataset of 79,558 weapon images was used to train a deep learning architecture based on a modified YOLOv7 architecture. After training, the model showed strong performance, with notable metrics: a mean average precision of 92.15% and a precision rate of 91.75%. The ability of WeaponVision AI to accurately detect weapons across various visual and environmental conditions highlights its effectiveness. Weapon Vision AI implemented YOLOv7 to demonstrate good precision and mAP for weapons detection, but only focused on a small subset of weapon types. Table 5 provides a summarized overview of the research findings on Weapon Detection.

Table 5. Summary Table—Weapon (Knife and Gun) Detection.

7. Research Summary—Key Findings

Figure 11 represents the yearly spread of research activity and algorithm use among five deep learning models—Faster RCNN, CNN, YOLO, SSD, and RCNN—from 2017 to 2024. Each colored marker represents the use of a particular algorithm in a given year, and the number of markers and the thickness of the color indicate the level of use and prominence within that time frame. From this figure, we can see that Faster R-CNN maintains steady usage over several years, especially between 2017 and 2023, which evidences its continual importance in the field of object detection. This likely reflects the implications of hardware that poses specific challenges for researchers when using a similar algorithm in a rapidly developing space. CNN, introduced in 2019, has notably gained traction from 2019 to 2024, suggesting its status as an essential backbone on which many other deep learning methods have been built and continue to be built. Looking at YOLO, there are relatively continuous, densely occurring indicators visually between 2019 and 2024. YOLO is a reliable, recently popular algorithm among researchers and practitioners across multiple settings, offering the best performance for real-time detection. SSD popped up intermittently between 2019 and 2020, suggesting some moderate interest; however, short interest again shows some proper connection to hardware challenges. RCNN appeared only for one year in 2022, likely due to the prevalence of very early classics that had fewer and fewer use cases as libraries became streamlined and code became more efficient.

Figure 11. Year-Wise Distribution of Base Detection Models.

Figure 12 illustrates the distribution of various object detection models and datasets employed in AI-driven weapon detection research. It divides the object detection models into three major dataset distributions, Custom Datasets, COCO Dataset, and Public and/or Private Datasets, indicating any relationship with the specific deep learning models used. Models such as YOLOv3 to YOLOv5 are often trained on custom datasets to detect weapons or similar objects in specific environmental or surveillance scenarios. The COCO dataset, a benchmark that hosts a large number of labeled image objects, supports models such as Mask R-CNN and Single Shot Detection (SSD), which use a convolutional neural network (CNN) architecture for general object detection. On the other hand, the Public and/or Private Datasets reveal models such as YOLOv3, YOLOv5, YOLOv7, Faster R-CNN, and SSD that are utilized for both general-purpose object identification and specific domains. Overall, the graph illustrates the significance of dataset type for particular models: custom datasets are required only for specialized weapon detection, while standard datasets, such as COCO, serve as foundational baselines for model performance.

Figure 12. Relationship between the model and the dataset in weapon detection.

Table 6 lists publicly available datasets for classifying and recognizing weapons, detailing the publication date, the number of images, and the weapon categories. Similarly, Table 7 lists private datasets for weapon classification and recognition, detailing publication date, number of images, and weapon category.

Table 6. Overview of publicly available dataset metrics.

Table 7. Overview of private dataset metrics.

7.1. Dataset Limitations

Most studies on weapon detection face significant limitations due to the available datasets, which affect the model’s effectiveness and generalization. Many datasets used to train weapon detection models include limited viewing angles or fixed camera positions, which limit the model’s ability to detect weapons from non-traditional angles. Fluctuations in lighting, particularly in outdoor night-time surveillance, limit consistency in feature extraction, leading to false negatives in subsequent trained models, such as YOLO and Faster R-CNN. While synthesized data can aid in a variety of datasets, it inherently lacks real-world noise and texture complexity, leading to inaccurate detection when applied to real surveillance footage. Further complicating feature extraction and accuracy, low-resolution camera inputs and video compression artifacts degrade edge detection clarity, affecting precision and recall, especially in fast-moving or crowded environments. While the studies citing COCO, IMFDB, and custom CCTV datasets all report entitled limitations as significant barriers to run-time adaptation, they emphasize the importance of obtaining more environment-specific training with high diversity of images, and high-quality photos (i.e., low-res and higher-res) to ensure more real-world potential of firearm and knife detection in a security context.

7.2. Method Applicability Across Scenarios

A thorough comparative assessment of the deployment of various AI-based weapon detection architectures in relation to different operational environments, identifying the best alternative for each operational environment. The survey examines multiple architectures, including CNN, Faster R-CNN, YOLO (v3–v8), SSD, and Mask R-CNN. It compares them based on detection speed, detection accuracy, and hardware compatibility. The results of the survey indicate that general-purpose models based on the YOLO (v3–v8) architecture outperform other models in real-time surveillance applications. This means that general-use YOLO models are quick and extract more information about detected objects, making them optimal for on-site CCTV applications and portable devices. Other models, such as Faster R-CNN and Mask R-CNN, achieve higher accuracy but require substantial processing power, placing them in the category of requiring expensive server space, which may only be available for longer-term offline analysis. Some models, such as an SSD with a lightweight CNN variant, offer a more balanced performance profile, enabling them to run on embedded or low-resource devices. Ultimately, the comparison table clearly identifies the trade-off between speed and accuracy in terms of model characteristics. Ideally, the recommended model should be determined by specific application conditions, hardware resources, and the desired tradeoff between accuracy and the convenience of recall time.

7.3. Research Discussion—Datasets and Model Performance

7.3.1. Gun

Public Datasets: Research utilizing publicly available datasets such as the COCO dataset, IMFDB, Kaggle, Open Images, and YouTube collections has evidenced the capabilities of high-level performative deep learning models, such as YOLO (v3–v8), Faster R-CNN, and Mask R-CNN. Broadly, the studies that used these models each reported high accuracy, ranging from 93% to 99%, and similarly high mAP values, indicating high reliability in model performance for use with large and diverse image datasets. Specifically, COCO was used in studies that classified multi-weapon detection utilizing YOLOv5 and Mask R-CNN, while IMFDB and Open Images datasets accurately classified firearms. The experimental designs typically relied on GPU-enabled frameworks (e.g., PyTorch, TensorFlow, Darknet) and were deemed successful if the reported values for accuracy, precision, recall, and F1 Scores were calculated.
Private Datasets: Studies using privately held datasets, including those drawn from a university setting or other evidentiary surveillance-based datasets, have mainly utilized three models: CNN, YOLO, and Faster R-CNN. These models have demonstrated accuracies of 92–98%, effectively detecting guns and knives across a variety of controlled environments and limited detection scenarios. Private datasets were collected primarily from universities or from local CCTV footage to bolster each model’s detection capabilities further when applied to realistic textures and scenes. As part of the experimental procedures, most studies evaluated detection accuracy in real time using mid- to high-end graphics processing units (GPUs), and metrics such as accuracy, recall, and F1-score were used to assess the benefits of each model in live surveillance applications.

7.3.2. Knife

Public Datasets: Public datasets like COCO (Common Objects in Context), Open Images, and X-ray imaging collections have been utilized to improve overall detection accuracy and robustness across various situations. For instance, with respect to overall detection capabilities, using SSD (Single Shot Detector) with Feature Pyramid Networks improved performance, achieving a mean average precision (mAP) of 91.3% and consistently detecting small objects in challenging background settings [24]. In a separate study, a CNN (Convolutional Neural Network) model that used both the COCO and Open Image datasets achieved approximately 88% accuracy in real-time weapon classification [22]. The majority of the models referenced were implemented using TensorFlow or PyTorch, with GPU acceleration, and were evaluated using metrics (e.g., accuracy, precision, recall, mAP) on large-scale datasets.
Private Datasets: Studies examining private datasets (institutional or lab-collected images) [20,21,23] primarily utilized CNN, YOLOv5, and hybrid architectures (e.g., MobileNet–PoseNet–Mask R-CNN) to achieve a performance accuracy of 95–100%. Studies demonstrated high efficiency in controlled environments, but weapon detection in outdoor and real-world environments decreased when lighting and occlusion were present. Experimental designs commonly use GPU-based systems and performance metrics such as precision, recall, F1-score, and mAP to develop real-time weapons-detection models.

7.3.3. Weapon—Gun and Knife

Public Datasets: Research into public datasets such as COCO, Open Images, YouTube, and SOHAs Weapon Dataset has shown that deep-learning-based models such as YOLO (v3–v8), Faster R-CNN, and CNN architectures are effective in weapon detection. The models consistently demonstrated accuracies ranging from 87% to 99%, with strong performance from YOLOv5 and Mask R-CNN, as evidenced by pinned precision, recall, and mAP scores. For example, [30,86,87] reported mAP scores above 82% and demonstrated robustness in multi-weapon detection. The experimental setups used GPU-accelerated libraries (PyTorch or TensorFlow) and reviewed most papers that focused on evaluating real-time performance, computational efficiency, and environmental suitability. Notably, these models were generally trained and assessed on public datasets and may not be applicable without fine-tuning in low-light and outdoor environments.
Private Datasets: The literature was primarily built from private dataset sources (i.e., collected from a university research study, public safety agencies, or custom surveillance systems), populated using CNN, YOLO, and Faster R-CNN architectures, which returned accuracies in the region of 92–99%. The earliest papers [21,45,84,93] yielded rigorous results, but, in terms of specialized data usage, which was almost exclusively drawn from private dataset sources, they were better and ensured greater control over image quality and annotation consistency. Furthermore, the methodology adopted by the models demonstrated precision and recall, with sufficient justification for its portrayal of effectiveness in models trained for real-time firearm and knife detection in constrained or institutionalized settings. The experimental methodology typically consisted of training and validating an NVIDIA GPU-based model using literature-based methods, which included accuracy metrics such as F1-score, mAP, and recall. The identified datasets are limited in diversity, which would otherwise hinder the model’s generalization for broader scenario evaluation.

7.4. Hardware Settings

7.4.1. Gun

Hardware configurations were an essential factor contributing to speed and performance metrics reported across the included studies on weapon detection. Studies that were trained and tested on GPUs or server-grade hardware (e.g., YOLOv5–v8, Faster R-CNN, Mask R-CNN) all achieved higher frame rates and faster inference, often exceeding 30–45 frames per second (FPS) to achieve real-time performance in surveillance contexts. However, studies implementing CPU-only or edge-device solutions—such as Tiny YOLO or MobileNet-based models—demonstrated reduced processing speed but improved energy efficiency, making these solutions more attractive for embedded or mobile surveillance systems. Studies that utilized a lightweight architecture (e.g., YOLOv5s, SSD) on restricted hardware achieved a balance between a drop in detection accuracy and associated latency, in contrast to studies utilizing a server-grade GPU that enabled highly computed models such as Vision Transformers (ViT) to achieve high detection accuracy, but at a decline in the CPU’s resource consumption. Overall, differences between the performance on any shooting study are reflective of the extent to which hardware optimizations and availability of resources impact the feasibility of reactivity in real-time, energy performance of deploying a solution for a given situation, and ultimately the scalability of deploying an artificial intelligence solution in weapon detection.

7.4.2. Knife

Variations in hardware setups play a significant role in the variations in speed and performance statistics listed in the studies reviewed. Those performance estimates were made using models trained on GPU or server systems (e.g., YOLOv5 in [23,26] or SSD-FPN in [24], resulting in faster inference and real-time detection at highly accurate rates above 95%. In contrast, methods tested on CPU setups or limited hardware [19,20] showed lower detection speeds, and latencies were evident when processing high-resolution or multi-view input streams. Testing on edge devices and lower-power models, such as MobileNet and FAST [19,21], achieved energy efficiencies suitable for portable surveillance, but at the cost of lower computational depth and detection robustness. Overall, the articles provide evidence that you can achieve greater accuracy and speed with more complex models when using GPU systems, but edge and CPU-based deployments prioritize access and efficiency for low-power and real-time applications.

7.4.3. Gun and Knife

The hardware configuration is a fundamental factor in the speed, accuracy, and overall performance of weapon detection models across the reviewed studies. Regardless of YOLOv5, v6, YOLOv7, and YOLOv8, Faster R-CNN, and Vision Transformer hardware frameworks being executed on GPU or server-grade systems versus the architecture and frameworks of lightweight CNNs, MobileNet versions, and other parallel neural networks it was observed that models with hardware configurations on GPUs and multi-GPU architecture (as shown by processing performance averaging between 30 FPS and 45 FPS) were significantly faster on a per-frame basis when comparing model performance and mAP or precision performances due to the benefit of parallelization and optimizing computational representation. However, models trained or tested on CPUs and thin clients or edge devices achieved slower inference rates than those trained or tested on GPU- or server-grade architectures. Still, they did provide optimization for energy efficiency and portability for embedded solutions in surveillance and applications that require a low-power energy source. The studies in [43,97] further supported the claims, suggesting that edge or fog computing enabled decentralized processing and scalable energy efficiency on CPUs, but at the cost of minor latency and organizational accuracy, as their centralized models were deployed to GPU and server-grade energy resources. In summary, performance across the reviewed studies showed that GPU-based models outperformed CPU-based models, and that edge- or fog-based models improved reliability, speed, and accuracy, with a focus on energy efficiency over accuracy and speed in the real world, and on applicability.

7.5. Challenges

The literature outlines a range of real-world scenarios that threaten the robustness of AI-based weapon-detection systems. In crowded situations, it is common for objects to overlap or for parts of weapons to be obscured. This leads to increased false alerts or missed detections, as seen in public surveillance operations. Occlusion from clothing, bags, or other contexts reduces model confidence, as weapons are often only partially visible, limiting the capabilities of YOLO and R-CNN models. Motion blur from moving subjects or an unsteady camera will diminish image sharpness, ultimately preventing feature extraction and reducing performance, especially in live CCTV feeds. Further, the quality of the camera, such as resolution, lighting, or compression artifacts, will impact model performance when finding weapon shape, as noted in studies that used IMFDB alongside real CCTV datasets. Environments with glare or shadow, or simply varied weather, also create an additional challenge. In summary, real-world deployments require models tuned for visual noise, dynamic preprocessing, and training datasets of sufficient quality and variability for continual performance across environments.

8. Research Gap and Future Scope

While weapon detection is a challenging and time-consuming task, it remains a critical issue for public security and safety. The rising demand for CCTV cameras capable of identifying and analyzing scenes and unusual events is necessary for security, safety, and the protection of private assets. This is especially important for intelligence surveillance. Mass shootings and gun violence are increasing in many parts of the world. Early detection of a gun is vital for preventing property damage and injuries. Using artificial intelligence and other technologies, this study aims to provide effective real-time weapon identification. Despite significant advances, the research examined had several drawbacks. There is still a lack of diversity in datasets, as many models are trained on small or synthetic datasets that do not accurately reflect real surveillance scenarios. The significant variations hinder a direct comparison between research studies in evaluation benchmarks and performance criteria. The generalizability and practical implementation of AI-based weapon detection systems are further limited by the scarcity of research that incorporates real-time field testing or considers privacy and ethical implications. The creation of standardized, openly accessible databases for weapon detection should be the top priority for future research, facilitating repeatability and equitable benchmarking. More accurate model comparisons will be made possible by establishing standardized evaluation procedures and criteria. Real-time deployment across various environmental conditions, including occlusions and low-light levels, should be the focus of future research. Incorporating ethical and privacy-preserving AI frameworks is also necessary to guarantee the proper use of surveillance technologies in both public and private settings. Many researchers present their own techniques and findings; however, they often discuss datasets from public or private sources without addressing their quality or availability. There are a few high-quality, labeled datasets for weapon detection; however, these datasets are limited, which hinders the development of reliable AI models. The limited variety of available datasets reduces the effectiveness of models in real-world scenarios across weapon types, angles, lighting conditions, and occlusions.

Utilizing artificial intelligence (AI) to detect weapons facilitates the provision of protection and practical solutions. These techniques can enhance the quality and accessibility of datasets. This involves collaborating with security and law enforcement organizations to develop diverse and comprehensive datasets. To increase dataset diversity and simulate various scenarios, apply data augmentation techniques. Data augmentation refers to modifying training images to generate a synthetic dataset larger than the original dataset. This reduces the risk of overfitting and boosts model performance.

9. Conclusions

This research study primarily focuses on the use of artificial intelligence for weapon detection in smart city security surveillance. Today, most crimes involve portable weapons such as guns, pistols, revolvers, and knives. Several surveys indicate that the most common weapon used in crimes like rape and burglary is a handgun. Therefore, weapon detection is a crucial need in today’s world. This study presents an artificial intelligence-based method for detecting weapons, including guns and knives, in complex environments. It is also vital to detect weapons early, identify them, alert the public, and take swift action to ensure safety. Additionally, researchers believe that future readers will find a brief overview of weapon detection (guns, knives, or both) and current advanced studies on weapon identification valuable.

Our research demonstrates that using artificial intelligence to detect guns and knives early can help police departments across regions automatically identify weapons and respond swiftly to prevent dangerous incidents. By integrating recent advances in deep learning architectures and real-time object detection frameworks, this study offers a more comprehensive synthesis of AI-based weapon-detection techniques than earlier systematic reviews. This assessment highlights the evolution and performance patterns from 2016 to 2025, providing clearer insights into the advantages and disadvantages of current techniques than previous evaluations, which often concentrated solely on specific algorithms or lacked updated comparisons among models. Table 2, Table 3 and Table 4 provide an overview of the included studies, including the reference, the year of publication, the dataset used, the proposed approaches (AI/ML/DL model utilized) (e.g., YOLO, SSD, Faster R-CNN), the reported performance metrics (e.g., precision, recall, mAP), and any significant constraints. These tables provide a comprehensive overview of the performance patterns and methodological diversity observed in the included research.

There is still room for improvement; therefore, using data augmentation techniques can increase dataset diversity and simulate a broader range of scenarios. This method also aims to reduce false positives and negatives. In the future, weapon detection models can be improved by incorporating advanced artificial intelligence architectures, such as transformer-based models and newly released YOLO models (e.g., YOLOv10, YOLOv11, and YOLOv12), to enhance precision and contextual awareness. Future research should incorporate the development of more sophisticated multimodal sensing techniques, the deployment of edge and federated learning for real-time applications, the collection of all viable benchmark data, and the integration of explainable AI capabilities to provide meaning and certainty in surveillance contexts.

Author Contributions

Conceptualization, T.M.; Methodology, T.M.; Software, T.M. and N.A.N.M.B.; Validation, T.M.; Formal Analysis, T.M. and N.A.N.M.B.; Investigation, T.M., N.A.N.M.B., A.R.O.A.S., M.M.R.A., E.M.R.A. and G.H.T.A.; Resources, T.M. and N.A.N.M.B.; Data Curation, T.M. and N.A.N.M.B.; Writing—Original Draft Preparation: N.A.N.M.B., A.R.O.A.S., M.M.R.A., E.M.R.A. and G.H.T.A.; Writing—Review and Editing, T.M.; Visualization, N.A.N.M.B.; Supervision, T.M.; Project Administration, T.M.; Funding Acquisition, T.M. All authors have read and agreed to the published version of the manuscript.

Funding

The United Arab Emirates University partly supported this work through the Summer Undergraduate Research Experience (SURE) PLUS Research Grant (12T072).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors are grateful to the College of Information Technology and the Research Office—United Arab Emirates University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dubey, S. Building a Gun Detection Model Using Deep Learning. Program Chair & Proceedings Editor: M. Afzal Upal, PhD, Chair of Computing & Information Science Department, Mercyhurst University, 501, 2019. Available online: www.academia.edu/download/60006683/GLS220190714-14844-1y0chif.pdf#page=24 (accessed on 19 November 2025).
Gelana, F.; Yadav, A. Firearm detection from surveillance cameras using image processing and machine learning techniques. In Proceedings of the Smart Innovations in Communication and Computational Sciences: Proceedings of ICSICCS-2018, Jaipur, India, 20–21 January 2018; Springer: Singapore; pp. 25–34. [Google Scholar]
Santos, T.; Oliveira, H.; Cunha, A. Systematic review on weapon detection in surveillance footage through deep learning. Comput. Sci. Rev. 2024, 51, 100612. [Google Scholar] [CrossRef]
Shanthi, P.; Manjula, V. A systematic review on CNN-YOLO techniques for face and weapon detection in crime prevention. Discov. Comput. 2025, 28, 204. [Google Scholar] [CrossRef]
Yadav, P.; Gupta, N.; Sharma, P.K. A comprehensive study towards high-level approaches for weapon detection using classical machine learning and deep learning methods. Expert Syst. Appl. 2023, 212, 118698. [Google Scholar] [CrossRef]
Warsi, A.; Abdullah, M.; Husen, M.N.; Yahya, M. Automatic handgun and knife detection algorithms: A review. In Proceedings of the 2020, the 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), Incheon, Republic of Korea, 3–5 January 2020; IEEE: New York, NY, USA, 2020; pp. 1–9. [Google Scholar]
Debnath, R.; Bhowmik, M.K. A comprehensive survey on computer vision-based concepts, methodologies, analysis, and applications for automatic gun/knife detection. J. Vis. Commun. Image Represent. 2021, 78, 103165. [Google Scholar] [CrossRef]
Hussein, N.J.; Hu, F. An alternative method to discover concealed weapon detection using critical fusion image of color image and infrared image. In Proceedings of the 2016, the First IEEE International Conference on Computer Communication and the Internet (ICCCI), Wuhan, China, 13–15 October 2016; IEEE: New York, NY, USA, 2016; pp. 378–383. [Google Scholar]
Verma, G.K.; Dhillon, A. A handheld gun detection using faster R-CNN deep learning. In Proceedings of the 7th International Conference on Computer and Communication Technology, Allahabad, India, 24–26 November 2017; pp. 84–88. [Google Scholar]
de Kanehisa, A.; Fumihiro, R.; de Almeida Neto, A. Firearm Detection using Convolutional Neural Networks. In Proceedings of the ICAART (2), Praha, The Czech Republic, 19–21 February 2019; pp. 707–714. [Google Scholar]
Vílchez, R.F.; Mauricio, D. Bullet Impact Detection in Silhouettes Using Mask R-CNN. IEEE Access 2020, 8, 129542–129552. [Google Scholar] [CrossRef]
Rahil, I.; Bouarifi, W.; Ghizlane, R.; Mustapha, O. An Improved Real-Time Handgun Detection System Using Yolo V5 on a Novel Dataset. J. Theor. Appl. Inf. Technol. 2023, 101, 7674–7688. [Google Scholar]
Mishra, S.; Chaurasiya, V.K. Automatic Firearm Detection in Images and Videos Using YOLO-Based Model. In Proceedings of the International Conference on Neural Information Processing, New Delhi, India, 22–26 November 2022; Springer Nature: Singapore; pp. 553–566. [Google Scholar]
Nale, P.; Gite, S.; Dharrao, D. Real-Time Weapons Detection System using Computer Vision. In Proceedings of the 2023 Third International Conference on Smart Technologies, Communication and Robotics (STCR), Sathyamangalam, India, 7–8 September 2023; IEEE: New York, NY, USA, 2023; Volume 1, pp. 1–6. [Google Scholar]
Flores, J.M.; Liban, L.; Ribaya, R.L.; Paguio, I.A.; Suárez, J.J. SPoTem: Handgun Detection in Videos using Spatial, Pose, and Temporal Features. In Proceedings of the 2024 International Conference on Innovation in Artificial Intelligence, Tokyo, Japan, 16–18 March 2024; pp. 111–117. [Google Scholar]
More, P.; Patil, S.; Pattanshetti, T. Real-time Violence and Weapon Detection and Alert System 2024. PREPRINT (Version 1). Res. Sq. 2024. [Google Scholar] [CrossRef]
Wang, G.; Ding, H.; Duan, M.; Pu, Y.; Yang, Z.; Li, H. Fighting against terrorism: A real-time CCTV autonomous weapons detection based on improved YOLO v4. Digit. Signal Process. 2023, 132, 103790. [Google Scholar] [CrossRef]
Valliappan, N.H.; Pande, S.D.; Vinta, S.R. Enhancing Gun Detection with Transfer Learning and YAMNet Audio Classification. IEEE Access 2024, 12, 58940–58949. [Google Scholar] [CrossRef]
Buckchash, H.; Raman, B. A robust object detector: Application to detection of visual knives. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 633–638. [Google Scholar]
Castillo, A.; Tabik, S.; Pérez, F.; Olmos, R.; Herrera, F. Brightness guided preprocessing for automatic cold steel weapon detection in surveillance videos with deep learning. Neurocomputing 2019, 330, 151–161. [Google Scholar] [CrossRef]
Noever, D.A.; Noever, S.E.M. Knife and threat detectors. arXiv 2020, arXiv:2004.03366. [Google Scholar] [CrossRef]
Babu, P.S.; Sudarshan, H.D.; Gowda, B.L.R.; Sujan, P.; Aldi, S.M. Weapon Detection Using Artificial Intelligence and Deep Learning for S Application. Int. J. Res. Eng. Sci. Manag. 2021, 4, 279–281. [Google Scholar]
Huynh, S.P.T.; Dang, T.Q.; Nguyen, H.D.; Tran, P.N.H.; Nguyen, N.Q.V. Knife Detection using YOLOV5: A Deep Learning Approach. In Proceedings of the 2024 9th International Conference on Intelligent Information Technology, Bali, Indonesia, 15–16 March 2024; pp. 7–12. [Google Scholar]
Guo, R.; Zhang, L.; Ying, Y.; Sun, H.; Han, Y.; Tan, H. Automatic detection and identification of controlled knives based on improved SSD model. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; IEEE: New York, NY, USA, 2019; pp. 5120–5125. [Google Scholar]
Amador-Salgado, Y.; Padilla-Medina, J.-A.; Pérez-Pinal, F.-J.; Barranco-Gutiérrez, A.-I.; Rodríguez-Licea, M.-A.; Martínez-Nolasco, J.J. Knife detection using indoor surveillance camera. In Proceedings of the 2021 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 10–13 May 2021; IEEE: New York, NY, USA; pp. 0062–0068. [Google Scholar]
Jayachitra, J.; Devi, K.S.; Manisekaran, S.V.; Satti, S.K. An optimal deep learning model for recognition of hidden hazardous weapons in terahertz and millimeter wave images. Earth Sci. Inform. 2023, 16, 2709–2726. [Google Scholar] [CrossRef]
Dwivedi, N.; Singh, D.K.; Kushwaha, D.S. Weapon classification using a deep convolutional neural network. In Proceedings of the 2019 IEEE Conference on Information and Communication Technology, Lucknow, India, 6–8 December 2019; IEEE: New York, NY, USA, 2019; pp. 1–5. [Google Scholar]
Egiazarov, A.; Zennaro, F.M.; Mavroeidis, V. Firearm detection via convolutional neural networks: Comparing a semantic segmentation model against end-to-end solutions. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; IEEE: New York, NY, USA, 2020; pp. 1796–1804. [Google Scholar]
Grega, M.; Matiolański, A.; Guzik, P.; Leszczuk, M. Automated detection of firearms and knives in a CCTV image. Sensors 2016, 16, 47. [Google Scholar] [CrossRef]
Navalgund, U.V.; Priyadharshini, K. Crime intention detection system using deep learning. In Proceedings of the 2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET), Canterbury, UK, 21–22 August 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
Hnoohom, N.; Chotivatunyu, P.; Jitpattanakul, A. ACF: An armed CCTV footage dataset for enhancing weapon detection. Sensors 2022, 22, 7158. [Google Scholar] [CrossRef] [PubMed]
Narejo, S.; Pandey, B.; Vargas, D.E.; Rodriguez, C.; Anjum, M.R. Weapon detection using YOLO V3 for smart surveillance system. Math. Probl. Eng. 2021, 2021, 9975700. [Google Scholar] [CrossRef]
Sivakumar, H.; Balamurugan, G. Novel Deep Learning Pipeline for Automatic Weapon Detection. In Proceedings of the 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Brisbane, Australia, 4–6 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Talib, M.; Saud, J.H. A Multi-Weapon Detection Using Deep Learning. Iraqi J. Inf. Commun. Technol. 2024, 7, 11–22. [Google Scholar] [CrossRef]
Yadav, P.; Gupta, N.; Sharma, P.K. WeaponVision AI: A software for strengthening surveillance through deep learning in real-time automated weapon detection. Int. J. Inf. Technol. 2025, 17, 1717–1727. [Google Scholar] [CrossRef]
Aftab, R.M. An Expert System for Weapon Identification and Categorization Using Machine Learning Technique to Retrieve Appropriate Response. Lahore Garrison Univ. Res. J. Comput. Sci. Inf. Technol. 2021, 5, 27–35. [Google Scholar] [CrossRef]
Kaya, V.; Tuncer, S.; Baran, A. Detection and classification of different weapon types using deep learning. Appl. Sci. 2021, 11, 7535. [Google Scholar] [CrossRef]
Al-Mousa, A.; Alzaibaq, O.Z.; Hashyeh, Y.K.A. Deep learning-based real-time weapon detection system. Int. J. Comput. Digit. Syst. 2023, 14, 531–540. [Google Scholar] [CrossRef]
Vallez, N.; Velasco-Mata, A.; Corroto, J.J.; Deniz, O. Weapon detection for particular scenarios using deep learning. In Proceedings of the Pattern Recognition and Image Analysis: 9th Iberian Conference, IbPRIA 2019, Madrid, Spain, 1–4 July 2019; Springer International Publishing: Heidelberg, Germany, 2019; pp. 371–382. [Google Scholar]
Ruiz-Santaquiteria, J.; Velasco-Mata, A.; Vallez, N.; Deniz, Ó.; Bueno, G. Improving handgun detection through a combination of visual features and body pose-based data. Pattern Recognit. 2023, 136, 109252. [Google Scholar] [CrossRef]
Idakwo, M.A.; Yoro, R.E.; Achimugu, P.; Achimugu, O. An Improved Weapons Detection and Classification System. J. Netw. Innov. Comput. 2024, 12, 1–10. [Google Scholar]
Bushra, S.N.; Shobana, G.; Maheswari, K.U.; Subramanian, N. Smart video surveillance-based weapon identification using Yolov5. In Proceedings of the 2022 International Conference on Electronic Systems and Intelligent Computing (ICESIC), Chennai, India, 22–23 April 2022; IEEE: New York, NY, USA, 2022; pp. 351–357. [Google Scholar]
Uganya, G.; Sudha, I.; Lakshmanan, V.; Shadrach, F.D.; Krishnammal, P.M.; Nandhini, T.J. Crime Scene Object Detection from Surveillance Video by using Tiny YOLO Algorithm. In Proceedings of the 2023, the 3rd International Conference on Pervasive Computing and Social Networking (ICPCSN), Salem, India, 16–17 March 2023; IEEE: New York, NY, USA, 2023; pp. 654–659. [Google Scholar]
Appavu, N. Real-Time Violence Recognition in CCTV Video Surveillance Enhanced by AI Deep Technique. In Proceedings of the 2025, the 6th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI), Kuala Lumpur, Malaysia, 7–8 January 2025; IEEE: New York, NY, USA, 2025; pp. 1187–1193. [Google Scholar]
Fernandez-Carrobles, M.M.; Deniz, O.; Maroto, F. Gun and knife detection based on faster R-CNN for video surveillance. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Madrid, Spain, 1–4 July 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 441–452. [Google Scholar]
Chatterjee, R.; Chatterjee, A.; Pradhan, M.R.; Acharya, B.; Choudhury, T. A Deep Learning-Based Efficient Firearms Monitoring Technique for Building Secure Smart Cities. IEEE Access 2023, 11, 37515–37524. [Google Scholar] [CrossRef]
Tram, N.T.K.; Son, D.T.; Thái, A.V. Weapon Detection Using Deep Learning. In Proceedings of the 12th International Symposium on Information and Communication Technology, Hanoi, Vietnam, 7–8 December 2023; pp. 101–109. [Google Scholar]
Berardini, D.; Migliorelli, L.; Galdelli, A.; Marín-Jiménez, M.J. Edge artificial intelligence and super-resolution for enhanced weapon detection in video surveillance. Eng. Appl. Artif. Intell. 2025, 140, 109684. [Google Scholar] [CrossRef]
Warsi, A.; Abdullah, M.; Husen, M.N.; Yahya, M.; Khan, S.; Jawaid, N. Gun detection system using YOLOv3. In Proceedings of the 2019 IEEE International Conference on Smart Instrumentation, Measurement and Application (ICSIMA), Kuala Lumpur, Malaysia, 27–29 August 2019; IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar]
Salido, J.; Lomas, V.; Ruiz-Santaquiteria, J.; Deniz, O. Automatic handgun detection with deep learning in video surveillance images. Appl. Sci. 2021, 11, 6085. [Google Scholar] [CrossRef]
Khalid, S.; Waqar, A.; Tahir, H.U.A.; Edo, O.C.; Tenebe, I.T. Weapon detection system for surveillance and security. In Proceedings of the 2023 International Conference on IT Innovation and Knowledge Discovery (ITIKD), Bangkok, Thailand, 23–24 March 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
González, J.L.; Zaccaro, C.; Álvarez-García, J.A.; Morillo, S.; Caparrini, F.S. Real-time gun detection in CCTV: An open problem. Neural Netw. 2020, 132, 297–308. [Google Scholar]
Raza, A.; Rustam, F.; Mallampati, B.; Gali, P.; Ashraf, I. Preventing crimes through gunshot recognition using novel feature engineering and a meta-learning approach. IEEE Access 2023, 11, 103115–103131. [Google Scholar] [CrossRef]
Jacob, E.I.R.; Sivasakthi, T.; Brindha, S.; Hariharasudhan, S.; Priyadharsan, M.; Vishal, V.S. An Enhanced Surveillance System for Gun and Knife Detection using YOLOv8 and Raspberry Pi. In Proceedings of the 2024 International Conference on Communication, Computing and Internet of Things (IC3IoT), Chennai, India, 15–16 February 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Idrees, H.; Shah, M.; Surette, R. Enhancing camera surveillance using computer vision: A research note. Polic. Int. J. 2018, 41, 292–307. [Google Scholar] [CrossRef]
Pang, L.; Liu, H.; Chen, Y.; Miao, J. Real-time concealed object detection from passive millimeter wave images based on the YOLOv3 algorithm. Sensors 2020, 20, 1678. [Google Scholar] [CrossRef]
Xu, S.; Hung, K. Development of an AI-based system for automatic detection and recognition of weapons in surveillance videos. In Proceedings of the 2020 IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 18–19 April 2020; IEEE: New York, NY, USA, 2020; pp. 48–52. [Google Scholar]
Akbulut, Y.; Khalaf, R. Smart Arms Detection System Using YOLO Algorithm and OpenCV Libraries. Turk. J. Sci. Technol. 2021, 16, 129–136. [Google Scholar]
Bhatti, M.T.; Khan, M.G.; Aslam, M.; Fiaz, M.J. Weapon detection in real-time CCTV videos using deep learning. IEEE Access 2021, 9, 34366–34382. [Google Scholar] [CrossRef]
Hashmi, T.S.S.; Haq, N.U.; Fraz, M.M.; Shahzad, M. Application of deep learning for weapons detection in surveillance videos. In Proceedings of the 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2), Rawalpindi, Pakistan, 20–21 May 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
Qi, D.; Tan, W.; Liu, Z.; Yao, Q.; Liu, J. A dataset and system for real-time gun detection in surveillance video using deep learning. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–20 October 2021; IEEE: New York, NY, USA; pp. 667–672. [Google Scholar]
Ramon, A.O.; Guaman, L.B. Detection of weapons using Efficient Net and Yolo v3. In Proceedings of the 2021 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Temuco, Chile, 17–19 November 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
Ruiz-Santaquiteria, J.; Velasco-Mata, A.; Vallez, N.; Bueno, G.; Álvarez-García, J.A.; Deniz, Ó. Handgun detection using combined human pose and weapon appearance. IEEE Access 2021, 9, 123815–123826. [Google Scholar] [CrossRef]
Ahmed, S.; Bhatti, M.T.; Khan, M.G.; Lövström, B.; Shahid, M. Development and optimization of deep learning models for weapon detection in surveillance videos. Appl. Sci. 2022, 12, 5772. [Google Scholar] [CrossRef]
Ashraf, A.H.; Imran, M.; Qahtani, A.M.; Alsufyani, A.; Almutiry, O.; Mahmood, A.; Attique, M.; Habib, M. Weapons detection for security and video surveillance using CNN and YOLO-v5s. CMC-Comput. Mater. Contin 2022, 70, 2761–2775. [Google Scholar]
Jadhav, R.S. Automatic Weapon Detection in CCTV Systems Using Deep Learning. Ph.D. Thesis, National College of Ireland, Dublin, Ireland, 2022. [Google Scholar]
Kiran, A.; Purushotham, P.; Priya, D.D. Weapon Detection using Artificial Intelligence and Deep Learning for Security Applications. In Proceedings of the 2022 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC), Bhubaneswar, India, 25–26 November 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar]
Manikandan, V.P.; Rahamathunnisa, U. A neural network aided attuned scheme for gun detection in video surveillance images. Image Vis. Comput. 2022, 120, 104406. [Google Scholar] [CrossRef]
Rasheed, O.; Ishaq, A.; Asad, M.; Hashmi, T.S.S. Multiplatform Surveillance System for Weapon Detection using YOLOv5. In Proceedings of the 2022, the 17th International Conference on Emerging Technologies (ICET), Swat, Pakistan, 28–29 November 2022; IEEE: New York, NY, USA, 2022; pp. 37–42. [Google Scholar]
Doan, T.S.; Nguyen, T.K.T.; Vo, T.A. Weapon Detection with YOLO Model Version 5, 7, 8. 2023. Available online: https://elib.vku.udn.vn/handle/123456789/2698 (accessed on 19 November 2025).
Khan, S.; Sayyed, M.; Yadav, S.; Bhalerao, B.; Patil, A.D. Weapon Detection and Alarm System Using Yolov5. Int. Res. J. Innov. Eng. Technol. 2023, 7, 102. [Google Scholar]
Nanda, S.K.; Ghai, D.; Ingole, P.; Pande, S. Analysis of video forensics system for detection of gun, mask, and anomaly using soft computing techniques. In Proceedings of the AIP Conference Proceedings, Medan, Indonesia, 7 December 2022; AIP Publishing: Melville, NY, USA, 2023; Volume 2800. [Google Scholar]
Naseeba, B.; Trisha, G.; Challa, N.P.; Varun, D. Real-time Weapon Surveillance using CV-based Motion Capture and DL-driven Analysis. In Proceedings of the 2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 16–17 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Pullakandam, M.; Loya, K.; Salota, P.; Yanamala, R.M.R.; Javvaji, P.K. Weapon object detection using quantized yolov8. In Proceedings of the 2023, the 5th International Conference on Energy, Power and Environment: Towards Flexible Green Energy Technologies (ICEPE), Shillong, India, 9–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Shah, A.; Kudalkar, D.; Shah, J.; Katre, N. Proposed Methodology for Real-Time Pistol Detection System. In Proceedings of the 2023 International Conference on Advanced Computing Technologies and Applications (ICACTA), Chennai, India, 21–22 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Sumi, L.; Dey, S. YOLOv5-based weapon detection systems with data augmentation. Int. J. Comput. Appl. 2023, 45, 288–296. [Google Scholar] [CrossRef]
Arora, S.; Dalal, S.; Sethi, M.N. Interpretable features of YOLO v8 for Weapon Detection: A Performance-driven approach. In Proceedings of the 2024 International Conference on Emerging Innovations and Advanced Computing (INNOCOMP), Kochi, India, 11–12 April 2024; IEEE: New York, NY, USA, 2024; pp. 87–93. [Google Scholar]
Nadeem, M.S.; Kurugollu, F.; Atlam, H.F.; Franqueira, V.N.L. Weapon Violence Dataset 2.0: A synthetic dataset for violence detection. Data Brief 2024, 54, 110448. [Google Scholar] [CrossRef]
Yadav, P.; Gupta, N.; Sharma, P.K. Robust weapon detection in dark environments using Yolov7-DarkVision. Digit. Signal Process. 2024, 145, 104342. [Google Scholar] [CrossRef]
You, Y.; Wang, J.; Yu, Z.; Sun, Y.; Peng, Y.; Zhang, S.; Bian, S.; Wang, E.; Wu, W. A fine-grained detection network model for soldier targets adopting attack action. IEEE Access 2024, 12, 107445–107458. [Google Scholar] [CrossRef]
Ağdaş, M.T.; Türkoğlu, M.; Gülseçen, S. Deep neural networks based on transfer learning approaches to classification of gun and knife images. Sak. Univ. J. Comput. Inf. Sci. 2021, 4, 131–141. [Google Scholar] [CrossRef]
Belurkar, A.; Waghmare, A.; Mallick, S.; Waghamode, N.; Totare, R. Weapon Detection using Yolov4 CNN. Int. J. Res. Appl. Sci. Eng. Technol. 2022, 10, 2058–2062. [Google Scholar] [CrossRef]
Fathy, C.; Saleh, S.N. Integrating deep learning-based IoT and fog computing with software-defined networking for detecting weapons in video surveillance systems. Sensors 2022, 22, 5075. [Google Scholar] [CrossRef]
Lamas, A.; Tabik, S.; Montes, A.C.; Pérez-Hernández, F.; García, J.; Olmos, R.; Herrera, F. Human pose estimation for mitigating false negatives in weapon detection in video-surveillance. Neurocomputing 2022, 489, 488–503. [Google Scholar] [CrossRef]
Maddileti, T.; Sirisha, J.; Srinivas, R.; Saikumar, K. Pseudo Trained YOLO R_CNN Model for Weapon Detection with a Real-Time Kaggle Dataset. Int. J. Integr. Eng. 2022, 14, 131–145. [Google Scholar]
Tejashwini, P.S.; Nargund, S.G.; Sumuk, K.; Rohith, G.R.; Trinetra, B.M. Weapon Detection and Classification in CCTV Footage. Int. Res. J. Eng. Technol. 2022, 9, 1870–1872. [Google Scholar]
Borthakur, S.; Kumar, G.; Rajput, A.; Sarvaiya, J.N. Object Detection for Military Surveillance using YOLO Framework. In Proceedings of the 2023 IEEE 20th India Council International Conference (INDICON), Hyderabad, India, 14–16 December 2023; IEEE: New York, NY, USA, 2023; pp. 126–131. [Google Scholar]
Devasenapathy, D.; Raja, M.; Dwibedi, R.K.; Vinoth, N.; Jayasudha, T.; Ganesh, V.D. Artificial Neural Network using Image Processing for Digital Forensics Crime Scene Object Detection. In Proceedings of the 2023, the 2nd International Conference on Edge Computing and Applications (ICECAA), Namakkal, India, 19–21 July 2023; IEEE: New York, NY, USA, 2023; pp. 652–656. [Google Scholar]
Dugyala, R.; Reddy, M.V.V.; Reddy, C.T.; Vijendar, G. Weapon detection in surveillance videos using Yolov8 and Pelsf-DCNN. E3S Web Conf. 2023, 391, 01071. [Google Scholar] [CrossRef]
González, J.L.; Salazar, J.A.; Álvarez-García, F.J.; Rendón-Segador, F.C. Conditioned cooperative training for semi-supervised weapon detection. Neural Netw. 2023, 167, 489–501. [Google Scholar] [CrossRef]
Sharma, P.; Arora, S. The All-Seeing Eye for Constructive Weapon Detection Using YOLOv8 Object Detection Model. In Proceedings of the 5th International Conference on Information Management & Machine Intelligence, Jaipur, India, 21–22 December 2023; pp. 1–5. [Google Scholar]
Yadav, R.; Halder, R.; Thakur, A.; Banda, G. A Lightweight Deep Learning-based Weapon Detection Model for Mobile Robots. In Proceedings of the 2023 6th International Conference on Advances in Robotics, Goa, India, 22–24 June 2023; pp. 1–6. [Google Scholar]
Abdullah, M.; Al-Noori, A.H.Y.; Suad, J.; Tariq, E. A multi-weapon detection using ensembled learning. J. Intell. Syst. 2024, 33, 20230060. [Google Scholar] [CrossRef]
Akhila, K.; Ahmed, K.R. Real Time Deep Learning Weapon Detection Techniques for Mitigating Lone Wolf Attacks. arXiv 2024, arXiv:2405.14148. [Google Scholar] [CrossRef]
Berardini, D.; Migliorelli, L.; Galdelli, A.; Frontoni, E.; Moccia, A.M.E.S. A deep-learning framework running on edge devices for handgun and knife detection from indoor video-surveillance cameras. Multimed. Tools Appl. 2024, 83, 19109–19127. [Google Scholar] [CrossRef]
Keerthana, S.M.; Sujitha, R.; Yazhini, P. Weapon Detection for Security Using the Yolo Algorithm with Email Alert Notification. In Proceedings of the 2024 International Conference on Innovations and Challenges in Emerging Technologies (ICICET), Coimbatore, India, 11–12 March 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Martinez, H.; Rodriguez-Lozano, F.J.; León-García, F.; Palomares, J.M.; Olivares, J. Distributed Fog computing system for weapon detection and face recognition. J. Netw. Comput. Appl. 2024, 232, 104026. [Google Scholar] [CrossRef]
Mukto, M.M.; Hasan, M.; Al Mahmud, M.M.; Haque, I.; Ahmed, M.A.; Jabid, T.; Ali, M.S.; Rashid, M.R.A.; Islam, M.M.; Islam, M. Design of a real-time crime monitoring system using deep learning techniques. Intell. Syst. Appl. 2024, 21, 200311. [Google Scholar] [CrossRef]
Rao, A.S.V.; Kainth, S.; Bhattacharya, A.; Amgoth, T. An efficient weapon detection system using NSGCU-DCNN classifier in surveillance. Expert Syst. Appl. 2024, 255, 124800. [Google Scholar] [CrossRef]
Sivakumar, M.; Ruthwik, M.S.; Amruth, G.; Bellam, K. An Enhanced Weapon Detection System using Deep Learning. In Proceedings of the 2024, the 2nd International Conference on Networking and Communications (ICNWC), Chennai, India, 22–23 February 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Apeh, S.T.; Ajao, L.A.; Nyitamen, D.S.; Wamdeo, C.L.; Edeh, R. Artificial Intelligence-Based Wireless Sensor Network Model for Intrusion Detection and Firearms Image Detection in the Conflict Zone. Cloud Comput. Data Sci. 2025, 6, 94–114. [Google Scholar] [CrossRef]

Figure 1. Article Count from Conferences.

Figure 2. Article Count from Journals.

Figure 3. Articles Selection Process.

Figure 4. Article Count—Distribution by Year.

Figure 5. Division as per the Source of Distribution.

Figure 6. AI-Driven Weapon Detection Workflow.

Figure 7. Evolution Timeline of YOLO Models.

Figure 8. Gun Detection Workflow Diagram Using Artificial Intelligence.

Figure 9. Knife Detection Workflow Diagram Using Artificial Intelligence.

Figure 10. Handguns and Knife Detection Workflow Diagram Using Artificial Intelligence.

Figure 11. Year-Wise Distribution of Base Detection Models.

Figure 12. Relationship between the model and the dataset in weapon detection.

Table 1. Summary of Key Studies on Weapon Detection.

Ref	Assess the Subject	Period of Coverage	Year
[6]	Exploring Knife and Handgun Detection Algorithms	From 2013 to 2018	2020
[7]	An overview of computer vision-based ideas, techniques, and uses for automatic gun and knife detection.	From 1962 to 2020	2021
[5]	This study evaluates various classical machine learning and deep learning algorithms for the practical identification of weapons.	From 2011 to 2022	2023
[3]	This study systematically reviews deep learning-based weapon detection research to identify commonly used models, dataset characteristics, and key challenges in automatic weapon detection.	From 2014 to 2022	2024

Table 2. Selection Criteria for the Literature: Inclusion and Exclusion.

Genre of Criteria	Inclusion	Exclusion
Language	English Language Publications	Non-English Publications
Year of Publication	Research released from 2016 to 2025	Before 2016, studies were published.
Type of Publication	Peer-reviewed conference papers and journal articles	Sources that are not subjected to peer review, such as preprints, theses, blogs, and reports
Significance of the Topic	Research was conducted on the use of AI, ML, or DL to detect weapons in surveillance footage or photos.	Studies that concentrated on other types of item detection or had no relation to AI, ML, or DL-based weapon detection.
Replication	Once, unique records are included.	Excluded are duplicate studies.
Availability	The full text is available for review.	Inaccessible or abstract-only papers are not included.

Table 3. Summary Table—Gun Detection.

Ref	Dataset	Proposed Method	Results Obtained	Advantages	Drawbacks	Year
[4]	Private Images	Otsu Algorithm	Accuracy: 99%	Automatic Weapon Detection	Need to Improve the Real-Time Settings.	2016
[9]	Internet Movie Firearms Database (IMFDB)	Faster R-CNN with VGG-16 architecture, SVM classifier, ImageNet pre-trained weights	Accuracy: 93.1%	Faster training with pre-trained weights, superior to earlier methods	Requires further optimization to reduce computational cost and improve real-time performance.	2017
[55]	Annotated Datasets	Computer Vision	Not Specified	To Monitor Effectively	Need to analyze the performance results.	2018
[10]	Internet Movie Firearms Database (IMFDB)	Darknet—YOLO, CNN	Accuracy: 96.26% mAP: 70.72%	Robustness and Feasibility	Low Quality Images.	2019
[1]	Open Source Platforms.	Faster R-CNN	Accuracy: 99%	High Speed for object detection	Need to train this model for real-time surveillance images.	2019
[2]	Private Dataset	CNN	Accuracy: 97.78%	Automatic Real-Time Firearm Detection Methodology	Need to reduce the false positives.	2019
[39]	Private Dataset (University of Seville, Spain)	Faster RCNN	Precision: 77.24%	The number of False Positives (FP) is reduced	Need to implement another detection framework.	2019
[49]	Custom dataset created by combining handgun images with ImageNet	YOLOv3 for real-time firearm detection, evaluated against Faster R-CNN	Precision: 96.51%. F1 Score: 75%	High processing speed (45 fps), reliable detection of pistols even with partial occlusions	Challenges with low-quality or obscured video frames.	2019
[52]	CCTV Dataset. Synthetic Dataset	Faster R-CNN object detector that uses FPN (Feature Pyramid Network)	IoU: 0.50. Confidence Level: 0.95/0.99	More Accuracy	Not including other weapons.	2020
[56]	Private Dataset (Beihang University, China)	YOLO v3 and SSD	mAP: 95%	More Effective and Feasible	Need more training tests.	2020
[11]	COCO Dataset	Mask R-CNN	Precision: 99.5%. Recall: 97.9%	High Accuracy	Need to increase the dataset sizes.	2020
[57]	COCO Dataset	Single Shot Multibox Detector (SSD) and Convolutional Neural Network (CNN)	Precision: 85%	Real-time processing capability	Performance issues with low-resolution video input.	2020
[36]	Private Dataset	CNN and YOLO	Accuracy: 97.01% Recall: 95.7% Precision: 98% F1: 96.9%	High Accuracy	Need to extend the dataset.	2021
[58]	COCO and Own Dataset	YOLO	Accuracy: 95% (Maximum)	It is Very Simple and Accurate	Need to extend large datasets.	2021
[59]	Custom dataset sourced from personal photos, YouTube videos, GitHub repositories, the University of Granada dataset, and the IMFDB	Object classification/Object detection—Yolo v4	F1 Score: 91%. mAP: 91.73%	High Precision and Reliable	Currently, it is just focused on small types of weapons.	2021
[60]	Google Images	YOLO v3 and YOLO v4	YOLOv3 and YOLOv4: Precision: 84%, 85% Recall: 71%, 78% F1 Score: 77%, 82% mAP: 77.30%, 84.85%	Better Performance	Need to extend the large datasets.	2021
[61]	To collect from various sources (Public Dataset)	YOLO v3, Simplified CenterNet	Accuracy: ResNet18: 96.97% ResNet34: 97.57% ResNet50: 97.83%	High Accuracy Rate	False Positive Rate is High.	2021
[62]	From Internet	Yolo v3, Efficient Net	Yolo v3: Precision: 80% Recall: 74%. F1: 77%	Best Model for detecting Firearms.	Need to take a variety of datasets.	2021
[63]	Multi Sources (YouTube, Synthetic Dataset)	Deep Learning, Hand Region Classifier (HRC) and Pose Data (P)	HRC and HRC + P: Precision: 96.83%, 90.18% Recall: 20.33%, 33.67% AP: 24.72%, 34.68%	Better Performance	Need to reduce the false positives.	2021
[50]	Custom Handgun Image Dataset	YOLO-based deep learning model	Highest Recall: 97.23% Precision Values: 96.23%	Real-time detection with low computational cost	Struggles with crowded scenes or partially obscured weapons.	2021
[64]	Custom-labeled dataset	State-of-the-art Scaled-YOLOv4	mAP:92.1%	Robust detection across conditions	Improve performance on low-res videos, add weapon types.	2022
[65]	Private Dataset (University of Granada) and IMFDB	YOLOv5 and Faster R-CNN	Precision: 99.5% Recall:84.6% F1:91.4%	Accurate Detection	Need to increase the contrast and brightness.	2022
[66]	Kaggle Websites, Open Images	YOLO v4, CSP DarkNet53	mAP: 100% (Kaggle). mAP: 71.58% (Open images). YOLO v4: mAP:86.19%. Precision:79%. F1:77%	Better Performance	Need to reduce the false positives and negatives.	2022
[67]	Public Dataset	RCNN	Accuracy: 84.6%	High Accuracy	Need to train various models.	2022
[68]	Public Dataset (YouTube Videos)	CNN and AODS (Attuned Object Detection Scheme)	Accuracy: 86.92% Precision: 89%. F1: 91%	Better Performance	Need to identify other objects.	2022
[13]	Open Source Dataset	YOLOv3 model	Improvement of mAP 10% and FPS 12%	Suitable for live surveillance with real-time processing	Difficulties in low-light conditions or with partially hidden firearms.	2022
[69]	Multi Sources (Films, CCTV, and Google Images)	YOLO v3, v4, v5	YOLO v5: Precision: 88%, Recall: 83%, F1: 86%, mAP: 87.8%	High Accuracy	Need to use recent versions.	2022
[38]	Private Dataset (Princess Sumaya University for Technology—Jordan)	CNN	Accuracy: 92.5% Precision: 93.4% Recall: 91.5%. F1: 92.4%	High Speed	Need to extend the datasets.	2023
[46]	WIDER FACE dataset and the Internet Movie Firearms Database	Faster RCNN	mAP 0.50: 77.02% mAP 0.75: 15.49%	Best Performance	Need to perform Real-Time Testing.	2023.
[70]	Pistol Detection, Pistol Classification, Sohas Weapon Datasets	YOLO v5, v7, v8	Yolo v7-E6: Precision: 92.2% Recall: 90.2% Yolo v8-X: Precision: 95.6% Recall: 82.2%	Better Performances	Need to improve results across devices.	2023
[51]	Publicly available dataset	YOLOv5-based weapon detection	Precision: 94.63% Recall: 96.25% Accuracy: 95.43%	High Accuracy and Robustness	The system may face some challenges in detecting small firearms.	2023
[71]	Custom Pistol dataset	YOLO V5	95% accuracy in detecting pistols	Efficiency and Accuracy	Detect more dangerous objects.	2023
[14]	YouTube CCTV recordings, film data, camera pictures, and online photographs	YOLO v7	mAP: 87.3% F1 Score: 91% Confidence Score: 98%	Enhance the Security	Need to reduce the false positives and negatives.	2023
[72]	GitHub and Kaggle Websites	Customized CNN, YOLO	Accuracy (Gun Detection) YOLO: 100% Customized CNN: 61.5%	High Speed.	Need to extend the larger dataset.	2023
[73]	Created Own Dataset	VGG-16. Faster RCNN	Accuracy: VGG16—92.5% Faster RCNN—94.8%	High Speed. Superior Accuracy	Need to utilize the other weapon detection.	2023
[74]	Custom Dataset	YOLO v5/YOLO v8	mAP: 90.1% (YOLOv8)	Highly accurate	Need to improve and optimize the framework.	2023
[12]	COCO	Yolo v5	Precision: 95% Recall: 92%. F1: 94%	It is High Speed and Efficiency	Need to Expand the Dataset.	2023
[53]	Public Dataset (YouTube)	Discrete Wavelet Transform Random Forest Probabilistic (DWT-RFP)	99% Accuracy	Accurate Gunshot Detection and Preventing Crimes	Background Noise. High Computational Complexity.	2023
[40]	Movie Database and YouTube Clips, COCO Dataset.	Vision Transformer (ViT) and CNN	AP Score: 91.73%	It Reaches a High AP Score	It relies heavily on the pose estimation method.	2023
[75]	COCO Dataset	YOLO (v3, v4, v5), Darknet-53 and CSPNet	YOLO v5: mAP50—98.60 and F1 Score: 98%	Very Light, Quick, and accurate model	Need to extend the system to detect other kinds of weapons.	2023
[76]	Custom dataset. Unity synthetic dataset. Mock attack dataset	Yolov5-based weapon detection system	Precision: 97%	Better Performance. Enhanced detection system robustness	Improve accuracy.	2023
[17]	Synthetic Datasets and Real-World CCTV Footage	YOLO v4 configuration with SCSP-ResNet (Spatial Cross Stage Partial)	mAP of 81.75% F1 Score: 80%. Recall: 77%. Precision: 84%.	Efficient and quicker detection of small objects Good detection under challenging scenarios	Optimizing and enhancing datasets with complex background assessment.	2023
[77]	Customized Dataset	Yolo v8	AP: 90.5%	To identify the weapons accurately	Should concentrate on improving its capabilities through dataset enrichment and algorithmic improvements.	2024
[15]	Mobile Guns Dataset, Monash Guns Dataset	Open Pose, DarkNet 53.	Temporal Pose: Accuracy: 91%. Precision: 91%. F1: 91%. Temporal Spatial: Accuracy: 89.80%. Precision: 89.91%. F1: 89.73%.	Better Performance in Gun Detection	Need to consider other temporal representations.	2024
[16]	Hockey Fights Dataset	Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory (Bi-LSTM). YOLO v8/v9	Training Accuracy: 99.85% Validation Accuracy: 98.19%	It Enhances the Security Measures	Need to utilize the other weapon detection.	2024
[78]	Weapon Violence Dataset (WVD)	Implementation of the Weapon Violence Dataset (WVD)	Accuracy: 87%	Scalability	Lack of randomness. The issue of authenticity.	2024
[18]	Gunshot Audio Dataset and the Gunshot Audio Forensics Dataset	YAMNet for audio and InceptionV3 for image recognition via transfer learning	94.96% accuracy	Combines audio and visual data for improved robustness	Explore more diverse audio datasets for broader environment coverage.	2024
[79]	From the Internet	YOLO V7	Precision Score: 95.50% F1-Score: 93.41	Enhanced Nighttime Performance	There may be some limitations in low-light conditions.	2024
[80]	Multi Sources (Internet, YouTube)	YOLO v8 AD	Precision: 73.5% Recall: 61.4% mAP50: 67.7%	Secure Monitoring	Need to reduce the difficulty of training.	2024

Table 4. Summary Table—Knife Detection.

Ref	Dataset	Proposed Method	Results Obtained	Advantages	Drawbacks	Year
[19]	From Internet	Features from Accelerated Segment Test (FAST)	Accuracy: 98.48% F1: 98% Precision: 97% Recall: 98%	Better Performance	Need to conduct extensive tests.	2017
[20]	Private Dataset	CNN and DaCoLT	Precision: 100% F1 Score: 84.15% Recall: 72.65%	Automatic Weapon Detection	It is challenging to detect weapons outdoors.	2019
[24]	Public Images (X-Ray)	Improved SSD model with Feature Pyramid Networks (FPN)	87% mean average precision (mAP) Improved mAP is 91.3%	Effective for small objects, good performance in complex backgrounds	Struggles with similar objects and varied lighting conditions.	2019
[21]	Private Dataset	MobileNet, PoseNet, and MaskRCNN	Accuracy: 95%.	Its Multi-View Points	Need to reduce False positives.	2020
[25]	Own Dataset	Infrared Lighting	Not Specified	It is easy to acquire	Need to improve and acquire results.	2021
[22]	COCO (Common Objects in Context). Open Images Dataset Custom Weapon Dataset	Convolutional Neural Networks (CNNs)	Accuracy: 88%	High accuracy and reliability Real-time detection	The current system primarily focuses on detecting and classifying weapons in individual images.	2021
[26]	MMW and THz Datasets	Yolo v5	Precision: MMW: 98.97% THz: 97.15%	Better Results	Need to reduce computation complexity.	2023
[23]	Private Dataset (GitHub)	YOLO v5	YOLO v5s: Precision:97.7%, Recall: 94%, mAP50: 97.6%	Faster Training and applicability, High recall and precision scores	Expansion of the Dataset and Real-World Implementation.	2024

Table 5. Summary Table—Weapon (Knife and Gun) Detection.

Ref	Dataset	Proposed Method	Results Obtained	Advantages	Drawbacks	Year
[29]	Custom CCTV Footage	Hybrid CNN and Image Processing	90% accuracy	Real-time operation	Occasional false positives.	2016
[30]	Public Dataset (YouTube Videos)	FRCNN, VGGNET 19	VGGNet19: Accuracy:91%, Recall: 93%, F1: 92%	Better Results	Need to train various models.	2018
[27]	Internet-sourced and lab-captured images	Deep Convolutional Neural Networks (DCNN)	Accuracy: 99%	Model efficiency	A current model trained on small datasets.	2019
[45]	Private Dataset	Faster R-CNN model, GoogleNet, SqueezeNet	Gun Detection: AP 85.44%. Knife Detection: AP 46.68%	High Accuracy Rate	Need to improve the new architecture.	2019
[28]	Custom Dataset	CNN	Accuracy: 92.5 (Full AR)	Reliability, Flexibility	Need to improve various detection approaches.	2020
[81]	Open Access Datasets	VGG16 and ResNet50 with Transfer Learning	Accuracy: 99.73% and 99.67%	Effective transfer learning, high accuracy	Sensitivity to background variations, limited dataset size.	2021
[37]	Collecting from the internet	VGGNet Architecture	98.40% accuracy	Robust real-time detection	Low-light and occluded weapons detection issues.	2021
[32]	PMMW Dataset. Custom Dataset	Yolo v3	Accuracy: 98.89%	Better security systems	The challenge is to detect weapons in complex environments.	2021
[82]	Not Specified	YOLOv4 and Harris Corner Detection Algorithm	Not Specified	Faster FPS	Need to analyze the results.	2022
[42]	GitHub Repository	YOLO V5	Accuracy: 98%	Speed and efficiency	Need to improve model performance across more complex, diverse scenarios.	2022
[83]	SOHAs Weapon Dataset	Software Defined Network (SDN) and YOLO v5	Average Throughput: 75%	High Average Throughput. Low Packet Loss	Need to improve the Results further.	2022
[31]	Armed CCTV Footage	Faster RCNN and SSD MobileNet	mAP: 75.8%	Better Performance	Need to improve object detection accuracy.	2022
[84]	Private (Sohas Weapon) Dataset	Top Down Weapon Detection Over Pose Estimation	Precision: 91.6%. Recall: 79.5% F1: 84.8%.	It is More Robust.	Need to improve weapon detection across various weapons.	2022
[85]	Kaggle Website	YOLO v3 and RCNN	mAP: 85%, Throughput: 94.23%	High Effective	Need Modification for the future.	2022
[86]	COCO Dataset. Open Images Dataset	Yolov5	Accuracy: 87% (Pistol). 96% (Knives)	High Accuracy	Computational and Processing Overheads.	2022
[87]	Public Dataset	YOLO v5 and Slicing Aided Hyper Inference (SAHI)	YOLO v5: Precision: 83.5% Recall:79.4% mAP50: 82.3%	High Accuracy	There is a need to improve the structure in military applications.	2023
[88]	Multi Sources	Artificial Neural Network and Histogram Oriented Gradient	Confusion Matrix: 89.78%	Better Performance	Need to expand the Dataset.	2023
[89]	Not Specified	Yolo v8 and PELSF–DCNN.	Accuracy: 97.5%	Better Results.	Need to extend by detecting other weapons.	2023
[90]	Custom Surveillance Dataset	Conditioned Cooperative Training	UGR Metrics: 71.32. YouTube-GDD Metrics: 58.56	Reduced data annotation costs	Struggles with complex scenes.	2023
[91]	Custom Dataset (From Public/Private Datasets)	YOLO v8	mAP: 88.2%	Very Fast	Need to extend larger datasets.	2023
[33]	Weapon Detection Dataset (WDD), Gun Dataset (GD), and the Gun Object Detection dataset (GDD)	YOLO v7 and Faster RCNN	Accuracy: 94%. Precision: 95.7%	More, Faster, and More Accurate	Need to optimize the structure.	2023
[47]	Public Sources, Handgun and Knife Detection Datasets, Data on Roboflow	YOLO v5/v7/v8 and Swin Transformer Model	YOLO v7-E6 Precision: 91.3% mAP50: 91.7%	Better Accuracy	Need to improve in the future.	2023
[43]	Numerous Websites	Tiny YOLO	Precision: 96% Recall: 90% F1-Score: 92%	Efficient for low-resource devices	Enhance small object detection and accuracy in cluttered scenes.	2023
[92]	Dataset: Sohas, Handgun, and Private (Own)	YOLO-WL Model	Own Dataset: Precision: 94.4%, Recall: 85%	Better Performance	High Computation Cost.	2023
[93]	Private Dataset	YOLO v8	mAP: 97.2%	High Accuracy	Need to expand the other object classes.	2024
[94]	Numerous Sources(YouTube and Google)	Yolo v5 and Faster RCNN	Yolo v5: F1 Score: 78% Faster RCNN: AP: 89%	Achieve Better Results	To avoid the selection bias.	2024
[95]	Public Dataset (YouTube Videos)	CNN	AP: 79.30	Automated Video Surveillance	The dataset needs to be taken from outdoors.	2024
[41]	Mock Attack Dataset	Modified Capsule Network	Accuracy: 99.43%	High Accuracy	Need to test various datasets.	2024
[54]	Multi Sources	YOLO v8	Accuracy: 90.9%, Precision: 95.3%, Recall: 96%, F1: 87%	Better Performance	Dataset Expansion. Fine Tuning.	2024
[96]	Multiple Sources.	Yolo v8	Precision: 98%, Recall: 95%, F1: 94%	Speed and Accuracy	Should work on different datasets.	2024
[97]	Terahertz Human Dataset and ImageNet	Leveraging Edge, Fog, and Cloud Computing	Not Discussed	High and Simple Scalability	Need to Discuss the Results.	2024
[98]	Numerous Websites	YOLO v5	F1: 86%. Performance: 97%	Suitable for real-time monitoring	Reduce false positives, improve speed in crowds.	2024
[99]	Public Dataset (From Internet)	NSGCU-DCNN	Accuracy: 97.85%. Sensitivity: 98.24%. Specificity: 96.81%	More Reliable and Robust	Need to Consider Other Weapons.	2024
[100]	Customized Dataset	VGG-16. ResNet 50. ResNet 101	Accuracy: VGG-16—89.75% ResNet50—93.70%. ResNet101—83.33%	It is Dealing with the Complex Problem	Need to develop a robust infrastructure.	2024
[34]	Synthetic Dataset	Yolo v7	mAP: 97%	Better Efficiency of security forces	Should develop the autonomous security systems.	2024
[44]	Collect From Movies, Violent Fight Dataset, Hockey Fight	CNN	VioNet model 99.72%, InceptionResNetV2 90%, ViolenceNet pseudo-97, and InceptionV3 92%	Greater Accuracy	Need to extend the dataset.	2025
[101]	From CCTV	DCNN and YOLO v5	Intrusion detection accuracy is 97.17%, Sensitivity is 1.0000, and precision is 0.3750. Firearm detection accuracy is 100%	Better Performance in Accuracy	The researchers did not use the YOLO Updated Version.	2025
[48]	Weapon Sense dataset	YOLOSR Integrates YOLO v8	AP 50: 66 (Hand Gun) and 44.20 (Knife) F1: 66.60 (Hand Gun) and 46.60 (Knife)	Learning Better Features	The researchers do not discuss the Accuracy, Precision, and F1 Score.	2025
[35]	Pistol Detection Dataset (PDD), YouTube Gun Detection Dataset (YouTubeGDD), Mock Attack Dataset (MAD), Weapon Detection Dataset (WDD)	Weapon Vision AI, YOLO V7	Precision rate of 91.75% and Mean average precision of 92.15%	Accurate Weapon Identification	It is focused on only limited weapon.	2025

Table 6. Overview of publicly available dataset metrics.

Public Datasets
Ref	Dataset Type	Weapon Type	No. of Images/Videos	Year
[19]	Images	Knife	1500	2017
[30]	Images	Gun, Knife	NA	2018
[24]	Images	Knife	5672	2019
[81]	Images	Gun, Knife	13,000	2021
[13]	Videos	Guns	22,000	2022
[51]	Images	Guns	2971	2023
[14]	Images	Guns	2693	2023
[79]	Images	Guns	15,367	2024

NA = Not Available.

Table 7. Overview of private dataset metrics.

Private Datasets
Ref	Dataset Type	Weapon Type	No. of Images/Videos	Year
[20]	Images	Knife	3702	2019
[2]	Videos	Knife	585	2019
[39]	Videos	Gun	871	2019
[21]	Images	Knife	12,799	2020
[36]	Images	Guns	7000	2021
[25]	Videos	Knife	NA	2021
[84]	Images	Knife, Gun	3250	2022
[23]	Images	Knife	2078	2024

NA = Not Available.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

AI-Based Weapon Detection for Security Surveillance: Recent Research Advances (2016–2025)

Abstract

1. Introduction

2. Research Methodology

3. Methodological Framework

3.1. Input Modality (Image/Video Based)

3.2. Model Type (ML, CNN, R-CNN, YOLO (v1–v8)

3.3. Application Environment (Real-Time Surveillance)

4. AI-Enhanced Smart City Surveillance—Weapon Detection: Gun

5. AI-Enhanced Smart City Surveillance—Weapon Detection: Knife

6. AI-Enhanced Smart City Surveillance—Weapon Detection: Gun and Knife

7. Research Summary—Key Findings

7.1. Dataset Limitations

7.2. Method Applicability Across Scenarios

7.3. Research Discussion—Datasets and Model Performance

7.3.1. Gun

7.3.2. Knife

7.3.3. Weapon—Gun and Knife

7.4. Hardware Settings

7.4.1. Gun

7.4.2. Knife

7.4.3. Gun and Knife

7.5. Challenges

8. Research Gap and Future Scope

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics