Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools

Khan, Tareq

doi:10.3390/iot7010015

Open AccessArticle

Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools

by

Tareq Khan

School of Engineering, Eastern Michigan University, Ypsilanti, MI 48197, USA

IoT 2026, 7(1), 15; https://doi.org/10.3390/iot7010015 (registering DOI)

Submission received: 15 December 2025 / Revised: 26 January 2026 / Accepted: 29 January 2026 / Published: 31 January 2026

Download

Browse Figures

Versions Notes

Abstract

Gun violence in U.S. schools not only causes loss of life and physical injury but also leaves enduring psychological trauma, damages property, and results in significant economic losses. One way to reduce this loss is to detect the gun early, notify the police as soon as possible, and implement lockdown procedures immediately. In this project, a novel gun detector Internet of Things (IoT) system is developed that automatically detects the presence of a gun either from images or from gunshot sounds, and sends notifications with exact location information to the first responder’s smartphones using the Internet within a second. The device also sends wireless commands using Message Queuing Telemetry Transport (MQTT) protocol to close the smart door locks in classrooms and announce to act using public address (PA) system automatically. The proposed system will remove the burden of manually calling the police and implementing the lockdown procedure during such traumatic situations. Police will arrive sooner, and thus it will help to stop the shooter early, the injured people can be taken to the hospital quickly, and more lives can be saved. Two custom deep learning AI models are used: (a) to detect guns from image data having an accuracy of 94.6%, and (b) the gunshot sounds from audio data having an accuracy of 99%. No single gun detector device is available in the literature that can detect guns from both image and audio data, implement lockdown and make PA announcement automatically. A prototype of the proposed gunshot detector IoT system, and a smartphone app is developed, and tested with gun replicas and blank guns in real-time.

Keywords:

auto lockdown; deep learning; emergency response systems; gun image detection; gunshot sound detection; Internet of Things (IoT); MQTT; public address (PA) system; school shooting; smart lock

1. Introduction

In 2025 alone, there were at least 141 incidents of gunfire on school grounds in the United States, resulting in 44 deaths and 129 injuries [1]. Study shows that the average duration between a gunshot and a 911 call is about 1 to 5 min. In some cases, it may take even longer if witnesses are unsure about what they heard, are in shock, or fear potential danger [2]. Every second matters in this type of situation according to the first responders. During an active shooter incident, the ALICE protocol [3]: Alert, Lockdown, Inform, Counter, and Evacuate, is implemented manually by human, which causes delays in such time-critical situations. The goals and objectives of this project are to develop an innovative gun detection Internet of Things (IoT) system capable of automatically identifying the presence of a firearm from images, gunshot sounds, or both. Within a second, it sends notifications with location information to first responders and school administrators’ smartphones via the Internet. The system also sends wireless commands using Message Queuing Telemetry Transport (MQTT) protocol to close the smart door locks in classrooms and announce to move to a safe location using the public address (PA) system automatically. The proposed system will remove the burden of manually calling the police and implementing the lockdown procedure during such traumatic situations and can save lives.

The needs and significances of the proposed system are mentioned below:

Audiovisual detection: There are several software solutions [4,5,6,7,8,9] that integrate with existing security cameras to monitor live video feeds and detect firearms. One advantage of this approach is that it can detect a gun before the first shot. However, visual detection relies on security cameras having a clear line of sight to the firearm, which means its effectiveness is restricted by the camera placement, angle, lighting conditions, and potential occlusion in the environment. In contrast, a gunshot sound can be heard and detected without a line of sight and from any corner making sound-based systems [10,11,12,13,14,15,16] more effective in different environments. However, one shortcoming of this approach is that the gun is detected only after the first gunshot happens. In this project, a gun is detected using both image and microphone sensors. It will detect guns using camera images before any shot is fired. However, if it is not detected due to occlusion or lighting conditions, then it will be detected from the gunshot sound when the first bullet is fired, making the detection system more robust.
Auto lockdown: The proposed system automates the lockdown process by sending commands to the door locks of the classrooms as soon as a gun is detected and this will save time and effort in implementing the ALICE protocol.
Auto PA announcement: The system automatically announces using the PA system the location of where the gun is detected and also announce to move away from that area. For instance, if the gun is detected in the northside hallway, the people who are outside the classrooms will be automatically advised using the PA system to move towards the south. This is the current practice in schools, and automating this can save time and lives.
Privacy: In [4], camera images are continuously transmitted to cloud-based servers for primary classification, enabling persistent third-party access and raising privacy concerns. In contrast, the proposed system performs all image and audio classification locally within the private network. Images are sent to an external vision model only as a secondary verification step and only when the local classifier detects a potential firearm with high confidence. This event-driven transmission is infrequent and limited to high-risk scenarios and contains the same visual information that is provided to first responders for situational awareness. Therefore, the system does not enable continuous external surveillance and preserves user privacy.

The remainder of the paper is organized as follows. Section 2 reviews related works and positions the proposed approach within the existing literature. Section 3 describes the materials and methods, including image-based gun detection and sound-based gunshot detection using deep learning. Section 4 presents the prototype development, detailing the audio-visual detection system, MQTT broker integration, smart lock design, IoT-connected public address (PA) system, and the smartphone application. Section 5 reports the experimental results, including deep learning model performance and real-time prototype evaluation. Section 6 discusses key design decisions, limitations, and future improvements, and Section 7 concludes the paper.

2. Related Works

The related works as commercial products and published literature are discussed below.

2.1. Commercial Products

Several commercial systems exist for firearm or gunshot detection and a comparison of those works with this work is shown in Table 1. ZeroEyes [4] analyzes live camera feeds using AI to detect visible firearms. Although effective when the gun is clearly seen, visual systems depend heavily on camera placement, lighting, and line-of-sight, and require human verification before sending alerts. The system was tested and notification delays can be 25–30 s, that limit their ability to prevent harm once a shooter draws a weapon. Sound-based systems overcome these limitations because audio travels in all directions and does not require line-of-sight. AmberBox [10] uses audio and infrared signatures to detect indoor gunshots in about 3.6 s. Outdoor city-scale solutions like ShotSpotter [16] are costly (up to $90,000 per square mile annually) and designed for municipal deployment rather than indoor use in schools or facilities.

The proposed system provides both image-based and sound-based gun detection designed specifically for indoor environments such as schools, malls, banks, and houses of worship. Unlike AmberBox, the proposed system uses lightweight deep-learning model, enabling faster inference and quickest response time. Unlike ZeroEyes, no human confirmation is required, and the solution works even when a firearm is not visible to a camera. Compared with existing works, no single gun detector system is available in the market that can detect guns from both image and audio data, implement lockdown, and make PA announcements automatically and this research fills up this gap.

2.2. Published Literature

Several recent studies have focused on IoT-based emergency response and access-control systems that emphasize reliability, security, and real-time communication rather than direct weapon detection. For instance, an IoT-enabled emergency alert and GPS tracking system was developed using an ESP32 microcontroller and a GSM/GPS module, where location and device status are periodically transmitted via MQTT to a cloud platform, and an SOS mechanism triggers emergency calls and SMS notifications with GPS coordinates [17]. In a related direction, the GUARDTRACK system introduced a multi-layer IoT-based smart entry and access-control framework that combines Wi-Fi-based presence tracking with RFID and one-time-password (OTP) authentication delivered via GSM [18]. By integrating device identification, network-level monitoring, and multi-factor authentication, the system effectively mitigates unauthorized access and RFID cloning attacks while maintaining real-time operational performance. Together, these studies highlight the growing role of IoT architectures, MQTT-based messaging, and embedded authentication mechanisms in building dependable safety and security infrastructures, which complement but do not directly address visual or acoustic weapon detection.

The related works on image-based gun detection, audio-based gunshot detection, and audiovisual gunshot detection are discussed below and compared with the proposed system.

2.2.1. Image-Based Gun Detection

A substantial body of research has explored image-based firearm detection using classical computer-vision techniques and modern deep-learning models. Early approaches primarily relied on handcrafted features, while recent work has shifted to convolutional neural networks (CNNs) and state-of-the-art object detection frameworks. A recent comprehensive review [19] surveys AI based approaches for visual weapon detection from 2016 to 2025, covering both gun and knife detection in complex indoor and outdoor environments. The review highlights the widespread adoption of real-time object detection frameworks, including YOLO, SSD, and Faster R-CNN variants, and summarizes their reported performance in terms of precision, recall, and mean average precision across diverse datasets. It also identifies persistent challenges such as false positives caused by visually similar objects, limited dataset diversity, and performance degradation under occlusion, motion blur, and poor lighting. The review further discusses emerging research directions, including data augmentation strategies, transformer-based architectures, and newer YOLO variants (e.g., YOLOv10–YOLOv12), as well as multimodal sensing, edge deployment, and explainable AI, which collectively represent promising avenues for improving future weapon-detection systems.

Jain et al. [5] proposed one of the earlier real-time gun-detection systems using the Haar Cascade classifier with OpenCV. Their method identifies firearm type and model directly from video frames, achieving good accuracy for specific weapon categories: 95% for submachine guns, 87.5% for assault rifles, and lower performance for pistols (80%). While lightweight and computationally efficient, Haar-based methods depend heavily on illumination, pose, and handcrafted features, limiting their robustness in real-world surveillance conditions.

With the rise of deep learning, several studies have adopted region-based detectors. Alaqil et al. [6] developed an automatic firearm-detection system using Faster R-CNN with multiple CNN backbones (Inception-ResNetV2, ResNet50, VGG16, MobileNetV2). Extensive experiments showed that Faster R-CNN with Inception-ResNetV2 achieved the highest mean average precision (mAP) of 82%.

Other researchers have explored more traditional yet optimized techniques. Debnath and Bhowmik [7] introduced a rotation- and scale-invariant template-matching method for detecting guns carried by a moving person. To reduce the high computational burden of template matching, the authors incorporated background subtraction, achieving improved efficiency and reaching an accuracy of 95%. Despite strong results, template-based methods generally struggle with occlusion, varied backgrounds, and diverse firearm appearances.

Mehta et al. [8] trained a YOLOv3-based model to detect gun violence and fire events in video streams. Their system achieved real-time performance at 45 FPS and reported high accuracy (89.3%) on datasets such as the Internet Movie Firearms Database (IMFDB).

More recent work has examined instance-segmentation methods. Goenka and Sitara [9] applied Mask R-CNN to detect weapons in surveillance imagery. They incorporated Gaussian deblurring as a preprocessing step to enhance handgun features, especially in blurred or low-quality frames. Their system achieved 82.76% accuracy.

Vallez et al. [20] presented a handgun detection framework that incorporates a deep autoencoder to reduce false positives in video surveillance systems. In their approach, a conventional object detector is first used to identify candidate handgun regions, after which a deep autoencoder is trained on false-positive samples collected from the target environment. During operation, detections that deviate significantly from the learned false-positive patterns are treated as true handgun events, while others are suppressed. Experimental evaluation on real surveillance footage demonstrated up to 87.2% reduction in false alarms.

P. S. and V. M. [21] proposed a hybrid framework that integrates Faster R-CNN and Mask R-CNN with YOLOv8, leveraging both high-resolution spatial feature extraction and fast frame-wise detection. The combined FMR-CNN–YOLOv8 model demonstrated strong robustness to occlusion, scale variation, and lighting changes, achieving a detection accuracy of 98.7% with improved localization performance, highlighting the potential of hybrid architectures for enhanced crime prevention in public surveillance environments.

Arockia Abins et al. [22] proposed a deep learning–based weapon detection system designed for rapid threat response, achieving a processing speed of approximately 0.05 s per frame. The system employs convolutional neural networks in conjunction with a PELSF-DCNN classifier to detect weapons while minimizing false positives. In addition to visual detection, a behavior analysis module evaluates object motion patterns to infer potential threats, and an alerting module notifies authorities upon detection. Sliding-window techniques and feature selection methods are used to enhance overall detection performance and efficiency.

Guanbo Wang et al. [23] introduced an enhanced YOLOv4-based framework for real-time weapon detection in CCTV surveillance systems, targeting security and counter-terrorism applications. Their approach incorporates a Non-Uniform YOLOv4 backbone with a spatial cross-stage partial ResNet (SCSP-ResNet), a receptive field amplification module, and a Fusion-PaNet (F-PaNet) structure to improve feature representation. Model pruning is applied to reduce parameter count and model size, enabling faster real-time inference without significant accuracy degradation. Additionally, k-means clustering is used to optimize anchor boxes, improving detection accuracy and recall for small weapon objects.

Bushra S. N. et al. [24] proposed a weapon-detection system for public surveillance environments using the YOLOv5 object detection model. The approach analyzes full-frame image regions to identify weapons in crowded scenes and supports the detection of suspicious activities under complex visual conditions. The study utilizes the Roboflow platform for dataset collection, annotation, preprocessing, and model training, enabling efficient deployment with custom datasets. Experimental results demonstrate the feasibility of YOLOv5 for real-time weapon detection in common public spaces.

Khalid Sulaiman et al. [25] developed a smart surveillance framework for automated weapon detection using the YOLOv6 object detection model. The system applies deep learning–based object detection and recognition techniques to identify visible weapons in surveillance footage and generates real-time alerts to authorized personnel or security agencies. Experimental results demonstrate the effectiveness of YOLOv6 in supporting timely threat detection in security monitoring applications.

Pavithra et al. [26] proposed an object identification and behavior analysis framework based on the YOLOv7 architecture to detect targets and track their motion trajectories in video streams. The method incorporates temporal gaps between movements and categorical behavior features to distinguish abnormal activities. Evaluated on COCO-based scenarios including burglary attempts and false-alarm cases, the system achieved a precision of 93%, recall of 94.71%, an F1-score of 94.42%, and an overall accuracy of 95%.

2.2.2. Audio-Based Gunshot Detection

Audio-based gunshot detection has been widely studied, with approaches ranging from classical signal-processing techniques to modern deep-learning models. Lopez-Morillas et al. [11] proposed a semi-supervised gunshot detection method based on Non-negative Matrix Factorization (NMF), using separate training and separation stages. Their system achieved a maximum true-positive rate of 50% at an SNR of 5 dB, indicating limited robustness in noisy environments.

Valenzise et al. [12] developed a dual-classifier system using Gaussian Mixture Models (GMMs) to distinguish gunshots and screams from background noise. By training separate GMMs with different sets of audio features, their method achieved 93% precision at 10 dB SNR. However, the system was evaluated only in software, without an embedded implementation or real-time alerting capability.

With the emergence of deep learning, several studies have explored CNN-based audio classification. Bajzik et al. [13] used transfer learning with VGG16, InceptionV3, and ResNet18, using MFCC features extracted from gunshot audio. ResNet18 achieved over 99% accuracy on their dataset, demonstrating the effectiveness of deeper architectures for acoustic event detection.

Morehead et al. [14] implemented a low-cost gunshot detection system using a custom CNN trained on spectrograms. The model achieved over 99% accuracy and was deployed on a Raspberry Pi connected to a USB microphone and SMS modem. Although the system provided real-time SMS alerts to predefined contacts, reliance on SMS introduces ongoing per-message costs and may limit scalability.

In the author’s earlier work [15], a custom dataset was developed of gunshot sounds and trained a deep learning model capable of distinguishing gunshot and non-gunshot audio with 99% accuracy. A complete IoT-based gunshot detection system including: a central server, detector devices on Raspberry Pi Zero 2W embedded platform, and a smartphone application for receiving notification with maps were developed. The system was successfully tested using blank gunshots as well as various false-alarm scenarios.

2.2.3. Audiovisual Gun Detection

Researchers have also explored multimodal approaches. Chen et al. [27] combined audio and visual cues for gunshot event recognition using an SVM classifier. Their system integrated gunshot sound signatures with human-emotion and activity analysis from video. Although the maximum precision for gunshot events reached only 73.46%, the work highlights the importance of integrating complementary modalities for improved reliability. However, the work does not discuss any IoT based hardware implementation.

As summarized in Table 2, the proposed system uniquely combines image-based firearm recognition and audio-based gunshot detection within a single device, supported by two custom deep-learning models with accuracy of 94.6% (Residual Separable Convolutional Neural Network, RS-CNN, for images) and 99% (CNN for audio). Most related works lack instant smartphone notifications with image and audio evidence that help users understand the context of the event, as well as real-time map-based localization, device control and monitoring, or automated lockdown and PA announcements. In contrast, the solution further incorporates MQTT-enabled lockdown of smart door locks and automated PA alerts, features not found in any previous work. This comprehensive, dual-modality, real-time emergency-response capability distinguishes the proposed system from the existing approaches.

3. Materials and Methods

In this work, deep-learning-based classifiers are developed to detect guns from image and from sound. Then, the proposed IoT system consisting of server, smart lock, PA device, and smartphone app is developed.

3.1. Image-Based Gun Detection Using Deep Learning

The creation of the custom dataset, the design of the deep learning model architecture, the training procedures, and verification with OpenAI (OpenAI Inc., San Francisco, CA, USA) are described below.

3.1.1. Custom Dataset Creation

To classify guns from camera images, a custom dataset consisting of two classes: gun and non-gun, was recorded using an Amcrest IP camera (Amcrest Technologies, Houston, TX, USA). Its shutter speed was set to 1/100 to reduce motion blur. Sample images of gun datasets are shown in Figure 1a–f, and sample images of non-gun datasets are shown in Figure 1g–l. For the gun class, volunteers of different ages and genders posed and walked with various types of gun replicas [28] including rifles, shotguns, submachine guns (SMG), and pistols, and simulated realistic shooter scenarios. For the non-gun class, the same volunteers carried or interacted with common objects such as bags, books, cell phones, cans, water bottles, brooms, basketballs, boxes, etc. In addition, artificially generated rainbow color pattern images collected from online sources, as shown in Figure 1k,l, were incorporated to further diversify the non-gun dataset.

To further enhance the dataset and improve the model’s ability to generalize, data augmentation techniques were applied. These included random horizontal flipping to simulate mirrored perspectives, small random rotations to account for camera angle variations, slight random zooming to mimic different distances from the camera, and minor brightness adjustments to represent variations in lighting conditions. For instance, Figure 1b is an augmented image generated from Figure 1a.

The dataset consisted of 4000 images split between two classes: 2000 gun images and 2000 non-gun images. The two classes of images were stored in two different folders for training. As the goal is image classification instead of object detection with boundary boxes, there was no need for costly data annotation of drawing boundary boxes. The dataset was divided into 70% for training, 15% for validation, and 15% for testing. Because a fixed batch size of 128 was used during training, the final split resulted in 2816 training images, 512 validation images, and 672 testing images.

3.1.2. Residual Separable Convolutional Neural Network Architecture

Multiple iterations were performed to design a compact yet efficient convolutional architecture capable of achieving high classification accuracy on image inputs while minimizing overfitting and computational cost. The custom designed model adopts a Residual Separable Convolutional Neural Network (RS-CNN) structure [29,30] inspired by Xception and ResNet designs. It leverages depth wise separable convolutions and residual connections to enhance feature reuse, maintain stable gradients, and reduce parameter count. The overall architecture is illustrated in Figure 2, and its key components are summarized below:

Input and Normalization: The model accepts input images of size 150 × 200 × 3 (height, width, channel). A Rescaling layer normalizes pixel intensities from 0 to 1 range by dividing them by 255—improving numerical stability and facilitating faster convergence during training.
Convolution Layer: The network begins with a Conv2D layer [31] with 16 filters and a 5 × 5 kernel, followed by batch normalization [32] and ReLU activation [33]. This layer captures low-level spatial and color features that provide the foundation for deeper representations.
Residual Separable Convolutional Blocks: The model includes three progressively deeper residual blocks, each combining separable convolutions with residual connections [29,30]. These residual connections preserve low-level information and mitigate vanishing gradients, resulting in more stable optimization. Although the input image is initially normalized by a Rescaling layer, batch normalization is still applied inside each block because the internal feature distributions continue to shift as they pass through multiple nonlinear layers. Batch normalization in these deeper layers reduces internal covariate shift, stabilizes activation statistics, allows the use of higher learning rates, and helps gradients propagate more reliably through the network.

In each block, the main branch consists of two SeparableConv2D layers with increasing filter sizes (32, 64, and 128), each followed by batch normalization and ReLU activation. Separable convolutions factorize a standard convolution into depth-wise and point-wise operations, significantly reducing the number of parameters and computational cost while preserving representational power. They are especially effective once the feature maps are decorrelated after the first convolutional layers, unlike the original RGB input where channels are strongly correlated. A MaxPooling2D operation [34] with stride 2 then reduces the spatial resolution, emphasizing the most salient features. In parallel, the residual pathway uses a 1 × 1 Conv2D projection with stride 2 to match both the down-sampled spatial dimensions and the number of filters of the main branch. The outputs of the main and residual paths are added elementwise, enabling the block to learn a residual mapping y = f(x) + x. This design preserves low-level information, mitigates vanishing gradients, and leads to more stable and efficient optimization.

Final Feature Extraction Layer: After the last residual block, a SeparableConv2D layer with 256 filters further enriches the learned feature representations. Batch normalization and ReLU activation are applied afterward.
Global Feature Aggregation and Dropout: A GlobalAveragePooling2D layer reduces each feature map to a single representative value, producing a compact 256-dimensional vector. A dropout layer is applied to reduce overfitting by preventing the model from relying too heavily on specific activations.
Output Layer: A Dense layer with sigmoid activation outputs a probability indicating whether the input image contains a gun. Values greater than or equal to 0.5 indicate the gun class, while values below 0.5 correspond to non-gun class.
Loss Function and Optimizer: The model is trained using binary cross-entropy loss, appropriate for two-class classification tasks. The Adam optimizer with a learning rate of 1 × 10⁻⁵ is used.

3.1.3. Model Training

The deep learning model shown in Figure 1 was developed in Python version 3.9 using the Keras framework, a high-level neural network API built on top of TensorFlow [35]. Keras was selected for its simplicity, flexibility, and efficient model-building workflow. Training was performed on a desktop computer equipped with a 12th-generation Intel Core i7 processor (6 cores, 2.10 GHz), 32 GB of RAM, and an NVIDIA GeForce RTX 3070 GPU, providing computational power for training.

After training, the TensorFlow model was converted into a LiteRT format [36] to enable faster and more efficient inference. This conversion increases inference speed—enabling higher frames per second (FPS) during real-time processing.

3.1.4. Verification with OpenAI

The proposed method incorporates an additional verification stage using a vision-enabled large multimodal model provided by OpenAI to further reduce false alarms caused by visually ambiguous objects. A gun detection is confirmed only when both the local RS-CNN classifier and the OpenAI-based verifier independently identify the presence of a firearm.

After the local RS-CNN produces a GUN decision through majority voting over the most recent three frames, the current frame is uploaded to the GPT-4.1 vision model for secondary verification. The model is prompted to distinguish real firearms from other objects such as toys, tools, or household items. The model is asked to return binary response indicating whether a real firearm is clearly visible in the image.

3.2. Sound-Based Gunshot Detection Using Deep Learning

In this work, the author’s previously published deep learning–based gunshot sound classifier from [15] was reused. In that study, a custom gunshot sound dataset was developed using blank guns fired in multiple indoor locations. Non-gunshot sounds include sounds in common indoor environments such as in schools, offices, grocery stores, etc. The dataset was further expanded through audio augmentation. One-second audio clips were transformed into 18 × 64 image-like feature maps by stacking Mel Frequency Cepstral Coefficients (MFCCs) with five time-domain energy features—allowing the problem to be treated as an image classification task. A compact CNN architecture with two convolution–max-pooling stages, followed by dropout and fully connected layers with a sigmoid output neuron was developed and trained. The model was finally converted to a LiteRT format for fast inference.

4. Prototype Development

The architecture of the proposed gunshot detection system consists of multiple interconnected components: cameras with built-in microphones, a mini-PC hosting the MQTT broker and the image/audio processing software, IoT-enabled smart locks, IoT-connected PA devices, Firebase cloud storage, and a smartphone application—illustrated in Figure 3.

The hardware setup involves mounting the cameras on walls or ceilings, installing smart locks on doors, and connecting PA devices to standard power outlets. After the devices are connected to Wi-Fi, users can receive real-time smartphone notifications from anywhere with Internet access. Wi-Fi provisioning, user and device management, and database structure for such IoT systems were discussed in author’s earlier work [15,37]. This paper focuses on developing a proof-of-concept prototype that integrates audio-visual gunshot detection, automated lockdown, and automatic PA announcements. A brief overview of the system’s key modules is provided below.

4.1. Audio-Visual Gun Detection and MQTT Broker

A mini-PC [38] hosts the MQTT broker and the image/audio processing software for audio-visual gun detection. The mini-PC runs Windows 11 as the operating system and it is equipped with an Intel Alder Lake-N100 CPU running at 3.4 GHz, 16 GB RAM, and a 1 TB SSD. Due to its compact physical footprint (5.7 × 2.5 × 1 inch) and low power consumption, it is well-suited for continuous, and energy-efficient operation. The software running in this device are briefly described below.

4.1.1. MQTT Broker Server

The Mosquitto MQTT broker [39] server is hosted on the mini-PC, and it can be accessed securely, using password-based authentication, and can be accessed from outside of the local network. Its public access is enabled by assigning a static private IP address and configuring port-forwarding in the router, and opening the corresponding port in Windows Firewall. MQTT [40] operates using a lightweight publish–subscribe communication model, in which client devices connects with the MQTT broker, publish messages to named topics and any subscriber client to that topic automatically receives the message. This architecture allows rapid and scalable dissemination of event alerts to all relevant IoT components.

4.1.2. Gun Detection from Image

For image-based gun detection, Amcrest IP cameras (Amcrest Technologies, Houston, TX, USA) used to capture visual data. A Python-based system running on the mini-PC processes these streams using a multi-camera architecture, in which each camera continuously delivers video to the mini-PC for real-time analysis. At initialization, the system loads a TFLite version of the RS-CNN classifier as discussed in Section 3.1. The mini-PC simultaneously launches a dedicated worker thread for each camera, enabling fully parallel processing of video streams and ensuring that detection performance scales with the number of deployed cameras.

The flowchart of each camera worker thread is shown in Figure 4. Each worker thread maintains a persistent RTSP connection and attempts to operate at a target frame rate. When a frame is acquired, it is resized to the model’s fixed input dimension of 150 × 200 pixels and passed through the TFLite interpreter, which generates an instantaneous prediction (“GUN” or “NONGUN”). To mitigate the impact of occasional misclassifications—such as those caused by motion blur, occlusion, or transient lighting variations—the system employs a temporal smoothing strategy using a fixed-length FIFO queue. The queue retains the most recent 3 predictions per camera, and the final label is obtained via majority voting. This approach significantly stabilizes frame-level predictions while preserving responsiveness.

To further reduce false alarms arising from visually ambiguous objects, the system incorporates an additional verification step using a vision-enabled large multimodal model from OpenAI. When the FIFO vote indicates a potential firearm, and at least 2 s verify cooldown have elapsed since the last verification attempt on that camera, the current frame is transmitted to the GPT-4.1 model, with a timeout of 5 s. The model is prompted to return a structured JSON response indicating whether a real firearm is visibly present. A gun event is confirmed only when both the local RS-CNN and the OpenAI verifier agree that a firearm is present. During this verification step, the RTSP stream is temporarily released and reopened afterward to avoid accumulating buffered frames and to maintain real-time behavior. To avoid repeated alerts from the same visual incident, a per-camera event cooldown of 5 s must elapse before another gun event can be triggered.

When a gun is confirmed, the system records the event timestamp and initiates a coordinated sequence of emergency actions. First, MQTT commands are sent to all smart door locks associated with the corresponding camera, instructing them to close and thereby initiating a localized lockdown. Second, PA devices mapped to the camera are instructed—via MQTT—to stop any ongoing playback and broadcast predefined voice instructions (e.g., directing occupants in the affected area to move toward a safer region). Third, the confirmed frame is encoded and uploaded to Firebase Storage [41] under a directory structured by device and camera identifiers. A local copy is also stored for redundancy. Finally, once the upload succeeds, the resulting public URL is embedded in an MQTT message that includes the camera ID, event timestamp, GPS coordinates, room label, and institution name. This message is published to a dedicated topic, to which the security personnel’s smartphone application is subscribed. Upon receiving the message, the phone processes the payload and generates an immediate push notification. Through this mechanism, the image-based module not only identifies the presence of a firearm with low latency and high reliability but also provides detailed contextual information.

To maintain a consistent processing rate, the execution time of each iteration of the worker thread is measured and compared against the target frame period. If the total processing time is shorter than the target period, the thread enters a sleep state for the remaining duration, thereby enforcing the desired frame rate. When the processing time exceeds the target frame period—such as during heavy computation or network delays—the computed sleep interval becomes zero, allowing the system to run at the maximum achievable rate without additional delay. This adaptive delay mechanism ensures stable real-time performance while avoiding unnecessary CPU utilization.

4.1.3. Gunshot Detection from Sound

For sound-based gunshot detection, the system utilizes the built-in microphones of the Amcrest IP cameras, configured at a sampling rate of 16 kHz with the microphone gain set to approximately 40%. Each camera’s RTSP audio stream is decoded on the mini-PC using FFmpeg (development build N-111059-gd78bffbf3d) [42], which extracts a continuous mono audio signal. The audio pipeline operates on 1-s analysis windows per camera, providing near real-time responsiveness while maintaining sufficient temporal resolution to capture the impulsive characteristics of gunshots.

In this work, the deep learning–based gunshot sound classifier developed in the authors’ previous work [15] is reused. As described in [15], each 1-s audio segment is first preprocessed to remove DC offset and normalize amplitude. Time-domain statistics (such as average absolute amplitude, maximum, minimum, standard deviation, and differences) are computed over 64 equal-length segments, while Mel Frequency Cepstral Coefficients (MFCCs) are extracted in the frequency domain. These features are then stacked to form an 18 × 64 image-like representation, which serves as the input to a compact convolutional neural network. The CNN, implemented as a TensorFlow Lite (TFLite) model, outputs a probability that the given 1-s segment corresponds to a gunshot. A decision threshold of 0.5 is used to classify the segment as “gunshot” or “non-gunshot.”

The mini-PC processes audio streams from all configured cameras using dedicated worker threads, with one thread assigned to each camera. Each audio worker continuously extracts non-overlapping 1-s audio segments from the RTSP stream, performs feature generation, and executes TFLite-based inference in real time. This parallel design ensures that audio processing for one camera does not block or delay processing for others and enables continuous monitoring across multiple zones.

When a gunshot is detected for a particular camera, the system records the event date and time and triggers the same emergency-response chain as in the image-based module. Specifically, MQTT commands are sent to all smart door locks associated with that camera to initiate localized lockdown, PA devices for that zone are instructed to broadcast a gunshot-specific warning message, and the corresponding 1-s audio segment is saved locally as a WAV file and uploaded to Firebase Storage. The public URL of the uploaded audio file, together with the camera ID, event timestamp, GPS coordinates, room label, and institution name, is then embedded in an MQTT message published to the observer topic. The smartphone application subscribed to this topic receives the event and generates a push notification for first responders. In this way, the sound-based module provides a complementary detection pathway that can identify gunshots even when the weapon is not visible in the camera’s field of view, while leveraging the same IoT-based lockdown, PA, and notification infrastructure as the image-based subsystem.

4.2. Smart Lock

The smart locks can be installed in the classroom doors. Whenever gun image or sound is detected on a camera, the assigned smart lock devices with that camera will be closed automatically. The smart locks can be opened or closed using a physical button on the device from inside the classroom or using the smartphone app from any place in the world. The status of the smart lock is indicated by an RGB led on the device and can also be monitored in the smartphone app. The hardware and firmware of the smart lock are described below.

4.2.1. Hardware

The block diagram of the smart lock hardware is shown in Figure 5. The smart lock module is built around an ESP32-S3 Tiny microcontroller board [43], which provides integrated Wi-Fi connectivity, GPIO, RGB LED, and processing capability for real-time control and MQTT communication. A TB6612FNG dual H-bridge motor driver [44] is used to drive a linear stepper actuator [45]. A sliding door latch [46] is attached with the shaft of the linear actuator to implement the lock-unlock mechanism. In the motor driver: BIN1 and BIN2 pair is assigned for forward motion, which closes the lock. The AIN1 and AIN2 pair is assigned for backward motion, which opens the lock. A momentary push-button connected to a GPIO allows manual toggling of the lock state. An onboard RGB LED provides visual feedback: green indicates an open or unlocked state, red indicates a closed or locked state, and yellow indicates motion or connection attempts. A 110 V AC to 5V 2A DC power adaptor is used as the power supply. The power regulator in the ESP32-S3 Tiny board generates the 3.3 V from the 5 V supply. In the motor driver: the 3.3 V is used as the power supply of the IC, and the 5 V is used as the power supply for the motors.

4.2.2. Firmware

After initialization of GPIOs, EEPROM, and the dual stepper drivers using AccelStepper library [47], the device connects to the configured Wi-Fi network and then to the MQTT broker using Arduino-MQTT library [48] as a client. As part of the MQTT setup, it configures a last-will message that is automatically published by the broker if the device disconnects unexpectedly, allowing the smartphone app to display lock-offline conditions. After a boot, each lock publishes its connection status and open/closed state to a designated observer topic and subscribes to a corresponding command topic, where it receives integer-coded commands such as CLOSE and OPEN. Upon receiving such a command—or a falling-edge interrupt from the physical push button—the firmware updates the internal open/closed state, computes the required displacement based on the calibrated step-to-distance conversion factor, and actuates the appropriate stepper motor (forward to close, backward to open) at a constant speed.

The step-to-distance conversion factor for the linear actuator was determined through empirical measurement. For proper open and close operation, the lock mechanism must travel 24 mm within 10 s under a constant motor speed of 90 steps per second. This operating speed resulted in 900 motor steps over the 10-s interval. Dividing the total number of steps by the measured travel distance yields a conversion ratio of 37.5 steps per millimeter. This factor is used throughout the firmware to linear motion distance into the appropriate number of motor steps.

After motion completes, the device updates EEPROM with the new state and publishes the latest open/closed status, while the RGB LED is updated to reflect the final lock position. Storing the lock state in non-volatile EEPROM ensures that the device can restore its previous open/closed status after power loss or reset.

4.3. IoT Connected PA

The IoT-connected public-address (PA) units serve as distributed audio warning devices that automatically announce safety messages when a gunshot sound or gun image is detected. Each camera is assigned to one or more PA devices, enabling location-specific instructions to occupants in real time. For instance, if gunshot sound is detected in North hallway, and safe location is set as South hallway for that camera, then the PA will announce “Gunshot sound detected in North Hallway. Please move to South Hallway” for 3 times. The hardware and firmware design of the PA module are described below.

4.3.1. Hardware

The hardware architecture of the IoT-connected PA module is shown in Figure 6. Each unit is built around a Raspberry Pi Zero 2 W single-board computer, which provides Wi-Fi connectivity, Linux-based processing capability, and sufficient computational resources for text-to-speech synthesis and audio playback. A WM8960 audio HAT [49] is mounted on top of the Raspberry Pi, providing a low-power stereo audio codec with I²S digital audio interfacing. The HAT integrates a dual-channel speaker driver, enabling direct connection of two 8 Ω loudspeakers without the need for an external amplifier. This configuration allows each PA device to produce clear, high-fidelity audio while maintaining a compact and energy-efficient hardware footprint. The Raspberry Pi and WM8960 HAT are powered using a 100–240 V AC to 5 V DC converter module [50], capable of supplying up to 600 mA of continuous current. All electronics are housed in an enclosure with an integrated wall-outlet AC plug [51], enabling the entire PA unit to be deployed as a self-contained, always-powered device in classrooms, hallways, and other indoor locations.

4.3.2. Firmware

The PA firmware is implemented in Python and runs automatically on the Raspberry Pi at system startup [52]. The device operates on the 64-bit Raspberry Pi OS (Bookworm), and all synthesized speech is routed through the ALSA default audio device to the WM8960 codec for playback [49]. Using the Paho-MQTT library [53], each PA unit establishes a connection to the Mosquitto MQTT broker as a client. As part of the MQTT initialization procedure, the device configures a last-will message that is published automatically by the broker if the PA unit disconnects unexpectedly, ensuring that the smartphone app can immediately identify offline PA nodes.

After connecting to the broker, the PA device publishes its connection status and subscribes to an assigned command topic, through which it receives short integer-coded instructions. Two types of announcement triggers are supported: one for gunshot sound detection and another for gun-image detection. Each command contains (i) the location where the threat was detected and (ii) the recommended safer destination area. Upon receiving an announcement command, the firmware generates the verbal instruction using the gTTS text-to-speech engine [54] and stores the synthesized audio in a temporary MP3 file. Playback is handled by the lightweight mpg123 player [55] through the WM8960 HAT, and announcements are repeated a configurable number of times.

Incoming commands are processed using a background worker thread and a synchronized queue, enabling orderly handling of multiple events without blocking MQTT communication. A STOP command is also supported, allowing higher-priority events to immediately interrupt ongoing playback and clear any pending announcements.

4.4. Smartphone Application

The smartphone application serves as the primary user interface of the proposed gunshot detection and response system. It provides device status monitoring, manual control of the IoT devices, and immediate emergency notifications to security personnel and users. The application is developed for Android and communicates with the backend infrastructure through the MQTT publish–subscribe messaging model and Firebase cloud storage.

4.4.1. MQTT Connectivity and System Monitoring

The smartphone connects to the Mosquitto MQTT broker using jMQTT library [56] and operates as a subscriber to multiple observer topics. These topics broadcast connectivity and operational status updates from smart locks, PA devices, and the gun detection mini-PC. By subscribing to these topics, the application continuously monitors whether each device is online, as well as the current open or closed state of each smart lock.

4.4.2. Control of Smart Locks and PA Devices

In addition to monitoring, the smartphone application enables users to actively control IoT devices through MQTT command messages. Smart locks can be remotely opened or closed using on-screen toggle switches, which publish integer-coded control commands to the corresponding device command topics. This allows classroom doors to be unlocked manually after an incident.

Similarly, the application provides manual control over PA devices. Dedicated control buttons allow users to stop ongoing audio announcements, for example, after a false alarm or once evacuation procedures are complete. This control command is transmitted as MQTT message and executed immediately by the corresponding PA nodes.

4.4.3. Gun Image and Gunshot Event Notifications

The smartphone subscribes to dedicated gun-event topics published by the mini-PC, which report both image-based and sound-based detections. When a gun image is detected, the application receives an event message containing the camera identifier, timestamp, room label, geographic coordinates, and a Firebase URL pointing to the captured image frame. A high-priority push notification is generated using NB6 library [57] with custom alarm tone and action buttons to immediately alert the user. The captured image is retrieved asynchronously from Firebase storage and presented to the user through a rich notification layout with an embedded preview. This visual evidence enables rapid assessment of the threat and supports informed decision-making by security personnel.

When a gunshot sound is detected, the application receives a corresponding event message containing similar contextual metadata, along with a Firebase URL linking to the recorded audio clip. In addition to the notification, the application allows the recorded audio clip to be played directly on the smartphone, enabling responders to assess the acoustic context to further evaluate the credibility and severity of the event.

4.4.4. Navigation Assistance

To support rapid response, the application integrates location-aware features using the GPS coordinates included in gun-event messages. Each notification provides a “Direction” action that launches Google Maps with turn-by-turn navigation to the incident location. This functionality is particularly useful for campus security or law enforcement personnel who may be off-site at the time of detection.

5. Results

In this section, the results of the deep learning models and the overall performance of the proposed prototype system are presented.

5.1. Deep Learning Model Results

The results of the deep learning models for image-based and sound-based gun detection are presented below.

5.1.1. Deep Learning Model Result for Image Based Detection

The deep learning model was trained and validated simultaneously until one of two stopping criteria was met: the validation loss falling to 0.1 or below, or the completion of 50,000 training epochs. A batch size of 128 was used for both training and validation, and the learning rate was set to 1 × 10⁻⁵. The training process terminated at epoch 272, where the validation loss reached 0.097, requiring a total training time of 33 min and 6.58 s. At this point, the training and validation accuracies converged to approximately 96% and 97%, respectively.

After training and validation, the final model—comprising 84,257 parameters, of which 83,073 are trainable and 1184 are non-trainable—was evaluated on an unseen test dataset containing 672 images. The model achieved an accuracy of 94.64% for the test dataset with corresponding loss of 0.1336. The trained model occupies 1.2 MB of storage, making it suitable for deployment on resource-constrained systems. Table 3 summarizes the loss and accuracy across the training, validation, and test datasets, demonstrating consistent performance and generalization.

The confusion matrix for the test dataset is shown in Table 4. Of the 321 true gun samples in the test dataset, 308 were correctly identified, resulting in 13 false negatives (4.0%). While false negatives are particularly critical in emergency-response scenarios, their impact is mitigated by the system’s continuous high-frame-rate operation. Even if a gun is missed in an individual frame, or during secondary verification, the same object is typically re-evaluated in subsequent frames, allowing detection to occur within a short time window, generally within a few seconds. Conversely, among 351 non-gun samples, 332 were correctly classified, with 19 false positives (5.4%). Although false positives may lead to unnecessary lockdown actions, the proposed system mitigates their impact through multiple safeguards: including temporal majority voting, and secondary OpenAI-based verification. This layered decision pipeline prioritizes early threat detection while reducing the likelihood of erroneous autonomous actions, making the system suitable for safety-critical deployment scenarios.

Precision, recall, and F1-score metrics are reported in Table 5.

5.1.2. Deep Learning Model Result for Sound Based Detection

The detailed performance of the deep learning model for sound-based gunshot detection was reported in author’s previous work [15]. The model achieved training and validation accuracies of approximately 98% and 99%, respectively. When tested on an unseen test dataset, it attained a test accuracy of 99.17%.

5.2. Prototype Results

A working prototype of the proposed system as shown in Figure 3, consisting of 2 cameras, network switch, mini-PC, 2 smart locks, 2 PA devices, and smartphone applications, has been developed and successfully tested. The components of the proposed system, image based and sound based testing results are discussed below.

5.2.1. Components of the Proposed System

This section illustrates the different components and the physical deployment of the proposed audio-visual gun detection system within the test environment. Two IP cameras are installed: one in the north hallway (Figure 7a) and the other in the south hallway (Figure 7b), providing complementary visual and audio coverage of the monitored area. Both cameras are connected via Ethernet cables to a Power-over-Ethernet (PoE) network switch (Figure 7c), which supplies power to the cameras while simultaneously handling data communication. The mini-PC (Figure 7d) hosts the Mosquitto MQTT broker and runs the real-time audio-visual processing software as discussed in Section 4.1. The PoE switch and the mini-PC are connected to a router, enabling network connectivity and Internet access.

Figure 8 presents the proof-of-concept smart door lock hardware, as discussed in Section 4.2, in both operational states. The open configuration, Figure 8a, shows the integration of the ESP32-S3 Tiny microcontroller with the motor driver and stepper-motor–based linear actuator, which mechanically translates motor rotation into linear motion of the door latch. The momentary push-button provides local manual control, and the onboard RGB LED offers immediate visual feedback of the lock status. The closed configuration in Figure 8b shows the engagement of the latch and the RGB LED indicate a locked state. Testing has been done to toggle the state the lock by pressing the push button switch and worked successfully.

Figure 9 shows the hardware realization of the IoT-connected public address (PA) system as discussed in Section 4.3. The disassembled view, Figure 9a, shows the Raspberry Pi Zero 2 W integrated with a WM8960 audio HAT, which provides stereo audio decoding and direct speaker drive capability. The system is powered through a compact 110 V AC to 5 V DC converter and it is attached with an enclosure designed for wall-outlet mounting. Figure 9b shows the fully assembled unit housed in the wall-mount casing with an integrated plug, resulting in a compact, self-contained PA device suitable for easy installation in indoor environments such as hallways or classrooms.

Figure 10 shows a screenshot of the smartphone application. It displays the MQTT connection status (connected or disconnected) of the smart locks, PA devices, and the smartphone. During testing, the slide switches were used to successfully control the lock open and close operations, and the stop buttons were used to mute an ongoing PA announcement.

5.2.2. Image-Based Gun Detection Testing Results

The image-based gun detection system was evaluated in real time by conducting controlled walkthrough tests in the building hallways using realistic gun replicas [28]. Volunteers of different ages and genders carried and walked with various types of replica firearms, including rifles, shotguns, submachine guns (SMGs), and pistols, to simulate realistic shooter scenarios. When a volunteer entered the field of view of the North hallway camera, the system immediately detected the presence of a firearm, automatically closed the smart door lock associated with that camera, initiated PA announcements—“Gun image detected in North Hallway. Please move to South Hallway”—and sent a smartphone notification containing the event timestamp, location information, and an image preview, as shown in Figure 11a. The notification also included a Direction button. Selecting this option opened Google Maps and provided navigation guidance to the detected location, as illustrated in Figure 11b. Figure 11c shows a screenshot of the smartphone application following the event, where the slide control for Lock 1 automatically switches to the OFF position, indicating that the lock has been closed. Similar experiments were conducted in the South Hallway, and in all cases the system correctly detected the firearm and executed the corresponding lockdown, PA announcement, and notification actions.

To test the system for possible false alarms, volunteers walked with common objects such as bags, water bottles, boxes, chairs, pillows, flowers, black sticks, etc., and none of them created any false alarm.

The runtime performance of the proposed image-based gun detection system was evaluated on the mini-PC using two simultaneous camera streams. Although the system was configured with a target processing rate of 30 frames per second per camera, the actual throughput was limited by the cumulative preprocessing, inference, and verification workload. Averaged across both cameras, the preprocessing stage required approximately 4.6 ms per frame, while TFLite inference required approximately 20.0 ms per frame, resulting in an average per-frame vision processing time of about 24.6 ms. Under continuous operation with no enforced sleep delay, the mini-PC achieved an average actual processing rate of approximately 18.7 FPS per camera, reflecting the maximum sustainable throughput given the available compute resources. OpenAI-based secondary verification was invoked only for confirmed gun candidates, with an average verification latency of approximately 934 ms per call. Depending on Internet conditions, the MQTT broker introduces an additional 10–100 ms delay [58] for delivering event notifications to the smartphone. Consequently, the worst-case end-to-end delay for an image-based gun event notification can be approximated as 24.6 ms (local vision processing) + 934 ms (OpenAI verification) + 100 ms (MQTT delivery) ≈ 1.06 s, which remains acceptable for real-time emergency response scenarios while providing high confidence through multi-stage verification.

5.2.3. Sound-Based Gunshot Detection Testing Results

The sound-based gunshot detection system was evaluated in real time using a blank gun, following similar testing procedure reported in author’s previous work [15]. When the blank gun was fired in the South Hallway, the system immediately detected the gunshot sound, automatically closed the smart door lock associated with that camera, initiated PA announcements—“Gunshot sound detected in South Hallway. Please move to North Hallway”—and sent a smartphone notification containing the event timestamp, location information, and a Play button to access the recorded audio, as shown in Figure 12a. Selecting the Play button opens an audio player, illustrated in Figure 12b, allowing responders to listen to the recorded gunshot sound to better understand the context of the event and assess potential false alarms. The notification also included a Direction button. Selecting this option opened Google Maps and provided navigation guidance to the detected location, similar to the gun-image notification shown in Figure 11b. Figure 12c shows a screenshot of the smartphone application following the event, where the slide control for Lock 2 automatically switches to the OFF position, indicating that the lock has been closed. Similar experiments were conducted in the North Hallway, and in all cases the system correctly detected the gunshot sound and executed the corresponding lockdown, PA announcement, and notification actions.

To evaluate the robustness of the sound-based gunshot detection module against false alarms, a variety of non-gunshot sounds were tested, including normal conversation, clapping, balloon popping, and playback of gunshot sounds from movies. None of these scenarios triggered false alarms. A false positive was observed only when a person screamed very close to the camera microphone, at a distance of approximately 1 ft or less. This scenario is unlikely to occur in practical deployments, as cameras are typically mounted on walls or ceilings and positioned several feet away from occupants’ mouths.

The runtime performance of the sound-based gunshot detection module was evaluated on the same mini-PC while the image processing program was also running. Averaged across both cameras, the feature-extraction stage required approximately 5.5 ms per 1-s audio segment, while TFLite inference required only about 0.18 ms, resulting in a negligible computational overhead relative to the audio acquisition time. From an end-to-end notification latency, detection delay is dominated by the 1-s audio window required for classification, followed by feature extraction, inference, and MQTT delay. Consequently, the worst-case notification latency for a sound-based gunshot event can be approximated as 1.0 s (audio window) + 5.5 ms (feature extraction) + 0.18 ms (inference) + 100 ms (MQTT delivery) ≈ 1.11 s. This delay remains well within acceptable bounds for real-time emergency response, while providing reliable and continuous acoustic surveillance across multiple camera locations.

6. Discussion

A key design decision in the proposed system is the use of a lightweight, locally executed RS-CNN model for continuous image-based gun detection, with cloud-based OpenAI vision verification invoked only after a gun is confirmed by majority voting. Relying exclusively on a cloud-based large vision model for every frame would be prohibitively expensive and inefficient. At a target rate of 30 FPS per camera, a single camera would generate approximately 2.6 million frames per day, or nearly 78 million frames per month. Submitting even a small fraction of these frames to a cloud vision API would result in substantial token costs and increased network dependency. In contrast, the local RS-CNN performs real-time inference without any token cost, while OpenAI verification is triggered only for rare, high-confidence gun events—typically a few times per month in realistic deployments. This hybrid approach dramatically reduces cloud usage and cost while preserving high confidence through secondary verification, making the system economically viable for continuous, long-term operation in schools and other public facilities.

Since OpenAI does not publish task-specific accuracy metrics for the GPT-4.1 vision model, the standalone detection accuracy of the external verifier cannot be quantified. Consequently, the exact analytical accuracy of the combined RS-CNN and GPT-4.1 verification pipeline cannot be directly computed and is instead evaluated empirically through system-level testing. Real-time experiments indicate that the GPT-4.1 model is generally effective at distinguishing real firearms from visually similar black objects—such as TV remote controls or pistols—and helps reduce false alarms.

Although cloud-based models were explored for secondary verification of gunshot sounds, experimental evaluation showed that such verification was not sufficiently reliable for this task. The OpenAI-based audio verification exhibited inconsistent performance —leading to potential false confirmations or rejections. For this reason, cloud-based verification was not used for sound events in the final system. Instead, the previously validated local deep learning sound classifier [15] is relied upon exclusively.

Recent progress in real-time object detection has been driven by the YOLO family of one-stage detectors, which could be adopted in surveillance and security applications [59]. Ultralytics YOLOv8 introduced an anchor-free design with decoupled classification and regression heads, improving detection accuracy for small and partially occluded objects while maintaining real-time performance [60]. Subsequent YOLO-style architectures, including YOLOv10 and newer Ultralytics models (e.g., YOLO11), further optimize backbone efficiency, feature pyramid fusion, and loss functions to reduce false positives and improve localization stability [61,62]. In future work, the authors plan to evaluate newer Ultralytics YOLO models, such as YOLO11 and YOLO26, for firearm detection in indoor surveillance scenarios.

The proposed system is not intended to replace human judgment or make irreversible autonomous decisions. Instead, it is designed as a decision-support and early-warning system that accelerates situational awareness during high-risk events. While certain actions—such as temporary door locking and PA announcements—are triggered automatically to reduce response time, these actions are reversible, localized, and time-limited, and they are intended to mitigate immediate risk rather than enforce punitive measures.

False positives are unavoidable in any sensing system. Therefore, multiple safeguards are incorporated, including temporal majority voting, secondary verification using an external vision model, and cooldown mechanisms to prevent repeated triggers. Moreover, real-time notifications with visual and audio evidence are immediately sent to human responders, who retain full authority to assess the situation and take appropriate actions such as unlocking the door and stopping the PA if it is false positive. Note that locked classroom doors can be opened from inside using a push-button as shown in Figure 5.

An event is confirmed only when both the local RS-CNN and the external verifier indicate the presence of a real firearm. If the external verifier returns a negative response, the system does not escalate the event, thereby providing conservative failure handling. In addition, uncertainty is mitigated at the system level by providing human responders with contextual information. For image-based events, a preview image is included in the smartphone notification (see Figure 11a), allowing responders to visually assess the situation. For sound-based events, a recorded audio clip is sent (see Figure 12b) and can be played directly from the notification to understand the acoustic context and assess potential false alarms. These mechanisms ensure that, even under model uncertainty, responders can make informed decisions using visual and auditory evidence rather than relying solely on automated outputs.

Trigger frequency is controlled through a multi-stage decision pipeline. For image-based detection, a firearm event is generated only after majority voting across three consecutive frames produces a consistent GUN decision, followed by secondary verification using an external vision model. In addition, a per-camera cooldown interval is enforced after each confirmed image-based event, preventing repeated triggers from the same visual incident. As a result, external verification requests and system actions are infrequent and event-driven rather than continuous. For sound-based detection, events are constrained by a fixed 1-s audio window and an enforced cooldown interval, which prevents repeated triggers from a single acoustic incident. This design ensures that system actions occur only when sustained evidence of a threat is present.

In U.S. practice, automated systems are widely deployed in safety-critical domains, including fire alarm systems that automatically activate sprinklers, industrial emergency shutoff mechanisms, vehicle collision-avoidance system, etc. In these applications, immediate automated responses are considered essential for reducing harm, even when the systems operate without prior human verification. The proposed system operates within this established framework: its automated actions are limited to non-punitive, reversible safety measures (temporary door locking, public address announcements, and event notifications) intended to mitigate immediate risk. We acknowledge that regulatory and legal requirements may differ across countries and jurisdictions. As such, the proposed system is presented as a proof-of-concept prototype rather than a finalized, policy-ready deployment. Country-specific compliance, regulatory approval processes, and human-in-the-loop requirements are recognized as important considerations and are explicitly identified as future work. These aspects will be addressed through collaboration with regulatory authorities, legal experts, and institutional stakeholders during any real-world deployment phase.

In the current prototype, the smart door lock requires approximately 10 s to complete a full open or close operation. While this duration is acceptable for a proof-of-concept system, faster actuation could be achieved by employing a higher-speed linear actuator. However, experimental testing revealed that increasing the speed of the selected actuator introduced undesirable mechanical vibration and instability. Future design can reduce actuation time by selecting linear actuators with higher rated speeds.

In the current implementation, the association between cameras, smart locks, and PA devices is statically defined in configuration files. While this approach is sufficient for prototype validation, it limits flexibility during deployment and reconfiguration. As future work, a graphical user interface (GUI) can be developed to allow administrators to visually map which smart locks and PA devices are activated by each camera. Such an interface would simplify installation, support dynamic reconfiguration as building layouts change, and reduce the likelihood of configuration errors, thereby improving scalability and usability in real-world institutional deployments.

While the proposed system demonstrates strong real-time performance, several limitations and extensions remain as future work. Expanding the dataset to include more diverse background environments and viewing conditions would further improve robustness. In addition, system performance under network congestion or partial connectivity loss has not been explicitly evaluated and should be investigated in future studies, particularly for large-scale deployments. A formal threat model is also an important extension, including potential vulnerabilities such as MQTT message spoofing, node compromise, and denial-of-service attacks. Future work will explore mitigation strategies such as stronger authentication, anomaly-based intrusion detection, rate limiting, and secure communication frameworks, based on recent advances in AI-driven intrusion detection and prevention systems for next-generation networks [59]. Finally, pilot deployments in real school environments are planned to assess long-term reliability, usability, cybersecurity resilience, and integration with existing safety procedures.

7. Conclusions

This work presented a novel audio-visual IoT-based gun detection and emergency response system designed to reduce response time and automate critical safety actions during active shooter incidents. The proposed system integrates two custom deep learning models—one for image-based gun detection and another for sound-based gunshot detection—achieving test accuracies of 94.6% and 99%, respectively. By combining visual and acoustic sensing, the system provides complementary detection pathways that improve reliability and robustness in real-world indoor environments. A complete prototype was developed, including PoE cameras with built-in microphones, a mini-PC hosting the MQTT broker and inference services, IoT-enabled smart door locks, IoT-connected public address devices, and a smartphone application for first responders. Real-time testing with gun replicas and blank gunshots demonstrated that the system can detect events, trigger localized lockdowns, broadcast evacuation instructions, and deliver rich notifications with image or audio context and location information to smartphones within approximately one second.

8. Patents

A US patent (application# 63/707,567) has been filed based on the work reported in this manuscript.

Funding

This research was funded by the Summer Research/Creative Activity Award of Eastern Michigan University.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The author would like to thank the following individuals for volunteering in the creation of the gun image dataset: Abdul-Azeez Madani, Andrew Ross, Chi Wang, Fenghao Jia, Jiahao Mei, Levi Lloyd-Zeeryp, Longju Wang, Muhammad Ashraf, Osama Salim, Preston Carter, Radee Khan, Rafia Tahsin, Taqi Khan, Yuanshu Ge (Eason), and Zihan Zhao. Special thanks are extended to Mohamed Hagras for his significant involvement throughout the data collection process. Thanks to Matthew Lige from the Department of Public Safety (DPS) at Eastern Michigan University for providing police support.

Conflicts of Interest

The author declares no conflicts of interest.

References

Gunfire on School Grounds in the United States. Available online: https://everytownresearch.org/maps/gunfire-on-school-grounds/ (accessed on 4 December 2025).
Active Shooter Notification Time Costs Lives. Available online: https://guard911.com/active-shooter-notification-time-costs-lives/ (accessed on 4 December 2025).
ALICE Active Shooter Response Training. Available online: https://www.alicetraining.com/ (accessed on 2 December 2025).
ZeroEyes. Available online: https://zeroeyes.com/ (accessed on 29 October 2024).
Jain, A.; Garg, G. Gun Detection with Model and Type Recognition using Haar Cascade Classifier. In 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India; IEEE: New York, NY, USA, 2020; pp. 419–423. [Google Scholar]
Alaqil, R.M.; Alsuhaibani, J.A.; Alhumaidi, B.A.; Alnasser, R.A.; Alotaibi, R.D.; Benhidour, H. Automatic Gun Detection from Images Using Faster R-CNN. In 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia; IEEE: New York, NY, USA, 2020; pp. 149–154. [Google Scholar]
Debnath, R.; Bhowmik, M.K. Automatic Visual Gun Detection Carried by A Moving Person. In 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS), Rupnagar, India; IEEE: New York, NY, USA, 2020; pp. 208–213. [Google Scholar]
Mehta, P.; Kumar, A.; Bhattacharjee, S. Fire and Gun Violence based Anomaly Detection System Using Deep Neural Networks. In 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India; IEEE: New York, NY, USA, 2020; pp. 199–204. [Google Scholar]
Goenka, A.; Sitara, K. Weapon Detection from Surveillance Images using Deep Learning. In 2022 3rd International Conference for Emerging Technology (INCET), Belgaum, India; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
AmberBox. Available online: https://amberbox.com/ (accessed on 29 October 2024).
Lopez-Morillas, J.; Canadas-Quesada, F.J.; Vera-Candeas, P.; Ruiz-Reyes, N.; Mata-Campos, R.; Montiel-Zafra, V. Gunshot detection and localization based on Non-negative Matrix Factorization and SRP-Phat. In 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM); IEEE: New York, NY, USA, 2016; pp. 1–5. [Google Scholar] [CrossRef]
Valenzise, G.; Gerosa, L.; Tagliasacchi, M.; Antonacci, F.; Sarti, A. Scream and gunshot detection and localization for audio-surveillance systems. In 2007 IEEE Conference on Advanced Video and Signal Based Surveillance; IEEE: New York, NY, USA, 2007; pp. 21–26. [Google Scholar] [CrossRef]
Bajzik, J.; Prinosil, J.; Koniar, D. Gunshot Detection Using Convolutional Neural Networks. In 2020 24th International Conference Electronics; IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar] [CrossRef]
Morehead, A.; Ogden, L.; Magee, G.; Hosler, R.; White, B.; Mohler, G. Low Cost Gunshot Detection using Deep Learning on the Raspberry Pi. In 2019 IEEE International Conference on Big Data (Big Data); IEEE: New York, NY, USA, 2019; pp. 3038–3044. [Google Scholar] [CrossRef]
Khan, T.H. A deep learning-based gunshot detection IoT system with enhanced security features and testing using blank guns. Internet Things (IoT) 2025, 6, 5. [Google Scholar] [CrossRef]
ShotSpotter. Available online: https://www.soundthinking.com/law-enforcement/gunshot-detection-technology (accessed on 31 December 2025).
Chinnasamy, P.; Sivakrishnaiah, C.; Sathiya, T.; Alam, I.; Degala, D.P. Design and Implementation of an IoT-based Emergency Alert and GPS Tracking System using MQTT and GSM/GPS Module. In 2025 5th International Conference on Trends in Material Science and Inventive Materials (ICTMIM), Kanyakumari, India; IEEE: New York, NY, USA, 2025; pp. 1286–1291. [Google Scholar]
Chinnasamy, P.; Subramanian, A.; Nithish Selvam, R.; Kabilash, K.N.; Ibrahim, S.N.M.; Swetha, D.Y. GUARDTRACK: RFID and Wi-Fi based Smart Entry System. In 2025 5th International Conference on Trends in Material Science and Inventive Materials (ICTMIM), Kanyakumari, India; IEEE: New York, NY, USA, 2025; pp. 667–672. [Google Scholar]
Shanthi, P.; Manjula, V. A systematic review on CNN-YOLO techniques for face and weapon detection in crime prevention. Discov. Comput. 2025, 28, 204. [Google Scholar] [CrossRef]
Vallez, N.; Velasco-Mata, A.; Deniz, O. Deep autoencoder for false positive reduction in handgun detection. Neural Comput. Appl. 2021, 33, 5885–5895. [Google Scholar] [CrossRef]
Shanthi, P.; Manjula, V. Weapon detection with FMR-CNN and YOLOv8 for enhanced crime prevention and security. Sci. Rep. 2025, 15, 26766. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Abins, A.A.; Priyadharshini, P.; Rohidh, G.; Cheran, R. Weapon recognition in CCTV videos: Deep learning solutions for rapid threat identification. In 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE); IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Wang, G.; Ding, H.; Duan, M.; Pu, Y.; Yang, Z.; Li, H. Fighting against terrorism: A real-time CCTV autonomous weapons detection based on improved YOLOv4. Digit. Signal Process. 2023, 132, 103790. [Google Scholar] [CrossRef]
Bushra, S.N.; Shobana, G.; Maheswari, K.U.; Subramanian, N. Smart video surveillance-based weapon identification using YOLOv5. In 2022 International Conference on Electronic Systems and Intelligent Computing (ICESIC); IEEE: New York, NY, USA, 2022; pp. 351–357. [Google Scholar]
Khalid, S.; Waqar, A.; Tahir, H.U.A.; Edo, O.C.; Tenebe, I.T. Weapon detection system for surveillance and security. In 2023 International Conference on IT Innovation and Knowledge Discovery (ITIKD); IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Yadav, P.; Gupta, N.; Sharma, P.K. Robust weapon detection in dark environments using YOLOv7-DarkVision. Digit. Signal Process. 2024, 145, 104342. [Google Scholar] [CrossRef]
Chen, C.; Abdallah, A.; Wolf, W. Audiovisual Gunshot Event Recognition. In 2006 IEEE International Conference on Systems, Man and Cybernetics; IEEE: New York, NY, USA, 2006; pp. 4807–4812. [Google Scholar]
Replica Guns. Available online: https://www.amazon.com/dp/B01MQS74AT/ (accessed on 4 December 2025).
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA; IEEE: New York, NY, USA, 2017; pp. 1800–1807. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France; PMLR: New York, NY, USA, 2015; pp. 448–456. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10); Omnipress: Madison, WI, USA, 2010; pp. 807–814. [Google Scholar]
Nagi, J.; Ducatelle, F.; Di Caro, G.A.; Cireşan, D.; Meier, U.; Giusti, A.; Nagi, F.; Schmidhuber, J.; Gambardella, L.M. Max-Pooling Convolutional Neural Networks for Vision-based Hand Gesture Recognition. In IEEE International Conference on Signal and Image Processing Applications (ICSIPA2011); IEEE: New York, NY, USA, 2011. [Google Scholar]
Keras: The Python Deep Learning Library. Available online: https://keras.io (accessed on 15 December 2025).
Convert TensorFlow Models. Available online: https://ai.google.dev/edge/litert/models/convert_tf (accessed on 15 December 2025).
Khan, T.H. Towards an indoor gunshot detection and notification system using deep learning. Appl. Syst. Innov. (ASI) 2023, 6, 94. [Google Scholar] [CrossRef]
Youyeetoo Mini Computers. Available online: https://www.amazon.com/youyeetoo-Computers-Windows-Preinstalled-Business/dp/B0D4DQQYYX/ (accessed on 9 December 2025).
Eclipse Mosquitto. Available online: https://mosquitto.org/ (accessed on 11 December 2025).
MQTT: The Standard for IoT Messaging. Available online: https://mqtt.org/ (accessed on 9 December 2025).
Cloud Storage for Firebase. Available online: https://firebase.google.com/docs/storage (accessed on 9 December 2025).
FFmpeg. Available online: https://www.ffmpeg.org/ (accessed on 11 December 2025).
ESP32-S3-Tiny. Available online: https://www.waveshare.com/wiki/ESP32-S3-Tiny (accessed on 12 December 2025).
TB6612FNG Motor Driver. Available online: https://www.sparkfun.com/sparkfun-motor-driver-dual-tb6612fng-with-headers.html (accessed on 12 December 2025).
Stepper Motor Linear Actuator. Available online: https://www.amazon.com/Stepper-Linear-Actuator-Engraving-Machine/dp/B09BZDSY7V (accessed on 12 December 2025).
Door Latch. Available online: https://www.amazon.com/JQK-Security-Stainless-Thickened-HBB120-P2/dp/B09Y5MXCDN/ (accessed on 12 December 2025).
AccelStepper Library for Arduino. Available online: https://www.airspayce.com/mikem/arduino/AccelStepper/index.html (accessed on 12 December 2025).
Arduino-MQTT Library. Available online: https://github.com/256dpi/arduino-mqtt (accessed on 12 December 2025).
WM8960 Hi-Fi Sound Card HAT. Available online: https://www.waveshare.com/wm8960-audio-hat.htm (accessed on 12 December 2025).
HLK PM01 AC DC Converter 220V to 5V. Available online: https://www.amazon.com/EC-Buying-Step-Down-Intelligent-3-3V/dp/B09Z253MQ2 (accessed on 12 December 2025).
PM2320 AC Wall Plug Enclosure. Available online: https://www.polycase.com/pm2320#PM2320T03XWT (accessed on 12 December 2025).
How to Run a Raspberry Pi Program on Startup. Available online: https://learn.sparkfun.com/tutorials/how-to-run-a-raspberry-pi-program-on-startup/all#method-2-autostart (accessed on 12 December 2025).
Paho-mqtt. Available online: https://pypi.org/project/paho-mqtt/ (accessed on 12 December 2025).
gTTS (Google Text-to-Speech). Available online: https://pypi.org/project/gTTS/ (accessed on 12 December 2025).
mpg123—Fast MP3 Player for Linux and Unix Systems. Available online: https://www.mpg123.de/ (accessed on 12 December 2025).
jMQTT Library. Available online: https://www.b4x.com/android/help/jmqtt.html (accessed on 12 December 2025).
NB6—Notifications Builder. Available online: https://www.b4x.com/android/forum/threads/nb6-notifications-builder.91819/ (accessed on 12 December 2025).
Gavrilov, A.; Bergaliyev, M.; Tinyakov, S.; Krinkin, K.; Popov, P. Using IoT Protocols in Real-Time Systems: Protocol Analysis and Evaluation of Data Transmission Characteristics. J. Comput. Netw. Commun. 2022, 2022, 7368691. [Google Scholar] [CrossRef]
Chinnasamy, P.; Yarramsetti, S.; Ayyasamy, R.K.; Rajesh, E.; Vijayasaro, V.; Pandey, D.; Pandey, B.K.; Lelish, M.E. AI-Driven intrusion detection and prevention systems to safeguard 6G networks from cyber threats. Sci. Rep. 2025, 15, 37901. [Google Scholar] [CrossRef] [PubMed]
Jocher, G.; Qiu, J.; Chaurasia, A. YOLOv8: Ultralytics YOLO. Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 28 January 2026).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Ultralytics. Ultralytics YOLO Models. 2024. Available online: https://docs.ultralytics.com (accessed on 28 January 2026).

Figure 1. Samples from the image dataset. (a–f) show images from the gun class, while (g–l) show images from the non-gun class. Faces are covered for privacy.

Figure 2. The custom designed Residual Separable Convolutional Neural Network (RS-CNN) based architecture.

Figure 3. System architecture of the proposed system in a school. Power-over-Ethernet (PoE) cameras with built-in microphones (a,b) are connected to a network switch (c), which links to the router (d). A mini-PC (e) hosts the MQTT broker with external access enabled through port forwarding and runs the image and audio classification software. The router receives Internet connectivity (f) from the modem (g). When a gunman (h) appears or a gunshot occurs near the north side of the building, camera (a) captures the event, the software in the mini-PC classifies using local deep learning model, confirms with OpenAI (i), and then issues MQTT commands using Wi-Fi to close nearby smart door locks (j,k) of the classrooms, activate PA devices (l,m) to instruct individuals in the hallway to move toward the south side, upload the captured media to Firebase storage (n), and send a GPS-tagged notification with an image/audio’s Firebase link to the security personnel’s (o) smartphone (p).

Figure 4. Flowchart of a camera worker thread.

Figure 5. Block diagram of smart lock hardware.

Figure 6. Block diagram of the PA hardware.

Figure 7. (a) Camera at north hallway, (b) Camera at south hallway, (c) Network switch where cameras are connected using Ethernet cable, (d) mini-PC hosting the MQTT broker and audio-visual processing software, with its compact size illustrated by comparison to a U.S. 25-cent coin.

Figure 8. Smart door lock prototype in (a) open and (b) closed positions. In (a), the main components are shown: (1) ESP32-S3 Tiny microcontroller, (2) motor driver, (3) push-button switch, (4) stepper-motor–based linear actuator in the open position, (5) door latch, (6) DC power adapter, and (7) RGB LED illuminated green to indicate an open state. In (b), the linear actuator is shown in the closed position with the door latch engaged, and the RGB LED illuminated red to indicate a locked state.

Figure 9. IoT-connected public address (PA) system prototype. (a) Individual hardware components: (1) Raspberry Pi Zero 2W single-board computer, (2) WM8960 audio HAT, (3) speakers, (4) 110 VAC to 5 V DC power converter, and (5) bottom part of the enclosure with an integrated wall plug. (b) Fully assembled PA device housed in the enclosure and connected directly to a wall outlet.

Figure 10. Screenshot of the smartphone application showing the MQTT connection status (connected or disconnected) of the smart locks, PA devices, and the smartphone. The slide switches are used to control and monitor the locks, and the stop buttons are used to mute an ongoing PA announcement.

Figure 11. Real-time image-based gun detection and system response: (a) Smartphone notification showing detected gun image with timestamp and location information; (b) Google Maps navigation view opened via the Direction button in the notification; (c) Smartphone application interface after the event, where the slide control for Lock 1 automatically switches to the OFF position, indicating that the lock has been closed.

Figure 12. Sound-based gunshot detection and notification results: (a) smartphone notification generated after a gunshot sound is detected in the South Hallway, including timestamp, location information, and a Play button to access the recorded audio; (b) audio player interface allowing responders to listen to the recorded gunshot sound; (c) smartphone application screen showing the updated system status after the event, where the slide control for Lock 2 automatically switches to the OFF position, indicating that the lock has been closed.

Table 1. Comparison with products in the market.

	ShotSpotter [16]	ZeroEyes [4]	AmberBox [10]	This Work
Indoor/Outdoor	Outdoor	Indoor	Indoor	Indoor
Detection Method	Sound	Image	Sound	Image + Sound
Human in the loop	✓	✓	✗	✗
Response Time (s)	60	30	3.6	Image: 1.06 Sound: 1.11
Auto Lockdown	✗	✗	✗	✓
Auto PA announcement	✗	✗	✗	✓

Table 2. Comparison with other published works.

	P. S. et al. [21]	Alaqil et al. [6]	Debnath et al. [7]	Mehta et al. [8]	Goenka and Sitara [9]	J. Morillas, et al. [11]	G. Valenzise, et al. [12]	J. Bajzik, et al. [13]	A. Morehead, et al. [14]	T. Khan [15]	Chen et al. [27]	This Work
Modality	Image	Image	Image	Image	Image	Audio	Audio	Audio	Audio	Audio	Image + Audio	Image + Audio
Classifier	FMR-CNN–YOLOv8	Faster R-CNN	Template-matching	YOLOv3	Mask R-CNN	NMF	GMM	CNN	CNN	CNN	SVM	Image: RS CNN Audio: CNN
Accuracy %	98.7	-	95	89.3	82.76	-	-	99	99	99	73.46	Image: 94.6 Audio: 99
Precision %	90.1	82	-	-	-	-	93	-	-	100	-	Image: 94.2 Audio: 100
Smartphone notification	✗	✗	✗	✗	✗	✗	✗	✗	✓ (using SMS)	✓	✗	✓
Plot on map	✗	✗	✗	✗	✗	✗	✗	✗	✗	✓	✗	✓
Realtime testing with replica and blank gun	✗	✗	✗	✗	✗	✗	✗	✗	✓	✓	✗	✓

Table 3. Performance evaluation of the model on training, validation, and test datasets.

	Training	Validation	Test
Loss	0.1119	0.097	0.1336
Accuracy	0.9599	0.9707	0.9464

Table 4. Confusion matrix of the proposed model evaluated on the test dataset.

True/Predicted	Gun	Non-Gun
Gun	308	13
Non-gun	19	332

Table 5. Precision, recall, and F1-scores of the proposed model evaluated on the test dataset.

	Precision	Recall	F1-Score
Gun	0.9419	0.9595	0.9506
Non-gun	0.9623	0.9459	0.9540

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khan, T. Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools. IoT 2026, 7, 15. https://doi.org/10.3390/iot7010015

AMA Style

Khan T. Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools. IoT. 2026; 7(1):15. https://doi.org/10.3390/iot7010015

Chicago/Turabian Style

Khan, Tareq. 2026. "Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools" IoT 7, no. 1: 15. https://doi.org/10.3390/iot7010015

APA Style

Khan, T. (2026). Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools. IoT, 7(1), 15. https://doi.org/10.3390/iot7010015

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools

Abstract

1. Introduction

2. Related Works

2.1. Commercial Products

2.2. Published Literature

2.2.1. Image-Based Gun Detection

2.2.2. Audio-Based Gunshot Detection

2.2.3. Audiovisual Gun Detection

3. Materials and Methods

3.1. Image-Based Gun Detection Using Deep Learning

3.1.1. Custom Dataset Creation

3.1.2. Residual Separable Convolutional Neural Network Architecture

3.1.3. Model Training

3.1.4. Verification with OpenAI

3.2. Sound-Based Gunshot Detection Using Deep Learning

4. Prototype Development

4.1. Audio-Visual Gun Detection and MQTT Broker

4.1.1. MQTT Broker Server

4.1.2. Gun Detection from Image

4.1.3. Gunshot Detection from Sound

4.2. Smart Lock

4.2.1. Hardware

4.2.2. Firmware

4.3. IoT Connected PA

4.3.1. Hardware

4.3.2. Firmware

4.4. Smartphone Application

4.4.1. MQTT Connectivity and System Monitoring

4.4.2. Control of Smart Locks and PA Devices

4.4.3. Gun Image and Gunshot Event Notifications

4.4.4. Navigation Assistance

5. Results

5.1. Deep Learning Model Results

5.1.1. Deep Learning Model Result for Image Based Detection

5.1.2. Deep Learning Model Result for Sound Based Detection

5.2. Prototype Results

5.2.1. Components of the Proposed System

5.2.2. Image-Based Gun Detection Testing Results

5.2.3. Sound-Based Gunshot Detection Testing Results

6. Discussion

7. Conclusions

8. Patents

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI