DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation

Singh, Harbinder; Deniz, Oscar; Ruiz-Santaquiteria, Jesus; Muñoz, Juan D.; Bueno, Gloria

doi:10.3390/app15115830

Open AccessArticle

DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation

by

Harbinder Singh

^*

,

Oscar Deniz

^*

,

Jesus Ruiz-Santaquiteria

^†

,

Juan D. Muñoz

^†

and

Gloria Bueno

^†

VISILAB, Universidad de Castilla-La Mancha, 13071 Ciudad Real, Spain

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(11), 5830; https://doi.org/10.3390/app15115830

Submission received: 14 April 2025 / Revised: 17 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The increasing frequency of mass shootings at public events and public buildings underscores the limitations of traditional surveillance systems, which rely on human operators monitoring multiple screens. Delayed response times often hinder security teams from intervening before an attack unfolds. Since firearms are rarely seen in public spaces and constitute anomalous observations, firearm detection can be considered as an anomaly detection (AD) problem, for which one-class classifiers (OCCs) are well-suited. To address this challenge, we propose a holistic firearm detection approach that integrates OCCs with visual hand-held gun features and human pose estimation (HPE). In the first stage, a variational autoencoder (VAE) learns latent representations of firearm-related instances, ensuring that the latent space is dedicated exclusively to the target class. Hand patches of variable sizes are extracted from each frame using body landmarks, dynamically adjusting based on the subject’s distance from the camera. In the second stage, a unified feature vector is generated by integrating VAE-extracted latent features with landmark-based arm positioning features. Finally, an isolation forest (IFC)-based OCC model evaluates this unified feature representation to estimate the probability that a test sample belongs to the firearm-related distribution. By utilizing skeletal representations of human actions, our approach overcomes the limitations of appearance-based gun features extracted by camera, which are often affected by background variations. Experimental results on diverse firearm datasets validate the effectiveness of our anomaly detection approach, achieving an F1-score of 86.6%, accuracy of 85.2%, precision of 95.3%, recall of 74.0%, and average precision (AP) of 83.5%. These results demonstrate the superiority of our method over traditional approaches that rely solely on visual features.

Keywords:

deep learning; human action recognition; surveillance; one-class classifiers; anomaly

1. Introduction

Weapon detection technology is designed to identify and mitigate threats from firearms, knives, explosives, and other weapons in images and video frames [1,2,3,4,5]. According to a recent survey by the Gun Violence Archive (GVA) [6], mass shootings are increasing, raising serious concerns for public safety in crowded, shared spaces. In robbery incidents, an active shooter typically prevents any movement or activity, including the ability to send messages to local law enforcement. Therefore, once an active firearm is detected through visible gun features and dangerous pose information [7], surveillance systems must accurately confirm the presence of the firearm in the frame and immediately alert law enforcement if detected. By leveraging advanced technologies, these AI-enabled systems can even detect concealed weapons, enhancing safety and helping prevent violent incidents in both public and private spaces [8]. Most weapon detection methods rely on deep learning-based object detection using binary classifiers where the firearm is visually identifiable within the scene [9,10,11], with classifiers trained on both normal and abnormal samples.

In light of this serious concern, several AI-powered commercial tools have emerged, recognized for their advancements in weapon and threat detection systems [12,13,14]. These tools leverage AI to enhance security by identifying potential threats in images and video frames, improving accuracy, and reducing response times. Egiazarov et al. [15] proposed a weapon detection system based on an ensemble of semantic convolutional neural networks (CNNs). Their approach decomposes the task of detecting and locating a weapon into a series of smaller subproblems, each focused on identifying individual component parts of the weapon, resulting in more accurate and robust detection. In another approach [16], transfer learning using VGG-16 as a feature extractor was employed to extract visible gun features, which were then classified by state-of-the-art support vector machine (SVM) classifiers [17] trained on a typical gun database. The model’s performance was evaluated in various real-world scenarios, including different backgrounds with firearms, occlusion, and other challenges. Very recently, a two-stage gun detection (TSGD) method was proposed [18], in which Stage 1 utilizes an image-augmented model to effectively classify videos as “gun” or “non-gun”. For videos classified as “gun”, Stage 2 employs an object detection model to precisely locate the firearm within the video frames.

Previous research has also explored various AI-driven approaches to developing reliable surveillance systems based on human action recognition in videos. However, most existing video analysis models (e.g., Kondratyuk et al. [19]; Wasim et al. [20,21]) are primarily designed for general video action recognition and have been evaluated on benchmark datasets. Ref. [19] introduced MoViNets, a model designed for efficient video recognition. Their approach incorporates stream buffers, which process videos in small consecutive subclips, maintaining constant memory usage while preserving long-range temporal dependencies. This innovation enables online inference while reducing memory and computational costs. One line of research in action recognition, ref. [20], explored transformer-based models for long-range spatiotemporal context modeling in video recognition. While self-attention mechanisms capture the global context effectively, they come with high computational costs. To address this, they proposed Video-FocalNets, a novel and efficient architecture that integrates both local and global contexts. Video-FocalNets utilizes a spatiotemporal focal modulation approach, reversing the interaction and aggregation steps of self-attention to improve efficiency. Another approach [21] focuses on spatiotemporal convolutions for video analysis, particularly in action recognition. The motivation behind this work stems from the strong performance of 2D-CNNs applied to individual video frames. Their study highlights the advantages of 3D-CNNs over 2D-CNNs within a residual learning framework. Additionally, they demonstrate that factorizing 3D convolutional filters into separate spatial and temporal components leads to significant accuracy improvements. Despite advancements in action recognition, these models often struggle with firearm detection due to the limited availability of labeled video datasets for training [18]. Through an empirical analysis of existing video action recognition models for robbery detection using a firearm video dataset that includes visible guns and dangerous poses, we evaluate the effectiveness of action recognition in firearm detection. To develop a reliable firearm detection-based surveillance system, we integrate both gun feature extraction and human action recognition, framing gun detection as a one-class classification (OCC) problem.

2. Related Work and Our Contributions

AI-powered imaging technology helps detect active shooter threats before a firearm becomes visible or shots are fired. Ref. [22] introduced a radar sensor-based 3D imaging approach for walkthrough screening systems, designed to enhance security in crowded areas by detecting concealed weapons on moving individuals. Their method leverages a fully convolutional architecture to generate a 3D peak-shaped probability map for effective weapon screening. For an efficient surveillance system, integrating advanced imaging technology with deep learning is essential for detecting firearm threats and ensuring public safety in both public and private spaces [23]. Furthermore, once a firearm becomes visible, the visual features extracted from camera can be combined with action recognition techniques to enhance the performance of AI-enabled surveillance systems. In [24], a CNNs model was utilized to detect and classify normal and abnormal activities in video frames extracted from imaging device.

Human action recognition techniques focusing on posture and human body landmarks have been applied to analyze various physical conditions [25], including handgun detection [26], sports assistance [27], fall detection [28], and children’s gross motor skill analysis [29]. Recently, a new strategy has emerged that combines multiple neural networks (NN) with the DEtection TRansformer (DETR) network for firearm detection in images [30]. The methodology presented in their work promotes collaboration and self-coordination among networks in the fully connected layers of DETR using multiple self-coordinating artificial neural networks (MANNs), eliminating the need for a coordinator. This self-coordination involves training the networks sequentially and integrating their outputs without additional components, resulting in high-level outcomes with a firearm detection precision of

84 %

. Extracting information specific to human actions is an effective method for detecting abnormal poses including firearm detection, offering robustness against complex backgrounds, illumination changes, and dynamic camera scenes [31]. In [7], the integration of skeleton features with visual features presented a promising yet underexplored approach for firearm detection, highlighting the need for further research in this area. They employed CNN- and transformer-based architectures for visual feature extraction and integrated these visual features with language models for firearm detection using a simple CNN model [7].

Dever et al. [32] proposed a method for analyzing silhouettes to recognize the classic holdup position in armed robberies. Their approach involves segmenting the silhouette’s skeleton into distinct body parts and identifying arm positions. Their results demonstrate that leveraging skeleton-based analysis effectively identifies human body parts and accurately detects holdup positions. Detecting active hands holding a visible firearm adds complexity to active shooter identification [33]. In some scenarios, individuals such as security personnel or police officers may already be present in the frame while holding firearms in normal poses, making it difficult to extract distinctive features for identifying active threats. Human pose estimation (HPE), which predicts the positions of body parts and joints in images or videos, enables the analysis of activities involving major muscle groups in the torso, arms, and legs. In recent years, various HPE methods [34,35] have played a critical role in enabling sign language recognition, full-body gesture control, and the quantification of physical exercises. These advancements serve as a foundation for applications in sports, fall detection, children’s gross motor skill assessment, and other fitness-related areas. Rather than scanning the entire frame, skeleton landmarks can be leveraged to pinpoint hand positions, and human activity recognition can be integrated with visual features to enhance firearm detection system performance.

Anomalous events refer to unusual actions, behaviors, or situations that pose risks to public health, safety, or the economy. Work [36] proposed a real-world anomaly detection approach using the UCF-Crime dataset, a large-scale collection of surveillance videos that includes 13 types of realistic anomalies such as fighting, road accidents, burglary, and firearm usage alongside normal activities. The existing approaches to OCC-based visual anomaly detection (AD) primarily focus on learning a model of the normal distribution using autoencoders [37,38,39], generative adversarial networks (GANs) [40,41], or other unsupervised adaptation techniques [42]. VAEs are particularly well-suited for anomaly detection. They leverage deep learning networks, allowing for automatic feature learning, and they do not require labeled data from all classes of interest. Instead, VAE models are trained exclusively on objects from the target class, which are typically abundant. More recently, studies such as [43] have proposed a data-driven approach for constructing OCCs using a variational autoencoder (VAE-SIMCA) [44]. Their method determines classification decisions based on a linear combination of distances between the original and reconstructed images. These distances are approximated using a scaled chi-square distribution, with the decision boundary derived from the theoretical quantile function of this distribution and a predefined probability. Notably, their approach eliminates the need for specific optimization, as the decision boundary is solely determined by the model outcomes from the training set. Despite this lack of adaptation, these models exhibit strong AD performance and reliable spatial localization of defects in industrial products. The fundamental principle behind these methods lies in feature matching between test and normal samples while leveraging the multi-scale nature of deep feature representations. High-resolution features facilitate the segmentation of subtle, whereas structural deviations and image-level AD are effectively managed by features at higher levels of abstraction [45]. OCC-based approaches enhance the accuracy and robustness of AD systems, making them well-suited for real-world applications.

Firearms are, by definition, rarely observed and considered uncommon objects [18]. Consequently, firearm detection can be framed as an AD problem, where OCCs are particularly well-suited. The absence of predefined classes and the limited availability of labeled data for firearm-related incidents present significant challenges for training supervised machine learning models for firearm detection. Traditional binary classification for firearm detection relies on both positive and negative samples; however, obtaining images of positive cases in publicly monitored spaces is highly challenging due to strict legal restrictions on carrying and displaying firearms. Even possessing a replica gun in public can cause alarm and lead to legal consequences. In contrast, negative samples are readily available in abundance. Our goal is to develop an OCC-based firearm detection model that leverages skeletal representations of human actions and appearance-based gun features while achieving performance comparable to fully supervised methods. As far as the authors know, this is the first time that such approach is taken for the purpose of weapon detection. In summary, this paper makes the following contributions:

Unsupervised Firearm Detection: We propose a fully unsupervised approach for detecting active firearms using a unified feature representation derived from a VAE and HPE. This method eliminates the need for prior labeling of anomalous samples with handgun and instead exploits averaged latent sample representations for improved accuracy.
Adaptive Patch Extraction: We introduce an automatic variable patch size extraction technique based on estimating the distance between the subject and the camera using HPE. This novel approach eliminates the need for cropping or oversizing patches extracted from visual gun instances in frames, ensuring robust and efficient feature extraction.
Comprehensive Benchmark Evaluation: We conduct extensive experiments using the VISILAB, UCF-Crime and YouTube benchmark firearm datasets, evaluating the performance of various video action analysis models, feature extraction techniques, and state-of-the-art OCC methods for firearm detection.

The remainder of this manuscript is structured as follows: Section 3 presents our proposed methodology. Section 4 details the experimental setup, results, cross-validation evaluation, ablation study, failure cases, and the potential for future research. Finally, Section 5 discusses our findings and concludes this study.

3. Proposed Algorithm

In this section, we introduce the proposed methodology and the motivation behind this work. Figure 1 presents a schematic diagram of the proposed firearm detection solution using an OCC approach. Our deep feature-driven OCC model, DeepGun, leverages both visual gun features and HPE to enhance firearm detection. By integrating these complementary data sources, DeepGun constructs a more precise and comprehensive representation, enabling effective decision-making, prediction, and classification of both normal and anomalous cases.

DeepGun is designed for outlier detection using classical OCC, identifying firearms as objects that deviate from expected patterns. This two-stage model first fuses deep features and body landmarks associated with visible firearms and dangerous poses, respectively. The first stage employs a VAE to extract latent features and significant body landmarks indicative of threatening postures associated to brandishing a weapon. These features are then combined into a unified feature vector. The key idea is to train the VAE on normal samples, enabling the model to recognize the presence of guns as outliers, distinguishing firearms from normal samples.

The second stage applies classical OCCs on the unified feature vector, generating anomaly scores to identify samples with out-of-distribution features. The key advantage of this approach is its ability to reduce the uncertainty in visual gun feature extraction from camera by integrating HAR through body landmarks.

3.1. Firearm Dataset Used

We evaluate our anomaly detection approach on three firearm-related datasets that include VISILAB [11], UCF-Crime [36], and YouTube samples [7] to encompass a wide range of real-world scenarios, including indoor, outdoor, and night time low-resolution video frames, respectively.

The VISILAB dataset [11] used in this study was collected in a controlled environment, as most image recognition algorithms and HPE models are sensitive to variations in lighting conditions, subject size, background, distance from the camera with different imaging sensor, and other factors. To address these challenges, the dataset includes subjects both with and without firearms, holding various common objects while assuming different normal poses, and wearing diverse clothing colors. The videos were filmed indoors, ensuring a range of conditions that meet the requirements of indoor surveillance systems.

Figure 2 showcases examples from the VISILAB firearm benchmark dataset [11], annotated with full-body landmarks and hand detection, featuring both normal and abnormal samples extracted from videos. This video dataset consists of a total of 398 videos, categorized into three classes: ‘handgun’ (141 videos), ‘machine gun’ (139 videos), and ‘no gun’ (118 videos), with an average frame rate of 251. The data were collected from various surveillance cameras with different resolutions, capturing diverse objects, venues, and backgrounds. The ‘no gun’ category includes commonly used objects such as empty hands, various types of smartphones, water bottles, sanitizers, boxes, and laptops. Localizing handgun in firearm detection scenarios is particularly challenging due to the limited availability of abnormal samples and the variability of anomalies. These anomalies can range from subtle structural defects to more pronounced issues such as brightness variations, color distortions, and partial occlusion of handgun. In such cases, the confidence values of key hand landmarks, including the wrist, pinky, index, and thumb, are crucial for accurate hand detection. In our approach, we use a confidence threshold of

20 %

to confirm the visibility of both hands. The ROIs of the left and right hands are highlighted with cyan and red bounding boxes, respectively (see Figure 2).

UCF-Crime is a challenging dataset consisting of long, untrimmed surveillance videos encompassing 13 categories of real-world anomalies. For our analysis, we exclude videos in which firearms are not visibly present. From the filtered dataset, 1629 frames depicting the act of shooting someone with a firearm are selected for use during the inference phase. Figure 3 displays three representative frames from the UCF-Crime dataset, each illustrating a shooting-related anomaly.

Another noteworthy dataset used in our analysis is the YouTube dataset used in [7], which captures outdoor shooting scenarios under challenging conditions such as low lighting and distant subjects, while depicting the act of shooting someone with a handgun. For our evaluation, we consider 2451 frames extracted from this dataset. Figure 3 presents three representative frames, each illustrating a shooting-related anomaly within typical outdoor environments. The augmented dataset simulates environments with low or poor illumination by generating darkened versions of the original images, achieved through adjustments to the value component in the HSV color space. Another important factor affecting firearm detection performance, particularly small objects, is camera distance. To introduce further variability into the YouTube dataset, an additional augmented set was added by reducing the image size by half and padding the remaining area with black pixels, thereby preserving the original image resolution. One example of the original and augmented images is shown in Figure 3c to illustrate the visual differences between the dataset versions.

3.2. Hand Patch Extraction Based on HPE

Human body landmarks detection in outdoor scenarios, including surveillance applications, presents significant challenges. These include the vast range of possible poses, numerous degrees of freedom, occlusions, varying subject distances from the camera, and diverse clothing styles. In the proposed firearm approach, we utilized the BlazePose Full model (i.e., Mediapipe 0.10.15) [34], a lightweight CNN architecture designed for human body landmarks detection. The keypoints topology of BlazePose is shown in Figure 4. During inference, the network detects 33 body keypoints for a single person and operates at over 30 frames per second. This efficiency makes it particularly well-suited for applications such as action recognition [47].

The proposed algorithm utilizes deep features derived from visible hand observations captured by camera and firearm-related human body landmarks. By analyzing the combination of hand landmarks and elbow positioning, it effectively detects dangerous poses. For dangerous pose detection, twenty key body landmarks are derived from the normalized skeleton [7]. These twenty landmarks include ten landmarks on the right arm, including the hand, and ten landmarks on the left arm, including the hand. Instead of using the neck keypoint, as in OpenPose [35], we normalize landmarks based on torso size, calculated from the positions of the left_hip, right_hip, left_shoulder and right_shoulder. Normalization is a pre-processing step that standardizes the coordinate origin for each detected keypoint, if present. This process ensures that only the relative positions between keypoints are considered, removing irrelevant factors such as the subject’s size, position within the image, camera distance, and image resolution.

The landmarks associated with the firearm are represented as a feature vector, denoted as

\vec{l}

. These landmark details are essential for anomaly detection, as they provide critical information for anomaly detection and keypoints for the accurate patch selection. To accurately extract deep features from both hands, it is essential to carefully select patch sizes to avoid cropping the hands or using oversized patches, as these could lead to distortions in background-related features introduced by the imaging sensors. This necessitates adjusting the patch size dynamically based on the subject’s relative distance from the camera. The proposed patch size detection algorithm relies on observing skeleton size variations as the subject moves relative to the camera or is captured by multiple cameras. This process requires quantifying the skeleton size in each video frame and correlating these measurements with the appropriate patch size. To achieve this, the subject’s relative distance from the camera must be estimated in each frame and used to determine an optimal patch size for deep feature extraction via the VAE—ensuring it is neither too small nor too large. Given that the relative distance may fluctuate over time, the skeleton size will also vary accordingly. However, this variation does not follow a linear relationship, which can be described by a second order quadratic function. To address this, our approach utilizes an easy to implement polynomial function (see Figure 5b) of the following form:

f (d) = A d^{2} + B d + C

(1)

where d represents the observed distance. We rely on this quadratic function (cf. Equation (1)), i.e.,

\exists A, B, C \in R

such that a patch size r is represented by the d. Fitting the observations using the function

f (d)

in Equation (1) results in a fully adaptive model for determining the r based on the subject’s distance d:

r = f (d) + ϵ

(2)

where

ϵ

is a user-defined regularization factor used to fine-tune the r. This parameter helps adjust the torso size-based patch size estimation to account for unseen cases, such as video frames extracted from different datasets. For VISILAB dataset, the default value of

ϵ

is set to 10. This adaptive approach ensures that the extracted r dynamically adjusts according to the d (c.f Figure 5a), optimizing the accuracy of feature extraction without unnecessary cropping or oversizing of ROI for deep feature extraction. Figure 5c displays the regression model fitted to the VISILAB firearm dataset. Following the model fitting, the residuals are plotted in Figure 5d. The residuals are randomly scattered around the zero line, suggesting that the model is effectively determining patch size across varying distances based on torso size estimation.

To estimate the values of d, we compute the Euclidean distance between shoulder and hip points, which serves as an approximation of the subject’s torso size. As the subject moves away from the camera, this Euclidean distance decreases, indicating a smaller torso size. Conversely, as the subject approaches the camera, the Euclidean distance increases, reflecting a larger torso size. To ensure this measurement is invariant to rotation, we compute the diagonal Euclidean distance using the differences in x and y coordinates between the shoulder and hip points. Based on the estimated torso size, we determine the appropriate patch size to extract from both hands. For the body landmarks associated with the torso, as illustrated in Figure 4, we use the following notations: Let (

J_{11}^{x}, J_{11}^{y}

), (

J_{12}^{x}, J_{12}^{y}

), (

J_{23}^{x}, J_{23}^{y}

), and (

J_{24}^{x}, J_{24}^{y}

) denotes the unnormalized coordinates values

(x, y)

of the right_shoulder, left_shoulder, right_hip and left_hip, respectively. To ensure robustness, the subject’s distance from the camera is estimated using both left and right torso landmark pairs. The Euclidean distance is then computed as follows:

\begin{matrix} \begin{matrix} d_{r} (J_{11}, J_{23}) & = \sqrt{{(J_{11}^{x} - J_{23}^{x})}^{2} + {(J_{11}^{y} - J_{23}^{y})}^{2}} \\ d_{l} (J_{12}, J_{24}) & = \sqrt{{(J_{12}^{x} - J_{24}^{x})}^{2} + {(J_{12}^{y} - J_{24}^{y})}^{2}} \\ d & = \frac{d_{r} (J_{11}, J_{23}) + d_{l} (J_{12}, J_{24})}{2} \end{matrix} \end{matrix}

(3)

For our experiments, we utilize a video dataset recorded using three types of cameras: a PTZ camera with a

640 \times 480

resolution and two mobile phones (OnePlus 6T and OnePlus 5T), both with a

1920 \times 1080

resolution [11]. From this dataset, 20,461 normal frames and 7352 firearm frames are extracted. Additionally, we assume that the focal length of the imaging device remains fixed during video capture and that the camera position does not change. These constraints are commonly met in most surveillance systems for a given location. The VISILAB video dataset is used to fit the function in Equation (1) consists of 259 videos, categorized into two classes: ‘handgun’ (141 videos) and ‘no gun’ (118 videos). We introduce the following notation: Let

M \in K

represent the number of subject tracks captured by different cameras, where each track has a length

N_{m} \in N

, with

m \in {1, \dots, K}

. The distance and patch data are represented as observation pairs:

\begin{matrix} \underset{1_{s t} subject}{\underset{︸}{(d_{1}^{1}, r_{1}^{1}), \dots (d_{N_{1}}^{1}, r_{N_{1}}^{1})}} \underset{2_{n d} subject}{\underset{︸}{(d_{1}^{2}, r_{1}^{2}), \dots (d_{N_{2}}^{2}, r_{N_{2}}^{2})}} \dots \underset{K_{t h} subject}{\underset{︸}{(d_{1}^{K}, r_{1}^{K}), \dots (d_{N_{K}}^{K}, r_{N_{K}}^{K})}} \end{matrix}

(4)

Initially, to obtain the distance and patch value pairs

(d, r)

used for function fitting in Equation (1), a representative and suitable patch size r must be carefully selected along with its corresponding distance d from the given subject segment. To ensure a diverse selection, we choose suitable r and corresponding d values from the segmented subjects, considering both the minimum and maximum distances from different cameras used in VISILAB video dataset. The graph in Figure 6a illustrates the varying patch sizes extracted from the handgun and no-gun dataset based on the subject’s distance from the imaging device. As shown, the patch size is determined automatically for the entire dataset, with a maximum size of

90 \times 90

and a minimum size of

32 \times 32

. Observing the patches in the left column of Figure 6b, we see that, in some frames, the fixed patch size successfully captures hands holding a gun in the last two cases. However, for other instances, the fixed patch size fails to effectively extract hand details when the subject is closer to the camera. Conversely, as shown in the right column of Figure 6b, when patch sizes are adjusted based on the calculated distance from landmarks, both gun-holding and empty hands are extracted accurately without cropping or oversizing.

Once the subject’s distance is determined using the measurements between torso landmarks (see Figure 4), the hand patches are extracted based on the confidence values of key landmarks associated with the left and right hands, specifically the (i.e., wrist, pinky, index, and thumb). This ensures accurate hand patch selection for deep feature extraction while maintaining consistency in deep feature representation.

The process for selecting patches for both the left and right hands is as follows:

Step-1: For each frame, compute the maximum confidence value of landmarks belong to left and right hand (i.e., wrist, pinky, index, and thumb landmarks).
Step-2: Discard patches and landmarks if the computed confidence (computed in step-1) for both left and right hands is below $20 %$ . Frames that do not meet this threshold will be excluded from both the training and testing phases.
Step-3: If one hand (left or right) is not visible, copy the landmark values and patches from the last frame for that hand. This reduces the waste of patches and body landmarks. Therefore, if the confidence for one hand is above the threshold, but the other hand’s confidence is below, the information for the invisible hand will be determined by the last frame for that invisible frame.

3.3. Variational Autoencoders (VAE)

The VAE is a NN designed to extract spatial feature information of a visible gun from ROI, denoted by R extracted from both the hands. Its network architecture resembles that of a standard autoencoder (AE) [48], but with a key difference: the encoder (E) in a VAE enforces the latent representation z to follow a predefined prior probability distribution

p (z)

(e.g.,

N (0, 1)

) (see Figure 7). The decoder (G) then reconstructs realistic data using latent codes sampled from

p (z)

.

In the proposed approach, the VAE is trained using normal data samples. When anomalous patches containing visible gun features extracted from camera are processed, they cause deviations in the latent variables during gradient backpropagation and attention generation. These deviations act as indicators for detecting anomalies associated with the presence of a visible gun.

For a datapoint

R \in D^{d}

, the E and G conditional distributions are denoted as

q_{ϕ} (z ∣ R)

and

p_{θ} (R ∣ z)

, respectively. Here d is the dimension of the R extracted from the hands, and

θ

and

ϕ

are the hidden parameters of E and G, respectively. Since the data distribution

p_{θ} (R)

is intractable through analytical methods, variational inference techniques are employed to optimize the objective function, maximum likelihood estimation of

\log p_{θ} (R)

:

\begin{matrix} \begin{matrix} L (R) & = \log p_{θ} (R) - K L [q_{ϕ} (z ∣ R) ‖ p_{θ} (z ∣ R)] \\ = E_{q_{ϕ} (z ∣ R)} [\log p_{θ} (R) + \log p_{θ} (z ∣ R) - \log q_{ϕ} (z ∣ R)] \\ = - K L (q_{ϕ} (z ∣ R) ‖ p_{θ} (z))+ E_{q_{ϕ} (z ∣ R)}[\log p_{θ} (R ∣ z)] \end{matrix} \end{matrix}

(5)

where KL denotes the Kullback–Leibler divergence (KLD), which is a metric to quantify similarity between

q_{ϕ} (z ∣ R)

and

p_{θ} (R ∣ z)

. To optimize the objective function

L (R)

, VAE maximizes the evidence variational lower bound (ELBO). To optimize the second term in Equation (5), VAE minimizes the reconstruction errors between input and output using mean squared error (MSE) between the inputs and their reconstructions

L_{M S E} (R, R_{r}) = ‖ R - R_{r} ‖

. Therefore, the objective function for VAE in Equation (5) can be rewritten as follows:

\begin{matrix} \begin{matrix} L_{V A E} & = L_{M S E} (R, G_{θ} (z)) + λ L_{K L} (E_{ϕ} (R)) \\ = L_{M S E} (R, R_{r}) + λ L_{K L} (μ, σ) \end{matrix} \end{matrix}

(6)

where

λ

is the scaling hyperparameter to hold the tradeoff between each KLD target term. In Equation (6),

μ

and

σ

represent the mean and standard deviation of the Gaussian distribution

q_{ϕ} (z ∣ R)

, which defines the parameter vectors estimated by the E to optimize the KLD:

\begin{matrix} \begin{matrix} L_{K L} (μ, σ) & = K L (q_{ϕ} (z ∣ R) ‖ p_{θ} (z)) \\ = K L (N (z; μ, σ^{2}) ‖ N (z; 0, 1)) \\ = \int N (z; μ, σ^{2}) \log \frac{N (z; μ, σ^{2})}{N (z; 0, 1)} d z \\ = \frac{1}{2} (1 + \log (σ^{2}) - μ^{2} - σ^{2}) \end{matrix} \end{matrix}

(7)

In the proposed firearm detection approach, a VAE is preferred over a AE because it leverages the reconstruction probability to sample z from the prior

p_{θ} (z)

for L iterations. Two fully connected NN are utilized for both the E and D, respectively. During the training of the VAE for feature extraction, patches of variable sizes are resized to a fixed size of

32 \times 32

. The specific parameters for each component of VAE are detailed in Table 1. Figure 8 presents a graph of the training and validation loss, where the VAE was trained on the normal dataset, using the test dataset as validation data for early stopping. A specific set of hyperparameters was selected, including a batch size of 32, a minimum change of

0.00075

, a patience value of 5, ReLU activation in the intermediate layers, and a sigmoid activation function in the final layer. The VAE is used to encode visual gun features extracted from image patches within video frames, compensating for information loss caused by the skeletonization of the human subject. These deep features, denoted as

\vec{v}

, are incorporated into a unified feature vector that combines object-specific visual data extracted from the hands by the imaging device with pose information obtained from HPE. This comprehensive representation enhances the model’s ability to detect firearms by preserving both visual and contextual human pose information. A comprehensive analysis of proposed approach is provided in the forthcoming results and evaluation Section 4.

3.4. Feature Fusion and Anomaly Detection

At this stage, the features extracted from hand patches by the VAE and the body pose landmarks are represented as feature vectors

\vec{v}

and

\vec{l}

, respectively. To integrate these feature vectors, we employed an early fusion [49] mechanism that creates a unified representation, combining the information extracted from both the hand patches and HPE. This mechanism achieves fusion by concatenating the feature vectors into a single unified feature vector. Subsequently, an isolation forest classifier (IFC) [50] model is trained using normal feature vectors derived from frames without a gun and involving normal actions to capture the correlations and interactions between features from both sources. For the implementation of IFC, the default settings were used, with the contamination parameter set to “auto” and the number of base estimators in the ensemble fixed at 40 across all experiments. Given the feature vectors

\vec{v}

and

\vec{l}

, which contain two types of features, the final OCC model’s prediction, denoted as h, can be expressed as follows:

p = h ([\vec{v}, \vec{l}]) = ([v_{1}, v_{2} \dots v_{K}, l_{1}, l_{2} \dots l_{N}])

(8)

where K and N represents the number of features extracted by the VAE and HPE, respectively. Regarding HPE, we consider ten body landmarks related to the positioning of the left and right arms including hands. These include left_wrist, left_pinky, left_index, right_thumb, left_elbow, right_wrist, right_pinky, right_index, right_thumb, and right_elbow, which are all crucial for accurate dangerous pose prediction. Similarly, eight deep features are extracted from the VAE after dimensionality reduction [51], corresponding to the hands, with or without a gun in either hand. Therefore, a total of twenty eight features are used to train the IFC model for classification for firearm prediction. Based on the anomaly score, samples containing a handgun and a dangerous pose are assigned a score of “1”, indicating an anomaly, while samples without a gun and in a normal pose receive a score of “0”, signifying normal behavior. The ROC curve of the proposed DeepGun method implemented using IFC, support-vector machine [52], Gaussian mixture models (GMM) [53], and minimum covariance determinant estimator (MCDE) [54] is shown in Figure 9. For fair comparison, all OCC implementations use the default parameter settings of IFC, SVM, MCDE, and GMM as provided by the scikit-learn library. In Figure 9, the DeepGun implementation using IFC exhibits the largest area under ROC, indicating the best average performance compared to other implementations, including SVM, GMM, and MCDE.

4. Experiments and Evaluation

This section presents a comparative analysis of the results obtained using the proposed methodology for firearm detection and evaluates the performance of different variant of proposed firearm detection approach with various state-of-the-art feature extractors, including AE [48], ResNet50 [55], VGG19 [56], NasNetLarge [57], DenseNet201 [58], and MobileNetV2 [59]. First, we present a comparative analysis with existing firearm detection and action recognition frameworks. Then, we evaluate the proposed method in conjunction with various deep feature extraction models and state-of-the-art OCC models, including IFC, SVM, GMM, and MCDE. For a fair comparison, all OCC implementations utilize the default parameter settings provided by the scikit-learn library across all feature extractor-based approaches. Finally, we conduct an ablation study by replacing the NN with a CNN in the Variational VAE, resulting in a Convolutional Variational Autoencoder (CVAE). We also analyze the effect of removing specific components of the VAE model on firearm detection performance and compare the impact of variable patch sizes against a fixed patch size.

For comparative analysis, this paper utilizes the recently introduced VISILAB benchmark dataset [11]. From this dataset, 20,461 normal frames and 7352 firearm frames are extracted, which are extracted and shuffled from 118 ‘no gun’ videos and 141 ‘handgun’ videos in the dataset, respectively. These frames are used to extract the feature vector

\vec{v}

of hand patches, while the corresponding human body landmark features

\vec{l}

are obtained using HPE. To prevent sharing of temporal context, scene layout, clothing, and background, the training and testing datasets are shuffled prior to training and inference. During inference, 2000 randomly selected samples from the ‘no gun’ category and 2000 randomly selected samples from the ‘handgun’ category are used to evaluate the performance of the proposed OCC-based firearm detection model, ensuring a fair and balanced assessment.

4.1. Comparative Analysis with Other Firearm Detection and Action Recognition Frameworks

To evaluate the performance of our firearm detection model, we chose accuracy (ACC) as the primary evaluation metric for comparative analysis. The performance of DeepGun was assessed in comparison to R-CNN [16], TSGD [18], Video-FocalNets [20], Movinets [19], and 3D-CNNs [21] using the VISILAB and UCF-Crime firearm datasets, as presented in Table 2. Furthermore, we compare our DeepGun approach with the recently proposed data-driven anomaly detection method for images including VAE-SIMCA [43], and TSGD [18]. To compare, we selected ACC, precision (P), and specificity (S) as evaluation metrics for a more comprehensive performance assessment (see Table 3).

From Table 2, we observe that the DeepGun method outperforms other firearm detection and action recognition approaches in terms of ACC. The performance of Video-FocalNets and MoViNets is comparable to that of DeepGun. Furthermore, Table 3 indicates that when comparing our VAE-based feature extraction approach with VAE-SIMCA, which is a reconstruction error-based OCC anomaly detection method, DeepGun demonstrates superior performance in terms of ACC, precision (P), and specificity (S). However, TSGD achieves higher precision compared to both DeepGun and VAE-SIMCA. Overall, the proposed DeepGun method, leveraging a combination of body landmarks and visual gun features, proves to be highly effective for firearm detection.

4.2. DeepGun vs. Deep Architectures as Feature Extractor

Although our proposed two-step approach primarily relies on VAE for visual gun feature extraction, we also performed a comparative evaluation against other deep learning architectures. The core idea behind this evaluation is to utilize pre-trained networks on the ImageNet database for feature extraction from hand patches captured by the imaging device. These pre-trained models are designed to learn rich feature representations from large-scale image datasets, making them highly effective for extracting meaningful features from new images.

By utilizing these pre-trained networks, we aim to convert hand patches into a set of relevant and discriminative features. These features serve as a compact yet informative representation of the hands, enabling the classification of images into two categories: hands holding a firearm and hands without a firearm. During the inference phase, these extracted features combined with the body landmarks play a crucial role in distinguishing between firearm and non-firearm scenarios, enhancing the overall accuracy and robustness of the detection system. For a comprehensive analysis, we replaced our VAE-based feature extraction with five alternative deep architectures—ResNet50 [55], VGG19 [56], NasNetLarge [57], DenseNet201 [58], and MobileNetV2 [59].

On the VISILAB firearm dataset, our VAE-based feature extraction and IFC-based OCC model outperforms these six architectures, demonstrating superior performance in terms of F1 score, ACC, P, R, and AP. The results, presented in Table 4, provide a comprehensive comparative evaluation of firearm detection across both ‘handgun’ and ‘no gun’ samples.

Notably, replacing simple AE [48] with VAE for feature extraction significantly improves firearm detection performance, as shown in the second-last column of Table 4, when implemented with different state-of-the-art OCC methods, including IFC, SVM, and MCDE, except GMM. While AE and GMM-based implementations exhibit some performance improvements, they lag behind other feature extraction-based firearm detection approaches. The experimental results show that the proposed OCC-based firearm detection framework, which leverages body landmarks and visual gun features, performs effectively across various deep architectures, including AE and VAE. These findings not only highlight the model’s adaptability to unified features but also confirm its strong generalization ability, making it well-suited for different deep learning networks. Another notable finding from Table 4 is that when firearm detection is performed using state-of-the-art OCCs and deep feature extractors, the SVM model achieves remarkably high recall. This indicates that combining SVM with various feature extractor implementations results in fewer false negatives (FNs). A similar trend is observed in implementations using MCDE with other feature extractors. Precision, on the other hand, measures the proportion of correct positive predictions. A model with high precision produces fewer false positives (FPs) predictions. IFC and GMM-based implementations exhibit the opposite trend, achieving higher precision than recall. This suggests a lower number of FPs, making them suitable for scenarios where minimizing false alarms is a priority. However, for highly sensitive locations, an SVM-based implementation may be more appropriate due to its superior recall performance.

The performance of various feature extraction methods was also evaluated on the UCF-Crime and YouTube datasets, as shown in Table 5 and Table 6, respectively. From Table 5, we observe that Nas-NetLarge, MobileNetV2 and DeepGun with IFC exhibit comparable performance. Notably, MobileNetV2 with IFC integration outperforms all other implementations across all five performance metrics on this challenging dataset.

Conversely, Table 6 indicates that DeepGun, when combined with IFC and SVM, achieves the highest performance among all feature extraction methods on the YouTube dataset.

Overall, maintaining a balance between recall and precision emerges as the most effective strategy. An optimal firearm detection model should minimize both false positives and false negatives. In this regard, the DeepGun implementation with IFC demonstrates strong performance across diverse datasets by effectively leveraging both visual firearm features and body pose information.

4.3. Cross-Validation Evaluation and Ablation Study

In this section, we perform 5-fold cross-validation to evaluate the robustness of the proposed DeepGun-based firearm detection approach. The performance metrics obtained using various feature extraction methods are summarized in Table 7. As shown in Table 7, the DeepGun model with IFC implementation outperforms all other feature extraction methods when combined with state-of-the-art OCC techniques, demonstrating stable performance with minimal standard deviation in most cases, which confirms its robustness. Another notable finding from the cross-validation is that ResNet50, when combined with SVM, outperforms all other methods. For MCDE and GMM based implementation’s performance the trend is quite comparable to the performance evaluation given in previous Table 4. The performance trends of the MCDE and GMM-based implementations in Table 7 are largely consistent with the evaluation results presented earlier in Table 4.

We conduct ablation studies to validate the effectiveness of the proposed firearm detection method using a VAE. Specifically, we replace the NN in the VAE with a CNN, forming a CVAE. We analyze the impact of removing specific layers of the VAE model on firearm detection performance and compare the effects of variable patch sizes versus a fixed patch size. The objective results are presented and summarized in Table 8. In this table, we evaluate the F1, ACC, P, R, and AP values on the test set, which consists of a total of 4000 patches extracted from “handgun” and “no gun” frames in the input videos. For this study, we have chosen the best-performing state-of-the-art OCC, IFC with VAE, for the final firearm detection step. Based on the anomaly score generated by the IFC, a score of “1” is assigned to anomalous samples containing a handgun and a dangerous pose, while a score of “0” is assigned to samples without a gun and in a normal pose.

To validate the effectiveness of DeepGun, we compare the performance metrics obtained by replacing the NN in VAE with CNN, as shown in the first column of Table 8. Both networks are trained using MSE loss and KL loss with same parameter settings. We refer to them as DeepGun (see the last column of Table 8) and DeepGun-CNN (see the first column of Table 8), respectively. Compared to the DeepGun-CNN, DeepGun performs better in terms of F1, ACC, P, and AP, while yielding the same performance outcome in terms of R.

To assess the effectiveness of the encoder module in the VAE, we compare DeepGun trained with four NN encoder layers to DeepGun trained with two NN encoder layers, as shown in the second column of Table 8. Both models are trained using the same parameter settings. The model trained with two encoder layers is referred to as DeepGun-E2. A comparison between DeepGun and DeepGun-E2 reveals that their performance is nearly identical, with only a slight reduction in performance for DeepGun-E2. This suggests that the VAE with NN implementation is effective in extracting deep features that align well with those obtained from the HPE module, facilitating firearm detection in frames extracted from videos.

Additionally, by comparing the performance of DeepGun with variable patch sizes extracted from video frames, as explained in Section 3.2, we evaluate the impact of patch size on firearm detection. We refer to the model trained with fixed patch sizes (

40 \times 40

) as DeepGun-FPS. From Table 8, we observe that features extracted from patches based on the subject’s distance from the camera outperform those from fixed-size patches, leading to improved firearm detection performance by DeepGun. We suggest that this difference occurs because fixed-size patches may be too small or too large as the subject moves closer to or farther from the imaging device, respectively. This can lead to capturing irrelevant background details or missing important features, such as the hand, due to improper patch sizing. In contrast, adaptively selecting patches is more effective, as it focuses on optimizing feature-level similarity, leading to improved firearm detection.

4.4. Failure Cases and the Potential for Future Research

In this section, we address the failure cases of the proposed DeepGun approach and provide a discussion on directions for future research. A primary objective in any firearm detection system is to sustain a high detection rate while minimizing both FPs and FNs during testing. However, maintaining a high detection rate by merely reducing the FP rate can result in an unacceptably high number of false alarms when the system is deployed in real-world surveillance scenarios. The failure cases observed in the VISILAB dataset are illustrated in Figure 10.

In Figure 10a, the subject is either carrying objects other than a handgun or has empty hands. In such case, partial occlusion of the hands or the extraction of background patches instead of meaningful hand regions can lead to unreliable features, which are a common cause of FPs. To mitigate this issue, we propose integrating an FP filtering mechanism that preserves the handgun detection capability of the system, as suggested in [61]. This filtering step may involve training a specialized model on the false positive detections collected over time while deploying the firearm detector in the target environment.

In another example, illustrated in Figure 10b, the subject is approaching the camera while carrying a handgun. However, due to poor lighting conditions and the firearm being inadequately depicted, the patch extracted from the hand does not yield robust features. In such cases, distorted firearm characteristics and ambiguous pose information can result in an unreliable unified feature vector, often leading to false negatives (FNs). To address this challenge, we propose utilizing the approach introduced in [62], which can extracts multiple ROI (multi-ROI) from the hands to construct a fused feature vector. Additionally, a more robust VAE, as described in [63], could be employed to extract more discriminative features from the hand patches.

In our approach, accurately extracting the hand positions for both hands from the skeleton is crucial to ensure that the hand ROI is correctly cropped, enabling reliable feature extraction by the VAE. Moreover, body landmarks are vital for detecting dangerous poses. However, MediaPipe often fails to provide consistent results. In some videos, landmarks are successfully detected in one frame but entirely missing in the subsequent frame, even though the frames are visually similar and the subject is clearly visible. An example of this failure case is illustrated in Figure 10c, where body landmarks could not be extracted using the default settings, despite the subject being clearly captured by the imaging device. We attempted to adjust various confidence thresholds, but these efforts did not resolve the issue. Therefore, to improve the robustness and reliability of the firearm detection framework, we propose replacing MediaPipe with a more effective keypoint extraction algorithm, as suggested in [64].

5. Conclusions

In this paper, we propose a novel OCC-based anomaly detection approach for firearm detection, leveraging deep features extracted from visual gun cues captured by camera and body landmarks obtained through HPE. The integration of a VAE with HPE for firearm detection marks a significant advancement at the intersection of AI and public safety. Experimental results demonstrate that body landmarks play a critical role in accurately identifying the ROI, effectively complementing visual gun features and helping to minimize both false positives and false negatives.

Extensive testing demonstrates the robustness of our model, achieving a strong balance between precision and recall—critical for ensuring reliable firearm detection in high-stakes environments where accuracy is paramount. Furthermore, the comparative analysis of firearm detection across various deep feature extraction methods and classical OCCs highlights the efficiency of our approach, making it well-suited for anomaly detection in video surveillance applications.

As security concerns continue to rise in urban settings, AI-driven firearm detection systems have the potential to enhance proactive threat mitigation strategies. Future research involving diverse datasets will provide deeper insights and further validate the adaptability of our method across various indoor and outdoor environments. Additionally, investigating alternative HPE techniques and refining feature extraction methods could further improve detection accuracy, particularly in complex scenarios.

Overall, our findings pave the way for advancing firearm detection methodologies, offering promising applications in next-generation surveillance systems designed to enhance public safety.

Author Contributions

Conceptualization, H.S. and J.R.-S.; methodology, H.S.; software, J.D.M.; validation, O.D.; formal analysis, H.S.; investigation, O.D. and G.B.; resources, O.D. and G.B.; data curation, J.R.-S. and J.D.M.; writing—original draft, H.S.; writing—review and editing, O.D.; visualization, J.R.-S., J.D.M. and G.B.; supervision, G.B.; project administration, O.D.; funding acquisition, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by projects Horizon Europe dAIEdge Grant n. 101120726 by the European Commission and project SBPLY/21/180501/000025 by the Autonomous Government of Castilla-La Mancha, ERDF and the European Commission.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Thakur, A.; Shrivastav, A.; Sharma, R.; Kumar, T.; Puri, K. Real-Time Weapon Detection Using YOLOv8 for Enhanced Safety. arXiv 2024, arXiv:2410.19862. [Google Scholar]
Burnayev, Z.R.; Toibazarov, D.O.; Atanov, S.K.; Canbolat, H.; Seitbattalov, Z.Y.; Kassenov, D.D. Weapons Detection System Based on Edge Computing and Computer Vision. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 0140586. [Google Scholar] [CrossRef]
Santos, T.; Oliveira, H.; Cunha, A. Systematic review on weapon detection in surveillance footage through deep learning. Comput. Sci. Rev. 2024, 51, 100612. [Google Scholar] [CrossRef]
Akshaya, P.; Reddy, P.B.; Panuganti, P.; Gurusai, P.; Subhahan, A. Automatic weapon detection using Deep Learning. In Proceedings of the 2023 International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE), Chennai, India, 1–2 November 2023; pp. 1–7. [Google Scholar] [CrossRef]
Salido, J.; Lomas, V.; Ruiz-Santaquiteria, J.; Deniz, O. Automatic handgun detection with deep learning in video surveillance images. Appl. Sci. 2021, 11, 6085. [Google Scholar] [CrossRef]
Gun Violence Archive. Available online: https://www.gunviolencearchive.org/reports/mass-shooting/year-2023 (accessed on 26 February 2025).
Ruiz-Santaquiteria, J.; Velasco-Mata, A.; Vallez, N.; Deniz, O.; Bueno, G. Improving handgun detection through a combination of visual features and body pose-based data. Pattern Recognit. 2023, 136, 109252. [Google Scholar] [CrossRef]
Muñoz, J.D.; Ruiz-Santaquiteria, J.; Deniz, O.; Bueno, G. Concealed Weapon Detection Using Thermal Cameras. J. Imaging 2025, 11, 72. [Google Scholar] [CrossRef]
Olmos, R.; Tabik, S.; Herrera, F. Automatic handgun detection alarm in videos using deep learning. Neurocomputing 2018, 275, 66–72. [Google Scholar] [CrossRef]
Yadav, P.; Gupta, N.; Sharma, P.K. A comprehensive study towards high-level approaches for weapon detection using classical machine learning and deep learning methods. Expert Syst. Appl. 2023, 212, 118698. [Google Scholar] [CrossRef]
Ruiz-Santaquiteria, J.; Muñoz, J.D.; Maigler, F.J.; Deniz, O.; Bueno, G. Firearm-related action recognition and object detection dataset for video surveillance systems. Data Brief 2024, 52, 110030. [Google Scholar] [CrossRef]
Grega, M.; Matiolański, A.; Guzik, P.; Leszczuk, M. Automated Detection of Firearms and Knives in a CCTV Image. Sensors 2016, 16, 47. [Google Scholar] [CrossRef]
Omnilert. Available online: https://www.omnilert.com/solutions/ai-gun-detection (accessed on 26 February 2025).
Pelco. Available online: https://www.pelco.com/blog/weapons-detection-systems (accessed on 26 February 2025).
Egiazarov, A.; Mavroeidis, V.; Zennaro, F.M.; Vishi, K. Firearm Detection and Segmentation Using an Ensemble of Semantic Neural Networks. In Proceedings of the 2019 European Intelligence and Security Informatics Conference (EISIC), Oulu, Finland, 26–27 November 2019; pp. 70–77. [Google Scholar] [CrossRef]
Verma, G.K.; Dhillon, A. A Handheld Gun Detection using Faster R-CNN Deep Learning. In Proceedings of the 7th International Conference on Computer and Communication Technology, ICCCT-2017, New York, NY, USA, 24–26 November 2017; pp. 84–88. [Google Scholar] [CrossRef]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Das, B.C.; Amini, M.H.; Wu, Y. Accurate and Efficient Two-Stage Gun Detection in Video. arXiv 2025, arXiv:2503.06317. [Google Scholar]
Kondratyuk, D.; Yuan, L.; Li, Y.; Zhang, L.; Tan, M.; Brown, M.; Gong, B. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16020–16030. [Google Scholar]
Wasim, S.T.; Khattak, M.U.; Naseer, M.; Khan, S.; Shah, M.; Khan, F.S. Video-FocalNets: Spatio-temporal focal modulation for video action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13778–13789. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Khan, N.S.; Ogura, K.; Cosatto, E.; Ariyoshi, M. Real-time concealed weapon detection on 3D radar images for walk-through screening system. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 673–681. [Google Scholar]
Kaul, C.; Mitchell, K.J.; Kassem, K.; Tragakis, A.; Kapitany, V.; Starshynov, I.; Villa, F.; Murray-Smith, R.; Faccio, D. AI-Enabled Sensor Fusion of Time-of-Flight Imaging and mmWave for Concealed Metal Detection. Sensors 2024, 24, 5865. [Google Scholar] [CrossRef]
Ingle, P.Y.; Kim, Y.G. Real-Time Abnormal Object Detection for Video Surveillance in Smart Cities. Sensors 2022, 22, 3862. [Google Scholar] [CrossRef] [PubMed]
Cha, Y.W.; Price, T.; Wei, Z.; Lu, X.; Rewkowski, N.; Chabra, R.; Qin, Z.; Kim, H.; Su, Z.; Liu, Y.; et al. Towards fully mobile 3D face, body, and environment capture using only head-worn cameras. IEEE Trans. Vis. Comput. Graph. 2018, 24, 2993–3004. [Google Scholar] [CrossRef]
Velasco-Mata, A.; Ruiz-Santaquiteria, J.; Vallez, N.; Deniz, O. Using human pose information for handgun detection. Neural Comput. Appl. 2021, 33, 17273–17286. [Google Scholar] [CrossRef]
Chung, J.L.; Ong, L.Y.; Leow, M.C. Comparative Analysis of Skeleton-Based Human Pose Estimation. Future Internet 2022, 14, 380. [Google Scholar] [CrossRef]
Ramirez, H.; Velastin, S.A.; Meza, I.; Fabregas, E.; Makris, D.; Farias, G. Fall Detection and Activity Recognition Using Human Skeleton Features. IEEE Access 2021, 9, 33532–33542. [Google Scholar] [CrossRef]
Suzuki, S.; Amemiya, Y.; Sato, M. Skeleton-based visualization of poor body movements in a child’s gross-motor assessment using convolutional auto-encoder. In Proceedings of the 2021 IEEE International Conference on Mechatronics (ICM), Kashiwa, Japan, 7–9 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Soares, R.A.A.; de Oliveira, A.C.M.; de Almeida Ribeiro, P.R.; de Almeida Neto, A. Firearm detection using DETR with multiple self-coordinated neural networks. Neural Comput. Appl. 2024, 36, 22013–22022. [Google Scholar] [CrossRef]
Ruiz-Santaquiteria, J.; Velasco-Mata, A.; Vallez, N.; Bueno, G.; Alvarez-Garcia, J.A.; Deniz, O. Handgun detection using combined human pose and weapon appearance. IEEE Access 2021, 9, 123815–123826. [Google Scholar] [CrossRef]
Dever, J.; da Vitoria Lobo, N.; Shah, M. Automatic visual recognition of armed robbery. In Proceedings of the 2002 International Conference on Pattern Recognition, Quebec City, QC, Canada, 11–15 August 2002; Volume 1, pp. 451–455. [Google Scholar] [CrossRef]
Darker, I.; Gale, A.; Ward, L.; Blechko, A. Can CCTV Reliably Detect Gun Crime? In Proceedings of the 2007 41st Annual IEEE International Carnahan Conference on Security Technology, Ottawa, ON, Canada, 8–11 October 2007; pp. 264–271. [Google Scholar] [CrossRef]
Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-device Real-time Body Pose tracking. arXiv 2020, arXiv:2006.10204. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
Seliya, N.; Abdollah Zadeh, A.; Khoshgoftaar, T.M. A literature review on one-class classification and its potential applications in big data. J. Big Data 2021, 8, 1–31. [Google Scholar] [CrossRef]
Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep one-class classification. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Breckenridge, CO, USA, 2018; pp. 4393–4402. [Google Scholar]
Khan, S.S.; Madden, M.G. One-class classification: Taxonomy of study and review of techniques. Knowl. Eng. Rev. 2014, 29, 345–374. [Google Scholar] [CrossRef]
Perera, P.; Nallapati, R.; Xiang, B. Ocgan: One-class novelty detection using gans with constrained latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2898–2906. [Google Scholar]
Liu, K.; Li, A.; Wen, X.; Chen, H.; Yang, P. Steel surface defect detection using GAN and one-class classifier. In Proceedings of the 2019 25th International Conference on Automation and Computing (ICAC), Lancaster, UK, 5–7 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Perera, P.; Patel, V.M. Learning deep features for one-class classification. IEEE Trans. Image Process. 2019, 28, 5450–5463. [Google Scholar] [CrossRef]
Petersen, A.; Kucheryavskiy, S. VAE-SIMCA—Data-driven method for building one class classifiers with variational autoencoders. Chemom. Intell. Lab. Syst. 2025, 256, 105276. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An introduction to variational autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Goyal, S.; Raghunathan, A.; Jain, M.; Simhadri, H.V.; Jain, P. DROCC: Deep robust one-class classification. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: Breckenridge, CO, USA, 2020; pp. 3711–3721. [Google Scholar]
MediaPipe ML Solutions in MediaPipe. Available online: https://chuoling.github.io/mediapipe/ (accessed on 1 March 2025).
Alsawadi, M.S.; Rio, M. Human Action Recognition using BlazePose Skeleton on Spatial Temporal Graph Convolutional Neural Networks. In Proceedings of the 2022 9th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), Semarang, Indonesia, 25–26 August 2022; pp. 206–211. [Google Scholar] [CrossRef]
Li, P.; Pei, Y.; Li, J. A comprehensive survey on design and application of autoencoder in deep learning. Appl. Soft Comput. 2023, 138, 110176. [Google Scholar] [CrossRef]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Reynolds, D.A. Gaussian mixture models. Encycl. Biom. 2009, 741, 3. [Google Scholar]
Rousseeuw, P.J.; van Driessen, K. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Saul, L.K.; Weiss, Y.; Bottou, L. Advances in Neural Information Processing Systems 17: Proceedings of the 2004 Conference; MIT Press: Cambridge, MA, USA, 2005; Volume 17. [Google Scholar]
Vallez, N.; Velasco-Mata, A.; Deniz, O. Deep autoencoder for false positive reduction in handgun detection. Neural Comput. Appl. 2021, 33, 5885–5895. [Google Scholar] [CrossRef]
Nie, Y.; Liu, C.; Long, C.; Zhang, Q.; Li, G.; Cai, H. Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springe: Berlin/Heidelberg, Germany; pp. 439–458. [Google Scholar]
Akrami, H.; Joshi, A.A.; Li, J.; Aydöre, S.; Leahy, R.M. A robust variational autoencoder using beta divergence. Knowl.-Based Syst. 2022, 238, 107886. [Google Scholar] [CrossRef]
Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]

Figure 1. The proposed firearm detection detection framework. For simplicity, the process is explained using a single input test frame. We use the following acronyms to denote specific inputs and operations: OCC (One-Class Classifier) and ROI (Region-of-Interest).

Figure 2. Examples from the VISILAB benchmark handgun firearm datasets. Superimposed on the images are the human-body landmarks from MediaPipe [46]. Based on the visibility threshold, the ROIs of the left and right hands are highlighted with cyan and red bounding boxes, respectively. Images credit: VISILAB firearm image database [11] (a) Firearm VISILAB; (b) ROIs VISILAB; (c) Normal Samples VISILAB; (d) ROIs VISILAB. https://data.mendeley.com/datasets/bbzpxhd22j/2, accessed on 3 February 2025.

Figure 3. Examples from the UCF-Crime and YouTube benchmark handgun firearm datasets. Superimposed on the images are the human-body landmarks from MediaPipe [46]. Based on the visibility threshold, the ROIs of the left and right hands are highlighted with cyan and red bounding boxes, respectively. (a) Firearm UCF-Crime; (b) ROIs UCF-Crime; (c) Firearm YouTube; (d) ROIs YouTube.

Figure 4. Keypoints topology of human body landmarks extracted by BlazePose.

Figure 5. Distance based patch size estimation: (a) patch of observed hand changes w.r.t. the distance between the subject and the camera. This relation can be described by a second order polynomial; (b) second order polynomial function used to fit the function; (c) graphical representation of fitted function and (d) residual plot of predicted values. The variable patch sizes computed for VISILAB no gun and handgun dataset are shown in Figure 6.

Figure 6. Patch sizes computed for VISILAB no gun and handgun dataset: (a) patch size vs. distance; (b) examples of patches with variable and fixed sizes.

Figure 7. An illustration of the VAE architecture. In the proposed approach,

\vec{R}

represents the flattened input extracted from a

32 \times 32

patch within the video frame,

\vec{v}

denotes the encoded feature vector, and

{\vec{R}}^{'}

corresponds to the reconstructed patch generated by the VAE.

Figure 7. An illustration of the VAE architecture. In the proposed approach,

\vec{R}

represents the flattened input extracted from a

32 \times 32

patch within the video frame,

\vec{v}

denotes the encoded feature vector, and

{\vec{R}}^{'}

corresponds to the reconstructed patch generated by the VAE.

Figure 8. Training and validation loss of VAE trained on the normal ‘no gun’ dataset for visual gun feature extraction.

Figure 9. The ROC curves for DeepGun implemented with IFC, SVM, GMM, and MCDE.

Figure 10. Sample frames from the VISILAB dataset illustrating failure cases: (a) false positive (FP); (b) false negative (FN); and (c) failure of MediaPipe Holistic to detect any body landmarks (MediaPipe Failure).

Table 1. Parameters of the VAE used to extract deep latent features from hand patches.

Network Layer	Network Structure	Input	Output
Encoder layer 1	Fully Connected	1024	512
Encoder layer 2	Fully Connected	512	256
Encoder layer 3	Fully Connected	256	128
Encoder layer 4	Fully Connected	128	64
Decoder layer 1	Fully Connected	64	128
Decoder layer 2	Fully Connected	128	256
Decoder layer 3	Fully Connected	256	512
Decoder layer 4	Fully Connected	512	1024

Table 2. Performance evaluation of firearm detection accuracy (ACC) from VISILAB and UCF-Crime benchmark firearm dataset.

Method	ACC (VISILAB)	ACC (UCF-Firearm)	Model Specifications
DeepGun	85.2	84.1	VAE with KL loss and reconstruction loss, NN encoder with 4 blocks, 1D kernel, latent space 64,200 epochs, early stopping with patience 10, input patch size $32 \times 32$ .
R-CNN [16]	82.2	80.5	VGG-16 pretrained on Imagenet database, mini-batch gradient descent with momentum, 50 epochs, early stopping with patience 10, input size $224 \times 224$ .
TSGD [18]	51.9	67.16	YOLOv11 for gun detection, VGG trained on Imagenet and Transformer [60] combination for handling the temporal features in videos, 200 epochs, early stopping, input frame size $640 \times 640$ .
Video-FocalNets [20]	83.33	75	Spatio-temporal focal modulation, 4 blocks with 2D depthwise followed by 1D pointwise convolution, 120 epochs, batch size 512.
Movinets [19]	83.44	79.16	Vision transformer, 5 blocks with 3D-CNN kernel, 240 epochs, batch size 1024.
3D-CNNs [21]	67.39	67.16	Spatiotemporal 3D CNNs with 34 layers, 45 epochs, input size $128 \times 171$ .

Table 3. Comprehensive analysis of DeepGun, TSGD, and VAE-SIMCA on VISILAB firearm action recognition dataset.

Method	ACC	Precision (P)	Specificity
DeepGun	85.2	95.3	96.4
VAE-SIMCA [43]	62.0	93.8	43.0
TSGD [18]	51.9	98.1	50.0

Table 4. Comprehensive quantitative analysis of firearm detection results obtained with different deep architectures as feature extractor and four state-of-the-art OCCs implemented on VISILAB benchmark firearm dataset.

Classifier	Metric	Nas-NetLarge	Dense-Net201	VGG19	Mobile-NetV2	Res-Net50	AE	DeepGun
IFC	$F_{1}$	85.1	82.3	83.7	83.2	83.7	81.3	86.6
	$A C C$	83.4	80.9	82.2	81.9	82.6	79.3	85.2
	P	90.1	87.2	89.8	88.2	87.7	87.2	95.3
	R	75.0	72.4	72.6	73.6	75.8	68.6	74.0
	$A P$	80.1	76.9	78.9	78.1	78.6	75.5	83.5
SVM	$F_{1}$	62.1	68.6	69.4	66.0	64.8	53.5	79.2
	$A C C$	70.6	74.3	75.1	72.7	72.5	66.3	81.1
	P	64.2	67.8	68.3	66.3	65.6	60.5	76.3
	R	93.0	92.2	93.6	92.2	94.2	94.8	90.2
	$A P$	63.2	66.4	67.1	65.0	64.7	59.8	73.7
MCDE	$F_{1}$	61.0	61.0	61.1	61.2	62.3	61.0	77.7
	$A C C$	51.6	51.6	51.7	51.8	54.5	51.7	79.1
	P	51.0	51.0	51.1	51.2	53.1	51.1	83.4
	R	76.0	75.8	76.0	76.2	75.4	75.8	72.8
	$A P$	50.8	50.8	50.8	50.9	52.3	50.8	74.3
GMM	$F_{1}$	41.0	49.0	57.2	53.1	44.7	68.3	56.8
	$A C C$	47.8	49.4	52.3	52.0	47.8	65.7	52.0
	P	47.1	49.3	51.8	51.9	47.2	63.4	51.6
	R	36.4	48.8	63.8	54.4	42.6	74.2	63.3
	$A P$	48.9	49.7	51.1	51.0	48.8	59.9	51.0

Table 5. Comprehensive quantitative analysis of firearm detection results obtained with different deep architectures as feature extractor and four state-of-the-art OCCs implemented on frames with visible handgun extracted from UCF-Crime shooting video dataset.

Classifier	Metric	Nas-NetLarge	Dense-Net201	VGG19	Mobile-NetV2	Res-Net50	AE	DeepGun
IFC	$F_{1}$	68.4	62.2	45.6	69.4	44.7	46.4	65.0
	$A C C$	85.6	83.1	78.8	85.8	78.4	77.3	84.1
	P	88.6	86.6	81.7	89.2	81.5	82.0	87.6
	R	92.6	91.8	92.3	92.9	92.3	89.5	92.0
	$A P$	87.6	85.7	81.2	88.1	81.4	81.3	86.6
SVM	$F_{1}$	30.4	35.9	32.0	28.1	11.6	10.0	21.9
	$A C C$	75.0	75.9	75.0	74.4	71.3	65.4	73.1
	P	78.4	79.5	78.7	78.0	75.3	73.6	76.9
	R	92.2	91.7	91.7	92.0	92.0	84.4	92.0
	$A P$	78.2	79.1	78.4	77.8	75.5	73.9	76.8
MCDE	$F_{1}$	84.3	83.3	83.3	83.5	83.2	83.3	83.3
	$A C C$	74.4	72.0	72.0	72.2	71.9	72.1	72.0
	P	77.3	75.6	75.7	75.5	75.5	75.7	75.7
	R	92.6	92.7	92.7	93.4	92.2	75.8	92.7
	$A P$	77.2	75.5	75.6	75.5	75.5	75.7	75.7
GMM	$F_{1}$	24.1	37.3	18.4	33.4	37.9	38.0	18.3
	$A C C$	25.3	28.4	71.4	28.8	29.9	32.2	71.5
	P	16.0	23.8	30.9	21.6	24.2	24.5	31.0
	R	48.3	86.9	13.0	72.6	87.2	84.7	13.0
	$A P$	20.4	23.9	25.3	22.4	24.2	24.5	25.5

Table 6. Comprehensive quantitative analysis of firearm detection results obtained with different deep architectures as feature extractor and four state-of-the-art OCCs implemented on visible handgun extracted from YouTube dataset.

Classifier	Metric	Nas-NetLarge	Dense-Net201	VGG19	Mobile-NetV2	Res-Net50	AE	DeepGun
IFC	$F_{1}$	76.7	78.0	78.8	80.1	80.5	66.6	81.8
	$A C C$	84.7	85.5	85.8	86.8	86.8	68.7	87.6
	P	88.2	89.3	90.2	90.4	91.4	95.7	92.1
	R	89.1	89.0	88.5	89.9	88.6	55.8	89.2
	$A P$	86.0	86.9	87.5	88.1	88.7	83.1	89.4
SVM	$F_{1}$	75.0	74.7	76.7	75.5	78.5	49.4	84.7
	$A C C$	84.7	84.4	85.5	84.9	86.4	32.8	90.1
	P	86.0	86.1	87.3	86.4	88.2	10.0	92.0
	R	92.2	91.6	91.7	92.0	92.0	10.0	93.3
	$A P$	84.6	84.5	85.6	68.9	86.6	67.1	90.3
MCDE	$F_{1}$	76.9	79.4	79.5	78.3	79.7	80.0	79.6
	$A C C$	64.0	68.7	68.8	66.4	69.5	70.1	69.0
	P	67.5	71.0	71.1	69.1	72.0	72.4	71.2
	R	89.4	90.1	90.0	90.4	89.2	89.4	90.1
	$A P$	67.5	70.6	70.7	68.9	71.5	71.9	70.8
GMM	$F_{1}$	25.7	44.8	25.8	41.6	43.8	27.4	44.9
	$A C C$	61.7	33.3	66.5	33.3	34.1	65.4	33.4
	P	35.6	30.6	47.8	29.2	30.4	44.4	30.8
	R	20.2	82.3	13.0	72.1	78.0	84.7	82.4
	$A P$	33.4	31.1	35.5	30.2	30.9	35.1	31.2

Table 7. Five-fold cross-validation evaluation on VISILAB dataset.

Classifier	Metric	Nas-NetLarge	Dense-Net201	VGG19	Mobile-NetV2	Res-Net50	AE	DeepGun
IFC	$F_{1}$	83 ± 0.00	82 ± 0.01	83 ± 0.01	82 ± 0.02	85 ± 0.01	83 ± 0.01	85 ± 0.01
	$A C C$	82 ± 0.00	80 ± 0.01	82 ± 0.01	80 ± 0.02	83 ± 0.01	81 ± 0.01	83 ± 0.01
	P	88 ± 0.01	86 ± 0.02	87 ± 0.01	85 ± 0.03	92 ± 0.02	89 ± 0.01	91 ± 0.01
	R	73 ± 0.01	73 ± 0.01	74 ± 0.01	74 ± 0.01	73 ± 0.00	71 ± 0.01	74 ± 0.01
	$A P$	78 ± 0.01	76 ± 0.01	78 ± 0.01	76 ± 0.02	81 ± 0.01	77 ± 0.01	80 ± 0.01
SVM	$F_{1}$	68 ± 0.02	69 ± 0.01	70 ± 0.01	68 ± 0.01	71 ± 0.01	22 ± 0.05	69 ± 0.01
	$A C C$	74 ± 0.01	74 ± 0.01	75 ± 0.00	74 ± 0.01	76 ± 0.00	49 ± 0.03	74 ± 0.01
	P	68 ± 0.01	68 ± 0.01	69 ± 0.00	67 ± 0.01	70 ± 0.00	49 ± 0.02	68 ± 0.01
	R	92 ± 0.01	92 ± 0.00	92 ± 0.00	92 ± 0.01	92 ± 0.00	83 ± 0.04	92 ± 0.01
	$A P$	66 ± 0.01	67 ± 0.01	67 ± 0.00	66 ± 0.00	68 ± 0.00	50 ± 0.02	66 ± 0.01
MCDE	$F_{1}$	59 ± 0.00	66 ± 0.01	66 ± 0.01	59 ± 0.01	65 ± 0.01	64 ± 0.00	67 ± 0.01
	$A C C$	51 ± 0.01	62 ± 0.02	62 ± 0.02	47 ± 0.01	61 ± 0.02	59 ± 0.03	63 ± 0.03
	P	50 ± 0.00	60 ± 0.02	60 ± 0.02	48 ± 0.01	59 ± 0.02	57 ± 0.02	61 ± 0.03
	R	72 ± 0.01	74 ± 0.01	74 ± 0.01	74 ± 0.01	73 ± 0.01	73 ± 0.01	73 ± 0.01
	$A P$	50 ± 0.00	57 ± 0.01	57 ± 0.01	49 ± 0.00	56 ± 0.01	55 ± 0.02	58 ± 0.02
GMM	$F_{1}$	38 ± 0.23	44 ± 0.28	33 ± 0.28	46 ± 0.25	42 ± 0.28	65 ± 0.01	35 ± 0.27
	$A C C$	49±0.02	50 ± 0.01	50 ± 0.01	50 ± 0.02	49 ± 0.00	51 ± 0.02	50 ± 0.01
	P	47 ± 0.04	49 ± 0.03	47 ± 0.04	48 ± 0.04	47 ± 0.03	51 ± 0.01	48 ± 0.03
	R	42 ± 0.39	58 ± 0.46	40 ± 0.44	48 ± 0.42	54 ± 0.43	89 ± 0.04	42 ± 0.45
	$A P$	50 ± 0.01	50 ± 0.00	50 ± 0.00	50 ± 0.01	49 ± 0.00	51 ± 0.01	50 ± 0.00

Table 8. Ablation study. For simplicity, we use the following acronyms to denote different versions of proposed DeepGun approach for ablation study: DeepGun-E2 (DeepGun with 2 layers in encoder for feature extraction), DeepGun-CNN (DeepGun with CNN architecture), and DeepGun-FPS (DeepGun with fixed patch size).

Metric	DeepGun-CNN	DeepGun-E2	DeepGun-FPS	DeepGun
$F_{1}$	83.7	85.9	81.3	86.6
$A C C$	82.6	84.3	79.3	85.2
P	88.8	94.1	87.2	95.3
R	74.0	73.6	68.6	74.0
$A P$	78.7	83.0	75.5	83.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Singh, H.; Deniz, O.; Ruiz-Santaquiteria, J.; Muñoz, J.D.; Bueno, G. DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation. Appl. Sci. 2025, 15, 5830. https://doi.org/10.3390/app15115830

AMA Style

Singh H, Deniz O, Ruiz-Santaquiteria J, Muñoz JD, Bueno G. DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation. Applied Sciences. 2025; 15(11):5830. https://doi.org/10.3390/app15115830

Chicago/Turabian Style

Singh, Harbinder, Oscar Deniz, Jesus Ruiz-Santaquiteria, Juan D. Muñoz, and Gloria Bueno. 2025. "DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation" Applied Sciences 15, no. 11: 5830. https://doi.org/10.3390/app15115830

APA Style

Singh, H., Deniz, O., Ruiz-Santaquiteria, J., Muñoz, J. D., & Bueno, G. (2025). DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation. Applied Sciences, 15(11), 5830. https://doi.org/10.3390/app15115830

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DeepGun: Deep Feature-Driven One-Class Classifier for Firearm Detection Using Visual Gun Features and Human Body Pose Estimation

Abstract

1. Introduction

2. Related Work and Our Contributions

3. Proposed Algorithm

3.1. Firearm Dataset Used

3.2. Hand Patch Extraction Based on HPE

3.3. Variational Autoencoders (VAE)

3.4. Feature Fusion and Anomaly Detection

4. Experiments and Evaluation

4.1. Comparative Analysis with Other Firearm Detection and Action Recognition Frameworks

4.2. DeepGun vs. Deep Architectures as Feature Extractor

4.3. Cross-Validation Evaluation and Ablation Study

4.4. Failure Cases and the Potential for Future Research

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI