Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets

Primero Primero, Francisco; Cervantes Ambriz, Daniel; Alejo Eleuterio, Roberto; Granda Gutiérrez, Everardo E.; Sánchez Jaime, Jorge; Valdovinos Rosas, Rosa M.

doi:10.3390/sym17091536

Open AccessArticle

Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets

by

Francisco Primero Primero

¹

,

Daniel Cervantes Ambriz

¹

,

Roberto Alejo Eleuterio

^1,*

,

Everardo E. Granda Gutiérrez

²

,

Jorge Sánchez Jaime

¹

and

Rosa M. Valdovinos Rosas

³

¹

Division of Graduate Studies and Research, National Technological of Mexico, Campus Toluca, Metepec 52149, Mexico

²

UAEM University Center at Atlacomulco, Autonomous University of the State of Mexico, Km. 60 Carretera Toluca-Atlacomulco, Atlacomulco 50450, Mexico

³

Faculty of Engineering, Autonomous University of the State of Mexico, Cerro de Coatepec Street, Universidad, Toluca de Lerdo 50130, Mexico

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(9), 1536; https://doi.org/10.3390/sym17091536

Submission received: 10 July 2025 / Revised: 31 August 2025 / Accepted: 5 September 2025 / Published: 15 September 2025

(This article belongs to the Special Issue Symmetry/Asymmetry and Its Applications in Deep Learning and Artificial Intelligence Methods)

Download

Browse Figures

Versions Notes

Abstract

Automated violence detection in images presents a technical and scientific challenge that demands specialized methods to enhance classification systems. This study introduces an approach for automatically identifying relevant samples to improve the performance of neural network models, specifically DenseNet121, with a focus on violence classification in images. The proposed methodology begins with an initial training phase using a balanced dataset (DS1, 6000 images). Based on the model’s output scores (

o u t_{N}

), three confidence levels are defined: Safe (

o u t_{N} \geq 0.9 + σ

or

o u t_{N} \leq 0.1 - σ

), Border (

0.5 - σ \leq o u t_{N} \leq 0.5 + σ

), and Average (

0.4 - σ \leq o u t_{N} \leq 0.6 + σ

). These levels correspond to scenarios with low, moderate, and high prediction error probabilities, respectively, where

σ

is an adjustable threshold. The Border subset exhibits symmetry around the decision boundary (

o u t_{N} = 0.5

), capturing maximally uncertain samples, while the Safe regions reflect functional asymmetries in high-confidence predictions. Subsequently, these thresholds are applied to a second dataset (DS2, 5600 images) to extract specialized subsets for retraining (

D S_{S a f e}

,

D S_{B o r d e r}

, and

D S_{A v e r a g e}

). Finally, the model is evaluated using an independent test set (

D S_{t e s t}

, 4400 images), ensuring complete data isolation. The experimental results demonstrate that the confidence-based subsets offer competitive performance despite using significantly fewer samples. The Average subset achieved an F1-Score of 0.89 and a g-mean of 0.93 using only 20% of the data, making it a promising alternative for efficient training. These findings highlight that strategic sample selection based on confidence thresholds enables effective training with reduced data, offering a practical balance between performance and efficiency when symmetric uncertainty modeling is exploited.

Keywords:

violence detection; image classification; convolutional neural networks (CNNs); sample selection; ambiguity in machine learning; computational resource optimization

1. Introduction

Automated detection of physical violence has become a significant technical challenge in multimedia content analysis, where surveillance systems and digital platforms process large amounts of images daily [1]. Manual processing of such content faces scalability issues and high operational costs, driving the development of automated methods based on deep learning [2].

Current violence detection systems have trouble with data heterogeneity. Traditional methods treat all training samples the same, ignoring differences in classification difficulty that can hinder learning efficiency [3]. This homogeneous strategy leads to computational inefficiencies, particularly in large-scale cases with limited resources [4]. Convolutional neural networks (CNNs), specifically DenseNet121, have demonstrated strong performance in image classification tasks due to their dense connectivity architecture, which enhances feature reuse [5]. However, practical deployment of these models requires careful consideration of computational efficiency for real-time applications [6].

Recent research has examined various methods for violence detection, including multimodal fusion [7], attention mechanisms [8], and lightweight architectures [9]. Alejo et al. [10] used confidence-based active learning to categorize data into Safe, Border, and Average subsets, showing improvements in training efficiency. However, a gap remains in systematically applying confidence-based sample selection to optimize computational resources for violence detection more effectively.

Confidence-based active learning iteratively constructs models by identifying representative samples [11]. This approach uses model uncertainty to direct training data selection, lowering computational requirements without sacrificing performance [12]. However, its application to violence detection within images remains underexplored, especially regarding specialized sample selection based on predictive confidence levels [13].

This work proposes a confidence-based selection method to optimize the training of DenseNet121 for automated violence detection. The methodology partitions the dataset into three specialized subsets: Border (

o u t_{N} \approx 0.5

), defined symmetrically around the decision boundary to capture maximally uncertain samples; Safe (

o u t_{N} \approx 0.1

or

o u t_{N} \approx 0.9

), representing high-confidence predictions at both extremes; and Average (

o u t_{N} \approx 0.4

or

o u t_{N} \approx 0.6

), encompassing moderately certain cases. This design utilizes a symmetric treatment of uncertainty at the decision boundary, while acknowledging functional asymmetries in sample utility across the confidence spectrum. Each subset is tailored to distinct classification complexity characteristics, enabling a data-centric strategy that balances efficiency and performance.

The key contributions of this approach include (1) a confidence-based selection procedure for computational resource optimization in violence detection, (2) the empirical validation of stratified training strategies based on classification complexity, (3) demonstration that strategic data selection can outperform volume-driven approaches, and (4) a comparative analysis of precision–efficiency trade-offs across specialized subsets.

2. Related Work

Automated violence detection with deep learning is an interdisciplinary research area that combines computer vision, spatiotemporal processing, and multimodal fusion. Current methods mainly fall into three categories: (1) unimodal systems using visual features, (2) audiovisual architectures for detecting aggression, and (3) optimization techniques for edge computing. The literature shows a clear shift from unimodal models toward hybrid, efficient, and adaptable solutions.

Table 1 synthesizes key studies in automated violence classification, organizing them by author, architecture, methodology, dataset, and performance metrics. It evaluates prior techniques for addressing classification complexity, including architectural optimizations and data processing strategies [6,14], to contextualize our confidence-based methodology within the current scientific landscape. The review progresses from broad surveys such as [15] to specialized implementations [9], culminating in works that explicitly use confidence thresholds [11].

2.1. Evolution of Methodologies

The field of automated violence detection has experienced significant methodological progress, driven by the need to improve both accuracy and computational efficiency. Early methods primarily relied on unimodal visual analysis; however, challenges in managing ambiguous situations and real-world variability led to innovations in multimodal fusion, temporal modeling, and lightweight architectures. This subsection critically reviews key milestones in this development, from basic benchmarking studies to advanced hybrid systems, illustrating how architectural innovations and optimization techniques have addressed the core challenges of violence classification. By following this progression, we understand how confidence-based sample selection has emerged as a natural way to enable resource-efficient training without losing performance. In this sense, the following studies offer different approaches to address this problem:

Benchmarking Studies: [15] established a comparative framework for analyzing CNNs, RNNs, and transformers on benchmark datasets (Hockey Fights, RWF-2000, Violent Flows), reporting an accuracy of over 90% in optimized configurations. Their systematic review highlights trends toward attention mechanisms and lightweight architectures.
Multimodal Fusion: Ye et al. [7] integrated MFCC audio features with C3D visual processing using Dempster–Shafer theory, achieving 97% accuracy in school environments. This demonstrates how auditory signals complement visual data in violence scenarios with identifiable acoustic components.
Temporal Modeling: Mumtaz et al. [8] combined CNNs, Bi-LSTMs, and multiscale attention with statistical control charts for risk monitoring, attaining 89–91% accuracy across datasets. Their work proves that selective attention improves robustness in complex scenes.
Computational Optimization: Wang et al. [6] implemented EfficientNet with bidirectional motion attention and TSM modules, achieving perfect accuracy on Movie Fights and >90% on other datasets with only 1.21 GFLOPs. This validates the feasibility of real-time efficient models.
Edge Deployment: Khan et al. [14] designed an industrial surveillance pipeline combining person detection (CNN) with action classification (3D-CNN), reducing latency by processing only regions of interest. Their cascaded architecture outperforms baselines while optimizing resources.
Transformer Architectures: Rendón-Segador et al. [9] proposed CrimeNet, a transformer model with adaptive sliding windows that improved inter-dataset robustness by 15% while maintaining 99% AUC across 11 public datasets.
Confidence-Based Learning and Efficiency: Abundez et al. [11] pioneered confidence-based active learning, categorizing images into Safe, Border, and Average subsets to optimize DenseNet121 and EfficientNet training. Their method increased the AUC from 0.44 to 0.81–0.91 on the AIRTLab, RLVS, and SCVD datasets, demonstrating that curating ambiguous examples enhances generalization without requiring the expansion of training data.

2.2. Spatiotemporal and Edge-Capable Models

Recent advances in violence detection focus on models that balance temporal reasoning with computational efficiency, allowing deployment in resource-limited edge environments. This subsection examines architectures that combine spatiotemporal feature extraction (such as 3D CNNs and ConvLSTMs) with lightweight design principles, tackling the challenges of capturing dynamic violent behaviors while maintaining real-time performance. From hybrid networks with dense connectivity to compact systems optimized for embedded devices, these methods show how innovative algorithms can expand the practical use of violence detection systems beyond cloud-dependent setups.

ViolenceNet [1]: A 3D DenseNet121 with bidirectional ConvLSTM and multi-head attention achieved 95–100% intra-dataset and 70–81% cross-dataset accuracy, capturing long-term temporal dynamics.
Compact Architectures: Huillcen Baca et al. [17] combined DenseNet121/MobileNetV2 with BiLSTM, reaching 98.2–100% accuracy on Hockey Fight and Movie Fight with only 3.5M parameters.
Edge Deployment: Azzakhnini et al. [18] developed LAVID, an autonomous camera using DenseNet121 and DSCNN-BiLSTM (0.57M parameters), achieving 96.6% accuracy on Violent Flows, proving edge-based advanced analysis is viable.

2.3. Trends and Gaps

Table 1 highlights DenseNet121’s dominance in violence detection due to its balance of accuracy and efficiency. It is validated for real-time detection [6], active learning frameworks [11], and computational efficiency in systematic reviews [15]. As can be observed, the recent literature shows a shift from computationally heavy models to hybrid, efficient, and adaptable solutions. The rise of multimodal fusion and active learning signals a new era of surveillance systems with greater precision, efficiency, and cross-domain transferability.

Despite significant progress in violence detection models, several critical limitations persist. One of the most pressing issues is cross-dataset generalization. While many models report intra-dataset accuracies exceeding 95%, their performance drops markedly (often to between 60% and 75%) when evaluated on different datasets (e.g., transferring from Hockey Fights to RWF-2000). This discrepancy features the challenges of domain adaptation and the need for more robust generalization techniques.

Another unresolved issue involves dynamic thresholding. Most current approaches rely on static confidence thresholds, which fail to account for context-dependent variability. This can result in poor discrimination in ambiguous scenes. Developing adaptive thresholding mechanisms that respond to scene complexity could enhance decision-making under uncertainty. Additionally, in terms of multimodal limitations, while audio–visual fusion has achieved high classification performance (up to 97% accuracy in some cases), these systems typically underperform in scenarios where audio cues are absent, such as silent violent acts. Finally, evaluation consistency remains a concern. The use of disparate metrics (such as AUC, classification accuracy, and computational cost) across studies complicates direct comparisons between models.

This landscape positions our work at the intersection of confidence-based efficiency and temporal robustness, offering a pathway to address generalization gaps through dynamic sample selection.

3. Methods

The methodology employed in this study was structured into a rigorous five-phase protocol to ensure the reliability and reproducibility of results. First, a comprehensive data preparation and balancing phase was conducted, addressing class imbalance and ensuring representative sampling across the dataset. Next, the base model training phase involved the use of DenseNet121 as the foundational architecture, selected for its proven performance in visual classification tasks. In the third phase, the trained model was used for prediction generation and confidence analysis, capturing the predicted labels and the associated confidence scores. These outputs served as the basis for subsequent intelligent data partitioning. The fourth phase introduced intelligent segmentation of the dataset into three specialized subsets (Safe, Average, and Border). Finally, an optimized model training and evaluation phase was executed.

3.1. Datasets and Preprocessing

To conduct this study, we used four publicly available datasets, which are summarized in Table 2. It is essential to note that the datasets were not used directly; instead, we developed an experimental framework that utilized three carefully curated datasets, DS1, DS2, and

D S_{t e s t}

, each designed to address specific evaluation needs while maintaining class balance.

DS1 served as our baseline dataset, comprising exclusively AIRTLab content with 6000 selected images (3000 violent and 3000 non-violent scenes). This homogeneous composition allowed for controlled initial model training and validation.

For enhanced diversity, DS2 combined material from all four source datasets (AIRTLab, RLVS, Pexels, and SCVD), totaling 5600 images (2800 per class). This composite dataset intentionally incorporates the varied visual characteristics of surveillance footage (SCVD), real-world YouTube content (RLVS), and controlled stock imagery (Pexels), creating a more challenging training environment.

The

D S_{t e s t}

evaluation set (4400 images: 2200 violent/2200 non-violent) followed a stratified sampling approach across all sources, ensuring proportional representation of each dataset’s unique characteristics while maintaining complete mutual exclusivity with the training sets (DS1∩DS2∩

D S_{t e s t}

= \emptyset

). This design ensures that

D S_{t e s t}

does not contain overlapping or visually similar images from DS1 or DS2 and includes diverse visual contexts from AIRTLab, RLVS, SCVD, and Pexels. Therefore, although this study does not test across datasets, the construction of

D S_{t e s t}

effectively simulates a cross-domain evaluation scenario. This approach aligns with recommendations in the recent literature, which emphasize the importance of testing in unseen environments to assess the generalization capability of physical violence detection systems [15].

The final purpose of creating this data combination is to develop a broad spectrum of scene diversity. First, violent scenes include physical confrontations (hand-to-hand combat, pushing), organized urban conflicts (riots, vandalism), domestic violence scenarios, and weapon-involved aggression (from SCVD). Then, non-violent counterparts consist of peaceful social interactions (conversations, group activities), sports and recreational events, family gatherings, workplace activities, and ambiguous but non-aggressive behaviors (arguing without physical contact).

This intentional diversity ensures models learn discriminative features beyond superficial visual patterns, forcing them to recognize contextual violence indicators while avoiding overfitting to specific scene types. The inclusion of CCTV footage (SCVD) alongside consumer-grade video (RLVS) and studio-quality images (Pexels) further enhances real-world applicability across different capture conditions.

To ensure compatibility with the DenseNet121 architecture and improve model robustness, a structured preprocessing pipeline was implemented. First, all input frames were resized to 224 × 224 pixels to meet the input requirements of DenseNet121 and allow consistent processing across the dataset. Next, normalization was applied by scaling pixel values to the [0, 1] range, which promotes faster convergence during training and enhances numerical stability. During training, data augmentation techniques were used to increase variability and reduce overfitting. These included random rotations up to ±15°, horizontal scaling within ±10%, and brightness adjustments of up to ±20%. These transformations mimicked real-world variations in camera angles, distances, and lighting conditions, helping the model generalize better.

3.2. DenseNet121 Architecture and Model Configuration

The DenseNet121 architecture was chosen as our base model because of its proven effectiveness in complex image classification tasks and its ability to capture subtle visual patterns through dense inter-layer connections (Figure 1). Unlike traditional CNNs, DenseNet121 uses an innovative approach where each layer directly receives inputs from all preceding layers, resulting in improved feature reuse (via concatenated feature maps), fewer parameters (20% less than ResNet variants), better gradient flow during backpropagation, and natural feature diversification through multi-scale feature aggregation.

The model implementation included several key customizations specific to the violence detection task. A transfer learning approach was used, starting with ImageNet pre-trained weights to utilize learned visual features. The original classification head of DenseNet121 was removed and replaced with a custom dense layer that has two output neurons, representing violence and non-violence classes. A softmax activation function was applied to generate normalized class probabilities.

The training used the Adam optimizer with a starting learning rate of 0.001, chosen for its adaptive gradient properties. To improve convergence, an automatic learning rate reduction was implemented when the validation performance stagnated, adjusting the learning rate accordingly. The loss function was categorical cross-entropy, suitable for the two-class probabilistic output. Additionally, a dropout layer with a probability of 0.3 was included before the final classification layer to reduce overfitting and improve generalization. The complete layer-by-layer architecture, including all dimensional transformations introduced during customization, is presented in Table 3.

All experiments were conducted on a workstation equipped with an NVIDIA RTX 4060 GPU (8 GB VRAM) and 48 GB of system RAM, running Windows 11 Pro. This hardware configuration provided sufficient computational resources for training multiple deep learning models efficiently.

The software stack was built around Python 3.10.16, with core deep learning functionality provided by TensorFlow 2.10.0 and Keras 2.10.0. GPU acceleration was enabled through CUDA 11.2 and cuDNN 8.1, ensuring compatibility and optimized performance. For numerical operations, NumPy 1.26.4 and SciPy 1.15.1 were utilized, while TensorBoard 2.10.1 and Matplotlib 3.10.0 facilitated real-time visualization and analysis of results.

This configuration enabled efficient training and evaluation of multiple model variants, with the GPU’s tensor cores delivering up to 4× speedup over CPU-only execution for batch sizes of up to 64 images. Moreover, the standardized software environment contributed to the reproducibility and consistency of results across experimental runs.

3.3. Model Performance Metrics

To evaluate the effectiveness of our proposal, we employed both confusion matrix analysis (Table 4) and standard classification metrics, including precision, recall, F1-Score, and g-mean. These metrics provide complementary perspectives on model performance across both violent and non-violent classes.

From the confusion matrix (Table 4), four key performance metrics were calculated. Precision (Equation (1)) indicates how accurately violence is identified by measuring the proportion of correctly predicted violent cases out of all cases predicted as violent. Recall (Equation (2)) assesses the model’s ability to detect violent events, reflecting coverage within the positive class. The (F1-Score (Equation (3)) offers a balanced evaluation by combining precision and recall into a single harmonic mean; this is especially useful in imbalanced classification situations. Lastly, the geometric mean (g-mean, Equation (4)) evaluates the model’s overall performance by emphasizing balanced performance across both classes, ensuring that neither class is disproportionately favored.

precision = \frac{T P}{T P + F P}

(1)

recall = \frac{T P}{T P + F N}

(2)

F 1 - Score = \frac{2 \times precision \times recall}{precision + recall}

(3)

g - mean = \sqrt{recall \times specificity} = \sqrt{\frac{T P}{T P + F N} \times \frac{T N}{T N + F P}}

(4)

These metrics collectively address the key requirements of violence detection systems: high true positive rates (minimizing missed violence) while maintaining low false positives (avoiding unnecessary alerts). The g-mean proves especially relevant for security applications where both over-reporting and under-reporting carry significant consequences.

3.4. Approach for Sample Selection: Safe, Average, and Border Subsets

This section details our methodology for selecting relevant samples based on prediction confidence scores (

o u t_{N}

) from the neural network. Inspired by active learning principles, we adapt these concepts to deep learning in violence detection, introducing three specialized data subsets: Safe, Average, and Border.

The core idea is to identify specialized subsets from

D S 2

based on prediction confidence levels, reflecting different degrees of model certainty. Unlike a strict partition, these subsets are defined by overlapping confidence intervals to independently assess the utility of samples in distinct reliability regimes.

A threshold,

σ

, is used to create new partitioned specialized datasets. In this sense, the Safe subset (

D S_{Safe}

) includes samples with high-confidence predictions, defined as

o u t_{N} \leq 0.1 - σ or o u t_{N} \geq 0.9 + σ

(5)

These represent clear-cut cases where the model is confident in its prediction, corresponding to easily classifiable examples. We must observe that the probabilities are always limited within [0, 1].

On the other hand, the Border subset (

D S_{Border}

) captures samples with maximum uncertainty, defined as

0.5 - σ \leq o u t_{N} \leq 0.5 + σ

(6)

These lie near the decision boundary, where the model assigns nearly equal probabilities to both classes, making them the most ambiguous and potentially informative for learning.

Finally, the Average subset (

D S_{Average}

) comprises samples of intermediate confidence, defined as

0.4 - σ \leq o u t_{N} \leq 0.6 + σ

(7)

This group includes moderately certain predictions, excluding only the most extreme high-confidence cases.

Note that due to the choice of

σ

, these subsets may overlap. For instance, a sample near

o u t_{N} = 0.3

may belong to both Safe and Border. However, each subset is used independently for retraining, so overlapping does not affect the evaluation.

The threshold

σ

was determined empirically by analyzing the distribution of confidence scores from the initial model trained on

D S 1

(6000 balanced images). After experimentation, we selected

σ = 0.25

, which yielded meaningful and well-populated subsets while maintaining class balance. This value aligns with prior work in confidence-based sampling [10,11], where thresholds around this range have been shown to effectively capture high and low-confidence predictions in binary classification tasks, mainly the selection of 0.4 and 0.6 as boundaries for

D S_{Average}

.

The methodology unfolds through three systematic phases, visualized in Figure 2. The initial phase establishes the confidence threshold

σ

through base model evaluation on DS1, our curated dataset of 6000 balanced images. This calibration step ensures threshold values adapt to the specific characteristics of violence detection in visual data.

Phase two applies these thresholds to DS2 (5600 images) for subset identification, generating the partitions:

D S_{S a f e}

,

D S_{A v e r a g e}

, and

D S_{B o r d e r}

. All subsets maintain class balance while focusing on distinct confidence regions and

D S_{S a f e}

,

D S_{A v e r a g e}

, and

D S_{B o r d e r}

, where

D S_{S a f e} \subset

DS2,

D S_{A v e r a g e} \subset

DS2, and

D S_{B o r d e r} \subset

DS2.

In the third phase, the initial model is retrained using only the subsets obtained in the previous phase, along with DS2 (the entire dataset), to compare the classifier’s effectiveness when trained on partial data (i.e.,

D S_{S a f e}

,

D S_{A v e r a g e}

, and

D S_{B o r d e r}

) versus when trained on the full dataset (DS2). Subsequently, the dataset

D S_{t e s t}

is used to assess the impact of utilizing Safe, Border, and Average samples on the overall training process of the neural model. It is important to note that the datasets used (DS1, DS2, and

D S_{t e s t}

) are mutually exclusive, i.e., DS1∩DS2∩

D S_{t e s t}

= \emptyset

. The complete procedure of the proposed method is detailed in Algorithm 1.

Algorithm 1: Confidence Segmentation for Violence Detection Optimization

1

Input: Base model

M_{b a s e}

(DenseNet121), initial dataset

D_{i n i t i a l}

(6000 images), test dataset

D_{t e s t}

(4400 images),

σ = 0.25

2

Output: Specialized models

{M_{Safe}, M_{Border}, M_{Average}}

and comparative analysis

1:: // Phase 0: Experimental reproducibility for $e a c h s e e d$ $s \in {s_{1}, \dots, s_{6}}$ do
2:: // Phase 1: Base model training with balanced dataset
3:: Train $M_{b a s e}$ on $D_{i n i t i a l}$ = {3000 V/3000 NV}, 80/20 stratified split
4:: Augment: rotation, flipping, brightness; dropout = 0.3; early stop (10 epochs)
5:: // Phase 2: Confidence analysis on test dataset
6:: $P \leftarrow M_{b a s e}$ .predict( $D_{t e s t}$ ) // Probabilistic predictions [0, 1]
7:: $C \leftarrow$ max(P, axis=1) // Maximum confidence extraction
8:: Calculate $μ_{C} \leftarrow$ mean(C) and $σ_{C} \leftarrow$ std(C)
9:: Phase 3: Segmentation, Specialization and Evaluation
10:: for each type ∈ {Safe, Border, Average} do
11:: if $(o u t_{N} \leq 0.1 + σ or o u t_{N} \geq 0.9 - σ) \to S_{S a f e}$
12:: if $(0.5 - σ \leq o u t_{N} \leq 0.5 + σ) \to S_{B o r d e r}$
13:: if $(0.4 - σ \leq o u t_{N} \leq 0.6 + σ) \to S_{A v e r a g e}$
14:: Organize each $S_{t y p e}$ by class
15:: for each $S \in {S_{S a f e}, S_{B o r d e r}, S_{A v e r a g e}}$ do
16:: $M_{S} \leftarrow$ clone( $M_{b a s e}$ )
17:: Train $M_{S}$ on S with same augmentations, dropout = 0.3, early stop
18:: Evaluate on $D_{t e s t}$ = {2200 V/2200 NV}
19:: end
20:: for each $M_{S}$ do
21:: Compute: precision, recall, F1-Score, g-mean
22:: Analyze confusion matrix
23:: end
24:: return ${M_{Safe}, M_{Border}, M_{Average}, M_{l i m i t}}$ and comparative report

The training protocol was designed to optimize the effectiveness of each specialized subset and to ensure a fair and consistent comparison of experimental results. For each subset (

D S_{S a f e}

,

D S_{A v e r a g e}

, and

D S_{B o r d e r}

), as well as for the entire dataset DS2, a dedicated DenseNet121 model was trained. All models used the same base architecture but were specifically fine-tuned to match the unique confidence characteristics of each segment.

The standardized protocol included several components: initializing with pre-trained weights from the base model to leverage prior knowledge, stratified cross-validation with an 80% training and 20% validation split, and early stopping based on validation loss to prevent overfitting. Consistent data augmentation techniques were applied: random rotations, horizontal scaling, and brightness adjustments, along with dropout regularization at a rate of 0.3. Model training was continuously monitored using key performance metrics, including precision, recall, F1-Score, and g-mean.

To ensure experimental reliability, each setup was run six times independently, using different random seeds for weight initialization and data splitting. This method allowed for evaluating the performance and the consistency of each training approach.

4. Results

This section presents experimental findings on confidence-threshold-based sample selection (Safe, Border, and Average subsets) for physical violence detection using DenseNet121, a deep convolutional network with dense connections known for high accuracy in image classification. The results demonstrate how strategic sample selection impacts two key criteria: classifier performance metrics and training set size reduction.

Also, to evaluate the generalizability of our confidence-based subset selection strategy, we conducted experiments on other architectures: MobileNetV2, a lightweight and efficient CNN designed for mobile and edge devices, prioritizing speed and low memory usage; and Vision Transformer (ViT), a Transformer-based architecture that models global image structure through self-attention, representing a different inductive bias from CNNs.

For each model, we applied the same confidence thresholds (

σ = 0.25

) to extract the Safe, Border, and Average subsets from

D S 2

. Each subset was used independently to retrain the model. This allows us to assess how the impact of sample selection behaves across models with different capacities, inductive biases, and computational demands.

Table 5 compares the effectiveness of each subset against the full-dataset baseline (DS2) using four performance metrics: precision, recall, F1-Score, and g-mean. The number of training samples and epochs is also reported for each configuration. All values represent the mean over six independent training runs, with standard deviation included in parentheses to reflect result consistency. As shown, the standard deviation across runs is consistently low, particularly for g-mean (typically

\pm 0.01

–

0.03

), which indicates high reproducibility. This confirms that the observed trends are stable and not due to random initialization effects.

4.1. Performance Evaluation on DenseNet121

We begin our evaluation using DenseNet121 as the reference architecture to assess the impact of confidence-based subset selection on model performance and training efficiency. This CNN-based model was selected due to its proven effectiveness in image classification tasks and its widespread use in prior work on violence detection. The Safe, Border, and Average subsets were independently used to retrain the model, allowing for a controlled analysis of how sample relevance, as defined by prediction confidence, affects learning dynamics.

The full dataset (DS2) achieved near-perfect classification (all metrics about 0.99) with 5600 images and 20-epoch convergence, confirming DenseNet121’s inherent capability for violence detection. This establishes the theoretical upper bound for subset comparisons.

The Safe subset exhibited high precision (0.97) but significantly reduced recall (0.73), yielding moderate F1-Score (0.84) and g-mean (0.89) values. While sample reduction was modest (12% smaller than DS2), its 34-epoch convergence suggests efficient learning for unambiguous cases. The precision–recall trade-off indicates a bias toward minimizing false alarms at the cost of missed detections.

The Border subset showed high computational efficiency, achieving 97.2% sample reduction, but had lower overall performance, with an F1-Score of 0.82. Its balanced precision (0.79) and recall (0.83) indicate consistent difficulty in classifying ambiguous samples near the decision boundary. Although it needed 35 epochs, the small data volume makes this approach suitable for resource-limited situations.

Finally, the Average subset achieved the best balance among the subsets, with strong metrics (F1-Score = 0.88, g-mean = 0.93) and an 80% data reduction. The 44-epoch training duration, though longer than other subsets, remains practical given the significantly reduced computational load. This subset’s performance indicates that moderately challenging samples provide optimal information density for efficient model training.

To provide more detail on the previous results discussion, the confusion matrices for each scenario are shown in Figure 3. This figure offers a comparative analysis of the classifier’s performance across various configurations and highlights different classification behavior patterns that support the findings for each subset.

Starting with

D S_{Border}

, a conservative strategy is observed, producing 1568 true negatives and 1833 true positives, along with 632 false positives and 367 false negatives. This error pattern indicates an improved precision profile, where the rise in false positives is balanced by a significant decrease in false negatives, which is an expected trade-off in confidence-based segmentation.

In contrast,

D S_{Safe}

displays highly specific behavior, with 2168 true negatives and 1600 true positives. The subset has a notably low false positive count (32), but at the expense of a significant increase in false negatives (600). This shift indicates a move toward computational efficiency, although it reduces sensitivity in detecting violent events (positive class).

D S_{Average}

shows the most balanced performance, with 1815 true negatives and 1858 true positives and errors spread proportionally (385 false positives and 342 false negatives). This symmetry in classification errors suggests that the subset is suitable for general-purpose uses, especially those requiring a trade-off between precision and recall, such as real-time physical violence detection systems.

The complete dataset (DS2) acts as the benchmark setup, producing nearly optimal results with 2172 true negatives and 2169 true positives, along with minimal error rates (31 false positives and 28 false negatives). These findings verify the technical strength of the DenseNet121 architecture and confirm the reliability of the evaluation set.

Overall, the observed patterns confirm that

D S_{Average}

provides the best balance between false positives and false negatives, making it a strong option for practical use, especially given its significant reduction in dataset size. Meanwhile, although

D S_{Safe}

performs well for the positive class, it sacrifices the negative class and offers limited data reduction. Conversely,

D S_{Border}

maximizes computational efficiency but sacrifices overall classification performance. These distinctions confirm that confidence-based sample selection is a promising approach for enhancing neural model training by aligning dataset features with task-specific goals.

In addition to performance, we evaluated the computational efficiency of each subset. As shown in Table 5, training on smaller subsets significantly reduces the total time despite requiring more epochs. For instance, the Average subset achieves an F1-Score of 0.89 in just 7.71 min, compared to 9.76 min for the full dataset (

D S 2

), representing a 20% reduction in training time. Even more striking, the Border subset completes training in only 1.74 min (82% reduction), making it highly suitable for rapid prototyping or resource-constrained environments.

4.2. Cross-Architecture Evaluation

To assess the generalizability of our confidence-based subset selection strategy, we extended the evaluation to multiple deep learning architectures with different inductive biases and computational profiles. This section presents the performance and efficiency results for MobileNetV2 and ViT, enabling a comparative analysis with the reference DenseNet121 model.

MobileNetV2 demonstrated exceptional computational efficiency while maintaining competitive performance. As shown in Table 5, the model achieves an F1-Score of 0.77 on the Border subset (104 samples, 2% of DS2) in just 0.08 min total training time (the fastest among all models). Also, is noticeable an F1-Score of 0.79 is achieved on the Average subset (3490 samples) in 3 min, despite requiring 36 epochs. Finally, a strong baseline performance was achieved (F1 = 0.98) on the full dataset (

D S 2

), confirming its suitability for this task.

The extremely low per-epoch time (0.004 min for Border, 0.08 min for Average) confirms MobileNetV2’s suitability for edge deployment and real-time applications. Notably, the Safe subset underperforms relative to DenseNet121 (F1 = 0.81), suggesting that high-confidence samples may not transfer as effectively to this lightweight architecture, which is an interesting direction for future analysis.

As shown in Table 5, ViT achieves moderate performance on the full dataset (F1 = 0.88), but lags behind DenseNet121 (F1 = 0.99). This gap is consistent with the well-documented sample inefficiency of Transformers in low-data regimes [23]. Vision Transformers rely heavily on large-scale pretraining and global attention mechanisms, which struggle to generalize when training data is limited, which is a common constraint in real-world violence detection scenarios.

Nevertheless, the subset-based models demonstrate meaningful gains. The Average subset achieves F1 = 0.74 with only 47% of the training data, and the Border subset reaches F1 = 0.73 using just 13% of samples. Importantly, training time is reduced by up to 97% (e.g., Border: 0.66 min vs. 24.5 min), highlighting that even for architectures with lower absolute performance, confidence-based selection significantly improves efficiency.

This suggests a key strength of our methodology: it is not dependent on achieving peak model accuracy but on enabling faster, data-efficient training across diverse architectural families, including those less suited to the task.

In this sense, previous findings confirm that our confidence-based selection method is not limited to CNNs but can be applied to diverse architectures to improve training efficiency. While performance varies by model, the relative gains from subset selection remain consistent, supporting the robustness of our approach. We note that the results with ViT and MobileNetV2 are only for a cross-evaluation, and more work needs to be conducted to ensure a fair comparison, since our main goal is to evaluate primarily the well-established case of a proven architecture such as DenseNet121.

Training efficiency was also assessed for MobileNetV2 and ViT. Concerning MobileNetV2, despite requiring more epochs to converge, the use of specialized subsets results in exceptional time savings due to its extremely low per-epoch cost. The Border subset reduces the total training time from 11.30 min (full dataset) to just 0.08 min, which is a 99.3% reduction, while achieving an F1-Score of 0.77. Similarly, the Average subset completes training in only 3.03 min (73% faster than full training), maintaining an F1-Score of 0.79. These results demonstrate that confidence-based selection unlocks fast training cycles, making MobileNetV2 a highly suitable candidate for real-time, edge-based, or resource-constrained applications.

On the other hand, despite the higher computational cost per epoch of ViT, the use of specialized subsets leads to dramatic time savings for this architecture. The Border subset reduces the total training time from 25.5 min (full dataset) to only 0.66 min (97% reduction) while maintaining an F1-Score of 0.73. Similarly, the Average subset trains in 18.2 min (29% faster than full training), demonstrating that confidence-based selection is effective even for architectures with higher baseline costs. This further reinforces the practical value of our approach: it maximizes efficiency gains across diverse deployment scenarios.

It can be observed from the cross-model comparison that while absolute performance and computational demands vary significantly across architectures, the relative benefits of confidence-based subset selection remain consistent. MobileNetV2 achieves the fastest training times (as low as 0.08 min) for the Border subset, making it ideal for real-time or edge-based deployment. DenseNet121 delivers the highest accuracy, particularly on the full dataset, at the cost of higher training time. ViT, though computationally heavier and less accurate in this low-data regime, still benefits substantially from subset-based training, reducing the total time by up to 97%.

This spectrum of behaviors confirms the robustness and adaptability of our approach. Irrespective of whether the primary objective is computational efficiency, predictive accuracy, or architectural preference, confidence-based subset selection consistently enhances training efficiency. Thus, the proposed method can be readily applied across a wide range of model architectures, enabling researchers and practitioners to address specific operational requirements without compromising the benefits derived from strategic data utilization.

Finally, the results in Table 5 correspond to an interesting interplay between structural symmetry and functional asymmetry in sample utility. The Border subset, although symmetrically defined around the decision boundary (

o u t_{N} = 0.5

), exhibits asymmetric practical utility because it enables ultra-fast training (e.g., 0.08 min with MobileNetV2) but yields only moderate accuracy. Conversely, the Average subset, despite being defined on an asymmetric confidence band (

o u t_{N} \approx 0.4

or

0.6

), achieves the most balanced trade-off between performance and efficiency across all architectures.

4.3. From Active Learning to Training Efficiency: Evolution of Confidence-Based Selection

Our approach is inspired by the confidence-based sample selection strategies proposed in active learning, particularly the work of Alejo et al. [10]. Also, we began with the ideas of Abundez et al. [11], which used a threshold-based criterion during training. However, we extend these frameworks in both scope and application to better suit deep learning in visual recognition tasks.

Abundez et al. [11] introduced a threshold-based method to identify ambiguous samples using a single parameter

μ

, primarily aimed at reducing labeling effort through active querying. In contrast, Alejo et al. [10] proposed a three-way partitioning into Safe, Border, and Average subsets, but applied it to tabular data using traditional machine learning models such as Multi-Layer Perceptrons (with one hidden layer). Their method further relied on Gaussian kernel functions to smooth and analyze prediction differences across model ensembles, increasing computational complexity.

Our work differs in several key aspects. First, we operate in the domain of image classification, specifically physical violence detection, where raw pixel data demands deep feature extraction. We leverage the DenseNet121 architecture to generate reliable confidence scores directly from the softmax output, eliminating the need for ensemble-based uncertainty estimation or auxiliary smoothing functions. This results in a simpler, faster, and more scalable segmentation process.

Second, our objective diverges from active learning: rather than selecting samples for labeling, we investigate how specialized subsets defined by confidence thresholds impact model performance and training efficiency when used in isolation. In particular, we show that the Average subset achieves 89% of the F1-Score of the full dataset using only 20% of the training samples, highlighting its potential for efficient model training.

Thus, by adapting and simplifying prior confidence-based frameworks for modern deep learning pipelines, our approach enables a practical and effective strategy for data-centric model optimization.

Notably, the cross-architecture evaluation demonstrates that the benefits of our method extend beyond a single model. Despite significant differences in architecture, inductive bias, and computational demands from lightweight models like MobileNetV2 to attention-based ViT, the relative gains in training efficiency and data utilization are consistently observed. This confirms the generalization capability of our confidence-based selection strategy, positioning it as a flexible and scalable tool for improving deep learning pipelines.

5. Conclusions

This work presents an empirical study focused on assessing the effectiveness of selecting representative samples to enhance the training process of the DenseNet121 neural model for classifying physical violence in images. The approach relies on the model’s output confidence scores to divide the data into three subsets: Safe (high confidence), Average (moderate confidence), and Border (low confidence), which indicate different levels of prediction certainty. The Border subset is defined symmetrically around the decision boundary (

o u t_{N} = 0.5

), reflecting a balanced treatment of uncertainty. In contrast, the Safe regions exhibit asymmetries in sample distribution and impact, as high-confidence predictions for violence and non-violence may not contribute equally to model efficiency.

Experimental results, obtained using the public datasets AIRTLab, RLVS, Pexels, and SCVD, demonstrate that confidence-based sample selection has great potential for reducing dataset size and improving computational efficiency. However, this efficiency gain often comes at the cost of decreased classification performance, especially in the

D S_{Border}

subset. In contrast, the

D S_{Safe}

subset does not provide a significant reduction in dataset size, and the classifier’s performance is negatively impacted. Notably, the

D S_{Average}

subset achieves a substantial reduction in dataset size while maintaining competitive classification performance, making it the most promising configuration and generating a functional asymmetry between data volume and performance.

These findings support the hypothesis that training on moderate-confidence (Average) samples can result in more efficient learning without sacrificing classification quality. Therefore, the proposed approach provides a practical strategy for enhancing the training process of DenseNet121 in binary violence classification tasks.

Additionally, the cross-architecture evaluation confirms that the benefits of confidence-based sample selection extend beyond DenseNet121. Despite differences in model design and computational demands, architectures such as MobileNetV2 and ViT exhibit consistent efficiency gains when trained on specialized subsets, particularly Average and Border. This demonstrates the generalization capability of our approach and supports its use as a flexible, architecture-agnostic strategy for efficient deep learning in other environments.

Building on the findings of this study, several promising research directions emerge. One main path involves optimizing confidence thresholds using adaptive methods, such as meta-learning or reinforcement learning, to go beyond fixed empirical values and enable flexible, data-driven segmentation strategies. Another key direction is expanding the proposed methodology to other neural architectures, including transformer-based models and multimodal systems that combine visual and auditory information. This would allow for a more thorough assessment of the framework’s applicability and robustness across different learning paradigms. Additionally, emphasis should be on real-world validation, especially through deployment in edge-computing environments for surveillance purposes. These scenarios provide valuable insights into the trade-offs among latency, computational costs, and detection accuracy under operational limits.

This work provides both a methodological foundation for confidence-based sample selection and empirical evidence of its effectiveness in violence detection tasks, particularly when utilizing the Average confidence regime, which strikes a balance between dataset reduction and classification performance. Thus, it demonstrates how controlled asymmetries in data selection can enhance learning efficiency.

Author Contributions

Conceptualization, D.C.A. and R.A.E.; Methodology, F.P.P., D.C.A. and R.A.E.; Software, F.P.P.; Validation, F.P.P. and R.M.V.R.; Formal analysis, D.C.A., R.A.E., E.E.G.G. and R.M.V.R.; Investigation, F.P.P. and E.E.G.G.; Resources, D.C.A.; Data curation, F.P.P.; Writing—original draft, F.P.P., D.C.A. and E.E.G.G.; Writing—review & editing, R.A.E., J.S.J. and R.M.V.R.; Visualization, J.S.J. and R.M.V.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the reported results are available from the corresponding author upon reasonable request. The complete dataset, confidence-based segmentation methodology, and implementation codes are publicly available in the GitHub repository: https://github.com/AI-Root01/Dataset-Violencia/ (accessed on 1 September 2025). The original dataset comprises balanced training structures (6000 images), extended sets for confidence analysis (11,600 images), and an independent evaluation set (4400 images), all of which were used consistently across all experiments.

Acknowledgments

The authors would like to thank the Tecnológico Nacional de México/Instituto Tecnológico de Toluca for providing the computational infrastructure necessary to carry out this research. Special thanks are extended to the Artificial Intelligence Laboratory for granting access to GPU resources and to the technical staff for their support during the intensive experimental phases. This research was partially supported by SECIHTI through scholarships Nos. 1288917 and 1289283.

Conflicts of Interest

The authors declare that they have no conflicts of interest. The research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. IEEE Trans. Image Process. 2019, 28, 1985–1998. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; pp. 326–366. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Wang, J.; Zhao, D.; Li, H.; Wang, D. Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention. Appl. Sci. 2024, 14, 4895. [Google Scholar] [CrossRef]
Ye, L.; Liu, T.; Han, T.; Ferdinando, H.; Seppänen, T.; Alasaarela, E. Campus Violence Detection Based on Artificial Intelligent Interpretation of Surveillance Video Sequences. Remote Sens. 2021, 13, 628. [Google Scholar] [CrossRef]
Mumtaz, N.; Ejaz, N.; Aladhadh, S.; Habib, S.; Lee, M. Deep Multi-Scale Features Fusion for Effective Violence Detection and Control Charts Visualization. Sensors 2022, 22, 9383. [Google Scholar] [CrossRef] [PubMed]
Rendón-Segador, F.; Álvarez García, J.; Soria-Morillo, L. Transformer and Adaptive Threshold Sliding Window for Improving Violence Detection in Videos. Sensors 2024, 24, 5429. [Google Scholar] [CrossRef] [PubMed]
Alejo, R.; Monroy-de Jesús, J.; Pacheco-Sánchez, J.; Valdovinos, R.; Antonio-Velázquez, J.; Marcial-Romero, J. Analysing the Safe, Average and Border Samples on Two-Class Imbalance Problems in the Back-Propagation Domain. In Proceedings of the Iberoamerican Congress on Pattern Recognition, Montevideo, Uruguay, 9–12 November 2015; pp. 685–693. [Google Scholar]
Abundez, I.; Alejo, R.; Primero-Primero, F.; Granda-Gutiérrez, E.; Portillo-Rodríguez, O.; Velázquez, J. Threshold Active Learning Approach for Physical Violence Detection on Images Obtained from Video (Frame-Level) Using Pre-Trained Deep Learning Neural Network Models. Algorithms 2024, 17, 316. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K. On Calibration of Modern Neural Networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6402–6413. [Google Scholar]
Khan, H.; Yuan, X.; Qingge, L.; Roy, K. Violence Detection From Industrial Surveillance Videos Using Deep Learning. IEEE Access 2025, 13, 15363–15375. [Google Scholar] [CrossRef]
Negre, P.; Alonso, R.; González-Briones, A.; Prieto, J.; Rodríguez-González, S. Literature Review of Deep-Learning-Based Detection of Violence in Video. Sensors 2024, 24, 4016. [Google Scholar] [CrossRef] [PubMed]
Rendón-Segador, F.; Álvarez García, J.; Enríquez, F.; Deniz, O. ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence. Electronics 2021, 10, 1601. [Google Scholar] [CrossRef]
Huillcen Baca, H.; Palomino Valdivia, F.; Gutiérrez Cáceres, J. Efficient Human Violence Recognition for Surveillance in Real Time. Sensors 2024, 24, 668. [Google Scholar] [CrossRef] [PubMed]
Azzakhnini, M.; Saidi, H.; Azough, A.; Tairi, H.; Qjidaa, H. LAVID: A Lightweight and Autonomous Smart Camera System for Urban Violence Detection and Geolocation. Computers 2025, 14, 140. [Google Scholar] [CrossRef]
Bianculli, M.; Falcionelli, N.; Sernani, P.; Tomassini, S.; Contardo, P.; Lombardi, M.; Dragoni, A. A dataset for automatic violence detection in videos. Data Brief 2020, 33, 106587. [Google Scholar] [CrossRef] [PubMed]
Soliman, M.; Kamal, M.; Nashed, M.; Mostafa, Y.; Chawky, B.; Khattab, D. Violence recognition from videos using deep learning techniques. In Proceedings of the 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, 8–10 December 2019; IEEE: New York, NY, USA, 2019; pp. 80–85. [Google Scholar]
Pexels. Pexels, 2023. Available online: https://www.pexels.com (accessed on 1 December 2023).
Aremu, T.; Li, Z.; Alameeri, R.; Khan, M.; Saddik, A. SSIVD-Net: A Novel Salient Super Image Classification & Detection Technique for Weaponized Violence. Res. Sq. 2023, 3, 1–21. [Google Scholar] [CrossRef]
Hütten, M.; Reisch, J.; Merhof, D. Vision Transformers for Real-World Visual Inspection: A Comparative Evaluation. Appl. Sci. 2022, 12, 11981. [Google Scholar] [CrossRef]

Figure 1. Architectural diagram of DenseNet121 showing dense blocks and transition layers. The network’s 121-layer depth is achieved through composite function growth.

Figure 2. Workflow for pre-training of subsets Safe, Border, and Average.

Figure 3. Comparative analysis of confusion matrices for DenseNet121 architectures specialized through confidence-based selection for automated violence detection.

Table 1. Summary of related works.

Author(s)	Ref.	Title	Model	Technique	Dataset	Results
Negre et al., 2024	[15]	Literature Review of Deep-Learning-Based Detection of Violence in Video	No single model (survey of CNN, RNN, Transformers)	Comparative review of DL architectures and techniques	Hockey Fights, RWF-2000, Violent Flows, etc.	SOTA models report >90% accuracy
Ye et al., 2021	[7]	Campus Violence Detection Based on AI Interpretation of Surveillance Video Sequences	C3D + MFCC (audio)	Multimodal fusion with Dempster–Shafer rule	Proprietary campus dataset + emotional audio bases	Global accuracy 97.0% after fusion
Mumtaz et al., 2022	[8]	Deep Multi-Scale Features Fusion for Effective Violence Detection and Control Charts Visualization	CNN + Bi-LSTM + attention	Spatiotemporal multiscale fusion	Hockey Fights, RWF-2000, Violent Crowd	89–91% accuracy on standard benchmarks
Wang et al., 2024	[6]	Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention	EfficientNet-B0 + Bi-LTMA	Bidirectional motion attention + TSM	Movie Fights, Hockey Fights, SCVD, RWF-2000	90–100% accuracy depending on dataset
Khan et al., 2025	[14]	Violence Detection From Industrial Surveillance Videos Using Deep Learning	CNN detector + 3D-CNN classifier	Real-time pipeline with people tracking	Hockey, Crowds, Movies, Industrial	Higher accuracy and lower latency than baselines
Rendón-Segador et al., 2024	[9]	Transformer and Adaptive Threshold Sliding Window for Improving Violence Detection in Videos	Vision Transformer (CrimeNet)	Adaptive threshold with sliding window	11 public datasets (XD-Violence, UCF-Crime, etc.)	AUC ROC ≈99%; +10–15% cross-domain robustness
Abundez et al., 2024	[11]	Threshold Active Learning for Physical Violence Detection on Video Frame Images	DenseNet121/ EfficientNet/ MobileNet	Active learning based on a single ambiguity threshold ( $μ$ )	AIRTLab, RLVS, SCVD	AUC improvement from 0.44 to 0.81–0.91 after iterations
Rendón-Segador et al., 2021	[16]	ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence	DenseNet121 3D + ConvLSTM	Multi-head self-attention + Bi-ConvLSTM	Hockey Fight, Movie, Violent Flows, RLVS	95–100% intra-dataset accuracy; 70–81% cross-dataset
Huillcen Baca et al., 2024	[17]	Efficient Human Violence Recognition for Surveillance in Real Time	DenseNet121/ MobileNetV2 + Bi-LSTM	Lightweight CNN + Bi-LSTM (real-time)	Hockey Fight, Movie Fight, RWF-2000, VioPerú	98.2–100% precision; 88.5% accuracy on RWF-2000 with 3.5M parameters
Azzakhnini et al., 2025	[18]	LAVID: A Lightweight and Autonomous Smart Camera System for Urban Violence Detection and Geolocation	DenseNet121/ DSCNN-BiLSTM	Separable CNN + Bi-LSTM on smart camera	Hockey Fight, Violent-Flow, RLVS, RWF-2000	96.6% accuracy on Violent-Flow; 92.7% RLVS; 91.1% RWF-2000 with 0.57M parameters

Table 2. Public datasets used for constructing DS1, DS2, and

D S_{t e s t}

.

Table 2. Public datasets used for constructing DS1, DS2, and

D S_{t e s t}

.

Dataset	Content Type	Resolution	Scenes	Key Characteristics
AIRTLab [19]	350 videos	HD	67% violent	Multi-camera perspectives; 1/3 non-violent, 2/3 violent behaviors
RLVS [20]	2000 YouTube videos	Variable	50% violent	Real-world daily life scenes
Pexels [21]	100k+ stock images	HD	Non-violent	General-purpose images for object detection
SCVD [22]	CCTV footage	1080p	Weapon violence	Specialized for urban surveillance; real CCTV recordings

Table 3. DenseNet121 layer architecture and dimensional transformations.

Layers	Output Size	DenseNet121
Convolution	$112 \times 112$	$7 \times 7$ conv, stride 2
Pooling	$56 \times 56$	$3 \times 3$ max pool, stride 2
Dense Block (1)	$56 \times 56$	$[\begin{matrix} 1 \times 1 conv \\ 3 \times 3 conv \end{matrix}] \times 6$
Transition layer (1)	$56 \times 56$	$1 \times 1$ conv
	$28 \times 28$	$2 \times 2$ average pool, stride 2
Dense Block (2)	$28 \times 28$	$[\begin{matrix} 1 \times 1 conv \\ 3 \times 3 conv \end{matrix}] \times 12$
Transition layer (2)	$28 \times 28$	$1 \times 1$ conv
	$14 \times 14$	$2 \times 2$ average pool, stride 2
Dense Block (3)	$14 \times 14$	$[\begin{matrix} 1 \times 1 conv \\ 3 \times 3 conv \end{matrix}] \times 24$
Transition layer (3)	$14 \times 14$	$1 \times 1$ conv
	$7 \times 7$	$2 \times 2$ average pool, stride 2
Dense Block (4)	$7 \times 7$	$[\begin{matrix} 1 \times 1 conv \\ 3 \times 3 conv \end{matrix}] \times 16$
Classification layer	$1 \times 1$	$7 \times 7$ global average pool
		Softmax layer

Table 4. Confusion matrix structure for binary classification.

		Prediction
		Positive	Negative
Real	Positive	True positive (TP)	False negative (FN)
	Negative	False positive (FP)	True negative (TN)

Table 5. Performance and efficiency metrics across model architectures and subsets (values in parentheses indicate ± standard deviation over six runs).

Model	Subset	#Samples	#Epochs	Precision	Recall	F1-Score	g-mean	Time/Epoch (min)	Total Time (min)
DenseNet121	Safe	4930	34	0.97 (±0.01)	0.73 (±0.03)	0.84 (±0.02)	0.89 (±0.01)	0.97	33.2 (±0.8)
	Border	158	35	0.79 (±0.03)	0.83 (±0.02)	0.82 (±0.03)	0.85 (±0.02)	0.05	1.74 (±0.2)
	Average	1132	44	0.90 (±0.01)	0.86 (±0.02)	0.88 (±0.01)	0.93 (±0.01)	0.17	7.71 (±0.3)
	DS2 (full)	5600	20	0.99 (±0.00)	0.99 (±0.00)	0.99 (±0.00)	0.99 (±0.00)	0.49	9.76 (±0.2)
MobileNetV2	Safe	5074	36	0.99 (±0.00)	0.69 (±0.02)	0.81 (±0.02)	0.82 (±0.02)	0.135	4.58 (±0.2)
	Border	104	18	0.76 (±0.02)	0.78 (±0.02)	0.77 (±0.02)	0.77 (±0.02)	0.004	0.08 (±0.01)
	Average	3490	36	0.99 (±0.00)	0.66 (±0.03)	0.79 (±0.02)	0.81 (±0.01)	0.080	3.00 (±0.1)
	DS2 (full)	5600	24	0.99 (±0.00)	0.97 (±0.01)	0.98 (±0.00)	0.98 (±0.00)	0.470	11.10 (±0.3)
ViT	Safe	2815	40	0.74 (±0.02)	0.73 (±0.03)	0.72 (±0.03)	0.73 (±0.02)	0.54	21.88 (±1.2)
	Border	733	17	0.74 (±0.03)	0.73 (±0.02)	0.73 (±0.02)	0.73 (±0.02)	0.03	0.66 (±0.05)
	Average	2609	29	0.74 (±0.02)	0.74 (±0.03)	0.74 (±0.02)	0.73 (±0.02)	0.62	18.20 (±1.0)
	DS2 (full)	5600	52	0.88 (±0.01)	0.88 (±0.01)	0.88 (±0.01)	0.87 (±0.01)	1.02	24.50 (±1.5)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Primero Primero, F.; Cervantes Ambriz, D.; Alejo Eleuterio, R.; Granda Gutiérrez, E.E.; Sánchez Jaime, J.; Valdovinos Rosas, R.M. Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets. Symmetry 2025, 17, 1536. https://doi.org/10.3390/sym17091536

AMA Style

Primero Primero F, Cervantes Ambriz D, Alejo Eleuterio R, Granda Gutiérrez EE, Sánchez Jaime J, Valdovinos Rosas RM. Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets. Symmetry. 2025; 17(9):1536. https://doi.org/10.3390/sym17091536

Chicago/Turabian Style

Primero Primero, Francisco, Daniel Cervantes Ambriz, Roberto Alejo Eleuterio, Everardo E. Granda Gutiérrez, Jorge Sánchez Jaime, and Rosa M. Valdovinos Rosas. 2025. "Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets" Symmetry 17, no. 9: 1536. https://doi.org/10.3390/sym17091536

APA Style

Primero Primero, F., Cervantes Ambriz, D., Alejo Eleuterio, R., Granda Gutiérrez, E. E., Sánchez Jaime, J., & Valdovinos Rosas, R. M. (2025). Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets. Symmetry, 17(9), 1536. https://doi.org/10.3390/sym17091536

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Strategic Sample Selection in Deep Learning: A Case Study on Violence Detection Using Confidence-Based Subsets

Abstract

1. Introduction

2. Related Work

2.1. Evolution of Methodologies

2.2. Spatiotemporal and Edge-Capable Models

2.3. Trends and Gaps

3. Methods

3.1. Datasets and Preprocessing

3.2. DenseNet121 Architecture and Model Configuration

3.3. Model Performance Metrics

3.4. Approach for Sample Selection: Safe, Average, and Border Subsets

4. Results

4.1. Performance Evaluation on DenseNet121

4.2. Cross-Architecture Evaluation

4.3. From Active Learning to Training Efficiency: Evolution of Confidence-Based Selection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI