DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video

Buribayev, Zholdas; Zhassuzak, Mukhtar; Aouani, Maria; Zhangabay, Zhansaya; Abdirazak, Zemfira; Yerkos, Ainur

doi:10.3390/app152011035

Open AccessArticle

DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video

by

Zholdas Buribayev

¹,

Mukhtar Zhassuzak

^1,*,

Maria Aouani

²

,

Zhansaya Zhangabay

²,

Zemfira Abdirazak

¹ and

Ainur Yerkos

¹

Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan

²

Department of Computer Science, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11035; https://doi.org/10.3390/app152011035

Submission received: 17 August 2025 / Revised: 10 October 2025 / Accepted: 13 October 2025 / Published: 14 October 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The optimization capabilities of Kolmogorov–Arnold Networks (KANs) remain largely unexplored, which has limited their practical use in video anomaly recognition compared to conventional 3D-CNNs. To address this gap, we introduce a novel hybrid optimization framework that combines a Minimax Ant System (MMAS) for hyperparameter selection with a modified DARTS strategy for adaptive tuning of the 3D KAN architecture. Unlike existing approaches, our method simultaneously optimizes both learning dynamics and architectural configurations, enabling KANs to better exploit their expressive power in spatiotemporal feature extraction. Applied to a three-class video dataset, the proposed approach improved model accuracy to 87%, surpassing the performance of a standard 3D-CNN by 6%.

Keywords:

3D CNN; Kolmogorov–Arnold Network; neural architecture search; ant colony optimization; violence detection; spatiotemporal features

1. Introduction

In recent years, there has been a significant increase in crime rates per population and a corresponding increase in the amount of video footage capturing this. According to the United Nations Office on Drugs and Crime (https://dataunodc.un.org/dp-crime-violent-offences (accessed on 11 July 2025)), the number of kidnappings and serious assaults has increased significantly in many countries around the world between 2020 and 2023. These alarming trends demonstrate the global problem of escalating crime, which requires the cooperative and rapid implementation of new technologies to detect and prevent such incidents.

Three-dimensional CNN models have proven to be a powerful tool for analyzing time-based video streams [1,2,3]. However, due to their low interpretability and rigid convolutional filters, it may not be very effective in the task of violence recognition, where subtle motion cues and contextual information are important [4].

Recently, the alternative architecture of Kolmogorov–Arnold Network [5] has started to gain popularity and is considered as potentially more interpretable and flexible than conventional convolutions. Unlike traditional convolutions, which use fixed, local receptive fields and linear weight filters, KAN layers represent functions through B-spline compositions, enabling a more expressive and continuous mapping from inputs to outputs. This structure allows for local adaptivity to the data and greater flexibility in capturing complex, non-linear spatiotemporal dependencies, especially in ambiguous scenes. Additionally, the use of explicit basis functions improves interpretability by making it easier to trace how specific input features influence the final decision. Despite the promising nature of this approach, the question of effective methods for selecting architectural and training parameters remains open [6]. This issue is especially pressing when working with imbalanced datasets.

Evolutionary optimization methods have gained wide recognition in the scientific community, especially for tuning hyperparameters of deep learning models [7]. That is why three evolutionary algorithms and their more complex versions that incorporate various optimization methods of similar algorithms were selected for hyperparameter fitting: Genetic Algorithm, Differential Evolutionary Algorithm, and Minimax Ant System. At the same time, a technique inspired by differentiable architecture search [8] is applied to select the architectural parameters of KAN layers.

The rest of the article is structured to direct the reader through the research process and results. We begin in Section 1 with a review of similar methods and work. Then in Section 2, the data set is used in detail, and the proposed architecture of the model and the methodology are presented. In Section 3, we describe the method of evaluating experiments. Section 4 presents the obtained results, followed by Section 5, which proposes an analysis of the results, a comparison with existing methods, and discusses potential areas for future research.

2. Related Work

2.1. Evolutionary Hyperparameter Tuning

Detecting abnormal human activity in videos and photos has been the subject of many research papers, as it is an actively developing area of computer vision. Various studies describe methods such as Convolutional Neural Networks (CNNs), Long Short-Term memory (LSTM) networks, Transformers, 3D CNNs, and multimodal methods [9,10,11]. For example, one such work tests the impact of LSTM when used together with 3D CNNs in a combined approach for binary classification. The accuracy of the method was 87.60%, which is slightly better than the standard 3D CNN model (86.11%) [11]. Despite the frequent mention of 3D CNNs in video analysis tasks, their potential and opportunities to improve accuracy have not been fully explored. In this paper, methods for optimizing video processing with respect to the time parameter will be investigated.

2.2. Differentiable Architecture Search

Evolutionary optimization methods have gained wide acceptance in the scientific community, especially for tuning hyperparameters of deep learning models. Differential evolution has shown high performance compared to other methods for hyperparameters. For example, in one study [12], it outperformed Bayesian optimization SMAC in 19–37% of hyperparameter tuning problems on small datasets. In another study, this algorithm was used to tune Transformer architectures in load forecasting tasks, demonstrating stable and accurate tuning [13]. It has also been applied in applied tasks such as human activity recognition—yielding an improvement in accuracy on challenging datasets up to 85.94–96.5% [14]. Genetic Algorithm methods are often used in model tuning, especially for discrete and complex hyperparameter spaces. Their successful application in various problems has been reported in the scientific literature, although they are often inferior to differential algorithms in convergence and accuracy [7]. Ant algorithms and MMAS are traditionally oriented to combinatorial problems, but there are works adapting them to continuous problems, demonstrating stable search and robustness to local optima [15]. The mentioned evolutionary algorithms may be inefficient for tuning architectural parameters. For this task, Neural Architecture Search (NAS) [16], which has been used in many works, including in the context of KANs [17], is most commonly used in the scientific literature. However, NAS requires significant computational resources as it involves separate training and subsequent comparison of a large number of architectures. KAN convolutional layers contain many parameters compared to conventional convolutions, which further increases the computational cost exponentially. This makes the application of NAS in tasks with KAN layers extremely computationally expensive. In contrast to NAS, Differentiated Architecture Retrieval Search (DARTS) optimizes architectural parameters in a single training cycle, which reduces the computational cost. Since KAN layers are already differentiable, the DARTS method is the most suitable for their configuration.

2.3. KAN in Related Works

Firstly, we chose 3D CNNs because they directly capture spatio-temporal dependencies across consecutive frames, which is critical for violence recognition. While 2D CNNs can also be applied by processing frames individually and then aggregating temporal features, they are generally better suited for object detection tasks, as shown in recent work on safety helmet monitoring using YOLOv10 and transformer-based architectures [18]. In contrast, Tang et al. introduced M3Net [19], which employs multi-view encoding, matching, and fusion to enhance fine-grained action recognition under few-shot settings, demonstrating another direction for modeling complex motion representations. KANs, in scientific research, are applied to problems with interpretable structures in small data sets. They are successfully used in time series analysis, vision tasks, and sequence processing, especially for learning meaningful representations and features [20,21,22]. The combination of KANs and convolutions has shown promise in signal and time series analytics due to the compactness of the models and ease of interpretation. One of the comprehensive studies, related to Kolmogorov–Arnold Convolutions, performed a set of empirical evaluations of scalable convolutions on different datasets [23]. This and other works on KANs opened a question—how to optimize convolutional KANs.

3. Methods

In this project, an optimization approach for the KAN convolution-based model was developed. The generalized method pipeline described in this section is demonstrated in Figure 1. First, we designed and implemented a 3D CNN model, and then evaluated three types of evolutionary algorithms and their modifications for hyperparameter tuning: GA, DE, and ACO. Secondly, we designed and implemented a KAN convolution-based version of our model and evaluated the same optimization algorithms on it. After making the final choice of an algorithm for hyperparameter tuning of the model, we utilized the logic behind DARTS to implement KAN-adapted architecture parameter tuning.

3.1. Dataset

The dataset used in this project was collected from two main sources: CCTV video records and professional boxing videos. As there are already plenty of 3D convolutional approaches, tested for binary classification [1,2,3,4], a decision was made towards multiclass classification. Overall, three classes were labeled: fighting, meeting, and normal. The classes ‘Normal’ and ‘Fighting’ had 50 videos each of different lengths, which varied from 30 s to a few minutes. While the class ‘Meeting’ had a total of 95 videos of different lengths. This variety of material and length allowed training of a more generalized model, but raised questions about data imbalance. Each video is in .mp4 format and is approximately 60 s long. All videos were pre-processed and standardized. Video resolution varied within standard values sufficient for behavior analysis (from 480p to 720p). The database was manually collected from various open sources to ensure diversity in scene, context, and visual style. The videos include clips from films and television series containing scenes of fights and mass gatherings, real-world videos from the internet, including rallies, street conflicts, and scenes of ordinary city traffic, and footage from combat competitions (e.g., MMA, boxing) used for the “Fighting” category. Videos were manually selected based on clarity of visual context, clear membership in one of the classes, and absence of complex scenes with multiple behavior types simultaneously. Each video was previewed, after which 60 s segments were selected that most clearly reflected the corresponding behavior.

To achieve a balanced class distribution between the training and test sets, a stratified split was employed [24]. To further address class imbalance, a multi-stage strategy was implemented. Specifically, the RandomOverSampler technique was applied to the training data to equalize class frequencies and ensure adequate representation of minority categories. In addition, class-specific weighting was incorporated within the Focal Loss formulation, thereby reducing the dominance of majority classes and improving the model’s sensitivity to underrepresented instances. Collectively, these measures alleviated the adverse effects of imbalance, enhanced robustness, and prevented the model from being biased toward overrepresented classes.

3.2. Dataflow in 3D-CNN

The key parameter that differentiates a 3D convolutional neural network (3D-CNN) from a conventional 2D-CNN is the temporal dimension, which is explicitly modeled to capture motion information in video data. This temporal dimension can be obtained through the analysis and preprocessing of the video dataset. The dataflow pipeline, illustrated in Figure 2, enables the extraction of both spatial and temporal features from video sequences, facilitating accurate action classification.

In the first step of preprocessing, a fixed number k of consecutive frames is extracted from each video. The value of k corresponds to the clip_len parameter [25], which determines the temporal extent of the input sample. The videos, pre-split into individual frames, are stored in corresponding folders, with each folder representing a single video of variable length. The selected frames are converted into NumPy arrays with the shape [clip_len, height, width, channels], and subsequently transformed into PyTorch tensors of shape [batch_size, channels, clip_len, height, width], where the number of channels is 3 for RGB color images.

These tensors are processed by a sequence of three 3D convolutional layers. The first convolutional layer transforms the input tensor of shape [batch_size, 3, clip_len, 112, 112] into [batch_size, 32, clip_len/2, 56, 56], halving the spatial dimensions. The second layer further reduces the spatial resolution and increases the channel depth, producing [batch_size, 64, clip_len/4, 28, 28]. The third layer outputs [batch_size, 128, clip_len/8, 14, 14].

The feature maps from the last convolutional layer are flattened and fed into the first fully connected layer, with the input dimension calculated as [batch_size, 128 × (clip_len/8) × 14 × 14]. This layer outputs feature vectors of shape [batch_size, 256], which are passed to the final fully connected layer that produces class predictions of shape [batch_size, num_classes], where num_classes = 3 (Normal, Fighting, Meeting).

3.3. Dataflow in KAN3DCNN

In KAN3DCNN (Figure 3), the first stage is similar to the previous one (Figure 4), although the tensors pass through three consecutive KANConv3DLayer layers: modified 3D convolutions with B-spline approximation. Each such layer first applies GELU activation and a classical 3D convolution on each channel, then calculates B-spline bases for each spatio-temporal value, unfolds them into an additional set of “spline channels”, and runs them through a second 3D convolution, after which it sums both results, normalizes, passes them through PReLU and Dropout3D [21]. Due to this, the first KAN layer reduces the dimension to [batch_size, 32, clip_len/2, 56, 56], the second to [batch_size, 64, clip_len/4, 28, 28], and the third to [batch_size, 128, clip_len/8, 14, 14]. The output of the last layer is unfolded into a flat vector of dimension [batch_size, 128 × (clip_len/8) × 14 × 14], passed to a fully connected layer with 256 neurons, and then to the final classifier [batch_size, num_classes], where num_classes = 3 (Normal, Fighting, Meeting). However, although KANConv3DLayer uses the idea of one-dimensional B-spline functions, it does not fully implement the multi-level composition of all input coordinates inherent in the classical Kolmogorov–Arnold theorem, but only adds a spline approximation to the usual convolution on each channel.

The difference from the regular 3D-CNN is that in the regular 3D-CNN, each 3D convolutional block consists of a simple sequence: convolution, BatchNorm, ReLU, and MaxPool; all the nonlinear approximation is provided by a single activation after the linear filters. In KANConv3DLayer, on the contrary, each block is split into two parts: a base branch, where after GELU there is a regular Conv3d, and a spline branch, where for each input value, B-spline bases are built, unfolded into an additional set of channels, and convolved with separate spline filters. Both results are summed up and normalized. This allows KAN layers to capture complex local nonlinear dependencies via spline approximation [26], whereas standard convolutions are limited to a linear combination of inputs and a single point activation.

In addition to the baseline KAN3DCNN architecture, an extended variant incorporating spatial-temporal attention was developed. This modification integrates a 3D Squeeze-and-Excitation (SE) mechanism after each KANConv3D_DARTS block. The SEBlock3D module performs global spatio-temporal pooling for each feature map, followed by a two-layer channel recalibration with a reduction ratio of 16. The resulting attention weights are applied multiplicatively to the original feature maps, allowing the network to emphasize the most informative channels while suppressing less relevant ones.

The enhanced architecture retains the three-stage KANConv3D backbone with progressive channel expansion (32 → 64 → 128) and intermediate pooling operations, but each convolutional stage is immediately followed by an SE attention block before pooling. As in the original version, the final representation is obtained via adaptive pooling to a fixed volume (3×14×14), flattened, passed through a fully connected layer with 256 neurons, and mapped to the output layer corresponding to the three behavior classes. This design aims to improve the model’s ability to jointly capture appearance and motion cues while dynamically re-weighting feature channels according to their relevance for the classification task.

3.4. Training

The model training process can be divided into two hierarchical levels. The first (external) level is the hyperparameter search, which operates on the basis of MMAS and tunes the learning_rate and weight_decay parameters. The second (internal) level is the KAN parameter search, which operates on the basis of a DARTS-inspired method and selects the most suitable KANConv3D layer configurations based on spline_order and grid_size values.

The models were implemented in PyTorch 2.4.0 and trained with the Adam optimizer. During the MMAS search, the learning rate (1 × 10⁻⁴–1 × 10⁻²) and weight decay (1 × 10⁻⁶–1 × 10⁻²) were sampled, with best results around 3 × 10⁻⁴ and 1 × 10⁻⁴, respectively. Each candidate was trained for 3 epochs with a batch size of 8 and an ant number of 6. Video clips of 30 frames were used, and Focal Loss with class weights addressed class imbalance. These settings were chosen to balance convergence speed, regularization, and computational feasibility while ensuring comparability across models. Other configurations (candidate set and lambda) will be discussed in the Discussion Section.

3.4.1. MMAS Level Training

As shown in Figure 5A, at each stage of MMAS, a generation is initialized and individual solutions, referred to as “ants”, are created. Each ant trains a separate model instance. The first ant randomly initializes the learning_rate and weight_decay hyperparameters and samples three pairs of grid_size and spline_order parameters, which define the form, smoothness, and complexity of the B-spline basis in the KANConv3D layers. Each subsequent ant stochastically samples parameter values using pheromones, which represent the popularity of each parameter set [27].

Unlike a classic KAN, KANConv3D uses another approach. Both methods use B-splines for nonlinear representation of input values; however, architecture-wise, KAN convolutions significantly differ from the classical implementation. Originally, a KAN output neuron is represented as every input

x_{j}

, transformed by a correlated nonlinear function

ϕ_{i j}

, which is built with B-splines and weighted with coefficient

a_{i j}

[5]:

y_{i} = \sum_{j} a_{i j} \cdot ϕ_{i j} (x_{j}) .

(1)

In the KANConv3D method, the input tensor x passes through an activation first, and then two convolutions are applied: a regular convolution and a B-spline convolution. B-splines are formed by a uniform grid and applied to the whole tensor instead of individual features:

Output (x) = Conv 3 D_{base} (ϕ (x)) + Conv 3 D_{spline} (Bsplines (x)),

(2)

where

ϕ

is an activation function, while

Bsplines (x)

is the transformed input. This means that B-splines serve as additional inputs to the convolutional operation, not as an independent function [23]. An ant builds a model that consists of three sequential KANConv3D layers (Figure 5B), while each layer contains three architectural candidates. These candidates are combined using softmax-weighting, where weights

α_{i}

are trainable parameters of the model. Therefore, the final output of each layer is calculated as a weighted sum of the outputs of all three variants, which provides a differentiable learning path for the architecture within a single model.

When all ants in a generation are trained, the three most accurate architectural solutions are selected, and pheromones are updated for them. The pheromones of the most successful configurations are increased, while the rest are partially “evaporated”. This approach allows the method to strengthen its preferences towards more efficient architectures.

3.4.2. DARTS Level Training

The method based on DARTS was adapted by using its main principle: the relaxation of discrete architectural choices into a continuous search with learnable weights. This method searches for pairs of configurations (grid_size, spline_order). In the final step, it constructs a KANConv3D layer that combines outputs from each variant of configurations.

This combination is performed by weighting the sum of the outputs of all candidates using a SoftMax function. The weights

α

are learnable parameters, optimized through standard backpropagation. The input tensor x is sequentially processed by each of the N candidates. Each candidate applies a unique filter to its output, and the results are weighted using the coefficients

α

[8]:

Out (x) = \sum_{i = 1}^{N} SoftMax (α_{i}) \cdot f_{i} (x) .

(3)

Regularization on the weights

α_{i}

was added to the total loss function

L_{ce}

in order to prevent the model from focusing on a single variant and ignoring the others [28]:

L_{total} = L_{ce} + λ_{α} \cdot \sum_{i = 1}^{N} SoftMax (α_{i}) \cdot log (SoftMax (α_{i}) + ε),

(4)

where

λ_{α}

is the regularization coefficient controlling the influence of the penalty term, and

ε

is a small constant to avoid computing the logarithm of zero.

Entropy regularization penalizes the model if the choice is overly confident and the SoftMax value for a specific candidate is significantly higher than for others. This regularization term discourages extremely peaked SoftMax distributions and prevents overconfident architectural selection. If one candidate’s SoftMax score dominates, the entropy decreases and the penalty increases. This mechanism is particularly important during the early stages of training, ensuring that all architectural variants receive gradient updates and allowing the model to explore the search space more thoroughly before converging to an optimal configuration. Thus, entropy regularization helps the model avoid early architectural collapse—where learning becomes biased toward a single, suboptimal structure—thereby maintaining architectural diversity and improving the robustness of the search process.

4. Evaluation

In addition to MMAS, other evolutionary algorithms for global optimization—Genetic Algorithm (GA), Differential Evolution (DE), and Ant Colony Optimization (ACO)—were also tested in this project. However, the best results were obtained with MMAS, which was the main reason for selecting it as the final optimization method. Each optimization strategy was tested at least three times on the same dataset to increase confidence in the results and to account for the stochastic nature of all algorithms. The final evaluation was based on the average results of each strategy.

The classification report was used to assess the quality of the model after every stage of training. It included metrics such as precision, recall, F1-score, and overall accuracy. In addition, weighted and macro-averaged metric values were calculated to account for class imbalance. The model was tested on a test set obtained through a stratified split. The full algorithm of the proposed optimization strategy for the network based on KANConv3D is shown in Figure 5. It combines both the MMAS and DARTS processes, as well as the training and evaluation stages of the project. The detailed steps of the proposed optimization technique for the KANConv3D Network are illustrated in the pseudocode shown in Figure 6.

5. Results

Experimental Results

To evaluate the effectiveness of the proposed approach, comparative experiments were conducted with several models, including both baseline versions of 3D-CNN and extended architectures incorporating KANConv3D layers. Each model was tested on a single dataset with three classes describing human behavior: “Fighting”, “Meeting”, and “Normal”. The main evaluation metrics included average accuracy, recall, precision, and F1-score computed across the three classes. Key results obtained from these experiments are presented in Table 1. For reproducibility, each experiment was repeated three times with different random seeds. Performance analysis, including ROC curves and confusion matrix on the original dataset, is illustrated in Figure 7. Across all runs, the KAN3D-MMAS-DARTS model achieved an accuracy in the range of 0.85–0.89 (mean ± std: 0.87 ± 0.03). Other models exhibited similar stability, with standard deviations not exceeding ±0.04. Additionally, experiments were conducted on an augmented dataset with additional improvements, such as SEBlock3D and RandomOverSampler, as these enhancements require larger and more diverse data to be effective. Applying them only to the augmented dataset ensured that their benefits could be reliably assessed without overfitting to the very limited original data.

The KAN3D-MMAS-DARTS model achieved the highest overall performance with an accuracy of 0.87, followed by KAN3D (MMAS) with 0.85. Among the classical architectures, the best result was obtained by 3D-CNN (MMAS) with an accuracy of 0.81.

To further assess the robustness of the proposed approach, the original dataset was augmented with additional video recordings, thereby increasing the diversity of examples for all three classes. Following this augmentation, two sequential experiments were conducted. In the first stage, training was performed on a limited portion of the extended dataset to evaluate the models’ performance under data scarcity conditions. In the second stage, training was conducted on the full augmented dataset to assess scalability and generalization capabilities. The performance analysis on augmented dataset is illustrated in Figure 8.

The results showed that even with a reduced dataset, the SEBlock3D architecture combined with the update_lambda_a method consistently outperformed the SEBlock3D with the RandomOverSampler baseline. When trained on the full dataset, the highest metrics were achieved by the SEBlock3D combined with RandomOverSampler.

The results for all tested configurations of KANConv3D and MMAS-DARTS on the augmented dataset are summarized in Table 2. While the SEBlock3D + update_lambda_a variant demonstrated stable performance across both limited and large datasets, the SEBlock3D + RandomOverSampler achieved the highest metrics when trained on the full augmented dataset, with an F1-score of 0.93 and accuracy of 0.97. Importantly, the KANConv3D (without SEBlock3D) configuration also exhibited strong results (F1-score of 0.90 and accuracy of 0.94), confirming that competitive performance can be achieved without additional attention modules. This comparison provides a systematic evaluation of the contribution of SE blocks, showing that while they can boost precision, the KANConv3D backbone itself captures sufficient discriminative features even without explicit channel recalibration.

Figure 9 shows the key difference between KAN and traditional CNN in terms of the behavior of activation functions. In the KAN model (left), each colored curve corresponds to a spline activation function learned by a separate filter. The abscissa shows the input signal values, and the ordinate shows the corresponding activation values. It is clear that the learned functions have diverse forms: some are oscillatory, while others are monotonic or piecewise nonlinear. This indicates that each neuron generates its own adaptive nonlinearity learned directly from the data. This flexibility is achieved through spline parameterization, which allows the model to approximate a wide range of functional dependencies beyond simple rectification functions. The functional diversity of activations allows the KAN to represent complex input–output dependencies with fewer layers and ensures greater interpretability of the model.

In contrast, a CNN (right) uses a single fixed activation function, the ReLU, defined as f(x) = max(0, x). While ReLU effectively introduces nonlinearity, it is unable to adapt to the data distribution or the specific features of individual neurons. As a result, all neurons in a CNN apply the same transformation to their inputs, forcing the network to rely solely on weights to model complex relationships. In KANs, the learning process is distributed between weights and activations, enabling a more expressive and flexible data representation.

6. Discussion

6.1. Analysis of Results

The experimental results demonstrated that the combined MMAS + DARTS approach can substantially improve both the accuracy and robustness of models on video data with high variability (e.g., varying durations, camera angles, and scenes). This improvement was particularly evident when using KANConv3D blocks with trained spline functions.

As shown in Table 2, the SEBlock3D + RandomOverSampler method achieved the highest performance among all considered approaches, reaching an accuracy of 0.97, an F1-score of 0.93, and high recall and precision values.

Additionally, preliminary experiments were conducted on a reduced version of the RWF-2000 dataset. Training split: 640 non-violent and 381 violent videos, test split: 160 non-violent and 96 violent videos. The same model, together with the proposed optimization method, achieved 0.69–0.75 accuracy. While these results are below the state-of-the-art, they demonstrate that the proposed approach is applicable to larger benchmarks.

Table 3 presents the top-performing configurations obtained during the MMAS + DARTS-based hyperparameter search. Each row corresponds to one of the best configurations found during optimization, including the attention coefficient

λ

and the selected candidate operation pairs.

The best-performing configuration achieved an accuracy of 0.9688 with

λ = 0.00661

and the candidate set

(4, 7), (4, 5), (2, 11)

. Most high-accuracy configurations feature

λ

values in the range

[0.0003, 0.0066]

, indicating that excessively large or very small values tend to degrade model performance.

Overall, the KAN3D-MMAS+DARTS model demonstrated the best results across all metrics. Compared with traditional 3D-CNNs optimized with GA or DE, it achieved clear improvements, particularly in F1-measure and recall, indicating a more accurate and balanced classification. Replacing conventional convolutional layers with KANConv3D proved effective, as the model could extract more expressive spatio-temporal features due to the use of B-splines. In contrast, using a KAN without DARTS led to a notable performance drop, suggesting that hyperparameter tuning alone is insufficient without the architectural flexibility enabled by the differentiable candidate selection mechanism in DARTS. (The evaluation set contained only 32 samples (20 from class 0, 9 from class 1, and 3 from class 2), which may lead to higher variance in the reported metrics; the results should, therefore, be interpreted with a degree of caution.)

An important advantage of Kolmogorov–Arnold Networks compared with conventional 3D-CNNs lies in their interpretability. While CNN filters operate as black-box feature extractors, the KANConv3D layers rely on spline-based approximations, which provide explicit functional mappings between input coordinates and learned representations. This design allows decomposition of complex spatio-temporal interactions into sums of low-dimensional functions, consistent with the Kolmogorov–Arnold representation theorem. As a result, the model’s decision process can be more directly traced to specific spline bases and their coefficients, offering a clearer view of how local spatial or temporal structures contribute to final predictions. In contrast, conventional CNN kernels lack such explicit interpretability, making it more difficult to associate learned parameters with meaningful input variations. This property of KAN is particularly beneficial in video anomaly recognition, where understanding which motion patterns or regions drive classification can help validate model decisions and support practical deployment.

From a computational perspective, the overall complexity is expressed in terms of the number of models trained at each iteration of MMAS:

O (A \cdot G \cdot C \cdot T)

where A is the number of ants (in total 6 ants) and G is the number of generations. Both of them were set to 6 in our work C is the number of candidates, which was set to 3, as was mentioned earlier in this paper. Finally, T is the time for one training example, which was equal to 9 min on average, considering the fact that all trainings were done in 3 epochs. Since each ant generates configurations that require full training, the complexity is linear in the number of ants A, generations G, and the number of candidates C. Differentiable choice of operations was used within blocks, which added linear complexity in the number of operations at each training step. This corresponds to an estimated budget of 5–6 GPU hours on a single NVIDIA RTX 4090. Compared to conventional hyperparameter optimization methods, this is significantly more resource-intensive, which limits the scalability of the approach to larger datasets and more complex models.

The observed 3% improvement in accuracy compared to the base model is explained by the hybrid optimization strategy. The use of DARTS played the role of architectural regularization: during the search, overcomplicated substructures were eliminated, and only those that helped minimize the error, not just on the training sample, but on the validation set, were left. First, the entropy penalty on the distribution

α

discourages the too early and rigid selection of a single operation, allowing the model to explore the architecture space longer and form more stable combinations of operations. This increases the diversity of extracted features and improves generalization ability. Second, limiting the complexity of substructures prevents overfitting the training data, contributing to better generalization ability.

Additionally, dynamic update of the learning rate and weight decay using MMAS allowed adapting the learning parameters depending on the current training stage. This ensured a balance between the convergence rate and resistance to overfitting: a reduced learning rate in late phases and enhanced weight regularization helped to avoid re-tuning the model to training patterns.

In addition to the main experiments, the dataset was extended with additional video samples to increase variability across all classes. Two sequential tests were conducted: first, training on a smaller subset to assess model performance under limited data conditions, and then on the full extended dataset to evaluate scalability and generalization. The results confirmed that the proposed approach retains its advantage even in data-scarce scenarios and benefits further from larger, more diverse training sets.

Moreover, in the updated algorithm, a mechanism for dynamic adjustment of the attention coefficient

λ_{a}

was introduced, allowing the model to adaptively change the balance between spatial and temporal features during training. Attention map visualization was also implemented, providing an intuitive analysis of which regions of the video frames have the greatest impact on classification.

6.2. Comparison with Other Methods

The results were expected, as 3D-CNN has already proven itself as a good option for the main model in similar tasks [1,2,3,4,11,25]. A 3D-CNN considers the temporal dependencies between frames, which is very important in video, where there is a sequence, not a static frame. Even without advanced architectural search methods, a 3D-CNN already gives an adequate representation of the movement and the scene.

As for evolutionary algorithms, all the methods used are popular and have been used many times in other works. However, GA works through random mutations. Although crossover and various selection techniques improve the search efficiency [6,7,12], too many random changes lead to the fact that the algorithm can jump over good zones in the search.

At the same time, DE copes well with problems with continuous numbers, but, as has already been said in other works [12,13], it is worse in discrete and combinatorial problems. Thus, in the 3D-CNN model, it showed itself well, but when working with a KAN, where pairs of parameters spline_order and grid_size are selected, the algorithm is not so effective, because this is a combinatorial problem. To address this problem, we used DARTS logic to tune these architecture parameters.

MMAS uses the history of success through pheromones, which amplifies good solutions [15,27]. This provides an opportunity for a smarter and more targeted search: the greater the success, the greater the chance of repetition.

6.3. Future Research

However, despite the positive results, our work has several limitations. First, the current implementation of KANConv3D_DARTS assumes a fixed number of architectural candidates, which limits the flexibility of the search. Hundreds of parameter combinations are possible, and expanding the set of candidates will require either more resources or a more intelligent selection strategy.

Second, the model is trained from scratch for each ant, which makes the whole process computationally expensive. Although the number of ants and generations in MMAS was limited, scaling this approach to larger datasets or more complex models may be impractical without optimizations. It is worth noting that an experiment with adaptive hyperparameter selection of MMAS during training [29] was conducted, but was not successful.

It is also worth noting that the pheromone selection strategy works well in the limited search space but may be unstable in high-dimensional parameters. This raises the question: Is it possible to adapt the idea of differentiable pheromone search or combine MMAS with other training signals?

Finally, although adding

α

-weight regularization helped to avoid premature specialization on individual paths, the DARTS approach itself may be sensitive to the number of candidates and requires further study of its behavior in unstable video data.

In addition, future research should be conducted in the areas of improving pheromone sampling and applying the method to more realistic problems, such as real-time usage or noisy datasets. Beyond these algorithmic aspects, practical deployment challenges also remain. These include the high computational cost of training on large-scale video streams, robustness in unseen or dynamically changing environments, and fairness or ethical concerns in sensitive application domains. Addressing these issues will be crucial for transitioning the method from research settings to real-world scenarios.

7. Conclusions

This study proposes an automated procedure for selecting architectural and training parameters for a video violence recognition model. The Minimax Ant System (MMAS) is used to sample combinations of layer parameters (grid_size and spline_order), while the Differentiable Architecture Search (DARTS) mechanism optimizes

α

-weights inside each sampled model to select the most appropriate operation per block. This two-stage strategy enables efficient exploration of the architecture space without exhaustive manual search. Training hyperparameters (learning_rate, weight_decay) were also tuned, which improved stability and accuracy. The proposed approach outperformed other tested global optimizers (GA, DE) on the evaluated dataset.

Despite these positive outcomes, several limitations remain. The search currently uses a fixed candidate set, which constrains flexibility: enlarging the search space requires substantially more computing or smarter candidate selection. Training each MMAS solution from scratch is computationally expensive and hinders scalability to larger datasets or more complex models. The pheromone-based selection works well in a restricted search space but may become unstable in high-dimensional settings, motivating investigation of hybrid strategies that combine pheromone signals with differentiable or gradient-based cues.

We outline three practical directions for further work. First, extend and adapt the candidate space with intelligent selection or dynamic candidate pools to balance exploration and cost. Second, reduce optimization cost by reusing weights between generations (transfer learning/warm-start), adopting weight-sharing (one-shot) schemes, or otherwise accelerating training. Third, improve robustness to noisy and unstable video by strengthening regularization, enhancing temporal augmentations, and introducing adaptation mechanisms for degraded frames. Addressing these points should improve the method’s flexibility, efficiency, and applicability to real-world video surveillance and complex spatio-temporal tasks.

Author Contributions

Conceptualization, Z.B., A.Y. and M.Z.; methodology, Z.B., Z.A. and M.Z.; software, Z.A., Z.Z. and M.Z.; validation, M.A., Z.Z. and M.Z.; formal analysis, M.A., Z.Z. and M.Z.; investigation, M.A., Z.Z. and M.Z.; resources, A.Y., Z.A., Z.B. and M.Z.; data curation, A.Y., M.A. and M.Z.; writing—original draft preparation, M.A., Z.Z. and M.Z.; writing—review and editing, M.A., Z.Z. and M.Z.; visualization, M.A., Z.Z. and M.Z.; supervision, Z.B. and M.Z.; project administration, Z.B. and M.Z.; funding acquisition, Z.B. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan grant number AP19579370.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Park, J.H.; Mahmoud, M.; Kang, H.S. Conv3D-based video violence detection network using optical flow and RGB data. Sensors 2024, 24, 317. [Google Scholar] [CrossRef]
Karim, A.; Razin, J.I.; Ahmed, N.U.; Shopon; Alam, T. An automatic violence detection technique using 3D convolutional neural network. In Sustainable Communication Networks and Application: Proceedings of ICSCN 2020; Springer: Singapore, 2021; pp. 17–28. [Google Scholar]
Chakole, P.D.; Satpute, V.R. Analysis of anomalous crowd behavior by employing pre-trained efficient-X3D net for violence detection. Sādhanā 2025, 50, 30. [Google Scholar] [CrossRef]
Maqsood, R.; Bajwa, U.I.; Saleem, G.; Raza, R.H.; Anwar, M.W. Anomaly recognition from surveillance videos using 3D convolution neural network. Multimed. Tools Appl. 2021, 80, 18693–18716. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [PubMed]
Long, Q.; Wang, B.; Xue, B.; Zhang, M. A Genetic Algorithm-Based Approach for Automated Optimization of Kolmogorov-Arnold Networks in Classification Tasks. arXiv 2025, arXiv:2501.17411. [Google Scholar]
Sen, A.; Mazumder, A.R.; Dutta, D.; Sen, U.; Syam, P.; Dhar, S. Comparative evaluation of metaheuristic algorithms for hyperparameter selection in short-term weather forecasting. In Proceedings of the 15th International Joint Conference on Computational Intelligence—ECTA, Rome, Italy, 13–15 November 2023; pp. 238–245. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
Mohtavipour, S.M.; Saeidi, M.; Arabsorkhi, A. A multi-stream CNN for deep violence detection in video sequences using handcrafted features. Vis. Comput. 2022, 38, 2057–2072. [Google Scholar] [CrossRef]
Rendón-Segador, F.J.; Álvarez-García, J.A.; Salazar-González, J.L.; Tommasi, T. Crimenet: Neural structured learning using vision transformer for violence detection. Neural Netw. 2023, 161, 318–329. [Google Scholar] [CrossRef]
Shanmughapriya, M.; Gunasundari, S.; Fenitha, J.R.; Sanchana, R. Fight detection in surveillance video dataset versus real time surveillance video using 3DCNN and CNN-LSTM. In Proceedings of the 2022 International Conference on Computer, Power and Communications (ICCPC), Chennai, India, 14–16 December 2022; IEEE: New York, NY, USA, 2022; pp. 313–317. [Google Scholar]
Kachitvichyanukul, V. Comparison of three evolutionary algorithms: GA, PSO, and DE. Ind. Eng. Manag. Syst. 2012, 11, 215–223. [Google Scholar] [CrossRef]
Sen, A.; Mazumder, A.R.; Sen, U. Differential evolution algorithm based hyper-parameters selection of transformer neural network model for load forecasting. In Proceedings of the 2023 IEEE Symposium Series on Computational Intelligence (SSCI), Mexico City, Mexico, 5–8 December 2023; IEEE: New York, NY, USA, 2023; pp. 234–239. [Google Scholar]
Verma, K.K.; Singh, B.M. Deep multi-model fusion for human activity recognition using evolutionary algorithms. Int. J. Interact. Multimedia Artif. Intell. 2021, 7, 44–58. [Google Scholar] [CrossRef]
Abdelmoaty, A.M.; Ibrahim, I.I. Comparative Analysis of Four Prominent Ant Colony Optimization Variants: Ant System, Rank-Based Ant System, Max-Min Ant System, and Ant Colony System. arXiv 2024, arXiv:2405.15397. [Google Scholar]
Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Comput. Surv. 2021, 54, 1–34. [Google Scholar] [CrossRef]
Latypov, V.; Hvatov, A. Exploring convolutional KAN architectures with NAS. In Proceedings of the First Conference of Mathematics of AI, Sochi, Russia, 24–28 March 2025. [Google Scholar]
Wang, S.; Park, S.; Kim, J.; Kim, J. Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras. J. Constr. Eng. Manag. 2025, 151, 04025186. [Google Scholar] [CrossRef]
Tang, H.; Liu, J.; Yan, S.; Yan, R.; Li, Z.; Tang, J. M3net: Multi-view encoding, matching, and fusion for few-shot fine-grained action recognition. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1719–1728. [Google Scholar]
Vaca-Rubio, C.J.; Blanco, L.; Pereira, R.; Caus, M. Kolmogorov-arnold networks (KANs) for time series analysis. In Proceedings of the 2024 IEEE Globecom Workshops (GC Wkshps), Cape Town, South Africa, 8–12 December 2024. [Google Scholar]
Cheon, M. Demonstrating the efficacy of kolmogorov-arnold networks in vision tasks. arXiv 2024, arXiv:2406.14916. [Google Scholar] [CrossRef]
Genet, R.; Inzirillo, H. Tkan: Temporal kolmogorov-arnold networks. arXiv 2024, arXiv:2405.07344. [Google Scholar] [CrossRef]
Drokin, I. Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies. arXiv 2024, arXiv:2407.01092. [Google Scholar] [CrossRef]
Szeghalmy, S.; Fazekas, A. A comparative study of the use of stratified cross-validation and distribution-balanced stratified cross-validation in imbalanced learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef]
Vrskova, R.; Hudec, R.; Kamencay, P.; Sykora, P. Human activity classification using the 3DCNN architecture. Appl. Sci. 2022, 12, 931. [Google Scholar] [CrossRef]
Bodner, A.D.; Tepsich, A.S.; Spolski, J.N.; Pourteau, S. Convolutional kolmogorov-arnold networks. arXiv 2024, arXiv:2406.13155. [Google Scholar]
He, P.; Jiang, G.; Lam, S.K.; Sun, Y. ML-MMAS: Self-learning ant colony optimization for multi-criteria journey planning. Inf. Sci. 2022, 609, 1052–1074. [Google Scholar] [CrossRef]
Jing, K.; Chen, L.; Xu, J. An architecture entropy regularizer for differentiable neural architecture search. Neural Netw. 2023, 158, 111–120. [Google Scholar] [CrossRef]
Ali, Y.A.; Awwad, E.M.; Al-Razgan, M.; Maarouf, A. Hyperparameter search for machine learning algorithms for optimizing the computational complexity. Processes 2023, 11, 349. [Google Scholar] [CrossRef]

Figure 1. Method pipeline: outer MMAS samples hyperparameters; inner DARTS-like optimization tunes

α

-weights for candidate KAN configurations.

Figure 1. Method pipeline: outer MMAS samples hyperparameters; inner DARTS-like optimization tunes

α

-weights for candidate KAN configurations.

Figure 2. Conversion of video frames into tensors and sequential processing in 3D-CNN for spatiotemporal feature extraction.

Figure 3. Structure of a KANConv3DLayer, combining B-splines, normalization, PReLU, and Dropout3D for richer representations.

Figure 4. Overview of the 3D CNN-based model for action classification.

Figure 5. Training algorithm with MMAS-DARTS optimization strategy: (A) full algorithm; (B) single epoch visualization.

Figure 6. Pseudocode of the proposed optimization technique for KANConv3D Network.

Figure 7. Performance of the KAN3D-MMAS-DARTS model on the original dataset: ROC curves across all iterations, averaged ROC curve, and confusion matrix of the best-performing configuration.

Figure 8. Performance of the SEBlock3D + RandomOverSampler model on the augmented dataset: ROC curves across all iterations, averaged ROC curve, and confusion matrix of the best-performing configuration.

Figure 9. Comparison of activation functions in KAN (on the left) and CNN (on the right) architectures. The KAN learns diverse, data-driven spline activations for each filter, while CNNs use a single fixed function (ReLU) shared across all neurons.

Table 1. Results of experiments on the original dataset.

Model	Recall	Precision	F1-Score	Accuracy
3D-CNN (GA)	0.69	0.69	0.68	0.69
3D-CNN (DE)	0.67	0.68	0.67	0.67
3D-CNN (MMAS)	0.73	0.76	0.73	0.73
KAN3D (Ant)	0.82	0.82	0.82	0.81
KAN3D (Genetic)	0.84	0.84	0.84	0.83
KAN3D (MMAS)	0.85	0.87	0.86	0.85
KAN3D (DARTS)	0.76	0.80	0.78	0.77
KAN3D-MMAS-DARTS	0.85	0.90	0.87	0.87

Table 2. Results of experiments on the augmented dataset.

Model	Recall	Precision	F1-Score	Accuracy
SEBlock3D + RandomOverSampler (big_dataset)	0.80	0.80	0.80	0.80
SEBlock3D + update_lambda_a (big_dataset)	0.81	0.82	0.81	0.82
SEBlock3D + update_lambda_a	0.83	0.83	0.79	0.80
KANConv3D (without SEBlock3D)	0.95	0.87	0.90	0.94
SEBlock3D + RandomOverSampler	0.89	0.98	0.93	0.97

Table 3. Top and Representative Results from MMAS+DARTS Experiments.

Accuracy	$λ$	Candidates
0.9688	0.00661	(4, 7), (4, 5), (2, 11)
0.9375	0.00052	(4, 9), (4, 11), (2, 7)
0.9375	0.00037	(5, 5), (1, 11), (5, 5)
0.9062	0.00610	(3, 11), (3, 7), (5, 5)
0.9062	0.00530	(4, 9), (3, 5), (4, 11)
0.9062	0.00002	(3, 5), (2, 5), (5, 5)
0.9062	0.00173	(3, 7), (4, 7), (2, 3)
0.8750	0.00109	(5, 9), (2, 11), (3, 7)
0.8750	0.00223	(1, 5), (3, 5), (4, 9)
0.1250	0.00706	(4, 5), (2, 11), (3, 7)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Buribayev, Z.; Zhassuzak, M.; Aouani, M.; Zhangabay, Z.; Abdirazak, Z.; Yerkos, A. DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video. Appl. Sci. 2025, 15, 11035. https://doi.org/10.3390/app152011035

AMA Style

Buribayev Z, Zhassuzak M, Aouani M, Zhangabay Z, Abdirazak Z, Yerkos A. DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video. Applied Sciences. 2025; 15(20):11035. https://doi.org/10.3390/app152011035

Chicago/Turabian Style

Buribayev, Zholdas, Mukhtar Zhassuzak, Maria Aouani, Zhansaya Zhangabay, Zemfira Abdirazak, and Ainur Yerkos. 2025. "DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video" Applied Sciences 15, no. 20: 11035. https://doi.org/10.3390/app152011035

APA Style

Buribayev, Z., Zhassuzak, M., Aouani, M., Zhangabay, Z., Abdirazak, Z., & Yerkos, A. (2025). DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video. Applied Sciences, 15(20), 11035. https://doi.org/10.3390/app152011035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video

Abstract

1. Introduction

2. Related Work

2.1. Evolutionary Hyperparameter Tuning

2.2. Differentiable Architecture Search

2.3. KAN in Related Works

3. Methods

3.1. Dataset

3.2. Dataflow in 3D-CNN

3.3. Dataflow in KAN3DCNN

3.4. Training

3.4.1. MMAS Level Training

3.4.2. DARTS Level Training

4. Evaluation

5. Results

Experimental Results

6. Discussion

6.1. Analysis of Results

6.2. Comparison with Other Methods

6.3. Future Research

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI