TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM

Girdhar, Palak; Al-Saidi, Muslem; Johri, Prashant; Virmani, Deepali; Taha, Hussein; Hassen, Oday Ali

doi:10.3390/telecom7020032

Open AccessArticle

TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM

by

Palak Girdhar

¹,

Muslem Al-Saidi

^2,3,

Prashant Johri

⁴,

Deepali Virmani

⁵

,

Hussein Taha

^6,*

and

Oday Ali Hassen

^7,8

¹

Bhagwan Parshuram Institute of Technology (BPIT), Guru Gobind Singh Indraprastha University (GGSIPU), PSP-4, Dr. K. N. Katju Marg, Sector-17 Rohini, New Delhi 110089, India

²

College of Computer Science and Information Technology, Wasit University, Al-Rabee District, Al Kut 52001, Wasit Governorate, Iraq

³

Department of Automation, Széchenyi István University, Egyetem Tér 1, 9026 Gyor, Hungary

⁴

School of Computing Science and Engineering, Galgotias University, Plot No. 2, Yamuna Expy, Opposite Buddha International Circuit, Sector 17A, Greater Noida 203201, Uttar Pradesh, India

⁵

Department of Information Technology, Guru Gobind Singh Indraprastha University (GGSIPU), Sector 16 C, Dwarka, Delhi 110078, India

⁶

Space Technology and Space Law Research Centre, Széchenyi István University, 9026 Gyor, Hungary

⁷

Computer Department, College of Education for Pure Sciences, Wasit University, Al-Rabee District, Al Kut 52001, Wasit Governorate, Iraq

⁸

Ministry of Education, Wasit Education Directorate, Al-Houra Street, Al Kut 52001, Wasit Governorate, Iraq

^*

Author to whom correspondence should be addressed.

Telecom 2026, 7(2), 32; https://doi.org/10.3390/telecom7020032

Submission received: 30 November 2025 / Revised: 2 March 2026 / Accepted: 10 March 2026 / Published: 19 March 2026

Download

Browse Figures

Versions Notes

Abstract

Human Activity Recognition (HAR) is a pivotal research area for applications such as automated surveillance, smart homes, security, healthcare, and human behavior analysis. Traditional machine-learning approaches often rely on manual feature engineering, which can limit generalization. Although deep learning has improved HAR through automatic representation learning, achieving high detection performance under computational constraints remains challenging. This paper proposes an efficient HAR framework that combines deep learning with hybrid optimization. Surveillance videos are first decomposed into frames, and a keyframe selection stage identifies distinctive frames to reduce redundancy and computational cost while preserving informative content. Motion and appearance features are then extracted using Histogram of Oriented Optical Flow (HOOF) and a ResNet-101 model, respectively, and concatenated into a unified feature representation. Classification is performed using an Inception-based Long Short-Term Memory (Incept-LSTM) network, which is fine-tuned via the proposed Tricky Predator Optimization (TricP) over a restricted, low-dimensional parameter vector. TricP is inspired by predator poaching behavior and the social dynamics of Latrans to enhance exploration and exploitation during search. Experiments on the UCF-Crime dataset show that the proposed method achieves 96.84% specificity, 92.16% sensitivity, and 93.62% accuracy.

Keywords:

Inception; LSTM; human activity recognition; optimization; fitness; deep learning

1. Introduction

Human Activity Recognition (HAR) is a challenging and rapidly evolving research area with applications in automated surveillance, healthcare, smart homes, Ambient Assisted Living (AAL), Human–Computer Interaction (HCI), and human behavior understanding. HAR can be formulated as a pattern-recognition problem in which postural and ambulatory activities—such as walking, running, sitting, and other body movement patterns—are analyzed and classified. A typical HAR pipeline comprises four main stages: data collection, preprocessing, feature extraction, and activity classification.

Based on the acquisition modality, HAR approaches are commonly grouped into video-based and sensor-based techniques. In video-based HAR, optical sensors or cameras capture activities within an environment. These approaches must address practical challenges such as occlusion, privacy constraints, scene obstruction, camera placement, and illumination variations [1]. In sensor-based HAR, wearable or ambient sensors (e.g., accelerometers and gyroscopes) are employed, often integrated into smartphones, wristbands, and smartwatches. Sensor-based methods are attractive because they are easy to deploy and typically require lower computational cost.

In this work, we focus on video-based HAR due to its direct relevance to surveillance. Modern security infrastructures increasingly rely on CCTV cameras for monitoring, producing time-series video streams. In many practical systems, videos are segmented into fixed-length clips (e.g., using sliding windows) before feature extraction and classification. However, designing HAR systems that are both accurate and computationally efficient remains difficult, particularly when videos are long, the scenes are complex, and the number of samples is large.

Traditional machine-learning (ML) methods can achieve good HAR performance, but they often require substantial domain expertise to design discriminative handcrafted features and may face several limitations: (i) failure to focus on the most informative content can increase computational complexity and reduce performance; (ii) several methods in [2,3,4] do not incorporate feature or parameter optimization that could improve classification; (iii) optimization-based deep learning methods [5] can improve performance but may suffer from slow convergence and weight decay; and (iv) some approaches are validated only on small benchmarks, and their performance may degrade on larger, real-world datasets [3,6].

Recently, deep learning (DL) has become a dominant approach for HAR due to its ability to learn representations automatically and its strong performance in vision tasks [7]. Deep neural networks are typically trained to minimize a loss function through iterative optimization, enabling feature learning without manual engineering. Convolutional Neural Networks (CNNs) are widely used to extract spatial patterns and short-term temporal cues from video frames and clips [8]. However, standard CNN-based models do not explicitly capture long-range temporal dependencies. For sequential data, Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, are commonly employed to model temporal evolution while mitigating vanishing-gradient issues. Moreover, motion information can be transformed into alternative representations suited to the task, and deep learning can then extract discriminative features from these representations [9].

Despite their effectiveness, deep architectures often contain a large number of parameters and typically require substantial training data and computational resources. To mitigate these requirements, transfer learning is frequently adopted, where pretrained networks are leveraged for feature extraction, reducing training time and improving generalization when labeled data are limited.

The overall goal of this research is to develop an efficient HAR system for surveillance video analysis. First, surveillance videos are converted into frames, and a keyframe-selection stage identifies informative frames to reduce redundancy and computational complexity. Next, distinctive features are extracted using Histogram of Oriented Optical Flow (HOOF) and ResNet-101 descriptors and then concatenated into a unified representation. Finally, HAR is performed using an Inception-based LSTM (Incept-LSTM). Consistent with the revised optimization scope discussed later in the paper, the proposed Tricky Predator optimization algorithm (TricP) is employed as an outer-loop optimizer to fine-tune a restricted, low-dimensional subset of the Incept-LSTM parameterization, rather than performing full-network weight optimization.

The major objectives of this research are as follows:

To introduce an efficient deep learning-based HAR technique for realistic, multi-view surveillance video sequences. Pretrained models are adopted for deep feature extraction to reduce training cost and improve generalization.
To propose a novel optimization algorithm, Tricky Predator (TricP), inspired by the poaching behavior of predators and the social dynamics of Latrans, to enhance exploration and exploitation during search.
To perform HAR using a Tricky Predator-based Incept-LSTM classifier, where TricP acts as an outer-loop optimizer to fine-tune a restricted subset of parameters, avoiding prohibitive full-network search while improving recognition performance.

The remainder of this paper is organized as follows. Section 2 presents related work on HAR. Section 3 describes the proposed Tricky Predator-based Incept-LSTM framework, including keyframe selection and feature extraction. Section 4 details the Tricky Predator optimization algorithm and its mathematical formulation. Section 5 presents the experimental setup and results, and Section 6 provides the analysis and comparative discussion. Finally, Section 7 concludes the paper and outlines future research directions.

2. Related Works

In 2021, Ronald et al. [2] proposed a Human Activity Recognition (HAR) approach based on deep learning using the Intelligent Signal Processing Lab Inception (iSPLInception) architecture. The Inception module integrates convolutional and pooling layers to reduce feature dimensionality and improve feature mapping, achieving good scalability and resource utilization. However, cross-validation was not reported, and further improvement in prediction accuracy remains necessary.

In the same year, Garcia et al. [3] introduced an ensemble-based HAR framework using autoencoders for both online and offline recognition. In the online scenario, activities were recognized by minimizing reconstruction error, whereas in the offline scenario, segmented data were classified using a separate classifier. The method achieved high accuracy with relatively low computation time, but its complexity increased substantially when applied to large-scale datasets.

Anagnostis et al. [4] developed an RNN-based HAR model for agriculture-oriented human–robot interaction. Sensor data were collected and denoised, and temporal dependencies were modeled using Long Short-Term Memory (LSTM) networks. Performance was improved by concatenating features obtained from individual participants. However, the method’s performance tended to degrade as the temporal window width increased.

Tasnim et al. [5] proposed a spatiotemporal deep learning-based HAR method using 3D skeletal joint data. Feature concatenation was employed prior to classification to improve recognition accuracy; nevertheless, weight decay and slow convergence adversely affected performance.

CNN-based HAR for embedded applications was investigated by Xu et al. [9] using a stochastic gradient-based optimizer with a learning-rate schedule that gradually decreased over training. Multiple sensors were used to record activity signals, which were then processed by a CNN for recognition. Although satisfactory accuracy was reported, the method did not explicitly incorporate a feature selection mechanism, which may increase computational complexity.

Ma et al. [10] developed two architectures to capture spatiotemporal information: (1) a Temporal Segment RNN (TS-LSTM) and (2) a Temporal ConvNet in an Inception-style framework. Using spatiotemporal feature matrices, both LSTM-based RNNs and Temporal ConvNets leveraged spatiotemporal dynamics and improved performance. Their models achieved accuracies of 94.1% and 69.0% on the UCF101 and HMDB51 datasets, respectively. However, some classes still exhibited low prediction accuracy (around 30%), despite reduced computation time.

In 2024, Jandhyam, Rengaswamy, and Satyala [11] proposed an optimized Deep LSTM-based framework for human action recognition. Their approach integrates handcrafted feature extraction with Deep LSTM classification and employs the IIWBPR algorithm to improve optimization. Experimental results on the UCF101 dataset achieved a recognition accuracy of 92.3%. However, evaluation on a single dataset may limit the generalizability of the reported results.

Weng et al. [12] introduced WLSTM with Saliency-aware Motion Enhancement (SME) for video activity prediction. The method combines motion segmentation, dynamic contrast segmentation, and saliency-aware enhancement to suppress background noise, and includes a long-range attention mechanism to capture long-term dependencies by focusing on semantically relevant frames. Experiments on UT-Interaction and sub-JHMDB showed that the approach outperformed several State-of-the-Art methods; however, under low observational ratios, classes with similar short-term motion patterns remained prone to misclassification.

Aghaei et al. [13] proposed a framework combining a ResNet feature extractor, Conv-Attention-LSTM, Bidirectional LSTM (BiLSTM), and fully connected layers. A sparse layer was added after each LSTM layer to mitigate premature convergence. In addition to RGB frames, optical flow was used to encode motion. Video sequences were segmented based on inter-frame similarity, and optical flow was computed between successive frames. The method achieved accuracies of 95.24% and 71.62% on UCF101 and HMDB51, respectively; nonetheless, confusion remained high for visually similar classes.

Saoudi, Jaafari, and Andaloussi [14] proposed a hybrid framework for human action recognition that combines a 3D CNN with an attention-based LSTM to capture both spatiotemporal features and long-range temporal dependencies in video sequences. The model was evaluated on the UCF101 and HMDB51 benchmark datasets, demonstrating competitive performance and emphasizing the effectiveness of integrating attention mechanisms with deep sequential models for action recognition.

Pandey and Muppalaneni [15] selected video clips fromtwo extreme classes—alert and sleepy—to develop a drowsiness detection system. Two models were designed: a temporal model based on handcrafted temporal features processed by an LSTM, and a spatial model where CNN features were processed by an LSTM. Although the temporal model was more complex and slightly less accurate, it exhibited favorable training-time behavior in confusion-matrix and AUC–ROC analysis.

Guo and Chen [16] combined a ResNet-based residual network with an LSTM to recognize dynamic facial expressions from video sequences. Spatial features were extracted by the CNN and temporal information was modeled by the LSTM. The method outperformed static single-frame recognition but was sensitive to pose, resolution, and input size, which affected both accuracy and efficiency.

Ahmad et al. [17] developed a CNN–LSTM architecture for recognizing physical activities from wearable sensor signals (accelerometer and gyroscope). A multi-branch CNN-processed signals from different sensors in parallel; extracted feature vectors were concatenated and fed into an LSTM followed by a dense layer. The approach outperformed traditional classifiers and several deep baselines.

Alom et al. [18] proposed an Improved Inception-Residual CNN (IRCNN) that integrates recurrent convolutional layers into an Inception-style network for activity and object recognition. The recurrent connections improved contextual modeling and training/testing accuracy. The model was evaluated on multiple benchmark datasets and achieved notable gains over several deep CNN baselines.

Xu et al. [19] introduced InnoHAR, which combines Inception-like convolutional modules and an RNN to model complex HAR. Multi-channel sensor waveforms were processed by Inception modules to extract multi-dimensional features, and Gated Recurrent Units (GRUs) modeled temporal dependencies. Experiments on three public HAR datasets showed improved accuracy and generalization relative to prior approaches.

Xia et al. [20] proposed a deep HAR method that combines convolutional layers with LSTM units. Global Average Pooling (GAP) was used to reduce parameters, and Batch Normalization (BN) was applied to accelerate convergence. The model achieved accuracies of 95.78%, 95.85%, and 92.63% on UCI-HAR, WISDM, and OPPORTUNITY, respectively, but its complexity remained relatively high.

Mustafa et al. [21] compared Faster R-CNN Inception-v2 and YOLOv3 for video-based behavior recognition. Both models were fine-tuned on the UCF-ARG dataset, and YOLOv3 consistently outperformed Faster R-CNN Inception-v2.

Dua et al. [22] proposed a wearable sensor-based HAR system using a CNN-based architecture to extract informative features. The classifier captured both spatial and temporal patterns and achieved high accuracy; however, optimization strategies were not explored, leaving scope for improved recognition performance.

Nafea et al. [23] presented a spatiotemporal HAR framework using wireless sensors in which CNN and LSTM components were combined for activity recognition. The approach achieved high precision by selecting informative features, although further gains in accuracy remained possible.

Finally, Xu et al. [9] also investigated deep feature extraction for HAR using an iOS application, demonstrating real-time classification on mobile devices. However, the approach was not evaluated on large-scale benchmark datasets, limiting direct comparison with State-of-the-Art methods.

Overall, prior studies indicate that combining CNN-based feature extraction with temporal modeling (RNN/LSTM/GRU), often supported by transfer learning, can be effective for HAR. Nonetheless, challenges remain in optimization (e.g., convergence behavior and local minima), computational efficiency, and robustness on complex real-world datasets. Metaheuristic optimization has been explored in several contexts because the improvement should be interpreted in terms of stability and robustness rather than drastic escape from severe local minima. At the same time, applying population-based metaheuristics directly to the full weight space of deep networks can be computationally prohibitive, motivating hybrid strategies that restrict the search to a low-dimensional subset of parameters.

The present work addresses these challenges by combining keyframe-based processing with HOOF and ResNet-101 feature extraction, and by introducing a Tricky Predator optimization algorithm to fine-tune an Incept-LSTM classifier for surveillance videos from the UCF-Crime dataset.

3. Materials and Methods

In this section, we describe the overall architecture of the proposed HAR model and its main components. Input videos from the UCF-Crime dataset [24] are processed to extract keyframes. Frames containing distinctive and informative content are selected to reduce redundancy and computational cost. For each selected keyframe, features are extracted using Histogram of Oriented Optical Flow (HOOF) and ResNet-101, and then concatenated into a single feature vector. This combined feature vector is provided as input to the Inception-based LSTM (Incept-LSTM) classifier for activity recognition. Each stage of the model is described in the following subsections.

3.1. Proposed Methodology for Human Activity Recognition Using an Optimized Incept-LSTM Classifier

Human Activity Recognition (HAR) is widely used in many domains to analyze and predict human movements. Despite extensive research, accurate recognition in realistic environments remains challenging. To address this, we propose an automatic HAR framework that combines deep learning with metaheuristic optimization.

The proposed Tricky Predator optimization-based Incept-LSTM model takes as input surveillance videos from the UCF-Crime dataset [24]. First, keyframes are extracted to retain frames with distinctive visual content. Next, discriminative features are computed from these keyframes using HOOF and ResNet-101 to reduce the computational load of downstream recognition. The resulting feature sets are concatenated into a single feature vector, which is then fed into the classifier for activity recognition.

The Incept-LSTM network serves as the classifier and is fine-tuned using the proposed Tricky Predator optimization algorithm (TricP). TricP is designed by combining the social behavior of Latrans with the tricky poaching behavior of predators. It is used as an outer-loop optimizer to improve the selected optimization variables described in Section 3.5, thereby enhancing recognition performance. A schematic diagram of the proposed methodology is shown in Figure 1.

3.2. Data Set

The UCF-Crime dataset [24] is used to evaluate the proposed technique. It contains approximately 128 h of video, including 1900 long, untrimmed surveillance videos. Thirteen anomalous activity categories are included, along with normal activities; therefore, the dataset is suitable for real-world anomaly detection in video surveillance.

An input video from UCF-Crime is converted into a sequence of frames. From this sequence, frames with distinctive content are selected and referred to as keyframes. Let the input video

G

comprise

r

frames, which can be expressed as

G = \{G_{l}\}, 1 \leq l \leq r

(1)

Each frame

G_{l}

has spatial dimensions

[V_{1} \times V_{2}]

, from which distinctive features are extracted to simplify the subsequent computations.

3.3. Extraction of Distinctive Video Frames from the Dataset

Distinctive frames are selected by measuring the Euclidean distance between successive frames based on pixel-wise intensity differences. Specifically, the distance between two consecutive frames

F_{t}

and

F_{t + 1}

is defined as

D (F_{t}, F_{t + 1}) = \sqrt{\sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{t} (i, j) - F_{t + 1} (i, j))^{2}}

(2)

where

H

: Height of the video frame (number of rows/pixels vertically)

W

: Width of the video frame (number of columns/pixels horizontally)

i

: Row index of the pixel

i \in {1, \dots, H}

j

: Column index of the pixel

j \in {1, \dots, W}

F_{t} (i, j)

: The pixel intensity at location

(i, j)

in frame

t

.

The set of distinctive frames detected from the full sequence is expressed as

D i s t i n c t i v e f r a m e^{t}, 0 \leq t \leq a

(3)

where

a

denotes the total number of distinctive frames. The value of

a

is significantly smaller than the total number of frames, which reduces redundancy and computational cost.

3.4. Feature Extraction

For each preprocessed video, two types of features are extracted from the selected keyframes to limit the computational complexity of the recognition network:

▪: Histogram of Optical Flow (HOOF) features [25], and
▪: ResNet-101 deep features [26].

These two feature sets are described below.

3.4.1. Histogram of Optical Flow (HOOF)

HOOF features capture motion patterns by summarizing optical-flow orientations weighted by their magnitudes. For each pixel in a frame, optical flow is computed between consecutive frames. The optical-flow vector at a pixel is represented as,

x = [x_{u} x_{v}]

(4)

where

x_{u}

and

x_{v}

denote the horizontal and vertical components of the optical flow, respectively. The orientation of the flow vectors ranges between −180° and 180°, and the magnitude is evaluated as

M a g i t u d e = \sqrt{x_{u}^{2} + x_{v}^{2}}

(5)

Let

{Mag}_{j}

and

{Ori}_{j}

denote the magnitude and orientation at pixel location

j

, respectively. Orientations are quantized into

n

bins, and each bin accumulates magnitudes of flow vectors whose orientations fall into that bin. The motion histogram computed over a window

P

is expressed as,

A (P) = (a_{1}, a_{2}, \dots a_{n})

(6)

where

n

is the number of orientation bins. The normalized HOOF features are represented as,

a_{i} = \frac{a_{i}^{'}}{\sum_{g = 1}^{n} a_{g}^{'}}

(7)

a_{i}^{'} = \sum_{\forall O r i_{j} \in (b i n - i)} {M a g}_{j}

(8)

These computations are performed for all pixels in each keyframe. The resulting histograms are concatenated to form the final HOOF feature vector.

3.4.2. ResNet-101

ResNet is a deep neural network architecture based on residual blocks with skip connections, enabling the extraction of rich feature representations. ResNet-101 is a CNN with 101 layers; its architecture is illustrated in Figure 2.

Let the output of a residual block without a skip connection be expressed as,

E (y) = f (w y + e)

(9)

where

f (\cdot)

denotes the activation function,

w

is the weight, and

e

is the bias term. After including the skip connection, the output is expressed as,

E (y) = f (w y + e) + y

(10)

If there is a dimensional mismatch between input and output, the skip connection can be implemented by zero-padding or by applying a

1 \times 1

convolution to match dimensions.

3.4.3. Feature Concatenation

The extracted HOOF and ResNet-101 features are combined to form a single feature vector:

f_{c o m b i n e d} = (f_{H O O F}, f_{R e s N e t})

(11)

This combined vector captures both motion (HOOF) and appearance (ResNet-101) information and is provided as input to the Incept-LSTM classifier. Prior to concatenation, the ResNet-101 feature vector is passed through a dense projection layer to reduce its dimensionality to be comparable with the HOOF representation. Both feature sets are normalized (zero mean and unit variance) before concatenation. This design prevents dominance of either feature type and enables the classifier to jointly exploit motion and appearance cues.

3.5. Human Action Recognition System

The proposed HAR system is based on an Incept-LSTM classifier trained using a hybrid strategy. Gradient-based training is used to learn the main network weights, and TricP is applied to fine-tune a restricted subset of parameters (the final classification layer and selected scalar training parameters), thereby avoiding the curse of dimensionality associated with full-weight metaheuristic optimization.

Inception LSTM

The conventional Long Short-Term Memory (LSTM) network maintains an internal cell state (memory) updated over time. It comprises three gates—input, forget, and output gates—and may include peephole connections. The input gate controls how much of the current input is stored, the forget gate controls how much of the previous state is retained, and the output gate determines the output based on the internal state.

Inception-based LSTM (Incept-LSTM) [27] incorporates convolutional filters with different kernel sizes into all LSTM gates. In the proposed HAR system, three kernel sizes

[1× 1]

,

[3× 3]

, and

[5× 5]

are used. The block diagram of the Incept-LSTM architecture is shown in Figure 3.

In the Incept-LSTM, the standard LSTM equations are adapted as follows. The input gate is given by

r_{i}^{'} = δ (p_{i} S_{l p} + h_{i - 1} S_{l h} + R_{i n p u t})

(12)

where

R

denotes the bias term,

h_{i - 1}

is the output of the previous time step,

p_{i}

is the current input, and

S

represents the weight matrices.

The candidate state is computed using the hyperbolic tangent function, considering both the current input and the previous layer’s weighted sum output:

l_{i} = \tan h (p_{i} . S_{l p} + h_{i - 1} S_{l h} + R_{i n p u t n o d e})

(13)

The internal state

q_{i}

is updated in a linear manner based on the activation function and is given by

q_{i} = r_{i} Θ l_{i} + q_{i - 1}

(14)

where

Θ

denotes element-wise multiplication, and

q_{i - 1}

is the previous internal state.

Using the forget gate, the internal cell state is initialized or updated as

u_{i} = δ (p_{i} . S_{p h} + h_{i - 1} S_{u h} + R_{f o r g e t})

(15)

The output gate is computed as

H_{i} = δ (p_{i} . S_{p e} + h_{i - 1} S_{h e} + R_{o u t p u t g a t e})

(16)

The output at the memory cell is then expressed as

h_{i} = \tan h (q_{i}) Θ H_{i}

(17)

with the internal state update

q_{i} = l_{i} Θ r_{i} + q_{i - 1} Θ u_{i}

(18)

For the Incept-LSTM, different convolution kernels are applied before the gates of the LSTM, and the gate activations are expressed as,

r_{i} = δ [S_{r 1 \times 1} * [p_{i}, h_{i - 1}] S_{r 3 \times 3} * [p_{i}, h_{i - 1}] S_{r 5 \times 5} * [p_{i}, h_{i - 1}]]

(19)

l_{i} = δ [S_{l 1 \times 1} * [p_{i}, h_{i - 1}] S_{l 3 \times 3} * [p_{i}, h_{i - 1}] S_{l 5 \times 5} * [p_{i}, h_{i - 1}]]

(20)

u_{i} = δ [S_{u 1 \times 1} * [p_{i}, h_{i - 1}] S_{u 3 \times 3} * [p_{i}, h_{i - 1}] S_{u 5 \times 5} * [p_{i}, h_{i - 1}]]

(21)

H_{i} = δ [S_{H 1 \times 1} * [p_{i}, h_{i - 1}] S_{H 3 \times 3} * [p_{i}, h_{i - 1}] S_{H 5 \times 5} * [p_{i}, h_{i - 1}]]

(22)

The internal state and output are then updated as

q_{i} = r_{i} Θ l_{i} + q_{i - 1} Θ u_{i}

(23)

h_{i} = \tan h (q_{i}) Θ H_{i}

(24)

In this work, TricP is not applied to optimize the full Incept-LSTM weight space. Instead, TricP is used as a hybrid optimizer that fine-tunes a restricted subset of parameters, namely: (i) the final classification layer parameters (weights and biases), and (ii) a small set of scalar training hyperparameters, specifically the learning rate and dropout rate. All remaining Incept-LSTM parameters are learned using standard gradient-based training. This restricted scope avoids the curse of dimensionality associated with population-based, gradient-free optimization in very high-dimensional parameter spaces, while still enabling TricP to improve generalization-related metrics (e.g., specificity and sensitivity).

4. TricP: Proposed Tricky Predator—An Optimization Algorithm

The proposed Tricky Predator optimization algorithm, denoted TricP, is inspired by the social conditions of Latrans [28] and the tricky poaching behavior of predators [29]. By fusing these behavioral models, TricP is designed to solve optimization problems through a balance between exploration (searching new regions) and exploitation (refining candidate solutions in promising regions).

4.1. Proposed Tricky Predator Optimization Algorithm

TricP combines two main concepts:

Predator poaching behavior: Predators live in packs and hunt wild and domestic animals using stealthy crawling behavior and intelligent movement patterns. A dominant predator couple leads and protects the territory. Offspring may inherit the parents’ territory or migrate to establish a new territory upon maturity.
Latrans social behavior: Latrans (coyotes) are social predators that follow dominance rules and cultural tendencies [28]. These social rules support herd maintenance and promote a balance between exploration and exploitation.

In TricP, local search is modeled as poaching within the current habitat, while global search is modeled by predators moving between forest and open land in search of prey. Interactions among predators, prey, and herd members are used to construct the optimization mechanism.

To maintain population size and diversity, two mechanisms are incorporated:

A predator may leave its territory to form a new herd or be removed (e.g., killed by a poacher).
Reproduction is modeled by replacing weak predators with new offspring at each iteration, which helps retain diversity and refine the best solution found so far.

By integrating Latrans social conditions (dominance rules and cultural tendencies) with predator search behavior, TricP supports herd maintenance and enhances the probability of identifying high-quality solutions in complex search spaces.

4.2. Mathematical Modeling

4.2.1. Solution Encoding and Search-Space Dimension

In our hybrid optimization scheme, each TricP agent represents a candidate solution vector

x \in R^{m}

that encodes only the parameters optimized by TricP. Specifically, TricP optimizes: (i) the final classification layer parameters of the Incept-LSTM (weights and biases), and (ii) a small set of scalar training hyperparameters (learning rate and dropout rate). All remaining Incept-LSTM parameters are learned via gradient-based training and are not included in

x

, which keeps the search space low-dimensional and avoids the curse of dimensionality associated with population-based, gradient-free optimization.

Let

d

denote the input feature dimension to the final classification layer and

C

denote the number of classes. The final-layer weights

W_{f c} \in R^{d \times C}

and biases

b_{f c} \in R^{C}

contribute

d \times C + C

decision variables. In addition, TricP optimizes

k

scalar hyperparameters, in our implementation,

k = 2

(learning rate and dropout rate). Therefore, the search-space dimension is

m = d \times C + C + k

(25a)

For the UCF-Crime dataset used in this work,

C = 14

.

4.2.2. Population Representation and Objective Function

Let

x_{τ}^{(p)} \in R^{m}

denote the position (candidate solution) of the

p

-th predator (agent) at iteration

τ

, where

p = 1,2, \dots, N

. The objective (fitness) function is denoted by

f (\cdot)

, and the current best solution is

\begin{matrix} x_{τ}^{*} = a r g \underset{p \in {1, \dots, N}}{m a x} f (x_{τ}^{(p)}) \end{matrix}

(25b)

With best fitness value

f (x_{τ}^{*})

.

4.2.3. Fitness Evaluation

The fitness of each predator is evaluated using three HAR performance metrics: accuracy, sensitivity, and specificity. The fitness function is defined as

F i t n e s s = \frac{A c c + S e n + S p e}{3}

(26)

Accuracy, sensitivity, and specificity are computed as

Acc = \frac{T P + T N}{T P + T N + F P + F N}

(27)

Sen = \frac{T P}{T P + F N}

(28)

Spe = \frac{T N}{T N + F P}

(29)

Here,

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively.

4.3. Exploration Phase

When food is unavailable in the local habitat, predators move to distant regions in search of prey. Information about food availability is shared among herd members, and the current best predator

x_{τ}^{*}

is selected based on fitness. Herd members are then moved toward the best predator. The Euclidean distance between a predator and the best position is computed as

e (x_{τ}^{(p)}) = {∥ x_{τ}^{(p)} - x_{τ}^{*} ∥}_{2}

(30)

The position of each predator is updated using

x_{τ + 1}^{(p)} = x_{τ}^{(p)} + β s i g n (x_{τ}^{*}− x_{τ}^{(p)})

(31)

where

β \in (0, e (x_{τ}^{(p)}))

is a random step-size parameter drawn at each iteration. If the new position yields better fitness, the predator remains at the new position; otherwise, it reverts to the previous position.

After completing exploration, predators return to their home territory. Since exploration may be risky, killing and reproduction mechanisms are applied to maintain population size, as described in Section 4.5.

4.4. Exploitation Phase

In the exploitation phase, predators search for prey within the habitat. The predator may circle the prey slowly before attacking. The movement mode is controlled by a random number

γ \in (0, 1)

:

M o d e = \{\begin{matrix} C l o s e l y M o v e, i f γ > 0.75 \\ N e v e r M o v e, i f γ \leq 0.75 \end{matrix}

(32)

While circling, the radius is evaluated using an observation angle

α_{0} \in (0, 2 π)

and a scaling parameter

u \in (0, 0.2)

:

b = \{\begin{matrix} u \frac{sin sin (α_{0})}{α_{0}} & i f & α_{0} \neq 0 \\ ε & i f & α_{0} = 0 \end{matrix}

(33)

where

ε \in (0, 1)

is a random value.

The position update in

m

-dimensional space is written compactly as

x_{τ + 1}^{(p)} (j) = x_{τ}^{(p)} (j) + u b \sin (α_{j}), j = 1,2, \dots, m

(34)

where

α_{j} \in (0, 2 π)

are randomly selected angles.

4.5. Latrans Social Update and Combined Rule

To enhance survival and maintain a balance between exploration and exploitation, Latrans social conditions are incorporated via dominance rules. Information discovered during exploration is shared among herd members according to Latrans behavior. The Latrans-based update for a coordinate can be written as

x_{τ + 1}^{(p)} (j) = x_{τ}^{(p)} (j) + r_{1} ζ_{1} (j) + r_{2} ζ_{2} (j)

(35)

where

r_{1}, r_{2} \in (0, 1)

are random numbers,

ζ_{1}

denotes the leader position component, and

ζ_{2}

denotes the cultural tendency component (e.g., a median-based measure).

Following [30], the predator and Latrans updates are combined as

x_{τ + 1}^{(p)} = 0.5 x_{τ + 1, p r e d a t o r}^{(p)} + 0.5 x_{τ + 1, L a t r a n s}^{(p)}

(36)

This combined rule supports diversification (searching new regions) and intensification (refining around promising solutions).

4.6. Breeding and Leaving (Population Maintenance)

In the natural habitat, some predators are removed (e.g., killed by poachers), some die, and others migrate; only a subset survives. The surviving predators reproduce and generate offspring. In TricP, the reproducing predators are the alpha couple, i.e., the top two predators based on fitness, denoted

x_{τ}^{(1)}

and

x_{τ}^{(2)}

.

The herd center is computed as

x_{τ}^{c} = \frac{x_{τ}^{(1)} + x_{τ}^{(2)}}{2}

(37)

The distance between the alpha couple is computed as

d_{τ}^{c} = {∥ x_{τ}^{(1)} - x_{τ}^{(2)} ∥}_{2}

(38)

A random parameter

n \in (0, 1)

is used to decide whether a nomadic individual is introduced or the alpha couple reproduces:

Replacement = \{\begin{matrix} New nomadic individual, & n \geq 0.45 \\ Alpha couple reproduce, & n < 0.45 \end{matrix}

(39)

The reproduced predator is generated as

x_{τ}^{r e p} = \frac{x_{τ}^{(1)} + x_{τ}^{(2)}}{2}

(40)

Weak predators are removed and replaced with newly generated predators (or nomadic individuals), maintaining population size and diversity.

4.7. Termination and Pseudo-Code

The exploration, exploitation, social update, and reproduction processes are repeated until a stopping criterion is satisfied or the maximum number of iterations

τ_{m a x}

is reached. The pseudo-code of TricP is provided in Algorithm 1.

Algorithm 1. Pseudo-code for the proposed Tricky Predator Optimization algorithm.

Begin

1.: Initialize the predator population ${x_{0}^{(p)}}_{p = 1}^{N}$ and parameters (e.g., $α_{0}$ ), and set the maximum iteration $τ_{m a x}$ .
2.: While $τ < τ_{m a x}$ do

a. For each predator

p = 1, \dots, N

, evaluate fitness using (26).

b. Sort predators by fitness; select the best predator

x_{τ}^{*}

.

c. For each predator, update its position according to the exploration and exploitation rules (30)–(36).

d. Sort predators by updated fitness.

e. Remove the worst predators (simulate hunting).

f. Generate and include new predators using (40) (and rule in (39)).

g. Update iteration counter:

τ = τ + 1

.

3.: End while
4.: Return the best solution $x^{*}$ .

End

5. Experimentation and Results

In this section, the proposed method is evaluated on the UCF-Crime dataset using accuracy, sensitivity, and specificity as performance metrics.

5.1. Experimental Setup

All experiments were conducted in MATLAB 24.1 using the proposed Tricky Predator-based Incept-LSTM framework for Human Activity Recognition (HAR).

Dataset and metrics. The UCF-Crime dataset [24] was used to assess the proposed technique. It contains approximately 128 h of video and 1900 long, untrimmed surveillance videos spanning 13 anomalous activity categories, in addition to normal activities. Owing to its scale and complexity, UCF-Crime is suitable for evaluating anomaly detection in real-world surveillance scenarios. Performance was evaluated using accuracy, sensitivity, and specificity, defined in Equations (27)–(29).

Baseline methods. To benchmark effectiveness, we compared the proposed method with several existing HAR approaches reported in the literature: EKVN [3], Stochastic-CNN [9], iSPLInception [2], and DeepCNN-BiLSTM [23]. All methods were evaluated under comparable conditions on the UCF-Crime dataset.

5.2. Implementation Details

All quantitative results reported in the tables were obtained using a fixed Tricky Predator (TricP) population size of

N = 20

. Population sizes

N \in {10, 20, 30, 40, 50}

were additionally considered only for the sensitivity analysis reported in Figure 4; these sensitivity settings are not the configuration used for the timing results in Table 1 or the main tabulated comparisons.

Because TricP optimizes a restricted, low-dimensional vector

x \in R^{m}

(Section 4.2), rather than the full deep-network weight space, the resulting computational overhead remains moderate compared with end-to-end population-based optimization over all network parameters.

Definition of epoch. For gradient-based baselines, an epoch denotes one full pass over the training set. For TricP, an “epoch” denotes one complete population-update cycle (Algorithm 1), i.e., evaluating all

N

candidate agents and updating the population. Therefore, one TricP epoch corresponds to

N

fitness evaluations (one per agent).

Evaluation strategy and hardware. TricP fitness evaluations are executed in batched GPU mode to reduce overhead; within each TricP epoch, the

N

candidates are evaluated in parallel batches using a mini-batch size. The population update logic is executed on the CPU. Experiments were run on an Intel Core i9-14900HX CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4080 GPU (16 GB VRAM) using MATLAB and the Deep Learning Toolbox.

Forward-pass and fitness-evaluation accounting. The total number of fitness evaluations is

N_{fitness} = E_{TricP} \times N

where

E_{TricP}

is the number of TricP epochs (population updates). The corresponding number of forward passes is estimated as

total_forward_passes = E_{TricP} \times N \times B, B = ⌈\frac{n_{eval}}{batch_size}⌉

where

n_{eval}

is the number of samples used for fitness evaluation and

B

is the number of mini-batches required to evaluate one candidate solution. Using the timing table’s reported TricP training length (

E_{TricP} \approx 30

) and

N = 20

, the total number of fitness evaluations is approximately

30 \times 20 = 600

; the forward-pass count depends on

B

.

5.3. Experimental Parameters

The performance of the proposed TricP-based method was analyzed in terms of specificity, sensitivity, and accuracy. Two complementary studies were conducted: (i) a population-size sensitivity analysis, and (ii) a main comparative evaluation using a fixed population size (

N = 20

) and different proportions of training data, to characterize the behavior of the Tricky Predator-based Incept-LSTM classifier.

Figure 4 illustrates the performance of the Tricky Predator-based Incept-LSTM as a function of population size (

N \in {10, 20, 30, 40, 50}

) and training-data percentage: Figure 4a shows specificity, Figure 4b shows sensitivity, and Figure 4c shows accuracy. As the training percentage increases, specificity improves, and within the tested range larger population sizes tend to yield higher specificity, indicating improved exploration of the search space. Sensitivity and accuracy follow similar trends: increasing the training percentage improves both metrics, and larger population sizes within the tested range generally lead to higher sensitivity and accuracy than smaller populations.

Training-time trade-off. TricP’s gains in accuracy, sensitivity, and specificity justify the moderate increase in training time, while inference-time efficiency remains unchanged because TricP affects training only. Under the fixed configuration (

N = 20

), TricP achieves specificity up to 96.84%, sensitivity up to 92.16%, and accuracy up to 93.62% compared with the baselines. To make this trade-off explicit, Table 1 reports the approximate observed training times, noting that fitness evaluations are executed in batched GPU mode.

Table 1. Approximate observed training time. For gradient-based baselines, an epoch denotes one full pass over the training set. For TricP, an epoch denotes one complete population-update cycle (evaluation of

N

agents + population update) with fixed population size

N = 20

. Fitness evaluations are executed in batched GPU mode (see Section 5.2) *.

Table 1. Approximate observed training time. For gradient-based baselines, an epoch denotes one full pass over the training set. For TricP, an epoch denotes one complete population-update cycle (evaluation of

N

agents + population update) with fixed population size

N = 20

. Fitness evaluations are executed in batched GPU mode (see Section 5.2) *.

Method	Training Time/Epoch	Total Epochs	Relative Training Time
Stochastic CNN	~2.4 min	~40 epochs	Baseline (1×)
DeepCNN-BiLSTM	~3.1 min	~35 epochs	1.1×
isplInception	~3.3 min	~31 epochs	1.2×
Tricky Predator-based Incept-LSTM	~5.0 min	~30 epochs	2.0×

* This was further confirmed by retraining baseline models on the same pre-extracted features-HOOF + ResNet-101-in order to ensure that TricP’s computational advantage did not simply originate from a result of feature caching. This actually reduced their per-epoch training time while resulting in an eventual, consistent drop in accuracy compared to end-to-end training, confirming once again that the gains in the performance of TricP come through its population-based optimization strategy rather than through feature reuse alone.

For strict computational fairness, the Adam baseline was evaluated under two configurations: (i) full end-to-end training from raw frames, and (ii) training on the identical cached feature representation (HOOF + ResNet-101) used by TricP. This ensures a transparent apples-to-apples comparison. TricP incurs additional training overhead due to population-based fitness evaluations; however, inference-time cost remains identical across all models.

To address potential concerns regarding computational fairness, we explicitly report the training time of the Adam baseline under the identical cached feature representation employed by TricP. This eliminates any ambiguity arising from comparing partial optimization time against full end-to-end training. As shown in Table 2, while TricP introduces additional training overhead due to population-based search, inference-time efficiency remains unchanged, since the trained Incept-LSTM architecture is identical during deployment.

A detailed numerical comparison of the proposed method and baseline approaches, as a function of training-data percentage, is provided in Table 3.

6. Analysis

The performance of the proposed TricP-based Incept-LSTM is analyzed in this section and compared with the baseline methods introduced in Section 5.1 (EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM). The comparison is carried out under two evaluation perspectives: (i) varying the training data percentage, and (ii) varying the K-Fold value used in cross-validation.

6.1. Interpretation Based on Training Data Percentage

Figure 5 illustrates the performance of the Tricky Predator-based Incept-LSTM for different training data percentages: Figure 5a shows specificity, Figure 5b shows sensitivity, and Figure 5c shows accuracy.

At a training percentage of 70%, the proposed method achieves a specificity of 95.56%, corresponding to reported improvements of 8.56%, 5.71%, 3.00%, and 1.45% over EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM, respectively. For the same training percentage, the achieved sensitivity is 90.45%, with reported gains of 20.56%, 5.39%, 3.05%, and 1.36% over the corresponding baselines.

In terms of accuracy, for 60% training data, the proposed method attains an accuracy of 91.69%, reported as 9.69%, 5.32%, 3.68%, and 1.45% higher than EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM, respectively. The detailed numerical results for all training percentages are summarized in Table 4. To further validate the effectiveness of the proposed optimizer under identical architecture, we additionally report results for the same Incept-LSTM trained using the standard Adam optimizer as an internal baseline.

6.2. Interpretation Based on K-Fold

Figure 6 presents performance trends when varying the K-Fold parameter in cross-validation: Figure 6a shows specificity, Figure 6b shows sensitivity, and Figure 6c shows accuracy.

For K-Fold = 7, the proposed method achieves a specificity of 91.75%, reported as an improvement of 7.20%, 5.64%, 4.01%, and 2.03% over EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM, respectively. For the same K-Fold value, the sensitivity is 89.85%, with reported gains of 23.19%, 5.86%, 3.93%, and 1.50% compared with the baselines.

For K-Fold = 8, the proposed method achieves an accuracy of 91.94%, reported to be 10.28%, 5.92%, 3.94%, and 1.66% higher than EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM, respectively.

6.3. Interpretation Based on ROC

Receiver operating characteristic (ROC) analysis is used to further evaluate the discriminative capability of the proposed method. The ROC curves for the Tricky Predator-based Incept-LSTM and the comparative methods are shown in Figure 7.

At a false positive rate (FPR) of 50%, the proposed method achieves a true positive rate (TPR) of 90.27%, and at an FPR of 90%, it attains a TPR of 94.28%. In addition, for a TPR of 60%, the reported values for EKVN, Stochastic-CNN, iSPLInception, DeepCNN-BiLSTM, and the proposed method are 70.59%, 84.74%, 87.27%, 89.03%, and 90.63%, respectively. These results indicate that the proposed approach provides stronger ROCs than the baseline techniques.

6.4. Comparative Discussion

A consolidated comparison of the Tricky Predator-based Incept-LSTM and the conventional methods is presented in Table 4. For the training-percentage evaluation, the best specificity, sensitivity, and accuracy achieved by the proposed method are 96.84%, 92.16%, and 93.62%, respectively.

Relative to EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM, the proposed method exhibits reported improvements of:

Specificity: 8.03%, 6.47%, 3.54%, and 1.45%, respectively.
Sensitivity: 21.19%, 6.67%, 3.93%, and 1.92%, respectively.
Accuracy: 8.40%, 5.15%, 3.22%, and 1.61%, respectively.

Table 5 further summarizes the best-case performance of all methods under both evaluation perspectives (training percentage and K-Fold). Overall, TricP-based Incept-LSTM consistently attains the highest values across the three metrics, with particularly notable gains in specificity and sensitivity, which are critical for anomaly-relevant recognition performance.

Interestingly, TricP exhibits slightly lower run-to-run variance than Adam (±0.09% vs. ±0.13%). This does not contradict the role of population-based exploration. Instead, it suggests that TricP reduces sensitivity to stochastic initialization effects by evaluating multiple candidate solutions per iteration and selecting the best-performing agent. In contrast, Adam follows a single optimization trajectory, which may converge to slightly different nearby basins across runs. The population-selection mechanism in TricP therefore acts as an implicit variance stabilizer.

Comparison with the Adam-optimized baseline (modest but consistent gains)

In addition to the conventional baselines, we report results for the same Incept-LSTM architecture trained using the standard Adam optimizer to enable a direct and controlled comparison. The results indicate that TricP provides modest improvements over Adam in some settings. For example, at K-Fold = 10, Incept-LSTM (Adam) achieves 92.26% accuracy, whereas the TricP-based Incept-LSTM achieves 93.17% accuracy (≈+0.91 percentage points). The gains are more evident in anomaly-relevant metrics: specificity increases from 92.38% to 93.70% (+1.32 points) and sensitivity increases from 90.78% to 91.64% (+0.86 points). Figure 8 illustrates representative convergence behavior of the Adam baseline under the experimental settings described in Section 5.

Figure 8 illustrates the convergence behavior of a single representative Adam baseline run, selected for visualization to demonstrate typical training dynamics and convergence trends. In contrast, Table 6 reports the mean and standard deviation of accuracy over five independent runs with different random initializations, which is why the averaged accuracy (92.26%) is lower than the peak validation accuracy observed in the single-run curve.

The five independent runs were conducted using fixed random seeds {S1, S2, S3, S4, S5} to ensure reproducibility. These seeds-controlled weight initialization, data shuffling, and stochastic operations.

Figure 8 Convergence behavior of the Adam-optimized Incept-LSTM under the same experimental configuration used for Table 6. Training was conducted for 40 epochs, after which validation performance exhibited marginal improvement (<0.05%), indicating practical convergence.

To ensure strict comparability, both Adam and TricP-based models were trained under identical data splits, initialization protocols, and stopping criteria. Although the Adam curve in Figure 8 appears to continue improving near epoch 40, extended experiments up to 60 epochs showed negligible additional gains (<0.1%), confirming that convergence had effectively been reached within the reported training schedule.

Statistical validation across multiple runs

To account for run-to-run variability, we performed multiple independent runs and report the results in Table 6. Across five runs, Adam achieves 92.26 ± 0.13% accuracy, while TricP achieves 93.17 ± 0.09%. A paired t-test yields p = 0.0036, confirming that the observed improvement over Adam is statistically significant (p < 0.05) despite the marginal absolute accuracy gain.

Figure 8 illustrates the convergence behavior of a single representative Adam run that achieved relatively high validation accuracy. In contrast, Table 6 reports the statistical results (mean ± standard deviation) obtained over five independent runs with different random initializations. Due to run-to-run variability, the best single-run performance shown in Figure 8 is higher than the average performance reported in Table 6.

Also, Figure 8 is intended solely to illustrate the convergence dynamics of the best baseline run of the Adam algorithm. Table 6 presents the statistically significant results (mean ± standard deviation of five independent runs), which is the sole basis for quantitative comparison and performance data presented in this paper. Figure 8 is illustrative only, while Table 6 provides the benchmark evaluation for multiple runs. Therefore, the performance gains of the TricP algorithm mentioned above remain valid and unchanged. Then we have already statistical validation + p-value = 0.0036.

Table 6 constitutes the sole statistical basis for performance comparison between Adam and TricP. Figure 8 does not represent averaged or representative multi-run performance; instead, it visualizes the convergence trend of a single high-performing Adam run for illustrative purposes only.

That convergence curves may vary across runs due to random initialization and stochastic optimization. For clarity and space considerations, only one representative convergence curve is shown, while all numerical claims rely exclusively on the multi-run statistics reported in Table 6.

The TricP algorithm primarily improves the anomaly-related metrics (specificity and sensitivity). We present the values of each of the three metrics (accuracy, specificity, and sensitivity) for each run across five independent runs in the newly added Table 7. The results confirm that the TricP algorithm consistently outperforms the Adam algorithm not only in accuracy, but also in specificity and sensitivity, with reduced variance between runs.

The sample variance was computed using:

s^{2} = \frac{1}{n - 1} \sum_{i = 1}^{5} (x_{i} - \bar{x})^{2}

(41)

These values confirm that TricP demonstrates lower run-to-run variability than Adam.

Both the Adam baseline and the suggested TricP optimizer showed run-to-run accuracy results in Table 6, which is based on five separate testing runs. Mean ± standard deviation is used to express the data in order to show the variability and central tendency. Adam got 92.26 ± 0.13% accuracy on average, while TricP got 93.17 ± 0.09%. Better optimization stability and less vulnerability to stochastic initialization are shown by the smaller standard deviation for TricP. In addition, the fact that TricP outperformed Adam in terms of performance is supported by a statistically significant improvement (p < 0.01), as shown by the stated p-value (0.0036), proving that the increase is not just a result of random fluctuation.

Using the same five separate experiments, Table 7 breaks down specificity, sensitivity, as well as accuracy in great detail. In every run, the findings show that TricP performs better than Adam in all three performance criteria. Specifically, TricP shows a better balance between real positive detection with false positive control, as it produces greater sensitivity values while still retaining superior specificity. Table 6’s decreased standard deviation and consistent improvements across separate trials suggest that the suggested TricP-based optimization methodology is reliable, resilient, and statistically valid.

Discussion of accuracy–cost trade-off and practical implications

Overall, the TricP-based Incept-LSTM improves HAR performance relative to the conventional baselines, particularly in specificity and sensitivity. The comparison with the Adam-optimized Incept-LSTM shows that the improvement in overall accuracy can be moderate in some settings; therefore, the benefit should be interpreted in terms of the balanced improvements achieved across metrics rather than accuracy alone. TricP’s population-based search helps mitigate stagnation in local optima by exploring alternative candidate solutions, which can be advantageous when robustness and stable generalization are prioritized over training speed.

Importantly, the observed training overhead remains moderate because TricP operates over a restricted, low-dimensional parameter vector (Section 3.5 and Section 4.2) rather than optimizing the full deep-network weight space. Although the proposed framework incurs additional training cost compared with lightweight baselines (e.g., EKVN and Stochastic-CNN) due to the combined use of Inception blocks, temporal modeling, and repeated fitness evaluations, this overhead affects training only. Inference-time efficiency remains unchanged because the trained Incept-LSTM is used for prediction without running TricP. Moreover, the keyframe-selection stage and the complementary feature extraction (HOOF and ResNet-101) reduce redundancy and enhance the informativeness of the input representation, contributing to the achieved gains.

In summary, the proposed method benefits from multi-scale feature extraction via Inception modules, temporal dependency modeling via LSTM, and hybrid optimization through TricP applied to a restricted set of parameters. This combination supports improved generalization on complex surveillance videos and strengthens anomaly-relevant metrics, particularly specificity.

Although population-based metaheuristics are often associated with stochastic exploration, the slightly lower variance observed for TricP can be interpreted through its structured selection dynamics. Since TricP operates over a restricted, low-dimensional parameter vector and employs competitive elimination of weaker agents, it behaves similarly to a multi-start optimization process with elitist selection. This reduces sensitivity to individual initialization effects compared to single-trajectory gradient descent, leading to marginally lower run-to-run dispersion without implying absence of non-convexity.

7. Conclusions and Future Scope

This work presented an efficient Human Activity Recognition (HAR) framework based on an Incept-LSTM classifier with a hybrid optimization strategy. A novel Tricky Predator optimization algorithm (TricP) was devised by fusing predator hunting behavior during food searching with the social behavior of Latrans for information sharing. In the proposed framework, TricP is employed as an outer-loop optimizer to fine-tune a restricted, low-dimensional subset of parameters (rather than the full deep-network weight space), supporting effective exploration while avoiding the curse of dimensionality associated with population-based optimization in very high-dimensional settings.

Surveillance videos from the UCF-Crime dataset were used as input. Video sequences were converted into frames, and a keyframe-selection strategy based on a distance measure was applied to identify distinctive frames. From these keyframes, complementary motion and appearance features were extracted using Histogram of Oriented Optical Flow (HOOF) and ResNet-101, and then concatenated to form a discriminative feature representation. This representation was used as input to the Incept-LSTM classifier.

Experimental results on UCF-Crime showed that the proposed method achieved 96.84% specificity, 92.16% sensitivity, and 93.62% accuracy. These results indicate that the combination of keyframe selection, complementary feature extraction, and hybrid optimization can improve anomaly-relevant metrics while maintaining practical inference-time efficiency, since TricP affects training only.

A limitation of the proposed approach is the increased training cost compared with lightweight baselines, due to the combined use of Inception blocks, temporal modeling, and repeated fitness evaluations during optimization. Although this overhead is mitigated by restricting TricP to a low-dimensional search vector and by batched GPU execution during fitness evaluation (as described in Section 5.2), scaling to substantially larger datasets or higher frame rates may still require high-performance computing resources.

Future work will extend the framework to additional application domains (e.g., robotics, health monitoring, gaming, and interactive systems) and evaluate generalization across multiple benchmark datasets under standardized protocols. Further research will investigate methods to reduce training cost and memory footprint, including lightweight backbones, compressed representations, and more efficient temporal modeling. In addition, advanced multi-stream designs and ensemble strategies may be explored to improve robustness in complex environments. Finally, benchmarking against more recent State-of-the-Art video architectures, including 3D CNNs and Transformer-based models, will be pursued to further assess comparative effectiveness.

Author Contributions

Conceptualization, P.G., M.A.-S., P.J., D.V., H.T. and O.A.H.; data curation, P.G., P.J. and D.V.; formal analysis, P.G., M.A.-S., P.J. and O.A.H.; Funding acquisition, P.J.; investigation, P.G., M.A.-S., D.V. and H.T.; methodology, P.G. and H.T.; project administration, P.J. and O.A.H.; resources, P.G., P.J. and D.V.; software, P.G.; supervision, P.G., D.V. and H.T.; validation, P.G., M.A.-S., P.J., D.V., H.T. and O.A.H.; visualization, M.A.-S. and H.T.; writing—original draft, P.G., M.A.-S. and O.A.H.; writing—review and editing, M.A.-S., H.T. and O.A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mekruksavanich, S.; Jitpattanakul, A. Biometric user identification based on human activity recognition using wearable sensors: An experiment using deep learning models. Electronics 2021, 10, 308. [Google Scholar] [CrossRef]
Ronald, M.; Poulose, A.; Han, D.S. iSPLInception: An inception-ResNet deep learning architecture for human activity recognition. IEEE Access 2021, 9, 68985–69001. [Google Scholar] [CrossRef]
Garcia, K.D.; de Sá, C.R.; Poel, M.; Carvalho, T.; Mendes-Moreira, J.; Cardoso, J.M.; de Carvalho, A.C.; Kok, J.N. An ensemble of autonomous auto-encoders for human activity recognition. Neurocomputing 2021, 439, 271–280. [Google Scholar] [CrossRef]
Anagnostis, A.; Benos, L.; Tsaopoulos, D.; Tagarakis, A.; Tsolakis, N.; Bochtis, D. Human activity recognition through recurrent neural networks for human–robot interaction in agriculture. Appl. Sci. 2021, 11, 2188. [Google Scholar] [CrossRef]
Tasnim, N.; Islam, M.K.; Baek, J.H. Deep learning based human activity recognition using spatio-temporal image formation of skeleton joints. Appl. Sci. 2021, 11, 2675. [Google Scholar] [CrossRef]
Alemayoh, T.T.; Lee, J.H.; Okamoto, S. New sensor data structuring for deeper feature extraction in human activity recognition. Sensors 2021, 21, 2814. [Google Scholar] [CrossRef]
Zhang, L.; Lim, C.P.; Yu, Y. Intelligent human action recognition using an ensemble model of evolving deep networks with swarm-based optimization. Knowl. Based Syst. 2021, 220, 106918. [Google Scholar] [CrossRef]
Russell, B.; McDaid, A.; Toscano, W.; Hume, P. Moving the Lab into the Mountains: A Pilot Study of Human Activity Recognition in Unstructured Environments. Sensors 2021, 21, 654. [Google Scholar] [CrossRef]
Xu, Y.; Qiu, T.T. Human activity recognition and embedded application based on convolutional neural network. J. Artif. Intell. Technol. 2021, 1, 51–60. [Google Scholar] [CrossRef]
Ma, C.Y.; Chen, M.H.; Kira, Z.; AlRegib, G. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Process. Image Commun. 2019, 71, 76–87. [Google Scholar] [CrossRef]
Jandhyam, L.A.; Rengaswamy, R.; Satyala, N. An optimized Deep LSTM model for human action recognition. Rev. D’intelligence Artif. 2024, 38, 11–23. [Google Scholar] [CrossRef]
Weng, Z.; Li, W.; Jin, Z. Human activity prediction using saliency-aware motion enhancement and weighted LSTM network. EURASIP J. Image Video Process. 2021, 2021, 3. [Google Scholar] [CrossRef]
Aghaei, A.; Nazari, A.; Moghaddam, M.E. Sparse deep LSTMs with convolutional attention for human action recognition. SN Comput. Sci. 2021, 2, 151. [Google Scholar] [CrossRef]
Saoudi, E.M.; Jaafari, J.; Andaloussi, S.J. Advancing human action recognition: A hybrid approach using attention-based LSTM and 3D CNN. Sci. Afr. 2023, 21, e01796. [Google Scholar] [CrossRef]
Pandey, N.N.; Muppalaneni, N.B. Temporal and spatial feature based approaches in drowsiness detection using deep learning technique. J. Real-Time Image Process. 2021, 18, 2287–2299. [Google Scholar] [CrossRef]
Guo, H.; Chen, J. Dynamic facial expression recognition based on ResNet and LSTM. IOP Conf. Ser. Mater. Sci. Eng. 2020, 790, 012145. [Google Scholar]
Ahmad, W.; Kazmi, B.M.; Ali, H. Human activity recognition using multi-head CNN followed by LSTM. In Proceedings of the 2019 15th International Conference on Emerging Technologies (ICET); IEEE: Washington, DC, USA, 2019; pp. 1–6. [Google Scholar]
Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Improved inception-residual convolutional neural network for object recognition. Neural Comput. Appl. 2020, 32, 279–293. [Google Scholar] [CrossRef]
Xu, C.; Chai, D.; He, J.; Zhang, X.; Duan, S. InnoHAR: A deep neural network for complex human activity recognition. IEEE Access 2019, 7, 9893–9902. [Google Scholar] [CrossRef]
Xia, K.; Huang, J.; Wang, H. LSTM-CNN architecture for human activity recognition. IEEE Access 2020, 8, 56855–56866. [Google Scholar] [CrossRef]
Mustafa, T.; Dhavale, S.; Kuber, M.M. Performance Analysis of Inception-v2 and Yolov3-Based Human Activity Recognition in Videos. SN Comput. Sci. 2020, 1, 138. [Google Scholar] [CrossRef][Green Version]
Dua, N.; Singh, S.N.; Semwal, V.B. Multi-input CNN-GRU based human activity recognition using wearable sensors. Computing 2021, 103, 1461–1478. [Google Scholar] [CrossRef]
Nafea, O.; Abdul, W.; Muhammad, G.; Alsulaiman, M. Sensor-based human activity recognition with spatio-temporal deep learning. Sensors 2021, 21, 2141. [Google Scholar] [CrossRef]
UCF Crime Dataset. Available online: https://www.kaggle.com/datasets/odins0n/ucf-crime-dataset (accessed on 15 July 2024).
Prabha, B.; Shanker, N.R.; Priya, M.; Ganesh, E. Human Anomalous Activity Detection: Shape and Motion Approach in Crowded Scenes. J. Phys. Conf. Ser. 2021, 1921, 012074. [Google Scholar] [CrossRef]
Hnoohom, N.; Maitrichit, N.; Chotivatunyu, P.; Sornlertlamvanich, V.; Mekruksavanich, S.; Jitpattanakul, A. Blister Package Classification Using ResNet-101 for Identification of Medication. In Proceedings of the 2021 25th International Computer Science and Engineering Conference (ICSEC); IEEE: Washington, DC, USA, 2021; pp. 406–410. [Google Scholar]
Hosseini, M.; Maida, A.S.; Hosseini, M.; Raju, G. Inception-inspired lstm for next-frame video prediction. arXiv 2019, arXiv:1909.05622. [Google Scholar]
Pierezan, J.; Coelho, L.D.S. Coyote optimization algorithm: A new metaheuristic for global optimization problems. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC); IEEE: Washington, DC, USA, 2018; pp. 1–8. [Google Scholar]
Połap, D.; Woźniak, M. Red fox optimization algorithm. Expert Syst. Appl. 2021, 166, 114107. [Google Scholar] [CrossRef]
Binu, D.; Kariyappa, B.S. Rider-deep-LSTM network for hybrid distance score-based fault prediction in analog circuits. IEEE Trans. Ind. Electron. 2020, 68, 10097–10106. [Google Scholar] [CrossRef]

Figure 1. Proposed methodology for Human Activity Recognition using the optimized Incept-LSTM classifier.

Figure 2. ResNet Architecture.

Figure 3. Block diagram of the Incept-LSTM.

Figure 4. Interpretation of the Tricky Predator-based Incept-LSTM performance based on: (a) specificity, (b) sensitivity, and (c) accuracy.

Figure 5. Performance comparison with varying training percentage: (a) specificity, (b) sensitivity, and (c) accuracy.

Figure 6. Performance comparison with varying K-Fold: (a) specificity, (b) sensitivity, and (c) accuracy.

Figure 7. ROC-based comparison of EKVN, Stochastic-CNN, iSPLInception, DeepCNN-BiLSTM, and Tricky Predator-based Incept-LSTM.

Figure 8. Representative convergence behavior of the Adam baseline for a single best-performing run.

Table 2. Transparent computational comparison under different feature pipelines.

Method	Feature Pipeline	Optimization Mode	Training Time/Epoch (min)	Total Epochs	Total Training Time (min)	Relative Cost
Adam (Baseline)	End-to-End (Raw Frames)	Gradient-Based	~2.4	40	~96	1.00×
Adam (Baseline)	Cached Features (HOOF + ResNet-101)	Gradient-Based	~1.6	40	~64	0.67×
TricP (Proposed)	Cached Features (HOOF + ResNet-101)	Hybrid (Population + Gradient)	~5.0	30	~150	1.56× (vs. E2E)/2.34× (vs. Cached Adam)

Table 3. Performance interpretation based on training data percentage.

Methods/Metrics	EKVN	Stochastic-CNN	iSPLInception	DeepCNN-BiLSTM	Tricky Predator-Based Incept-LSTM
Specificity based on Training
40%	82.94	86.44	88.48	89.87	91.80
50%	84.98	87.98	89.54	90.92	93.08
60%	86.42	89.03	91.51	92.90	94.29
70%	87.38	90.11	92.70	94.17	95.56
80%	89.06	90.57	93.41	95.44	96.84
Sensitivity Based on Training
40%	57.40	78.80	80.08	85.68	87.31
50%	61.84	79.47	81.49	87.33	89.12
60%	70.09	85.12	86.81	88.04	89.53
70%	71.85	85.57	87.69	89.22	90.45
80%	72.63	86.01	88.54	90.39	92.16
Accuracy Based on Training
40%	78.25	83.48	84.96	87.57	89.13
50%	81.07	83.78	85.63	89.51	90.89
60%	82.81	86.82	88.32	90.36	91.69
70%	84.09	87.51	89.30	90.96	92.46
80%	85.76	88.80	90.61	92.11	93.62

Table 4. Interpretation based on training data percentage and K-Fold (including Incept-LSTM (Adam) baseline).

Methods/Metrics	EKVN	Stochastic-CNN	iSPL Inception	DeepCNN-BiLSTM	Incept-LSTM (Adam)	Tricky Predator-Based Incept-LSTM
Specificity based on K-Fold
5	81.73	85.39	87.19	89.59	90.10	91.39
6	82.65	85.98	87.52	89.74	90.36	91.45
7	85.15	86.57	88.07	89.89	90.32	91.75
8	85.65	87.28	88.73	91.06	91.45	92.74
9	85.73	87.84	89.26	91.20	92.11	93.33
10	86.39	88.43	89.94	91.82	92.38	93.70
Sensitivity based on K-Fold
5	66.15	82.68	84.54	87.30	87.85	88.64
6	68.28	83.28	85.18	87.99	88.40	89.25
7	69.01	84.59	86.32	88.50	90.12	89.85
8	69.18	84.85	86.79	88.96	90.78	90.37
9	70.80	85.32	86.97	89.71	92.02	91.19
10	72.85	85.80	87.64	90.23	90.78	91.64
Accuracy based on K-Fold
5	74.65	84.39	86.27	88.68	89.12	90.26
6	76.22	85.34	87.04	89.32	89.74	90.84
7	81.76	86.23	87.93	89.84	90.45	91.42
8	82.49	86.49	88.32	90.42	91.10	91.94
9	82.57	87.13	88.96	91.00	91.82	92.59
10	82.69	87.77	89.54	91.51	92.26	93.17

Table 5. Best-case performance comparison of the proposed Tricky Predator-based Incept-LSTM with EKVN, Stochastic-CNN, iSPLInception, and DeepCNN-BiLSTM.

Methods/Metrics	EKVN	Stochastic-CNN	iSPLInception	DeepCNN-BiLSTM	Tricky Predator-Based Incept-LSTM
Interpretation based on Training
Specificity	89.06	90.57	93.41	95.44	96.84
Sensitivity	72.63	86.01	88.54	90.39	92.16
Accuracy	85.76	88.80	90.61	92.11	93.62
Interpretation based on K-Fold
Specificity	86.39	88.43	89.94	91.82	93.70
Sensitivity	72.85	85.80	87.64	90.23	91.64
Accuracy	82.69	87.77	89.54	91.51	93.17

Table 6. Statistical validation of TricP vs. Adam over multiple runs.

Method	Run 1 (%)	Run 2 (%)	Run 3 (%)	Run 4 (%)	Run 5 (%)	Mean ± Std (%)	p-Value
Adam (Baseline)	92.18	92.34	92.09	92.41	92.28	92.26 ± 0.13	—
TricP (Proposed)	93.05	93.22	93.11	93.26	93.21	93.17 ± 0.09	0.0036

Table 7. Per-run performance metrics across five independent runs.

Method	Run	Specificity (%)	Sensitivity (%)	Accuracy (%)
Adam	1	90.74	95.91	92.18
Adam	2	90.88	96.03	92.34
Adam	3	90.69	95.82	92.09
Adam	4	90.93	96.11	92.41
Adam	5	90.81	95.97	92.28
TricP	1	91.82	96.41	93.05
TricP	2	92.04	96.66	93.22
TricP	3	91.91	96.53	93.11
TricP	4	92.18	96.78	93.26
TricP	5	92.09	96.71	93.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Girdhar, P.; Al-Saidi, M.; Johri, P.; Virmani, D.; Taha, H.; Hassen, O.A. TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM. Telecom 2026, 7, 32. https://doi.org/10.3390/telecom7020032

AMA Style

Girdhar P, Al-Saidi M, Johri P, Virmani D, Taha H, Hassen OA. TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM. Telecom. 2026; 7(2):32. https://doi.org/10.3390/telecom7020032

Chicago/Turabian Style

Girdhar, Palak, Muslem Al-Saidi, Prashant Johri, Deepali Virmani, Hussein Taha, and Oday Ali Hassen. 2026. "TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM" Telecom 7, no. 2: 32. https://doi.org/10.3390/telecom7020032

APA Style

Girdhar, P., Al-Saidi, M., Johri, P., Virmani, D., Taha, H., & Hassen, O. A. (2026). TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM. Telecom, 7(2), 32. https://doi.org/10.3390/telecom7020032

Article Menu

TricP: A Novel Approach for Human Activity Recognition Using Tricky Predator Optimization Based on Inception and LSTM

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Proposed Methodology for Human Activity Recognition Using an Optimized Incept-LSTM Classifier

3.2. Data Set

3.3. Extraction of Distinctive Video Frames from the Dataset

3.4. Feature Extraction

3.4.1. Histogram of Optical Flow (HOOF)

3.4.2. ResNet-101

3.4.3. Feature Concatenation

3.5. Human Action Recognition System

Inception LSTM

4. TricP: Proposed Tricky Predator—An Optimization Algorithm

4.1. Proposed Tricky Predator Optimization Algorithm

4.2. Mathematical Modeling

4.2.1. Solution Encoding and Search-Space Dimension

4.2.2. Population Representation and Objective Function

4.2.3. Fitness Evaluation

4.3. Exploration Phase

4.4. Exploitation Phase

4.5. Latrans Social Update and Combined Rule

4.6. Breeding and Leaving (Population Maintenance)

4.7. Termination and Pseudo-Code

5. Experimentation and Results

5.1. Experimental Setup

5.2. Implementation Details

5.3. Experimental Parameters

6. Analysis

6.1. Interpretation Based on Training Data Percentage

6.2. Interpretation Based on K-Fold

6.3. Interpretation Based on ROC

6.4. Comparative Discussion

7. Conclusions and Future Scope

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI