Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset

Khan, Bakht Alam; Jung, Jin-Woo

doi:10.3390/app152212173

Open AccessArticle

Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset

by

Bakht Alam Khan

and

Jin-Woo Jung

^*

Department of Computer Science and Engineering, Dongguk University, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12173; https://doi.org/10.3390/app152212173

Submission received: 14 April 2025 / Revised: 29 September 2025 / Accepted: 30 September 2025 / Published: 17 November 2025

(This article belongs to the Special Issue Advanced Computer Vision Techniques: AI-Based Object Detection, Tracking, Surveillance and Security Applications)

Download

Browse Figures

Versions Notes

Abstract

Human Activity Recognition (HAR) plays a critical role across various fields, including surveillance, healthcare, and robotics, by enabling systems to interpret and respond to human behaviors. In this research, we present an innovative method for HAR that leverages the strengths of Dilated Convolutional Neural Networks (CNNs) integrated with Long Short-Term Memory (LSTM) networks. The proposed architecture achieves an impressive accuracy of 94.9%, surpassing the conventional CNN-LSTM approach, which achieves 93.7% accuracy on the challenging UCF 50 dataset. The use of dilated CNNs significantly enhances the model’s ability to capture extensive spatial–temporal features by expanding the receptive field, thus enabling the recognition of intricate human activities. This approach effectively preserves fine-grained details without increasing computational costs. The inclusion of LSTM layers further strengthens the model’s performance by capturing temporal dependencies, allowing for a deeper understanding of action sequences over time. To validate the robustness of our model, we assessed its generalization capabilities on an unseen YouTube video, demonstrating its adaptability to real-world applications. The superior performance and flexibility of our approach suggests its potential to advance HAR applications in areas like surveillance, human–computer interaction, and healthcare monitoring.

Keywords:

human activity recognition; dilated CNN; LSTM; UCF 50 dataset; spatial-temporal information

1. Introduction

Human Activity Recognition (HAR) aims to identify and classify the physical activities performed by an individual or a group of individuals, depending on the specific application context. This recognition process involves understanding and interpreting various movements and actions, which can range from simple daily tasks to more complex activities. For instance, activities such as walking, running, jumping, or sitting can be performed by a single person, each involving distinct patterns of body motion and posture changes. These tasks require the system to capture variations in body movements to accurately detect and classify the activities being performed. HAR systems are designed to extract meaningful information from these movements, making it possible to monitor, analyze, and respond to human behavior in diverse scenarios. This capability is particularly valuable in fields such as healthcare, where monitoring patient mobility can be crucial, or in surveillance, where detecting suspicious actions can enhance security. By accurately recognizing and distinguishing these activities, HAR technologies contribute significantly to improving the responsiveness and efficiency of automated systems in various domains [1,2]. Certain activities are carried out through the movement of specific body parts, such as making gestures with one’s hands [3,4]. In some instances, activities involve interacting with objects, such as preparing meals in the kitchen [5,6]. Several surveys have comprehensively reviewed the development of wearable sensors and multimodal interfaces for human activity recognition [7,8,9]. The process of recognizing human activities using deep learning architectures, particularly Convolutional Neural Networks (CNNs), involves multiple critical stages that form a cohesive and complex system. A general framework for a CNN-based activity recognition. The initial phase focuses on the selection and integration of appropriate sensing devices tailored to the specific recognition tasks. Following this, data acquisition is conducted, where edge devices are utilized to capture data from various input sources. This data is then transmitted to a centralized server using communication technologies like Wi-Fi or Bluetooth.

Edge computing plays a crucial role in this architecture by bringing computational and storage capabilities closer to the data source, thus facilitating efficient and real-time processing. This approach ensures that sensors deployed for data collection can transmit information directly to edge servers, which are capable of processing the data promptly. By leveraging edge computing, the system achieves low-latency responses and enhanced performance, making it particularly suitable for applications requiring real-time activity recognition. The authors in [8] conducted an in-depth review of HAR frameworks that specifically utilize accelerometer data. This analysis included discussions on various parameters, such as sampling rates, window sizes, and overlap percentages, which are critical for processing time-series data. The review also shed light on feature extraction techniques and highlighted key factors that influence the effectiveness of HAR systems. Cornacchia et al. [9] emphasized the use of wearable sensors in HAR systems, covering a wide range of sensor types, including pressure sensors, accelerometers, gyroscopes, depth sensors, and hybrid modalities. They classified recent HAR studies based on the sensor data processing techniques employed, particularly those leveraging machine learning algorithms.

Additionally, Beddiar et al. [10] provided a comprehensive survey on the latest advancements in HAR, focusing on significant features like the types of activities recognized, input data formats, validation methods, targeted body parts, and camera viewpoints used in data collection. Their review involved a comparative analysis of state-of-the-art methods based on the diversity of activities they could detect. Furthermore, they offered a detailed overview of vision-based datasets, which are crucial for advancing HAR research by providing standardized benchmarks for model evaluation. Collectively, these studies highlight the rapid advancements in HAR technologies, particularly in leveraging wearable sensors and machine learning techniques. They emphasize the need for continued innovation in sensor fusion, data processing, and feature extraction to develop robust systems capable of real-time and accurate activity recognition across diverse environments.

Moreover, incorporating edge devices reduces the dependency on cloud infrastructure by enabling localized data processing, which is vital for maintaining data privacy and reducing network congestion. Thus, the entire system is designed to handle large-scale, continuous data streams efficiently, which is essential for complex tasks like human activity recognition in dynamic environments.

In recent years, deep learning (DL) algorithms have gained significant traction due to their ability to automatically extract features from complex datasets, including visual or image data sequential time-series data. This indicates the need for manual feature engineering, which is often time-consuming and domain-specific, making DL models highly efficient and adaptable across diverse applications. As a result, DL methods have been widely adopted in areas such as computer vision, natural language processing, and predictive analytics, where high-dimensional data is prevalent. The flexibility and robustness of DL models in capturing intricate patterns have made them a preferred choice for tasks requiring accurate, data-driven insights [11,12].

Convolutional Neural Networks (CNNs) are known to effectively learn hierarchical features, progressing from low-level to high-level representations. Researchers have observed that the features extracted by CNNs tend to outperform traditional handcrafted ones. In recent years, substantial efforts have been dedicated to designing neural networks that can effectively capture spatiotemporal characteristics for human activity recognition. Many studies leverage deep learning approaches in this area, as they enable automated extraction and learning of hierarchical features crucial for behavior analysis. This has led to the development of various systems demonstrating promising outcomes in human activity recognition. Deep Learning (DL) techniques have garnered significant interest due to their impressive performance across diverse domains. It is, therefore, unsurprising that DL-based models have seen a surge in applications for tasks such as identification, prediction, and intention recognition. In particular, Recurrent Neural Networks (RNNs) have achieved notable success in behavior analysis, with the Long Short-Term Memory (LSTM) architecture being the most prevalent. LSTMs, an enhanced form of RNNs, utilize gated memory cells that efficiently manage long-term temporal dependencies, making them especially suitable for understanding sequential and time-dependent data in human behavior analysis. Uddin et al. [11] proposed an innovative approach to Human Activity Recognition (HAR) using a distinctive feature descriptor known as the Adaptive Local Motion Descriptor (ALMD), which builds upon the Local Ternary Pattern (LTP). The ALMD effectively captures human motion and appearance within video sequences, utilizing a random forest classifier for recognition. This method was validated on three well-known datasets—KTH, UCF Sports, and UCF-50—demonstrating high accuracy.

Similarly, Luvizon et al. [12] proposed a skeleton-sequence–based HAR method that learns discriminative combinations of pose-derived features. Starting from per-frame joint coordinates (and their temporal variations), their deep model aggregates joint trajectories over time and automatically weights the most informative joints and motions, producing a compact spatio-temporal representation for action classification. By focusing on skeleton cues rather than appearance, the approach is robust to background and illumination changes and reports competitive results on standard skeleton-based benchmarks. Gao et al. [13] introduced an advanced model for multidimensional Human Activity Recognition (HAR) by leveraging image set-based analysis and group sparsity techniques. This method begins with the extraction of dense trajectory features, followed by the construction of a codebook using k-means clustering. The resulting Bag-of-Words (BoW) representation is then utilized alongside the codebook to recognize actions from multiple viewpoints. The effectiveness of this approach was evaluated across three well-known datasets: Northwestern UCLA, IXMAX, and CVS-MV-RGBD-Single. Experimental results demonstrated that this method significantly enhances recognition accuracy and achieves robust performance, especially in scenarios involving complex multi-view action sequences. By focusing on multi-dimensional feature representation, Gao et al.’s model offers a promising solution for recognizing activities across different perspectives and conditions. To enhance the performance of human activity recognition, my proposed approach integrates Convolutional Neural Networks (CNNs) with dilated convolutions alongside Long Short-Term Memory (LSTM) networks. The incorporation of dilated convolutions significantly reduces computational complexity while maintaining a broader receptive field, allowing the model to capture essential spatiotemporal features more effectively. This design not only accelerates the training process but also optimizes resource efficiency, leading to superior performance compared to traditional methods. The synergy of CNNs for spatial feature extraction, combined with LSTMs for handling temporal dependencies, results in more accurate recognition of human activities. The proposed approach demonstrates remarkable improvements in accuracy and training speed, offering a robust alternative to conventional deep learning models in this domain. Lightweight networks used in activity recognition frequently implement grouped convolution techniques.

2. Literature Review

Several factors, such as background noise, varying viewpoints, and changes in lighting, can significantly affect the environment in which human activities are captured in real-time video. These elements can make it challenging to clearly observe and recognize actions, as they introduce variability that obscures human activities. Traditional activity recognition techniques aim to overcome this issue by extracting specific features from the video data to classify different patterns.

Deep learning approaches, particularly convolutional neural networks (CNNs), provide a more effective solution by automatically learning hierarchical features. This capability allows the system to progressively build complex representations from simpler elements, improving the recognition process. The use of pooling layers and weight-sharing in convolutional architectures helps streamline the network’s search space, making it more efficient by leveraging the inherent structure of images. Additionally, pooling operations and weight-sharing mechanisms enhance the model’s robustness to changes in scale and spatial variations, enabling more consistent and accurate recognition of activities across different conditions. By doing so, CNNs can effectively handle the challenges posed by real-world video data, leading to more reliable human activity recognition.

2.1. CNN Approach for Human Activity Recognition

Zeiler and Fergus [14] showed that the filters learned by convolutional neural network (CNN) models follow a hierarchical structure. In the early layers, the network detects simple, low-level features like edges and textures. As the data moves through deeper layers, the network identifies more abstract, high-level features, such as shapes and object components. This hierarchical approach not only demonstrates the flexibility of CNNs but also their effectiveness as generalized feature extractors across a wide range of tasks. By progressively enhancing the feature details at each layer, CNNs can build intricate data representations, significantly improving their performance in areas such as human activity recognition. This capability of automatically learning both fundamental and complex patterns from the data reduces the dependence on manually crafted features, making CNNs exceptionally robust for analyzing extensive datasets.

2.2. Deep Convolutional Neural Network for Human Activity Recognition (DCNN)

Granada et al. [15] utilized deep convolutional neural networks (CNNs) to extract video representations directly from raw inputs, such as RGB frames and optical flow fields, training their recognition model in a fully end-to-end fashion. By feeding these CNNs with both types of input data, they generated probability scores for each class and predicted activity labels using a fusion technique. This approach demonstrated superior performance in activity recognition compared to methods that rely solely on handcrafted features or deep learning models using only RGB or optical flow inputs. However, despite the advantages of deep learning, the results did not always surpass handcrafted feature-based methods, particularly due to the limited availability of large-scale RGB activity recognition datasets needed for effective supervised training. The scarcity of comprehensive datasets hampers the full potential of CNNs in this domain, emphasizing the need for more extensive and diverse data to achieve better generalization in recognizing human activities.

2.3. Binary Motion Image Method for HAR

In Dobhal et al. [16] proposed a human activity recognition model that utilizes a unique 2D representation of actions by combining sequences of images into a single image, known as a Binary Motion Image (BMI). In their approach, they first generate binary foreground images using a Gaussian Mixture Model (GMM) to highlight motion areas, which are then combined to form the BMI. A convolutional neural network (CNN) is subsequently trained on these BMIs for effective activity classification. The authors extended their work to include 3D depth maps, where a similar feature extraction method was applied to compute BMIs from depth data. This approach demonstrated the flexibility of the BMI representation, as it efficiently captures spatiotemporal motion patterns, allowing the CNN to learn from both 2D visual data and 3D depth information. By leveraging depth maps, the model was able to enhance recognition accuracy, particularly in scenarios where standard RGB data alone may fall short due to changes in lighting or background noise.

2.4. 3D Convolutional Neural Network for Human Activity Recognition

In Ji et al. [17] introduced an innovative approach to activity recognition by utilizing 3D convolutional networks, which extend traditional 2D convolutions into the temporal domain. Unlike standard 2D CNNs that operate solely on spatial dimensions, 3D convolutional networks employ filters that span both spatial and temporal axes. This allows them to effectively capture spatiotemporal features and motions embedded across consecutive video frames, enabling a deeper understanding of dynamic content. By incorporating temporal information directly into the learning process, these networks can better model the motion patterns essential for human activity recognition. However, for optimal performance, the network requires additional input, such as optical flow, to enhance its training capabilities. Ji et al. [17] demonstrated through experiments that 3D convolutional networks significantly outperform traditional 2D CNNs, particularly in scenarios where understanding motion across frames is crucial. This improvement highlights the potential of 3D convolutions in extracting richer feature representations, making them well-suited for complex video-based applications where both spatial details and temporal dynamics are important for accurate recognition.

2.5. Slow Fusion Method for Human Activity Recognition

In Karpathy et al. [18] introduced a method to enhance the temporal awareness of convolutional networks through a technique known as slow fusion. In this approach, the network is provided with multiple adjacent segments of a video, and it processes them using the same set of convolutional layers. By doing this, the network captures temporal information across these video segments, enabling it to learn the patterns of motion and events over time. The network’s output for each segment is then processed by fully connected layers to generate a comprehensive video descriptor. Furthermore, Karpathy et al. [18] proposed the use of a multi-resolution approach, where two separate networks are employed, each handling smaller inputs. This method not only improves the accuracy of activity recognition by allowing the network to focus on different resolutions of the video data but also reduces the number of parameters the network needs to learn. By utilizing smaller inputs and processing them in parallel streams, the network becomes more efficient, allowing for faster training and improved generalization across diverse video sequences. This approach significantly boosts the network’s ability to recognize actions with higher precision while keeping the computational cost manageable.

2.6. 3DDCNN Approach for Human Activity Recognition

In Liu et al. [19] proposed a novel 3D convolutional deep neural network (3DDCNN) designed to automatically learn spatiotemporal features from raw depth sequences. In their method, the network also integrates a Joint Vector, which is calculated using the position and angle information of skeleton joints, to improve the recognition of human activities. This approach allows the model to capture both the spatial and temporal aspects of human motion, which are essential for activity recognition tasks. One of the key advantages of this method is that the learned feature representation is both time-invariant and viewpoint-invariant. This means that the model is capable of recognizing activities accurately regardless of the time at which they occur or the viewpoint from which the action is captured. As a result, the network can generalize better to different scenarios, making it robust to variations in camera angles and temporal shifts. The method achieves results that are comparable to state-of-the-art techniques, demonstrating its effectiveness in recognizing complex human activities while maintaining a high level of accuracy. This approach highlights the potential of combining depth information with skeleton-based features to improve the robustness and performance of activity recognition systems.

2.7. 4K-Dimensional Per-Segment Descriptors Based on CNN

Ryoo et al. [20] evaluated temporal pooling strategies (average, max, and pooled time series with temporal pyramids) to summarize CNN features into ~4K-dimensional per-segment descriptors. These representations capture scene dynamics over time, improving robustness to noisy motion and enabling multi-scale temporal reasoning. Building on the need for efficient spatio-temporal representations, recent sensor-centric HAR work explores lightweight neural designs that reduce parameters and compute while preserving accuracy. LIMUNet [21] exemplifies this direction for smartwatch data, using compact architectural choices tailored to resource-constrained devices and reporting competitive recognition performance in a low-overhead footprint. Complementary to lightweight CNNs, sequence models remain effective for modeling temporal dependencies in dynamic activities. Hassan et al. [22] present a Deep BiLSTM approach enhanced with transfer-learning–based feature extraction, showing that pretrained feature encoders coupled with bidirectional temporal modeling can yield strong results on dynamic HAR benchmarks.

Broader surveys synthesize these trends across modalities, models, and deployment settings. Gu et al. [23] review deep-learning HAR end-to-end—from architectures and fusion strategies to datasets and evaluation protocols—highlighting open issues such as cross-dataset generalization and domain shift. For wearable-sensor pipelines specifically, Zhang et al. [24] provides an extensive overview of deep models, preprocessing/windowing choices, and practical considerations (latency, energy, on-device inference). Wang et al. [25] survey deep learning for sensor-based HAR, detailing convolutional/recurrent formulations, feature learning for time-series signals, and challenges in real-world deployment (subject variability, annotation cost, and robustness). Together, refs. [23,24,25] position lightweight CNNs and sequence models [21,22] as complementary tools for efficient, accurate HAR under practical constraints.

3. Proposed Method

3.1. Dilated Convolution

In dilated convolutions are a technique used in convolutional neural networks (CNNs) to increase the receptive field without adding extra parameters or computational cost. By introducing gaps between the filter elements, dilated convolutions allow the network to capture larger contextual information, which is useful for tasks like image segmentation and action recognition. Unlike traditional convolutions, where filters slide over the input data in a continuous manner, dilated convolutions space out the filter elements, allowing the network to cover a broader area with the same filter size. The dilation rate controls the gap between elements, and higher rates enable the model to capture larger contextual dependencies. This method is particularly effective in tasks requiring both detailed local information and global context. It helps improve accuracy in complex scenes without increasing the computational load. Additionally, dilated convolutions are useful for sequential data tasks, such as time-series analysis or video recognition, where long-range dependencies need to be captured efficiently. In short, dilated convolutions enhance CNNs by expanding their receptive field, improving performance in various tasks without increasing computational complexity.

Figure 1 shows the visual representation of a 1D dilated convolution with dilation rate two. The red dots represent the input values, and the blue dot represents the output value. The dashed lines indicate the connections between the input and output values.

y (i) = k \sum x (i + r \cdot k) \cdot w (k) .

(1)

3.2. LSTM (Many to One)

The Many-to-One Long Short-Term Memory (LSTM) network processes sequential data by taking a series of inputs over multiple time steps and producing a single output at the end. In this approach, the input sequence is fed into the LSTM cells one step at a time, allowing the model to learn and retain dependencies through its internal memory mechanism as can be seen in Figure 2. The LSTM cells are equipped with gates that control the flow of information, enabling them to selectively remember or forget details from earlier time steps. As the model progresses through the sequence, it accumulates relevant information, storing it in its hidden states. Once all inputs have been processed, the final hidden state is used to generate the output, effectively summarizing the entire sequence into one prediction or classification. This architecture is particularly beneficial for tasks where understanding the context of the entire sequence is crucial, such as sentiment analysis, where a complete sentence determines the sentiment, or human activity recognition, where a series of video frames must be analyzed to classify an action. By leveraging its ability to capture both short-term and long-term patterns, Many-to-One LSTM models have proven effective in fields like time series forecasting, text classification, and sequence analysis.

3.3. Workflow for Human Activity Recognition Using Dilated ConvLSTM

The first step in the process involves loading a dataset of videos, such as the UCF50 dataset, which contains video sequences of various actions. These videos are stored locally and are accessed by the system for action recognition. Each video in the dataset is processed by extracting frames. These frames are resized to a fixed size and normalized, so they can be consistently used as input to the model.

The next step is to prepare these frames into sequences of a fixed length, as the model requires sequential data for action recognition. Each sequence represents a segment of the video where a specific action is being performed. The frames within a sequence are passed through the model, which uses a combination of dilated convolutional neural networks (CNN) and Long Short-Term Memory (LSTM) networks. The CNN layers extract spatial features from the frames, helping the model recognize objects and structures within the video, while the LSTM layers capture the temporal dependencies between frames, allowing the model to understand how actions unfold over time. The model is trained on these labeled sequences from the dataset, learning to classify actions based on the patterns it detects in the frame sequences. Once the model is trained, it can make predictions on new video data. For prediction, the model takes a new video, extracts frames, and organizes them into sequences. The model then predicts the action for each sequence, labeling the action being performed in the video as described in Figure 3.

3.4. Proposed Model

This model is a deep neural network that I designed for action recognition using ConvLSTM (Convolutional Long Short-Term Memory) layers, which combine the strengths of Convolutional Neural Networks (CNNs) and LSTMs for processing spatiotemporal data as shown in model Figure 4. The model processes video frames, where the temporal sequence of images is crucial to recognizing patterns in motion. The model begins with an input layer that accepts a sequence of video frames, with each frame having a fixed height, width, and three-color channels (RGB). The sequence length, image height, and width are specified as part of the input shape. The first layer in the model is a ConvLSTM2D layer, which applies a 2D convolutional operation to each frame while maintaining temporal dependencies across frames via LSTM cells. The layer uses 4 filters with a kernel size of 3 × 3 and a dilation rate of (2,2), meaning that the receptive field is enlarged, allowing the model to capture more spatial context within each frame. This is particularly useful for recognizing large-scale motion across frames. Following the ConvLSTM layer is a MaxPooling3D layer, which down-samples the spatial dimensions (height and width) of the feature maps by applying a 2 × 2 pool on the height and width dimensions, while keeping the temporal dimension intact. This helps reduce computational complexity and prevents overfitting by emphasizing the most important features. Additionally, Dropout is applied with a rate of 0.2 to regularize the model and reduce the risk of overfitting by randomly deactivating 20% of the neurons during training. The model then passes through a series of similar layers: another ConvLSTM2D, MaxPooling3D, and Dropout layers. The second ConvLSTM2D layer uses 8 filters, followed by a third with 14 filters, and a fourth with 16 filters. The dilation rate remains consistent at (2,2) in all ConvLSTM layers, allowing the model to maintain its ability to capture a large spatial context in each frame. MaxPooling3D layers are used after each ConvLSTM layer to gradually reduce the spatial dimensions. After the fourth ConvLSTM layer, the model applies a Flatten layer to reshape the multi-dimensional output into a single vector that can be processed by the dense layers. Finally, the model includes a Dense layer with a Softmax activation function. The number of units in the Dense layer corresponds to the number of possible action classes in the dataset, and the Softmax activation ensures that the output values represent a probability distribution over these classes. The model is optimized using a categorical cross-entropy loss function, which is suitable for multi-class classification problems. After the Flatten layer, the model incorporates a fully connected Dense layer with 128 units and a ReLU activation function to introduce non-linearity and enhance feature representation. This layer serves as a bridge between the high-level spatiotemporal features extracted by the ConvLSTM layers and the final classification layer. To further mitigate overfitting, another Dropout layer with a rate of 0.3 is added after the Dense layer, ensuring that the model generalizes well to unseen data. Batch normalization is also applied before the final Dense layer to stabilize and accelerate the training process by normalizing the inputs. The model is trained using the Adam optimizer, which adapts the learning rate dynamically, improving convergence and performance. Early stopping is implemented during training to halt the process if the validation loss does not improve for a specified number of epochs, preventing unnecessary computation and overfitting. Additionally, data augmentation techniques such as random cropping, flipping, and rotation are applied to the input video frames to increase the diversity of the training data and improve the model’s robustness. Finally, the model’s performance is evaluated using metrics such as accuracy, precision, recall, and F1-score to ensure a comprehensive understanding of its classification capabilities across different action classes.

3.4.1. Spatio-Temporal Feature Extraction

The ConvLSTM2D layers can capture both spatial features (from each individual frame) and temporal features (from the sequence of frames). This is crucial for understanding human activities, which often involve patterns that unfold over time (e.g., walking, running, waving).

3.4.2. The Use of Dilation Rate

By setting the dilation rate to 2, the model can cover larger areas within the video frames, helping it learn high-level temporal patterns. This is beneficial for recognizing complex, long-range actions or transitions between activities that might span several frames. This design choice was also motivated by prior studies reporting that moderate dilation values balance receptive field expansion and stability in training, without excessive computational overhead.

3.4.3. Handling Sequential Data

Human activity recognition typically involves videos where each frame contains part of the activity, and the full context emerges across the sequence. ConvLSTM is ideal for handling such data because it integrates convolutional operations (which capture spatial features) with LSTM cells (which capture the temporal dependencies between frames).

3.4.4. Regularization (Dropout)

The use of dropouts in both the ConvLSTM layers and Time Distributed dropout layers helps prevent overfitting, which is important when training on large video datasets, ensuring that the model generalizes well to unseen data.

3.4.5. Hierarchical Feature Learning

The model extracts increasingly complex features at each layer, which allows it to understand the fine details of actions as well as the broader patterns in the activity. For example, the first layer might detect simple features like edges, the second might capture textures or motion, and the later layers can represent more abstract actions like walking or waving.

3.4.6. Training Strategy for the Dilated Conv LSTM Model

For training the Dilated ConvLSTM model, we utilized the Google Colab Pro environment, taking advantage of its high-performance GPU and extended memory. Specifically, the setup included a GPU with 15 GB of dedicated memory and 51 GB of RAM, allowing efficient handling of large video datasets and faster training times. To extract spatial–temporal features, ConvLSTM2D layers with a dilation rate of 2 were utilized, allowing the model to capture a broader context without increasing the parameter count. The activation function used for the ConvLSTM layers was tanh, providing smooth gradient flow and capturing intricate patterns in sequential data. Regularization techniques, such as recurrent dropout and Time Distributed dropout, were incorporated to prevent overfitting by randomly deactivating neurons during training. The model was trained for 70 epochs, with batch normalization and validation monitoring in place to ensure generalization to unseen data. The final Dense layer utilized a softmax activation function to output the probabilities for each class. The training process was conducted iteratively, adjusting hyperparameters and employing early stopping to achieve optimal performance on the validation set.

4. Results and Discussions

4.1. Dataset UCF 50

The UCF50 dataset is a comprehensive action recognition dataset that features 50 distinct action categories, sourced from realistic videos on YouTube. It is an expanded version of the earlier YouTube Action dataset (UCF11), which focused on only 11 action categories. Unlike many existing datasets in the field, which rely on staged and controlled environments, the UCF50 dataset emphasizes realism as can be seen in Figure 5. It provides the computer vision community with challenging, real-world scenarios to test the robustness of action recognition models. One of the primary strengths of this dataset is its diversity and complexity. The videos exhibit significant variations in factors such as camera motion, object appearance, pose, scale, and viewpoint, alongside varying illumination conditions and cluttered backgrounds. These variations make the dataset highly challenging and suitable for evaluating models’ performance in real-world scenarios. The dataset is organized into 50 action categories, which are further divided into 25 groups per category. Each group contains at least four video clips, often sharing common traits like the same individual, similar backgrounds, or comparable viewpoints, thereby testing models’ ability to generalize across different contexts. The 50 action classes span a wide range of activities, including both sports and everyday actions. Some of the notable categories are Basketball Shooting, Bench Press, Billiards Shot, Drumming, Horse Riding, Kayaking, Pull-Ups, Tennis Swing, Volleyball Spiking, and Walking with a Dog, among others. This diversity in actions, captured in realistic settings, makes UCF50 a valuable resource for advancing research in human activity recognition.

4.2. Dataset Preprocessing

The pre-processing of the UCF50 dataset involved several essential steps to prepare the data for effective training of the ConvLSTM model. Initially, a subset of action categories was chosen from the dataset, focusing on classes like WalkingWithDog, TaiChi, Swing, and others. For each selected class, the script identified all available video files within the designated directory. The core of the pre-processing was handled by the frame’s extraction function, which read each video using OpenCV’s VideoCapture. It determined the total frame count and extracted a fixed number of frames sequence length was set to 20 at uniform intervals to ensure consistency. To achieve this, a frame sampling interval was computed based on the video’s total frame count, allowing the model to learn temporal patterns efficiently. Each extracted frame was resized to a standard dimension of 256 × 256 pixels to maintain uniformity across the dataset. The frames were then normalized by scaling pixel values to a range between 0 and 1, which enhances training performance by stabilizing gradient updates. The dataset function brought everything together by processing each video file, extracting frames, and storing them in structured lists. Only videos with at least 20 frames were included, ensuring all sequences were of consistent length. The resulting frames, labels, and video paths were then converted into numpy arrays for efficient handling during model training. This comprehensive pre-processing strategy ensured the data was optimally formatted, allowing the ConvLSTM model to focus on learning the spatial and temporal patterns critical for accurate action recognition.

4.3. Model Evaluation Performance Metrics

After the model has been trained, testing it on real-world data, such as YouTube videos, involves several key steps. First, the video to be tested is obtained by downloading it from YouTube. This is done by providing the URL of the YouTube video, which is then fetched and stored in a designated directory. The title of the video is extracted, and the video is saved with a meaningful filename. The downloaded video is now ready to be used for testing the model. Once the video is downloaded, it is prepared for action recognition. To do this, the video is read frame by frame, and specific frames are selected based on the sequence length required by the model. These frames are resized to fit the input dimensions expected by the model, and they are normalized to ensure that pixel values lie between 0 and 1. After preprocessing, a fixed number of frames are passed as input to the trained LRCN model. The model processes these frames, generating a probability distribution over the possible classes for each frame sequence. The class with the highest probability is selected as the predicted action. Finally, the predicted action, along with the confidence score (the probability of the predicted class), is output. This gives insight into the model’s performance on the test video. The process concludes by displaying the results, typically in the form of the predicted action label and its associated confidence, which reflects how certain the model is about its prediction. This approach allows for the effective evaluation of the model’s performance on unseen video data, providing a real-world application of the trained action recognition model. Furthermore, the ability to test on varied and complex real-world video data demonstrates the robustness and versatility of the model in handling different scenarios. This real-time action recognition capability can be applied in a wide range of industries, from surveillance and security to sports and entertainment, highlighting its potential for practical use beyond controlled datasets.

The model was tested on three different YouTube videos, each depicting a distinct action: a child swinging at a playground, a horse race, and Tai Chi. For the first test (Figure 6a), the model correctly identified the activity as “Swing,” demonstrating sensitivity to repetitive, cyclic motion. In the second test (Figure 6b), the model predicted “Horse Race,” highlighting proficiency in recognizing high-speed activities involving multiple moving subjects. The final test (Figure 6c) featured Tai Chi, characterized by slow and controlled movements; the model accurately classified it as “Tai Chi,” indicating the ability to capture subtle, fluid motion patterns. Across all three examples, the predicted labels were accompanied by high confidence scores, reflecting certainty in each decision.

This ability is essential for recognizing actions that do not involve abrupt movements but rather require a keen understanding of slower, deliberate gestures. In all three cases, the model not only identified the correct action but also provided a confidence score, reflecting its certainty in each prediction. The overall performance comparison of the proposed Dilated ConvLSTM model with other baseline architectures is summarized in Table 1. The table presents the accuracy and loss values for both training and validation phases, demonstrating that the Dilated ConvLSTM achieves the highest accuracy (94.9%) and the lowest loss (0.20), indicating superior learning efficiency and generalization capability. These results underline the effectiveness of the Dilated ConvLSTM model in performing action recognition on diverse types of activities, ranging from rapid motions to slow, controlled movements.

The incorporation of dilated convolutions in the CNN layers enables the model to expand its receptive field, allowing it to capture more extensive spatial patterns without a corresponding increase in computational cost. This improvement enhances its ability to recognize complex spatial and temporal features across a larger context. Additionally, by integrating Long Short-Term Memory (LSTM) units, the model is able to effectively learn and predict dynamic sequential dependencies, making it well-suited for action recognition tasks. This combination of dilated convolutions and LSTMs results in superior performance compared to conventional CNN and CNN-LSTM models.

4.4. Plotting Accuracy vs. Validation Accuracy

The Total Accuracy vs. Validation Accuracy graph as shown in Figure 7 for the Dilated ConvLSTM model clearly demonstrates its superior performance. The blue curve represents total accuracy (training accuracy) over epochs. The red curve represents validation accuracy over epochs. As seen in the graph, the model maintains a significantly higher training accuracy (94.9%) and validation accuracy (88.34%) compared to the other models. This indicates not only strong learning on the training data but also excellent generalization to unseen data, which is crucial for real-world applications. The close alignment between training and validation accuracy further supports the model’s robustness and its ability to effectively handle action recognition tasks across diverse inputs. Although the training and validation curves align closely, slight divergence after later epochs may indicate mild overfitting, which could be further mitigated through extended regularization or larger datasets in future work.

4.5. Plotting Loss vs. Validation Loss

The total Loss vs. Validation Loss graph as shown in Figure 8. The blue curve represents total loss (training loss) over epochs. The red curve represents validation loss over epochs. The Dilated ConvLSTM model highlights its efficient training process. As depicted, the model achieves a low training loss of 0.20 and a validation loss of 0.32, both of which are significantly lower than those of the other models. This shows that the model not only minimizes error during training but also maintains a good balance between training and validation, suggesting it is not overfitting. The smaller gap between training and validation loss indicates strong generalization, which is essential for ensuring the model’s effectiveness in real-world action recognition tasks. Future work could also incorporate ablation studies on regularization strength and alternative optimization strategies to better validate robustness.

5. Conclusions

This project effectively demonstrates the application of a Dilated CNN-LSTM (Long Short-Term Memory) model for recognizing actions in video sequences. By integrating dilated convolutional layers with LSTM units, the model captures both spatial and temporal dynamics, enabling it to identify complex actions over time. The model was trained on a varied set of video data, achieving strong performance in action classification. During the testing phase, the model was successfully applied to real-world YouTube videos, accurately predicting actions. This highlights the model’s robustness and its ability to generalize across different video scenarios. Additionally, the process of downloading, preprocessing, and evaluating the model on previously unseen data, along with its assessment based on accuracy and loss metrics, further illustrates the model’s practical applicability in real-world settings. Future research could focus on enhancing the model’s performance by refining the architecture, incorporating larger, more diverse datasets, or adapting it for real-time use. Overall, the findings suggest that the Dilated CNN-LSTM model holds considerable promise for a variety of applications, such as surveillance, entertainment, and human–computer interaction.

6. Future Work

For future work, we plan to develop a real-time human surveillance system that leverages the trained Dilated CNN-LSTM model. This system will take input data directly from CCTV cameras, enabling the real-time prediction of actions performed by individuals. By integrating the model into a live surveillance environment, the system will be capable of continuously analyzing video feeds to detect and classify various human actions, such as walking, running, or suspicious behavior. This advancement will enhance the effectiveness of security systems, offering automated monitoring and timely alerts for potential security threats.

Author Contributions

Conceptualization, B.A.K.; Methodology, B.A.K.; Validation, B.A.K. and J.-W.J.; Formal analysis, B.A.K.; Investigation, B.A.K.; Resources, B.A.K. and J.-W.J.; Data curation, B.A.K.; Writing—original draft, B.A.K.; Writing—review and editing, B.A.K. and J.-W.J.; Visualization, B.A.K. and J.-W.J.; Supervision, J.-W.J.; Project administration, J.-W.J.; Funding acquisition, J.-W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Ministry of Trade, Industry and Energy (MOTIE) and Korea Institute for Advancement of Technology (KIAT) through the International Cooperative R&D program (Project No. P0026318), by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2025-2020-0-01789) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), and by the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2025-RS-2023-00254592) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank Yongsik Choi and Mohammad Afif Kasno (Human-Robot Interaction Laboratory, Dongguk University, Seoul, Republic of Korea) for their assistances in managing research documentation and supporting the institutional administrative processes related to this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qiu, S.; Zhao, H.; Jiang, N.; Wang, Z.; Liu, L.; An, Y.; Zhao, H.; Miao, X.; Liu, R.; Fortino, G. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Inf. Fusion 2022, 80, 241–265. [Google Scholar] [CrossRef]
Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep learning for sensor-based human activity recognition. ACM Comput. Surv. 2021, 54, 77. [Google Scholar] [CrossRef]
Jiang, S.; Kang, P.; Song, X.; Lo, B.; Shull, P.B. Emerging wearable interfaces and algorithms for hand gesture recognition: A survey. IEEE Rev. Biomed. Eng. 2022, 15, 85–102. [Google Scholar] [CrossRef] [PubMed]
Sakshi, P.D.; Jain, S.; Sharma, C.; Kukreja, V. Deep learning: An application perspective. In Cyber Intelligence and Information Retrieval; Lecture Notes in Networks and Systems; Springer: Singapore, 2022; pp. 323–333. [Google Scholar] [CrossRef]
Alshehri, F.; Muhammad, G. A comprehensive survey of the Internet of things (IoT) and AI-based smart healthcare. IEEE Access 2021, 9, 3660–3678. [Google Scholar] [CrossRef]
Medhane, D.V.; Sangaiah, A.K.; Hossain, M.S.; Muhammad, G.; Wang, J. Blockchainenabled distributed security framework for next-generation IoT: An edge cloud and software-defined network-integrated approach. IEEE Internet Things J. 2020, 7, 6143–6149. [Google Scholar] [CrossRef]
Kumari, P.; Mathew, L.; Syal, P. Increasing trend of wearables and multimodal interface for human activity monitoring: A review. Biosens. Bioelectron. 2017, 90, 298–307. [Google Scholar] [CrossRef] [PubMed]
Alrazzak, U.; Alhalabi, B. A Survey on Human Activity Recognition Using Accelerometer Sensor. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition, Spokane, WA, USA, 30 May–2 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 152–159. [Google Scholar] [CrossRef]
Cornacchia, M.; Ozcan, K.; Zheng, Y.; Velipasalar, S. A survey on activity detection and classification using wearable sensors. IEEE Sens. J. 2017, 17, 386–403. [Google Scholar] [CrossRef]
Beddiar, D.R.; Nini, B.; Sabokrou, M.; Hadid, A. Vision-based human activity recognition: A survey. Multimed. Tools Appl. 2020, 79, 30509–30555. [Google Scholar] [CrossRef]
Uddin, M.A.; Joolee, J.B.; Alam, A.; Lee, Y.-K. Human action recognition using adaptive local motion descriptor in spark. IEEE Access 2017, 5, 21157–21167. [Google Scholar] [CrossRef]
Luvizon, D.C.; Tabia, H.; Picard, D. Learning features combination for human action recognition from skeleton sequences. Pattern Recognit. Lett. 2017, 99, 13–20. [Google Scholar] [CrossRef]
Gao, Z.; Zhang, Y.; Zhang, H.; Xue, Y.B.; Xu, G.P. Multi-dimensional human action recognition model based on image set and group sparsity. Neurocomputing 2016, 215, 138–149. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
Granada, R.; Monteiro, J.; Barros, R.C.; Meneguzzi, F. A Deep Neural Architecture for Kitchen Activity recognition. In Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2017), Marco Island, FL, USA, 22–24 May 2017; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2017. [Google Scholar]
Dobhal, T.; Shitole, V.; Thomas, G.; Navada, G. Human activity recognition using binary motion image and deep learning. Procedia Comput. Sci. 2015, 58, 178–185. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Li, F.-F. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
Liu, Z.; Zhang, C.; Tian, Y. 3D-based Deep Convolutional Neural Network for Action Recognition with Depth Sequences. Image Vis. Comput. 2016, 55, 93–100. [Google Scholar] [CrossRef]
Ryoo, M.S.; Matthies, L. Video-based convolutional neural networks for activity recognition from robot-centric videos. Proc. SPIE 2016, 9837, 98370R-1. [Google Scholar]
Lin, L.; Wu, J.; An, R.; Ma, S.; Zhao, K.; Ding, H. LIMUNet: A Lightweight Neural Network for Human Activity Recognition Using Smartwatches. Appl. Sci. 2024, 14, 10515. [Google Scholar] [CrossRef]
Hassan, N.; Miah, A.S.M.; Shin, J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Appl. Sci. 2024, 14, 603. [Google Scholar] [CrossRef]
Gu, F.; Chung, M.-H.; Chignell, M.; Valaee, S.; Zhou, B.; Liu, X. A Survey on Deep Learning for Human Activity Recognition. ACM Comput. Surv. (CSUR) 2021, 54, 177. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Zhang, S.; Shahabi, F.; Xia, S.; Deng, Y.; Alshurafa, N. Deep learning in human activity recognition with wearable sensors: A review on advances. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep Learning for Sensor-based Human Activity Recognition: A Survey. Sensors 2019, 17, 416. [Google Scholar] [CrossRef]

Figure 1. Dilated convolutions representation.

Figure 2. Many to one LSTM representation.

Figure 3. Workflow for Human Activity Recognition Using Dilated CNN LSTM.

Figure 4. Proposed Dilated CNN-LSTM architecture for spatio-temporal feature extraction. The model includes four ConvLSTM2D layers with dilation rate (2,2), MaxPooling layers, and fully connected Dense layers for classification.

Figure 5. Dataset UCF 50 for human activity recognition.

Figure 6. Example test frames from unseen YouTube videos and the model’s predicted actions: (a) Swing (child on playground swing), (b) Horse race, (c) Tai Chi.

Figure 7. Plotting accuracy vs. validation accuracy across epochs. The x-axis represents the number of training epochs, while the y-axis shows accuracy (%). The blue curve denotes training accuracy and the red curve denotes validation accuracy.

Figure 8. Plotting training loss vs. validation loss across epochs. The x-axis represents the number of training epochs, while the y-axis shows the loss values. The blue curve denotes training loss and the red curve denotes validation loss.

Table 1. Performance metrics of the proposed (Dilated ConvLSTM).

Model	Accuracy	Validation Accuracy	Loss	Validation Loss
Conv2D	84.23%	75.32%	0.54	0.75
CNN	79.3%	71.9%	0.62	0.73
CNN (Slow Fusion)	85.21%	79.4%	0.45	0.64
LRCN	86.3%	80.5%	0.39	0.52
ConvLSTM	93.7%	86.67%	0.31	0.48
Dilated ConvLSTM (Ours)	94.9%	88.34%	0.20	0.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, B.A.; Jung, J.-W. Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset. Appl. Sci. 2025, 15, 12173. https://doi.org/10.3390/app152212173

AMA Style

Khan BA, Jung J-W. Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset. Applied Sciences. 2025; 15(22):12173. https://doi.org/10.3390/app152212173

Chicago/Turabian Style

Khan, Bakht Alam, and Jin-Woo Jung. 2025. "Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset" Applied Sciences 15, no. 22: 12173. https://doi.org/10.3390/app152212173

APA Style

Khan, B. A., & Jung, J.-W. (2025). Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset. Applied Sciences, 15(22), 12173. https://doi.org/10.3390/app152212173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Human Activity Recognition Using Dilated CNN and LSTM on Video Sequences of Various Actions Dataset

Abstract

1. Introduction

2. Literature Review

2.1. CNN Approach for Human Activity Recognition

2.2. Deep Convolutional Neural Network for Human Activity Recognition (DCNN)

2.3. Binary Motion Image Method for HAR

2.4. 3D Convolutional Neural Network for Human Activity Recognition

2.5. Slow Fusion Method for Human Activity Recognition

2.6. 3DDCNN Approach for Human Activity Recognition

2.7. 4K-Dimensional Per-Segment Descriptors Based on CNN

3. Proposed Method

3.1. Dilated Convolution

3.2. LSTM (Many to One)

3.3. Workflow for Human Activity Recognition Using Dilated ConvLSTM

3.4. Proposed Model

3.4.1. Spatio-Temporal Feature Extraction

3.4.2. The Use of Dilation Rate

3.4.3. Handling Sequential Data

3.4.4. Regularization (Dropout)

3.4.5. Hierarchical Feature Learning

3.4.6. Training Strategy for the Dilated Conv LSTM Model

4. Results and Discussions

4.1. Dataset UCF 50

4.2. Dataset Preprocessing

4.3. Model Evaluation Performance Metrics

4.4. Plotting Accuracy vs. Validation Accuracy

4.5. Plotting Loss vs. Validation Loss

5. Conclusions

6. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI