A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition

Bousmina, Abir; Selmi, Mouna; Ben Rhaiem, Mohamed Amine; Farah, Imed Riadh

doi:10.3390/rs15143626

Open AccessArticle

A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition

by

Abir Bousmina

¹,

Mouna Selmi

^1,*,

Mohamed Amine Ben Rhaiem

^1,2 and

Imed Riadh Farah

¹

RIADI Laboratory, National School of Computer Sciences, University of Manouba, Manouba 2010, Tunisia

²

CNCT, National Mapping and Remote Sensing Center, Tunis 2045, Tunisia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(14), 3626; https://doi.org/10.3390/rs15143626

Submission received: 13 May 2023 / Revised: 13 June 2023 / Accepted: 19 June 2023 / Published: 21 July 2023

(This article belongs to the Special Issue Recent Trends of Generative Adversarial Networks (GANs) in Remote Sensing Applications)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicles (UAVs), known as drones, have played a significant role in recent years in creating resilient smart cities. UAVs can be used for a wide range of applications, including emergency response, civil protection, search and rescue, and surveillance, thanks to their high mobility and reasonable price. Automatic recognition of human activity in aerial videos captured by drones is critical for various tasks for these applications. However, this is difficult due to many factors specific to aerial views, including camera motion, vibration, low resolution, background clutter, lighting conditions, and variations in view. Although deep learning approaches have demonstrated their effectiveness in a variety of challenging vision tasks, they require either a large number of labelled aerial videos for training or a dataset with balanced classes, both of which can be difficult to obtain. To address these challenges, a hybrid data augmentation method is proposed which combines data transformation with the Wasserstein Generative Adversarial Network (GAN)-based feature augmentation method. In particular, we apply the basic transformation methods to increase the amount of video in the database. A Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) model is used to learn the spatio-temporal dynamics of actions, then a GAN-based technique is applied to generate synthetic CNN-LSTM features conditioned on action classes which provide a high discriminative spatio-temporal features. We tested our model on the YouTube aerial database, demonstrating encouraging results that surpass those of previous state-of-the-art works, including an accuracy rate of 97.83%.

Keywords:

UAVs; human action recognition; deep learning; CNN-LSTM; data augmentation; WGAN-GP

1. Introduction

Unmanned Aerial Vehicles (UAV), or drones, were used and developed for the first time in the military field; in recent years, they have undergone profound changes and are increasingly seeing use in the civilian field, in particular in the context of smart cities [1,2]. Drones are showing unprecedented levels of growth and influence in this field thanks to their huge advantages of being able to automatically perform complex missions and tasks with low risk. In fact, the use of drones has increased significantly in regard to contributing new information, autonomy, speed, and low cost. Drones can efficiently reach a distant locations and collect aerial video from a bird’s-eye view.

Drones are used in several applications in smart cities, such as security, research, rescue, disaster management, and video surveillance [1,2]. In these applications, it is important to recognize human actions from an aerial point of view. For instance, Surya et al. [3] proposed a drone surveillance system that can detect violent activity in public places using autonomous drones. Our motivation for developing this system stems from the realization of the immense potential drones hold in revolutionizing urban environments. By harnessing their capabilities, we aim to enhance safety, efficiency, and overall quality of life within smart cities.

Automatic recognition of human activity in drone videos is of great importance, however, the task remains difficult due to problems such as occlusion, variability in human form and movement, and crowded backgrounds, and becomes even more challenging due differences in lighting conditions and point of view. The limited number of aerial video datasets presents an additional issue. Deep learning (DL) models seem to be well adapted to this challenging task due to their robustness and generalization capacity. In particular, the hybrid CNN-LSTM model can efficiently model the spatio-temporal dynamics of human actions. Nevertheless, the relationship between performance and the large amounts of data required remain the principal challenge. In fact, recognition of human activity from video based on deep learning models requires hundreds of training video samples per class for robust training [4].

Unfortunately, collecting large numbers of aerial videos and labeling them is both resource intensive and time consuming, and in certain circumstances may be nearly impossible. On the other hand, domain class imbalance problems are commonly seen in contexts such as anomalous action detection, violent activity detection, natural disaster management, etc. This imbalanced data problem makes DL models less effective, especially for the prediction of minority classes.

To overcome the two challenging problems of the small size of aerial datasets and class imbalances in the data, methods for augmenting data [5] have gained the attention of researchers. Two mains categories of data augmentation approaches can be distinguished [5]: data transformation approaches, and Deep Learning-based approaches. The pioneering approach consists of classical data augmentation methods that involve creating new samples by applying various kinds of geometric and photometric transformations. The second category generates new data samples using DL models. Generative Adversarial Networks (GANs) are the most commonly used in this context thanks to their capability to model complex real-world data. By leveraging these data augmentation approaches, our research aims to overcome the limitations imposed by the small size of aerial datasets as well as the class imbalance problem. Our proposed system combines the power of data transformation techniques and deep learning-based data augmentation, allowing us to effectively address these challenges and improve the performance of automatic human activity recognition in drone videos.

In this paper, we introduce an innovative hybrid augmentation approach that combines data transformation techniques and GAN-based methods to leverage the unique strengths of these two complementary approaches. Our contribution lies in proposing the use of GANs to create discriminative features that are better suited for action categorization in aerial video sequences, thereby significantly enhancing the effectiveness of our model.

It is worth noting that recent research has highlighted the limitations of GAN-generated video samples, as their quality is not yet optimal for training deep learning models [4,6]. To address this challenge, we propose the application of GAN-based techniques to synthesize CNN-LSTM features conditioned on semantic class levels. By conditioning on the semantic action class, our approach produces class-conditional feature distributions that capture the nuanced characteristics of different actions. The CNN-LSTM features generated by our suggested method serve as powerful representations, and can be effectively utilized to train softmax classifiers. This empowers our model to accurately classify and recognize actions in aerial video data. Furthermore, our feature augmentation approach significantly reduces the computational complexity relative to the computational cost associated with generating synthetic videos. By avoiding the need to create complete video sequences, we achieve notable computational efficiency while benefiting from the advantages of synthetic data. In addition, the model proposed in this paper contributes to the field by addressing the limitations of GAN-generated video samples, offering a more efficient approach to training deep learning models. In addition to enhancing the accuracy of action classification and recognition in aerial video data, this can pave the way for further advancements in utilizing synthetic data for training purposes.

The rest of this paper is organized as follows: Section 2 introduces related works; Section 3 describes our aerial action recognition approach; Section 4 presents and discusses the experimental results; finally, conclusions and perspectives are presented in Section 5.

2. Related Works

The difficult task of employing computer vision to recognize human activities (Human Activity Recognition, HAR) has attracted a great deal of scientific interest. With the development of numerous methods for identifying and comprehending human actions in video data, this field has made significant strides in recent years.

These techniques include simple handcrafted methods based on manually-designed feature detectors and descriptors as well as more complex deep learning methods that can automatically learn trainable features to recognize actions, negating the need for manual feature detector and descriptor design.

2.1. Handcrafted Methods

Earlier studies in the area of human action recognition (HAR) have depended on handcrafted features, which are manually designed and computed from video data. These techniques can be divided into three groups: body model-based approaches [7,8], holistic approaches [9,10,11], and local feature-based approaches [12,13,14,15].

Body model-based approaches extract the human body outline to compute a simplified skeleton [7] or reconstruct a real-time 3D pose [8]. There are two types of holistic methods that depict the body’s overall dynamics: silhouette-based methods, and optical flow or gradient-based methods. Silhouette-based approaches [9] extract features from the body’s silhouette. Gradient-based approaches [10,11], use the movement of pixels or gradients in the image to identify actions. On the other hand, local approaches are generally based on Interest Points (IPs) [12,13,14,15], and have shown effectiveness when combined with a bag-of-words (BOW) [16] representation. This approach permits efficient action representation while avoiding background segmentation and tracking. For example, an improvement in dense trajectories which describe extracted trajectories of dense IPs has been noted [17] with HOG [18], HOF [17], and MBH [17].

Handcrafted methods for identifying human action have a number of drawbacks, however, such as their reliance on specific presumptions along with typically weak generalization and robustness capability. These restrictions can be overcome using deep learning techniques.

2.2. Deep Learning Methods

Deep learning approaches are commonly applied to automatically learn features while providing good robustness and generalization capacity. DL models are considered powerful feature extractors that can learn discriminative high-level features directly from large amounts of data. Deep learning models can efficiently recognize real-world activities with complex structures based on large training datasets. However, the relationship among performance, complexity, and data requirements remain a major challenge. In addition, DL methods are vulnerable to adversarial attacks in the form of small perturbations in the videos frames that are visually imperceptible [19,20]. These adversial attacks constitute a major threat to the security, and impact the effectiveness of DL methods.

Convolutional neural network (CNN)-based models, which have shown their effectiveness in learning spatial features, have been widely used for HAR tasks. To capture both spatial and temporal features of human motion, two-stream CNN has been proposed [21,22,23]. For instance, Simonyan et al. [21] proposed a two-stream CNN which has showed strong performance on video-based HAR tasks. The spatial 2D-CNN stream uses RGB images as input, while the temporal 2D-CNN stream uses the optical flow frames as input; the final classification result is then obtained by combining these two streams. Recent works have extended traditional 2D-CNN to 3D-CNN in order to incorporate the temporal dimension. For instance, Tran et al. [24] proposed a spatio-temporal 3D-CNN approach for human action recognition using 3D convolutions in the spatial and temporal dimensions. Their proposed 3D-CNN showed performance gains compared to 2D-CNNs.

Because CNNs are more effective for extracting spatial features than for extracting long-range temporal features, temporal changes in human behavior present a significant challenge for them. As a result, recurrent neural networks (RNN) have been developed to address this issue. By including a transmission mechanism for hidden layers, it is possible to extract temporal information for the basic recurrent neural network [25]. Two-stream RNNs have been used for action recognition [26]; however, when the sequential data are too long, gradient disappearance and explosion become problems for RNNs.

The basic RNN structure can be improved with long short-term memory (LSTM) to overcome the gradient explosion problem. Zhu et al. [27] suggested using deep LSTM to learn action recognition based on the human skeleton model, while Liu et al. [28] proposed an LSTM with Global Context-Aware Attention to extend the attention mechanism to LSTM models.

Recently, researchers have become interested in fusing CNNs and LSTMs [29,30,31] to build powerful human action recognition systems, as CNNs have proven successful in learning spatial features and LSTMs excel at extracting temporal aspects. In this context, many researchers have demonstrated that CNN-LSTM is a very powerful and robust model when applied to a large training dataset. Despite the potential of CNN-LSTM methods, the required aerial datasets suffer from limited size, as the collection of large-scale aerial video datasets is challenging. This explains why CNN-LSTM has not previously been used for aerial action recognition. In this paper, we propose using the CNN-LSTM model jointly with data augmentation techniques. By augmenting the data, the CNN-LSTM model can be efficiently trained to recognize actions while being trained on aerial video datasets with smaller size or class imbalance.

2.3. Data Augmentation

The recognition of human activity from video with CNN-LSTM models has shown promising improvements. However, these models require extensive datasets, which can be resource-intensive and time-consuming to collect and label [32,33,34,35]. In certain cases, it may even be impossible to obtain such datasets. Because most HAR tasks require additional data to avoid overfitting, data augmentation is a widely used technique for improving system performance. There are two main approaches taken to increase the amount of video data: basic transformations, and augmentation using DL models.

Methods based on basic transformations use simple techniques to generate slightly modified copies of videos. For example, Wang et al. [36] and Lee et al. [37] applied the temporal cropping technique multiple times to each original video sequence to increase the amount of data. Image mixing methods, such as VideoMix [34], Mixup [38], and Cut-Mix [39], have been widely used for data augmentation. These techniques mix the pixel values of two different images from the original dataset. Another data augmentation technique, proposed in [40], applies operations such as rotation, projection, resizing, blurring, and occlusion [41]. Several other techniques, including deletion, mixing, cut-and-paste, cropping, flipping, zooming, injecting noise, and applying color transformations such as brightness, contrast, or saturation [32,33,35], have been successful in augmenting numerous datasets.

DL-based models are widely used to improve human activity recognition (HAR) in videos. Data augmentation is a popular approach that generates synthetic instances while maintaining comparable features to the original set. Generative adversarial networks (GANs) represent a promising example of this strategy [42]. GANs consist of a generator and a discriminator, and are mainly used for sample generation. The generator aims to generate synthetic data that closely resemble the training data, while the discriminator distinguishes between real and synthetic data and improves the probability of correct labeling for both the training samples and the synthetic data [43]. GANs have been used to expand datasets by transforming an image from one domain into an image from another domain, and the same technique can be applied to videos by incorporating the temporal dimension [44,45].

In addition to GANs, other neural network-based models, such as video generation networks, have been used for video data augmentation [46,47]. To improve performance in HAR tasks, researchers have suggested a dual-branch GAN for generating synthetic videos [48], while another study introduced a data augmentation framework using WGAN to enhance HAR models even with limited datasets [49]. Zhang et al. [49] developed a data augmentation system that generates new synthetic dynamic images from videos using a WGAN by utilizing a GAN to generate dynamic images that condense the motion information of video sequences. Models for video classification can then be trained using an augmented dataset which includes both authentic and artificial dynamic images. GAN extensions such as DCGAN [50], CGAN [51], and CycleGAN [52] have improved the quality of generated data, and present new opportunities for data augmentation. Ahsan et al. [53] utilized GANs to learn action representations in videos with minimal or no supervision information, while Yang et al. [54] proposed Open GAN for open set action recognition by employing dense blocks for feature extraction and combination. To improve action classification, Waqas et al. [4] proposed using game videos and airborne features generated by GANs. GANs are practical and effective tools for data augmentation, particularly when original videos are not available. Due to their lacking access to RGB video images, Dong et al. [55] applied data augmentation to the feature vectors extracted from a deep InceptionV3 network.

Although the use of GANs and video generation networks can improve HAR models, a number of limitations exist. GANs may struggle to generate realistic data in highly specific domains where data variability is extremely high. Additionally, the quality of generated data can vary considerably depending on the quality of input data and the complexity of the generation task.

In this subsection, we have discussed two different approaches to augment datasets: basic transformation techniques, and deep learning techniques. While basic transformation techniques are easy to use, they may produce lower quality results; on the other hand, deep learning-based approaches allow for more precise and customizable modifications; however, they require substantial programming expertise.

In this paper, we have chosen to focus particularly on the AugLy library for basic transformation techniques, as it offers a wide range of spatio-temporal augmentations that can be selected based on the customization needs of the database. Therefore, combining this technique with GAN-generate features can improve the quality and customization of training data for human action recognition systems.

This combination is able to produce more varied and realistic training data, thereby improving system performance.

3. Proposed Approach

The proposed architecture of our model is designed to perform the task of action recognition by coordinating five main processing components, as illustrated in Figure 1. To compensate for the small size of the available datasets, our action recognition system begins by applying basic transformation techniques followed by a preprocessing step applied to the raw data.

After the raw data are prepared, we move to the third step of applying CNN-LSTM and 2D CNN as feature extractors to extract discriminative and informative features at the frame level. After the initial identification pass, the motion information gleaned from the prior representation is further processed using an LSTM network, allowing the capture of hidden correlations and long-range dependencies within the data sequence.

The fourth step in our action recognition system involves generating synthetic features using WGAN. In this step, the features extracted by the CNN-LSTM are used as inputs to the WGAN to generate additional synthetic features, increasing the size of the dataset used for training.

Finally, in the fifth phase, the features extracted by the CNN-LSTM and the synthetic features generated by the WGAN are used as inputs to a softmax activation function for action classification.

3.1. Phase 1: Data Augmentation

Data augmentation techniques are powerful tools in the development of computer vision models, and are becoming increasingly prevalent in other fields as well. These techniques are designed to expand the size of datasets and prevent overfitting by introducing variations to the input data. In addition to their applications in enhancing overall performance, data augmentation can be used to evaluate the robustness of trained models. While there are numerous data augmentation strategies available for image classification tasks, the same cannot be said for video classification. While efforts have been made to apply data augmentation techniques to video, compared to the wealth of options available for image classification the options for video augmentation remain limited.

In this work, we were drawn to the AugLy [41] data augmentation library due to its wide range of spatio-temporal augmentations. Furthermore, the selection of augmentation techniques should be based on the specific needs of the targeted database. AugLy is particularly well-suited for assessing the robustness of models, is relatively easy to use, and requires fewer computational resources.

The AugLy data augmentation library is a new open-source tool that offers over 100 augmentations for audio, image, text, and video data. AugLy is known for its extensive collection of spatiotemporal augmentations, which include techniques such as splicing videos together, simulating reshared screenshots, and overlaying videos on top of each other. In addition, AugLy is unique in its ability to perform multimodal integration, allowing users to transform a video’s audio and combine it with the video using various augmentations.

In this research, we used data augmentation to boost the functionality of our proposed model. Specifically, we used four techniques: inserting backgrounds, adding blur, adjusting brightness, and adding noise. These augmentation techniques were specifically chosen to address the complexities associated with aerial videos, as they effectively simulate real-world variations and enhance the model’s ability to handle occlusion, lighting variations, and other challenges specific to aerial environments. Inserting backgrounds involves adding additional images or data to the original dataset, which is beneficial for tasks such as object detection where the model must be able to distinguish items from the background. Adding blur helps the model generalize to slightly blurry images, adjusting brightness helps it generalize to images with different lighting conditions, and adding noise helps the model generalize to images with noise; these methods can help the model to perform better on real-world videos. Using these methods, we were able to artificially expand the size of our dataset and train a more reliable model.

3.2. Phase 2: Preprocessing

There are many different preprocessing steps that can be taken to prepare data for machine learning, with the specific steps depending on the nature of the data and the requirements of the machine learning model. In our work, we focused on preprocessing aerial videos captured by a mobile camera mounted on an aerial vehicle for the purpose of recognizing human actions. To prepare the data for further processing, we implemented the following three preprocessing steps. (1) Extracting the video frames: this involved creating a list containing the individual images of the video, which allowed us to work with the frames of the video as separate data samples. (2) Resizing the frames: to ensure that the data were in the correct format for the machine learning model, we resized the frames to a specific height and width image height = 224, image width = 224); in addition, we set the sequence length to a specific value (sequence length = 20) in order to determine how many frames need to be processed together as a single sequence. (3) Normalizing the pixel values: the previous two steps enabled us to work with individual images, ensuring that the data were properly formatted for the model; we then normalized the pixel values between 0 and 1. These three preprocessing steps helped us to prepare the video data for further processing, and ultimately increased the efficiency of our proposed model. By extracting the frames, resizing them, and setting the sequence length, we were able to ensure that the data was in the correct format for the model to process, which can help to improve model accuracy.

3.3. Phase 3: Feature Extraction

A hybrid CNN-LSTM architecture is used to extract spatial and temporal characteristics from each video frame in the feature extraction stage of our system for human action recognition in aerial videos. The CNN-LSTM model’s architecture is depicted in Figure 2.

3.3.1. Spatial Feature Extraction

For the purpose of extracting spatial features from aerial action videos, we used a 2D-CNN-based structure [56,57]. CNNs, particularly 2D-CNNs, a subclass of CNNs designed with the aim of processing two-dimensional data such as images, have been widely used in reseach on HAR.

A 2D-CNN can be used to extract spatial characteristics from each frame of a video while attempting to recognize human action in videos. The CNN is able to learn several spatial properties, including edges, corners, and textures, by combining the input frame with a number of filters.

In our tests, we extracted spatial features from all the video frames using the 2D-CNN. Each video was split into twenty frames, and the 2D-CNN was applied to all twenty frames. It is crucial to take the spatiotemporal characteristics of the data into account when interpreting video data. Videos can be represented as a multidimensional tensor with the shape (T, H, W, C), where T represents the temporal axis, H and W represent the height and width of the spatial dimension, respectively, and C represents the number of channels. In our implementation, we faced hardware restrictions such as limited GPU memory. To overcome these limitations, we designed a network architecture optimized for the available hardware resources. The CNN network architecture in our proposed model is composed of four convolutional layers, four pooling layers, and a softmax output layer. All the 2D convolution kernels are 3 × 3 with stride 1 × 1. All the max pooling layers are 2 × 2 except for the first and second max pooling layers, which have kernel sizes of 4 × 4. Figure 2 provides a thorough explanation of our suggested architecture.

In summary, while CNNs are well-suited for extracting spatial features from individual frames of a video, LSTMs are a better choice for capturing temporal information. Using a hybrid CNN-LSTM model, it is possible to successfully extract both spatial and temporal characteristics from video data.

3.3.2. Temporal Feature Extraction

While Convolutional Neural Networks (CNNs) are highly effective in extracting spatial features from individual frames of a video, they are not well suited for capturing temporal information. To accurately capture the temporal dynamics of human behavior in videos, a different type of neural network, such as a Long Short-Term Memory (LSTM) network, must be used. LSTM is a kind of recurrent neural network (RNN) that is specifically designed to handle sequential data such as video when modeling the short- and long-range relationships of sequence features. Additionally, it addresses the issues of gradient vanishing and bursting better than standard RNNs.

The forget gate, the input gate, and the output gate are the three basic “gates” [28,58] that make up a Long Short-Term Memory (LSTM) unit. These gates are in charge of regulating the communication between concealed states and earlier temporal steps. The input gate refreshes the memory cell with new data, the forget gate decides which information from the previous state should be discarded, and the output gate decides which information should be passed on to the next layer. This allows the LSTM network to maintain an internal memory of past inputs and use it to make predictions about future inputs. The three gates in the structure of an LSTM memory cell are shown in Figure 3. An LSTM network can be used to examine a series of video frames and identify the temporal correlations between them in the context of human action recognition. Our implementation used an LSTM network with 32 units (Figure 3). To use LSTM for temporal feature extraction, the output of the 2D-CNN spatial feature extractor is fed to the LSTM network as input. This can be done by using the output of the last fully connected layer of the CNN as the input for the LSTM. In this way, the LSTM network can use the spatial information extracted by the CNN in conjunction with its ability to remember past inputs to make predictions about the temporal relationships in the video.

In summary, while CNNs are well-suited for extracting spatial features from individual frames of a video, LSTMs are better suited for capturing temporal information. By combining CNN and LSTM, it is possible to effectively extract both spatial and temporal features from video data.

The complexity per time step of our hybrid CNN-LSTM approach can be calculated as the sum of the complexity of the CNN and LSTM layers, as as follows:

O (\sum_{l = 1}^{d} (n_{l - 1} . s_{l}^{2} . n_{l} . m_{l}^{2}) + w)

, while for all training process the same calculation is

O ((\sum_{l = 1}^{d} (n_{l - 1} . s_{l}^{2} . n_{l} . m_{l}^{2}) + w) . i . e)

. Here d is the number of convolutional layers,

n_{l}

is the number of kernels in the lth layer,

n_{l - 1}

is the number of input channels of the lth layer,

s_{l}

is the spatial size of the kernel,

m_{l}

is the spatial size of the output feature map, w is the number of LSTM weights, i is the input length, and e is the number of epochs. We can conclude that our model has complexity O in the typical asymptotic notation.

3.4. Phase 4: WGAN for Generating Synthetic Features

The fourth phase of our human activity recognition system using aerial videos involves a WGAN-GP model used to generate synthetic features based on the spatio-temporal features extracted in the previous phase.

The architecture of the WGAN-GP model is shown in Figure 4. Instead of augmenting the video images themselves, we chose to augment the extracted features to obtain more easily exploitable synthetic data for human activity recognition. This approach helped us to avoid storage and processing overload issues that may arise when increasing the number of images.

To better understand how the WGAN-GP works, it is important to first understand the global functioning of a GAN. Goodfellow [6] first proposed generative adversarial networks, or GANs, as a system of two neural networks, a generator and a discriminator, trained together to produce synthetic data. The generator creates synthetic data from a set of noise, while the discriminator compares the synthetic data and the real data to determine whether the synthetic data is realistic. Both networks are trained together antagonistically, with the discriminator being trained to be more efficient at differentiating between real and synthetic data and the generator trained to create more realistic synthetic data. A minimax game serves to define the training strategy, and both players practice simultaneously against one another [4].

The generator converts samples from a straightforward noise distribution, such as a Gaussian or uniform distribution, into actual input data, with the goal of producing data that are as accurately as possible. Contrarily, the discriminator is trained not to be deceived by the generated fake data, and is used to discriminate between the bogus input data from the generator and real data. The distribution of created data tends to be as similar as possible to the distribution of real data after the minimax game training procedure [59].

The original GAN algorithm suffered from instability and convergence issues, making its training process unstable. Additionally, it was difficult to determine the quality of tuning parameters, as no numerical values could measure their effectiveness. This meant that users had to rely on the generated samples to determine whether or not the GAN model was well trained.

To address these problems, Wasserstein GAN (WGAN) was proposed [60]. The Wasserstein distance W (P, Q) is used by WGAN to calculate the separation between points in the distributions P and Q. The Jensen–Shannon divergence is substituted by the Wasserstein distance in the WGAN value function. Wasserstein GAN is a highly effective approach for generating realistic data, and has solved many problems associated with the original GAN algorithm. However, WGAN has limitations, such as slow training and stability issues. To address these issues, WGAN-GP was proposed in [60].

WGAN-GP stands for Wasserstein GAN with gradient penalty. The main difference between WGAN and WGAN-GP is the use of a gradient penalty to constrain the discriminator neural network to produce regular gradients. This helps to avoid training oscillations while making convergence faster and more stable. The gradient penalty entails adding a cost function using the norm of the gradient of the discriminator outputs.

Because they can learn complicated data distributions and produce realistic samples, generative adversarial networks (GANs) were our first choice for creating synthetic features. Specifically, we opted to use Wasserstein GAN with gradient penalty (WGAN-GP), which has shown superior performance to the original GAN algorithm in terms of both stability and training speed.

In our case, the generator

G (z, c)

generates a CNN-LSTM feature

\hat{f}

in the input feature space F from random noise

z_{p}

and a condition class label c, while the discriminator

D (f, c)

takes as input a pair of input features f and a condition class c and outputs a value, optimizing

\begin{matrix} E_{f \sim P_{r}} [D (f)] - E_{\tilde{f} \sim P_{g}} [D (\tilde{f})] + λ E_{\hat{f} \sim P_{\hat{f}}} [{({∥\nabla_{\hat{f}} D (\hat{f})∥}_{2} - 1)}^{2}] \end{matrix}

where

P_{r}

represents the real input feature distribution and

P_{g}

represents the generator distribution.

3.5. Phase 5: Action Classification

The classification of actions is a crucial step in our action classification method, involving the classification of actions using features extracted by the CNN-LSTM and synthetic features generated by the WGAN-GP. The output size of the dense layer for this classification is determined by the number of classes in the dataset being utilized. The activation function applied in this last dense layer is the Softmax function [61]. Softmax is a commonly used activation function in multi-class classification problems. For a given sample, it determines the probabilities of each class and then returns the class with the highest likelihood. This function is particularly useful for multi-class classification tasks, as it provides probabilities for each class, allowing for a clearer interpretation of the classification results.

4. Experiments and Results

This section evaluates our suggested human action recognition method using training and validation data from aerial YouTube videos. To demonstrate the efficacy of our improved method, we thoroughly analyze the experimental results. Additionally, we contrast the outcomes of our proposed strategy with those of the most recent cutting-edge techniques.

4.1. Dataset

In this study, we utilize the YouTube Aerial dataset as the primary source for training and validating our proposed action recognition approach. The YouTube Aerial dataset is a valuable resource for researchers interested in analyzing videos captured from drones. It consists of a collection of 400 clips, each corresponding to one of eight actions from the UCF101 dataset [62]: band marching, biking, cliff diving, golf swinging, horse riding, kayaking, skateboarding, and surfing. For each action, the dataset contains fifty videos, allowing for a comprehensive analysis of the various actions.

The 320 × 240 pixel videos in the dataset were recorded at a frame rate of 25 frames per second to allow for thorough study of the subject’s motions in each clip. However, collecting these videos was not an easy task, as they were captured from fast-moving drones at varying heights.

The choice of the YouTube aerial video dataset was motivated by the limited availability of publicly accessible aerial video datasets, which often have a small number of videos. The YouTube aerial video dataset stands out due to its larger video collection, enabling our model to effectively detect and recognize different body parts and actions.

4.2. Implementation Setup

In this research, we conducted all of our experiments on Google Colab-Pro+, with a V100-Tesla as a computing device and a RAM limit up to 52 GB. The reason for choosing this service was its ability to provide an exceptional amount of computing power. The proposed model was developed in Python 3 using Jupyter Notebook, and we employed Tensorflow and Keras packages in the backend.

In this study, the Youtube aerial video dataset was utilized, which comprises videos of varying lengths. To optimize detection and recognition performance, twenty frames were extracted from each video as input and the input frame resizing was set to 224 × 224 × 3, providing adequate resolution for detecting human actions in aerial videos. Basic data augmentation techniques were applied, which multiplied the dataset size by five, resulting in a dataset size of 2000 videos. This augmentation technique involved adding brightness, blur, noise, and background, which helped to increase the size of the dataset. To prevent overfitting, we partitioned the dataset into training, validation, and testing sets. Specifically, 80% of the data were allocated for training, 20% for testing, with an additional 20% from the training data set aside for validation purposes.

To extract the spatial and temporal primitives of each frame of the video, a hybrid CNN LSTM model was utilized. The model began with four CNN layers, followed by an LSTM layer. The four CNN layers had filter sizes of 16, 32, 64, and 128, respectively, with kernels of size 3 × 3. Max-pooling was used to reduce the dimensionality of the images, and an LSTM layer with 32 units was added to capture the temporal sequence of human activity. The categorical cross-entropy loss function was minimized using the Adam optimizer, and a dropout rate of 0.25 was added to prevent overfitting. The model was trained for 100 epochs with a batch size of 30 videos. The hyperparameters of the CNN-LSTM model were chosen taking into account several factors in order to balance the model’s complexity and its ability to capture the spatio-temporal features of human activities in aerial videos while avoiding overfitting. The size of the dataset was considered as well, an overly complex model can lead to overfitting with small amounts of data. Additionally, the model’s performance on the validation set was taken into account to adjust the hyperparameters. The hyperparameters were chosen through testing and iterative experimentation to maximize the accuracy and generalization of the model for the task of human activity recognition in aerial videos. We carefully selected the hyperparameters to ensure that our model was able to effectively learn and represent the complex patterns present in the aerial video data and provide accurate and reliable results. The CNN-LSTM model’s hyperparameters are shown in Table 1.

Next, the WGAN-GP model was trained using the features extracted by the CNN-LSTM model to generate synthetic CNN-LSTM features conditioned on action classes that were added to the real features. The number of features created by the WGAN-GP was the same as the number of features extracted by the CNN-LSTM. The hyperparameters used in the WGAN-GP model for the generator and discriminator are provided in the Table 2, Table 3 and Table 4.

4.3. Results

In this section, we present the results of our experiments on the YouTube aerial video database. We evaluate the performance of our proposed method using commonly employed metrics such as accuracy and loss. Additionally, we provide insights into the effectiveness of our approach through analysis of the confusion matrix, which reveals the classification performance for each action category. We conducted three different experiments to evaluate our approach. The first experiment used the CNN-LSTM model as a baseline, while the second experiment incorporated classical data augmentation techniques with the CNN-LSTM model. The third experiment involved the full proposed model combining basic video transformation and WGAN-based feature augmentation using the CNN-LSTM model. Our proposed model achieved outstanding accuracy of 97.83%, outperforming state-of-the-art methods. Specifically, Waqas et al. [4] achieved an accuracy of 68.2% using their MFN-3D + GAN method, while Yazeed et al. [63] achieved an accuracy of 86.63% using their FCN method. Furthermore, our first experiment, which utilized the CNN-LSTM model, resulted in an accuracy of 71.21%, and our second experiment, which combined classical data augmentation with the CNN-LSTM model, resulted in an accuracy of 94.58%. To provide a comprehensive comparison, Table 5 illustrates the accuracies obtained by the different methods, including our proposed approach.

Analyzing these results, we can conclude that our suggested method achieves superior accuracy compared to both the baseline and state-of-the-art methods. This improvement is greatly enhanced by the addition of video transformation and WGAN-based feature augmentation, demonstrating that the proposed approach holds great potential for aerial action recognition tasks.

It should be noted that the high accuracy of the FCN model compared to our CNN-LSTM model can be explained by its specific approach, which utilizes advanced preprocessing techniques, body part-based segmentation, and optimized feature extraction. These elements allow the FCN model to capture more detailed and discriminative information for recognizing human–object interactions, leading to better accuracy. However, extraction and tracking of human body parts requires high-resolution video in which the torso is the predominant subject in each frame and there are no cluttered scenes, which are not guaranteed by any means in realistic aerial video. Notably, our choice of CNN-LSTM as a basic approach was motivated by the requirement that the developed approach should be robust in realistic aerial videos characterized by high camera motion, lack of image detail, small actor size, perspective distortion, and occlusion.

Next, we examined the loss curves of the generator and discriminator to evaluate the effectiveness of our WGAN-GP model.

Figure 5 plots the loss curves for the generator and discriminator of the WGAN-GP model. It can be observed that the loss curve for the generator gradually decreases from 8 to nearly 0 at iteration 50,000, while the loss curve for the discriminator decreases from nearly −4 to nearly 0 to converge with the generator. The loss curves indicate that our model achieves this balance and converges to a stable point, indicating that the model was properly trained.

Figure 6 presents a distribution histogram that allows for evaluating the quality of the data generated by the WGAN-GP model. The goal is to create synthetic data that resemble the original dataset while being sufficiently dissimilar to avoid overfitting. The distribution histogram can be used to compare the characteristics of synthetic samples with those of the original dataset. In our case, the distribution histogram shows that the synthetic features are very close to those of the real features, which testifies to the efficiency and consistency of our WGAN-GP model in generating features. The choice to use a distribution histogram to visualize these characteristics was made due to the nature of the data, which were not in the form of images. Thus, the distribution histogram allows for an effective comparison between the real and synthesized features, aiding in the evaluation of their similarity.

Figure 7 and Figure 8 show that both the training and validation curves move in the right direction, decreasing for the loss and increasing for the accuracy. The model’s accuracy gradually increases, and reaches a very high accuracy for the validation set, while the loss decreases gradually and reaches a stable value close to 0. The gap between the two curves is very low, indicating the quality of the model. These figures demonstrate that the model avoids the issue of overfitting, which is prevalent in deep learning models, converging rapidly and stabilizing at a particular level.

In addition to the analysis of the learning curves, Figure 9 shows the confusion matrix for the model trained on this dataset, indicating that the majority of test samples were correctly classified. It can be seen that five out of eight classes are identified correctly: “biking”, “cliffdiving”, “golfswing”, “skateboarding”, and “surfing”. The other classes have high predictions as well, demonstrating the overall quality of the model on this dataset.

As part of this experimentation, we tested our trained model on the YouTube aerial video dataset, using drone-captured videos sourced from YouTube. In this section, we use screenshots to showcase the obtained results. The results demonstrate high accuracy and strong performance for the three tested movement classes. The first video showcased the action of bicycling, and the system predicted this action with a confidence of 96.78% (see Figure 10). The second video depicted the action of a golf swing, and the system accurately predicted this action with a confidence of 99.15% (see Figure 11). Lastly, the third video captured the action of surfing, and the system correctly predicted this action with a confidence of 99.54% (see Figure 12). These screenshots provide visual evidence of the high accuracy and reliable performance of our model on real-world aerial video data. They further validate the ability of our approach to effectively recognize and classify human actions in aerial video footage, demonstrating its potential for practical applications in various domains.

These results highlight the effectiveness of our method for recognizing human actions in aerial videos and the improvement in classification performance realized by using WGAN-GP to generate synthetic features.

5. Conclusions

UAVs are rapidly becoming a common part of various of smart city applications such as search and rescue, security and surveillance, and intelligent environments. In these applications, recognizing and interpreting effectively human actions has become unavoidable. CNN-LSTM-based methods are among the most widely used for action recognition tasks, showing promising performance. However, these approaches require a large amount of data, and collecting large numbers of aerial videos is an expensive and time-consuming task.

To deal with this difficulty, the present work introduces a CNN-LSTM-based action recognition system for aerial videos that incorporates a hybrid data augmentation phase. The data augmentation phase combines basic videos transformation and WGAN-based feature augmentation. Our generated features synthesize highly discriminative CNN-LSTM video features to augment the training data of a softmax classifier. Our experimental results prove that our approach can improve aerial action classification accuracy.

A summary of the key contributions of our paper is as follows:

We propose a new Hybrid GAN-Based CNN-LSTM approach for aerial activity recognition that overcomes challenges related to small size and class imbalance in aerial video datasets.
We propose a new data augmentation approach that combines a data transformation approach and a GAN-based technique.
To the best of our knowledge, we are the first to propose a GAN-based technique to synthesize discriminative spatio-temporal features to augment softmax classifier training.
We prove that our proposed aerial action recognition model outperforms state-of-art results while reducing computational complexity.

However, it is important to acknowledge the limitations of our human activity recognition approach from aerial videos. First, utilizing the WGAN-GP model entails high memory usage. The training process of the WGAN-GP model can be time-consuming, requiring several hours to complete. Additionally, the quality of the videos generated by the AugLy data augmentation technique may be lower than that of the original videos, potentially affecting the learning and overall performance of our approach.

In our future works, we will address the spatio-temporal segmentation of actions in aerial videos while focusing on the challenging trade-off between computational resources, network complexity, processing time, and accuracy. We plan to propose a method for resisting adversarial example attacks as well.

Author Contributions

The individual contributions of the authors are as follows: Conceptualization, A.B. and M.S.; methodology, A.B. and M.S.; software, A.B.; validation, M.S., and I.R.F.; writing—original draft preparation, A.B. and M.S.; writing—review and editing, A.B., M.A.B.R. and M.S.; supervision, M.S. and I.R.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the reported results can be found in the YouTube aerial database. The dataset used in this study is publicly available at [https://im.itu.edu.pk/human-action-recognition-in-drone-videos-using-a-few-aerial-training-examples/] (accessed on 17 October 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Gohari, A.; Ahmad, A.; Rahim, R.; Supa’at, A.; Abd Razak, S.; Gismalla, M. Involvement of Surveillance Drones in Smart Cities: A Systematic Review. IEEE Access 2022, 10, 56611–56628. [Google Scholar] [CrossRef]
Mohd Daud, S.; Mohd Yusof, M.; Heo, C.; Khoo, L.; Chainchel, S.M.; Mahmood, M.; Nawawi, H. Applications of drone in disaster management: A scoping review. Sci. Justice 2022, 62, 30–42. [Google Scholar] [CrossRef] [PubMed]
Penmetsa, S.; Minhuj, F.; Singh, A.; Omkar, S.N. Autonomous UAV for suspicious action detection using pictorial human pose estimation and classification. Elcvia Electron. Lett. Comput. Vis. Image Anal. 2014, 13, 18–32. [Google Scholar] [CrossRef]
Sultani, W.; Shah, M. Human action recognition in drone videos using a few aerial training examples. Comput. Vis. Image Underst. 2021, 206, 103186. [Google Scholar] [CrossRef]
Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Yacoob, Y.; Black, M.J. Parameterized modeling and recognition of activities. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), Bombay, India, 4–7 January 1998; pp. 120–127. [Google Scholar] [CrossRef]
Ke, Y.; Hebert, M. Volumetric features for video event detection. Int. J. Comput. Vis. 2010, 88, 339–362. [Google Scholar] [CrossRef]
Bobick, A.F.; Davis, J.W. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 257–267. [Google Scholar] [CrossRef]
Zhang, Z.; Hu, Y.; Chan, S.; Chia, L.-T. Motion context: A new representation for human action recognition. Motion context: A new representation for human action recognition. In Proceedings of the Computer Vision—ECCV 2008, 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Part IV. pp. 817–829. [Google Scholar]
Efros, A.A.; Malik, J. Recognizing action at a distance. In Proceedings of the Ninth IEEE International Conference on Computer Vision—ICCV’03, Nice, France, 13–16 October 2003; Volume 2, p. 726. [Google Scholar]
Willems, G.; Tuytelaars, T.; Van Gool, L. An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In Proceedings of the Computer Vision—ECCV, Marseille, France, 12–18 October 2008; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2008; Volume 5303, pp. 650–663. [Google Scholar]
Scovanner, P.; Ali, S.; Shah, M. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, 24–29 September 2007; pp. 357–360. [Google Scholar]
Dollar, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; pp. 65–72. [Google Scholar]
Laptev, I. On Space-Time Interest Points. Int. Comput. Vis. 2005, 64, 107–123. [Google Scholar] [CrossRef]
Peng, X.; Wang, L.; Wang, X.; Qiao, Y. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput. Vis. Image Underst. 2016, 150, 109–125. [Google Scholar] [CrossRef]
Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 3 March 2014; pp. 3551–3558. [Google Scholar]
Navneet, D.; Bill, T. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar] [CrossRef]
Akhtar, N.; Mian, A. Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey. IEEE Access 2018, 6, 14410–14430. [Google Scholar] [CrossRef]
Kwon, H.; Lee, J. AdvGuard: Fortifying Deep Neural Networks against Optimized Adversarial Example Attack. IEEE Access, 2020; early access. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Tu, Z.; Xie, W.; Qin, Q.; Poppe, R.; Veltkamp, R.C.; Li, B.; Yuan, J. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 2018, 79, 32–43. [Google Scholar] [CrossRef]
Zhao, Y.; Man, K.L.; Smith, J.; Siddique, K.; Guan, S.-U. Improved two-stream model for human action recognition. J. Image Video Proc. 2020, 2020, 24. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Medsker, L.R.; Jain, L.C. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
Wang, H.; Wang, L. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, Phoenix, AZ, USA, 12–17 February 2016; pp. 3697–3703. [Google Scholar]
Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C. Global context-aware attention LSTM networks for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4674–4683. [Google Scholar]
Wu, J.; Wang, G.; Yang, W.; Ji, X. Action recognition with joint attention on multi-level deep features. arXiv 2016, arXiv:1607.02556. [Google Scholar]
Sun, L.; Jia, K.; Chen, K.; Yeung, D.Y.; Shi, B.E.; Savarese, S. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2147–2156. [Google Scholar]
Malik, N.U.R.; Abu-Bakar, N.u.R.; Sheikh, U.U.; Channa, A.; Popescu, N. Cascading Pose Features with CNN-LSTM for Multiview Human Action Recognition. Signals 2023, 4, 40–55. [Google Scholar] [CrossRef]
Hoelzemann, A.; Sorathiya, N. Data Augmentation Strategies for Human Activity Data Using Generative Adversarial Neural Networks. In Proceedings of the 17th Workshop on Context and Activity Modeling and Recognition, Kassel, Germany, 22–26 March 2021. [Google Scholar]
Kim, T.; Lee, H.; Cho, M.A.; Lee, H.S.; Cho, D.H.; Lee, S. Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition. arXiv 2020, arXiv:2008.05721v1. [Google Scholar]
Yun, S.; Oh, S.J. VideoMix: Rethinking Data Augmentation for Video Classification. arXiv 2020, arXiv:2012.03457v1. [Google Scholar]
Dong, J.; Wang, X. Feature Re-Learning with Data Augmentation for Video Relevance Prediction. IEEE Trans. Knowl. Data Eng. 2019, 33, 1946–1959. [Google Scholar] [CrossRef]
Wang, L.; Ge, L.; Li, R.; Fang, Y. Three-stream CNNs for action recognition. Pattern Recognit. 2017, 92, 33–40. [Google Scholar] [CrossRef]
Li, J.; Yang, M.; Liu, Y.; Wang, Y.; Zheng, Q.; Wang, D. Dynamic hand gesture recognition using multi-direction 3D convolutional neural networks. Eng. Lett. 2019, 27, 490–500. [Google Scholar]
Hang, H. Cisse, mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Hu, L.; Huang, S.; Wang, S.; Liu, W.; Ning, J. Do We Really Need Frame-by-Frame Annotation Datasets for Object Tracking? In Proceedings of the MM 2021—29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 4949–4957. [Google Scholar]
Papakipos, Z. AugLy: Data Augmentations for Robustness. Artificial Intelligence (cs.AI). arXiv 2022, arXiv:2201.06494. [Google Scholar]
Qi, M.; Wang, Y. stagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 549–565. [Google Scholar] [CrossRef]
Lee, H.-Y.; Huang, J.-B. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 667–676. [Google Scholar]
Cauli, N.; Recupero, D.R. Survey on Videos Data Augmentation for Deep Learning Models. Future Internet 2022, 14, 93. [Google Scholar] [CrossRef]
Zhou, T.; Porikli, F.; Crandall, D.; Van Gool, L.; Wang, W. A Survey on Deep Learning Technique for Video Segmentation. arXiv 2021, arXiv:2107.01153. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Chen, J.; Sharma, N.; Pan, S.; Long, G.; Blumenstein, M. Adversarial Action Data Augmentation for Similar Gesture Action Recognition. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019. [Google Scholar]
Wei, D.; Xu, X.; Shen, H.; Huang, K. General Method for Appearance-Controllable Human Video Motion Transfer. IEEE Trans. Multimed 2021, 23, 2457–2470. [Google Scholar] [CrossRef]
Aberman, K.; Shi, M.; Liao, J.; Liscbinski, D.; Chen, B. Deep Video-Based Performance Cloning. Comput. Graph. Forum 2019, 38, 219–233. [Google Scholar] [CrossRef]
Zhang, Y.; Jia, G.; Chen, L.; Zhang, M.; Yong, J. Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples. In Proceedings of the MM ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1652–1660. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Ahsan, U.; Sun, C.; Essa, I. DiscrimNet: Semi- supervised action recognition from videos using generative adversarial networks. arXiv 2018, arXiv:1801.07230. [Google Scholar]
Hang, Y.; Hou, C. Open-set human activity recognition based on micro-Doppler signatures. Pattern Recogn. 2019, 85, 60–69. [Google Scholar]
Dong, J.; Li, X.; Xu, C.; Yang, G.; Wang, X. Feature relearning with data augmentation for content-based video recommendation. In Proceedings of the MM 2018—2018 ACM Multimedia Conference, Seoul, Republic of Korea, 22–26 October 2018; pp. 2058–2062. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Xia, K.; Huang, J.; Wang, A.H. LSTM-CNN Architecture for Human Activity Recognition. IEEE Access 2020, 8, 56855–56866. [Google Scholar] [CrossRef]
Bayoudh, K. An Attention-based Hybrid 2D/3D CNN-LSTM for Human Action Recognition. In Proceedings of the 2nd International Conference on Computing and Information Technology (ICCIT), 2022/ FCIT/UT/KSA, Tabuk, Saudi Arabia, 25–27 January 2022; pp. 97–103. [Google Scholar] [CrossRef]
Gao, X.; Deng, F.; Yue, X. Data augmentation in fault diagnosis based on the Wasserstein generative adversarial network with gradient penalty. Neurocomputing 2020, 396, 487–494. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Nannarelli, A.; Re, M.; Spanò, S. A pseudo-softmax function for hardware-based high speed image classification. Sci. Rep. 2021, 11, 15307. [Google Scholar] [CrossRef]
Soomro, K.; Zamir, R.; Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Ghadi, Y.Y.; Waheed, M. Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network. Remote Sens. 2022, 14, 1492. [Google Scholar] [CrossRef]

Figure 1. The proposed model for human action recognition in aerial videos.

Figure 2. CNN-LSTM model architecture.

Figure 3. Illustration of an LSTM memory cell.

Figure 4. Illustration of WGAN-GP model.

Figure 5. The generator and discriminator losses of the WGAN-GP model.

Figure 6. The distribution histogram of the original and synthetic characteristics.

Figure 7. The precision in the training and validation phases.

Figure 8. The loss in the training and validation phases.

Figure 9. Confusion matrix.

Figure 10. Bicycling action classification result.

Figure 11. Golf swing action classification result.

Figure 12. Surfing action classification result.

Table 1. Hyperparameters of the proposed CNN-LSTM Model.

Parameter	Values
Input frame resizing	224 × 224 × 3
Number of CNN layers	4
Filter sizes	16, 32, 64, and 128
Kernel size	3 × 3
Max pooling	Yes
Number of LSTM units	32
Epochs	100
Batch size	30
Optimizer	Adam
Loss function	Categorical cross-entropy loss
Dropout rate	0.25

Table 2. Hyperparameters of the WGAN generator.

Parameter	Values
Noise vector dimension z	100
Number of layers	four layers
Activation functions	LeakyReLU; output layer: hyperbolic tangent
Number of neurons in each layer	(128, 256, 512, 32)

Table 3. Hyperparameters of the WGAN-GP discriminator.

Parameter	Values
Number of layers	four layers
Activation functions	LeakyReLU; output layer: linear
Number of neurons in each layer	(512, 256, 128, 1)

Table 4. Hyperparameters of the WGAN-GP model used in our experiments.

Parameter	Values
Optimizer	RMSprop (lr = 0.00001)
Batch size	128
Number of epochs	50,000
Loss function	Wasserstein loss
GP regularization parameter	10

Table 5. Comparison of accuracy with state-of-the art methods.

Author	Method	Accuracy
Waqas et al. [4]	MFN-3D + GAN	68.2%
Yazeed et al. [63]	FCN	86.63%
our	CNN-LSTM	71.21%
our	Basic Transformation + CNN-LSTM	94.58%
our	Basic Transformation + CNN-LSTM + WGAN-GP-features	97.83%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bousmina, A.; Selmi, M.; Ben Rhaiem, M.A.; Farah, I.R. A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition. Remote Sens. 2023, 15, 3626. https://doi.org/10.3390/rs15143626

AMA Style

Bousmina A, Selmi M, Ben Rhaiem MA, Farah IR. A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition. Remote Sensing. 2023; 15(14):3626. https://doi.org/10.3390/rs15143626

Chicago/Turabian Style

Bousmina, Abir, Mouna Selmi, Mohamed Amine Ben Rhaiem, and Imed Riadh Farah. 2023. "A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition" Remote Sensing 15, no. 14: 3626. https://doi.org/10.3390/rs15143626

APA Style

Bousmina, A., Selmi, M., Ben Rhaiem, M. A., & Farah, I. R. (2023). A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition. Remote Sensing, 15(14), 3626. https://doi.org/10.3390/rs15143626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition

Abstract

1. Introduction

2. Related Works

2.1. Handcrafted Methods

2.2. Deep Learning Methods

2.3. Data Augmentation

3. Proposed Approach

3.1. Phase 1: Data Augmentation

3.2. Phase 2: Preprocessing

3.3. Phase 3: Feature Extraction

3.3.1. Spatial Feature Extraction

3.3.2. Temporal Feature Extraction

3.4. Phase 4: WGAN for Generating Synthetic Features

3.5. Phase 5: Action Classification

4. Experiments and Results

4.1. Dataset

4.2. Implementation Setup

4.3. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI