Detecting Human Actions in Drone Images Using YoloV5 and Stochastic Gradient Boosting

Human action recognition and detection from unmanned aerial vehicles (UAVs), or drones, has emerged as a popular technical challenge in recent years, since it is related to many use case scenarios from environmental monitoring to search and rescue. It faces a number of difficulties mainly due to image acquisition and contents, and processing constraints. Since drones’ flying conditions constrain image acquisition, human subjects may appear in images at variable scales, orientations, and occlusion, which makes action recognition more difficult. We explore low-resource methods for ML (machine learning)-based action recognition using a previously collected real-world dataset (the “Okutama-Action” dataset). This dataset contains representative situations for action recognition, yet is controlled for image acquisition parameters such as camera angle or flight altitude. We investigate a combination of object recognition and classifier techniques to support single-image action identification. Our architecture integrates YoloV5 with a gradient boosting classifier; the rationale is to use a scalable and efficient object recognition system coupled with a classifier that is able to incorporate samples of variable difficulty. In an ablation study, we test different architectures of YoloV5 and evaluate the performance of our method on Okutama-Action dataset. Our approach outperformed previous architectures applied to the Okutama dataset, which differed by their object identification and classification pipeline: we hypothesize that this is a consequence of both YoloV5 performance and the overall adequacy of our pipeline to the specificities of the Okutama dataset in terms of bias–variance tradeoff.


Introduction
In recent years, drones and unmanned aerial vehicles (UAVs) have found numerous applications in urban surveillance, search and rescue, and situational awareness applications. Several of these applications require the ability to recognize actions from UAV cameras, either through video or single-image analysis. Human action recognition is considered to be a challenging task, which has been fairly addressed over the last decade [1]. However, extending this task to drone-captured images and videos is an emerging topic. Human action recognition is a well-studied problem which is categorized into (i) posebased [2][3][4], (ii) single-image-based [5][6][7], and (iii) video-based action recognition [8][9][10]. However, detecting actions in single images is a less explored area because it faces the problem of the unavailability of annotated temporal data for action detection [11][12][13]. This task requires the integration of components for entity detection, and classification that can be adapted to the distribution of target situations, as well as practical deployment constraints. With the availability of popular and efficient object detection methods such as Yolo, it becomes possible to envision solutions that incorporate a detection module on top of object recognition. Eweiwi et al. [2] built an efficient pose-based action recognition using the recognition and detection of actions in aerial images is a less developed area, and differs from previous work that simply adopts the perspective of pedestrians in the scene [31,35].
Over the course of time, several aerial action datasets have been collected. The UCF-ARG dataset [36] contains 10 realistic human actions in the three settings of ground, rooftop, and aerial triplets. This is considered to be a challenging dataset as the dataset contains various instances of camera motion and humans tend to occupy only a few pixels within images. Perera et al. [37] recorded a slow and low-altitude (about 10 ft) UAV video dataset for detecting 13 different gestures in aerial videos. These gestures are mainly related to UAV navigation and aircraft handling. The authors investigated a pose-based convolutional neural network (CNN) for this work. Ding et al. [38] overcame the challenging problem of heavy computations by devising a lightweight model for real-world drone action recognition. The backbone architecture for this method contains a temporal segment network with MobileNetV3, where temporal structures are responsible for capturing self-attention and the focal loss emphasizes misclassified samples. Geraldes et al. [39] proposed a UAV-based situational awareness system called Person-Action-Locator (PAL). The PAL system is robust enough to automatically detect people and then recognize their actions in near-real-time.
Mliki et al. [40] introduced a two-stage methodology for recognizing human activities in UAV-captured videos. The first stage is responsible for generating human/non-human entities and human activity models using CNN. The second inference phase employs CNN-based human activity modeling to recognize human activities by using majority voting for the whole video sequence. Choi et al. [41] investigated the emerging problem of action recognition in drone videos using unsupervised and semi-supervised domain adaptation. The proposed method transfers the knowledge from source to target domain using video and an instance-based adaptation methodology. The authors also created a dataset of 5250 videos for evaluating their proposed method.
Barekatain et al. [42] presented the Okutama-Action dataset as a concurrent aerial view dataset for human action recognition and detection. This dataset contains 43 min of video with 12 different action categories. The Okutama-Action dataset poses a generic challenge due to the realistic condition of video acquisition that results in dynamic transition of actions, and challenges specific to single-image action detection, such as significant changes in scale and aspect ratio of the subject, abrupt camera movement, and side and top views of the subjects, as well as multi-labeled actors.
If we could devise a working pipeline for such a dataset, it will increase its suitability to process real-world situations.

Methodology
The offline pipeline of our proposed method for action detection and recognition in drone images is illustrated in Figure 1. In the first stage, a camera mounted either at 45 • or 90 • on a drone captures the outdoor scene. In the second stage, these drone-captured images are input into an anchor-free single-staged YoloV5 detector. A gradient boosting classifier accepts the output of the Yolo detector and detects and recognizes different actions. The final stage draws the bounding boxes and confidence score for each prediction. Moreover, we present a detailed diagram of YoloV5 architecture and gradient boosting classifier in Figure 2.

YoloV5 for Drone Action Detection
Two-stage object detectors have been a popular choice among the research community and include R-CNN series [43][44][45]. As compared to two-stage detectors, single-stage detectors are faster because they can simultaneously predict the bounding box and the class of objects. However, there is a compromise for the slight drop of accuracy for single-stage object detectors. The prominent single-stage detector may contain Yolo series [22][23][24][25], SSD [46], and RetinaNet [47].  YoloV5 is one of the most famous detection algorithms due to its fast speed and high accuracy. YoloV5 divides the images into a grid system, where each cell in the grid is responsible for detecting objects within itself. This approach provides a specific advantage when multiple objects are involved, which is of particular importance to multi-person action recognition [48]. YoloV5 is the most recent model of the Yolo detection family and it contains the best architectures among the Yolo family. We interchangeably use the object detection and action detection terminology.
There are five different models for YoloV5: YoloV5s, YoloV5m, YoloV5l, YoloV5x, and YoloV5n, which offer various options adapted to different computational and deployment constraints. We choose our baseline as YoloV5 which includes CSPDarknet53 as backbone, PANet as neck, and Yolo detection as head layer for a single-stage detector [22]. While training, we noticed that YoloV5x outperforms other models, i.e., YoloV5s, YoloV5m, YoloV5l, and YoloV5n. One clear disadvantage of YoloV5x is longer training time and larger model size. YoloV5n is the tiny version of YoloV5, which reduces one-third of the depth of YoloV5s and, therefore, results in 75% reduction in model parameters (7.5 M to 1.9 M), which make it an ideal choice for deploying on mobile devices and CPU-only machines. In YoloV5 architecture, there is other recent advancement, such as YoloV5-P5 and YoloV5-P6. YoloV5-P5 models have three output layers, P3, P4, and P5, with stride sizes of 8, 16, and 32 at an image size of 640 × 640. On the other hand, YOLOv5-P6 models have four output layers, P3, P4, P5, and P6, having stride sizes of 8, 16, 32, and 64, which were trained for image size of 1280 × 1280. YoloV5-P6 with stride size of 64 works well for detecting larger objects in high-resolution training images. By the time of study, YoloV7 was not released; however, YoloV5 as a single-stage detector is an appropriate choice for single-image action detection, since it brings different variants of YoloV5, thus offering a compromise between usability and performance for a UAV scenario.

Gradient Boosting Classifier
Previous work on the Okutama dataset explored various architectures based on a pipeline of entity recognition and temporal feature recognition using specialized variants of a classical CNN-LSTM approach. Barekatain et al. [42], more specifically, used SSDclassifer and ensembled two-streams as RGB and optical flow for action detection where both streams work in a complimentary fashion. Geraldes et al. [39] also ensembled the output of two architectures, POINet and ActivityNet. POINet detects the bounding box for each person in the scene using CNN; meanwhile, ActivityNet employs LSTM and computes the temporal features and action labels for each person.
The new architecture we introduce here adopts a similar philosophy, but aims at upgrading individual components for object identification and action classification, while also substituting boosting into the RNN component. While boosting is equally able to deal with temporal information, this might be a smaller issue with image-based action recognition. In addition, it offers more flexibility in terms of learning behavior, and is gaining popularity for action recognition tasks.
It should be noted that some previous work has explored a tight integration between CNN and boosting for vision tasks. However, by incorporating boosting weights into deep learning architecture [60], our use of boosting follows the previous pipeline processing philosophy with independent processing stages.
We have the same rationale for our work by using gradient boosting, which ensembles the output of different classifiers working in a complementary fashion. The Okutama-Action dataset contains images of camera angles with 45 • and 90 • , at altitude of 30 m and varying the distances between subjects and camera. The dataset images are also selfoccluded and occluded by different objects in the scene. All these factors contribute to high variance for the Okutama-Action dataset.
Gradient boosting (GB) classifiers are good at mitigating high variance and high bias, which may cause overfitting and underfitting problems, respectively. Gradient boosting significantly reduces the high variance problem by decreasing the learning rate, because a higher learning rate more aggressively captures the variation among training samples [61]. On the other hand, gradient boosting controls high bias by increasing the boosting rounds, in which each round corresponds to the addition of a new decision tree [62,63]. The bias term consistently decreases as the number of boosting rounds is increased. Gradient boosting builds a mechanism for reducing the bias and variance in expected prediction error. Using gradient boosting, when a model is trained with low learning rate and higher number of boosting rounds, results in low bias and variance and correspondingly improves the model performance. This is another strong motivation for using gradient boosting as a powerful general-purpose learning algorithm in our work.
Gradient boosting is conceived to be better than other machine learning algorithms, such as bagging and random forest decision trees, because it involves weight adjustment using decision tree predictions. GB also involves cross-validation, efficient handling of missing data, regularization to avoid overfitting, tree pruning, and paralyzed tree building, [64]. Gradient boosting fits the nonlinear (piecewise linear) decision boundary, while SVM always fits the linear boundaries even if the dataset is not linearly separable; therefore, GB brings more flexibility and, therefore, performs better than polynomial-based SVM approaches [65].
GB assigns different weights to different samples in such a manner that difficult-toclassify samples are weighted more whilst easily-classified samples receive less weight. Using gradient boosting, weak learners are sequentially added up to better classify the difficult samples. We employ log likelihood as loss function for the gradient boosting classifier. The gradient boosting is explained in Algorithm 1. The same concept of gradient boosting is also explained in Figure 3. During experimentation, we used 100 trees (boosting rounds) of maximum depth of 3 and a learning rate of 0.001. In addition, we used the Scikit-learn package for implementation of the gradient boosting algorithm [66].

Description of Architecture
In our work, gradient boosting is implemented by increasing the frequency of difficult samples in order to better learn from these samples. Gradient boosting adjusts the weights based on the previous decision tree predictions. The residual error is computed and added to the initial values and then fine-tuned, so that the final prediction approaches closer to the ground-truth values.
The main challenges that occur during action detection in the Okutama-Action dataset are (i) human subjects are not well-exposed to the camera, as are subjects in images taken on the ground, (ii) subjects are of very small size, as compared to the whole image, and (iii) subjects occur from different camera viewpoints.

Input
Following the common practice of object detection, the input image is expected to have a size of H × W × C, where H and W correspond to image height and width, whereas C represents the number of channels. In our case, the number of channels is set to three (RGB).

Backbone
Multi-scale features are extracted using either ResNet [50], ResNeXt [67], or DenseNet [51] as encoder. These features are made compatible and input into a feature pyramid network (FPN). Feature map extraction at different stages 1 ∼ N is represented correspondingly as C 1 ∼ C N .

Feature Pyramid Network (FPN)
The feature pyramid network helps to detect objects at different scales. The lower-level layers in FPN have higher resolution with fewer semantic details, whilst higher-level layers have lower resolution with stronger semantic meaning. The residual connections fuse the features between different layers and thus facilitate smaller objects' detection.

Detection Heads
The prediction heads carry out per-pixel prediction, where a prediction is output from three heads of similar architectures, i.e., a 2D convolution → a group normalization → a rectified linear unit (ReLU). The three outputs of these heads are centerness head, class prediction head, and box-regression head, as shown in Figure 2

Network Settings
In our experiments, we used CSPDarknet53 and ResNet-50 as backbone architectures. We empirically choose all the network parameters, e.g., initial base learning rate was set to 0.001 with a weight decay of 0.0001 and a momentum of 0.9. The image mosaic parameter was set to high to take advantage of data augmentation. We ran our experiments for 200 epochs with batch size of 32. We implemented our method in the Pytorch platform to run these experiments [68], while the Scikit-learn library was used for implementing gradient boosting algorithm with 100 trees and depth-level of 3. Tesla P100-PCIE with cuda-10.2 was the hardware machine for this implementation.

Dataset Description
The Okutama-Action dataset [42] consists of a 43-minute-long annotated video sequence of 12 different outdoor action categories with about 77,000 image frames. This is a challenging dataset because it includes abrupt camera motion, dynamic transition of actions, scale variation due to near-far movement of drone, and variation in aspect ratio of actor while performing different actions. Some example images of the Okutama-Action dataset are shown in Figure 4. Okutama-Action videos were captured using a DJI Phantom 4 UAV at a baseball field in Okutama, Japan. These twelve actions of the Okutama-Action dataset are grouped into three types: • Human-to-human interaction (handshaking, hugging). • Human-to-object interaction (reading, drinking, pushing/pulling, carrying, calling).  The Okutama-Action dataset covers two different scenarios for data collection in the morning and noon settings for incorporating different lighting conditions (sunny and cloudy). Additionally, this dataset was captured using two different drones operated by two different pilots with different speeds and maneuvers. For some of the videos, metadata was provided for altitude, speed, and gimbal angle. During data collection, a 4K high-resolution UAV-mounted camera was operating at 30 FPS.

Ablation Study
We evaluate the performance of our proposed models using precision, mAP@50%IoU, recall, and F1-score.

Baseline Model
We define YoloV5s and YoloV5s6 as our baseline models. YoloV5s is termed as YoloV5-P5 and has three output layers with stride sizes of 8, 16, and 32 for image size 640 × 640, whilst YoloV5s6 is termed as YoloV5-P6, having four output layers of stride 8, 16, 32, and 64 with image size of 1280 × 1280. We report the performance using YoloV5s and YoloV5s6 in Table 1, whilst a performance chart for each category of actions, primary and secondary actions, is also reported in Table 2. Primary actions are performed by the single subject without any interaction with other subject or object (e.g., run, walk, lying, sit, stand). Meanwhile, secondary actions may involve interaction with some object (e.g., read, call, drink, actions require book, phone, bottle) or interaction with some other subject (e.g., handshake, hug). We conducted our experiment with different YoloV5 architectures and report our research findings in Table 3. We exercised different architectures of YoloV5-P5, such as YoloV5s, YoloV5n, YoloV5m, YoloV5l, and YoloV5x, and YoloV5-P6, such as YoloV5s6, YoloV5n6, YoloV5m6, YoloV5l6, and YoloV5x6. YoloV5n and YoloV5n6 are the smallest in size with a slight compromise in accuracy which makes them an ideal choice for deploying in drone applications where computational resources are always a constraint. Mainly, this experiment works offline but devises small sizes of Yolo architectures to work online, as shown in Table 3. The YoloV5-P6/64 output layer performs well for detecting larger objects in high-resolution images. YoloV5x and YoloV5x6 architectures are the largest in size and performed better than other architectures. Table 3. Comparison among different YoloV5 architectures. The first main comparison is between YoloV5-P5 and YoloV5-P6. The parameters are measured in millions (M), average precision is measured in %, training time is measured in hours, and model size is measured in MB. In the names of YoloV5, the subscripts "s", "n", "m", "l", and "x" refer to small, nano, medium, large, and extra-large network architectures.

Comparison with State-of-the-Art Methods
We present a comparison of our proposed method with other state-of-the-art methods in Table 4. Barekatain et al. [42] used SSD for detecting actions in Okutama-Action dataset with image size of 512 × 512. The authors ran their experiments on RGB and optical flow streams and then combined the results of both streams for better accuracy. During experimentation, it was realized that the SSD model performed best when the camera angle was 45 degrees. It was also noticed that strongly related temporal actions, e.g., the running action, resulted in lower recognition accuracy because SSD sequentially performed detection in a frame-by-frame manner. Soleimani et al. [69] proposed a two-stage architecture for identifying the action categories in the Okutama-Action dataset. The first stage relies on SSD for finding objects of interest, whereas the second stage uses another CNN to learn the latent sub-space for associating aerial imagery and action labels. The main difference between SSD and Yolo architectures lies in the handling of bounding boxes, as SSD treats each bounding box prediction as a regression problem. Yolo architecture, on the other hand, computes non-maximum suppression and thus retains the final bounding box. Yolo methods are slightly better than SSD in detecting smaller objects. This is exactly the situation in our scenario due to the distance between flying drones and human subjects on the ground. This proposed architecture resulted in significant increase of 28.3% in mAP for 50% IoU. Geraldes et al. [39] devised POINet (Position and Orientation Invariant Neural Networks) by employing MobilenetV2 [70] and reported the action detection performance separately for primary and secondary actions. The performance of our proposed method in terms of primary and secondary actions is also better than [39]. In Table 4, we explicitly mentioned the type of classifier for each methodology and realized that the gradient boosting classifier with Yolo action detector shows better performance than the fully-connected feed-forward classifier with detectors. The qualitative detection results of our proposed methodology are presented in Figure 5. Table 4. Comparison of our proposed method with other state-of-the-art methods for the Okutama-Action dataset. We report the YoloV5x6 as the best performing Yolo architecture. The performance is measured in mean average precision (%) for 50% IoU. PA stands for primary actions, while SA stands for secondary actions. FFNN represents fully-connected feed-forward neural networks, while GB denotes gradient boosting.

Method
Image

Conclusions
Deep learning is a method of choice for drone action recognition and detection due to its high performance and easy deployment. In this paper, we propose a Yolo-based framework with gradient boosting for detecting and recognizing twelve different actions. We investigated the performance of our model on the Okutama-Action dataset, where it outperformed other methods for individual actions, primary and secondary actions. We evaluated the performance of our method for different settings, such as morning or noon, and varying gimbal angle, of 45 degree or 90 degree, for an altitude of 30 m. We also evaluated the performance of different variants of YoloV5. Since YoloV5 is efficient for detecting multiple objects in an image, and the Okutama-Action dataset contains concurrent actions in a frame, our proposed method achieved better results for singleimage action detection than other methods in the literature that may include explicit temporal information, e.g., LSTM or 3D-CNN.
The main limitation of our work is that the performance of our algorithm may degrade if the gimbal angle becomes closer to 90 • or if the drone flies above 30 m. Moreover, the speed of the drone may also negatively affect the performance of our algorithm due to motion blurring. In addition, low lighting conditions of early morning-evening and extreme weather conditions of raining, clouds, or snowing may severely degrade the performance of our method.
YoloV7 shows good results for pose estimation and is beneficial for single-image object detection; therefore, in future work, it is possible that YoloV7 can be a good choice for action detection. As an alternative to YoloV7, self-attention transformers and their variants, e.g., ViT or ScaleViT, are good candidates that could replace the pipeline of YoloV5 and gradient boosting for action detection.
Author Contributions: H.P. and Y.M. conceptualized the idea of action recognition and detection in drone images for the Okutama-Action dataset. T.A. made an early investigation of Yolo models for the Okutama dataset and devised a methodology in collaboration with H.P., M.C., Y.M. and T.A. carried out experiments and validated the results. All authors contributed to the preparation of the manuscript. Finally, the funding for this project was acquired by Y.M. and H.P. All authors have read and agreed to the published version of the manuscript.
Funding: This work was partially supported by a donation from Matsuo Institute, Japan.
Institutional Review Board Statement: Not Applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.