Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition

: Aerial human action recognition is an emerging topic in drone applications. Commercial drone platforms capable of detecting basic human actions such as hand gestures have been developed. However, a limited number of aerial video datasets are available to support increased research into aerial human action analysis. Most of the datasets are conﬁned to indoor scenes or object tracking and many outdoor datasets do not have sufﬁcient human body details to apply state-of-the-art machine learning techniques. To ﬁll this gap and enable research in wider application areas, we present an action recognition dataset recorded in an outdoor setting. A free ﬂying drone was used to record 13 dynamic human actions. The dataset contains 240 high-deﬁnition video clips consisting of 66,919 frames. All of the videos were recorded from low-altitude and at low speed to capture the maximum human pose details with relatively high resolution. This dataset should be useful to many research areas, including action recognition, surveillance, situational awareness, and gait analysis. To test the dataset, we evaluated the dataset with a pose-based convolutional neural network (P-CNN) and high-level pose feature (HLPF) descriptors. The overall baseline action recognition accuracy calculated using P-CNN was 75.92%.


Introduction
Drones or Unmanned aerial vehicles (UAVs) are increasingly popular due to their affordability and applicability in numerous commercial applications.Some popular application areas of drones are photogrammetry [1], agriculture [2], crowd monitoring [3], sports activity recording [3], parcel delivery [3], and search and rescue [4,5].Recently, these areas have experienced rapid advances through convergence of technologies [6].
However, limited research has been done on human-drone interactions and associated applications.Although some commercial drones can recognize gestures, the capabilities rely on high-quality videos captured at low altitudes and are limited to a small number of simple gestures.Going beyond gestures, enabling a drone to recognize general human actions is far more challenging.
A limited number of studies have been conducted to understand the sophisticated human body movements recorded from a drone [16].Reasons for the slow progress in human action recognition in aerial videos include:

•
Human action recognition is inherently a challenging problem.Most of the action recognition studies are focused on popular video datasets that consist mostly of ground-level videos [27].
Even when using high quality videos, the state of the art in action recognition is still fallible [28].

•
The quality of aerial videos is often marred by a lack of image details, perspective distortion, occlusions, and camera movements.

•
Many drone applications require online rather than offline processing, but the resource constraints of embedded hardware platforms limit the choice and complexity of action recognition algorithms.

•
There are not enough relevant video datasets to help train algorithms for action recognition in the aerial domain.Currently available aerial video datasets are mostly limited to object tracking.While there are some datasets supporting research in aerial action recognition [16], they are limited.
Based on the above, we reason that action recognition using aerial videos is less studied than general human action recognition.Recognizing the importance of datasets for action recognition research in the aerial domain, we present in this study, a new full high-definition (FHD) video dataset with rich human details, that was recorded on a drone in a controlled manner.
The work presented here is specifically focused on providing accurate and rich human body details in aerial videos recorded from a moving platform.State-of-the-art action recognition techniques mostly rely on high-resolution human videos to achieve high accuracy, typically by leveraging by computationally intensive techniques.However, many real-life aerial videos suffer from low resolution and undesired camera movement resulting from the apparatus.We believe that current limitations of aerial camera systems should not deprive action recognition research of high-quality and highly detailed aerial videos.
The dataset presented here consists of 13 action classes recorded at FHD (1920 × 1080 resolution) from a quadrotor drone flying slowly at low altitude.Some actions were recorded while the drone was hovering (e.g., kicking, stabbing, and punching), and others were recorded while the drone was following the subject (e.g., walking, jogging, and running).A hovering camera should not be confused with a stationary camera.A hovering camera can drift in any direction due to platform mobility and susceptibility to wind gusts.All of the videos were recorded in a way that preserved as much body surface area as possible.This dataset was designed to support research related to search and rescue, situational awareness, surveillance, pose estimation, and action recognition.We assume that in most practical missions, a drone operator or an autonomous drone follows these general rules:

•
Avoid flying at overly low altitude, which is hazardous to humans and equipment; • Avoid flying at overly high altitude, so as to maintain sufficient image resolution; • Avoid high-speed flying, and therefore, motion blur; • Hover to acquire more details of interesting scenes; • Record human subjects from a viewpoint that gives minimum perspective distortion.
Our dataset was created assuming the guidelines above were followed.For testing the dataset, a pose-based convolutional neural network (P-CNN) [29] was used as the baseline action recognition algorithm.P-CNN utilizes the CNN features of body parts extracted using the estimated pose.The dataset was also evaluated with high-level pose features (HLPF) [30] which recognize action classes based on the temporal relationships of body joints and their variations.The baseline accuracy and the experiment details have been compared with recently published human activity datasets.
The rest of this article is organized as follows.Section 2 discusses closely related work on popular action recognition datasets, aerial video datasets, and their limitations in terms of suitability for aerial human action recognition.Section 3 describes the steps involved in preparing the dataset, and compares the dataset with other recently published video datasets.Section 4 reports experimental results.A discussion of issues and potential improvements is presented in Section 5. Section 6 concludes.Throughout this manuscript we will use the noun "drone" synonymously with the acronym UAV which describes an aircraft in terms of not being manned.

Related Work
A number of video datasets have been published for general human action recognition and aerial human action recognition.Here, we briefly discuss some studies related to our work.
Human action recognition datasets (mostly focused on ground videos): KTH [31] and Weizmann [32] were the most popular early action recognition datasets, and they were recorded in a controlled setting with a static background.These datasets helped to progress action recognition research.UCF101 [33] was one of the largest and most diverse datasets in this category.It was created with 13,320 YouTube videos belonging to 101 action classes.JHMDB [30] and Penn action [34] are another two notable action datasets created using YouTube videos.These datasets come with human pose and action annotations.
The recently made action recognition datasets are massive in terms of their numbers of videos and action classes.Sports-1M [35] and ActivityNet [36] were two datasets introduced under this category.These datasets do not provide spatial localization details of human subjects.For example, Sports-1M videos are annotated automatically using YouTube topics and are, therefore, weakly labeled [37].The most recent additions to this category were YouTube-8M [38] Kinetics [39], and HACS [40].The Kinetics dataset consisted of 400, 600, and 700 action classes in its successive releases.HACS dataset was created with 1.55 million annotated clips across 200 classes.The action classes of both datasets cover a diverse range of daily human actions collected from YouTube videos.Diversified large datasets are useful for pre-training action recognition networks on them and enhancing networks' performances on small datasets.
Aerial datasets: The available aerial video datasets can be broadly categorized as follows based on their intended applications.

•
Object detection and tracking datasets: There has been a surge of interest in aerial object detection and tracking studies.They are mainly focused on vehicle and human detection and tracking.VisDrone [11] is the largest object detection and tracking dataset in this category.This dataset covers various urban and suburban areas with a diverse range of aerial objects (it includes 2.5 million annotated instances).The dataset has been arranged into four tracks for object detection in images/videos and single/multi object tracking.The similar but relatively smaller UAVDT dataset [41] contains 80 thousand annotated instances recorded from low altitude drones.UAV123 dataset [15] has been presented for single object tracking from 123 aerial videos.Videos in these datasets have been annotated for the ground truth object bounding boxes.

•
Gesture recognition datasets: A gesture dataset recorded from a low-altitude hovering drone was presented in [17].The UAV-Gesture dataset contains outdoor recorded of 13 drone commanding signals.NATOPS [42] is a similar indoor recorded gesture dataset consisting of 24 aircraft signal gestures.

•
Human action recognition datasets: A large-scale VIRAT dataset [13] contains 550 videos covering a range of realistic and controlled human actions.The dataset has been recorded from both static and moving cameras (called VIRAT ground and aerial datasets).There are 23 outdoor event types with 29 h of videos.A limitation of the VIRAT aerial dataset is its low resolution of 480 × 720 pixels restricts algorithms from retrieving rich activity information from relatively small human subjects.A 4k resolution Okutama-Action [16] video dataset was introduced to detect 12 concurrent actions by multiple subjects.The dataset was recorded in a baseball field using two drones.Abrupt camera movements are present in the dataset making it more challenging for action recognition.However, the 90 degree elevation angle of the camera creates severe self-occlusions and perspective distortions in the videos.Other notable datasets in this category are UCFARG [14], UCF aerial action [43], and Mini-drone [44].UCF aerial and UCF ARG datasets have been recorded from an R/C-controlled blimp and a helium balloon respectively.Both datasets have similar action classes.UCF ARG is a multi-view dataset with synchronized ground, rooftop, and aerial camera views of the actions.However, UCF aerial action is a single-view dataset.Mini-drone was developed to study various aspects and definitions of privacy in aerial videos.The dataset was captured in a car park covering illicit, suspicious, and normal behaviors.
Most state-of-the-art human action recognition systems are implemented in high-end or graphics processing unit (GPU) powered computers, and those algorithms cannot perform as intended on typical embedded platforms due to the slowing of gains in commodity processor performance despite strong market pull [45].The platform dependency and the limitations of processing power in drones have been bottlenecks to the development of onboard machine learning and AI techniques for human action or gesture recognition.Notably, action recognition from a mobile platform is not an extensively studied research area.Some notable works related to aerial image analysis are limited to offline processing of recorded videos [9,[46][47][48][49].In order to utilize the drone's mobile and operational capabilities to meet the emerging demand for drone technology, more studies are needed.In particular, the aerial action recognition studies need more datasets to develop technologies tailored to search and rescue, surveillance, and situational awareness.We tried to collect the data against relatively clutter-free backgrounds that were nonetheless normal outdoor environments.

Preparing the Dataset
This section explains the data collection process, the selected action classes, and the variations in the collected data.The dataset is compared with recently published aerial and ground video datasets.

Data Collection
The data collection was done on an unsettled road located in the middle of a wheat field.The aerial platform was a 3DR SOLO rotorcraft.The flight was slow and done at low altitudes (8-12 m).Videos were recorded using a GoPro Hero 4 Black camera with an anti-fish eye replacement lens (5.4 mm, 10 MP, IR CUT) and a 3-axis Solo gimbal.All videos in the dataset are HD (1920 × 1080) format captured at 25 fps.
A total of 13 actions were recorded while the drone was in hovering and following modes.In all videos, the subject was roughly in the middle of the frame and performed each action five to ten times except for walking, jogging, and running.For walking, jogging, and running actions, the subject was asked to traverse between two markers while the drone followed the subject from different directions.
When recording the actions, sometimes the drone drifted from its initial hovering position due to wind gusts.This adds random and characteristic camera motion to the videos corresponding to practical scenarios.
The total number of human subjects in the dataset was 10.The participants were asked to perform the actions in a selected section of the road.We divided the actions into the following three categories based on the viewpoint and the camera movement: following, side-view, and front-view.A detailed illustration of these three settings is provided in Figure 1.
Following actions: For the walking, jogging, and running actions, the subjects were recorded from four viewpoints.Each action was repeated four times while recording from the drone positioned at four different viewpoints in succession.As shown in Figure 1a, the left and right side views were recorded by maintaining a roughly perpendicular view with respect to the walking direction.However, due to mismatches in the moving speeds of the subject and the drone, in some videos, the subject was not always in the middle of the frame.Irrespective of the horizontal and vertical movements of the subject with respect to the camera frame, the videos were trimmed to include only the full body details of the person undertaking the particular action.We tried to follow and record the subject throughout the action, but in some videos, the drone was in hover mode or hover and follow modes combined.
In particular, when the subject was running fast, the recordings were mostly done in the hovering mode or had a little movement towards the direction in which the subject was moving.In terms of the camera motion, the recorded actions are from both moving and hovering modes.
Side-view actions: Five actions were recorded from both left and right sides of the subject when the drone was in hover mode; namely, punching, hitting with a bottle, hitting with a stick, kicking, and stabbing.We used mock foam weapons to represent sticks, bottles, and knives.Each action was repeated five to ten times (see Figure 1b).The actions were recorded with diversity across the class.For example, for the kicking action, each subject performed kicks towards the left and right sides of the camera frame.Each instance (left or right) had five to ten randomly performed kicks.
Front-view actions: Clapping and waving hands actions were recorded while the drone was hovering in front of the subject (see Figure 1c).In these videos, the subject was roughly in the middle of the frame and repeated each action five to ten times.

Action Class Selection
The action categories described in Section 3.1 were selected considering three application scenarios.
(i) Analyzing human movements: This includes the following actions (walking, jogging, and running).Each action was recorded from four viewpoints from the front, back, and left and right sides of the subject.In the dataset, we consider the action recorded from the front and back views as one class and the action recorded from left and right sides as another class.For example, the walking_side action class has 20 videos recorded from left and right sides of 10 subjects, while walking_front_back has 20 videos recorded from front and back viewpoints.The 120 videos in these six action classes are also suitable for gait analysis and tracking.
(ii) Violence detection: The five side-view actions were selected from common violent acts.Punching and kicking are the most common bare hand actions of humans when fighting with each other [50].It is also common to use some objects to hit or throw at each other [51].To cover these scenarios, we included hitting with a bottle, hitting with a stick, and stabbing.Each action was recorded from left and right sides of the subject, capturing the maximum body surface area possible of these asymmetric actions.There are 100 videos showing the violent actions of 10 subjects.
(iii) Signaling the drone: In some scenarios, people might try to signal to the drone for help.The most popular way of signaling someone at a visible distance is using hand gestures.We selected two actions (clapping and waving hands) for this scenario and recorded from the front view as front-view actions (Figure 2).Each class contains 10 videos.

Variations in Data
The actors were asked to perform each action five to ten times, with most of the actions being performed ten times.Each actor performed the actions differently, as they wished, capturing a variety of natural responses.All actors were volunteers from our research group and consisted of seven males and three females.
There are rich variations in the recorded actions in terms of the phase, orientation, camera movement, and the body shapes of the actors.The camera movements were caused by wind gusts and by the movement of the drone.In some videos, the scale rapidly changes.An example of scale change is the running action recorded from front or back viewpoints.The drone could not always maintain the speed of the subject in low-altitude flight.This resulted in scale variation of subjects during the course of the video.In some videos, the skin color of the actor is close to the background color.These variations create a challenging dataset for action recognition, and also makes it more representative of real-world situations.

Dataset Annotations
All of the videos have been annotated with subject ID, action class, and bounding box.Subject IDs are S1 to S10 for the ten actors.Along with bounding box annotations, body joint estimations computed using the widely used pose estimator OpenPose [52] were also included.
The experiments conducted here used three randomly generated split sets.These split sets were generated using the subject IDs associated with the videos; i.e., by avoiding overlapping of subject IDs between the train and test sets.Subject IDs can be used to generate customized split sets.The split sets used in our experiments are available for download with the dataset.

Dataset Summary
The dataset contains a total of 240 clips with 66919 frames.The number of actors in the dataset is 10 and they performed each action 5-10 times.All the videos are provided with 1920×1080 resolution and 25 fps.The average duration of each action was 11.15 sec.A summary of the dataset is given in Table 1.The total clip length and mean clip length of each class are represented on the left side (blue) and right side (amber) bar graphs of Figure 3 respectively.In Table 2, we compare our dataset with eight recently published video datasets.These datasets have helped progress research in action recognition, gesture recognition, event recognition, and object tracking.

Experimental Results
We experimented with two popular feature types used in human action recognition; namely, pose-based Cheron et al.'s CNN features [29] and Jhuang et al.'s high-level pose features [30].In this section, we report and compare experimental results using these two approaches.

High-Level Pose Features (HLPF)
HLPFs are formed by combining spatial and temporal properties of body keypoints throughout the action.We used the publicly available HLPF code [30] with minor modifications.HLPF was calculated using 15 keypoints (head, neck, shoulders, elbows, wrists, abdomen, hips, knees, and ankles).We used the pose estimator OpenPose [52] to find keypoints (see Figure 4a,b).In each frame, four spatial properties were calculated as follows:

•
Normalized positions: Each key point was normalized with respect to the belly key point.A total of 30 descriptors were obtained (x and y coordinates of 15 joints).

•
Distance relationships: The distance between two key points was calculated for all 15 key points, resulting in 105 descriptors.

•
Angular relationships: A total of 1365 angles were calculated.Each angle was that of two vectors connecting a triplet of key points.

•
Orientation relationships: 104 angles were calculated between each vector connecting two key points and the vector from neck to abdomen.
The temporal features were obtained by combining the following trajectories: • Cartesian trajectory: 30 descriptors were obtained from the translation of 15 key points along x and y axes.

•
Radial trajectory: The radial displacement of 15 key points provides 15 descriptors.

•
Distance relationship trajectory: 105 descriptors were calculated from the differences between distance relationships.

•
Angle relationship trajectory: 1365 descriptors were calculated from the differences between angle relationships.

•
Orientation relationship trajectory: 104 descriptors were calculated from the differences between distance relationships.
An HLPF vector (including 3223 descriptor types) was generated by combining the above spatial and temporal descriptors.A codebook was generated by performing k-means clustering on each descriptor type.The codeword size was selected to have k = 20.Classification was done using a SVM with linear kernels.

Pose-Based CNN (P-CNN)
P-CNN features [29] were created from person-centric motion and appearance features extracted using body joint positions.The P-CNN workflow is similar to Simonyan and Zisserman's two-stream CNN architecture [55].For this work, we used the publicly available P-CNN code with minor modifications.The process of P-CNN feature extraction is shown in Figure 5.The main modules of the process are explained below.

Pose Estimation
We used the latest version of OpenPose [52] for pose estimation (see Figure 4a,b).OpenPose is a state-of-the-art method of pose estimation [56].The pose estimator provides 18 key point estimations (eyes, ears, nose shoulders, elbows, wrists, hips, knees, and ankles).Head position was calculated by taking the center of the eyes, ears, and nose key points.Sometimes, all of the key points in the head were not visible (e.g., from a side view).The abdomen key point was calculated by taking the centroid between the shoulder and hip key points.A total of 15 key points were extracted for the experiments (head, neck, shoulders, elbows, wrists, abdomen, hips, knees, and ankles).

Optical Flow Calculation
Optical flow was calculated between all consecutive frames of the video using Brox et al.'s implementation [57].The calculated optical flow image was transformed to a heat map at the pixel level.The velocities along x and y axes, namely, v x , v y , were mapped to the range [0, 255] by [ vx , vy ] = s[v x , v y ] + c, with values outside the range [0, 255] truncated at 0 or 255, according to [58].The scalar s was calculated as s = c/a, where a = 8 was the maximum absolute value of the flow, and c = 128 was the new center of motion velocities.

Extracting Part Patches
The scale of each pose was used to extract consistent sized patches from the hands, upper body, and entire body of the subject.The scale for each pose was obtained by dividing the pose height (in pixels) by 240.In the image coordinate system, height was calculated using the maximum difference in y-coordinates between the head and ankles.
A region of interest around a body part was extracted by adding some margin to the top-left and bottom-right key point positions of the body part-we called this a part patch.The margin was calculated as C × S, where C was selected to be 100 pixels, and S was the scale.The extracted patch was resized to 224 × 224 to match the input size of the CNN model.Figure 4 illustrates the process of extracting part patches using an image from the Kicking class.

CNN Feature Aggregation
For each body part and full image, the appearance (RGB) and optical flow patches were extracted, and their CNN features were computed using two pre-trained networks (see Figure 5).For appearance patches, the publicly available "VGG-f" network [59] was used, whereas for optical flow patches, the motion network from Gkioxari and Malik's Action Tube implementation [58] was used.Both networks have the same architecture with five convolutional and three fully-connected layers.For each body part and frame t, a frame descriptor f t was obtained from the output of the second fully-connected layer.Static and dynamic features were separately aggregated over time to obtain a static video descriptor v sta and a dynamic video descriptor v dyn respectively.
Min and max aggregation schemes were used to calculate the minimum and maximum values of each descriptor dimension k (k ∈ {1, ..., 4096}) over T frames.
For static and dynamic video descriptors, the time-aggregated video descriptors were concatenated as follows: ∆ represents temporal differences in the video descriptors.The aggregated features (v sta and v dyn ) were normalized and concatenated over the number of body parts to obtain appearance features v app and flow features v o f .The final P-CNN descriptor was obtained by concatenating v app and v o f .

Performance Evaluation
The evaluation metric selected for the experiment was accuracy.Accuracy was calculated using the scores returned by the action classifiers.We experimented with the dataset by generating three split sets.Each split set was generated randomly with a 70 : 30 train-to-test ratio using the subject ID.
We report a mean classification accuracy of 64.36% for HLPF.As explained in Section 4.1, classification was performed using linear kernels.Accuracies for the three split sets were 63.89%, 68.09%, and 61.11%.
The mean accuracy reported for P-CNN was 75.92%.The accuracies of the three split set were 72.22%, 81.94%, and 73.61%.Here, the classification was performed using a linear SVM.
In Table 3, we compare our dataset with two other action recognition datasets; namely, JHMDB and MPII.Both of these datasets used Cherian et al.'s pose estimation algorithm [60].The performance metrics for the JHMDB and MPII datasets were taken from [29].
Two confusion matrices were calculated (see Figure 6) for assessing the action recognition accuracies of the HLPF and P-CNN approaches.Both HLPF and P-CNN had some difficulty distinguishing between hitting with bottle and stabbing actions, which were associated with the lowest scores in the confusion matrix for P-CNN and noticeably low scores for HLPF.The main reason for this was the similar patterns of body joint movements for these two classes of actions, although HLPF does not take into account the characteristics of associated objects.Another possibility specific to P-CNN was that the color of the bottle used in hitting with bottle was close to the background color and sometimes difficult to identify.The classifiers can also confuse between jogging and running actions, as the speeds of these actions varied from one actor to another.Overall, P-CNN showed better performance for most of the classes.
As an alternative measure of accuracy, Cohen's kappa coefficient [61] was calculated for both the HLPF and PCNN confusion matrices.Cohen's kappa was originally proposed in the area of psychology as a measure of agreement between two "judges" that each classifies some objects into some mutually exclusive categories.The coefficient was defined as κ p o −p e 1−p e , where p o is the proportion of agreements, and p e is the proportion of chance agreements.According to Landis and Koch [62], a κ value of < 0 should be interpreted as no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement.In the context of multi-class classification, an agreement is a true positive.Given confusion matrix [m i,j ] and k classes, .
Using Cardillo's code [63], Cohen's κ coefficients for the HLPF and P-CNN confusion matrices were found to be 0.6433 and 0.7593 respectively.As per Landis and Koch's [62] interpretation, both κ coefficients indicated substantial agreement, with the κ for P-CNN being conspicuously higher than that for HLPF, and in fact nearly falling in the region of almost perfect agreement.
We also performed experiments by (i) combining P-CNN and HLPF features to form a new feature vector, and (ii) fusing the classification scores of P-CNN and HLPF classifiers.In both cases, the performance was degraded below the accuracy reported for P-CNN alone.Therefore, we report the P-CNN feature-based action classification as the baseline algorithm used for our dataset.

Discussion
HLPF computation is based only on the spatial and temporal relationships among the key points.It does not consider any additional information associated with the actions, such as the objects used in the actions.In comparison, P-CNN computation includes the body parts and action-specific objects visible in the part patches.This results in a higher action recognition accuracy for P-CNN compared to HLPF.This observation is applicable to all three datasets used in this work.
In both HLPF and P-CNN approaches, actions with similar body joint movements can be confused as one another, resulting in an overall reduction in accuracy.Both feature descriptors performed well with distinctive actions.It was noted that HLPF accuracy decreased when the action was associated with additional objects, while P-CNN was robust whether additional objects were involved or not.
We used the state-of-the-art pose estimator OpenPose to find key points.However, as with all estimators, sometimes it missed the key points due to varying human scales and viewpoint occlusions.The estimator can be made more robust to scale variation by increasing the input image size of the pose estimator model, at the expense of computational overhead.Perspective distortion caused by the use of an aerial camera is minimal in most of the videos, although in the front/back following actions, the perspective distortion varies when the subject moves closer to or further away from the drone.This distortion can be compensated for to some extent using an approach similar to that reported by Perera et al. [64].
This dataset contains only single-person actions.As explained in Sections 1 and 2, drone-based human action recognition is still an emerging research area.The closest research area to this is gesture-based drone control.We believe that single-person actions provide an opportunity for many researchers to start or advance their research in aerial action recognition.That is why this dataset focuses on the first step in aerial action recognition: single-person action recognition.Our ongoing research includes extending this dataset to a multi-person version.
In multi-person action recognition studies, normally, the actions are broadly categorized based on the event.A relevant dataset and approach are described in the VIRAT aerial dataset [13].When detecting the actions of a group of people, each individual in the scene should be detected and analyzed for the collective or individual actions [16].However, depending on the scene, multiple subjects can improve the detection accuracy in the actions (e.g., fighting).
This dataset has been created to help progress research in aerial action recognition.The work presented in this paper is intended to provide a sufficient amount of data to cover some common human actions.These actions can be visible from a low-altitude slow-flying drone.When recording videos, we used a relatively clutter-free background and maximized the image details by keeping the subject roughly in the middle of the frame throughout the drone's hover and follow modes.Therefore, our dataset does not include videos with complex backgrounds and challenging flight maneuvers.

Conclusions
We presented an action recognition dataset, called Drone-Action, recorded by a hovering drone.The dataset contains 240 HD videos lasting for a total of 44.6 minutes.The dataset was prepared using 13 selected actions from common outdoor human actions.The actions were recorded from 10 participants in an outdoor setting while the drone was in hover or follow mode.The rich variation of body size, camera motion, and phase, makes our dataset challenging for action recognition.The dataset contains videos recorded from a low-altitude, slow-flying drone.The action classes, the relatively large images of human subjects and high resolution image details extend the dataset's applicability to a wider research community.
We evaluated this new dataset using HLPF and P-CNN descriptors, and reported an overall baseline action recognition accuracy of 75.92% using P-CNN.This dataset is useful for research involving surveillance, situational awareness, general action recognition, violence detection, and gait recognition.The Drone-Action dataset is available at https://asankagp.github.io/droneaction.

Figure 1 .
Figure 1.The actions were recorded in three different settings using a single drone.They are (a) following actions, (b) side-view actions, and (c) front-view actions.

Figure 2 .
Figure 2. The action classes of the Drone-Action dataset.The images shown were selected randomly and cropped around the subject for clear demonstration.

Figure 3 .
Figure 3.The total clip length (blue) and the mean clip length (amber) are shown in the same graph in seconds.

Figure 4 .
Figure 4. Extracted part patches (left hand, right hand, upper body, and full body) are shown using an image from the kicking class.(a) The estimated pose is shown with the background.The image is cropped around the subject for clear demonstration.(b) The keypoint locations are used compute high-level pose features (HLPF) and posed-based convolutional neural network (cP-CNN) features.(c) Part locations were determined based-on the keypoints.Purple color bounding box covers the full image.(d) RGB and flow patches were extracted from the RGB images and optical flow images respectively, using the bounding boxes shown in (c).

Figure 5 .
Figure 5.The steps involved in a P-CNN feature descriptor calculation.For a given image patch, one RGB and one flow CNN descriptor f t is calculated per frame t.

Figure 6 .
Figure 6.Confusion matrices calculated as percentage accuracies for HLPF (on the left) and P-CNN (on the right).

Table 1 .
A summary of the dataset.

Table 2 .
Comparison of recently published video datasets.

Table 3 .
The best reported HLPF and P-CNN action recognition results for different datasets.