Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2

Kondratev, Sergei; Dyrchenkova, Yulia; Nikitin, Georgiy; Voskov, Leonid; Pikalov, Vladimir; Meshcheryakov, Victor

doi:10.3390/technologies14010069

Open AccessArticle

Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2

by

Sergei Kondratev

^1,2,*

,

Yulia Dyrchenkova

³

,

Georgiy Nikitin

³

,

Leonid Voskov

³

,

Vladimir Pikalov

¹

and

Victor Meshcheryakov

¹

Department of Automated Electric Drive and Robotics, Lipetsk State Technical University, Lipetsk 398055, Russia

²

RAZROBOTICS LLC, Lipetsk 398024, Russia

³

Department of Computer Engineering, HSE University, Moscow 101000, Russia

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(1), 69; https://doi.org/10.3390/technologies14010069

Submission received: 18 December 2025 / Revised: 9 January 2026 / Accepted: 13 January 2026 / Published: 16 January 2026

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Versions Notes

Abstract

This paper presents Aerokinesis, an IoT-based software–hardware system for intuitive gesture-driven control of quadcopter unmanned aerial vehicles (UAVs), developed within the Robot Operating System 2 (ROS2) framework. The proposed system addresses the challenge of providing an accessible human–drone interaction interface for operators in scenarios where traditional remote controllers are impractical or unavailable. The architecture comprises two hierarchical control levels: (1) high-level discrete command control utilizing a fully connected neural network classifier for static gesture recognition, and (2) low-level continuous flight control based on three-dimensional hand keypoint analysis from a depth camera. The gesture classification module achieves an accuracy exceeding 99% using a multi-layer perceptron trained on MediaPipe-extracted hand landmarks. For continuous control, we propose a novel approach that computes Euler angles (roll, pitch, yaw) and throttle from 3D hand pose estimation, enabling intuitive four-degree-of-freedom quadcopter manipulation. A hybrid signal filtering pipeline ensures robust control signal generation while maintaining real-time responsiveness. Comparative user studies demonstrate that gesture-based control reduces task completion time by 52.6% for beginners compared to conventional remote controllers. The results confirm the viability of vision-based gesture interfaces for IoT-enabled UAV applications.

Keywords:

gesture control; hand keypoint estimation; Euler angles; 3D pose estimation; ROS2; Gazebo; fully connected neural network; gesture classification; signal filtering; IoT; Internet of Things; UAV; quadcopter; human–drone interaction; MediaPipe

1. Introduction

The proliferation of unmanned aerial vehicles (UAVs) across diverse application domains has fundamentally transformed approaches to data collection, monitoring, and inspection tasks [1]. Modern quadcopters are deployed extensively in infrastructure inspection [2], agricultural monitoring [3], pipeline surveillance [4], search and rescue operations [5], and cartographic surveying [6]. The integration of UAVs with Internet of Things (IoT) ecosystems has emerged as a particularly promising paradigm, enabling real-time data transmission, cloud-based processing, and seamless interoperability with existing sensor networks [7,8].

Despite these technological advances, the human–machine interface for UAV control remains a significant barrier to widespread adoption, particularly for non-expert operators. Conventional remote controllers demand substantial training time and impose a cognitive burden that can compromise mission effectiveness [9,10]. Research by Zlotnikov et al. [11] identified operator training as a primary cost driver in UAV deployment, with novice pilots requiring 15–25 h of practice before achieving operational proficiency. This limitation has motivated extensive research into alternative control paradigms, including voice commands [12], brain–computer interfaces [13], and gesture-based systems [14,15].

Gesture-based control has emerged as a particularly promising approach due to its intuitive nature and minimal hardware requirements [16,17]. The human hand provides a naturally expressive interface with high bandwidth for communicating spatial intent [18]. Recent advances in computer vision, particularly the development of real-time hand tracking frameworks such as MediaPipe [19,20], have substantially lowered the technical barriers to implementing robust gesture recognition systems. However, existing approaches often suffer from limited gesture vocabularies, insufficient robustness to environmental variations, or inadequate consideration of the unique requirements of UAV control tasks [21].

The integration of gesture control systems with IoT architectures presents additional challenges and opportunities. As Chamola et al. [8] note, IoT-enabled UAVs must balance computational efficiency, communication latency, and energy constraints while maintaining responsive control. Edge computing paradigms, wherein processing is distributed between on-device capabilities and proximate compute nodes, offer a promising approach to addressing these challenges [22]. The emergence of powerful embedded platforms such as the NVIDIA Jetson series has made real-time deep learning inference feasible for aerial robotics applications [23].

The Robot Operating System 2 (ROS2) has become the de facto standard for robotics middleware, providing a publish–subscribe communication model, standardized message formats, and extensive tooling for simulation and deployment [24]. ROS2’s support for real-time systems, improved security, and multi-platform compatibility make it particularly suitable for UAV applications that must operate reliably in diverse environments [25]. The Gazebo simulator, tightly integrated with ROS2, enables comprehensive validation of control algorithms before deployment on physical hardware [26,27].

In this paper, we present Aerokinesis, a comprehensive IoT software–hardware system for intuitive gesture-driven quadcopter control. Our approach addresses the limitations of existing systems through several key contributions:

1.: Hierarchical Control Architecture: We propose a two-tier control framework that separates high-level discrete commands (arm/disarm, takeoff, landing, speed selection) from low-level continuous flight control (roll, pitch, yaw, throttle). This separation enables both precise maneuvering and rapid execution of safety-critical operations through distinct gesture vocabularies.
2.: 3D Pose-Based Continuous Control: Unlike approaches relying solely on 2D image coordinates, our system leverages depth camera data to compute control signals from true 3D hand orientation using camera intrinsic parameters and trigonometric analysis, providing intuitive mapping between hand pose and aircraft motion.
3.: Robust Signal Filtering Pipeline: We develop a multi-stage filtering approach combining median-based outlier detection, sliding window averaging, and exponential smoothing that achieves a balance between noise rejection and control responsiveness, with parameters validated through extensive experimentation.
4.: Complete Implementation and Validation: We provide full implementation details, experimental validation in both simulation and physical deployment using the DJI Tello platform and NVIDIA Jetson Orin NX, and comparative user studies demonstrating significant advantages for novice operators.

The remainder of this paper is organized as follows. Section 2 reviews related work in gesture-based UAV control, IoT integration, and human–robot interaction. Section 3 describes the overall system architecture and hardware configuration. Section 4 details the gesture classification approach for high-level commands. Section 5 presents the continuous control methodology including 3D pose estimation and signal filtering. Section 6 describes the ROS2 implementation. Section 7 presents experimental results from simulation and real-world testing. Section 8 discusses findings and limitations, and Section 9 concludes with directions for future work.

2. Related Work

2.1. Gesture Recognition for Human–Machine Interaction

Gesture recognition has evolved substantially over the past decade, driven by advances in deep learning [28] and the availability of large-scale datasets [29]. Early approaches relied on specialized hardware such as data gloves [30] or the Microsoft Kinect sensor [31], which provided skeletal tracking but imposed significant constraints on user mobility and deployment environments. The transition to RGB-camera-based methods, enabled by convolutional neural networks (CNNs) and more recently vision transformers, has dramatically expanded the applicability of gesture interfaces [32].

The MediaPipe framework, released by Google in 2019 [19], represents a particularly significant development for real-time gesture recognition. MediaPipe Hands combines a palm detection model with a hand landmark model to estimate 21 three-dimensional keypoints representing finger joints and wrist position [20]. The framework achieves real-time performance on mobile devices while maintaining robustness to partial occlusion and varying lighting conditions. Recent work by Yaseen et al. [33] demonstrated that MediaPipe landmarks combined with Inception-v3 and LSTM networks achieve state-of-the-art dynamic gesture recognition accuracy, while Meng et al. [34] developed registerable systems for real-time gesture monitoring.

For static gesture classification, fully connected neural networks (multilayer perceptrons) operating on hand landmark coordinates have proven highly effective [35]. This approach offers advantages over image-based CNNs in terms of computational efficiency and invariance to hand appearance variations. Latif et al. [36] demonstrated that landmark-based classification outperforms direct image classification for human–drone interaction tasks. The decoupling of hand detection (handled by MediaPipe) from gesture classification (handled by a lightweight classifier) enables efficient deployment on resource-constrained embedded platforms [37].

2.2. Vision-Based UAV Control Systems

The application of gesture recognition to UAV control has received considerable attention in the recent literature. Begum et al. [15] compared CNN, VGG-16, and ResNet-50 architectures for gesture-controlled drone operation, finding that simpler CNN models offered the best trade-off between accuracy and computational complexity. Hu and Wang [14] proposed a method converting dynamic gesture sequences into 2D matrices for UAV control, though environmental factors such as lighting significantly affected accuracy.

Samotyy et al. [38] developed a real-time gesture-based control system optimized for low computational requirements, enabling efficient recognition but experiencing accuracy degradation in cluttered environments. Khoza et al. [39] demonstrated gesture control for the DJI Tello using OpenCV, validating the feasibility of commodity hardware for UAV interaction. The OmniRace system by Serpiva et al. [21] achieved 6D hand pose estimation for intuitive racing drone guidance, demonstrating that spatial hand orientation provides superior control bandwidth compared to discrete gesture vocabularies.

Edge computing has emerged as a key enabler for responsive gesture-controlled UAV systems. Abdalla et al. [22] proposed an edge-assisted framework wherein computationally intensive gesture recognition is offloaded to edge servers, reducing closed-loop delay while maintaining high accuracy. This approach addresses the fundamental tension between recognition accuracy (which benefits from complex models) and control latency (which requires rapid inference). The NVIDIA Jetson platform has become particularly popular for such deployments, offering GPU-accelerated inference in a low-power form factor suitable for aerial applications [23].

2.3. IoT Integration and UAV Networks

The Internet of Things paradigm has transformed UAV applications by enabling seamless integration with sensor networks, cloud services, and edge computing infrastructure [7]. IoT-enabled UAVs function as mobile gateways, extending network coverage to remote areas while providing aerial sensing capabilities [8]. Rahmani et al. [40] reviewed IoT drone implementations, identifying communication protocols, localization accuracy, and energy efficiency as key research challenges.

Smart surveillance systems exemplify the synergy between IoT and UAV technologies. Eamsaard and Boonsongsrikul [41] developed an AI-based security system using UAVs with YOLOv8 object detection connected via WiFi to an IoT network for intrusion and fire detection. Fernandez-Quilez et al. [42] proposed a UAV-based IIoT monitoring system for industrial applications, demonstrating fog computing integration for reduced latency. These systems highlight the importance of robust control interfaces that enable operators to direct UAVs effectively within complex IoT ecosystems.

ROS2 has become the standard middleware for IoT-enabled robotic systems, providing the communication infrastructure necessary for distributed drone operations [24]. Wei et al. [43] demonstrated ROS2-based machine vision control for warehouse robotics, while Wijaya et al. [44] implemented hand gesture control of quadcopter swarms in the Gazebo simulation environment using ROS2. The modularity of ROS2 facilitates the integration of diverse sensors, control algorithms, and communication protocols within a unified framework [25].

2.4. Signal Processing for Robotic Control

Real-time control systems must contend with measurement noise from sensors and prediction uncertainty from computer vision algorithms. For gesture-based control, hand landmark predictions exhibit frame-to-frame jitter even when the hand remains stationary, necessitating filtering approaches that balance noise rejection with control responsiveness [45].

Exponential smoothing, a class of infinite impulse response (IIR) filters, has proven effective for real-time control applications due to its computational simplicity and single-parameter tuning [46]. The exponential moving average assigns exponentially decreasing weights to older observations, allowing recent data to dominate while incorporating historical context [47]. For gesture-controlled robotics, the smoothing parameter

α

trades off between responsiveness (high

α

) and stability (low

α

) [48].

Sliding window approaches, including simple moving averages, provide complementary filtering characteristics. Moving average filters excel at removing short-term oscillations while preserving underlying trends [49]. The combination of outlier detection (via median absolute deviation) with averaging provides robustness to occasional erroneous landmark predictions [50]. Budiyanto et al. [45] demonstrated the effectiveness of complementary filtering for hand gesture-based drone navigation, combining multiple sensor modalities to achieve reliable control.

3. System Architecture

3.1. Overview

The Aerokinesis system comprises three primary subsystems: (1) a vision processing module for hand detection and gesture classification; (2) a control computation module for translating hand pose to quadcopter commands; and (3) a ROS2 communication layer interfacing with the flight controller. Figure 1 illustrates the overall system architecture.

The system operates in two mutually exclusive modes:

Command Mode: The system recognizes discrete static gestures for high-level commands including arm/disarm, takeoff, landing, and speed selection. Gesture sequences provide safety interlocks, preventing accidental command activation.
Control Mode: The system continuously computes roll, pitch, yaw, and throttle values from 3D hand pose, enabling direct quadcopter maneuvering. Signal filtering ensures smooth, predictable aircraft response.

Mode transitions are triggered by specific gesture sequences, providing explicit operator intent signaling while preventing inadvertent mode changes during maneuvering.

3.2. Hardware Configuration

3.2.1. Depth Camera

We employ the Intel RealSense D435i stereo depth camera, which provides synchronized RGB and depth streams at up to 90 FPS. The camera’s depth sensing range (0.1–10 m) and 87 × 58 degree field of view accommodate typical operator-to-aircraft separation distances. Key specifications include the following:

RGB resolution: 1920 × 1080 at 30 FPS;
Depth resolution: 1280 × 720 at 30 FPS;
Depth accuracy: <2% at 2 m distance;
Onboard IMU for camera motion compensation.

The depth stream is aligned to the color stream using the RealSense SDK 2.0+, ensuring pixel-level correspondence between RGB and depth data. This alignment is essential for associating 2D hand landmark coordinates with corresponding depth values.

3.2.2. Edge Computing Platform

The NVIDIA Jetson Orin NX 8GB serves as the primary computing platform, providing GPU-accelerated inference for MediaPipe hand detection while maintaining the low power consumption (15–25 W) necessary for portable deployment. The platform runs Ubuntu 22.04 with ROS2 Humble and supports CUDA-accelerated PyTorch 2.0+ for neural network inference.

3.2.3. Quadcopter Platform

For experimental validation, we utilize the DJI Tello, a 80 g micro-quadcopter with a documented SDK enabling programmatic control via WiFi. The Tello supports velocity control commands (linear velocity in X, Y, Z and angular velocity around Z), making it well-suited for gesture-based control research. For simulation, we employ a Tello model in Gazebo Ignition, enabling comprehensive testing without risk to physical hardware.

3.3. Software Framework

The software architecture is implemented as a collection of ROS2 nodes communicating via standardized message types. The primary nodes include the following:

1.: hand_gesture_detector: Main processing node that subscribes to camera topics, runs MediaPipe inference, performs gesture classification/continuous control computation, and publishes control commands;
2.: realsense_node: RealSense camera driver providing RGB, depth, and camera info topics;
3.: tello_driver/x500_controller: Flight controller interface for physical/simulated quadcopter.

Communication between nodes uses standard ROS2 message types including sensor_msgs/Image for camera streams, sensor_msgs/CameraInfo for intrinsic parameters, and geometry_msgs/Twist for velocity commands.

4. Gesture Classification for High-Level Commands

4.1. Hand Landmark Extraction

The MediaPipe Hands framework provides the foundation for hand detection and landmark estimation. The framework employs a two-stage pipeline: first, a palm detector locates hands in the input image; second, a hand landmark model estimates 21 3D keypoints representing finger joints and the wrist (Figure 2).

Each landmark is represented by normalized coordinates

(x_{i}, y_{i}) \in {[0, 1]}^{2}

relative to the image dimensions, plus a depth estimate

z_{i}

representing relative distance from the camera. For gesture classification, we utilize only the 2D coordinates to form the input feature vector:

x = {[x_{0}, y_{0}, x_{1}, y_{1}, \dots, x_{20}, y_{20}]}^{⊤} \in R^{42}

(1)

The normalized coordinate representation provides inherent invariance to hand position in the frame and image resolution, simplifying classifier training and improving generalization.

4.2. Gesture Vocabulary

The high-level command gesture vocabulary comprises eight distinct static hand poses as shown in Figure 3 and Table 1.

The use of two-gesture sequences (prefix + command) provides safety interlocks preventing accidental activation of critical commands such as motor arming or landing. For example, takeoff requires the sequence OK → ONE, which is unlikely to occur inadvertently during normal hand motion.

It is important to note that landing uses SDK’s fixed descent profile; continuous throttle control is available in Control Mode for gradual altitude reduction. Speed levels 1–5 scale velocity commands by factor

50 L

; safe_mode limits to level 1 for novice operators.

Scalability to More Complex Gestures

The current gesture vocabulary comprises eight static hand poses optimized for essential UAV control operations. However, the proposed architecture supports expansion to more complex gesture sets:

Expanded static gesture vocabulary: The MLP classifier architecture can accommodate additional gesture classes with minimal modification. The 42-dimensional landmark feature space provides sufficient discriminative capacity for distinguishing substantially more than eight hand poses. Preliminary experiments with expanded vocabularies containing 15 gestures (adding poses such as “pointing,” “thumbs down,” “peace sign,” and directional indicators) maintained classification accuracy above 97%, though certain gesture pairs exhibited increased confusion rates due to visual similarity.

Dynamic gesture recognition: For applications requiring temporal gesture patterns (e.g., swipe motions, circular movements, or gesture sequences), the system architecture can be extended using recurrent neural networks (LSTM, GRU) or temporal convolutional networks operating on sequences of landmark coordinates. Such extensions would enable more expressive control vocabularies including gestures for waypoint marking, camera gimbal control, and mission-specific commands.

Computational and cognitive considerations: Expanding the gesture vocabulary introduces trade-offs. From a computational perspective, increasing to 20+ static gestures has negligible impact on inference time, while adding dynamic gesture recognition would reduce system throughput from 30 FPS to approximately 20–25 FPS depending on required sequence length. From a user perspective, our observations suggest that novice operators can reliably learn and recall 8–10 distinct gestures within a 15 min training session; beyond this threshold, error rates increase, particularly under time pressure or stress. For applications requiring larger command sets, we recommend hierarchical gesture menus (where a primary gesture enters a sub-mode with additional commands) rather than flat vocabularies exceeding 12–15 gestures.

4.3. Neural Network Architecture

Gesture classification employs a fully connected multilayer perceptron (MLP) architecture optimized for the 42-dimensional input feature space. The network comprises two hidden layers with ReLU activation functions:

f_{layer} (x) = ReLU (W x + b)

(2)

where

ReLU (x) = max (0, x)

provides nonlinearity, enabling the network to learn complex decision boundaries.

The complete classification function is

\hat{y} = arg max_{k} {[W_{3} \cdot ReLU (W_{2} \cdot ReLU (W_{1} x + b_{1}) + b_{2}) + b_{3}]}_{k}

(3)

where

W_{1} \in R^{168 \times 42}

,

W_{2} \in R^{546 \times 168}

, and

W_{3} \in R^{8 \times 546}

are learned weight matrices, and

b_{1}

,

b_{2}

,

b_{3}

are bias vectors.

Figure 4 illustrates the network architecture schematically.

The hidden layer dimensions (168 and 546 neurons) were determined through hyperparameter search, balancing classification accuracy against inference time. Dropout regularization (

p = 0.5

) is applied during training to prevent overfitting.

4.4. Training and Validation

The classifier was trained on a dataset of 12,000 hand landmark samples (1500 per gesture class) collected from 10 participants (7 male, 3 female; age range 20–32 years) under varying lighting conditions (natural daylight, fluorescent, and mixed illumination). Each participant performed each gesture approximately 150 times across multiple recording sessions to capture intra-subject variability.

Data splitting strategy: To ensure rigorous evaluation of generalization to unseen users, we employed a subject-wise split for the test set. Specifically, data from eight participants (9600 samples) was used for model development, while data from two held-out participants (2400 samples, 300 per class) constituted the final test set. Within the development set, we applied a random 80/20 split for training and validation, resulting in 7680 training samples and 1920 validation samples. This subject-wise test split is critical for assessing whether the model generalizes to new operators not seen during training.

Data augmentation included random coordinate perturbation (

σ = 0.02

) and random dropout of individual landmarks (

p = 0.1

) to improve robustness to hand tracking noise and partial occlusions.

Training employed the Adam optimizer with an initial learning rate of

10^{- 3}

, a batch size of 64, and cross-entropy loss. Early stopping based on validation accuracy terminated training after 150 epochs. The final model achieved 99.3% accuracy on the held-out test set comprising participants not included in training, demonstrating strong generalization to unseen users.

Classifier Architecture Selection

To justify the MLP architecture selection, we conducted a comparative evaluation of five classification approaches using stratified 5-fold cross-validation on the development set (eight participants, 9600 samples). Table 2 summarizes the results.

It is noted that k-NN requires storing all training samples (∼300 KB for landmarks). 1D-CNN utilizes three convolutional layers (32-64-128 filters) with global average pooling.

The MLP architecture achieved the highest accuracy while maintaining the lowest inference latency and smallest model footprint—critical factors for embedded deployment on the Jetson platform. The k-Nearest Neighbors classifier suffered from slow inference due to distance computations across training samples. Support Vector Machines with an RBF kernel achieved competitive accuracy but required substantially more memory for support vector storage. Random Forest provided robust performance but with inference overhead from tree traversal. The 1D-CNN, treating landmark coordinates as sequential data, approached MLP accuracy but with higher computational cost.

The superior MLP performance can be attributed to the relatively simple decision boundary geometry in the 42-dimensional landmark space: after MediaPipe normalization, gesture classes form approximately convex, well-separated clusters that are efficiently partitioned by ReLU-activated layers. More complex architectures (deeper networks, attention mechanisms) were evaluated but provided no accuracy improvement while increasing inference time.

5. Continuous Control from 3D Hand Pose

5.1. Problem Formulation

Continuous control mode maps hand pose to quadcopter velocity commands with four degrees of freedom: roll (lateral movement), pitch (forward/backward movement), yaw (rotation), and throttle (vertical movement). The design goal is an intuitive mapping wherein hand orientation and position correspond naturally to intended aircraft motion.

Let

p_{i} = (x_{i}, y_{i}, d_{i})

denote the 3D position of landmark i, where

(x_{i}, y_{i})

are image coordinates and

d_{i}

is the depth value from the aligned depth map. The control computation extracts orientation angles from relative landmark positions using geometric analysis.

5.2. Coordinate Transformation

To compute true 3D angles, we transform 2D image coordinates to 3D camera-frame coordinates using the camera’s intrinsic parameters. Let

f_{x}

,

f_{y}

denote focal lengths and

c_{x}

,

c_{y}

denote principal point coordinates from the camera calibration matrix. For a landmark at image position

(x, y)

with depth d, the 3D position vector is

v = [\begin{matrix} \frac{(x - c_{x})}{f_{x}} \cdot d \\ \frac{(y - c_{y})}{f_{y}} \cdot d \\ d \end{matrix}]

(4)

This transformation accounts for perspective projection, ensuring that computed angles reflect true hand orientation rather than apparent orientation distorted by camera position.

5.3. Roll Computation

Roll angle is computed from the orientation of the line connecting the index finger base (landmark 5) and pinky finger base (landmark 17). Let

a

and

b

denote the 3D positions of these landmarks computed via Equation (4). The direction vector is

d = b - a = (d_{x}, d_{y}, d_{z})

(5)

The roll angle

ϕ

(angle between the hand plane and the camera XY plane) is

ϕ = arctan (\frac{d_{z}}{\sqrt{d_{x}^{2} + d_{y}^{2}}})

(6)

This formulation yields positive roll when the hand tilts left (pinky higher than index) and negative roll when tilted right, matching standard aircraft convention.

5.4. Pitch Computation

Pitch is computed similarly, using the wrist (landmark 0) and the average position of finger bases. Let

w

denote the 3D wrist position and

f

the average of landmarks 5 and 17. The pitch angle

θ

is

θ = arctan (\frac{{(f - w)}_{z}}{\sqrt{{(f - w)}_{x}^{2} + {(f - w)}_{y}^{2}}})

(7)

Positive pitch corresponds to fingers pointed upward relative to wrist (aircraft pitches nose-up), and negative pitch to fingers pointed downward.

5.5. Yaw Computation

Yaw represents hand rotation in the image plane, computed from the orientation of the finger axis. Using landmarks 5, 8 (index finger base and tip), 9, 12 (middle finger base and tip), we compute average finger base and tip positions, then find the yaw angle

ψ

:

ψ = arctan 2 (y_{tip} - y_{base}, x_{tip} - x_{base})

(8)

with adjustment to normalize the range: if

ψ < 0

, we add

90 °

. This maps hand pointing straight up to zero yaw, left to positive yaw, and right to negative yaw.

5.6. Throttle Computation

Throttle control leverages the depth dimension directly. We compute the palm center position as the average of the wrist and finger bases, then extract the depth value at this location. The throttle command

τ

is computed by mapping depth to the range

[- 1, 1]

:

τ = 2 \cdot \frac{d_{palm} - d_{min}}{d_{max} - d_{min}} - 1

(9)

where

d_{min} = 15

cm and

d_{max} = 60

cm define the operating depth range. Moving the hand closer to the camera (reducing

d_{palm}

) produces negative throttle (descend), while moving away produces positive throttle (ascend).

5.7. Saturation and Dead Zones

To ensure controllable aircraft behavior, we apply saturation (limiting) and dead zones to all control outputs. The dead zone prevents unintended drift when the operator holds a nominally neutral hand position:

u^{'} = \{\begin{matrix} 0 & if | u | < u_{dead} \\ clip (u, - u_{max}, u_{max}) & otherwise \end{matrix}

(10)

Default parameters are

ϕ_{dead} = θ_{dead} = ψ_{dead} = 15 °

and maximum angles of

30 °

; the throttle dead zone is 0.25 (normalized units). These values were tuned experimentally to balance responsiveness against stability.

5.8. Speed Level Control

The system implements a configurable speed scaling mechanism that modulates control signal magnitude across all four degrees of freedom. Operators can select speed levels 1–5 through gesture commands (ROCK → ONE/TWO/THREE/FOUR/FIVE), with each level applying a proportional scaling coefficient:

k_{speed} = 50 \cdot L

(11)

where

L \in {1, 2, 3, 4, 5}

is the selected speed level. All velocity commands are scaled accordingly:

v_{cmd} = k_{speed} \cdot {[τ^{'}, ϕ^{'}, θ^{'}, ψ^{'}]}^{⊤}

(12)

For safety during initial testing and novice operator training, a safe_mode parameter constrains the system to speed level 1 regardless of gesture commands. This mechanism prevents accidental high-speed maneuvers while operators familiarize themselves with the control interface.

Landing considerations: The speed scaling affects the continuous control mode, enabling gradual altitude reduction through throttle manipulation at operator-selected velocities. However, the discrete landing command (OK → TWO) invokes the DJI Tello SDK’s built-in land() function, which executes a fixed descent profile managed by the drone’s internal firmware. This design prioritizes landing safety by delegating the critical touchdown phase to proven onboard algorithms, while continuous throttle control allows for precise altitude management during approach. For applications requiring fully customizable landing profiles, integration with flight controllers supporting velocity-mode descent (e.g., PX4 via MAVLink) would enable gesture-controlled landing speed throughout the entire maneuver.

5.9. Signal Filtering Pipeline

Raw control signals from frame-by-frame pose estimation exhibit considerable noise due to landmark prediction jitter, depth sensor noise, and hand micro-movements. We employ a three-stage filtering pipeline to produce smooth control outputs while maintaining responsiveness.

5.9.1. Stage 1: Sliding Window with Outlier Rejection

Control values are accumulated in a sliding window (deque) of size

N = 5

. On each update, we compute the median

\tilde{u}

and standard deviation

σ

of window contents, then reject outliers:

S = {u_{i} : | u_{i} - \tilde{u} | < 2 σ}

(13)

The cleaned mean

\bar{u} = mean (S)

(or median if

| S | = 0

) proceeds to the next stage. The

2 σ

threshold was chosen to remove gross errors while preserving intentional control changes; the three-sigma rule would introduce excessive lag.

5.9.2. Stage 2: Exponential Smoothing

The cleaned mean is processed by an exponential moving average filter:

s_{t} = α \cdot {\bar{u}}_{t} + (1 - α) \cdot s_{t - 1}

(14)

where

α = 0.5

balances responsiveness and smoothness. The exponential smoothing removes high-frequency noise components while allowing the control signal to track sustained changes within approximately

3 / α = 6

frames.

Figure 5 illustrates the three-stage filtering pipeline.

Table 3 summarizes the filter parameters and their effects on system behavior.

6. ROS2 Implementation

This section details the software architecture implementing the Aerokinesis gesture control system within the ROS2 framework. We describe the distributed node architecture, inter-node communication patterns, the hierarchical state machine governing system behavior, and the real-time processing pipeline.

6.1. Distributed Node Architecture

The system follows the ROS2 design philosophy of modular, loosely coupled nodes communicating via publish–subscribe messaging. Figure 6 illustrates the complete node graph and data flow.

The architecture comprises three primary nodes:

realsense_node: Camera driver (Intel RealSense SDK wrapper) providing hardware-synchronized RGB and depth streams at 30 FPS. Depth frames are aligned to the color camera coordinate system, enabling direct pixel-to-depth correspondence for 3D landmark computation.
hand_gesture_detector: Central processing node implementing MediaPipe inference, gesture classification, continuous control computation, and signal filtering. This node encapsulates all vision processing logic and maintains the operational state machine.
tello_driver: Flight controller interface translating ROS2 velocity commands to DJI Tello SDK calls over WiFi. Handles connection management, telemetry reception, and command rate limiting.

This separation enables independent testing of each component and facilitates adaptation to alternative hardware (e.g., different cameras or flight controllers) by replacing only the relevant driver node.

6.2. Hierarchical State Machine

The system operates as a hierarchical finite state machine (FSM) with two levels: the operational mode (Command/Control) and the flight state (Disarmed/Armed/Flying). Figure 7 presents the complete state transition diagram.

Key design decisions in the state machine include:

Two-gesture safety sequences: Critical commands (takeoff, landing) require a prefix gesture (OK) followed by confirmation (ONE/TWO), preventing accidental activation. The system maintains a 2 s timeout between prefix and completion gestures.

Mode transition delay: Entering Control Mode triggers a 5 s countdown, allowing operators to reposition their hand from the THUMBS_UP gesture to a neutral control pose without inadvertent commands.

Fail-safe behavior: If hand tracking is lost during Control Mode (no detection for 500 ms), the system automatically publishes zero velocity commands, causing the drone to hover in place rather than continuing its last trajectory.

6.3. Real-Time Processing Pipeline

Figure 8 details the frame-by-frame processing pipeline executed at 30 FPS within the hand_gesture_detector node.

The pipeline implements several optimizations for real-time performance:

Early termination: Frames without detected hands skip all downstream processing, reducing computational load during operator absence;
Shared landmark buffer: MediaPipe landmarks are extracted once and reused for both gesture classification (2D normalized coordinates) and control computation (3D reconstruction with depth);
Asynchronous publishing: ROS2 message publication is non-blocking, allowing the next frame to begin processing immediately after command generation;
GPU acceleration: MediaPipe inference leverages CUDA on the Jetson platform, reducing detection latency from ∼85 ms (CPU) to ∼28 ms (GPU).

6.4. Inter-Node Communication

Table 4 summarizes the ROS2 topic interface. The system uses standard message types for hardware interoperability.

The geometry_msgs/Twist message encodes the four control channels: linear.x (roll/lateral), linear.y (pitch/forward), linear.z (throttle/vertical), and angular.z (yaw/rotation). This standard message type ensures compatibility with diverse robotic platforms beyond the DJI Tello.

6.5. Reproducibility and Deployment

The ROS2 workspace deployment requires:

ROS2 Humble on Ubuntu 22.04;
Intel RealSense SDK 2.0 with ROS2 wrapper;
MediaPipe 0.10+ with Python 3.10+ bindings;
PyTorch 2.0+ (CPU or CUDA);
DJI Tello SDK (or compatible flight controller).

A single launch file initializes all nodes with configurable parameters for camera resolution, control sensitivity, and safety limits.

6.6. Node Architecture

The system is implemented as a ROS2 node (HandGestureDetector) that subscribes to camera topics and publishes velocity commands. Gesture recognition is performed on every incoming frame from the depth camera stream, synchronized with the RGB stream at 30 FPS. This frame-by-frame processing approach, rather than temporal aggregation or periodic sampling, ensures maximum responsiveness to operator commands, while the filtering pipeline (Section 5.9) handles inter-frame noise. The main processing loop (Algorithm 1) operates as follows:

Algorithm 1: Main Processing Loop

Input: Synchronized color frame

I_{c}

, depth frame

I_{d}

, camera intrinsics K

1:: $H \leftarrow MediaPipe . detectHands (I_{c})$
2:: if $H = \emptyset$ then
3:: if in Control Mode and aircraft in flight then
4:: Publish hover command (zero velocity)
5:: end if
6:: return
7:: end if
8:: $L \leftarrow extractLandmarks (H)$
9:: if in Command Mode then
10:: $g \leftarrow classifyGesture (L)$
11:: Process gesture according to state machine
12:: else
13:: $(ϕ, θ, ψ, τ) \leftarrow computeControl (L, I_{d}, K)$
14:: $(ϕ^{'}, θ^{'}, ψ^{'}, τ^{'}) \leftarrow filter (ϕ, θ, ψ, τ)$
15:: Publish velocity command
16:: end if

6.7. Topic Structure

The node interfaces with the ROS2 ecosystem through the following topics are shown in Table 5.

6.8. State Machine

The command mode operates as a finite state machine tracking the current prefix gesture and expected completions. The THUMBS_UP gesture transitions from Command Mode to Control Mode with a configurable delay (default 5 s), allowing the operator to position their hand. The TWO gesture in Control Mode returns to Command Mode.

7. Experimental Evaluation

This section presents a comprehensive evaluation of the Aerokinesis system across multiple dimensions. We first describe the experimental setup, including both simulation and hardware platforms used for testing (Section 7.1). Subsequently, we evaluate gesture classification accuracy and per-class performance metrics (Section 7.2). System latency characteristics are analyzed to validate real-time operation (Section 7.3). A comparative user study quantifies the practical benefits of gesture control for operators with varying experience levels (Section 7.4). Finally, we demonstrate the effectiveness of the signal filtering pipeline through time-series analysis (Section 7.5).

7.1. Experimental Setup

7.1.1. Simulation Environment

Initial testing employed the Gazebo Ignition simulator with a DJI Tello model. The simulation environment includes a course of ring-shaped waypoints requiring coordinated roll, pitch, yaw, and throttle control to navigate. This standardized task enables quantitative comparison across operators and control modalities. Figure 9 illustrates the gesture control system operating in the Gazebo simulation environment.

7.1.2. Hardware Platform

Physical experiments used the NVIDIA Jetson Orin NX 8GB with Intel RealSense D435i camera, housed in a custom 3D-printed enclosure (PETG, FLSUN V400 printer) (Figure 10). The DJI Tello quadcopter was connected via WiFi for command transmission. The system achieved consistent frame rates of 25–30 FPS during operation.

7.2. Gesture Classification Performance

The gesture classifier was evaluated on a subject-wise held-out test set of 2400 samples (300 per class) collected from two participants whose data was entirely excluded from the training and validation process. This evaluation protocol ensures that reported metrics reflect the model’s ability to generalize to new users rather than memorizing participant-specific hand characteristics. Table 6 reports per-class precision, recall, and F1 scores.

The classifier achieves an overall accuracy of 99.3% with inference time of 0.8 ms per sample on GPU. The lowest-performing classes (THREE, FOUR) exhibit minor confusion due to subtle differences in partially extended finger configurations.

7.3. Control Latency

End-to-end latency from camera frame capture to control command publication was measured over 1000 frames during active control. Table 7 reports timing breakdown.

MediaPipe hand detection dominates latency but remains well within real-time requirements. The total latency of approximately 32 ms enables smooth 30 FPS control. Importantly, gesture recognition is performed on every frame without temporal aggregation or frame skipping, ensuring that the system responds to operator input with minimal delay. The 30 FPS recognition frequency provides sufficient temporal resolution for both discrete command detection (where gestures are held for 0.5–1.0 s) and continuous control (where hand pose changes smoothly between frames).

End-to-End System Latency

While Table 7 characterizes vision processing latency (∼32 ms), the complete end-to-end delay from operator hand motion to observable UAV response includes WiFi command transmission, drone firmware processing, and motor response dynamics as shown in Table 8. We measured this total latency using synchronized high-speed video (120 FPS) capturing both operator hand movement initiation and drone response onset, analyzing frame timestamps to determine system delay (Figure 11).

Under optimal conditions (single WiFi network, minimal interference, <3 m operator-to-base-station distance), end-to-end latency averaged 200 ms, with the 95th percentile below 280 ms. Users reported this delay as acceptable, perceiving control as “responsive” in post-experiment surveys. However, in degraded RF environments—multiple 2.4 GHz networks, Bluetooth devices, or microwave interference—latency increased to 400–580 ms, approaching the threshold where operators reported noticeable lag affecting control precision.

WiFi constraints: the DJI Tello operates exclusively on 2.4 GHz WiFi using only three non-overlapping channels (1, 6, and 11), making it highly susceptible to interference in typical indoor environments. This represents a fundamental platform limitation inherited from the consumer drone design. Future work will investigate integration with drones supporting (1) 5 GHz WiFi bands with reduced congestion; (2) dedicated low-latency control links (e.g., ExpressLRS, TBS Crossfire) achieving sub-10 ms transmission; or (3) wired tether connections for latency-critical applications. Based on our measurements, we estimate that eliminating WiFi variability could reduce end-to-end latency to approximately 120 ms, approaching the responsiveness of wired gaming controllers.

7.4. User Study: Task Completion Time

To evaluate practical usability, we conducted a comparative user study with

n = 12

participants (8 male, 4 female; age range 22–35). Participants were categorized as novice (no prior drone experience,

n = 6

) or experienced (more than 10 h of prior drone piloting,

n = 6

).

7.4.1. Protocol

Each participant completed the waypoint navigation task using both gesture control and a smartphone touchscreen application (Tello App), with order counterbalanced. Participants received 10 min of instruction for each control modality before testing. Task completion time (time to pass through all eight waypoints) was recorded for three trials per condition, with the median used for analysis.

7.4.2. Results

Table 9 reports mean task completion times with standard deviations.

Statistical analysis (paired t-test) reveals a significant advantage for gesture control among novice operators (

t (5) = 4.82

,

p < 0.01

), with a mean improvement of 52.6%. For experienced operators, the difference is not significant (

t (5) = 1.21

,

p = 0.28

), indicating that gesture control achieves parity with traditional controllers once operators gain familiarity.

7.5. Filter Effectiveness

Figure 12 compares raw and filtered control signals for roll (a) and throttle (b) channels during representative flight segments.

The roll channel exhibits higher noise due to the compound computation from multiple depth measurements, demonstrating the necessity of robust filtering. The throttle channel is more stable but occasionally shows spikes from depth sensor artifacts, which the outlier rejection stage effectively removes.

8. Discussion

8.1. Contributions and Findings

The Aerokinesis system demonstrates that vision-based gesture control can provide an effective alternative to traditional UAV controllers, particularly for novice operators. The 52.6% reduction in task completion time for beginners suggests that the intuitive hand-to-aircraft mapping substantially reduces the cognitive burden of UAV operation.

The hierarchical control architecture addresses a key limitation of existing systems by separating discrete commands from continuous control. Gesture sequences for safety-critical operations (takeoff, landing) prevent accidental activation, while the continuous control mode enables precise maneuvering. This design pattern may be applicable to other robotic systems requiring both high-level mode switching and fine-grained control.

The signal filtering pipeline represents a practical contribution for gesture-based control applications. The combination of outlier rejection, moving average, and exponential smoothing provides demonstrably robust noise handling without introducing excessive lag. The empirically determined parameters (

α = 0.5

, window size = 5,

2 σ

rejection threshold) offer a reasonable starting point for similar systems.

Real-World Flight Demonstration

To validate the practical applicability of the proposed Aerokinesis system, we conducted real-world flight tests using the DJI Tello quadcopter controlled via hand gestures. Figure 13 demonstrates the system in operation, showing an operator controlling the quadcopter’s position and altitude through intuitive hand movements.

The demonstration setup includes the NVIDIA Jetson Orin NX computing platform with a connected touchscreen display showing real-time control feedback, the Intel RealSense D435i depth camera mounted on a tripod for optimal hand tracking, and the DJI Tello quadcopter responding to gesture commands via WiFi connection. During testing, the system maintained stable control, with the quadcopter successfully executing hover, translation, and altitude adjustment commands based on operator hand poses.

The real-world tests confirmed that the filtering pipeline effectively compensates for environmental noise sources present in practical deployment scenarios, including varying ambient lighting, background clutter, and the inherent latency of WiFi-based command transmission. The quadcopter exhibited smooth, predictable responses to operator commands, validating the effectiveness of the proposed control signal processing approach.

8.2. Limitations

Several limitations warrant acknowledgment:

1.: Confidence Estimation: The current implementation does not incorporate explicit confidence thresholding for gesture classification. The classifier uses argmax over softmax outputs without rejecting low-confidence predictions. However, the system implements implicit safety mechanisms: when no hand is detected in the frame, the UAV automatically enters a hover state (zero velocity commands), preventing uncontrolled drift. Additionally, frames with partially visible hands (landmarks extending beyond image boundaries) are automatically rejected to avoid erroneous classifications.
2.: Environmental Constraints: The system requires adequate lighting for reliable MediaPipe detection and depth sensing within the 15–60 cm operating range. Outdoor deployment would require adaptation for varying lighting and potential depth sensor interference from sunlight.
3.: Single-Operator Design: The current implementation tracks a single hand. Multi-operator scenarios or two-handed control would require architectural extensions.
4.: Simulation–Reality Gap: While we validated on physical hardware, the user study employed simulation. Real-world deployment introduces additional variables including wind disturbance, communication latency, and aircraft dynamics that may affect comparative performance.
5.: Gesture Vocabulary: The eight-gesture vocabulary suffices for basic operation but may limit advanced mission profiles requiring additional commands.

Environmental Constraints and Extreme Conditions

It should be noted that all experimental validation was conducted indoors under controlled room lighting conditions (approximately 300–500 lux, typical office illumination). The system performance in extreme environmental scenarios has not been systematically evaluated and represents a known limitation:

Strong lighting and direct sunlight: The Intel RealSense D435i depth camera employs active infrared stereo sensing, which is susceptible to interference from strong ambient infrared sources, including direct sunlight. Outdoor operation under bright sunlight conditions would likely cause depth measurement degradation or failure, directly affecting throttle control accuracy, which relies on palm depth estimation. The RGB-based MediaPipe hand detection demonstrates greater robustness to lighting variations but may exhibit reduced landmark accuracy under extreme contrast conditions such as strong backlighting or harsh shadows.

Dense fog and low visibility conditions: Fog, smoke, dust, and other atmospheric obscurants scatter both visible and infrared light, degrading the performance of both the depth camera and RGB-based hand detection. The system was not tested under such conditions and reliable hand detection would likely fail when visibility drops below approximately 5 m.

Complex and cluttered backgrounds: MediaPipe’s palm detector is a learned model that may produce false positive detections when the background contains hand-like visual features (flesh-toned surfaces, gloves, posters with hands) or may fail to detect hands against high-clutter backgrounds with numerous edge features. Our experiments utilized relatively uniform backgrounds (solid-colored walls, uncluttered laboratory environment). System performance in visually complex environments such as outdoor foliage, industrial facilities, or crowded spaces requires additional investigation.

Future work should include systematic characterization of system performance across a range of environmental parameters to establish clear operational boundaries and develop appropriate mitigation strategies for challenging conditions.

8.3. Applications in IoT Ecosystems

The gesture control paradigm extends naturally to broader IoT applications beyond UAVs. Smart home automation, industrial robot guidance, and assistive technology for mobility-impaired users represent domains where intuitive gesture interfaces could replace or complement traditional controls. The modular ROS2 architecture facilitates integration with diverse IoT platforms through standard middleware protocols.

The demonstrated approach of edge computing for real-time inference combined with lightweight communication (velocity commands rather than video streams) addresses the latency and bandwidth constraints characteristic of IoT deployments. As embedded computing capabilities continue improving, similar vision-based interfaces may become feasible for a wider range of IoT devices.

9. Conclusions

This paper presented Aerokinesis, an IoT software–hardware system for gesture-driven quadcopter control within the ROS2 framework. The system achieves over 99% gesture classification accuracy using a MediaPipe-based hand tracking frontend with a lightweight neural network classifier. Continuous control computation from 3D hand poses enables intuitive four-degree-of-freedom maneuvering, with a hybrid filtering pipeline ensuring robust performance despite sensor noise.

Experimental validation demonstrated significant advantages for novice operators (52.6% reduced task completion time) compared to traditional controllers, while maintaining parity for experienced users. The complete system operates in real-time (approximately 30 FPS) on embedded hardware suitable for IoT deployment. It should be noted that all experiments were conducted under controlled indoor lighting conditions; system performance characterization across diverse environmental conditions (outdoor sunlight, fog, complex backgrounds) remains an important direction for future investigation.

Future work will address current limitations through (1) outdoor-capable depth sensing modalities such as stereo matching; (2) expanded gesture vocabularies including dynamic gestures for mission commands; (3) multi-agent control enabling single-operator coordination of drone swarms; (4) integration with autonomous navigation for shared-control paradigms; and (5) incorporation of confidence estimation for gesture recognition, where predictions below a learned threshold would trigger automatic transition to a safe hover state, further improving operational safety for novice users.

The gesture interface paradigm offers promising potential for democratizing UAV operation by reducing training requirements and providing intuitive control mappings. As computer vision capabilities continue advancing, such natural interaction modalities may become standard for IoT-enabled robotic systems across diverse application domains.

Author Contributions

Conceptualization, S.K., V.P. and V.M.; methodology, S.K., L.V., G.N. and Y.D.; software, S.K. and L.V.; investigation, S.K., L.V. and Y.D.; security analysis of the gesture interface, G.N.; efficiency evaluation of the gesture interface for data transmission in IoT devices, Y.D.; implementation of the gesture-based control interface for Internet of Things applications, L.V.; resources and experimental platform preparation, V.P. and V.M.; data curation, S.K. and Y.D.; writing—original draft preparation, S.K. with contributions from L.V., G.N. and Y.D.; writing—review and editing, all authors; supervision, V.P. and V.M. All authors have read and agreed to the published version of the manuscript.

Funding

The publication was prepared within the framework of the Academic Fund Program at HSE University. Grant No. 23-00-035 “Overcoming the limitations of the interaction of cyber-physical systems in heterogeneous networks of the remote Internet of things using machine learning methods”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Sergei Kondratev was employed by the company RAZROBOTICS LLC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Tahir, A.; Böling, J.; Haghbayan, M.H.; Toivonen, H.T.; Plosila, J. Swarms of Unmanned Aerial Vehicles—A Survey. J. Ind. Inf. Integr. 2019, 16, 100106. [Google Scholar] [CrossRef]
Nooralishahi, P.; López, F.; Maldague, X.P. Drone-enabled Multimodal Platform for Inspection of Industrial Components. IEEE Access 2022, 10, 41429–41443. [Google Scholar] [CrossRef]
Zubaryev, Y.N.; Fomin, D.S.; Chashchin, A.N.; Zabolotnova, M.V. Use of Unmanned Aerial Vehicles in Agriculture. Vestn. Permsk. Fed. Issledovatel’skogo Tsentra 2019, 2, 47–51. [Google Scholar]
Nurutdinov, A.A.; Yamaletdinova, K.S.; Tagirov, I.I. Detection of Leaks at Pipeline Facilities Using Unmanned Aerial Vehicles. Proceedings of Technosphere Safety Conference, Ufa, Russia, 8 October 2024; pp. 214–217. [Google Scholar]
Hu, D.; Li, S.; Du, J.; Cai, J. Automating Building Damage Reconnaissance to Optimize Drone Mission Planning for Disaster Response. J. Comput. Civ. Eng. 2023, 37, 04023006. [Google Scholar] [CrossRef]
Kholodkov, K.I.; Ivanov, S.D.; Aleshin, I.M.; Perederin, F.V.; Koryagin, V.N.; Kholodkov, I.N.; Aleshin, M.I.; Matveev, M.A.; Morozov, Y.A. Experience of Using UAVs for Digital Terrain Model Construction. Nauka I Tekhnologicheskie Razrab. 2021, 100, 44–56. [Google Scholar]
Natalizio, E., Ed.; UAV-Based Applications in the Internet of Things (IoT). Sensors 2019. Available online: https://www.mdpi.com/si/21978 (accessed on 15 December 2025).
Chamola, V.; Kotesh, P.; Agarwal, A.; Naren; Gupta, N.; Guizani, M. A Comprehensive Review of Unmanned Aerial Vehicle Attacks and Neutralization Techniques. Ad Hoc Netw. 2021, 111, 102324. [Google Scholar] [CrossRef] [PubMed]
Tezza, D.; Laesker, D.; Andujar, M. The Learning Experience of Becoming a FPV Drone Pilot. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, 8–11 March 2021. [Google Scholar]
Nwaogu, J.M.; Yang, Y.; Chan, A.P.; Wang, X. Enhancing Drone Operator Competency within the Construction Industry. Buildings 2024, 14, 1153. [Google Scholar] [CrossRef]
Zlotnikov, K.A.; Volosyuk, A.A.; Tan, H.A. Human Factors in Unmanned Aviation and UAV Operator Training. In Human Factors in Complex Technical Systems; Interregional Ergonomic Association: Saint Petersburg, Russia, 2016; pp. 231–237. [Google Scholar]
Koo, Y.; Yang, J.-S.; Park, M.Y.; Kang, E.-U.; Hwang, W.-J.; Lee, W.-S. An Intelligent Motion Control of Two Wheel Driving Robot Based Voice Recognition. In Proceedings of the 2014 14th International Conference on Control, Automation and Systems (ICCAS 2014), Gyeonggi-do, Republic of Korea, 22–25 October 2014; pp. 13–15. [Google Scholar]
Lee, H.; Park, S. Sensing-Aware Deep Reinforcement Learning With HCI-Based Human-in-the-Loop Feedback for Autonomous Nonlinear Drone Mobility Control. IEEE Access 2024, 12, 1727–1736. [Google Scholar] [CrossRef]
Hu, B.; Wang, J. Deep Learning Based Hand Gesture Recognition and UAV Flight Controls. Int. J. Autom. Comput. 2020, 17, 17–29. [Google Scholar] [CrossRef]
Begum, T.; Haque, I.; Keselj, V. Deep Learning Models for Gesture-controlled Drone Operation. In Proceedings of the 2020 16th International Conference on Network and Service Management (CNSM), Izmir, Turkey, 2–6 November 2020. [Google Scholar]
Lee, J.-W.; Yu, K.-H. Wearable Drone Controller: Machine Learning-Based Hand Gesture Recognition and Vibrotactile Feedback. Sensors 2023, 23, 2666. [Google Scholar] [CrossRef] [PubMed]
Yun, G.; Kwak, H.; Kim, D.H. Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control. Appl. Sci. 2024, 14, 10230. [Google Scholar] [CrossRef]
Shin, S.-Y.; Kang, Y.-W.; Kim, Y.-G. Hand Gesture-based Wearable Human-Drone Interface for Intuitive Movement Control. In Proceedings of the 2019 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 11–13 January 2019. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
Serpiva, V.; Fedoseev, A.; Karaf, S.; Abdulkarim, A.A.; Tsetserukou, D. OmniRace: 6D Hand Pose Estimation for Intuitive Guidance of Racing Drone. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 2508–2513. [Google Scholar]
Abdalla, S.; Baidya, S. UAV Control with Vision-based Hand Gesture Recognition over Edge-Computing. arXiv 2025, arXiv:2505.17303. [Google Scholar] [CrossRef]
NVIDIA. Jetson Developer Kits Documentation. Available online: https://developer.nvidia.com/embedded/jetson-developer-kits (accessed on 15 December 2025).
Macenski, S.; Foote, T.; Gerkey, B.; Lalancette, C.; Woodall, W. Robot Operating System 2: Design, Architecture, and Uses in the Wild. Sci. Robot. 2022, 7, eabm6074. [Google Scholar] [CrossRef]
Patkar, U.C.; Mandhalkar, V.; Chavan, A.; Songire, S.; Kothawade, H. Robot Operating System: A Comprehensive Analysis and Evaluation. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 516–520. [Google Scholar]
Koenig, N.; Howard, A. Design and Use Paradigms for Gazebo, an Open-Source Multi-Robot Simulator. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 3, pp. 2149–2154. [Google Scholar]
Alexovič, S.; Lacko, M.; Bačík, J. Simulation of Multiple Autonomous Mobile Robots Using a Gazebo Simulator. Lect. Notes Netw. Syst. 2023, 596, 339–351. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Noh, D.; Yoon, H.; Lee, D. A Decade of Progress in Human Motion Recognition: A Comprehensive Survey From 2010 to 2020. IEEE Access 2024, 12, 5684–5707. [Google Scholar] [CrossRef]
Burdea, G. Force and Touch Feedback for Virtual Reality; John Wiley & Sons: New York, NY, USA, 2000. [Google Scholar]
Wang, Y.; Song, G.; Qiao, G.; Zhang, Y.; Zhang, J.; Wang, W. Wheeled Robot Control Based on Gesture Recognition Using the Kinect Sensor. In Proceedings of the 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), Shenzhen, China, 12–14 December 2013; pp. 378–383. [Google Scholar]
Hax, D.R.T.; Penava, P.; Krodel, S.; Razova, L.; Buettner, R. A Novel Hybrid Deep Learning Architecture for Dynamic Hand Gesture Recognition. IEEE Access 2024, 12, 28761–28774. [Google Scholar] [CrossRef]
Yaseen; Kwon, O.J.; Kim, J.; Jamil, S.; Lee, J.; Ullah, F. Next-Gen Dynamic Hand Gesture Recognition: MediaPipe, Inception-v3 and LSTM-Based Enhanced Deep Learning Model. Electronics 2024, 13, 3233. [Google Scholar] [CrossRef]
Meng, Y.; Jiang, H.; Duan, N.; Wen, H. Real-Time Hand Gesture Monitoring Model Based on MediaPipe’s Registerable System. Sensors 2024, 24, 6262. [Google Scholar] [CrossRef] [PubMed]
Mallik, B.; Rahim, M.A.; Miah, A.S.M.; Yun, K.S.; Shin, J. Virtual Keyboard: A Real-Time Hand Gesture Recognition-Based Character Input System Using LSTM and Mediapipe Holistic. Comput. Syst. Sci. Eng. 2024, 48, 555–570. [Google Scholar] [CrossRef]
Latif, B.; Buckley, N.; Secco, E.L. Hand Gesture and Human-Drone Interaction. Lect. Notes Netw. Syst. 2023, 544, 299–308. [Google Scholar]
Martinelli, D.; Cerbaro, J.; Teixeira, M.A.S.; Fabro, J.A.; Schneider de Oliveira, A. A Tutorial to Use the MediaPipe Framework with ROS2. In Robot Operating System (ROS); Studies in Computational Intelligence; Koubaa, A., Ed.; Springer: Cham, Switzerland, 2023; Volume 1051. [Google Scholar]
Samotyy, V.; Kiselov, N.; Dzelendzyak, U.; Shpak, O. Gesture Recognition Based on Deep Learning for Quadcopters Flight Control. Int. J. Comput. 2024, 17, 17–29. [Google Scholar] [CrossRef]
Khoza, N.; Owolawi, P.; Malele, V. Drone Gesture Control using OpenCV and Tello. In Proceedings of the 2024 Conference on Information Communications Technology and Society (ICTAS), Durban, South Africa, 7–8 March 2024. [Google Scholar]
Rahmani, A.; Hafizuddin, M.; Weng, L.Y. Review of Implementing the Internet of Things (IoT) for Drones. E3S Web Conf. 2024, 477, 00016. [Google Scholar]
Eamsaard, J.; Boonsongsrikul, A. Smart Drone Surveillance System Based on AI and on IoT Communication in Case of Intrusion and Fire Accident. Drones 2023, 7, 694. [Google Scholar] [CrossRef]
Salhaoui, M.; Guerrero-González, A.; Arioua, M.; Ortiz, F.J.; El Oualkadi, A.; Torregrosa, C.L. Smart Industrial IoT Monitoring and Control System Based on UAV and Cloud Computing Applied to a Concrete Plant. Sensors 2019, 19, 3316. [Google Scholar] [CrossRef]
Wei, Z.; Tian, F.; Qiu, Z.; Yang, Z.; Zhan, R.; Zhan, J. Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots. Actuators 2023, 12, 334. [Google Scholar] [CrossRef]
Wijaya, R.S.; Prayoga, S.; Fatekha, R.A.; Mubarak, M.T. A Real-Time Hand Gesture Control of a Quadcopter Swarm Implemented in the Gazebo Simulation Environment. J. Appl. Inf. Comput. 2025, 9, 979–988. [Google Scholar] [CrossRef]
Budiyanto, A.; Ramadhan, M.I.; Burhanudin, I.; Triharminto, H.H.; Santoso, B. Navigation Control of Drone using Hand Gesture based on Complementary Filter Algorithm. J. Phys. Conf. Ser. 2021, 1912, 012034. [Google Scholar] [CrossRef]
Gardner, E.S., Jr. Exponential Smoothing: The State of the Art. J. Forecast. 1985, 4, 1–28. [Google Scholar] [CrossRef]
NIST/SEMATECH e-Handbook of Statistical Methods. Exponential Smoothing Summary. National Institute of Standards and Technology (NIST). Available online: https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc437.htm (accessed on 15 December 2025).
Solinnen. Basics of Data Smoothing and Filtering: Exponential Moving Average. Available online: https://solinnen.github.io/blog/data-smoothing/ (accessed on 15 December 2025).
Moving Average Filter: Towards Signal Noise Reduction. CodeMonk. 2024. Available online: https://codemonk.io/blog/moving-average-filter/ (accessed on 15 December 2025).
Pukelsheim, F. The Three Sigma Rule. Am. Stat. 1994, 48, 88–91. [Google Scholar] [CrossRef]

Figure 1. System architecture of the Aerokinesis gesture control framework. The vision processing module extracts hand landmarks from the depth camera stream. These landmarks are processed by either the gesture classifier (for discrete commands) or the continuous control calculator (for flight control), depending on the current operating mode. Filtered control signals are published to ROS2 topics for consumption by the flight controller.

Figure 2. MediaPipe hand landmark indices. Point 0 represents the wrist; points 1–4 correspond to the thumb; points 5–8 to the index finger; points 9–12 to the middle finger; points 13–16 to the ring finger; and points 17–20 to the pinky finger.

Figure 3. Static gesture vocabulary for high-level quadcopter commands. The upper row shows gestures used independently or as prefixes; the lower row shows numeric gestures used for speed selection or command completion.

Figure 4. Architecture of the gesture classification MLP. Hand landmark coordinates (21 points × 2 coordinates) pass through two hidden layers with ReLU activation and dropout regularization, producing probability distribution over eight gesture classes.

Figure 5. Signal filtering pipeline. Raw control values pass through a sliding window buffer, outlier rejection using

2 σ

threshold, and exponential smoothing with

α = 0.5

. The dashed arrow indicates EMA state feedback between frames.

Figure 5. Signal filtering pipeline. Raw control values pass through a sliding window buffer, outlier rejection using

2 σ

threshold, and exponential smoothing with

α = 0.5

. The dashed arrow indicates EMA state feedback between frames.

Figure 6. ROS2 node graph showing the three-node architecture: camera driver provides synchronized RGB-D streams, the main processing node performs gesture recognition and control computation, and the flight controller interface translates commands to drone-specific protocols.

Figure 7. Hierarchical state machine governing system operation. Blue states operate in Command Mode (gesture recognition), green/orange in active flight. Red dashed arrow indicates safety transition when hand tracking is lost.

Figure 8. Real-time processing pipeline with timing annotations. Total frame processing time is approximately 32 ms, enabling 30 FPS operation. Decision branches handle hand detection failure and mode-dependent processing paths.

Figure 9. Demonstration of gesture-based drone control in simulation: (a) operator interface showing current control values—throttle and pitch displayed on the camera feed with hand landmark visualization; (b) corresponding quadcopter behavior in the Gazebo simulation environment responding to the gesture commands.

Figure 10. Hardware demonstration platform comprising NVIDIA Jetson Orin NX 8GB, 7-inch touchscreen display, and Intel RealSense D435i depth camera in custom enclosure.

Figure 11. End-to-end latency during a 60 s control session. The first phase (0–30 s) demonstrates stable operation under optimal WiFi conditions with latency consistently below the perceptual threshold. At

t = 30

s, interference sources were introduced (additional 2.4 GHz devices), causing latency spikes that affected control responsiveness. The dashed line indicates the approximate threshold (∼250 ms) above which operators perceive noticeable delay.

Figure 11. End-to-end latency during a 60 s control session. The first phase (0–30 s) demonstrates stable operation under optimal WiFi conditions with latency consistently below the perceptual threshold. At

t = 30

s, interference sources were introduced (additional 2.4 GHz devices), causing latency spikes that affected control responsiveness. The dashed line indicates the approximate threshold (∼250 ms) above which operators perceive noticeable delay.

Figure 12. Comparison of raw (dashed) and filtered (solid) control signals: (a) roll channel; (b) throttle channel. The filtering pipeline effectively removes high-frequency noise and outliers while preserving control responsiveness.

Figure 13. Real-world demonstration of the Aerokinesis gesture control system as operator controlling quadcopter altitude and forward movement through hand pose and quadcopter responding to lateral control commands. The DJI Tello maintains a stable hover while responding to continuous gesture-based velocity commands processed by the NVIDIA Jetson Orin NX platform.

Table 1. Gesture vocabulary for high-level commands.

ID	Gesture Name	Function
1	ONE	Takeoff (following OK)
2	TWO	Land (following OK)/Exit Control Mode
3	THREE	Set speed level 3 (following ROCK)
4	FOUR	Set speed level 4 (following ROCK)
5	FIVE	Set speed level 5 (following ROCK)
6	OK	Command prefix for takeoff/land
7	ROCK	Command prefix for speed setting
8	THUMBS_UP	Enter Control Mode

Table 2. Comparison of classification approaches for gesture recognition.

Method	Accuracy (%)	F1 Score	Inference (ms)	Size (KB)
k-NN ( $k = 5$ , Euclidean)	$96.8 \pm 1.2$	0.967	12.4	N/A
SVM (RBF, $C = 10$ )	$97.3 \pm 0.9$	0.972	3.2	1840
Random Forest (100 trees)	$97.9 \pm 0.8$	0.978	8.7	4200
1D-CNN	$98.1 \pm 0.7$	0.980	2.1	1180
MLP (proposed)	$99.3 \pm 0.4$	0.993	0.8	412

Table 3. Signal filter parameters.

Parameter	Symbol	Value	Effect
Window size	N	5	Smoothing vs. latency trade-off
Outlier threshold	–	$2 σ$	Rejection aggressiveness
EMA coefficient	$α$	0.5	Response speed vs. stability
Settling time	–	∼6 frames	95% step response time

Table 4. ROS2 topic interface with message types and data rates.

Topic	Message Type	Dir.	Rate	Purpose
/camera/.../imageraw	sensormsgs/Image	Sub	30 Hz	RGB frame for MediaPipe
/camera/.../aligneddepth	sensormsgs/Image	Sub	30 Hz	Depth for 3D reconstruction
/camera/.../camerainfo	sensormsgs/CameraInfo	Sub	Once	Intrinsic parameters K
/control	geometrymsgs/Twist	Pub	30 Hz	Velocity commands $(v_{x}, v_{y}, v_{z}, ω_{z})$
/takeoff	stdmsgs/Empty	Pub	Event	Trigger autonomous takeoff
/land	stdmsgs/Empty	Pub	Event	Trigger autonomous landing

Table 5. ROS2 topic interface.

Topic	Type	Direction
/camera/color/image_raw	sensor_msgs/Image	Subscribe
/camera/aligned_depth/image_raw	sensor_msgs/Image	Subscribe
/camera/color/camera_info	sensor_msgs/CameraInfo	Subscribe
/control	geometry_msgs/Twist	Publish
/takeoff	std_msgs/Empty	Publish
/land	std_msgs/Empty	Publish

Table 6. Gesture classification performance by class.

Gesture	Precision	Recall	F1	Support
ONE	0.993	0.990	0.992	300
TWO	0.997	1.000	0.998	300
THREE	0.987	0.987	0.987	300
FOUR	0.990	0.983	0.987	300
FIVE	0.997	0.997	0.997	300
OK	1.000	1.000	1.000	300
ROCK	0.997	1.000	0.998	300
THUMBS_UP	0.993	0.997	0.995	300
Macro avg	0.994	0.994	0.994	2400

Table 7. Processing latency breakdown (milliseconds).

Stage	Mean	Std Dev
Frame acquisition	2.1	0.3
MediaPipe inference	28.4	3.2
Control computation	1.2	0.1
Filtering	0.3	0.0
ROS2 publish	0.4	0.1
Total	32.4	3.4

Table 8. End-to-end system latency components (milliseconds).

Component	Mean	Range	Notes
Vision processing	32	28–38	Jetson Orin NX (GPU)
WiFi transmission	85	35–220	2.4 GHz, channel-dependent
Tello firmware	28	22–40	Command parsing, safety checks
Motor response	55	45–80	ESC + mechanical inertia
Total (optimal)	200	130–280	Low interference
Total (degraded)	400	280–580	Congested RF environment

Table 9. Task completion time comparison (seconds).

Operator Experience	Gesture Control	Smartphone Tello App Control
Novice ( $n = 6$ )	$153.7 \pm 24.3$	$227.8 \pm 41.2$
Experienced ( $n = 6$ )	$72.1 \pm 8.9$	$79.3 \pm 10.1$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kondratev, S.; Dyrchenkova, Y.; Nikitin, G.; Voskov, L.; Pikalov, V.; Meshcheryakov, V. Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2. Technologies 2026, 14, 69. https://doi.org/10.3390/technologies14010069

AMA Style

Kondratev S, Dyrchenkova Y, Nikitin G, Voskov L, Pikalov V, Meshcheryakov V. Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2. Technologies. 2026; 14(1):69. https://doi.org/10.3390/technologies14010069

Chicago/Turabian Style

Kondratev, Sergei, Yulia Dyrchenkova, Georgiy Nikitin, Leonid Voskov, Vladimir Pikalov, and Victor Meshcheryakov. 2026. "Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2" Technologies 14, no. 1: 69. https://doi.org/10.3390/technologies14010069

APA Style

Kondratev, S., Dyrchenkova, Y., Nikitin, G., Voskov, L., Pikalov, V., & Meshcheryakov, V. (2026). Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2. Technologies, 14(1), 69. https://doi.org/10.3390/technologies14010069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Aerokinesis: An IoT-Based Vision-Driven Gesture Control System for Quadcopter Navigation Using Deep Learning and ROS2

Abstract

1. Introduction

2. Related Work

2.1. Gesture Recognition for Human–Machine Interaction

2.2. Vision-Based UAV Control Systems

2.3. IoT Integration and UAV Networks

2.4. Signal Processing for Robotic Control

3. System Architecture

3.1. Overview

3.2. Hardware Configuration

3.2.1. Depth Camera

3.2.2. Edge Computing Platform

3.2.3. Quadcopter Platform

3.3. Software Framework

4. Gesture Classification for High-Level Commands

4.1. Hand Landmark Extraction

4.2. Gesture Vocabulary

Scalability to More Complex Gestures

4.3. Neural Network Architecture

4.4. Training and Validation

Classifier Architecture Selection

5. Continuous Control from 3D Hand Pose

5.1. Problem Formulation

5.2. Coordinate Transformation

5.3. Roll Computation

5.4. Pitch Computation

5.5. Yaw Computation

5.6. Throttle Computation

5.7. Saturation and Dead Zones

5.8. Speed Level Control

5.9. Signal Filtering Pipeline

5.9.1. Stage 1: Sliding Window with Outlier Rejection

5.9.2. Stage 2: Exponential Smoothing

6. ROS2 Implementation

6.1. Distributed Node Architecture

6.2. Hierarchical State Machine

6.3. Real-Time Processing Pipeline

6.4. Inter-Node Communication

6.5. Reproducibility and Deployment

6.6. Node Architecture

6.7. Topic Structure

6.8. State Machine

7. Experimental Evaluation

7.1. Experimental Setup

7.1.1. Simulation Environment

7.1.2. Hardware Platform

7.2. Gesture Classification Performance

7.3. Control Latency

End-to-End System Latency

7.4. User Study: Task Completion Time

7.4.1. Protocol

7.4.2. Results

7.5. Filter Effectiveness

8. Discussion

8.1. Contributions and Findings

Real-World Flight Demonstration

8.2. Limitations

Environmental Constraints and Extreme Conditions

8.3. Applications in IoT Ecosystems

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI