1. Introduction
The proliferation of unmanned aerial vehicles (UAVs) across diverse application domains has fundamentally transformed approaches to data collection, monitoring, and inspection tasks [
1]. Modern quadcopters are deployed extensively in infrastructure inspection [
2], agricultural monitoring [
3], pipeline surveillance [
4], search and rescue operations [
5], and cartographic surveying [
6]. The integration of UAVs with Internet of Things (IoT) ecosystems has emerged as a particularly promising paradigm, enabling real-time data transmission, cloud-based processing, and seamless interoperability with existing sensor networks [
7,
8].
Despite these technological advances, the human–machine interface for UAV control remains a significant barrier to widespread adoption, particularly for non-expert operators. Conventional remote controllers demand substantial training time and impose a cognitive burden that can compromise mission effectiveness [
9,
10]. Research by Zlotnikov et al. [
11] identified operator training as a primary cost driver in UAV deployment, with novice pilots requiring 15–25 h of practice before achieving operational proficiency. This limitation has motivated extensive research into alternative control paradigms, including voice commands [
12], brain–computer interfaces [
13], and gesture-based systems [
14,
15].
Gesture-based control has emerged as a particularly promising approach due to its intuitive nature and minimal hardware requirements [
16,
17]. The human hand provides a naturally expressive interface with high bandwidth for communicating spatial intent [
18]. Recent advances in computer vision, particularly the development of real-time hand tracking frameworks such as MediaPipe [
19,
20], have substantially lowered the technical barriers to implementing robust gesture recognition systems. However, existing approaches often suffer from limited gesture vocabularies, insufficient robustness to environmental variations, or inadequate consideration of the unique requirements of UAV control tasks [
21].
The integration of gesture control systems with IoT architectures presents additional challenges and opportunities. As Chamola et al. [
8] note, IoT-enabled UAVs must balance computational efficiency, communication latency, and energy constraints while maintaining responsive control. Edge computing paradigms, wherein processing is distributed between on-device capabilities and proximate compute nodes, offer a promising approach to addressing these challenges [
22]. The emergence of powerful embedded platforms such as the NVIDIA Jetson series has made real-time deep learning inference feasible for aerial robotics applications [
23].
The Robot Operating System 2 (ROS2) has become the de facto standard for robotics middleware, providing a publish–subscribe communication model, standardized message formats, and extensive tooling for simulation and deployment [
24]. ROS2’s support for real-time systems, improved security, and multi-platform compatibility make it particularly suitable for UAV applications that must operate reliably in diverse environments [
25]. The Gazebo simulator, tightly integrated with ROS2, enables comprehensive validation of control algorithms before deployment on physical hardware [
26,
27].
In this paper, we present Aerokinesis, a comprehensive IoT software–hardware system for intuitive gesture-driven quadcopter control. Our approach addresses the limitations of existing systems through several key contributions:
- 1.
Hierarchical Control Architecture: We propose a two-tier control framework that separates high-level discrete commands (arm/disarm, takeoff, landing, speed selection) from low-level continuous flight control (roll, pitch, yaw, throttle). This separation enables both precise maneuvering and rapid execution of safety-critical operations through distinct gesture vocabularies.
- 2.
3D Pose-Based Continuous Control: Unlike approaches relying solely on 2D image coordinates, our system leverages depth camera data to compute control signals from true 3D hand orientation using camera intrinsic parameters and trigonometric analysis, providing intuitive mapping between hand pose and aircraft motion.
- 3.
Robust Signal Filtering Pipeline: We develop a multi-stage filtering approach combining median-based outlier detection, sliding window averaging, and exponential smoothing that achieves a balance between noise rejection and control responsiveness, with parameters validated through extensive experimentation.
- 4.
Complete Implementation and Validation: We provide full implementation details, experimental validation in both simulation and physical deployment using the DJI Tello platform and NVIDIA Jetson Orin NX, and comparative user studies demonstrating significant advantages for novice operators.
The remainder of this paper is organized as follows.
Section 2 reviews related work in gesture-based UAV control, IoT integration, and human–robot interaction.
Section 3 describes the overall system architecture and hardware configuration.
Section 4 details the gesture classification approach for high-level commands.
Section 5 presents the continuous control methodology including 3D pose estimation and signal filtering.
Section 6 describes the ROS2 implementation.
Section 7 presents experimental results from simulation and real-world testing.
Section 8 discusses findings and limitations, and
Section 9 concludes with directions for future work.
4. Gesture Classification for High-Level Commands
4.1. Hand Landmark Extraction
The MediaPipe Hands framework provides the foundation for hand detection and landmark estimation. The framework employs a two-stage pipeline: first, a palm detector locates hands in the input image; second, a hand landmark model estimates 21 3D keypoints representing finger joints and the wrist (
Figure 2).
Each landmark is represented by normalized coordinates
relative to the image dimensions, plus a depth estimate
representing relative distance from the camera. For gesture classification, we utilize only the 2D coordinates to form the input feature vector:
The normalized coordinate representation provides inherent invariance to hand position in the frame and image resolution, simplifying classifier training and improving generalization.
4.2. Gesture Vocabulary
The high-level command gesture vocabulary comprises eight distinct static hand poses as shown in
Figure 3 and
Table 1.
The use of two-gesture sequences (prefix + command) provides safety interlocks preventing accidental activation of critical commands such as motor arming or landing. For example, takeoff requires the sequence OK → ONE, which is unlikely to occur inadvertently during normal hand motion.
It is important to note that landing uses SDK’s fixed descent profile; continuous throttle control is available in Control Mode for gradual altitude reduction. Speed levels 1–5 scale velocity commands by factor ; safe_mode limits to level 1 for novice operators.
Scalability to More Complex Gestures
The current gesture vocabulary comprises eight static hand poses optimized for essential UAV control operations. However, the proposed architecture supports expansion to more complex gesture sets:
Expanded static gesture vocabulary: The MLP classifier architecture can accommodate additional gesture classes with minimal modification. The 42-dimensional landmark feature space provides sufficient discriminative capacity for distinguishing substantially more than eight hand poses. Preliminary experiments with expanded vocabularies containing 15 gestures (adding poses such as “pointing,” “thumbs down,” “peace sign,” and directional indicators) maintained classification accuracy above 97%, though certain gesture pairs exhibited increased confusion rates due to visual similarity.
Dynamic gesture recognition: For applications requiring temporal gesture patterns (e.g., swipe motions, circular movements, or gesture sequences), the system architecture can be extended using recurrent neural networks (LSTM, GRU) or temporal convolutional networks operating on sequences of landmark coordinates. Such extensions would enable more expressive control vocabularies including gestures for waypoint marking, camera gimbal control, and mission-specific commands.
Computational and cognitive considerations: Expanding the gesture vocabulary introduces trade-offs. From a computational perspective, increasing to 20+ static gestures has negligible impact on inference time, while adding dynamic gesture recognition would reduce system throughput from 30 FPS to approximately 20–25 FPS depending on required sequence length. From a user perspective, our observations suggest that novice operators can reliably learn and recall 8–10 distinct gestures within a 15 min training session; beyond this threshold, error rates increase, particularly under time pressure or stress. For applications requiring larger command sets, we recommend hierarchical gesture menus (where a primary gesture enters a sub-mode with additional commands) rather than flat vocabularies exceeding 12–15 gestures.
4.3. Neural Network Architecture
Gesture classification employs a fully connected multilayer perceptron (MLP) architecture optimized for the 42-dimensional input feature space. The network comprises two hidden layers with ReLU activation functions:
where
provides nonlinearity, enabling the network to learn complex decision boundaries.
The complete classification function is
where
,
, and
are learned weight matrices, and
,
,
are bias vectors.
Figure 4 illustrates the network architecture schematically.
The hidden layer dimensions (168 and 546 neurons) were determined through hyperparameter search, balancing classification accuracy against inference time. Dropout regularization () is applied during training to prevent overfitting.
4.4. Training and Validation
The classifier was trained on a dataset of 12,000 hand landmark samples (1500 per gesture class) collected from 10 participants (7 male, 3 female; age range 20–32 years) under varying lighting conditions (natural daylight, fluorescent, and mixed illumination). Each participant performed each gesture approximately 150 times across multiple recording sessions to capture intra-subject variability.
Data splitting strategy: To ensure rigorous evaluation of generalization to unseen users, we employed a subject-wise split for the test set. Specifically, data from eight participants (9600 samples) was used for model development, while data from two held-out participants (2400 samples, 300 per class) constituted the final test set. Within the development set, we applied a random 80/20 split for training and validation, resulting in 7680 training samples and 1920 validation samples. This subject-wise test split is critical for assessing whether the model generalizes to new operators not seen during training.
Data augmentation included random coordinate perturbation () and random dropout of individual landmarks () to improve robustness to hand tracking noise and partial occlusions.
Training employed the Adam optimizer with an initial learning rate of , a batch size of 64, and cross-entropy loss. Early stopping based on validation accuracy terminated training after 150 epochs. The final model achieved 99.3% accuracy on the held-out test set comprising participants not included in training, demonstrating strong generalization to unseen users.
Classifier Architecture Selection
To justify the MLP architecture selection, we conducted a comparative evaluation of five classification approaches using stratified 5-fold cross-validation on the development set (eight participants, 9600 samples).
Table 2 summarizes the results.
It is noted that k-NN requires storing all training samples (∼300 KB for landmarks). 1D-CNN utilizes three convolutional layers (32-64-128 filters) with global average pooling.
The MLP architecture achieved the highest accuracy while maintaining the lowest inference latency and smallest model footprint—critical factors for embedded deployment on the Jetson platform. The k-Nearest Neighbors classifier suffered from slow inference due to distance computations across training samples. Support Vector Machines with an RBF kernel achieved competitive accuracy but required substantially more memory for support vector storage. Random Forest provided robust performance but with inference overhead from tree traversal. The 1D-CNN, treating landmark coordinates as sequential data, approached MLP accuracy but with higher computational cost.
The superior MLP performance can be attributed to the relatively simple decision boundary geometry in the 42-dimensional landmark space: after MediaPipe normalization, gesture classes form approximately convex, well-separated clusters that are efficiently partitioned by ReLU-activated layers. More complex architectures (deeper networks, attention mechanisms) were evaluated but provided no accuracy improvement while increasing inference time.
5. Continuous Control from 3D Hand Pose
5.1. Problem Formulation
Continuous control mode maps hand pose to quadcopter velocity commands with four degrees of freedom: roll (lateral movement), pitch (forward/backward movement), yaw (rotation), and throttle (vertical movement). The design goal is an intuitive mapping wherein hand orientation and position correspond naturally to intended aircraft motion.
Let denote the 3D position of landmark i, where are image coordinates and is the depth value from the aligned depth map. The control computation extracts orientation angles from relative landmark positions using geometric analysis.
5.2. Coordinate Transformation
To compute true 3D angles, we transform 2D image coordinates to 3D camera-frame coordinates using the camera’s intrinsic parameters. Let
,
denote focal lengths and
,
denote principal point coordinates from the camera calibration matrix. For a landmark at image position
with depth
d, the 3D position vector is
This transformation accounts for perspective projection, ensuring that computed angles reflect true hand orientation rather than apparent orientation distorted by camera position.
5.3. Roll Computation
Roll angle is computed from the orientation of the line connecting the index finger base (landmark 5) and pinky finger base (landmark 17). Let
and
denote the 3D positions of these landmarks computed via Equation (
4). The direction vector is
The roll angle
(angle between the hand plane and the camera XY plane) is
This formulation yields positive roll when the hand tilts left (pinky higher than index) and negative roll when tilted right, matching standard aircraft convention.
5.4. Pitch Computation
Pitch is computed similarly, using the wrist (landmark 0) and the average position of finger bases. Let
denote the 3D wrist position and
the average of landmarks 5 and 17. The pitch angle
is
Positive pitch corresponds to fingers pointed upward relative to wrist (aircraft pitches nose-up), and negative pitch to fingers pointed downward.
5.5. Yaw Computation
Yaw represents hand rotation in the image plane, computed from the orientation of the finger axis. Using landmarks 5, 8 (index finger base and tip), 9, 12 (middle finger base and tip), we compute average finger base and tip positions, then find the yaw angle
:
with adjustment to normalize the range: if
, we add
. This maps hand pointing straight up to zero yaw, left to positive yaw, and right to negative yaw.
5.6. Throttle Computation
Throttle control leverages the depth dimension directly. We compute the palm center position as the average of the wrist and finger bases, then extract the depth value at this location. The throttle command
is computed by mapping depth to the range
:
where
cm and
cm define the operating depth range. Moving the hand closer to the camera (reducing
) produces negative throttle (descend), while moving away produces positive throttle (ascend).
5.7. Saturation and Dead Zones
To ensure controllable aircraft behavior, we apply saturation (limiting) and dead zones to all control outputs. The dead zone prevents unintended drift when the operator holds a nominally neutral hand position:
Default parameters are and maximum angles of ; the throttle dead zone is 0.25 (normalized units). These values were tuned experimentally to balance responsiveness against stability.
5.8. Speed Level Control
The system implements a configurable speed scaling mechanism that modulates control signal magnitude across all four degrees of freedom. Operators can select speed levels 1–5 through gesture commands (ROCK → ONE/TWO/THREE/FOUR/FIVE), with each level applying a proportional scaling coefficient:
where
is the selected speed level. All velocity commands are scaled accordingly:
For safety during initial testing and novice operator training, a safe_mode parameter constrains the system to speed level 1 regardless of gesture commands. This mechanism prevents accidental high-speed maneuvers while operators familiarize themselves with the control interface.
Landing considerations: The speed scaling affects the continuous control mode, enabling gradual altitude reduction through throttle manipulation at operator-selected velocities. However, the discrete landing command (OK → TWO) invokes the DJI Tello SDK’s built-in land() function, which executes a fixed descent profile managed by the drone’s internal firmware. This design prioritizes landing safety by delegating the critical touchdown phase to proven onboard algorithms, while continuous throttle control allows for precise altitude management during approach. For applications requiring fully customizable landing profiles, integration with flight controllers supporting velocity-mode descent (e.g., PX4 via MAVLink) would enable gesture-controlled landing speed throughout the entire maneuver.
5.9. Signal Filtering Pipeline
Raw control signals from frame-by-frame pose estimation exhibit considerable noise due to landmark prediction jitter, depth sensor noise, and hand micro-movements. We employ a three-stage filtering pipeline to produce smooth control outputs while maintaining responsiveness.
5.9.1. Stage 1: Sliding Window with Outlier Rejection
Control values are accumulated in a sliding window (deque) of size
. On each update, we compute the median
and standard deviation
of window contents, then reject outliers:
The cleaned mean (or median if ) proceeds to the next stage. The threshold was chosen to remove gross errors while preserving intentional control changes; the three-sigma rule would introduce excessive lag.
5.9.2. Stage 2: Exponential Smoothing
The cleaned mean is processed by an exponential moving average filter:
where
balances responsiveness and smoothness. The exponential smoothing removes high-frequency noise components while allowing the control signal to track sustained changes within approximately
frames.
Figure 5 illustrates the three-stage filtering pipeline.
Table 3 summarizes the filter parameters and their effects on system behavior.
6. ROS2 Implementation
This section details the software architecture implementing the Aerokinesis gesture control system within the ROS2 framework. We describe the distributed node architecture, inter-node communication patterns, the hierarchical state machine governing system behavior, and the real-time processing pipeline.
6.1. Distributed Node Architecture
The system follows the ROS2 design philosophy of modular, loosely coupled nodes communicating via publish–subscribe messaging.
Figure 6 illustrates the complete node graph and data flow.
The architecture comprises three primary nodes:
realsense_node: Camera driver (Intel RealSense SDK wrapper) providing hardware-synchronized RGB and depth streams at 30 FPS. Depth frames are aligned to the color camera coordinate system, enabling direct pixel-to-depth correspondence for 3D landmark computation.
hand_gesture_detector: Central processing node implementing MediaPipe inference, gesture classification, continuous control computation, and signal filtering. This node encapsulates all vision processing logic and maintains the operational state machine.
tello_driver: Flight controller interface translating ROS2 velocity commands to DJI Tello SDK calls over WiFi. Handles connection management, telemetry reception, and command rate limiting.
This separation enables independent testing of each component and facilitates adaptation to alternative hardware (e.g., different cameras or flight controllers) by replacing only the relevant driver node.
6.2. Hierarchical State Machine
The system operates as a hierarchical finite state machine (FSM) with two levels: the operational mode (Command/Control) and the flight state (Disarmed/Armed/Flying).
Figure 7 presents the complete state transition diagram.
Key design decisions in the state machine include:
Two-gesture safety sequences: Critical commands (takeoff, landing) require a prefix gesture (OK) followed by confirmation (ONE/TWO), preventing accidental activation. The system maintains a 2 s timeout between prefix and completion gestures.
Mode transition delay: Entering Control Mode triggers a 5 s countdown, allowing operators to reposition their hand from the THUMBS_UP gesture to a neutral control pose without inadvertent commands.
Fail-safe behavior: If hand tracking is lost during Control Mode (no detection for 500 ms), the system automatically publishes zero velocity commands, causing the drone to hover in place rather than continuing its last trajectory.
6.3. Real-Time Processing Pipeline
Figure 8 details the frame-by-frame processing pipeline executed at 30 FPS within the
hand_gesture_detector node.
The pipeline implements several optimizations for real-time performance:
Early termination: Frames without detected hands skip all downstream processing, reducing computational load during operator absence;
Shared landmark buffer: MediaPipe landmarks are extracted once and reused for both gesture classification (2D normalized coordinates) and control computation (3D reconstruction with depth);
Asynchronous publishing: ROS2 message publication is non-blocking, allowing the next frame to begin processing immediately after command generation;
GPU acceleration: MediaPipe inference leverages CUDA on the Jetson platform, reducing detection latency from ∼85 ms (CPU) to ∼28 ms (GPU).
6.4. Inter-Node Communication
Table 4 summarizes the ROS2 topic interface. The system uses standard message types for hardware interoperability.
The geometry_msgs/Twist message encodes the four control channels: linear.x (roll/lateral), linear.y (pitch/forward), linear.z (throttle/vertical), and angular.z (yaw/rotation). This standard message type ensures compatibility with diverse robotic platforms beyond the DJI Tello.
6.5. Reproducibility and Deployment
The ROS2 workspace deployment requires:
ROS2 Humble on Ubuntu 22.04;
Intel RealSense SDK 2.0 with ROS2 wrapper;
MediaPipe 0.10+ with Python 3.10+ bindings;
PyTorch 2.0+ (CPU or CUDA);
DJI Tello SDK (or compatible flight controller).
A single launch file initializes all nodes with configurable parameters for camera resolution, control sensitivity, and safety limits.
6.6. Node Architecture
The system is implemented as a ROS2 node (
HandGestureDetector) that subscribes to camera topics and publishes velocity commands. Gesture recognition is performed on every incoming frame from the depth camera stream, synchronized with the RGB stream at 30 FPS. This frame-by-frame processing approach, rather than temporal aggregation or periodic sampling, ensures maximum responsiveness to operator commands, while the filtering pipeline (
Section 5.9) handles inter-frame noise. The main processing loop (Algorithm 1) operates as follows:
| Algorithm 1: Main Processing Loop |
Input: Synchronized color frame , depth frame , camera intrinsics K - 1:
- 2:
ifthen - 3:
if in Control Mode and aircraft in flight then - 4:
Publish hover command (zero velocity) - 5:
end if - 6:
return - 7:
end if - 8:
- 9:
if in Command Mode then - 10:
- 11:
Process gesture according to state machine - 12:
else - 13:
- 14:
- 15:
Publish velocity command - 16:
end if
|
6.7. Topic Structure
The node interfaces with the ROS2 ecosystem through the following topics are shown in
Table 5.
6.8. State Machine
The command mode operates as a finite state machine tracking the current prefix gesture and expected completions. The THUMBS_UP gesture transitions from Command Mode to Control Mode with a configurable delay (default 5 s), allowing the operator to position their hand. The TWO gesture in Control Mode returns to Command Mode.
7. Experimental Evaluation
This section presents a comprehensive evaluation of the Aerokinesis system across multiple dimensions. We first describe the experimental setup, including both simulation and hardware platforms used for testing (
Section 7.1). Subsequently, we evaluate gesture classification accuracy and per-class performance metrics (
Section 7.2). System latency characteristics are analyzed to validate real-time operation (
Section 7.3). A comparative user study quantifies the practical benefits of gesture control for operators with varying experience levels (
Section 7.4). Finally, we demonstrate the effectiveness of the signal filtering pipeline through time-series analysis (
Section 7.5).
7.1. Experimental Setup
7.1.1. Simulation Environment
Initial testing employed the Gazebo Ignition simulator with a DJI Tello model. The simulation environment includes a course of ring-shaped waypoints requiring coordinated roll, pitch, yaw, and throttle control to navigate. This standardized task enables quantitative comparison across operators and control modalities.
Figure 9 illustrates the gesture control system operating in the Gazebo simulation environment.
7.1.2. Hardware Platform
Physical experiments used the NVIDIA Jetson Orin NX 8GB with Intel RealSense D435i camera, housed in a custom 3D-printed enclosure (PETG, FLSUN V400 printer) (
Figure 10). The DJI Tello quadcopter was connected via WiFi for command transmission. The system achieved consistent frame rates of 25–30 FPS during operation.
7.2. Gesture Classification Performance
The gesture classifier was evaluated on a subject-wise held-out test set of 2400 samples (300 per class) collected from two participants whose data was entirely excluded from the training and validation process. This evaluation protocol ensures that reported metrics reflect the model’s ability to generalize to new users rather than memorizing participant-specific hand characteristics.
Table 6 reports per-class precision, recall, and F1 scores.
The classifier achieves an overall accuracy of 99.3% with inference time of 0.8 ms per sample on GPU. The lowest-performing classes (THREE, FOUR) exhibit minor confusion due to subtle differences in partially extended finger configurations.
7.3. Control Latency
End-to-end latency from camera frame capture to control command publication was measured over 1000 frames during active control.
Table 7 reports timing breakdown.
MediaPipe hand detection dominates latency but remains well within real-time requirements. The total latency of approximately 32 ms enables smooth 30 FPS control. Importantly, gesture recognition is performed on every frame without temporal aggregation or frame skipping, ensuring that the system responds to operator input with minimal delay. The 30 FPS recognition frequency provides sufficient temporal resolution for both discrete command detection (where gestures are held for 0.5–1.0 s) and continuous control (where hand pose changes smoothly between frames).
End-to-End System Latency
While
Table 7 characterizes vision processing latency (∼32 ms), the complete end-to-end delay from operator hand motion to observable UAV response includes WiFi command transmission, drone firmware processing, and motor response dynamics as shown in
Table 8. We measured this total latency using synchronized high-speed video (120 FPS) capturing both operator hand movement initiation and drone response onset, analyzing frame timestamps to determine system delay (
Figure 11).
Under optimal conditions (single WiFi network, minimal interference, <3 m operator-to-base-station distance), end-to-end latency averaged 200 ms, with the 95th percentile below 280 ms. Users reported this delay as acceptable, perceiving control as “responsive” in post-experiment surveys. However, in degraded RF environments—multiple 2.4 GHz networks, Bluetooth devices, or microwave interference—latency increased to 400–580 ms, approaching the threshold where operators reported noticeable lag affecting control precision.
WiFi constraints: the DJI Tello operates exclusively on 2.4 GHz WiFi using only three non-overlapping channels (1, 6, and 11), making it highly susceptible to interference in typical indoor environments. This represents a fundamental platform limitation inherited from the consumer drone design. Future work will investigate integration with drones supporting (1) 5 GHz WiFi bands with reduced congestion; (2) dedicated low-latency control links (e.g., ExpressLRS, TBS Crossfire) achieving sub-10 ms transmission; or (3) wired tether connections for latency-critical applications. Based on our measurements, we estimate that eliminating WiFi variability could reduce end-to-end latency to approximately 120 ms, approaching the responsiveness of wired gaming controllers.
7.4. User Study: Task Completion Time
To evaluate practical usability, we conducted a comparative user study with participants (8 male, 4 female; age range 22–35). Participants were categorized as novice (no prior drone experience, ) or experienced (more than 10 h of prior drone piloting, ).
7.4.1. Protocol
Each participant completed the waypoint navigation task using both gesture control and a smartphone touchscreen application (Tello App), with order counterbalanced. Participants received 10 min of instruction for each control modality before testing. Task completion time (time to pass through all eight waypoints) was recorded for three trials per condition, with the median used for analysis.
7.4.2. Results
Table 9 reports mean task completion times with standard deviations.
Statistical analysis (paired t-test) reveals a significant advantage for gesture control among novice operators (, ), with a mean improvement of 52.6%. For experienced operators, the difference is not significant (, ), indicating that gesture control achieves parity with traditional controllers once operators gain familiarity.
7.5. Filter Effectiveness
Figure 12 compares raw and filtered control signals for roll (a) and throttle (b) channels during representative flight segments.
The roll channel exhibits higher noise due to the compound computation from multiple depth measurements, demonstrating the necessity of robust filtering. The throttle channel is more stable but occasionally shows spikes from depth sensor artifacts, which the outlier rejection stage effectively removes.
8. Discussion
8.1. Contributions and Findings
The Aerokinesis system demonstrates that vision-based gesture control can provide an effective alternative to traditional UAV controllers, particularly for novice operators. The 52.6% reduction in task completion time for beginners suggests that the intuitive hand-to-aircraft mapping substantially reduces the cognitive burden of UAV operation.
The hierarchical control architecture addresses a key limitation of existing systems by separating discrete commands from continuous control. Gesture sequences for safety-critical operations (takeoff, landing) prevent accidental activation, while the continuous control mode enables precise maneuvering. This design pattern may be applicable to other robotic systems requiring both high-level mode switching and fine-grained control.
The signal filtering pipeline represents a practical contribution for gesture-based control applications. The combination of outlier rejection, moving average, and exponential smoothing provides demonstrably robust noise handling without introducing excessive lag. The empirically determined parameters (, window size = 5, rejection threshold) offer a reasonable starting point for similar systems.
Real-World Flight Demonstration
To validate the practical applicability of the proposed Aerokinesis system, we conducted real-world flight tests using the DJI Tello quadcopter controlled via hand gestures.
Figure 13 demonstrates the system in operation, showing an operator controlling the quadcopter’s position and altitude through intuitive hand movements.
The demonstration setup includes the NVIDIA Jetson Orin NX computing platform with a connected touchscreen display showing real-time control feedback, the Intel RealSense D435i depth camera mounted on a tripod for optimal hand tracking, and the DJI Tello quadcopter responding to gesture commands via WiFi connection. During testing, the system maintained stable control, with the quadcopter successfully executing hover, translation, and altitude adjustment commands based on operator hand poses.
The real-world tests confirmed that the filtering pipeline effectively compensates for environmental noise sources present in practical deployment scenarios, including varying ambient lighting, background clutter, and the inherent latency of WiFi-based command transmission. The quadcopter exhibited smooth, predictable responses to operator commands, validating the effectiveness of the proposed control signal processing approach.
8.2. Limitations
Several limitations warrant acknowledgment:
- 1.
Confidence Estimation: The current implementation does not incorporate explicit confidence thresholding for gesture classification. The classifier uses argmax over softmax outputs without rejecting low-confidence predictions. However, the system implements implicit safety mechanisms: when no hand is detected in the frame, the UAV automatically enters a hover state (zero velocity commands), preventing uncontrolled drift. Additionally, frames with partially visible hands (landmarks extending beyond image boundaries) are automatically rejected to avoid erroneous classifications.
- 2.
Environmental Constraints: The system requires adequate lighting for reliable MediaPipe detection and depth sensing within the 15–60 cm operating range. Outdoor deployment would require adaptation for varying lighting and potential depth sensor interference from sunlight.
- 3.
Single-Operator Design: The current implementation tracks a single hand. Multi-operator scenarios or two-handed control would require architectural extensions.
- 4.
Simulation–Reality Gap: While we validated on physical hardware, the user study employed simulation. Real-world deployment introduces additional variables including wind disturbance, communication latency, and aircraft dynamics that may affect comparative performance.
- 5.
Gesture Vocabulary: The eight-gesture vocabulary suffices for basic operation but may limit advanced mission profiles requiring additional commands.
Environmental Constraints and Extreme Conditions
It should be noted that all experimental validation was conducted indoors under controlled room lighting conditions (approximately 300–500 lux, typical office illumination). The system performance in extreme environmental scenarios has not been systematically evaluated and represents a known limitation:
Strong lighting and direct sunlight: The Intel RealSense D435i depth camera employs active infrared stereo sensing, which is susceptible to interference from strong ambient infrared sources, including direct sunlight. Outdoor operation under bright sunlight conditions would likely cause depth measurement degradation or failure, directly affecting throttle control accuracy, which relies on palm depth estimation. The RGB-based MediaPipe hand detection demonstrates greater robustness to lighting variations but may exhibit reduced landmark accuracy under extreme contrast conditions such as strong backlighting or harsh shadows.
Dense fog and low visibility conditions: Fog, smoke, dust, and other atmospheric obscurants scatter both visible and infrared light, degrading the performance of both the depth camera and RGB-based hand detection. The system was not tested under such conditions and reliable hand detection would likely fail when visibility drops below approximately 5 m.
Complex and cluttered backgrounds: MediaPipe’s palm detector is a learned model that may produce false positive detections when the background contains hand-like visual features (flesh-toned surfaces, gloves, posters with hands) or may fail to detect hands against high-clutter backgrounds with numerous edge features. Our experiments utilized relatively uniform backgrounds (solid-colored walls, uncluttered laboratory environment). System performance in visually complex environments such as outdoor foliage, industrial facilities, or crowded spaces requires additional investigation.
Future work should include systematic characterization of system performance across a range of environmental parameters to establish clear operational boundaries and develop appropriate mitigation strategies for challenging conditions.
8.3. Applications in IoT Ecosystems
The gesture control paradigm extends naturally to broader IoT applications beyond UAVs. Smart home automation, industrial robot guidance, and assistive technology for mobility-impaired users represent domains where intuitive gesture interfaces could replace or complement traditional controls. The modular ROS2 architecture facilitates integration with diverse IoT platforms through standard middleware protocols.
The demonstrated approach of edge computing for real-time inference combined with lightweight communication (velocity commands rather than video streams) addresses the latency and bandwidth constraints characteristic of IoT deployments. As embedded computing capabilities continue improving, similar vision-based interfaces may become feasible for a wider range of IoT devices.
9. Conclusions
This paper presented Aerokinesis, an IoT software–hardware system for gesture-driven quadcopter control within the ROS2 framework. The system achieves over 99% gesture classification accuracy using a MediaPipe-based hand tracking frontend with a lightweight neural network classifier. Continuous control computation from 3D hand poses enables intuitive four-degree-of-freedom maneuvering, with a hybrid filtering pipeline ensuring robust performance despite sensor noise.
Experimental validation demonstrated significant advantages for novice operators (52.6% reduced task completion time) compared to traditional controllers, while maintaining parity for experienced users. The complete system operates in real-time (approximately 30 FPS) on embedded hardware suitable for IoT deployment. It should be noted that all experiments were conducted under controlled indoor lighting conditions; system performance characterization across diverse environmental conditions (outdoor sunlight, fog, complex backgrounds) remains an important direction for future investigation.
Future work will address current limitations through (1) outdoor-capable depth sensing modalities such as stereo matching; (2) expanded gesture vocabularies including dynamic gestures for mission commands; (3) multi-agent control enabling single-operator coordination of drone swarms; (4) integration with autonomous navigation for shared-control paradigms; and (5) incorporation of confidence estimation for gesture recognition, where predictions below a learned threshold would trigger automatic transition to a safe hover state, further improving operational safety for novice users.
The gesture interface paradigm offers promising potential for democratizing UAV operation by reducing training requirements and providing intuitive control mappings. As computer vision capabilities continue advancing, such natural interaction modalities may become standard for IoT-enabled robotic systems across diverse application domains.