1. Introduction
Driven by the surge in deep learning, the field of unmanned aerial vehicle (UAV) computer vision has experienced a revolutionary shift in architectural design. Object detection algorithms have evolved from the early (Region-based Convolutional Neural Networks) R-CNN [
1] series to end-to-end inference frameworks such as you only look once (YOLO) [
2] and DETR, achieving a superior balance between detection precision and real-time responsiveness. Integrated with mature multi-object tracking (MOT) strategies like DeepSORT [
3] and ByteTrack [
4], UAV platforms have fully leveraged their mobility advantages in practical applications such as aerial logistics, infrastructure inspection, and search-and-rescue operations [
5]. However, existing control modalities still face severe challenges in practical operations. Traditional remote-control modes impose an excessively high cognitive workload on operators and exhibit limited response efficiency when handling concurrent multi-target switching tasks [
6]. At a deeper level, fully autonomous visual decision-making systems struggle with reliability in typical aerial photography environments. These environments are often disrupted by dynamic backgrounds, occasional target occlusion, and vibrations from the aircraft caused by wind. Additionally, these systems lack an intuitive and efficient way for humans to step in and provide feedback, making it difficult to integrate human intelligence when needed [
7].
Simultaneously, the essence of a brain–computer interface (BCI) is to bypass traditional peripheral nerves and muscles creating a direct connection between the human central nervous system and external devices, enabling a seamless flow of information [
8]. By demodulating specific neuro-electrophysiological signals in real-time, this technology has transformed the way patients with motor dysfunction interact in clinical rehabilitation [
9]. It has also proven to be strategically valuable in advanced tasks such as coordinating complex unmanned systems and controlling intelligent equipment [
10]. Among various BCI paradigms the SSVEP has become a core technical path for constructing high-rate human–machine collaborative frameworks due to its high information transfer rate (ITR) and minimal user training requirements [
11]. The field of BCI has advanced significantly through the continuous development of feature extraction paradigms, such as canonical correlation analysis (CCA) and filter bank canonical correlation analysis (FBCCA) [
12]. Consequently, the robust recognition accuracy of modern SSVEP-BCIs now provides a reliable foundation for real-time human–machine interaction in complex environments [
13].
Despite the technical maturity of both BCI and computer vision (CV), their deep integration into dynamic UAV operational scenarios remains constrained by several bottlenecks. First, mainstream SSVEP stimulus interfaces are largely limited to static layouts making them challenging to adapt to the rapidly evolving targets and backgrounds observed from a UAV’s perspective [
14]. This spatio-temporal decoupling between stimulus sources and physical targets directly leads to confusion in electroencephalography (EEG) decoding and semantic fragmentation. Second, constrained by the power-to-performance ratio of embedded edge computing platforms such as the controller (rk3588), the high computational demands of native YOLO or transformer models often result in performance degradation which hinders the real-time execution and on-device deployment of integrated systems [
15]. Additionally, the visual fatigue caused by the SSVEP paradigm becomes especially noticeable during long missions. When combined with interference from UAV turbulence, the accuracy of EEG signal extraction is put to the test. Most importantly, in dynamic environments where targets frequently change, traditional touch or button-based controls struggle to create a smooth connection between intent and action, making them less efficient for seamless human–machine collaboration in fast-moving scenarios [
16].
Based on these considerations, exploring the coordination mechanisms between SSVEP-BCI, CV and UAV control while addressing the adaptation issues between intent decoding and target locking in dynamic scenes offers significant theoretical value and pressing engineering opportunities [
17]. To address the complex challenges of dynamic UAV operations, this study proposes a dynamic vision–BCI control framework. By designing a spatio-temporally bound visual stimulus paradigm, the SSVEP stimulus sources move synchronously with the detected visual targets. To overcome edge-side resource limitations, lightweight object detection algorithms and tracking logic have been optimized. Finally, by establishing a closed-loop feedback mechanism of intent demodulation-visual locking-control law mapping, the system enables natural target selection and robust tracking by the user.
The remainder of this paper is organized as follows:
Section 2 outlines the technical scheme of the dynamic vision–BCI system, systematically elucidating the dynamic SSVEP encoding strategy, the improved YOLOv11 detection algorithm and their collaborative logic at the command mapping layer by deconstructing the integrated architecture of the embedded edge platform and gimbal.
Section 3 focuses on experimental validation, analyzing the system’s operational effectiveness in dynamic and the robustness of online closed-loop control based on offline classification accuracy, hardware-in-the-loop (HIL) simulations and real-world paradigm tests.
Section 4 discusses the adaptive advantages of the hybrid system in UAV missions, examining the trade-offs between algorithmic precision and edge-side real-time performance while identifying current research limitations. Finally,
Section 5 summarizes the core findings and presents a vision for the future development of human–machine collaborative UAV technology.
3. Results and Analysis
3.1. Pilot Study Results and Analysis
The pilot study phase involved a systematic evaluation of SSVEP response characteristics under varying frequency and color configurations. This investigation aimed to establish an optimal parameter baseline for subsequent formal experiments through a comprehensive analysis of classification accuracy, ITR and quantified visual fatigue metrics [
39]. Preliminary results indicate that classification performance and ITR demonstrate strong statistical robustness across diverse color features, providing an empirical foundation for parameter deployment in complex operational scenarios. In the initial testing round, four visual paradigms white, red, green and blue were evaluated across three distinct time windows (1.0 s, 1.4 s and 1.8 s) using the CCA algorithm [
40] to assess the demodulation efficacy of neural signals. The results are shown in
Figure 10 and
Figure 11. The experimental data suggest that the combination of stimulus frequency and chromatic background significantly modulates the temporal and spectral characteristics of SSVEP signals. Furthermore, these factors likely influence the subject’s fatigue levels indirectly by regulating visual attention and the allocation of neural resources. These findings offer theoretical support for the selective regulation mechanisms of SSVEP signals and serve as a valuable reference for optimizing SSVEP-based BCI system designs. Quantitatively, the classification accuracy exhibited a monotonic upward trend as the data length increased, rising from 88.14% to 95.68%. Conversely, the ITR decayed from 91.55 bits/min to 72.61 bits/min, a consequence of the increased temporal cost associated with longer windows. Statistical inference revealed a significant gain in accuracy within the 1.0 s to 1.4 s interval (
p < 0.05); however, the rate of improvement plateaued beyond the 1.4 s mark. Notably, except for Paradigm 1 (white) at the 1.8 s window, no statistically significant differences in accuracy were observed between the various chromatic paradigms at equivalent time steps. This phenomenon implies that stimulus color is not the primary variable governing classification performance.
Subsequently, this research focused on evaluating the adverse effects of dynamic SSVEP paradigms and complex backgrounds on signal decoding. To ensure an objective and comprehensive assessment, we conducted a horizontal benchmark test across seven mainstream algorithms. These include Extended Canonical Correlation Analysis (eCCA), CCA, Multi-stimulus Extended Canonical Correlation Analysis (ms-eCCA), Ensemble Task-Related Component Analysis (eTRCA), Multi-stimulus Ensemble Task-Related Component Analysis (ms-eTRCA), and TDCA, which serves as an improved benchmark in this study. Furthermore, we implemented a Hybrid Extended Canonical Correlation Analysis and Ensemble Task-Related Component Analysis (Hybrid-eCCA-eTRCA), hereafter referred to as Combined in subsequent figures and analysis for the sake of conciseness. As illustrated in
Figure 9, these comparative experiments revealed the unique performance advantages of TDCA under the specified complex constraints, providing empirical evidence for its robustness in dynamic environments.
Notably, the algorithm achieved a mean peak accuracy of 94.44% within a 1.2 s time window, effectively suppressing non-stationary noise interference induced by the environmental background. Even under the dynamic evolution paradigm, TDCA demonstrated rapid responsiveness, attaining a recognition rate of 96.67% at the 1.2 s mark. Given the scarcity of training samples in practical applications and the requirement for cross-paradigm compatibility across diverse scenarios, the comprehensive performance exhibited by TDCA renders it the optimal algorithmic choice for underpinning the dynamic hybrid BCI architecture of this study.
3.2. Object Detection Performance in Dynamic UAV Backgrounds
Beyond the precise decoding of human intent, the accurate identification and robust tracking capabilities of the visual system are critical determinants of overall system performance.
Figure 12 illustrates the evolution of the loss functions and performance metrics over a 200-epoch training cycle. From the topological structure of the loss convergence, the model demonstrates exceptional learning efficiency and stability. Specifically, the bounding box regression loss (Box Loss) and classification loss (Cls Loss) exhibit a steep decline within the initial 50 epochs before rapidly transitioning into a stable asymptotic convergence phase. Notably, the validation loss (Val Loss) closely tracks the decline of the training loss without significant rebound or oscillation. This phenomenon provides compelling evidence that the improved model possesses well-defined generalization boundaries within the feature space, effectively mitigating the over-fitting risks commonly associated with deep networks.
Corresponding with the convergence of the loss functions, various evaluation metrics show a synchronized upward trajectory. The mean average precision (mAP@0.5) ascended into the saturation zone (approaching 1.0 s) within a remarkably short training duration, while the more stringent mAP@0.5:0.95 metric maintained a steady growth slope, eventually stabilizing at a high level above 0.85. This synergistic enhancement of high precision and high recall further confirms that the introduction of the improvement mechanisms has likely heightened the network’s sensitivity to small object features. Consequently, the model successfully preserves low-level spatial details essential for precise localization while retaining high-level semantic information.
To further elucidate the discriminative logic of the model within authentic aerial photography scenarios,
Figure 13 presents detection results across multi-scale perspectives alongside their corresponding Class Activation Mapping (CAM) visualizations. UAV-acquired imagery is frequently characterized by high-frequency background noise-such as intricate pavement textures, architectural shadows, and arboreal occlusions-which imposes rigorous challenges on the detector’s background suppression capabilities.
The improved YOLO model, specifically optimized for the dynamic BCI system, demonstrates a highly targeted spatial attention distribution pattern. Even in scenarios featuring dense vehicle clusters or surrounding distractors, the high-response regions (represented by the red core zones) accurately converge upon the geometric centers of the target vehicles, with activation boundaries aligning precisely with the contours of the physical objects. This focusing effect indicates that the BCI-related modules (or the attention mechanisms inspired by them) effectively guide the convolutional network to suppress feature responses in non-salient regions, thereby achieving “denoising and purification” at the feature map level.
In terms of bounding box confidence, the model demonstrates the ability to accurately capture dense traffic flows without omission or false detection, even in wide-field-of-view nadir perspectives where target pixels constitute a minimal proportion of the frame. This consistent performance across multiple scales and complex backgrounds confirms that the improved YOLO architecture possesses a superior capacity to decouple regions of interest (ROI) from intricate backgrounds during dynamic environmental perception tasks. Such capabilities satisfy the dual requirements of real-time responsiveness and precision essential for UAV-based ground observation missions.
While static detection metrics validate the model’s feature extraction capabilities at the single-frame level, the temporal consistency of target motion remains the decisive factor for ensuring the stable anchoring of SSVEP stimulus sources within a “human-in-the-loop” closed-loop control system. As shown in
Figure 14a, the experimental results intuitively demonstrate the identity preservation stability of the proposed YOLO-BCI + DeepSort framework under significant detection confidence fluctuations. Even when the bounding box confidence scores fluctuate sharply due to the interference of complex backgrounds and small target scales, the system still maintains consistent target ID assignment without ID switching or identity loss. This superior performance benefits from the cascaded matching strategy that fuses appearance features and kinematic motion information, which effectively suppresses the negative impact of transient confidence jitter on tracking continuity.
Figure 14b intuitively illustrates the dynamic tracking sequences following the integration of the DeepSort algorithm, where red vector lines represent the Kalman filter’s real-time posterior estimation of the target’s motion state, including velocity vectors and direction. To avoid ambiguity, we explicitly mark the opposite motion directions in
Figure 14a and clarify that
Figure 14b shows two independent scenes captured at different UAV altitudes (50 m and 100 m), rather than adjacent frames.
From the perspective of sequential frame evolution, UAV aerial photography is typically characterized by the coupled interference of significant ego-motion and non-linear target maneuvers, which imposes rigorous challenges on the tracker’s prediction-update mechanism. As observed in the illustrations, the algorithm effectively utilizes a kinematic prediction model to perform smooth corrections on bounding box coordinates, even in relatively high-speed non-linear scenarios. This state-space-based prediction mechanism not only compensates for transient jitter inherent in visual detection but also ensures that the visual stimulus anchors remain precisely aligned with the centers of physical targets. Notably, due to the deep fusion of appearance features and motion information via the cascaded matching strategy, the system achieved a 100% success rate in maintaining target ID for the video segments utilized in the EEG recognition evaluation of this study.
The identity identifiers of the vehicles-such as ID:11 in the upper sequence and individual IDs within the dense lower traffic flow-maintained strict uniqueness across continuous frames, with no instances of the “ID Switch” phenomenon common in multi-object tracking. This trajectory-level robustness constitutes the cornerstone of the hybrid system’s reliable operation; it fundamentally prevents abrupt fluctuations in stimulus frequency caused by tracking drift or ID confusion. Consequently, this ensures the mapping constancy between frequency-tagging and physical semantics, providing a solid physical baseline for precise user intent locking in dynamic environments.
3.3. Results and Analysis of the Dynamic Hybrid System
In light of the pilot study’s findings which indicated that static paradigms in real-world backgrounds are susceptible to spatio-temporal aliasing from adjacent stimuli-this research specifically optimized the formal experimental protocol. The objective was to provide a rigorous analysis of the recognition robustness of the dynamic SSVEP paradigm within complex operational environments. This validation process serves not only as an efficacy assessment of the proposed algorithmic framework but also establishes critical data benchmarks and theoretical anchors for the subsequent construction of the dynamic vision–BCI system.
This section evaluates the comprehensive performance of the vision–BCI system through quantitative metrics to verify its effectiveness in authentic environments. The experimental cohort comprised 10 subjects selected from an initial pool of 15 (Sub1–Sub15); these individuals underwent fundamental training and exhibited heterogeneous physiological response characteristics, allowing for an investigation into the system’s generalization capabilities under fluctuating Signal-to-Noise Ratio (SNR) conditions. To explore the non-linear relationship between command latency and decoding performance, five sliding time windows (TW) ranging from 1.0 s to 1.8 s (with a 0.2 s increment) were established. A horizontal benchmark test was conducted across eight representative algorithms, including CCA and its variants (eCCA, ms-eCCA) [
41], eTRCA [
42], ms-eTRCA [
43], the Combined method [
44], TDCA and the TDCA-V model proposed in this study. Given the consistency of the target set, recognition accuracy was consistently employed as the primary performance metric. The results are shown in
Figure 15.
Data window length serves as the pivotal trade-off variable between the ITR and the reliability of an SSVEP system, directly constraining the completeness of feature extraction. Statistical results indicate that recognition accuracy exhibits a significant monotonic increasing trend across the entire sample set as the time window extends. Under the constraint of a 1.0 s short-time window, the traditional CCA algorithm demonstrates severe performance oscillations due to limited frequency resolution caused by sample sparsity. Taking the low-response subject Sub5 as an example, the CCA recognition rate at 1.0 s was a mere 42.59%, failing to meet the threshold requirements for robust control. In sharp contrast, the TDCA series of algorithms demonstrated superior weak-feature enhancement capabilities, elevating Sub5’s accuracy to 79.63% under identical conditions—a performance gain of approximately 37%. As the window extends to approximately 1.4 s, the overall system performance tends toward convergence, with certain high-performing algorithms rapidly entering a saturation zone. For instance, Sub6 achieved 100% precision in recognition using TDCA-V within a 1.6 s window, confirming the algorithm’s capacity to overcome dynamic background noise and achieve zero-error closed-loop control given sufficient data support.
To provide an intuitive representation of the aforementioned temporal dynamic characteristics and the robustness variations among algorithms,
Figure 16 illustrates a comprehensive performance comparison at three critical time nodes: 1.0 s, 1.4 s, and 1.8 s. In this visualization, the bar heights represent mean accuracy, while the superimposed error bars quantify the standard deviation, effectively reflecting the stability distribution of the system across different individuals. Observation of the visual data reveals that as the time window contracts, traditional correlation analysis methods, such as CCA and eCCA, not only exhibit a significant attenuation in mean accuracy but also a substantial expansion in error variance, suggesting a high sensitivity to short-term data noise. Conversely, TDCA-V, leveraged by its optimized spatial filtering structure, maintains the most compact error distribution and the highest mean accuracy even at the 1.0 s limit (indicated by the light green bars). This visual evidence strongly corroborates the superiority of the improved algorithm in suppressing non-stationary interference, aligning closely with previous statistical inferences.
A horizontal comparison reveals a clear gradient stratification of algorithmic efficacy. TDCA and its variant, TDCA-V, reside firmly in the top tier due to their significant statistical advantages. Specifically, subject Sub6 achieved a mean accuracy of 97.03% under the TDCA algorithm, which further ascended to 99.26% within the TDCA-V framework. This phenomenon suggests that the TDCA architecture possesses a superior signal separation mechanism for suppressing non-stationary disturbances, such as mechanical vibrations from the UAV and dynamic visual background noise. In contrast, while eTRCA and the Combined methods outperform the baseline CCA, they remain slightly inferior to the TDCA framework in terms of peak performance; for example, Sub3 achieved 88.89% accuracy using eTRCA at the 1.4 s window, whereas TDCA had already converged to 96.30% under identical conditions.
Experimental observations indicate that the multi-stage decomposition algorithm, (ms-eCCA), performed worse than the foundational eCCA for certain low-response subjects (e.g., Sub11). This anomalous finding reflects the spectral complexity of EEG signals in real-world environments, where excessive decomposition may introduce redundant features or non-stationary noise, thereby weakening the classifier’s discriminative boundaries. Furthermore, the experimental data highlight the potential of advanced spatial filtering algorithms to address the challenge of “BCI-illiteracy”. For Sub11, who exhibited weak physiological responses, the baseline CCA algorithm yielded a mean accuracy of only 74.07%, rendering the system unusable. However, the introduction of the TDCA-V algorithm elevated the mean accuracy to 88.89%, reaching a practical benchmark of 96.20% at the 1.8 s window. This advancement proves that spatial filtering mechanisms based on task-relatedness can significantly enhance the saliency of evoked components, effectively broadening the system’s audience applicability.
Focus to the specific performance of the algorithms, the proposed TDCA-V demonstrates a clear advantage over the standard TDCA, notably in its ability to maintain high precision while reducing performance fluctuations across different participants. This improvement is quantified by a roughly 5% net gain in overall accuracy. A striking example is seen in subject Sub6. These gains are not accidental; they stem from the synergistic integration of filter bank decomposition, multi-delay ensembles, and dynamic shrinkage regularization. Rather than merely filtering out noise, this combination allows the system to capture subtle harmonic features and adapt to individual variations in neural latency, all while preventing the overfitting common in small-sample learning. Consequently, the resulting feature distributions are more concentrated, establishing the stable classification boundaries necessary for reliable multi-target discrimination. To analyze the granular performance across multi-classification tasks, this study further constructed a series of confusion matrices for the 1.8 s time window, as shown in
Figure 17a–c.
Figure 17a presents the raw sample counts, where each row sum (
N = 90) represents the total experimental trials for a given stimulus frequency. Building upon this,
Figure 17b and
Figure 17c provide the normalized Precision and Recall matrices, respectively.
The high response values along the diagonal of the Recall matrix (
Figure 17c) confirm the system’s exceptional “catch rate,” demonstrating that the algorithm successfully identifies the vast majority of target signals from the 90 samples provided per class. Simultaneously, the Precision matrix (
Figure 17b) reflects the reliability of the algorithm’s predictions; by calculating the ratio of true positives to the total number of times a frequency was predicted (column sum), it proves that the system effectively minimizes false alarms.
Recognition accuracy for primary targets generally exceeds 90%, with performance for the 10 Hz and 14 Hz targets being particularly prominent, reaching 96.72% and 98.44%, respectively. The few instances of misclassification were primarily concentrated between adjacent frequency bands 8 Hz and 9 Hz. This phenomenon suggests that spectral leakage or subject visual fatigue may have compromised frequency demodulation, leading some 8 Hz targets to be misidentified as 9 Hz or vice versa. In summary, the experimental data confirm that the TDCA-V algorithm achieves an optimal configuration in terms of recognition accuracy, response speed, and resilience to individual variability. Within a window setting of 1.2 s to 1.6 s, this algorithm effectively balances the dual requirements of high precision and high ITR, providing robust algorithmic support for real-time control, as shown in
Table 1.
3.4. Hardware-in-the-Loop Simulation Results and Analysis
Due to the complex electromagnetic environment of the designated airspace and the fact that the components utilized in this study are exclusively consumer-grade, the physical safety of real-world flight experiments cannot be guaranteed. Consequently, the experimental evaluation adopts a hardware-in-the-loop (HIL) configuration to ensure both operational safety and data repeatability during EEG acquisition. Specifically, the UAV platform is mounted on a multi-degree-of-freedom mechanical test rig that emulates the motion characteristics of actual flight. As illustrated by the structural schematic in
Figure 18a and the system components in
Figure 18b, although the experiments were not conducted directly in real airspace, the multi-degree-of-freedom mechanical test rig and the actual UAV system effectively simulate environmental noise, such as vibrations, typically encountered in real-world scenarios.
The DSI-24 EEG cap used for data acquisition is shown in
Figure 18c. In place of real-world flight tests, a display screen positioned beneath the platform plays pre-recorded urban traffic videos (captured at an altitude of approximately 50 m) to simulate realistic aerial observation conditions; the overall system integration and testing scenario are depicted in
Figure 18d. During each trial, the vision module detects and tracks candidate vehicles in real-time while synchronously overlaying flickering SSVEP stimulus blocks onto the targets, as shown in the experimental interface in
Figure 18e. Once the subject’s intention is decoded, the gimbal controller locks onto the selected target and continuously adjusts the camera orientation to maintain the target near the center of the field of view.
To systematically validate the target tracking efficacy of the brain-controlled UAV within dynamic scenarios, this study selected the three top-performing subjects from the previous experiments to participate in online closed-loop testing. The experimental environment was configured with the UAV in a fixed-point hovering mode (at an altitude of 50 m), utilizing the relative motion between the UAV’s perspective and ground targets to construct a dynamic testing environment.
At the perception layer, the system loads the pre-trained and enhanced YOLOv11 model to perform real-time detection of dynamic targets within the visual field. A spatio-temporal binding strategy is employed to synchronously overlay SSVEP stimulus blocks onto the target coordinates. The experiment established a threshold of 9 concurrent dynamic targets within the field of view, with the stimulus frequency configuration maintained consistent with the formal experiments described earlier. For periods characterized by target sparsity equal to 9 targets, the system dynamically maps frequency parameters based on the temporal priority of targets entering the field of view.
To clarify the experimental task, the subject’s goal during each trial was to select a specific target vehicle on the screen using their gaze, which then triggered the UAV’s camera gimbal to swivel and track it. A single trial sequence was designed as follows: initially, a 2 s visual cue indicated which target vehicle the subject needed to select. Next, a 4 s SSVEP stimulation phase was initiated, during which the bounding boxes of all targets flickered at different frequencies, and the subject visually focused on their designated target. In real-time, the BCI system decoded the EEG signals to identify the user’s intent. This control command was then transmitted to the host computer, which immediately directed the UAV’s gimbal to rotate and lock onto the selected target. Finally, a 2 s rest period was provided before the next trial. This process simulated the target locking and tracking actions of the UAV gimbal with high fidelity in the physical dimension. Upon completion of a trial, the gimbal camera automatically returned to its initial position, and the video sequence was reset in preparation for the subsequent test. Each subject was required to complete a total of 54 continuous trials in the online environment.
Crucially, this experiment introduced an explicit visual closed-loop feedback mechanism: regardless of the correctness of the decoding result, the UAV gimbal executed tracking actions based strictly on the real-time output of the BCI system. Via the real-time video feedback stream, subjects could intuitively assess the current tracking status (i.e., the consistency between the decoding result and the cued target) and subsequently dynamically regulate their own attention levels and psychological states to adapt to subsequent trials. Experimental results (refer to
Table 2) indicate that the three subjects achieved a mean accuracy of 91.98% and a mean ITR of 42.42 bits/min, strongly validating the robustness and efficiency of the proposed system in online human–machine collaborative tasks.
4. Discussion
The dynamic hybrid BCI system developed in this study aims to overcome the robustness bottlenecks of single-modality control in complex UAV operational scenarios. By introducing a “human-in-the-loop” interaction paradigm, the system effectively integrates the flexibility of human cognition with the high efficiency of machine vision.
In dynamic tracking tasks, the improved YOLOv11-DeepSort algorithmic framework provides a stable physical anchor for SSVEP stimulus sources, while the induced SSVEP signals act as a high-level semantic filter. Compared to fully autonomous tracking systems that rely solely on computer vision, this collaborative mechanism excels in scenarios involving target occlusion or semantic. When fully autonomous detection algorithms suffer from bounding box jitter due to feature overlap, the operator’s gaze intent can lock onto a specific ID, forcing the control law to maintain tracking of the selected target. This suppresses control divergence caused by visual false positives at the decision-making level.
A critical consideration in the design of this interface is the choice between neural and ocular input modalities. While eye-tracking systems offer strengths in low-latency spatial localization, this study deliberately utilizes an EEG-based BCI to resolve the “Midas Touch” problem—the inherent difficulty in distinguishing between a user’s spontaneous environmental scanning and a deliberate intent to select a target. In complex UAV missions, operators must frequently shift their gaze to maintain situational awareness. Traditional eye-tracking often leads to false triggers during these rapid gaze transitions. In contrast, the SSVEP paradigm requires a stable neural resonance at a specific frequency, which effectively filters out non-intentional gaze shifts and establishes a robust “third control channel” when the operator’s hands are occupied by primary flight maneuvers.
Furthermore, our framework implements an optimized human–computer interaction (HCI) strategy characterized by “Short-term Selection, Long-term Tracking” to mitigate potential visual fatigue. Unlike eye-tracking systems that often require continuous gaze maintenance, the operator in this study only needs to provide a brief “cognitive trigger” within a short time window (typically 1.0 s to 1.4 s) to lock onto a target. Once the intent is demodulated, the system hands over the task to the improved YOLOv11 and DeepSort algorithms for autonomous, continuous tracking. This intermittent control strategy is significantly less demanding than continuous gaze-based control, achieving a profound decoupling of “cognitive intent” from “physical execution”. This strategy not only bypasses the limitations of restricted ITR in BCI systems but also significantly reduces the operator’s cognitive load and fatigue accumulation during long-endurance missions while ensuring task success rates.
Building upon the validated interaction logic, comparative experiments in real-world paradigms further reveal the superior robustness of the TDCA-V algorithm in non-stationary environments. While traditional spatial filtering methods provide a baseline,
Table 1 reveals a critical temporal dependency in performance. Specifically, advanced variants such as eCCA and eTRCA struggle to maintain high responsiveness under dynamic UAV perspectives. Although eCCA eventually converges to a competitive accuracy of 89.38% at a longer time window of 1.8 s, its performance drops sharply to 74.57% at 1.0 s. This latency lag suggests that purely EEG-based methods require extended integration times to filter out broadband environmental noise, rendering them ill-suited for emergency braking or rapid maneuvering tasks in aerial control.
In contrast, TDCA-V exhibits superior rapid-response capabilities. By effectively decoupling interference through multi-scale feature fusion, it achieves a high accuracy of 85.31% within just 1.0 s, establishing a significant lead of approximately 11% over eCCA. This ‘fast-settling’ characteristic confirms that the visual auxiliary stream provides an immediate, stable reference for the decoder, effectively compensating for the initial instability of EEG signals.
Furthermore, the comparison between standard TDCA and vision-enhanced TDCA-V isolates and confirms the contribution of the proposed improvement strategy. Incorporates experimental data from all 15 subjects-comprising 10 subjects who underwent simple training and 5 subjects who received absolutely no training. The results indicate that the improved strategy yields an average net accuracy gain of approximately 5%. Taking the 1.0 s time window as an example, accuracy increased from 80.49% to 85.31%, and more importantly, the standard deviation was reduced from ±14.56% to ±12.63%. High standard deviations in baseline methods, such as eTRCA reaching ±19.77%, indicate high sensitivity to individual subject quality, implying effectiveness only for ‘high-quality’ subjects. The extremely low variability of TDCA-V proves its universality, ensuring reliable decoding even for low-response subjects or ‘BCI-illiterate’ users. This performance, combining speed, precision, and population robustness, fully verifies the necessity of the hybrid dynamic BCI architecture for constructing practical, safety-critical UAV systems.
Complementing the stability of neural decoding is the trade-off between computational power and precision involved in deploying high-precision detection algorithms on the RK3588. As the core of UAV airborne edge computing, this platform integrates a NPU with a peak computing power of 6 TOPS; however, its memory bandwidth and sustained computing power supply are still limited by the power constraints of embedded devices, making it unable to bear the computational load of full-channel convolutions in the native YOLOv11. Instead of merely reducing the computational volume, the improved YOLOv11 conducts hardware-aware optimization based on the architectural characteristics of the RK3588’s NPU by introducing PConv to replace traditional full-channel convolutions: through selectively activating channel features, PConv reduces invalid computations and redundant memory access frequency by more than 40%, which not only aligns with the parallel computing logic of the NPU but also avoids the bandwidth bottleneck caused by full-channel convolutions. Consequently, the model’s inference speed is increased from 32 FPS of the native YOLOv11 to 59 FPS, meeting the real-time requirements for dynamic UAV tracking. This improvement in inference speed is not achieved at the expense of precision. Experimental data show that after 200 epochs of training, the improved model achieves an mAP@0.5 of 0.985 and an mAP@0.5:0.95 stably around 0.85, with a 100% target ID maintenance success rate in selected dynamic scenarios, realizing the synergistic optimization of “speed and precision”.
More importantly, the millisecond-level response speed ensures that the end-to-end latency from visual acquisition, target detection, stimulus overlay to control command output is controlled within 50 ms, minimizing the negative impact of “perception-display latency” on SSVEP closed-loop feedback, a prerequisite for generating high-quality evoked EEG signals. The “spatio-temporal synchronization mechanism” at the core of this study requires real-time alignment between SSVEP stimulus sources and target displacements. If the detection latency exceeds 100 ms, the stimulus blocks will produce spatial offsets from physical targets, leading to confusion in the frequency components of EEG signals and directly reducing the decoding accuracy of the TDCA-V algorithm. The low-latency characteristic of the improved YOLOv11 precisely guarantees the tight anchoring between stimulus sources and dynamic targets, providing a stable visual induction environment for neural decoding.
Furthermore, addressing the typical issues of sparse features, large tilt angles, and uneven aspect ratios of small targets from UAV aerial perspectives, the GD-Mechanism and Shape-IoU loss function introduced in the improved YOLOv11 form a dual guarantee of “feature enhancement-regression optimization”: Through global feature aggregation and cross-layer distribution, the GD-Mechanism solves the feature dilution problem of traditional FPN in small target detection. Even in scenarios such as vehicle tilt and partial occlusion, it can still ensure the accurate coverage of target centers by SSVEP stimulus blocks. Notably, these two optimizations do not increase additional inference burden, and the overall parameter count of the model only increases by approximately 10%, which is fully compatible with the resource constraints of the RK3588.
This customized optimization for specific hardware and scenarios has not only been validated in laboratory environments but also undergone pragmatic verification through HIL simulations. In dynamic tracking tests with the UAV hovering at an altitude of 50 m, the improved YOLOv11 collaborates with the DeepSort algorithm. Even in real operating conditions such as real-time traffic flow and background texture interference, it can stably output target coordinates and IDs, supporting the system in achieving an average tracking accuracy of 91.98%. This result confirms that on the resource-constrained edge side, through the deep synergy between algorithms and hardware, the stable operation of complex vision–BCI systems can be fully realized. It not only breaks through the dilemma of “insufficient precision” or “inadequate speed” in traditional edge computing but also provides a critical empirical basis for the engineering deployment of the system in practical scenarios such as low-altitude economy and emergency rescue, echoing the core research objective of this study, “addressing the constraints of dynamic UAV operations”.
Despite the demonstrated feasibility of the system in hardware-in-the-loop simulations and real-world tests, several limitations remain to be explored in future research. First, the current SSVEP stimulus paradigm still relies on screen graphic overlays, which may suffer from weakened induction intensity in bright outdoor environments due to reduced screen contrast. Future work could explore augmented reality (AR) glasses as a stimulus carrier, utilizing their high-brightness displays and retinal projection technology to enhance outdoor adaptability. Second, while DeepSort addresses short-term occlusion to an extent, the ReID capability after long-term target disappearance remains to be studied. Integrating advanced Transformer-based trajectory prediction models or spatio-temporal memory networks may be effective routes for resolving long-term occlusion and trajectory repair [
45]. Finally, while the current command set focuses on discrete target selection, it could be expanded to high-level tactical UAV instructions, such as using dynamic BCI paradigms to achieve multi-UAV formation switching or continuous adjustment of reconnaissance modes, thereby constructing a more comprehensive brain-controlled UAV swarm ecosystem.