UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface

Wang, Jun; Li, Zanyang; Yan, Lirong; Imtiaz, Muhammad; Li, Hang; Shoukat, Muhammad Usman; Jinsihan, Jianatihan; Feng, Benjun; Yang, Yi; Yan, Fuwu; He, Shumo; Wu, Yibo

doi:10.3390/drones10030222

Open AccessArticle

UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface

by

Jun Wang

^1,2

,

Zanyang Li

^1,2,

Lirong Yan

¹,

Muhammad Imtiaz

¹,

Hang Li

^1,2,

Muhammad Usman Shoukat

³

,

Jianatihan Jinsihan

¹,

Benjun Feng

²,

Yi Yang

⁴,

Fuwu Yan

¹,

Shumo He

¹ and

Yibo Wu

^2,*

¹

School of Automotive Engineering, Wuhan University of Technology, Wuhan 430070, China

²

Wuhan Leishen Special Equipment Company Ltd., Wuhan 430200, China

³

Energy and Transportation Domain, Beijing Institute of Technology, Zhuhai 519088, China

⁴

Ordnance Sergeants School, Army Engineering University, Wuhan 430075, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(3), 222; https://doi.org/10.3390/drones10030222

Submission received: 17 February 2026 / Revised: 13 March 2026 / Accepted: 18 March 2026 / Published: 21 March 2026

(This article belongs to the Section Artificial Intelligence in Drones (AID))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study develops a lightweight, UAV-based dynamic vision–brain–computer interface (BCI) framework that integrates an enhanced YOLOv11 and DeepSort tracking algorithm, enabling the deployment of lightweight visual algorithms on resource-constrained edge computing platforms. The tracking algorithm achieved a 100% target identity maintenance rate in the specified dynamic scenarios, accomplishing high-speed, real-time alignment and stable spatio-temporal association between the object detection system and steady-state visual evoked potential (SSVEP) stimuli.
An enhanced task-discriminant component analysis (TDCA-V) algorithm is introduced for neural decoding, which achieved an average command recognition accuracy of 91.48 ± 6.49% within a strict 1.0 s time window in simulation environments, significantly outperforming traditional methods. Furthermore, during UAV hardware-in-the-loop experiments under 50 m flight conditions, the algorithm attained an average command recognition accuracy of 91.98% within a 1.2 s time window.

What are the implications of the main findings?

By establishing a closed-loop feedback mechanism comprising intent demodulation, visual locking, and control law mapping, the system enables natural and robust multi-target selection, achieving a 100% success rate in maintaining target identity during predefined fixation tasks.
The framework effectively decouples human cognitive intent from machine execution, mitigating the operator’s cognitive workload while circumventing the unreliability of fully autonomous visual systems, thereby providing a resilient human-in-the-loop solution for complex UAV collaborative control.

Abstract

To address the inherent limitations in the robustness of fully autonomous unmanned aerial vehicle (UAV) visual perception and the high cognitive workload associated with manual control, this paper proposes a human-in-the-loop brain–computer interface (BCI) control framework. The system integrates steady-state visual evoked potential (SSVEP) with deep learning techniques to create a spatio-temporally dynamic interaction paradigm, enabling real-time alignment between visual targets and frequency stimuli. At the perception level, an enhanced YOLOv11 network incorporating partial convolution (PConv) and shape intersection over union (Shape-IoU) loss is developed and coupled with the DeepSort multi-object tracking algorithm. This configuration ensures high-speed execution on edge computing platforms while maintaining stable stimulus coverage over dynamic targets, thus providing a robust visual induction environment for EEG decoding. At the neural decoding level, an enhanced task-discriminant component analysis (TDCA-V) algorithm is introduced to improve signal detection stability within non-stationary flight conditions. Experimental results demonstrate that within the predefined fixation task window, the system achieves 100% success in maintaining target identity (ID). The BCI system achieved an average command recognition accuracy of 91.48% within a 1.0 s time window, with the TDCA-V algorithm significantly outperforming traditional spatial filtering methods in dynamic scenarios. These findings demonstrate the system’s effectiveness in decoupling human cognitive intent from machine execution, providing a robust solution for human–machine collaborative control.

Keywords:

steady-state visual evoked potential; unmanned aerial vehicle; dynamic interaction paradigm; task-discriminant component analysis; object tracking

1. Introduction

Driven by the surge in deep learning, the field of unmanned aerial vehicle (UAV) computer vision has experienced a revolutionary shift in architectural design. Object detection algorithms have evolved from the early (Region-based Convolutional Neural Networks) R-CNN [1] series to end-to-end inference frameworks such as you only look once (YOLO) [2] and DETR, achieving a superior balance between detection precision and real-time responsiveness. Integrated with mature multi-object tracking (MOT) strategies like DeepSORT [3] and ByteTrack [4], UAV platforms have fully leveraged their mobility advantages in practical applications such as aerial logistics, infrastructure inspection, and search-and-rescue operations [5]. However, existing control modalities still face severe challenges in practical operations. Traditional remote-control modes impose an excessively high cognitive workload on operators and exhibit limited response efficiency when handling concurrent multi-target switching tasks [6]. At a deeper level, fully autonomous visual decision-making systems struggle with reliability in typical aerial photography environments. These environments are often disrupted by dynamic backgrounds, occasional target occlusion, and vibrations from the aircraft caused by wind. Additionally, these systems lack an intuitive and efficient way for humans to step in and provide feedback, making it difficult to integrate human intelligence when needed [7].

Simultaneously, the essence of a brain–computer interface (BCI) is to bypass traditional peripheral nerves and muscles creating a direct connection between the human central nervous system and external devices, enabling a seamless flow of information [8]. By demodulating specific neuro-electrophysiological signals in real-time, this technology has transformed the way patients with motor dysfunction interact in clinical rehabilitation [9]. It has also proven to be strategically valuable in advanced tasks such as coordinating complex unmanned systems and controlling intelligent equipment [10]. Among various BCI paradigms the SSVEP has become a core technical path for constructing high-rate human–machine collaborative frameworks due to its high information transfer rate (ITR) and minimal user training requirements [11]. The field of BCI has advanced significantly through the continuous development of feature extraction paradigms, such as canonical correlation analysis (CCA) and filter bank canonical correlation analysis (FBCCA) [12]. Consequently, the robust recognition accuracy of modern SSVEP-BCIs now provides a reliable foundation for real-time human–machine interaction in complex environments [13].

Despite the technical maturity of both BCI and computer vision (CV), their deep integration into dynamic UAV operational scenarios remains constrained by several bottlenecks. First, mainstream SSVEP stimulus interfaces are largely limited to static layouts making them challenging to adapt to the rapidly evolving targets and backgrounds observed from a UAV’s perspective [14]. This spatio-temporal decoupling between stimulus sources and physical targets directly leads to confusion in electroencephalography (EEG) decoding and semantic fragmentation. Second, constrained by the power-to-performance ratio of embedded edge computing platforms such as the controller (rk3588), the high computational demands of native YOLO or transformer models often result in performance degradation which hinders the real-time execution and on-device deployment of integrated systems [15]. Additionally, the visual fatigue caused by the SSVEP paradigm becomes especially noticeable during long missions. When combined with interference from UAV turbulence, the accuracy of EEG signal extraction is put to the test. Most importantly, in dynamic environments where targets frequently change, traditional touch or button-based controls struggle to create a smooth connection between intent and action, making them less efficient for seamless human–machine collaboration in fast-moving scenarios [16].

Based on these considerations, exploring the coordination mechanisms between SSVEP-BCI, CV and UAV control while addressing the adaptation issues between intent decoding and target locking in dynamic scenes offers significant theoretical value and pressing engineering opportunities [17]. To address the complex challenges of dynamic UAV operations, this study proposes a dynamic vision–BCI control framework. By designing a spatio-temporally bound visual stimulus paradigm, the SSVEP stimulus sources move synchronously with the detected visual targets. To overcome edge-side resource limitations, lightweight object detection algorithms and tracking logic have been optimized. Finally, by establishing a closed-loop feedback mechanism of intent demodulation-visual locking-control law mapping, the system enables natural target selection and robust tracking by the user.

The remainder of this paper is organized as follows: Section 2 outlines the technical scheme of the dynamic vision–BCI system, systematically elucidating the dynamic SSVEP encoding strategy, the improved YOLOv11 detection algorithm and their collaborative logic at the command mapping layer by deconstructing the integrated architecture of the embedded edge platform and gimbal. Section 3 focuses on experimental validation, analyzing the system’s operational effectiveness in dynamic and the robustness of online closed-loop control based on offline classification accuracy, hardware-in-the-loop (HIL) simulations and real-world paradigm tests. Section 4 discusses the adaptive advantages of the hybrid system in UAV missions, examining the trade-offs between algorithmic precision and edge-side real-time performance while identifying current research limitations. Finally, Section 5 summarizes the core findings and presents a vision for the future development of human–machine collaborative UAV technology.

2. Materials and Methods

2.1. System Design and Hardware Platform

2.1.1. Edge Computing and Hardware Integration

This research establishes a collaborative UAV intelligent gimbal tracking framework that integrates visual guidance with EEG-based intent analysis. The architecture is designed to deconstruct and reconstruct the end-to-end interaction paradigm from raw visual perception to physical motion execution through a human-in-the-loop (Human-IL) decision-making model. The logical topology of the system is primarily woven from three interconnected components: the airborne edge processing unit, the ground interaction main control station and the neural feedback control link which collectively support the deep coupling of visual information and neural intent.

The airborne terminal serves as the system’s forward-looking perception front, with its core task being the instantaneous extraction of environmental semantics. The perception module is powered by the RK3588 embedded computing platform (Rockchip Electronics Co., Ltd., Fuzhou, China), which is an integrated neural processing unit (NPU) capable of delivering up to 6 TOPS of peak computing power [18]. This hardware selection ensures that the computationally intensive improved YOLO detection algorithm can be deployed locally on the resource-constrained edge. Raw video streams captured by the high-resolution gimbal camera are fed into the controller (RK3588) via connecting cable (MIPI or HDMI) interfaces, where real-time extraction of target candidate boxes and their spatial coordinates is completed. This process filters non-essential environmental noise at the source and significantly reduces the bandwidth load for cross-link data transmission, granting the system millisecond-level response characteristics.

Operating in coordination with the edge computing platform is a high-precision UAV gimbal and its associated motor control board, which together form the system’s execution layer. The airborne gimbal provides physical stabilization for the camera, while its dedicated motor control board serves as the critical interface for the implementation of control laws. In this architecture, the onboard RK3588 platform (acting as the “upper computer”) receives target tracking commands and transmits attitude compensation signals to the gimbal motor board (acting as the “lower computer”) via standard protocols such as UART or PWM. Based on these inputs, the board adjusts the output torque of the brushless motors in real-time. This hardware hierarchy ensures that motion compensation, driven either by visual tracking algorithms or BCI-based intent, can be translated seamlessly into physical adjustments, maintaining a stable, centered perspective in dynamic flight environments.

The ground main control center consists of a high-performance workstation equipped with an Intel Core i9 processor (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4080 Super graphics card (NVIDIA Corporation, Santa Clara, CA, USA), serving as the core for processing cognitive tasks and parallel computing. The workstation utilizes the multi-threaded high-frequency processing capabilities of the i9 processor to handle the real-time reception and online feature analysis of high-sampling-rate EEG data streams. Simultaneously, the RTX 4080 Super graphics card, with its powerful Compute Unified Device Architecture (CUDA) core computing power and high-refresh-rate rendering features, ensures the frequency precision and temporal determinism of the SSVEP stimulus sources within complex graphical interfaces. The overall hardware flow chart of the system is shown in Figure 1, The characters on the road denote an irrelevant local road name, and this interpretation applies to all similar road markings throughout the manuscript.

2.1.2. Hybrid Control Logic

The hybrid BCI control logic serves as the pivotal nexus for enabling human–machine collaborative operations, fundamentally establishing a cross-modal mapping between visual perception and human cognitive intent via the ground control station [19]. Upon receiving the pre-processed airborne video stream via the wireless link, the system utilizes the target coordinates generated by the edge device to dynamically overlay flickering bounding boxes directly onto the detected vehicles in the video feed. Each box serves as an SSVEP stimulus, flickering at a specific, predefined frequency. This spatio-temporal synchronization mechanism ensures that the visual stimuli displace in real-time with the targets in the UAV’s field of view, thereby forging an intuitive semantic association between physical entities and neural representations.

Within this interaction paradigm, the operator’s fixation on a target flickering at a specific frequency evokes EEG signals characterized by distinct frequency components. An EEG acquisition device worn by the operator captures these subtle neuro-physiological responses and transmits them to the ground station for online analysis. The integrated signal processing module utilizes an enhanced task-discriminant component analysis (TDCA) algorithm to perform spatial filtering and feature extraction on the complex EEG data, thereby demodulating the operator’s attentional intent [20]. This architecture implies a dominant role for human cognitive decision-making in multi-target selection tasks, effectively addressing the limitations of autonomous visual detection algorithms in complex environments where decision logic remains relatively rigid.

Once intent decoding is finalized, the resultant fixation target index is converted into standardized control commands and injected back into the UAV’s airborne system via the wireless data link. Upon receiving these commands, the RK3588 edge platform invokes a predefined Fuzzy Proportional-Integral-Derivative (PID) control law. By synthesizing the pixel residual of the target relative to the center with the operator’s dynamic instructions, the system modulates the pulse width modulation (PWM) signals of the gimbal motors in real-time, achieving high-precision attitude compensation [21]. This “perception-decision-execution” closed-loop feedback cycle not only ensures sustained locking and visual centering of the selected target during dynamic flight but also establishes a unified “eye–brain–machine” control pathway that synergizes machine computational efficiency with human cognitive flexibility.

2.2. Object Detection and Tracking Algorithms

2.2.1. Object Detection Based on Improved YOLOv11 Algorithm

To address the dual challenges of significant target scale variations in UAV aerial photography and the constrained computational capacity of edge-side hardware, this study implements a customized refinement of YOLOv11 for efficient edge deployment [22]. The objective is to construct a deep learning framework that achieves a synergistic balance between computational efficiency and detection precision.

In alignment with standard YOLO-style detectors, the proposed network architecture is divided into three functional components: the backbone, neck, and head, as illustrated in Figure 2. The backbone is responsible for extracting hierarchical visual features from the input image; it integrates the SPPF (Spatial Pyramid Pooling-Fast) module at its final stage to effectively enlarge the receptive field while isolating critical context features. The neck aggregates these multi-scale features to enhance detection robustness, utilizing C3k2 (Cross Stage Partial Network with CSP Layer) blocks to facilitate efficient feature fusion without significantly increasing computational overhead. Finally, the head performs the ultimate task-specific operations, including vehicle classification and bounding-box regression. This structured refinement ensures the system maintains high sensitivity to small-scale targets in dynamic aerial scenarios. The enhancement logic initially focuses on the hardware specifications of the RK3588 airborne edge computing platform. Given that this system-on-chip (SoC) is often bottlenecked by memory bandwidth when processing high-resolution video streams, high-redundancy traditional convolutional operators were discarded in favor of hardware-aware partial convolution (PConv) to reconstruct the backbone network [23], as shown in Figure 3. By executing selective computation on channel features, PConv effectively reduces redundant computational overhead and the frequency of memory access. This optimization significantly lowers the required floating-point operations per second (FLOPS) while maintaining the depth of feature extraction. Such low-level operator refinement ensures that visual detection tasks operate at high frame rates on the airborne terminal, thereby providing millisecond-level position feedback for the downstream gimbal PID control algorithm and eliminating perceptual latency in the human–machine interaction loop.

Building upon this hardware-efficient perception layer, a gather-and-distribute mechanism (GD-Mechanism) was integrated into the neck network to overcome the instability of small object recognition caused by feature dilution in aerial scenes [24]. This mechanism breaks the limitations of traditional feature pyramid networks (FPNs) in cross-layer information transfer by constructing global gather and distribute modules, facilitating deep interaction between shallow spatial details and deep semantic information. The introduction of global contextual information enhances target saliency in complex airspaces, ensuring that the detection boxes generated during SSVEP target switching commands possess high spatial robustness. This effectively mitigates logic mis-triggering within the BCI system that might otherwise be caused by target jitter or transient false detections.

Finally, to ensure regression accuracy in scenarios with densely distributed multiple targets, the shape intersection over union (Shape-IoU) loss function, which incorporates geometric morphological constraints, was introduced during the training phase. Addressing the physical characteristics of aerial targets, such as large tilt angles and uneven aspect ratio distributions, Shape-IoU adds a weighted penalty term for bounding box shape deviation. This approach rectifies the slow convergence of traditional IoU algorithms when handling irregularly shaped targets. This strategy improves localization precision without increasing the inference burden on the edge terminal, ensuring that SSVEP stimulus sources are accurately “anchored” to the centers of physical targets [25]. This deep coupling logic-spanning from low-level operator efficiency and mid-level feature fusion to top-level geometric constraints-not only enhances algorithmic execution performance on embedded platforms but also secures the consistency of interaction feedback and the stability of closed-loop control from a system engineering perspective.

2.2.2. DeepSort Object Tracking Algorithm

To ensure the robust binding between the SSVEP stimulus paradigm and visual targets within dynamic scenarios, the integration of a highly resilient MOT architecture is imperative. As a benchmark framework for online real-time tracking, the simple online and real time tracking (SORT) [26] algorithm establishes a technical path based on Kalman filter prediction and Hungarian algorithm association by decoupling detection from association tasks. Its logical core involves utilizing the Kalman filter (KF) to smooth target trajectories and predict future positions, subsequently achieving optimal data association by minimizing IoU costs. However, SORT’s reliance on purely linear motion prediction exhibits significant vulnerability in scenarios involving target overlap or complex occlusions, frequently inducing identity (ID) switches. Such trajectory instability leads directly to the “de-anchoring” of visual stimulus points from physical targets, thereby interfering with the feature analysis of EEG signals.

Addressing these bottlenecks, the DeepSort framework inherits the real-time efficiency of SORT while significantly enhancing tracking continuity through the introduction of deep appearance features and a cascaded matching strategy. A pivotal evolution of this algorithm lies in the establishment of a rigorous trajectory state machine management mechanism. New trajectories must experience continuous matching verification (e.g., a default of 3 frames) before transitioning to a confirmed state, while confirmed trajectories possess a mismatch tolerance lifespan (e.g., 30 frames). This design effectively mitigates trajectory fragmentation caused by transient missed detections or environmental noise. Crucially, DeepSort embeds the visual representation of targets into the association cost function, enabling the system to maintain ID stability based on appearance similarity even in non-linear motion scenarios where KF predictions may fail. This dual-discriminative logic, synergizing motion consistency with feature robustness, establishes the algorithmic foundation for the stable operation of the Vision–BCI system in dynamic environments.

At the execution level, DeepSort operates as a structured iterative process. The system assigns initial target IDs and activates motion modeling in the first video frame, subsequently entering a frame-by-frame loop where the algorithm dynamically updates confirmation states based on real-time matching history. During the matching phase, the system prioritizes cascaded matching logic for confirmed trajectories. Unmatched trajectories are downgraded to a secondary association pool based on Intersection over Union (IoU), providing an opportunity to re-establish links with residual detection boxes. Throughout this process, detection boxes that fail to match any trajectory are initialized as candidate trajectories, while trajectories exceeding the preset lifespan threshold for mismatched frames are pruned. Through this multi-stage, refined lifecycle management, the system achieves efficient maintenance of target trajectories while maintaining edge-side real-time processing, ultimately providing a continuous and reliable stream of identity identifiers for the upper-layer command mapping mechanism.

2.3. Dynamic Brain–Computer Interface and Target Selection Methods

2.3.1. Analysis of SSVEP Response Characteristics and Spatiotemporal Evolution

This chapter investigates the modulation effects of stimulus frequency on SSVEP signals based on a non-cue paradigm. Figure 4 illustrates the time-domain EEG waveforms recorded primarily from the occipital and parietal regions within the 8–17 Hz frequency band. Across the recorded trials, a consistent temporal evolution pattern is observed: following the stimulus onset, the EEG signals rapidly entrain to the driving frequency. While brief initial transient responses may occur, the signals transition into a highly stable and robust periodic state typically within the first 300 to 500 ms, yielding a strictly phase-locked SSVEP response. During this steady-state phase, the waveforms exhibit clear textbook sinusoidal morphology. The phase of these oscillations remains highly synchronized across the active channels within a single trial, demonstrating continuous and coherent neural tracking of the visual stimulus.

Within the 8–17 Hz range, the peak-to-peak morphology of the steady-state SSVEP response is markedly pronounced. The signal amplitude consistently oscillates within a healthy physiological range, maintaining an excellent signal-to-noise ratio (SNR). Unlike paradigms that rely on massive transient evoked potentials, this steady-state dynamic highlights a continuous, sustained neural recruitment in the visual cortex. The strong periodicity in this frequency band indicates that the steady-state dynamics are significantly enhanced, providing a reliable foundation for target frequency recognition.

To further elucidate the spatial activation and propagation mechanisms of the SSVEP responses, this section analyzes the brain topographic maps based on relative power spectral density (PSD) and SNR. As demonstrated in Figure 5 (using representative frequencies such as 9, 12, and 15 Hz), the SSVEP response originates and heavily concentrates in the primary visual cortex (occipital region). Over time, this activation stabilizes into a clear spatial distribution pattern. The topography confirms that continuous visual stimuli within the 8–17 Hz range induce robust, highly localized neural recruitment at the visual cortex, with minimal dispersion to unrelated cortical areas.

In summary, these results validate that visual stimulation within the 8–17 Hz band significantly optimizes both the amplitude and stability of SSVEP responses. Furthermore, the disclosure of these temporal phase-locking characteristics and spatial distribution patterns provides a critical theoretical framework for the parameter optimization and robust design of the dynamic BCI systems proposed in the following sections.

2.3.2. Presentation Strategies and Coding Schemes

Building upon the aforementioned research foundation, this study proposes an enhanced SSVEP paradigm tailored for dynamic environments, specifically designed for a dynamic BCI system integrated with object detection algorithms [27]. The visual background materials for this paradigm are primarily derived from real-world scenarios captured by imaging systems on mobile platforms, such as automobiles and UAVs. These unstructured data are constructed into a unified dataset for subsequent model training and simulation validation. To comprehensively cover typical dynamic scenarios, the dataset focuses on core operational conditions of UAVs, aiming to simulate dynamic target selection tasks within complex backgrounds. Given the significant variance in frequency characteristics, color distribution, and background complexity inherent in the visual scenes, a preliminary experiment is essential to systematically test the effects of different color stimuli, background modes, and frequency parameters on SSVEP signal quality, thereby determining the optimal parameter configuration for system robustness.

To clarify the stimulus generation strategy, Steady-State Visual Evoked Potential paradigms typically utilize three primary stimulation modalities. The first, light flashing, involves the repetitive flickering of discrete light sources such as light-emitting diodes at constant frequencies. The second, pattern reversal, refers to the periodic phase-alternation of high-contrast visual structures, notably checkerboard patterns that reverse their black-and-white squares [28], The third approach, known as graphic stimulation and adopted in this study, utilizes computer-generated graphical elements like shapes or icons as the flickering source. Specifically, our system implements a dynamic graphic paradigm where colored bounding boxes are overlaid directly onto detected vehicle targets. Unlike traditional static stimuli, this approach requires the stimulus interface to move and flicker in perfect synchrony with the mobile targets to ensure stable visual anchoring during real-time tracking. This requirement inherently increases the visual and cognitive load on the subjects [29]. Consequently, this study selects graphic stimulation, which induces the lowest relative visual fatigue among the three modes, as the primary presentation method. Specifically, Sampled Sinusoidal Stimulation is employed to modulate the grayscale values of the graphics. This method not only accommodates a wider range of stimulation frequencies but also maintains the narrowband characteristics of the signal. This approach aligns with conventions in traditional SSVEP research that have yielded favorable experimental results using this geometry, facilitating the maintenance of stable evoked signals in dynamic environments [30].

To optimize experimental parameters, a multi-dimensional preliminary experiment paradigm was designed, comprising four color stimuli (red, green, white, blue), two background modes (default solid color vs. real-world image background), and nine frequency conditions (8–17 Hz). To eliminate potential interference from sequence and fatigue effects on the results, the presentation order of all distinct paradigms and stimuli was pseudo-randomized. This pilot study aims to quantitatively evaluate the variation characteristics of SSVEP waveforms under complex backgrounds and varying stimulus parameters, revealing the weight of each factor’s contribution to potential performance. The results will clarify the selective modulation mechanism of SSVEP signals in dynamic environments, providing theoretical support for parameter configuration in the formal experiment and reliable optimization references for SSVEP-based BCI system design.

In the formal experiment phase, utilizing pre-collected UAV-perspective traffic videos, the system performs adaptive adjustment and precise alignment of the size and spatial position of all stimulus blocks regarding the pre-defined targets of interest. This design ensures that the stimuli closely fit the vehicle contours, guaranteeing robust SSVEP induction during vehicle motion and effectively avoiding spatial overlap and aliasing within the field of view. To address the multi-target identification challenge, a Joint Frequency-Phase Modulation strategy is adopted. Building upon nine frequency conditions (8–17 Hz), four phase parameters (0, 0.5

π

,

π

, 1.5

π

rads) are introduced. By expanding the coding space, this approach ensures the system provides stable evoked signals with high discriminability for distinct dynamic targets within the field of view, thereby realizing the efficient decoding of target commands.

The experiment employs a sine wave as the variation waveform for the stimulus frequency, realizing the flashing of the stimulus module by modulating its transparency. The transparency

T

changes with time according to the following sine formula:

T = \frac{\sin (2 π f t) + 1}{2}

(1)

where

f

represents the stimulus frequency and

t

represents the sampling time point.

2.3.3. Feature Extraction and Classification Algorithms

The baseline TDCA algorithm selected for this study utilizes a discriminative model aimed at identifying a spatio-temporal filter applicable to all categories. This approach enables TDCA to leverage inter-class information, thereby enhancing overall classification performance.

Prior to this, the task-related component analysis (TRCA) algorithm was proposed to optimize spatial filters specifically for each individual class, as expressed in the following equation:

\hat{w} = a r g m a x \frac{w^{T} S w}{w^{T} Q w}

(2)

\hat{w}

denotes the optimal spatial filter vector,

S

represents the covariance matrix of the task-related signal, and

Q

signifies the covariance matrix of the background noise.

Analogously, the objective of TDCA is to maximize inter-class discrepancy while minimizing intra-class variance. Its optimization objective can be formulated as:

\hat{w} = a r g m a x \frac{w^{T} S_{b} w}{w^{T} S_{w} w}

(3)

where

\begin{matrix} S_{b} = \sum_{i = 1}^{N_{f}} (\bar{X_{i}} - \bar{X}) {(\bar{X_{i}} - \bar{X})}^{T}, \\ S_{w} = \sum_{i = 1}^{N_{f}} \sum_{h = 1}^{N_{t}} (X_{i}^{(h)} - \bar{X_{i}}) {(X_{i}^{(h)} - \bar{X_{i}})}^{T} \end{matrix}

(4)

here,

N_{f}

denotes the number of stimulus targets, and

N_{t}

represents the number of repeated trials for each stimulus target.

X_{i}^{(h)}

represents the feature vector of the h-th trial corresponding to the i-th target.

\bar{X_{i}}

denotes the mean vector of the i-th target, calculated by averaging all trials within that class. Finally,

\bar{X}

refers to the global mean vector computed across all classes.

Deriving the formula yields Equation (5):

\hat{w} = a r g m a x \frac{w^{T} S_{b} w}{w^{T} S_{w} w} = a r g m a x \frac{w^{T} \sum_{i = 1}^{N_{f}} S_{i} S_{i}^{T} w}{w^{I} N N^{T} w}

(5)

where

S_{i}

represents the task-related component matrix corresponding to stimulus frequency

i

, and

N

denotes the noise matrix composed of noise signals. Similarly to TRCA, the TDCA method can be transformed into a generalized eigenvalue problem involving the matrix

S_{b} S_{w}^{- 1}

, with the target frequency determined through an identical procedure.

In the hybrid system proposed herein, the algorithm requires not only high detection accuracy to ensure correct command execution but also adaptability to real-world application demands. Although traditional TDCA outperforms other algorithms in discriminative performance, its high sensitivity to latency parameters results in poor generalization, and covariance estimation is often imprecise under small sample calibration data. To address these issues, this study proposes an improved ensemble filter bank task-discriminant component analysis method, termed TDCA-V. By eschewing high-computational-cost deep learning architectures in favor of efficient linear algebra operations and ensemble strategies, TDCA-V achieves precise and rapid intent decoding through multi-scale feature fusion and dynamic regularization mechanisms, while ensuring millisecond-level real-time response.

To enhance detection accuracy without increasing computational time steps, the algorithm first employs Filter Bank techniques to decompose raw EEG signals into multiple sub-bands, fully utilizing the wideband frequency domain information of SSVEP signals. Given that high-order harmonics of low-frequency stimuli retain high signal-to-noise ratio (SNR) and contain critical discriminative information at a 300 Hz sampling rate in object detection tasks, the harmonic orders of the reference sine wave signals are dynamically extended according to the specific scenario to capture more high-frequency details. Furthermore, to mitigate the loss of detail caused by the rapid decay of high-frequency sub-band weights in traditional methods, the calculation formula for sub-band weighting coefficients is revised to appropriately increase the weight of high-frequency sub-bands, thereby enhancing robustness in complex spectral environments:

w_{m} = m^{- 1.0} + 0.25, m = 1, 2, \dots, 5,

(6)

where

m

is the sub-band index.

Individual variability in visual pathway conduction latency constitutes a significant barrier to system usability, typically necessitating tedious parameter tuning for each user. To improve user-friendliness for industrial applications and enhance generalization, a lightweight multi-delay ensemble strategy is introduced. Rather than relying on a single latency point, the algorithm defines a sparse delay parameter set D = {4, 6, 8} (corresponding to time windows of approximately 0.13 s to 0. 27 s). For each delay length

τ

in the set, an augmented feature matrix

\tilde{X} (τ)

containing the original signal and its orthogonal projection is constructed in parallel:

\tilde{X} (τ) = [X (τ), X (τ) P]

(7)

where

X (τ)

is the spatiotemporal data matrix containing

τ

delay points, and

P

is the orthogonal projection operator for the reference signal of the

i

-th class of stimuli.

Simultaneously, to achieve a ‘plug-and-play’ user experience in practical object detection applications, the system is often limited to 5 trials per target. Under such small-sample conditions, the estimation of the intra-class scatter matrix

S_{w}

tends to be imprecise, and direct inversion can lead to numerical instability and overfitting. Consequently, a trace-based dynamic shrinkage regularization method is introduced. By automatically assessing the distribution characteristics of the feature space, the intra-class scatter matrix is dynamically corrected:

{\hat{S}}_{w} = (1 - λ) S_{w} + λ \frac{t r a c e (S_{w})}{d} I

(8)

where

I

is the identity matrix,

λ

is the regularization coefficient and

d

represents the feature dimension.

2.3.4. BCI Command Mapping Mechanism

In complex and dynamic environments, establishing a precise mapping between neural intent and specific operational targets is fundamental to constructing an efficient human–machine collaborative system [31]. To address the challenges of frequent target overlap and drastic positional fluctuations within the UAV’s field of view, this study develops a dynamic command association mechanism based on MOT metadata. The logical premise of this mechanism lies in utilizing tracking algorithms to assign and maintain a unique ID for every candidate target within the visual field. This ensures semantic stability of the targets across continuous temporal frames, thereby providing reliable physical coordinate support for the dynamic anchoring of SSVEP stimuli.

During the image processing phase, as unique IDs are assigned to distinct recognized targets, each target corresponding to a specific ID is overlaid with an individual flickering block characterized by a unique stimulation frequency and phase. This mechanism enables the subject to selectively target and interact with specific objects within the dynamic visual field. The precise configuration and spatial alignment of these overlays are illustrated in Figure 6.

2.4. Experimental Design and Data Acquisition

2.4.1. UAV Vision Dataset Construction and Augmentation

Given the current lack of public datasets specifically targeting traffic monitoring and building critical node identification from diverse nadir perspectives at varying altitudes, this study utilized multi-rotor UAVs to perform field aerial photography missions. The objective was to construct a benchmark sample library aligned with practical inspection requirements. The data acquisition path covered an altitude range from 50 to 120 m above ground level, and through multi-angle sampling, an initial set of 3200 images was obtained. Following redundancy removal and blur filtering, the dataset was refined to 2850 valid samples. During the construction of this sample set, specific emphasis was placed on dimensions such as illumination evolution, multi-scale target distribution, and complex occlusion to ensure that the dataset sufficiently reflects the environmental disturbances and target heterogeneity encountered in real-world operational environments [32].

The annotation process was focused exclusively on vehicles, with standardized pixel-level bounding box criteria established to mitigate labeling bias, supplemented by a multi-person cross-verification mechanism to ensure inter-class consistency. For truncated and partially occluded targets, the annotation logic adhered to the “what you see is what you get” principle, ensuring high semantic alignment with actual perceptual information [33]. The specific dataset construction process is shown in Figure 7.

To further reconcile the conflict between the high cost of large-scale data acquisition and the demands for model generalization, a systematic data augmentation pipeline was deployed. Within the geometric transformation dimension, techniques such as random rotation, horizontal flipping, and affine transformations were utilized to enhance the model’s adaptability to UAV attitude fluctuations and perspective shifts. In the photometric dimension, non-linear adjustments in the HSV space were employed to simulate drastic environmental illumination changes, and Gaussian noise was introduced to represent inherent sensor imaging fluctuations [34].

Simultaneously, inspired by scale-space theory, this study further reinforced the stability of multi-scale target capture by constructing image pyramids. To suppress overfitting and enhance recognition robustness for dense targets, regularization strategies such as Cutout and Mixup were introduced. These methods force the network to learn more discriminative local and composite features, significantly broadening the coverage of samples within the feature space [35]. Collectively, these augmentation techniques constitute a highly robust training foundation, providing the necessary data support for the stable deployment of subsequent lightweight detection models on edge devices.

2.4.2. Pilot Study Procedures of Dynamic BCI

All experimental procedures were conducted in a controlled, comfortable environment. Participants were positioned approximately 60 cm from a standard 24-inch LCD monitor with a refresh rate of 60 Hz and a resolution of 1920 × 1080. For this study, five healthy volunteers were recruited for the pilot study (3 males, 2 females; age range: 19–30 years). Building upon this, an additional ten healthy volunteers were recruited for the formal experiments (7 males, 3 females; age range: 19–30 years). All participants possessed normal or corrected-to-normal vision and provided written informed consent approved by the Ethics Committee of Wuhan University of Technology prior to the commencement of the study.

EEG signals were acquired using a DSI-24 dry electrode EEG system at a sampling rate of 1000 Hz (Wearable Sensing, San Diego, CA, USA). A 24-channel electrode configuration conforming to the International 10/20 System was adopted for signal collection. To ensure data quality and reliability, the following electrodes were selected for analysis: P3, Pz, POz, P4, PO4, PO3, PO7, PO8, O1, O2, and Oz, with TP9 and TP10 serving as reference electrodes. Prior to recording, electrode impedances were maintained below 15 kΩ to minimize signal noise and interference.

Given the individual variations in response to stimulus frequencies, it is necessary to select an appropriate frequency range tailored to the physiological characteristics of the subjects. In the presentation of the SSVEP paradigm, each stimulus corresponds to a specific target; thus, the number of stimulus frequencies must equal the number of targets to ensure distinct identification and selection. In the pilot study, images contained 5 to 9 targets for selection, necessitating the configuration of 5 to 9 discrete flickering frequencies. Stimulus frequencies were constrained within the 6–75 Hz range, determined by the inherent characteristics of SSVEP components and the physiological limits of the human visual system, with the strongest EEG responses typically induced between 8 and 17 Hz. Frequency intervals must be sufficiently large to avoid cross-interference while remaining close enough to ensure accurate selection. Consequently, the stimulus band for the pilot study was set from 8 to 17 Hz. Specifically, targets 1 through 9 were assigned flickering frequencies of 8 Hz, 9 Hz, 10 Hz, 11 Hz, 12 Hz, 13 Hz, 14 Hz, 15 Hz, and 17 Hz, respectively [36].

Building upon the stimulation criteria established in the previous section, the preliminary experiment focused on optimizing the parameters of the selected graphic stimulation modality. To mitigate the heightened experimental load inherent in dynamic tracking, four distinct colors were designated for the graphical stimuli to evaluate their respective visibility and stability. The grayscale values were modulated via a sine-wave pattern to accommodate a broader range of stimulus frequencies, an approach that aligns with traditional SSVEP research where this specific geometry has consistently yielded favorable results. Furthermore, the dimensions of these stimuli were carefully adjusted to fit the contours of vehicles or other identified objects, ensuring robust SSVEP induction without spatial overlap. Figure 8 illustrates the schematic diagram of this preliminary experimental setup [37].

The selection of stimulation duration is a critical factor in the design of SSVEP visual stimuli. While system accuracy typically correlates positively with increased stimulation time, excessive durations tend to degrade the ITR; conversely, insufficient durations may fail to evoke detectable SSVEP signals, thereby reducing accuracy. To optimize the balance between ITR and accuracy, a stimulation duration of 4 s was established for the pilot study, which also serves to mitigate participant fatigue [38].

The pilot study comprised 8 experimental paradigms, each consisting of 6 blocks with 5–9 trials per block, totaling 240–432 trials. The total experimental duration ranged from 20 to 40 min. To eliminate potential order effects, the sequence of stimulus paradigms was randomized. Specifically, paradigms 1–4 were designed to investigate the influence of stimulus color, while paradigms 5–8 focused on the impact of dynamic paradigms and real-world scene configurations.

2.4.3. Formal Experimental Protocol

In the formal experimental phase, the SSVEP-based dynamic target selection paradigm is implemented by applying the enhanced Vision–BCI object detection algorithm to a specifically selected video segment from the dataset. This process yields the real-time target count and their respective spatial coordinates within the screen coordinate system. Subsequently, SSVEP visual stimulus flickering blocks are superimposed onto each detected target position to guide the subjects in selecting various dynamic targets. The formal experiment comprises six distinct paradigms, each containing six blocks. Within these paradigms, the selection targets are randomized and the number of vehicles in the traffic flow is maintained within a range of up to nine. The total experimental duration spans between 40 and 60 min. To mitigate potential confounding factors related to the flickering sequence, the order of stimulus paradigm selection was strictly randomized.

In a single experimental block, during each trial, participants are instructed to visually focus on the flickering stimulus box corresponding to the desired target vehicle. The attended frequency is decoded from EEG as a discrete selection command, which triggers the gimbal to rotate and lock onto the chosen target for continuous tracking. The temporal sequence of each trial is structured as follows: initially, a target candidate is highlighted by a blue cue box for 2 s to alert the subject to the selection task. This is followed by a 4 s stimulation period, during which flickering stimuli appear; subjects are instructed to maintain intense focus on the cued region and minimize blinking. Finally, a 2 s rest interval is provided to allow for physiological recovery. The complete procedural flow of this experimental paradigm is illustrated in Figure 9.

3. Results and Analysis

3.1. Pilot Study Results and Analysis

The pilot study phase involved a systematic evaluation of SSVEP response characteristics under varying frequency and color configurations. This investigation aimed to establish an optimal parameter baseline for subsequent formal experiments through a comprehensive analysis of classification accuracy, ITR and quantified visual fatigue metrics [39]. Preliminary results indicate that classification performance and ITR demonstrate strong statistical robustness across diverse color features, providing an empirical foundation for parameter deployment in complex operational scenarios. In the initial testing round, four visual paradigms white, red, green and blue were evaluated across three distinct time windows (1.0 s, 1.4 s and 1.8 s) using the CCA algorithm [40] to assess the demodulation efficacy of neural signals. The results are shown in Figure 10 and Figure 11. The experimental data suggest that the combination of stimulus frequency and chromatic background significantly modulates the temporal and spectral characteristics of SSVEP signals. Furthermore, these factors likely influence the subject’s fatigue levels indirectly by regulating visual attention and the allocation of neural resources. These findings offer theoretical support for the selective regulation mechanisms of SSVEP signals and serve as a valuable reference for optimizing SSVEP-based BCI system designs. Quantitatively, the classification accuracy exhibited a monotonic upward trend as the data length increased, rising from 88.14% to 95.68%. Conversely, the ITR decayed from 91.55 bits/min to 72.61 bits/min, a consequence of the increased temporal cost associated with longer windows. Statistical inference revealed a significant gain in accuracy within the 1.0 s to 1.4 s interval (p < 0.05); however, the rate of improvement plateaued beyond the 1.4 s mark. Notably, except for Paradigm 1 (white) at the 1.8 s window, no statistically significant differences in accuracy were observed between the various chromatic paradigms at equivalent time steps. This phenomenon implies that stimulus color is not the primary variable governing classification performance.

Subsequently, this research focused on evaluating the adverse effects of dynamic SSVEP paradigms and complex backgrounds on signal decoding. To ensure an objective and comprehensive assessment, we conducted a horizontal benchmark test across seven mainstream algorithms. These include Extended Canonical Correlation Analysis (eCCA), CCA, Multi-stimulus Extended Canonical Correlation Analysis (ms-eCCA), Ensemble Task-Related Component Analysis (eTRCA), Multi-stimulus Ensemble Task-Related Component Analysis (ms-eTRCA), and TDCA, which serves as an improved benchmark in this study. Furthermore, we implemented a Hybrid Extended Canonical Correlation Analysis and Ensemble Task-Related Component Analysis (Hybrid-eCCA-eTRCA), hereafter referred to as Combined in subsequent figures and analysis for the sake of conciseness. As illustrated in Figure 9, these comparative experiments revealed the unique performance advantages of TDCA under the specified complex constraints, providing empirical evidence for its robustness in dynamic environments.

Notably, the algorithm achieved a mean peak accuracy of 94.44% within a 1.2 s time window, effectively suppressing non-stationary noise interference induced by the environmental background. Even under the dynamic evolution paradigm, TDCA demonstrated rapid responsiveness, attaining a recognition rate of 96.67% at the 1.2 s mark. Given the scarcity of training samples in practical applications and the requirement for cross-paradigm compatibility across diverse scenarios, the comprehensive performance exhibited by TDCA renders it the optimal algorithmic choice for underpinning the dynamic hybrid BCI architecture of this study.

3.2. Object Detection Performance in Dynamic UAV Backgrounds

Beyond the precise decoding of human intent, the accurate identification and robust tracking capabilities of the visual system are critical determinants of overall system performance. Figure 12 illustrates the evolution of the loss functions and performance metrics over a 200-epoch training cycle. From the topological structure of the loss convergence, the model demonstrates exceptional learning efficiency and stability. Specifically, the bounding box regression loss (Box Loss) and classification loss (Cls Loss) exhibit a steep decline within the initial 50 epochs before rapidly transitioning into a stable asymptotic convergence phase. Notably, the validation loss (Val Loss) closely tracks the decline of the training loss without significant rebound or oscillation. This phenomenon provides compelling evidence that the improved model possesses well-defined generalization boundaries within the feature space, effectively mitigating the over-fitting risks commonly associated with deep networks.

Corresponding with the convergence of the loss functions, various evaluation metrics show a synchronized upward trajectory. The mean average precision (mAP@0.5) ascended into the saturation zone (approaching 1.0 s) within a remarkably short training duration, while the more stringent mAP@0.5:0.95 metric maintained a steady growth slope, eventually stabilizing at a high level above 0.85. This synergistic enhancement of high precision and high recall further confirms that the introduction of the improvement mechanisms has likely heightened the network’s sensitivity to small object features. Consequently, the model successfully preserves low-level spatial details essential for precise localization while retaining high-level semantic information.

To further elucidate the discriminative logic of the model within authentic aerial photography scenarios, Figure 13 presents detection results across multi-scale perspectives alongside their corresponding Class Activation Mapping (CAM) visualizations. UAV-acquired imagery is frequently characterized by high-frequency background noise-such as intricate pavement textures, architectural shadows, and arboreal occlusions-which imposes rigorous challenges on the detector’s background suppression capabilities.

The improved YOLO model, specifically optimized for the dynamic BCI system, demonstrates a highly targeted spatial attention distribution pattern. Even in scenarios featuring dense vehicle clusters or surrounding distractors, the high-response regions (represented by the red core zones) accurately converge upon the geometric centers of the target vehicles, with activation boundaries aligning precisely with the contours of the physical objects. This focusing effect indicates that the BCI-related modules (or the attention mechanisms inspired by them) effectively guide the convolutional network to suppress feature responses in non-salient regions, thereby achieving “denoising and purification” at the feature map level.

In terms of bounding box confidence, the model demonstrates the ability to accurately capture dense traffic flows without omission or false detection, even in wide-field-of-view nadir perspectives where target pixels constitute a minimal proportion of the frame. This consistent performance across multiple scales and complex backgrounds confirms that the improved YOLO architecture possesses a superior capacity to decouple regions of interest (ROI) from intricate backgrounds during dynamic environmental perception tasks. Such capabilities satisfy the dual requirements of real-time responsiveness and precision essential for UAV-based ground observation missions.

While static detection metrics validate the model’s feature extraction capabilities at the single-frame level, the temporal consistency of target motion remains the decisive factor for ensuring the stable anchoring of SSVEP stimulus sources within a “human-in-the-loop” closed-loop control system. As shown in Figure 14a, the experimental results intuitively demonstrate the identity preservation stability of the proposed YOLO-BCI + DeepSort framework under significant detection confidence fluctuations. Even when the bounding box confidence scores fluctuate sharply due to the interference of complex backgrounds and small target scales, the system still maintains consistent target ID assignment without ID switching or identity loss. This superior performance benefits from the cascaded matching strategy that fuses appearance features and kinematic motion information, which effectively suppresses the negative impact of transient confidence jitter on tracking continuity. Figure 14b intuitively illustrates the dynamic tracking sequences following the integration of the DeepSort algorithm, where red vector lines represent the Kalman filter’s real-time posterior estimation of the target’s motion state, including velocity vectors and direction. To avoid ambiguity, we explicitly mark the opposite motion directions in Figure 14a and clarify that Figure 14b shows two independent scenes captured at different UAV altitudes (50 m and 100 m), rather than adjacent frames.

From the perspective of sequential frame evolution, UAV aerial photography is typically characterized by the coupled interference of significant ego-motion and non-linear target maneuvers, which imposes rigorous challenges on the tracker’s prediction-update mechanism. As observed in the illustrations, the algorithm effectively utilizes a kinematic prediction model to perform smooth corrections on bounding box coordinates, even in relatively high-speed non-linear scenarios. This state-space-based prediction mechanism not only compensates for transient jitter inherent in visual detection but also ensures that the visual stimulus anchors remain precisely aligned with the centers of physical targets. Notably, due to the deep fusion of appearance features and motion information via the cascaded matching strategy, the system achieved a 100% success rate in maintaining target ID for the video segments utilized in the EEG recognition evaluation of this study.

The identity identifiers of the vehicles-such as ID:11 in the upper sequence and individual IDs within the dense lower traffic flow-maintained strict uniqueness across continuous frames, with no instances of the “ID Switch” phenomenon common in multi-object tracking. This trajectory-level robustness constitutes the cornerstone of the hybrid system’s reliable operation; it fundamentally prevents abrupt fluctuations in stimulus frequency caused by tracking drift or ID confusion. Consequently, this ensures the mapping constancy between frequency-tagging and physical semantics, providing a solid physical baseline for precise user intent locking in dynamic environments.

3.3. Results and Analysis of the Dynamic Hybrid System

In light of the pilot study’s findings which indicated that static paradigms in real-world backgrounds are susceptible to spatio-temporal aliasing from adjacent stimuli-this research specifically optimized the formal experimental protocol. The objective was to provide a rigorous analysis of the recognition robustness of the dynamic SSVEP paradigm within complex operational environments. This validation process serves not only as an efficacy assessment of the proposed algorithmic framework but also establishes critical data benchmarks and theoretical anchors for the subsequent construction of the dynamic vision–BCI system.

This section evaluates the comprehensive performance of the vision–BCI system through quantitative metrics to verify its effectiveness in authentic environments. The experimental cohort comprised 10 subjects selected from an initial pool of 15 (Sub1–Sub15); these individuals underwent fundamental training and exhibited heterogeneous physiological response characteristics, allowing for an investigation into the system’s generalization capabilities under fluctuating Signal-to-Noise Ratio (SNR) conditions. To explore the non-linear relationship between command latency and decoding performance, five sliding time windows (TW) ranging from 1.0 s to 1.8 s (with a 0.2 s increment) were established. A horizontal benchmark test was conducted across eight representative algorithms, including CCA and its variants (eCCA, ms-eCCA) [41], eTRCA [42], ms-eTRCA [43], the Combined method [44], TDCA and the TDCA-V model proposed in this study. Given the consistency of the target set, recognition accuracy was consistently employed as the primary performance metric. The results are shown in Figure 15.

Data window length serves as the pivotal trade-off variable between the ITR and the reliability of an SSVEP system, directly constraining the completeness of feature extraction. Statistical results indicate that recognition accuracy exhibits a significant monotonic increasing trend across the entire sample set as the time window extends. Under the constraint of a 1.0 s short-time window, the traditional CCA algorithm demonstrates severe performance oscillations due to limited frequency resolution caused by sample sparsity. Taking the low-response subject Sub5 as an example, the CCA recognition rate at 1.0 s was a mere 42.59%, failing to meet the threshold requirements for robust control. In sharp contrast, the TDCA series of algorithms demonstrated superior weak-feature enhancement capabilities, elevating Sub5’s accuracy to 79.63% under identical conditions—a performance gain of approximately 37%. As the window extends to approximately 1.4 s, the overall system performance tends toward convergence, with certain high-performing algorithms rapidly entering a saturation zone. For instance, Sub6 achieved 100% precision in recognition using TDCA-V within a 1.6 s window, confirming the algorithm’s capacity to overcome dynamic background noise and achieve zero-error closed-loop control given sufficient data support.

To provide an intuitive representation of the aforementioned temporal dynamic characteristics and the robustness variations among algorithms, Figure 16 illustrates a comprehensive performance comparison at three critical time nodes: 1.0 s, 1.4 s, and 1.8 s. In this visualization, the bar heights represent mean accuracy, while the superimposed error bars quantify the standard deviation, effectively reflecting the stability distribution of the system across different individuals. Observation of the visual data reveals that as the time window contracts, traditional correlation analysis methods, such as CCA and eCCA, not only exhibit a significant attenuation in mean accuracy but also a substantial expansion in error variance, suggesting a high sensitivity to short-term data noise. Conversely, TDCA-V, leveraged by its optimized spatial filtering structure, maintains the most compact error distribution and the highest mean accuracy even at the 1.0 s limit (indicated by the light green bars). This visual evidence strongly corroborates the superiority of the improved algorithm in suppressing non-stationary interference, aligning closely with previous statistical inferences.

A horizontal comparison reveals a clear gradient stratification of algorithmic efficacy. TDCA and its variant, TDCA-V, reside firmly in the top tier due to their significant statistical advantages. Specifically, subject Sub6 achieved a mean accuracy of 97.03% under the TDCA algorithm, which further ascended to 99.26% within the TDCA-V framework. This phenomenon suggests that the TDCA architecture possesses a superior signal separation mechanism for suppressing non-stationary disturbances, such as mechanical vibrations from the UAV and dynamic visual background noise. In contrast, while eTRCA and the Combined methods outperform the baseline CCA, they remain slightly inferior to the TDCA framework in terms of peak performance; for example, Sub3 achieved 88.89% accuracy using eTRCA at the 1.4 s window, whereas TDCA had already converged to 96.30% under identical conditions.

Experimental observations indicate that the multi-stage decomposition algorithm, (ms-eCCA), performed worse than the foundational eCCA for certain low-response subjects (e.g., Sub11). This anomalous finding reflects the spectral complexity of EEG signals in real-world environments, where excessive decomposition may introduce redundant features or non-stationary noise, thereby weakening the classifier’s discriminative boundaries. Furthermore, the experimental data highlight the potential of advanced spatial filtering algorithms to address the challenge of “BCI-illiteracy”. For Sub11, who exhibited weak physiological responses, the baseline CCA algorithm yielded a mean accuracy of only 74.07%, rendering the system unusable. However, the introduction of the TDCA-V algorithm elevated the mean accuracy to 88.89%, reaching a practical benchmark of 96.20% at the 1.8 s window. This advancement proves that spatial filtering mechanisms based on task-relatedness can significantly enhance the saliency of evoked components, effectively broadening the system’s audience applicability.

Focus to the specific performance of the algorithms, the proposed TDCA-V demonstrates a clear advantage over the standard TDCA, notably in its ability to maintain high precision while reducing performance fluctuations across different participants. This improvement is quantified by a roughly 5% net gain in overall accuracy. A striking example is seen in subject Sub6. These gains are not accidental; they stem from the synergistic integration of filter bank decomposition, multi-delay ensembles, and dynamic shrinkage regularization. Rather than merely filtering out noise, this combination allows the system to capture subtle harmonic features and adapt to individual variations in neural latency, all while preventing the overfitting common in small-sample learning. Consequently, the resulting feature distributions are more concentrated, establishing the stable classification boundaries necessary for reliable multi-target discrimination. To analyze the granular performance across multi-classification tasks, this study further constructed a series of confusion matrices for the 1.8 s time window, as shown in Figure 17a–c. Figure 17a presents the raw sample counts, where each row sum (N = 90) represents the total experimental trials for a given stimulus frequency. Building upon this, Figure 17b and Figure 17c provide the normalized Precision and Recall matrices, respectively.

The high response values along the diagonal of the Recall matrix (Figure 17c) confirm the system’s exceptional “catch rate,” demonstrating that the algorithm successfully identifies the vast majority of target signals from the 90 samples provided per class. Simultaneously, the Precision matrix (Figure 17b) reflects the reliability of the algorithm’s predictions; by calculating the ratio of true positives to the total number of times a frequency was predicted (column sum), it proves that the system effectively minimizes false alarms.

Recognition accuracy for primary targets generally exceeds 90%, with performance for the 10 Hz and 14 Hz targets being particularly prominent, reaching 96.72% and 98.44%, respectively. The few instances of misclassification were primarily concentrated between adjacent frequency bands 8 Hz and 9 Hz. This phenomenon suggests that spectral leakage or subject visual fatigue may have compromised frequency demodulation, leading some 8 Hz targets to be misidentified as 9 Hz or vice versa. In summary, the experimental data confirm that the TDCA-V algorithm achieves an optimal configuration in terms of recognition accuracy, response speed, and resilience to individual variability. Within a window setting of 1.2 s to 1.6 s, this algorithm effectively balances the dual requirements of high precision and high ITR, providing robust algorithmic support for real-time control, as shown in Table 1.

3.4. Hardware-in-the-Loop Simulation Results and Analysis

Due to the complex electromagnetic environment of the designated airspace and the fact that the components utilized in this study are exclusively consumer-grade, the physical safety of real-world flight experiments cannot be guaranteed. Consequently, the experimental evaluation adopts a hardware-in-the-loop (HIL) configuration to ensure both operational safety and data repeatability during EEG acquisition. Specifically, the UAV platform is mounted on a multi-degree-of-freedom mechanical test rig that emulates the motion characteristics of actual flight. As illustrated by the structural schematic in Figure 18a and the system components in Figure 18b, although the experiments were not conducted directly in real airspace, the multi-degree-of-freedom mechanical test rig and the actual UAV system effectively simulate environmental noise, such as vibrations, typically encountered in real-world scenarios.

The DSI-24 EEG cap used for data acquisition is shown in Figure 18c. In place of real-world flight tests, a display screen positioned beneath the platform plays pre-recorded urban traffic videos (captured at an altitude of approximately 50 m) to simulate realistic aerial observation conditions; the overall system integration and testing scenario are depicted in Figure 18d. During each trial, the vision module detects and tracks candidate vehicles in real-time while synchronously overlaying flickering SSVEP stimulus blocks onto the targets, as shown in the experimental interface in Figure 18e. Once the subject’s intention is decoded, the gimbal controller locks onto the selected target and continuously adjusts the camera orientation to maintain the target near the center of the field of view.

To systematically validate the target tracking efficacy of the brain-controlled UAV within dynamic scenarios, this study selected the three top-performing subjects from the previous experiments to participate in online closed-loop testing. The experimental environment was configured with the UAV in a fixed-point hovering mode (at an altitude of 50 m), utilizing the relative motion between the UAV’s perspective and ground targets to construct a dynamic testing environment.

At the perception layer, the system loads the pre-trained and enhanced YOLOv11 model to perform real-time detection of dynamic targets within the visual field. A spatio-temporal binding strategy is employed to synchronously overlay SSVEP stimulus blocks onto the target coordinates. The experiment established a threshold of 9 concurrent dynamic targets within the field of view, with the stimulus frequency configuration maintained consistent with the formal experiments described earlier. For periods characterized by target sparsity equal to 9 targets, the system dynamically maps frequency parameters based on the temporal priority of targets entering the field of view.

To clarify the experimental task, the subject’s goal during each trial was to select a specific target vehicle on the screen using their gaze, which then triggered the UAV’s camera gimbal to swivel and track it. A single trial sequence was designed as follows: initially, a 2 s visual cue indicated which target vehicle the subject needed to select. Next, a 4 s SSVEP stimulation phase was initiated, during which the bounding boxes of all targets flickered at different frequencies, and the subject visually focused on their designated target. In real-time, the BCI system decoded the EEG signals to identify the user’s intent. This control command was then transmitted to the host computer, which immediately directed the UAV’s gimbal to rotate and lock onto the selected target. Finally, a 2 s rest period was provided before the next trial. This process simulated the target locking and tracking actions of the UAV gimbal with high fidelity in the physical dimension. Upon completion of a trial, the gimbal camera automatically returned to its initial position, and the video sequence was reset in preparation for the subsequent test. Each subject was required to complete a total of 54 continuous trials in the online environment.

Crucially, this experiment introduced an explicit visual closed-loop feedback mechanism: regardless of the correctness of the decoding result, the UAV gimbal executed tracking actions based strictly on the real-time output of the BCI system. Via the real-time video feedback stream, subjects could intuitively assess the current tracking status (i.e., the consistency between the decoding result and the cued target) and subsequently dynamically regulate their own attention levels and psychological states to adapt to subsequent trials. Experimental results (refer to Table 2) indicate that the three subjects achieved a mean accuracy of 91.98% and a mean ITR of 42.42 bits/min, strongly validating the robustness and efficiency of the proposed system in online human–machine collaborative tasks.

4. Discussion

The dynamic hybrid BCI system developed in this study aims to overcome the robustness bottlenecks of single-modality control in complex UAV operational scenarios. By introducing a “human-in-the-loop” interaction paradigm, the system effectively integrates the flexibility of human cognition with the high efficiency of machine vision.

In dynamic tracking tasks, the improved YOLOv11-DeepSort algorithmic framework provides a stable physical anchor for SSVEP stimulus sources, while the induced SSVEP signals act as a high-level semantic filter. Compared to fully autonomous tracking systems that rely solely on computer vision, this collaborative mechanism excels in scenarios involving target occlusion or semantic. When fully autonomous detection algorithms suffer from bounding box jitter due to feature overlap, the operator’s gaze intent can lock onto a specific ID, forcing the control law to maintain tracking of the selected target. This suppresses control divergence caused by visual false positives at the decision-making level.

A critical consideration in the design of this interface is the choice between neural and ocular input modalities. While eye-tracking systems offer strengths in low-latency spatial localization, this study deliberately utilizes an EEG-based BCI to resolve the “Midas Touch” problem—the inherent difficulty in distinguishing between a user’s spontaneous environmental scanning and a deliberate intent to select a target. In complex UAV missions, operators must frequently shift their gaze to maintain situational awareness. Traditional eye-tracking often leads to false triggers during these rapid gaze transitions. In contrast, the SSVEP paradigm requires a stable neural resonance at a specific frequency, which effectively filters out non-intentional gaze shifts and establishes a robust “third control channel” when the operator’s hands are occupied by primary flight maneuvers.

Furthermore, our framework implements an optimized human–computer interaction (HCI) strategy characterized by “Short-term Selection, Long-term Tracking” to mitigate potential visual fatigue. Unlike eye-tracking systems that often require continuous gaze maintenance, the operator in this study only needs to provide a brief “cognitive trigger” within a short time window (typically 1.0 s to 1.4 s) to lock onto a target. Once the intent is demodulated, the system hands over the task to the improved YOLOv11 and DeepSort algorithms for autonomous, continuous tracking. This intermittent control strategy is significantly less demanding than continuous gaze-based control, achieving a profound decoupling of “cognitive intent” from “physical execution”. This strategy not only bypasses the limitations of restricted ITR in BCI systems but also significantly reduces the operator’s cognitive load and fatigue accumulation during long-endurance missions while ensuring task success rates.

Building upon the validated interaction logic, comparative experiments in real-world paradigms further reveal the superior robustness of the TDCA-V algorithm in non-stationary environments. While traditional spatial filtering methods provide a baseline, Table 1 reveals a critical temporal dependency in performance. Specifically, advanced variants such as eCCA and eTRCA struggle to maintain high responsiveness under dynamic UAV perspectives. Although eCCA eventually converges to a competitive accuracy of 89.38% at a longer time window of 1.8 s, its performance drops sharply to 74.57% at 1.0 s. This latency lag suggests that purely EEG-based methods require extended integration times to filter out broadband environmental noise, rendering them ill-suited for emergency braking or rapid maneuvering tasks in aerial control.

In contrast, TDCA-V exhibits superior rapid-response capabilities. By effectively decoupling interference through multi-scale feature fusion, it achieves a high accuracy of 85.31% within just 1.0 s, establishing a significant lead of approximately 11% over eCCA. This ‘fast-settling’ characteristic confirms that the visual auxiliary stream provides an immediate, stable reference for the decoder, effectively compensating for the initial instability of EEG signals.

Furthermore, the comparison between standard TDCA and vision-enhanced TDCA-V isolates and confirms the contribution of the proposed improvement strategy. Incorporates experimental data from all 15 subjects-comprising 10 subjects who underwent simple training and 5 subjects who received absolutely no training. The results indicate that the improved strategy yields an average net accuracy gain of approximately 5%. Taking the 1.0 s time window as an example, accuracy increased from 80.49% to 85.31%, and more importantly, the standard deviation was reduced from ±14.56% to ±12.63%. High standard deviations in baseline methods, such as eTRCA reaching ±19.77%, indicate high sensitivity to individual subject quality, implying effectiveness only for ‘high-quality’ subjects. The extremely low variability of TDCA-V proves its universality, ensuring reliable decoding even for low-response subjects or ‘BCI-illiterate’ users. This performance, combining speed, precision, and population robustness, fully verifies the necessity of the hybrid dynamic BCI architecture for constructing practical, safety-critical UAV systems.

Complementing the stability of neural decoding is the trade-off between computational power and precision involved in deploying high-precision detection algorithms on the RK3588. As the core of UAV airborne edge computing, this platform integrates a NPU with a peak computing power of 6 TOPS; however, its memory bandwidth and sustained computing power supply are still limited by the power constraints of embedded devices, making it unable to bear the computational load of full-channel convolutions in the native YOLOv11. Instead of merely reducing the computational volume, the improved YOLOv11 conducts hardware-aware optimization based on the architectural characteristics of the RK3588’s NPU by introducing PConv to replace traditional full-channel convolutions: through selectively activating channel features, PConv reduces invalid computations and redundant memory access frequency by more than 40%, which not only aligns with the parallel computing logic of the NPU but also avoids the bandwidth bottleneck caused by full-channel convolutions. Consequently, the model’s inference speed is increased from 32 FPS of the native YOLOv11 to 59 FPS, meeting the real-time requirements for dynamic UAV tracking. This improvement in inference speed is not achieved at the expense of precision. Experimental data show that after 200 epochs of training, the improved model achieves an mAP@0.5 of 0.985 and an mAP@0.5:0.95 stably around 0.85, with a 100% target ID maintenance success rate in selected dynamic scenarios, realizing the synergistic optimization of “speed and precision”.

More importantly, the millisecond-level response speed ensures that the end-to-end latency from visual acquisition, target detection, stimulus overlay to control command output is controlled within 50 ms, minimizing the negative impact of “perception-display latency” on SSVEP closed-loop feedback, a prerequisite for generating high-quality evoked EEG signals. The “spatio-temporal synchronization mechanism” at the core of this study requires real-time alignment between SSVEP stimulus sources and target displacements. If the detection latency exceeds 100 ms, the stimulus blocks will produce spatial offsets from physical targets, leading to confusion in the frequency components of EEG signals and directly reducing the decoding accuracy of the TDCA-V algorithm. The low-latency characteristic of the improved YOLOv11 precisely guarantees the tight anchoring between stimulus sources and dynamic targets, providing a stable visual induction environment for neural decoding.

Furthermore, addressing the typical issues of sparse features, large tilt angles, and uneven aspect ratios of small targets from UAV aerial perspectives, the GD-Mechanism and Shape-IoU loss function introduced in the improved YOLOv11 form a dual guarantee of “feature enhancement-regression optimization”: Through global feature aggregation and cross-layer distribution, the GD-Mechanism solves the feature dilution problem of traditional FPN in small target detection. Even in scenarios such as vehicle tilt and partial occlusion, it can still ensure the accurate coverage of target centers by SSVEP stimulus blocks. Notably, these two optimizations do not increase additional inference burden, and the overall parameter count of the model only increases by approximately 10%, which is fully compatible with the resource constraints of the RK3588.

This customized optimization for specific hardware and scenarios has not only been validated in laboratory environments but also undergone pragmatic verification through HIL simulations. In dynamic tracking tests with the UAV hovering at an altitude of 50 m, the improved YOLOv11 collaborates with the DeepSort algorithm. Even in real operating conditions such as real-time traffic flow and background texture interference, it can stably output target coordinates and IDs, supporting the system in achieving an average tracking accuracy of 91.98%. This result confirms that on the resource-constrained edge side, through the deep synergy between algorithms and hardware, the stable operation of complex vision–BCI systems can be fully realized. It not only breaks through the dilemma of “insufficient precision” or “inadequate speed” in traditional edge computing but also provides a critical empirical basis for the engineering deployment of the system in practical scenarios such as low-altitude economy and emergency rescue, echoing the core research objective of this study, “addressing the constraints of dynamic UAV operations”.

Despite the demonstrated feasibility of the system in hardware-in-the-loop simulations and real-world tests, several limitations remain to be explored in future research. First, the current SSVEP stimulus paradigm still relies on screen graphic overlays, which may suffer from weakened induction intensity in bright outdoor environments due to reduced screen contrast. Future work could explore augmented reality (AR) glasses as a stimulus carrier, utilizing their high-brightness displays and retinal projection technology to enhance outdoor adaptability. Second, while DeepSort addresses short-term occlusion to an extent, the ReID capability after long-term target disappearance remains to be studied. Integrating advanced Transformer-based trajectory prediction models or spatio-temporal memory networks may be effective routes for resolving long-term occlusion and trajectory repair [45]. Finally, while the current command set focuses on discrete target selection, it could be expanded to high-level tactical UAV instructions, such as using dynamic BCI paradigms to achieve multi-UAV formation switching or continuous adjustment of reconnaissance modes, thereby constructing a more comprehensive brain-controlled UAV swarm ecosystem.

5. Conclusions

This work establishes and validates a dynamic BCI control framework characterized by an interaction paradigm based on spatio-temporal dynamic binding. This framework effectively bridges the gap between human cognitive intent and robotic execution by directly linking SSVEP stimulus sources to physical entities through machine vision. To address the computational limitations of edge-deployed platforms such as the RK3588, the perception architecture leverages an enhanced YOLOv11 network, incorporating PConv and Shape-IoU modules. When combined with the DeepSort algorithm, this setup ensures consistent visual stimulus identity and achieves a 100% success rate in continuous frame tracking. Additionally, the proposed TDCA-V algorithm addresses non-stationary noise from mechanical vibrations and background variations by combining multi-scale filter banks with trace-based dynamic regularization. This approach enhances feature saliency in low signal-to-noise ratio conditions, achieving a command recognition accuracy of 91.48% within a critical 1.0 s window. The realized mechanism of intent demodulation, visual locking, and closed-loop control effectively decouples human cognitive flexibility from the stringent demands of robotic execution. This work paves the way for future research into unsupervised online adaptation and the integration of multi-modal bio-signals.

Author Contributions

Conceptualization, J.W., Y.W. and L.Y.; Methodology, J.W. and Z.L.; Software, J.W. and H.L.; Validation, L.Y., M.I. and M.U.S.; Formal analysis, Z.L., F.Y.; Investigation, M.I.; Resources, Y.W. and Y.Y.; Data curation, H.L., J.J., B.F. and S.H.; Writing—original draft preparation, J.W.; Writing—review and editing, M.U.S., Y.W. and L.Y.; Visualization, J.W.; Supervision, L.Y.; Project administration, Y.W.; Funding acquisition, Y.W. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The APC received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Yibo Wu and Benjun Feng was employed by the Wuhan Leishen Special Equipment Company Ltd., the remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Hsu, W.-Y.; Lin, W.-Y. Ratio-and-scale-aware YOLO for pedestrian detection. IEEE Trans. Image Process. 2020, 30, 934–947. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Ma, Q.; Wu, Y.; Shoukat, M.; Yan, Y.; Wang, J.; Yang, L.; Yan, F.; Yan, L. Deep Reinforcement Learning-Based Wind Disturbance Rejection Control Strategy for UAV. Drones 2024, 8, 632. [Google Scholar] [CrossRef]
Chen, Y.; Wang, F.; Li, T.; Zhao, L.; Gong, A.; Nan, W.; Ding, P.; Fu, Y. Considerations and discussions on the clear definition and definite scope of brain-computer interfaces. Front. Neurosci. 2024, 18, 1449208. [Google Scholar] [CrossRef]
Zhang, L.; Yan, Y.; Cheng, L.; Wang, H. Learning object scale with click supervision for object detection. IEEE Signal Process. Lett. 2019, 26, 1618–1622. [Google Scholar] [CrossRef]
Abiri, R.; Borhani, S.; Sellers, E.W.; Jiang, Y.; Zhao, X. A comprehensive review of EEG-based brain–computer interface paradigms. J. Neural Eng. 2019, 16, 011001. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, Y.; Li, H.; Wu, A.; Zeng, L.; Hu, D. MASER: Enhancing EEG Spatial Resolution with State Space Modeling. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 3858–3868. [Google Scholar] [CrossRef]
Kolb, B.; Whishaw, I.Q. Fundamentals of Human Neuropsychology; Macmillan: London, UK, 2009. [Google Scholar]
Chen, X.; Wang, Y.; Gao, S.; Jung, T.-P.; Gao, X. Filter bank canonical correlation analysis for implementing a high-speed SSVEP-based brain–computer interface. J. Neural Eng. 2015, 12, 046008. [Google Scholar] [CrossRef]
Zhao, L.; Liu, Y.; Gao, J.; Ding, P.; Wang, F.; Gong, A.; Nan, W.; Fu, Y.; Li, T. Visual Imagery-Based Brain-Computer Interaction Paradigms and Neural Encoding and Decoding. IEEE Trans. Hum. Mach. Syst. 2025, 55, 358–371. [Google Scholar] [CrossRef]
Fawcett, T.J.; Cooper, C.S.; Longenecker, R.J.; Walton, J.P. Machine learning, waveform preprocessing and feature extraction methods for classification of acoustic startle waveforms. MethodsX 2021, 8, 101166. [Google Scholar] [CrossRef]
Liu, Y.; Dai, W.; Liu, Y.; Hu, D.; Yang, B.; Zhou, Z. An SSVEP-based BCI with 112 targets using frequency spatial multiplexing. J. Neural Eng. 2024, 21, 036004. [Google Scholar] [CrossRef] [PubMed]
He, L.; Zhou, Q.; Li, X.; Niu, L.; Cheng, G.; Li, X.; Liu, W.; Tong, Y.; Ma, L.; Zhang, L. End-to-end video object detection with spatial-temporal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 1507–1516. [Google Scholar]
Shi, H.; Bi, L.; Yang, Z.; Ge, H.; Fei, W.; Wang, L. Adaptive Model Prediction Control Framework with Game Theory for Brain-Controlled Air-Ground Collaborative Unmanned System. IEEE Robot. Autom. Lett. 2024, 10, 1577–1584. [Google Scholar] [CrossRef]
Bi, Z.; Mikkola, A.; Ip, A.W.; Yung, K.L.; Luo, C. Brain–Computer Interface for Shared Controls of Unmanned Aerial Vehicles. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 3860–3871. [Google Scholar] [CrossRef]
Qian, M.; Wang, Y.; Liu, S.; Xu, Z.; Ji, Z.; Chen, M.; Wu, H.; Zhang, Z. Real time wire rope detection method based on Rockchip RK3588. Sci. Rep. 2025, 15, 30625. [Google Scholar] [CrossRef]
Cheng, Y.; Yan, L.; Shoukat, M.U.; She, J.; Liu, W.; Shi, C.; Wu, Y.; Yan, F. An improved SSVEP-based brain-computer interface with low-contrast visual stimulation and its application in UAV control. J. Neurophysiol. 2024, 132, 809–821. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Chen, X.; Shi, N.; Wang, Y.; Gao, S.; Gao, X. Improving the performance of individually calibrated SSVEP-BCI by task-discriminant component analysis. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 1998–2007. [Google Scholar] [CrossRef] [PubMed]
Boudjit, K.; Larbes, C.; Ramzan, N. Visual object tracking by quadrotor AR. Drone using artificial neural networks and fuzzy logic controller. In AI for Emerging Verticals: Human-Robot Computing, Sensing and Networking; The Institution of Engineering and Technology: London, UK, 2024; pp. 101–132. [Google Scholar] [CrossRef]
Alif, M.A.R. Yolov11 for vehicle detection: Advancements, performance, and applications in intelligent transportation systems. arXiv 2024, arXiv:2410.22898. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 2023, 36, 51094–51112. [Google Scholar]
Wang, G.; Zhao, X.; Dang, D.; Wang, J.; Chen, Y. Enhancing Object Detection with Shape-IoU and Scale–Space–Task Collaborative Lightweight Path Aggregation. Appl. Sci. 2025, 15, 11976. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Zhang, M.; Zhao, S.; Luo, Z.; Xie, L.; Liu, T.; Yao, D.; Yan, Y.; Yin, E. Self-supervised Contrastive Pre-training for Dry Electrode I-411-41G hmotion Recognition via Cross Device Representation Consistency. In Proceedings of the 2025 International Conference on Acoustics Speech and Signal Processing-ICASSP-Annual, Hyderabad, India, 6–11 April 2025. [Google Scholar]
Zhang, N.; Zhou, Z.; Liu, Y.; Yin, E.; Jiang, J.; Hu, D. A Novel Single-Character Visual BCI Paradigm with Multiple Active Cognitive Tasks. IEEE Trans. Biomed. Eng. 2019, 66, 3119–3128. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Luo, Z.; Zhao, S.; Zhang, Q.; Liu, G.; Wu, D.; Yin, E.; Chen, C. Classification of EEG Signals Based on Sparrow Search Algorithm-Deep Belief Network for Brain-Computer Interface. Bioengineering 2024, 11, 30. [Google Scholar] [CrossRef]
Niu, Y.; Zhou, Z.; Li, Z.; Wang, J.; Wu, J.; Yang, W.; Xue, C. Improving SSVEP-BCI system interaction efficiency: Design recommendations for shape of visual stimuli and number of auxiliary stimuli. Int. J. Hum. Comput. Interact. 2024, 40, 3427–3448. [Google Scholar] [CrossRef]
Song, S.; Kidziński, Ł.; Peng, X.B.; Ong, C.; Hicks, J.; Levine, S.; Atkeson, C.G.; Delp, S.L. Deep reinforcement learning for modeling human locomotion control in neuromechanical simulation. J. Neuroeng. Rehabil. 2021, 18, 126. [Google Scholar] [CrossRef] [PubMed]
Jafarbiglu, H. Quantitative Adjustment of Sun-View Geometry in Areal Remote Sensing. Ph.D. Thesis, University of California, Davis, CA, USA, 2023. [Google Scholar]
Behera, T.K.; Bakshi, S.; Sa, P.K.; Nappi, M.; Castiglione, A.; Vijayakumar, P.; Gupta, B.B. The NITRDrone dataset to address the challenges for road extraction from aerial images. J. Signal Process. Syst. 2023, 95, 197–209. [Google Scholar] [CrossRef]
Dhamo, H.; Navab, N.; Tombari, F. Object-driven multi-layer scene decomposition from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5369–5378. [Google Scholar]
Liu, Y.; Dong, Y. CSPPNet: Cascade space pyramid pooling network for object detection. Comput. Vis. Image Underst. 2025, 258, 104377. [Google Scholar] [CrossRef]
Kuś, R.; Duszyk, A.; Milanowski, P.; Łabęcki, M.; Bierzyńska, M.; Radzikowska, Z.; Michalska, M.; Żygierewicz, J.; Suffczyński, P.; Durka, P.J. On the quantification of SSVEP frequency responses in human EEG in realistic BCI conditions. PLoS ONE 2013, 8, e77536. [Google Scholar] [CrossRef]
Azadi Moghadam, M.; Maleki, A. Fatigue factors and fatigue indices in SSVEP-based brain-computer interfaces: A systematic review and meta-analysis. Front. Hum. Neurosci. 2023, 17, 1248474. [Google Scholar] [CrossRef]
Huang, Y.; Cao, L.; Chen, Y.; Wang, T. Optimization of Dynamic SSVEP Paradigms for Practical Application: Low-Fatigue Design with Coordinated Trajectory and Speed Modulation and Gaming Validation. Sensors 2025, 25, 4727. [Google Scholar] [CrossRef]
Tian, P.; Xu, G.; Han, C.; Du, C.; Li, H.; Chen, R.; Xie, J.; Wang, J.; Jiang, H.; Guo, X. A subjective and objective fusion visual fatigue assessment system for different hardware and software parameters in SSVEP-based BCI applications. Sci. Rep. 2024, 14, 27872. [Google Scholar] [CrossRef]
Rivera-Flor, H.; Gurve, D.; Floriano, A.; Delisle-Rodriguez, D.; Mello, R.; Bastos-Filho, T. CCA-based compressive sensing for SSVEP-based brain-computer interfaces to command a robotic wheelchair. IEEE Trans. Instrum. Meas. 2022, 71, 1–10. [Google Scholar] [CrossRef]
Wong, C.M.; Wang, B.; Wang, Z.; Lao, K.F.; Rosa, A.; Wan, F. Spatial filtering in SSVEP-based BCIs: Unified framework and new improvements. IEEE Trans. Biomed. Eng. 2020, 67, 3057–3072. [Google Scholar] [CrossRef]
Lee, H.K.; Choi, Y.-S. Enhancing SSVEP-based brain-computer interface with two-step task-related component analysis. Sensors 2021, 21, 1315. [Google Scholar] [CrossRef] [PubMed]
Wong, C.M.; Wan, F.; Wang, B.; Wang, Z.; Nan, W.; Lao, K.F.; Mak, P.U.; Vai, M.I.; Rosa, A. Learning across multi-stimulus enhances target recognition methods in SSVEP-based BCIs. J. Neural Eng. 2020, 17, 016026. [Google Scholar] [CrossRef] [PubMed]
Wei, Q.; Li, C.; Wang, Y.; Gao, X. Enhancing the performance of SSVEP-based BCIs by combining task-related component analysis and deep neural network. Sci. Rep. 2025, 15, 365. [Google Scholar] [CrossRef]
Gu, S.; Ma, J.; Hui, G.; Xiao, Q.; Shi, W. STMT: Spatio-temporal memory transformer for multi-object tracking. Appl. Intell. 2023, 53, 23426–23441. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of the hardware architecture for the proposed dynamic BCI system.

Figure 2. Schematic diagram of the YOLOv11 network architecture.

Figure 3. Schematic diagram of the Partial Convolution module. Colors and dashed lines represent different feature stages and spatial partitioning, while the ellipsis indicates channel continuation.

Figure 4. Time-domain EEG waveforms of the SSVEP responses across the 8–17 Hz frequency band, demonstrating rapid phase-locking and steady-state sinusoidal oscillation in occipito-parietal channels.

Figure 5. Topographic maps of EEG power across different frequencies and time intervals, the color bar indicates power intensity.

Figure 6. Visualization of the dynamic stimulus generation mechanism based on real-time object tracking.

Figure 7. The UAV-based vision data acquisition system and the comprehensive data augmentation framework.

Figure 8. Visual stimulation paradigms employed in the pilot study, where the upper panel features a black background and the lower panel depicts various simulated scenes.

Figure 9. Experimental paradigm and the trial structure of a single trial, red boxes highlight the target stimuli intended for selection in the current round.

Figure 10. Impact of stimulus color on SSVEP classification accuracy across varying time windows.

Figure 11. Classification accuracy comparison of SSVEP decoding algorithms across time windows ranging from 1.0 s to 1.8 s under different experimental paradigms.

Figure 12. Training dynamics and performance metrics of the improved detection network over 300 epochs.

Figure 13. Visual interpretability analysis of the network’s feature activation maps across diverse scenarios.

Figure 14. Qualitative visualization of robust detection and tracking performance under complex dynamic aerial scenarios, in 14(a), yellow boxes represent the targets of interest across continuous frames, while in 14(b), red lines indicate the tracking trajectories.

Figure 15. Statistical distribution of the stability index across different stimulus classes, respectively, while the diamond points indicate statistical outliers.

Figure 16. Classification accuracy comparison of different algorithms across key time windows.

Figure 17. Classification performance metrics of the SSVEP-based BCI system under a 1.8 s time window.

Figure 18. Overview of the experimental platform and system components.

Table 1. Individual and average performance metrics of the TDCA-V based BCI system across different time windows.

Subject	Metric	1.0 s	1.2 s	1.4 s	1.8 s	Average (for Each Sub)
S1	Acc. (%)	94.44	98.15	98.15	98.15	97.41
S1	ITR (bpm)	107.72	105.78	94.64	78.18	96.58
S2	Acc. (%)	96.30	96.30	96.30	98.15	97.41
S2	ITR (bpm)	113.40	100.06	89.53	78.18	95.29
S3	Acc. (%)	94.44	96.3	98.15	98.15	97.04
S3	ITR (bpm)	107.72	100.06	94.64	78.18	95.15
S6	Acc. (%)	100	98.15	98.15	100	99.26
S6	ITR (bpm)	126.80	105.78	94.64	82.70	102.48
S8	Acc. (%)	90.74	88.89	94.44	94.44	92.59
S8	ITR (bpm)	99.04	84.14	85.04	70.25	84.62
S9	Acc. (%)	94.44	98.15	94.44	98.15	97.04
S9	ITR (bpm)	107.72	105.78	85.04	78.18	94.18
S11	Acc. (%)	79.63	81.48	92.59	96.20	88.89
S11	ITR (bpm)	73.20	68.19	81.47	73.75	74.15
S12	Acc. (%)	81.48	88.89	94.44	98.15	92.22
S12	ITR (bpm)	77.28	84.14	85.04	78.18	81.16
S14	Acc. (%)	94.44	94.44	98.15	98.15	97.04
S14	ITR (bpm)	107.72	95.05	94.64	78.18	93.90
S15	Acc. (%)	88.89	90.74	94.44	94.44	92.59
S15	ITR (bpm)	95.36	87.39	85.04	70.25	84.51
Mean ± SD	Acc. (%)	91.48 ± 6.49	93.15 ± 5.53	95.93 ± 2.11	97.40 ± 1.81	95.07 ± 3.62
Mean ± SD	ITR (bpm)	101.23 ± 16.39	92.95 ± 12.69	88.73 ± 5.07	76.35 ± 3.86	88.61 ± 8.64

Table 2. Hardware-in-the-Loop Simulation Results.

Subject	Time Window Width	Correct Trials	Incorrect Trials	Accuracy (%)	ITR (bpm)
Sub1	1.2 s	48	6	88.89	95.36
Sub2	1.2 s	50	4	92.50	102.43
Sub6	1.2 s	51	3	94.44	107.72
Average	1.2 s	50	4	91.98	101.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Li, Z.; Yan, L.; Imtiaz, M.; Li, H.; Shoukat, M.U.; Jinsihan, J.; Feng, B.; Yang, Y.; Yan, F.; et al. UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface. Drones 2026, 10, 222. https://doi.org/10.3390/drones10030222

AMA Style

Wang J, Li Z, Yan L, Imtiaz M, Li H, Shoukat MU, Jinsihan J, Feng B, Yang Y, Yan F, et al. UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface. Drones. 2026; 10(3):222. https://doi.org/10.3390/drones10030222

Chicago/Turabian Style

Wang, Jun, Zanyang Li, Lirong Yan, Muhammad Imtiaz, Hang Li, Muhammad Usman Shoukat, Jianatihan Jinsihan, Benjun Feng, Yi Yang, Fuwu Yan, and et al. 2026. "UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface" Drones 10, no. 3: 222. https://doi.org/10.3390/drones10030222

APA Style

Wang, J., Li, Z., Yan, L., Imtiaz, M., Li, H., Shoukat, M. U., Jinsihan, J., Feng, B., Yang, Y., Yan, F., He, S., & Wu, Y. (2026). UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface. Drones, 10(3), 222. https://doi.org/10.3390/drones10030222

Article Menu

UAV Target Detection and Tracking Integrating a Dynamic Brain–Computer Interface

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. System Design and Hardware Platform

2.1.1. Edge Computing and Hardware Integration

2.1.2. Hybrid Control Logic

2.2. Object Detection and Tracking Algorithms

2.2.1. Object Detection Based on Improved YOLOv11 Algorithm

2.2.2. DeepSort Object Tracking Algorithm

2.3. Dynamic Brain–Computer Interface and Target Selection Methods

2.3.1. Analysis of SSVEP Response Characteristics and Spatiotemporal Evolution

2.3.2. Presentation Strategies and Coding Schemes

2.3.3. Feature Extraction and Classification Algorithms

2.3.4. BCI Command Mapping Mechanism

2.4. Experimental Design and Data Acquisition

2.4.1. UAV Vision Dataset Construction and Augmentation

2.4.2. Pilot Study Procedures of Dynamic BCI

2.4.3. Formal Experimental Protocol

3. Results and Analysis

3.1. Pilot Study Results and Analysis

3.2. Object Detection Performance in Dynamic UAV Backgrounds

3.3. Results and Analysis of the Dynamic Hybrid System

3.4. Hardware-in-the-Loop Simulation Results and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI