1. Introduction
UAVs are now widely used across many areas, including smart agriculture [
1], re-mote sensing [
2], emergency response [
3], logistics [
4], and infrastructure inspection [
5]. However, the rapid increase in low-altitude UAV operations has brought new challenges for airspace safety. Unauthorized or unknown drones may interfere with navigation or communication systems and raise potential security concerns for critical infrastructure and national defense [
6,
7]. Therefore, reliable identification of UAV communication signals and effective detection of unknown or suspicious UAVs are essential for maintaining situational awareness in complex electromagnetic environments.
At present, UAV identification is mainly based on acoustic, visual, radar, and radio-frequency (RF) signals. Early works mainly relied on hand-crafted features combined with traditional machine learning models. For example, Muhammad et al. [
8] used Mel-frequency cepstral coefficients from UAV acoustic signals and trained support vector machines (SVMs) for recognition. Chu et al. [
9] applied histograms of oriented gradients for visual-based detection, while Zhang et al. [
10] used short-time Fourier transform and principal component analysis for radar-based classification. For RF-based UAV recognition, Nie et al. [
11] combined fractal dimension and dual-spectrum features to extract RF fingerprints for UAV recognition. Methods based on handcrafted features can achieve reasonable performance in relatively controlled experimental conditions. However, their effectiveness is highly dependent on the quality of feature design, and their performance often drops noticeably in the presence of low SNRs, channel distortions, or dynamically changing environments.
Compared with other sensing modalities, RF signals offer several practical ad-vantages. They directly reflect the communication behavior and operational status of UAVs and can be collected passively without strict line-of-sight requirements. This makes them suitable for continuous monitoring and anti-UAV applications. In recent years, deep learning has greatly improved UAV signal recognition, since neural networks can learn hierarchical and discriminative features directly from raw data. In related research fields, studies have explored a variety of network architectures [
12,
13,
14,
15] and learning strategies, such as few-shot learning [
16,
17], semi-supervised learning [
18,
19], self-supervised learning [
20,
21], and transfer learning [
22,
23], to improve model performance in difficult conditions. Such strategies have gradually been introduced into RF signal processing and UAV identification tasks. For instance, Akter et al. [
24] used convolutional neural networks (CNNs) with angle-of-arrival features for UAV recognition; Domenico et al. [
25] proposed a real-time RF identification framework; and Cai et al. [
26] built a lightweight multi-scale CNN for efficient UAV RF fingerprinting.
Since RF signals naturally contain information in both the time and frequency do-mains, effective representation learning usually requires jointly modeling these two aspects. Classical time-frequency analysis methods, such as the short-time Fourier trans-form (STFT) [
27] and the continuous wavelet transform (CWT) [
7], have been widely used to generate two-dimensional representations for CNN-based classifiers. Ozturk et al. [
28] applied spectrogram-based CNNs for UAV detection and achieved robust performance across different drone platforms. Zhang et al. [
29] proposed a multi-channel physical feature convolution and tri-branch fusion network for automatic modulation recognition. More recently, Dong et al. [
30] introduced a second-order synchrosqueezing transform with attention mechanisms for enhanced analysis of non-stationary signals. In other domains, feature modulation techniques have shown strong capability in integrating heterogeneous information. De Vries et al. [
31] proposed conditional batch normalization for language-guided visual processing, and Perez et al. [
32] introduced FiLM layers that use one modality to modulate features from another. These works suggest that conditioning temporal features on global spectral information may benefit RF signal recognition, although such mechanisms have not been systematically explored for UAV RF signals.
Another line of research focuses on network architectures that can capture both local and global patterns. Conventional CNNs are effective for local feature extraction but are limited by their receptive fields [
33]. Transformer-based models [
34] address this issue by using self-attention to model long-range dependencies. To combine the advantages of both, hybrid convolution-attention architectures have been proposed. Wu et al. [
35] introduced convolutional operations into Vision Transformers, while Dai et al. [
36] developed CoAtNet by integrating depthwise convolutions with attention layers. In speech processing, the Conformer architecture [
37] alternates between convolution and attention modules and has achieved strong results. For RF signals, Xu et al. [
38] designed parallel complex convolution and attention branches for modulation classification; Huynh-The et al. [
39] built a lightweight multi-scale convolutional network for UAV fingerprinting; and Dhakal et al. [
40] explored frameworks combining physical-layer fingerprints with deep attention mechanisms.
However, in most existing hybrid architectures, time-domain and frequency-domain information are usually processed separately or combined through simple feature fusion. Although such designs allow the network to access information from both domains, they do not explicitly consider how frequency-domain characteristics influence temporal signal patterns. In UAV RF signals, frequency-domain features often reflect the overall transmission state of the signal, while temporal features describe local waveform variations. When the two domains are treated independently, the interaction between global spectral characteristics and local temporal structures may not be fully captured. Instead of directly concatenating features from different domains, we adopt a modulation-based strategy, in which frequency-domain information is used to adjust temporal representations. This design provides a simple way to introduce cross-domain interaction while preserving the original temporal structure, and avoids significantly increasing model complexity.
Beyond feature extraction, the recognition paradigm itself also requires reconsideration. Most existing UAV signal recognition methods assume a closed-set scenario, in which all test classes are seen during training. In real UAV communication environments, this assumption rarely holds, as new modulation types and private protocols often appear. Traditional Softmax-based classifiers [
41] must classify each input into a single known class and cannot recognize or reject new signals. To solve this problem, open-set recognition (OSR) has been proposed, aiming to correctly classify known samples while detecting and rejecting unknown ones.
Figure 1 shows the basic concept of OSR. In a closed-set scenario, the model performs well when all test samples come from known classes. However, when unknown samples appear, closed-set models tend to misclassify them as one of the known classes. In contrast, an open-set model can not only correctly recognize known samples but also reject unknown ones. The concept of OSR was first introduced by Scheirer et al. [
42]. Early OSR methods were based on traditional classifiers such as SVMs [
43], sparse representation [
44], and k-nearest neighbors [
45]. Later, Bendale et al. [
46] proposed OpenMax, extending OSR to deep neural networks. After that, researchers extended OSR by using counterfactual samples [
47], generative adversarial networks [
48,
49], and reciprocal point learning [
50,
51]. Geng et al. [
52] developed an OSR method based on a hierarchical Dirichlet process, and Wang et al. [
53] employed energy modeling to construct high-energy regions for unknown detection.
Generally, OSR methods can be roughly divided into generative and discriminative approaches. Generative methods attempt to model the data distribution of known classes and synthesize unknown samples, but they often struggle to generate realistic and diverse data, especially in complex signal environments. Discriminative methods, on the other hand, focus on learning compact and separable feature spaces. Although such methods usually achieve better classification performance, their robustness can still be limited under practical conditions, such as noise contamination, channel drift, and varying RF environments. Moreover, many existing approaches rely on a single latent space and fixed rejection thresholds, which makes it difficult to simultaneously achieve strong feature separability, stable decision boundaries, and adaptive unknown rejection. Although OSR has been applied to other fields such as bias detection [
54] and pathogen identification [
55], it is still rarely used for UAV RF signals. This task is challenging because UAV signals often share overlapping frequency bands, exhibit subtle inter-class differences caused by hardware imperfections, and show strong non-stationary motion features, which make detecting unseen signals even harder.
To deal with these challenges, this paper proposes the GE-OSR method for UAV signal classification. Unlike previous works, GE-OSR integrates geometric embedding learning with energy-based modeling within a unified framework. This design enables the model to learn both a structured feature structure and a discriminative energy distribution, leading to achieve better recognition of known samples and more reliable rejection of unknown ones. The main contributions of this paper are summarized as follows:
A time-frequency convolutional hybrid network for UAV signal representation. Considering the complex and non-stationary characteristics of UAV communication signals, a time-frequency convolutional hybrid network is designed to jointly exploit temporal and spectral information, providing stable and representative signal features from raw UAV data.
A geometric embedding mechanism to enhance feature separability. To obtain a more compact and discriminative feature space, a geometric embedding mechanism with learnable class embeddings and dual-constraint loss is introduced, which effectively improves intra-class compactness and inter-class separability.
An energy-based regularization strategy for learning discriminative energy distributions. Faced with the difficulty of distinguishing known and unknown samples in open-set scenarios, an energy-based regularization strategy is adopted, consisting of an explicit energy formulation and its regularization term, and ultimately forming a more discriminative energy landscape.
An adaptive energy threshold for open-set rejection. Instead of relying on a fixed threshold, an adaptive energy thresholding mechanism is introduced, using the empirical energy distribution of known classes, and finally achieving more reliable rejection of unknown signals.
4. Discussion
The effectiveness of the proposed GE-OSR framework mainly arises from the joint modeling of feature geometry and energy distribution. From the geometric perspective, the DCEL module improves the compactness of feature distribution by pulling samples of the same class closer to their embeddings, so features from the same UAV become more clustered in the embedding space. At the same time, the inter-class constraint enlarges the angular separation between different class embeddings, which helps reduce overlap between classes and improves discrimination, especially under noisy conditions. From the energy perspective, the FEAL module keeps the energy of known samples within a stable, low range, while unknown samples are pushed to higher-energy areas. Samples that are far from all known class embeddings naturally produce higher energy values and are therefore easier to identify as unknown. When these two parts work together, they can build a clear, stable boundary in the geometry-energy space, which helps the model distinguish between known and unseen UAV signal classes more accurately and robustly.
At the feature representation level, the FCTM module and the CTBlocks work together to improve feature quality. The role of FCTM is to bring global frequency information into temporal feature learning. Specifically, spectral statistics extracted from the input signal are used to modulate time-domain features, so that temporal representations are adjusted according to the overall frequency characteristics of the signal. This process allows the network to adapt its temporal features based on different spectral patterns, rather than treating all signals in the same way. Building on these frequency-aware temporal features, the CTBlocks further model temporal information at different scales. The convolutional branch mainly captures local temporal patterns, such as short-term variations and fine-grained structures, while the self-attention branch focuses on long-range temporal dependencies and global context. By processing these two types of information in parallel, the network can exploit both local and global temporal characteristics without introducing excessive model complexity. Thanks to this design, the proposed model maintains strong recognition performance even under low SNR conditions. At the same time, the model has only about 0.057 million parameters, making it very lightweight and suitable for real-time UAV detection systems, including portable or edge devices often used in smart cities, emergency response, and perimeter surveillance.
However, there are still some limitations. The EMA-based threshold can sometimes be slow to react when channel conditions change quickly. The model may also fail to identify signals that are completely different from all known categories. In addition, the delay still needs to be reduced to improve the system’s performance in real-time tasks. In the future, we plan to improve the threshold method, enhance generalization to unseen signals, and further simplify the model architecture to meet the strict real-time and operational requirements of practical UAV monitoring applications.
Overall, GE-OSR demonstrates high accuracy, clear interpretability, and strong generalization ability. The model maintains very stable performance in open-set recognition, handling low SNRs and a large number of unknown signals with ease, whereas most existing methods often fail under such conditions. These results indicate that the joint constraint of geometry and energy is an effective strategy for managing complex electromagnetic signal environments. The combination of geometry and energy not only supports robust UAV signal recognition but also provides a promising approach for other intelligent perception tasks in open and dynamic environments, such as anti-drone surveillance, autonomous UAV navigation, and urban airspace management.