MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification

Huang, Min; Dou, Leihan; Sun, Qiuhong

doi:10.3390/info17060562

Open AccessArticle

MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification

by

Min Huang

,

Leihan Dou

and

Qiuhong Sun

^*

School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(6), 562; https://doi.org/10.3390/info17060562 (registering DOI)

Submission received: 20 April 2026 / Revised: 1 June 2026 / Accepted: 2 June 2026 / Published: 5 June 2026

(This article belongs to the Section Information and Communications Technology)

Download

Browse Figures

Versions Notes

Abstract

In the multi-class recognition task for unmanned aerial vehicle (UAV) radio frequency (RF) signals, noise interference and complex backgrounds can significantly degrade recognition performance. The stability and accuracy of existing recognition methods often fail to meet the requirements of practical applications. To enhance the stability and accuracy of UAV RF signal recognition, especially to mitigate performance degradation in complex backgrounds, a UAV RF signal classification method, MD-Net, is proposed that integrates Adaptive Time-Frequency Masking and a dual-network architecture. First, an Adaptive Time-Frequency Masking mechanism is constructed. By analyzing the energy distribution of RF signals in the time-frequency domain, the masking region is automatically determined, ensuring that the training data maintains a diverse distribution across different interference scenarios. This significantly improves the model’s anti-interference performance and discriminative stability in complex environments. Subsequently, a dual-branch recognition network architecture is designed, integrating a multi-layer perceptron (MLP) and a long short-term memory (LSTM) network. The MLP extracts static amplitude features from the signals, while the LSTM learns time-series features. These two feature types are then fused to achieve complementary characteristics, ultimately enabling accurate classification of UAV RF signals. Extensive comparative experiments conducted on the DroneRF dataset demonstrate that the MD-Net model achieves an average recognition accuracy of 85.58%, an improvement of 5.27 percentage points over the baseline model. The experimental results show that Adaptive Time-Frequency Masking can effectively enhance the model’s adaptability to real-world interference environments, while the dual-network fusion mechanism fully integrates static amplitude and time-series features, providing a feasible and highly reliable technical approach for UAV RF signal recognition.

Keywords:

UAV radio frequency signals; Adaptive Time-Frequency Masking; MLP-LSTM dual-network; signal classification; feature fusion

1. Introduction

As unmanned aerial vehicles (UAVs) become widely deployed across both civilian and military domains, the associated security risks—including airport intrusions, unauthorized flights in prohibited zones, privacy breaches, and electromagnetic interference—have become increasingly prominent, making real-time detection and identification vital [1]. Existing identification methods mainly include acoustic, optical, radar, and RF signal-based systems [2,3,4,5]. Figure 1 shows the detection ranges for these methods against small commercial drones. The baseline performance data are derived from technological reviews [4] and further calibrated against real-world environmental factors—such as background noise, terrain occlusion, and the specific radar cross-section of target drones—to provide an objective evaluation under complex operational settings. Detection distances for each method are in Figure 1. Acoustic recognition detects rotor noise. It is low-cost and well concealed but easily disturbed by noise and has a limited range. Optical recognition uses visible or infrared light and is well concealed but highly sensitive to lighting and visibility. This makes it unreliable at night or in bad weather. Radar covers a wide area but is affected by complex terrain. It is exposed, costly, and less sensitive for small UAVs. RF signal identification uses communication or control links for detection. It achieves all-weather detection, is highly sensitive to small UAVs, is resistant to environmental interference, and offers long-range detection. However, it can be affected by co-frequency clutter and may require expensive equipment. RF signal identification has recently become a focus in UAV identification research.

This paper presents a four-classification method for UAV RF signals using Adaptive Time-Frequency Masking and a dual-network. First, the time-frequency graph of the RF signal is extracted, and real channel degradations (e.g., frequency-band shielding, time-local damage) are simulated with Adaptive Time-Frequency Masking to improve robustness. MLP extracts static amplitude features, while LSTM learns time-series features. These features are then fused for complementary information, yielding more stable recognition. The main contributions of this paper are summarized as follows.

We propose an Adaptive Time-Frequency Masking mechanism for RF physical fading. Unlike traditional data augmentation in computer vision, this approach is tightly tied to the physical properties of radio wave propagation. It adaptively masks certain frequency bands and time segments in the feature map. In doing so, it simulates “frequency-selective fading” and “burst signal truncation” seen in real-world electromagnetic environments. This targeted augmentation strategy improves the model’s robustness to non-stationary background noise.
We construct an ultra-lightweight dual-branch fusion architecture based on time-frequency feature decoupling. Unlike conventional approaches that map RF signals to images and use high-complexity convolutional networks (such as 2D-CNNs), MD-Net introduces a physical-feature decoupling design. It uses an MLP branch to efficiently compress static envelope features of the global spectrum. At the same time, an LSTM branch captures dynamic time-hopping temporal dependencies in communication protocols. This heterogeneous fusion architecture preserves complete topological correlations in the time-frequency domain. It also improves UAV signal discrimination under complex backgrounds, with an extremely lightweight parameter count of only 0.32 M.

To fully evaluate the proposed method, this paper tests it on the DroneRF dataset. Additional cross-dataset validation is performed on the independent DroneRFa dataset. This approach demonstrates the model’s generalization across different acquisition conditions.

The remaining article is organized as follows. Section 2 outlines relevant research. Section 3 details the MD-Net model, including Adaptive Time-Frequency Masking and MLP-LSTM dual-branch networks. Section 4 covers the experimental process. Section 5 presents experimental results and analysis, including performance evaluation and comparison. Section 6 concludes the paper.

2. Relevant Research

Before the rise of deep learning methods, identifying RF signals in unmanned aerial vehicles relied primarily on manual feature extraction and traditional machine learning. Unmanned aerial vehicle (UAV) type identification is usually achieved by analyzing the time-domain features of the signal (such as amplitude, phase, instantaneous frequency, etc.), frequency-domain features (such as power spectral density, spectral peak, etc.), or radio frequency fingerprint features (such as IQ deviation, frequency offset, nonlinear distortion, etc.), and combining classifiers such as support vector machines, K-nearest neighbors, and random forests. In addition, studies extract time-frequency features using wavelet transforms or short-time Fourier transforms and then classify them using traditional classifiers. These methods are simple to implement and have low computational overhead. However, under conditions of low signal-to-noise ratio, multipath interference, and complex channels, the robustness and generalization ability of the models are limited, and their recognition performance across multiple categories or complex working modes is also relatively limited. In recent years, exploring end-to-end deep neural networks for RF signal recognition in unmanned aerial vehicles has become a mainstream approach. Among them, Alam et al. [6] conducted research on RF signal detection and recognition on the CardRF dataset. The end-to-end deep learning model they proposed achieved 97.53% accuracy in three-category detection tasks for UAV, Wi-Fi, and Bluetooth, while in the specific device recognition task, the model’s overall accuracy was 76.42%. Aouladhadj et al. [7] used Mel spectra and pre-trained models for classification and recognition, achieving better results. In addition, spectral features and traditional machine learning methods have also been applied to the classification of unmanned aerial vehicles (UAVs). For instance, Kılıç et al. [8] achieved four- and ten-class classification by combining features such as PSD and MFCC with SVMs.

Although progress has been made with existing methods, key issues remain: Firstly, in real-world environments with low SNR, multipath interference, and band congestion, RF fingerprint features degrade significantly, and the model’s robustness is low [9]. Secondly, most methods adopt a single network structure and lack joint modeling of static and dynamic multidimensional features. This deficiency limits the model’s ability to make judgments based on a single feature, thereby reducing its adaptability to changes in RF fingerprint features under complex channel conditions. Furthermore, in scenarios with low signal-to-noise ratio and multipath interference, problems such as incomplete discrimination information and decreased classification performance are prone to occur [10]. In addition, the collection environment in public datasets is often simpler than that in real-world scenarios, leading to insufficient model generalization [11].

In response to the above problems, recent research has begun to focus on time-frequency feature modeling. For instance, Zhou Xi et al. [12] proposed obtaining radio-frequency fingerprints from time-frequency graphs and performing identification. Su Zhigang et al. [13] proposed an identification model with time-varying channel robustness, emphasizing improved feature stability. Al-S’ad et al. [14], based on the public dataset [15], used a DNN to detect and identify three types of unmanned aerial vehicles and their operational modes. The results showed that when verifying the extended categories, classification accuracy decreased significantly: 99.7% for category 2 (with or without drones), 84.5% for category 4 (drone type), and 46.8% for category 10 (drone mode). Swinney and Woods [16] represented the signals as images based on PSD and spectrograms, extracted features using VGG-16, and then classified them using SVM, logistic regression, and random forest. Eventually, they achieved 100% accuracy in binary classification, 88.6% in four-category classification, and 87.3% in ten-category classification. Bai et al. [17] proposed a two-stage spatio-temporal network that identifies radio frequency signals by integrating temporal and frequency-domain features, providing a reference for temporal modeling. In addition, studies have characterized radio frequency signals using wavelet analysis and temporal feature modeling, thereby enhancing discrimination among different types of unmanned aerial vehicles (UAVs) [18].

3. Introduction to the MD-Net Model

The MD-Net model incorporates an Adaptive Time-Frequency Masking mechanism and a dual MLP-LSTM network structure built on the baseline MLP model. This integration facilitates collaborative modeling of both the static frequency-domain and dynamic timing-domain characteristics of radio frequency signals from unmanned aerial vehicles. Initially, the original signal undergoes adaptive occlusion to produce a variety of time-frequency features, thereby enhancing the model’s resilience to interference. Following this, static feature extraction and time-series learning are performed on the enhanced data, with feature-level fusion to enable effective classification and discrimination.

3.1. Overall Structure of the MD-Net Model

To enhance the robustness of unmanned aerial vehicle (UAV) radio frequency signal classification in complex backgrounds and weak-interference scenarios, this study introduces MD-Net, a UAV radio frequency signal classification model. This model integrates Adaptive Time-Frequency Masking and a dual-network structure, with a core logic of “data augmentation—dual-branch coding—fusion discrimination.” By improving sample distribution at the data layer and combining static spectral features with dynamic time-series features at the feature layer, it achieves comprehensive modeling of multidimensional discrimination information in UAV RF signals. The model’s data flow commences with the original radio frequency signal, generates samples via data preprocessing and enhancement, extracts static and dynamic features using the dual-branch encoder, and ultimately produces the discrimination outcome through global fusion and classification, establishing a coherent recognition process. The schematic representation of its overall architecture is depicted in Figure 2.

The data preprocessing and enhancement module is crucial for improving the model’s resistance to interference. Its primary role is to remove redundant information and increase sample diversity through standardization, dimensionality reduction, and adaptive enhancement. Initially, Z-score standardization is applied to the amplitude-spectrum features of the original radio-frequency signal to address dimensional variations across frequency components. Subsequently, PCA is employed to preserve 98% of the essential features, reducing computational costs and preventing overfitting. It should be noted that the introduction of PCA in this study is not intended as an architectural innovation, but rather as an indispensable prerequisite for data dimensionality reduction. Due to the extremely high dimensionality of the raw radio frequency (RF) signals, directly inputting them into the network would trigger severe out-of-memory (OOM) errors and system crashes. Therefore, PCA serves as a computationally necessary dimensionality-reduction step to ensure the smooth execution of the subsequent dual-branch network under constrained computational resources. The Adaptive Time-Frequency Masking module, central to data augmentation, comprises two operations: time-frequency automatic occlusion (IF-automask ×2) and noise injection. It automatically identifies the occlusion region based on the core interval of the signal’s instantaneous frequency (IF), thereby mimicking real-world degradation scenarios such as spectral loss and multipath fading. By introducing low-intensity noise, the model’s robustness in complex electromagnetic environments is bolstered, safeguarding the signal’s core characteristics from deterioration.

The hierarchical dual-branch encoder is the core component of feature extraction. By parallelizing MLP and LSTM branches, it captures the signal’s static amplitude and dynamic timing features, respectively, achieving complementary modeling of multidimensional features. Among them, the MLP branch primarily extracts global statistical information from the feature vectors. The feature dimension is gradually compressed via a three-layer fully connected network (Dense-256, Dense-128, Dense-64), and normalization and regularization are applied at each layer to stabilize training and suppress overfitting. Ultimately, a static feature representation with strong discriminative ability is formed. The LSTM branch models the correlation structure among feature dimensions. Firstly, the one-dimensional feature vector is reconstructed into a pseudo-sequence form of “time series step size × feature dimension” through the Reshape operation to meet the input requirements of the time series modeling network. It should be noted that this reshaping operation is not an artificial random segmentation, but rather a simulation of the framing and packetization mechanisms of the underlying radio communication protocols. This constructs a structured pseudo-sequence representation inspired by communication framing, rather than a true physical time sequence, thereby enabling the precise capture of dynamic temporal hopping patterns between continuous communication frames. Subsequently, a bidirectional LSTM (BiLSTM) is adopted to encode the reconstructed sequence, capturing feature dependencies in both forward and reverse directions. On this basis, a fully connected layer (Dense, 32 units) is introduced to further compress and non-linearly map the timing coding results, yielding a compact feature-correlation representation.

The global feature fusion module and the classification output head jointly optimize features and category discrimination. Among them, the Global feature fusion module concatenates the static features from the MLP branch and the temporal features from the LSTM branch (Concat (Global)). Subsequently, through a fully connected layer and a regularization mechanism, nonlinear transformations and the suppression of redundant information are performed to further enhance the discriminative ability of the features. The classification output head adopts the Softmax activation function to conduct probability prediction for four types of RF signals: Background, Bebop, AR, and Phantom, and outputs the attribution probabilities of each category, ultimately achieving the four-classification task of unmanned aerial vehicle (UAV) RF signals. The entire model structure not only ensures the full mining of static spectral features but also precisely captures dynamic temporal dependencies, providing structural support for highly reliable identification in complex environments.

3.2. Adaptive Time-Frequency Masking Module

With the widespread deployment of unmanned aerial vehicles (UAVs) in both civilian and military applications, RF signals are often affected by multipath interference, low signal-to-noise ratio, and spectral congestion in real-world environments, leading to significant degradation of traditional signal features and thereby reducing the robustness of model classification. To address this issue, this paper designs an Adaptive Time-Frequency Masking module. Its core objective is to generate diverse samples by simulating channel degradation during training, thereby enhancing the model’s anti-interference performance in complex environments. The overall module process is shown in Figure 3.

First, perform a short-time Fourier transform on the original RF signal

x (t)

, and conduct a segmented analysis of the signal through the sliding window function

w [n]

. The window length N was fixed at 256 (n_fft = 256) in the experiment. This value was derived empirically in time-frequency analysis and used to achieve a stable balance between temporal and frequency resolutions, rather than being obtained through parameter adjustment in the experiment. Under the sliding window, the original signal is divided into multiple locally approximately stationary short-time segments. The Fourier transform is performed on each segment, and a two-dimensional time-frequency representation (real-time frequency diagram) containing both time and frequency information is formed. The calculation process is shown in Equation (1). Then, through the Random Enhancement Operator Pool, this module does not represent the random selection of multiple enhancement methods, but rather controls the enhancement trigger through a multi-level random mechanism: First, determine whether the sample is enhanced with a probability

p = 0.5

. Then, decide whether to performAdaptive Time-Frequency Masking with a probability of 0.7, and randomly determine the starting position of the occlusion time.

S (f, t) = \sum_{n = 0}^{N - 1} x [n] w [n - t] e^{- j 2 π f n / N}

(1)

In the formula,

S (f, t)

represents the time-frequency representation of the final output. f is the frequency index, whose value range is jointly determined by the sampling rate and the window length, corresponding to the discrete frequency points within the target frequency band of 2400–2480 MHz, achieving precise characterization of the frequency domain components of the signal.

e^{- j 2 π f n / N}

is a complex exponential term, essentially the kernel function of the discrete Fourier transform, which plays a core role in converting the local time-domain sampled signal

x [n]

to the frequency domain. By multiplying and summing the weighted signal sampling values, the extraction and amplitude calculation of specific frequency components are completed.

w [n - t]

represents the window function sliding on the signal with the frame shift step size to achieve segmented windowing analysis. Among them, the time frame index t increases in step size with hop = N/4 = 64 according to the STFT framing rule, and the values are

t = 0, 64, 128, \dots, L - N

(where L is the signal length). Subsequently, multiple enhancement operations were applied to the time-frequency graph. This module generates diverse training samples through the following steps, thereby enhancing the model’s stability under conditions of low SNR and the absence of a local frequency band.

RF-IF-AutoMask enhancement: The core energy region of the signal is located by calculating the instantaneous frequency, and the signal amplitude in the non-core region is locally occluded to avoid damaging the key features of the signal. The occlusion behavior is precisely controlled by the time occlusion parameter $T_{mask}$ and the frequency occlusion parameter $F_{mask}$ : The time occlusion parameter $T_{mask}$ represents the number of continuous units of the occlusion area in the time dimension of the time-frequency graph, and the frequency occlusion parameter $F_{mask}$ represents the number of continuous units in the frequency dimension. The two jointly determine the shape and size of the time-frequency mask matrix $M (f, t)$ , which is composed of several rectangular regions of size $F_{mask} \times T_{mask}$ and is distributed around the instantaneous main frequency trajectory. Subsequently, complex Gaussian noise is used to replace the occluded area to generate the enhanced sample mask $\tilde{S} (f, t)$ , and its calculation is shown in Equation (2).

$\tilde{S} (f, t) = S (f, t) \cdot (1 - M (f, t)) + N (f, t) \cdot M (f, t)$

(2)

In the formula, $M (f, t)$ is a binary mask matrix determined by $T_{mask}$ and $F_{mask}$ , and $N (f, t)$ is complex Gaussian noise. Its amplitude is adaptively scaled according to $| S (f, t) | \times 10^{- SNR / 20}$ , thereby achieving precise control of the ratio of noise to signal amplitude through this scale factor. The core reason for choosing complex Gaussian noise is to adapt to the complex characteristics of the time-frequency graph $S (f, t)$ (including amplitude and phase information), avoid signal phase distortion caused by local replacement, and strictly control the occlusion ratio during the occlusion and replacement process (not exceeding 20% of the total signal area) to ensure that the main features of the signal are not damaged.
AFH enhancement: AFH enhancement is to apply a smooth random phase perturbation to the signal frequency range to simulate the frequency drift phenomenon. Its calculation is shown in Equation (3).

$x_{AFH} (t) = x (t) \cdot e^{j ϕ_{AFH} (t)}, ϕ_{AFH} (t) \sim SmoothNoise ()$

(3)

Here, $x (t)$ represents the original RF complex baseband signal, $e^{j ϕ_{AFH} (t)}$ is the phase modulation term that varies with time, $ϕ_{AFH} (t)$ is the phase perturbation sequence generated by a smooth random process, and $x_{AFH} (t)$ is the enhanced signal. The frequency drift simulation is achieved indirectly through a phase-progressive offset.
PNT enhancement: PNT augmentation involves superimposing smooth noise on the signal phase to simulate the real channel phase variation. Its calculation is shown in Equation (4).

$x_{PNT} (t) = x (t) \cdot e^{j ϕ_{PNT} (t)}, ϕ_{PNT} (t) \sim N (0, σ_{ϕ}^{2})$

(4)

Among them, $e^{j ϕ_{PNT} (t)}$ is the random phase modulation term, which only acts on the signal phase and does not change the amplitude characteristics. $ϕ_{PNT} (t)$ is the phase noise component, which follows a zero-mean Gaussian distribution $N (0, σ_{ϕ}^{2})$ . $x_{PNT} (t)$ is the enhanced signal, precisely restoring the channel degradation mode of “only phase-disturbed”.
PSD gated screening [19]: To ensure that the spectral characteristics of the enhanced samples are consistent with the original signal, a power spectral density similarity gating is introduced, retaining only the samples with spectral differences less than the threshold $τ$ . Its calculation is shown in Equation (5).

${PSD}_{sim} = \frac{∥ P_{orig} - P_{aug} ∥_{2}}{∥ P_{orig} ∥_{2}} \leq τ$

(5)

In the formula, $P_{orig}$ is the power spectral density (PSD) of the original signal, $P_{aug}$ is the power spectral density of the enhanced signal, and ${∥ \cdot ∥}_{2}$ is the Euclidean norm.

In addition, to balance the feature fitting of the original signal and the diversity of the enhancement samples after the above-mentioned enhancement operations, the module introduces an enhancement probability control to adjust the trigger frequency of the enhancement operations. During training, each input sample is selected with 50% probability to undergo the above Adaptive Time-Frequency Masking, while unselected samples retain their original features after preprocessing and participate in training. This setting not only avoids “pseudo-feature learning” caused by excessive enhancement, but also ensures that the model can fully contact the samples in complex interference scenarios, thereby enhancing the model’s anti-interference ability. Meanwhile, global Gaussian white noise with an amplitude of 0.001–0.002 (SNR ≈ 40 dB) is superimposed to simulate the slight global noise interference that is common in real channels. By combining three types of targeted enhancements (local spectrum deficiency, frequency drift, and phase noise) with global mild noise injection, the module systematically expands the distribution diversity of training samples, specifically strengthening the model’s adaptability to local spectrum deficiency, frequency drift, phase noise, and mild global interference, laying a stable data foundation for subsequent feature extraction and classification tasks.

3.3. MLP-LSTM Dual-Branch Network Structure

Traditional single-branch networks struggle to simultaneously capture the static frequency-domain and dynamic timing-domain characteristics of RF signals. To this end, this paper proposes an MLP-LSTM dual-branch network architecture that combines MLP and LSTM to capture complementary static and dynamic features. The network structure is shown in Figure 4. Unlike traditional network-based physical time series modeling, the dual-branch network proposed in this paper uses standardized, PCA-reduced radio-frequency feature vectors as a unified input. Multi-view feature representation is achieved through structural design, thereby enhancing the model’s classification robustness in complex electromagnetic environments. It mainly consists of five parts: the input layer, the MLP static feature extraction branch, the LSTM dynamic feature modeling branch, the feature fusion module, and the classification output layer. The functions of each module are as follows.

1.: Input layer: The dual-branch network shares the same input source, namely the radio frequency feature vector, enhanced by data preprocessing and adaptive time-frequency occlusion. The original RF signal of the unmanned aerial vehicle (UAV) is first subjected to a discrete Fourier transform to obtain a high-dimensional frequency-domain amplitude feature containing approximately 5000 frequency components. To eliminate dimensional differences among frequency components, the Z-Score standardization is applied to this feature first. Subsequently, Principal Component Analysis (PCA) is introduced in the input layer of the MD-Net model for dimensionality reduction. PCA, as an unsupervised linear dimensionality reduction algorithm, retains 98% of the valid discriminant information by performing steps such as covariance matrix calculation, feature decomposition, and principal component screening. Before processing, the data has high dimensionality, a large amount of redundant information, and strong correlations, which can increase the model’s computational cost and lead to overfitting. After processing, a low-dimensional, one-dimensional feature vector is obtained, which not only effectively removes redundant information and some noise but also reduces computational cost and ensures that key information is not lost. Ultimately, a one-dimensional feature vector x of dimension D is formed, denoted as $x \in R^{D}$ , where $R$ is the symbol of the real number field, used to represent the numerical type and vector space attribute of x. This feature vector serves as the common input to the subsequent MLP and LSTM branches, providing a unified data basis for dual-branch feature coding.
2.: MLP static feature extraction branch: The MLP branch primarily extracts global, static discriminative features from radio frequency signals. This branch directly takes the input feature vector x without additional dimension flattening and performs nonlinear mapping and feature compression layer by layer through a multi-layer fully connected network. The specific structure consists of three fully connected layers, with network scales of 256, 128, and 64 in sequence. Each layer uses the ReLU activation function to enhance the network’s nonlinear expressiveness and mitigate the vanishing gradient problem. Its forward propagation process is shown in Equation (6), where $σ (\cdot)$ is the ReLU activation function, $W_{i}$ and $b_{i}$ are trainable parameters, and $h_{i}$ is the output feature vector of the i-th fully connected layer. Through layer-by-layer compression and feature recombination, the MLP branch ultimately outputs a 64-dimensional static feature vector that characterizes the stable differences among different types of unmanned aerial vehicles at the overall spectral distribution level.

$h_{i} = σ (W_{i} h_{i - 1} + b_{i}), i = 1, 2, 3$

(6)
3.: LSTM dynamic feature modeling branch: LSTM branches are used to capture the temporal dynamic dependencies of RF signals. Their goal is not to model physical time series, but to leverage LSTM’s structural advantages for sequence modeling to explore dependencies between feature dimensions. Firstly, the input to this branch is the one-dimensional feature vector (dimension D) obtained after standardization and PCA dimensionality reduction. Since the LSTM network requires input in sequence form, the core reconstruction must be performed using the Reshape operation. The reconstruction process is as follows: According to the rules, the one-dimensional vector is artificially divided into T continuous feature sub-segments, each of which contains F feature dimensions (satisfying $T \times F = D$ , ensuring that the total number of features remains unchanged and there is no information loss. Eventually, it is reconstructed into a two-dimensional pseudo-sequence X with “time sequence step T × feature dimension F”, as shown in Equation (7). As a structured pseudo-sequence representation inspired by communication framing rather than a true physical time sequence, this reconstruction process serves to adapt to the LSTM’s input structure requirements while capturing the underlying protocol characteristics. Subsequently, a bidirectional LSTM (BiLSTM) with 64 hidden units is adopted to encode the reconstructed feature sequence X, simultaneously modeling the association patterns between feature subspaces in both forward and reverse directions, and outputting a 128-dimensional high-order feature representation. To further compress redundant information and improve the compatibility of feature fusion, a fully connected compression layer (Dense 32, ReLU) is introduced after the BiLSTM to map high-dimensional features into 32-dimensional compact feature vectors. This vector comprehensively characterizes the correlation structure among radio-frequency features, providing an effective supplement for subsequent fusion with MLP branch features.

$x \to X \in R^{T \times F}$

(7)
4.: Feature fusion and classification output layer: The static features from the MLP branch are concatenated with the dynamic features from the LSTM branch and fed to the Softmax classifier to obtain the final prediction. The calculation is shown in Equation (8), where $[\cdot]$ represents the feature concatenation operation, $W_{c}$ and $b_{c}$ are the parameters of the classification layer, and $\hat{y}$ is the final multi-classification prediction output.

$\hat{y} = Softmax (W_{c} [h_{MLP}, h_{LSTM}] + b_{c})$

(8)

This dual-branch network architecture addresses the challenge that a single network architecture is difficult to incorporate both static and dynamic features simultaneously by leveraging the complementary functions of MLPs and LSTMs. The global feature extraction capability of the MLP branch and the time series modeling capability of the LSTM branch work in synergy. Combined with the multidimensional information after feature fusion, the model can still maintain high recognition accuracy in complex electromagnetic environments, providing efficient network support for the classification of RF signals in unmanned aerial vehicles.

4. Experimental Setup

4.1. Introduction to the Dataset

4.1.1. DroneRF Dataset

This study adopts the publicly available DroneRF dataset [15], which consists of three types of commercial unmanned aerial vehicles (DJI Phantom, Parrot Bebop, and AR) and the radio frequency communication signals of the background environment, providing multi-scenario samples for unmanned aerial vehicle detection and classification. In terms of the underlying communication protocol, the Parrot series (AR and Bebop) follows the standard Wi-Fi (802.11 b/g/n) system, with the physical layer adopting Direct Sequence Spread Spectrum (DSSS) and Orthogonal Frequency Division Multiplexing (OFDM) modulation techniques, and a typical channel bandwidth of 20 MHz, while the DJI Phantom uses an enhanced Wi-Fi or the early Lightbridge proprietary protocol, with its OFDM subcarrier structure specifically optimized to support adaptive bandwidth switching between 10 and 20 MHz. These differences in physical characteristics at the protocol level provide a crucial underlying technical basis for model recognition of different drone platforms. Data acquisition used two NI USRP-2943 software-defined radios for synchronous recording, capturing the upper and lower portions of the frequency band, respectively, enabling a total sampling bandwidth of 80 MHz and covering common communication frequency bands (such as Wi-Fi). The dataset has a total capacity of approximately 40 GB and contains 454 radio frequency record files. Each record lasts about 0.25 s [20], with a total of 227 fragments. In addition, a longer background segment was collected for the presence detection of unmanned aerial vehicles. To facilitate task design, the data can be classified into two categories (with drones/without drones), four categories (Bebop/AR/Phantom/background), and ten categories (subdivided according to different flight and transmission modes). Among them, Bebop and AR have multiple operation modes, while Phantom is only in a control connection state. The original RF signal is transformed into a frequency-domain representation containing 5000 frequency components by the discrete Fourier transform, which is used for subsequent feature extraction and classification modeling. Background samples account for approximately 18% [21], thereby enhancing the model’s robustness to real-world environmental noise and supporting the detection of unmanned aerial vehicle presence and multi-category recognition tasks. The classification structure of the DroneRF dataset is shown in Figure 5.

4.1.2. DroneRFa Dataset

DroneRFa is a large-scale open-source RF signal dataset for low-altitude unmanned aerial vehicle (UAV) detection, aiming to fill the gap in real-world scene data for UAV RF identification research and to support the development of anti-UAV detection and identification technology. This dataset was collected using software-defined radio equipment (USRP-2955, National Instruments, Austin, TX, USA) from communication signals between unmanned aerial vehicles (UAVs) and remote controls, covering three ISM bands: 915 MHz, 2.4 GHz, and 5.8 GHz. It includes 9 types of urban outdoor sports UAV signals, 15 types of indoor signals, and 1 type of background reference signal. Each data type contains at least 12 segments. Each fragment contains over 100 million sampling points and stores I/Q baseband signal components in double-precision floating-point format, supporting time-frequency-domain feature analysis. The dataset follows the naming rule “prefix character + binary code”, indicating key information such as flight distance and working frequency band, which is convenient for the precise retrieval of target data. This study selected six types of signals from this dataset as experimental subjects, specifically including: Background reference signal (T0000), including environmental interference signals such as Bluetooth and Wi-Fi, DJI MATRICE 200 (T0011), DJI MATRICE 100 (T0100), DJI Air 2S (T0101), DJI Mini 3 Pro (T0110) and DJI Inspire Signal 2 (T0111). The selected models cover the evolution process of drone communication technology from fixed bandwidth to highly adaptive systems: The DJI MATRICE series and Inspire 2 mainly adopt the Lightbridge 2 proprietary video transmission system, which combines Orthogonal Frequency Division Multiplexing (OFDM) and Frequency Hopping Spread Spectrum (FHSS) technology links to ensure stability; while the DJI Air 2S and Mini 3 Pro are equipped with the more advanced OcuSync 3.0 (O3) system, which features a higher density dynamic frequency hopping mechanism and can adjust the channel bandwidth from 1.4 MHz to 20 MHz in real time based on environmental interference. This complex physical-layer protocol and non-stationary spectrum characteristics constitute key technical obstacles in the MD-Net recognition task. The selected signal covers mainstream commercial unmanned aerial vehicle (UAV) models and complex urban electromagnetic backgrounds. Its multi-band communication characteristics, high sampling accuracy, and real-world scene interference information can effectively support the training and verification of UAV radio frequency identification models [11].

4.2. Experimental Environment and Parameters

The experimental platform for this study is Python 3, and the model is built using the TensorFlow/Keras 2.13.0 framework. The experimental parameters are shown in Table 1. Under the premise of keeping the key training hyperparameters such as the number of training rounds, batch size, optimizer and loss function consistent with the baseline model MLP, Adaptive Time-Frequency Masking is enabled for the training set, with the enhancement probability set to 0.5, this value is derived from the standard engineering practice of balancing data diversity and the stability of the original distribution, aiming to ensure that the model can learn enhanced features without causing data distribution shifts due to excessive perturbation. The time masking parameter

T_{mask} = 10

and the frequency masking parameter

F_{mask} = 6

are based on strategies from the relevant literature and have been heuristically adapted to the physical characteristics of the DroneRF dataset. Specifically, considering that each RF record in the dataset lasts approximately 0.25 s and contains 5000 frequency components, the size of the masking windows at the physical level can effectively simulate local burst interference in real environments, while ensuring that the macroscopic time-frequency energy distribution characteristics of the drone communication protocols (such as Wi-Fi or Lightbridge) are not masked. This medium-scale masking configuration is designed to guide the network to forcibly learn the underlying protocol details in the non-masked regions, thereby significantly enhancing the model’s discriminative stability in non-stationary electromagnetic backgrounds and effectively enhancing the classification robustness against complex environmental interference. After comprehensively considering the balance among RF signal characteristics, model expressiveness, and generalization performance, it is determined that the MLP branch consists of three fully connected layers with 256, 128, and 64 neurons, respectively. The neuron configuration decreases layer by layer, enabling the network to gradually compress feature dimensions while strengthening category discrimination. By obtaining a more compact and discriminative global feature representation, the LSTM branch employs a bidirectional LSTM (BiLSTM) to capture temporal dependencies in RF signals. The number of LSTM units is set to 64, and a fully connected layer with 32 units is connected after it to further map and compress the timing features, reducing redundant information and achieving dimension matching with the MLP branch features. Thereby enhancing the stability of feature fusion. The output layer consists of four Softmax units, each corresponding to the category labels Background, Bebop, AR, and Phantom.

4.3. Experimental Evaluation Index

In multi-category classification tasks, Accuracy is typically used as the overall performance metric, while Precision, Recall, and F1 score are calculated for each category. Let the true positive be TP, the false positive be FP, the false negative be FN, and the true negative be TN. The total number of samples is

N = T P + F P + F N + T N

. The accuracy rate is defined as shown in Equation (9).

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(9)

Among them, true positive (TP) refers to real samples of a certain category being correctly predicted by the model as that category, false positive (FP) refers to real samples of a non-category being wrongly predicted by the model as that category, false negative (FN) refers to real samples of a certain category being wrongly predicted by the model as another category, and true negative (TN) refers to real samples of a non-category being correctly predicted by the model as another category. The total sample size (N) is the total number of samples in all categories.

Precision measures the proportion of samples the model predicts as belonging to a given category and is used to assess the accuracy of the predictions. Its definition is shown in Equation (10).

Precision = \frac{T P}{T P + F P}

(10)

Among them, the numerator TP represents the number of samples predicted by the model as belonging to this category and actually belonging to it, while the denominator

T P + F P

represents the total number of samples predicted by the model as belonging to this category (including TP that truly belongs to this category and FP that is wrongly predicted). The core meaning of this is the purity of the model’s prediction results. The higher the value, the fewer incorrect annotations there are in the prediction results of this category.

Recall focuses on the proportion of real samples in a given category that the model correctly predicts as that category and is used to measure the completeness of the model’s predictions. Its definition is shown in Equation (11).

Recall = \frac{T P}{T P + F N}

(11)

Among them, the numerator TP represents the true number of samples correctly identified by the model as belonging to this category, and the denominator

T P + F N

represents the total number of true samples in this category (including the correctly identified TP and the wrongly omitted FN). The core meaning is the model’s ability to capture samples from this category. The higher the value, the fewer true samples of this category are omitted. Accuracy and recall often have trade-offs (for instance, an increase in accuracy may lead to a decrease in recall, and vice versa). Therefore, the F1 score combines the two by harmonizing the mean to measure the model’s balanced performance in this category, as defined in Equation (12).

F 1 score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(12)

Among them, the numerator is twice the product of the precision and recall rates, used to amplify the fraction when both are high, and the denominator is the sum of the precision and recall rates. Its core purpose is to avoid the limitations of a single indicator and to comprehensively reflect the model’s balanced performance across accuracy and completeness. The value range of the F1 score is

[0, 1]

. The closer it is to 1, the better the balance of category recognition, especially suitable for scenarios with unbalanced numbers of category samples (for example, in this article, the proportions of Phantom and background type samples are significantly different).

4.4. Model Implementation and Training

This study implements a classification model based on the dual-network fusion of MLP and LSTM, and simultaneously introduces an Adaptive Time-Frequency Masking method to expand the training data. During data preprocessing, the original spectral features are first standardized by Z-Score, and then Adaptive Time-Frequency Masking is applied: This method randomly occludes several time-domain segments and frequency-domain components of the signal and fills them with white noise (with the signal-to-noise ratio set at 40 dB), thereby generating enhanced samples and improving the model’s anti-interference ability to noise. The enhanced data is merged with the original data and jointly used for model training.

The model architecture consists of two branches: one MLP branch processes full-spectrum features, with the input dimension equal to the feature length. After the feature dimension is gradually compressed layer by layer through the fully connected layer, a static feature vector is output. Another LSTM branch reshapes the input features into a sequence shape (50 steps × the frequency domain feature dimension corresponding to each time step) at a predetermined time step, extracts time-dependent features through a bidirectional LSTM (BiLSTM) with 64 hidden units, and then connects a fully connected layer (32 units, ReLU activated) for feature compression. Output the dynamic feature vector. The two branch outputs are concatenated and then connected to a softmax output layer to achieve a four-category prediction. The network training uses the cross-entropy loss function and the Adam optimizer implemented in TensorFlow/Keras 2.13.0 (Google LLC., Mountain View, CA, USA), and sets the early stopping and learning rate decay callback functions to prevent overfitting by monitoring performance on the validation set. The entire experimental process was evaluated through 10-fold cross-validation, with each fold training for 30 epochs. During the training period, the best model weights, training logs, and loss/accuracy curves are recorded and saved, as shown in Figure 6 and Figure 7, providing complete data support for subsequent performance verification and analysis. In addition, to ensure the fairness of the performance comparison, we conducted extended training experiments with 50 epochs. The loss and accuracy curves for this 50-epoch extended training are shown in Figure 8 and Figure 9.

After model training is complete, the best model weights from this round of training are loaded. The performance is evaluated using the independent test subset that did not participate in this round of training in the hierarchical 10-fold cross-validation. Various metrics are calculated, and classification reports and confusion matrices are generated. The model confusion matrix is shown in Figure 10. The ROC curve of the model is shown in Figure 11. It is computed by binarizing the test-set output probabilities. In multi-classification, the AUC values for each category and the micro-average are calculated to reflect the model’s comprehensive discriminative ability across categories. All experimental results were ultimately saved for comparison and subsequent analysis.

5. Experimental Results and Analysis

5.1. Model Index Evaluation

This study conducted a four-class task evaluation on the DroneRF public dataset, with the categories being: Background, Bebop, AR, and Phantom. The evaluation indicators include precision, recall, F1-score, and overall accuracy. The model was trained using hierarchical 10-fold cross-validation, and the final result was the average across 10 rounds. First, the performance improvements of the traditional baseline model and the MD-Net model proposed in this study were compared. The baseline model achieved 80.31% classification accuracy, while the MD-Net model achieved 85.58%, resulting in a significant improvement. The four-classification results of the MD-Net model are shown in Table 2. Among them, the background category shows stable performance (Precision = 1, Recall ≈ 1), indicating that the radio frequency data is highly separable for discriminating background categories. The Phantom category is the most difficult to identify among all models. Fundamentally, this is caused by the extreme class imbalance in the original dataset acquisition [15] (where Bebop recorded 4 flight modes but Phantom only 1), which inevitably induces a majority-class preference during training. To mitigate this, future optimizations could incorporate class-weighted loss to assign higher penalty weights to misclassified minority samples. Furthermore, its communication protocol and waveform features are closer to the background interference components.

5.2. Model Efficiency Evaluation

To validate the engineering deployment potential of the MD-Net model and comprehensively assess the performance gains and architectural overhead introduced by the dual-branch network and adaptive enhancement module, this section systematically compares the baseline model (single-branch MLP) with the MD-Net model across four key dimensions: parameter count, computational complexity, inference latency, and memory usage. All efficiency metrics were evaluated under the same experimental environment as previously described to ensure comparability with the classification performance metrics. The comparison results are shown in Table 3.

Cost-effectiveness analysis shows that the MD-Net model has a total parameter count of only 0.3214 million, demonstrating a highly compact architecture that effectively reduces storage overhead. Although MD-Net incorporates both an Adaptive Time-Frequency Masking augmentation mechanism and a dual-branch recognition structure compared to the baseline model, its use of highly parameterized and compact layers successfully suppresses model size expansion. As a result, the absolute increases in computational FLOPs and inference latency remain minimal. In resource-constrained edge computing scenarios, this slight computational overhead has a negligible impact on overall system power consumption and real-time performance, while delivering a significant 5.27% improvement in average recognition accuracy (from 80.31% to 85.58%). Taken together, the model efficiency metrics and classification performance results demonstrate that MD-Net significantly enhances feature extraction capability and discrimination stability in complex interference environments, all at an extremely lightweight architectural cost. This reflects an outstanding balance between performance and efficiency. These results indicate that the proposed model offers exceptional engineering cost-effectiveness for UAV radio frequency monitoring systems with limited computational resources, effectively overcoming the performance limitations that traditional simple models face in complex background recognition tasks.

It should be noted that the efficiency metrics for all models reported in Table 3 were evaluated on a standard computing platform equipped with an Intel Core CPU, without relying on GPU acceleration. In realistic engineering deployment scenarios, the target hardware for UAV monitoring systems typically consists of resource-constrained miniature edge devices (such as the NVIDIA Jetson Nano or a Raspberry Pi 4 without a dedicated GPU). The lightweight architectural characteristics of MD-Net, along with its low memory overhead of 3.33 MB, enable it to align perfectly with the storage and computational constraints of low-power edge nodes, effectively circumventing edge computing bottlenecks. Furthermore, in a pure CPU environment without GPU acceleration, MD-Net achieves a single-sample inference time of 27.32 ms, meaning the system can process approximately 36 radio frequency (RF) signal frames per second. Since UAVs continuously broadcast RF signals during flight operations, this low latency provides ample system reaction time to trigger air defense alerts or activate countermeasures, successfully meeting the requirements of practical, real-time UAV monitoring tasks.

Furthermore, it is essential to discuss the error tolerance and practical usefulness of the MD-Net method in real-world UAV monitoring scenarios. Although the average recognition accuracy of 85.58% implies a potential misclassification rate of approximately 14%, it must be noted that this metric represents the accuracy of a single rapid inference based on extremely short RF segments (e.g., 0.25 s). In real-world detection and countermeasure systems, specific error-tolerance mechanisms are designed to handle instantaneous false alarms caused by complex multipath fading and burst noise. Practical deployments rarely rely on a single short-term prediction for a final operational decision; instead, they implement continuous monitoring and a Time-window Majority Voting strategy. Leveraging MD-Net’s exceptionally low single-step inference latency (27.32 ms), detection equipment can perform dozens of rapid inferences within a single second. This continuous temporal aggregation of high-frequency single frames effectively smooths out random misjudgments introduced by isolated noise. Consequently, at the system level, the comprehensive alerting reliability is elevated to a highly rigorous threshold that meets the demands of real-world security deployments. Moreover, evaluating the usefulness of an RF identification method cannot be divorced from the physical constraints of hardware resources. On portable spectrum detection nodes or UAV-borne defense platforms, stringent power and storage limitations make high-complexity large models impractical. With its highly compact dual-branch architecture (only 0.32 M parameters), MD-Net drastically reduces front-end computational overhead while maintaining acute sensitivity to non-stationary underlying protocol features. This highly cost-effective characteristic—trading minimal computational cost for high-frequency, stable early warning capabilities in complex electromagnetic environments—fully validates the exceptional practical value of this method in real-world engineering. By doing so, MD-Net successfully provides an efficient and reliable recognition solution for lightweight RF sensing applications.

5.3. Model Comparison Experiment

To further verify the effectiveness of the MD-Net model proposed in this paper for the multi-category identification task of unmanned aerial vehicle (UAV) radio frequency signals, this section presents performance comparisons on the DroneRF dataset and contrasts it with existing methods. Including the MLP model, the existing CNN model [22], the LSTM model [22], the RandomForest model, the Transformer model, the DNN model [14], and the 1D-CNN model [21]. The comparison results mainly focus on classification accuracy for the UAV four-classification task, while also accounting for computational cost and inference time to quantify the model’s comprehensive trade-offs. The specific results are shown in Table 4.

As shown in Table 4, the MD-Net model achieves the best performance in the UAV four-class classification task, with a classification accuracy of 85.58%—an improvement of 0.13 percentage points over 1D-CNN, 1.06 percentage points over DNN, and 1.11 percentage points over the Transformer model. The F1-score is also significantly higher than that of other methods, reaching 85.00%, indicating better class balance and effectively mitigating the impact of noise interference, complex backgrounds, and inter-class confusion on performance. Experimental results show that although the Transformer model was made extremely lightweight (with only 0.07 M parameters) to accommodate computational constraints, its computational complexity (6.17 M FLOPs) remains significantly higher than that of MD-Net (4.35 M FLOPs) due to the quadratic complexity of the self-attention mechanism. More importantly, parameter compression leads to a sharp decline in the Transformer’s F1-score (79.01%) and recall (76.55%), resulting in a pronounced tendency toward missed detections. In contrast, the core advantage of MD-Net lies in its combination of extremely low inference latency (27.32 ms) and optimal feature extraction. Objectively speaking, compared to 1D-CNN, MD-Net’s absolute improvement in per-sample accuracy (0.13%) is relatively small. However, its core advantage lies in its extremely low inference latency (27.32 ms). In practical deployment, this low latency provides the foundation for introducing a sliding-window voting strategy. The system can filter out occasional misclassifications through high-frequency decision aggregation, thereby meeting the requirements of continuous system-level early warning with minimal computational overhead. Among these models, the DNN model obtains the highest precision (91.08%), potentially owing to its specific network architecture. DNN is a traditional deep fully connected network. DNN can only accurately classify high-confidence samples with “significant static features and no ambiguity”. When such samples are assigned to the target category, the probability that they truly belong to it is extremely high. That is, the proportion of TP is high. However, for samples with “ambiguous static features and dependent on temporal information for distinction”, DNN cannot effectively identify them and directly excludes them from the target category, thereby achieving a higher accuracy. But the cost is that a large number of “real target category samples” are missed due to unclear features, that is, false negative FN is increased, resulting in a decrease in the recall rate. Ultimately, the overall accuracy rate was lower than that of the MD-Net model. In addition, 10-fold cross-validation results show that the baseline model’s accuracy fluctuates widely (80.31% ± 6.37%), whereas MD-Net remains highly stable at 85.58% ± 0.59%. This extremely low statistical variance demonstrates the outstanding robustness of our dual-branch architecture in handling complex environmental interference, fully meeting the requirements for practical deployment in real electromagnetic environments.

Beyond performance metrics such as accuracy, evaluating the practical advantages of a radio frequency (RF) identification algorithm requires balancing feature fidelity with hardware resource constraints. To clarify the trade-offs in real-world deployment for future users dedicated to engineering implementation, Table 4 further quantifies the computational cost and inference time for each model. Certain complex network architectures are often characterized by large parameter counts and high inference latency, severely limiting their real-time capabilities on resource-constrained edge nodes. In contrast, the proposed MD-Net exhibits an exceptionally superior comprehensive tradeoff advantage. Although introducing the dual-branch structure incurs a slight increase in computational overhead compared to ultra-simple baselines such as DNN or MLP, MD-Net maintains a highly compact footprint of 0.32 M parameters and 4.35 MFLOPs. More importantly, it achieves the highest classification accuracy while controlling the inference time at 27.32 ms. This characteristic of trading acceptable and minimal computational cost for high-precision monitoring capabilities fully quantifies the engineering usefulness of MD-Net. Combined with the validation of the model’s outstanding generalization ability on the DroneRFa dataset presented in Section 5.5, the comprehensive performance superiority of MD-Net is firmly established.

The radio frequency signals of unmanned aerial vehicles (UAVs) exhibit clear temporal and pattern characteristics. MD-Net can capture the signal’s long-term dependence via the LSTM module. Meanwhile, combined with the multidimensional representation provided by the feature extraction layer, it better supports classification and recognition of UAV types. Compared with traditional MLP and DNN models, MD-Net’s main advantage lies in its ability to model temporal series features. Compared with the 1D-CNN model, MD-Net can not only extract local features but also model the dynamic changes in signal sequences, thereby improving overall classification performance.

Furthermore, based on the 50-epoch extended training experiments introduced in Section 4.4 (Figure 8 and Figure 9), we conducted a further comparative analysis of the fully converged models. The extended training curves indicate that, although the baseline model achieves a fully stable, converged state after prolonged training, its validation accuracy plateaus at around 82.14%, making further improvement difficult. This confirms that the number of epochs does not alter the final comparative conclusion. In contrast, MD-Net consistently maintains an accuracy of approximately 85.82%throughout the 50-epoch training period. This additional validation clearly demonstrates that the performance gap between the two models arises from the inherent expressive limitations of the single-branch structure when handling complex non-stationary features. Thanks to its architectural design, MD-Net successfully achieves robust recognition performance against complex background interference while maintaining low computational overhead.

5.4. Model Ablation Experiment

This section primarily conducts ablation experiments to examine the influence of various components of the MD-Net model on signal detection performance. Taking the original MLP classifier as the Baseline model (Baseline), the Adaptive Time-Frequency Masking module (denoted as A) and the MLP-LSTM dual network structure (denoted as B) are introduced successively. The experimental results are shown in Table 5.

The experimental results show that when all modules are integrated, the model accuracy increases from 80.31% at the baseline to 85.58%, an increase of 5.27 percentage points, fully verifying the significant synergy among the modules. The adaptive time-frequency enhancement module has increased model accuracy by 3.78 percentage points without adding model parameters. The reason is that this method partially occludes time-frequency features, enabling the network to obtain an “anti-interference training experience” and thereby enhancing the model’s robustness against environmental noise and interference signals, effectively reducing misclassification. After introducing the MLP-LSTM dual-branch architecture, the model’s accuracy increased by 3.69 percentage points. Among them, the bidirectional LSTM (BiLSTM) path models the dependencies among the feature dimensions of the radio frequency signal, and the MLP path mines high-dimensional nonlinear relationships among the spectral features. The complementarity of the two enables the network to provide a more comprehensive feature representation of the unmanned aerial vehicle’s radio frequency signal, thereby enhancing recognition performance. Overall, both the Adaptive Time-Frequency Masking module and the MLP-LSTM network structure have played a key role in improving the model’s performance. The combined use of the two has significantly improved the model’s classification accuracy while maintaining a lightweight structure and stable training. In conclusion, the ablation experiment effectively demonstrates that the proposed improvement strategy can enhance the radio frequency signal recognition performance of unmanned aerial vehicles and verifies the effectiveness and reliability of the MD-Net model under weak interference, complex backgrounds, and noisy environments.

5.5. Model Generalization Experiment

To verify the generalization ability of the MD-Net model, a six-category generalization experiment was conducted using six types of signals from the DroneRFa dataset (including background reference signals and five types of DJI drone signals, covering multiple frequency bands and real environmental interference), and the performance of the baseline MLP model and the MD-Net model was compared. The results are shown in Table 6. The MD-Net model’s classification accuracy on the new dataset reached 83.15%, which was 2.23 percentage points higher than the MLP model’s 80.92%. The precision, recall, and F1 Scores also improved significantly simultaneously. At the same time, to systematically validate the model’s anti-interference capability under strictly controlled interference conditions, we injected Gaussian white noise with different gradients into the test environment. Experimental results show that the model maintains highly stable accuracy across a wide SNR (Signal-to-Noise Ratio) range: 85.58% at SNR = 40 dB, 84.95% at 30 dB, 85.69% at 20 dB, 85.52% at 10 dB, and 85.48% without added noise. Even under extreme interference at SNR = −40 dB, the accuracy remains stable at 84.88%. This consistent performance across controlled SNR gradients demonstrates that the Adaptive Time-Frequency Masking module provides strong robustness against complex interference, while the MLP-LSTM dual-branch structure enables effective multidimensional feature extraction. As a result, the model can effectively address challenges arising from model expansion, frequency band changes, and real-world environmental interference, exhibiting excellent generalization performance and providing robust support for practical engineering applications in UAV monitoring scenarios.

Furthermore, it is worth noting that the slight performance drop observed when extending from the 4-class DroneRF dataset to the 6-class DroneRFa dataset aligns with phenomena widely reported in existing machine learning literature. As the number of target categories increases, maintaining high accuracy becomes increasingly challenging due to the exponential increase in decision boundary complexity and the reduction in inter-class variance. Against this widely recognized backdrop, the ultra-lightweight MD-Net (with a parameter count of only 0.32 M) achieves 83.15% accuracy on a 6-class task with complex real-world environmental interference, demonstrating highly competitive robustness. However, this raises an important consideration regarding the practical usefulness of MD-Net in scenarios that require classifying a massive number of UAV categories.

In large-scale deployment environments, MD-Net’s core value lies not in independently executing massive fine-grained classifications, but in serving as an agile front-end early-warning trigger within a hierarchical recognition pipeline. It is exceptionally adept at executing high-frequency, coarse-grained tasks under severe computational constraints. For operational requirements that require distinguishing among dozens of specific UAV models, the system would significantly benefit from integrating a higher-fidelity, fine-tuned model. Under a cloud-edge collaborative architecture, the lightweight MD-Net can continuously monitor the spectrum and establish a robust initial detection framework at the edge. Once a potential threat is intercepted, it can seamlessly activate a heavier, higher-fidelity, fine-tuned model, either on a local server or in the cloud, to complete downstream fine-grained identification. This tiered strategy balances the system demands for low-power continuous monitoring with high-fidelity multi-category classification.

6. Conclusions

This paper focuses on the vulnerability of the radio frequency signal of unmanned aerial vehicles (UAVs) to noise interference in multi-category recognition, where the background complexity is high and timing patterns change significantly, and proposes a four-classification method for UAVs based on MD-Net. The research aims to enhance the stability and anti-interference ability of RF signal recognition for unmanned aerial vehicles (UAVs) under complex conditions, providing feasible technical support for the actual UAV monitoring system. Through systematic experiments on the DroneRF dataset, the following main conclusions are obtained in this paper:

The Adaptive Time-Frequency Masking module can effectively enhance the diversity of training samples without increasing the model parameters, thereby significantly improving the model’s anti-interference performance in complex interference environments. Experiments show that this module can increase average accuracy by 3.78 percentage points and significantly reduce cross-validation performance fluctuations.
The MLP-LSTM dual network structure can simultaneously capture the static amplitude characteristics and time series dependence of radio frequency signals, thereby enhancing the feature expression ability and increasing the model accuracy by 3.69 percentage points.
When the two improved modules were jointly integrated, the average accuracy rate of the model reached 85.58%, which was a significant improvement compared to the 80.31% of the baseline model. This indicates that the two exhibit strong synergistic gain and can effectively improve the stability of RF signal identification in real-world interference environments.

The research results show that the MD-Net model proposed in this paper demonstrates strong robustness and reliability under conditions of strong interference, multiple backgrounds, and non-stationary signals. However, this study still has certain limitations: the dataset covers only a limited set of scenarios, and the model’s adaptability in real dynamic environments remains to be verified; although the Adaptive Time-Frequency Masking strategy is effective, its parameters have not yet been fully adaptively optimized. At the same time, challenges remain for real-time system integration and physical edge deployment of the model. Although Section 5.2 has demonstrated the lightweight advantages of MD-Net using hardware-independent proxy metrics—such as parameter count (approximately 0.32 M), computational complexity (4.35 MFLOPs), and peak memory usage (3.33 MB)—a comprehensive evaluation on actual resource-constrained hardware platforms (such as embedded NPUs or microcontrollers) has yet to be conducted. In practical mechatronic system integration, physical constraints such as cross-layer data transmission, actual hardware power allocation, heat dissipation, and end-to-end system latency still pose substantial engineering challenges to practical deployment. Future research can focus on several directions to further enhance the engineering applicability of UAV radio frequency recognition technology: large-scale and multi-scenario RF data collection, design of adaptive enhancement mechanisms based on learning strategies, conversion of the model into edge-executable formats (such as TensorRT/ONNX), and benchmarking power consumption and real-time performance on real embedded devices. These efforts will be crucial for advancing the practical deployment of MD-Net in resource-constrained environments.

Author Contributions

Conceptualization, M.H. and L.D.; methodology, L.D.; software, L.D.; validation, M.H., L.D. and Q.S.; formal analysis, M.H. and Q.S.; investigation, L.D.; resources, M.H.; data curation, M.H. and L.D.; writing—original draft preparation, L.D.; writing—review and editing, M.H., L.D. and Q.S.; visualization, M.H.; supervision, Q.S.; project administration, M.H.; funding acquisition, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Laboratory on Electromagnetic Environment Effects Foundation (6142205240201).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the research findings in this study are from publicly available datasets. The access URL for the DroneRF dataset is: https://data.mendeley.com/datasets/f4c2b4n755/1, accessed on 5 March 2025. The access URL for the DroneRFa dataset is: https://www.scidb.cn/detail?dataSetId=34f0a91e8a544904998b8fdc44477380, accessed on 1 May 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, H.; Luo, J.; Wang, H. A review of radio frequency fingerprint recognition methods for unmanned aerial vehicles. Radio Eng. 2024, 54, 2672–2684. [Google Scholar]
Shi, Z.; Chang, X.; Yang, C. An acoustic-based surveillance system for amateur drones detection and localization. IEEE Trans. Veh. Technol. 2020, 69, 2731–2739. [Google Scholar] [CrossRef]
Zhang, Z.; Shi, Z.; Gu, Y. Ziv–Zakai bound for DOAs estimation. IEEE Trans. Signal Process. 2023, 71, 136–149. [Google Scholar] [CrossRef]
Khan, M.A.; Menouar, H.; Eldeeb, A. On the detection of unauthorized drones—Techniques and future perspectives: A review. IEEE Sens. J. 2022, 22, 11439–11455. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, H. Research on unmanned aerial vehicle detection technology. Eng. Constr. 2024, 7, 58. [Google Scholar]
Alam, S.S.; Chakma, A.; Rahman, M.H. RF-enabled deep-learning-assisted drone detection and identification: An end-to-end approach. Sensors 2023, 23, 4202. [Google Scholar] [CrossRef] [PubMed]
Aouladhadj, D.; Kpre, E.; Deniau, V. Drone detection and tracking using RF identification signals. Sensors 2023, 23, 7650. [Google Scholar] [CrossRef] [PubMed]
Kilic, R.; Kumbasar, N.; Oral, E.A. Drone classification using RF signal based spectral features. Eng. Sci. Technol. Int. J. 2022, 28, 101028. [Google Scholar] [CrossRef]
Yang, L.; Camtepe, S.; Gao, Y. Robustness and security enhancement of radio frequency fingerprint identification in time-varying channels. arXiv 2024, arXiv:2410.07591. [Google Scholar] [CrossRef]
Tiras, F.E.; Altinoluk, H.S. CrossRF: A domain-invariant deep learning approach for RF fingerprinting. arXiv 2025, arXiv:2505.18200. [Google Scholar]
Yu, N.; Mao, S.; Zhou, C. DroneRFA: A large-scale UAV radio frequency signal dataset for detecting low-altitude UAVs. J. Electron. Inf. Technol. 2024, 46, 1147–1156. [Google Scholar]
Zhou, X.; Tang, D.; Cai, Y. Radio frequency fingerprint extraction method for identity recognition of wireless devices. Integr. Circuits Embed. Syst. 2024, 24, 69–72. [Google Scholar]
Su, Z.; Yan, X.; Han, B. Real-time detection method of unmanned aerial vehicle radio frequency signal under low signal-to-noise ratio conditions. Signal Process. 2023, 39, 919–928. [Google Scholar]
Al-Sa’d, M.F.; Al-Ali, A.; Mohamed, A. RF-based drone detection and identification using deep learning approaches: An initiative towards a large open source drone database. Future Gener. Comput. Syst. 2019, 100, 86–97. [Google Scholar] [CrossRef]
Allahham, M.H.D.S.; Al-Sa’d, M.F.; Al-Ali, A. DroneRF dataset: A dataset of drones for RF-based detection, classification and identification. Data Brief 2019, 26, 104313. [Google Scholar] [CrossRef] [PubMed]
Swinney, C.J.; Woods, J.C. Unmanned aerial vehicle flight mode classification using convolutional neural network and transfer learning. In Proceedings of the 16th International Computer Engineering Conference (ICENCO), Cairo, Egypt, 29–30 December 2020; pp. 83–87. [Google Scholar]
Bai, H.; Li, S.; Jia, Y. Radio signal recognition using two-stage spatiotemporal network with bispectral analysis. Sensors 2025, 25, 5449. [Google Scholar] [CrossRef] [PubMed]
Choudhary, P.; Sihag, V.; Choudhary, G. DRIFTER: A drone identification technique using RF signals. Forensic Sci. Int. Digit. Investig. 2025, 54, 301948. [Google Scholar] [CrossRef]
Fang, M.; Pani, S.; Di Fulvio, A. Enabling PSD-capability for a high-density channel imager. In Proceedings of the IEEE Nuclear Science Symposium and Medical Imaging Conference, Piscataway, NJ, USA, 16–23 October 2021; pp. 1–4. [Google Scholar]
Alqodah, M.A.; Tahsin, M.; Omari, M.H. RF-based lightweight machine learning for comprehensive drone activity classification. In Proceedings of the 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things, Mt. Pleasant, MI, USA, 7–8 September 2024; pp. 1–5. [Google Scholar]
Al-Emadi, S.; Al-Senaid, F. Drone detection approach based on radio-frequency using convolutional neural network. In Proceedings of the IEEE International Conference on Informatics, IoT, and Enabling Technologies, Doha, Qatar, 2–5 February 2020; pp. 29–34. [Google Scholar]
Akter, R.; Doan, V.S.; Lee, J.M. CNN-SSDI: Convolution neural network inspired surveillance system for UAVs detection and identification. Comput. Netw. 2021, 201, 108519. [Google Scholar] [CrossRef]

Figure 1. Detection ranges of different UAV identification methods.

Figure 2. Structure of the MD-Net model.

Figure 3. Adaptive Time-Frequency Masking Module.

Figure 4. MLP-LSTM Dual-Branch Network.The green special symbol indicates feature fusion; arrows indicate information flow; colors are used to distinguish different functional modules.

Figure 5. Classification structure diagram of DroneRF dataset.

Figure 6. Loss and accuracy curves of the baseline model.

Figure 7. Loss and accuracy curves of the MD-Net model.

Figure 8. Loss and accuracy curves of the baseline model (epoch = 50).

Figure 9. Loss and accuracy curves of the MD-Net model (epoch = 50).

Figure 10. Model confusion matrix diagram.

Figure 11. Model ROC curve graph. The black dashed diagonal line represents the random-classifier baseline.

Table 1. Experimental parameter settings.

Parameter	Numerical Value
Epochs	30
Batch Size	10
Dropout Rate	0.2
Optimizer	Adam
Loss Function	Categorical Cross-Entropy
Learning Rate	0.001
MLP Hidden Layers	[256, 128, 64]
LSTM Units	64
Time Mask Param	10
Frequency Mask Param	6
Augmentation Noise SNR	40 dB

Table 2. MD-Net model classification results.

Type	F1-Score	Recall	Precision
Background	1.00	1.00	1.00
Bebop	0.84	0.98	0.73
AR	0.87	0.78	0.98
Phantom	0.52	0.36	0.97

Table 3. Comparison of efficiency indicators between baseline model and MD-Net model.

Model	Params (M)	FLOPs (M)	Inference Time (ms)	Inference Peak Memory Usage (MB)
MLP	0.30	0.59	23.20	2.14
MD-Net	0.32	4.35	27.32	3.33

Table 4. Model performance comparison.

Model Name	Accuracy	F1-Score	Recall	Precision	Params	FLOPs	Inference Time
MLP *	80.31%	79.56%	80.43%	84.02%	0.30 M	0.59 M	23.20 ms
Existing CNN	61.80%	59.00%	60.00%	86.00%	2.03 M	10.32 M	26.30 ms
LSTM	62.90%	60.00%	62.00%	84.00%	–	–	–
RandomForest	81.37%	73.80%	72.38%	89.05%	–	–	–
Transformer	84.47%	79.01%	76.55%	90.03%	0.07 M	6.17 M	25.78 ms
DNN	84.52%	78.81%	76.43%	91.08%	0.16 M	0.32 M	24.70 ms
1D-CNN	85.45%	84.68%	–	–	–	–	–
MD-Net ^⋆	85.58%	85.00%	85.49%	89.02%	0.32 M	4.35 M	27.32 ms

Note: * indicates the baseline model, ^⋆ indicates the improved model, and the others are reference models.

Table 5. Module ablation experiment.

Baseline	A	B	Accuracy	F1-Score	Recall	Precision
🗸	×	×	80.31%	79.56%	80.43%	84.02%
🗸	🗸	×	84.09%	83.00%	83.86%	88.00%
🗸	×	🗸	84.00%	82.82%	84.05%	87.73%
🗸	🗸	🗸	85.58%	85.00%	85.49%	89.02%

Note: 🗸 indicates that the corresponding module is included, and × indicates that the corresponding module is not included.

Table 6. Model generalization results.

Dataset	Model	Accuracy	F1-Score	Recall	Precision
DroneRF	MLP	80.31%	79.56%	80.43%	84.02%
DroneRF	MD-Net	85.58%	85.00%	85.49%	89.02%
DroneRFa	MLP	80.92%	80.88%	80.92%	81.32%
DroneRFa	MD-Net	83.15%	83.06%	83.15%	83.20%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, M.; Dou, L.; Sun, Q. MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification. Information 2026, 17, 562. https://doi.org/10.3390/info17060562

AMA Style

Huang M, Dou L, Sun Q. MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification. Information. 2026; 17(6):562. https://doi.org/10.3390/info17060562

Chicago/Turabian Style

Huang, Min, Leihan Dou, and Qiuhong Sun. 2026. "MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification" Information 17, no. 6: 562. https://doi.org/10.3390/info17060562

APA Style

Huang, M., Dou, L., & Sun, Q. (2026). MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification. Information, 17(6), 562. https://doi.org/10.3390/info17060562

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification

Abstract

1. Introduction

2. Relevant Research

3. Introduction to the MD-Net Model

3.1. Overall Structure of the MD-Net Model

3.2. Adaptive Time-Frequency Masking Module

3.3. MLP-LSTM Dual-Branch Network Structure

4. Experimental Setup

4.1. Introduction to the Dataset

4.1.1. DroneRF Dataset

4.1.2. DroneRFa Dataset

4.2. Experimental Environment and Parameters

4.3. Experimental Evaluation Index

4.4. Model Implementation and Training

5. Experimental Results and Analysis

5.1. Model Index Evaluation

5.2. Model Efficiency Evaluation

5.3. Model Comparison Experiment

5.4. Model Ablation Experiment

5.5. Model Generalization Experiment

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI