1. Introduction
In the current era of highly developed digitalization, wireless communication technologies have been widely applied in various fields. From the interconnection of smart home devices in the civilian domain to the transmission of critical instructions in the military domain, the security of wireless communication is of utmost importance [
1,
2]. For example, during the flight of an aircraft, it needs to maintain close contact with the ground station through wireless communication and transmit key information such as position, speed, altitude, and flight attitude in real time. The accurate and secure transmission of this information is crucial for ensuring flight safety. In the traditional encryption authentication method, the security of the encryption algorithm depends on the confidentiality of the key. Once the key is leaked, the entire communication system will face huge risks. Moreover, the encryption and decryption processes are computationally intensive, which will severely affect the performance of some resource-constrained devices [
3,
4]. Specific emitter identification (SEI) is a non-cryptographic authentication method based on device physical characteristics [
5]. Due to hardware characteristic differences, such as I/Q imbalance [
6], power amplifier non-linearity [
7], and frequency offset [
8], each transmitter leaves unique and hard-to-forge radio frequency fingerprints (RFFs) in the signal. SEI technology has significant advantages compared with cryptography-based methods and has great application value in fields such as cognitive radio, spectrum monitoring, and electronic countermeasures [
9,
10,
11].
Traditional methods based on feature engineering rely heavily on expert knowledge and human experience. They manually extract features such as signal amplitude, phase, and frequency [
12,
13,
14] and design classifiers to identify different transmitters. With the continuous increase in the number of wireless devices, the selection of effective fingerprint features becomes increasingly difficult, making it hard to adapt to the rapidly changing electromagnetic environment.
In recent years, deep learning (DL) [
15] has achieved remarkable success in various fields, such as computer vision [
16] and natural language processing [
17]. With its outstanding automatic feature extraction capabilities, deep learning can extract the subtle feature patterns hidden in radio frequency signals from large amounts of data. Therefore, the deep learning method was applied to SEI. Some methods use the original In-phase and Quadrature (IQ) data as the input of convolutional neural networks (CNNs) [
18,
19,
20] and recurrent neural network (RNNs) [
21]. Merchant et al. [
22] utilized a CNN to extract features from IQ signals for identifying ZigBee devices. The experimental results demonstrated the method’s robustness across various signal-to-noise ratios (SNRs). Sankhe et al. [
23] proposed the ORACLE model, composed of two convolutional layers and two fully connected layers, to extract and fuse features from I and Q signals. Yu et al. [
24] introduced a robust method based on a multisampling convolutional neural network (MSCNN), employing multiple downsampling transformations for multiscale feature extraction. Al-Shawabka et al. [
25] fed IQ signals into the RFF-LSTM model, which consists of three stacked Long Short-Term Memory (LSTM) layers and one fully connected layer. In [
26], the authors proposed the ConvRNN model, which combines convolutional layers with RNN layers to simultaneously extract spatial and temporal features from IQ signals. A lightweight Transformer-based GLFormer method was presented in [
27]. This method automatically extracts long-term dependencies from IQ signals. Wang et al. [
28] proposed the complex-valued neural network (CVNN), comprising nine complex-valued convolutional layers and two fully connected layers, along with the model compression method. This approach aims to reduce model complexity while maintaining accuracy.
Other methods adopt bispectrum transform [
29], time–frequency transform [
30], Hilbert–Huang transform [
31], constellation diagrams [
32], etc., as the input to extract the features in the transform domain. Shen et al. [
33] used the time–frequency spectrograms obtained through Short-Time Fourier Transform (STFT) as input to the Transformer model. To overcome the impact of wireless channels, Shen et al. [
34] proposed the channel independent spectrogram and data augmentation method to enhance the system’s robustness against channel variations. In [
35], the authors processed the transmitter signals through empirical mode decomposition (EMD) and Hilbert transform (HT) into graph tensors and applied graph neural networks (GNNs) for graph classification. Peng et al. [
32] introduced an SEI method based on a differential constellation trace figure (DCTF). The CNN was designed to recognize the DCTF features of different devices. Subsequently, Peng et al. [
8] proposed the heat constellation trace figure (HCTF) method. Lin et al. [
36] proposed converting the IQ signal into a contour stellar image (CSI) with statistical significance and then using a CNN model for classification.
However, DL-based SEI methods are data-driven and rely heavily on a large amount of labeled samples. Labeling massive samples requires expensive manual overhead, as it requires professional knowledge to accurately label each transmitter signal to determine its corresponding transmitter category. In many practical non-cooperative electromagnetic environments, only unlabeled samples can be obtained, and it is difficult to acquire a sufficient number of labeled data to train deep neural networks (DNNs) adequately, which often leads to overfitting of the model.
To fully leverage all unlabeled data, self-supervised learning (SSL) has been proposed. Unlike traditional supervised learning, SSL does not rely on a large number of labeled samples. Instead, it designs pre-training tasks that enable the model to generate high-quality feature representations, thereby enhancing the performance of downstream tasks. Contrastive and generative methods are two key paradigms in SSL pre-training. In the framework of contrastive methods, feature representations are learned by constructing positive and negative samples. Specifically, for a given input sample, two augmented views are generated. These augmented views are regarded as positive sample pairs relative to the original input sample. While the augmented views may differ in form due to data augmentation, they share the same semantic content. In contrast, augmented views derived from different input samples are considered negative samples. The model aims to minimize the distance between positive sample pairs and maximize the distance between positive and negative samples, thereby learning discriminative feature representations. Wu et al. [
37] devised positive sample pairs via random slicing for contrastive learning, employing the pre-trained model as an initialization in the subsequent fine-tuning stage. Meanwhile, Liu et al. [
38] utilized BYOL [
39] as the backbone for self-supervised contrastive learning, pre-training an RFF extractor and introducing adversarial augmentation to enhance the self-supervised learning process. Thereafter, they utilized a limited set of labeled samples for fine-tuning the extractor and classifier. Inspired by SimCLR [
40], Liu et al. [
41] converted signals into constellation trace figures and adopted image augmentation techniques from the computer vision domain to generate extensive positive and negative sample pairs for contrastive learning. In contrast to contrastive methods, which emphasize distinguishing between different sample pairs, generative methods focus more on learning how to reconstruct or predict partially observed data. By enabling the model to learn the reconstruction of input data from latent representations or the generation of new samples consistent with the real data distribution, these methods capture the intrinsic structures and features of the data. Xie et al. [
42] utilized the Hilbert time–frequency spectrum as the input to an auto-encoder for pre-training, followed by fine-tuning the pre-trained auto-encoder using a small amount of labeled data. Huang et al. [
43] proposed a pre-training method based on a symmetric masked auto-encoder (MAE), which predicted the masked segments of signals. To achieve better feature extraction performance and reduce the computational cost associated with symmetric decoders, Yao et al. [
44] introduced an asymmetric masked auto-encoder (AMAE) for pre-training the RFF extractor. Li et al. [
45] proposed a CML framework for few-shot SEI, which decouples masked reconstruction and contrastive learning into a dual-branch architecture comprising an encoder–decoder branch for signal reconstruction and a momentum encoder branch for instance discrimination.
Inspired by the aforementioned analysis, contrastive learning demonstrates remarkable advantages in feature extraction by learning discriminative feature representations through the construction of positive–negative sample pairs. Meanwhile, generative methods effectively capture latent distribution characteristics via data reconstruction. While contrastive methods excel in inter-sample discriminability, they overlook critical fine-grained local patterns, whereas generative methods capture signal structures but lack explicit mechanisms for inter-class separation. Considering this, we bridge the gap by combining masked reconstruction with contrastive learning, which reinforces device identity discrimination through feature aggregation and separation. This approach achieves representation learning from local details to global discriminability. Distinct from the dual-branch architecture in [
45], the proposed method employs a single-branch asymmetric auto-encoder incorporating a lightweight single-layer convolutional decoder. This design significantly reduces parameter and computational costs while retaining effective signal reconstruction capabilities. Furthermore, deviating from conventional random masking techniques, we introduce a novel complementary masking data augmentation strategy. This approach generates complementary masked views of identical signals, compelling the model to learn robust feature representations by correlating information from non-overlapping signal segments. To mitigate limitations in generic feature extraction, we integrate channel squeeze-and-excitation residual blocks within the encoder. These blocks dynamically recalibrate feature responses to enhance focus on discriminative radio frequency fingerprint regions. Additionally, a lightweight feature projection head is integrated to compress features into a compact, discriminative latent space. We further establish dual optimization objectives. Unlike traditional methods reconstructing the entire signal, the reconstruction task specifically targets the complementary masked regions. Simultaneously, the contrastive task explicitly optimizes inter-device discrimination. This dual focus on local detail reconstruction and global separability effectively overcomes the limitations of single-paradigm approaches.
To this end, we propose a novel contrastive asymmetric masked learning-based SEI (CAML-SEI) method, aiming to integrate the synergistic advantages of contrastive and generative learning to build robust and discriminative RFF feature representations. Specifically, we first employ a complementary masking data augmentation technique to construct positive and negative sample pairs for RFF feature representation contrastive learning. Subsequently, we design channel squeeze-and-excitation residual blocks (CSERBlocks) to build an encoder architecture that maps sample pairs into the feature space. Additionally, a lightweight feature projection head is introduced to eliminate redundant information while enhancing feature discriminability. Finally, we jointly optimize the network by combining masked signal reconstruction loss and contrastive loss. The main contributions of this paper are summarized as follows:
An effective CAML-SEI method is proposed for SSL-SEI. This method enhances the model’s ability to extract RFF features and effectively overcomes the limitation of limited labeled samples.
A novel asymmetric auto-encoder architecture is designed to implement CAML-SEI, comprising a CSERBlock-based encoder, a lightweight decoder, and a lightweight feature projection head.
The pre-trained encoder is fine-tuned using only a limited number of labeled samples, and a simple classifier containing only one fully connected layer is trained to classify the RFF features for SEI.
The proposed CAML-SEI method is evaluated on real-world ADS-B and Wi-Fi datasets. Simulation results show that the recognition performance of the proposed method is superior to other comparison methods.
3. Proposed Method
The proposed CAML-SEI method is illustrated in
Figure 1. The framework incorporates a masked signal reconstruction task to enable the model to learn local structural characteristics of signals, while simultaneously employing a contrastive learning task to enhance intra-class compactness and inter-class separability among samples from similar devices. The joint optimization of these two objectives enables the model to simultaneously capture fine-grained features and establish globally discriminative representations. Furthermore, the CSERBlock is specifically designed to construct the feature encoder.
During the pre-training stage, only unlabeled samples are required to learn generalized signal characteristics. During the fine-tuning stage, the encoder parameters obtained from self-supervised pre-training are used as the initialization parameters for the encoder. The encoder and classifier undergo end-to-end training, ensuring the encoder can both learn generalized RFF features from unlabeled data and adapt to task specificity through a small number of labeled samples. Extensive experiments are conducted to validate the enhanced feature learning capability of the proposed method, which demonstrates advanced recognition performance compared with other methods.
3.1. Data Augmentation
Traditional masked signal reconstruction methods typically employ a single random mask for training by applying it to the input signal, while contrastive learning paradigms rely on similarity learning between paired augmented samples. The data augmentation in this paper consists of two steps, which are masking and adding Gaussian white noise. To achieve synergistic optimization of these two tasks, a complementary mask enhancement strategy is proposed to enable the model to infer complete signal features from different local information, thereby enhancing the comprehensiveness of feature extraction. The strategy generates two augmented samples with complementary mask positions from the original signal
based on a preset masking ratio
. Specifically, the starting point
of the masked interval is randomly selected within the signal length
L, and the ending point
is calculated according to
, ensuring that two mask blocks
and
operate on non-overlapping regions of the signal.
sets the interval
to zero while preserving the remaining parts, whereas
retains this interval and masks out other regions. This complementary masking facilitates the model in extracting effective features from distinct local segments and establishing cross-region feature associations, thereby effectively mitigating the potential issue of local feature overfitting that may arise from single-mask training. Additionally, after performing complementary masking, Gaussian white noise
and
are, respectively, added to the two masked signals to generate augmented samples
and
:
where ⊙ denotes the Hadamard product. Therefore, the two complementarily masked signals obtained through the aforementioned data augmentation will serve as inputs to the feature encoder.
3.2. Network Structure
To enhance the feature extraction capability of the encoder, we design an encoder architecture based on the CSERBlock, as illustrated in
Figure 2. The encoder comprises a 1D convolutional layer, seven CSERBlocks, and a fully connected layer. Here, Conv1d (C, K, S) denotes a 1D convolutional layer with output channels C, kernel size K, and stride S. Here, the output channel number C is set to 64 to balance feature richness and computational efficiency. An insufficient number of channels would limit the model’s ability to capture key features, while excessive channels would increase computational overhead and easily lead to overfitting in few-shot scenarios. The convolution kernel size K is set to 5 as a trade-off choice to accommodate both short-term and long-term feature extraction. The stride S = 1 maximizes the retention of fingerprint details. These structural hyperparameters remain fixed throughout the training process, with only the convolution kernel weights being adaptively updated through backpropagation. In the CSERBlock, the input features are first transformed into a more compact representation through 1D convolution. In the “excitation” step, weights are generated via activation functions to adaptively recalibrate feature responses at different positions on the feature maps. This mechanism enables the model to focus on more informative regions, thereby optimizing the quality of feature representations. Simultaneously, residual connections ensure direct gradient flow, mitigating the vanishing gradient problem during the training of deep networks. Consequently, the residual block with channel squeeze-and-excitation enhances critical features while suppressing less significant ones by reweighting the importance of features at each position, thereby improving the model’s feature extraction capability. Finally, a max-pooling layer is employed to complete the feature extraction process.
Specifically, in the CSERBlock, the channel squeeze-and-excitation function
is composed of a 1D convolutional layer followed by the sigmoid activation function
that maps inputs to values within the range of 0 to 1. By applying
to the input feature map
, the output response
is generated as follows:
Each element
in
indicates the relative importance of the feature at position
i. The input feature map
is then adjusted through the weighting mechanism, thereby recalibrating the importance of features at each position. This process enhances the influence of critical features while suppressing non-critical ones as follows:
where · denotes element-wise multiplication across channels. The excitation mechanism enhances focus on discriminative regions through a spatial attention mechanism that dynamically assigns location-specific weights to feature maps. Specifically, as defined in (
13), a 1D convolutional layer compresses multi-channel features at each temporal position into a scalar value, followed by a sigmoid activation to generate a spatial weight vector. These weights recalibrate the original features via element-wise multiplication, where higher weights intensify responses to discriminative features, while lower weights suppress less informative regions. This weighting mechanism is adaptively optimized through backpropagation during training.
The function
consists of two 1D convolutional layers applied to the input feature map
. Its output is added to the recalibrated feature map
, thereby forming a residual connection. This residual connection ensures direct gradient flow, which helps mitigate the vanishing gradient problem during the training of deep networks. Finally, a max-pooling layer is utilized to achieve dimensionality reduction and feature extraction as follows:
We adopt an asymmetric encoder–decoder architecture, where the decoder employs only a single 1D convolutional layer to reconstruct the signal. Compared to traditional symmetric encoder–decoder architectures, this asymmetric design significantly simplifies the decoder’s complexity. Due to the reduced parameters, the asymmetric decoder exhibits stronger generalization capability on training data, effectively mitigating the risk of overfitting. Furthermore, this design enhances flexibility and reduces demands on computational resources, thereby significantly lowering the model’s computational burden while maintaining efficient performance.
In contrastive learning tasks, two complementary masked augmented signals from the same signal may exhibit potential distribution shifts in their encoded features due to differences in semantic information density within masked regions. This asymmetric semantic distribution hinders the contrastive loss function from effectively minimizing the intra-class distance of positive sample pairs, thereby limiting the model’s capability to learn useful representations. Consequently, direct contrastive learning on raw features may encounter significant challenges in efficiently extracting discriminative feature representations. To address this issue, we introduce two lightweight feature projection heads subsequent to the feature encoder. Each projection head comprises merely a single fully connected layer, which employs non-linear projection to compress high-dimensional features into a low-dimensional space, thereby eliminating redundant information while enhancing feature discriminability. This design aims to mitigate distributional discrepancies among contrastive features and facilitate model optimization.
The classifier in the fine-tuning stage comprises a fully connected layer and a dropout layer to mitigate overfitting effects. Detailed network configurations are presented in
Table 1.
3.3. Pre-Training Stage
During the self-supervised pre-training stage, two augmented masked signals
and
are fed into the feature encoder to extract latent fingerprint representations
and
. For the signal reconstruction task, the latent representations are passed to the decoder to produce reconstructed signals
and
. To minimize the discrepancy between reconstructed and original signals, the mean squared error (MSE) is employed as the loss function, with the loss calculated exclusively on the masked regions between the decoder predictions and the original signals:
where
B denotes the training batch size. Since
, the model must fully reconstruct the signal from these two mutually exclusive subsets. This compels the encoder to learn a global feature representation from local fragments and maintain semantic consistency across fragments. For random masking, due to the existence of overlapping regions, the model might neglect feature extraction from certain regions. For the latent fingerprint features
and
of two augmented samples, the features are further compressed via projection heads to obtain
and
. We utilize cosine similarity as the similarity metric. Its advantage lies in its robustness to the magnitudes of feature vectors, enabling the model to focus on learning discriminative semantic features. The cosine similarity between them is computed as follows:
To establish a discriminative feature representation space, we further design a contrastive learning-based optimization objective. Specifically, for positive signal pairs (i.e., the two masked augmented signals) derived from the same original signal, their latent features are encouraged to be close to each other, while being repelled from negative signal features originating from different signals. Formally, given a training batch containing
B signals, for each signal
, the features of its two augmented versions,
and
, form a positive pair. The negative signal set comprises the complementary mask signal features from other signals within the batch. Based on this construction, the contrastive loss is formulated as follows:
Similarly,
is defined as
The total contrastive loss is defined as
In this way, the model learns to discriminate between positive signals derived from the same signal and negative signals derived from different signals, thereby improving the discriminative capability of latent features. The overall learning objective is formulated as a combination of the reconstruction loss
and the contrastive loss
, defined as
Since the model is required to recover complete signals from masked inputs, the reconstruction loss facilitates the model in capturing the inherent structures and patterns of the signals, thereby enabling the learning of more comprehensive and enriched feature representations. The contrastive loss focuses on optimizing both the matching degree between two augmented signals and the contrastiveness between signals. This mechanism encourages the learned features to exhibit desirable instance discriminability.
3.4. Fine-Tuning Stage
After completing the parameter optimization of the feature encoder during the pre-training stage, we propose to transfer the pre-trained weights of the encoder as initial parameters to the model fine-tuning stage. This design effectively leverages the general feature representation capabilities acquired through self-supervised pre-training, thereby providing a more efficient initial foundation for subsequent fine-tuning tasks. Additionally, the classifier incorporates a fully connected layer. To minimize the discrepancy between the probability distribution
predicted by the model and that of the ground-truth labels, the cross-entropy loss function is employed as the optimization objective:
where
is the batch size for model training in the fine-tuning stage. The training procedure of CAML-SEI is summarized in Algorithm 1.
Algorithm 1 Training procedure of CAML-SEI method. |
Input: |
, : Unlabeled and labeled datasets, respectively; , : Number of pre-training and fine-tuning epochs, respectively; , : Number of pre-training and fine-tuning iterations, respectively; , , : Parameters of the feature encoder, feature decoder, and classifier, respectively; , : Parameters of projection heads; : Learning rate.
|
Output: Trained parameters and . |
Pre-training Stage: |
- 1:
for
do - 2:
for
do - 3:
Sample a batch of training signals from ; - 4:
Generate two batches of masked signals and via data augmentation; - 5:
Feed masked signals into the feature encoder: , - 6:
; - 7:
Project latent features via heads: , ; - 8:
Reconstruct signals using the decoder: , ; - 9:
Compute reconstruction loss via Equation ( 16); - 10:
Compute contrastive loss via Equations ( 18)–( 20); - 11:
Total loss: ; - 12:
Update encoder: ; - 13:
Update projection heads: ; - 14:
; - 15:
Update decoder: ; - 16:
end for - 17:
end for - 18:
Save feature encoder parameters;
|
Fine-tuning Stage: |
- 19:
Load pre-trained encoder parameters; - 20:
for
do - 21:
for
do - 22:
Sample a batch from ; - 23:
Extract features: ; - 24:
Predict labels: ; - 25:
Compute cross-entropy loss via Equation ( 22); - 26:
Update encoder: ; - 27:
Update classifier: ; - 28:
end for - 29:
end for
|
5. Conclusions
Aiming at the challenge of scarce labeled samples in SEI tasks under non-cooperative communication scenarios, this paper proposes a self-supervised learning method based on contrastive asymmetric masked learning. The method innovatively integrates dual optimization mechanisms of masked signal reconstruction and contrastive learning. Through complementary masking augmentation strategies, it preserves critical fingerprint information in signals while utilizing masked regions to guide robust feature learning. Furthermore, a channel squeeze-and-excitation residual network combined with an asymmetric encoder–decoder architecture is designed to enhance key feature representation while reducing model complexity, significantly improving feature extraction capability. Experimental results demonstrate that the proposed CAML-SEI method outperforms other comparative methods across multiple extreme labeled conditions (only 10, 15, 20, 25 samples per category) on both the 30-class ADS-B dataset and 16-class Wi-Fi dataset. This conclusively verifies its superiority in learning robust and generalizable fingerprint characteristics for RFFs, providing a reliable technical solution for addressing RFF feature extraction challenges in label-scarce scenarios.
The proposed CAML-SEI method offers several advantages, and our experiments have validated its robustness under moderate SNR conditions. However, performance may not be optimal under low SNR conditions, for example, at SNR levels below 0 dB. Therefore, future work will integrate a signal denoising module to enhance robustness for low SNR conditions. Simultaneously, considering the issue of signal interference in non-cooperative scenarios, we will incorporate interference detection as an auxiliary pre-training task in the self-supervised framework to improve model performance. Moreover, regarding future research directions, we will focus on the study of cross-domain SEI, specifically investigating whether pre-trained models can be adapted to SEI recognition tasks where signal distributions may change over time. To address the cross-domain SEI problem, we plan to introduce domain adaptation techniques in future work. Additionally, we will employ model compression strategies such as model pruning, quantization, and knowledge distillation to further reduce model size, enabling edge deployment. Furthermore, for ultra-large-scale device identification scenarios, we will explore combining memory bank mechanisms with adaptive hard sample mining to further improve the quality of negative samples and training scalability.