An Adaptive Focal Loss Function Based on Transfer Learning for Few-Shot Radar Signal Intra-Pulse Modulation Classiﬁcation

: To solve the difﬁculty associated with radar signal classiﬁcation in the case of few-shot signals, we propose an adaptive focus loss algorithm based on transfer learning. Firstly, we trained a one-dimensional convolutional neural network (CNN) with radar signals of three intra-pulse modulation types in the source domain, which were effortlessly obtained and had sufﬁcient samples. Then, we transferred the knowledge obtained by the convolutional layer to nine types of few-shot complex intra-pulse modulation classiﬁcation tasks in the target domain. We propose an adaptive focal loss function based on the focal loss function, which can estimate the parameters based on the ratio of hard samples to easy samples in the data set. Compared with other existing algorithms, our proposed algorithm makes good use of transfer learning to transfer the acquired prior knowledge to new domains, allowing the CNN model to converge quickly and achieve good recognition performance in case of insufﬁcient samples. The improvement based on the focal loss function allows the model to focus on the hard samples while estimating the focusing parameter adaptively instead of tediously repeating experiments. The experimental results show that the proposed algorithm had the best recognition rate at different sample sizes with an average recognition rate improvement of 4.8%, and the average recognition rate was better than 90% for different signal-to-noise ratios (SNRs). In addition, upon comparing the training processes of different models, the proposed method could converge with the least number of generations and the shortest time under the same experimental conditions.


Introduction
Intra-pulse modulation classification of radar signals is an essential area within the field of electronic countermeasures (ECM), which determine the system, usage, and type of enemy radar by analyzing the data received from radar reconnaissance systems [1]. Generally, the classification of intra-pulse modulation can be divided into feature-based and data-based classification [2]. Feature-based classification needs to extract features from the radar signal and carefully design classifiers based on these features [3], while data-based classification can fully retain the data of the radar signal, and is the focus of this paper.
Deep learning is a new research direction in the field of machine learning and is a collective term for a class of pattern analysis methods [4]. With the rapid development of deep learning [4], many researchers now use deep learning to study the classification of intrapulse modulation of radar signals. Compared to traditional feature extraction, deep learning can automatically extract depth features from radar signals. In [5], Z Huang proposed a deep convolutional neural network-based approach that used the amplitude information of single-polarization SAR images as input to automatically extract the hierarchical spatial characteristics. These features may be more abstract, but they are representative and suitable for classification. Deep learning models can learn all the representation layers together at the same time. By learning the common features, once the model modifies an internal feature, all other features that depend on it will automatically make the corresponding adaptation. To improve the classification accuracy of radar signals, various methods have been proposed. Many researchers convert one-dimensional (1D) radar signals into twodimensional (2D) time-frequency images for classification. For example, short-time Fourier transformation and CNNs were used to identify six different intra-pulse modulations signals, and the overall classification success rate was over 90% [6]. Ma, XR et al. [3] designed a combination of the short-time Ramanujan Fourier transform and pseudo-Zernike moments invariant feature-based method to recognize different modulation schemes under different parameter variation conditions. To improve classification performance under a low signal-to-noise ratio (SNR), an improved convolutional de-noising automatic encoder [7] was proposed. When the SNR was −9 dB, the encoder could classify 12 kinds of modulated signals, and the classification accuracy was over 95%. Gao, LP et al. [8] proposed an image fusion algorithm using non-multi-scale decomposition to fuse images of a single signal with different time-frequency methods. Liu et al. [9] proposed an algorithm for radar emitter signal recognition transforming raw radio signals into time-frequency image using the Choi-Williams distribution function. Moreover, various studies show that using one-dimensional radar signals for identification makes sense, which is also the focus of this paper. Sun, J et al. [10] designed a novel encoding method to generate high-dimension sequences of equal length as new features in cases of inconsistent features between samples, and proposed a unidimensional convolutional neural network to classify the encoded high-dimension radar signals. Li, X et al. [11] proposed an attention-based approach for radar emitter classification using recurrent neural networks to classify the radar signals. To improve the classification accuracy of radar signals with different SNRs of −14~20 dB, a novel network was proposed, which combines a shallow convolutional neural network (CNN), a long-term memory network (LSTM), and a deep neural network (DNN). Wu, B et al. [12] proposed a novel 1D CNN with an attention mechanism to extract more discriminative features and recognize radar emitter signals. Although methods such as deep CNN can omit feature engineering and automatically extract and learn features from the data [5], at the same time, the number of samples required restricts the development of deep learning in the field of radar signal classification. Most researchers only focus on signal classification under different SNRs. These methods fail to overcome the obstacle of training the depth network with limited radar signals, which is a few-shot recognition problem [13].
To solve these issues, we propose a novel deep network based on transfer learning. Transfer learning is good at applying knowledge or patterns learned in one domain or task to a different but related domain or problem [14]. In general, transfer learning can be classified into three categories: instance-based, feature-based, and shared parameter-based [15]. Instance-based transfer learning studies how to select instances from the source domain that are useful for training in the target domain. For example, an effective assignment of weight to labeled data instances in the source domain can make the distribution of instances in the source and target domains close, so that a reliable learning model with high classification accuracy can be built in the target domain. Dai WY et al. [16] proposed the TrAdaBoost algorithm to improve the classification effect by adjusting the weights of misclassified samples in the source and target domains. Feature-based approaches extract and identify representative features shared between the source and target domains and then use these features to transfer knowledge [17]. Shared parameter-based methods investigate how to find common parameters or prior distributions between the spatial models of source and target data [18]. For few-shot Synthetic Aperture Radar (SAR) image classification, shared parameter-based methods are good at migrating labeled data or learned knowledge structures from related domains with sufficient samples [18][19][20][21][22]. Huang, Z et al. [18] designed an assembled CNN architecture consisting of a classification pathway and a reconstruction pathway, together with an additional feedback bypass. A novel method, deep memory convolution neural networks, for alleviating the problem of overfitting caused by insufficient SAR image samples was proposed in [19]. Rostami, M et al. [22] proposed a novel deep neural network for classifying SAR images that eliminates the need for a huge labeled dataset.
Different from the classification of 2D SAR images, we recognize the received 1D radar signals instead of converting them into images via time-frequency analysis. First, we trained a 1D deep CNN using three large numbers of simple radar signals with labels as the source dataset, and the source task was to classify the three radar signals as correctly as possible. We could easily obtain an optimal depth CNN, because the number of samples was simple and sufficient, and then discard the classifier and higher convolution layers, leaving only the structure and parameters of the lower layer to transfer to the target domain. The lower convolutional layers of the CNN (those layers closer to the input) extracted more general features, while some higher layers of classifier and convolutional layers were applied to specific features, with the higher layers containing more feature semantics and the lower layers containing less feature semantics but more location information. Meanwhile, we proposed an adaptive focus loss function based on focal loss [23], which can adjust the parameter according to the ratio of hard samples to easy samples in the data set. The experimental results demonstrate that, compared with existing algorithms, the proposed algorithm had significantly improved classification accuracy and convergence speed while using less training data.
The remainder of the paper proceeds as follows. Section 2 is concerned with the methodology used for this study. Detailed experimental procedures and discussion of the results are given in Section 3. Section 4 is the conclusion of this paper.

Convolutional Neural Network
Deep learning arose rapidly in the first decade of the 21st century as computing power increased. The Convolutional Neural Network (CNN), the exemplar of deep learning, was first established by Y. L. Cun [24], who designed the famous LeNet-5 to classify handwritten numbers, drawing on artificial neurons and visual perception mechanisms. CNNs share many similarities with ordinary neural networks, in that both of them mimic the structure of human nerves and consist of neurons with learnable weights and bias constants [24]. However, CNNs are more widely used because they avoid complex pre-processing of data and can directly input raw data relying on convolution layers to extract feature maps [25]. In the following years, CNNs evolved based on their classical structure. In 2012 Geoffrey and his student Alex designed the Alex network [26], introducing a nonlinear activation function based on LeNet (ReLU) and a method to prevent overfitting (Dropout, Data augmentation). In 2014, K. S et al. [27] proposed VGG-Net, which contains more layers and uses the same size convolutional filter. The Inception structure of GoogLeNet [28] allows the entire network structure to be expanded in both width and depth. ResNet [29] proposes a residual learning framework that reduces the training burden on the network.

Radar Signal Intra-Pulse Module Classification
Radar emitter signal identification, which aims to obtain information concerning radar systems by analyzing the emitter signals, is an important aspect of electronic warfare and has been extensively studied by numerous researchers. All the CNNs mentioned in Section 2.1.1 have achieved good results in many fields, so it is reasonable to use CNNs to learn temporal correlations and deep features from radar signals for classification. Wei, SJ et al. [30] used sequences in the time, frequency, and autocorrelation domains of the original signal as inputs to a shallow CNN, after which the deep features extracted by the CNN were used as the input to an LSTM network, and finally, a DNN, as the classification network, would directly output the modulation type of the signal. This could achieve high Remote Sens. 2022, 14,1950 4 of 21 accuracies for four common kinds of measured radar signals. In [9], Z. Liu created a deep CNN using the input of time-frequency spectrums of radar signal intra-pulse modulation to substitute manually constructed features that are time-demanding and neglect delicate characteristics. In [31], Y. Pan et al. used the Hilbert-Huang transform to obtain a wealth of information on the nonlinear and non-stationary properties of radar signals and built a deep residual network to avoid the degradation problem.

Transfer Learning
Transfer learning is a machine learning technique that can transfer knowledge learned in the source domain to the target domain for enhancing the learning of the target task. Transfer learning typically includes the following elements: a source domain D S , a target domain D T , a source learning task T S , and a target learning task T T . Based on the differences between the source/target domains and tasks, S.J. Pan et al. classified transfer learning into inductive transfer learning, unsupervised transfer learning, and transductive transfer learning [15]. The above three types of transfer learning can be further grouped into four cases: instance transfer learning, feature-representation transfer learning, parameter transfer learning, and relational-knowledge transfer learning [15]. Transfer learning has been successfully adopted in many fields, such as image and video quality, visual categorization, and machinery fault diagnosis. Ling, S et al. [32] surveyed state-of-the-art transfer learning algorithms in visual categorization applications such as object recognition, image classification, and human action recognition. Varga, D et al. [33] pre-trained different types of CNNs based on fusing the decisions of multiple image quality scores that can better characterize authentic image distortion and effectively estimate perceived image quality. In [34], an ImageNet database pre-trained CNN with global average pooling layers was proposed in order to transfer the learned knowledge so that the module can be easily generalized to any input image size and pre-trained CNNs. In [35], Li, C et al. reviews the research progress on deep transfer learning for machinery fault diagnosis in recent years. In the field of radar target classification, Huang, ZL et al. [18] used inductive transfer learning to transfer reconstructed knowledge from convolutional self-encoders to the SAR target classification task. They innovatively used a large number of unlabeled SAR scene images to train the convolutional self-encoder to reconstruct the features well and transferred only the encoder results during the target classification task. Qing, W et al. [36] designed a two-channel CNN combined with bi-directional LSTM architecture to improve the classification performance of the waveforms of cognitive passive radar. They used transductive transfer learning to initialize the target domain classifier with source domain parameters. In this paper, since the source and target domains have different but related distributions, we used parameter transfer learning, in which the source and target tasks share some of the network parameters.

The Focal Loss Function
The imbalance of different sample categories in target detection is a critical issue impacting accuracy. Online hard example mining (OHEM) [37], a typical algorithm to deal with class imbalance, increases the weight of misclassified samples but ignores the easy-to-classify samples. Focal loss is proposed in [23] to solve the category imbalance problem, and is obtained by modifying the standard cross-entropy loss. Compared to OHEM, the focal loss function allows the model to focus more on difficult samples by reducing the weights of easy-to-classify samples during the training process [23]. Currently, focal loss is commonly used in different fields. In the area of text detection in computer vision, X. Tian et al. [38] designed a focal text detection network that uses focal loss to train the network well with an uneven number of samples; it could obtain better performance when the number of samples was insufficient. In the field of medical image processing, where sample imbalance is a serious problem, [39] proposed a network framework using residual neural networks (Res-Net) [29] combined with focal loss for determining left ventricle segmentation from cardiac MRI images. For the problem of ship detection in Remote Sens. 2022, 14, 1950 5 of 21 high-resolution SAR images, ref. [40] designed a RetinaNet-Plus method based on the RetinaNet network, which uses focal loss in the training process to resolve class imbalance and reduce the loss weights of easy-to-classify samples.

Transfer Learning-Based Convolutional Neural Network
Based on parameter transfer learning, we constructed the network frameworks for the source and target tasks separately, and the source domain used three simple intra-pulse modulation types of radar signals, which were effortlessly obtained and had sufficient samples. The goal was to classify these three radar signals as accurately as possible. In the target domain, we trained with a small number of complex intra-pulse modulation type signals, nine in total, and initialized the convolutional layers of the target domain network using the parameters learned in the source domain instead of random initialization. In the following, we describe the details of the method.
A sequential structure is a convolutional neural network in which the output of each layer is superimposed sequentially as the input of the next layer. Because of its simple structure, it has been widely used as a classical structure. VGG is a typical sequential structured network model, first proposed by K. Simonyan et al. [27]. They added convolutional layers to AlexNet one by one to study the effect of network depth on the recognition effect. Experiments showed that the deeper the network, the better the recognition effect; when the network structure increased to 16 and 19 layers, the effect improved significantly, and therefore these algorithms were called VGG-16 and VGG-19. Therefore, to keep the training simple and the model effortlessly transferred and tuned, we used a sequential structure and simplified the VGG network to build the source and target networks. To ensure that the layer parameters could transfer properly between the source and target domains, we construct a sequential structure with part of the same convolutional layers in the source and target domains. We designed a 1D CNN for the input 1D radar signal intra-pulse modulated sequence; the receptive field of the 1D convolution kernel was continuously translated over the data sequence to observe significant features due to the translation-invariant property of convolution. Each convolution filter of the convolution layer acted iteratively throughout the receptive field to convolve the input sequence, and the convolution result formed a feature map of the input sequence containing the local features of the radar signals. Each convolution filter shared the same parameters, including the same weight matrix and bias term, which were transferred to the target domain after the training of the source domain. The convolved 1D feature map was fed to the pooling layer, and maximum pooling was used for sampling, which was a nonlinear down-sampling method [41]. After acquiring radar signal sequence features by convolution, directly using all the extracted feature data to train the classifier for classification usually expends great computational effort, so the maximum pooling sampling method can be used to down-sample the convolutional features. The convolution and pooling process is shown in Figure 1. By training the source domain network, we continuously optimized the feature extraction capability of the convolutional layers for 1D radar signals. Thus, these pre-trained convolutional layers could extract data features well in the face of complex target tasks By training the source domain network, we continuously optimized the feature extraction capability of the convolutional layers for 1D radar signals. Thus, these pre-trained convolutional layers could extract data features well in the face of complex target tasks with few-shot samples and inconsistent sample distribution.
To better assemble and train the network and highlight the effect of transfer learning, we designed a single-input, single-output convolutional network structure based on the above structure, as shown in Figure 2. By training the source domain network, we continuously optimized the feature extraction capability of the convolutional layers for 1D radar signals. Thus, these pre-trained convolutional layers could extract data features well in the face of complex target tasks with few-shot samples and inconsistent sample distribution.
To better assemble and train the network and highlight the effect of transfer learning, we designed a single-input, single-output convolutional network structure based on the above structure, as shown in Figure 2. As shown in Figure 2, the source network consisted of five convolutional layers, each followed by batch normalization to prevent gradient disappearance and speed up training. To transfer the parameters properly between networks, we kept the structure of the lower convolutional layers of the source the same as that of the target network, with the same number and size of convolutional kernels. Table 1 shows the number of feature maps As shown in Figure 2, the source network consisted of five convolutional layers, each followed by batch normalization to prevent gradient disappearance and speed up training. To transfer the parameters properly between networks, we kept the structure of the lower convolutional layers of the source the same as that of the target network, with the same number and size of convolutional kernels. Table 1 shows the number of feature maps and the size of the convolutional kernels for each convolutional layer. We used multiple 1 × 3 (or 1 × 5) convolutional kernels instead of large-sized kernels to minimize the number of parameters and amount of computational effort while ensuring that the perceptual field of view was not altered [42]. As the number of layers deepened, we increased the number of convolutional kernels to extract more deep features and then down-sampled the feature maps via maximum pooling to prevent overfitting. This design gave the network an inverted triangular shape, i.e., the closer to the input layer, the smaller the number of parameters, and the closer to the output layer, the larger the number of parameters. Such an inverted triangular structure prevents the neural network from losing gradients too quickly during backpropagation [27]. In terms of activation function selection, the Rectifier Linear Unit (ReLU) is widely used because it can solve the gradient disappearance problem [43], but its sparsity tends to lead to dying ReLU, so we used Leaky ReLUs, which assign a non-zero slope to all negative values after each layer of convolution, to solve this problem [44,45].
In ref. [46], Y, W. et al. proved that different convolutional layers extract differentlevel feature information, and therefore that during the transfer learning process the appropriate convolutional layer parameters should be selected for transfer, instead of all of them. The lower convolutional layers of CNN (the layers closer to the input) extract more general features, i.e., the lower layers contain few feature semantics but have more location information, while some of the higher classifier and convolutional layers apply to specific features which extract more feature semantics, and the semantic features learned in the last few layers are quite different for different datasets. Therefore, the higher convolutional layers are generally related to task objectives and classification, and the lower convolutional layers are more suitable as feature extractors to extract general features for transfer learning. As a result, in the transfer process, we only kept the first four convolutional layers of the source network and discarded the other layers, which contained more semantic information. These optimized lower convolutional layers were more general and could effectively extract structural and detailed features in the radar signal, even in the face of new data.

Adaptive Focus Loss Function (AFL)
To further improve the classification performance of radar signals under few-shot learning, we replaced the original cross-entropy loss with adaptive focal loss, which could automatically adjust its application range depending on the number of samples determined via focal loss.
The focal loss function is widely used and has achieved good results in the field of target detection. The authors proposed focal loss (FL) based on cross-entropy (CE) loss: where p ∈ [0, 1] is the model's estimated probability and y ∈ {−1, 1} indicates the groundtruth class. Facing the problem of sample imbalance, the authors added a factor to the CE loss that assigns different weights to the samples: The focusing parameter γ ≥ 0. When γ = 0, the focal loss is the traditional cross-entropy loss, and when γ increases, the modulation coefficient also increases. γ smoothly adjusts the proportion of loss accounted for by samples of different difficulties. If a hard sample is misclassified, the p t value is small: and focal loss has not changed significantly compared to the original loss. By contrast, when the easily classified samples are correctly classified, p t → 0 , and the contribution Remote Sens. 2022, 14, 1950 8 of 21 to the total loss is small. Based on this principle, focal loss solves most of the sample imbalance problems very well, but in some scenarios, its effect is not ideal. Focal loss solves the sample imbalance problem by adjusting the parameter γ to give more weight to the hard samples with poor classification. However, the value of γ is not as large as possible. In ref. [23], the best recognition was achieved when γ = 2, and the performance decreased when γ > 2. For other models, the optimal value of γ needs to be determined through a large number of experiments. We can simplify the determination of the optimal parameters by estimating the range of γ. Therefore, we propose an adaptive focus loss function that estimates the value of γ based on the ratio of hard-to easy-toclassify samples. According to the research [47], there is a huge difference in quantity between easy and hard samples during the training process. First, we trained a base classifier using CE, then we predicted the training set and counted the numbers of easy and hard samples, denoted N e and N h , respectively (For radar signal intra-pulse modulation classification, we considered p t ≤ 0.1 to be a hard sample and p t ≥ 0.9 to be an easy sample. For different models and problems, one can change the judgment threshold of the hard and easy samples). According to the focal loss function, the loss gap is: We defined the difficulty of the training set as the ratio of the number of easy samples to the number of hard ones: The focusing parameter γ was used to adjust the contribution of easy and difficult samples to the overall loss to balance their large quantitative differences. Therefore, the loss gap should not be less than the ratio of the number of easy samples to the number of hard ones, and the value of γ should increase as r increases, i.e., the more simple samples, the greater the focus on hard samples. Then we derived the estimate of γ asγ, which should satisfy the following: In summary, for the multi-classification problem of radar signal intra-pulse modulation, we propose the Adaptive Focus Loss function (AFL) as follows: In this paper, we obtainedγ = log 9 N e N h , and p prediction was a 1 × 9 vector, which was our model's estimated probability for the nine radar signals classified. y groundtruth was the vector of the labels after one-hot encoding.

Experiments and Results
In this section, we simulated several radar signal datasets with different sample sizes to simulate different small sample cases, which were used to train and test the proposed method and other baseline methods. In all experiments, we used a computer equipped with an Intel 10900K CPU, 64 GB of RAM, and a RTX 3070 GPU.

Dataset and Parameters Setting
Generally, the typical radar signal is dominated by high-power radio frequency (RF) pulses with a carrier band range from 3 MHz to 100 GHz. The radar receiver in our simulation used a local oscillator to mix with the high-frequency radar signal to reduce the frequency of the received signal, and then output a lower-frequency signal through the intermediate frequency (IF) amplifier. Specifically, to ensure that the frequency of the received signal was reduced to replicate a radar system operating in a real environment, a mixer was simulated. The mixer multiplied the RF signal by the local oscillator signal to obtain two output frequencies, the summation and the subtraction of the radio frequency f RF and the local oscillator frequency f LO , which can be expressed as: Through using a low-pass filter, the summed frequency f RF + f LO could be well suppressed, so that we could obtain the subtracted frequency f RF − f LO , which is the IF signal. In this paper, we simulated the low frequency radar signals from the receiver output and used them to train and test our proposed method. For the source domain dataset, we selected three modulation types of simple and widely obtainable radar signals: singlecarrier frequency (SCF) signals, linear frequency modulation (LFM) signals, and sinusoidal frequency modulation (SFM) signals. For the target domain dataset, we used nine different kinds of radar signals with complex modulation types comprising binary phase-shift keying (BPSK) signals, binary frequency-shift keying (BFSK) signals, quadrature frequencyshift keying (QFSK) signals, Frank phase-coded (Frank) signals, even quadratic frequency modulation (EQFM) signals, dual-frequency modulation (DLFM) signals, multiple linear frequency modulation (MLFM) signals, and two kinds of composite modulation (LFM-BPSK, BPSK-BFSK) signals. The sampling frequency was 1 GHz, the pulse width of all radar signals varied from 1 µs to 10 µs, and other signal parameters are shown in Table 2. To simulate the real electromagnetic environment, we added additive Gaussian white noise (AWGN) to all signals. The model of the radar signal intercepted by the receiver is given by: Remote Sens. 2022, 14,1950 10 of 21 n(t) is white Gaussian noise, and s(t) is a radar signal. The SNR is defined as: SNR = 10 log 10 P s P n (14) where P s represents the effective power of the signal and P n is the effective power of the noise. In Figure 3, taking the LFM signal as an example, we simulated the time-domain waveforms of the same signal at −5 dB, 0 dB, 5 dB, and noiseless, respectively. 5,7,11,13-bit Barker code BPSK-BFSK 100~400 MHz 5,7,11,13-bit Barker code 100~400 MHz To simulate the real electromagnetic environment, we added additive Gaussian white noise (AWGN) to all signals. The model of the radar signal intercepted by the receiver is given by: ( ) n t is white Gaussian noise, and s(t) is a radar signal. The SNR is defined as: 10 =10 log s n P SNR P (14) where s P represents the effective power of the signal and n P is the effective power of the noise. In Figure 3, taking the LFM signal as an example, we simulated the time-domain waveforms of the same signal at −5 dB, 0 dB, 5 dB, and noiseless, respectively.  Different research fields require a different number of samples in each field. To better investigate the relationship between the number of training samples and the classification effect of the model in radar intra-pulse modulation classification, we introduced a learning curve [48] to plot the classification accuracy versus the number of the training set. The learning curve equation is as follows: where y is the classification accuracy, x is the training dataset, and b 1 and b 2 correspond to the learning rate and decay rate, respectively. Figure 4 shows the learning curve of classification accuracy versus number of samples for the nine types of intra-pulse modulated radar signals in the target domain.
where y is the classification accuracy, x is the training dataset, and 1 b and 2 b correspond to the learning rate and decay rate, respectively. Figure 4 shows the learning curve of classification accuracy versus number of samples for the nine types of intra-pulse modulated radar signals in the target domain. According to the learning curve, the classification accuracy curve reaches smoothness and the model converges when the total number of training samples is 2000, so in this paper, we defined radar signal intra-pulse modulation classification with less than 2000 training samples as small sample learning. To validate our proposed method on cases of different sample sizes, we randomly generated samples from each type of signal in the target domain with numbers increasing from 50 to 140 at increments of 10, constituting 10 training sets with different sample sizes. For the three types of radar signals in the source domain, the number of samples for each signal was 5000. The number of training sets for each type of radar signal is shown in Table 3. Table 3. The number of training sets for each type of radar signal. According to the learning curve, the classification accuracy curve reaches smoothness and the model converges when the total number of training samples is 2000, so in this paper, we defined radar signal intra-pulse modulation classification with less than 2000 training samples as small sample learning.
To validate our proposed method on cases of different sample sizes, we randomly generated samples from each type of signal in the target domain with numbers increasing from 50 to 140 at increments of 10, constituting 10 training sets with different sample sizes. For the three types of radar signals in the source domain, the number of samples for each signal was 5000. The number of training sets for each type of radar signal is shown in Table 3. Each dataset was generated at the same SNR condition ranging from −5 dB to 5 dB, with a 1 dB interval. An additional set of noise-free signals was generated as a control group to verify the effect of noise on the model performance. For all the above training sets, we produced validation and test sets corresponding to the ratio of 4:1:1. For example, when the target domain training set had 450 samples for each SNR (−5~5 dB and a noise-free dataset for a total of 12 SNRs), the validation set and test set contained 112 samples. Figures 5 and 6 respectively show the waveforms of the source-domain and target-domain intra-pulse modulated radar signals over time when the SNR was 0 dB. Each dataset was generated at the same SNR condition ranging from −5 dB to 5 dB, with a 1 dB interval. An additional set of noise-free signals was generated as a control group to verify the effect of noise on the model performance. For all the above training sets, we produced validation and test sets corresponding to the ratio of 4:1:1. For example, when the target domain training set had 450 samples for each SNR (−5~5 dB and a noisefree dataset for a total of 12 SNRs), the validation set and test set contained 112 samples.

Experiments on the Source Domain Network
In this section, the source domain network in Section 2 was trained with only three intra-pulse modulation radar signals. The source task was to classify these three radar

Experiments on the Source Domain Network
In this section, the source domain network in Section 2 was trained with only three intra-pulse modulation radar signals. The source task was to classify these three radar signals as accurately as possible to optimize the feature extraction capability of the convolutional layers for 1D radar signals. In this stage of training the source domain network, we added three fully connected layers after the convolutional layer, which contained a hidden layer of 256 neurons and a "Leaky ReLU" activation function. The cross-entropy loss function and Adam optimizer were used with a 0.001 learning rate, the batch size was 64, and the network weights were saved for migration when the validation set had the highest accuracy. The result of the classification task is shown in Table 4. In the subsequent transfer process, we only kept the first four convolutional layers of the source network and discarded the other layers, which contained more semantic information. These optimized lower convolutional layers were more general and could extract structural and detailed features in the radar signal well, even in the face of new data.

Experiments on the Target Domain Network
In this section, we transferred the learned weights to the target network and trained the proposed 1D-TLAFLCNN using the different datasets in the target domain generated in Section 3.1, through the following procedure: 1.
Initialize the corresponding convolutional layers of the target domain network with the weights learned from the first four convolutional layers of the source domain network, and freeze these weights; 2.
Randomly initialize the parameters of the fully connected layers using a Gaussian distribution; 3.
Train the classification layers using the target domain dataset; 4.
Fine-tune the entire network by unfreezing all convolutional layers and setting a low learning rate (set to 0.0001) to retrain the entire network in order to incrementally fit the pre-trained features to the new data.
When the SNR is 0 dB and the number of training samples is 450 (50 for each signal), the average accuracy value during the training process is shown in Figure 7. In the subsequent transfer process, we only kept the first four convolutional layers of the source network and discarded the other layers, which contained more semantic information. These optimized lower convolutional layers were more general and could extract structural and detailed features in the radar signal well, even in the face of new data.

Experiments on the Target Domain Network
In this section, we transferred the learned weights to the target network and trained the proposed 1D-TLAFLCNN using the different datasets in the target domain generated in Section 3.1, through the following procedure: 1. Initialize the corresponding convolutional layers of the target domain network with the weights learned from the first four convolutional layers of the source domain network, and freeze these weights; 2. Randomly initialize the parameters of the fully connected layers using a Gaussian distribution; 3. Train the classification layers using the target domain dataset; 4. Fine-tune the entire network by unfreezing all convolutional layers and setting a low learning rate (set to 0.0001) to retrain the entire network in order to incrementally fit the pre-trained features to the new data.
When the SNR is 0 dB and the number of training samples is 450 (50 for each signal), the average accuracy value during the training process is shown in Figure 7 Figure 7 shows that the accuracy of the model stabilized using the validation dataset as the epoch increased, and the accuracy became essentially constant when the epoch reached 90, which indicates that the model converged. Subsequently, we repeated these experiments on training datasets with different SNRs and different sample sizes and tested the classification performance of the proposed model using the highest accuracy weights from the validation dataset. The experiment was repeated 20 times for each case and the average accuracy was taken as the final classification accuracy. The classification accuracies for the nine intra-pulse modulation signals, based on 1D-TLAFLCNN for different cases, are given in Table 5.
As shown in Table 5, we used the training sets of the target domain mentioned in Section 3.1. Each type of signal increased from 50 to 140 with a step of 10, constituting ten training sets with different sample sizes. It can be concluded that the proposed algorithm had good performance in the case of different numbers of small samples, and the average classification accuracy improved as the number of samples increased. Moreover, the proposed method, both in noiseless and noisy environments, had good performance and the classification accuracy steadily improved with increasing SNR. When the SNR was greater than or equal to −1dB, the classification accuracy of the different data sets was over 90%.

Comparisons with Other Baseline Methods
To show the effectiveness of transfer learning, focal loss, and our proposed adaptive focus loss function (AFL), we constructed five models based on whether or not to add these improvements, while ensuring that they had the same convolutional layer, convolutional kernel size, fully connected layers, batch normalization layer, etc. We used the 1D-CNN as a blank control group, denoted the 1D-CNN with only the focal loss function added (we took the default optimal value as γ = 2 [23]) as 1D-FLCNN (γ = 2), the 1D-CNN with only transfer learning added as 1D-TLCNN, the 1D-CNN with the focal loss function added based on transfer learning as 1D-TLFLCNN (γ = 2), and the transfer learning-based AFL proposed in this paper as 1D-TLAFLCNN, as shown in Table 6. Table 6. Differences between the proposed method and other baseline methods.

Focal Loss Transfer Learning Adaptive Focus Loss
1D-CNN no no no 1D-FLCNN (γ = 2 ) yes no no 1D-TLCNN no yes no 1D-TLFLCNN (γ = 2 ) yes yes no 1D-TLAFLCNN no yes yes "yes" means that the model uses this improvement, "no" means it is not used.
During the training process, we found that the three methods using transfer learning could have higher classification accuracy on the validation set at the beginning of the iteration, instead of learning from scratch. Transfer learning allows the model to gain some prior knowledge when facing new samples and converge at a faster speed. The accuracy of the five models during the training process at a SNR of 0 dB and 450 training samples (50 for each signal) is shown in Figure 8.
In addition, we compared the number of iterations used to reach convergence and the total time required for the five models in Table 7 and Figure 9.
In addition, we used some representative algorithms as a baseline, including CNN-Qu [7], CNN-Wu [12] and CNN-Wei [30]. All these methods are proposed for intra-pulse modulation classification of radar signals, and have been proved to have good accuracy advantages.
In Table 8 and Figure 10, we compare the classification accuracy of the method proposed in this paper (1D-TLAFLCNN) with other baseline methods at different sample sizes in the 0 dB case and calculate their average accuracy (AA).
During the training process, we found that the three methods using transfer learning could have higher classification accuracy on the validation set at the beginning of the iteration, instead of learning from scratch. Transfer learning allows the model to gain some prior knowledge when facing new samples and converge at a faster speed. The accuracy of the five models during the training process at a SNR of 0 dB and 450 training samples (50 for each signal) is shown in Figure 8. In addition, we compared the number of iterations used to reach convergence and the total time required for the five models in Table 7 and Figure 9.   In addition, we used some representative algorithms as a baseline, including CNN-Qu [7], CNN-Wu [12] and CNN-Wei [30]. All these methods are proposed for intra-pulse modulation classification of radar signals, and have been proved to have good accuracy advantages.
In Table 8 and Figure 10, we compare the classification accuracy of the method proposed in this paper (1D-TLAFLCNN) with other baseline methods at different sample sizes in the 0 dB case and calculate their average accuracy (AA).    In addition, we compared the classification accuracy of all methods on the test set for different SNRs. Figure 11 shows the variation in average accuracy with different SNRs when the sample size was 900 (100 for each signal). In addition, we compared the classification accuracy of all methods on the test set for different SNRs. Figure 11 shows the variation in average accuracy with different SNRs when the sample size was 900 (100 for each signal). The experimental results demonstrate that 1D-TLAFLCNN performed best under various SNR conditions, with different magnitudes of improvement compared to other algorithms. The classification effectiveness under different training sets in Table 8 shows that the addition of transfer learning and AFL resulted in a greater improvement than using only transfer learning or FL separately, in which transfer learning could reduce the number of convergence generations required for training and converge faster; AFL could estimate γ adaptively based on FL and improved the accuracy of classification.

AFL Compared with Different Values of the Focusing Parameter Based on FL.
To investigate the effect of different values of the focusing parameter γ in FL on the classification effect and thus prove the effectiveness of our proposed method, we compared the average accuracy of all SNRs with several different few-shot sample sizes, as shown in Figure 12. The experimental results demonstrate that 1D-TLAFLCNN performed best under various SNR conditions, with different magnitudes of improvement compared to other algorithms. The classification effectiveness under different training sets in Table 8 shows that the addition of transfer learning and AFL resulted in a greater improvement than using only transfer learning or FL separately, in which transfer learning could reduce the number of convergence generations required for training and converge faster; AFL could estimate γ adaptively based on FL and improved the accuracy of classification.

AFL Compared with Different Values of the Focusing Parameter Based on FL
To investigate the effect of different values of the focusing parameter γ in FL on the classification effect and thus prove the effectiveness of our proposed method, we compared the average accuracy of all SNRs with several different few-shot sample sizes, as shown in Figure 12.

AFL Compared with Different Values of the Focusing Parameter Based on FL.
To investigate the effect of different values of the focusing parameter γ in FL on the classification effect and thus prove the effectiveness of our proposed method, we compared the average accuracy of all SNRs with several different few-shot sample sizes, as shown in Figure 12. As shown in the figure, for the same classification task, the value of the focusing parameter affected the classification accuracy, as it represents how much attention the model pays to the hard samples. A large value of the focusing parameter (e.g., γ= 5) tended to over-focus the model on some hard samples and bias the trained model toward these "outliers", which is often fatal when the training sample is insufficient. In addition, even if the focus parameter took the same value, the classification accuracy kept fluctuating with the sample size. This is because the number of hard and easy samples changed with the sample size, which means that the classification task also changed, and the same value of the focusing parameter could not adjust to all classification tasks. Typically, for new classification tasks, we must conduct quantitative testing to find the appropriate value of the focus parameter γ, and our method is proposed to solve this problem. As we can see in the figure, our proposed method could estimate the range of γ by calculating the proportion of hard and easy samples in the dataset, which was a good improvement compared to other integer values of γ.

Effect of Different Noise Environments on Experimental Results
The classification performance of the model in noisy environments is one measure of model stability. To this end, we explored the effect of SNRs on the model in both noisy and noiseless environments. We compared the model classification performance of different models for a total of 12 scenarios from −5 dB to 5 dB along with pure signals without noise pollution, as shown in Table 9 and Figure 13. The accuracy in the chart is the average classification accuracy after repeating the experiment for ten different sample sets.  It can be seen that our proposed method had the highest average accuracy, 97.86%, in a noise-free environment. The classification performance of all models for noise-free signals was greatly improved compared to noisy environments, with an average accuracy improvement of 2-5%, since the models are more likely to extract features in pure signals, which are often masked in noisy environments. However, even in the −5 dB case, the average accuracy of our proposed algorithm was over 85%. Meanwhile, the model recognition capability steadily improved as the SNR increased, which indicates that our algorithm has certain anti-noise performance.

Effect of Different Sample Sizes on Experimental Results
As shown in Table 8 and Figure 10, the proposed method had the best accuracy on different few-shot sample datasets, and the average accuracy improved by 8% compared with the traditional CNN algorithm. The comparison between 1D-TLAFLCNN and 1D-TLFLCNN shows that the estimate of γ calculated using AFL had a slightly better classification performance than the default value of taking γ = 2. Overall, the classification results were better when using the transfer learning approach. The method using FL alone (default value γ = 2) yielded improved results in most cases, but the classification accuracy decreased in some cases, which may be because FL focuses excessively on outliers in the sample and makes the model misclassify. In comparison with other baseline methods, the proposed method had the best classification accuracy for all few-shot sample sizes. The method using TL alone had a flatter accuracy curve. It increased steadily as the sample size increased. Transfer learning enables the model to have some prior knowledge, which makes it have good feature extraction ability and effective performance when facing new few-shot problems.
However, as the number of samples increased, our method also showed certain shortcomings: the improvement effect was not obvious in some sample sets, with the accuracy improvement being less than 0.2%. We analyze that as the sample size increased, the difficulty of model training decreased, and the ratio of hard to easy samples tended to become balanced, which decreased the role of transfer learning and AFL.

Improvement in Training Model Time Consumption
Besides classification accuracy, fast classification is also necessary for radar reconnaissance systems, which need real-time classification. Shorter training times and faster convergence mean that our models could make predictions faster in real-world applications, helping to speed up our analysis of enemy radar systems.
In real radar reconnaissance systems, real-time training and classification of few-shot unknown signals for a new scene is an important aspect, which requires the model to have fast convergence capability. In X-band radar, the airborne radar pulse repetition frequency is f Hz, which emits n pulses at one wave position; therefore, the radar time-on-target collected in one scan is roughly n f seconds. Then, we can calculate the duration of the next received radar echo accordingly, which requires the model to finish converging when the signal is received again. Therefore, considering this practical application, the model convergence epoch and the time spent can be one of the judgments for the performance of the method.
By comparing the number of iterations and the total training time required to reach convergence for the five models in Table 7 and Figure 9, the proposed method converged at 90 generations and 258 s, which reduced the time cost by at least 10%, but this improvement is still far from enough to meet the time requirements for real-time training and testing in practical applications. In future work, we will continue to improve the convergence speed of the model and reduce the training time.

Conclusions
To solve the problems associated with the difficulty of training deep cellular neural networks and the insufficient training data available for radar signals, which leads to the low classification accuracy of intra-pulse modulation, a 1D-TLAFLCNN method is proposed for the classification of few-shot intra-pulse modulation radar signals. We used a transfer learning method to transfer the knowledge learned from a large number of simple intra-pulse modulated radar signals in the source domain to a complex modulation classification task in the target domain, and estimated the factor γ adaptively based on FL, which ensured that the model could focus more on the hard samples.
The experimental results show that the proposed method could reduce the number of generations to model convergence compared with other baseline methods, and the model converged within 90 generations with the shortest time of 258 s. By comparing the experiments with different values of focusing parameters, AFL had a maximum accuracy improvement of approximately 1.5% based on FL and could reduce repeated experiments by estimating the range of parameters. In addition, the proposed method displayed noise immunity, and the average accuracy of our proposed method was over 85% in the −5 dB case. From further experiments exploring its classification performance on different fewshot datasets, we found that the application of transfer learning could well help the model gain rich prior knowledge and feature extraction ability in few-shot cases, which improved by 8% compared with the traditional CNN algorithm. However, we are also aware of the shortcomings of our method. First, the focusing parameter derived from AFL was a fixed value, but as training proceeded the proportion of hard and easy samples changed, and a fixed focusing parameter could not adapt to the new sample proportion, which made the focal loss function useless. Second, as the number of samples increased, the improvement effect of transfer learning was not obvious. In future work, we hope to develop a method to realize dynamic adjustment of the focusing parameters following changes in the ratio of hard and easy samples.