Deep Muti-Modal Generic Representation Auxiliary Learning Networks for End-to-End Radar Emitter Classiﬁcation

: Radar data mining is the key module for signal analysis, where patterns hidden inside of signals are gradually available in the learning process and its superiority is signiﬁcant for enhancing the security of the radar emitter classiﬁcation (REC) system. Owing to the disadvantage that radio frequency ﬁngerprinting (RFF) caused by the imperfection of emitter’s hardware is difﬁcult to forge, current deep-learning REC methods based on deep-learning techniques, e.g. convolutional neural network (CNN) and long short term memory (LSTM) are difﬁcult to capture the stable RFF features. In this paper, an online and non-cooperative multi-modal generic representation auxiliary learning REC model, namely muti-modal generic representation auxiliary learning networks (MGRALN), is put forward. Multi-modal means that multi-domain transformations are uniﬁed to a generic representation. After this, the representation is employed to facilitate mining the implicit information inside of the signals and to perform the better model robustness, which is achieved by using the available generic genenation to guide the network training and learning. Online means the learning process of REC is only once and the REC is end-to-end. Non-cooperative denotes no demodulation techniques are used before the REC task. Experimental results on the measured civil aviation radar data demonstrate that the proposed method enables one to achieve superior performance.


Introduction
Radar emitter classification (REC), also referred as specific emitter identification (SEI), is the process of extracting radio frequency fingerprints (RFF) of device authentication, and recognizing emitter individuals based on emitter-specific RFFs, which has not only become increasingly important with some military fields, e.g., air reconnaissance, battlefield surveillance, guidance and command operations, but with a broad application prospect in the new hi-tech sector of cognitive radio, self-organized networking, air navigation and traffic control, et cetera [1][2][3]. With the rapid development of radar technology, radar signals have the characteristics of large quantities, increasing types and density, resulting in the complicated electromagnetic space (noises, multi-path, interferences, etc.). How to mine stable RFF representation involved inside the radar signals has become the key factor to enhance the security of electronic warfare systems. As shown in Figure 1, the REC system mainly consists of several subsystems, i.e., RF system, data collection, data preprocessing, feature extraction, emitters identification, radar types management system and database. Essentially, RFF extraction is a signal recognition task [4][5][6][7][8][9], which is a cutting-edge scientific problem for signal analysis. RFF representation is constructed by calculating the numerical characters from the observations;. The main pipeline is to extract unique RFF device calibration, and to recognize the radar emitters' individuals [10]. Figure 1. The architecture of a typical REC system, where radar signals are obtained by RF system, sampled by data collection and are processed by the backend RF representation as well as classification, which provides intelligence support and online update for database.
To the best of our knowledge, current RF representation and relevant recognition approaches have moved from manually-designed features to data-driven features; from temporal information mining to transform-domain encoding; from hand-crafted classifiers to automatic deep-learning models; and from multi-step processing pipeline to the end-toend proprecessing. It has been demonstrated that convolutional neural network (CNN) [36] and long short-term memory (LSTM) [37,38], etc., have been among the most effective data-driven techniques to recognize individual emitters. Observe that the above REC methods have achieved superior performance, and usually a long-time duration of signal observation is expected to extract stable features. Whereas, the REC system offen suffers from short data and data-hungury problems. In this case, it is also very difficult to construct a compact RFF representation.
Aiming at the challenges above, an end-to-end REC method based on multi-modal features, namely muti-modal generic representation auxiliary learning networks (MGRALN), is proposed. The multi-modal refers to the different time or frequency transformations. We also notice that the same classifier based on different transformation may obtain different results in that multi-modal features have various physical meanings-i.e., feature heterogeneity. Thus, the signals and corresponding transformations are mapped to mutual subspace for the generic RFF representation (auxiliary branch), which is, meanwhile, utilized to assist the deep-learning architecture to mine the RFF represention from the signals (main branch). Finally, with the auxiliary branch guiding the learning of the main branch, multi-modal information is unified as the generic representation, and is incorporated into the main branch for the REC. Finally, when removing the auxiliary branch from the MGRALN, the stable RFF can be available from the main branch, and achieve a signal-to-prediction REC task. The MGRALN is an online end-to-end REC method and has a strong feature extraction ability of benefitting from multi-modal features. Additionally, it is worth noting that this work not only has important academic value and broad application prospects, but has an important practical significance for improving the electromagnetic spectrum perception ability.
The reminder of this paper is organized as follows. Section 2 makes a detailed description of the signal model and proposed method, which is followed by the numerical results in Section 3. Conclusions are given in Section 4.

Signal Model
Consider the signal in a digital communication system, where s[k] is the k-th symbol signal representing the amplitude s k and the phase θ k . Thus, the transmitted signal s b (t) can be expressed as, where For RF system, we have where r(t) represents the received or intercepted signal and n(t) is noise. Generally, using two orthogonal carriers, the complex baseband signal r c (t) is shown in Figure 2 r c (t) =r(t)cos(2π The intercepted signal in a communication system is usually converted to an orthogonal double-channel zero intermediate-frequency, also referred to as a complex baseband I/Q signal, through the digital down conversion (DDC). Because the REC is usually under the non-cooperation circumstance, the classification results may be unsatisfied if some demodulation techniques are used to demodulate the intercepted signal. Additionally, the modulation errors may be accumulated in subsequent REC. Thus, this paper does not adopt any demodulation technique to obtain the modulation types of the intercepted signal.

MGRALN-Based REC
(1) Multi-modal Transformations Selection: Multi-modal transformations selection: The first step for the MGRALN-based REC method is to select the multi-modal transformation, in that the signal transformation determines the distinction of RFF. In principle, the desired transform layer should satisfy the following properties: The inputs of auxiliary branches require high discrimination; all the transformations should be online with well adaptivity; training parameters are not updated layer-wise, but simultaneously. This paper focuses on the guiding model to construct the mutual subspace and to learn the robust RFF representation. Here, we select existing features, as follows: Signal envelope (SE) presents different transient information, such as the change of signal edge, pulse width, peak position, rising edge and falling edge of signals, etc.
Ambiguity function (AF) is mainly used to measure the distinguishability of the target in distance and velocity dimensions. The ambiguity function decreases rapidly along the axis of the frequency offset. We select several frequency offset slices of ambiguity function to act as transform layers. The slices AF0, AF2 and AF4 represent the features when the frequency offset is set to 0 Hz, 2 Hz, and 4 Hz, respectively.
Power spectrum density (PSD) represents the change of signal power with the frequency, which is defined as the Fourier transform of the autocorrelation function of radar signals.
(2) The MGRALN Model: The MGRALN for REC model is shown in Figure 3. A hierarchy of convolutions is utilized to unify their distribution; thus, the RFF representation benefits from data and transform pipelines. After this, the radar signal and RFF features are aligned by a pixelwise convolution, which are together input to CNN architecture to recognize radar emitter individuals. Given complex baseband signals set (X, , y i represents the groundtruth emitter. X ∈ C N * 2 * L , Y ∈ R N * 1 , N denotes the number of signals, and L represents the sampling length. The convolutional feature of the main branch (i.e., signal pipeline) can be expressed as, where F 1 denotes a stack of the convolution-BN-swish assemble units, and W 0 denotes learnable parameters. Radar signal is supposed to transform layers, convolution-BN-Swish assembly, and temporal global pooling and consensus unit. After this, with learnable parameters shared between signal-based CNN and consensual response-based CNN, the final prediction is available by fusing P i (1) and P i (2).
The signal transformation follows similar processing, thus, andx where k represents the branch index, k ∈ [1, K], K is the total number of auxiliary branches, W k denotes the learnable parameters, T denotes the transform layer, and conv is a convolution layer.
To avoid enormous parameters, the parameters are shared among all the paths, After randomly initializing the weights, the signals and multi-modal paths are activated. Then, the convolutional responses are concatenated, and the concatenated responses are supplied to the consensual function as well as a softmax layer. Denote M as the response of consensual function, and we construct three types, as follows, Element-wise Average (EA): Element-wise Multiplication (EM): where represents element-wise multiplication.
Next, the obtained consensual responses and radar signals are together input to a CNN architecture for the preliminary prediction, where and where O is the number of radar emitters, F 2 and F 3 are the CNNs, H is a softmax function, and W represents its learnable parameters, satisfying W = W -i.e., F 2 = F 3 . Finally, the emitter is predicted by fusing the available preliminary prediction above, i.e., (3) The learning principle: To achieve an effective REC task, a predefined loss for the MGRALN model is expected to realize the parameters' update. For the REC, the deep learning networks in the MGRALN expect to minimize a cross-entropy loss, In order to capture generic RFF, an additional learning principle is defined for the consistency. For a radar signal, the consensual loss is given by where <> denotes the inner product operator, denotes the modulus operator, and r, v, q ∈ {1, 2, · · · , K}. In Equation (19), the first term on the left side keeps the distribution consistency between the signal and each transformation, while the second term guarantees a high correlation between the transformations.
Considering the consensual loss evaluated on all the training data, we have The final objective function is given by, where U represents the associated learning function.
where λ is a compromise between the recognition and construction of mutual subspace. However, the loss function above would confuse the MGRALN architecture, because it not only recognizes radar emitter individuals that require the distinctive features, but constructs RFF representation in the auxiliary pipeline. A suboptimal cross-training method can be proposed by, alternatively, optimizing L R and L C . Actually, with CNN, it is difficult to ascertain whether the network training would be synchronized under the two learning criteria. This paper proposes associated learning, given by Equation (22). Next, take the derivative of the associated loss L with respect to training parameters W x , W, W , W and W M , The learnable parameters of MGRALN are updated, The overall loss of MGRALN is a weighted sum of the cross-entropy loss and consensual loss, and the gradients of L with respect to training parameters W x , W, W , W and W M are updated, simultaneously. After training, if the signal is input to the model only (removing auxiliary branches), learned model parameters still enable one to recognize the radar emitter individuals in an end-to-end manner, as shown in Figure 4.

Data Colletion and Implementation Details
Our data are collected from seven airplanes, and each of them contains 200 files. For each airplane, we sample 100 snapshots, including I and Q components. Five snapshots are collected per second, according to the hardware sampling rate of 12.5 MHz and the signal duty ratio of 0.01/0.2 = 5%. The radar scanning cycle is 6 s, and 100 snapshots signals are scanned three times by the main lobe of radars, which results in strong signals. Dataset I consists of 608 samples collected from seven airplanes and dataset II contains 6123 samples collected from 15 airplanes. The civil aviation radar signal of the receiving and processing system is shown in Figure 5. Note that the receiver does not apply any modulation classification technique to obtain the modulation format. We consider the duration of each sequence to be 8 µs and 46 µs. Meanwhile, we consider the difference of the amounts of samples between classes, equilibrium (Taking dataset I for instance, and seeing radar 1, 2, 3 and 6, or radar 5 and 7) and imbalance (See radar 4 and 5), as shown in Figure 6.   For dataset I and dataset II, CNNs follow different design schemes. For dataset I, the CNN consists of three assembly units with 64, 128 and 256 convolutional kernels, respectively. Three convolutional kernels in the Convolution-BN-Swish assembly are of 128, 256, 100, respectively. The number of the last convolutional kernel is 100. The main reason is to guarantee a mutual feature M in the mutual subspace to have the same dimension with x i . For dataset II, the CNN contains five similar assembly units with 64, 128, 256, 512 and 1024 convolutional kernels, respectively, and the corresponding convolutional kernels in the Convolution-BN-Swish assembly are 256, 512 and 500, respectively. To increase the receptive fields in the temporal dimension, the kernel size of the first convolutional layer in CNN is 1, and the stride size, meanwhile, is adjusted to 2. The spatial size of the pooling layers is 2. For simplification, the tradeoff parameter λ in the associated loss is set to 1, and the dropout ratio is set to 0.5. In the training stage, MGRALN is trained by a Geforce RTX 3090 Ti GPU. Meanwhile, the learning parameters are optimized using the adaptive moment estimation with a learning rate of 0.0001 [39]. Because the radar signals are unidimensional, the mandatory relation of the relative spatial position might be unduly captured with 2D or 3D convolution operators. Due to lack of spatial correlation in random and unpredictable radar signals, the unidimensional convolution offers obvious advantages [40]. Since a pooling operator tends to make spatial sizes of the convolutional feature half that of the input, the numberof convolutional kernels would be doubled to enhance the representational ability of the model [41]. We thus conform to the practice, and the convolutional responses are arranged to form a large-scale majority of channels. We also improve the global averaged pooling to summarize the temporal convolutional responses, such as increasing the temporal receptive field.

Recognition Performance of MGRALN
(1) Consensus comparisons: signal and single transformation. The consensual function emerges as a key factor to the MGRALN, due to the vital influence on recognition performance; actually, its construction has been nimbler and more challenging, which is a very fascinating, relevant and important module. In this section, we give a detailed ablation study on the contribution of the consensual function M to its recognition performance. The form of consensual function is an open problem, and here we compare the correct accuracy of the MGRALN with the SE transformation on the dataset I under a different consensus including (i) EA, (ii) EM, and (iii) EC. As shown in Figure 7, it can be noted that the MGRALN with EA consensus in terms of accuracy achieves a superior performance, and outperforms those of MGRALN with EM and MGRALN with EC. In the traditional machine learning REC task, SE seems to not be a comprehensive representation of RFF. It can be found that MGRALN still enables one to recognize the radar emitter individuals when using the generic RFF representation generated by the signal and SE transformation to act as the auxiliary pipeline. In the following experiments, we select the EA as the default consensual function.
(2) Accuracy comparisons of MGRALN when K = 2. This section compares the MGRALN with the signal and single transformation, that is, the generic RFF representation, which is utilized to assist the training of the signal pipeline and is available by fusing the signal and one transformation. Specifically, we choose only one signal and one transformation, i.e., SE, AF or PSD, as the input of the auxiliary branch. Figure 8 shows the histograms of the recognition accuracy of MGRALN with traditional SVM methods and the deep-learning based CNN method on dataset I. Observe that less training data will lead to networkover-fitting, thereby resulting in inferior recognition performance. With the increase in the training ratio, the MGRALN gradually fits the data distribution, and recognition accuracy is improved gradually. Compared with the feature-based SVM and CNN methods, MGRALNs improve the recognition performance.
It can be also observed that PSD_SVM or PSD_CNN performs better than PSD_MGRALN when K = 2. The main reason can be ascribed to two factors. On the one hand, for AF0_MGRALN and SE_MGRALN, the generic representation is achieved by the signal and SE or signal and AF0. The former and the latter are temporal operators. Whereas, the generic representation of PSD_MGRALN is constructed by the signal and PSD feature by unifying time and frequency features, both of which essentially are with a large distribution gap. (3) Accuracy comparisons of MGRALN when K = 3. As shown in Figure 9 a,b, the training ratio of 30% is the critical point that determines where the data-driven MGRALN method outstrips the traditional SVM method. It is worth noting that the performance improvement is large, especially when the AF0 acts as an auxiliary branch. It can be ascribed that the transform has high overlaps with the radar emitter signal. Compared with the PSD and SE, the AF0 seems not to be a highquality distinctive transform. However, when it is utilized to act as the transform layer, the performance of the MGRALN, instead, matches that of which is based on the PSD. We conjecture that AF has a large overlap with the radar emitter signal in the mutual subspace. Since SE seems to be a relatively weak representation of the specific radar emitter, and we compare the recognition performance of the SE_MGRALN with the support vector machine (SE_SVM) and CNN (SE_CNN). We can observe that SE_MGRALN still makes a better recognition performance. In our work, the CNN is actually one of the effective tools we can use to break the distribution isolation. That is, any data-driven technique can act as an available tool to achieve this goal.
MGRALN differs from many existing REC approaches in that it expects explicit insight into the number of auxiliary branches, and what kinds of combinations of auxiliary branches can achieve higher-quality RFF. This section introduces dual transform layers, and demonstrates their effect on measured radar emitter data. Considering the comprehensiveness, dual auxiliary branches are selected by random sampling with replacement, i.e., both items are randomly selected from PSD, AF0 and SE. In order to eliminate the influence of network over-fitting, we set the training ratio to 80%, and explore the recognition performance of the MGRALN when K = 3, i.e., signal and dual transformations. Table 1 shows the recognition results on the dataset I. The diagonal entries denote that a signal and two transformations is of the same type as the inputs, in essence, which is sort of boosting in the ensemble learning. Despite this encouraging progress, the MGRALNs when K = 3 are, still, marginally inferior to those when K = 2, which can be ascribed to two aspects. One possible reason is that the form of L c is an open problem, directly affecting the quality of the auxiliary RFF representation. Additionally, under the two learning rules, L R and L c , it, however, is a very challenging thing to guarantee the synchronization of model learning under the two learning principle.
Taken together, the cross-entropy loss and the consensual loss have a perceptible effect on the recognition performance of REC, because the former is the foundation of the latter, and, both L R and L c determine the final recognition performance. This paper, in its current form, does not fall into the trap of dedicating too much on the learning principles, but focuses its best efforts on how to construct the stable RFF formulation.
(4) Accuracy comparisons of MGRALN without auxiliary pipelines. It can be observed that MGRALN has the ability to project the signal and its transformations to the mutual subspace. Naturally, the proposed MGRALNs with the auxiliary pipeline have been demonstrated to yield a competitive performance and use transform layers with a back-propogate ability to embed the transformations inside the learning of the MGRALNs. Although such a practice can avoid extracting the RFF features offline, there is still a lack of an intuitionistic demonstration of the effects of auxiliary pipelines. In view of this, for a pure signal-prediction module, the recognition accuracy of the MGRALN (maximum accuracy) is shown in Table 2. Observe that the MGRALN removing the AF0 pipeline, achieves a superior recognition accuracy on dataset I. For dataset II, the MGRALN removing the SE pipeline is superior to that of other forms. Table 3 illustrates the accuracy comparisons of the end-to-end MGRALN with the mainstream methods including SVM, LSTM and a combination of CNN and LSTM (CNN-LSTM). On the dataset I and dataset II, the recognition accuracy of the MGRALN obviously outperforms the SVM recognition method by a large margin of 22.2% and 3.6%, respectively. Compared with deep learning radar emitter recognition method inputting signals to LSTM, MGRALN exceeds the accuracy improvement of 37.7% and 12.3%. Further, considering the advantage of CNN and LSTM, we compare the MGRALN with the CNN concentrated with a LSTM, and the MGRALN makes a 5.3% and 1.4% accuracy boost.
Compared with CNN, the computation complexity of the proposed MGRALN focuses mainly on the Convolution-BN-Swish assembly structure. For deep-learning architectures, floating point operations (FLOPs) are utilized to caculate the complexity of deep learning architectures. For the convolutional layer, the FLOPs are 2 × C i × K i × C o × L o , where C i is the number of channels in the input, K i is the kernel size, C o is the number of channels in the output, and L o is the feature size of the output. Specifically, the total FLOPs difference for dataset I: O (3.43 × 10 6 ), and O (8.74 × 10 7 ). For the MGRALN without auxiliary branches, the computation complexity of MGRALN is the same with CNN in the test. Compared with MGRALN and CNN, the SVM has a lower computational complexity.  The proposed MGRALN can achieve a superior radar emitter recognition by bringing the weak signal-to-recognition methods to high-level end-to-end prediction. From a scientific standpoint, the MGRALN attempts to demonstrate the validity of the distribution unification. From the perspective of engineering applications, the MGRALN makes an end-to-end REC in the complex electromagnetic environment. From the perspective of the model, the MGRALN is a new model for the online noncorporative REC, which can be treated as the alternative scheme; or a more complementary approach with the mainstream approache, e.g., multi-path REC classifiers.

Conclusions and Future Work
This paper aims to solve the radar emitter classification (REC) problem because the unstable RFF inclines to forge, in order to be easily covered as well as to be damaged by noise and radar signals. This motivates us to attempt to construct a more stable RFF representation by fusing multi-modal features to facilate deep-learning model learning, which can be embedded inside of the process to further mine the radar signal. The resultant model is the MGRALN.
Compared with SVM, CNN, and LSTM, etc., the advantage of the MGRALN is its superior recognition performance on two measured civil aviation radar datum. Additionally, our MGRALN is an online end-to-end prediction architecture in the REC task, especially in the cases where only radar signals are available, and no demodulation techniques are used.
Despite its validity in the civil airplane scenario, the proposed MGRALN is restricted by the design of the auxiliary branches, in that their construction is an open problem, which determines the generic representation. Additionally, the proposed MGRALN cannot completely replace the current existing REC methods, in that it is just a feasible path. From the prespective of signal mining, it provides sparse yet distinctive signal representation for signal analysis, which may be applied in the other signal recogniton tasks, e.g., automatic modulaton classification. Directions for future works include the construction of learning principle, more insight into the internal operation and interpretability of RFF mechanism, boosting [42], and multi-path features or classifiers' fusion, [43,44] including the complementation with CNN, LSTM, and BiLSTM, etc. Additionally, some attempts may focus on deeper model construction and attention mechanisms, including the transformer [45] based REC.