A Novel Data-Driven Speciﬁc Emitter Identiﬁcation Feature Based on Machine Cognition

: Machine learning becomes increasingly promising in speciﬁc emitter identiﬁcation (SEI), particularly in feature extraction and target recognition. Traditional features, such as radio frequency (RF), pulse amplitude (PA), power spectral density (PSD), and etc., usually show limited recognition effects when only a slight difference exists in radar signals. Numerous two-dimensional features on transform domain, like various time-frequency representation and ambiguity function are used to augment information abundance, whereas the unacceptable computational burden usually emerges. To solve this problem, some artfully handcrafted features in transformed domain are proposed, like representative slice of ambiguity function (AF-RS) and compressed sensing mask (CS-MASK), to extract representative information that contributes to machine recognition task. However, most handcrafted features only utilizing neural network as a classiﬁer, few of them focus on mining deep informative features from the perspective of machine cognition. Such feature extraction that is based on human cognition instead of machine cognition may probably miss some seemingly nominal texture information which actually contributes greatly to recognition, or collect too much redundant information. In this paper, a novel data-driven feature extraction is proposed based on machine cognition (MC-Feature) resort to saliency detection. Saliency detection exhibits positive contributions and suppresses irrelevant contributions in a transform domain with the help of a saliency map calculated from the accumulated gradients of each neuron to input data. Finally, positive and irrelevant contributions in the saliency map are merged into a new feature. Numerous experimental results demonstrate that the MC-feature can greatly strengthen the slight intra-class difference in SEI and provides a possibility of interpretation of CNN.


Introduction
Feature extraction is an important part in specific emitter identification (SEI), but it is more than challenging because in modern complex electromagnetic environment, it is no longer enough to accomplish SEI tasks only relying on some primitive properties of radar signals, like radio frequency (RF), pulse amplitude (PA), pulse width (PW), power spectral density (PSD), and etc. [1][2][3][4]. Various two dimensional (2D) transform features, like short time Fourier transform (STFT), Wavelet transform (WT), S transform (ST), Winger-Ville distribution (WVD), ambiguity function (AF), and etc., are used as feature input to classifier in order to represent more comprehensive information in a feature. Although they can achieve good results, the information representation capacity of a transformed feature and its data dimensionality are always a pair of paradox. Based on this problem, some compressive sensing (CS) extraction methods are proposed: Wang, L. and Ji, H. propose a representative slice of ambiguity function (AF-RS) in [5] which selects the most representative slice in a

2D Transform Domain Feature
With the increasing complexity of signal form and modulation mode, particularly signals with time-varying information, traditional one-dimensional features, such as RF ,PA, PW, and PSE, have been difficult to satisfy the recognition requirements. Lots of two-dimensional transform domain features have been proposed to allay the limitation of inadequate representation of one-dimensional features. These features can be divided into primitive 2D transformation features and handcrafted 2D transform features, which are introduced in Section 2.1 and Section 2.2, respectively.

Primitive Transform Domain Feature
Numerous 2D transforms or representation are proposed, such as STFT, WT, ST, WVD, and AF. These features can capture time-varying information of frequency or amplitude of signals, meaning a larger capacity of representation in comparison to one dimensional features. In this section, STFT and AF will be introduced in details, because the MC-feature and two comparative features are based on STFT and AF, respectively.

Short Time Fourier Transform
STFT is a widely used time-frequency representation tool in signal processing. STFT transforms signals into time-frequency domain by performing Fourier Transform in a fixed window traversing the entire time domain. The STFT of a given signal x(t) with a window g(t) in Schwartz class, is defined as [13]: where g * (t) is the complex conjugate of g(t), τ is time delay, and f is frequency. Compared with one-dimensional features, STFT can reflect the variation of the frequency of each component with time; hence, STFT holds a larger capacity of information representation. Nonetheless, the time window with fixed width limits time-frequency resolution. Subsequently, to assuage the limitation of STFT, numerous time-frequency transform methods with variable time-frequency resolution are proposed, such as WT and ST. Even though the time-frequency resolution of STFT is limited, the concise and easily implemented mathematical form makes STFT one of the most prevalently used time-frequency analysis tools in signal processing. Here, four typical frequency modulation (FM) signals are taken as examples: single-frequency x 1 (t), linear frequency modulation (LFM) x 2 (t), second-order frequency modulation x 3 (t), and triangle frequency modulation x 4 (t), to show effects of STFT. The sampling frequency and sampling time is set to 256 Hz and 1 s respectively. The specific forms of signals are as follows: x 4 (t) =sin(2π10(cos(2π0.5t))) (5) Figure 1 shows the waveform and STFT of each signal. It can be seen that STFT can clearly reflect the variation of frequency with respect to time, whereas the time-frequency resolution is fixed no matter when the frequency is low or high.

Ambiguity Function
AF is the inverse Fourier transform of the instantaneous autocorrelation function of a signal over the time variable. Different from STFT that transforms signals in joint time-frequency domain (TFD), AF transforms signals to time lag-Doppler lag domain [7]. The AF of signal x(t) is defined by: where τ is time-lag and ν is Doppler-lag. It is clear from (6) that, when Doppler-lag is 0, the AF degenerates into the integral of autocorrelation function of the signal with respect to time. Therefore, the peak value of AF is located in the origin according to the property of autocorrelation function. Figure 2 shows the AF of each signal. From Figure 2, the energy is mostly concentrated near the origin in AF domain, and the difference of each signal can be easily detected from sidelobes in disparate shapes.

Handcrafted Transform Domain Feature
While the primitive 2D features can represent more informative details of signals, the data quantity of primitive 2D features is usually squared with the data quantity of the processed signal, leading to unacceptable computation burden. With the development of theory of compressed sensing (CS), some algorithms that are based on CS are proposed to alleviate this problem by extracting some most representative parts of primitive 2D features as a new feature with smaller data quantity. However, such compression inevitably jettisons some detailed texture information more or less; hence, we usually attempt to seek a trade off between representative capability and data quantity.
In this section, two excellent handcrafted transform domain features that are based on AF, AF-RS, and CS-MASK, are introduced.

Representative Slice of Ambiguity Function
Because the energy of AF shows a peak value at origin, the slices of AF in a certain Doppler-lag near zero (including zero-slice) could be considered as the major representative feature of radar signals named representative slice.
A class-dependent algorithm to classify radar emitter is proposed by Gillespie and Atlas [14,15], which is used to extract representative features of radar signals, followed by a kernel optimization scheme in the AF domain. Direct Discriminant Ratio (DRR) is used as a criterion to rank the kernel in [14], defined as: here,Ā i means the "average" auto-ambiguity function of Class i. Frequency offset ν is usually set near zero in order to rank the points along the major direction of ambiguity function distribution. By this norm, AF-RS extracts the slices near the origin in delay-frequency offset domain as a feature set. The most representative slice is selected as the feature, resulting in a great reduction in data quantity. Figure 3 shows the AF-RS of each signal. It is clear that this feature with low data quantity (24 × 24) can vividly exhibit the difference of each signal in a very visible form.
It is necessary to note that there are still some small flaws that can be improved on AF-RS [7], mainly in: (1) a sole slice at a fixed frequency may neglect other slices which contribute to identification in other Doppler-lag. (2) The selection of representative slice must by implemented by complex feature fusion optimization, which probably makes it unfeasible to achieve real-time optimization. It must be based on recognition rate feedback, which results in large computational complexity.

Compressed Sensing Mask
CS-MASK is an ingenious handcrafted feature extraction method combining CS theory and properties of AF and WVD [6]. CS-MASK seeks for a small-sized region in AF, where most representative information is included by constraining the error between WVD reconstructed from CS-MASK and original WVD.
It can be seen from (6) that the two-dimensional inverse Fourier transform of the Wigner-Ville distribution is the ambiguity function, as shown below: where WVD x (t, f ) is Winger-Ville Distribution (WVD) of signal x(t), a kind of quadratic time-frequency representation tool, defined as: Set WVD and ambiguity in the form of column vectors in size of N × 1; hence, the matrix form of AF can be written as: where Ψ is Fourier transform matrix in size of N × N, and W D is the matrix form of WVD of x(t).
To reconstruct Wigner-Ville distribution from ambiguity function with high accuracy and low data dimension, a measurement matrix Φ in size of M × N (M N) based on CS is used to extract those large non-zero values in A x . Feature Θ (M×1) , an M sparse measurement of AF, is obtained, as follows: The optimal CS-MASK Θ Best can be implemented as following constraint: where F −1 2d is two-dimensional (2D) Fourier inverse transform, ε represents a user-specified bound to constrain the error in an acceptable range. Because the 2D Fourier transform of ambiguity function is WVD, the efficacy of the feature CS-MASK can be examined by the error between its 2D Fourier inverse transform and WVD according to Equation (15) . Figure 4 depicts CS-MASK of each signal. The signal length is 256, the size of the original ambiguity function is 256 × 256, in comparison, the size of CS-MASK extracted according to Equations (14) and (15) is 24 × 24, meaning a great reduction of the feature size.
The representative information contained in CS-MASK can be measured by the error between its recovered WVD and original WVD. Figure 5 shows the WVD of the original signal and WVD of the signal reconstructed by CS-MASK. It is really laudable that, even though CS-MASK only takes the zone in size of 24 × 24 near the origin, it contains the most information to reconstruct the complete WVD.
Nevertheless, there are still some imperfections in CS-MASK [7]: (1) the selection of the mask region is based on the optimization criterion related to reconstructed WVD, so the calculation amount is large, even though the rapid optimization method in two projects is proposed [6]; (2) the size of the optimized feature is not fixed, but it will change with certain factors, such as signal modulation and noise interference.

MC-Feature
In this section, the rationale of MC-feature is elucidated in details. As discussed above, 2D features can be regarded as gray scale images, since the neural network perceive a 2D input from value of each pixel not the object in human cognition. Therefore, a lot of saliency detection methods used in image processing, like LRP, input cropping, deconvolution, and gradient algorithms [16] can be also applied in SEI. Image-specific class saliency visualization (ISCSV), an effective saliency detection technique based on gradient algorithm, is the theoretical basis of our proposed method. Firstly, ISCSV is introduced in Section 3.1,. Subsequently, the production of saliency maps is shown in Section 3.2. Finally, the operation procedures of the proposed method are presented in Section 3.3.

Image-Specific Class Saliency Visualisation
With the rapid development of computing capability of electronic devices, deep network, especially deep Convolutional Neural Networks (CNN) [17,18] now being one of the most prevalent choices for image classification [19,20] and remarkable achievements have been made. However, the inner recognition mechanism of CNN is still in lack of systematic interpretation. Understanding the cognition of CNN has become increasingly important and necessary when CNN is applied in some special scenarios where nearly any nominal error is unacceptable, like driverless automobiles, missile guidance, military radar image processing, and etc.. In a previous work, [21] visualised deep network by seeking an input image which maximises the neuron activity of an optimisation in the image space. Recently, the problem of CNNs visualisation was addressed in [22] by the Deconvolutional Network (DeconvNet) architecture, which aims to approximately reconstruct the input of each layer from its output. Yet, both of these two methods consume large computational resources [21,22]. Ref. [12] proposed a very handy method to obtain image-specific class saliency visualization (ICSV) by calculating the accumulated gradients of each pixel during back propagation. In this paper, ISCSV is used as the visualization tool to generate a kind of saliency map that reflects the contribution of each pixel in 2D transform domain to recognition network. The details of processing is shown, as follows: Assume that V g x (t, f ) is corresponding to a class i. A class score activation function C i (·) describes the relationship between the input and output of classification CNN. Generally, a CNN contains multiple convolutional and full connection layers with their corresponding activation functions; hence, C i (·) is a highly non-linear function. The complete Taylor expansion of C i (·) in the neighbourhood of V g x (t, f ) can be expressed as: Here, we approximate S i (·) with a linear function in the neighbourhood of V g x (t, f ) by computing the first-order Taylor expansion: Usually, there is a loss function J(C i , L) with respect to S i and true labels L to measure the error between the current output of NN and target labels. Noting that the second and third item in right side of Equation (17), x , can be regarded as a constant, the partial derivative of C i (·) and Jwith respect to the STFT V at V g x can be expressed as: It is clear from Equation (19) that the magnitude of the derivative indicates which pixels need to be changed the least to affect the class score the most; therefore, it visible to detect the positive contributing pixels in 2D feature as well as irrelevant pixels if the gradient of each pixel can be obtained.

Saliency Map
Based on successful applications of ISCSV in saliency detection, it is considered that the inner recognition mechanism can probably be reflected by ISCSV. As mentioned in Section 1, various transformed features with 2 dimension can be regarded as an equivalent to the input in Equation (16). In the beginning of forward propagation of CNN, each pixel in 2D transformed feature is perceived by different convolutional kernel. The partial derivative of the loss between network output and label with respect to network input is propagated back from layer-to-layer in the form of neuron error δ. An overview of back propagation as well as its detailed algorithmic presentation can be found in [23]. Here, we take a simple network model to explain how to obtain saliency maps of 2D transform feature. Figure 6 shows the structure of a simple neuron network. ω (q) ij denotes the weight from jth neuron in (q − 1)th layer to ith neuron in qth layer, b i is the weighted input of ith neuron in lth layer defined as: x (q) j denotes the output of jth neuron in qth layer defined as: where σ(·) denotes the activation function of qth layer. The output of this network y i equals the output of last layer x (Q) i . This process is called forward propagation. p (STFT of processed radar signals) with their corresponding labels L P as below: After forward propagation, the output of last layer x (Q) p can be obtained, and then a loss p ) is used to measure the error between the output of network and true labels. The loss function J can be formulated as the summation of each error: Now, the partial derivative of the loss function J with respect to each weight ω (q) ij can be expressed as: Accordingly, we can set q = 1 to obtain the gradient G p of J with respect to weights ω (1) ij connecting the input data to the network: where G p is a matrix with the same size of the input data. Usually, the NN will be trained lots of epochs, and the gradient is accumulated in every epoch; hence, the final saliency map H p can be calculated by the following formula: where n denotes the number of epoch, and K is the maximum of epoch.

MC-Feature
The proposed method can be compartmentalized into three main parts: initial training, saliency map production, and re-training. Firstly, the 2D transform feature of radar signals are divided into training set and testing set, then the training set is fed into a network1 as input to train it, and the testing set is used to measure the performance of network1. Secondly, all of the data are sent to forward propagate the well-trained network1 again, and then back propagate the error to the first hidden layer. In this way, a set of saliency maps for each 2D transform feature can be obtained. Thirdly, the saliency maps of original training set are sent to a new network with identical structure of network1 to do training. The performance of this new network can be measured by the saliency maps of original testing set. Figure 7 shows the flowchart of the proposed algorithm and Algorithm 1 elaborates the detailed steps of this algorithm.

Experimental Results
In this section, the detailed experimental results are presented and analyzed. Section 4.1 introduces the detailed information about the data used in our experiment. Section 4.2 depicts the structure of the recognition CNN. Section 4.3 shows the analysis of experimental results, including the interpretation of CNN and comparison of recognition rate by different features.

Data Information
In real scenarios, there are numerous complex types of radar signals for different applications. According to types of radar transmission, signals can be divided into impulse radar signal, swept frequency radar signal, and continuous-wave radar signal [24]. According to the modulation mode, the radar signal can be divided into analog modulated signal and digital modulated signal. Analog modulation includes linear frequency modulation, quadrature amplitude modulation (QAM), triangular frequency modulation, and etc. [25]. Digital modulation includes binary phase shift keying (BPSK), quadrature phase shift keying (QPSK), 8 phase shift keying (8PSK), and etc. Besides, ultra wide bandwidth (UWB) signal has been used successfully in radar systems for many years [26]. In this experiment, four datasets are selected. In each dataset, the signals are set with the same parameters (such as radio frequency and frequency modulation mode) and transmitted by 10 radars of same type produced by the same manufacturer. 10 types of 500 approximate single-frequency signals from 10 civil aviation meteorological radars are involved in Database I. Database II contains 10 types of 500 single-frequency signals from 10 radar generators. Database III and database IV include 10 types of 500 (LFM) signals gleaned from radar signal generators. The time sampling points of each signal is in length of 500. The reasons why these databases are selected are because (1) they are typical and representative radar signals in SEI; and, (2) limited by experimental conditionals, signals with more complex modulation are unavailable for us. We randomly select a signal from each dataset to show its STFT in Figure 8.

Recognition CNN Structure
The structure of CNN influences the final recognition effect greatly, hence, some extraordinary artificial intelligence professions delved into network structure and numerous effective network structures were designed, like LeNet-5 designed by Yann LeCun [27], AlexNet proposed by Hinton and Alex Krizhevsky [28], VGG proposed by Oxford Visual Geometry Group [29], GoogleNet proposed by Christian Szegedy [30], and etc.. It should be noted that this paper focuses on studying "what" is learned by a certain network other than the difference between various networks. Accordingly, LeNet-5, a very simple and effective CNN is used as classfier in this paper. Figure 9 shows the structure of LeNet-5 which contains 7 hidden layers: 2 convolutional layers, 2 pooling layers, and 3 fully connected layers. It should be pointed that this kind of parameters setting is only suitable for STFT and MC-feature. As for AF-RS and CS-MASK, the parameters of convolutional kernel and fully connected layers should be adjusted according to the size of these two features.

Results Analysis
Figures 10-13 exhibit AF-RS, CS-MASK, STFT, and MC-feature of two signals belong to different class in each dataset. It should be noted that the bright red and dark blue pixels in the proposed method represent high magnitude of gradient, meaning a great impact on classification of network, whereas the light green pixels mean nominal contribution to classification. In this way, we can obtain the information that network really "looks" in the process of recognition. In comparison to AF-RS and CS-MASK, which usually modify the area where the energy is concentrated in the transformed domain, the proposed features are visibly more representative, shown in Figures 10-13. It is clear that the divergence of other three features of two different class signals is not very apparent, while the divergence of the proposed method can even be distinguished by eyes. In order to verify the superiority of the proposed algorithm, AF-RS, CS-MASK, STFT, and Machine Cognition Based feature are sent into LeNet 5 to recognize. We set µ = 0.01, K = 400, and training rate from 5% to 60%. Tables 1-4 show the recognition rate of each dataset. Figure 14 exhibits the recognition rate and training rate curves of different features in each data set. In general, the proposed feature obtains a very high recognition rate when compared with other features in each dataset, even the training rate is low. For dataset I, AF-RS cannot provide the network with enough useful information, leading to a result that 10 types of signals are divided into only 2 class; hence, the recognition rate retains approximately 20% no matter how training rate changes. For dataset II and dataset III, AF-RS and CS-MASK contain some information that is conducive to the recognition task of network; however, the recognition rate declines obviously when the training rate is below 20%. In addition, even the training rate is over 40%, the network cannot learn more information from AF-RS and CS-MASK. It is probably because these two features may lose some detailed texture information at cost of dimensionality compression. Similar phenomenon also appears in dataset IV that the recognition rate is only 10% of AF-RS and CS-MASK when the training rate is 5%, which means that the network learns nothing from these two features. In four datasets, the performance of STFT is relatively stable, but still much lower than that of the proposed feature, especially when the training rate is less than 30%.    The experimental results demonstrate the superiority of the proposed method. Furthermore, they also indicate that some information based on human cognition, like the area concentrated energy, may be not the key to a classification network, and some more informative parts behind the primitive 2D features are not mined. From the perspective of machine learning, deep informative information can be extracted in the proposed algorithm.

Conclusions
In this paper, we propose a novel SEI feature from perspective of machine cognition instead of human cognition. The MC-feature can obtain considerably higher recognition accuracy in comparison to other handcrafted features, particularly under scant data samples, which can greatly alleviate the model immaturity that is caused by insufficient training radar signals. In addition, even though MC-feature still remains to be deeply interpreted, it demonstrates that the information network relies on for identification is much different from various handcrafted features based on human cognition. It illuminates that it is quite necessary and potential to understand the inner mechanism of recognition network. In the future, the clear and profound interpretation of the exact physical meaning of MC-feature will be the focus of our study. Once the specific physical meaning of MC-feature can be explained, the recognition network will no longer be a "black box", but an analytical mathematics tool, which will widely broaden the application of machine learning in many scenarios. In the future, the research on clear physical meaning of MC-feature of complex radar signals, like UWB and frequency hopping signals, will be the focus of our research team.