Research on Illegal Mobile Device Identiﬁcation Based on Radio Frequency Fingerprint Feature

: Internet of Things (IoT) technology is widely used in new power systems, and it also provides many new modes for network attacks. Illegal terminal device identiﬁcation is also a signiﬁcant topic in the ﬁeld of wireless authentication technology. Some kinds of power network equipment are located in sparsely populated areas and rely on IoT terminals for real-time monitoring. Attackers use illegal terminals to connect power IoT devices for production monitoring and to carry out network attacks, which may cause serious damage, such as power data theft and misoperation of power network equipment. Radio frequency ﬁngerprint (RFF) can extract hardware features from different devices, and is widely used for device identiﬁcation and authentication. The area over which power network equipment placed is vast, and there are many wireless communication devices and terminals. It is difﬁcult to identify illegal devices through commonly used network management techniques, thus making it difﬁcult to distinguish between the mobile terminals of employees and illegal terminals in general spectrum screening. In response to the above situation, this paper uses the characteristics of the squared spectrum of random access preamble signals to extract hardware device features, proposes an illegal device identiﬁcation algorithm based on Gaussian distribution theory, and evaluates its performance. The experimental results show that, when the signal-to-noise ratio (SNR) is greater than 15 dB, the average recognition result is greater than 92%. In addition, the algorithm has low computational complexity and high engineering application value.


Introduction
In recent years, with the continuous promotion of China's 2060 carbon neutrality target, the construction of new power network systems has been accelerated.The development of power network systems has, thus, been rapid, aiming at improving public health, sparking technological innovation and creating new economic opportunities.The power network stations cover a large area, and the space is open.With limited human resources, managers can only rely on a small number of people to complete on-site duty work, which results in a lack of regulation and high network security risks.For example, it is difficult to detect when outsiders come into direct contact with control equipment on site, and equipment is prone to be overlapped by external lines and being attacked by intermediaries [1][2][3][4][5][6].For example, some criminals utilize today's rapidly developing science and technology to connect illegal mobile phones with power network systems through cellular mobile communication.Figure 1 presents a schematic diagram of an attack scenario, which illustrates an example of power network data leakage.Firstly, illegal terminal devices make connections to power network systems by adopting a cellular wireless networking method.In this way, attackers can access the local network of power stations through cellular mobile communications from afar, conduct sniffing, tapping, and network attacks, including infiltrating and controlling various terminal devices in the local network, collect network traffic, and even forge and tamper with power control instructions in the network.After attacking the power network system successfully, the illegal equipment transmits the stolen information to the external base station.From the example given of data leakage from a power network, it can be seen that, if illegal devices are connected to the power equipment, large quantities of important information may be leaked.Therefore, effective device authentication mechanisms are needed to improve network security.Radio frequency fingerprint (RFF) identification is an effective technology for classifying wireless device identities [7][8][9][10][11][12][13][14][15].As shown in Figure 1, a detector is used in this example, playing the role of third party testing equipment.Illegal devices use cellular communication, which has the characteristics of small coverage and low power transmission.Moreover, mobile phones used by venue employees share the same network as criminal devices, which makes it difficult for the detector to accurately and effectively detect illegal devices.

UE
This article considers an identification method for illegal equipment in new power network systems, and proposes a recognition algorithm based on a Gaussian distribution fitting test.The main contents of the investigation include: first, estimating the arrival time of the preamble signal of the wireless device through the Haar wavelet transform, then analyzing the squared spectrum characteristics of the preamble signal, that is, calculating its squared spectrum, and analyzing its differences to provide a basis for feature extraction.On this basis, this article reports a recognition algorithm based on a goodness of fit test of the Gaussian distribution to achieve recognition of illegal equipment and legal equipment.
The rest of the article is organized as follows: Section 2 provides a detailed summary of work undertaken so far.Section 3 describes the methodology used, including the signal model, analysis of the Zadoff-Chu (ZC) sequence, signal collection, and the analysis approach, including the confusion matrix and goodness of fit test.Section 4 provides the experimental results.Section 5 presents the conclusions.
Existing RFF recognition methods involve two main schemes.One is traditional device fingerprint technology, which usually selects one or more signal features for fingerprint extraction, such as the I/Q imbalance, frequency offset, phase noise, and so on.The other involves utilization of neural networks to automatically extract features for device identification.By increasing the size of the network, deep learning methods can improve the capacity of a fingerprint model and improve the differentiation degree of the device fingerprint, which has been extensively considered in recent years.
Linning Peng proposed a method to extract the RF fingerprint by using a differential constellation trace figure [8], which can be obtained by oversampling the received signal and performing differential operations on the signal to plot the sample points.Then the cluster center is calculated using the K-means clustering method.USRP was used as the receiving platform to identify 12 CC2530 devices using OQPSK modulation.The experimental results indicated that, when SNR was above 15 dB, it could reach a recognition accuracy of over 95%, and when the SNR was above 30 dB, it could reach 99%.Based on this work, Linning Peng later proposed a hybrid device classification scheme based on multi RF fingerprint features [7], which uses four modulation features, namely, a differential constellation trace figure (DCTF), a carrier frequency offset, a modulation offset, a constellation offset, and an I/Q offset.Based on this, a hybrid classifier scheme was designed to adaptively combine different features according to the channel signal-to-noise ratio (SNR).The weight of each feature was obtained during the training period.These features were combined with the weights selected according to the estimated SNR during the testing period.The classification error rate was as low as 0.048.
Laxima Niure Kandel exploited channel state information (CSI) for recognition [16].The author collected the measured data in different locations as the training data and designed a classifier to determine whether the equipment was legal.Comprehensive experiments in diverse real environmental settings were conducted using the training set and the test set with a ratio of 8.5:1.5.The results indicated that, when transmitter and receiver were static, an accuracy of 98% was obtained, and, for the moving area, 92% device identification accuracy was obtained.
In light of the phenomenon that most existing RFF technologies are data-dependent, Yang Yang proposed a data-independent RFF extraction scheme [17], which was implemented on random data segments, such as communication data.In this study, a method called least mean square (LMS)-based adaptive-filter-based stacking (LAFS) was designed for RFF extraction; then, the author used the tap coefficients of an adaptive filter to represent the features.By utilizing the proposed LAFS, stable device fingerprints can be extracted from changing data.The LAFS was evaluated and the experimental results indicated that the classification accuracy could reach 98.9%, outperforming the deep learning network.
All the methods mentioned above are state of the art in traditional RFF recognition.Nevertheless, these selected features are effective in distinguishing a limited number of devices only when used alone or with a few cases.The distinguishing accuracy decreases as the number of devices grows [10].
In recent years, with the development of artificial intelligence technology, machine learning algorithms have been widely applied in fingerprint extraction and device recognition [18][19][20].Amani Al Shawabka et al. [21] proposed a deep-learning-based recognition algorithm, which first analyzes features and then uses a convolutional neural network (CNN) to achieve fingerprint accuracy levels that traditional low dimensional algorithms cannot achieve.Pengcheng Yin proposed a novel multi-channel convolutional neural network (MCCNN) for long-term evolution (LTE) terminal identification [9].The MCCNN is leveraged for feature extraction from the different parts of the signal, including the transient-on part, the modulation part, and the transient-off part.Then, the extracted features are combined to achieve higher classification accuracy.The experimental results obtained indicate that the identification accuracy achieved was as high as 98.96%.
At the same time, experts in some fields are gradually starting to research and use deep neural networks (DNNs) to construct modulation recognition classifiers in order to improve the effectiveness and reliability of recognition algorithms.However, this type of method also has certain limitations, requiring a large number of training samples and having high computational costs, making it unsuitable for applications with high realtime requirements.
In conclusion, the traditional RFF identification methods have difficulty recognizing a large number of devices accurately.RFF identification based on deep learning also requires large training samples and struggles to meet high real-time requirements.In addition, the aim of our article is to identify whether a terminal device is moving, which is a little different from traditional RFF research, the goal of which is to identify whether a device belongs to a specific category.Such research on recognizing static devices has not been carried out before.Therefore, a new method for identifying static terminals based on RFF with high real-time effectiveness and reliability still requires investigation.

Signal Model
When a terminal is connected to a mobile network, it will send initial information to achieve synchronization with the base station, which is referred to as sending of preamble signals.The preamble signals of wireless devices are generated by one or several ZC sequences, with a total of 64 different ZC sequences in each cell.Mobile devices will randomly select one ZC sequence for access.The ZC sequence can be represented as described in [9]: where n is the number of sequences and there are a total of 64 different ZC sequences; q and N are adjustable parameters and N RS ZC is the length of the ZC sequence.The signal after adding Gaussian white noise can be expressed as where w(n) represents the added Gaussian white noise signal, whose mean value is zero and with variance of σ 2 .
The identification problem of illegal devices in this design can be expressed in terms of the following hypothesis-testing model: In this scenario, illegal devices are wired to power devices and steal data, usually in a static state; legitimate devices refer to mobile devices (such as mobile phones) used by employees.These devices generally follow employees around as they operate the devices, making it difficult for them to remain stationary.Therefore, the identification of illegal devices can be further transformed into the identification of moving/static devices, which can be expressed as Identi f ied as static device H 1 , Identi f ied as moving device . (4)

Analysis of ZC Sequence
The preamble signals of mobile phone are composed of multiple ZC sequence cyclic shifts [9].Different ZC sequences have orthogonality.According to simulations, the real part of the squared spectrum of the ZC sequence follows a Gaussian distribution; the distribution fitting results are shown in Figure 2.
The reason for this phenomenon is that different cyclic shifts of the ZC sequence have orthogonality for static equipment.In this way, for equipment in a static environment, different ZC sequences are independent of each other, and the real part of the squared spectrum of the preamble signal thus obeys a Gaussian distribution.For equipment in the moving environment, due to the Doppler effect, the orthogonality between different cyclic shifts of the ZC sequence will be destroyed.In this case, different ZC sequences will, thus, no longer have mutual independence.Therefore, when a moving device is connected to the network, the squared spectrum of the preamble signal does not follow a Gaussian distribution; that is, the squared spectrum of the legitimate device's preamble signal does not follow a Gaussian distribution.Reflecting this characteristic, this article proposes a recognition algorithm based on the goodness of fit test of a Gaussian distribution, which can effectively identify illegal equipment.

Collection of Preamble Signal
The preamble signal of a wireless device is the initial message sent by the terminal while accessing the mobile network before transmitting data; it can also be used to synchronize the user equipment with the base station to obtain base station resources.Usually, the preamble is triggered when it is necessary to connect to a mobile network.The typical preamble signal waveform is shown in Figure 3. Estimating the arrival time of signals in low SNR environments is an important topic in signal processing and analysis.In [22], a signal arrival time estimation method based on Haar wavelet transform is reported; the Haar wavelet transform has the functions of edge detection and mutation point localization.It has been widely applied in the signal processing area.This article uses the Haar wavelet transform algorithm to reduce the noise level of the received signal [22], thereby estimating the start and end times of the preamble signal without prior signal information.The discrete Haar wavelet transform function used in this article is as follows [22]: where a represents the scale and n represents the translation.ψ( n a ) is the mother wavelet function, which can be written as: In our experimental approach, first, the collected signal with noise is subjected to a wavelet transform as described above.Then an optimal approximation to the original signal is found in the function space formed by scaling and translation of the wavelet mother function to remove the noise contained in it.Finally, the processed wavelet coefficients are subjected to a wavelet inverse transform to obtain the denoised signal.The waveform of the denoised leading signal is shown in Figure 4.The arrival and end times of the preamble signal can, thus, be accurately estimated.

Analysis of Feature
As shown in Figures 5 and 6, the time-domain and frequency-domain waveform of the preamble signals of static and moving devices are very similar.The waveform is a random variable consisting of random variables in the time and frequency domain and does not have any significant peak.Therefore, in this design, it is difficult to effectively distinguish the preamble signals emitted by the static and moving devices through general processing, such as time-domain modulus extraction and signal Fourier transform, as expected.Therefore, in this article, an illegal device recognition algorithm based on the Gaussian distribution RF fingerprint features is considered.According to the analysis of the ZC sequence above, it can be concluded that the real part of the squared spectrum of the preamble signal of a static device follows a Gaussian distribution, while the real part of the squared spectrum of the preamble signal of a moving device does not follow a Gaussian distribution.Therefore, this algorithm first extracts the squared spectrum of two types of preamble signals and calculates their real parts.The real part of the squared spectrum waveform of the two devices is shown in Figure 7. Figure 7 indicates that the real part of the squared spectrum waveform of the preamble signals of the static device consists of random variables and does not have any significant peak, while for moving devices, it exhibits two significant peaks.The peaks indicate that the signal comprises a deterministic part and additive phase noise, which do not follow a Gaussian distribution.The difference in the real part of the squared spectrum waveform of the preamble signals can be utilized to distinguish the two signals.

Goodness of Fit Test Algorithm for Distribution
In theoretical research, when it is necessary to test whether a group of random samples conforms to a certain probability distribution, a goodness of fit test of distribution is usually utilized.Widely used methods include the Anderson-Darling (AD) test, the Kolmogorov-Smirnov (KS) test, etc. [23].The goodness of fit test method used in this paper is the KS test, which is used to test whether the distribution of the real part of the squared spectrum of the preamble approximately follows a Gaussian distribution.If it follows a Gaussian distribution, it is determined to be illegal equipment.If it does not follow a Gaussian distribution, it is determined to be legal equipment.
In this design, it is assumed that the samples x 1 ≤ x 2 ≤ . . .≤ x N are independent, identically distributed observation samples arranged in ascending order in the squared spectrum of the leading signal, all of which come from the overall empirical distribution samples F R (x).Here, the distribution of the samples is a Gaussian distribution of the overall empirical distribution samples.
The null hypothesis H 0 needs to be tested: where F is the distribution cluster, whose parameter is θ.The basic idea of the goodness of fit test of an empirical distribution function is to compare the distance between the hypothetical distribution F(x, θ) and the empirical distribution F γ (x).
The identification of illegal devices can be further transformed into an identification model expressed as: The empirical distribution function can be calculated using the following formula: The basic process of the KS test is as follows: Step 1: Calculate the empirical cumulative distribution function Fγ .
Step 2: Define the absolute value of the maximum difference between two cumulative distribution functions as the test statistic of goodness of fit: Step 3: Compare the goodness of fit test statistics with the threshold λ; accept H 0 (H 0 = 0) if Q ≤ λ and reject H 0 if Q > λ.When the significance level or error warning probability of the given test P f a is given, the threshold can, thus, be obtained by solving the following equation: where

Evaluation Metrics
In this article, we use the confusion matrix to evaluate the performance of the goodness of fit test for the distribution algorithm.The confusion matrix provides high precision and excellent classification capability, so that it is used for classification in this article.
The basic standard evaluation criteria include true positive (TP), false positive (FP), false negative (FN), and true negative (TN), which are shown in Table 1.However, the evaluation criteria in the confusion matrix cannot measure the advantages and disadvantages of the proposed algorithm by counting the quantity of data only.Therefore, the confusion matrix is extended to include the following four secondary indicators with the basic statistical results, shown in Table 2: accuracy, precision, recall, and specificity.Using the above four secondary indicators, the quantity values in the confusion matrix can, thus, be converted into a proportion between 0 and 1.
Utilizing these four indicators, another third-level indicator called F 1 − score is, thus, derived, which can be represented as: The F 1 − score value ranges from 0 to 1; a value of 1 represents the best output of the model, while 0 represents the worst output of the model.
In this article, the performance of the goodness of fit test for the distribution algorithm can, thus, be evaluated by the criteria described above.

Results and Analysis
This section introduces the construction of the experimental platform and the setting of the experimental parameters.Then, we display the empirical and theoretical distributions of the real parts of the squared spectrum of two types of signals, and, finally, analyze their performance.

Establishment of Experimental Platform and Parameter Settings
In order to verify the illegal device recognition algorithm based on Gaussian distribution features proposed in this article, an experimental system was built, which included a static mobile phone serving as an illegal device and a moving mobile phone serving as legitimate equipment carried by staff.
Software defined radio (SDR) is a radio broadcasting communication technology that controls traditional hardware circuits through software to receive and transmit wireless signals of different frequency bands and standards.This design uses a USRP B210 device to build a base station system in the Ubuntu 18.04 system using the srsRAN 21 open-source software suite, while another USRP N210 device is used to capture signals.The related parameters of USRP B210 and USRP N210 are shown in Table 3.These two USRP devices mainly perform front-end processing, such as signal transmission, filtering, mixing, and sampling.The maximum frequency of USRP is 6 GHz and the maximum processing bandwidth is 56 MHz.By using variable sampling theory, it can be converted to a sampling rate of 30.72 Msps, meeting the requirements of a 10 MHz bandwidth and 2.565 GHz center frequency in this study.

Experimental Results
As shown in Figures 8 and 9, the solid blue line represents the theoretical Gaussian distribution curve of the squared spectrum sequence of the preamble signal, and the actual distribution of the squared spectrum sequence is represented by small red circles.The above results provide estimated empirical distribution functions for two scenarios and theoretical distribution functions obtained from fitting data based on a Gaussian distribution.From the graph, it can be seen that, under hypothetical H 0 circumstances, the empirical distribution function of the squared spectral sequence fits a Gaussian distribution well (H 0 = 0); under hypothetical H 1 circumstances, the actual distribution function is significantly different from a Gaussian distribution (H 1 = 1).From this, it can be verified that the squared spectrum of the preamble signal of illegal equipment (static equipment) follows a Gaussian distribution, while the squared spectrum of the preamble signal of legal equipment (moving equipment) does not follow a Gaussian distribution.In the experiment, each hypothetical case corresponds to one of the evaluation criteria, respectively, which is shown in Table 4.   Hypothetical Criteria In this design, SNR is one of the main factors affecting algorithm performance.Therefore, this article investigates its impact on algorithm performance.
In this study, 1000 sets of preamble signals were collected in two different scenarios, and varying degrees of noise were added.The minimum SNR was 0 dB and the maximum value was 30 dB, with an increment of 3 dB.The sampling point for each preamble signal was 10,000.For each SNR, the collected signals were identified and the recognition results obtained from these simulations were used to obtain their respective recognition accuracy.As shown in Table 5, five criteria were evaluated in the design, each of which gradually increases with increase in SNR.
It can be seen from Table 6 that TNR and FNR are basically not affected by SNR.The reason is that, after the spectrum processing described in this article, the waveform of the preamble signal of the static equipment can approximately be regarded as the spectrum of Gaussian white noise.Therefore, no matter how low the SNR is, its spectrum can be regarded as a spectrum of noise obeying a Gaussian distribution.The performance is, thus, affected little by SNR.TPR and FNR are greatly affected by SNR because there are obvious peak spectral lines in the square spectrum of the moving signal, which is a combination of the noise spectrum and the signal spectrum.Therefore, when the SNR is too low, the power of the signal spectrum component will decrease, causing the peak spectral line of the spectrum to be small, resulting in unclear characteristics and a decrease in the recognition accuracy.Table 5 illustrates five extended performances of the proposed algorithm in this article.The five criteria increase as SNR grows.As shown in the figure, they can all reach an accuracy of more than 90% when SNR is more than 15 dB.This is aligned with our expectations since this algorithm can provide high accuracy of recognition.

Comparison with Other Scheme
As shown in Figure 10, TPR and TNR are compared with RFF based on deep learning, which is exploited in [16] for device identification using the training set and the test set with a ratio of 8.5:1.5.Inspired by [16], this article simulates 1000 samples of CSI collected from different devices, from which the samples are first divided randomly into two parts: 85% for the training set and 15% for the test set.By comparison, we find that the performance of the scheme proposed in this article shows little difference with the scheme of deep learning.RFF identification in this article can reach an accuracy of 96.6% and the scheme in [16] can reach 98.1%, which indicates that the performance in our article is a little lower.Fortunately, compared with the deep learning method, this algorithm does not require a training section and has low complexity, resulting in high engineering value.

Conclusions
This article proposed an illegal mobile device identification scheme based on RFF.Firstly, we studied the characteristics of the ZC sequence.As was discussed in Related Work, most existing schemes have difficulty achieving high accuracy and meeting high real-time requirements at the same time.To overcome this shortcoming, we proposed a recognition algorithm based on a Gaussian distribution fitting test by analyzing the squared spectrum characteristics of the preamble signals of illegal and legitimate devices in a power network system.The real part of the squared spectrum of the preamble signals of illegal devices follows a Gaussian distribution, while the real part of the squared spectrum of legitimate device's preamble signals does not follow a Gaussian distribution.The experimental results showed that, in an environment with an SNR of 30 dB, the average recognition accuracy of the algorithm could reach 96.6%, and the algorithm has low complexity and high engineering value.In conclusion, this algorithm displays high precision and excellent classification capability, making it suitable for the identification of illegal equipment in power network stations.This design has the advantage of recognizing signals of the same type and recognizing whether they are static.When the signals are composite signals, it may be difficult to use this method to identity an illegal device.The application of this method in the case of composite signals (mixed with other types of signals) will be studied in the future.

Figure 2 .
Figure 2. Schematic diagram for distribution fitting of ZC square spectral sequence.

Figure 4 .
Figure 4. Schematic diagram of comparison before and after wavelet transform.

Figure 5 .Figure 6 .
Figure 5.The time domain waveform of preamble signals for static (left) and moving (right) devices.

Figure 7 .
Figure 7.The real part of the squared spectrum waveform of preamble signals for static (left) and moving (right) devices.

Figure 8 .
Figure 8. Fitting result of the squared spectral distribution of the preamble signal of an illegal device (H 0 = 0).

Figure 9 .
Figure 9. Fitting result of the squared spectral distribution of the preamble signal of a legal device (H 1 = 1).

Figure 10 .
Figure 10.Comparison of recognition with schemes proposed in [16] by Kandel, L in 2019.

Table 2 .
The secondary indicators.

Table 3 .
USRP equipments and related parameters.

Table 4 .
Correspondence of the hypothetical case and the evaluation criteria.

Table 5 .
Extended performance diagram of distribution fitting recognition algorithm.

Table 6 .
Basic performance diagram of distribution fitting recognition algorithm.