1. Introduction
The development of wireless communication technologies has introduced end users to various external security attacks. For this reason, the protection of wireless communication networks has become an important concern in order to provide data security and user privacy. Currently, one of the ways of achieving this is to employ traditional upper layer security techniques. However, physical layer protection techniques can also be used to protect wireless networks from malicious attacks. Such techniques are based on the use of the fully physical identification of wireless communication devices. This process is also known as radio frequency fingerprinting (RFF). In RFF, the distinctive or unique features of the physical waveforms (signals) emitted by wireless devices are utilized to classify the authorized users. This paves the way for identifying possible threats to the network [
1,
2].
A typical RFF method is composed of three main stages, namely data acquisition, signal processing and classification. In the literature, high-end receivers [
3,
4,
5] and low-end receivers [
5,
6] have been preferred for data acquisition. High-end receivers use higher sampling rates. This, evidently, increases the data size. Hence, an extended memory is highly important to record signals. It should be noted that the sampling rate is one of the most critical parameters that greatly affects the accuracy. Specifically, a higher sampling rate can result in undesired frequency components in the signals, while lower sampling rates can cause the loss of the unique features needed for RFF. To overcome this trade off, it is necessary to use sub-Nyquist rates [
7], or downconverters [
3]. In this context, an RF front end system can be used for data acquisition [
8,
9]. Another critical parameter in data acquisition is the device diversity and also the number of signals to be recorded. For a reliable RFF method, the number of devices and captured signals should be kept as high as possible. In the signal processing stage, on the other hand, either the transient or steady-state regions of the transmitted signals are used to extract the distinctive or unique features (so called “RF fingerprints”). The extracted features are then used in the subsequent stage in order to classify the transmitting devices according to their model and manufacturer.
Among the implementation stages of the RFF method, the data acquisition stage plays a critical role as it directly affects the upper bounds of the performance of RFF. This is because even a small error or deficiency, like an insufficient number of devices in the data acquisition, might adversely affect the subsequent stages. Consequently, this leads to a poor device identification ability. Then, the data acquisition stage should be planned carefully in order to provide an accurate, robust and adequate size of database. The purpose of this study is to introduce a database consisting of datasets, including Bluetooth (BT) signals, collected from various smartphones of different brands and models, that were recorded at different sampling frequencies. To this end, the details of the data acquisition system are presented in detail. In addition, the results of two transient-based RFF methods that use the dataset are provided. To the best of our knowledge, this is the first freely available database that enables the testing or developing of RFF methods with various BT devices. Therefore, it is believed that the database would help many researchers from the community of the RFF identification of BT devices.
3. Data Usage Example
The recorded data are a real-valued time series (voltage/time). Firstly, they should be transformed into analytical signals by using the Hilbert transform (HT) [
9,
10,
11,
12]. Then, a digital downconverter can be used if further down-conversion is needed. Next, the I/Q data can be generated through MATLAB or AWR.
As discussed previously, to classify the authorized users in a network, the distinctive or unique features of the physical waveforms transmitted from a wireless device are utilized in RFF. The distinctive features can be extracted from transient or steady-state regions of the transmitted signals, as shown in
Figure 4. Recent studies have shown that most of the distinctive features can be extracted from transient regions [
9,
10,
11,
12]. To do this, firstly, it is necessary to detect transient signals. Then, from the detected transient signal, the features can be extracted. Mostly, instantaneous signal characteristics [
13] and the time–frequency–energy distribution (TFED) [
14] are utilized to extract the features. Next, the extracted features are used in the classification of the transmitting devices by brand, model or series. The choice of classifier type might highly affect the RFF’s performance. Deep learning (DL) [
4], support vector machines (SVM) [
14], k-nearest neighbor (KNN) [
15] and multiple discriminant analysis (MDA) [
16] are well-known for the identification of transmitting devices in RFF.
For a demonstration of the use of the dataset, the implementation of RFF is presented here. To this end, two different transient signal-based RFF methods on the basis of instantaneous signal characteristics and TFED features were experimentally tested. These methods are described in the following subsections. Before testing these methods, a transient detection method was employed to detect the transients of the recorded signals [
17]. In this method, the energy envelope of the emitted signals was utilized. The transient signals detected from four BT devices are shown in
Figure 5. Moreover, the normalized energy of the signals from the same devices are also shown in
Figure 6 for comparison.
3.1. Device Identification Using Instantaneous Signal Characteristics
In the transient signal-based RFF method, on the basis of instantaneous signal characteristics, three higher order statistical (HOS) features (skewness, kurtosis and variance) are derived from the signal’s characteristics, namely instantaneous amplitude, instantaneous frequency and instantaneous phase. The process of extracting HOS features has been presented in detail in [
9,
10,
11,
12,
13].
After extracting the HOS features, the next step is the classification of the BT devices (smartphones). The classification performances of two classifiers, the support vector machine (SVM) and neural networks (NN), were examined. The feature set was divided into training and test sets for each BT device in the dataset. Each training set consisted of 120 transient signals (out of 150) while the test set consisted of the remaining 30 transients. The training and test sets were chosen randomly from the dataset.
In the training stage, a non-linear SVM classifier was used as the data were linearly inseparable. To build a non-linear SVM classifier, it is already known that the kernel function enables researchers to map the dataset onto a higher-dimensional vector space. Although there are several types of kernel functions (radial basis, sigmoid and polynomial), the polynomial kernel function (quadratic) was chosen as it provides higher classification accuracy. On the other hand, a multi-layer NN structure for the training of the NN classifier can be found in [
9]. As the input layer of the NN structure, ten neurons corresponding to the number of generated signal features were selected. The output layer consisted of sixteen neurons as there were sixteen classes in each dataset. Moreover, two hidden layers with four neurons were used initially. Then, the number of neurons and hidden layers were increased gradually to achieve the best training performance. It was found that three hidden layers with sixty-four neurons for each were sufficient to achieve the targeted performance. Furthermore, the
tansig function was chosen for the activation of all neurons. For the network training, the number of epoch limits was chosen as 2500. The classification accuracy of the classifiers is given in
Table 2.
From the results listed in
Table 2, the NN classifier initially seems to be overfitted due to the training accuracy. While training the data, initial weights were used randomly at every turn. For this reason, such higher training accuracies would have been expected. If similar data were trained, it would not be possible to achieve such accuracy rates. Here, the network that achieved the highest training accuracy among the trained networks was used. Furthermore, the training and test sets consisted of different data. If there was an overfitting, the overfitted network would memorize only the given training data, and the accuracy of the test data would be much lower. In this context, the high accuracies achieved with both the training and test data indicate that the network was not overfitted.
3.2. Device Identification Using Time Frequency Energy Distribution (TFED) Features
An RFF method based on TFED features is presented in [
14]. In this method, features can be extracted by using the Hilbert Huang Transformation (HHT) of transient signals. HHT is simply defined as the calculation of the energy of each frequency component in a certain time resolution. In this context, the device’s signal characteristics in terms of both time and frequency can be analyzed easily.
To extract the TFED features, the dataset consisting of BT signals sampled with 20 Gsps was used (dataset C). The features extracted from the TFED of BT signals are listed in
Table 3. A feature set created from the features was smoothed by a median filter in order to evaluate the effects of filtering the features on the classification performance. Then, the smoothed and unsmoothed features were employed separately for comparison. Three classifiers, the linear support vector machine (L-SVM), linear discriminant analysis (LDA) and the complex decision tree (CDT) were used for the same feature set. As in the previous section, the feature set was divided into training and testing sets. In the training set, 60 records (out of 150 records) were used, while the remaining 90 records were used in the testing set.
The CDT classifier consisted of three main nodes: root node, split nodes and leaves or terminal nodes. The root node contains whole data while the split nodes generate internal nodes. CDT is known as a supervised learning algorithm and is widely used in classification processes. To generate the model, it requires high quality training data. The CDT classifier is based on the idea that the data are split into subsets, each of which belongs to a unique class.
On the other hand, the LDA is based on modeling the differences between the data classes. This classifier projects high-dimensional feature vectors onto low-dimensional ones by means of a linear transformation. In this way, the generated vectors provide an efficient separation of the classes.
Finally, linear support vector machines (L-SVM) is a supervised machine learning algorithm which maps input data (
) onto high-dimensional data. For this, it utilizes a mapping function
, along with a linear transformation
, where
and
are the optimized coefficients. With the help of
, the data can be separated, from which a hyperplane is generated. Maximizing the margin between the separating hyperplanes results in minimizing the upper bound error, and thus, the structure of the L-SVM is constructed. The L-SVM classifier was employed for the given BT data, and the details are reported in [
10,
12].
The performances of the classifiers for both the smoothed and unsmoothed features are presented in
Table 4.
3.3. Discussion
In order to evaluate the usability of the database, two different transient signal-based RFF methods have been experimentally tested. During the tests, the distinctive features extracted from the transient signals were used in the classification stage to identify the BT devices. The classification performance results of the classifiers prove the robustness of the datasets. Obviously, without an accurate dataset, it is impossible to achieve such classification performance. The results also verify the effectiveness of the data acquisition system. Therefore, the database provided in this study could be valuable for the research community. The database may give greater flexibility to the research community for developing better RFF methods. On the other hand, novel or existing RFF methods can also be implemented with the database. It may also be used in developing and testing transient or steady-state signal detection techniques.
4. Conclusions
This paper is intended to describe a BT signal database which is freely available to the research community for developing RFF methods. The BT database was recorded at an isolated laboratory at Atilim University, Ankara, Turkey. A set of 27 smartphones from various models produced by six manufacturers were used in the data collection. This is a work of a team who dedicated substantial time and effort to generate such an extensive and reliable BT signal database. Even a small mistake in the data acquisition process can adversely affect the following stages in the RFF methods. For this reason, this paper presents not only the database but also the data acquisition methodology for a reliable database of BT signals. Moreover, two well-known RFF methods have been reviewed, and the demonstration results are presented to show the usability of the database. The results of the RRF methods prove the effectiveness of both the acquisition system and the database for the further researching of RFF methods. As a future work, the authors intend to create a new version of the database that might include Wi-Fi signals.