1.1. Motivation
In IoMT systems, the application of traditional complex cryptographic algorithms is indeed restricted on medical sensor nodes with limited resources. Due to the constraints on the computational power and storage capacity in medical sensor nodes, which are typically micro-embedded systems, the direct application of complex cryptographic algorithms may encounter performance bottlenecks or resource exhaustion issues. Therefore, it is necessary to propose a secure authentication method that does not require computational resources. Medical devices possess unique hardware-specific differences that are difficult to replicate. Even within the same device class, subtle variations occur in their internal components during manufacturing and usage. By analyzing and extracting these inherent differences, RF fingerprinting features can be obtained. These features are distinct and remain stable within the short term [
1], making them suitable for device identification. Moreover, RF fingerprinting identification operates at the physical layer, offering a potential solution to the problem of spoofing attacks in the IoMT [
2].
To handle unknown spoofing attacks with limited samples for model training, traditional clustering methods have been applied to RF fingerprinting identification. Nevertheless, in a typical medical environment, RF signals are exposed to complex factors such as noise, channel fading, and interference [
3]. Facing these complex signals, traditional identification methods encounter numerous issues, including high computational complexity, ineffective dimensionality reduction, poor noise resistance, and suboptimal clustering performance. As a result, traditional machine-learning-based clustering methods cannot be effectively applied in RF fingerprinting for the IoMT [
4].
In recent years, due to the automatic feature extraction capabilities and the ability to approximate complex functions in deep learning, its combination with clustering tasks has attracted extensive attention, and research on deep clustering has emerged [
5]. The success of deep clustering depends on an effective sample representation. The learned features should not only be a low-dimensional approximation of the original samples but also capture the structural characteristics of the original samples to a large extent, thus achieving a better clustering effect.
Existing deep clustering methods all rely on deep neural networks to perform representation learning on samples and then cluster according to the results of the representation learning. According to the interaction mode between the representation learning module and the clustering module, the existing deep clustering methods can be summarized into the following four branches [
6].
- (1)
Multi-stage deep clustering methods
In this type of method, the representation learning module and the clustering module are connected sequentially. This type of method first uses deep unsupervised representation learning techniques to learn the representation of each data instance, and then feeds the learned representation back into a classic clustering model to obtain the final clustering result. This method of separating data processing and clustering is convenient in enabling researchers to conduct clustering analyses, and this method has strong universality and can be applied to almost all research scenarios. The authors in [
6] trained a deep autoencoder to learn the representations of samples, and these representations could be directly input into k-means for clustering.
Multi-stage deep clustering methods have advantages such as programming friendliness and intuitive principles. However, this simple combination of deep representation learning and traditional machine learning clustering often cannot achieve the optimal result [
7]. Firstly, most representation learning methods are not specifically designed for clustering tasks, which leads to the fact that the learned sample representations may not necessarily achieve good clustering results. Secondly, due to the characteristic of the separation of the two stages, the clustering result cannot be used in reverse to guide the representation learning module to enable the representation module to obtain better data representations. Therefore, this direct module cascade cuts off the information interaction between representation learning and clustering, so the limitations of either side will jointly affect the final performance, resulting in the algorithm only being able to achieve suboptimal clustering results.
- (2)
Iterative deep clustering methods
Aiming at the limitations of multi-stage deep clustering methods, iterative deep clustering methods allow the clustering results to guide the representation learning in reverse. Generally speaking, the clustering module in deep iterative clustering will generate pseudo-labels, which can be used to train the representation learning module in a supervised manner. DeepCluster [
8] is a representative and mature deep iterative clustering method that has achieved success in the fields of image clustering and video clustering [
2]. DeepCluster alternately updates between the backbone representation module and the k-means clustering module by minimizing the gap between the clustering assignment predicted by the representation learning module and the pseudo-labels, and it can achieve better clustering results.
The method of deep iterative clustering enables representation learning and clustering to promote each other. However, at the same time, they are also affected by error propagation during the iterative process. Especially in the early stage of training, inaccurate clustering results may cause the representation learning module to generate confused representations, and these representations will in turn affect the clustering results, ultimately resulting in the model not being able to achieve the expected effect or even being unable to train and converge.
- (3)
Parallel deep clustering methods
Although the iterative deep clustering method allows the information between the representation learning module and the clustering module to guide them, these two modules are optimized in an explicit, iterative manner and they cannot be updated simultaneously. In the parallel deep clustering method, the representation learning module and the clustering module are optimized simultaneously in an end-to-end manner. DEC [
3] is a representative method that combines the autoencoder with the self-training strategy to optimize clustering and representation learning simultaneously, and this idea has had a profound impact on subsequent research. The authors in [
4] introduced an additional noise encoder and improved the robustness of the autoencoder by minimizing the reconstruction error of each layer between the noise decoder and the original encoder. The authors in [
5] applied the self-training method between the original branch and the enhanced branch, further improving the robustness of clustering. The authors in [
9] improved the distribution of the target by increasing the normalized frequency of the clusters, solved the problems of data imbalance and uneven sample distribution, and could maintain the distinguishability of small groups.
Contrastive learning has been one of the most popular unsupervised representation learning techniques in recent years, and its basic idea is to pull positive instance pairs closer and push negative instance pairs farther apart. The representative method of contrastive clustering is CC [
10], whose basic idea is to construct positive and negative sample pairs, regard each cluster as a data instance in the low-dimensional space, minimize the distance between similar samples, and maximize the distance between different samples. There are also some variants based on CC. PICA [
11] directly separates different clusters by minimizing the cosine similarity between the statistical vectors assigned by clusters, and DRC [
12] introduces a regularization method for the clusters of clustering.
- (4)
Generative deep clustering methods
Generative deep clustering can be further divided into methods based on variational autoencoders (VAEs) and methods based on generative adversarial networks (GANs). The VAE is a probabilistic model based on variational inference, and the model is trained by assuming the distribution of latent variables. The VAE has led to many models, including GMVAE [
13], VaDE [
14], etc. Generative adversarial networks (GANs) have achieved great success in the field of computer vision and in the estimation of complex data distributions. In recent years, there have also been studies applying GANs to deep clustering. The authors in [
15] proposed stacking a Gaussian mixture model (GMM) with a GAN, using the GMM as the prior distribution for the generation of data instances. The authors in [
16] proposed directly replacing the GMM with a GAN and proposed a new method to solve the convergence problem in the early stage of the model. The authors in [
17] proposed using the Sobel operation before the discriminator of the GAN to improve the model performance.
Although deep generative clustering models can generate samples while completing clustering, they also have some disadvantages. Firstly, the training of generative models usually involves Monte Carlo sampling, which may lead to unstable training and high computational complexity. Secondly, VAE-based models usually require prior assumptions about the data distribution, but this may not be applicable in actual situations; while GAN-based algorithms are more flexible and diverse, they may encounter problems such as mode collapse and slow convergence speeds. In summary, there are still several problems in the research of radio-frequency (RF) fingerprints.
Most of the existing research regards RF fingerprint recognition as a supervised task that requires the manual annotation of the collected RF signals in advance to form a dataset. However, the actual electromagnetic environment is often complex and changeable. When facing a complex electromagnetic environment, it is not always feasible to collect data in advance and construct a dataset through manual annotation. This requires the participation of industry experts, which is quite difficult and incurs high human and time costs. Moreover, the effect of manual annotation directly affects the subsequent recognition effect. In addition, when facing some specific types of network attacks, such as spoofing attacks and Sybil attacks, these supervised methods often fail due to the appearance of unknown devices, while unsupervised blind recognition methods can effectively prevent such attacks.
In the actual electromagnetic environment, multipath noise will be introduced into the signals, and the subtle differences between different devices of the same model are not easy to detect, making it difficult to extract and recognize RF fingerprints. Most of the existing research methods on RF fingerprints has been carried out in ideal scenarios. The constructed models are easily affected by the characteristics of the electromagnetic environment, leading to the overfitting or degradation of the models, which has led to a disconnect between theoretical research and practical applications.