1. Introduction
Multiple entities, be they humans or machines, that participate in wireless communication should utilize the same, limited electromagnetic (EM) frequency spectrum that is allocated by the regulatory bodies to be used for certain wireless communication standards. On the other hand, the wireless communications field has been expanding in terms of both the number of users and application areas, and this trend appears to continue into the future [
1]. This growth has been creating more demand to access the limited EM frequency spectrum, making it a scarce resource that must be efficiently managed to meet the demand [
2].
The field of wireless communications has witnessed extensive research and implementation efforts to address the increasing demand for efficient and reliable communication systems. Various strategies have been explored, including, but not limited to, the utilization of cellular systems, advancements in multiplexing and multiple access techniques, the integration of radar sensing and communication functionalities [
3,
4,
5], beamforming for spatial user separation [
6], and the adoption of massive multiple-input multiple-output (MIMO) technology [
7], to name a few.
The concept of cognitive radio (CR) has emerged as a promising solution with the potential to achieve efficient utilization of the electromagnetic frequency spectrum and transform the wireless communications protocol stack into a more adaptable system. By understanding its radio environment and making intelligent decisions in real time, CR aims to address various user needs in different application scenarios [
2,
8]. Even though there are multiple facets to consider, such as spectrum sharing and management, spectrum-aware networking, sensing scheduling, joint software, and hardware design [
9,
10], to equip the radio with cognitive abilities, it can be claimed that sensing the spectrum with the purpose of identifying the opportunities, e.g., available time-frequency slots, in the spectrum space [
11] is the initiator of the CR’s capabilities. Thus, being an essential component of CR, spectrum sensing has attracted the interest of the research community.
It appears that this interest will persist. Based on the findings of [
12], it has been observed that while 5G networks generally outperform 4G networks in terms of service quality and speed, their performance has diminished over time due to the increased number of users accessing 5G networks without a corresponding increase in network capacity. The primary limiting factor in addressing this capacity constraint is the scarcity of available EM spectrum resources. Considering the envisioned goals of 6G [
13,
14], which strive to establish ubiquitous connectivity across diverse vertical application domains, it can be inferred that the ongoing significance of efficient EM spectrum utilization persists. Consequently, the research topic of spectrum sensing retains its importance, as it directly addresses the need to effectively detect and allocate available spectrum resources in support of 6G’s objectives.
Spectrum sensing research is often classified into two primary categories: wideband sensing and narrowband sensing. Wideband sensing can be further characterized by two distinct methodologies: Nyquist-based and sub-Nyquist-based methods. Similarly, narrowband sensing is investigated under two discrete classifications: coherent methods and non-coherent methods [
15,
16].
Nyquist-based wideband sensing methods often suffer challenges due to their reliance on high sampling rates, necessitating specialized and expensive equipment. In contrast, sub-Nyquist wideband sensing methods alleviate the aforementioned requirements. However, a substantial portion of these methods rely on assumptions of sparsity and the estimation of sparsity levels, thereby introducing high computational complexity [
17]. An increase in computational complexity is frequently associated with increased energy consumption, which is particularly undesirable for radios operating in resource-constrained environments, such as mobile phones or military radios that are deployed in field conditions.
While wideband sensing methods offer the potential to sense a significantly broader spectrum in comparison with narrowband sensing methods, the latter, characterized by their relatively lower sampling rates, prove more compatible with conventional, non-bulky, cost-effective equipment, including mobile phones, software-defined radios, and military radio devices. Despite the generally high performance exhibited by coherent narrowband sensing methods, such as matched filtering and waveform detection, their notable drawback lies in the necessity of possessing either partial or complete prior knowledge of the primary signal characteristics, which is frequently unavailable.
Alternative narrowband sensing techniques, such as cyclostationary feature detection, covariance-based detection, etc., exhibit commendable performance even in challenging channel conditions. Nonetheless, these methods are encumbered by the drawback of high computational complexity. Energy detection is a non-coherent and straightforward narrowband sensing technique that is widely adopted in the literature. It involves subjecting a test statistic to binary hypothesis testing against a threshold value. Its popularity stems from its low computational complexity and the absence of the need for prior knowledge about primary user signals. However, it exhibits limitations in terms of reliability at low signal-to-noise ratio (SNR) values and susceptibility to noise uncertainty. Finally, it is worth mentioning that artificial intelligence (AI)-based approaches have been leveraged either as standalone methodologies or in combination with the aforementioned techniques to yield enhanced and more efficient approaches [
18]. In [
19], a system is introduced that incorporates convolutional neural network (CNN) and long short-term memory (LSTM) deep learning (DL) architectures. CNN extracts energy-correlation features from sensing data covariance matrices, while LSTM utilizes these features across multiple sensing periods to learn the primary user activity patterns and enhance the detection probability. Notably, this approach operates without making assumptions about signal-noise models. In [
20], researchers utilize short-time Fourier transform to create time-frequency matrices from the received samples, which serve as inputs for the CNN model used in spectrum occupancy classification. Effectively, this approach transforms the problem into an image classification task. The proposed method places no constraints on primary user signals and remains robust in the presence of SNR variations. Ref. [
21] presents a system using spectral correlation function outputs as the input for a CNN. This approach not only determines channel occupancy, but also identifies the types of signals occupying the channel, demonstrating robustness even in challenging channel conditions. In [
22], the authors employ power spectrum as the CNN input, normalizing received signal power to mitigate noise power uncertainty. They use the Residual Neural Network model, training it with an extensive dataset of signal and noise data. Transfer learning is applied, where the network is initially trained with simulated data and then fine-tuned with real-world signals. This method performs well under colored noise and exhibits the ability to generalize when detecting unknown signals. In contrast, Ref. [
23] directly employs complex-valued samples as CNN inputs, eliminating the need for feature extraction. Transfer learning techniques are integrated to address performance degradation in diverse scenarios beyond the training dataset. Likewise, in [
24], raw signals act as inputs for a neural network model, which includes one-dimensional CNN, bidirectional LSTM (BiLSTM), and self-attention (SA) components. This fusion utilizes CNN for local pattern extraction, BiLSTM for capturing long- and short-term dependencies, and SA for highlighting specific features.
In the context of energy detection, establishing a threshold that ensures a consistent false alarm rate and missed detection rate requires either having explicit information about the noise variance and SNR, or estimating these parameters from the available data. Extensive research has focused on the development of adaptive threshold determination methods, driven by the recognition that the noise floor commonly exhibits non-stationary behavior. The objective of these studies is to devise techniques that can dynamically adjust the threshold to accommodate the temporal fluctuations in the noise floor, enabling more robust and accurate detection performance.
For example, Ref. [
25] introduces an adaptive thresholding method by leveraging the binarization concept used in image processing. The proposed approach represents the threshold as a linear function of the mean and standard deviation of the received signal samples. Ref. [
26] proposes a heuristic method modeled as a linear function of the signal-to-interference-plus-noise-ratio. They highlight the simplicity and practicality of this approach for engineering applications. The authors also highlight that although the proposed method necessitates complex offline optimization, it offers straightforward online threshold control. Ref. [
27] addresses the problem of minimizing the error decision probability by formulating it as a function of the primary user’s spectrum utilization ratio (ranging from 0 to 1) and the threshold. The authors demonstrate that when the spectrum utilization ratio is fixed, the error decision probability function exhibits convex behavior, allowing for the derivation of solutions. A novel three-event energy detection algorithm is presented in [
28]. The authors employ Newton’s method with forced convergence in a single iteration to accurately approximate the optimal decision threshold, effectively minimizing the error decision probability. In [
29], the authors present a novel approach centered on deep reinforcement learning. They employ a custom-designed reward function within the deep Q-network algorithm to intelligently adjust the energy detection threshold. Furthermore, they seamlessly integrate this method with a clustered cooperative spectrum sensing architecture, harnessing the collective power of these advanced techniques for enhanced spectrum management. In [
30], the authors present a method for adaptive threshold determination, employing noise power estimates obtained at each sensing interval. The noise power estimation relies on the spectral minima tracking (SMT) technique. Furthermore, the authors perform a correlation analysis to assess the relationship between the parameters of the SMT technique and the resulting noise power estimates. Leveraging the insights from this analysis, they fine-tune the parameters of the SMT method to enhance the accuracy of their noise power estimations. In their study, Ref. [
31] enhance the fixed double-threshold using a conventional mean energy detection algorithm with a novel approach by determining the intermediate threshold adaptively. They formulate the intermediate threshold as a weighted combination of high- and low-value thresholds and employ a decision error probability metric to optimize its value.
With the primary objective of enhancing the utilization of the limited electromagnetic spectrum in real-world applications, this study introduces a novel multi-stage DL approach to spectrum sensing. It employs energy detection with adaptive thresholding, utilizing multiple stages of DL techniques applied to the time domain representation of the narrowband channel readings to estimate the threshold. The key contributions of the proposed approach are:
- The introduction of a multi-stage DL approach that effectively mitigates channel impairments and jointly conducts spectrum sensing while maintaining superior performance, resulting in low false alarm (3.85%) and missed detection (3.06%) rates. 
- The enhancement of system interpretability through a multi-stage DL approach, distinguishing it from monolithic DL models. This distinctive feature substantially diminishes the ‘black box’ nature often associated with DL systems. 
- The integration of DL techniques to dynamically estimate the energy detection threshold. To the best of our knowledge, although AI-based methods for spectrum sensing exist, the utilization of DL techniques to adaptively determine the threshold in energy detection has not been extensively explored in the literature; thus, it represents a novel contribution. 
- Our exclusive use of the time domain samples eliminates the need for typically resource-intensive operations and transformations, although it presents additional challenges. Certain transformations excel at extracting valuable features, some of which exhibit remarkable performance even under severe channel impairments. Nonetheless, our system effectively addresses these challenges within the time domain. 
- The exploration of diverse DL architectures, expanding beyond conventional choices such as CNN and LSTM. The incorporation of autoencoders serves as a bridge toward the wider integration of generative AI for the future in solving spectrum sensing problems. 
The subsequent sections are organized as follows: 
Section 2 provides a comprehensive overview of the system model, energy detection, proposed neural network architectures, training and testing datasets, hardware setup, hyperparameter settings, as well as the block bootstrapping method, which is one of the various methods utilized for performance measurement. 
Section 3 presents the results obtained from the experimental analysis. Finally, 
Section 4 concludes the paper by discussing the findings in detail and highlighting their implications.
  2. Materials and Methods
Although various methods in the literature extend beyond mere channel occupancy detection, spectrum sensing fundamentally aims to determine EM channel occupancy, essentially reducing to binary hypothesis testing [
32]. In the following section, we present the mathematical representation of the spectrum sensing scenario.
  2.1. Mathematical Representation of the Spectrum Sensing Scenario
The spectrum sensing scenario that we consider can be mathematically represented as:
        where 
 represents the vector of received samples; 
 represents the vector of transmitted signal samples; 
 represents the carrier frequency offset (CFO) expressed in terms of data rate; 
n represents the zero-based array index; 
 represents the phase offset in radians; 
 represents the vector of independent and identically distributed additive white Gaussian noise (AWGN) samples, where 
 follows a complex Gaussian distribution 
; 
 denotes the null hypothesis indicating the vacancy of the channel; and, finally, 
 denotes the alternative hypothesis representing channel occupancy.
  2.2. Binary Hypothesis Testing and Energy Detection
The received signal samples, denoted as 
, are utilized to construct the test statistic 
T according to Equation [
32]:
Subsequently, this test statistic is compared with a threshold 
 to facilitate the hypothesis test:
In Equation (
3), 
 represents the decision threshold, dictated by the channel’s noise variance. It can be predefined or statistically estimated, and it plays a crucial role in achieving accurate and reliable detection results.
  2.3. Proposed Adaptive Thresholding Approach
Motivated by the principle of energy conservation, we adopt a perspective that centers around Equation (
1) in the following manner:
Here, 
E represents energy, and the subscripts denote the corresponding signal vectors. By leveraging insights from [
33,
34,
35], we propose a deep learning-based method to estimate 
 from 
. This estimation allows us to determine 
, which, in turn, enables us to adaptively determine the threshold 
, which, in this case, is equivalent to 
.
Thus, Equation (
3) can be reformulated by substituting 
 and 
 while introducing a correction factor 
k, as follows:
Introducing the correction factor k addresses limitations in accurately estimating the transmitted signal vector’s energy. Under the null hypothesis, where the signal vector consists of zeros, ideal estimation yields zero energy. However, inherent imperfections lead to estimates close to zero but not precisely zero. This discrepancy requires accounting for and is addressed by introducing the correction factor k to adjust the threshold.
  2.4. System Design
As shown in Equation (
1), RF impairments, such as CFO, phase offset, and AWGN, affect the transmitted signal, causing deviations in the received signal. Addressing and compensating for these impairments enables accurate estimation of the transmitted signal from the received one.
We propose a multi-stage DL system that combines explicit and implicit estimation approaches to address these impairments. A dedicated neural network (NN) model estimates the CFO, and this estimation is used with signal processing techniques to mitigate its effect on the received signal.
For the remaining impairments, such as phase offset and AWGN, an implicit estimation approach is adopted. Another dedicated neural network jointly estimates and corrects these impairments by capturing complex patterns and dependencies in the data. Training on a moderately diverse dataset, the network learns to implicitly estimate and compensate for the combined effect of phase offset and noise.
By referring to 
Figure 1 and following the sequential numbering assigned to each line, readers can effortlessly navigate and comprehend the system design. The system begins by estimating the CFO using a dedicated fully convolutional network (FCN). The estimated CFO is then utilized in the CFO correction block, where the received signal 
 is multiplied by 
 to obtain an intermediate signal 
, which undergoes a certain level of CFO correction. This stage is known as coarse CFO estimation and correction.
Subsequently, the same FCN is employed, with the previous network’s output, i.e., the CFO-corrected intermediate signal , serving as the input. This stage, referred to as fine CFO estimation and correction, further refines the CFO estimation and yields the signal , which represents a well-CFO-corrected signal. The coarse estimation result is denoted as , while the fine estimation result is denoted as .
Afterwards, a U-Net [
36]-based autoencoder, known for its effectiveness in noise mitigation applications [
37,
38,
39], is employed to compensate for phase offset and denoise the signal.
Next, the energies of both the received signal 
 and the estimated transmitted signal 
 are computed, as depicted in 
Figure 1. These energy values are utilized to determine the energy of the noise 
. Finally, the estimated noise energy is multiplied by a factor 
k, and together with 
, it is employed in the binary hypothesis testing to make a decision between 
 and 
.
Determining the value of parameter 
k through an exhaustive search, covering all real values from one to infinity, and selecting the optimal 
k where the probabilities of false alarm and missed detection intersect is theoretically valid. However, implementing such an approach is computationally infeasible. Therefore, we employ a practical and empirical method to determine an appropriate value for 
k. Specifically, we create a vector of 200 evenly spaced numbers ranging from 1 to 1.5, with the upper limit of 1.5 chosen based on empirical analysis. This approach allows for an efficient and effective selection of the parameter 
k. Each element of this vector is substituted for 
k, and the corresponding false alarm and missed detection rates are evaluated. The values of 
k that resulted in the intersection of the false alarm and missed detection rates are identified as the most favorable choice, as demonstrated in 
Figure 2. The average central processing unit time required to perform this process is approximately 800 ms.
While recurrent neural networks (RNNs) are commonly used for time series problems, CNNs and FCNs have also proven effective [
40,
41,
42,
43]. In our system, we employ FCNs for CFO estimation due to their ability to detect and combine complex patterns, making them suitable for this task.
Table 1 provides a comprehensive summary of the FCN layers and their associated parameters utilized for CFO estimation.
 The architecture of our U-Net-based autoencoder, which has 165,378 trainable parameters, can be seen in 
Figure 3.
During the training stage for both networks, a mean square error loss function was utilized.
  2.5. Dataset Descriptions
In our simulations, we utilize quadrature phase-shift keying (QPSK) modulated baseband signals with a length of 512 samples. Each symbol within the signal consists of 8 samples. For pulse shaping, we utilize a square-root raised cosine filter with a roll-off factor of 0.25 and a span of 10. To simplify the analysis and implementation, we set the sampling frequency to 1 in our simulations.
During the training of the CFO network, we utilize CFO values that are uniformly distributed within the range of  relative to the data rate. To simplify the training process, we keep the phase offset and SNR constant. The constant phase offset was chosen because it should not have an affect on the CFO estimation. Moreover, we use a relatively high SNR of 20 dB to enable the network to effectively learn the patterns caused by the CFO. Our training dataset comprises 20,000 signals, each of which consists of 512 samples with in-phase and quadrature components. We partition this dataset into 70% for training and 30% for validation. For ease of training, we normalize the CFO values to fall within the range of , simplifying their use as target variables in the network. Additionally, we generate a separate test dataset consisting of 20,000 signals. This test dataset introduces phase offset values that are uniformly distributed from  to 45 degrees in 10-degree increments and SNR values that are uniformly distributed from 0 to 20 dB in 5 dB steps.
Table 2 provides a concise summary of the datasets utilized for both training and testing the CFO estimation network.
 The training and testing datasets for the U-Net-based autoencoder are presented in 
Table 3, following a similar fashion as the CFO estimation dataset.
During the training of the U-Net-based autoencoder network, the trained CFO estimation network is used for CFO correction. This involves mitigating CFO in the data with estimated values before U-Net training. It is important to note that some residual CFO remains due to estimation limitations. This approach mimics real-world scenarios with residual CFO, enhancing the U-Net’s ability to handle remaining CFO components and improve practical performance.
Before discussing training and testing setups, we must describe the energy detection dataset. It consists of 10,000 signals with RF impairments, generated as previously explained. Additionally, 10,000 noise vectors following the complex normal distribution represent scenarios where the null hypothesis cannot be rejected. These sub-datasets are merged to form the final dataset for energy detection.
  2.6. Training and Testing Setup
The training and testing of both networks were conducted on a machine equipped with an Intel i7-6700HQ CPU (8 cores) running at 2.6 GHz base frequency, an NVIDIA GeForce GTX 950M GPU, and 16 GB RAM. The CFO network underwent training for 90 epochs, employing a learning rate of . To enhance the learning process, a learning rate scheduler known as ReduceLROnPlateau was utilized, reducing the learning rate by a factor of 0.5 with a patience of 10. The Adam optimizer was employed with the same learning rate and optimizer parameters were set to  and .
Similarly, the U-Net-based autoencoder network was trained for 35 epochs, employing the same learning rate and Adam optimizer parameters. However, the patience value for the learning rate scheduler was adjusted to 5. This configuration facilitated effective network training and optimization of model performance.
Table 4 provides a concise summary of the training hyperparameters utilized in our study.
 With the hardware setup detailed in 
Table 5, our trained neural network can efficiently execute a single CFO estimation and correction event for an individual signal in approximately 0.6 ms. Furthermore, the other neural network can perform phase offset estimation, correction, and denoising for the same signal in approximately 2.2 ms.
  2.7. Block Bootstrapping
To evaluate the performance of our system, we utilized various assessment methods. One of these methods involves employing the block bootstrapping technique [
44] to construct 95% prediction intervals, which are presented in 
Section 3 afterwards. This method is applied individually to each signal estimation. Initially, we computed the corresponding residuals by computing the differences between our estimate and the true transmitted signal. Then, we generated multiple blocks of length 81 by striding over the residuals vector with a stride of 1, extracting a block at each stride. These blocks were then randomly sampled with replacement, concatenated, and trimmed to the length of 512. This process was repeated multiple times, resulting in the construction of 3000 residual vectors in our simulation.
To obtain the 95% prediction interval for the corresponding estimation, we added each of the residual vectors to our estimation one at a time. This process resulted in the formation of a new dataset. Finally, we computed the 2.5th and 97.5th percentiles of this dataset, which served as the lower and upper bounds of the prediction interval, respectively.
The utilization of the block bootstrapping method is preferred over naive resampling approaches due to the interdependence of contiguous samples in our transmitted signal, and consequently the estimated signal, caused by the pulse shaping filter. By attempting to mimic the underlying process that generated the time series, such as the residuals in our case, the block bootstrapping method aims to capture the temporal dependencies.
The choice of block length is a critical factor in block bootstrapping, as it directly impacts the ability of the method to capture the desired temporal dependencies. It serves as a tunable parameter, and an inappropriate selection of the block length may hinder the effectiveness of block bootstrapping in capturing the intended dependencies. In our case, we chose a block length of 81 to align with the length of our pulse shaping filter, ensuring that relevant dependencies within the signal were properly captured.
  4. Discussion and Conclusions
Spectrum sensing encompasses numerous aspects, with one fundamental yet critical challenge being the balance between computational complexity and performance. This study proposes a multistage approach that employs DL to adaptively estimate the energy detector threshold by utilizing only the time domain representation of received signal samples. The aim is to achieve a delicate equilibrium between computational complexity and performance.
The proposed model distinguishes itself through its simplicity, relying exclusively on time domain samples affected by RF impairments, thus avoiding the need for additional feature extraction operations or transformations. It is noteworthy that some transformations excel at extracting valuable features, even in the presence of severe channel impairments. Therefore, abstaining from their use presents additional challenges. Nevertheless, our system effectively tackles these challenges within the time domain, achieving outstanding performance with low false alarm (3.85%) and missed detection (3.06%) rates, all while jointly mitigating channel impairments and conducting spectrum sensing.
We leverage the received samples, which are influenced by RF impairments, to estimate the transmitted signal. By estimating the energies of both the transmitted and received signals, we can compute the noise energy necessary for threshold estimation. This noise energy is then multiplied by a correction factor, denoted as k, to further refine the threshold estimation process.
The multistage approach incorporates an FCN for accurate CFO estimation and explicit CFO correction. Phase offset estimation, correction, and noise suppression are implicitly handled by a U-Net-based autoencoder network. Energy calculations are performed on the received and estimated transmitted signals to estimate the noise energy. Finally, the energy of the received signal and the k-adjusted noise energy are used for binary hypothesis testing, enabling effective spectrum sensing. Quantitative analyses confirm the efficacy of our proposed method.
Based on our literature research, although AI-based methods for spectrum sensing are available, the application of DL techniques for dynamically determining the energy detection threshold has not been comprehensively investigated in the existing literature. This represents a unique and innovative contribution of our work.
We also investigate a range of DL architectures, going beyond the typical selections, such as CNN and LSTM. The inclusion of autoencoders serves as a gateway to the broader integration of generative AI, a path to potentially improving spectrum sensing in the future.
We acknowledge the limitations of our approach. The system’s performance degrades when encountering phase offsets exceeding 45 degrees, making it less suitable for complex modulation schemes like 16-QAM or 64-QAM, as well as presenting additional challenges in accurate detection and parameter estimation when applied to signals conforming to standards like LTE and 5G NR, which exhibit noise-like properties.
Motivated by the identified limitations, we propose several potential research directions to address the challenges. One immediate but challenging approach is to design custom loss functions specifically tailored to train the neural networks. The customized loss functions would incorporate various characteristics of the received signal, particularly its statistical properties. By incorporating relevant statistical information into the training process, we anticipate that the neural networks could better adapt to the inherent complexities and variations present in the received signal, ultimately improving the overall system performance. However, developing such custom loss functions would require careful consideration and exploration of appropriate statistical metrics and techniques, making it a challenging yet promising avenue for further investigation. In addition to exploring custom loss functions, another promising research direction is to investigate the integration of state-of-the-art DL techniques into the proposed system. Techniques such as attention mechanisms [
45] and diffusion models [
46] have shown significant advancements in various domains and have the potential to enhance the mitigation of RF impairments. A more robust approach would involve leveraging cyclostationary or covariance-based features, which provide valuable information about the underlying signal structure. However, extracting these features traditionally requires significant computational resources. Therefore, a promising research direction is to explore DL methods for estimating these features directly from the received samples. This approach would involve developing specialized architectures and training techniques to efficiently extract informative features and enhance system performance.
In our future work, we intend to conduct real-world experiments to validate the system’s performance under practical conditions. Furthermore, we aim to explore various use case scenarios and implement necessary modifications to extend the system’s applicability to more complex modulation schemes and signals that adhere to specific wireless communication standards.