4.1. Network Architecture
The architecture of the proposed PE-MSRN algorithm is illustrated in
Figure 2. By integrating residual connections and multi-scale feature extraction within a deep convolutional neural network, combined with position encoding-enhanced data preprocessing, the algorithm achieves robust deep learning of 5G signal features. The architecture comprises an image input layer, the convolution network, residual blocks, a squeeze-and-excitation (SE) attention network, a multi-scale feature extraction mechanism, and a fully connected output stage. Dynamic size adaptation and optimized training strategies significantly enhance positioning performance. The data processing pipeline within PE-MSRN can be summarized as follows: First, the raw 5G CSI features
undergo an enhancement process, incorporating positional encoding for angular data and nonlinear transformations for time delay, resulting in an extended feature vector
. This vector, now rich with spatial information, is systematically reshaped into a 2D image-like format
with the size of
, making it amenable to convolutional processing. This image serves as the input to our deep network. The initial layers of the network perform primary feature extraction, followed by a series of residual blocks to learn deeper representations. Crucially, a multi-scale feature extraction module then processes the feature maps through parallel branches with different kernel sizes to capture both fine-grained details and broader contextual patterns simultaneously. Finally, these multi-scale features are fused and passed to fully connected layers to regress the final 3D coordinates. This hierarchical and multi-scale design is key to the high performance of the algorithm.
The algorithm flowchart of the proposed PE-MSRN algorithm is shown in
Figure 3. The features
are transformed into a single-channel feature map. This transformation enables the feature to be converted into a two-dimensional image input, making it easier to use convolution operations to capture spatial dependencies. These are first processed by an initial convolutional layer for primary feature extraction, followed by residual blocks employing convolutions with different sizes of channels, respectively, to extract spatial features while mitigating gradient vanishing.
The SE attention mechanism applies global average pooling, generating channel weights through fully connected layers with a compression ratio of
and sigmoid activation, thereby enhancing key feature responses and improving robustness to noise. The multi-scale feature extraction network captures diverse spatial patterns through parallel
,
, and
convolutional branches, fuses them via a convolution, and optimizes gradient flow with residual connections. After dimensionality reduction through global average pooling, the features are processed by three fully connected layers with progressively reduced dimensions, including batch normalization, leaky rectified linear unit (LeakyReLU), and dropout, gradually reducing the dimensionality from high dimensions and finally outputting the normalized coordinates through a fully connected layer and a regression layer. The pseudocode of the proposed PE-MSRN algorithm is shown in Algorithm 1.
Algorithm 1: The PE-MSRN algorithm for 5G positioning. |
![Sensors 25 05578 i001 Sensors 25 05578 i001]()
![Sensors 25 05578 i002 Sensors 25 05578 i002]() |
4.2. Design of Positional Encoding Algorithm
The basic idea of positional encoding combines the position-related encoding value with each input feature, typically generated using sine and cosine functions, thus providing a unique representation for each position. In this study, the positional encoding from the Transformer architecture is integrated and applied to the data preprocessing stage of 5G deep learning positioning [
43]. By enhancing the 2D estimated angles
and
, with positional encoding, the additional spatial information is provided to the network. This allows each pixel in the image to not only contain its own feature information but also its spatial position attributes. Additionally, by incorporating other cross-features related to AOA and TOA, the algorithm can achieve more accurate position estimation.
Firstly, the 2D estimated angles are converted to radians. The dimension after position encoding is divided into four parts, and the dimension of each part is defined as
, which are used for the sine and cosine encoding of the angles. For the
frequency dimension, the corresponding
can be defined as
where
is a reference initial value that determines the starting point of the frequency distribution and is used to control the frequency range of the position encoding. The definition of the frequency factor allows it to cover multiple scales, enabling the capture of positional features at different frequency ranges. Through a geometric progression, the frequency starts from a small value
and gradually increases until it reaches
, thereby spanning multiple scales in the frequency range [
43,
44] and capturing finer-grained angular variation information.
Secondly, for the
i-th azimuth angle
and the elevation angle
, the corresponding sine and cosine terms for positional encoding can be illustrated as
Finally, all the encodings for 2D estimated angles will be concatenated into position encoding vectors, as shown below
This part of the encoding captures the periodic variation of the angles. The frequency ranges at different scales allow the network to leverage the combination of low-frequency and high-frequency signals, enhancing its feature representation ability. To further utilize the geometric information of the angles,
and
are used to compute the direction vector through trigonometric functions, as follows
and the spherical coordinates are converted to Cartesian coordinates to compute the unit direction vector, illustrated as
This directional quantity can directly represent the spatial direction of the signal transmission path, helping the network capture spatial features in different directions. The original feature package contains the information of the time delay
; to further enrich the input features, a nonlinear transformation is applied to the time delay as follows
where
represents a small adjustment factor to avoid some deviations. Finally, all features processed by positional encoding and transformation are integrated to form a new enhanced feature vector, which is then input into the network for training. The extended input dataset based on positional encoding features contains SNR estimation
, AOA values
and
, time delay estimation
, the received signal power
, sine and cosine transformations of 2D angles, 3D unit direction vectors, time delay transformation, cross-characteristics of SNR and power, and positional encoding.
After the feature expansion operation described in the previous section, the input feature dimension becomes
. The image area is set to
, where each feature value is mapped to a pixel in the image. The image area must be at least
N. For a square image, the ideal image dimensions is calculated as follows
where
denotes the ceiling of the square root of
N, ensuring the image width and height can accommodate all features. For each feature under each SNR value at every coordinate point, the feature index is denoted as
, and its corresponding pixel coordinates in the image are
, where
r is the row index and
c is the column index, calculated as follows
Given an image
, where
H and
W represent the height and width and
C is the number of channels, each feature
is assigned to the corresponding position in the image, as shown below
The reference initial value is a critical hyperparameter that was set to in our experiments. This choice is based on the rationale that the fundamental frequency of encoding should correspond to a full cycle of the angular unit, allowing the subsequent geometric progression to capture variations from this base frequency up to very high frequencies, thus covering a comprehensive spectrum of positional information.
4.3. Residual Connection Mechanism
To enhance the training stability and representational capacity of deep neural networks, this study integrates a residual connection network based on the pre-activation structure into the PE-MSRN algorithm. This network consists of two convolutional layers, each followed by batch normalization (BN) and a LeakyReLU activation function. Compared with the conventional BN→ReLU→Conv arrangement, the pre-activation structure enables more effective gradient backpropagation in deep layers. The output of the residual block can be defined as
When the input and output dimensions are consistent, an identity mapping is used; otherwise, a convolutional kernel is applied for dimensional alignment. The LeakyReLU activation function is used in the residual blocks to improve gradient flow and avoid neuron inactivation. Unlike standard ReLU, which outputs zero for all negative inputs, LeakyReLU allows a small negative slope
to preserve non-zero gradients, as given by
This structure offers significant advantages during backpropagation. The gradient can bypass the nonlinear transformations through the identity shortcut, thereby alleviating vanishing gradient issues. Specifically, the derivative with respect to the input can be expressed as
where
denotes the loss function and
is the identity matrix. This design not only improves the trainability of deep neural networks but also helps preserve the statistical distribution consistency between the input and the output features.
4.4. Multi-Scale Feature Extraction Mechanism
To more effectively capture the spatial diversity and scale variation inherent in the input features, a multi-scale feature extraction mechanism is designed in the proposed network. This network is embedded after the final residual block and extracts features across different receptive fields through parallel convolutional branches at multiple scales.
Specifically, this block consists of three parallel branches: the first branch employs a
convolution to extract local fine-grained features; the second branch applies a
convolution to extract mid-scale spatial representations; and the third branch employs a larger
convolution, enabling the capture of more global spatial patterns. Each convolutional path is followed by a BN operation and a LeakyReLU activation function, which help accelerate convergence and enhance nonlinear representational capacity. The multi-scale feature extraction process of the three branches can be expressed as follows
where
denote the input feature map of the network, ∗ represents the convolution operation, and
denotes the convolution kernel of the
i-th layer.
is the batch normalization operation, and
denotes the LeakyReLU activation function. The outputs from the three parallel branches are concatenated along the channel dimension to form a unified feature representation, as given by
To integrate multi-scale information and reduce dimensionality, the concatenated features are fused using a
convolution, as followed by
To preserve the original information and enhance gradient flow, a shortcut connection is introduced, and the final output of the network is then obtained by
where
is a
convolution kernel. The proposed multi-scale feature extraction mechanism enables parallel modeling with different receptive fields, allowing joint learning of features at multiple spatial resolutions. On the one hand, small-scale convolutions help capture fine-grained geometric structures, such as localized directional variations; on the other hand, large-scale paths are capable of capturing spatial correlations across regions, thereby enhancing the network’s ability to accurately estimate position information in complex environments.
4.6. Complexity Discussion of Different Algorithms
For the deep neural network, the computation complexity mainly depends on three key components: input layer, hidden layer (including attention mechanism and residual blocks), and output layer. The following summarizes the computation complexity of each part and outputs the total computation cost. The variable notations and corresponding explanations can be referenced to
Table 1 as above.
For each single training epoch, the primary computation of the input layer involves both the convolution and the fully connected layers. The calculation of the input layer mainly includes two parts: matrix multiplication and bias addition. For each batch of data, the full connection calculation complexity of the input layer can be expressed as
The residual integration involves self-attention mechanisms and residual connections, with multiple iterations of operations and additive operations for residual connections. For single-layer attention residuals, the main channels involve two types of fully connected layers and attention-based structures, and the total computation can be calculated as
The output layer is responsible for mapping the high-dimensional feature representations to the final prediction results, and its computation complexity is directly related to the hidden dimension and output dimension, as given by
Thus, the total computation complexity for a single training epoch can be computed as
For Algo.5 (PE-FCNet), with
,
,
, and
, the specific calculation of the computational complexity is
. And for convolutional networks, its computational complexity is related to the size of the convolution kernel, the size of the input feature map, the size of the output feature map, and the number of convolution kernels. For a single training round, the computational complexity of the convolution layer in the network can be expressed as
Each convolution kernel performs a convolution operation with the input feature map and calculates the weighted sum of multiple local regions. Therefore, the complexity of the convolution layer is directly related to the size of the feature map and the number of convolution kernels. The fully connected layer is used for feature integration. Its computational complexity is mainly related to the dimension of the input feature and the number of output nodes. The computational complexity of the j-th layer of the fully connected layer can be calculated as
The total training computational complexity can be defined as
Algo.1 (PE-MSRN) consists of approximately convolutional layers and fully connected layers. The network reduces the image size gradually from to and increases the kernel size to extract richer features. The output channels range from 64 to 512, with the final output . increases layer by layer from 1 to 512 at the deepest layer, and . In contrast, Algo.2 (CNN) has fewer convolutional layers, with and . The kernel size is fixed as , and the output channels range from 32 to 256. is set to . The input channel number progressively increase from a single channel to 256 through the network depth.
For Algo.3 (FCNet), the specific parameter configuration is
, the input feature dimension
, the number of neurons in the hidden layer
, the number of residual attention blocks
, and the output dimension
. Algo.4 (FE-FCNet) is configured as
,
,
,
, and
. Algo.5 (PE-FCNet) uses
,
,
,
, and
. Substituting the network parameters into the complexity calculation formulas of the above different algorithms, the numerical analysis of the network training operation amount can be obtained as shown in
Table 3.