CSI Feedback Model Based on Multi-Source Characterization in FDD Systems

In wireless communication, to fully utilize the spectrum and energy efficiency of the system, it is necessary to obtain the channel state information (CSI) of the link. However, in Frequency Division Duplexing (FDD) systems, CSI feedback wastes part of the spectrum resources. In order to save spectrum resources, the CSI needs to be compressed. However, many current deep-learning algorithms have complex structures and a large number of model parameters. When the computational and storage resources are limited, the large number of model parameters will decrease the accuracy of CSI feedback, which cannot meet the application requirements. In this paper, we propose a neural network-based CSI feedback model, Mix_Multi_TransNet, which considers both the spatial characteristics and temporal sequence of the channel, aiming to provide higher feedback accuracy while reducing the number of model parameters. Through experiments, it is found that Mix_Multi_TransNet achieves higher accuracy than the traditional CSI feedback network in both indoor and outdoor scenes. In the indoor scene, the NMSE gains of Mix_Multi_TransNet are 4.06 dB, 4.92 dB, 4.82 dB, and 6.47 dB for compression ratio η = 1/8, 1/16, 1/32, 1/64, respectively. In the outdoor scene, the NMSE gains of Mix_Multi_TransNet are 3.63 dB, 6.24 dB, 4.71 dB, 4.60 dB, and 2.93 dB for compression ratio η = 1/4, 1/8, 1/16, 1/32, 1/64, respectively.


Introduction 1.Background and Motivations
With the development of mobile communication technology, the need for highprecision channel feedback is becoming increasingly urgent.In a large-scale Multi-Input Multi-Output (MIMO) system, the Base Station (BS) is configured with a large number of antennas to fully utilize spatial diversity and spatial multiplexing to increase the channel capacity [1].As the amount of transmitted data increases, the channel becomes congested.We can use reconfigurable intelligent surfaces (RISs) for safe and efficient effects.RISs act as a relay to strengthen the signal strength and provide energy for the subsequent signal transmission [2,3].RISs mainly serve as a relay to enhance the transmission energy.Beamforming is required to concentrate the signal energy on specific User Equipment (UE) and achieve less interference and leakage at a high signal-noise ratio (SNR) while ensuring the transmission [4].One of the keys to achieving beamforming is to have accurate downlink Channel State Information (CSI).Usually, in Time Division Duplexing (TDD) systems where uplink data and downlink data can be transmitted at the same frequency point, the uplink and downlink channels are reciprocal, i.e., the UE can back-propagate the uplink channel through the downlink channel and, thus, does not have to spend many resources to implement the CSI estimation [5].
In recent years, Frequency Division Duplexing (FDD) systems have gained more popularity due to their higher spectrum utilization [6].In FDD, on the other hand, the CRNet increases the feedback accuracy to a certain extent, but it singularly focuses on the spatial features of the CSI and similarly neglects the temporal nature of CSI.CLNet [24] weights the complex signals by using the correlation between real and imaginary parts after dividing the complex signals into real and imaginary parts.Compared to other algorithms, CLNet is trained with a relatively simple network structure, which reduces the information loss problem caused by complex splitting and complexity.However, the method also singularly considers the spatial features of the CSI and ignores the temporal nature of CSI.CsiNet-LSTM [25] and Attention-CSI [26] gather an increased temporal nature from the training samples by introducing LSTM.Although both networks focus on the temporal nature of CSI to ensure some feedback accuracy, they ignore the spatial features of CSI.STNet [27] proposes a spatially divisible attention mechanism, which mainly refers to the attention mechanism part of Transformer [28], using local grouping self-attention and global sampling attention.The method improves the performance of the network with some increase in complexity.The algorithm pays some attention to both the spatial characteristics of CSI and the temporal order of the CSI but does not balance the feedback accuracy of CSI.

Contributions
The above networks have paid attention to their spatial characteristics or temporal properties by analyzing the downlink CSI features.However, the networks consider them only from a single part, ignoring the influence between the two factors, or the feedback accuracy is too low.When the network only thinks of the spatial characteristics, the network breaks the connection between the elements in the downlink CSI, which affects the final feedback accuracy.When the network only finds from the temporal nature, the feedback matrix is not sufficiently compressed, which significantly increases the overhead cost of the feedback.The CSI feedback accuracy is improved by paying attention to and learning the spatial characteristics and timing of the downlink CSI.In this study, the Mix_Multi_TransNet network is proposed to consider both the spatial factors and temporality of the downlink CSI, and its main work is summarized as follows: (1) Use dual-path neural networks.Path1 and Path2 learn the channel matrix's spatial features and temporal properties, respectively.Path1 adopts multiple sensory fields to understand the channel matrix's spatial characteristics comprehensively, and Path2 combines the Transformer Attention Mechanism with Convolutional Neural Networks to thoroughly learn the temporal properties of the channel information.(2) Path1 uses multiple sensory fields; Path2 uses the Transformer attention mechanism combined with a convolutional neural network to ensure the learning of temporal relationships while reducing model parameters.The two path feature representations are eventually fused to improve the model's performance, robustness, and generalization ability.

Paper Organization
The article consists of five parts in total.This chapter focuses on the background of this study and the innovations and contributions of our entire study; the second part mainly introduces our system model; the third part mainly introduces the design of the Mix_Multi_TransNet network; the fourth part is our experimental results and comparisons; and the fifth part is our conclusion.

System Model
In the FDD system, the BS have N t -transmitting antennas and the UE have N r antennas, N t ≥ N r .For simplicity, this study lets N r = 1.An Orthogonal Frequency Division Multiplexing (OFDM) system with N c subcarriers is used, and the system model is shown in Figure 1.

System Model
In the FDD system, the BS have  -transmitting antennas and the UE have  antennas,   .For simplicity, this study lets  = 1 .An Orthogonal Frequency Division Multiplexing (OFDM) system with  subcarriers is used, and the system model is shown in Figure 1.The received signal  ∈ ℂ can be expressed as: where  ∈ ℂ denotes the transmitted symbol vector in one OFDM cycle and  ∈ ℂ is the additive noise vector.The diagonal matrix  = (ℎ  , … , ℎ  ),  =  , where ℎ ∈ ℂ and  ∈ ℂ , denotes the downlink channel response vector and beamforming precoding vector,  ∈ 1, … ,  , respectively.
In order to obtain the beamforming vectors, it is necessary to obtain the corresponding ℎ at the base station.Define the downlink channel matrix ℂ .The matrix contains   elements, each of which includes information about the real and imaginary parts of the CSI, proportional to the number of antennas.In the case of massive MIMO,   will be a vast number, which is a huge challenge for data processing.Since the channel matrix is sparse in the angular time-delay domain,  can be transformed from the spatial frequency domain to the angular time-delay domain using the two-dimensional discrete Fourier transform (DFT), as shown in Equation (2).

𝐻 = 𝐹 𝐻𝐹
(2) where  and  denote DFT matrices of size   and   , respectively.For each element in the angular delay domain matrix,  ℂ corresponds to a path delay with a certain angle of arrival (AoA).Only the first  rows of  obtained after twodimensional discrete Fourier transform (DFT) contain helpful information, and the rest of the rows indicate paths with considerable propagation delays, so the elements are almost zero.Therefore, only the first  is truncated to obtain the matrix  ℂ representing the downlink channel matrix.
By truncating  , the channel matrix is downscaled.However,  is still a matrix with extensive data elements.Therefore, further dimensionality reduction is needed.The neural network can handle the problem of exaggerated dimensions and uneven feature distribution of  very well.Therefore,  can be inputted into the encoder part of the UE consisting of neural networks, and the encoder can obtain codeword  according to the compression ratio : where  (•) denotes the compression process and  denotes a set of parameters of the encoder.The received signal y ∈ C N c ×1 can be expressed as: where x ∈ C N c ×1 denotes the transmitted symbol vector in one OFDM cycle and z ∈ C N c ×1 is the additive noise vector.The diagonal matrix A = diag h H 1 p 1 , . . ., h H n p n , n = N c , where h i ∈ C N t ×1 and p i ∈ C N t ×1 , denotes the downlink channel response vector and beamforming precoding vector, i ∈ {1, . . . ,N c }, respectively.
In order to obtain the beamforming vectors, it is necessary to obtain the corresponding h i at the base station.Define the downlink channel matrix H C N c ×N t .The matrix contains N c × N t elements, each of which includes information about the real and imaginary parts of the CSI, proportional to the number of antennas.In the case of massive MIMO, N c × N t will be a vast number, which is a huge challenge for data processing.Since the channel matrix is sparse in the angular time-delay domain, H can be transformed from the spatial frequency domain to the angular time-delay domain using the two-dimensional discrete Fourier transform (DFT), as shown in Equation (2).
where F c and F t denote DFT matrices of size N c × N c and N t × N t , respectively.For each element in the angular delay domain matrix, H C N c ×N t corresponds to a path delay with a certain angle of arrival (AoA).Only the first N a rows of H obtained after two-dimensional discrete Fourier transform (DFT) contain helpful information, and the rest of the rows indicate paths with considerable propagation delays, so the elements are almost zero.Therefore, only the first N a is truncated to obtain the matrix H a C N a ×N t representing the downlink channel matrix.
By truncating H , the channel matrix is downscaled.However, H a is still a matrix with extensive data elements.Therefore, further dimensionality reduction is needed.The neural network can handle the problem of exaggerated dimensions and uneven feature distribution of H a very well.Therefore, H a can be inputted into the encoder part of the UE consisting of neural networks, and the encoder can obtain codeword υ according to the compression ratio η: where F ξ (•) denotes the compression process and Θ ξ denotes a set of parameters of the encoder.
After the feedback link is sent to the BS, υ is recovered in the decoder part of the BS.Like the encoder, the decoder also consists of a neural network.The recovery process of H a can be represented as follows: where F R (•) denotes the recovery process and Θ R denotes a set of parameters of the decoder.The H a obtained after recovery is padded for its zero values and then undergoes a two-dimensional discrete Fourier inverse transform (IDFT) to obtain H. Combining Equations ( 3) and (4) and using the mean square error as a metric, the entire compression and recovery process can be expressed as: (5) It is assumed that the uplink is in an ideal state, i.e., there is no loss in υ obtained from the encoder processing and then passed through the uplink to the decoder and recovered at the decoder.Therefore, the main objective of this study is to train and design the network for Θ ξ and Θ R .

Mix_Multi_TransNet Design
This section describes the design and principles of the Mix_Multi_TransNet network and its key components.The overall architecture of Mix_Multi_TransNet is shown in Figure 2. Mix_Multi_TransNet is an encoder-decoder framework divided into two paths with four embedded modules to solve the CSI feedback problem.
After the feedback link is sent to the BS,  is recovered in the decoder part of the BS.Like the encoder, the decoder also consists of a neural network.The recovery process of  can be represented as follows: where  ℛ (•) denotes the recovery process and  ℛ denotes a set of parameters of the decoder.The  obtained after recovery is padded for its zero values and then undergoes a two-dimensional discrete Fourier inverse transform (IDFT) to obtain  .Combining Equations ( 3) and (4) and using the mean square error as a metric, the entire compression and recovery process can be expressed as: It is assumed that the uplink is in an ideal state, i.e., there is no loss in  obtained from the encoder processing and then passed through the uplink to the decoder and recovered at the decoder.Therefore, the main objective of this study is to train and design the network for  and  ℛ .

Mix_Multi_TransNet Design
This section describes the design and principles of the Mix_Multi_TransNet network and its key components.The overall architecture of Mix_Multi_TransNet is shown in Figure 2. Mix_Multi_TransNet is an encoder-decoder framework divided into two paths with four embedded modules to solve the CSI feedback problem.

Network Modeling Processes
As shown in Figure 2, the whole network is divided into two parts: the encoder and the decoder.Before entering the encoder, firstly, the two-dimensional discrete Fourier transform of  is utilized to obtain  using Equation (2), and then truncation is performed to obtain the complex matrix  .Then, the complex matrix  ℂ is directly divided into real and imaginary parts: denotes the real part of the complex number and  denotes the imaginary part of the complex number; then, the resulting real and imaginary parts are turned into a 2  new matrix:

Network Modeling Processes
As shown in Figure 2, the whole network is divided into two parts: the encoder and the decoder.Before entering the encoder, firstly, the two-dimensional discrete Fourier transform of H is utilized to obtain H using Equation ( 2), and then truncation is performed to obtain the complex matrix H a .Then, the complex matrix H a C N a ×N t is directly divided into real and imaginary parts: a denotes the real part of the complex number and b denotes the imaginary part of the complex number; then, the resulting real and imaginary parts are turned into a 2N a × N t new matrix: where 0 ≤ i < Na, 0 ≤ j < Nt, 0 ≤ k < 2Na, and i = k or i = k − Na.The new matrix H is taken as input into both Path1 and Path2 of the encoder, and the whole processing in the encoder is conducted as shown in Equation ( 6); finally, υ1&υ2 is obtained after passing through Path1_E and Path2_E.In the decoder, υ1&υ2 is input into Path1 and Path2, corresponding to the decoder, respectively.υ1&υ2 is obtained after passing through Path1_D and Path2_ D processing and υ1 &υ2 is obtained; then, the two matrices of υ1 &υ2 are fused: The obtained H is recovered as a complex matrix H a .The entire processing in the decoder is shown in Equation (4).

Path1
A detailed description of the encoder part and decoder part of the network is given, as shown in Figure 3. From Figure 3, it can be seen that the whole self-encoding network is divided into two paths, Path1 and Path2.Path1 consists of Path1_E and Path1_D, denoted as the encoder and decoder part, respectively.
where 0  , 0  , 0  2, and  =  or  =  − .The new matrix ℋ is taken as input into both Path1 and Path2 of the encoder, and the whole processing in the encoder is conducted as shown in Equation ( 6); finally, 1&2 is obtained after passing through Path1_E and Path2_E.In the decoder, 1&2 is input into Path1 and Path2, corresponding to the decoder, respectively.1&2 is obtained after passing through Path1_D and Path2_ D processing and 1 &2 is obtained; then, the two matrices of 1 &2 are fused: The obtained ℋ is recovered as a complex matrix  .The entire processing in the decoder is shown in Equation (4).

Path1
A detailed description of the encoder part and decoder part of the network is given, as shown in Figure 3. From Figure 3, it can be seen that the whole self-encoding network is divided into two paths, Path1 and Path2.Path1 consists of Path1_E and Path1_D, denoted as the encoder and decoder part, respectively.
Y in denotes the matrix of the input convolution and W in denotes the corresponding convolution weights.In this path, H a first passes through the ConvB3* module and the convolution weight matrix has a size of 3*3.Then, the result is fed into ConvB1*9 and the corresponding convolution weight matrix has a size of 1*9.Finally, after ConvB9*1, the convolution weight matrix has a size of 9*1.After ConvB3*, ConvB1*9, and ConvB9*1, we get the feature matrix Y p1 _E1 of the first large convolution path.Large convolution operation can increase the sensory field and obtain more null domain information of the channel matrix.However, in order to compensate for the blurring effect of feature information brought by the large convolution operation and to improve the local and fine-grained feature-learning effect, the ConvB3* module of the other path in Path1_E uses the two-dimensional convolution with a convolution size of 3*3.It performs the same operation as in Equation ( 9) for H a to obtain Y p1 _E2.Then, the two paths Y p1 _E1 and Y p1 _E2, are subjected to the Concat operation: The resulting two feature matrices are fused with features in the dimension dim = 1, and then a 1*1 convolution operation, as in Equation (9), is performed to reduce the network parameters.Note that a batch normalization is performed after each convolution operation: Here, mean(Y_i) and var(Y_i) denote the mean and variance of the feature matrix Y_i on each channel, ep is a tiny constant used to avoid division by zero, and g and b are the learnable scaling factor and bias term, respectively.The feature matrix Y_i is the result after each convolution.After batch normalization, the training speed of the network is increased, the gradient propagation is improved, and the model's generalization ability is also improved.The activation function LeakyReLU is then used after batch normalization to provide nonlinearity.Then, after a dimensional change: Y a denotes the matrix obtained after the activation function.Finally, the obtained sequence data Y view is passed through the fully connected layer EnFc_1: Here, W view is a weight matrix of shape total size , total size η and B view is a bias term of shape total size η , .The final result obtained from Path1_E is compressed according to the compression ratio η to υ1.
The same 2D convolution operation with different convolution kernel sizes and number of convolutions is used in Path1_D.Path1_D mainly follows the CRBlock module in CRNet [21].In this way, to recover the channel matrix accurately and efficiently, in Path1_D, it first goes through the fully connected layer DeFc_1 to restore υ1 to the sequence size before compression: W D is a weight matrix of shape total size η , total size and B D is a shape (total size ,).It is restored to the original size, and then goes through the ConvB5* module, which represents a two-dimensional convolution with a convolution size of 5*5, computed as in Equation ( 9), and keeps the result: Here, I(•) makes the reservation of the calculation result.Then, the obtained result υ1 D is fed into two convolution paths, in which the first convolution path consists of three parts, ConvB3*, ConvB1*9, ConvB9*1, and each convolution operation is calculated as in Equation (10).After the first convolution path, we get Y p1 _D1.The second convolution path consists of ConvB1*5, ConvB5*1, and each convolution operation is calculated as in Equation ( 9) after the first convolution path to get Y p1 _D2, and then the two paths get Y p1 _D1 and Y p1 _D2 for the Concat operation, as in Equation ( 16): The two feature matrices obtained are fused with features in the dimension dim = 1, and then a 1*1 convolution operation is performed as in Equation ( 9) to reduce the network parameters.Note that after each convolution operation, batch normalization is performed, computed as in Equation (11), and then the activation function LeakyReLU is used to provide the nonlinearity to obtain υ1 D .Here, a residual join is made between υ1 D and the previously retained υ1 D in order to better transfer the features: F1(•) After denoting the ConvB5* module, a series of operations are obtained for υ1 D .We summarize operations as the ConvBlock module, as shown in Figure 3. Subsequently, the obtained υ1 is fed into the next ConvBlock module to obtain υ2.Finally, using the Sigmoid function, the output is mapped to [0, 1] to obtain the output υ1 of Path1. _ =   _1,  _2 ,  = 1 The two feature matrices obtained are fused with features in the dimension  = 1, and then a 1*1 convolution operation is performed as in Equation ( 9) to reduce the network parameters.Note that after each convolution operation, batch normalization is performed, computed as in Equation (11), and then the activation function LeakyReLU is used to provide the nonlinearity to obtain 1 .Here, a residual join is made between 1 and the previously retained 1 in order to better transfer the features: 1(•) After denoting the ConvB5* module, a series of operations are obtained for 1 .We summarize operations as the ConvBlock module, as shown in Figure 3.
Subsequently, the obtained 1 is fed into the next ConvBlock module to obtain 2 .Finally, using the Sigmoid function, the output is mapped to 0,1 to obtain the output 1 of Path1.First, we will explain how T_En1, T_En2, and T_De work.The input to Path2_En is a complex split with dimensional changes to obtain a new matrix, ℋ. ℋ is first fed into the T_En1 module, which employs the decoder layer of the Transformer.ℋ passes through the multiattendance module, through three separate linear layers.These three independent linear layers will have three outputs, computed as follows: First, we will explain how T_En1, T_En2, and T_De work.The input to Path2_En is a complex split with dimensional changes to obtain a new matrix, H. H is first fed into the T_En1 module, which employs the decoder layer of the Transformer.H passes through the multiattendance module, through three separate linear layers.These three independent linear layers will have three outputs, computed as follows:

Path2
where Q n , K n , and V n represent the outputs of the three linear layers, respectively.W Q n , W K n , and W V n represent the weights of the three linear layers on the nth head of the multi-head attention module, respectively, and then the attention score matrix is calculated: where Q n K T n is calculated to derive the degree of correlation between each element of the matrix.In order to eliminate the Q n multiplied by K n , the gradient is prevented from vanishing by eliminating the change in magnitude brought about by multiplying the matrix by the K-transpose matrix.After transposing the matrix, divide by √ d, where d denotes the dimension of the matrix.Then, after the So f tmax(•) function is normalized, finally, multiply by the matrix V n to obtain the attention score matrix.Then, after residual joining and layer normalization, the processing result is sent to the feed-forward layer and, finally, after one more residual joining and layer normalization to obtain the result of the T_En1 part, the working principle of T_En2 is the same as that of T_En1, but the difference is that the input of T_En2 becomes the processing result of the En_Multi_CNN part.
The corresponding T_De uses Transformer's decoder layer structure, which works similarly to the encoder layer with the difference that the decoder part uses a masked multihead attention mechanism.Unlike the multi-head attention mechanism, the masked multihead attention mechanism uses a whole new layer of weights to represent the criticality of each part of the feature data.It uses a masking mechanism to prevent label leakage.The T_De1 section first undergoes processing by the masked multi-head attention mechanism, followed by feeding the results into residual concatenation and layer normalization.It then undergoes processing by the multi-head attention mechanism, followed by another residual join and layer normalization.Immediately after the input to the feed-forward layer, it finally undergoes residual concatenation and layer normalization to obtain the output of the T_De part.
In Path2, the input H first passes through the T_En1 section, yielding the result Y p2 _T_En1, and then the result is fed into the En_Multi_CNN part.En_Multi_CNN uses different sizes of convolution kernels and connections to process the one-dimensional sequences passing through T_En1.Unlike the convolution in the Path1 path, after the T_En1 operation, the data dimensions are changed.To better learn the features of the data in this one dimension, the one-dimensional convolution operation is used.
The details of the En_Multi_CNN section are shown in Figure 5, where one-dimensional convolution is used for all convolutions: Sensors 2023, 23, x FOR PEER REVIEW 9 of 16 where  ,  , and  represent the outputs of the three linear layers, respectively. ,  , and  represent the weights of the three linear layers on the th head of the multihead attention module, respectively, and then the attention score matrix is calculated: Where   is calculated to derive the degree of correlation between each element of the matrix.In order to eliminate the  multiplied by  , the gradient is prevented from vanishing by eliminating the change in magnitude brought about by multiplying the matrix by the K-transpose matrix.After transposing the matrix, divide by √, where  denotes the dimension of the matrix.Then, after the (⋅) function is normalized, finally, multiply by the matrix  to obtain the attention score matrix.Then, after residual joining and layer normalization, the processing result is sent to the feed-forward layer and, finally, after one more residual joining and layer normalization to obtain the result of the T_En1 part, the working principle of T_En2 is the same as that of T_En1, but the difference is that the input of T_En2 becomes the processing result of the En_Multi_CNN part.
The corresponding T_De uses Transformer's decoder layer structure, which works similarly to the encoder layer with the difference that the decoder part uses a masked multi-head attention mechanism.Unlike the multi-head attention mechanism, the masked multi-head attention mechanism uses a whole new layer of weights to represent the criticality of each part of the feature data.It uses a masking mechanism to prevent label leakage.The T_De1 section first undergoes processing by the masked multi-head attention mechanism, followed by feeding the results into residual concatenation and layer normalization.It then undergoes processing by the multi-head attention mechanism, followed by another residual join and layer normalization.Immediately after the input to the feed-forward layer, it finally undergoes residual concatenation and layer normalization to obtain the output of the T_De part.
In Path2, the input ℋ first passes through the T_En1 section, yielding the result  _T_En1, and then the result is fed into the En_Multi_CNN part.En_Multi_CNN uses different sizes of convolution kernels and connections to process the one-dimensional sequences passing through T_En1.Unlike the convolution in the Path1 path, after the T_En1 operation, the data dimensions are changed.To better learn the features of the data in this one dimension, the one-dimensional convolution operation is used.
The details of the En_Multi_CNN section are shown in Figure 5, where onedimensional convolution is used for all convolutions: Here, Y in1d denotes the sequence of inputs and W in1d denotes the weight matrix of the inputs, with dimensions determined by the input and output channels.Path2 first goes through a convolution of one dimension, with convolution size 3.In order to learn the features of different sizes more fully, it is divided into two paths; the first path directly adopts a one-dimensional convolution of size 3, which reduces the parameters while learning small-size features quickly and improving the efficiency of local and finegrained feature learning, and obtains Y p2 _E1.The second path uses two one-dimensional convolutions of size 5, and the two convolutions are concatenated together to learn features with larger sensory fields to help edge feature learning, obtaining Y p2 _E2.Here, all the convolutions are computed as shown in Equation (20).After each convolution operation, one-dimensional batch normalization is performed: Here, the Y 1D _i denotes the input sequence and mean(Y 1Di ) denotes the mean of the feature sequence, var(Y 1D _i) denotes the variance of the feature sequence, ep is a tiny constant used to avoid division by zero, and g 1d and b 1d denote the scaling factor and bias term.Then, after batch normalization, the nonlinearity is obtained using the LeakyReLU activation function.The outputs of the two trails are then Concat spliced: " " denotes sequence splicing.Subsequently, a one-dimensional convolution operation of size 1 is accessed as in Equation ( 20), followed by a one-shot batch normalization and LeakyReLU activation function to reduce the network parameters.Finally, the input to En_Multi_CNN concerning the output of En_Multi_CNN with reference to the idea of residual connection: F2(•) denotes the sequence of operations to obtain Y p2 _E.Then, Y p2 _E is fed to T_En2, which is processed in the same way as T_En1 to obtain Y p2 _E , and the nonlinearity is introduced using the Relu activation function.Finally, Y p2 _E is passed into EnFc_2, which is computed as in Equation ( 13), according to the compression ratio η, to obtain υ2.
υ2 is transmitted to the Path2_D path through the ideal feedback link.υ2 is first restored to the pre-compression dimension calculation as in Equation ( 15) through DeFc_2 to obtain υ2 D .Then, υ2 D is used as an input to get Y p2 _T _De after T_De processing, and it is used as an input to De_Multi_CNN1.The structure of De_Multi_CNN1 and De_Multi_CNN2 is shown in Figure 6.Two different paths are used, and the two paths have different sizes and convolutions.The first path comprises a one-dimensional convolution of size 3.The second path is composed of a one-dimensional convolution of size 3 and a one-dimensional convolution of size 9 in series.Each convolution is computed as in Equation ( 20), followed by batch normalization as in Equation ( 21), and the activation function LeakyReLU is used to provide nonlinearity.Then, the two path features are subjected to Concat operation as in Equation (21), and, finally, the network parameters are reduced by a one-dimensional convolution operation of size 1.The convolution is formulated as in Equation (20).After convolution, the batch normalization is performed as in Equation (21), and, finally, Y p2 _De _M is obtained.At the end of De_Multi_CNN1, Y p2 _T _De and Y p2 _De _M are summed up using residual join ideas.Calculated as in Equation ( 22) and using the Relu activation function to provide nonlinearity, the calculation yields Y p2 _D .Then, it is processed by De_Multi_CNN2 with the same process as De_Multi_CNN1 to obtain υ2 .

Mix_Multi_TransNet Network Outputs
After the processing of the two paths Path1 and Path2, the output is obtained, 1 and 2 .Finally, 1 and 2 are fused: " ⨁ " denotes the addition of the two matrices, which ultimately results in the recovery of the channel matrix

Mix_Multi_TransNet Steps
The flow of the Mix_Multi_TransNet algorithm is shown in Algorithm 1.The input is  and the output is  .Initialize the transmit antenna  = 32, subcarrier  = 1024, and then truncate the first Na=32 rows.Firstly, we obtain H1 after a two-dimensional discrete Fourier variation, and then, according to the initialized truncation row  , we truncate  to get  .Then, we split the complex matrix  into a real part, ( ), and an imaginary part, ( ), and then the real and fictional elements are combined to form a new real matrix ℋ , and we feed ℋ to our encoder and decoder.ℋ is provided to Path1 and finally gets 1 ; see Section 3.2 Path1 for detailed steps.At the same time of feeding ℋ to Path1, it is also fed to Path2 and finally gets 2 , see Section 3.3 Path2 for exact steps.Finally, after Equation (24), we obtain the recovered matrix  .

Mix_Multi_TransNet Network Outputs
After the processing of the two paths Path1 and Path2, the output is obtained, υ1 and υ2 .Finally, υ1 and υ2 are fused: " " denotes the addition of the two matrices, which ultimately results in the recovery of the channel matrix H a

Mix_Multi_TransNet Steps
The flow of the Mix_Multi_TransNet algorithm is shown in Algorithm 1.The input is H and the output is H a .Initialize the transmit antenna N t = 32, subcarrier N c = 1024, and then truncate the first Na = 32 rows.Firstly, we obtain H1 after a two-dimensional discrete Fourier variation, and then, according to the initialized truncation row N a , we truncate H to get H a .Then, we split the complex matrix H a into a real part, Re(H a ), and an imaginary part, Im(H a ), and then the real and fictional elements are combined to form a new real matrix H, and we feed H to our encoder and decoder.H is provided to Path1 and finally gets υ1 ; see Section 3.2 Path1 for detailed steps.At the same time of feeding H to Path1, it is also fed to Path2 and finally gets υ2 , see Section 3.3 Path2 for exact steps.Finally, after Equation (24), we obtain the recovered matrix H a .

Simulation Results and Analysis
This section describes the detailed setup of the experiment and the network performance and compares the network accuracy with the state-of-the-art CSI feedback algorithm.

Data Sets, Training Programs, and Assessment Indicators
This study generates a dataset using COST2100 [29], and the proposed Mix_Multi_TransNet is compared with the existing state-of-the-art algorithms CsiNet, CsiNet+, CLNet, CRNet, and STNet.Two scenarios are considered in generating the dataset: an indoor scenario at 5.3 GHz and an outdoor scenario at 300 MHz, with a uniform linear array (ULA) model with N t = 32 at the BS.For FDD systems, N c = 1024 subcarriers are taken in the frequency domain.After two-dimensional Discrete Fourier Transform (DFT), N a = 32 is taken in the angular delay domain.It is divided into two scenarios, each with 150,000 independently generated channels, divided into training, validation, and test datasets containing 100,000, 30,000 and 20,000 channel matrices, respectively.The experiments split the data into individual matrix data for data loading convenience.
This experiment was conducted in Windows Server 2019 Standard environment using NVIDIA Quadro RTX 5000 Graphics Processing Unit (GPU) with 16 GB of video memory, 128 GB of RAM, and a Central Processing Unit (CPU) of i9-10900K clocked at 3.7 GHz for network training.The network is implemented based on Pytorch, and the Adam optimizer is used to train the network with 100 epochs, and the batch is set to 16.The learning rate is set to 0.001, and the learning rate is adjusted using the cosine annealing method, and the formula is calculated as follows: where lr denotes the current learning rate, l n denotes the final value after the learning rate has decayed, l str denotes the initial value of the learning rate, T cur denotes the current epoch value, and T max denotes the total epoch value.
In this thesis, the normalized mean square error (NMSE) between H a and H a is used as a criterion for the network accuracy, which is calculated as given by Equation (26):

Mix_Multi_TransNet Network Performance
Comparison details with existing CSI feedback algorithms are shown in Table 1.After 100 training rounds, Mix_Multi_TransNet starts to outperform other deep learning algorithms in both indoor and outdoor scenes.In the indoor scenario, the compression ratio η values equal to 1/8, 1/16, 1/32, and 1/64 outperform the other algorithms.They are compared with the best accuracy of the other algorithms, and Mix_Multi_TransNet obtains NMSE gains of 4.06 dB, 4.92 dB, 4.82 dB, and 6.47 dB, respectively.This means that in the indoor scenario, with compression ratio values of η = 1/8, 1/16, 1/32, 1/64, the accuracy is improved by 49.59%, 64.70%, 65.60%, and 32.24%, respectively, compared with the optimal accuracy of the existing algorithm.In the outdoor scene, all the compression ratio η values outperform the other algorithms, and Mix_Multi_TransNet obtains NMSE gains of 3.63 dB, 6.24 dB, 4.71 dB, 4.60 dB, and 2.93 dB, respectively.This implies that in the indoor scene, the accuracy improves compared to the optimal accuracy of the existing algorithms by 46.30%, 76.80%, 63.03%, 64.16%, and 50.04%, respectively.
In the bottom half of Table 1 is our network's training time and each Batch's response time when tested.The units are minutes and milliseconds, respectively.
As shown in Figure 7, in the indoor scenario, all compression ratio cases outperform the other accuracies, except for the compression ratio value η = 1/4 when the network performance is slightly worse than CLNet.In the outdoor scenario, all compression ratio values outperform the other algorithms.The overall network performance shows a decreasing trend with increasing compression ratio values; the more significant the compression ratio value, the more the loss of channel state information and, after feedback, the accuracy of recovering the channel state information decreases.In outdoor scenarios, the advantage of the Mix_Multi_TransNet algorithm is more significant than other algorithms, which can maximize feedback accuracy and is more suitable for complex outdoor environments.As shown in Figure 7, in the indoor scenario, all compression ratio cases outperform the other accuracies, except for the compression ratio value  = 1/4 when the network performance is slightly worse than CLNet.In the outdoor scenario, all compression ratio values outperform the other algorithms.The overall network performance shows a decreasing trend with increasing compression ratio values; the more significant the compression ratio value, the more the loss of channel state information and, after feedback, the accuracy of recovering the channel state information decreases.In outdoor scenarios, the advantage of the Mix_Multi_TransNet algorithm is more significant than other algorithms, which can maximize feedback accuracy and is more suitable for complex outdoor environments.

Ablation Experiment
As shown in Table 2, this study compares the network's performance status when the network uses specific modules alone.Because this network uses dual paths, the spatial and temporal features of the channel state information are learned separately; finally, the matrix information of the two paths is fused, which improves the problem of low accuracy

Ablation Experiment
As shown in Table 2, this study compares the network's performance status when the network uses specific modules alone.Because this network uses dual paths, the spatial and temporal features of the channel state information are learned separately; finally, the matrix information of the two paths is fused, which improves the problem of low accuracy when training and learning from one aspect alone.Path1 focuses on learning the spatial features of the channel state information, while Path2 focuses on learning the temporal features of the channel state information.In Path2, the one-dimensional convolutional modules En_Multi_CNN, De_Multi_CNN1, and De_Multi_CNN1 are also used for more excellent compression.As obtained from Table 2, the network performance is not optimal when a particular path is used alone or in some of these modules.In the indoor scene, using Path1 alone reduces the average NMSE gain by 2.10 dB compared to using the Path1 + Path2 no-convolution module and reduces the average NMSE gain by 7.61 dB compared to using the Path1 + Path2 complete network NMSE.This implies that in the indoor scene, using Path1 alone reduces the average NMSE gain by 53.18% compared to using the Path1 + Path2 no-convolution module The average NMSE accuracy is reduced by 53.18% and the average NMSE gain is reduced by 80.00% using the Path1 + Path2 complete network.Using Path2 alone in the indoor scene reduces the average gain by 2.31 dB over the Path1 + Path2 convolution-free module NMSE and 7.81 dB over the Path1 + Path2 full network NMSE, which implies that the average accuracy of Path2 alone in the indoor scene is reduced by 50.77% over the Path1 + Path2 convolution-free module NMSE.The average accuracy is reduced by 50.77% and the average gain is reduced by 79.05% compared to using the Path1 + Path2 complete network NMSE.
In the outdoor scene, the average NMSE gain of Path1 alone is 0.738 dB lower than that of the Path1 + Path2 no-convolution module and 5.51 dB lower than that of the Path1 + Path2 full network NMSE, which implies that the average NMSE accuracy of the outdoor scene is 14.58% lower than that of the Path1 + Path2 no-convolution module, and 78.42% lower than the Path1 + Path2 complete network NMSE.The average NMSE accuracy is reduced by 14.58%, and the average NMSE gain is reduced by 78.42% when using the Path1 + Path2 complete network.In the outdoor scene, using Path2 alone reduces the average gain by 2.39 dB over the Path1 + Path2 convolution-free module NMSE.It reduces the average gain by 7.16 dB over the Path1 + Path2 complete network NMSE, which means that in the outdoor scene, the average accuracy of Path2 alone reduces by 42.42% over the Path1 + Path2 convolution-free module NMSE.The average accuracy is reduced by 42.42%, and the average gain is reduced by 85.59% compared to using the Path1 + Path2 complete network NMSE.

Conclusions
This thesis proposes a multi-source neural network, Mix_Multi_TransNet, to solve the CSI feedback problem.The network learns different channel state information features and the two proposed paths learn spatial and temporal features, respectively, and obtain the encoder compression results of the channel state information matrix.The encoder compresses the channel state information matrix.Then, it is decoded by the decoder and the highest precision of the matrix is restored to the original channel state information matrix by fusing the feature information of the two paths.The Normalized Mean Square Error (NMSE) is used as an error measure and compared with existing algorithms from the COST2100 dataset.In the indoor scene, Mix_Multi_TransNet obtained the highest accuracy for compression ratios η = 1/8, 1/16, 1/32, 1/64, and the NMSE gain was 4.06 dB, 4.92 dB, 4.82 dB, and 6.47 dB, respectively.Mix_Multi_TransNet obtained the highest accuracy in the outdoor scene for all compression ratio values n.In the outdoor stage, all the compression ratio values η, Mix_Multi_TransNet have the highest accuracy, and the gain obtained by NMSE was 3.63 dB, 6.24 dB, 4.71 dB, 4.60 dB, and 2.93 dB, respectively.
In this study, a significant gain in NMSE was achieved in the indoor and outdoor scenarios compared to other algorithms.In future work, we will focus more on improving the model's inference speed and the model's size for use in natural industrial environments.

Figure 3 . 9 )
Figure 3. Encoder and Decoder.*' denotes the convolution dimension hyphen, e.g., 3* denotes a 2D kernel of size 3 3; ⨁' indicates the addition of two matrices.In order to thoroughly learn the spatial features of the channel matrix and to reduce the network parameters, Path1_E and Path1_D use a multidimensional pure convolutional structure.Due to the sparsity of the channel matrix in the angular delay domain, the distribution of the channel state information is still unevenly distributed after the interception of the former  rows.The convolution structure in Path1_E adopts two two-dimensional convolution operations with two different convolution sizes and a number of convolutions to enhance the effect of learning the features of the channel matrix, reducing the network parameters and enabling fast training.In Path1_E, the first path consists of ConvB3*, ConvB1*9, and ConvB9*1, indicating three convolution sizes of 3*3, 1*9, and 9*1 structures, respectively.Each 2D convolution is as in Equation (9):  = 2( ,  ) (9)  denotes the matrix of the input convolution and  denotes the corresponding convolution weights.In this path,  first passes through the ConvB3* module and the convolution weight matrix has a size of 3*3.Then, the result is fed into ConvB1*9 and the corresponding convolution weight matrix has a size of 1*9.Finally, after ConvB9*1, the convolution weight matrix has a size of 9*1.After ConvB3*, ConvB1*9, and ConvB9*1, we

Figure 3 .
Figure 3. Encoder and Decoder.'*' denotes the convolution dimension hyphen, e.g., 3* denotes a 2D kernel of size 3 × 3; ' ' indicates the addition of two matrices.In order to thoroughly learn the spatial features of the channel matrix and to reduce the network parameters, Path1_E and Path1_D use a multidimensional pure convolutional structure.Due to the sparsity of the channel matrix in the angular delay domain, the distribution of the channel state information is still unevenly distributed after the interception of the former N a rows.The convolution structure in Path1_E adopts two two-dimensional convolution operations with two different convolution sizes and a number of convolutions to enhance the effect of learning the features of the channel matrix, reducing the network parameters and enabling fast training.In Path1_E, the first path consists of ConvB3*, ConvB1*9, and ConvB9*1, indicating three convolution sizes of 3*3, 1*9, and 9*1 structures, respectively.Each 2D convolution is as in Equation (9):

3. 3 .
Path2Path2 also consists of two parts, Path2_E and Path2_D, which are denoted as the encoder part and decoder part, respectively.Different from Path1, this path mainly focuses on and learns the temporal nature of the CSI to compensate for the shortcoming of Pat1, which only focuses on the spatial features.Path2_E consists of T_EN1, En_Multi_CNN, T_EN2, and EnFc_2.Among them, T_EN1 and T_EN2 structures are shown on the left of Figure4, respectively, adopting Transformer's encoder layer structure.Path2_D consists of DeFc_2, T_De, De_Multi_CNN1, and De_Multi_CNN2, of which the T_De structure is shown on the right of Figure4, adopting Transformer's decoder layer structure.
Path2 also consists of two parts, Path2_E and Path2_D, which are denoted as the encoder part and decoder part, respectively.Different from Path1, this path mainly focuses on and learns the temporal nature of the CSI to compensate for the shortcoming of Pat1, which only focuses on the spatial features.Path2_E consists of T_EN1, En_Multi_CNN, T_EN2, and EnFc_2.Among them, T_EN1 and T_EN2 structures are shown on the left of Figure 4, respectively, adopting Transformer's encoder layer structure.Path2_D consists of DeFc_2, T_De, De_Multi_CNN1, and De_Multi_CNN2, of which the T_De structure is shown on the right of Figure 4, adopting Transformer's decoder layer structure.

Figure 7 .
Figure 7. Performance comparison with other algorithms in Indoor and Outdoor.

Figure 7 .
Figure 7. Performance comparison with other algorithms in Indoor and Outdoor.

Table 1 .
NMSE Comparison between Mix_Multi_TransNet and Other Methods.
η: Compression Ratio Value; In: indoor Scene; Out: outdoor Scene.Training Time (minutes) and Batch's Response Time (milliseconds).The table below corresponds to "In" and "Out" above.

Table 1 .
NMSE Comparison between Mix_Multi_TransNet and Other Methods.