1. Introduction
Multiple-input multiple-output (MIMO) technology is a key enabler for enhancing the spectral efficiency and reliability of wireless communication systems. In frequency-division duplex (FDD) systems, accurate CSI at the base station is essential for effective downlink pre-coding. However, as the number of antennas increases dramatically, traditional codebook-based CSI quantization schemes suffer from excessive feedback overhead and limited quantization accuracy. To overcome these limitations, DL-based CSI feedback methods have been proposed and have demonstrated strong potential [
1]. Moreover, deep-learning-based CSI feedback architectures have been proposed, including Transformer backbone networks [
2], domain knowledge-guided meta-learning approaches [
3], and feature vector designs tailored for pre-coding structures [
4]. In order to improve the generalization ability of CSI feedback models under distribution drift, recent studies have also explored the lightweight adaptive framework [
5] and unsupervised learning pathway [
6]. Most existing methods adopt a two-sided architecture, where a UE-side encoder neural network compresses the CSI, and a corresponding BS-side decoder performs CSI reconstruction. Although the aforementioned methods have yielded notable performance improvements, their encoders impose substantial computational and memory burdens on resource-constrained UEs. This limitation is especially acute in emerging applications like wearable electronics and industrial wireless sensor networks, as low-power sensor nodes cannot sustain the energy cost of complex deep learning inference. Furthermore, the UE-side encoder must be frequently updated to cope with time-varying channel conditions, resulting in poor generalization and additional training overhead.
To address these challenges, a one-sided feedback scheme has been proposed [
7]. In this architecture, computationally intensive deep learning models are fully deployed at the BS side, while UE only performs lightweight compression operations (such as linear projection), thereby significantly reducing its computational burden. Beyond explicit compression, an ultra-low-rate implicit CSI feedback scheme which leverages the reciprocity of the bidirectional channel has been developed to further reduce uplink overhead [
8]. Hao Luo further employed Type I/II codebooks at the UE side for CSI compression and investigated codebook-independent enhancement methods for deep-learning-based CSI feedback [
9]. However, the performance of one-sided frameworks critically depends on the reconstruction capability of the BS-side decoder. Under highly compressed CSI or dynamically varying channel conditions, existing decoders often struggle to recover the CSI with sufficient accuracy.
In recent years, large language models (LLMs) have gained widespread attention for their remarkable capabilities [
10]. Their potential in shaping future 6G systems has also been highlighted in relevant surveys [
11]. The application of LLMs has long transcended the boundaries of natural language processing and has extended into the field of wireless communications—where several foundational models have already emerged to address various wireless tasks. For example, WirelessGPT introduced a multitask pre-training framework with about 80 million parameters [
12], while LLM4WM explored the adaptation of LLMs for wireless multitasking [
13]. In channel modeling, ChannelGPT employs a GPT-2 based architecture to tackle long-distance channel prediction and multimodal perception tasks [
14]. Motivated by these advances, this work incorporates LLMs into the one-sided feedback framework, leveraging their strong modeling strength to enhance the accuracy of CSI reconstruction and prediction.
Inspired by Liu et al. [
15], this paper proposes LLM4FB, a novel framework that integrates a pre-trained large language model (LLM) into the base station (BS) decoder. The core philosophy is to leverage the extensive semantic representations encapsulated within the pre-trained LLM to enhance channel state information (CSI) estimation. Specifically, we employ a parameter-efficient fine-tuning (PEFT) strategy, where the majority of the LLM parameters remain frozen while only specific modules, such as normalization layers, are updated. To further optimize system-level performance, we introduce a composite loss function that jointly minimizes the normalized mean square error (NMSE) and maximizes spectral efficiency (SE).
The proposed LLM4FB framework demonstrates superiority over existing CSI feedback paradigms in three critical aspects: reconstruction fidelity, computational efficiency, and cross-domain generalization.
In terms of reconstruction accuracy, the framework exhibits exceptional robustness, particularly in challenging compression regimes. By treating the coarse pseudoinverse reconstruction as a corrupted sequence and leveraging the LLM’s denoising capabilities, LLM4FB effectively recovers channel semantics. At a non-compressed state (), the method achieves an NMSE of 0.044, representing a 31% improvement over Transformer-based baselines (0.064) and significantly outperforming CNN (0.051) and LSTM (0.077) architectures. More notably, under extreme compression scenarios (), where conventional methods experience severe performance collapse—with CNN and Transformer degrading to NMSEs of 0.53 and 0.494, respectively—LLM4FB maintains a resilient NMSE of 0.464.
Regarding parameter efficiency, the proposed fine-tuning strategy drastically mitigates the training overhead typically associated with large-scale models. Unlike conventional approaches necessitating full-parameter updates, LLM4FB restricts optimization to the layer normalization layers, resulting in only 0.97 M trainable parameters out of a total 85.23 M (approximately 2% of model capacity). This reduction translates to a 50-fold decrease in computational cost and memory usage compared to full fine-tuning. Despite this sparse update mechanism, the framework yields a spectral efficiency of 8.494 bps/Hz, approaching the theoretical upper bound of 8.510 bps/Hz achieved by full-parameter training. This finding suggests that the pre-trained weights possess sufficient general feature extraction capabilities, requiring only minimal distribution alignment to adapt to wireless channel characteristics.
Furthermore, the framework offers superior adaptability across diverse propagation environments. While traditional deep learning models often require hundreds of epochs to converge when facing shifts in antenna configurations or channel scenarios (e.g., UMa to UMi), LLM4FB exploits the inherent alignment between the next-token prediction task in NLP and temporal sequence prediction in CSI feedback. Empirical results indicate that cross-scenario fine-tuning converges within 10–50 epochs, achieving an adaptation speed 2–10 times faster than training from scratch. This rapid deployment capability is particularly advantageous for the dynamic environmental requirements of 6G systems.
The main contributions of this paper are as follows:
A novel one-sided CSI feedback and prediction framework LLM4FB is proposed, which uses a pre-trained LLM to enhance the BS-side decoder capability and achieves high-accuracy CSI prediction for lightweight UEs.
An efficient parameter fine-tuning strategy is designed, and a multiobjective loss function is proposed that jointly optimizes NMSE and SE, enabling further improvement of system performance.
Extensive simulations verify the effectiveness of LLM4FB under various compression ratios and moving speeds, and its performance surpasses multiple existing baseline methods.
3. LLM-Based One-Sided CSI Feedback Framework
To address the high computational burden and limited generalization capability of conventional two-sided CSI feedback architectures, the LLM4FB framework is developed—a one-sided feedback framework that shifts the major inference workload to the BS. The overall pipeline is depicted in
Figure 1.
3.1. UE-Side Low-Complexity Compression
Within the LLM4FB framework, the UE performs only a random linear projection to compress CSI, corresponding to a minimalist encoder
. For the CSI tensor
spanning
subcarriers and
consecutive time slots, processing is simplified by compressing the per-antenna CSI matrix
independently for each antenna
. Specifically, the UE vectorizes the real and imaginary parts of
separately and concatenates them into a real-valued vector of dimension
, which is then projected by a random matrix
:
where
is the random projection matrix and
denotes the compressed representation to be fed back. The compression ratio is defined as
. In this architecture, the UE avoids storing a large-scale neural network—only retaining the seed used to generate the deterministic realization of
, thereby substantially reducing storage requirements.
To ensure reproducibility and theoretical guarantees, the random projection matrix
is constructed following specific design principles. Each element of
is independently drawn from a standard Gaussian distribution
:
This Gaussian ensemble is chosen because it provably satisfies the restricted isometry property (RIP) with high probability. Specifically, for an
S-sparse channel vector, the matrix preserves Euclidean distances when
for a universal constant
C, which provides theoretical justification for accurate recovery even under extreme compression.
The pseudoinverse recovery in (
10) implicitly normalizes the measurement energy. Although no explicit
scaling is applied during compression, the Moore–Penrose pseudoinverse
acts as a matched filter that optimally reconstructs the signal in the least-squares sense, automatically accounting for the energy distribution of the projection.
Critically, the same projection matrix is applied to all antennas. This design choice offers two advantages: (1) Preserving spatial coherence: Applying identical linear transformations across antennas maintains the relative phase and amplitude relationships in the compressed domain, enabling the LLM to exploit spatial correlations during reconstruction. (2) Implementation efficiency: Sharing reduces storage overhead from to , which is crucial for massive MIMO systems with hundreds of antennas. To ensure consistency across experiments, the random seed for generation is fixed, guaranteeing that the BS and UE operate on identical projection matrices.
3.2. BS-Side CSI Recovery and LLM Enhancement
The BS-side decoder is responsible for recovering and predicting CSI from the received compressed vectors . The recovery process consists of two stages: an initial linear reconstruction step, followed by LLM-based enhancement step.
Upon reception of
, the BS first obtains an initial estimate
by applying the (Moore–Penrose) pseudoinverse of the projection matrix:
where
denotes the Moore–Penrose pseudoinverse. (When
is non-singular, one valid form is
). As the compression is highly underdetermined (
), the initial estimate
inevitably contains substantial reconstruction error and information loss.
To recover high-fidelity CSI from this noisy initial estimate, an LLM-based enhancement module is introduced. By leveraging the expressive prior knowledge embedded in a pre-trained LLM, the module models complex joint time-frequency correlations in the CSI and performs refined reconstruction and prediction. The enhancement module comprises four components—preprocessing, embedding, the LLM backbone, and an output head—which collectively convert the initial real-valued reconstructions into LLM-compatible inputs and generate the final CSI estimates. Detailed architecture and processing steps are described in the following subsections.
3.2.1. Preprocessing and Tokenization Strategy
The initial reconstruction yields the tensor , which comprises complex numbers stored in floating point format. However, LLMs typically operate on discrete tokens, necessitating a mapping from continuous-valued CSI signals to the discrete input space.
In order for large models to better extract time-domain and frequency-domain information, An inverse discrete Fourier transform (IDFT) is performed on the roughly recovered CSI to map the frequency-domain CSI to the delay domain. In the delay domain, the energy of the signal is concentrated in a small number of taps, allowing the model to better learn features and converge quickly.
Inspired by the patch strategy of Vision Transformers [
16]: instead of directly feeding all the information into the model at once as typical models do, the delay-time CSI matrix is divided into several non-overlapping patches of size
P. Each patch captures the local time–frequency characteristics of specific channel regions; after flattening, it is mapped to a latent vector space through a learnable linear projection. This process is similar to tokenization in natural language processing, with each patch acting as a high-dimensional feature describing a segment of the channel state, compressing the length while extracting local features effectively.
Mathematically, given the delay-time CSI matrix
for antenna
i, we partition it into
patches
, where each patch
. Each patch is then flattened and projected into a
-dimensional embedding space:
where
and
are learnable parameters. The resulting sequence of patch embeddings
serves as the input to the LLM backbone.
To preserve the sequential order essential for prediction, we add learnable positional embeddings to the patch embeddings. As standard self-attention is permutation-invariant; it cannot capture the order of the CSI patches on its own. The sinusoidal positional encoding strategy is employed. For a patch at position
in the sequence and a dimension index
or
in the embedding space, the positional encoding
is defined as
where
is the dimension of the embedding vectors. These positional encodings are added element-wise to the patch embeddings before being fed into the LLM backbone.
In our implementation, we set the patch size to to balance local feature extraction and computational efficiency. Given the historical CSI sequence length time slots and delay-domain dimension , the input delay-time matrix is . After patch partitioning with stride , we obtain patches in total (4 patches along the delay axis × 4 patches along the time axis). Each patch is flattened into a 16-dimensional vector and projected to dimensions. After processing through the LLM backbone, these 16 patch embeddings are unpacked and reshaped back to the original delay-time structure before final CSI reconstruction. This design preserves spatial locality in both delay and time domains while maintaining a manageable sequence length for the attention mechanism.
3.2.2. Model Architecture Selection and Pre-Training Task Alignment
We adopt the GPT-2 Small architecture [
10] as the backbone for LLM4FB, which consists of 12 Transformer decoder blocks with 12 attention heads per layer and a hidden dimension of
. The total parameter count is approximately 117 million. To balance inference efficiency and model capacity for CSI feedback tasks, we truncate the architecture to 6 layers, reducing the total parameters to 85.23 million while maintaining sufficient representational power.
The selection of GPT-2 is motivated by the inherent alignment between its pre-training task and CSI prediction. GPT-2 is pre-trained on a next-token prediction objective, where the model learns to predict the probability distribution of the next token given the preceding tokens . Mathematically, this corresponds to modeling autoregressive conditional dependencies in sequential data. This task is fundamentally congruent with channel prediction, where the objective is to estimate future CSI conditioned on historical observations . Both tasks require capturing long-range temporal dependencies and extrapolating patterns beyond observed sequences.
Through the patch embedding strategy described earlier, we convert the continuous-valued CSI signal into a sequence of discrete feature vectors, enabling direct utilization of GPT-2’s learned capabilities in modeling long-range causal dependencies. During pre-training on massive text corpora, GPT-2 develops internal representations that encode sequential patterns, temporal correlations, and contextual reasoning—skills that transfer effectively to wireless channel modeling despite the domain shift from language to radio signals. This cross-domain transferability has been empirically validated in recent works applying LLMs to time-series forecasting [
15] and multimodal sensing tasks.
3.2.3. LLM Backbone and Self-Attention Mechanism
The core of our framework is the pre-trained LLM backbone (specifically, a GPT-2 variant), which consists of stacked Transformer decoder blocks. The fundamental operation within these blocks is the Masked Multi-Head Self-Attention (MSA). Mathematically, for an input sequence of CSI embeddings
, the attention mechanism computes three matrices: queries (
), keys (
), and values (
). The attention score is calculated as
where
is the dimension of the key vectors. In the CSI prediction problem, the attention mechanism allows the model to concentrate on the time points that actually contribute to the prediction. For instance, when the channel exhibits a periodic fading pattern due to a certain scatterer, the model can identify and emphasize earlier samples that share the same trend, which helps it anticipate the upcoming state. This behavior contrasts with RNNs and LSTMs, which process sequences in order and often have difficulty preserving information over long spans. The Transformer overcomes this limitation by attending to the entire sequence context simultaneously. This ability is particularly useful when the channel response is formed by multiple paths with different Doppler shifts.
To reduce training overhead while retaining the LLM’s pre-trained knowledge, a parameter-efficient fine-tuning (PEFT) strategy is adopted, updating only the layer normalization and positional embedding parameters. This allows LLM4FB to exploit a large model capacity while training only a small fraction of the parameters compared to conventional deep learning approaches. The output of the LLM, denoted as , represents the enhanced CSI feature embeddings.
3.3. Computational Complexity Analysis
A key motivation for the proposed one-sided framework is to offload the computational burden from the UE to the BS. Here, the computational complexity of both ends is analyzed in terms of floating-point operations (FLOPs).
At the UE side, the compression process involves a linear projection of the vectorized channel matrix. Given the input dimension and the compressed dimension M, the complexity is dominated by the matrix-vector multiplication, which is . As due to the high compression ratio, this operation is extremely lightweight and can be efficiently implemented on low-power mobile chipsets.
At the BS side, the complexity is primarily determined by the LLM inference. For a Transformer-based model with layers, hidden dimension , and sequence length S, the complexity of the self-attention mechanism is , and the feed-forward network contributes . Consequently, the total complexity at the BS is approximately . Although this scales quadratically with the sequence length, the BS is typically equipped with high-performance computing resources (e.g., GPUs), making this computational cost acceptable. This asymmetric complexity distribution aligns perfectly with the resource constraints of practical FDD massive MIMO systems.
3.4. Output and Loss Function
The enhanced features are mapped back to the original CSI dimension via a linear projection layer, producing the final predicted CSI .
To directly optimize communication performance, inspired by multitask learning [
17], a multiobjective loss function is defined. It includes the normalized NMSE and an additional spectral efficiency term:
LLM4FB was trained according to the NMSE loss alone. To further enhance performance, a combined loss function is proposed:
where
R denotes the SE computed using the predicted CSI
, and
is a weighting hyperparameter (set to 0.9 in experiments). The
operation prevents gradient propagation through this term. Referring to the task-oriented design approach [
18], this loss formulation uses SE as a guiding signal to bias the model toward predictions yielding higher SE, while preserving the primary NMSE-driven gradient. In the training procedure, LLM4FB is first trained using the NMSE loss (
14), and then LLM4FB+ is obtained by fine-tuning the trained LLM4FB model for 10 additional epochs using the combined loss (
15).
Specifically, the term functions as a dynamic coefficient
:
By multiplying the rate
R by
, we effectively scale the magnitude of the rate-based loss component to match the current magnitude of the NMSE loss. The
operation is critical here; it treats
and
as constants during backpropagation. This prevents the optimizer from manipulating the scaling factor itself to minimize the loss (e.g., by artificially inflating
to reduce the weight), ensuring that gradients flow only through the optimization targets.
Consequently, the gradient of the total loss with respect to the network parameters
is given by
This formulation ensures that the contribution of the spectral efficiency to the parameter updates is always proportionally aligned with the reconstruction error, allowing for stable joint optimization where the SE maximization task provides a guided auxiliary gradient without overwhelming the primary objective of minimizing NMSE. The parameter
, thus, acts as a fine-tuning knob for the relative priority of SE, independent of the numerical scale of
R. The impact of
will be analyzed in subsequent sections.
4. Simulation Results
4.1. Simulation Environment and Parameters
The design of wireless datasets has been shown to influence the performance of AI communication systems [
19]. An open-source CSI channel prediction dataset is used for the experiments [
15,
20], which is generated with the QuaDRiGa channel simulator and complies with the 3GPP 38.901 channel standard. This paper consider a single-cell MISO-OFDM system, where the BS is equipped with an
UPA, and the UE has a single antenna. The system bandwidth is 8.64 MHz, comprising 48 resource blocks (RBs), i.e.,
. FDD is adopted, with an uplink center frequency of 2.4 GHz. The model predicts future CSI for
time slots based on historical CSI from
consecutive time slots, with a pilot interval of 0.5 ms. The channel scenario follows the 3GPP urban macro (UMa) non-line-of-sight (NLOS) model. The training, validation, and test sets contain 8000, 1000, and 10,000 samples, respectively, covering UE velocities ranging from 10 km/h to 100 km/h.
To investigate the effect of varying compression ratios (CR) on model performance, CR values are set to . To ensure a fair comparison, the random seed for the projection matrix generation is fixed.
Several representative baselines are considered for comparison, including the traditional PAD model [
21], classical deep learning models (DNN, RNN, LSTM, CNN), and a state-of-the-art Transformer model. As summarized in
Table 1, despite the large total parameter count, LLM4FB requires significantly fewer trainable parameters (0.97 M) than other DL models, as only the fine-tuning parameters are updated.
4.2. Computational Complexity and Resource Consumption Analysis
To comprehensively evaluate the practical deployment feasibility of LLM4FB, we conduct a detailed analysis of computational complexity, memory consumption, and inference latency for both UE and BS sides.
Table 2 presents a comprehensive comparison across all baseline methods.
Params-UE denotes the storage footprint of the projection matrix
. All methods except CS-CsiNet employ fixed random projections with zero trainable parameters at the UE side, and CS-CsiNet uses a learned projection matrix, resulting in 0.59 M parameters. CS-CsiNet was originally designed for a compressed feedback task. Now, to adapt it for a prediction task, we have changed the Sigmoid to ReLU and added three linear layers, two LN layers, and ReLU activation functions at the final output. From
Table 2, several observations can be made regarding the computational distribution between UE and BS:
All methods maintain identical UE-side complexity (0.045 ms latency, 0.59 M parameters, 10.85 M memory). This confirms that the one-sided architecture successfully decouples the decoder complexity from the UE burden—the terminal only performs a fixed linear projection regardless of the BS-side model sophistication. This property is essential for resource-constrained IoT devices and wireless sensors.
LLM4FB achieves a BS-side inference latency of 6.435 ms, which is 6.4 times faster than the Transformer (41.295 ms) while maintaining comparable or better NMSE (0.106 vs. 0.109). This efficiency results from our 6-layer truncated GPT-2 architecture, which prioritizes feature extraction depth over sequence processing length. The 6.435 ms latency remains practical for systems with 0.5 ms pilot intervals, especially considering the substantial performance gain.
Regarding computational resources, LLM4FB requires 480.25 M memory at the BS—higher than traditional models (12.95–263.15 M) but acceptable for modern GPU-equipped base stations. The computational cost of 0.93 GFLOPs is lower than CNN (4.83 GFLOPs). More importantly, LLM4FB (PEFT) reduces NMSE by 45.6% compared to CS-CsiNet (0.106 vs. 0.195), demonstrating that the memory overhead is justified by the performance improvement.
Comparing PEFT with full-parameter training shows that while full training achieves slightly better NMSE (0.087 vs. 0.106), the PEFT strategy requires only 1.57 M trainable parameters and maintains identical inference cost. This validates that the pre-trained LLM already contains sufficient structural knowledge for CSI reconstruction, requiring only minimal fine-tuning of normalization layers and embeddings to adapt to the wireless domain.
4.3. Performance Evaluation
Table 3 and
Table 4 compare the NMSE and SE performance under varying compression ratios at SNR of 10 dB. Several observations can be drawn.
Regarding NMSE, it can be observed that as the compression ratio increases from 2 to 64, the prediction error of all methods inevitably rises due to information blur. However, LLM4FB consistently achieves the lowest NMSE across all compression ratios, indicating its excellence in CSI reconstruction prediction. Meanwhile, under high compression scenarios (e.g., CR = 64), traditional deep learning methods such as CNN and DNN experience significant performance degradation, whereas LLM4FB maintains relatively low error. This robustness is attributed to LLM’s powerful contextual reasoning capability, enabling it to infer and recover missing channel details from extremely sparse observations. For example, at CR = 8, LLM4FB achieves an NMSE of 0.144, comparable to the best performing Transformer baseline, while using significantly fewer trainable parameters.
In terms of SE, the multiobjective optimized variant, LLM4FB+, exhibits the best overall performance. Although its NMSE is marginally higher than that of the standard LLM4FB in some cases, it achieves the highest SE across all tested compression ratios. This phenomenon highlights that minimizing the mean squared error does not always translate to maximizing the communication rate, as NMSE weights all channel elements uniformly, whereas SE is more sensitive to the accuracy of the dominant eigenmodes used for beamforming. By directly adding SE to the loss function, LLM4FB+ is able to optimize towards the channel features that are most important for improving the downlink rate. At a compression ratio of CR = 8, the model achieves a spectral efficiency of 8.036 bps/Hz, which is about 2.6% higher than the Transformer baseline. This result indicates that the joint optimization approach works as expected.
Figure 2 and
Figure 3 illustrate the NMSE and SE performance under varying compression ratios. It is observed that as the compression ratio increases, the NMSE of all methods increases, whereas the SE decreases. LLM4FB and LLM4FB+ consistently outperform the other baseline methods. In scenarios with low compression ratios, the improvement of SE is clearly noticeable.
Figure 4 shows the performance of models’ behavior under various UE speeds, from 20 km/h up to 90 km/h. As the user speed increases, the Doppler spread widens, leading to faster temporal variations in the channel impulse response. The performance of all feedback schemes degrades. Nevertheless, LLM4FB and LLM4FB+ demonstrate superior resilience compared to the baselines. Even at high speeds (e.g., 90 km/h), where the channel coherence time is short, our method maintains a significant performance advantage over conventional RNN and LSTM models. This indicates that the pre-trained LLM is able to learn the channel’s temporal patterns to make accurate predictions even for highly dynamic environments.
Figure 5 illustrates the spectral efficiency performance across varying SNR conditions. As expected, SE increases with higher SNR for all methods due to improved channel quality. Notably, LLM4FB maintains a consistent performance advantage over baseline methods across the entire SNR range. At low SNR (0 dB), the gap is more pronounced, demonstrating the robustness of the LLM-based denoising capability. This validates that our framework achieves robust performance under diverse channel conditions.
4.4. Impact of on Performance Trade-Offs and Scenario Generalization
To investigate the effect of the multiobjective weight parameter
in Equation (
15) and validate the robustness of LLM4FB across different propagation environments, we conduct ablation experiments across different compression ratios. The parameter
controls the trade-off between NMSE minimization and SE maximization. To further validate the generalization capability of our framework, we use the pre-trained base model originally trained on the UMa scenario and fine-tune it on the UMi (urban micro) scenario with different
values. This cross-scenario fine-tuning strategy allows us to verify both the optimal
range and the model’s adaptability to diverse channel conditions.
Table 5 presents the NMSE and SE performance under various
values ranging from 0 to 2.0.
Table 5 presents the ablation study results for the multiobjective weight parameter
defined in Equation (
15), evaluated across compression ratios from
to
at SNR = 10 dB.
From the experimental results, we observe that for PEFT-based models, values in the range 0.5–1.0 achieve the best trade-off between NMSE and SE. When (pure NMSE optimization), the model achieves the lowest reconstruction error but yields suboptimal SE. For example, at , the NMSE is 0.143 but SE reaches only 7.782 bps/Hz. As increases to 0.9, the NMSE slightly degrades to 0.146, while SE improves significantly to 7.969 bps/Hz—a 2.4% gain. This demonstrates that minimizing NMSE does not necessarily maximize communication rate, as SE is more sensitive to the accuracy of dominant channel eigenmodes used in beamforming.
For full-parameter training models, the performance exhibits less sensitivity to variations. Even at , the NMSE degradation remains within 3% compared to . This indicates that models with sufficient capacity can simultaneously optimize both objectives without severe performance trade-offs. However, excessively large values (>1.0) provide diminishing returns and may occasionally degrade NMSE, particularly under high compression ratios.
The sensitivity to also varies with compression ratio. At low compression (), NMSE variations across different values are minimal (<3%), suggesting that abundant feedback information enables easy satisfaction of both objectives. In contrast, at high compression (), the choice of becomes more critical, with NMSE variations up to 5%. This indicates that careful hyperparameter tuning is essential in resource-constrained scenarios.
Based on these observations, we adopt for the LLM4FB+ variant in our main experiments, which provides a practical balance between reconstruction accuracy and spectral efficiency.
4.5. Model Architecture and Pre-Training Benefits
To address the question of whether performance gains stem from the GPT-2 architecture itself or from leveraging pre-trained weights,
Table 6 presents controlled ablation experiments that isolate these two factors.
We construct four variants under identical conditions: (1) Transformer baseline with standard architecture trained from scratch; (2) GPT-2 architecture with random initialization and full-parameter training; (3) GPT-2 with random initialization but only LN layers trainable; (4) our proposed LLM4FB with pre-trained GPT-2 weights and only LN layers fine-tuned.
The results reveal several findings. Comparing GPT-2 (Scratch + Full) with the Transformer baseline shows that the GPT-2 architecture itself provides substantial benefits even without pre-training. At , NMSE improves from 0.064 to 0.047, and at , from 0.146 to 0.109. This 25–26% reduction demonstrates that GPT-2’s design—including its residual connections, layer normalization placement, and attention patterns—is inherently more suitable for CSI reconstruction.
The failure of GPT-2 (Scratch + PEFT) is particularly instructive. With only LN layers trainable from random initialization, the model performs worse than even the Transformer baseline at high compression (NMSE 0.562 vs. 0.494 at ). This indicates that fine-tuning only 0.97 M parameters out of 85.23 M total is insufficient when starting from random weights. The model simply cannot learn meaningful channel representations with such limited trainable capacity.
LLM4FB (pre-trained + PEFT) achieves the best performance across all compression ratios despite having the same training configuration as GPT-2 (Scratch + PEFT). At , NMSE is 0.145 compared to 0.156 for the randomly initialized version, representing a 7% improvement. At , the gap widens further, with NMSE of 0.464 compared to 0.562 for the randomly initialized variant, corresponding to a 17% improvement. This comparison directly demonstrates the value of pre-trained weights: they provide a strong initialization that enables successful adaptation with minimal parameter updates.
These results establish that LLM4FB’s effectiveness arises from three factors working together: a well-designed architecture optimized for sequential modeling, pre-trained weights encoding general temporal patterns, and an efficient fine-tuning strategy that adapts only the normalization layers.
4.6. Fine-Tuning Strategy Comparison: LN-Only and LN + PE
To determine the optimal parameter-efficient fine-tuning configuration, we compare two strategies: fine-tuning only layer normalization (LN) parameters (0.97 M trainable) versus jointly fine-tuning LN and positional embedding (PE) parameters (1.76 M trainable). Both strategies maintain significantly lower trainable parameter counts compared to full fine-tuning (85.23 M), but differ in which components are updated during adaptation.
Table 7 presents the performance comparison across compression ratios from
to
.
Table 7 presents the performance comparison between two parameter-efficient fine-tuning configurations: fine-tuning only layer normalization (LN) parameters (0.97 M trainable) versus jointly fine-tuning LN and positional embedding (PE) parameters (1.76 M trainable).
The results reveal that both strategies achieve nearly identical performance across all compression ratios. At , the NMSE difference is merely 0.001 (0.044 vs. 0.045), representing a negligible 2.3% variation. Similarly, SE values are 8.494 and 8.492 bps/Hz, respectively—effectively identical within measurement precision. This pattern persists across the entire compression range: at , NMSE values are 0.145 vs. 0.146 (0.7% difference), and at , they are 0.464 vs. 0.465 (0.2% difference). The SE performance exhibits comparable consistency, with differences typically below 0.5%.
The minimal performance gap between the two configurations indicates that fine-tuning only layer normalization parameters is sufficient for effective domain adaptation. Layer normalization controls activation distributions through scale (
) and shift (
) parameters:
By adjusting these parameters, the model recalibrates pre-trained feature representations to match wireless channel statistics without modifying core attention mechanisms. The positional embeddings, which encode temporal ordering information, appear largely redundant in this task—the self-attention mechanism already captures temporal dependencies through learned attention weights during pre-training.
The practical implication is significant: adding 0.79 M more trainable parameters provides no measurable performance benefit. This validates our design choice of fine-tuning only LN layers as the default configuration for LLM4FB. The LN-only strategy offers three advantages: (1) 45% fewer trainable parameters, reducing training memory requirements; (2) faster convergence due to smaller optimization space; (3) implicit regularization that may prevent overfitting in data-limited scenarios.
4.7. Denoising Capability Analysis Under High Compression
A critical concern in one-sided feedback with high compression ratios is the severe noise introduced by the pseudoinverse reconstruction. To validate the LLM’s denoising capability, we analyze the reconstruction quality at different processing stages across various compression ratios.
Table 8 quantifies the reconstruction quality at three critical stages: (1) the initial pseudoinverse estimate
for historical CSI, (2) a naive baseline that repeats the last historical time slot as future prediction (NMSE = 0.769 across all CRs), and (3) the final LLM-enhanced output
. Several observations validate the LLM’s robust denoising and prediction capability.
Figure 6 visualizes the delay-domain CSI across different compression ratios and processing stages. Each row corresponds to a specific CR (
), and the four columns show the following: (1) Delay Hist GT—ground truth historical CSI in delay domain, (2) Delay 4 Init—pseudoinverse reconstruction
transformed to delay domain, (3) Delay Future GT—ground truth future CSI, and (4) Delay LLM Pred—LLM prediction
in delay domain.
These results demonstrate that even under extreme compression (e.g., CR = 64), where the initial pseudoinverse estimate is highly noisy (NMSE = 0.985), the LLM is able to reduce the error. This indicates that the LLM denoises the corrupted input by leveraging learned temporal patterns.