Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning

An efficient data-driven prediction strategy for multi-antenna frequency-selective channels must operate based on a small number of pilot symbols. This paper proposes novel channel-prediction algorithms that address this goal by integrating transfer and meta-learning with a reduced-rank parametrization of the channel. The proposed methods optimize linear predictors by utilizing data from previous frames, which are generally characterized by distinct propagation characteristics, in order to enable fast training on the time slots of the current frame. The proposed predictors rely on a novel long short-term decomposition (LSTD) of the linear prediction model that leverages the disaggregation of the channel into long-term space-time signatures and fading amplitudes. We first develop predictors for single-antenna frequency-flat channels based on transfer/meta-learned quadratic regularization. Then, we introduce transfer and meta-learning algorithms for LSTD-based prediction models that build on equilibrium propagation (EP) and alternating least squares (ALS). Numerical results under the 3GPP 5G standard channel model demonstrate the impact of transfer and meta-learning on reducing the number of pilots for channel prediction, as well as the merits of the proposed LSTD parametrization.


Introduction
The capacity to accurately predict channel state information (CSI) is a key enabler of proactive resource allocation strategies, which are central to many visions for efficient and low-latency communications in 6G and beyond (see, e.g., [1]). The problem of channel prediction is relatively straightforward in the presence of known channel statistics. In fact, under the common assumption that multi-antenna frequency-selective channels follow stationary complex Gaussian processes, optimal channel predictors can be obtained via linear minimum mean squared error (LMMSE) estimators such as the Wiener filter [2]. However, in practice, the channel statistics are not known, and predictors need to be optimized based on training data obtained through the transmission of pilot signals [3][4][5][6][7][8]. The problem addressed by this paper concerns the design of data-efficient channel predictors for multi-antenna frequency-selective channels.

Context and Prior Art
A classical approach to tackle this problem is to optimize finite impulse response (FIR) filters [9], or recursive linear filters via autoregressive (AR) models [3][4][5] such as Kalman filtering (KF) [6][7][8], by estimating channel statistics from the available pilot data. Although recursive linear filters generally outperform FIR filters when an accurate model of the state-transition dynamics is available [10,11], FIR filters are typically advantageous in the presence of limited amounts of pilot data [12]. More recently, deep learning-based nonlinear predictors have also been proposed to adapt to channel statistics through the Fig. 1: Illustration of the frame-based transmission system under study: At any frame f , based on the previous N channels h N l,f , we investigate the problem of optimizing the -lag predictionĥl+ ,f .
[5], [6], deep learning based predictors tend to require excessive training (pilot) data, while failing to outperform well-designed linear filters in the low-data regime.
In this paper, we propose to leverage meta-learning in order to further reduce the data requirement of linear CSI predictors for flat-fading channels. The key idea is to leverage multiple related channel data sets from various propagation environments to optimize the hyperparameters of a training procedure. Specifically, we first develop a low-complexity offline solution that obtains hyperparameters and predictors in closed form. Then, an online algorithm is proposed based on gradient descent and equilibrium propagation (EP) [7], [8]. Previous applications of meta-learning to communication systems include demodulation [9], [10], channel equalization [11], encoding/decoding [12], [13], [14], MIMO detection [15], beamforming [16], [17], and resource allocation [18].

A. System Model
As shown in Fig. 1, we study a frame-based transmission system, with each frame containing multiple time slots. Each frame carries data from a possibly different user to the same receiver, e.g., a base station. The receiver has NR antennas, while the transmitters have NT antennas. The channel hl,f in slot l = 1, 2, . . . of frame f = 1, 2, . . . is a vector with S = NRNT W entries, with W being the (discrete-time) delay spread. In each frame f , the channels hl,f 2 C NRNT W ⇥1 follow a stationary process across the time slots l that is characterized by fixed frame-dependent average path powers, path delays, Doppler spectra, and angles of arrival and departure. For instance, in a frame f , we may have a slow-moving frequency frequency frame frame Fig. 1: Illustration of the frame-based transmission system under study: At any frame f , based on the previous N channels h N l,f , we investigate the problem of optimizing the -lag predictionĥl+ ,f .
[5], [6], deep learning based predictors tend to require excessive training (pilot) data, while failing to outperform well-designed linear filters in the low-data regime.
In this paper, we propose to leverage meta-learning in order to further reduce the data requirement of linear CSI predictors for flat-fading channels. The key idea is to leverage multiple related channel data sets from various propagation environments to optimize the hyperparameters of a training procedure. Specifically, we first develop a low-complexity offline solution that obtains hyperparameters and predictors in closed form. Then, an online algorithm is proposed based on gradient descent and equilibrium propagation (EP) [7], [8]. Previous applications of meta-learning to communication systems include demodulation [9], [10], channel equalization [11], encoding/decoding [12], [13], [14], MIMO detection [15], beamforming [16], [17], and resource allocation [18].

A. System Model
As shown in Fig. 1, we study a frame-based transmission system, with each frame containing multiple time slots. Each frame carries data from a possibly different user to the same receiver, e.g., a base  Fig. 1: Illustration of the frame-based transmission system under study: At any frame f , based on the previous N channels h N l,f , we investigate the problem of optimizing the -lag predictionĥl+ ,f .
[5], [6], deep learning based predictors tend to require excessive training (pilot) data, while failing to outperform well-designed linear filters in the low-data regime.
In this paper, we propose to leverage meta-learning in order to further reduce the data requirement of linear CSI predictors for flat-fading channels. The key idea is to leverage multiple related channel data sets from various propagation environments to optimize the hyperparameters of a training procedure. Specifically, we first develop a low-complexity offline solution that obtains hyperparameters and predictors in closed form. Then, an online algorithm is proposed based on gradient descent and equilibrium propagation (EP) [7], [8]. Previous applications of meta-learning to communication systems include demodulation [9], [10], channel equalization [11], encoding/decoding [12], [13], [14], MIMO detection [15], beamforming [16], [17], and resource allocation [18].

A. System Model
As shown in Fig. 1 [5], [6], deep learning based predictors tend to require excessive training (pilot) data, while failing to outperform well-designed linear filters in the low-data regime.
In this paper, we propose to leverage meta-learning in order to further reduce the data requirement of linear CSI predictors for flat-fading channels. The key idea is to leverage multiple related channel data sets from various propagation environments to optimize the hyperparameters of a training procedure. Specifically, we first develop a low-complexity offline solution that obtains hyperparameters and predictors in closed form. Then, an online algorithm is proposed based on gradient descent and equilibrium propagation (EP) [7], [8]. Previous applications of meta-learning to communication systems include demodulation [9], [10], channel equalization [11], encoding/decoding [12], [13], [14], MIMO detection [15], beamforming [16], [17], and resource allocation [18].

A. System Model
As shown in Fig. 1, we study a frame-based transmission system, with each frame containing multiple time slots. Each frame carries data from a possibly different user to the same receiver, e.g., a base station. The receiver has NR antennas, while the transmitters have NT antennas. The channel hl,f in slot l = 1, 2, . . . of frame f = 1, 2, . . . is a vector with S = NRNT W entries, with W being the (discrete-time) delay spread. In each frame f , the channels hl,f 2 C NRNT W ⇥1 follow a stationary process across the time slots l that is characterized by fixed frame-dependent average path powers, path delays, Doppler spectra, and angles of arrival and departure. For instance, in a frame f , we may have a slow-moving spatial geometry spatial geometry Figure 1. Illustration of the frame-based transmission system under study. At any frame f , based on the previous N channels h N l, f , we investigate the problem of optimizing the δ-lag predictionĥ l+δ, f . This paper takes a different approach that allows us to move beyond the singleantenna setting studied in [12]. As described in the next subsection, key ingredients of the proposed methods are transfer learning and meta-learning. Transfer learning [21] and meta-learning [22] aim at using knowledge from distinct tasks in order to reduce the data requirements on a new task of interest. Given a large enough resemblance between different tasks, both transfer learning and meta-learning have shown remarkable performance to reduce the sample complexity in general machine learning problems [23]. Transfer learning applies to a specific target task, whereas meta-learning caters to adaptation to any new task (see e.g., [24]).

Contributions
This paper proposes novel efficient data-driven channel prediction algorithms that reduce pilot requirements by integrating transfer and meta-learning with a novel longshort-term decomposition (LSTD) of the linear predictors. Unlike the prior articles reviewed above, the proposed methods apply to multi-antenna frequency-selective channels whose statistics change across frames (see Figure 1). Specific contributions are as follows.

•
We develop efficient predictors for single-antenna frequency-flat channels based on transfer/meta-learned quadratic regularization. Transfer and meta-learning are used to leverage data from multiple frames in order to extract shared useful knowledge that can be used for prediction on the current frame (see Figure 2).

Figure 2.
Illustration of the considered transfer and meta-learning methods. With access to pilots from previously received frames, transfer learning and meta-learning aim at obtaining the hyperparameters V to be used for channel prediction in a new frame.
• Targeting multi-antenna frequency-selective channels, we introduce the LSTDbased model class of linear predictors that builds on the well-known disaggregation of standard channel models into long-term space-time signatures and fading amplitudes [5,[41][42][43][44]. Accordingly, the channel is described by multipath features, such as angle of arrivals, delays, and path loss, that change slowly across the frame, as well as by fast-varying fading amplitudes. Transfer learning and meta-learning algorithms for LSTD-based prediction models are proposed that build on equilibrium propagation (EP) and alternating least squares (ALS).

•
Numerical results under the 3GPP 5G standard channel model demonstrate the impact of transfer and meta-learning on reducing the number of pilots for channel prediction, as well as the merits of the proposed LSTD parametrization.
Part of this paper was presented in [45], which only covered meta-learning for the case of single-antenna frequency-flat channels. As compared to [45], this journal version includes both transfer and meta-learning, and it addresses the general scenario of multiantenna frequency-selective channels by introducing and developing the LSTD model class of linear predictors.

Organization
The rest of the paper is organized as follows. In Section 2, we detail system and channel models, and describe conventional, transfer, and meta-learning concepts. In Section 3, we develop solutions for single-antenna frequency-flat channels. In Section 4, multi-antenna frequency-selective channels are considered, and we propose LSTD-based linear prediction schemes. Numerical results are presented in Section 5, and conclusions are presented in Section 6.
Notation: In this paper, (·) denotes the transposition; (·) † the Hermitian transposition, (·) F the Frobenius norm, | · | the absolute value, || · || the Euclidean norm, vec(·) the vectorization operator that stacks the columns of a matrix into a column vector, [·] i the i-th element of the vector, and I S the S × S identity matrix for some integer S.

System Model
As shown in Figure 1, we study a frame-based transmission system, with each frame containing multiple time slots. Each frame carries data from a possibly different user to the same receiver, e.g., a base station. The receiver has N R antennas, and the transmitters have N T antennas. The channel h l, f in slot l = 1, 2, . . . of frame f = 1, 2, . . . is a vector with S = N R N T W entries, with W being the delay spread measured in number of transmission symbols within each frame f , the multi-path channels h l, f ∈ C N R N T W×1 are characterized by fixed, frame-dependent, average path powers, path delays, Doppler spectra, and angles of arrival and departure [46]. For instance, in a frame f , we may have a slow-moving user in line-of-sight condition subject to time-invariant fading, whereas in another, the channel may have significant scattering with fast temporal variations with a large Doppler frequency. In both cases, the frame is assumed to be short enough that average path powers, path delays, Doppler spectra, and angles of arrival and departure do not change within the frame [41,42].
As also seen in Figure 1, for each frame f , we are interested in addressing the lag-δ channel prediction problem, in which channel h l+δ, f is predicted based on the N past channels: We adopt linear prediction with regressor V f ∈ C SN×S , so that the prediction is given aŝ The focus on linear prediction is justified by the optimality of linear estimation for Gaussian stationary processes [47], which provide standard models for fading channels in rich scattering environments. Assuming no prior knowledge of the channel model, we adopt a data-driven approach to the design of the predictor (2). Accordingly, to train the linear predictor (2), for any frame f , the receiver is assumed to have available the training set encompassing L tr input-output examples. Dataset Z tr f can be constructed from L tr + N + δ − 1 channels {h 1, f , . . . , h L tr +N+δ−1, f } by using the lag-δ channel h l+δ, f as label for the covariate vector vec(H N l, f ). In practice, the channel vectors h l, f are estimated by using pilot symbols, and estimation noise can be easily incorporated in the model (see Section 2.5). Throughout, we implicitly assume that the channels h l, f correspond to estimates available at the receiver. From , so that the dataset can be expressed as the pair Z tr f = (X tr f , Y tr f ).

Channel Model
We adopt the standard spatial channel model [46]. Accordingly, a channel vector h l, f for slot l in frame f , is obtained by sampling the continuous-time multipath vector channel impulse response which is the sum of contributions from D paths. In (4), the waveform g(τ) is given by the convolution of the transmitted waveform and the matched filter at the receiver. Furthermore, the contribution of the d-th path depends on the average power Ω d, f , the path delay τ d, f , the N T N R × 1 spatial vector a d, f , the Doppler frequency γ d, f , and the starting wall-clock time of the l-th slot t l . The average power Ω d, f , path delays τ d, f , spatial vector a d, f , and Doppler frequency γ d, f are constant within one frame because they depend on large-scale geometric features of the propagation environment. However, they may change over frames following Clause 7.6.3.2 (Procedure B) in [46]. The number of paths is assumed without loss of generalization to be the same for all frames f because one can set Ω d, f = 0 for frames with a smaller number of paths.
In [46], the spatial vector a d, f has a structure that depends on field patterns and steering vectors of the transmit and receive antennas, as well on the polarization of the antennas. Mathematically, the entry of the spatial vector a d, f corresponding to the receive and transmit antenna element n R and n T can be modeled as [46] [ where F rx,n R (·, ·) and F tx,n T (·, ·) are the 2 × 1 field patterns, θ d, f ,ZOA , φ d, f ,AOA , θ d, f ,ZOD , and φ d, f ,AOD are the zenith angle of arrival (ZOA), azimuth angle of arrival (AOA), zenith angle of departure (ZOD), and azimuth angle of departure (AOD) (in degrees), λ 0 is the wavelength (in m) of the carrier frequency, l d, f ,n R ,n T is the length of the path (in m) between the two antennas, and M d, f is the polarization coupling matrix defined as with random initial phase Φ (·,·) d, f ∼ U(−π, π) and log-normal distributed cross polarization power ratio (XPR) κ d, f > 0 [46].
In order to obtain the S × 1 vector h l, f , we sample the continuous-time channel h l, f (τ) in (4) at Nyquist rate 1/T to obtain W discrete-time N R N T × 1 channel impulse response for w = 1, . . . , W. Following [41], the channel vector h l, f ∈ C N R N T W×1 is obtained by

Conventional Learning
The optimization of the linear predictor V f in (2) can be formulated as a supervised learning problem as it will be detailed in Section 3. In conventional learning, the predictor V f is designed separately in each frame f based on the corresponding dataset Z tr f . In order for this predictor V f to generalize well to slots in the same frame f outside the training set, it is necessary to have a sufficiently large number of training slots, L tr [48,49].

Transfer Learning and Meta-Learning
In conventional learning, the number of required training slots L tr can be reduced by selecting hyperparameters in the learning problem that reflect prior knowledge about the prediction problem at hand. In the next sections, we will explore solutions that optimize such hyperparameters based on data received from multiple previous frames. To this end, as illustrated in Figure 2, we assume the availability of channel data collected from F frames received in the past. In each frame, the channel follows the model described in Section 2.2. Accordingly, data from previous frames consists of L + N + δ − 1 channels {h 1, f , . . . , h L+N+δ−1, f } for some integer L.
By using these channels, the dataset can be obtained as explained in Section 2.1, where L is typically larger than L tr , although this will not be assumed in the analysis. Correspondingly, we also define the L × N input matrix X f and the L × 1 target vector y f . We will propose methods that leverage the historical knowledge available from dataset Z f for f = 1, . . . , F via transfer learning and meta-learning with the goal of reducing number of pilots, L tr , needed for channel prediction in a new frame (i.e., frame F + 1 in Figure 2).

Incorporating Estimation Noise
Until now, we assumed that channel vectors h l, f are available noiselessly to the predictor. In practice, channel information needs to be estimated via pilots. To elaborate on this point, let us assume the received signal model where for the corresponding received signal, and h l, f for the channel with additive white complex Gaussian noise n l, the training symbol, the average signal-to-noise ratio (SNR) is given as E x /N 0 . From (10), we can estimate the channel aš which suffers from channel estimation noise ξ ∼ CN (0, SNR −1 I S ). If P training symbols are available in each block, the channel estimation noise can be reduced via averaging to SNR −1 /P. Channelsȟ l, f can be used as training data in the schemes described in the previous subsections. More efficient channel estimation methods, including sparse Bayesian learning [50] and approximate message passing approaches [51] may further reduce the channel estimation noise.

Single-Antenna Frequency-Flat Channels
In this section, we propose transfer learning and meta-learning methods for singleantenna flat-fading channels, which result in S = 1. Throughout this section, we write the prediction matrix V f ∈ C SN×S in (2) as the vector v f ∈ C N×1 , and the target data Y tr f ∈ C L tr ×S as the vector y tr f ∈ C L tr ×1 . Correspondingly, we rewrite the linear predictor (2) aŝ

Conventional Learning
Assuming the standard quadratic loss, we formulate the supervised learning problem as the ridge regression optimization with hyperparameters (λ,v) given by the scalar λ > 0 and by the N × 1 bias vectorv. The bias vectorv can be thought of defining the prior mean of the predictor v f , whereas λ > 0 specifies the precision (i.e., inverse of the variance) of this prior knowledge. The solution of problem (13) can be obtained explicitly as

Transfer Learning
Transfer learning uses datasets Z f in (58) from the previous F frames, i.e., with f = 1, . . . , F, to optimize the hyperparameter vectorv in (13) as The rationale for this choice is that vectorv trans provides a useful prior mean to be used in the ridge regression problem (13) Note that during deployment time, this approach has the same computational complexity as that of conventional learning, because the bias vector is treated as a constant vector.

Meta-Learning
Unlike transfer learning, which utilizes all the available datasets {Z f } F f =1 from the previous frames at once as in (15), meta-learning allows for the separate adaptation of the predictor in each frame. To this end, for each frame f , we split the L data points into L tr training pairs {( = Z te f , resulting in two separate datasets, Z tr f and Z te f . We correspondingly define the L tr × N input matrix X tr f and the L tr × 1 target vector y tr f , as well as the L te × N input matrix X te f and the L te × 1 target vector y te f . The hyperparameter vectorv is then optimized by minimizing the sum loss of the predictors v * (Z tr f |v) in (13) that are adapted separately for each frame f = 1, . . . , F given the bias vectorv. Accordingly, estimating the loss in each frame f via the test set Z te f yields the meta-learning problem As studied in [52], the minimization in (17) is a least squares problem that can be solved in closed form asv where L te × N matrixX te f contains by row the Hermitian transpose of the N × 1 pre-  [25,29,30,[32][33][34], the proposed meta-learning procedure adopts linear models, significantly reducing the computational complexity of meta-learning [52].
After meta-learning, similar to transfer learning, based on the meta-learned hyperparameterv meta λ , we train a channel predictor via ridge regression (13), obtaining

Multi-Antenna Frequency-Selective Channels
In this section, we study the more general scenario with any number of antennas and with frequency-selective channels, resulting in S > 1. As we will discuss, a naïve extension of the techniques presented in the previous sections is undesirable, because this would not leverage the structure of the channel model (4). For this reason, in the following, we will introduce novel hybrid model-and data-driven solutions that build on the channel model (4).

Naïve Extension
We start by briefly presenting the direct extension of the approaches studied in the previous section to any S > 1. Unlike the previous section, we adopt the general matrix notation introduced in Section 2. First, with S = 1, conventional learning obtains the predictor by solving problem (13), which is generalized to any S > 1 as the minimization over the linear prediction matrix V f in (2). Similarly, transfer learning computes the bias matrixV trans by solving the following generalization of problem (15), followed by the evaluation of the predictor V * (Z tr f |V trans ) using (20); whereas metalearning addresses the following generalization of minimization (17), over the bias matrixV ∈ C SN×S , which is used to compute the predictor V * (Z tr f |V meta ) in (20).
The issue with the naïve extensions (21) and (22) is that the dimension of the predictor V and of the hyperparameter matrixV can become extremely large when S grows. This, in turn, may lead to overfitting in the hyperparameter space [53] when the number of frames, F, is limited. This form of overfitting may prevent transfer learning and meta-learning from effectively reducing the sample complexity for problem (20), because the optimized hyperparameter matrixV would be excessively dependent on the data received in the F previous frames. To solve this problem, we propose next to utilize the structure of the channel model (4) in order to reduce the dimension of the channel parametrization.

LSTD Channel Model
The channel model (4) implies that the channel vector h l, f in (7) and (8) can be written as the product of a frame-dependent N R N T W × D matrix T f and of a slot-dependent D × 1 vector β l, f as in [41], where T f collects space-time signatures of the D paths as The frame-dependent matrix T f is typically rank-deficient, because paths are generally not all resolvable [54,55]. To account for this structural property of the channel, as in [41], we introduce a N R N T W × K full-rank unitary matrix B f , such that span{T f } = span{B f } and redefine (23) as As an example, the unitary matrix B f can be obtained from the singular value decomposi- [41]. For future reference, we also rewrite (25) as where d k l, f is the k-th element of the vector d l, f and b k f is the k-th column of the matrix B f . We will refer to matrix B f in (26) as the long-term space-time feature matrix, or feature matrix for short, whereas vector d l, f will be referred as the short-term corresponding amplitude vector. Parametrization (25) and (26) are particularly efficient when the feature matrix B f can be accurately estimated from the available data. For conventional learning, this requires observing a sufficiently large number of slots per frame, i.e., a large L new [41], as well as a channel that varies sufficiently quickly across each frame. In contrast, as we will explore, transfer and meta-learning can potentially leverage data from multiple frames in order to enhance the estimation of the feature matrix.

LSTD-Based Prediction Model
Given the LSTD channel model (25) and (26), in this subsection we redefine the problem of predicting channel h l+δ, f = B f d l+δ, f as the problem of estimating the feature matrix B f and predicting the amplitude vector d l+δ, f based on the available data. This will lead to a reduced-rank parametrization of the linear predictor (2).
To start, we write the predicted channelĥ l+δ, f aŝ whereB f andd l+δ, f are the estimated feature matrix and the predicted amplitude vector, respectively. To define the corresponding predictor, we first observe that the input matrix H N l, f in (1) can be expressed by using (25) as Consider now the prediction of the k-th amplitude d k l+δ, f . Generalizing (12), we adopt the linear predictord where v k f is an N × 1 prediction vector, and is the k-th row of the matrix (29), which represents the past N fading scalar amplitudes that correspond to the k-th feature b k f . Plugging the prediction (30) into (27) yields the predicted channelĥ l+δ, f (cf. (26))ĥ As detailed in Appendix A, inserting (30) and (31) to (32), we can express the LSTDbased prediction (32) in the form (2) aŝ where the LSTD-based predictor matrix V where ⊗ is the Kronecker product. Overall illustration of LSTD-based channel prediction is summarized in Figure 3.

Conventional Learning for LSTD-Based Prediction
In conventional learning, the goal is to optimize the LSTD-based predictor V over the optimization variables (B f , {v k f } K k=1 ). In (35), the hyperparameters (λ,V (K) ) are given by the scalar λ > 0 and by the SN × S LSTD-based bias matrixV (K) defined as (cf. (34))V Because the Euclidean norm regularization V f −V (K) 2 F in (35) mixes long-term and short-term dependencies due to (34) and (36), we propose the modification of problem (35) with hyperparameters (λ 1 , λ 2 ,b 1 , . . . ,b K ,v 1 , . . . ,v K ) given by the scalars λ 1 , λ 2 > 0, by the S × 1 long-term bias vectorsb 1 , . . . ,b K , and by the N × 1 short-term bias vectorsv 1 , . . . ,v K . For each feature k, the considered regularization minimizes the Euclidean distance between the short-term prediction vector v k f and the short-term bias vectorv k as in Section 3, while maximizing the alignment between the long-term feature vectorb k f and the long-term bias vectorb k in a manner akin to the kernel alignment method of [56].
To address problem (37), inspired by [57,58], we propose a sequential approach, in which the pair (v k f ,b k f ) consisting of the k-th predictor v k f and the k-th feature vectorb k f is optimized in the order k = 1, 2, . . . , K. Specifically, at each step k, we consider the problem where the L tr × S k-th residual target matrix (Y tr f ) k is defined as [57,58] (Y tr given the k-th predictor (40) and k-th optimized predictor Because (38) is a nonconvex problem, we consider alternating least squares (ALS) [59] to obtain the optimal solution {b k, * f , v k, * f } by iterating between the following steps: until convergence. Closed-form solutions for (42) and (43) can be found in Appendix B, and the overall LSTD-based conventional learning scheme can be found in Algorithm 1.

Transfer Learning for LSTD-Based Prediction
Similar to conventional learning, transfer learning for LSTD-based prediction can be addressed from the naïve extension (21) by utilizing the LSTD parametrization V (K) in (34) in lieu of the unconstrained predictor V to obtain the bias matrixV (K),trans as which can also be solved via the ALS-based sequential approach detailed in Section 4.4. This produces the sequencesb 1,trans , . . . ,b K,trans andv 1,trans , . . . ,v K,trans as (cf. (38)) where the residual target matrix (Y f ) k is defined as (cf. (39)) with k-th optimized predictor Details for transfer learning can be found in Appendix C, and the overall transfer learning scheme for LSTD prediction is summarized in Algorithm 2. After transfer learning, similar to Section 3.2, based on the optimized hyperparametersb 1,trans , . . . ,b K,trans and v 1,trans , . . . ,v K,trans , the LSTD-based channel predictor for a new frame f new can be obtained via (37) as which can also be solved in the sequential way as in (38).

Meta-Learning for LSTD-Based Prediction
Plugging (37) into the naïve extension of (22), we can formulate the meta-learning problem for LSTD-based prediction as Similar to the sequential approach (38) described in Section 4.4, we propose a hierarchical sequential approach for meta-learning by using (38) in the order k = 1, . . . , K, obtaining the problemb with the residual target matrix (Y te f ) k defined as (cf. (39)) The bilevel non-convex optimization problem (50) is addressed through gradientbased updates with gradients computed via equilibrium propagation (EP) [60,61]. EP uses finite differentiation to approximate the gradient of the bilevel optimization (50), where the difference is computed between two gradients obtained at two stationary points (b k, * f , v k, * f ) and (b k,α f , v k,α f ) for the original problem (38) and modified version of (38) that considers additional prediction loss for the test set Z te f . Specifically, EP leverages the asymptotic equality [60] ∇¯bk and with additional real-valued hyperparameter α ∈ R, which is generally chosen to be a non-zero small value [60,61]. In (52) and (53), vectorsb k,α f and v k,α f are defined as (cf. Derivations for the gradients (52) and (53) can be found in Appendix D.
To reduce the computational complexity for the gradient-based updates, we adopt stochastic gradient descent with the Adam optimizer as done in [61] in order to updateb k andv k based on (52) and (53). The overall LSTD-based meta-learning scheme is detailed in Algorithm 3.
After meta-learning, as in Section 3.3, based on the optimizedb 1,meta , . . . ,b K,meta and v 1,meta , . . . ,v K,meta , LSTD-based channel predictor for a new frame f new can be obtained via (37) as which can be solved in a sequential way, as in (38). The computational complexity order of the considered schemes is summarized in Tables 1 and 2. At deployment time, as seen in Table 1, all schemes require the same computational complexity of conventional learning. In contrast, in the offline meta-learning or transfer learning phase, the computational overhead depends on the dimension of the channel vector S = N R N T W. LSTD-based schemes can reduce the computational overhead as compared to naïve solutions when the channel vector is large, i.e., S 1, and the rank K is sufficiently small. This is quantified in Table 2, where I ALS is the number of iterations for ALS and I EP is the number of iterations for EP. Table 1. Computational complexity analysis at deployment (meta-testing).

Rank-Estimation for LSTD-Based Prediction
The number of total features K for LSTD-based predictions depends on the rank of the unknown space-time signature matrix T f as discussed in Section 4.2. This rank can be estimated by using available channels from previous frames if we assume that the number of total features does not change over multiple frames. This can be achieved via one of the standard methods, Akaike's information theoretic criterion (AIC) (Equation (16) in [62]), which is applicable for all the proposed LSTD-based techniques. However, as the AIC-based rank estimation generally tends to be overestimated [62,63], we propose a potentially more effective estimator for meta-learning, which utilizes a validation dataset.
To this end, we first split the available F frames into F tr meta-training frames f = 1, . . . , F tr and F val meta-validation frames f = F tr + 1, . . . , F. Then, we compute the sumloss as (cf. (49)) where the hyperparameters {b k ,meta ,v k ,meta } k k =1 are computed by using the F tr metatraining frames, as explained in the previous section. The rank-estimation procedure sequentially evaluates the meta-validation loss (56) in order to minimize it over the selection of k. In this regard, it is worth noting that an increase in the total number of features k always decreases the meta-training loss in (49), whereas this is not necessarily true for the meta-validation and meta-test losses.

Experiments
In this section, we present experimental results for the prediction of multi-antenna and/or frequency-selective channels. Numerical examples for single-antenna frequencyflat channels for both offline and online learning scenarios can be found in the conference version of this paper [45]. For all the experiments, we compute the normalized mean squared error (NMSE) ||ĥ l+δ, f − h l+δ, f || 2 /||h l+δ, f || 2 , which is averaged over 100 samples for 200 new frames. To avoid discrepancies between the evaluation measures used during the training and testing phase, we also adopt the NMSE as the training loss function by normalizing the training dataset for the new frame f new as (cf. (3)) and similarly redefine the datasets from previous frames f = 1, . . . , F for transfer and meta-training as (cf. (58)) As summarized in Table 3, we consider a window size N = 5 with lag size δ = 3. All of the experimental results follow the 3GPP 5G standard SCM channel model [46] with variations of the long-term features over frames following Clause 7.6.3.2 (Procedure B) [46], under the Umi-Street Canyon environment, as discussed in Section 2.2. The normalized Doppler frequency ρ = γ d, f /γ SRS ∈ [0, 1] within each frame f , defined as the ratio between the Doppler frequency γ d, f (4) and the frequency of the pilot symbols γ SRS , or sounding reference signal (SRS) [46], is randomly selected in one of the two following ways: (i) for slow-varying environments, it is uniformly drawn in the interval [0.005, 0.05]; and (ii) for fast-varying environments, it is uniformly distributed in the interval [0. 1,1]. In the following, we study the impact of (i) the number of antennas N R N T , (ii) the number of channel taps W, (iii) the number of training samples L new , and (iv) the number of previous frames F, for various prediction schemes: (a) conventional learning, (b) transfer learning, and (c) meta-learning, where each scheme is implemented by using either the naïve or the LSTD parametrization. We set λ = 0, λ 1 = 0, and λ 2 = 0 for conventional learning [9,45], whereas λ = 1, λ 1 = 1, and λ 2 = 1 for transfer and meta-learning.

Multi-Antenna Frequency-Flat Channels
We begin by considering multi-antenna frequency-flat channels and evaluating the NMSE as a function of total number of antennas N R N T under a fast-varying environment ( Figure 4) or a slow-varying environment ( Figure 5). We set K = 1 in the LSTD model. Specific antenna configurations are described in Appendix E. Both transfer and metalearning are seen to provide significant advantages as compared to conventional learning, as long as one chooses the type of parametrization-naïve or LSTD-as a function of the type of variability in the channel, with meta-learning generally outperforming transfer learning. In particular, as seen in Figure 4, for fast-varying environments, meta-learning with LSTD parametrization has the best performance, significantly reducing the NMSE with respect to both conventional and transfer learning. This is because meta-learning with LSTD can account for the need to adapt to fast-varying channel conditions, while also leveraging the reduced-rank structure of the channel. In contrast, as shown in Figure 5, for slow-varying channels, naïve parametrization tends to be preferable, because, as explained in Section 4.2, long-term and short-term features of the channel become indistinguishable when channel variability is too low. It is also interesting to observe that increasing the number of antennas is generally useful for prediction, as the predictor can build on a larger vector of correlated covariates. This is, however, not the case for conventional learning in slow-varying environments, for which the features tend to be too correlated, resulting in overfitting. As a final note, although absolute NMSE values close to 1 may be insufficient for use in applications such as precoding, they can provide useful information for other applications such as proactive resource allocation [40,64].

Rank Estimation
In the previous experiments, we have considered channels with unitary rank, for which one can assume without loss of optimality a number of features in the LSTD parametrization equal to K = 1. In order to implement predictors for multi-antenna frequency-selective channels, one instead needs to first address the problem of estimating the number of features. Here, we evaluate the performance of the approach proposed in Section 4.7 for rank estimation. To this end, we set the number of antennas as N R = 8 and N T = 8, and consider the 19-clustered channel model with delay spread ratio 2. Figure 6 shows the NMSE evaluated on the meta-training, meta-validation, and meta-test data sets as a function of total number of features K. The meta-training set contains 20 frames, the meta-test 200 frames, and the meta-validation set 20 frames. The meta-training loss is monotonically decreasing with K, because a richer parametrization enables a closer fit of the training data. In contrast, both meta-test and meta-validation loss are optimized for an intermediate value of K. The main point of the figure is that the meta-validation loss, while only containing 20 frames, provides useful information to choose a value of K that approximately minimizes the meta-test loss. In contrast, although we can see that K = 3 is a proper estimate of the channel rank for the considered set-up, AIC-based rank estimation gives the highly overestimated value K = 200, which deteriorates the prediction performance, as can be seen in Figure 6. Throughout the following experiments, we will follow the proposed procedure to select K for meta-learning, whereas for all the other schemes, we adopt AIC-based rank estimation to determine K. Results are evaluated with number of previous frames F tr = 20 for meta-training, F val = 20 for meta-validation, and F te = 200 for meta-test.

Single-Antenna Frequency-Selective Channels
Before considering multi-antenna frequency-selective channels, we first consider the impact of the level of frequency selectivity on the prediction of single-antenna frequencyselective channels. To this end, starting from 45 ns, we increase the delay spread by a multiplicative factor, and correspondingly also increase the number of taps by the same amount, which is referred to as delay spread ratio in Figure 6. The number of taps W is obtained as the smallest number of taps that contains more than 90% of the average channel power, following ITU-R report [65]. Figure 7 shows that the dependence on the delay spread of the channel is qualitatively similar to the dependence on the number of antennas in Figures 4 and 5, with the top of Figure 7 representing the performance under a fast-varying environment and the bottom figure depicting the NMSE for a slow-varying environment. Accordingly, as discussed in the previous subsection, meta-learning outperforms both transfer and conventional learning, as long as the parametrization is correctly selected: naïve for slow-varying channels, and LSTD for fast-varying environments. rev final 5 10 20

Multi-Antenna Frequency-Selective Channel Case
We now consider the prediction performance for multi-antenna frequency-selective channels as a function of the number of training samples L new in Figures 8 and 9, as well as versus the number of frames F in Figure 10. For meta-learning, we set L tr = L new in order to avoid discrepancies between meta-training and meta-testing [29]. Figure 8 and Figure 9 shows that meta-learning and transfer learning, which utilize F = 500 previous frames, can significantly outperform conventional learning in terms of number of required pilots L new . This key observation motivates the use of transfer and meta-learning in the presence of limited training data. Furthermore, confirming the analysis in Section 3.3 and Section 4.6, meta-learning can outperform all other schemes as long as one selects a naïve parametrization for slowly varying environments, and the LSTD parametrization for fast-varying environments. For sufficiently large L new , transfer learning can, however, improve over meta-learning on fast-varying environments, as seen in Figure 8. This stems from the split of training and testing set applied by meta-learning, which can lead to a performance loss as L new increases.  Lastly, we investigate the effect of the number of previous frames F for transfer and meta-learning. As a general result, as demonstrated by Figure 10, an increase in the number F of previous frames results in better performance for both transfer and meta-learning. Furthermore, in a slow-varying environment with a small value of F, transfer learning can outperform meta-learning due to the limited need for adaptation, whereas meta-learning with the correctly select type of parametrization, outperforms transfer learning otherwise.

Conclusions
In this paper, we have introduced data-driven channel prediction strategies for multiantenna frequency-selective channels that aim at reducing the number of pilots by integrating transfer and meta-learning with a novel parametrization of linear predictors. The methods leverage the underlying structure of the wireless channels, which can be expressed in terms of a long short-term decomposition (LSTD) into long-term space-time features and fading amplitudes. To enable transfer and meta-learning under an LSTD-based model, we have proposed an optimization strategy based on equilibrium propagation (EP) and alternating least squares (ALS). Numerical experiments have shown that the proposed LSTD-based transfer and meta-learning methods far outperform conventional prediction methods, especially in the few-pilots regime. For instance, under a standard 3GPP SCM channel model, assuming four transmit antennas and two receive antennas, using only one pilot meta-learning with LSTD can reduce the normalized prediction MSE by 3 dB as compared to standard learning techniques. Future work may consider the joint use of deep neural networks, in lieu of linear prediction filters, although related results for multi-antenna frequency-flat channels have not reported any significant advantage to date [14][15][16]19]. for which the solution of (A4) can be obtained by taking the eigenvector of ((X k, f ) † (X k, f ) − (Y k, f ) † (X k, f ) − (X k, f ) † (Y k, f )) that corresponds to the smallest eigenvalue, with the matriceš X k, f andY k, f defined asX where we denote (X tr . Note that we arbitrarily started ALS with vector v k f , as this ordering did not show any meaningful impact on the final results, as also reported in [66]. which concludes the derivation of (52) and (53). A generalized proof along with useful characteristics of EP can be found in [60]. For initialization, the hyperparameterb k is set as the one-hot vector at position k; whilev k are chosen as all-zero vectors [52].

Appendix E. Details on the Antenna Configuration in Section 5.1
Following the table contains the specification of the antenna configurations in Section 5.1. We denote (N hor R , N ver R , N